kartothek.utils.ktk_adapters module

Methods to make working with Kartothek easier.

kartothek.utils.ktk_adapters.get_dataset_columns(dataset)[source]

Get columns present in a Kartothek_Cube-compatible Kartothek dataset.

Parameters

dataset (kartothek.core.dataset.DatasetMetadata) – Dataset to get the columns from.

Returns

columns – Usable columns.

Return type

Set[str]

kartothek.utils.ktk_adapters.get_dataset_keys(dataset)[source]

Get store keys that belong to the given Kartothek dataset.

Parameters

dataset (kartothek.core.dataset.DatasetMetadata) – Datasets to scan for keys.

Returns

keys – Storage keys.

Return type

Set[str]

kartothek.utils.ktk_adapters.get_dataset_schema(dataset)[source]

Get schema from a Kartothek_Cube-compatible Kartothek dataset.

Parameters

dataset (kartothek.core.dataset.DatasetMetadata) – Dataset to get the schema from.

Returns

schema – Schema data.

Return type

pyarrow.Schema

Deprecated since version 5.3: This will be removed in 6.0. The get_dataset_schema keyword is deprecated and will be removed.

kartothek.utils.ktk_adapters.get_partition_dataframe(dataset, cube)[source]

Create DataFrame that represent the partioning of the dataset.

The row index named "partition" include the partition labels, the columns are the physical partition columns.

Parameters
Returns

df – DataFrame with partition data.

Return type

pandas.DataFrame

kartothek.utils.ktk_adapters.get_physical_partition_stats(metapartitions, store)[source]

Get statistics for partition.

Hint

To get the metapartitions pre-aligned, use concat_partitions_on_primary_index=True during dispatch.

Parameters
Returns

stats – Statistics for the current partition.

Return type

Dict[str, int]

kartothek.utils.ktk_adapters.metadata_factory_from_dataset(dataset, with_schema=True, store=None)[source]

Create DatasetMetadata from DatasetMetadata.

Parameters
Returns

factory – Metadata factory w/ caches pre-filled.

Return type

DatasetFactory