API

This is a non-exhaustive list of the most useful kartothek functions.

Please see Versioning for guarantees we currently provide for the stability of the interface.

Dataset state and metadata

Core functions and classes to investigate the dataset state.

dataset.DatasetMetadata(uuid, partitions, …) Containing holding all metadata of the dataset.
factory.DatasetFactory(dataset_uuid, …) Container holding metadata caching storage access.
common_metadata.SchemaWrapper(schema, …) Wrapper object for pyarrow.Schema to handle forwards and backwards compatibility.

Data retrieval and storage

Eager

Immediate pipeline execution on a single worker without the need for any external scheduling engine. Well suited for small data, low-overhead pipeline execution.

High level user interface

read_table(dataset_uuid[, store]) A utility function to load a single table with multiple partitions as a single dataframe in one go.
read_dataset_as_dataframes(dataset_uuid[, store]) Read a dataset as a list of dataframes.
store_dataframes_as_dataset(store, …[, …]) Utility function to store a list of dataframes as a partitioned dataset with multiple tables (files).
update_dataset_from_dataframes(df_list, …) Update a kartothek dataset in store at once, using a list of dataframes.
build_dataset_indices(store, dataset_uuid, …) Function which builds a ExplicitSecondaryIndex.
garbage_collect_dataset([dataset_uuid, …]) Remove auxiliary files that are no longer tracked by the dataset.
delete_dataset([dataset_uuid, store, factory]) Delete the entire dataset from the store.

Expert low level interface

read_dataset_as_metapartitions([…]) Read a dataset as a list of kartothek.io_components.metapartition.MetaPartition.
create_empty_dataset_header(store, …[, …]) Create an dataset header without any partitions.
write_single_partition([store, …]) Write the parquet file(s) for a single partition.
commit_dataset(store, …) Commit new state to an existing dataset.

Iter

An iteration interface implementation as python generators to allow for (partition based) stream / micro-batch processing of data.

High level user interface

read_dataset_as_dataframes__iterator([…]) A Python iterator to retrieve a dataset from store where each partition is loaded as a DataFrame.
update_dataset_from_dataframes__iter(…[, …]) Update a kartothek dataset in store iteratively, using a generator of dataframes.
store_dataframes_as_dataset__iter(…[, …]) Store pd.DataFrame s iteratively as a partitioned dataset with multiple tables (files).

Expert low level interface

read_dataset_as_metapartitions__iterator([…]) A Python iterator to retrieve a dataset from store where each partition is loaded as a MetaPartition.

Dask

The dask module offers a seamless integration to dask and offers implementations for dask data collections like dask.Bag, dask.DataFrame or as dask.Delayed. This implementation is best suited to handle big data and scale the pipelines across many workers using dask.distributed.

DataFrame

This is the most user friendly interface of the dask containers and offers direct access to the dask DataFrame.

read_dataset_as_ddf([dataset_uuid, store, …]) Retrieve a single table from a dataset as partition-individual DataFrame instance.
store_dataset_from_ddf(ddf, store, …) Store a dataset from a dask.dataframe.
update_dataset_from_ddf(ddf, store, …) Update a dataset from a dask.dataframe.
collect_dataset_metadata(store, …) Collect parquet metadata of the dataset.
hash_dataset(store, simplekv.KeyValueStore, …) Calculate a partition wise, or group wise, hash of the dataset.
pack_payload_pandas(partition, group_key)
pack_payload(df, group_key, str]) Pack all payload columns (everything except of group_key) into a single columns.
unpack_payload_pandas(partition, unpack_meta) Revert pack_payload_pandas and restore packed payload
unpack_payload(df, unpack_meta) Revert payload packing of pack_payload and restores full dataframe.

Bag

This offers the dataset as a dask Bag. Very well suited for (almost) embarassingly parallel batch processing workloads.

read_dataset_as_dataframe_bag([…]) Retrieve data as dataframe from a dask.bag of MetaPartition objects
store_bag_as_dataset(bag, store[, …]) Transform and store a dask.bag of dictionaries containing dataframes to a kartothek dataset in store.
build_dataset_indices__bag(store, …[, …]) Function which builds a ExplicitSecondaryIndex.

Delayed

This offers a low level interface exposing the delayed interface directly.

read_table_as_delayed([dataset_uuid, store, …]) A collection of dask.delayed objects to retrieve a single table from a dataset as partition-individual DataFrame instances.
read_dataset_as_delayed([dataset_uuid, …]) A collection of dask.delayed objects to retrieve a dataset from store where each partition is loaded as a DataFrame.
store_delayed_as_dataset(delayed_tasks, store) Transform and store a list of dictionaries containing dataframes to a kartothek dataset in store.
update_dataset_from_delayed(delayed_tasks[, …]) A dask.delayed graph to add and store a list of dictionaries containing dataframes to a kartothek dataset in store.
merge_datasets_as_delayed(left_dataset_uuid, …) A dask.delayed graph to perform the merge of two full kartothek datasets.
delete_dataset__delayed([dataset_uuid, …])
param dataset_uuid:
 The dataset UUID
garbage_collect_dataset__delayed([…]) Remove auxiliary files that are no longer tracked by the dataset.

DataFrame Serialization

DataFrame serializers

DataFrameSerializer Abstract class that supports serializing DataFrames to/from simplekv stores.
CsvSerializer([compress])
ParquetSerializer([compression, chunk_size])

Utility to handle predicates

filter_predicates_by_column(predicates, str, …) Takes a predicate list and removes all literals which are not referencing one of the given column
columns_in_predicates(predicates, str, …) Determine all columns which are mentioned in the list of predicates.
filter_df_from_predicates(df, predicates, …) Filter a pandas.DataFrame based on predicates in disjunctive normal form.
filter_array_like(array_like, op, value[, …]) Filter an array-like object using operations defined in the predicates