API¶
This is a non-exhaustive list of the most useful kartothek functions.
Please see Versioning for guarantees we currently provide for the stability of the interface.
Dataset state and metadata¶
Core functions and classes to investigate the dataset state.
|
Containing holding all metadata of the dataset. |
|
Container holding metadata caching storage access. |
|
Wrapper object for pyarrow.Schema to handle forwards and backwards compatibility. |
Data retrieval and storage¶
Eager¶
Immediate pipeline execution on a single worker without the need for any external scheduling engine. Well suited for small data, low-overhead pipeline execution.
High level user interface¶
|
A utility function to load a single table with multiple partitions as a single dataframe in one go. |
|
Read a dataset as a list of dataframes. |
|
Utility function to store a list of dataframes as a partitioned dataset with multiple tables (files). |
|
Update a kartothek dataset in store at once, using a list of dataframes. |
|
Function which builds a |
|
Remove auxiliary files that are no longer tracked by the dataset. |
|
Delete the entire dataset from the store. |
Expert low level interface¶
Read a dataset as a list of |
|
|
Create an dataset header without any partitions. |
|
Write the parquet file(s) for a single partition. |
|
Commit new state to an existing dataset. |
Iter¶
An iteration interface implementation as python generators to allow for (partition based) stream / micro-batch processing of data.
High level user interface¶
A Python iterator to retrieve a dataset from store where each partition is loaded as a |
|
Update a kartothek dataset in store iteratively, using a generator of dataframes. |
|
Store pd.DataFrame s iteratively as a partitioned dataset with multiple tables (files). |
Expert low level interface¶
A Python iterator to retrieve a dataset from store where each partition is loaded as a |
Dask¶
The dask module offers a seamless integration to dask
and offers implementations for dask data collections like Bag
,
dask.dataframe.DataFrame
or as dask.delayed.Delayed
.
This implementation is best suited to handle big data and scale the
pipelines across many workers using dask.distributed.
DataFrame¶
This is the most user friendly interface of the dask containers and offers direct access to the dask DataFrame.
|
Retrieve a single table from a dataset as partition-individual |
|
Store a dataset from a dask.dataframe. |
|
Update a dataset from a dask.dataframe. |
|
Collect parquet metadata of the dataset. |
|
Calculate a partition wise, or group wise, hash of the dataset. |
|
|
|
Pack all payload columns (everything except of group_key) into a single columns. |
|
Revert |
|
Revert payload packing of |
Bag¶
This offers the dataset as a dask Bag. Very well suited for (almost) embarassingly parallel batch processing workloads.
Retrieve data as dataframe from a |
|
|
Transform and store a dask.bag of dictionaries containing dataframes to a kartothek dataset in store. |
|
Function which builds a |
Delayed¶
This offers a low level interface exposing the delayed interface directly.
|
A collection of dask.delayed objects to retrieve a single table from a dataset as partition-individual |
|
A collection of dask.delayed objects to retrieve a dataset from store where each partition is loaded as a |
|
Transform and store a list of dictionaries containing dataframes to a kartothek dataset in store. |
|
A dask.delayed graph to add and store a list of dictionaries containing dataframes to a kartothek dataset in store. |
|
A dask.delayed graph to perform the merge of two full kartothek datasets. |
|
|
Remove auxiliary files that are no longer tracked by the dataset. |
DataFrame Serialization¶
DataFrame serializers¶
Abstract class that supports serializing DataFrames to/from simplekv stores. |
|
|
|
|
Serializer to store a |
Utility to handle predicates¶
|
Takes a predicate list and removes all literals which are not referencing one of the given column |
|
Determine all columns which are mentioned in the list of predicates. |
|
Filter a pandas.DataFrame based on predicates in disjunctive normal form. |
|
Filter an array-like object using operations defined in the predicates |