kartothek.io.dask.dataframe_cube module

Dask.DataFrame IO.

kartothek.io.dask.dataframe_cube.append_to_cube_from_dataframe(data, cube, store, metadata=None)[source]

Append data to existing cube.

For details on data and metadata, see build_cube().

Important

Physical partitions must be updated as a whole. If only single rows within a physical partition are updated, the old data is treated as “removed”.

Hint

To have better control over the overwrite “mask” (i.e. which partitions are overwritten), you should use remove_partitions() beforehand.

Parameters:
Returns:

metadata_dict – A dask bag object containing the compute graph to append to the cube returning the dict of dataset metadata objects. The bag has a single partition with a single element.

Return type:

dask.bag.Bag

kartothek.io.dask.dataframe_cube.build_cube_from_dataframe(data: Union[dask.dataframe.core.DataFrame, Dict[str, dask.dataframe.core.DataFrame]], cube: kartothek.core.cube.cube.Cube, store: Callable[[], simplekv.KeyValueStore], metadata: Optional[Dict[str, Dict[str, Any]]] = None, overwrite: bool = False, partition_on: Optional[Dict[str, Iterable[str]]] = None, shuffle: bool = False, num_buckets: int = 1, bucket_by: Optional[Iterable[str]] = None) → dask.delayed.Delayed[source]

Create dask computation graph that builds a cube with the data supplied from a dask dataframe.

Parameters:
  • data – Data that should be written to the cube. If only a single dataframe is given, it is assumed to be the seed dataset.
  • cube – Cube specification.
  • store – Store to which the data should be written to.
  • metadata – Metadata for every dataset.
  • overwrite – If possibly existing datasets should be overwritten.
  • partition_on – Optional parition-on attributes for datasets (dictionary mapping Dataset ID -> columns). See Dimensionality and Partitioning Details for details.
Returns:

A dask delayed object containing the compute graph to build a cube returning the dict of dataset metadata objects.

Return type:

metadata_dict

kartothek.io.dask.dataframe_cube.extend_cube_from_dataframe(data, cube, store, metadata=None, overwrite=False, partition_on=None)[source]

Create dask computation graph that extends a cube by the data supplied from a dask dataframe.

For details on data and metadata, see build_cube().

Parameters:
Returns:

metadata_dict – A dask bag object containing the compute graph to extend a cube returning the dict of dataset metadata objects. The bag has a single partition with a single element.

Return type:

dask.bag.Bag

kartothek.io.dask.dataframe_cube.query_cube_dataframe(cube, store, conditions=None, datasets=None, dimension_columns=None, partition_by=None, payload_columns=None)[source]

Query cube.

For detailed documentation, see query_cube().

Important

In contrast to other backends, the Dask DataFrame may contain partitions with empty DataFrames!

Parameters:
Returns:

ddf – Dask DataFrame, partitioned and order by partition_by. Column of DataFrames is alphabetically ordered. Data types are provided on best effort (they are restored based on the preserved data, but may be different due to Pandas NULL-handling, e.g. integer columns may be floats).

Return type:

dask.dataframe.DataFrame