kartothek.io.dask.bag_cube module

Dask.Bag IO.

kartothek.io.dask.bag_cube.append_to_cube_from_bag(data, cube, store, ktk_cube_dataset_ids, metadata=None)[source]

Append data to existing cube.

For details on data and metadata, see build_cube().

Important

Physical partitions must be updated as a whole. If only single rows within a physical partition are updated, the old data is treated as “removed”.

Hint

To have better control over the overwrite “mask” (i.e. which partitions are overwritten), you should use remove_partitions() beforehand or use update_cube_from_bag() instead.

Parameters:
Returns:

metadata_dict – A dask bag object containing the compute graph to append to the cube returning the dict of dataset metadata objects. The bag has a single partition with a single element.

Return type:

dask.bag.Bag

kartothek.io.dask.bag_cube.update_cube_from_bag(data, cube, store, remove_conditions, ktk_cube_dataset_ids, metadata=None)[source]

Remove partitions and append data to existing cube.

For details on data and metadata, see build_cube().

Only datasets in ktk_cube_dataset_ids will be affected.

Parameters:
  • data (dask.Bag) – Bag containing dataframes
  • cube (kartothek.core.cube.cube.Cube) – Cube specification.
  • store (Callable[[], simplekv.KeyValueStore]) – Store to which the data should be written to.
  • remove_conditions – Conditions that select the partitions to remove. Must be a condition that only uses partition columns.
  • ktk_cube_dataset_ids (Optional[Iterable[str]]) – Datasets that will be written, must be specified in advance.
  • metadata (Optional[Dict[str, Dict[str, Any]]]) – Metadata for every dataset, optional. For every dataset, only given keys are updated/replaced. Deletion of metadata keys is not possible.
Returns:

metadata_dict – A dask bag object containing the compute graph to append to the cube returning the dict of dataset metadata objects. The bag has a single partition with a single element.

Return type:

dask.bag.Bag

kartothek.io.dask.bag_cube.build_cube_from_bag(data, cube, store, ktk_cube_dataset_ids=None, metadata=None, overwrite=False, partition_on=None)[source]

Create dask computation graph that builds a cube with the data supplied from a dask bag.

Parameters:
Returns:

metadata_dict – A dask bag object containing the compute graph to build a cube returning the dict of dataset metadata objects. The bag has a single partition with a single element.

Return type:

dask.bag.Bag

kartothek.io.dask.bag_cube.cleanup_cube_bag(cube, store, blocksize=100)[source]

Remove unused keys from cube datasets.

Important

All untracked keys which start with the cube’s uuid_prefix followed by the KTK_CUBE_UUID_SEPERATOR (e.g. my_cube_uuid++seed…) will be deleted by this routine. These keys may be leftovers from past overwrites or index updates.

Parameters:
Returns:

bag – A dask bag that performs the given operation. May contain multiple partitions.

Return type:

dask.bag.Bag

kartothek.io.dask.bag_cube.collect_stats_bag(cube, store, datasets=None, blocksize=100)[source]

Collect statistics for given cube.

Parameters:
Returns:

bag – A dask bag that returns a single result of the form Dict[str, Dict[str, int]] and contains statistics per ktk_cube dataset ID.

Return type:

dask.bag.Bag

kartothek.io.dask.bag_cube.copy_cube_bag(cube, src_store, tgt_store, blocksize=100, overwrite=False, datasets=None)[source]

Copy cube from one store to another.

Parameters:
Returns:

bag – A dask bag that performs the given operation. May contain multiple partitions.

Return type:

dask.bag.Bag

kartothek.io.dask.bag_cube.delete_cube_bag(cube, store, blocksize=100, datasets=None)[source]

Delete cube from store.

Important

This routine only deletes tracked files. Garbage and leftovers from old cubes and failed operations are NOT removed.

Parameters:
Returns:

bag – A dask bag that performs the given operation. May contain multiple partitions.

Return type:

dask.bag.Bag

kartothek.io.dask.bag_cube.extend_cube_from_bag(data, cube, store, ktk_cube_dataset_ids, metadata=None, overwrite=False, partition_on=None)[source]

Create dask computation graph that extends a cube by the data supplied from a dask bag.

For details on data and metadata, see build_cube().

Parameters:
Returns:

metadata_dict – A dask bag object containing the compute graph to extend a cube returning the dict of dataset metadata objects. The bag has a single partition with a single element.

Return type:

dask.bag.Bag

kartothek.io.dask.bag_cube.query_cube_bag(cube, store, conditions=None, datasets=None, dimension_columns=None, partition_by=None, payload_columns=None, blocksize=1)[source]

Query cube.

For detailed documentation, see query_cube().

Parameters:
Returns:

bag – Bag of 1-sized partitions of non-empty DataFrames, order by partition_by. Column of DataFrames is alphabetically ordered. Data types are provided on best effort (they are restored based on the preserved data, but may be different due to Pandas NULL-handling, e.g. integer columns may be floats).

Return type:

dask.Bag