kartothek.io_components.cube.write module¶

Common functionality required to implement cube write functionality.

kartothek.io_components.cube.write.apply_postwrite_checks(datasets, cube, store, existing_datasets)[source]¶

Apply sanity checks that can only be done after Kartothek has written its datasets.

Parameters

datasets (Dict[str, kartothek.core.dataset.DatasetMetadata]) – Datasets that just got written.
cube (kartothek.core.cube.cube.Cube) – Cube specification.
store (Union[Callable[[], simplekv.KeyValueStore], simplekv.KeyValueStore]) – KV store.
existing_datasets (Dict[str, kartothek.core.dataset.DatasetMetadata]) – Datasets that were present before the write procedure started.

Returns

datasets – Datasets that just got written.

Return type

Dict[str, kartothek.core.dataset.DatasetMetadata]

Raises

ValueError – If sanity check failed.

kartothek.io_components.cube.write.check_datasets_prebuild(ktk_cube_dataset_ids, cube, existing_datasets)[source]¶

Check if given dataset UUIDs can be used to build a given cube, to be used before any write operation is performed.

The following checks will be applied:

the seed dataset must be part of the data
no leftovers (non-seed datasets) must be present that are not overwritten

Parameters

ktk_cube_dataset_ids (Iterable[str]) – Dataset IDs that should be written.
cube (kartothek.core.cube.cube.Cube) – Cube specification.
existing_datasets (Dict[str, kartothek.core.dataset.DatasetMetadata]) – Datasets that existings before the write process started.

Raises

ValueError – In case of an error.

kartothek.io_components.cube.write.check_datasets_preextend(ktk_cube_dataset_ids, cube)[source]¶

Check if given dataset UUIDs can be used to extend a given cube, to be used before any write operation is performed.

The following checks will be applied:

the seed dataset of the cube must not be touched

..warning::: It is assumed that Kartothek checks if the overwrite flags are correct. Therefore, modifications of non-seed datasets are NOT checked here.

Parameters

ktk_cube_dataset_ids (Iterable[str]) – Dataset IDs that should be written.
cube (kartothek.core.cube.cube.Cube) – Cube specification.

Raises

ValueError – In case of an error.

kartothek.io_components.cube.write.check_provided_metadata_dict(metadata, ktk_cube_dataset_ids)[source]¶

Check metadata dict provided by the user.

Parameters

metadata (Optional[Dict[str, Dict[str, Any]]]) – Optional metadata provided by the user.
ktk_cube_dataset_ids (Iterable[str]) – ktk_cube_dataset_ids announced by the user.

Returns

metadata – Metadata provided by the user.

Return type

Dict[str, Dict[str, Any]]

Raises

TypeError – If either the dict or one of the contained values has the wrong type.:
ValueError – If a ktk_cube_dataset_id in the dict is not in ktk_cube_dataset_ids.:

kartothek.io_components.cube.write.multiplex_user_input(data, cube)[source]¶

Get input from the user and ensure it’s a multi-dataset dict.

Parameters

data (Union[pandas.DataFrame, Dict[str, pandas.DataFrame]]) – User input.
cube (kartothek.core.cube.cube.Cube) – Cube specification.

Returns

pipeline_input – Input for write pipelines.

Return type

Dict[str, pandas.DataFrame]

kartothek.io_components.cube.write.prepare_data_for_ktk(df, ktk_cube_dataset_id, cube, existing_payload, partition_on, consume_df=False)[source]¶

Prepare data so it can be handed over to Kartothek.

Some checks will be applied to the data to ensure it is sane.

Parameters

df (pandas.DataFrame) – DataFrame to be passed to Kartothek.
ktk_cube_dataset_id (str) – Ktk_cube dataset UUID (w/o cube prefix).
cube (kartothek.core.cube.cube.Cube) – Cube specification.
existing_payload (Set[str]) – Existing payload columns.
partition_on (Iterable[str]) – Partition-on attribute for given dataset.
consume_df (bool) – Whether the incoming DataFrame can be destroyed while processing it.

Returns

mp – Kartothek-ready MetaPartition, may be sentinel (aka empty and w/o label).

Return type

kartothek.io_components.metapartition.MetaPartition

Raises

ValueError – In case anything is fishy.

kartothek.io_components.cube.write.prepare_ktk_metadata(cube, ktk_cube_dataset_id, metadata)[source]¶

Prepare metadata that should be passed to Kartothek.

This will add the following information:

a flag indicating whether the dataset is considered a seed dataset
dimension columns
partition columns
optional user-provided metadata

Parameters

cube (kartothek.core.cube.cube.Cube) – Cube specification.
ktk_cube_dataset_id (str) – Ktk_cube dataset UUID (w/o cube prefix).
metadata (Optional[Dict[str, Dict[str, Any]]]) – Optional metadata provided by the user. The first key is the ktk_cube dataset id, the value is the user-level metadata for that dataset. Should be piped through check_provided_metadata_dict() beforehand.

Returns

ktk_metadata – Metadata ready for Kartothek.

Return type

Dict[str, Any]

kartothek.io_components.cube.write.prepare_ktk_partition_on(cube: kartothek.core.cube.cube.Cube, ktk_cube_dataset_ids: Iterable[str], partition_on: Optional[Dict[str, Iterable[str]]]) → Dict[str, Tuple[str, …]][source]¶

Prepare partition_on values for kartothek.

Parameters

cube – Cube specification.
ktk_cube_dataset_ids – ktk_cube_dataset_ids announced by the user.
partition_on – Optional parition-on attributes for datasets.

Returns

partition_on – Partition-on per dataset.

Return type

Dict

Raises

ValueError – In case user-provided values are invalid.: