kartothek.io_components.cube.write module¶
Common functionality required to implement cube write functionality.
-
kartothek.io_components.cube.write.
apply_postwrite_checks
(datasets, cube, store, existing_datasets)[source]¶ Apply sanity checks that can only be done after Kartothek has written its datasets.
- Parameters
datasets (Dict[str, kartothek.core.dataset.DatasetMetadata]) – Datasets that just got written.
cube (kartothek.core.cube.cube.Cube) – Cube specification.
store (Union[Callable[[], simplekv.KeyValueStore], simplekv.KeyValueStore]) – KV store.
existing_datasets (Dict[str, kartothek.core.dataset.DatasetMetadata]) – Datasets that were present before the write procedure started.
- Returns
datasets – Datasets that just got written.
- Return type
- Raises
ValueError – If sanity check failed.
-
kartothek.io_components.cube.write.
check_datasets_prebuild
(ktk_cube_dataset_ids, cube, existing_datasets)[source]¶ Check if given dataset UUIDs can be used to build a given cube, to be used before any write operation is performed.
The following checks will be applied:
the seed dataset must be part of the data
no leftovers (non-seed datasets) must be present that are not overwritten
- Parameters
ktk_cube_dataset_ids (Iterable[str]) – Dataset IDs that should be written.
cube (kartothek.core.cube.cube.Cube) – Cube specification.
existing_datasets (Dict[str, kartothek.core.dataset.DatasetMetadata]) – Datasets that existings before the write process started.
- Raises
ValueError – In case of an error.
-
kartothek.io_components.cube.write.
check_datasets_preextend
(ktk_cube_dataset_ids, cube)[source]¶ Check if given dataset UUIDs can be used to extend a given cube, to be used before any write operation is performed.
The following checks will be applied:
the seed dataset of the cube must not be touched
- ..warning::
It is assumed that Kartothek checks if the
overwrite
flags are correct. Therefore, modifications of non-seed datasets are NOT checked here.
- Parameters
ktk_cube_dataset_ids (Iterable[str]) – Dataset IDs that should be written.
cube (kartothek.core.cube.cube.Cube) – Cube specification.
- Raises
ValueError – In case of an error.
-
kartothek.io_components.cube.write.
check_provided_metadata_dict
(metadata, ktk_cube_dataset_ids)[source]¶ Check metadata dict provided by the user.
- Parameters
- Returns
metadata – Metadata provided by the user.
- Return type
- Raises
TypeError – If either the dict or one of the contained values has the wrong type.:
ValueError – If a ktk_cube_dataset_id in the dict is not in ktk_cube_dataset_ids.:
-
kartothek.io_components.cube.write.
multiplex_user_input
(data, cube)[source]¶ Get input from the user and ensure it’s a multi-dataset dict.
- Parameters
data (Union[pandas.DataFrame, Dict[str, pandas.DataFrame]]) – User input.
cube (kartothek.core.cube.cube.Cube) – Cube specification.
- Returns
pipeline_input – Input for write pipelines.
- Return type
Dict[str, pandas.DataFrame]
-
kartothek.io_components.cube.write.
prepare_data_for_ktk
(df, ktk_cube_dataset_id, cube, existing_payload, partition_on, consume_df=False)[source]¶ Prepare data so it can be handed over to Kartothek.
Some checks will be applied to the data to ensure it is sane.
- Parameters
df (pandas.DataFrame) – DataFrame to be passed to Kartothek.
ktk_cube_dataset_id (str) – Ktk_cube dataset UUID (w/o cube prefix).
cube (kartothek.core.cube.cube.Cube) – Cube specification.
existing_payload (Set[str]) – Existing payload columns.
partition_on (Iterable[str]) – Partition-on attribute for given dataset.
consume_df (bool) – Whether the incoming DataFrame can be destroyed while processing it.
- Returns
mp – Kartothek-ready MetaPartition, may be sentinel (aka empty and w/o label).
- Return type
- Raises
ValueError – In case anything is fishy.
-
kartothek.io_components.cube.write.
prepare_ktk_metadata
(cube, ktk_cube_dataset_id, metadata)[source]¶ Prepare metadata that should be passed to Kartothek.
This will add the following information:
a flag indicating whether the dataset is considered a seed dataset
dimension columns
partition columns
optional user-provided metadata
- Parameters
cube (kartothek.core.cube.cube.Cube) – Cube specification.
ktk_cube_dataset_id (str) – Ktk_cube dataset UUID (w/o cube prefix).
metadata (Optional[Dict[str, Dict[str, Any]]]) – Optional metadata provided by the user. The first key is the ktk_cube dataset id, the value is the user-level metadata for that dataset. Should be piped through
check_provided_metadata_dict()
beforehand.
- Returns
ktk_metadata – Metadata ready for Kartothek.
- Return type
Dict[str, Any]
-
kartothek.io_components.cube.write.
prepare_ktk_partition_on
(cube: kartothek.core.cube.cube.Cube, ktk_cube_dataset_ids: Iterable[str], partition_on: Optional[Dict[str, Iterable[str]]]) → Dict[str, Tuple[str, …]][source]¶ Prepare
partition_on
values for kartothek.- Parameters
cube – Cube specification.
ktk_cube_dataset_ids – ktk_cube_dataset_ids announced by the user.
partition_on – Optional parition-on attributes for datasets.
- Returns
partition_on – Partition-on per dataset.
- Return type
Dict
- Raises
ValueError – In case user-provided values are invalid.: