kartothek.io_components.cube.write module

Common functionality required to implement cube write functionality.

kartothek.io_components.cube.write.apply_postwrite_checks(datasets, cube, store, existing_datasets)[source]

Apply sanity checks that can only be done after Kartothek has written its datasets.

Parameters
Returns

datasets – Datasets that just got written.

Return type

Dict[str, kartothek.core.dataset.DatasetMetadata]

Raises

ValueError – If sanity check failed.

kartothek.io_components.cube.write.check_datasets_prebuild(ktk_cube_dataset_ids, cube, existing_datasets)[source]

Check if given dataset UUIDs can be used to build a given cube, to be used before any write operation is performed.

The following checks will be applied:

  • the seed dataset must be part of the data

  • no leftovers (non-seed datasets) must be present that are not overwritten

Parameters
Raises

ValueError – In case of an error.

kartothek.io_components.cube.write.check_datasets_preextend(ktk_cube_dataset_ids, cube)[source]

Check if given dataset UUIDs can be used to extend a given cube, to be used before any write operation is performed.

The following checks will be applied:

  • the seed dataset of the cube must not be touched

..warning::

It is assumed that Kartothek checks if the overwrite flags are correct. Therefore, modifications of non-seed datasets are NOT checked here.

Parameters
Raises

ValueError – In case of an error.

kartothek.io_components.cube.write.check_provided_metadata_dict(metadata, ktk_cube_dataset_ids)[source]

Check metadata dict provided by the user.

Parameters
  • metadata (Optional[Dict[str, Dict[str, Any]]]) – Optional metadata provided by the user.

  • ktk_cube_dataset_ids (Iterable[str]) – ktk_cube_dataset_ids announced by the user.

Returns

metadata – Metadata provided by the user.

Return type

Dict[str, Dict[str, Any]]

Raises
  • TypeError – If either the dict or one of the contained values has the wrong type.:

  • ValueError – If a ktk_cube_dataset_id in the dict is not in ktk_cube_dataset_ids.:

kartothek.io_components.cube.write.multiplex_user_input(data, cube)[source]

Get input from the user and ensure it’s a multi-dataset dict.

Parameters
Returns

pipeline_input – Input for write pipelines.

Return type

Dict[str, pandas.DataFrame]

kartothek.io_components.cube.write.prepare_data_for_ktk(df, ktk_cube_dataset_id, cube, existing_payload, partition_on, consume_df=False)[source]

Prepare data so it can be handed over to Kartothek.

Some checks will be applied to the data to ensure it is sane.

Parameters
  • df (pandas.DataFrame) – DataFrame to be passed to Kartothek.

  • ktk_cube_dataset_id (str) – Ktk_cube dataset UUID (w/o cube prefix).

  • cube (kartothek.core.cube.cube.Cube) – Cube specification.

  • existing_payload (Set[str]) – Existing payload columns.

  • partition_on (Iterable[str]) – Partition-on attribute for given dataset.

  • consume_df (bool) – Whether the incoming DataFrame can be destroyed while processing it.

Returns

mp – Kartothek-ready MetaPartition, may be sentinel (aka empty and w/o label).

Return type

kartothek.io_components.metapartition.MetaPartition

Raises

ValueError – In case anything is fishy.

kartothek.io_components.cube.write.prepare_ktk_metadata(cube, ktk_cube_dataset_id, metadata)[source]

Prepare metadata that should be passed to Kartothek.

This will add the following information:

  • a flag indicating whether the dataset is considered a seed dataset

  • dimension columns

  • partition columns

  • optional user-provided metadata

Parameters
  • cube (kartothek.core.cube.cube.Cube) – Cube specification.

  • ktk_cube_dataset_id (str) – Ktk_cube dataset UUID (w/o cube prefix).

  • metadata (Optional[Dict[str, Dict[str, Any]]]) – Optional metadata provided by the user. The first key is the ktk_cube dataset id, the value is the user-level metadata for that dataset. Should be piped through check_provided_metadata_dict() beforehand.

Returns

ktk_metadata – Metadata ready for Kartothek.

Return type

Dict[str, Any]

kartothek.io_components.cube.write.prepare_ktk_partition_on(cube: kartothek.core.cube.cube.Cube, ktk_cube_dataset_ids: Iterable[str], partition_on: Optional[Dict[str, Iterable[str]]])Dict[str, Tuple[str, ]][source]

Prepare partition_on values for kartothek.

Parameters
  • cube – Cube specification.

  • ktk_cube_dataset_ids – ktk_cube_dataset_ids announced by the user.

  • partition_on – Optional parition-on attributes for datasets.

Returns

partition_on – Partition-on per dataset.

Return type

Dict

Raises

ValueError – In case user-provided values are invalid.: