kartothek.io_components.utils module

This module is a collection of helper functions

class kartothek.io_components.utils.InvalidObject[source]

Bases: object

Sentinel to mark keys for removal

kartothek.io_components.utils.align_categories(dfs, categoricals)[source]

Takes a list of dataframes with categorical columns and determines the superset of categories. All specified columns will then be cast to the same pd.CategoricalDtype

Parameters
  • dfs (List[pd.DataFrame]) – A list of dataframes for which the categoricals should be aligned

  • categoricals (List[str]) – Columns holding categoricals which should be aligned

Returns

A list with aligned dataframes

Return type

List[pd.DataFrame]

kartothek.io_components.utils.check_single_table_dataset(dataset, expected_table=None)[source]

Raise if the given dataset is not a single-table dataset.

Parameters

Deprecated since version 5.3: This will be removed in 6.0. The check_single_table_dataset keyword is deprecated and will be removed.

kartothek.io_components.utils.combine_metadata(dataset_metadata: List[Dict], append_to_list: bool = True)Dict[source]

Merge a list of dictionaries

The merge is performed in such a way, that only keys which are present in all dictionaries are kept in the final result.

If lists are encountered, the values of the result will be the concatenation of all list values in the order of the supplied dictionary list. This behaviour may be changed by using append_to_list

Parameters
  • dataset_metadata – The list of dictionaries (usually metadata) to be combined.

  • append_to_list – If True, all values are concatenated. If False, only unique values are kept

kartothek.io_components.utils.extract_duplicates(lst)[source]

Return all items of a list that occur more than once.

Parameters

lst (List[Any]) –

Returns

lst

Return type

List[Any]

kartothek.io_components.utils.raise_if_indices_overlap(partition_on, secondary_indices)[source]
kartothek.io_components.utils.sort_values_categorical(df: pandas.core.frame.DataFrame, columns: Union[List[str], str])pandas.core.frame.DataFrame[source]

Sort a dataframe lexicographically by the categories of column column

kartothek.io_components.utils.validate_partition_keys(dataset_uuid, store, ds_factory, default_metadata_version, partition_on, load_dataset_metadata=True)[source]