kartothek.core.index module

class kartothek.core.index.ExplicitSecondaryIndex(column: str, index_dct: Optional[Dict[ValueType, List[str]]] = None, index_storage_key: Optional[str] = None, dtype: Optional[pyarrow.lib.DataType] = None, normalize_dtype: bool = True)[source]

Bases: kartothek.core.index.IndexBase

An Index class representing an explicit, secondary index which is calculated and stored next to the dataset. In contrast to the PartitionIndex this needs to be calculated by an explicit pass over the data. All mutations of this class will erase the reference to the physical file and the storage of the mutated object will write to a new storage key.

copy(**kwargs) → kartothek.core.index.ExplicitSecondaryIndex[source]
static from_v2(column: str, dct_or_str: Union[str, Dict[ValueType, List[str]]]) → kartothek.core.index.IndexBase[source]

Create an index instance from a version 2 Python structure.

Parameters:
  • column – Name of the column this index provides lookup for
  • dct_or_str – Either the storage key of the external index or the index itself as a Python object structure.
Returns:

index

Return type:

[kartothek.core.index.ExplicitSecondaryIndex]

load(store: Union[str, simplekv.KeyValueStore, Callable[[], simplekv.KeyValueStore]])[source]

Load an external index into memory. Returns a new index object that contains the index dictionary. Returns itself if the index is internal or an already loaded index.

Parameters:store (Object) – Object that implements the .get method for file/object loading.
Returns:index
Return type:[kartothek.core.index.ExplicitSecondaryIndex]
store(store: Union[str, simplekv.KeyValueStore, Callable[[], simplekv.KeyValueStore]], dataset_uuid: str) → str[source]

Store the index as a parquet file

If compatible, the new keyname will be the name stored under the attribute index_storage_key. If this attribute is None, a new key will be generated of the format

{dataset_uuid}/indices/{column}/{timestamp}.by-dataset-index.parquet

where the timestamp is in nanosecond accuracy and is created upon Index object initialization

Parameters:
  • store
  • dataset_uuid
unload() → kartothek.core.index.IndexBase[source]

Drop index data to safe memory.

class kartothek.core.index.IndexBase(column: str, index_dct: Optional[Dict[ValueType, List[str]]] = None, dtype: Optional[pyarrow.lib.DataType] = None, normalize_dtype: bool = True)[source]

Bases: kartothek.core._mixins.CopyMixin

as_flat_series(compact: bool = False, partitions_as_index: bool = False, date_as_object: bool = False, predicates: Optional[List[List[Tuple[str, str, LiteralValue]]]] = None)[source]

Convert the Index object to a pandas.Series

Parameters:
  • compact – If True, ensures that the index will be unique. If there a multiple partition values per index, there values will be compacted into a list (see Examples section).
  • partitions_as_index – If True, the relation between index values and partitions will be reverted for the output dataframe: partition values will be used as index and the indices will be mapped to the partitions.
  • predicates – A list of predicates. If a literal within the provided predicates references a column which is not part of this index, this literal is interpreted as True.

Examples

>>> import pyarrow as pa
>>> from kartothek.core.index import ExplicitSecondaryIndex
>>> index1 = ExplicitSecondaryIndex(
...     column="col", index_dct={1: ["part_1", "part_2"]}, dtype=pa.int64()
... )
>>> index1.as_flat_series()
col
1    part_1
1    part_2
Name: partition, dtype: object
>>> index1.as_flat_series(compact=True)
col
1    [part_1, part_2]
Name: partition, dtype: object
>>> index1.as_flat_series(partitions_as_index=True)
partition
part_1    1
part_2    1
Name: col, dtype: int64
copy(**kwargs) → kartothek.core.index.IndexBase[source]
eval_operator(op: str, value: ValueType) → Set[str][source]

Evaluates a given operator on the index for a given value and returns all partition labels allowed by this index.

Parameters:
  • op (str) – A string representation of the operator to be evaluated. Supported are “==”, “<=”, “>=”, “<”, “>”, “in” For details, see documentation of kartothek.serialization
  • value (object) – The value to be evaluated
Returns:

set

Return type:

Allowed partition labels

loaded

Check if the index was already loaded into memory.

static normalize_value(dtype: pyarrow.lib.DataType, value: Any) → Any[source]

Normalize value according to index dtype.

This may apply casts (e.g. integers to floats) or parsing (e.g. timestamps from strings) to the value.

Parameters:
  • dtype (pyarrow.Type) – Arrow type of the index.
  • value (Any) – any value
Returns:

value – normalized value, with a type that matches the index dtype

Return type:

Any

Raises:
observed_values(date_as_object=True) → numpy.array[source]

Return an array of all observed values

query(value: ValueType) → List[str][source]

Query this index for a given value. Raises an exception if the index is external and not loaded.

Parameters:value – The value that is looked up in the index dictionary.
Returns:A list of keys of partitions that contain the corresponding value.
Return type:keys
remove_partitions(list_of_partitions: List[str], inplace=False) → kartothek.core.index.IndexBase[source]

Removes a partition from the internal index dictionary

The new index object will no longer carry the attribute index_storage_key since it is no longer a proper representation of the stored index object.

Parameters:
  • list_of_partitions (obj) – The partition to be removed
  • inplace (bool, (default: False)) – If True the operation is performed inplace and will return the same object
remove_values(list_of_values: List[str], inplace: bool = False) → kartothek.core.index.IndexBase[source]

Removes a value from the internal index dictionary

Parameters:
  • list_of_values (list) – The value to be removed
  • inplace (bool, (default: False)) – If True the operation is performed inplace and will return the same object
to_dict() → Dict[ValueType, List[str]][source]

Serialise the object to Python object that can be part of a larger dictionary that may be serialised to JSON.

update(index: kartothek.core.index.IndexBase, inplace: bool = False) → kartothek.core.index.IndexBase[source]

Returns a new Index object in case of a change.

The new index object will no longer carry the attribute index_storage_key since it is no longer a proper representation of the stored index object.

Parameters:index ([kartothek.core.index.IndexBase]) – The index which should be added to this one
class kartothek.core.index.PartitionIndex(column: str, index_dct: Optional[Dict[ValueType, List[str]]] = None, dtype: pyarrow.lib.DataType = None, normalize_dtype: bool = True)[source]

Bases: kartothek.core.index.IndexBase

An Index class representing partition indices (sometimes also referred to as primary indices). A PartitionIndex is usually constructed by parsing the partition filenames which encode index information.

The constructor for this class should usually not be called explicitly but indices should be created by e.g. kartothek.core.dataset.DatasetMetadataBase.load_partition_indices()

kartothek.core.index.filter_indices(index_dict: Dict[str, kartothek.core.index.IndexBase], partitions: Iterable[str])[source]

Filter a kartothek index dictionary such that only the provided list of partitions is included in the index dictionary

All indices must be embedded!

Parameters:
  • index_dict – A dictionary holding kartothek indices
  • partition_list – A list of partition labels which are allowed in the output dictionary
kartothek.core.index.merge_indices(list_of_indices: List[Dict[str, kartothek.core.index.IndexBase]]) → Dict[str, kartothek.core.index.IndexBase][source]

Merge a list of index dictionaries

Parameters:list_of_indices (list of tuple) –

A list of tuples holding index information

Format: [ (partition_label, index_dict) ]

kartothek.core.index.remove_partitions_from_indices(index_dict: Dict[str, kartothek.core.index.IndexBase], partitions: List[str])[source]

Remove a given list of partitions from a kartothek index dictionary

Parameters:
  • index_dict (dict of Index) – A dictionary holding kartothek indices
  • partitions (list) – A list of partition labels which should be removed form the index objects