kartothek.core.index module¶

class kartothek.core.index.ExplicitSecondaryIndex(column: str, index_dct: Optional[Dict[ValueType, List[str]]] = None, index_storage_key: Optional[str] = None, dtype: Optional[pyarrow.lib.DataType] = None, normalize_dtype: bool = True)[source]¶

Bases: kartothek.core.index.IndexBase

An Index class representing an explicit, secondary index which is calculated and stored next to the dataset. In contrast to the PartitionIndex this needs to be calculated by an explicit pass over the data. All mutations of this class will erase the reference to the physical file and the storage of the mutated object will write to a new storage key.

copy(**kwargs) → kartothek.core.index.ExplicitSecondaryIndex [source]¶

static from_v2(column: str, dct_or_str: Union[str, Dict[ValueType, List[str]]]) → kartothek.core.index.IndexBase [source]¶

Create an index instance from a version 2 Python structure.

Parameters

column – Name of the column this index provides lookup for
dct_or_str – Either the storage key of the external index or the index itself as a Python object structure.

Returns

index

Return type

[kartothek.core.index.ExplicitSecondaryIndex]

load(store: Union[str, simplekv.KeyValueStore, Callable[], simplekv.KeyValueStore]])[source]¶

Load an external index into memory. Returns a new index object that contains the index dictionary. Returns itself if the index is internal or an already loaded index.

Parameters: store – Object that implements the .get method for file/object loading.
Returns: index
Return type: [kartothek.core.index.ExplicitSecondaryIndex]

store(store: Union[str, simplekv.KeyValueStore, Callable[], simplekv.KeyValueStore]], dataset_uuid: str) → str [source]¶

Store the index as a parquet file

If compatible, the new keyname will be the name stored under the attribute index_storage_key. If this attribute is None, a new key will be generated of the format

{dataset_uuid}/indices/{column}/{timestamp}.by-dataset-index.parquet

where the timestamp is in nanosecond accuracy and is created upon Index object initialization

Parameters

store –
dataset_uuid –

unload() → kartothek.core.index.IndexBase [source]¶: Drop index data to safe memory.

class kartothek.core.index.IndexBase(column: str, index_dct: Optional[Dict[ValueType, List[str]]] = None, dtype: Optional[pyarrow.lib.DataType] = None, normalize_dtype: bool = True)[source]¶

Bases: kartothek.core._mixins.CopyMixin

Initialize an IndexBase.

Parameters

column – Name of the column this index is for.
index_dct – Mapping from index values to partition labels
dtype – Type of index. If left out and index_dct is present, this will be inferred.
normalize_dtype – Normalize type information and values within index_dct. The user may disable this when it the index was already normalized, e.g. when the index python objects gets copied, or when the index data is restored from a parquet file that was written by a trusted write path.

as_flat_series(compact: bool = False, partitions_as_index: bool = False, date_as_object: bool = False, predicates: Optional[List[List[Tuple[str, str, LiteralValue]]]] = None) → pandas.core.series.Series [source]¶

Convert the Index object to a pandas.Series

Parameters

predicates (List[List[Tuple[str, str, Any]]) –
Optional list of predicates, like [[(‘x’, ‘>’, 0), …], that are used to filter the resulting DataFrame, possibly using predicate pushdown, if supported by the file format. This parameter is not compatible with filter_query.

Predicates are expressed in disjunctive normal form (DNF). This means that the innermost tuple describes a single column predicate. These inner predicates are all combined with a conjunction (AND) into a larger predicate. The most outer list then combines all predicates with a disjunction (OR). By this, we should be able to express all kinds of predicates that are possible using boolean logic.

Available operators are: ==, !=, <=, >=, <, > and in.

Filtering for missings is supported with operators ==, != and in and values np.nan and None for float and string columns respectively.

Categorical data

When using order sensitive operators on categorical data we will assume that the categories obey a lexicographical ordering. This filtering may result in less than optimal performance and may be slower than the evaluation on non-categorical data.

See also Filtering / Predicate pushdown and Efficient Querying
compact – If True, ensures that the index will be unique. If there a multiple partition values per index, there values will be compacted into a list (see Examples section).
partitions_as_index – If True, the relation between index values and partitions will be reverted for the output dataframe: partition values will be used as index and the indices will be mapped to the partitions.
predicates – A list of predicates. If a literal within the provided predicates references a column which is not part of this index, this literal is interpreted as True.

Examples

>>> import pyarrow as pa
>>> from kartothek.core.index import ExplicitSecondaryIndex
>>> index1 = ExplicitSecondaryIndex(
...     column="col", index_dct={1: ["part_1", "part_2"]}, dtype=pa.int64()
... )
>>> index1.as_flat_series()
col
1    part_1
1    part_2
Name: partition, dtype: object
>>> index1.as_flat_series(compact=True)
col
1    [part_1, part_2]
Name: partition, dtype: object
>>> index1.as_flat_series(partitions_as_index=True)
partition
part_1    1
part_2    1
Name: col, dtype: int64

copy(**kwargs) → kartothek.core.index.IndexBase [source]¶

eval_operator(op: str, value: ValueType) → Set[str][source]¶

Evaluates a given operator on the index for a given value and returns all partition labels allowed by this index.

Parameters

op (str) – A string representation of the operator to be evaluated. Supported are “==”, “<=”, “>=”, “<”, “>”, “in” For details, see documentation of kartothek.serialization
value (object) – The value to be evaluated

Returns

Allowed partition labels

Return type

set

property loaded¶: Check if the index was already loaded into memory.

static normalize_value(dtype: pyarrow.lib.DataType, value: Any) → Any[source]¶

Normalize value according to index dtype.

This may apply casts (e.g. integers to floats) or parsing (e.g. timestamps from strings) to the value.

Parameters

dtype – Arrow type of the index.
value – any value

Returns

value – normalized value, with a type that matches the index dtype

Return type

Any

Raises

ValueError – If dtype of the index was not set or derived.
NotImplementedError – If the dtype cannot be handled.

observed_values(date_as_object=True) → numpy.ndarray [source]¶: Return an array of all observed values

query(value: ValueType) → List[str][source]¶

Query this index for a given value. Raises an exception if the index is external and not loaded.

Parameters: value – The value that is looked up in the index dictionary.
Returns: keys – A list of keys of partitions that contain the corresponding value.
Return type: List[str]

remove_partitions(list_of_partitions: List[str], inplace: bool = False) → kartothek.core.index.IndexBase [source]¶

Removes a partition from the internal index dictionary

The new index object will no longer carry the attribute index_storage_key since it is no longer a proper representation of the stored index object.

Parameters

list_of_partitions – The partition to be removed
inplace – If True the operation is performed inplace and will return the same object

remove_values(list_of_values: List[str], inplace: bool = False) → kartothek.core.index.IndexBase [source]¶

Removes a value from the internal index dictionary

Parameters

list_of_values – The value to be removed
inplace – If True the operation is performed inplace and will return the same object

to_dict() → Dict[ValueType, List[str]][source]¶: Serialise the object to Python object that can be part of a larger dictionary that may be serialised to JSON.

update(index: kartothek.core.index.IndexBase, inplace: bool = False) → kartothek.core.index.IndexBase [source]¶

Returns a new Index object in case of a change.

The new index object will no longer carry the attribute index_storage_key since it is no longer a proper representation of the stored index object.

Parameters: index ([kartothek.core.index.IndexBase]) – The index which should be added to this one

class kartothek.core.index.PartitionIndex(column: str, index_dct: Optional[Dict[ValueType, List[str]]] = None, dtype: Optional[pyarrow.lib.DataType] = None, normalize_dtype: bool = True)[source]¶

Bases: kartothek.core.index.IndexBase

An Index class representing partition indices (sometimes also referred to as primary indices). A PartitionIndex is usually constructed by parsing the partition filenames which encode index information.

The constructor for this class should usually not be called explicitly but indices should be created by e.g. kartothek.core.dataset.DatasetMetadataBase.load_partition_indices()