kartothek.core.index module¶
-
class
kartothek.core.index.
ExplicitSecondaryIndex
(column: str, index_dct: Optional[Dict[ValueType, List[str]]] = None, index_storage_key: Optional[str] = None, dtype: Optional[pyarrow.lib.DataType] = None, normalize_dtype: bool = True)[source]¶ Bases:
kartothek.core.index.IndexBase
An Index class representing an explicit, secondary index which is calculated and stored next to the dataset. In contrast to the PartitionIndex this needs to be calculated by an explicit pass over the data. All mutations of this class will erase the reference to the physical file and the storage of the mutated object will write to a new storage key.
-
static
from_v2
(column: str, dct_or_str: Union[str, Dict[ValueType, List[str]]]) → kartothek.core.index.IndexBase[source]¶ Create an index instance from a version 2 Python structure.
Parameters: - column – Name of the column this index provides lookup for
- dct_or_str – Either the storage key of the external index or the index itself as a Python object structure.
Returns: index
Return type:
-
load
(store: Union[str, simplekv.KeyValueStore, Callable[[], simplekv.KeyValueStore]])[source]¶ Load an external index into memory. Returns a new index object that contains the index dictionary. Returns itself if the index is internal or an already loaded index.
Parameters: store (Object) – Object that implements the .get method for file/object loading. Returns: index Return type: [kartothek.core.index.ExplicitSecondaryIndex]
-
store
(store: Union[str, simplekv.KeyValueStore, Callable[[], simplekv.KeyValueStore]], dataset_uuid: str) → str[source]¶ Store the index as a parquet file
If compatible, the new keyname will be the name stored under the attribute index_storage_key. If this attribute is None, a new key will be generated of the format
{dataset_uuid}/indices/{column}/{timestamp}.by-dataset-index.parquetwhere the timestamp is in nanosecond accuracy and is created upon Index object initialization
Parameters: - store –
- dataset_uuid –
-
static
-
class
kartothek.core.index.
IndexBase
(column: str, index_dct: Optional[Dict[ValueType, List[str]]] = None, dtype: Optional[pyarrow.lib.DataType] = None, normalize_dtype: bool = True)[source]¶ Bases:
kartothek.core._mixins.CopyMixin
-
as_flat_series
(compact: bool = False, partitions_as_index: bool = False, date_as_object: bool = False, predicates: Optional[List[List[Tuple[str, str, LiteralValue]]]] = None)[source]¶ Convert the Index object to a pandas.Series
Parameters: - compact – If True, ensures that the index will be unique. If there a multiple partition values per index, there values will be compacted into a list (see Examples section).
- partitions_as_index – If True, the relation between index values and partitions will be reverted for the output dataframe: partition values will be used as index and the indices will be mapped to the partitions.
- predicates – A list of predicates. If a literal within the provided predicates references a column which is not part of this index, this literal is interpreted as True.
Examples
>>> import pyarrow as pa >>> from kartothek.core.index import ExplicitSecondaryIndex >>> index1 = ExplicitSecondaryIndex( ... column="col", index_dct={1: ["part_1", "part_2"]}, dtype=pa.int64() ... ) >>> index1.as_flat_series() col 1 part_1 1 part_2 Name: partition, dtype: object >>> index1.as_flat_series(compact=True) col 1 [part_1, part_2] Name: partition, dtype: object >>> index1.as_flat_series(partitions_as_index=True) partition part_1 1 part_2 1 Name: col, dtype: int64
-
eval_operator
(op: str, value: ValueType) → Set[str][source]¶ Evaluates a given operator on the index for a given value and returns all partition labels allowed by this index.
Parameters: Returns: set
Return type:
-
loaded
¶ Check if the index was already loaded into memory.
-
static
normalize_value
(dtype: pyarrow.lib.DataType, value: Any) → Any[source]¶ Normalize value according to index dtype.
This may apply casts (e.g. integers to floats) or parsing (e.g. timestamps from strings) to the value.
Parameters: - dtype (pyarrow.Type) – Arrow type of the index.
- value (Any) – any value
Returns: value – normalized value, with a type that matches the index dtype
Return type: Raises: ValueError
– If dtype of the index was not set or derived.NotImplementedError
– If the dtype cannot be handled.
-
query
(value: ValueType) → List[str][source]¶ Query this index for a given value. Raises an exception if the index is external and not loaded.
Parameters: value – The value that is looked up in the index dictionary. Returns: A list of keys of partitions that contain the corresponding value. Return type: keys
-
remove_partitions
(list_of_partitions: List[str], inplace=False) → kartothek.core.index.IndexBase[source]¶ Removes a partition from the internal index dictionary
The new index object will no longer carry the attribute index_storage_key since it is no longer a proper representation of the stored index object.
Parameters: - list_of_partitions (obj) – The partition to be removed
- inplace (bool, (default: False)) – If True the operation is performed inplace and will return the same object
-
remove_values
(list_of_values: List[str], inplace: bool = False) → kartothek.core.index.IndexBase[source]¶ Removes a value from the internal index dictionary
Parameters: - list_of_values (list) – The value to be removed
- inplace (bool, (default: False)) – If True the operation is performed inplace and will return the same object
-
to_dict
() → Dict[ValueType, List[str]][source]¶ Serialise the object to Python object that can be part of a larger dictionary that may be serialised to JSON.
-
update
(index: kartothek.core.index.IndexBase, inplace: bool = False) → kartothek.core.index.IndexBase[source]¶ Returns a new Index object in case of a change.
The new index object will no longer carry the attribute index_storage_key since it is no longer a proper representation of the stored index object.
Parameters: index ([kartothek.core.index.IndexBase]) – The index which should be added to this one
-
-
class
kartothek.core.index.
PartitionIndex
(column: str, index_dct: Optional[Dict[ValueType, List[str]]] = None, dtype: pyarrow.lib.DataType = None, normalize_dtype: bool = True)[source]¶ Bases:
kartothek.core.index.IndexBase
An Index class representing partition indices (sometimes also referred to as primary indices). A PartitionIndex is usually constructed by parsing the partition filenames which encode index information.
The constructor for this class should usually not be called explicitly but indices should be created by e.g.
kartothek.core.dataset.DatasetMetadataBase.load_partition_indices()
-
kartothek.core.index.
filter_indices
(index_dict: Dict[str, kartothek.core.index.IndexBase], partitions: Iterable[str])[source]¶ Filter a kartothek index dictionary such that only the provided list of partitions is included in the index dictionary
All indices must be embedded!
Parameters: - index_dict – A dictionary holding kartothek indices
- partition_list – A list of partition labels which are allowed in the output dictionary
-
kartothek.core.index.
merge_indices
(list_of_indices: List[Dict[str, kartothek.core.index.IndexBase]]) → Dict[str, kartothek.core.index.IndexBase][source]¶ Merge a list of index dictionaries
Parameters: list_of_indices (list of tuple) – A list of tuples holding index information
Format: [ (partition_label, index_dict) ]
-
kartothek.core.index.
remove_partitions_from_indices
(index_dict: Dict[str, kartothek.core.index.IndexBase], partitions: List[str])[source]¶ Remove a given list of partitions from a kartothek index dictionary
Parameters: - index_dict (dict of Index) – A dictionary holding kartothek indices
- partitions (list) – A list of partition labels which should be removed form the index objects