kartothek.core.index module¶
-
class
kartothek.core.index.
ExplicitSecondaryIndex
(column: str, index_dct: Optional[Dict[ValueType, List[str]]] = None, index_storage_key: Optional[str] = None, dtype: Optional[pyarrow.lib.DataType] = None, normalize_dtype: bool = True)[source]¶ Bases:
kartothek.core.index.IndexBase
An Index class representing an explicit, secondary index which is calculated and stored next to the dataset. In contrast to the PartitionIndex this needs to be calculated by an explicit pass over the data. All mutations of this class will erase the reference to the physical file and the storage of the mutated object will write to a new storage key.
-
copy
(**kwargs) → kartothek.core.index.ExplicitSecondaryIndex[source]¶
-
static
from_v2
(column: str, dct_or_str: Union[str, Dict[ValueType, List[str]]]) → kartothek.core.index.IndexBase[source]¶ Create an index instance from a version 2 Python structure.
- Parameters
column – Name of the column this index provides lookup for
dct_or_str – Either the storage key of the external index or the index itself as a Python object structure.
- Returns
index
- Return type
-
load
(store: Union[str, simplekv.KeyValueStore, Callable[], simplekv.KeyValueStore]])[source]¶ Load an external index into memory. Returns a new index object that contains the index dictionary. Returns itself if the index is internal or an already loaded index.
- Parameters
store – Object that implements the .get method for file/object loading.
- Returns
index
- Return type
-
store
(store: Union[str, simplekv.KeyValueStore, Callable[], simplekv.KeyValueStore]], dataset_uuid: str) → str[source]¶ Store the index as a parquet file
If compatible, the new keyname will be the name stored under the attribute index_storage_key. If this attribute is None, a new key will be generated of the format
{dataset_uuid}/indices/{column}/{timestamp}.by-dataset-index.parquet
where the timestamp is in nanosecond accuracy and is created upon Index object initialization
- Parameters
store –
dataset_uuid –
-
unload
() → kartothek.core.index.IndexBase[source]¶ Drop index data to safe memory.
-
-
class
kartothek.core.index.
IndexBase
(column: str, index_dct: Optional[Dict[ValueType, List[str]]] = None, dtype: Optional[pyarrow.lib.DataType] = None, normalize_dtype: bool = True)[source]¶ Bases:
kartothek.core._mixins.CopyMixin
Initialize an IndexBase.
- Parameters
column – Name of the column this index is for.
index_dct – Mapping from index values to partition labels
dtype – Type of index. If left out and
index_dct
is present, this will be inferred.normalize_dtype – Normalize type information and values within
index_dct
. The user may disable this when it the index was already normalized, e.g. when the index python objects gets copied, or when the index data is restored from a parquet file that was written by a trusted write path.
-
as_flat_series
(compact: bool = False, partitions_as_index: bool = False, date_as_object: bool = False, predicates: Optional[List[List[Tuple[str, str, LiteralValue]]]] = None) → pandas.core.series.Series[source]¶ Convert the Index object to a pandas.Series
- Parameters
predicates (List[List[Tuple[str, str, Any]]) –
Optional list of predicates, like [[(‘x’, ‘>’, 0), …], that are used to filter the resulting DataFrame, possibly using predicate pushdown, if supported by the file format. This parameter is not compatible with filter_query.
Predicates are expressed in disjunctive normal form (DNF). This means that the innermost tuple describes a single column predicate. These inner predicates are all combined with a conjunction (AND) into a larger predicate. The most outer list then combines all predicates with a disjunction (OR). By this, we should be able to express all kinds of predicates that are possible using boolean logic.
Available operators are: ==, !=, <=, >=, <, > and in.
Filtering for missings is supported with operators ==, != and in and values np.nan and None for float and string columns respectively.
Categorical data
When using order sensitive operators on categorical data we will assume that the categories obey a lexicographical ordering. This filtering may result in less than optimal performance and may be slower than the evaluation on non-categorical data.
See also Filtering / Predicate pushdown and Efficient Querying
compact – If True, ensures that the index will be unique. If there a multiple partition values per index, there values will be compacted into a list (see Examples section).
partitions_as_index – If True, the relation between index values and partitions will be reverted for the output dataframe: partition values will be used as index and the indices will be mapped to the partitions.
predicates – A list of predicates. If a literal within the provided predicates references a column which is not part of this index, this literal is interpreted as True.
Examples
>>> import pyarrow as pa >>> from kartothek.core.index import ExplicitSecondaryIndex >>> index1 = ExplicitSecondaryIndex( ... column="col", index_dct={1: ["part_1", "part_2"]}, dtype=pa.int64() ... ) >>> index1.as_flat_series() col 1 part_1 1 part_2 Name: partition, dtype: object >>> index1.as_flat_series(compact=True) col 1 [part_1, part_2] Name: partition, dtype: object >>> index1.as_flat_series(partitions_as_index=True) partition part_1 1 part_2 1 Name: col, dtype: int64
-
copy
(**kwargs) → kartothek.core.index.IndexBase[source]¶
-
eval_operator
(op: str, value: ValueType) → Set[str][source]¶ Evaluates a given operator on the index for a given value and returns all partition labels allowed by this index.
-
property
loaded
¶ Check if the index was already loaded into memory.
-
static
normalize_value
(dtype: pyarrow.lib.DataType, value: Any) → Any[source]¶ Normalize value according to index dtype.
This may apply casts (e.g. integers to floats) or parsing (e.g. timestamps from strings) to the value.
- Parameters
dtype – Arrow type of the index.
value – any value
- Returns
value – normalized value, with a type that matches the index dtype
- Return type
Any
- Raises
ValueError – If dtype of the index was not set or derived.
NotImplementedError – If the dtype cannot be handled.
-
observed_values
(date_as_object=True) → numpy.ndarray[source]¶ Return an array of all observed values
-
query
(value: ValueType) → List[str][source]¶ Query this index for a given value. Raises an exception if the index is external and not loaded.
- Parameters
value – The value that is looked up in the index dictionary.
- Returns
keys – A list of keys of partitions that contain the corresponding value.
- Return type
List[str]
-
remove_partitions
(list_of_partitions: List[str], inplace: bool = False) → kartothek.core.index.IndexBase[source]¶ Removes a partition from the internal index dictionary
The new index object will no longer carry the attribute index_storage_key since it is no longer a proper representation of the stored index object.
- Parameters
list_of_partitions – The partition to be removed
inplace – If True the operation is performed inplace and will return the same object
-
remove_values
(list_of_values: List[str], inplace: bool = False) → kartothek.core.index.IndexBase[source]¶ Removes a value from the internal index dictionary
- Parameters
list_of_values – The value to be removed
inplace – If True the operation is performed inplace and will return the same object
-
to_dict
() → Dict[ValueType, List[str]][source]¶ Serialise the object to Python object that can be part of a larger dictionary that may be serialised to JSON.
-
update
(index: kartothek.core.index.IndexBase, inplace: bool = False) → kartothek.core.index.IndexBase[source]¶ Returns a new Index object in case of a change.
The new index object will no longer carry the attribute index_storage_key since it is no longer a proper representation of the stored index object.
- Parameters
index ([kartothek.core.index.IndexBase]) – The index which should be added to this one
-
class
kartothek.core.index.
PartitionIndex
(column: str, index_dct: Optional[Dict[ValueType, List[str]]] = None, dtype: Optional[pyarrow.lib.DataType] = None, normalize_dtype: bool = True)[source]¶ Bases:
kartothek.core.index.IndexBase
An Index class representing partition indices (sometimes also referred to as primary indices). A PartitionIndex is usually constructed by parsing the partition filenames which encode index information.
The constructor for this class should usually not be called explicitly but indices should be created by e.g.
kartothek.core.dataset.DatasetMetadataBase.load_partition_indices()