kartothek.core.dataset module¶
-
class
kartothek.core.dataset.
DatasetMetadata
(uuid: str, partitions: Optional[Dict[str, kartothek.core.partition.Partition]] = None, metadata: Optional[Dict] = None, indices: Optional[Dict[str, kartothek.core.index.IndexBase]] = None, metadata_version: int = 4, explicit_partitions: bool = True, partition_keys: Optional[List[str]] = None, table_meta: Optional[Dict[str, kartothek.core.common_metadata.SchemaWrapper]] = None)[source]¶ Bases:
kartothek.core.dataset.DatasetMetadataBase
Containing holding all metadata of the dataset.
-
static
from_dict
(dct: Dict, explicit_partitions: bool = True)[source]¶ Load dataset metadata from a dictionary.
This must have no external references. Otherwise use
load_from_dict
to have them resolved automatically.
-
static
load_from_buffer
(buf, store: Union[str, simplekv.KeyValueStore, Callable[], simplekv.KeyValueStore]], format: str = 'json') → kartothek.core.dataset.DatasetMetadata[source]¶ Load a dataset from a (string) buffer.
- Parameters
buf – Input to be parsed.
store – Object that implements the .get method for file/object loading.
- Returns
Parsed metadata.
- Return type
-
static
load_from_dict
(dct: Dict, store: Union[str, simplekv.KeyValueStore, Callable[], simplekv.KeyValueStore]], load_schema: bool = True) → kartothek.core.dataset.DatasetMetadata[source]¶ Load dataset metadata from a dictionary and resolve any external includes.
- Parameters
dct –
store – Object that implements the .get method for file/object loading.
load_schema – Load table schema
-
static
load_from_store
(uuid: str, store: Union[str, simplekv.KeyValueStore, Callable[], simplekv.KeyValueStore]], load_schema: bool = True, load_all_indices: bool = False) → kartothek.core.dataset.DatasetMetadata[source]¶ Load a dataset from a storage
- Parameters
uuid – UUID of the dataset.
store – Object that implements the .get method for file/object loading.
load_schema – Load table schema
load_all_indices – Load all registered indices into memory.
- Returns
dataset_metadata – Parsed metadata.
- Return type
-
static
-
class
kartothek.core.dataset.
DatasetMetadataBase
(uuid: str, partitions: Optional[Dict[str, kartothek.core.partition.Partition]] = None, metadata: Optional[Dict] = None, indices: Optional[Dict[str, kartothek.core.index.IndexBase]] = None, metadata_version: int = 4, explicit_partitions: bool = True, partition_keys: Optional[List[str]] = None, table_meta: Optional[Dict[str, kartothek.core.common_metadata.SchemaWrapper]] = None)[source]¶ Bases:
kartothek.core._mixins.CopyMixin
-
static
exists
(uuid: str, store: Union[str, simplekv.KeyValueStore, Callable[], simplekv.KeyValueStore]]) → bool[source]¶ Check if a dataset exists in a storage
- Parameters
uuid – UUID of the dataset.
store – Object that implements the .get method for file/object loading.
-
get_indices_as_dataframe
(columns: Optional[List[str]] = None, date_as_object: bool = True, predicates: Optional[List[List[Tuple[str, str, LiteralValue]]]] = None)[source]¶ Converts the dataset indices to a pandas dataframe and filter relevant indices by predicates.
For a dataset with indices on columns column_a and column_b and three partitions, the dataset output may look like
column_a column_b part_1 1 A part_2 2 B part_3 3 None
- Parameters
columns (Optional[List[Dict[str]]]) – A dictionary mapping tables to list of columns. Only the specified columns are loaded for the corresponding table. If a specfied table or column is not present in the dataset, a ValueError is raised.
predicates (List[List[Tuple[str, str, Any]]) –
Optional list of predicates, like [[(‘x’, ‘>’, 0), …], that are used to filter the resulting DataFrame, possibly using predicate pushdown, if supported by the file format. This parameter is not compatible with filter_query.
Predicates are expressed in disjunctive normal form (DNF). This means that the innermost tuple describes a single column predicate. These inner predicates are all combined with a conjunction (AND) into a larger predicate. The most outer list then combines all predicates with a disjunction (OR). By this, we should be able to express all kinds of predicates that are possible using boolean logic.
Available operators are: ==, !=, <=, >=, <, > and in.
Filtering for missings is supported with operators ==, != and in and values np.nan and None for float and string columns respectively.
Categorical data
When using order sensitive operators on categorical data we will assume that the categories obey a lexicographical ordering. This filtering may result in less than optimal performance and may be slower than the evaluation on non-categorical data.
See also Filtering / Predicate pushdown and Efficient Querying
-
property
index_columns
¶
-
load_all_indices
(store: Union[str, simplekv.KeyValueStore, Callable[], simplekv.KeyValueStore]], load_partition_indices: bool = True) → T[source]¶ Load all registered indices into memory.
Note: External indices need to be preloaded before they can be queried.
- Parameters
store – Object that implements the .get method for file/object loading.
load_partition_indices – Flag if filename indices should be loaded. Default is True.
- Returns
dataset_metadata – Mutated metadata object with the loaded indices.
- Return type
-
load_index
(column: str, store: Union[str, simplekv.KeyValueStore, Callable[], simplekv.KeyValueStore]]) → T[source]¶ Load an index into memory.
Note: External indices need to be preloaded before they can be queried.
- Parameters
column – Name of the column for which the index should be loaded.
store – Object that implements the .get method for file/object loading.
- Returns
dataset_metadata – Mutated metadata object with the loaded index.
- Return type
-
load_partition_indices
() → T[source]¶ Load all filename encoded indices into RAM. File encoded indices can be extracted from datasets with partitions stored in a format like
`dataset_uuid/table/IndexCol=IndexValue/SecondIndexCol=Value/partition_label.parquet`
Which results in an in-memory index holding the information
{ "IndexCol": { IndexValue: ["partition_label"] }, "SecondIndexCol": { Value: ["partition_label"] } }
Deprecated since version 5.3: This will be removed in 6.0. The load_partition_indices keyword is deprecated and will be removed.
-
property
primary_indices_loaded
¶
-
query
(indices: Optional[List[kartothek.core.index.IndexBase]] = None, **kwargs) → List[str][source]¶ Query the dataset for partitions that contain specific values. Lookup is performed using the embedded and loaded external indices. Additional indices need to operate on the same partitions that the dataset contains, otherwise an empty list will be returned (the query method only restricts the set of partition keys using the indices).
- Parameters
indices – List of optional additional indices.
**kwargs – Map of columns and values.
- Returns
List of keys of partitions that contain the queries values in the respective columns.
- Return type
List[str]
-
property
schema
¶
-
property
secondary_indices
¶
-
static
storage_keys
(uuid: str, store: Union[str, simplekv.KeyValueStore, Callable[], simplekv.KeyValueStore]]) → List[str][source]¶ Retrieve all keys that belong to the given dataset.
- Parameters
uuid – UUID of the dataset.
store – Object that implements the .iter_keys method for key retrieval loading.
-
property
table_meta
¶
-
property
tables
¶
-
static