kartothek.core.dataset module

class kartothek.core.dataset.DatasetMetadata(uuid: str, partitions: Optional[Dict[str, kartothek.core.partition.Partition]] = None, metadata: Optional[Dict] = None, indices: Optional[Dict[str, kartothek.core.index.IndexBase]] = None, metadata_version: int = 4, explicit_partitions: bool = True, partition_keys: Optional[List[str]] = None, table_meta: Optional[Dict[str, kartothek.core.common_metadata.SchemaWrapper]] = None)[source]

Bases: kartothek.core.dataset.DatasetMetadataBase

Containing holding all metadata of the dataset.

static from_buffer(buf: str, format: str = 'json', explicit_partitions: bool = True)[source]
static from_dict(dct: Dict, explicit_partitions: bool = True)[source]

Load dataset metadata from a dictionary.

This must have no external references. Otherwise use load_from_dict to have them resolved automatically.

static load_from_buffer(buf, store: Union[str, simplekv.KeyValueStore, Callable[], simplekv.KeyValueStore]], format: str = 'json')kartothek.core.dataset.DatasetMetadata[source]

Load a dataset from a (string) buffer.

Parameters
  • buf – Input to be parsed.

  • store – Object that implements the .get method for file/object loading.

Returns

Parsed metadata.

Return type

DatasetMetadata

static load_from_dict(dct: Dict, store: Union[str, simplekv.KeyValueStore, Callable[], simplekv.KeyValueStore]], load_schema: bool = True)kartothek.core.dataset.DatasetMetadata[source]

Load dataset metadata from a dictionary and resolve any external includes.

Parameters
  • dct

  • store – Object that implements the .get method for file/object loading.

  • load_schema – Load table schema

static load_from_store(uuid: str, store: Union[str, simplekv.KeyValueStore, Callable[], simplekv.KeyValueStore]], load_schema: bool = True, load_all_indices: bool = False)kartothek.core.dataset.DatasetMetadata[source]

Load a dataset from a storage

Parameters
  • uuid – UUID of the dataset.

  • store – Object that implements the .get method for file/object loading.

  • load_schema – Load table schema

  • load_all_indices – Load all registered indices into memory.

Returns

dataset_metadata – Parsed metadata.

Return type

DatasetMetadata

class kartothek.core.dataset.DatasetMetadataBase(uuid: str, partitions: Optional[Dict[str, kartothek.core.partition.Partition]] = None, metadata: Optional[Dict] = None, indices: Optional[Dict[str, kartothek.core.index.IndexBase]] = None, metadata_version: int = 4, explicit_partitions: bool = True, partition_keys: Optional[List[str]] = None, table_meta: Optional[Dict[str, kartothek.core.common_metadata.SchemaWrapper]] = None)[source]

Bases: kartothek.core._mixins.CopyMixin

static exists(uuid: str, store: Union[str, simplekv.KeyValueStore, Callable[], simplekv.KeyValueStore]])bool[source]

Check if a dataset exists in a storage

Parameters
  • uuid – UUID of the dataset.

  • store – Object that implements the .get method for file/object loading.

get_indices_as_dataframe(columns: Optional[List[str]] = None, date_as_object: bool = True, predicates: Optional[List[List[Tuple[str, str, LiteralValue]]]] = None)[source]

Converts the dataset indices to a pandas dataframe and filter relevant indices by predicates.

For a dataset with indices on columns column_a and column_b and three partitions, the dataset output may look like

        column_a column_b
part_1         1        A
part_2         2        B
part_3         3     None
Parameters
  • columns (Optional[List[Dict[str]]]) – A dictionary mapping tables to list of columns. Only the specified columns are loaded for the corresponding table. If a specfied table or column is not present in the dataset, a ValueError is raised.

  • predicates (List[List[Tuple[str, str, Any]]) –

    Optional list of predicates, like [[(‘x’, ‘>’, 0), …], that are used to filter the resulting DataFrame, possibly using predicate pushdown, if supported by the file format. This parameter is not compatible with filter_query.

    Predicates are expressed in disjunctive normal form (DNF). This means that the innermost tuple describes a single column predicate. These inner predicates are all combined with a conjunction (AND) into a larger predicate. The most outer list then combines all predicates with a disjunction (OR). By this, we should be able to express all kinds of predicates that are possible using boolean logic.

    Available operators are: ==, !=, <=, >=, <, > and in.

    Filtering for missings is supported with operators ==, != and in and values np.nan and None for float and string columns respectively.

    Categorical data

    When using order sensitive operators on categorical data we will assume that the categories obey a lexicographical ordering. This filtering may result in less than optimal performance and may be slower than the evaluation on non-categorical data.

    See also Filtering / Predicate pushdown and Efficient Querying

property index_columns
load_all_indices(store: Union[str, simplekv.KeyValueStore, Callable[], simplekv.KeyValueStore]], load_partition_indices: bool = True)T[source]

Load all registered indices into memory.

Note: External indices need to be preloaded before they can be queried.

Parameters
  • store – Object that implements the .get method for file/object loading.

  • load_partition_indices – Flag if filename indices should be loaded. Default is True.

Returns

dataset_metadata – Mutated metadata object with the loaded indices.

Return type

DatasetMetadata

load_index(column: str, store: Union[str, simplekv.KeyValueStore, Callable[], simplekv.KeyValueStore]])T[source]

Load an index into memory.

Note: External indices need to be preloaded before they can be queried.

Parameters
  • column – Name of the column for which the index should be loaded.

  • store – Object that implements the .get method for file/object loading.

Returns

dataset_metadata – Mutated metadata object with the loaded index.

Return type

DatasetMetadata

load_partition_indices()T[source]

Load all filename encoded indices into RAM. File encoded indices can be extracted from datasets with partitions stored in a format like

`dataset_uuid/table/IndexCol=IndexValue/SecondIndexCol=Value/partition_label.parquet`

Which results in an in-memory index holding the information

{
    "IndexCol": {
        IndexValue: ["partition_label"]
    },
    "SecondIndexCol": {
        Value: ["partition_label"]
    }
}

Deprecated since version 5.3: This will be removed in 6.0. The load_partition_indices keyword is deprecated and will be removed.

property primary_indices_loaded
query(indices: Optional[List[kartothek.core.index.IndexBase]] = None, **kwargs)List[str][source]

Query the dataset for partitions that contain specific values. Lookup is performed using the embedded and loaded external indices. Additional indices need to operate on the same partitions that the dataset contains, otherwise an empty list will be returned (the query method only restricts the set of partition keys using the indices).

Parameters
  • indices – List of optional additional indices.

  • **kwargs – Map of columns and values.

Returns

List of keys of partitions that contain the queries values in the respective columns.

Return type

List[str]

property schema
property secondary_indices
static storage_keys(uuid: str, store: Union[str, simplekv.KeyValueStore, Callable[], simplekv.KeyValueStore]])List[str][source]

Retrieve all keys that belong to the given dataset.

Parameters
  • uuid – UUID of the dataset.

  • store – Object that implements the .iter_keys method for key retrieval loading.

property table_meta
property tables
to_dict()Dict[source]
to_json()bytes[source]
to_msgpack()bytes[source]