kartothek.core.dataset module

class kartothek.core.dataset.DatasetMetadata(uuid: str, partitions: Optional[Dict[str, kartothek.core.partition.Partition]] = None, metadata: Optional[Dict[KT, VT]] = None, indices: Optional[Dict[str, kartothek.core.index.IndexBase]] = None, metadata_version: int = 4, explicit_partitions: bool = True, partition_keys: Optional[List[str]] = None, table_meta: Optional[Dict[str, kartothek.core.common_metadata.SchemaWrapper]] = None)[source]

Bases: kartothek.core.dataset.DatasetMetadataBase

Containing holding all metadata of the dataset.

static from_buffer(buf: str, format: str = 'json', explicit_partitions: bool = True)[source]
static from_dict(dct: Dict[KT, VT], explicit_partitions: bool = True)[source]

Load dataset metadata from a dictionary.

This must have no external references. Otherwise use load_from_dict to have them resolved automatically.

static load_from_buffer(buf, store: Union[str, simplekv.KeyValueStore, Callable[[], simplekv.KeyValueStore]], format: str = 'json') → kartothek.core.dataset.DatasetMetadata[source]

Load a dataset from a (string) buffer.

Parameters:
  • buf – Input to be parsed.
  • store – Object that implements the .get method for file/object loading.
Returns:

Parsed metadata.

Return type:

dataset_metadata

static load_from_dict(dct: Dict[KT, VT], store: Union[str, simplekv.KeyValueStore, Callable[[], simplekv.KeyValueStore]], load_schema: bool = True)[source]

Load dataset metadata from a dictionary and resolve any external includes.

Parameters:
  • dct (dict) –
  • store (Object) – Object that implements the .get method for file/object loading.
  • load_schema (bool) – Load table schema
Returns:

dataset_metadata – Parsed metadata.

Return type:

DatasetMetadata

static load_from_store(uuid: str, store: Union[str, simplekv.KeyValueStore, Callable[[], simplekv.KeyValueStore]], load_schema: bool = True, load_all_indices: bool = False) → kartothek.core.dataset.DatasetMetadata[source]

Load a dataset from a storage

Parameters:
  • uuid (str or unicode) – UUID of the dataset.
  • store (Object) – Object that implements the .get method for file/object loading.
  • load_schema (bool) – Load table schema
  • load_all_indices (bool) – Load all registered indices into memory.
Returns:

dataset_metadata – Parsed metadata.

Return type:

DatasetMetadata

class kartothek.core.dataset.DatasetMetadataBase(uuid: str, partitions: Optional[Dict[str, kartothek.core.partition.Partition]] = None, metadata: Optional[Dict[KT, VT]] = None, indices: Optional[Dict[str, kartothek.core.index.IndexBase]] = None, metadata_version: int = 4, explicit_partitions: bool = True, partition_keys: Optional[List[str]] = None, table_meta: Optional[Dict[str, kartothek.core.common_metadata.SchemaWrapper]] = None)[source]

Bases: kartothek.core._mixins.CopyMixin

static exists(uuid: str, store: Union[str, simplekv.KeyValueStore, Callable[[], simplekv.KeyValueStore]]) → bool[source]

Check if a dataset exists in a storage

Parameters:
  • uuid (str or unicode) – UUID of the dataset.
  • store (Object) – Object that implements the .get method for file/object loading.
Returns:

exists – Whether a metadata file could be found.

Return type:

bool

get_indices_as_dataframe(columns: Optional[List[str]] = None, date_as_object: bool = True, predicates: Optional[List[List[Tuple[str, str, LiteralValue]]]] = None)[source]

Converts the dataset indices to a pandas dataframe and filter relevant indices by predicates.

For a dataset with indices on columns column_a and column_b and three partitions, the dataset output may look like

        column_a column_b
part_1         1        A
part_2         2        B
part_3         3     None
Parameters:
  • columns (dict of list of string, optional) – A dictionary mapping tables to list of columns. Only the specified columns are loaded for the corresponding table. If a specfied table or column is not present in the dataset, a ValueError is raised.
  • predicates (list of list of tuple[str, str, Any]) –

    Optional list of predicates, like [[(‘x’, ‘>’, 0), …], that are used to filter the resulting DataFrame, possibly using predicate pushdown, if supported by the file format. This parameter is not compatible with filter_query.

    Predicates are expressed in disjunctive normal form (DNF). This means that the innermost tuple describes a single column predicate. These inner predicates are all combined with a conjunction (AND) into a larger predicate. The most outer list then combines all predicates with a disjunction (OR). By this, we should be able to express all kinds of predicates that are possible using boolean logic.

    Available operators are: ==, !=, <=, >=, <, > and in.

    Filtering for missings is supported with operators ==, != and in and values np.nan and None for float and string columns respectively.

    Categorical data

    When using order sensitive operators on categorical data we will assume that the categories obey a lexicographical ordering. This filtering may result in less than optimal performance and may be slower than the evaluation on non-categorical data.

index_columns
load_all_indices(store: Union[str, simplekv.KeyValueStore, Callable[[], simplekv.KeyValueStore]], load_partition_indices: bool = True) → T[source]

Load all registered indices into memory.

Note: External indices need to be preloaded before they can be queried.

Parameters:
  • store (Object) – Object that implements the .get method for file/object loading.
  • load_partition_indices (bool) – Flag if filename indices should be loaded. Default is True.
Returns:

dataset_metadata – Mutated metadata object with the loaded indices.

Return type:

DatasetMetadata

load_index(column: str, store: Union[str, simplekv.KeyValueStore, Callable[[], simplekv.KeyValueStore]]) → T[source]

Load an index into memory.

Note: External indices need to be preloaded before they can be queried.

Parameters:
  • column (str) – Name of the column for which the index should be loaded.
  • store (Object) – Object that implements the .get method for file/object loading.
Returns:

dataset_metadata – Mutated metadata object with the loaded index.

Return type:

DatasetMetadata

load_partition_indices() → T[source]

Load all filename encoded indices into RAM. File encoded indices can be extracted from datasets with partitions stored in a format like

`dataset_uuid/table/IndexCol=IndexValue/SecondIndexCol=Value/partition_label.parquet`

Which results in an in-memory index holding the information

{
    "IndexCol": {
        IndexValue: ["partition_label"]
    },
    "SecondIndexCol": {
        Value: ["partition_label"]
    }
}
primary_indices_loaded
query(indices: List[kartothek.core.index.IndexBase] = None, **kwargs) → List[str][source]

Query the dataset for partitions that contain specific values. Lookup is performed using the embedded and loaded external indices. Additional indices need to operate on the same partitions that the dataset contains, otherwise an empty list will be returned (the query method only restricts the set of partition keys using the indices).

Parameters:
  • indices – List of optional additional indices.
  • **kwargs – Map of columns and values.
Returns:

Return type:

List of keys of partitions that contain the queries values in the respective columns.

secondary_indices
static storage_keys(uuid: str, store: Union[str, simplekv.KeyValueStore, Callable[[], simplekv.KeyValueStore]]) → List[str][source]

Retrieve all keys that belong to the given dataset.

Parameters:
  • uuid – UUID of the dataset.
  • store – Object that implements the .iter_keys method for key retrieval loading.
Returns:

Sorted list of storage keys.

Return type:

keys

tables
to_dict() → Dict[KT, VT][source]
to_json() → bytes[source]
to_msgpack() → bytes[source]
class kartothek.core.dataset.DatasetMetadataBuilder(uuid: str, metadata_version=4, explicit_partitions=True, partition_keys=None, table_meta=None)[source]

Bases: kartothek.core._mixins.CopyMixin

Incrementally build up a dataset.

In constrast to a kartothek.core.dataset.DatasetMetadata instance, this object is mutable and may not be a full dataset (e.g. partitions don’t need to be fully materialised).

add_embedded_index(column, index)[source]

Embed an index into the metadata.

Parameters:
add_external_index(column, filename=None)[source]

Add a reference to an external index.

Parameters:column (str) – Name of the indexed column
Returns:storage_key – The location where the external index should be stored.
Return type:str
add_metadata(key, value)[source]

Add arbitrary key->value metadata.

Parameters:
  • key (str) –
  • value (str) –
add_partition(name, partition)[source]

Add an (embedded) Partition.

Parameters:
static from_dataset(dataset)[source]
to_dataset() → kartothek.core.dataset.DatasetMetadata[source]
to_dict()[source]

Render the dataset to a dict.

to_json()[source]

Render the dataset to JSON.

Returns:
  • storage_key (str) – The path where this metadata should be placed in the storage.
  • dataset_json (str) – The rendered JSON for this dataset.
to_msgpack() → Tuple[str, bytes][source]

Render the dataset to msgpack.

Returns:
  • storage_key (str) – The path where this metadata should be placed in the storage.
  • dataset_json (str) – The rendered JSON for this dataset.
kartothek.core.dataset.create_partition_key(dataset_uuid, table, index_values, filename='data')[source]

Create partition key for a kartothek partition

Parameters:
  • dataset_uuid (str) –
  • table (str) –
  • index_values (list of tuples str:str) –
  • filename (str) –
  • Example
    create_partition_key(‘my-uuid’, ‘testtable’,
    [(‘index1’, ‘value1’), (‘index2’, ‘value2’)])

    returns ‘my-uuid/testtable/index1=value1/index2=value2/data’

kartothek.core.dataset.to_ordinary_dict(dct: Dict[KT, VT]) → Dict[KT, VT][source]