kartothek.core.common_metadata module

class kartothek.core.common_metadata.SchemaWrapper(schema, origin: Union[str, Set[str]])[source]

Bases: object

Wrapper object for pyarrow.Schema to handle forwards and backwards compatibility.

equals(self, Schema other, bool check_metadata=False)[source]

Test if this schema is equal to the other

Parameters:
Returns:

is_equal

Return type:

bool

internal()[source]
origin
remove(i)[source]

Schema.set(self, int i, Field field)

Replace a field at position i in the schema.

Parameters:
Returns:

schema

Return type:

Schema

remove_metadata(self)[source]

Create new schema without metadata, if any

Returns:schema
Return type:pyarrow.Schema
set(self, int i, Field field)[source]

Replace a field at position i in the schema.

Parameters:
Returns:

schema

Return type:

Schema

with_origin(origin: Union[str, Set[str]]) → kartothek.core.common_metadata.SchemaWrapper[source]

Create new SchemaWrapper with given origin.

Parameters:origin – New origin.
Returns:New schema.
Return type:schema
kartothek.core.common_metadata.empty_dataframe_from_schema(schema, columns=None, date_as_object=False)[source]

Create an empty DataFrame from provided schema.

Parameters:
  • schema (Schema) – Schema information of the new empty DataFrame.
  • columns (Union[None, List[str]]) – Optional list of columns that should be part of the resulting DataFrame. All columns in that list must also be part of the provided schema.
Returns:

Empty DataFrame with requested columns and types.

Return type:

DataFrame

kartothek.core.common_metadata.make_meta(obj, origin, partition_keys=None)[source]

Create metadata object for DataFrame.

Note

This function can, for convenience reasons, also be applied to schema objects in which case they are just returned.

Warning

Information for categoricals will be stripped!

normalize_type() will be applied to normalize type information and normalize_column_order() will be applied to to reorder column information.

Parameters:
  • obj (Union[DataFrame, Schema]) – Object to extract metadata from.
  • origin (str) – Origin of the schema data, used for debugging and error reporting.
  • partition_keys (Union[None, List[str]]) – Partition keys used to split the dataset.
Returns:

schema – Schema information for DataFrame.

Return type:

SchemaWrapper

kartothek.core.common_metadata.normalize_column_order(schema, partition_keys=None)[source]

Normalize column order in schema.

Columns are sorted in the following way:

  1. Partition keys (as provided by partition_keys)
  2. DataFrame columns in alphabetic order
  3. Remaining fields as generated by pyarrow, mostly index columns
Parameters:
  • schema (SchemaWrapper) – Schema information for DataFrame.
  • partition_keys (Union[None, List[str]]) – Partition keys used to split the dataset.
Returns:

schema – Schema information for DataFrame.

Return type:

SchemaWrapper

kartothek.core.common_metadata.normalize_type(t_pa, t_pd, t_np, metadata)[source]

This will normalize types as followed:

  • all signed integers (int8, int16, int32, int64) will be converted to int64
  • all unsigned integers (uint8, uint16, uint32, uint64) will be converted to uint64
  • all floats (float32, float64) will be converted to float64
  • all list value types will be normalized (e.g. list[int16] to list[int64], list[list[uint8]] to list[list[uint64]])
  • all dict value types will be normalized (e.g. dictionary<values=float32, indices=int16, ordered=0> to float64)
Parameters:
  • t_pa (pyarrow.Type) – pyarrow type object, e.g. pa.list_(pa.int8()).
  • t_pd (string) – pandas type identifier, e.g. "list[int8]".
  • t_np (string) – numpy type identifier, e.g. "object".
  • metadata (Union[None, Dict[String, Any]]) – metadata associated with the type, e.g. information about categorials.
Returns:

type_tuple – tuple of t_pa, t_pd, t_np, metadata for normalized type

Return type:

Tuple[pyarrow.Type, string, string, Union[None, Dict[String, Any]]]

kartothek.core.common_metadata.read_schema_metadata(dataset_uuid: str, store: simplekv.KeyValueStore, table: str) → kartothek.core.common_metadata.SchemaWrapper[source]

Read schema and metadata from store.

Parameters:
  • dataset_uuid (str) – Unique ID of the dataset in question.
  • store (obj) – Object that implements .get(key) to read data.
  • table (str) – Table to read metadata for.
Returns:

schema – Schema information for DataFrame/table.

Return type:

Schema

kartothek.core.common_metadata.store_schema_metadata(schema: kartothek.core.common_metadata.SchemaWrapper, dataset_uuid: str, store: simplekv.KeyValueStore, table: str) → str[source]

Store schema and metadata to store.

Parameters:
  • schema (Schema) – Schema information for DataFrame/table.
  • dataset_uuid (str) – Unique ID of the dataset in question.
  • store (obj) – Object that implements .put(key, data) to write data.
  • table (str) – Table to write metadata for.
Returns:

key – Key to which the metadata was written to.

Return type:

str

kartothek.core.common_metadata.validate_compatible(schemas, ignore_pandas=False)[source]

Validate that all schemas in a given list are compatible.

Apart from the pandas version preserved in the schema metadata, schemas must be completely identical. That includes a perfect match of the whole metadata (except the pandas version) and pyarrow types.

Use make_meta() and normalize_column_order() for type and column order normalization.

In the case that all schemas don’t contain any pandas metadata, we will check the Arrow schemas directly for compatibility.

Parameters:
  • schemas (List[Schema]) – Schema information from multiple sources, e.g. multiple partitions. List may be empty.
  • ignore_pandas (bool) – Ignore the schema information given by Pandas an always use the Arrow schema.
Returns:

schema – The reference schema which was tested against

Return type:

SchemaWrapper

Raises:

ValueError – At least two schemas are incompatible.

kartothek.core.common_metadata.validate_shared_columns(schemas, ignore_pandas=False)[source]

Validate that columns that are shared amongst schemas are compatible.

Only DataFrame columns are taken into account, other fields (like index data) are ignored. The following data must be an exact match:

  • metadata (as stored in the "columns" list of the b'pandas' schema metadata)
  • pyarrow type (that means that e.g. int8 and int64 are NOT compatible)

Columns that are only present in a subset of the provided schemas must only be compatible for that subset, i.e. non-existing columns are ignored. The order of the columns in the provided schemas is irrelevant.

Type normalization should be handled by make_meta().

In the case that all schemas don’t contain any pandas metadata, we will check the Arrow schemas directly for compatibility. Then the metadata information will not be checked (as it is non-existent).

Parameters:
  • schemas (List[Schema]) – Schema information from multiple sources, e.g. multiple tables. List may be empty.
  • ignore_pandas (bool) – Ignore the schema information given by Pandas an always use the Arrow schema.
Raises:

ValueError – Incompatible columns were found.