kartothek.core.common_metadata module

class kartothek.core.common_metadata.SchemaWrapper(schema, origin: Union[str, Set[str]])[source]

Bases: object

Wrapper object for pyarrow.Schema to handle forwards and backwards compatibility.

equals(self, Schema other, bool check_metadata=False)[source]

Test if this schema is equal to the other

Parameters
  • other (pyarrow.Schema) –

  • check_metadata (bool, default False) – Key/value metadata must be equal too

Returns

is_equal

Return type

bool

internal()[source]
property origin
remove(i)[source]

Schema.set(self, int i, Field field)

Replace a field at position i in the schema.

Parameters
  • i (int) –

  • field (Field) –

Returns

schema

Return type

Schema

remove_metadata(self)[source]

Create new schema without metadata, if any

Returns

schema

Return type

pyarrow.Schema

set(self, int i, Field field)[source]

Replace a field at position i in the schema.

Parameters
  • i (int) –

  • field (Field) –

Returns

schema

Return type

Schema

with_origin(origin: Union[str, Set[str]])kartothek.core.common_metadata.SchemaWrapper[source]

Create new SchemaWrapper with given origin.

Parameters

origin – New origin.

kartothek.core.common_metadata.empty_dataframe_from_schema(schema, columns=None, date_as_object=False)[source]

Create an empty DataFrame from provided schema.

Parameters
  • schema (Schema) – Schema information of the new empty DataFrame.

  • columns (Union[None, List[str]]) – Optional list of columns that should be part of the resulting DataFrame. All columns in that list must also be part of the provided schema.

Returns

Empty DataFrame with requested columns and types.

Return type

DataFrame

kartothek.core.common_metadata.make_meta(obj, origin, partition_keys=None)[source]

Create metadata object for DataFrame.

Note

This function can, for convenience reasons, also be applied to schema objects in which case they are just returned.

Warning

Information for categoricals will be stripped!

normalize_type() will be applied to normalize type information and normalize_column_order() will be applied to to reorder column information.

Parameters
  • obj (Union[DataFrame, Schema]) – Object to extract metadata from.

  • origin (str) – Origin of the schema data, used for debugging and error reporting.

  • partition_keys (Union[None, List[str]]) – Partition keys used to split the dataset.

Returns

schema – Schema information for DataFrame.

Return type

SchemaWrapper

kartothek.core.common_metadata.normalize_column_order(schema, partition_keys=None)[source]

Normalize column order in schema.

Columns are sorted in the following way:

  1. Partition keys (as provided by partition_keys)

  2. DataFrame columns in alphabetic order

  3. Remaining fields as generated by pyarrow, mostly index columns

Parameters
  • schema (SchemaWrapper) – Schema information for DataFrame.

  • partition_keys (Union[None, List[str]]) – Partition keys used to split the dataset.

Returns

schema – Schema information for DataFrame.

Return type

SchemaWrapper

kartothek.core.common_metadata.normalize_type(t_pa: pyarrow.lib.DataType, t_pd: Optional[str], t_np: Optional[str], metadata: Optional[Dict[str, Any]])Tuple[pyarrow.lib.DataType, Optional[str], Optional[str], Optional[Dict[str, Any]]][source]

This will normalize types as followed:

  • all signed integers (int8, int16, int32, int64) will be converted to int64

  • all unsigned integers (uint8, uint16, uint32, uint64) will be converted to uint64

  • all floats (float32, float64) will be converted to float64

  • all list value types will be normalized (e.g. list[int16] to list[int64], list[list[uint8]] to list[list[uint64]])

  • all dict value types will be normalized (e.g. dictionary<values=float32, indices=int16, ordered=0> to float64)

Parameters
  • t_pa – pyarrow type object, e.g. pa.list_(pa.int8()).

  • t_pd – pandas type identifier, e.g. "list[int8]".

  • t_np – numpy type identifier, e.g. "object".

  • metadata – metadata associated with the type, e.g. information about categorials.

kartothek.core.common_metadata.validate_compatible(schemas, ignore_pandas=False)[source]

Validate that all schemas in a given list are compatible.

Apart from the pandas version preserved in the schema metadata, schemas must be completely identical. That includes a perfect match of the whole metadata (except the pandas version) and pyarrow types.

Use make_meta() and normalize_column_order() for type and column order normalization.

In the case that all schemas don’t contain any pandas metadata, we will check the Arrow schemas directly for compatibility.

Parameters
  • schemas (List[Schema]) – Schema information from multiple sources, e.g. multiple partitions. List may be empty.

  • ignore_pandas (bool) – Ignore the schema information given by Pandas an always use the Arrow schema.

Returns

schema – The reference schema which was tested against

Return type

SchemaWrapper

Raises

ValueError – At least two schemas are incompatible.