kartothek.core.common_metadata module¶
-
class
kartothek.core.common_metadata.
SchemaWrapper
(schema, origin: Union[str, Set[str]])[source]¶ Bases:
object
Wrapper object for pyarrow.Schema to handle forwards and backwards compatibility.
-
equals
(self, Schema other, bool check_metadata=False)[source]¶ Test if this schema is equal to the other
- Parameters
other (pyarrow.Schema) –
check_metadata (bool, default False) – Key/value metadata must be equal too
- Returns
is_equal
- Return type
-
property
origin
¶
-
remove
(i)[source]¶ Schema.set(self, int i, Field field)
Replace a field at position i in the schema.
- Parameters
i (int) –
field (Field) –
- Returns
schema
- Return type
Schema
-
remove_metadata
(self)[source]¶ Create new schema without metadata, if any
- Returns
schema
- Return type
-
set
(self, int i, Field field)[source]¶ Replace a field at position i in the schema.
- Parameters
i (int) –
field (Field) –
- Returns
schema
- Return type
Schema
-
with_origin
(origin: Union[str, Set[str]]) → kartothek.core.common_metadata.SchemaWrapper[source]¶ Create new SchemaWrapper with given origin.
- Parameters
origin – New origin.
-
-
kartothek.core.common_metadata.
empty_dataframe_from_schema
(schema, columns=None, date_as_object=False)[source]¶ Create an empty DataFrame from provided schema.
- Parameters
- Returns
Empty DataFrame with requested columns and types.
- Return type
DataFrame
-
kartothek.core.common_metadata.
make_meta
(obj, origin, partition_keys=None)[source]¶ Create metadata object for DataFrame.
Note
This function can, for convenience reasons, also be applied to schema objects in which case they are just returned.
Warning
Information for categoricals will be stripped!
normalize_type()
will be applied to normalize type information andnormalize_column_order()
will be applied to to reorder column information.- Parameters
- Returns
schema – Schema information for DataFrame.
- Return type
-
kartothek.core.common_metadata.
normalize_column_order
(schema, partition_keys=None)[source]¶ Normalize column order in schema.
Columns are sorted in the following way:
Partition keys (as provided by
partition_keys
)DataFrame columns in alphabetic order
Remaining fields as generated by pyarrow, mostly index columns
- Parameters
schema (SchemaWrapper) – Schema information for DataFrame.
partition_keys (Union[None, List[str]]) – Partition keys used to split the dataset.
- Returns
schema – Schema information for DataFrame.
- Return type
-
kartothek.core.common_metadata.
normalize_type
(t_pa: pyarrow.lib.DataType, t_pd: Optional[str], t_np: Optional[str], metadata: Optional[Dict[str, Any]]) → Tuple[pyarrow.lib.DataType, Optional[str], Optional[str], Optional[Dict[str, Any]]][source]¶ This will normalize types as followed:
all signed integers (
int8
,int16
,int32
,int64
) will be converted toint64
all unsigned integers (
uint8
,uint16
,uint32
,uint64
) will be converted touint64
all floats (
float32
,float64
) will be converted tofloat64
all list value types will be normalized (e.g.
list[int16]
tolist[int64]
,list[list[uint8]]
tolist[list[uint64]]
)all dict value types will be normalized (e.g.
dictionary<values=float32, indices=int16, ordered=0>
tofloat64
)
- Parameters
t_pa – pyarrow type object, e.g.
pa.list_(pa.int8())
.t_pd – pandas type identifier, e.g.
"list[int8]"
.t_np – numpy type identifier, e.g.
"object"
.metadata – metadata associated with the type, e.g. information about categorials.
-
kartothek.core.common_metadata.
validate_compatible
(schemas, ignore_pandas=False)[source]¶ Validate that all schemas in a given list are compatible.
Apart from the pandas version preserved in the schema metadata, schemas must be completely identical. That includes a perfect match of the whole metadata (except the pandas version) and pyarrow types.
Use
make_meta()
andnormalize_column_order()
for type and column order normalization.In the case that all schemas don’t contain any pandas metadata, we will check the Arrow schemas directly for compatibility.
- Parameters
schemas (List[Schema]) – Schema information from multiple sources, e.g. multiple partitions. List may be empty.
ignore_pandas (bool) – Ignore the schema information given by Pandas an always use the Arrow schema.
- Returns
schema – The reference schema which was tested against
- Return type
- Raises
ValueError – At least two schemas are incompatible.