kartothek.serialization package

Module contents

class kartothek.serialization.CsvSerializer(compress=True)[source]

Bases: kartothek.serialization._generic.DataFrameSerializer

static restore_dataframe(store: simplekv.KeyValueStore, key: str, filter_query: Optional[str] = None, columns: Optional[Iterable[str]] = None, predicate_pushdown_to_io: Any = None, categories: Optional[Iterable[str]] = None, predicates: Optional[List[List[Tuple[str, str, LiteralValue]]]] = None, date_as_object: Any = None, **kwargs)[source]

Load a DataFrame from the specified store. The key is also used to detect the used format.

Parameters:
  • store (simplekv.KeyValueStore) – store engine
  • key (str) – Key that specifies a path where object should be retrieved from the store resource.
  • filter_query (str) – Optional query to filter the DataFrame. Must adhere to the specification of pandas.DataFrame.query.
  • columns (str or None) – Only read in listed columns. When set to None, the full file will be read in.
  • predicate_pushdown_to_io (bool) – Push predicates through to the I/O layer, default True. Disable this if you see problems with predicate pushdown for the given file even if the file format supports it. Note that this option only hides problems in the store layer that need to be addressed there.
  • categories (list of str (optional)) – Columns that should be loaded as categoricals.
  • predicates (list of list of tuple[str, str, Any]) –

    Optional list of predicates, like [[(‘x’, ‘>’, 0), …], that are used to filter the resulting DataFrame, possibly using predicate pushdown, if supported by the file format. This parameter is not compatible with filter_query.

    Predicates are expressed in disjunctive normal form (DNF). This means that the innermost tuple describe a single column predicate. These inner predicate make are all combined with a conjunction (AND) into a larger predicate. The most outer list then combines all predicates with a disjunction (OR). By this, we should be able to express all kinds of predicates that are possible using boolean logic.

  • date_as_object (bool) – Retrieve all date columns as an object column holding datetime.date objects instead of pd.Timestamp. Note that this option only works for type-stable serializers, e.g. ParquetSerializer.
Returns:

Return type:

Data in pandas dataframe format.

store(store, key_prefix, df)[source]

Persist a DataFrame to the specified store.

The used store format (e.g. Parquet) will be appended to the key.

Parameters:
  • store (simplekv.KeyValueStore) – store engine
  • key_prefix (str) – Key prefix that specifies a path where object should be stored on the store resource. The used file format will be appended to the key.
  • df (pandas.DataFrame or pyarrow.Table) – DataFrame that shall be persisted
Returns:

The actual key where the DataFrame is stored.

Return type:

str

class kartothek.serialization.DataFrameSerializer[source]

Bases: object

Abstract class that supports serializing DataFrames to/from simplekv stores.

classmethod register_serializer(suffix, serializer)[source]
classmethod restore_dataframe(store: simplekv.KeyValueStore, key: str, filter_query: Optional[str] = None, columns: Optional[Iterable[str]] = None, predicate_pushdown_to_io: bool = True, categories: Optional[Iterable[str]] = None, predicates: Optional[List[List[Tuple[str, str, LiteralValue]]]] = None, date_as_object: bool = False) → pandas.core.frame.DataFrame[source]

Load a DataFrame from the specified store. The key is also used to detect the used format.

Parameters:
  • store (simplekv.KeyValueStore) – store engine
  • key (str) – Key that specifies a path where object should be retrieved from the store resource.
  • filter_query (str) – Optional query to filter the DataFrame. Must adhere to the specification of pandas.DataFrame.query.
  • columns (str or None) – Only read in listed columns. When set to None, the full file will be read in.
  • predicate_pushdown_to_io (bool) – Push predicates through to the I/O layer, default True. Disable this if you see problems with predicate pushdown for the given file even if the file format supports it. Note that this option only hides problems in the store layer that need to be addressed there.
  • categories (list of str (optional)) – Columns that should be loaded as categoricals.
  • predicates (list of list of tuple[str, str, Any]) –

    Optional list of predicates, like [[(‘x’, ‘>’, 0), …], that are used to filter the resulting DataFrame, possibly using predicate pushdown, if supported by the file format. This parameter is not compatible with filter_query.

    Predicates are expressed in disjunctive normal form (DNF). This means that the innermost tuple describe a single column predicate. These inner predicate make are all combined with a conjunction (AND) into a larger predicate. The most outer list then combines all predicates with a disjunction (OR). By this, we should be able to express all kinds of predicates that are possible using boolean logic.

  • date_as_object (bool) – Retrieve all date columns as an object column holding datetime.date objects instead of pd.Timestamp. Note that this option only works for type-stable serializers, e.g. ParquetSerializer.
Returns:

Return type:

Data in pandas dataframe format.

store(store: simplekv.KeyValueStore, key_prefix: str, df: pandas.core.frame.DataFrame) → str[source]

Persist a DataFrame to the specified store.

The used store format (e.g. Parquet) will be appended to the key.

Parameters:
  • store (simplekv.KeyValueStore) – store engine
  • key_prefix (str) – Key prefix that specifies a path where object should be stored on the store resource. The used file format will be appended to the key.
  • df (pandas.DataFrame or pyarrow.Table) – DataFrame that shall be persisted
Returns:

The actual key where the DataFrame is stored.

Return type:

str

type_stable = False
class kartothek.serialization.ParquetSerializer(compression='SNAPPY', chunk_size=None)[source]

Bases: kartothek.serialization._generic.DataFrameSerializer

static restore_dataframe(store: simplekv.KeyValueStore, key: str, filter_query: Optional[str] = None, columns: Optional[Iterable[str]] = None, predicate_pushdown_to_io: bool = True, categories: Optional[Iterable[str]] = None, predicates: Optional[List[List[Tuple[str, str, LiteralValue]]]] = None, date_as_object: bool = False)[source]

Load a DataFrame from the specified store. The key is also used to detect the used format.

Parameters:
  • store (simplekv.KeyValueStore) – store engine
  • key (str) – Key that specifies a path where object should be retrieved from the store resource.
  • filter_query (str) – Optional query to filter the DataFrame. Must adhere to the specification of pandas.DataFrame.query.
  • columns (str or None) – Only read in listed columns. When set to None, the full file will be read in.
  • predicate_pushdown_to_io (bool) – Push predicates through to the I/O layer, default True. Disable this if you see problems with predicate pushdown for the given file even if the file format supports it. Note that this option only hides problems in the store layer that need to be addressed there.
  • categories (list of str (optional)) – Columns that should be loaded as categoricals.
  • predicates (list of list of tuple[str, str, Any]) –

    Optional list of predicates, like [[(‘x’, ‘>’, 0), …], that are used to filter the resulting DataFrame, possibly using predicate pushdown, if supported by the file format. This parameter is not compatible with filter_query.

    Predicates are expressed in disjunctive normal form (DNF). This means that the innermost tuple describe a single column predicate. These inner predicate make are all combined with a conjunction (AND) into a larger predicate. The most outer list then combines all predicates with a disjunction (OR). By this, we should be able to express all kinds of predicates that are possible using boolean logic.

  • date_as_object (bool) – Retrieve all date columns as an object column holding datetime.date objects instead of pd.Timestamp. Note that this option only works for type-stable serializers, e.g. ParquetSerializer.
Returns:

Return type:

Data in pandas dataframe format.

store(store, key_prefix, df)[source]

Persist a DataFrame to the specified store.

The used store format (e.g. Parquet) will be appended to the key.

Parameters:
  • store (simplekv.KeyValueStore) – store engine
  • key_prefix (str) – Key prefix that specifies a path where object should be stored on the store resource. The used file format will be appended to the key.
  • df (pandas.DataFrame or pyarrow.Table) – DataFrame that shall be persisted
Returns:

The actual key where the DataFrame is stored.

Return type:

str

type_stable = True
kartothek.serialization.default_serializer()[source]
kartothek.serialization.check_predicates(predicates: Optional[List[List[Tuple[str, str, LiteralValue]]]]) → None[source]

Check if predicates are well-formed.

kartothek.serialization.columns_in_predicates(predicates: Optional[List[List[Tuple[str, str, LiteralValue]]]]) → Set[str][source]

Determine all columns which are mentioned in the list of predicates.

Parameters:predicates – The predicates to be scaned.
kartothek.serialization.filter_array_like(array_like, op, value, mask=None, out=None, strict_date_types=False, column_name=None)[source]

Filter an array-like object using operations defined in the predicates

Parameters:
  • array_like (array-like, c.f. pd.api.types.is_array_like) – The array like object to be filtered
  • op (string) –
  • value (object) –
  • mask (boolean array-like, optional) – A boolean array like object which will be combined with the result of this evaluation using a logical AND. If an array with all True is given, it will be the same result as if left empty
  • out (array-like) – An array into which the result is stored. If provided, it must have a shape that the inputs broadcast to. If not provided or None, a freshly-allocated array is returned.
  • strict_date_types (bool) – If False (default), cast all datelike values to datetime64 for comparison.
  • column_name (str, optional) – Name of the column where array_like originates from, used for nicer error messages.
kartothek.serialization.filter_df_from_predicates(df: pandas.core.frame.DataFrame, predicates: Optional[List[List[Tuple[str, str, LiteralValue]]]], strict_date_types: bool = False) → Optional[List[List[Tuple[str, str, LiteralValue]]]][source]

Filter a pandas.DataFrame based on predicates in disjunctive normal form.

Parameters:
  • df (pd.DataFrame) – The pandas DataFrame to be filtered
  • predicates (list of lists) – Predicates in disjunctive normal form (DNF). For a thorough documentation, see DataFrameSerializer.restore_dataframe If None, the df is returned unmodified
  • strict_date_types (bool) – If False (default), cast all datelike values to datetime64 for comparison.
Returns:

Return type:

pd.DataFrame

kartothek.serialization.filter_df(df, filter_query=None)[source]

General implementation of query filtering.

Serialisation formats such as Parquet that support predicate push-down may pre-filter in their own implementations.

kartothek.serialization.filter_predicates_by_column(predicates: Optional[List[List[Tuple[str, str, LiteralValue]]]], columns: List[str]) → Optional[List[List[Tuple[str, str, LiteralValue]]]][source]

Takes a predicate list and removes all literals which are not referencing one of the given column

In [1]: from kartothek.serialization import filter_predicates_by_column

In [2]: predicates = [[("A", "==", 1), ("B", "<", 5)], [("C", "==", 4)]]

In [3]: filter_predicates_by_column(predicates, ["A"])
Out[3]: [[('A', '==', 1)]]
Parameters:
  • predicates – A list of predicates to be filtered
  • columns – A list of all columns allowed in the output