kartothek.io_components.cube.query package¶
Module contents¶
Common code to build query functions/pipelines.
-
class
kartothek.io_components.cube.query.
QueryGroup
(metapartitions: Dict[int, Dict[str, Tuple[kartothek.io_components.metapartition.MetaPartition, …]]], load_columns: Dict[str, Tuple[str, …]], output_columns: Tuple[str, …], predicates: Dict[str, Tuple[Tuple[Tuple[str, str, Any], …], …]], empty_df: Dict[str, pandas.core.frame.DataFrame], dimension_columns: Tuple[str, …], restrictive_dataset_ids: Set[str])[source]¶ Bases:
object
Query group, aka logical partition w/ all kartothek metapartition and information required to load the data.
- Parameters
metapartition (Dict[int, Dict[str, Tuple[kartothek.io_components.metapartition.MetaPartition, ..]]]) – Mapping from partition ID to metapartitions per dataset ID.
output_columns (Tuple[str, ..]) – Tuple of columns that will be returned from the query API.
predicates (Dict[str, Tuple[Tuple[Tuple[str, str, Any], ..], ..]]) – Predicates for each dataset ID.
empty_df (Dict[str, pandas.DataFrame]) – Empty DataFrame for each dataset ID.
dimension_columns (Tuple[str, ..]) – Dimension columns, used for de-duplication and to join data.
restrictive_dataset_ids (Set[str]) – Datasets (by Ktk_cube dataset ID) that are restrictive during the join process.
-
class
kartothek.io_components.cube.query.
QueryIntention
(dimension_columns: Tuple[str, …], partition_by: Tuple[str, …], conditions_pre: Dict[str, kartothek.core.cube.conditions.Conjunction], conditions_post: Dict[str, kartothek.core.cube.conditions.Conjunction], output_columns: Tuple[str, …])[source]¶ Bases:
object
Checked user intention during the query process.
- Parameters
dimension_columns (Tuple[str, ..]) – Real dimension columns.
partition_by (Tuple[str, ..]) – Real partition-by columns, may be empty.
conditions_pre (Dict[str, kartothek.core.cube.conditions.Conjunction]) – Conditions to be applied based on the index data alone.
conditions_post (Dict[str, kartothek.core.cube.conditions.Conjunction]) – Conditions to be applied during the load process.
output_columns (Tuple[str, ..]) – Output columns to be passed back to the user, in correct order.
-
kartothek.io_components.cube.query.
load_group
(group, store, cube)[source]¶ Load
QueryGroup
and return DataFrame.- Parameters
group (QueryGroup) – Query group.
store (Union[Callable[[], simplekv.KeyValueStore], simplekv.KeyValueStore]) – Store to load data from.
cube (kartothek.core.cube.cube.Cube) – Cube specification.
- Returns
df – Dataframe, may be empty.
- Return type
-
kartothek.io_components.cube.query.
plan_query
(conditions, cube, datasets, dimension_columns, partition_by, payload_columns, store)[source]¶ Plan cube query execution.
Important
If the intention does not contain a partition-by, this partition by the cube partition columns to speed up the query on parallel backends. In that case, the backend must concat and check the resulting dataframes before passing it to the user.
- Parameters
conditions (Union[None, Condition, Iterable[Condition], Conjunction]) – Conditions that should be applied.
cube (Cube) – Cube specification.
datasets (Union[None, Iterable[str], Dict[str, kartothek.core.dataset.DatasetMetadata]]) – Datasets to query, must all be part of the cube.
dimension_columns (Optional[Iterable[str]]) – Dimension columns of the query, may result in projection.
partition_by (Optional[Iterable[str]]) – By which column logical partitions should be formed.
payload_columns (Optional[Iterable[str]]) – Which columns apart from
dimension_columns
andpartition_by
should be returned.store (Union[simplekv.KeyValueStore, Callable[[], simplekv.KeyValueStore]]) – Store to query from.
- Returns
intent (QueryIntention) – Query intention.
empty_df (pandas.DataFrame) – Empty DataFrame representing the output types.
groups (Tuple[QueryGroup]) – Tuple of query groups. May be empty.
-
kartothek.io_components.cube.query.
quick_concat
(dfs, dimension_columns, partition_columns)[source]¶ Fast version of:
pd.concat( dfs, ignore_index=True, sort=False, ).sort_values(dimension_columns + partition_columns).reset_index(drop=True)
if inputs are presorted.
- Parameters
dfs (Iterable[pandas.DataFrame]) – DataFrames to concat.
dimension_columns (Iterable[str]) – Dimension columns in correct order.
partition_columns (Iterable[str]) – Partition columns in correct order.
- Returns
df – Concatenated result.
- Return type