kartothek.utils.pandas module¶
Pandas performance helpers.
-
kartothek.utils.pandas.
aggregate_to_lists
(df, by, data_col)[source]¶ Do a group-by and collect the results as python lists.
Roughly equivalent to:
df = df.groupby( by=by, as_index=False, )[data_col].agg(lambda series: list(series.values))
- Parameters
df (pandas.DataFrame) – Dataframe.
by (Iterable[str]) – Group-by columns, might be empty.
data_col (str) – Column with values to be collected.
- Returns
df – DataFrame w/ operation applied.
- Return type
-
kartothek.utils.pandas.
concat_dataframes
(dfs, default=None)[source]¶ Concatenate given DataFrames.
For non-empty iterables, this is roughly equivalent to:
pd.concat(dfs, ignore_index=True, sort=False)
except that the resulting index is undefined.
Important
If
dfs
is a list, it gets emptied during the process.Warning
This requires all DataFrames to have the very same set of columns!
- Parameters
dfs (Iterable[pandas.DataFrame]) – Iterable of DataFrames w/ identical columns.
default (Optional[pandas.DataFrame]) – Optional default if iterable is empty.
- Returns
df – Concatenated DataFrame or default value.
- Return type
- Raises
ValueError – If iterable is empty but no default was provided.
-
kartothek.utils.pandas.
drop_sorted_duplicates_keep_last
(df, columns)[source]¶ Drop duplicates on sorted data, keep last occurance as unique entry.
Roughly equivalent to:
df.drop_duplicates(subset=columns, keep='last')
- Parameters
df (pandas.DataFrame) – DataFrame in question.
columns (Iterable[str]) – Column-subset for duplicate-check (remaining columns are ignored).
- Returns
df – DataFrame w/o duplicates.
- Return type
-
kartothek.utils.pandas.
is_dataframe_sorted
(df, columns)[source]¶ Check that the given DataFrame is sorted as specified.
This is more efficient than sorting the DataFrame.
An empty DataFrame (no rows) is considered to be sorted.
Warning
This function does NOT handle NULL values correctly!
- Parameters
df (pd.DataFrame) – DataFrame to check.
colums (Iterable[str]) – Column that the DataFrame should be sorted by.
- Returns
sorted –
True
if DataFrame is sorted,False
otherwise.- Return type
- Raises
ValueError – If
columns
is empty.:KeyError – If specified columns in
by
is missing.:
-
kartothek.utils.pandas.
mask_sorted_duplicates_keep_last
(df, columns)[source]¶ Mask duplicates on sorted data, keep last occurance as unique entry.
Roughly equivalent to:
df.duplicated(subset=columns, keep='last').values
- Parameters
df (pandas.DataFrame) – DataFrame in question.
columns (Iterable[str]) – Column-subset for duplicate-check (remaining columns are ignored).
- Returns
mask – 1-dimensional boolean array, marking duplicates w/
True
- Return type
-
kartothek.utils.pandas.
merge_dataframes_robust
(df1, df2, how)[source]¶ Merge two given DataFrames but also work if there are no columns to join on.
If now shared column between the given DataFrames is found, then the join will be performaned on a single, constant column.
- Parameters
df1 (pd.DataFrame) – Left DataFrame.
df2 (pd.DataFrame) – Right DataFrame.
how (str) – How to join the frames.
- Returns
df_joined – Joined DataFrame.
- Return type
-
kartothek.utils.pandas.
sort_dataframe
(df, columns)[source]¶ Sort DataFrame by columns.
This is roughly equivalent to:
df.sort_values(columns).reset_index(drop=True)
Warning
This function does NOT handle NULL values correctly!
- Parameters
df (pandas.DataFrame) – DataFrame to sort.
columns (Iterable[str]) – Columns to sort by.
- Returns
df – Sorted DataFrame w/ reseted index.
- Return type