kartothek.utils.pandas module

Pandas performance helpers.

kartothek.utils.pandas.aggregate_to_lists(df, by, data_col)[source]

Do a group-by and collect the results as python lists.

Roughly equivalent to:

df = df.groupby(
    by=by,
    as_index=False,
)[data_col].agg(lambda series: list(series.values))
Parameters
  • df (pandas.DataFrame) – Dataframe.

  • by (Iterable[str]) – Group-by columns, might be empty.

  • data_col (str) – Column with values to be collected.

Returns

df – DataFrame w/ operation applied.

Return type

pandas.DataFrame

kartothek.utils.pandas.concat_dataframes(dfs, default=None)[source]

Concatenate given DataFrames.

For non-empty iterables, this is roughly equivalent to:

pd.concat(dfs, ignore_index=True, sort=False)

except that the resulting index is undefined.

Important

If dfs is a list, it gets emptied during the process.

Warning

This requires all DataFrames to have the very same set of columns!

Parameters
  • dfs (Iterable[pandas.DataFrame]) – Iterable of DataFrames w/ identical columns.

  • default (Optional[pandas.DataFrame]) – Optional default if iterable is empty.

Returns

df – Concatenated DataFrame or default value.

Return type

pandas.DataFrame

Raises

ValueError – If iterable is empty but no default was provided.

kartothek.utils.pandas.drop_sorted_duplicates_keep_last(df, columns)[source]

Drop duplicates on sorted data, keep last occurance as unique entry.

Roughly equivalent to:

df.drop_duplicates(subset=columns, keep='last')
Parameters
  • df (pandas.DataFrame) – DataFrame in question.

  • columns (Iterable[str]) – Column-subset for duplicate-check (remaining columns are ignored).

Returns

df – DataFrame w/o duplicates.

Return type

pandas.DataFrame

kartothek.utils.pandas.is_dataframe_sorted(df, columns)[source]

Check that the given DataFrame is sorted as specified.

This is more efficient than sorting the DataFrame.

An empty DataFrame (no rows) is considered to be sorted.

Warning

This function does NOT handle NULL values correctly!

Parameters
  • df (pd.DataFrame) – DataFrame to check.

  • colums (Iterable[str]) – Column that the DataFrame should be sorted by.

Returns

sortedTrue if DataFrame is sorted, False otherwise.

Return type

bool

Raises
  • ValueError – If columns is empty.:

  • KeyError – If specified columns in by is missing.:

kartothek.utils.pandas.mask_sorted_duplicates_keep_last(df, columns)[source]

Mask duplicates on sorted data, keep last occurance as unique entry.

Roughly equivalent to:

df.duplicated(subset=columns, keep='last').values
Parameters
  • df (pandas.DataFrame) – DataFrame in question.

  • columns (Iterable[str]) – Column-subset for duplicate-check (remaining columns are ignored).

Returns

mask – 1-dimensional boolean array, marking duplicates w/ True

Return type

numpy.ndarray

kartothek.utils.pandas.merge_dataframes_robust(df1, df2, how)[source]

Merge two given DataFrames but also work if there are no columns to join on.

If now shared column between the given DataFrames is found, then the join will be performaned on a single, constant column.

Parameters
Returns

df_joined – Joined DataFrame.

Return type

pd.DataFrame

kartothek.utils.pandas.sort_dataframe(df, columns)[source]

Sort DataFrame by columns.

This is roughly equivalent to:

df.sort_values(columns).reset_index(drop=True)

Warning

This function does NOT handle NULL values correctly!

Parameters
Returns

df – Sorted DataFrame w/ reseted index.

Return type

pandas.DataFrame