kartothek.api.consistency module

Methods to check preserved cube for consistency.

kartothek.api.consistency.check_datasets(datasets: Dict[str, kartothek.core.dataset.DatasetMetadata], cube: kartothek.core.cube.cube.Cube)Dict[str, kartothek.core.dataset.DatasetMetadata][source]

Apply sanity checks to persisteted Karothek datasets.

The following checks will be applied:

  • seed dataset present

  • metadata version correct

  • only the cube-specific table is present

  • partition keys are correct

  • no overlapping payload columns exists

  • datatypes are consistent

  • dimension columns are present everywhere

  • required index structures are present (more are allowed)

    • PartitionIndex for every partition key

    • for seed dataset, ExplicitSecondaryIndex for every dimension column

    • for all datasets, ExplicitSecondaryIndex for every index column

Parameters
  • datasets – Datasets.

  • cube – Cube specification.

Returns

datasets – Same as input, but w/ partition indices loaded.

Return type

Dict[str, DatasetMetadata]

Raises

ValueError – If sanity check failed.

kartothek.api.consistency.get_cube_payload(datasets: Dict[str, kartothek.core.dataset.DatasetMetadata], cube: kartothek.core.cube.cube.Cube)Set[str][source]

Get payload columns of the whole cube.

Parameters
  • datasets – Datasets.

  • cube – Cube specification.

Returns

payload – Payload columns.

Return type

Set[str]

kartothek.api.consistency.get_payload_subset(columns: Iterable[str], cube: kartothek.core.cube.cube.Cube)Set[str][source]

Get payload column subset from a given set of columns.

Parameters
  • columns – Columns.

  • cube – Cube specification.

Returns

payload – Payload columns.

Return type

Set[str]