Table type system

This document explains the type system of Kartothek.

Motivation

To understand why and how the type system was designed, we illustrate the requirements and use cases Kartothek should cover:

  • simplicity: For the average programmer, it should be possible to understand the semantics of Kartothek quickly.
  • optimized data representation: Data providers (humans and software) should be able to pick a memory representation that is sufficient to hold the data in question (e.g. pick an 8-bit unsigned integer if you only have integer up to 255).
  • lifecycle management: A dataset may be used for many months and can be extended over this time. Also, the software used with the dataset might change.
  • efficiency: Data written once should not be altered if not really necessary.
  • stability: Kartothek is made for production use. It should provide stable, predicable outputs and must not crash or lead other libraries to behave in an expected way.
  • compatibility: Kartothek is not the only software in the stack and must play nicely with others (also see Related Type Systems).

Base Type System

Since we think that Apache Arrow is a solid, future-proof choice for in-memory dataframes and will hopefully offer a sane alternative for many Pandas use cases, we opt for Apache Arrow as our type system baseline. In contrast to NumPy and Pandas, it offers better support for nested types, is more strict about type conversions and can also be used when interacting with other languages like R, Julia and Java.

Note

Even though we have chosen Apache Arrow as our type system, we will demonstrate many things using NumPy since it enables us to easily play around with fixed-size scalar values.

Note

Most user interaction with Kartothek are likely done by using Pandas DataFrames. Please consolidate the pyarrow documentation on Type differences to see how different Pandas constructs map to Apache Arrow.

Type Classes

This section shows how and why certain types are treated as compatible. This is required to enable partition-based memory and performance optimization. The rules for picking compatible type classes (i.e. a set of types which can be mapped to single container type) are:

  • feasibility: the container type must be able to hold all values of all types in the type class
  • semantics: types within a type class should be considered to be semantically compatible

Bool

Booleans (short “bool”) are the most basic unit of information. A bool can either be true or false. It is a quite handy bit of information to store things like “is this account active or not”.

Apache Arrow knows a single type here, bool, which does not require any normalization on its own.

Unsigned Integers

Unsigned integers are used to store whole non-negative numbers with numerical information (often counts like number of cars) or to hold IDs (like a numeric ID generated by a database).

The maximum value of an unsigned integer with a given bitwidth \(b\) is \(2^b - 1\). The minimum value is always 0. All whole numbers between minimum and maximum can be represented. The following table lists the ranges for all supported unsigned integer types.

Type Minimum Maximum
uint8 0 255
uint16 0 65,535
uint32 0 4,294,967,295
uint64 0 18,446,744,073,709,551,615

Therefore, all values of all supported unsigned integers types can be represented by uint64.

Warning

Since many libraries do not guarantee that small unsigned integers (like uint8) are kept small (Pandas for example chooses uint64 whenever it detects an overflow), do NOT rely on the exact bitwidth and pattern. If you want to encode bitstream data, use bytes!

Signed Integers

Signed integers are used to store whole numbers with numerical information (like delta of number of cars) or can also be used to hold index data like array offsets.

They work similar to unsigned integers, but the range is different. For a given bitwidth \(b\), the minimum is \(-(2^{b - 1})\) and the maximum is \(2^{b - 1} - 1\). The following table listsa all range of all supported signed integer types.

Type Minimum Maximum
int8 -128 127
int16 -32,768 32,767
int32 -2,147,483,648 2,147,483,647
int64 -9,223,372,036,854,775,808 9,223,372,036,854,775,807

Therefore, all values of all supported signed integer types can be represented by int64.

Floats

Floating point values (often just called “floats”) are used to store quantitative data where precision is often only relatively important. For example, if you want to know the age of the universe in milliseconds (around 4.35e+20), a millisecond more or less might not make the difference.

The floating point types float16, float32, and float64 are defined in IEEE 754 and are called binary{16, 32, 64} there.

Type Fraction Bits Exponent Bits
float16 11 5
float32 24 8
float64 53 11

Values are stored, depending on the bit-pattern in the exponent bits, as:

  • normalized: \((-1)^\text{signbit} \times 1.\text{fraction}_2 \times 2^{\text{exponent}_2 - \text{expbias}}\)
  • subnormals: \((-1)^\text{signbit} \times 0.\text{fraction}_2 \times 2^{1 - \text{expbias}}\)
  • zero: \((-1)^\text{signbit} \times 0\)
  • infinity: \((-1)^\text{signbit} \times \infty\)

where \(\text{expbias} = 2^{\#\text{exponentbits}}-1\).

Important

Signaling NaN values are discouraged and should not be used!

Important

NaN payloads are not handled and should not be used. The IEEE 754 declares them as optional and hardware and software may wipe them anyway, so portable code cannot make use this data.

For each of these categories, we can represent all values of float{16, 32} by using float64. So we normalize all floating point types to float64.

Decimal

Decimals have a given precision and scale and used to store fixed-point floats like money.

There is a single decimal type decimal128[P, S] where P measures the total precision in digits and S measures the scale in digits (therefore \(P \ge S\)).

>>> from decimal import Context, Decimal
>>> import pandas as pd
>>> import pyarrow as pa
>>> df = pd.DataFrame({
...     "profit_eu": [Decimal("110.12"), Decimal("20.00")],
...     "reveneu_eu": [Decimal("20.00"), Decimal("1000.00")],
...     "profit_lyd": [Decimal("0.0000"), Decimal("22.1050")],
...     "reveneu_lyd": [Decimal("0.0000"), Decimal("200.0000")],
... })
>>> schema = pa.Schema.from_pandas(df)
>>> schema.field("profit_eu").type
Decimal128Type(decimal(5, 2))
>>> schema.field("reveneu_eu").type
Decimal128Type(decimal(6, 2))
>>> schema.field("profit_lyd").type
Decimal128Type(decimal(6, 4))
>>> schema.field("reveneu_lyd").type
Decimal128Type(decimal(7, 4))

As shown, not only the scale changes for various numbers but also the precision is bound to the largest number. While the scale-handling makes sense (currencies should not be mixed), the precision-handling is unfortunate and may lead to various problems.

We currently do not implement a normalization. This might change in future metadata versions.

Warning

Because no normalization is implemented for different decimal precisions, we strongly advice against using them in Kartothek.

Date

Dates are normally used to to store “which day it is”.

There are two date types with slightly different semantics:

  • date32: 32bit unsigned integer counter for days since UNIX epoch
  • date64: 64bit unsigned integer counter for milliseconds since UNIX epoch

In theory, we could fit all date32 values into date64:

>>> import math
>>> n_years_date32 = math.floor(2**32 / 366)
>>> n_years_date64 = math.floor(2**64 / (366 * 24 * 3600 * 1000))
>>> n_years_date32, n_years_date64
(11734883, 583344214)
>>> n_years_date64 > n_years_date32
True

Since date64 is a very rarely used, this normalization is currently NOT implemented. This might change in a future metadata version.

Note

Date in Pandas can only be used by using an object column with datetime.date objects. Since this is neither backed by NumPy nor has a special implementation in Pandas, this might be too slow and memory intensive for certain use cases. There are the following known workarounds:

  • timestamps: Timestamps are backed by NumPy using the datetime64 type and map directly to integer-like data and arithmetics. Use “midn” as a time (e.g. 2019-05-21 00:00:00) and most features including Pandas support work as expected.
  • extension types: Using Extension Types would make it possible to have proper, fast date types in Pandas. Note that this would also require to either convert them back and forth before/after the Kartothek interaction or to teach pyarrow about them.

Time

This is the colleague of Date and stores the time at a given day.

The normalization of time32[U] and time64[U] (where U is either "s" for seconds or "ms" for milliseconds) is currently not implemented. This might change in a future metadata version.

Timestamp

A combination of Date and Time and is particularly useful to store when an event occurred without the need to store date and time separately.

There is a single, parametrized timestamp type called timestamp[U, Z] (where U is any of "s" for seconds, "ms" for milliseconds, "us" for microseconds, "ns" for nanoseconds; and Z stands for the timezone). It occupies 64bits.

We cannot treat timestamps for different timezones as the same time because the timezone parameter has important semantic meaning. We also cannot treat timestamps with different unit types as same since they all have very different ranges. So, no normalization is implemented for timestamps.

Lists

They are used to store a set of elements in a fixed order, like a list of cities to visit, or a plan how to connect given points to draw a panda.

Lists in Apache Arrow have a homogeneous element type. We can therefore assume that they can be optimized for certain partitions similar to other data types. We therefore treat lists with compatible element types as compatible, i.e. list[T1] and list[T2] are compatible iff T1 and T2 are compatible.

Structs

Structures (short “structs”) might be the most complex data type. They are used to store a collection of other data types, like all ID card information (containing name, the birthday and a picture). They can even be nested, i.e. a struct can hold another struct.

Normalization for structs is currently not implemented but might be in future releases.

Incompatibilities

This section points out why we treat certain type classes as incompatible, also in contrast to other libraries.

Signed / Unsigned Integer

This section shows why signed and unsigned integers are two distinct type classes.

Let us assume we represent signed and unsigned integers by the largest available types, int64 and uint64. If they would be in the same type class, either int64 or uint64 should than be able to represent all values of the other. This however, does not work for uint64 because it cannot represent negative numbers. For int64, this also is not feasible because the range \((9223372036854775807, 18446744073709551615]\) cannot be represented (this is the range int64 sacrifices to be able to represent negative numbers):

>>> import numpy as np
>>> x = ~np.uint64(0)
>>> y = np.int64(x)
>>> x, y
(18446744073709551615, -1)

Now you could represent uint{8, 16, 32} (w/o uint64) with int64, but making uint64 special would be confusing and also contradict the illustrated optimization use case.

Important

This is different to Dask and Pandas.

Float / Integer

Looking at the range of float64, it may be feasible to just pack all integers into a floating point values and everything is fine. This is what Pandas is doing by default. Since a float64 only has 53 fraction bits, it cannot store all 64 bit integers:

>>> import numpy as np
>>> x = np.int64((1 << 53) + 1)
>>> y = np.int64(np.float64(x))
>>> x, y
(9007199254740993, 9007199254740992)
>>> import numpy as np
>>> x = np.uint64((1 << 53) + 1)
>>> y = np.uint64(np.float64(x))
>>> x, y
(9007199254740993, 9007199254740992)

Integers might hold IDs which are by nature rather categorical than numeric. There, these tiny errors might lead to wrong / unpredictable results or crashes, we decided to treat integers and floats as distinct type classes.

Important

This is different to Dask and Pandas.

String / Binary

Not all binary values are valid Unicode, e.g.:

>>> b"\xff".decode("utf8")
Traceback (most recent call last):
  ...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

Furthermore, the encoding of Unicode strings is not per se defined. It might be UTF-8, UTF-16, UTF-32, or something completely different. For that reason, we also cannot just represent all string values with binary.

This incompatibility is also supported by the semantic meaning that binary data might be any bitstream (like image data, crypto keys, Thrift bitstreams) and string is reserved for text-like data.

Note

This was especially problematic under Python 2, where the content of str was undefined and unicode was not the default choice of many libraries like Pandas. Under Python 3, this is now clarified (str are always Unicode), so it is easier for users to produce and consume proper string data.

Bool / Integer

We could encode booleans as signed or unsigned integer (False -> 0 and True -> 1), but decided against it for the following reasons:

  • semantic: Integers and booleans have a different meaning. Also, it is not always clear that False and True are mapped to 0 and 1.

  • optimization: Booleans are clearly more efficient than integers and we would like to preserve that extreme advantage.

  • library support: Pandas for example makes a difference depending if a column contains boolean or integer data:

    >>> import pandas as pd
    >>> df = pd.DataFrame({
    ...     "b": [False, True],
    ...     "i": [0, 1],
    ... })
    >>> df.dtypes
    b     bool
    i    int64
    dtype: object
    >>> ~df["b"]
    0     True
    1    False
    Name: b, dtype: bool
    >>> ~df["i"]
    0   -1
    1   -2
    Name: i, dtype: int64
    

Null

While null has a semantic meaning, they can easily occur in production due to the type inference that pyarrow has to do when working with pandas dataframes:

>>> import pandas as p
>>> import pyarrow as pa
>>> df = pd.DataFrame({
...     "single_value": [None, "foo", None],
...     "no_value": [None, None, None],
... })
>>> schema = pa.Schema.from_pandas(df)
>>> schema.field("single_value").type
DataType(string)
>>> schema.field("no_value").type
DataType(null)

The reason is that string and also data objects are stored as object columns in pandas, which can contain arbitrary python objects. None acts as a placeholder “missing value”. Apache Arrow requires that values in a columns have one single type and therefore needs to guess what an object column should represent (i.e. type inference). If pyarrow does not find any non-Null object, it treats the column as null. Sadly, this might be wrong. It could easily also have meant to be a string or date32 column, but pyarrow cannot know that.

To keep things pragmatic, we ignore null during type checks.

Dictionary Encoding

Dictionary encoded data is normally produced by Pandas categoricals:

>>> import pandas as pd
>>> import pyarrow as pa
>>> df = pd.DataFrame({
...     "s": pd.Series(["foo", "foo", "bar"]).astype("category"),
... })
>>> schema = pa.Schema.from_pandas(df)
>>> schema.field("s").type
DictionaryType(dictionary<values=string, indices=int8, ordered=0>)

They have the form dictionary[T, I, O] where T represents the value type, I the index type (mostly integers) and O flags if the index is ordered or not.

Since categoricals are, in our opinion, a pure optimization and do not alter the nature of the data, we treat dictionary-encoded data like the values they encode. So dictionary[T1, I1, O1] is compatible with T2 if T1 and T2 are compatible. This also means that it is compatible with dictionary[T2, I2, O2]. Note that the ordered flag and the index data type are ignored. So the values in the example shown above are treated like string.

Normalization

Following the outlined guidelines, we can write down the following normalization rule set:

Type Class Normalization Examples
signed integer int{8, 16, 32, 64} -> int64
norm(int8) = int64
norm(int64) = int64
unsigned integer uint{8, 16, 32, 64} -> uint64
norm(uint8) = uint64
norm(uint64) = uint64
float float{16, 32, 64} -> float64
norm(float8) = float64
norm(float64) = float64
list list[T] -> list[norm(T)]
norm(list[int8]) = list[int64]
norm(list[int64]) = list[int64]
norm(list[list[int8]]) = list[list[int64]]
norm(list[string]) = list[string]
norm(list[dictionary[int8, int8, 1]]) = list[int64]
dictionary dictionary[T, I, O] -> norm(T)
norm(dictionary[str, int8, 0]) = str
norm(dictionary[int8, int16, 1]) = int64
norm(dictionary[list[int8], int8, 1]) = list[int64]

Technical Implementation

There are three sources of type information:

  • partition parquet files: the actual payload data written to the different parquet files
  • common metadata: the metadata that offers a quick introspection and is also used to recover type information for partition indices since they are stored as strings and are part of the payload storage keys
  • secondary indices: parquet with secondary index information are typed

The ground truth for type information is the common metadata file. There, the outlined normalization is applied. The payload data and the secondary indices may have any type that belongs to the correct type class, i.e. where norm(T_payload) equals T_common_metadata.