# Table type system¶

This document explains the type system of Kartothek.

## Motivation¶

To understand why and how the type system was designed, we illustrate the requirements and use cases Kartothek should cover:

**simplicity:**For the average programmer, it should be possible to understand the semantics of Kartothek quickly.**optimized data representation:**Data providers (humans and software) should be able to pick a memory representation that is sufficient to hold the data in question (e.g. pick an 8-bit unsigned integer if you only have integer up to 255).**lifecycle management:**A dataset may be used for many months and can be extended over this time. Also, the software used with the dataset might change.**efficiency:**Data written once should not be altered if not really necessary.**stability:**Kartothek is made for production use. It should provide stable, predicable outputs and must not crash or lead other libraries to behave in an expected way.**compatibility:**Kartothek is not the only software in the stack and must play nicely with others (also see Related Type Systems).

## Base Type System¶

Since we think that Apache Arrow is a solid, future-proof choice for in-memory dataframes and will hopefully offer a sane alternative for many Pandas use cases, we opt for Apache Arrow as our type system baseline. In contrast to NumPy and Pandas, it offers better support for nested types, is more strict about type conversions and can also be used when interacting with other languages like R, Julia and Java.

Note

Even though we have chosen Apache Arrow as our type system, we will demonstrate many things using NumPy since it enables us to easily play around with fixed-size scalar values.

Note

Most user interaction with Kartothek are likely done by using Pandas DataFrames. Please consolidate the pyarrow documentation on Type differences to see how different Pandas constructs map to Apache Arrow.

## Type Classes¶

This section shows how and why certain types are treated as compatible. This is required to enable partition-based memory and performance optimization. The rules for picking compatible type classes (i.e. a set of types which can be mapped to single container type) are:

**feasibility:**the container type must be able to hold all values of all types in the type class**semantics:**types within a type class should be considered to be semantically compatible

### Bool¶

Booleans (short “bool”) are the most basic unit of information. A bool can either be true or false. It is a quite handy bit of information to store things like “is this account active or not”.

Apache Arrow knows a single type here, `bool`

, which does not require any normalization on its own.

### Unsigned Integers¶

Unsigned integers are used to store whole non-negative numbers with numerical information (often counts like number of cars) or to hold IDs (like a numeric ID generated by a database).

The maximum value of an unsigned integer with a given bitwidth \(b\) is \(2^b - 1\). The minimum value is always 0. All whole numbers between minimum and maximum can be represented. The following table lists the ranges for all supported unsigned integer types.

Type | Minimum | Maximum |
---|---|---|

`uint8` |
0 | 255 |

`uint16` |
0 | 65,535 |

`uint32` |
0 | 4,294,967,295 |

`uint64` |
0 | 18,446,744,073,709,551,615 |

Therefore, all values of all supported unsigned integers types can be represented by `uint64`

.

Warning

Since many libraries do not guarantee that small unsigned integers (like `uint8`

) are kept small (Pandas for
example chooses `uint64`

whenever it detects an overflow), do NOT rely on the exact bitwidth and pattern. If you
want to encode bitstream data, use `bytes`

!

### Signed Integers¶

Signed integers are used to store whole numbers with numerical information (like delta of number of cars) or can also be used to hold index data like array offsets.

They work similar to unsigned integers, but the range is different. For a given bitwidth \(b\), the minimum is \(-(2^{b - 1})\) and the maximum is \(2^{b - 1} - 1\). The following table listsa all range of all supported signed integer types.

Type | Minimum | Maximum |
---|---|---|

`int8` |
-128 | 127 |

`int16` |
-32,768 | 32,767 |

`int32` |
-2,147,483,648 | 2,147,483,647 |

`int64` |
-9,223,372,036,854,775,808 | 9,223,372,036,854,775,807 |

Therefore, all values of all supported signed integer types can be represented by `int64`

.

### Floats¶

Floating point values (often just called “floats”) are used to store quantitative data where precision is often only relatively important. For example, if you want to know the age of the universe in milliseconds (around 4.35e+20), a millisecond more or less might not make the difference.

The floating point types `float16`

, `float32`

, and `float64`

are defined in IEEE 754 and are called
`binary{16, 32, 64}`

there.

Type | Fraction Bits | Exponent Bits |
---|---|---|

`float16` |
11 | 5 |

`float32` |
24 | 8 |

`float64` |
53 | 11 |

Values are stored, depending on the bit-pattern in the exponent bits, as:

**normalized:**\((-1)^\text{signbit} \times 1.\text{fraction}_2 \times 2^{\text{exponent}_2 - \text{expbias}}\)**subnormals:**\((-1)^\text{signbit} \times 0.\text{fraction}_2 \times 2^{1 - \text{expbias}}\)**zero:**\((-1)^\text{signbit} \times 0\)**infinity:**\((-1)^\text{signbit} \times \infty\)

where \(\text{expbias} = 2^{\#\text{exponentbits}}-1\).

Important

Signaling NaN values are discouraged and should not be used!

Important

NaN payloads are not handled and should not be used. The IEEE 754 declares them as optional and hardware and software may wipe them anyway, so portable code cannot make use this data.

For each of these categories, we can represent all values of `float{16, 32}`

by using `float64`

. So we normalize all
floating point types to `float64`

.

### Decimal¶

Decimals have a given precision and scale and used to store fixed-point floats like money.

There is a single decimal type `decimal128[P, S]`

where `P`

measures the total precision in digits and `S`

measures the scale in digits (therefore \(P \ge S\)).

```
>>> from decimal import Context, Decimal
>>> import pandas as pd
>>> import pyarrow as pa
>>> df = pd.DataFrame({
... "profit_eu": [Decimal("110.12"), Decimal("20.00")],
... "reveneu_eu": [Decimal("20.00"), Decimal("1000.00")],
... "profit_lyd": [Decimal("0.0000"), Decimal("22.1050")],
... "reveneu_lyd": [Decimal("0.0000"), Decimal("200.0000")],
... })
>>> schema = pa.Schema.from_pandas(df)
>>> schema.field("profit_eu").type
Decimal128Type(decimal(5, 2))
>>> schema.field("reveneu_eu").type
Decimal128Type(decimal(6, 2))
>>> schema.field("profit_lyd").type
Decimal128Type(decimal(6, 4))
>>> schema.field("reveneu_lyd").type
Decimal128Type(decimal(7, 4))
```

As shown, not only the scale changes for various numbers but also the precision is bound to the largest number. While the scale-handling makes sense (currencies should not be mixed), the precision-handling is unfortunate and may lead to various problems.

We currently do not implement a normalization. This might change in future metadata versions.

Warning

Because no normalization is implemented for different decimal precisions, we strongly advice against using them in Kartothek.

### Date¶

Dates are normally used to to store “which day it is”.

There are two date types with slightly different semantics:

`date32`

: 32bit unsigned integer counter for days since UNIX epoch`date64`

: 64bit unsigned integer counter for milliseconds since UNIX epoch

In theory, we could fit all `data32`

values into `date64`

:

```
>>> import math
>>> n_years_date32 = math.floor(2**32 / 366)
>>> n_years_date64 = math.floor(2**64 / (366 * 24 * 3600 * 1000))
>>> n_years_date32, n_years_date64
(11734883, 583344214)
>>> n_years_date64 > n_years_date32
True
```

Since `date64`

is a very rarely used, this normalization is currently NOT implemented. This might change in a future
metadata version.

Note

Date in Pandas can only be used by using an `object`

column with `datetime.date`

objects. Since this is
neither backed by NumPy nor has a special implementation in Pandas, this might be too slow and memory intensive
for certain use cases. There are the following known workarounds:

**timestamps:**Timestamps are backed by NumPy using the`datetime64`

type and map directly to integer-like data and arithmetics. Use “midn” as a time (e.g.`2019-05-21 00:00:00`

) and most features including Pandas support work as expected.**extension types:**Using Extension Types would make it possible to have proper, fast date types in Pandas. Note that this would also require to either convert them back and forth before/after the Kartothek interaction or to teach pyarrow about them.

### Time¶

This is the colleague of Date and stores the time at a given day.

The normalization of `time32[U]`

and `time64[U]`

(where `U`

is either `"s"`

for seconds or `"ms"`

for
milliseconds) is currently not implemented. This might change in a future metadata version.

### Timestamp¶

A combination of Date and Time and is particularly useful to store when an event occurred without the need to store date and time separately.

There is a single, parametrized timestamp type called `timestamp[U, Z]`

(where `U`

is any of `"s"`

for seconds,
`"ms"`

for milliseconds, `"us"`

for microseconds, `"ns"`

for nanoseconds; and `Z`

stands for the timezone). It
occupies 64bits.

We cannot treat timestamps for different timezones as the same time because the timezone parameter has important semantic meaning. We also cannot treat timestamps with different unit types as same since they all have very different ranges. So, no normalization is implemented for timestamps.

### Lists¶

They are used to store a set of elements in a fixed order, like a list of cities to visit, or a plan how to connect given points to draw a panda.

Lists in Apache Arrow have a homogeneous element type. We can therefore assume that they can be optimized for certain
partitions similar to other data types. We therefore treat lists with compatible element types as compatible, i.e.
`list[T1]`

and `list[T2]`

are compatible iff `T1`

and `T2`

are compatible.

### Structs¶

Structures (short “structs”) might be the most complex data type. They are used to store a collection of other data types, like all ID card information (containing name, the birthday and a picture). They can even be nested, i.e. a struct can hold another struct.

Normalization for structs is currently not implemented but might be in future releases.

## Incompatibilities¶

This section points out why we treat certain type classes as incompatible, also in contrast to other libraries.

### Signed / Unsigned Integer¶

This section shows why signed and unsigned integers are two distinct type classes.

Let us assume we represent signed and unsigned integers by the largest available types, `int64`

and `uint64`

. If
they would be in the same type class, either `int64`

or `uint64`

should than be able to represent all values of the
other. This however, does not work for `uint64`

because it cannot represent negative numbers. For `int64`

, this also
is not feasible because the range \((9223372036854775807, 18446744073709551615]\) cannot be represented (this is the
range `int64`

sacrifices to be able to represent negative numbers):

```
>>> import numpy as np
>>> x = ~np.uint64(0)
>>> y = np.int64(x)
>>> x, y
(18446744073709551615, -1)
```

Now you could represent `uint{8, 16, 32}`

(w/o `uint64`

) with `int64`

, but making `uint64`

special would be
confusing and also contradict the illustrated optimization use case.

Important

This is different to Dask and Pandas.

### Float / Integer¶

Looking at the range of `float64`

, it may be feasible to just pack all integers into a floating point values and
everything is fine. This is what Pandas is doing by default. Since a `float64`

only has 53
fraction bits, it cannot store all 64 bit integers:

```
>>> import numpy as np
>>> x = np.int64((1 << 53) + 1)
>>> y = np.int64(np.float64(x))
>>> x, y
(9007199254740993, 9007199254740992)
```

```
>>> import numpy as np
>>> x = np.uint64((1 << 53) + 1)
>>> y = np.uint64(np.float64(x))
>>> x, y
(9007199254740993, 9007199254740992)
```

Integers might hold IDs which are by nature rather categorical than numeric. There, these tiny errors might lead to wrong / unpredictable results or crashes, we decided to treat integers and floats as distinct type classes.

Important

This is different to Dask and Pandas.

### String / Binary¶

Not all `binary`

values are valid Unicode, e.g.:

```
>>> b"\xff".decode("utf8")
Traceback (most recent call last):
...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
```

Furthermore, the encoding of Unicode strings is not per se defined. It might be UTF-8, UTF-16, UTF-32, or something
completely different. For that reason, we also cannot just represent all `string`

values with `binary`

.

This incompatibility is also supported by the semantic meaning that `binary`

data might be any bitstream (like image
data, crypto keys, Thrift bitstreams) and `string`

is reserved for text-like data.

Note

This was especially problematic under Python 2, where the content of `str`

was undefined and `unicode`

was not
the default choice of many libraries like Pandas. Under Python 3, this is now clarified (`str`

are always
Unicode), so it is easier for users to produce and consume proper string data.

### Bool / Integer¶

We could encode booleans as signed or unsigned integer (`False -> 0`

and `True -> 1`

), but decided against it for
the following reasons:

**semantic:**Integers and booleans have a different meaning. Also, it is not always clear that`False`

and`True`

are mapped to`0`

and`1`

.**optimization:**Booleans are clearly more efficient than integers and we would like to preserve that extreme advantage.**library support:**Pandas for example makes a difference depending if a column contains boolean or integer data:>>> import pandas as pd >>> df = pd.DataFrame({ ... "b": [False, True], ... "i": [0, 1], ... }) >>> df.dtypes b bool i int64 dtype: object >>> ~df["b"] 0 True 1 False Name: b, dtype: bool >>> ~df["i"] 0 -1 1 -2 Name: i, dtype: int64

## Null¶

While `null`

has a semantic meaning, they can easily occur in production due to the type inference that pyarrow has to
do when working with pandas dataframes:

```
>>> import pandas as p
>>> import pyarrow as pa
>>> df = pd.DataFrame({
... "single_value": [None, "foo", None],
... "no_value": [None, None, None],
... })
>>> schema = pa.Schema.from_pandas(df)
>>> schema.field("single_value").type
DataType(string)
>>> schema.field("no_value").type
DataType(null)
```

The reason is that string and also data objects are stored as `object`

columns in pandas, which can contain arbitrary
python objects. `None`

acts as a placeholder “missing value”. Apache Arrow requires that values in a columns have
one single type and therefore needs to guess what an `object`

column should represent (i.e. type inference). If
pyarrow does not find any non-Null object, it treats the column as `null`

. Sadly, this might be wrong. It could easily
also have meant to be a `string`

or `date32`

column, but pyarrow cannot know that.

To keep things pragmatic, we ignore `null`

during type checks.

## Dictionary Encoding¶

Dictionary encoded data is normally produced by Pandas categoricals:

```
>>> import pandas as pd
>>> import pyarrow as pa
>>> df = pd.DataFrame({
... "s": pd.Series(["foo", "foo", "bar"]).astype("category"),
... })
>>> schema = pa.Schema.from_pandas(df)
>>> schema.field("s").type
DictionaryType(dictionary<values=string, indices=int8, ordered=0>)
```

They have the form `dictionary[T, I, O]`

where `T`

represents the value type, `I`

the index type (mostly integers)
and `O`

flags if the index is ordered or not.

Since categoricals are, in our opinion, a pure optimization and do not alter the nature of the data, we treat
dictionary-encoded data like the values they encode. So `dictionary[T1, I1, O1]`

is compatible with `T2`

if `T1`

and `T2`

are compatible. This also means that it is compatible with `dictionary[T2, I2, O2]`

. Note that the ordered
flag and the index data type are ignored. So the values in the example shown above are treated like `string`

.

## Normalization¶

Following the outlined guidelines, we can write down the following normalization rule set:

Type Class | Normalization | Examples |
---|---|---|

signed integer | `int{8, 16, 32, 64} -> int64` |
`norm(int8) = int64` `norm(int64) = int64` |

unsigned integer | `uint{8, 16, 32, 64} -> uint64` |
`norm(uint8) = uint64` `norm(uint64) = uint64` |

float | `float{16, 32, 64} -> float64` |
`norm(float8) = float64` `norm(float64) = float64` |

list | `list[T] -> list[norm(T)]` |
`norm(list[int8]) = list[int64]` `norm(list[int64]) = list[int64]` `norm(list[list[int8]]) = list[list[int64]]` `norm(list[string]) = list[string]` `norm(list[dictionary[int8, int8, 1]]) = list[int64]` |

dictionary | `dictionary[T, I, O] -> norm(T)` |
`norm(dictionary[str, int8, 0]) = str` `norm(dictionary[int8, int16, 1]) = int64` `norm(dictionary[list[int8], int8, 1]) = list[int64]` |

## Technical Implementation¶

There are three sources of type information:

**partition parquet files:**the actual payload data written to the different parquet files**common metadata:**the metadata that offers a quick introspection and is also used to recover type information for partition indices since they are stored as strings and are part of the payload storage keys**secondary indices:**parquet with secondary index information are typed

The ground truth for type information is the common metadata file. There, the outlined normalization is applied. The
payload data and the secondary indices may have any type that belongs to the correct type class, i.e. where
`norm(T_payload)`

equals `T_common_metadata`

.