Inside Parquet Format

Parquet-format
Yue Chen
http://linkedin.com/in/yuechen2
http://dataera.wordpress.com

Goal
To have a state-of-the-art columnar storage available across the Hadoop platform
Hadoop is very reliable for big long running queries but also IO heavy.
Incrementally take advantage of column based storage in existing framework.
Not tied to any framework in particular
Can be used to store nested data

Columnar Storage
Limits IO to data actually needed:
loads only the columns that need to be accessed.
Saves space:
Columnar layout compresses better
Type specific encodings.
Enables vectorized execution engines.

Columnar Storage
row-oriented storage
colume-oriented storage

The Model
Schema:
required: exactly one occurrence
optional: 0 or 1 occurrence
repeated: 0 or more occurrences
Example:

The Model
Lists (or Sets) can be represented by a repeating field.

The Model
A Map is equivalent to a repeating field containing groups of key-value pairs where the key is required.

Table Format

How to store?
The structure of the record is captured for each value by two integers called repetition level and definition level. Using definition and repetition levels, we can fully reconstruct the nested structures.

Definition Levels
To support nested records we need to store the level for which the field is null.

Definition Levels Example

Definition Levels More Example
The maximum definition level is now 2 as b does not need one.(cannot be null)

Repetition levels
To support repeated fields we need to store when new lists are starting in a column of values.
The repetition level can be seen as a marker of when to start a new list and at which level.

Repetition levels Example
0 marks every new record and implies creating a new level1 and level2 list; 1 marks every new level1 list and implies creating a new level2 list as well; 2 marks every new element in a level2 list;

Repetition levels Example

Summary Example

We’ll now focus on the column contacts.phoneNumber to illustrate this.
Summary Example

To write the column we iterate through the record data for this column:
contacts.phoneNumber: “555 987 6543”
new record: R = 0
value is defined: D = maximum (2)
contacts.phoneNumber: null
repeated contacts: R = 1
only defined up to contacts: D = 1
contacts: null
new record: R = 0
only defined up to AddressBook: D = 0
Summary Example

The columns contains the following data:
Summary Example

To reconstruct the records from the column, we iterate through the column:
R=0, D=2, Value = “555 987 6543”:
R = 0 means a new record. We recreate the nested records from the root until the definition level (here 2)
D = 2 which is the maximum. The value is defined and is inserted.
R=1, D=1:
R = 1 means a new entry in the contacts list at level 1.
D = 1 means contacts is defined but not phoneNumber, so we just create an empty contacts.
R=0, D=0:
R = 0 means a new record. we create the nested records from the root until the definition level
D = 0 => contacts is actually null, so we only have an empty AddressBook
Summary Example

Compression Codecs
Snappy, GZIP; currently Snappy by default

Reference
https://blog.twitter.com/2013/dremel-made-simple-with- parquet

Inside Parquet Format

More Related Content

Inside Parquet Format