The document discusses Parquet, a column-oriented data storage format for Hadoop. Parquet aims to provide efficient column-based storage across Hadoop platforms by limiting I/O to only the needed columns and supporting nested data structures. It uses definition and repetition levels to reconstruct nested data representations from columns and allows for different compression codecs like Snappy and GZIP.
2. http://dataera.wordpress.com
http://linkedin.com/in/yuechen2
Goal
To have a state-of-the-art columnar storage available across the Hadoop platform
Hadoop is very reliable for big long running queries but also IO heavy.
Incrementally take advantage of column based storage in existing framework.
Not tied to any framework in particular
Can be used to store nested data
3. http://dataera.wordpress.com
http://linkedin.com/in/yuechen2
Columnar Storage
Limits IO to data actually needed:
loads only the columns that need to be accessed.
Saves space:
Columnar layout compresses better
Type specific encodings.
Enables vectorized execution engines.
9. http://dataera.wordpress.com
http://linkedin.com/in/yuechen2
How to store?
The structure of the record is captured for each value by two integers called repetition level and definition level. Using definition and repetition levels, we can fully reconstruct the nested structures.
14. http://dataera.wordpress.com
http://linkedin.com/in/yuechen2
Repetition levels Example
0 marks every new record and implies creating a new level1 and level2 list; 1 marks every new level1 list and implies creating a new level2 list as well; 2 marks every new element in a level2 list;
18. http://dataera.wordpress.com
http://linkedin.com/in/yuechen2
To write the column we iterate through the record data for this column:
contacts.phoneNumber: “555 987 6543”
new record: R = 0
value is defined: D = maximum (2)
contacts.phoneNumber: null
repeated contacts: R = 1
only defined up to contacts: D = 1
contacts: null
new record: R = 0
only defined up to AddressBook: D = 0
Summary Example
20. http://dataera.wordpress.com
http://linkedin.com/in/yuechen2
To reconstruct the records from the column, we iterate through the column:
R=0, D=2, Value = “555 987 6543”:
R = 0 means a new record. We recreate the nested records from the root until the definition level (here 2)
D = 2 which is the maximum. The value is defined and is inserted.
R=1, D=1:
R = 1 means a new entry in the contacts list at level 1.
D = 1 means contacts is defined but not phoneNumber, so we just create an empty contacts.
R=0, D=0:
R = 0 means a new record. we create the nested records from the root until the definition level
D = 0 => contacts is actually null, so we only have an empty AddressBook
Summary Example