Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Parquet-format 
Yue Chen 
http://linkedin.com/in/yuechen2 
http://dataera.wordpress.com
http://dataera.wordpress.com 
http://linkedin.com/in/yuechen2 
Goal 
To have a state-of-the-art columnar storage available across the Hadoop platform 
Hadoop is very reliable for big long running queries but also IO heavy. 
Incrementally take advantage of column based storage in existing framework. 
Not tied to any framework in particular 
Can be used to store nested data
http://dataera.wordpress.com 
http://linkedin.com/in/yuechen2 
Columnar Storage 
Limits IO to data actually needed: 
loads only the columns that need to be accessed. 
Saves space: 
Columnar layout compresses better 
Type specific encodings. 
Enables vectorized execution engines.
http://dataera.wordpress.com 
http://linkedin.com/in/yuechen2 
Columnar Storage 
row-oriented storage 
colume-oriented storage
http://dataera.wordpress.com 
http://linkedin.com/in/yuechen2 
The Model 
Schema: 
required: exactly one occurrence 
optional: 0 or 1 occurrence 
repeated: 0 or more occurrences 
Example:
http://dataera.wordpress.com 
http://linkedin.com/in/yuechen2 
The Model 
Lists (or Sets) can be represented by a repeating field.
http://dataera.wordpress.com 
http://linkedin.com/in/yuechen2 
The Model 
A Map is equivalent to a repeating field containing groups of key-value pairs where the key is required.
http://dataera.wordpress.com 
http://linkedin.com/in/yuechen2 
Table Format
http://dataera.wordpress.com 
http://linkedin.com/in/yuechen2 
How to store? 
The structure of the record is captured for each value by two integers called repetition level and definition level. Using definition and repetition levels, we can fully reconstruct the nested structures.
http://dataera.wordpress.com 
http://linkedin.com/in/yuechen2 
Definition Levels 
To support nested records we need to store the level for which the field is null.
http://dataera.wordpress.com 
http://linkedin.com/in/yuechen2 
Definition Levels Example
http://dataera.wordpress.com 
http://linkedin.com/in/yuechen2 
Definition Levels More Example 
The maximum definition level is now 2 as b does not need one.(cannot be null)
http://dataera.wordpress.com 
http://linkedin.com/in/yuechen2 
Repetition levels 
To support repeated fields we need to store when new lists are starting in a column of values. 
The repetition level can be seen as a marker of when to start a new list and at which level.
http://dataera.wordpress.com 
http://linkedin.com/in/yuechen2 
Repetition levels Example 
0 marks every new record and implies creating a new level1 and level2 list; 1 marks every new level1 list and implies creating a new level2 list as well; 2 marks every new element in a level2 list;
http://dataera.wordpress.com 
http://linkedin.com/in/yuechen2 
Repetition levels Example
http://dataera.wordpress.com 
http://linkedin.com/in/yuechen2 
Summary Example
http://dataera.wordpress.com 
http://linkedin.com/in/yuechen2 
We’ll now focus on the column contacts.phoneNumber to illustrate this. 
Summary Example
http://dataera.wordpress.com 
http://linkedin.com/in/yuechen2 
To write the column we iterate through the record data for this column: 
contacts.phoneNumber: “555 987 6543” 
new record: R = 0 
value is defined: D = maximum (2) 
contacts.phoneNumber: null 
repeated contacts: R = 1 
only defined up to contacts: D = 1 
contacts: null 
new record: R = 0 
only defined up to AddressBook: D = 0 
Summary Example
http://dataera.wordpress.com 
http://linkedin.com/in/yuechen2 
The columns contains the following data: 
Summary Example
http://dataera.wordpress.com 
http://linkedin.com/in/yuechen2 
To reconstruct the records from the column, we iterate through the column: 
R=0, D=2, Value = “555 987 6543”: 
R = 0 means a new record. We recreate the nested records from the root until the definition level (here 2) 
D = 2 which is the maximum. The value is defined and is inserted. 
R=1, D=1: 
R = 1 means a new entry in the contacts list at level 1. 
D = 1 means contacts is defined but not phoneNumber, so we just create an empty contacts. 
R=0, D=0: 
R = 0 means a new record. we create the nested records from the root until the definition level 
D = 0 => contacts is actually null, so we only have an empty AddressBook 
Summary Example
http://dataera.wordpress.com 
http://linkedin.com/in/yuechen2 
Compression Codecs 
Snappy, GZIP; currently Snappy by default
http://dataera.wordpress.com 
http://linkedin.com/in/yuechen2 
Reference 
https://blog.twitter.com/2013/dremel-made-simple-with- parquet

More Related Content

Inside Parquet Format