Ingesting Data at Blazing Speed Using Apache Orc

© 2017 IBM Corporation
Ingesting Data at Blazing Speed with Apache ORC
Gustavo Arocena
IBM Toronto Lab

© 2017 IBM Corporation2
IBM Canada Lab
Toronto

Acknowledgements and Disclaimers
Availability. References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries
in which IBM operates.
The workshops, sessions and materials have been prepared by IBM or the session speakers and reflect their own views. They are
provided for informational purposes only, and are neither intended to, nor shall have the effect of being, legal or other guidance or advice
to any participant. While efforts were made to verify the completeness and accuracy of the information contained in this presentation, it is
provided AS-IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of,
or otherwise related to, this presentation or any other materials. Nothing contained in this presentation is intended to, nor shall have the
effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the
applicable license agreement governing the use of IBM software.
All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may
have achieved. Actual environmental costs and performance characteristics may vary by customer. Nothing contained in these materials
is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales,
revenue growth or other results.
© Copyright IBM Corporation 2017. All rights reserved.
U.S. Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
IBM, the IBM logo, ibm.com, BigInsights, and Big SQL are trademarks or registered trademarks of International Business Machines Corporation in the
United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a
trademark symbol (® or TM), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was
published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the
Web at
▪“Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml
▪TPC Benchmark, TPC-DS, and QphDS are trademarks of Transaction Processing Performance Council
▪Cloudera, the Cloudera logo, Cloudera Impala are trademarks of Cloudera.
▪Hortonworks, the Hortonworks logo and other Hortonworks trademarks are trademarks of Hortonworks Inc. in the United States and other countries.
▪Other company, product, or service names may be trademarks or service marks of others.

Agenda
What is Big SQL?
From Parquet to ORC
Reading ORC Files Fast
Scaling Up
Tuning Big SQL for ORC

Big SQL Background

What is Big SQL?
Data Security
Metastore
Cluster Mgmt.
Administration
Runs on
Data Platform

What’s the Big Deal?
Grace Under Pressure

The Big SQL
Advantages
Scale
• Only engine to run
TPC-DS at 100TB
scale
Complex SQL
• Capable of running all
99 TPC-DS queries
since 2014
• Complex queries
optimized with IBM
Cost-based optimizer
Concurrency
• Handles highly concurrent
workloads gracefully
• 12 stream TPC-DS
Efficient Resource
Utilization
• Memory
• CPU
• IO
What’s the Big Deal?

Metrics for Big SQL 4.2.5 vs. Spark SQL 2.1
▪ Hadoop DS @ 100TB, 4 streams
13.7
43.2
BIG SQL SPARK SQL
Hours
Elapsed Time
76.4
88.2
BIG SQL SPARK SQL
%
CPU Utilization
107
388
BIG SQL SPARK SQL
MB/Sec
Disk Reads
25
237
BIG SQL SPARK SQL
MB/Sec
Disk Writes
- 15%
1/3
1/3 1/9
https://developer.ibm.com/hadoop/2017/02/07/experiences-comparing-big-sql-and-spark-sql-at-100tb/

From Parquet to ORC

Big SQL Architecture (as of 2016)
Head Node
Worker Node
Worker Node
Worker Node
Parquet
IO
Hive
Compat.
IO
Hive
Metastore
HDFS
HDFS
NN

2017: Big SQL on HDP
Most popular
data format
on HDP is
ORC
ORC
performance
becomes top
priority

0
5000
10000
15000
20000
25000
30000
35000
1 Stream 4 Streams
ElapsedTime(sec)
Parquet vs ORC
1TB TPC-DS
Parquet ORC v0
70% Slower
with ORC
ORC vs Parquet in Big SQL 4.2 > 50000
315% Slower
with ORC

Limitations of Hive Compatibility IO Engine
Slow
Ingestion
Single row at a
time
JIT unfriendly
Data values as
Java objects
Low
Scalability
Large memory
footprint per
scan
Excessive
CPU use
Overloaded
disks

The Roadmap Towards Fast ORC ingestion
1st Phase
• Big SQL 4.2.5, Dec ‘16
• Fast ORC Ingestion
2nd Phase
• Big SQL 5.0.1, Aug ‘17
• ORC at Scale

1st Phase – Fast Ingestion using Apache ORC
0
5000
10000
15000
20000
25000
1 Stream 4 Streams
ElapsedTime(sec)
Parquet vs ORC
1TB TPC-DS
Parquet ORC v1
Apache ORC libs key benefits
▪ Many-row-at-a-time API
▪ Enable JIT-friendly code
▪ Represent data using
primitive Java types
▪ Make projection and
selection pushdown
very easy
2% Faster
with ORC
65% Slower
with ORC

2nd Phase - Managing Resources
0
2000
4000
6000
8000
10000
12000
14000
1 Stream 4 Streams
ElapsedTime(sec)
Parquet vs ORC
1TB TPC-DS
Parquet ORC v2
Resource Manager has global
oversight over
▪ Total number of threads
▪ Overall JVM heap
consumption
▪ Degree of parallelism
per scan
15% Faster
with ORC
3.4% Faster
with ORC

ORC as a First Class Citizen in 5.0.1
Head Node
Worker Node
Worker Node
Worker Node
Parquet
IO
Hive
Compat.
IO
Hive
Metastore
HDFS
HDFS
NN
ORC
IO

ORC Background

What is Apache ORC?
ORC = efficient
storage + fast
ingestion
Compression
• Type-specific
encodings (RLE for
numbers, dictionary
for strings, etc)
• Generic compression
(Zlib, Snappy)
Data skipping
• Column skipping
based on data layout
• Row skipping based
on MIN/MAX stats
and bloom filters
JIT friendly
• Vectorized APIs
(retrieve data as
arrays of primitive
values)
▪ Engines leverage all
these features
▪ Apache ORC libs allow
applications to leverage
them too

ORC Physical Data Layout
CREATE HADOOP TABLE SALES(id INTEGER,
quantity INTEGER,
amount DOUBLE)
Stripe stats
Stripe stats
Stripe stats
File stats
Stripe (HDFS block)
Row group (10K rows)
10K id values
10K quantity values
10K amount values
Row group stats

Leveraging the Apache ORC Libraries

Dependencies and Classes
▪ Java Dependencies (orc.apache.org group id in Maven)
orc-core-1.4.0-nohive.jar
aircompressor-0.3.jar
▪ Java Classes for “vectorized” processing
import org.apache.orc.OrcFile;
import org.apache.orc.Reader;
import org.apache.orc.RecordReader;
import org.apache.orc.storage.ql.exec.vector.VectorizedRowBatch;
import org.apache.orc.storage.ql.exec.vector.DoubleColumnVector;
import org.apache.orc.storage.ql.io.sarg.SearchArgument;

Using Vectorized ORC APIs
Reader r = OrcFile.createReader(path, OrcFile.readerOptions(conf));
RecordReader rr = r.rows();
VectorizedRowBatch batch = r.getSchema().createRowBatch(1000);
1000 Values
ID
QUANTITY
AMOUNT
long quantity[1000];
long id[1000];
double amount[1000];
▪ A vectorized row batch is a Java object that contains 1000 decoded rows

JIT friendly code
// Compute sum(amount)
double sum = 0;
while (rr.nextBatch(batch)) {
long[] qty = ((LongColumnVector) batch.cols[1]).vector;
double[] amt = ((DoubleColumnVector) batch.cols[2]).vector;
for (int i=0; i < batch.size; i++)
if (qty[i] < 500)
sum += amt[i];
}
▪ Get the total for sales involving less than 500 items
SELECT sum(amount)
FROM sales
WHERE quantity < 500
• No objects
• No method calls
• Tight loop compiles
to machine code

Column Skipping/Pruning (a.k.a. Projection Pushdown)
▪ If we don’t say otherwise, ORC will read all the columns
▪ But our query is using only two columns
ID
QUANTITY
AMOUNT
SELECT sum(amount)
FROM sales

// Projection
boolean projection[] = new boolean[] {false, true, true});
// ORC RecordReader with projection pushdown
RecordReader rr = r.rows(
new Reader.Options()
.include(projection));
double sum = 0;
while (rr.nextBatch(batch)) { … }
SELECT sum(amount)
FROM sales

create external hadoop table web_sales (
ws_sold_date_sk int, ws_sold_time_sk int, ws_ship_date_sk int, ws_item_sk int not null, ws_bill_customer_sk int,
ws_bill_cdemo_sk int, ws_bill_hdemo_sk int, ws_bill_addr_sk int, ws_ship_customer_sk int, ws_ship_cdemo_sk int,
ws_ship_hdemo_sk int, ws_ship_addr_sk int, ws_web_page_sk int, ws_web_site_sk int, ws_ship_mode_sk int, ws_warehouse_sk int,
ws_promo_sk int, ws_order_number bigint not null, ws_quantity bigint, ws_wholesale_cost double, ws_list_price double,
ws_sales_price double, ws_ext_discount_amt double, ws_ext_sales_price double, ws_ext_wholesale_cost double,
ws_ext_list_price double, ws_ext_tax double, ws_coupon_amt double, ws_ext_ship_cost double, ws_net_paid double,
ws_net_paid_inc_tax double, ws_net_paid_inc_ship double, ws_net_paid_inc_ship_tax double, ws_net_profit double
)
STORED AS ORC;
0
5
10
15
20
25
30
With Projection Pushdown Without Projection Pushdown
Elapsedtime(seconds)
SELECT max(ws_sold_date_sk)
FROM tpcds_1tb.web_sales
3.5
24.9

Row Skipping/Pruning (a.k.a. Predicate Pushdown)
▪ Row skipping leverages the MIN/MAX stats
quantity
MIN:274,MAX:590
quantity
MIN:603,MAX:3000
quantity
MIN:510, Max:540
quantity
MIN:330, Max:420
Stripes Row groups
SELECT sum(amount)
FROM sales
• Data must be sorted for
pruning to be effective!

Row Skipping/Pruning (a.k.a. Predicate Pushdown)
// Predicates
SearchArgument selection = SearchArgumentFactory.newBuilder()
.lessThan("quantity", PredicateLeaf.Type.LONG, 500L).build();
// ORC RecordReader with projection and selection pushdown
RecordReader rr = r.rows(
new Reader.Options()
.include(projection)
.searchArgument(selection, new String[] {}));
double sum =0;
while (rr.nextBatch(batch)) { ... }
SELECT sum(amount)
FROM sales

Scaling Up

Scaling Up
▪ Reading too many files hurts performance and can cause OOM
▪ How Many?
 Number of disks
 Java Heap size
▪ Must have multiple disks AND the files must be evenly distributed
across the disks
▪ But number of threads must be limited by the Java heap size!
 Need 30 to 80 MB of Java Heap per open ORC file/split
 Biggest consumers
• RecordReader
• VectorizedRowBatch

Scaling Up In an Engine
▪ An engine must handle concurrency (multiple queries/users)
▪ Fixed # of open files per scan leads to OOMs
▪ Must gracefully degrade performance instead of OOM
▪ Need to limit total open files
▪ Multiple issues to deal with
 Starvation (all queries must make reasonable progress)
 Stragglers (work for a scan must be evenly balanced across nodes)
 Adapting parallelism to concurrency
• Single query must take full advantage of the available resources
• As concurrency increases, parallelism decreases for ongoing scans
• When concurrency decreases, parallelism per scan increases

Big SQL Tuning for ORC
• The “fundamentals” (regardless of the data format)
 Ensure Big SQL has enough resources
• % memory
• % CPU
• Temp storage spread across multiple disks (e.g. same disks as HDFS)
 Use Partitioning
 Run ANALYZE on all your tables (or ensure AUTO ANALYZE is enabled)
▪ ORC-specific tuning (Big SQL 5.0.1 & up)
 Property bigsql.java.orc.preftp.size controls the max number of open ORC files
▪ ORC file creation
 Data sorted by filtering columns
 Stripe and row group size
 Bloom filters
▪ For more Big SQL tuning tips, see
https://developer.ibm.com/hadoop/2016/11/16/top-6-big-sql-v4-2-performance-tips/

Summary
▪ ORC = Storage Efficiency + Fast Ingestion
▪ Fast ingestion using Vectorized APIs in Apache ORC
▪ Big SQL performs best with data in ORC format!

Questions?

Thank you!

Ingesting Data at Blazing Speed Using Apache Orc

More Related Content

Ingesting Data at Blazing Speed Using Apache Orc