Using LLVM to accelerate processing of data in Apache Arrow

© 2018 Dremio Corporation @DremioHQ
Using LLVM to accelerate processing of data in
Apache Arrow
DataWorks Summit, San Jose
June 21, 2018
Siddharth Teotia
1

Who?
Siddharth Teotia
@siddcoder
loonytek
Quora
• Software Engineer @ Dremio
• Committer - Apache Arrow
• Formerly at Oracle (Database Engine team)
2

Agenda
• Introduction to Apache Arrow
• Arrow in Practice: Introduction to Dremio
• Why Runtime Code Generation in Databases?
• Commonly used Runtime Code Generation Techniques
• Runtime Code Generation Requirements
• Introduction to LLVM
• LLVM in Dremio
3

Apache Arrow Project
• Top-level Apache Software Foundation project
– Announced Feb 17, 2016
• Focused on Columnar In-Memory Analytics
1. 10-100x speedup on many workloads
2. Designed to work with any programming language
3. Flexible data model that handles both flat and nested types
• Developers from 13+ major open source projects involved.
Calcite
Cassandra
Deeplearning4j
Drill
Hadoop
HBase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm
R
4

Arrow goals
• Columnar in-memory representation optimized for efficient use of processor
cache through data locality.
• Designed to take advantage of modern CPU characteristics by implementing
algorithms that leverage hardware acceleration.
• Interoperability for high speed exchange between data systems.
• Embeddable in execution engines, storage layers, etc.
• Well-documented and cross language compatible.
5

High Performance Interface for Data Exchange
• Each system has its own internal memory
format
• 70-80% CPU wasted on serialization and
deserialization
• Functionality duplication and unnecessary
conversions
• All systems utilize the same memory
format
• No overhead for cross-system
communication
• Projects can share functionality (eg:
Parquet-to-Arrow reader)
6

Focus on CPU Efficiency
Traditional Memory Buffer
( row format)
Arrow Memory Buffer
(columnar format)
• Maximize CPU throughput
– SIMD
– Cache Locality
• Vectorized operations.
• Constant value access
– With minimal structure
overhead
• Use efficient lightweight
compression schemes on a per
column basis.
8

Arrow Data Types
• Scalars
– Boolean
– [u]int[8,16,32,64], Decimal, Float, Double
– Date, Time, Timestamp
– UTF8 String, Binary
• Complex
– Struct, List, Union
9

Columnar Data
10

Real World Arrow: Sabot
• Dremio is an OSS Data-as-a-
Service Platform
• The core engine is “Sabot”
– Built entirely on top of Arrow
libraries, runs in JVM

Why Runtime Code Generation in Databases?
• In general, what would be the most optimal query execution plan?
– Hand-written query plan that does the required processing for exact same
data types and operators as required by the query.
– Such execution plan will only work for a particular query but will be the
fastest way to execute that query.
– We can implement extremely fast dedicated code to process _only_ the
kind of data touched by the query.
• However, query engines need to support broad functionality
– Several different data types, SQL operators etc.
– Interpreter based execution.
– Generic control blocks to understand arbitrary query specific runtime
information (field types etc which are not known during query compilation).
– Dynamic dispatch (aka virtual calls via function pointers in C++).
12

Why Runtime Code Generation in Databases? Cont’d
• Interpreted (non code-generated) execution is not very CPU efficient and hurts
query performance
– Generic code not tailored for specific query has excessive branching
– Cost of branch misprediction: Entire pipeline has to be flushed.
– Not the best way to implement code paths critical for performance on
modern pipelined architectures
• Most databases generate code at runtime (query execution time)
– When query execution is about to begin, we have all the information
available that can be used to generate query specific code.
– The code-generated function(s) are specific to the query since they are
based on information resolved at runtime.
– Optimized custom code for executing a particular query.
13

Commonly Used Runtime Code Generation Techniques
• Generate query specific Java classes at query runtime using predefined templates
– Use Janino to compile runtime generated classes in-memory to bytecode,
load and execute the bytecode in same JVM.
– Dremio uses this mechanism.
• Generate query specific C/C++ code at runtime, execv a compiler and load the
executable.
• Problems with existing code-generation mechanisms:
– Heavy object instantiation and dereferencing in generated Java code.
– Compiling and optimizing C/C++ code is known to be slow.
– Inefficient handling of complex and arbitrary SQL expressions.
– Limited opportunities for leveraging modern hardware capabilities
• SIMD vectorization, use of wider registers for handling decimals etc

Runtime Code Generation Requirements
• Efficient code-generation
– The method to generate query specific code at runtime should itself be very
efficient.
– The method should be able to leverage target hardware capabilities.
• Query specific optimized code
– The method should generate highly optimized code to improve query
execution performance.
• Handle arbitrary complex SQL expressions efficiently
– The method should be able to handle complex SQL expressions efficiently.

Introduction to LLVM
• A library providing compiler related modular tools for implementing JIT
compilation infrastructure.
• LLVM can be used to efficiently generate query specific optimized native
machine code at query runtime for performance critical operations.
• Potential for significant speedup in overall query execution time.
• Two high level steps:
– Generate IR (Intermediary Representation) code
– Compile and optimize IR to machine code targeting specific architecture
• IR is both source (language) and target (architecture) independent low-level
specification
• Custom optimization: separate passes to optimize the generated IR.
– Vectorizing loops, combining instructions etc.
• Full API support for all steps of compilation process

Introduction to LLVM Cont’d
IR (Intermediary Representation) is the core of LLVM for code generation:
• A low-level assembly language like specification used by LLVM for representing
code during compilation.
• Generating IR using IRBuilder
– Part of C++ API provided by LLVM.
– Programmatically assemble IR modules/functions instruction by instruction.
• Generating IR using Cross-compilation
– Clang C++ compiler as a frontend to LLVM.
– Compile C++ functions to corresponding IR code.

LLVM in Dremio
Goal: Use LLVM for efficient execution of SQL expressions in native code.
• Has the potential to significantly improve the performance of our execution
engine.
Welcome to Gandiva !!

Gandiva - Introduction
• A standalone C++ library for efficient evaluation of arbitrary SQL
expressions on Arrow vectors using runtime code-generation in
LLVM.
• Has no runtime or compile time dependencies on Dremio or any
other execution engine.
• Provides Java APIs that use the JNI bridge underneath to talk to
C++ code for code generation and expression evaluation
– Dremio’s execution engine leverages Gandiva Java APIs
• Expression support
– If/Else, CASE, ==, !=, <, >, etc
– Function expressions: +, -, /, *, %
– All fixed width scalar types
– More to come
• Boolean expressions, variable width data, complex
types etc.

Gandiva - Design
IR Generation

Gandiva - Design
Tree Based Expression Builder
• Define the operator, operands, output at each level in the tree

Gandiva - Design
High level usage of
main C++ modules

Gandiva - Design (Sample Usage)
// schema for input fields
auto fielda = field("a", int32()); auto fieldb = field("b", int32()); auto schema = arrow::schema({fielda, fieldb});
// output fields
auto field_result = field("res", int32());
// build expression
auto node_a = TreeExprBuilder::MakeField(fielda); auto node_b = TreeExprBuilder::MakeField(fieldb);
auto condition = TreeExprBuilder::MakeFunction("greater_than", {node_a, node_b}, boolean());
auto sum = TreeExprBuilder::MakeFunction("add", {node_a, node_b}, int32());
auto sub = TreeExprBuilder::MakeFunction("subtract", {node_a, node_b}, int32());
auto if_node = TreeExprBuilder::MakeIf(condition, sum, sub, int32());
auto expr = TreeExprBuilder::MakeExpression(if_node, field_result);
// Build a projector for the expressions
std::shared_ptr<Projector> projector; Status status = Projector::Make(schema, {expr}, pool_, &projector);
// Create an input Arrow record-batch with some sample data
// Evaluate expression on record batch arrow::ArrayVector outputs; status = projector->Evaluate(*in_batch, &outputs);
Expression:
if (a > b)
a + b
else
a - b

Gandiva - Design
• Suitable for expressions of
type
– input is null -> output null
• Evaluate vector’s data
buffer and validity buffer
independently
– Reduced branches.
– Better CPU efficiency
– Amenable to SIMD.
– Junk data is also
evaluated but it
doesn’t affect the
end result
Expression
Decomposition

Gandiva - Design
Expression
Decomposition

Gandiva - Design
Categories of Function Expressions
NULL_IF_NULL NULL_NEVER NULL_INTERNAL
● Always decomposable
● If input null -> output null
● Input validity is pushed to top
of tree to determine validity of
output
● Highly optimized execution
● Eg: +, -, *, / etc
● Majority of functions
● Output is never null
● No need to push
validity for final result
● Eg: isNumeric(expr),
isNull(expr),
isDate(expr)
● Actual evaluation done
using conditions
● Output can be null
● Eg: castStringToInt(x) + y + z
● Evaluate sub-tree and
generate a local bitmap
● Rest of the tree uses local
bitmap to continue with
decomposed evaluation

Gandiva - Design
Handling CASE Expressions
• Interpreting CASE as if-else-if statements loses
optimization opportunities
– Evaluation of same condition across multiple cases
– Evaluation of same validity across multiple cases
• Treat as switch case
• LLVM helps with removing redundant evaluation of validity
and and conditions across multiple cases
• A temporary bitmap is created and shared amongst all
expressions for computing validity of output
– Detect nested if-else and use a single bitmap
– Only the matching “if or else” updates bitmap
case
when cond1 then exp1
..
Else exp

Using Gandiva in Dremio

Performance
Java JIT runtime bytecode generation v/s Gandiva runtime code generation in LLVM
• Compare expression evaluation time of five simple expressions on JSON dataset
of 500 million rows
• Tests were run on Mac machine (2.7GHz quad-core Intel Core i7, 16GB RAM)
Project 5 columns
SELECT
sum(x + N2x + N3x),
sum(x * N2x - N3x),
sum(3 * x + 2 * N2x + N3x),
count(x >= N2x - N3x),
count(x + N2x = N3x)
FROM json.d500
Case - 10
SELECT count
(case
when x < 1000000 then x/1000000 + 0
when x < 2000000 then x/2000000 + 1
when x < 3000000 then x/3000000 + 2
when x < 4000000 then x/4000000 + 3
when x < 5000000 then x/5000000 + 4 ……………...
else 10 end)
FROM json.d500

Performance
Test Project time (secs)
with Java JIT
Project time (secs)
with Gandiva LLVM
Improvement
Sum 3.805 0.558 6.8x
Project 5 columns 8.681 1.689 5.13x
Project 10 columns 24.923 3.476 7.74x
CASE-10 4.308 0.925 4.66x
CASE-100 1361 15.187 89.6x

Get Involved
• Gandiva
– https://github.com/dremio/gandiva
• Arrow
– dev@arrow.apache.org
– http://arrow.apache.org
– Follow @ApacheArrow, @DremioHQ
• Dremio
– https://community.dremio.com/
– https://github.com/dremio/dremio-oss

Using LLVM to accelerate processing of data in Apache Arrow

More Related Content

Using LLVM to accelerate processing of data in Apache Arrow