Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
© 2018 Dremio Corporation @DremioHQ
Using LLVM to accelerate processing of data in
Apache Arrow
DataWorks Summit, San Jose
June 21, 2018
Siddharth Teotia
1
© 2018 Dremio Corporation @DremioHQ
Who?
Siddharth Teotia
@siddcoder
loonytek
Quora
• Software Engineer @ Dremio
• Committer - Apache Arrow
• Formerly at Oracle (Database Engine team)
2
© 2018 Dremio Corporation @DremioHQ
Agenda
• Introduction to Apache Arrow
• Arrow in Practice: Introduction to Dremio
• Why Runtime Code Generation in Databases?
• Commonly used Runtime Code Generation Techniques
• Runtime Code Generation Requirements
• Introduction to LLVM
• LLVM in Dremio
3
© 2018 Dremio Corporation @DremioHQ
Apache Arrow Project
• Top-level Apache Software Foundation project
– Announced Feb 17, 2016
• Focused on Columnar In-Memory Analytics
1. 10-100x speedup on many workloads
2. Designed to work with any programming language
3. Flexible data model that handles both flat and nested types
• Developers from 13+ major open source projects involved.
Calcite
Cassandra
Deeplearning4j
Drill
Hadoop
HBase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm
R
4
© 2018 Dremio Corporation @DremioHQ
Arrow goals
• Columnar in-memory representation optimized for efficient use of processor
cache through data locality.
• Designed to take advantage of modern CPU characteristics by implementing
algorithms that leverage hardware acceleration.
• Interoperability for high speed exchange between data systems.
• Embeddable in execution engines, storage layers, etc.
• Well-documented and cross language compatible.
5
© 2018 Dremio Corporation @DremioHQ
High Performance Interface for Data Exchange
• Each system has its own internal memory
format
• 70-80% CPU wasted on serialization and
deserialization
• Functionality duplication and unnecessary
conversions
• All systems utilize the same memory
format
• No overhead for cross-system
communication
• Projects can share functionality (eg:
Parquet-to-Arrow reader)
6
Apache Arrow Adoption
© 2018 Dremio Corporation @DremioHQ
Focus on CPU Efficiency
Traditional Memory Buffer
( row format)
Arrow Memory Buffer
(columnar format)
• Maximize CPU throughput
– SIMD
– Cache Locality
• Vectorized operations.
• Constant value access
– With minimal structure
overhead
• Use efficient lightweight
compression schemes on a per
column basis.
8
© 2018 Dremio Corporation @DremioHQ
Arrow Data Types
• Scalars
– Boolean
– [u]int[8,16,32,64], Decimal, Float, Double
– Date, Time, Timestamp
– UTF8 String, Binary
• Complex
– Struct, List, Union
9
© 2018 Dremio Corporation @DremioHQ
Columnar Data
10
© 2018 Dremio Corporation @DremioHQ
Real World Arrow: Sabot
• Dremio is an OSS Data-as-a-
Service Platform
• The core engine is “Sabot”
– Built entirely on top of Arrow
libraries, runs in JVM
© 2018 Dremio Corporation @DremioHQ
Why Runtime Code Generation in Databases?
• In general, what would be the most optimal query execution plan?
– Hand-written query plan that does the required processing for exact same
data types and operators as required by the query.
– Such execution plan will only work for a particular query but will be the
fastest way to execute that query.
– We can implement extremely fast dedicated code to process _only_ the
kind of data touched by the query.
• However, query engines need to support broad functionality
– Several different data types, SQL operators etc.
– Interpreter based execution.
– Generic control blocks to understand arbitrary query specific runtime
information (field types etc which are not known during query compilation).
– Dynamic dispatch (aka virtual calls via function pointers in C++).
12
© 2018 Dremio Corporation @DremioHQ
Why Runtime Code Generation in Databases? Cont’d
• Interpreted (non code-generated) execution is not very CPU efficient and hurts
query performance
– Generic code not tailored for specific query has excessive branching
– Cost of branch misprediction: Entire pipeline has to be flushed.
– Not the best way to implement code paths critical for performance on
modern pipelined architectures
• Most databases generate code at runtime (query execution time)
– When query execution is about to begin, we have all the information
available that can be used to generate query specific code.
– The code-generated function(s) are specific to the query since they are
based on information resolved at runtime.
– Optimized custom code for executing a particular query.
13
© 2018 Dremio Corporation @DremioHQ
Commonly Used Runtime Code Generation Techniques
• Generate query specific Java classes at query runtime using predefined templates
– Use Janino to compile runtime generated classes in-memory to bytecode,
load and execute the bytecode in same JVM.
– Dremio uses this mechanism.
• Generate query specific C/C++ code at runtime, execv a compiler and load the
executable.
• Problems with existing code-generation mechanisms:
– Heavy object instantiation and dereferencing in generated Java code.
– Compiling and optimizing C/C++ code is known to be slow.
– Inefficient handling of complex and arbitrary SQL expressions.
– Limited opportunities for leveraging modern hardware capabilities
• SIMD vectorization, use of wider registers for handling decimals etc
© 2018 Dremio Corporation @DremioHQ
Runtime Code Generation Requirements
• Efficient code-generation
– The method to generate query specific code at runtime should itself be very
efficient.
– The method should be able to leverage target hardware capabilities.
• Query specific optimized code
– The method should generate highly optimized code to improve query
execution performance.
• Handle arbitrary complex SQL expressions efficiently
– The method should be able to handle complex SQL expressions efficiently.
© 2018 Dremio Corporation @DremioHQ
Introduction to LLVM
• A library providing compiler related modular tools for implementing JIT
compilation infrastructure.
• LLVM can be used to efficiently generate query specific optimized native
machine code at query runtime for performance critical operations.
• Potential for significant speedup in overall query execution time.
• Two high level steps:
– Generate IR (Intermediary Representation) code
– Compile and optimize IR to machine code targeting specific architecture
• IR is both source (language) and target (architecture) independent low-level
specification
• Custom optimization: separate passes to optimize the generated IR.
– Vectorizing loops, combining instructions etc.
• Full API support for all steps of compilation process
© 2018 Dremio Corporation @DremioHQ
Introduction to LLVM Cont’d
IR (Intermediary Representation) is the core of LLVM for code generation:
• A low-level assembly language like specification used by LLVM for representing
code during compilation.
• Generating IR using IRBuilder
– Part of C++ API provided by LLVM.
– Programmatically assemble IR modules/functions instruction by instruction.
• Generating IR using Cross-compilation
– Clang C++ compiler as a frontend to LLVM.
– Compile C++ functions to corresponding IR code.
© 2018 Dremio Corporation @DremioHQ
LLVM in Dremio
Goal: Use LLVM for efficient execution of SQL expressions in native code.
• Has the potential to significantly improve the performance of our execution
engine.
Welcome to Gandiva !!
© 2018 Dremio Corporation @DremioHQ
Gandiva - Introduction
• A standalone C++ library for efficient evaluation of arbitrary SQL
expressions on Arrow vectors using runtime code-generation in
LLVM.
• Has no runtime or compile time dependencies on Dremio or any
other execution engine.
• Provides Java APIs that use the JNI bridge underneath to talk to
C++ code for code generation and expression evaluation
– Dremio’s execution engine leverages Gandiva Java APIs
• Expression support
– If/Else, CASE, ==, !=, <, >, etc
– Function expressions: +, -, /, *, %
– All fixed width scalar types
– More to come
• Boolean expressions, variable width data, complex
types etc.
© 2018 Dremio Corporation @DremioHQ
Gandiva - Design
IR Generation
© 2018 Dremio Corporation @DremioHQ
Gandiva - Design
Tree Based Expression Builder
• Define the operator, operands, output at each level in the tree
© 2018 Dremio Corporation @DremioHQ
Gandiva - Design
High level usage of
main C++ modules
© 2018 Dremio Corporation @DremioHQ
Gandiva - Design (Sample Usage)
// schema for input fields
auto fielda = field("a", int32()); auto fieldb = field("b", int32()); auto schema = arrow::schema({fielda, fieldb});
// output fields
auto field_result = field("res", int32());
// build expression
auto node_a = TreeExprBuilder::MakeField(fielda); auto node_b = TreeExprBuilder::MakeField(fieldb);
auto condition = TreeExprBuilder::MakeFunction("greater_than", {node_a, node_b}, boolean());
auto sum = TreeExprBuilder::MakeFunction("add", {node_a, node_b}, int32());
auto sub = TreeExprBuilder::MakeFunction("subtract", {node_a, node_b}, int32());
auto if_node = TreeExprBuilder::MakeIf(condition, sum, sub, int32());
auto expr = TreeExprBuilder::MakeExpression(if_node, field_result);
// Build a projector for the expressions
std::shared_ptr<Projector> projector; Status status = Projector::Make(schema, {expr}, pool_, &projector);
// Create an input Arrow record-batch with some sample data
// Evaluate expression on record batch arrow::ArrayVector outputs; status = projector->Evaluate(*in_batch, &outputs);
Expression:
if (a > b)
a + b
else
a - b
© 2018 Dremio Corporation @DremioHQ
Gandiva - Design
• Suitable for expressions of
type
– input is null -> output null
• Evaluate vector’s data
buffer and validity buffer
independently
– Reduced branches.
– Better CPU efficiency
– Amenable to SIMD.
– Junk data is also
evaluated but it
doesn’t affect the
end result
Expression
Decomposition
© 2018 Dremio Corporation @DremioHQ
Gandiva - Design
Expression
Decomposition
© 2018 Dremio Corporation @DremioHQ
Gandiva - Design
Categories of Function Expressions
NULL_IF_NULL NULL_NEVER NULL_INTERNAL
● Always decomposable
● If input null -> output null
● Input validity is pushed to top
of tree to determine validity of
output
● Highly optimized execution
● Eg: +, -, *, / etc
● Majority of functions
● Output is never null
● No need to push
validity for final result
● Eg: isNumeric(expr),
isNull(expr),
isDate(expr)
● Actual evaluation done
using conditions
● Output can be null
● Eg: castStringToInt(x) + y + z
● Evaluate sub-tree and
generate a local bitmap
● Rest of the tree uses local
bitmap to continue with
decomposed evaluation
© 2018 Dremio Corporation @DremioHQ
Gandiva - Design
Handling CASE Expressions
• Interpreting CASE as if-else-if statements loses
optimization opportunities
– Evaluation of same condition across multiple cases
– Evaluation of same validity across multiple cases
• Treat as switch case
• LLVM helps with removing redundant evaluation of validity
and and conditions across multiple cases
• A temporary bitmap is created and shared amongst all
expressions for computing validity of output
– Detect nested if-else and use a single bitmap
– Only the matching “if or else” updates bitmap
case
when cond1 then exp1
when cond2 then exp2
when cond3 then exp3
..
Else exp
© 2018 Dremio Corporation @DremioHQ
Using Gandiva in Dremio
© 2018 Dremio Corporation @DremioHQ
Performance
Java JIT runtime bytecode generation v/s Gandiva runtime code generation in LLVM
• Compare expression evaluation time of five simple expressions on JSON dataset
of 500 million rows
• Tests were run on Mac machine (2.7GHz quad-core Intel Core i7, 16GB RAM)
Project 5 columns
SELECT
sum(x + N2x + N3x),
sum(x * N2x - N3x),
sum(3 * x + 2 * N2x + N3x),
count(x >= N2x - N3x),
count(x + N2x = N3x)
FROM json.d500
Case - 10
SELECT count
(case
when x < 1000000 then x/1000000 + 0
when x < 2000000 then x/2000000 + 1
when x < 3000000 then x/3000000 + 2
when x < 4000000 then x/4000000 + 3
when x < 5000000 then x/5000000 + 4 ……………...
else 10 end)
FROM json.d500
© 2018 Dremio Corporation @DremioHQ
Performance
Test Project time (secs)
with Java JIT
Project time (secs)
with Gandiva LLVM
Improvement
Sum 3.805 0.558 6.8x
Project 5 columns 8.681 1.689 5.13x
Project 10 columns 24.923 3.476 7.74x
CASE-10 4.308 0.925 4.66x
CASE-100 1361 15.187 89.6x
© 2018 Dremio Corporation @DremioHQ
Get Involved
• Gandiva
– https://github.com/dremio/gandiva
• Arrow
– dev@arrow.apache.org
– http://arrow.apache.org
– Follow @ApacheArrow, @DremioHQ
• Dremio
– https://community.dremio.com/
– https://github.com/dremio/dremio-oss

More Related Content

Using LLVM to accelerate processing of data in Apache Arrow

  • 1. © 2018 Dremio Corporation @DremioHQ Using LLVM to accelerate processing of data in Apache Arrow DataWorks Summit, San Jose June 21, 2018 Siddharth Teotia 1
  • 2. © 2018 Dremio Corporation @DremioHQ Who? Siddharth Teotia @siddcoder loonytek Quora • Software Engineer @ Dremio • Committer - Apache Arrow • Formerly at Oracle (Database Engine team) 2
  • 3. © 2018 Dremio Corporation @DremioHQ Agenda • Introduction to Apache Arrow • Arrow in Practice: Introduction to Dremio • Why Runtime Code Generation in Databases? • Commonly used Runtime Code Generation Techniques • Runtime Code Generation Requirements • Introduction to LLVM • LLVM in Dremio 3
  • 4. © 2018 Dremio Corporation @DremioHQ Apache Arrow Project • Top-level Apache Software Foundation project – Announced Feb 17, 2016 • Focused on Columnar In-Memory Analytics 1. 10-100x speedup on many workloads 2. Designed to work with any programming language 3. Flexible data model that handles both flat and nested types • Developers from 13+ major open source projects involved. Calcite Cassandra Deeplearning4j Drill Hadoop HBase Ibis Impala Kudu Pandas Parquet Phoenix Spark Storm R 4
  • 5. © 2018 Dremio Corporation @DremioHQ Arrow goals • Columnar in-memory representation optimized for efficient use of processor cache through data locality. • Designed to take advantage of modern CPU characteristics by implementing algorithms that leverage hardware acceleration. • Interoperability for high speed exchange between data systems. • Embeddable in execution engines, storage layers, etc. • Well-documented and cross language compatible. 5
  • 6. © 2018 Dremio Corporation @DremioHQ High Performance Interface for Data Exchange • Each system has its own internal memory format • 70-80% CPU wasted on serialization and deserialization • Functionality duplication and unnecessary conversions • All systems utilize the same memory format • No overhead for cross-system communication • Projects can share functionality (eg: Parquet-to-Arrow reader) 6
  • 8. © 2018 Dremio Corporation @DremioHQ Focus on CPU Efficiency Traditional Memory Buffer ( row format) Arrow Memory Buffer (columnar format) • Maximize CPU throughput – SIMD – Cache Locality • Vectorized operations. • Constant value access – With minimal structure overhead • Use efficient lightweight compression schemes on a per column basis. 8
  • 9. © 2018 Dremio Corporation @DremioHQ Arrow Data Types • Scalars – Boolean – [u]int[8,16,32,64], Decimal, Float, Double – Date, Time, Timestamp – UTF8 String, Binary • Complex – Struct, List, Union 9
  • 10. © 2018 Dremio Corporation @DremioHQ Columnar Data 10
  • 11. © 2018 Dremio Corporation @DremioHQ Real World Arrow: Sabot • Dremio is an OSS Data-as-a- Service Platform • The core engine is “Sabot” – Built entirely on top of Arrow libraries, runs in JVM
  • 12. © 2018 Dremio Corporation @DremioHQ Why Runtime Code Generation in Databases? • In general, what would be the most optimal query execution plan? – Hand-written query plan that does the required processing for exact same data types and operators as required by the query. – Such execution plan will only work for a particular query but will be the fastest way to execute that query. – We can implement extremely fast dedicated code to process _only_ the kind of data touched by the query. • However, query engines need to support broad functionality – Several different data types, SQL operators etc. – Interpreter based execution. – Generic control blocks to understand arbitrary query specific runtime information (field types etc which are not known during query compilation). – Dynamic dispatch (aka virtual calls via function pointers in C++). 12
  • 13. © 2018 Dremio Corporation @DremioHQ Why Runtime Code Generation in Databases? Cont’d • Interpreted (non code-generated) execution is not very CPU efficient and hurts query performance – Generic code not tailored for specific query has excessive branching – Cost of branch misprediction: Entire pipeline has to be flushed. – Not the best way to implement code paths critical for performance on modern pipelined architectures • Most databases generate code at runtime (query execution time) – When query execution is about to begin, we have all the information available that can be used to generate query specific code. – The code-generated function(s) are specific to the query since they are based on information resolved at runtime. – Optimized custom code for executing a particular query. 13
  • 14. © 2018 Dremio Corporation @DremioHQ Commonly Used Runtime Code Generation Techniques • Generate query specific Java classes at query runtime using predefined templates – Use Janino to compile runtime generated classes in-memory to bytecode, load and execute the bytecode in same JVM. – Dremio uses this mechanism. • Generate query specific C/C++ code at runtime, execv a compiler and load the executable. • Problems with existing code-generation mechanisms: – Heavy object instantiation and dereferencing in generated Java code. – Compiling and optimizing C/C++ code is known to be slow. – Inefficient handling of complex and arbitrary SQL expressions. – Limited opportunities for leveraging modern hardware capabilities • SIMD vectorization, use of wider registers for handling decimals etc
  • 15. © 2018 Dremio Corporation @DremioHQ Runtime Code Generation Requirements • Efficient code-generation – The method to generate query specific code at runtime should itself be very efficient. – The method should be able to leverage target hardware capabilities. • Query specific optimized code – The method should generate highly optimized code to improve query execution performance. • Handle arbitrary complex SQL expressions efficiently – The method should be able to handle complex SQL expressions efficiently.
  • 16. © 2018 Dremio Corporation @DremioHQ Introduction to LLVM • A library providing compiler related modular tools for implementing JIT compilation infrastructure. • LLVM can be used to efficiently generate query specific optimized native machine code at query runtime for performance critical operations. • Potential for significant speedup in overall query execution time. • Two high level steps: – Generate IR (Intermediary Representation) code – Compile and optimize IR to machine code targeting specific architecture • IR is both source (language) and target (architecture) independent low-level specification • Custom optimization: separate passes to optimize the generated IR. – Vectorizing loops, combining instructions etc. • Full API support for all steps of compilation process
  • 17. © 2018 Dremio Corporation @DremioHQ Introduction to LLVM Cont’d IR (Intermediary Representation) is the core of LLVM for code generation: • A low-level assembly language like specification used by LLVM for representing code during compilation. • Generating IR using IRBuilder – Part of C++ API provided by LLVM. – Programmatically assemble IR modules/functions instruction by instruction. • Generating IR using Cross-compilation – Clang C++ compiler as a frontend to LLVM. – Compile C++ functions to corresponding IR code.
  • 18. © 2018 Dremio Corporation @DremioHQ LLVM in Dremio Goal: Use LLVM for efficient execution of SQL expressions in native code. • Has the potential to significantly improve the performance of our execution engine. Welcome to Gandiva !!
  • 19. © 2018 Dremio Corporation @DremioHQ Gandiva - Introduction • A standalone C++ library for efficient evaluation of arbitrary SQL expressions on Arrow vectors using runtime code-generation in LLVM. • Has no runtime or compile time dependencies on Dremio or any other execution engine. • Provides Java APIs that use the JNI bridge underneath to talk to C++ code for code generation and expression evaluation – Dremio’s execution engine leverages Gandiva Java APIs • Expression support – If/Else, CASE, ==, !=, <, >, etc – Function expressions: +, -, /, *, % – All fixed width scalar types – More to come • Boolean expressions, variable width data, complex types etc.
  • 20. © 2018 Dremio Corporation @DremioHQ Gandiva - Design IR Generation
  • 21. © 2018 Dremio Corporation @DremioHQ Gandiva - Design Tree Based Expression Builder • Define the operator, operands, output at each level in the tree
  • 22. © 2018 Dremio Corporation @DremioHQ Gandiva - Design High level usage of main C++ modules
  • 23. © 2018 Dremio Corporation @DremioHQ Gandiva - Design (Sample Usage) // schema for input fields auto fielda = field("a", int32()); auto fieldb = field("b", int32()); auto schema = arrow::schema({fielda, fieldb}); // output fields auto field_result = field("res", int32()); // build expression auto node_a = TreeExprBuilder::MakeField(fielda); auto node_b = TreeExprBuilder::MakeField(fieldb); auto condition = TreeExprBuilder::MakeFunction("greater_than", {node_a, node_b}, boolean()); auto sum = TreeExprBuilder::MakeFunction("add", {node_a, node_b}, int32()); auto sub = TreeExprBuilder::MakeFunction("subtract", {node_a, node_b}, int32()); auto if_node = TreeExprBuilder::MakeIf(condition, sum, sub, int32()); auto expr = TreeExprBuilder::MakeExpression(if_node, field_result); // Build a projector for the expressions std::shared_ptr<Projector> projector; Status status = Projector::Make(schema, {expr}, pool_, &projector); // Create an input Arrow record-batch with some sample data // Evaluate expression on record batch arrow::ArrayVector outputs; status = projector->Evaluate(*in_batch, &outputs); Expression: if (a > b) a + b else a - b
  • 24. © 2018 Dremio Corporation @DremioHQ Gandiva - Design • Suitable for expressions of type – input is null -> output null • Evaluate vector’s data buffer and validity buffer independently – Reduced branches. – Better CPU efficiency – Amenable to SIMD. – Junk data is also evaluated but it doesn’t affect the end result Expression Decomposition
  • 25. © 2018 Dremio Corporation @DremioHQ Gandiva - Design Expression Decomposition
  • 26. © 2018 Dremio Corporation @DremioHQ Gandiva - Design Categories of Function Expressions NULL_IF_NULL NULL_NEVER NULL_INTERNAL ● Always decomposable ● If input null -> output null ● Input validity is pushed to top of tree to determine validity of output ● Highly optimized execution ● Eg: +, -, *, / etc ● Majority of functions ● Output is never null ● No need to push validity for final result ● Eg: isNumeric(expr), isNull(expr), isDate(expr) ● Actual evaluation done using conditions ● Output can be null ● Eg: castStringToInt(x) + y + z ● Evaluate sub-tree and generate a local bitmap ● Rest of the tree uses local bitmap to continue with decomposed evaluation
  • 27. © 2018 Dremio Corporation @DremioHQ Gandiva - Design Handling CASE Expressions • Interpreting CASE as if-else-if statements loses optimization opportunities – Evaluation of same condition across multiple cases – Evaluation of same validity across multiple cases • Treat as switch case • LLVM helps with removing redundant evaluation of validity and and conditions across multiple cases • A temporary bitmap is created and shared amongst all expressions for computing validity of output – Detect nested if-else and use a single bitmap – Only the matching “if or else” updates bitmap case when cond1 then exp1 when cond2 then exp2 when cond3 then exp3 .. Else exp
  • 28. © 2018 Dremio Corporation @DremioHQ Using Gandiva in Dremio
  • 29. © 2018 Dremio Corporation @DremioHQ Performance Java JIT runtime bytecode generation v/s Gandiva runtime code generation in LLVM • Compare expression evaluation time of five simple expressions on JSON dataset of 500 million rows • Tests were run on Mac machine (2.7GHz quad-core Intel Core i7, 16GB RAM) Project 5 columns SELECT sum(x + N2x + N3x), sum(x * N2x - N3x), sum(3 * x + 2 * N2x + N3x), count(x >= N2x - N3x), count(x + N2x = N3x) FROM json.d500 Case - 10 SELECT count (case when x < 1000000 then x/1000000 + 0 when x < 2000000 then x/2000000 + 1 when x < 3000000 then x/3000000 + 2 when x < 4000000 then x/4000000 + 3 when x < 5000000 then x/5000000 + 4 ……………... else 10 end) FROM json.d500
  • 30. © 2018 Dremio Corporation @DremioHQ Performance Test Project time (secs) with Java JIT Project time (secs) with Gandiva LLVM Improvement Sum 3.805 0.558 6.8x Project 5 columns 8.681 1.689 5.13x Project 10 columns 24.923 3.476 7.74x CASE-10 4.308 0.925 4.66x CASE-100 1361 15.187 89.6x
  • 31. © 2018 Dremio Corporation @DremioHQ Get Involved • Gandiva – https://github.com/dremio/gandiva • Arrow – dev@arrow.apache.org – http://arrow.apache.org – Follow @ApacheArrow, @DremioHQ • Dremio – https://community.dremio.com/ – https://github.com/dremio/dremio-oss