SparkSQL: A Compiler from Queries to RDDs

SparkSQL:
A Compiler from Queries to RDDs
Sameer Agarwal
Spark Summit | Boston | February 9th 2017

About Me
• Software Engineer at Databricks (Spark Core/SQL)
• PhD in Databases (AMPLab, UC Berkeley)
• Research on BlinkDB (Approximate Queries in Spark)

Background: What is an RDD?
• Dependencies
• Partitions
• Compute function: Partition => Iterator[T]
3

• Dependencies
• Partitions
4
Opaque Computation

• Dependencies
• Partitions
5
Opaque Data

RDD Programming Model
6
Constructexecution DAG using low level RDD operators.

7

8

SQL/Structured Programming Model
• High-level APIs (SQL, DataFrame/Dataset): Programs
describe what data operations are neededwithout
specifying how to executethese operations
• More efficient: An optimizer can automatically find out
the most efficient plan to executea query
9

10
SQL AST
DataFrame
Dataset
Query Plan
Optimized
Query Plan
RDDs
Transformations
Catalyst
Abstractionsof users’programs
(Trees)
Spark SQL Overview
Tungsten

11
How Catalyst Works: An Overview
SQL AST
DataFrame
Dataset
Query Plan
Optimized
Query Plan
RDDs
Transformations
Catalyst
Abstractions of users’ programs
(Trees)

12
Trees: Abstractions of Users’ Programs
SELECT sum(v)
FROM (
SELECT
t1.id,
1 + 2 + t1.value AS v
FROM t1 JOIN t2
WHERE
t1.id = t2.id AND
t2.id > 50 * 1000) tmp

13
SELECT sum(v)
FROM (
SELECT
t1.id,
FROM t1 JOIN t2
WHERE
t1.id = t2.id AND
t2.id > 50 * 1000) tmp
Expression
• An expressionrepresentsa
new value, computed based
on input values
• e.g. 1 + 2 + t1.value

14
SELECT sum(v)
FROM (
SELECT
t1.id,
FROM t1 JOIN t2
WHERE
t1.id = t2.id AND
t2.id > 50 * 1000) tmp
Query Plan
Scan
(t1)
Scan
(t2)
Join
Filter
Project
Aggregate sum(v)
t1.id,
1+2+t1.value
as v
t1.id=t2.id
t2.id>50*1000

Logical Plan
• A Logical Plan describescomputation
on datasets without defining how to
conductthe computation
15
Scan
(t1)
Scan
(t2)
Join
Filter
Project
Aggregate sum(v)
t1.id,
1+2+t1.value
as v
t1.id=t2.id
t2.id>50*1000

Physical Plan
• A Physical Plan describescomputation
on datasets with specific definitions on
how to conductthe computation
16
Parquet Scan
(t1)
JSONScan
(t2)
Sort-Merge
Join
Filter
Project
Hash-
Aggregate
sum(v)
t1.id,
1+2+t1.value
as v
t1.id=t2.id
t2.id>50*1000

17
How Catalyst Works: An Overview
SQL AST
DataFrame
Dataset
(Java/Scala)
Query Plan
Optimized
Query Plan
RDDs
Transformations
Catalyst
(Trees)

• A function associated with everytree used to
implement a single rule
Transform
18
Attribute
(t1.value)
Add
Add
Literal(1) Literal(2)
1 + 2 + t1.value
Attribute
(t1.value)
Add
Literal(3)
3+ t1.valueEvaluate 1 + 2 onceEvaluate 1 + 2
for every row

Transform
• A transform is defined as a Partial Function
• Partial Function: A function that is defined for a subset
of its possible arguments
19
val expression: Expression = ...
expression.transform {
case Add(Literal(x, IntegerType), Literal(y, IntegerType)) =>
Literal(x + y)
}
Case statement determineifthe partialfunction is definedfora given input

Literal(x + y)
}
Transform
20
Attribute
(t1.value)
Add
Add
1 + 2 + t1.value

Literal(x + y)
}
Transform
21
Attribute
(t1.value)
Add
Add
1 + 2 + t1.value

Literal(x + y)
}
Transform
22
Attribute
(t1.value)
Add
Add
1 + 2 + t1.value

Literal(x + y)
}
Transform
23
Attribute
(t1.value)
Add
Add
1 + 2 + t1.value
Attribute
(t1.value)
Add
Literal(3)
3+ t1.value

Combining Multiple Rules
24
Scan
(t1)
Scan
(t2)
Join
Filter
Project
Aggregate sum(v)
t1.id,
1+2+t1.value
as v
t1.id=t2.id
t2.id>50*1000
Predicate Pushdown
Scan
(t1)
Scan
(t2)
Join
Filter
Project
Aggregate sum(v)
t1.id,
1+2+t1.value
as v
t2.id>50*1000
t1.id=t2.id

25
Constant Folding
Scan
(t1)
Scan
(t2)
Join
Filter
Project
Aggregate sum(v)
t1.id,
1+2+t1.value
as v
t2.id>50*1000
t1.id=t2.id
Scan
(t1)
Scan
(t2)
Join
Filter
Project
Aggregate sum(v)
t1.id,
3+t1.value as
v
t2.id>50000
t1.id=t2.id

26
Column Pruning
Scan
(t1)
Scan
(t2)
Join
Filter
Project
Aggregate sum(v)
t1.id,
3+t1.value as
v
t2.id>50000
t1.id=t2.id
Scan
(t1)
Scan
(t2)
Join
Filter
Project
Aggregate sum(v)
t1.id,
3+t1.value as
v
t2.id>50000
t1.id=t2.id
Project Project
t1.id
t1.value t2.id

27
Scan
(t1)
Scan
(t2)
Join
Filter
Project
Aggregate sum(v)
t1.id,
1+2+t1.value
as v
t1.id=t2.id
t2.id>50*1000
Scan
(t1)
Scan
(t2)
Join
Filter
Project
Aggregate sum(v)
t1.id,
3+t1.value as
v
t2.id>50000
t1.id=t2.id
Project Projectt1.id
t1.value
t2.id
Before transformations
After transformations

28
SQL AST
DataFrame
Dataset
Query Plan
Optimized
Query Plan
RDDs
Transformations
Catalyst
(Trees)
Spark SQL Overview
Tungsten

Scan
Filter
Project
Aggregate
select count(*) from store_sales
where ss_item_sk = 1000

G. Graefe, Volcano— An Extensible and Parallel Query Evaluation System,
In IEEE Transactions on Knowledge and Data Engineering 1994

Volcano Iterator Model
• Standard for 30 years:
almost all databases do it
• Each operator is an
“iterator” that consumes
records from its input
operator
class Filter(
child: Operator,
predicate: (Row => Boolean))
extends Operator {
def next(): Row = {
var current = child.next()
while (current == null ||predicate(current)) {
current = child.next()
}
return current
}
}

Downside of the Volcano Model
1. Too many virtual function calls
o at least 3 calls for each row in Aggregate
2. Extensive memory access
o “row” is a small segment in memory (or in L1/L2/L3 cache)
3. Can’t take advantage of modern CPU features
o SIMD, pipelining, prefetching, branch prediction, ILP, instruction
cache, …

Scan
Filter
Project
Aggregate
long count = 0;
for (ss_item_sk in store_sales) {
if (ss_item_sk == 1000) {
count += 1;
}
}
Whole-stage Codegen: Spark as a “Compiler”

Whole-stage Codegen
• Fusing operators together so the generated code looks like
hand optimized code:
- Identify chains of operators (“stages”)
- Compile each stage into a single function
- Functionality of a general purpose execution engine;
performance as if hand built system just to run your query

T Neumann, Efficiently compiling efficient query plans for modern hardware. InVLDB 2011

Operator Benchmarks: Cost/Row (ns)
5-30x
Speedups

Radix Sort
10-100x
Speedups

Shuffling
still the
bottleneck

10x
Speedup

TPC-DS (Scale Factor 1500, 100 cores)
QueryTime
Query #
Spark 2.0 Spark 1.6
Lower is Better

Spark 2.2 and beyond
1. SPARK-16026: Cost Based Optimizer
- Leverage table/column level statistics to optimize joins and aggregates
- Statistics Collection Framework (Spark 2.1)
- Cost Based Optimizer (Spark 2.2)
2. Boosting Spark’s Performance on Many-Core Machines
- In-memory/ single node shuffle
3. Improving quality of generated code and betterintegration
with the in-memory column format in Spark

SparkSQL: A Compiler from Queries to RDDs

More Related Content

SparkSQL: A Compiler from Queries to RDDs