From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Francisco 2015

From DataFrames to Tungsten:
A Peek into Spark’s Future
Reynold Xin @rxin
Spark Summit, San Francisco
June 16th, 2015

DataFrame
noun
Making Spark accessible to everyone (data
scientists, engineers, statisticians, …)

Tungsten
noun
Making Spark faster & prepare for the next
five years.

How do DataFrames and
Tungsten relate to each other?

Google Trends for “dataframe”
Single-node tabular data structure, with API for
relational algebra (filter, join, …)
math and stats
input/output (CSV, JSON, …)
ad infinitum

Data frame: lingua franca for “small data”

head(flights)

#>
Source:
local
data
frame
[6
x
16]

#>

#>

year
month
day
dep_time
dep_delay
arr_time
arr_delay
carrier
tailnum

#>
1

2013

1

1

517

2

830

11

UA

N14228

#>
2

2013

1

1

533

4

850

20

UA

N24211

#>
3

2013

1

1

542

2

923

33

AA

N619AA

#>
4

2013

1

1

544

-‐1

1004

-‐18

B6

N804JB

#>
..

...

...
...

...

...

...

...

...

...

Spark DataFrame
>
head(filter(df,
df$waiting
<
50))

#
an
example
in
R

##

eruptions
waiting

##1

1.750

47

##2

1.750

47

##3

1.867

48

Distributed data frame for Java, Python, R, Scala
Similar APIs as single-node tools (Pandas, dplyr), i.e. easy to learn

data size
KB MB GB TB PB
Existing
Single-node
Data Frames
Spark
DataFrame

It is not Spark vs Python/R,
but Spark and Python/R.

Spark and Python/R
Spark
DF
scalability
multi-core
multi-machines
Python/R
DF
Viz
Machine
Learning
Stats
wealth
of
libraries

Spark RDD Execution
Java/Scala
API
JVM
Execution
Python
API
Python
Execution
opaque closures
(user-defined functions)

Spark DataFrame Execution
DataFrame
Logical Plan
Physical
Execution
Catalyst
optimizer
Intermediate representation for computation

Spark DataFrame Execution
Python
DF
Logical Plan
Physical
Execution
Catalyst
optimizer
Java/Scala
DF
R
DF
Intermediate representation for computation
Simple wrappers to create logical plan

Benefit of Logical Plan: Simpler Frontend
Python : ~2000 line of code (built over a weekend)
R : ~1000 line of code
i.e. much easier to add new language bindings (Julia, Clojure, …)

Performance
0 2 4 6 8 10
Java/Scala
Python
Runtime for an example aggregation workload
RDD

Benefit of Logical Plan:
Performance Parity Across Languages
0 2 4 6 8 10
Java/Scala
Python
Java/Scala
Python
R
SQL
Runtime for an example aggregation workload (secs)
DataFrame
RDD

Hardware Trends
Storage
Network
CPU

Hardware Trends
2010
Storage
50+MB/s
(HDD)
Network 1Gbps
CPU ~3GHz

Hardware Trends
2010 2015
Storage
50+MB/s
(HDD)
500+MB/s
(SSD)
Network 1Gbps 10Gbps
CPU ~3GHz ~3GHz

Hardware Trends
2010 2015
Storage
50+MB/s
(HDD)
500+MB/s
(SSD)
10X
Network 1Gbps 10Gbps 10X
CPU ~3GHz ~3GHz L

Tungsten: Preparing Spark for Next 5 Years
Substantially speed up execution by optimizing CPU eﬀiciency, via:
(1)  Runtime code generation
(2)  Exploiting cache locality
(3)  Oﬀ-heap memory management

From DataFrame to Tungsten
Python
DF
Logical Plan
Java/Scala
DF
R
DF
Tungsten
Execution
5PM
Deep Dive into Project Tungsten
Developer Track by Josh Rosen

Initial Performance Results
0
200
400
600
800
1000
1200
1x 2x 4x 8x
Runtime(seconds)
Data set size (relative)
Tungsten-oﬀ
Tungsten-on

Python Java/Scala RSQL …
DataFrame
Logical Plan
LLVMJVM GPU NVRAM
Unified API, One Engine, Automatically Optimized
Tungsten
backend
language
frontend
…

Tungsten Execution
PythonSQL R Streaming
DataFrame
Advanced
Analytics

Spark Office Hours Today
Databricks booth A1
Topic Area
1:00-1:45 Core, YARN, Ops
1:45-2:30 Core/SQL/Data Science
3:00-3:40 Streaming
3:40-4:15 Core, Python, R
4:30-5:15 Machine Learning
5:15-6:00 Matei Zaharia

From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Francisco 2015

More Related Content

From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Francisco 2015