The document discusses Spark's DataFrame API and the Tungsten project. DataFrames make Spark accessible to different users by providing a common API across languages like Python, R and Scala. Tungsten aims to improve Spark's performance for the next five years through techniques like runtime code generation and off-heap memory management. Initial results show Tungsten doubling performance. Together, DataFrames and Tungsten will help Spark scale to larger data and queries across different languages and environments.
1 of 27
More Related Content
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Francisco 2015
1. From DataFrames to Tungsten:
A Peek into Spark’s Future
Reynold Xin @rxin
Spark Summit, San Francisco
June 16th, 2015
5. Google Trends for “dataframe”
Single-node tabular data structure, with API for
relational algebra (filter, join, …)
math and stats
input/output (CSV, JSON, …)
ad infinitum
6. Data frame: lingua franca for “small data”
head(flights)
#>
Source:
local
data
frame
[6
x
16]
#>
#>
year
month
day
dep_time
dep_delay
arr_time
arr_delay
carrier
tailnum
#>
1
2013
1
1
517
2
830
11
UA
N14228
#>
2
2013
1
1
533
4
850
20
UA
N24211
#>
3
2013
1
1
542
2
923
33
AA
N619AA
#>
4
2013
1
1
544
-‐1
1004
-‐18
B6
N804JB
#>
..
...
...
...
...
...
...
...
...
...
7. Spark DataFrame
>
head(filter(df,
df$waiting
<
50))
#
an
example
in
R
##
eruptions
waiting
##1
1.750
47
##2
1.750
47
##3
1.867
48
Distributed data frame for Java, Python, R, Scala
Similar APIs as single-node tools (Pandas, dplyr), i.e. easy to learn
8. data size
KB MB GB TB PB
Existing
Single-node
Data Frames
Spark
DataFrame
9. It is not Spark vs Python/R,
but Spark and Python/R.
13. Spark DataFrame Execution
Python
DF
Logical Plan
Physical
Execution
Catalyst
optimizer
Java/Scala
DF
R
DF
Intermediate representation for computation
Simple wrappers to create logical plan
14. Benefit of Logical Plan: Simpler Frontend
Python : ~2000 line of code (built over a weekend)
R : ~1000 line of code
i.e. much easier to add new language bindings (Julia, Clojure, …)
15. Performance
0 2 4 6 8 10
Java/Scala
Python
Runtime for an example aggregation workload
RDD
16. Benefit of Logical Plan:
Performance Parity Across Languages
0 2 4 6 8 10
Java/Scala
Python
Java/Scala
Python
R
SQL
Runtime for an example aggregation workload (secs)
DataFrame
RDD
22. Tungsten: Preparing Spark for Next 5 Years
Substantially speed up execution by optimizing CPU efficiency, via:
(1) Runtime code generation
(2) Exploiting cache locality
(3) Off-heap memory management
23. From DataFrame to Tungsten
Python
DF
Logical Plan
Java/Scala
DF
R
DF
Tungsten
Execution
5PM
Deep Dive into Project Tungsten
Developer Track by Josh Rosen