Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
84 views

Spark SQL - Relational Data Processing in Spark

Uploaded by

Ana Ilie
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views

Spark SQL - Relational Data Processing in Spark

Uploaded by

Ana Ilie
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Spark SQL: Relational Data Processing in Spark

Michael Armbrust† , Reynold S. Xin† , Cheng Lian† , Yin Huai† , Davies Liu† , Joseph K. Bradley† ,
Xiangrui Meng† , Tomer Kaftan‡ , Michael J. Franklin†‡ , Ali Ghodsi† , Matei Zaharia†⇤

Databricks Inc. ⇤
MIT CSAIL ‡
AMPLab, UC Berkeley

ABSTRACT While the popularity of relational systems shows that users often
prefer writing declarative queries, the relational approach is insuf-
Spark SQL is a new module in Apache Spark that integrates rela-
ficient for many big data applications. First, users want to per-
tional processing with Spark’s functional programming API. Built
form ETL to and from various data sources that might be semi-
on our experience with Shark, Spark SQL lets Spark program-
or unstructured, requiring custom code. Second, users want to
mers leverage the benefits of relational processing (e.g., declarative
perform advanced analytics, such as machine learning and graph
queries and optimized storage), and lets SQL users call complex
processing, that are challenging to express in relational systems.
analytics libraries in Spark (e.g., machine learning). Compared to
In practice, we have observed that most data pipelines would ide-
previous systems, Spark SQL makes two main additions. First,
ally be expressed with a combination of both relational queries and
it offers much tighter integration between relational and procedu-
complex procedural algorithms. Unfortunately, these two classes
ral processing, through a declarative DataFrame API that integrates
of systems—relational and procedural—have until now remained
with procedural Spark code. Second, it includes a highly extensible
largely disjoint, forcing users to choose one paradigm or the other.
optimizer, Catalyst, built using features of the Scala programming
language, that makes it easy to add composable rules, control code This paper describes our effort to combine both models in Spark
generation, and define extension points. Using Catalyst, we have SQL, a major new component in Apache Spark [39]. Spark SQL
built a variety of features (e.g., schema inference for JSON, ma- builds on our earlier SQL-on-Spark effort, called Shark. Rather
chine learning types, and query federation to external databases) than forcing users to pick between a relational or a procedural API,
tailored for the complex needs of modern data analysis. We see however, Spark SQL lets users seamlessly intermix the two.
Spark SQL as an evolution of both SQL-on-Spark and of Spark it- Spark SQL bridges the gap between the two models through two
self, offering richer APIs and optimizations while keeping the ben- contributions. First, Spark SQL provides a DataFrame API that
efits of the Spark programming model. can perform relational operations on both external data sources and
Spark’s built-in distributed collections. This API is similar to the
Categories and Subject Descriptors widely used data frame concept in R [32], but evaluates operations
H.2 [Database Management]: Systems lazily so that it can perform relational optimizations. Second, to
support the wide range of data sources and algorithms in big data,
Keywords Spark SQL introduces a novel extensible optimizer called Catalyst.
Catalyst makes it easy to add data sources, optimization rules, and
Databases; Data Warehouse; Machine Learning; Spark; Hadoop
data types for domains such as machine learning.
1 Introduction The DataFrame API offers rich relational/procedural integration
within Spark programs. DataFrames are collections of structured
Big data applications require a mix of processing techniques, data
records that can be manipulated using Spark’s procedural API, or
sources and storage formats. The earliest systems designed for
using new relational APIs that allow richer optimizations. They
these workloads, such as MapReduce, gave users a powerful, but
can be created directly from Spark’s built-in distributed collections
low-level, procedural programming interface. Programming such
of Java/Python objects, enabling relational processing in existing
systems was onerous and required manual optimization by the user
Spark programs. Other Spark components, such as the machine
to achieve high performance. As a result, multiple new systems
learning library, take and produce DataFrames as well. DataFrames
sought to provide a more productive user experience by offering
are more convenient and more efficient than Spark’s procedural
relational interfaces to big data. Systems like Pig, Hive, Dremel
API in many common situations. For example, they make it easy
and Shark [29, 36, 25, 38] all take advantage of declarative queries
to compute multiple aggregates in one pass using a SQL statement,
to provide richer automatic optimizations.
something that is difficult to express in traditional functional APIs.
Permission to make digital or hard copies of all or part of this work for personal or They also automatically store data in a columnar format that is sig-
classroom use is granted without fee provided that copies are not made or distributed nificantly more compact than Java/Python objects. Finally, unlike
for profit or commercial advantage and that copies bear this notice and the full cita-
existing data frame APIs in R and Python, DataFrame operations
tion on the first page. Copyrights for components of this work owned by others than
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- in Spark SQL go through a relational optimizer, Catalyst.
publish, to post on servers or to redistribute to lists, requires prior specific permission To support a wide variety of data sources and analytics work-
and/or a fee. Request permissions from permissions@acm.org. loads in Spark SQL, we designed an extensible query optimizer
SIGMOD’15, May 31–June 4, 2015, Melbourne, Victoria, Australia.
called Catalyst. Catalyst uses features of the Scala programming
Copyright is held by the owner/author(s). Publication rights licensed to ACM.
ACM 978-1-4503-2758-9/15/05 ...$15.00.
language, such as pattern-matching, to express composable rules
http://dx.doi.org/10.1145/2723372.2742797. in a Turing-complete language. It offers a general framework for
transforming trees, which we use to perform analysis, planning, and This code creates an RDD of strings called lines by reading an
runtime code generation. Through this framework, Catalyst can HDFS file, then transforms it using filter to obtain another RDD,
also be extended with new data sources, including semi-structured errors. It then performs a count on this data.
data such as JSON and “smart” data stores to which one can push RDDs are fault-tolerant, in that the system can recover lost data
filters (e.g., HBase); with user-defined functions; and with user- using the lineage graph of the RDDs (by rerunning operations such
defined types for domains such as machine learning. Functional as the filter above to rebuild missing partitions). They can also
languages are known to be well-suited for building compilers [37], explicitly be cached in memory or on disk to support iteration [39].
so it is perhaps no surprise that they made it easy to build an extensi-
One final note about the API is that RDDs are evaluated lazily.
ble optimizer. We indeed have found Catalyst effective in enabling
Each RDD represents a “logical plan” to compute a dataset, but
us to quickly add capabilities to Spark SQL, and since its release
Spark waits until certain output operations, such as count, to launch
we have seen external contributors easily add them as well.
a computation. This allows the engine to do some simple query op-
Spark SQL was released in May 2014, and is now one of the
timization, such as pipelining operations. For instance, in the ex-
most actively developed components in Spark. As of this writing,
ample above, Spark will pipeline reading lines from the HDFS file
Apache Spark is the most active open source project for big data
with applying the filter and computing a running count, so that it
processing, with over 400 contributors in the past year. Spark SQL
never needs to materialize the intermediate lines and errors re-
has already been deployed in very large scale environments. For
sults. While such optimization is extremely useful, it is also limited
example, a large Internet company uses Spark SQL to build data
because the engine does not understand the structure of the data in
pipelines and run queries on an 8000-node cluster with over 100
RDDs (which is arbitrary Java/Python objects) or the semantics of
PB of data. Each individual query regularly operates on tens of ter-
user functions (which contain arbitrary code).
abytes. In addition, many users adopt Spark SQL not just for SQL
queries, but in programs that combine it with procedural process- 2.2 Previous Relational Systems on Spark
ing. For example, 2/3 of customers of Databricks Cloud, a hosted
service running Spark, use Spark SQL within other programming Our first effort to build a relational interface on Spark was Shark [38],
languages. Performance-wise, we find that Spark SQL is competi- which modified the Apache Hive system to run on Spark and imple-
tive with SQL-only systems on Hadoop for relational queries. It is mented traditional RDBMS optimizations, such as columnar pro-
also up to 10⇥ faster and more memory-efficient than naive Spark cessing, over the Spark engine. While Shark showed good perfor-
code in computations expressible in SQL. mance and good opportunities for integration with Spark programs,
More generally, we see Spark SQL as an important evolution it had three important challenges. First, Shark could only be used
of the core Spark API. While Spark’s original functional program- to query external data stored in the Hive catalog, and was thus not
ming API was quite general, it offered only limited opportunities useful for relational queries on data inside a Spark program (e.g., on
for automatic optimization. Spark SQL simultaneously makes Spark the errors RDD created manually above). Second, the only way
accessible to more users and improves optimizations for existing to call Shark from Spark programs was to put together a SQL string,
ones. Within Spark, the community is now incorporating Spark which is inconvenient and error-prone to work with in a modular
SQL into more APIs: DataFrames are the standard data representa- program. Finally, the Hive optimizer was tailored for MapReduce
tion in a new “ML pipeline” API for machine learning, and we hope and difficult to extend, making it hard to build new features such as
to expand this to other components, such as GraphX and streaming. data types for machine learning or support for new data sources.
We start this paper with a background on Spark and the goals of
Spark SQL (§2). We then describe the DataFrame API (§3), the 2.3 Goals for Spark SQL
Catalyst optimizer (§4), and advanced features we have built on
Catalyst (§5). We evaluate Spark SQL in §6. We describe external With the experience from Shark, we wanted to extend relational
research built on Catalyst in §7. Finally, §8 covers related work. processing to cover native RDDs in Spark and a much wider range
of data sources. We set the following goals for Spark SQL:
2 Background and Goals 1. Support relational processing both within Spark programs (on
2.1 Spark Overview native RDDs) and on external data sources using a programmer-
friendly API.
Apache Spark is a general-purpose cluster computing engine with
2. Provide high performance using established DBMS techniques.
APIs in Scala, Java and Python and libraries for streaming, graph
processing and machine learning [6]. Released in 2010, it is to our 3. Easily support new data sources, including semi-structured data
knowledge one of the most widely-used systems with a “language- and external databases amenable to query federation.
integrated” API similar to DryadLINQ [20], and the most active 4. Enable extension with advanced analytics algorithms such as
open source project for big data processing. Spark had over 400 graph processing and machine learning.
contributors in 2014, and is packaged by multiple vendors.
Spark offers a functional programming API similar to other re-
cent systems [20, 11], where users manipulate distributed collec-
3 Programming Interface
tions called Resilient Distributed Datasets (RDDs) [39]. Each RDD Spark SQL runs as a library on top of Spark, as shown in Fig-
is a collection of Java or Python objects partitioned across a cluster. ure 1. It exposes SQL interfaces, which can be accessed through
RDDs can be manipulated through operations like map, filter, JDBC/ODBC or through a command-line console, as well as the
and reduce, which take functions in the programming language DataFrame API integrated into Spark’s supported programming lan-
and ship them to nodes on the cluster. For example, the Scala code guages. We start by covering the DataFrame API, which lets users
below counts lines starting with “ERROR” in a text file: intermix procedural and relational code. However, advanced func-
lines = spark . textFile (" hdfs ://...") tions can also be exposed in SQL through UDFs, allowing them to
errors = lines . filter (s => s. contains (" ERROR ")) be invoked, for example, by business intelligence tools. We discuss
println ( errors . count ()) UDFs in Section 3.7.
as well as complex (i.e., non-atomic) data types: structs, arrays,
User Programs
JDBC Console maps and unions. Complex data types can also be nested together
(Java, Scala, Python)
to create more powerful types. Unlike many traditional DBMSes,
Spark SQL provides first-class support for complex data types in
Spark SQL DataFrame API the query language and the API. In addition, Spark SQL also sup-
ports user-defined types, as described in Section 4.4.2.
Catalyst Optimizer Using this type system, we have been able to accurately model
data from a variety of sources and formats, including Hive, rela-
Spark tional databases, JSON, and native objects in Java/Scala/Python.
Resilient Distributed Datasets 3.3 DataFrame Operations
Users can perform relational operations on DataFrames using a
domain-specific language (DSL) similar to R data frames [32] and
Figure 1: Interfaces to Spark SQL, and interaction with Spark. Python Pandas [30]. DataFrames support all common relational
operators, including projection (select), filter (where), join, and
aggregations (groupBy). These operators all take expression ob-
3.1 DataFrame API jects in a limited DSL that lets Spark capture the structure of the
The main abstraction in Spark SQL’s API is a DataFrame, a dis- expression. For example, the following code computes the number
tributed collection of rows with a homogeneous schema. A DataFrame of female employees in each department.
is equivalent to a table in a relational database, and can also be employees
manipulated in similar ways to the “native” distributed collections .join(dept , employees (" deptId ") === dept (" id "))
in Spark (RDDs).1 Unlike RDDs, DataFrames keep track of their . where ( employees (" gender ") === " female ")
schema and support various relational operations that lead to more . groupBy (dept (" id"), dept (" name "))
optimized execution. .agg( count (" name "))
DataFrames can be constructed from tables in a system cata- Here, employees is a DataFrame, and employees("deptId") is
log (based on external data sources) or from existing RDDs of an expression representing the deptId column. Expression ob-
native Java/Python objects (Section 3.5). Once constructed, they jects have many operators that return new expressions, including
can be manipulated with various relational operators, such as where the usual comparison operators (e.g., === for equality test, > for
and groupBy, which take expressions in a domain-specific language greater than) and arithmetic ones (+, -, etc). They also support ag-
(DSL) similar to data frames in R and Python [32, 30]. Each gregates, such as count("name"). All of these operators build up an
DataFrame can also be viewed as an RDD of Row objects, allowing abstract syntax tree (AST) of the expression, which is then passed
users to call procedural Spark APIs such as map.2 to Catalyst for optimization. This is unlike the native Spark API
Finally, unlike traditional data frame APIs, Spark DataFrames that takes functions containing arbitrary Scala/Java/Python code,
are lazy, in that each DataFrame object represents a logical plan to which are then opaque to the runtime engine. For a detailed listing
compute a dataset, but no execution occurs until the user calls a spe- of the API, we refer readers to Spark’s official documentation [6].
cial “output operation” such as save. This enables rich optimization Apart from the relational DSL, DataFrames can be registered as
across all operations that were used to build the DataFrame. temporary tables in the system catalog and queried using SQL. The
To illustrate, the Scala code below defines a DataFrame from a code below shows an example:
table in Hive, derives another based on it, and prints a result:
users . where ( users (" age ") < 21)
ctx = new HiveContext () . registerTempTable (" young ")
users = ctx. table (" users ") ctx.sql (" SELECT count (*) , avg(age) FROM young ")
young = users . where ( users (" age ") < 21)
println ( young . count ()) SQL is sometimes convenient for computing multiple aggregates
concisely, and also allows programs to expose datasets through JD-
In this code, users and young are DataFrames. The snippet BC/ODBC. The DataFrames registered in the catalog are still un-
users("age") < 21 is an expression in the data frame DSL, which materialized views, so that optimizations can happen across SQL
is captured as an abstract syntax tree rather than representing a and the original DataFrame expressions. However, DataFrames can
Scala function as in the traditional Spark API. Finally, each DataFrame also be materialized, as we discuss in Section 3.6.
simply represents a logical plan (i.e., read the users table and filter
3.4 DataFrames versus Relational Query Languages
for age < 21). When the user calls count, which is an output opera-
tion, Spark SQL builds a physical plan to compute the final result. While on the surface, DataFrames provide the same operations as
This might include optimizations such as only scanning the “age” relational query languages like SQL and Pig [29], we found that
column of the data if its storage format is columnar, or even using they can be significantly easier for users to work with thanks to
an index in the data source to count the matching rows. their integration in a full programming language. For example,
We next cover the details of the DataFrame API. users can break up their code into Scala, Java or Python functions
that pass DataFrames between them to build a logical plan, and
3.2 Data Model
will still benefit from optimizations across the whole plan when
Spark SQL uses a nested data model based on Hive [19] for ta- they run an output operation. Likewise, developers can use control
bles and DataFrames. It supports all major SQL data types, includ- structures like if statements and loops to structure their work. One
ing boolean, integer, double, decimal, string, date, and timestamp, user said that the DataFrame API is “concise and declarative like
1 We chose the name DataFrame because it is similar to structured data li- SQL, except I can name intermediate results,” referring to how it is
braries in R and Python, and designed our API to resemble those. easier to structure computations and debug intermediate steps.
2 These Row objects are constructed on the fly and do not necessarily rep- To simplify programming in DataFrames, we also made API an-
resent the internal storage format of the data, which is typically columnar. alyze logical plans eagerly (i.e., to identify whether the column
names used in expressions exist in the underlying tables, and whether of UDFs, without the complicated packaging and registration pro-
their data types are appropriate), even though query results are cess found in other database systems. This feature has proven cru-
computed lazily. Thus, Spark SQL reports an error as soon as user cial for the adoption of the API.
types an invalid line of code instead of waiting until execution. This In Spark SQL, UDFs can be registered inline by passing Scala,
is again easier to work with than a large SQL statement. Java or Python functions, which may use the full Spark API inter-
nally. For example, given a model object for a machine learning
3.5 Querying Native Datasets
model, we could register its prediction function as a UDF:
Real-world pipelines often extract data from heterogeneous sources val model : LogisticRegressionModel = ...
and run a wide variety of algorithms from different programming
libraries. To interoperate with procedural Spark code, Spark SQL ctx.udf. register (" predict ",
allows users to construct DataFrames directly against RDDs of ob- (x: Float , y: Float ) => model . predict ( Vector (x, y)))
jects native to the programming language. Spark SQL can automat- ctx.sql (" SELECT predict (age , weight ) FROM users ")
ically infer the schema of these objects using reflection. In Scala
and Java, the type information is extracted from the language’s type Once registered, the UDF can also be used via the JDBC/ODBC
system (from JavaBeans and Scala case classes). In Python, Spark interface by business intelligence tools. In addition to UDFs that
SQL samples the dataset to perform schema inference due to the operate on scalar values like the one here, one can define UDFs that
dynamic type system. operate on an entire table by taking its name, as in MADLib [12],
For example, the Scala code below defines a DataFrame from an and use the distributed Spark API within them, thus exposing ad-
RDD of User objects. Spark SQL automatically detects the names vanced analytics functions to SQL users. Finally, because UDF
(“name” and “age”) and data types (string and int) of the columns. definitions and query execution are expressed using the same general-
purpose language (e.g., Scala or Python), users can debug or profile
case class User(name: String , age: Int) the entire program using standard tools.
The example above demonstrates a common use case in many
// Create an RDD of User objects
usersRDD = spark . parallelize ( pipelines, i.e., one that employs both relational operators and ad-
List(User (" Alice ", 22) , User (" Bob", 19))) vanced analytics methods that are cumbersome to express in SQL.
The DataFrame API lets developers seamlessly mix these methods.
// View the RDD as a DataFrame
usersDF = usersRDD .toDF 4 Catalyst Optimizer
Internally, Spark SQL creates a logical data scan operator that To implement Spark SQL, we designed a new extensible optimizer,
points to the RDD. This is compiled into a physical operator that Catalyst, based on functional programming constructs in Scala.
accesses fields of the native objects. It is important to note that this Catalyst’s extensible design had two purposes. First, we wanted
is very different from traditional object-relational mapping (ORM). to make it easy to add new optimization techniques and features to
ORMs often incur expensive conversions that translate an entire ob- Spark SQL, especially to tackle various problems we were seeing
ject into a different format. In contrast, Spark SQL accesses the na- specifically with “big data” (e.g., semistructured data and advanced
tive objects in-place, extracting only the fields used in each query. analytics). Second, we wanted to enable external developers to ex-
tend the optimizer—for example, by adding data source specific
The ability to query native datasets lets users run optimized re-
rules that can push filtering or aggregation into external storage
lational operations within existing Spark programs. In addition, it
systems, or support for new data types. Catalyst supports both rule-
makes it simple to combine RDDs with external structured data.
based and cost-based optimization.
For example, we could join the users RDD with a table in Hive:
While extensible optimizers have been proposed in the past, they
views = ctx. table (" pageviews ") have typically required a complex domain specific language to spec-
usersDF .join(views , usersDF (" name ") === views (" user ")) ify rules, and an “optimizer compiler” to translate the rules into exe-
cutable code [17, 16]. This leads to a significant learning curve and
3.6 In-Memory Caching maintenance burden. In contrast, Catalyst uses standard features of
Like Shark before it, Spark SQL can materialize (often referred the Scala programming language, such as pattern-matching [14], to
to as “cache") hot data in memory using columnar storage. Com- let developers use the full programming language while still mak-
pared with Spark’s native cache, which simply stores data as JVM ing rules easy to specify. Functional languages were designed in
objects, the columnar cache can reduce memory footprint by an or- part to build compilers, so we found Scala well-suited to this task.
der of magnitude because it applies columnar compression schemes Nonetheless, Catalyst is, to our knowledge, the first production-
such as dictionary encoding and run-length encoding. Caching is quality query optimizer built on such a language.
particularly useful for interactive queries and for the iterative algo- At its core, Catalyst contains a general library for representing
rithms common in machine learning. It can be invoked by calling trees and applying rules to manipulate them.3 On top of this frame-
cache() on a DataFrame.
work, we have built libraries specific to relational query processing
(e.g., expressions, logical query plans), and several sets of rules
3.7 User-Defined Functions that handle different phases of query execution: analysis, logical
User-defined functions (UDFs) have been an important extension optimization, physical planning, and code generation to compile
point for database systems. For example, MySQL relies on UDFs parts of queries to Java bytecode. For the latter, we use another
to provide basic support for JSON data. A more advanced exam- Scala feature, quasiquotes [34], that makes it easy to generate code
ple is MADLib’s use of UDFs to implement machine learning al- at runtime from composable expressions. Finally, Catalyst offers
gorithms for Postgres and other database systems [12]. However, several public extension points, including external data sources and
database systems often require UDFs to be defined in a separate user-defined types.
programming environment that is different from the primary query 3 Cost-based optimization is performed by generating multiple plans using
interfaces. Spark SQL’s DataFrame API supports inline definition rules, and then computing their costs.
Add Rules (and Scala pattern matching in general) can match multi-
ple patterns in the same transform call, making it very concise to
implement multiple transformations at once:
Attribute(x) Add tree. transform {
case Add( Literal (c1), Literal (c2 )) => Literal (c1+c2)
case Add(left , Literal (0)) => left
case Add( Literal (0) , right ) => right
}
Literal(1) Literal(2)
In practice, rules may need to execute multiple times to fully
transform a tree. Catalyst groups rules into batches, and executes
Figure 2: Catalyst tree for the expression x+(1+2). each batch until it reaches a fixed point, that is, until the tree stops
changing after applying its rules. Running rules to fixed point
means that each rule can be simple and self-contained, and yet
4.1 Trees still eventually have larger global effects on a tree. In the exam-
The main data type in Catalyst is a tree composed of node ob- ple above, repeated application would constant-fold larger trees,
jects. Each node has a node type and zero or more children. New such as (x+0)+(3+3). As another example, a first batch might an-
node types are defined in Scala as subclasses of the TreeNode class. alyze an expression to assign types to all of the attributes, while
These objects are immutable and can be manipulated using func- a second batch might use these types to do constant folding. Af-
tional transformations, as discussed in the next subsection. ter each batch, developers can also run sanity checks on the new
As a simple example, suppose we have the following three node tree (e.g., to see that all attributes were assigned types), often also
classes for a very simple expression language:4 written via recursive matching.
• Literal(value: Int): a constant value Finally, rule conditions and their bodies can contain arbitrary
Scala code. This gives Catalyst more power than domain specific
• Attribute(name: String): an attribute from an input row, e.g., “x” languages for optimizers, while keeping it concise for simple rules.
• Add(left: TreeNode, right: TreeNode): sum of two expres- In our experience, functional transformations on immutable trees
sions. make the whole optimizer very easy to reason about and debug.
These classes can be used to build up trees; for example, the They also enable parallelization in the optimizer, although we do
tree for the expression x+(1+2), shown in Figure 2, would be rep- not yet exploit this.
resented in Scala code as follows: 4.3 Using Catalyst in Spark SQL
Add( Attribute (x), Add( Literal (1) , Literal (2))) We use Catalyst’s general tree transformation framework in four
phases, shown in Figure 3: (1) analyzing a logical plan to resolve
4.2 Rules references, (2) logical plan optimization, (3) physical planning, and
(4) code generation to compile parts of the query to Java bytecode.
Trees can be manipulated using rules, which are functions from a In the physical planning phase, Catalyst may generate multiple
tree to another tree. While a rule can run arbitrary code on its input plans and compare them based on cost. All other phases are purely
tree (given that this tree is just a Scala object), the most common rule-based. Each phase uses different types of tree nodes; Catalyst
approach is to use a set of pattern matching functions that find and includes libraries of nodes for expressions, data types, and logical
replace subtrees with a specific structure. and physical operators. We now describe each of these phases.
Pattern matching is a feature of many functional languages that
4.3.1 Analysis
allows extracting values from potentially nested structures of alge-
braic data types. In Catalyst, trees offer a transform method that Spark SQL begins with a relation to be computed, either from an
applies a pattern matching function recursively on all nodes of the abstract syntax tree (AST) returned by a SQL parser, or from a
tree, transforming the ones that match each pattern to a result. For DataFrame object constructed using the API. In both cases, the re-
example, we could implement a rule that folds Add operations be- lation may contain unresolved attribute references or relations: for
tween constants as follows: example, in the SQL query SELECT col FROM sales, the type of
col, or even whether it is a valid column name, is not known until
tree. transform {
case Add( Literal (c1), Literal (c2 )) => Literal (c1+c2) we look up the table sales. An attribute is called unresolved if we
} do not know its type or have not matched it to an input table (or
an alias). Spark SQL uses Catalyst rules and a Catalog object that
Applying this to the tree for x+(1+2), in Figure 2, would yield tracks the tables in all data sources to resolve these attributes. It
the new tree x+3. The case keyword here is Scala’s standard pattern starts by building an “unresolved logical plan” tree with unbound
matching syntax [14], and can be used to match on the type of an attributes and data types, then applies rules that do the following:
object as well as give names to extracted values (c1 and c2 here). • Looking up relations by name from the catalog.
The pattern matching expression that is passed to transform is a
partial function, meaning that it only needs to match to a subset of • Mapping named attributes, such as col, to the input provided
all possible input trees. Catalyst will tests which parts of a tree a given operator’s children.
given rule applies to, automatically skipping over and descending • Determining which attributes refer to the same value to give
into subtrees that do not match. This ability means that rules only them a unique ID (which later allows optimization of expres-
need to reason about the trees where a given optimization applies sions such as col = col).
and not those that do not match. Thus, rules do not need to be
• Propagating and coercing types through expressions: for exam-
modified as new types of operators are added to the system.
ple, we cannot know the return type of 1 + col until we have
4 We use Scala syntax for classes here, where each class’s fields are defined resolved col and possibly casted its subexpressions to a com-
in parentheses, with their types given using a colon. patible types.
Logical Physical Code
Analysis
Optimization Planning Generation
SQL Query

Cost Model
Physical Selected
Unresolved Optimized Physical
Physical
Logical Plan Plans Physical RDDs
Logical Plan Logical Plan Plans
Plans Plan
DataFrame
Catalog

Figure 3: Phases of query planning in Spark SQL. Rounded rectangles represent Catalyst trees.

In total, the rules for the analyzer are about 1000 lines of code. 4.3.4 Code Generation
4.3.2 Logical Optimization The final phase of query optimization involves generating Java byte-
The logical optimization phase applies standard rule-based opti- code to run on each machine. Because Spark SQL often operates on
mizations to the logical plan. These include constant folding, pred- in-memory datasets, where processing is CPU-bound, we wanted
icate pushdown, projection pruning, null propagation, Boolean ex- to support code generation to speed up execution. Nonetheless,
pression simplification, and other rules. In general, we have found code generation engines are often complicated to build, amounting
it extremely simple to add rules for a wide variety of situations. For essentially to a compiler. Catalyst relies on a special feature of the
example, when we added the fixed-precision DECIMAL type to Spark Scala language, quasiquotes [34], to make code generation simpler.
SQL, we wanted to optimize aggregations such as sums and aver- Quasiquotes allow the programmatic construction of abstract syn-
ages on DECIMALs with small precisions; it took 12 lines of code to tax trees (ASTs) in the Scala language, which can then be fed to the
write a rule that finds such decimals in SUM and AVG expressions, Scala compiler at runtime to generate bytecode. We use Catalyst to
and casts them to unscaled 64-bit LONGs, does the aggregation on transform a tree representing an expression in SQL to an AST for
that, then converts the result back. A simplified version of this rule Scala code to evaluate that expression, and then compile and run
that only optimizes SUM expressions is reproduced below: the generated code.
As a simple example, consider the Add, Attribute and Literal tree
object DecimalAggregates extends Rule[ LogicalPlan ] {
/** Maximum number of decimal digits in a Long */
nodes introduced in Section 4.2, which allowed us to write expres-
val MAX_LONG_DIGITS = 18 sions such as (x+y)+1. Without code generation, such expressions
would have to be interpreted for each row of data, by walking down
def apply(plan: LogicalPlan ): LogicalPlan = { a tree of Add, Attribute and Literal nodes. This introduces large
plan transformAllExpressions {
case Sum(e @ DecimalType . Expression (prec , scale )) amounts of branches and virtual function calls that slow down ex-
if prec + 10 <= MAX_LONG_DIGITS => ecution. With code generation, we can write a function to translate
MakeDecimal (Sum( UnscaledValue (e)), prec + 10, scale ) a specific expression tree to a Scala AST as follows:
}
} def compile (node: Node ): AST = node match {
case Literal ( value ) => q" $value "
As another example, a 12-line rule optimizes LIKE expressions case Attribute (name) => q"row.get( $name )"
with simple regular expressions into String.startsWith or case Add(left , right ) =>
q"${ compile (left )} + ${ compile ( right )}"
String.contains calls. The freedom to use arbitrary Scala code in }
rules made these kinds of optimizations, which go beyond pattern-
matching the structure of a subtree, easy to express. In total, the The strings beginning with q are quasiquotes, meaning that al-
logical optimization rules are 800 lines of code. though they look like strings, they are parsed by the Scala compiler
4.3.3 Physical Planning at compile time and represent ASTs for the code within. Quasiquotes
can have variables or other ASTs spliced into them, indicated using
In the physical planning phase, Spark SQL takes a logical plan and $ notation. For example, Literal(1) would become the Scala AST
generates one or more physical plans, using physical operators that for 1, while Attribute("x") becomes row.get("x"). In the end, a
match the Spark execution engine. It then selects a plan using a tree like Add(Literal(1), Attribute("x")) becomes an AST for
cost model. At the moment, cost-based optimization is only used a Scala expression like 1+row.get("x").
to select join algorithms: for relations that are known to be small, Quasiquotes are type-checked at compile time to ensure that only
Spark SQL uses a broadcast join, using a peer-to-peer broadcast fa- appropriate ASTs or literals are substituted in, making them signif-
cility available in Spark.5 The framework supports broader use of icantly more useable than string concatenation, and they result di-
cost-based optimization, however, as costs can be estimated recur- rectly in a Scala AST instead of running the Scala parser at runtime.
sively for a whole tree using a rule. We thus intend to implement Moreover, they are highly composable, as the code generation rule
richer cost-based optimization in the future. for each node does not need to know how the trees returned by its
The physical planner also performs rule-based physical optimiza- children are constructed. Finally, the resulting code is further opti-
tions, such as pipelining projections or filters into one Spark map mized by the Scala compiler in case there are expression-level opti-
operation. In addition, it can push operations from the logical plan mizations that Catalyst missed. Figure 4 shows that quasiquotes let
into data sources that support predicate or projection pushdown. us generate code with performance similar to hand-tuned programs.
We will describe the API for these data sources in Section 4.4.1. We have found quasiquotes very straightforward to use for code
In total, the physical planning rules are about 500 lines of code. generation, and we observed that even new contributors to Spark
5 Table sizes are estimated if the table is cached in memory or comes from SQL could quickly add rules for new types of expressions. Quasiquotes
an external file, or if it is the result of a subquery with a LIMIT. also work well with our goal of running on native Java objects:
These interfaces allow data sources to implement various degrees
Intepreted of optimization, while still making it easy for developers to add
simple data sources of virtually any type. We and others have used
Hand-written
the interface to implement the following data sources:
Generated • CSV files, which simply scan the whole file, but allow users to
specify a schema.
0 10 20 30 40
• Avro [4], a self-describing binary format for nested data.
Runtime (seconds)
• Parquet [5], a columnar file format for which we support col-
umn pruning as well as filters.
Figure 4: A comparision of the performance evaluating the ex-
• A JDBC data source that scans ranges of a table from an RDBMS
presion x+x+x, where x is an integer, 1 billion times.
in parallel and pushes filters into the RDBMS to minimize com-
munication.
when accessing fields from these objects, we can code-generate a To use these data sources, programmers specify their package
direct access to the required field, instead of having to copy the ob- names in SQL statements, passing key-value pairs for configuration
ject into a Spark SQL Row and use the Row’s accessor methods. options. For example, the Avro data source takes a path to the file:
Finally, it was straightforward to combine code-generated evalua- CREATE TEMPORARY TABLE messages
tion with interpreted evaluation for expressions we do not yet gen- USING com. databricks . spark .avro
erate code for, since the Scala code we compile can directly call OPTIONS (path " messages .avro ")
into our expression interpreter.
All data sources can also expose network locality information,
In total, Catalyst’s code generator is about 700 lines of code.
i.e., which machines each partition of the data is most efficient to
4.4 Extension Points read from. This is exposed through the RDD objects they return, as
RDDs have a built-in API for data locality [39].
Catalyst’s design around composable rules makes it easy for users Finally, similar interfaces exist for writing data to an existing or
and third-party libraries to extend. Developers can add batches of new table. These are simpler because Spark SQL just provides an
rules to each phase of query optimization at runtime, as long as RDD of Row objects to be written.
they adhere to the contract of each phase (e.g., ensuring that anal-
ysis resolves all attributes). However, to make it even simpler to 4.4.2 User-Defined Types (UDTs)
add some types of extensions without understanding Catalyst rules, One feature we wanted to allow advanced analytics in Spark SQL
we have also defined two narrower public extension points: data was user-defined types. For example, machine learning applica-
sources and user-defined types. These still rely on facilities in the tions may need a vector type, and graph algorithms may need types
core engine to interact with the rest of the rest of the optimizer. for representing a graph, which is possible over relational tables [15].
4.4.1 Data Sources Adding new types can be challenging, however, as data types per-
vade all aspects of the execution engine. For example, in Spark
Developers can define a new data source for Spark SQL using sev- SQL, the built-in data types are stored in a columnar, compressed
eral APIs, which expose varying degrees of possible optimization. format for in-memory caching (Section 3.6), and in the data source
All data sources must implement a createRelation function that API from the previous section, we need to expose all possible data
takes a set of key-value parameters and returns a BaseRelation types to data source authors.
object for that relation, if one can be successfully loaded. Each In Catalyst, we solve this issue by mapping user-defined types to
BaseRelation contains a schema and an optional estimated size in structures composed of Catalyst’s built-in types, described in Sec-
bytes.6 For instance, a data source representing MySQL may take tion 3.2. To register a Scala type as a UDT, users provide a mapping
a table name as a parameter, and ask MySQL for an estimate of the from an object of their class to a Catalyst Row of built-in types, and
table size. an inverse mapping back. In user code, they can now use the Scala
To let Spark SQL read the data, a BaseRelation can implement type in objects that they query with Spark SQL, and it will be con-
one of several interfaces that let them expose varying degrees of verted to built-in types under the hood. Likewise, they can register
sophistication. The simplest, TableScan, requires the relation to re- UDFs (see Section 3.7) that operate directly on their type.
turn an RDD of Row objects for all of the data in the table. A more As a simple example, suppose we wanted to register two-dimensional
advanced PrunedScan takes an array of column names to read, and points (x, y) as a UDT. We can represent such vectors as two DOUBLE
should return Rows containing only those columns. A third inter- values. To register the UDT, one would write the following:
face, PrunedFilteredScan, takes both desired column names and
class PointUDT extends UserDefinedType [ Point ] {
an array of Filter objects, which are a subset of Catalyst’s expres- def dataType = StructType (Seq( // Our native structure
sion syntax, allowing predicate pushdown.7 The filters are advi- StructField ("x", DoubleType ),
sory, i.e., the data source should attempt to return only rows pass- StructField ("y", DoubleType )
))
ing each filter, but it is allowed to return false positives in the case def serialize (p: Point ) = Row(p.x, p.y)
of filters that it cannot evaluate. Finally, a CatalystScan interface def deserialize (r: Row) =
is given a complete sequence of Catalyst expression trees to use in Point(r. getDouble (0), r. getDouble (1))
}
predicate pushdown, though they are again advisory.
6 Unstructured After registering this type, Points will be recognized within na-
data sources can also take a desired schema as a parameter;
for example, there is a CSV file data source that lets users specify column
tive objects that Spark SQL is asked to convert to DataFrames, and
names and types. will be passed to UDFs defined on Points. In addition, Spark SQL
7 At the moment, Filters include equality, comparisons against a constant, will store Points in a columnar format when caching data (com-
and IN clauses, each on one attribute. pressing x and y as separate columns), and Points will be writable
{ 18, 27], but simpler because it only infers a static tree structure,
"text ": "This is a tweet about #Spark", without allowing recursive nesting of elements at arbitrary depths.
"tags ": ["# Spark "],
"loc ": {" lat ": 45.1 , "long ": 90}
Specifically, the algorithm attempts to infer a tree of STRUCT types,
} each of which may contain atoms, arrays, or other STRUCTs. For
each field defined by a distinct path from the root JSON object
{
"text ": "This is another tweet",
(e.g., tweet.loc.latitude), the algorithm finds the most specific
"tags ": [], Spark SQL data type that matches observed instances of the field.
"loc ": {" lat ": 39, "long ": 88.5} For example, if all occurrences of that field are integers that fit into
} 32 bits, it will infer INT; if they are larger, it will use LONG (64-bit)
{ or DECIMAL (arbitrary precision); if there are also fractional values,
"text ": "A # tweet without # location ", it will use FLOAT. For fields that display multiple types, Spark SQL
"tags ": ["# tweet ", "# location "] uses STRING as the most generic type, preserving the original JSON
}
representation. And for fields that contain arrays, it uses the same
“most specific supertype" logic to determine an element type from
Figure 5: A sample set of JSON records, representing tweets. all the observed elements. We implement this algorithm using a
single reduce operation over the data, which starts with schemata
(i.e., trees of types) from each individual record and merges them
text STRING NOT NULL ,
tags ARRAY < STRING NOT NULL > NOT NULL , using an associative “most specific supertype" function that gen-
loc STRUCT <lat FLOAT NOT NULL , long FLOAT NOT NULL > eralizes the types of each field. This makes the algorithm both
single-pass and communication-efficient, as a high degree of re-
duction happens locally on each node.
Figure 6: Schema inferred for the tweets in Figure 5. As a short example, note how in Figures 5 and 6, the algorithm
generalized the types of loc.lat and loc.long. Each field appears
as an integer in one record and a floating-point number in another,
to all of Spark SQL’s data sources, which will see them as pairs of
so the algorithm returns FLOAT. Note also how for the tags field,
DOUBLEs. We use this capability in Spark’s machine learning library,
the algorithm inferred an array of strings that cannot be null.
as we describe in Section 5.2.
In practice, we have found this algorithm to work well with real-
5 Advanced Analytics Features world JSON datasets. For example, it correctly identifies a usable
schema for JSON tweets from Twitter’s firehose, which contain
In this section, we describe three features we added to Spark SQL
around 100 distinct fields and a high degree of nesting. Multiple
specifically to handle challenges in “big data" environments. First,
Databricks customers have also successfully applied it to their in-
in these environments, data is often unstructured or semistructured.
ternal JSON formats.
While parsing such data procedurally is possible, it leads to lengthy
boilerplate code. To let users query the data right away, Spark In Spark SQL, we also use the same algorithm for inferring schemas
SQL includes a schema inference algorithm for JSON and other of RDDs of Python objects (see Section 3), as Python is not stat-
semistructured data. Second, large-scale processing often goes be- ically typed so an RDD can contain multiple object types. In the
yond aggregation and joins to machine learning on the data. We de- future, we plan to add similar inference for CSV files and XML.
scribe how Spark SQL is being incorporated into a new high-level Developers have found the ability to view these types of datasets
API for Spark’s machine learning library [26]. Last, data pipelines as tables and immediately query them or join them with other data
often combine data from disparate storage systems. Building on extremely valuable for their productivity.
the data sources API in Section 4.4.1, Spark SQL supports query 5.2 Integration with Spark’s Machine Learning Library
federation, allowing a single program to efficiently query disparate
sources. These features all build on the Catalyst framework. As an example of Spark SQL’s utility in other Spark modules, ML-
lib, Spark’s machine learning library, introduced a new high-level
5.1 Schema Inference for Semistructured Data API that uses DataFrames [26]. This new API is based on the con-
Semistructured data is common in large-scale environments be- cept of machine learning pipelines, an abstraction in other high-
cause it is easy to produce and to add fields to over time. Among level ML libraries like SciKit-Learn [33]. A pipeline is a graph
Spark users, we have seen very high usage of JSON for input data. of transformations on data, such as feature extraction, normaliza-
Unfortunately, JSON is cumbersome to work with in a procedu- tion, dimensionality reduction, and model training, each of which
ral environment like Spark or MapReduce: most users resorted to exchange datasets. Pipelines are a useful abstraction because ML
ORM-like libraries (e.g., Jackson [21]) to map JSON structures to workflows have many steps; representing these steps as compos-
Java objects, or some tried parsing each input record directly with able elements makes it easy to change parts of the pipeline or to
lower-level libraries. search for tuning parameters at the level of the whole workflow.
In Spark SQL, we added a JSON data source that automatically To exchange data between pipeline stages, MLlib’s developers
infers a schema from a set of records. For example, given the JSON needed a format that was compact (as datasets can be large) yet
objects in Figure 5, the library infers the schema shown in Figure 6. flexible, allowing multiple types of fields to be stored for each
Users can simply register a JSON file as a table and query it with record. For example, a user may start with records that contain
syntax that accesses fields by their path, such as: text fields as well as numeric ones, then run a featurization algo-
rithm such as TF-IDF on the text to turn it into a vector, normalize
SELECT loc.lat , loc.long FROM tweets
WHERE text LIKE ’% Spark %’ AND tags IS NOT NULL
one of the other fields, perform dimensionality reduction on the
whole set of features, etc. To represent datasets, the new API uses
Our schema inference algorithm works in one pass over the data, DataFrames, where each column represents a feature of the data.
and can also be run on a sample of the data if desired. It is related to All algorithms that can be called in pipelines take a name for the
prior work on schema inference for XML and object databases [9, input column(s) and output column(s), and can thus be called on
model define it. The JDBC data source will also push the filter predicate
down into MySQL to reduce the amount of data transferred.
tokenizer tf lr CREATE TEMPORARY TABLE users USING jdbc
OPTIONS ( driver " mysql " url "jdbc: mysql :// userDB /users ")

(text, label) (text, label, (text, label, CREATE TEMPORARY TABLE logs
words) words, features) USING json OPTIONS (path "logs.json ")

data = <DataFrame of (text , label ) records > SELECT users .id , users .name , logs. message
FROM users JOIN logs WHERE users .id = logs. userId
tokenizer = Tokenizer () AND users . registrationDate > "2015 -01 -01"
. setInputCol (" text "). setOutputCol (" words ")
tf = HashingTF () Under the hood, the JDBC data source uses the PrunedFiltered-
. setInputCol (" words "). setOutputCol (" features ") Scan interface in Section 4.4.1, which gives it both the names of the
lr = LogisticRegression ()
. setInputCol (" features ")
columns requested and simple predicates (equality, comparison and
IN clauses) on these columns. In this case, the JDBC data source
pipeline = Pipeline (). setStages ([ tokenizer , tf , lr ]) will run the following query on MySQL:8
model = pipeline .fit(data)
SELECT users .id , users .name FROM users
WHERE users . registrationDate > "2015 -01 -01"
Figure 7: A short MLlib pipeline and the Python code to run it.
We start with a DataFrame of (text, label) records, tokenize the In future Spark SQL releases, we are also looking to add predi-
text into words, run a term frequency featurizer (HashingTF) to cate pushdown for key-value stores such as HBase and Cassandra,
get a feature vector, then train logistic regression. which support limited forms of filtering.

6 Evaluation
any subset of the fields and produce new ones. This makes it easy We evaluate the performance of Spark SQL on two dimensions:
for developers to build complex pipelines while retaining the orig- SQL query processing performance and Spark program performance.
inal data for each record. To illustrate the API, Figure 7 shows a In particular, we demonstrate that Spark SQL’s extensible archi-
short pipeline and the schemas of DataFrames created. tecture not only enables a richer set of functionalities, but brings
The main piece of work MLlib had to do to use Spark SQL was substantial performance improvements over previous Spark-based
to create a user-defined type for vectors. This vector UDT can store SQL engines. In addition, for Spark application developers, the
both sparse and dense vectors, and represents them as four primi- DataFrame API can bring substantial speedups over the native Spark
tive fields: a boolean for the type (dense or sparse), a size for the API while making Spark programs more concise and easier to un-
vector, an array of indices (for sparse coordinates), and an array derstand. Finally, applications that combine relational and proce-
of double values (either the non-zero coordinates for sparse vec- dural queries run faster on the integrated Spark SQL engine than
tors or all coordinates otherwise). Apart from DataFrames’ utility by running SQL and procedural code as separate parallel jobs.
for tracking and manipulating columns, we also found them use-
ful for another reason: they made it much easier to expose MLlib’s 6.1 SQL Performance
new API in all of Spark’s supported programming languages. Pre- We compared the performance of Spark SQL against Shark and Im-
viously, each algorithm in MLlib took objects for domain-specific pala [23] using the AMPLab big data benchmark [3], which uses a
concepts (e.g., a labeled point for classification, or a (user, prod- web analytics workload developed by Pavlo et al. [31]. The bench-
uct) rating for recommendation), and each of these classes had to mark contains four types of queries with different parameters per-
be implemented in the various languages (e.g., copied from Scala forming scans, aggregation, joins and a UDF-based MapReduce
to Python). Using DataFrames everywhere made it much simpler job. We used a cluster of six EC2 i2.xlarge machines (one mas-
to expose all algorithms in all languages, as we only need data con- ter, five workers) each with 4 cores, 30 GB memory and an 800
versions in Spark SQL, where they already exist. This is especially GB SSD, running HDFS 2.4, Spark 1.3, Shark 0.9.1 and Impala
important as Spark adds bindings for new programming languages. 2.1.1. The dataset was 110 GB of data after compression using the
Finally, using DataFrames for storage in MLlib also makes it columnar Parquet format [5].
very easy to expose all its algorithms in SQL. We can simply de- Figure 8 shows the results for each query, grouping by the query
fine a MADlib-style UDF, as described in Section 3.7, which will type. Queries 1–3 have different parameters varying their selectiv-
internally call the algorithm on a table. We are also exploring APIs ity, with 1a, 2a, etc being the most selective and 1c, 2c, etc being the
to expose pipeline construction in SQL. least selective and processing more data. Query 4 uses a Python-
5.3 Query Federation to External Databases based Hive UDF that was not directly supported in Impala, but was
largely bound by the CPU cost of the UDF.
Data pipelines often combine data from heterogeneous sources. For
We see that in all queries, Spark SQL is substantially faster than
example, a recommendation pipeline might combine traffic logs
Shark and generally competitive with Impala. The main reason
with a user profile database and users’ social media streams. As
for the difference with Shark is code generation in Catalyst (Sec-
these data sources often reside in different machines or geographic
tion 4.3.4), which reduces CPU overhead. This feature makes Spark
locations, naively querying them can be prohibitively expensive.
SQL competitive with the C++ and LLVM based Impala engine in
Spark SQL data sources leverage Catalyst to push predicates down
many of these queries. The largest gap from Impala is in query 3a
into the data sources whenever possible.
where Impala chooses a better join plan because the selectivity of
For example, the following uses the JDBC data source and the the queries makes one of the tables very small.
JSON data source to join two tables together to find the traffic
log for the most recently registered users. Conveniently, both data 8 The JDBC data source also supports “sharding” a source table by a partic-
sources can automatically infer the schema without users having to ular column and reading different ranges of it in parallel.
40 450 700 800
35 400 600
30 350 600
500
300
Runtime (s)

Runtime (s)

Runtime (s)

Runtime (s)
25
250 400
20 400
200 300
15 150
10 200 200
100
5 50 100
0 0 0 0
1a 1b 1c 2a 2b 2c 3a 3b 3c
Query 1 (Scan) Query 2 (Aggregation) Query 3 (Join) Query 4 (UDF)
Shark Impala Spark SQL Shark Impala Spark SQL Shark Impala Spark SQL Shark Spark SQL

Figure 8: Performance of Shark, Impala and Spark SQL on the big data benchmark queries [31].

6.2 DataFrames vs. Native Spark Code


Python API
In addition to running SQL queries, Spark SQL can also help non-
SQL developers write simpler and more efficient Spark code through Scala API
the DataFrame API. Catalyst can perform optimizations on DataFrame
operations that are hard to do with hand written code, such as pred- DataFrame
icate pushdown, pipelining, and automatic join selection. Even
without these optimizations, the DataFrame API can result in more 0 50 100 150 200
efficient execution due to code generation. This is especially true Runtime (seconds)
for Python applications, as Python is typically slower than the JVM.
For this evaluation, we compared two implementations of a Spark Figure 9: Performance of an aggregation written using the na-
program that does a distributed aggregation. The dataset consists of tive Spark Python and Scala APIs versus the DataFrame API.
1 billion integer pairs, (a, b) with 100,000 distinct values of a, on
the same five-worker i2.xlarge cluster as in the previous section.
We measure the time taken to compute the average of b for each
value of a. First, we look at a version that computes the average SQL + Spark
using the map and reduce functions in the Python API for Spark:
filter
sum_and_count = \ DataFrame word count
data.map( lambda x: (x.a, (x.b, 1))) \
. reduceByKey ( lambda x, y: (x[0]+y[0], x[1]+y [1])) \
. collect () 0 200 400 600 800 1000
[(x[0] , x [1][0] / x [1][1]) for x in sum_and_count ]
Runtime (seconds)
In contrast, the same program can written as a simple manipula-
tion using the DataFrame API: Figure 10: Performance of a two-stage pipeline written as a
df. groupBy ("a"). avg ("b") separate Spark SQL query and Spark job (above) and an inte-
grated DataFrame job (below).
Figure 9, shows that the DataFrame version of the code outper-
forms the hand written Python version by 12⇥, in addition to being
much more concise. This is because in the DataFrame API, only In this experiment, we generated a synthetic dataset of 10 billion
the logical plan is constructed in Python, and all physical execution messages in HDFS. Each message contained on average 10 words
is compiled down into native Spark code as JVM bytecode, result- drawn from an English dictionary. The first stage of the pipeline
ing in more efficient execution. In fact, the DataFrame version also uses a relational filter to select roughly 90% of the messages. The
outperforms a Scala version of the Spark code above by 2⇥. This second stage computes the word count.
is mainly due to code generation: the code in the DataFrame ver- First, we implemented the pipeline using a separate SQL query
sion avoids expensive allocation of key-value pairs that occurs in followed by a Scala-based Spark job, as might occur in environ-
hand-written Scala code. ments that run separate relational and procedural engines (e.g., Hive
and Spark). We then implemented a combined pipeline using the
6.3 Pipeline Performance
DataFrame API, i.e., using DataFrame’s relational operators to per-
The DataFrame API can also improve performance in applications form the filter, and using the RDD API to perform a word count on
that combine relational and procedural processing, by letting de- the result. Compared with the first pipeline, the second pipeline
velopers write all operations in a single program and pipelining avoids the cost of saving the whole result of the SQL query to
computation across relational and procedural code. As a simple ex- an HDFS file as an intermediate dataset before passing it into the
ample, we consider a two-stage pipeline that selects a subset of text Spark job, because SparkSQL pipelines the map for the word count
messages from a corpus and computes the most frequent words. with the relational operators for the filtering. Figure 10 compares
Although very simple, this can model some real-world pipelines, the runtime performance of the two approaches. In addition to be-
e.g., computing the most popular words used in tweets by a spe- ing easier to understand and operate, the DataFrame-based pipeline
cific demographic. also improves performance by 2⇥.
7 Research Applications way using constructs in the host programming language (see Sec-
In addition to the immediately practical production use cases of tion 3.4). It also allows running relational queries directly on native
Spark SQL, we have also seen significant interest from researchers RDDs, and supports a wide range of data sources beyond Hive.
working on more experimental projects. We outline two research One system that inspired Spark SQL’s design was DryadLINQ [20],
projects that leverage the extensibility of Catalyst: one in approxi- which compiles language-integrated queries in C# to a distributed
mate query processing and one in genomics. DAG execution engine. LINQ queries are also relational but can
operate directly on C# objects. Spark SQL goes beyond DryadLINQ
7.1 Generalized Online Aggregation by also providing a DataFrame interface similar to common data
Zeng et al. have used Catalyst in their work on improving the gen- science libraries [32, 30], an API for data sources and types, and
erality of online aggregation [40]. This work generalizes the exe- support for iterative algorithms through execution on Spark.
cution of online aggregation to support arbitrarily nested aggregate Other systems use only a relational data model internally and rel-
queries. It allows users to view the progress of executing queries egate procedural code to UDFs. For example, Hive and Pig [36, 29]
by seeing results computed over a fraction of the total data. These offer relational query languages but have widely used UDF inter-
partial results also include accuracy measures, letting the user stop faces. ASTERIX [8] has a semi-structured data model internally.
the query when sufficient accuracy has been reached. Stratosphere [2] also has a semi-structured model, but offers APIs
In order to implement this system inside of Spark SQL, the au- in Scala and Java that let users easily call UDFs. PIQL [7] likewise
thors add a new operator to represent a relation that has been broken provides a Scala DSL. Compared to these systems, Spark SQL in-
up into sampled batches. During query planning a call to transform tegrates more closely with native Spark applications by being able
is used to replace the original full query with several queries, each to directly query data in user-defined classes (native Java/Python
of which operates on a successive sample of the data. objects), and lets developers mix procedural and relational APIs
However, simply replacing the full dataset with samples is not in the same language. In addition, through the Catalyst optimizer,
sufficient to compute the correct answer in an online fashion. Op- Spark SQL implements both optimizations (e.g., code generation)
erations such as standard aggregation must be replaced with stateful and other functionality (e.g., schema inference for JSON and ma-
counterparts that take into account both the current sample and the chine learning data types) that are not present in most large-scale
results of previous batches. Furthermore, operations that might fil- computing frameworks. We believe that these features are essential
ter out tuples based on approximate answers must be replaced with to offering an integrated, easy-to-use environment for big data.
versions that can take into account the current estimated errors. Finally, data frame APIs have been built both for single ma-
Each of these transformations can be expressed as Catalyst rules chines [32, 30] and clusters [13, 10]. Unlike previous APIs, Spark
that modify the operator tree until it produces correct online an- SQL optimizes DataFrame computations with a relational optimizer.
swers. Tree fragments that are not based on sampled data are ig-
Extensible Optimizers The Catalyst optimizer shares similar goals
nored by these rules and can execute using the standard code path.
with extensible optimizer frameworks such as EXODUS [17] and
By using Spark SQL as a basis, the authors were able to implement
Cascades [16]. Traditionally, however, optimizer frameworks have
a fairly complete prototype in approximately 2000 lines of code.
required a domain-specific language to write rules in, as well as
7.2 Computational Genomics an “optimizer compiler” to translate them to runnable code. Our
A common operation in computational genomics involves inspect- major improvement here is to build our optimizer using standard
ing overlapping regions based on a numerical offsets. This problem features of a functional programming language, which provide the
can be represented as a join with inequality predicates. Consider same (and often greater) expressivity while decreasing the main-
two datasets, a and b, with a schema of (start LONG, end LONG). tenance burden and learning curve. Advanced language features
The range join operation can be expressed in SQL as follows: helped with many areas of Catalyst—for example, our approach to
code generation using quasiquotes (Section 4.3.4) is one of the sim-
SELECT * FROM a JOIN b plest and most composable approaches to this task that we know.
WHERE a. start < a.end
AND b. start < b.end
While extensibility is hard to measure quantitatively, one promis-
AND a. start < b. start ing indication is that Spark SQL had over 50 external contributors
AND b. start < a.end in the first 8 months after its release.
For code generation, LegoBase [22] recently proposed an ap-
Without special optimization, the preceding query would be ex- proach using generative programming in Scala, which would be
ecuted by many systems using an inefficient algorithm such as a possible to use instead of quasiquotes in Catalyst.
nested loop join. In contrast, a specialized system could compute
the answer to this join using an interval tree. Researchers in the Advanced Analytics Spark SQL builds on recent work to run ad-
ADAM project [28] were able to build a special planning rule into vanced analytics algorithms on large clusters, including platforms
a version of Spark SQL to perform such computations efficiently, for iterative algorithms [39] and graph analytics [15, 24]. The de-
allowing them to leverage the standard data manipulation abilities sire to expose analytics functions is also shared with MADlib [12],
alongside specialized processing code. The changes required were though the approach there is different, as MADlib had to use the
approximately 100 lines of code. limited interface of Postgres UDFs, while Spark SQL’s UDFs can
be full-fledged Spark programs. Finally, techniques including Sinew
8 Related Work and Invisible Loading [35, 1] have sought to provide and optimize
Programming Model Several systems have sought to combine re- queries over semi-structured data such as JSON. We hope to apply
lational processing with the procedural processing engines initially some of these techniques in our JSON data source.
used for large clusters. Of these, Shark [38] is the closest to Spark
SQL, running on the same engine and offering the same combi-
9 Conclusion
nation of relational queries and advanced analytics. Spark SQL We have presented Spark SQL, a new module in Apache Spark
improves on Shark through a richer and more programmer-friendly providing rich integration with relational processing. Spark SQL
API, DataFrames, where queries can be combined in a modular extends Spark with a declarative DataFrame API to allow rela-
tional processing, offering benefits such as automatic optimization, [16] G. Graefe. The Cascades framework for query optimization.
and letting users write complex pipelines that mix relational and IEEE Data Engineering Bulletin, 18(3), 1995.
complex analytics. It supports a wide range of features tailored [17] G. Graefe and D. DeWitt. The EXODUS optimizer
to large-scale data analysis, including semi-structured data, query generator. In SIGMOD, 1987.
federation, and data types for machine learning. To enable these [18] J. Hegewald, F. Naumann, and M. Weis. XStruct: efficient
features, Spark SQL is based on an extensible optimizer called Cat- schema extraction from multiple and large XML documents.
alyst that makes it easy to add optimization rules, data sources and In ICDE Workshops, 2006.
data types by embedding into the Scala programming language. [19] Hive data definition language.
User feedback and benchmarks show that Spark SQL makes it sig- https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL.
nificantly simpler and more efficient to write data pipelines that [20] M. Isard and Y. Yu. Distributed data-parallel computing
mix relational and procedural processing, while offering substan- using a high-level programming language. In SIGMOD,
tial speedups over previous SQL-on-Spark engines. 2009.
Spark SQL is open source at http://spark.apache.org. [21] Jackson JSON processor. http://jackson.codehaus.org.
10 Acknowledgments [22] Y. Klonatos, C. Koch, T. Rompf, and H. Chafi. Building
efficient query engines in a high-level language. PVLDB,
We would like to thank Cheng Hao, Tayuka Ueshin, Tor Myklebust, 7(10):853–864, 2014.
Daoyuan Wang, and the rest of the Spark SQL contributors so far. [23] M. Kornacker et al. Impala: A modern, open-source SQL
We would also like to thank John Cieslewicz and the other mem- engine for Hadoop. In CIDR, 2015.
bers of the F1 team at Google for early discussions on the Catalyst
[24] Y. Low et al. Distributed GraphLab: a framework for
optimizer. The work of authors Franklin and Kaftan was supported
machine learning and data mining in the cloud. VLDB, 2012.
in part by: NSF CISE Expeditions Award CCF-1139158, LBNL
[25] S. Melnik et al. Dremel: interactive analysis of web-scale
Award 7076018, and DARPA XData Award FA8750-12-2-0331,
datasets. Proc. VLDB Endow., 3:330–339, Sept 2010.
and gifts from Amazon Web Services, Google, SAP, The Thomas
and Stacey Siebel Foundation, Adatao, Adobe, Apple, Inc., Blue [26] X. Meng, J. Bradley, E. Sparks, and S. Venkataraman. ML
Goji, Bosch, C3Energy, Cisco, Cray, Cloudera, EMC2, Ericsson, pipelines: a new high-level API for MLlib.
Facebook, Guavus, Huawei, Informatica, Intel, Microsoft, NetApp, https://databricks.com/blog/2015/01/07/ml-pipelines-a-new-
Pivotal, Samsung, Schlumberger, Splunk, Virdata and VMware. high-level-api-for-mllib.html.
[27] S. Nestorov, S. Abiteboul, and R. Motwani. Extracting
11 References schema from semistructured data. In ICDM, 1998.
[28] F. A. Nothaft, M. Massie, T. Danford, Z. Zhang, U. Laserson,
[1] A. Abouzied, D. J. Abadi, and A. Silberschatz. Invisible C. Yeksigian, J. Kottalam, A. Ahuja, J. Hammerbacher,
loading: Access-driven data transfer from raw files into M. Linderman, M. J. Franklin, A. D. Joseph, and D. A.
database systems. In EDBT, 2013. Patterson. Rethinking data-intensive science using scalable
[2] A. Alexandrov et al. The Stratosphere platform for big data analytics systems. In SIGMOD, 2015.
analytics. The VLDB Journal, 23(6):939–964, Dec. 2014. [29] C. Olston, B. Reed, U. Srivastava, R. Kumar, and
[3] AMPLab big data benchmark. A. Tomkins. Pig Latin: a not-so-foreign language for data
https://amplab.cs.berkeley.edu/benchmark. processing. In SIGMOD, 2008.
[4] Apache Avro project. http://avro.apache.org. [30] pandas Python data analysis library.
[5] Apache Parquet project. http://parquet.incubator.apache.org. http://pandas.pydata.org.
[6] Apache Spark project. http://spark.apache.org. [31] A. Pavlo et al. A comparison of approaches to large-scale
[7] M. Armbrust, N. Lanham, S. Tu, A. Fox, M. J. Franklin, and data analysis. In SIGMOD, 2009.
D. A. Patterson. The case for PIQL: a performance insightful [32] R project for statistical computing. http://www.r-project.org.
query language. In SOCC, 2010. [33] scikit-learn: machine learning in Python.
[8] A. Behm et al. Asterix: towards a scalable, semistructured http://scikit-learn.org.
data platform for evolving-world models. Distributed and [34] D. Shabalin, E. Burmako, and M. Odersky. Quasiquotes for
Parallel Databases, 29(3):185–216, 2011. Scala, a technical report. Technical Report 185242, École
[9] G. J. Bex, F. Neven, and S. Vansummeren. Inferring XML Polytechnique Fédérale de Lausanne, 2013.
schema definitions from XML data. In VLDB, 2007. [35] D. Tahara, T. Diamond, and D. J. Abadi. Sinew: A SQL
[10] BigDF project. https://github.com/AyasdiOpenSource/bigdf. system for multi-structured data. In SIGMOD, 2014.
[11] C. Chambers, A. Raniwala, F. Perry, S. Adams, R. R. Henry, [36] A. Thusoo et al. Hive–a petabyte scale data warehouse using
R. Bradshaw, and N. Weizenbaum. FlumeJava: Easy, Hadoop. In ICDE, 2010.
efficient data-parallel pipelines. In PLDI, 2010. [37] P. Wadler. Monads for functional programming. In Advanced
[12] J. Cohen, B. Dolan, M. Dunlap, J. Hellerstein, and Functional Programming, pages 24–52. Springer, 1995.
C. Welton. MAD skills: new analysis practices for big data. [38] R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker,
VLDB, 2009. and I. Stoica. Shark: SQL and rich analytics at scale. In
[13] DDF project. http://ddf.io. SIGMOD, 2013.
[14] B. Emir, M. Odersky, and J. Williams. Matching objects with [39] M. Zaharia et al. Resilient distributed datasets: a
patterns. In ECOOP 2007 – Object-Oriented Programming, fault-tolerant abstraction for in-memory cluster computing.
volume 4609 of LNCS, pages 273–298. Springer, 2007. In NSDI, 2012.
[15] J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. [40] K. Zeng et al. G-OLA: Generalized online aggregation for
Franklin, and I. Stoica. GraphX: Graph processing in a interactive analysis on big data. In SIGMOD, 2015.
distributed dataflow framework. In OSDI, 2014.

You might also like