Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

CS 744: SPARK SQL

Shivaram Venkataraman
Fall 2019
ADMINISTRIVIA
- Assignment 2 grades this week
- Midterm details on Piazza
- Course Project Proposal comments
Applications

Machine Learning SQL Streaming Graph

Computational Engines

Scalable Storage Systems

Resource Management

Datacenter Architecture
SQL: STRUCTURED QUERY LANGUAGE
DATABASE SYSTEMS
SQL in BiG DATA SYSTEMS
- Scale: How do we handle large datasets, clusters ?

- Wide-area: How do we handle queries across datacenters ?


SPARK SQL: Architecture
DATAFRAME
Motivation: Understanding the structure of data

lines = sc.textFile(“users")
csv = lines.map(x =>
x.split(‘,’))
young = csv.filter(x =>
x(1) < 21)
println(young.count())
PROCEDURAL VS. RELATIONAL

ctx = new HiveContext ()


lines = sc.textFile(“users")
users = ctx.table(“users")
csv = lines.map(x =>
young = users.where(
x.split(‘,’))
users(“age") < 21)
young = csv.filter(x =>
println(young.count())
x(1) < 21)
println(young.count())
OPERATORS à EXPRESSIONS
Projection (select), Filter, Join, Aggregations take in Expressions

employees.join(dept,
employees (“deptId") === dept ("id ")
)

Build up Abstract Syntax Tree (AST)


OTHER FEATURES
1. Debugging: Eager analysis of logical plans

2. Interoperability: Convert RDD to Dataframes


OTHER FEATURES
3. Caching: Columnar caching with compression

4. UDFs: Python or Scala functions


val model: LogisticRegressionModel = ...
ctx.udf. register (" predict", (x: Float , y: Float) =>
model.predict(Vector(x, y)))
ctx.sql (" SELECT predict(age , weight) FROM users ")
CATALYST
Goal: Extensibility to add new optimization rules
CATALYST DESIGN
Library for representing trees and
rules to manipulate them

tree. transform {
case Add(Literal(c1),Literal(c2)) =>
Literal(c1+c2)
case Add(left , Literal(0)) => left
case Add(Literal(0), right) => right
}
LOGICAL, PHYSICAL PLANS
1. Analyzer: Lookup relations, map named attributes, propagate types
2. Logical Optimization

3. Physical Planning
CODE GENERATION
CPU bound when data is in-memory
Branches, virtual function calls etc.

def compile(node: Node ): AST = node match {


case Literal(value) => q"$value"
case Attribute (name) => q"row.get($name)"
case Add(left, right) =>
q"${compile(left)} + ${compile(right)}"
}
EXTENSIONS
Data sources
- Define a BaseRelation that contains schema
- TableScan returns RDD[Row]
- Pruning / Filtering optimizations

User-Defined Types (UDTs)


- Support advanced analytics with e.g. Vector
- Users provide mapping from UDT to Catalyst Row
SUMMARY, TAKEAWAYS
Relational API
- Enables rich space of optimizations
- Easy to use, integration with Scala, Python

Catalyst Optimizer
- Extensible, rule-based optimizer
- Code generation for high-performance

Evolution of Spark API


DISCUSSION
https://forms.gle/r6DnV7wLGHjYmYd17
Does SparkSQL help ML workloads? Consider the MNIST code in your
assignment. What parts of your code would benefit from SparkSQL and what
parts would not?
What are some limitations of the Catalyst optimizer as described in the
paper? Describe one or two ideas to improve the optimizer
NEXT STEPS
Next class: Wide-area SQL queries
Midterm coming up!
SCHEMA INFERENCE
Common data formats: JSON, CSV, semi-structured data

JSON schema inference


- Find most specific SparkSQL type that matches instances
e.g. if tweet.loc.latitude are all 32-bit then it is a INT
- Fall back to STRING if unknown
- Implemented using a reduce over trees of types

You might also like