Cs 744: Spark SQL: Shivaram Venkataraman Fall 2019
Cs 744: Spark SQL: Shivaram Venkataraman Fall 2019
Shivaram Venkataraman
Fall 2019
ADMINISTRIVIA
- Assignment 2 grades this week
- Midterm details on Piazza
- Course Project Proposal comments
Applications
Computational Engines
Resource Management
Datacenter Architecture
SQL: STRUCTURED QUERY LANGUAGE
DATABASE SYSTEMS
SQL in BiG DATA SYSTEMS
- Scale: How do we handle large datasets, clusters ?
lines = sc.textFile(“users")
csv = lines.map(x =>
x.split(‘,’))
young = csv.filter(x =>
x(1) < 21)
println(young.count())
PROCEDURAL VS. RELATIONAL
employees.join(dept,
employees (“deptId") === dept ("id ")
)
tree. transform {
case Add(Literal(c1),Literal(c2)) =>
Literal(c1+c2)
case Add(left , Literal(0)) => left
case Add(Literal(0), right) => right
}
LOGICAL, PHYSICAL PLANS
1. Analyzer: Lookup relations, map named attributes, propagate types
2. Logical Optimization
3. Physical Planning
CODE GENERATION
CPU bound when data is in-memory
Branches, virtual function calls etc.
Catalyst Optimizer
- Extensible, rule-based optimizer
- Code generation for high-performance