Achieving Lakehouse Models with Spark 3.0

Achieving Lakehouse Models
with Spark 3.0
Simon Whiteley
Director of Engineering, Advancing Analytics

Agenda
Why Lakehouse?
Kimball Problems
Delta & Spark 3.0
▪ SCD & SQL Merge
▪ Dynamic Partition Pruning
▪ Adaptive Query Execution
Enabling the Lakehouse

Analytics Evolution
PARQUET
Delta Lake

The Modern Warehouse
RAW BASE
PARQUET

The Lakehouse
RAW BASE ENRICHED
DELTA DELTA

When you think Warehouse…
We automatically think of Star
Schemas and Kimball warehousing
approaches.
A large central fact table with smaller
reference dimensions… some of which
aren’t so small

Literally Everyone
(All The time)
“You can’t use Kimball in a
Data Lake”

Three Historical Challenges
▪ Slowly Changing Dimensions
▪ Filtering Dimensions
▪ General SQL Performance

SCD - Enabling the Familiar
PrimaryKey Address Current EffectiveDate EndDate
11 A new customer address TRUE 03/08/2020 null
58 Yet another address TRUE 03/08/2020 null
41 A different address TRUE 03/08/2020 null
PrimaryKey Address Current EffectiveDate EndDate
11 A new customer address FALSE 03/08/2020 22/10/2020
11 An updated address TRUE 22/10/2020 null
58 Yet another address TRUE 03/08/2020 null
41 A different address TRUE 03/08/2020 null

SCD - Merge Commands
MERGE INTO dataai.addresses as original
USING updates
ON original.primaryKey = updates.primaryKey
WHEN MATCHED THEN
UPDATE SET *
WHEN NOT MATCHED THEN
INSERT *
Available in SQL, Scala and Python APIs - the merge command has made many complex warehousing
jobs accessible to the wider Analytics community
This is enabled by the Delta file format

Spark Partitioning
SELECT * FROM Sales WHERE Month
= 3
SQL Query Action
Filtering performed by
selectively reading files
SALES
Month=1 Month=2
Month=3 Month=4

Cross-Filter Spark 2.4
SELECT * FROM Sales JOIN Date
WHERE DateMonth = 3
SQL Query Action
SALES
Month=1 Month=2
Month=3 Month=4
DimDATE
Partition Keys not hit when
filtering on joined tables

Cross-Filter Spark 3.0
SELECT * FROM Sales JOIN Date
WHERE DateMonth = 3
SQL Query Action
SALES
Month=1 Month=2
Month=3 Month=4
DimDATE
Dynamic Partition Pruning
determines partition filters
during runtime

AQE in Spark 3.0
AQE will speed up common queries in a number of ways:
▪ Coalescing Shuffle Partitions
▪ Switching Join Strategies
▪ Optimizing Skew Joins

Before AQE - Shuffle Coalescing
Read RDDs
Read RDDs
Shuffle RDDs
Shuffle RDDs
Shuffle RDDs
Shuffle RDDs
Shuffle RDDs
Shuffle RDDs
Shuffle RDDs
Shuffle RDDs
Shuffle RDDs
Shuffle RDDs
Shuffle RDDs
Write RDDs
Write RDDs
2 Tasks 200 Tasks 2 Tasks

Using Spark 3.0 AQE - Shuffle Coalescing
Read RDDs
Read RDDs
Write RDDs
Write RDDs
2 Tasks 2 Tasks 2 Tasks
Shuffle RDDs
Shuffle RDDs

DEMO: Let’s see it in action

The Data Lakehouse
Delta & Spark 3.0 enable the Lakehouse through:
▪ Enabling familiar (SQL) patterns
▪ Removing technical barriers
▪ Targeting performance of common
warehousing activities

Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

Achieving Lakehouse Models with Spark 3.0

Related slideshows

More Related Content

Achieving Lakehouse Models with Spark 3.0