Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Achieving Lakehouse Models
with Spark 3.0
Simon Whiteley
Director of Engineering, Advancing Analytics
Agenda
Why Lakehouse?
Kimball Problems
Delta & Spark 3.0
▪ SCD & SQL Merge
▪ Dynamic Partition Pruning
▪ Adaptive Query Execution
Enabling the Lakehouse
The Data Lakehouse
Analytics Evolution
PARQUET
Delta Lake
Achieving Lakehouse Models with Spark 3.0
The Modern Warehouse
RAW BASE
PARQUET
The Lakehouse
RAW BASE ENRICHED
DELTA DELTA
Lakehouse Barriers
When you think Warehouse…
We automatically think of Star
Schemas and Kimball warehousing
approaches.
A large central fact table with smaller
reference dimensions… some of which
aren’t so small
Literally Everyone
(All The time)
“You can’t use Kimball in a
Data Lake”
Three Historical Challenges
▪ Slowly Changing Dimensions
▪ Filtering Dimensions
▪ General SQL Performance
Slowly Changing Dimensions
SCD - Enabling the Familiar
PrimaryKey Address Current EffectiveDate EndDate
11 A new customer address TRUE 03/08/2020 null
58 Yet another address TRUE 03/08/2020 null
41 A different address TRUE 03/08/2020 null
PrimaryKey Address Current EffectiveDate EndDate
11 A new customer address FALSE 03/08/2020 22/10/2020
11 An updated address TRUE 22/10/2020 null
58 Yet another address TRUE 03/08/2020 null
41 A different address TRUE 03/08/2020 null
SCD - Merge Commands
MERGE INTO dataai.addresses as original
USING updates
ON original.primaryKey = updates.primaryKey
WHEN MATCHED THEN
UPDATE SET *
WHEN NOT MATCHED THEN
INSERT *
Available in SQL, Scala and Python APIs - the merge command has made many complex warehousing
jobs accessible to the wider Analytics community
This is enabled by the Delta file format
Dynamic Partition Pruning
Spark Partitioning
SELECT * FROM Sales WHERE Month
= 3
SQL Query Action
Filtering performed by
selectively reading files
SALES
Month=1 Month=2
Month=3 Month=4
Cross-Filter Spark 2.4
SELECT * FROM Sales JOIN Date
WHERE DateMonth = 3
SQL Query Action
SALES
Month=1 Month=2
Month=3 Month=4
DimDATE
Partition Keys not hit when
filtering on joined tables
Cross-Filter Spark 3.0
SELECT * FROM Sales JOIN Date
WHERE DateMonth = 3
SQL Query Action
SALES
Month=1 Month=2
Month=3 Month=4
DimDATE
Dynamic Partition Pruning
determines partition filters
during runtime
Adaptive Query Execution
AQE in Spark 3.0
AQE will speed up common queries in a number of ways:
▪ Coalescing Shuffle Partitions
▪ Switching Join Strategies
▪ Optimizing Skew Joins
Before AQE - Shuffle Coalescing
Read RDDs
Read RDDs
Shuffle RDDs
Shuffle RDDs
Shuffle RDDs
Shuffle RDDs
Shuffle RDDs
Shuffle RDDs
Shuffle RDDs
Shuffle RDDs
Shuffle RDDs
Shuffle RDDs
Shuffle RDDs
Write RDDs
Write RDDs
2 Tasks 200 Tasks 2 Tasks
Using Spark 3.0 AQE - Shuffle Coalescing
Read RDDs
Read RDDs
Write RDDs
Write RDDs
2 Tasks 2 Tasks 2 Tasks
Shuffle RDDs
Shuffle RDDs
DEMO: Let’s see it in action
The Data Lakehouse
Delta & Spark 3.0 enable the Lakehouse through:
▪ Enabling familiar (SQL) patterns
▪ Removing technical barriers
▪ Targeting performance of common
warehousing activities
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

More Related Content

Achieving Lakehouse Models with Spark 3.0

  • 1. Achieving Lakehouse Models with Spark 3.0 Simon Whiteley Director of Engineering, Advancing Analytics
  • 2. Agenda Why Lakehouse? Kimball Problems Delta & Spark 3.0 ▪ SCD & SQL Merge ▪ Dynamic Partition Pruning ▪ Adaptive Query Execution Enabling the Lakehouse
  • 7. The Lakehouse RAW BASE ENRICHED DELTA DELTA
  • 9. When you think Warehouse… We automatically think of Star Schemas and Kimball warehousing approaches. A large central fact table with smaller reference dimensions… some of which aren’t so small
  • 10. Literally Everyone (All The time) “You can’t use Kimball in a Data Lake”
  • 11. Three Historical Challenges ▪ Slowly Changing Dimensions ▪ Filtering Dimensions ▪ General SQL Performance
  • 13. SCD - Enabling the Familiar PrimaryKey Address Current EffectiveDate EndDate 11 A new customer address TRUE 03/08/2020 null 58 Yet another address TRUE 03/08/2020 null 41 A different address TRUE 03/08/2020 null PrimaryKey Address Current EffectiveDate EndDate 11 A new customer address FALSE 03/08/2020 22/10/2020 11 An updated address TRUE 22/10/2020 null 58 Yet another address TRUE 03/08/2020 null 41 A different address TRUE 03/08/2020 null
  • 14. SCD - Merge Commands MERGE INTO dataai.addresses as original USING updates ON original.primaryKey = updates.primaryKey WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT * Available in SQL, Scala and Python APIs - the merge command has made many complex warehousing jobs accessible to the wider Analytics community This is enabled by the Delta file format
  • 16. Spark Partitioning SELECT * FROM Sales WHERE Month = 3 SQL Query Action Filtering performed by selectively reading files SALES Month=1 Month=2 Month=3 Month=4
  • 17. Cross-Filter Spark 2.4 SELECT * FROM Sales JOIN Date WHERE DateMonth = 3 SQL Query Action SALES Month=1 Month=2 Month=3 Month=4 DimDATE Partition Keys not hit when filtering on joined tables
  • 18. Cross-Filter Spark 3.0 SELECT * FROM Sales JOIN Date WHERE DateMonth = 3 SQL Query Action SALES Month=1 Month=2 Month=3 Month=4 DimDATE Dynamic Partition Pruning determines partition filters during runtime
  • 20. AQE in Spark 3.0 AQE will speed up common queries in a number of ways: ▪ Coalescing Shuffle Partitions ▪ Switching Join Strategies ▪ Optimizing Skew Joins
  • 21. Before AQE - Shuffle Coalescing Read RDDs Read RDDs Shuffle RDDs Shuffle RDDs Shuffle RDDs Shuffle RDDs Shuffle RDDs Shuffle RDDs Shuffle RDDs Shuffle RDDs Shuffle RDDs Shuffle RDDs Shuffle RDDs Write RDDs Write RDDs 2 Tasks 200 Tasks 2 Tasks
  • 22. Using Spark 3.0 AQE - Shuffle Coalescing Read RDDs Read RDDs Write RDDs Write RDDs 2 Tasks 2 Tasks 2 Tasks Shuffle RDDs Shuffle RDDs
  • 23. DEMO: Let’s see it in action
  • 24. The Data Lakehouse Delta & Spark 3.0 enable the Lakehouse through: ▪ Enabling familiar (SQL) patterns ▪ Removing technical barriers ▪ Targeting performance of common warehousing activities
  • 25. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.