Lesson 01.05 The 5 Ss Storage
Lesson 01.05 The 5 Ss Storage
Lesson 01.05 The 5 Ss Storage
Storage
1
The 5 Most Common Performance Problems (The 5 Ss)
Storage
● Storage is our 4th problem area
2
The 5 Most Common Performance Problems (The 5 Ss)
Storage - More Examples
We are going to take a look at a couple of examples:
● Tiny Files
● Scanning
3
If you had only 60 seconds to pick up as many coins as you
$0.02
$0.08
$0.09
$0.03
$0.10
$0.04
$0.12
$0.06
$0.01
$0.05
$0.13
$0.07
$0.11 vs
vs
vs$0.00
$0.50
$2.00
$2.50
$0.25
$3.00
$2.25
$3.25
$0.75
$2.75
$1.00
$1.50
$1.25
$1.75
can, one coin at a time, which pile do you want to work from?
4
The 5 Most Common Performance Problems (The 5 Ss)
Storage - Tiny Files In Action
See Experiment #8923, contrast Step B, StepC and Step D
● In the Spark UI, see the Stage Details for the last
stage of each step and note the Input Size / Records
● In the Spark UI, see the Query Details for the last job of each step and note the...
■ number of files read
■ scan time total
■ filesystem read time total
■ size of files read
5
The 5 Most Common Performance Problems (The 5 Ss)
Storage - Tiny Files, Review
D - Tiny Files ~34 M ~1.5 hours 345,612 12 hours > 6 hours 2.1 GB
6
What can we do to mitigate the impact of tiny files?
7
The 5 Most Common Performance Problems (The 5 Ss)
Storage - Only 2.5 Options?
I might be wrong, but I think there are really only two options...
Scenario #1
● You caused the problem…
○ You can fix the problem
Scenario #2
● Someone else caused the problem…
○ Push back on design
○ Just live with it
8
The 5 Most Common Performance Problems (The 5 Ss)
Storage - The Ideal File Size
● The ideal part-file is between 128MB and 1GB
● Remember...
1 Spark-Partition == 1 Part-File upon write
9
The 5 Most Common Performance Problems (The 5 Ss)
Storage - Manual Compaction
● You can control the on-disk, part-file size
10
The 5 Most Common Performance Problems (The 5 Ss)
Storage - Manual Compaction, How-To
The Algorithm An Example In Action
1. Determine the size of your dataset on disk 1. Size on disk is 150 GB
2. Decide what your ideal part-file size is 2. Assume ½ GB part-files
3. Compute the number of spark-partitions 3. 150 GB / ½ GB = 300 partitions
required (divide size-on-disk / ideal-size)
4. Configure a cluster with N cores 4. 9 x C4.8xlarge (60 GB, 36 cores)
(more cores == less time) How? ● 9 VMs x 36 cores for 324 total cores
● 60 GB / 2 = 30 GB execution
(default is 60% but 50% is safe)
● 30 GB /36 cores = 0.83 GB
(over our ½ GB goal, but disk vs RAM)
5. Read in your data, repartition by N, 5. Read in your data, repartition by 300, and
and then write to disk then write to disk
6. Check the Spark UI for spill and any other issues
Homework: See how to manually compact tiny files in Experiment #2586 11
The 5 Most Common Performance Problems (The 5 Ss)
Storage - Automatic Compaction
Databricks Delta’s Optimize Operation
● See Optimize (Delta Lake on Databricks) for more information
● Targets a 1GB size for each part-file
12
The 5 Most Common Performance Problems (The 5 Ss)
Storage - Why Auto Optimize?
● Manually compacting files, or writing them out
correctly the first time, is the most efficient process
13
Storage - Traditional Writes
Traditional Writes Optimized Writes
Spark Cluster
Delta Tables
Disk-Partition A Disk-Partition B Disk-Partition C Disk-Partition A Disk-Partition B Disk-Partition C
1 of 5
Storage - Traditional Writes
Traditional Writes Optimized Writes
Spark Cluster
Delta Tables
100 mb
64 MB
32 MB
32 MB
Each task will write one part file to the target disk-partition 2 of 5
Storage - Traditional Writes
Traditional Writes Optimized Writes
Spark Cluster
Delta Tables
100 mb
32 MB
32 MB
32 MB
32 MB
64 MB
32 MB
32 MB
Spark Cluster
Delta Tables
144 MB
100 mb
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
64 MB
64 MB
32 MB
32 MB
Spark Cluster
Delta Tables
144 MB
100 mb
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
64 MB
64 MB
32 MB
32 MB
Spark Cluster
Delta Tables
144 MB
100 mb
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
64 MB
64 MB
32 MB
32 MB
Spark Cluster
Delta Tables
144 MB
100 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
64 MB
Disk-Partition A Disk-Partition B Disk-Partition C Disk-Partition A Disk-Partition B Disk-Partition C
1 of 17
Storage - Optimized Writes
Traditional Writes Optimized Writes
Spark Cluster
Adaptive Shuffle
Delta Tables
144 MB
100 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
64 MB
Disk-Partition A Disk-Partition B Disk-Partition C Disk-Partition A Disk-Partition B Disk-Partition C
Spark Cluster
#1
Delta Tables
144 MB
100 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
64 MB
Disk-Partition A Disk-Partition B Disk-Partition C Disk-Partition A Disk-Partition B Disk-Partition C
Spark Cluster
#1 #2
Delta Tables
144 MB
100 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
64 MB
Disk-Partition A Disk-Partition B Disk-Partition C Disk-Partition A Disk-Partition B Disk-Partition C
Spark Cluster
#1 #2 #4
Delta Tables
144 MB
100 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
64 MB
Disk-Partition A Disk-Partition B Disk-Partition C Disk-Partition A Disk-Partition B Disk-Partition C
Spark Cluster
#1 #2 #4
Delta Tables
144 MB
100 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
64 MB
Disk-Partition A Disk-Partition B Disk-Partition C Disk-Partition A Disk-Partition B Disk-Partition C
Spark Cluster
#1 #2 #4
100 MB
144 MB
32 MB
32 MB
64 MB
32 MB
32 MB
32 MB
32 MB
16 MB
48 MB
70 MB
Delta Tables
144 MB
100 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
64 MB
Disk-Partition A Disk-Partition B Disk-Partition C Disk-Partition A Disk-Partition B Disk-Partition C
Spark Cluster
#1 #2 #4
100 MB
144 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
70 MB
Delta Tables
144 MB
100 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
64 MB
Disk-Partition A Disk-Partition B Disk-Partition C Disk-Partition A Disk-Partition B Disk-Partition C
Spark Cluster
#1 #2 #4
100 MB
144 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
70 MB
Delta Tables
144 MB
128 MB
100 MB
100 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
64 MB
Disk-Partition A Disk-Partition B Disk-Partition C Disk-Partition A Disk-Partition B Disk-Partition C
Each task will write one part file to the target disk-partition 9 of 17
Storage - Optimized Writes
Traditional Writes Optimized Writes
Spark Cluster
#1 #2 #4
100 MB
144 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
70 MB
Delta Tables
144 MB
128 MB
100 MB
100 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
64 MB
128
MB
Disk-Partition A Disk-Partition B Disk-Partition C Disk-Partition A Disk-Partition B Disk-Partition C
Each task will write one part file to the target disk-partition 10 of 17
Storage - Optimized Writes
Traditional Writes Optimized Writes
Spark Cluster
#1 #2 #4
100 MB
144 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
70 MB
Delta Tables
144 MB
128 MB
100 MB
100 MB
144 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
64 MB
48 MB
16 MB
64 MB
128
MB
Disk-Partition A Disk-Partition B Disk-Partition C Disk-Partition A Disk-Partition B Disk-Partition C
Spark Cluster
#1 #2 #4
100 MB
144 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
70 MB
Delta Tables
144 MB
128 MB
100 MB
100 MB
144 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
64 MB
48 MB
16 MB
64 MB
128
MB
Disk-Partition A Disk-Partition B Disk-Partition C Disk-Partition A Disk-Partition B Disk-Partition C
Spark Cluster
#1 #2 #4
100 MB
144 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
70 MB
Delta Tables
144 MB
128 MB
100 MB
100 MB
144 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
64 MB
48 MB
16 MB
64 MB
128
MB
Disk-Partition A Disk-Partition B Disk-Partition C Disk-Partition A Disk-Partition B Disk-Partition C
Spark Cluster
#1 #2 #4
100 MB
144 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
70 MB
Delta Tables
144 MB
128 MB
100 MB
100 MB
144 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
64 MB
48 MB
16 MB
64 MB
128
MB
Disk-Partition A Disk-Partition B Disk-Partition C Disk-Partition A Disk-Partition B Disk-Partition C
Spark Cluster
#1 #2 #4
100 MB
144 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
70 MB
Delta Tables
144 MB
128 MB
100 MB
100 MB
144 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
64 MB
48 MB
16 MB
64 MB
128
MB
Disk-Partition A Disk-Partition B Disk-Partition C Disk-Partition A Disk-Partition B Disk-Partition C
Spark Cluster
#1 #2 #4
100 MB
144 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
70 MB
Delta Tables
144 MB
128 MB
100 MB
100 MB
144 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
64 MB
128
128
MB
MB
Disk-Partition A Disk-Partition B Disk-Partition C Disk-Partition A Disk-Partition B Disk-Partition C
Spark Cluster
#1 #2 #4
100 MB
144 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
70 MB
Delta Tables
144 MB
128 MB
100 MB
100 MB
144 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
64 MB
128
128
MB
MB
Disk-Partition A Disk-Partition B Disk-Partition C Disk-Partition A Disk-Partition B Disk-Partition C
Spark Cluster
#1 #2 #4
100 MB
144 MB
64 MB
32 MB
32 MB
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
70 MB
Delta Tables
128 MB
100 MB
144 MB
144 MB
100 mb
32 MB
32 MB
32 MB
32 MB
48 MB
16 MB
64 MB
64 MB
32 MB
32 MB
128
128
MB
MB
Disk-Partition A Disk-Partition B Disk-Partition C
38
The 5 Most Common Performance Problems (The 5 Ss)
Storage - Directory Scanning
The next version of the “Tiny Files Problem” is Directory Scanning
39
The 5 Most Common Performance Problems (The 5 Ss)
Storage - Scanning Example
Consider this common scenario:
● Consider 1 year’s worth of data partitioned by year, month, day & hour
40
The 5 Most Common Performance Problems (The 5 Ss)
Storage - Scanning In Action
See Experiment #8973
● See Step E, F & G for more variants and how they affect scanning
● For Step J open the Spark UI and look at the Query Details for the last job
■ Identify the proof that scanning is the root cause of this performance problem
41
The 5 Most Common Performance Problems (The 5 Ss)
Storage - Scanning, Review
Step Description Duration Records Files Directories
D Partitioned by year, month, hour & day ~15 minutes 37,413,338 6,273 8,760
42
The 5 Most Common Performance Problems (The 5 Ss)
Storage - Scanning, Prove It
What proof is there in the Query Details for Experiment #8973, Step J,
that scanning is the root cause of these performance problems?
What proof is there in the Query Details for Experiment #8973, Step J,
that scanning is the root cause of these performance problems?
43
What can we do to mitigate the impact of scanning?
44
The 5 Most Common Performance Problems (The 5 Ss)
Storage - Can we…?
45
Performance Tuning on Apache Spark
Storage - Schemas
46
The 5 Most Common Performance Problems (The 5 Ss)
Storage - Schemas
● Inferring schemas (for JSON and CSV) require a full read of the file to
determine data types, even if you only want a subset of the data
47
What can we do to mitigate the schema issues?
48
The 5 Most Common Performance Problems (The 5 Ss)
Storage - Schema Mitigation
There are several ways to mitigate some of these issues:
● Use tables - the backing meta store will track the table’s schema
49
50