This document discusses optimizing Apache Spark (PySpark) workloads in production. It provides an agenda for a presentation on various Spark topics including the primary data structures (RDD, DataFrame, Dataset), executors, cores, containers, stages and jobs. It also discusses strategies for optimizing joins, parallel reads from databases, bulk loading data, and scheduling Spark workflows with Apache Airflow. The presentation is given by a solution architect from Accionlabs, a global technology services firm focused on emerging technologies like Apache Spark, machine learning, and cloud technologies.
1. No more struggles with Apache Spark (PySpark)
workloads in production
Chetan Khatri, Solution Architect - Data Science.
Accionlabs India.
PyconZA 2019, The Wanderers Club in Illovo.
Johannesburg, South Africa
11th Oct, 2019
Twitter: @khatri_chetan,
Email: chetan.khatri@live.com
chetan.khatri@accionlabs.com
LinkedIn: https://www.linkedin.com/in/chetkhatri
Github: chetkhatri
2. Who am I?
Solution Architect - Data Science @ Accion labs India Pvt. Ltd.
Contributor @ Apache Spark, Apache HBase, Elixir Lang.
Co-Authored University Curriculum @ University of Kachchh, India.
Ex - Data Engineering @: Nazara Games, Eccella Corporation.
Masters - Computer Science from University of Kachchh, India.
Daily Activity?
Functional Programming, Distributed Computing, Python, Scala, Haskell, Data
Science, Product Development
3. Helping organizations create innovative
product and solutions using the emerging
technologies
An Innovation Focused
Technology Services
Firm
Employees
Clients
Accelerators
Global
Offices
Development
Centers
2300+
75+
20+
12+
7
4. Accion Labs - Introduction
● A Global Technology Services firm focussed Emerging Technologies
○ 12 offices, 7 dev centers, 2300+ employees, 75+ active clients
● Profitable, venture-backed company
○ 3 rounds of funding, 8 acquisitions to bolster emerging tech capability and leadership
● Flexible Outcome-based Engagement Models
○ Projects, Extended teams, Shared IP, Co-development, Professional Services
● Framework Based Approach to Accelerate Digital Transformation
○ A collection of tools and frameworks, Breeze Digital Blueprint helps gain 25-30% efficiency
● Action-oriented Leadership Team
○ Fastest growing firm from Pittsburgh (2014, 2015, 2016), E&Y award 2015, PTC Finalist 2018
4
5. Accion’s Emerging Tech Capabilities
Adaptive UI, UX Engineering
NLP, Voice Interface &
Chat Bots
Artificial Intelligence and
Machine Learning
Data Lake &
Big Data Analytics
Blockchain, Payment
Technologies
Cloud Strategy and
Transformation
Mobile Development
MicroServices and
Serverless Computing
QA Engineering, RPA and
DevOps Automation
SFDC, ServiceNow, IBM
Solutions, Azure
5
6. Agenda
● Apache Spark
● Primary data structures (RDD, DataSet, Dataframe)
● Pragmatic explanation - executors, cores, containers, stage, job, a task in Spark.
● Parallel read from JDBC: Challenges and best practices.
● Bulk Load API vs JDBC write
● An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin
● Avoid unnecessary shuffle
● Optimize Spark stage generation plan
● Predicate pushdown with partitioning and bucketing
● Airflow DAG scheduling for Apache Spark worflow. - Design, Architecture, Demo.
7. What is Apache Spark?
● Apache Spark is a fast and general-purpose cluster computing system / Unified Engine for massive data
processing.
● It provides high level API for Scala, Java, Python and R and optimized engine that supports general
execution graphs.
Structured Data / SQL - Spark SQL Graph Processing - GraphX
Machine Learning - MLlib Streaming - Spark Streaming,
Structured Streaming
14. Essential Spark Operations
TRANSFORMATIONSACTIONS
General Math / Statistical Set Theory / Relational Data Structure / I/O
map
gilter
flatMap
mapPartitions
mapPartitionsWithIndex
groupBy
sortBy
sample
randomSplit
union
intersection
subtract
distinct
cartesian
zip
keyBy
zipWithIndex
zipWithUniqueID
zipPartitions
coalesce
repartition
repartitionAndSortWithinPartitions
pipe
reduce
collect
aggregate
fold
first
take
forEach
top
treeAggregate
treeReduce
forEachPartition
collectAsMap
count
takeSample
max
min
sum
histogram
mean
variance
stdev
sampleVariance
countApprox
countApproxDistinct
takeOrdered
saveAsTextFile
saveAsSequenceFile
saveAsObjectFile
saveAsHadoopDataset
saveAsHadoopFile
saveAsNewAPIHadoopDataset
saveAsNewAPIHadoopFile
15. When to use RDDs ?
You care about control of dataset and knows how data looks like, you care
about low level API.
Don’t care about lot’s of lambda functions than DSL.
Don’t care about Schema or Structure of Data.
Don’t care about optimization, performance & inefficiencies!
Very slow for non-JVM languages like Python, R.
Don’t care about Inadvertent inefficiencies.
18. Structured APIs in Apache Spark
SQL DataFrames Datasets
Syntax Errors Runtime Compile Time Compile Time
Analysis Errors Runtime Runtime Compile Time
Analysis errors are caught before a job runs on cluster
20. DataFrame -> SQL View -> SQL Query
parsedDF.createOrReplaceTempView("audits")
results = spark.sql(
"""SELECT sprint, sum(numStories)
AS count FROM audits WHERE project = 'finance' GROUP BY sprint
LIMIT 100""")
results.show(100)
project sprint numStories
finance 3 20
finance 4 22
21. Catalyst in Spark
SQL AST
DataFrame
Datasets
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
Physical
Plans
CostModel
Selected
Physical
Plan
RDD
22. Example: DataFrame Optimization
employees.join(events, employees("id") === events("eid"))
.filter(events("date") > "2015-01-01")
events file
employees
table
join
filter
Logical Plan
scan
(employees)
filter
Scan
(events)
join
Physical Plan
Optimized
scan
(events)
Optimized
scan
(employees)
join
Physical Plan
With Predicate Pushdown
and Column Pruning
25. Spark Internals terminology
Job - Each transformation and action mapping in Spark would create a separate jobs.
Stage - A Set of task in each job which can run parallel using ThreadPoolExecutor.
Task - Lowest level of Concurrent and Parallel execution Unit.
Each stage is split into #number-of-partitions tasks,
i.e Number of Tasks = stage * number of partitions in the stage
48. An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin
JoinSelection execution planning strategy uses
spark.sql.autoBroadcastJoinThreshold property (default: 10M) to control the size
of a dataset before broadcasting it to all worker nodes when performing a join.
# check broadcast join threshold
>>> int(spark.conf.get("spark.sql.autoBroadcastJoinThreshold")) / 1024 / 1024
10
# logical plan with tree numbered
sampleDF.queryExecution.logical.numberedTreeString
# Query plan
sampleDF.explain
49. An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin
Repartition: Boost the Parallelism, by increasing the number of Partitions. Partition on Joins, to get
same key joins faster.
// Reduce number of partitions without shuffling, where repartition does equal data shuffling across the cluster.
employeeDF.coalesce(10).bulkCopyToSqlDB(bulkWriteConfig("EMPLOYEE_CLIENT"))
For example, In case of bulk JDBC write. Parameter "bulkCopyBatchSize" -> "2500", means Dataframe has 10 partitions
and each partition will write 2500 records Parallely.
Reduce: Impact on Network Communication, File I/O, Network I/O, Bandwidth I/O etc.
50. An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin
1. // disable autoBroadcastJoin
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
2. // Order doesn't matter
table1.leftjoin(table2) or table2.leftjoin(table1)
3. force broadcast, if one DataFrame is not small!
4. Minimize shuffling & Boost Parallism, Partitioning, Bucketing, coalesce, repartition,
HashPartitioner
81. References
[1] How to Setup Airflow Multi-Node Cluster with Celery & RabbitMQ.
[URL]
https://medium.com/@khatri_chetan/challenges-and-struggle-while-setting-up-multi-node-airflow-clu
ster-7f19e998ebb
[2] Setup and Configure Multi Node Airflow Cluster with HDP Ambari and Celery for Data Pipelines.
[URL]
https://medium.com/@khatri_chetan/setup-and-configure-multi-node-airflow-cluster-with-hdp-ambari-
and-celery-for-data-pipelines-dc1e96f3d773
[3] Challenges and Struggle while Setting up Multi-Node Airflow Cluster
[URL]
https://medium.com/@khatri_chetan/how-to-setup-airflow-multi-node-cluster-with-celery-rabbitmq-cf
de7756bb6a
[4] Leveraging Spark Speculation To Identify And Re-Schedule Slow Running Tasks.
https://blog.yuvalitzchakov.com/leveraging-spark-speculation-to-identify-and-re-schedule-slow-run
ning-tasks/