PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow

No more struggles with Apache Spark (PySpark)
workloads in production
Chetan Khatri, Solution Architect - Data Science.
Accionlabs India.
PyconZA 2019, The Wanderers Club in Illovo.
Johannesburg, South Africa
11th Oct, 2019
Twitter: @khatri_chetan,
Email: chetan.khatri@live.com
chetan.khatri@accionlabs.com
LinkedIn: https://www.linkedin.com/in/chetkhatri
Github: chetkhatri

Who am I?
Solution Architect - Data Science @ Accion labs India Pvt. Ltd.
Contributor @ Apache Spark, Apache HBase, Elixir Lang.
Co-Authored University Curriculum @ University of Kachchh, India.
Ex - Data Engineering @: Nazara Games, Eccella Corporation.
Masters - Computer Science from University of Kachchh, India.
Daily Activity?
Functional Programming, Distributed Computing, Python, Scala, Haskell, Data
Science, Product Development

Helping organizations create innovative
product and solutions using the emerging
technologies
An Innovation Focused
Technology Services
Firm
Employees
Clients
Accelerators
Global
Ofﬁces
Development
Centers
2300+
75+
20+
12+
7

Accion Labs - Introduction
● A Global Technology Services firm focussed Emerging Technologies
○ 12 offices, 7 dev centers, 2300+ employees, 75+ active clients
● Profitable, venture-backed company
○ 3 rounds of funding, 8 acquisitions to bolster emerging tech capability and leadership
● Flexible Outcome-based Engagement Models
○ Projects, Extended teams, Shared IP, Co-development, Professional Services
● Framework Based Approach to Accelerate Digital Transformation
○ A collection of tools and frameworks, Breeze Digital Blueprint helps gain 25-30% efficiency
● Action-oriented Leadership Team
○ Fastest growing firm from Pittsburgh (2014, 2015, 2016), E&Y award 2015, PTC Finalist 2018
4

Accion’s Emerging Tech Capabilities
Adaptive UI, UX Engineering
NLP, Voice Interface &
Chat Bots
Artiﬁcial Intelligence and
Machine Learning
Data Lake &
Big Data Analytics
Blockchain, Payment
Technologies
Cloud Strategy and
Transformation
Mobile Development
MicroServices and
Serverless Computing
QA Engineering, RPA and
DevOps Automation
SFDC, ServiceNow, IBM
Solutions, Azure
5

Agenda
● Apache Spark
● Primary data structures (RDD, DataSet, Dataframe)
● Pragmatic explanation - executors, cores, containers, stage, job, a task in Spark.
● Parallel read from JDBC: Challenges and best practices.
● Bulk Load API vs JDBC write
● An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin
● Avoid unnecessary shuffle
● Optimize Spark stage generation plan
● Predicate pushdown with partitioning and bucketing
● Airflow DAG scheduling for Apache Spark worflow. - Design, Architecture, Demo.

What is Apache Spark?
● Apache Spark is a fast and general-purpose cluster computing system / Unified Engine for massive data
processing.
● It provides high level API for Scala, Java, Python and R and optimized engine that supports general
execution graphs.
Structured Data / SQL - Spark SQL Graph Processing - GraphX
Machine Learning - MLlib Streaming - Spark Streaming,
Structured Streaming

1. Distributed Data Abstraction
RDD RDD RDD RDD
Logical Model Across Distributed Storage on Cluster
HDFS, S3

2. Resilient & Immutable
RDD RDD RDD
T T
RDD -> T -> RDD -> T -> RDD
T = Transformation

3. Compile-time Type Safe / Strongly type inference
Integer RDD
String or Text RDD
Double or Binary RDD

4. Lazy evaluation
RDD RDD RDD
T T
RDD RDD RDD
T A
RDD - T - RDD - T - RDD - T - RDD - A - RDD
T = Transformation
A = Action

Apache Spark Operations
Operations
Transformation
Action

Essential Spark Operations
TRANSFORMATIONSACTIONS
General Math / Statistical Set Theory / Relational Data Structure / I/O
map
gilter
flatMap
mapPartitions
mapPartitionsWithIndex
groupBy
sortBy
sample
randomSplit
union
intersection
subtract
distinct
cartesian
zip
keyBy
zipWithIndex
zipWithUniqueID
zipPartitions
coalesce
repartition
repartitionAndSortWithinPartitions
pipe
reduce
collect
aggregate
fold
first
take
forEach
top
treeAggregate
treeReduce
forEachPartition
collectAsMap
count
takeSample
max
min
sum
histogram
mean
variance
stdev
sampleVariance
countApprox
countApproxDistinct
takeOrdered
saveAsTextFile
saveAsSequenceFile
saveAsObjectFile
saveAsHadoopDataset
saveAsHadoopFile
saveAsNewAPIHadoopDataset
saveAsNewAPIHadoopFile

When to use RDDs ?
You care about control of dataset and knows how data looks like, you care
about low level API.
Don’t care about lot’s of lambda functions than DSL.
Don’t care about Schema or Structure of Data.
Don’t care about optimization, performance & inefﬁciencies!
Very slow for non-JVM languages like Python, R.
Don’t care about Inadvertent inefﬁciencies.

Inadvertent inefﬁciencies in RDDs

Structured in Spark
DataFrames
Datasets

Structured APIs in Apache Spark
SQL DataFrames Datasets
Syntax Errors Runtime Compile Time Compile Time
Analysis Errors Runtime Runtime Compile Time
Analysis errors are caught before a job runs on cluster

DataFrame API Code
// convert RDD -> DF with column names
parsedDF = parsedRDD.toDF("project", "sprint", "numStories")
// filter, groupBy, sum, and then agg()
parsedDF.filter(lambda x: x[1] === "finance")
.groupBy("sprint")
.agg(sum("numStories").as("count"))
.limit(100)
.show(100)
project sprint numStories
ﬁnance 3 20
ﬁnance 4 22

DataFrame -> SQL View -> SQL Query
parsedDF.createOrReplaceTempView("audits")
results = spark.sql(
"""SELECT sprint, sum(numStories)
AS count FROM audits WHERE project = 'finance' GROUP BY sprint
LIMIT 100""")
results.show(100)
project sprint numStories
ﬁnance 3 20
ﬁnance 4 22

Catalyst in Spark
SQL AST
DataFrame
Datasets
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
Physical
Plans
CostModel
Selected
Physical
Plan
RDD

Example: DataFrame Optimization
employees.join(events, employees("id") === events("eid"))
.filter(events("date") > "2015-01-01")
events file
employees
table
join
filter
Logical Plan
scan
(employees)
filter
Scan
(events)
join
Physical Plan
Optimized
scan
(events)
Optimized
scan
(employees)
join
Physical Plan
With Predicate Pushdown
and Column Pruning

DataFrames are Faster than RDDs
Source: Databricks

Pragmatic
Approach
Executors
Cores
Containers
Stage
Job
Task

Spark Internals terminology
Job - Each transformation and action mapping in Spark would create a separate jobs.
Stage - A Set of task in each job which can run parallel using ThreadPoolExecutor.
Task - Lowest level of Concurrent and Parallel execution Unit.
Each stage is split into #number-of-partitions tasks,
i.e Number of Tasks = stage * number of partitions in the stage

Spark on Yarn Internals terminology
yarn.scheduler.minimum-allocation-vcores = 1
Yarn.scheduler.maximum-allocation-vcores = 6
Yarn.scheduler.minimum-allocation-mb = 4096
Yarn.scheduler.maximum-allocation-mb = 28832
Yarn.nodemanager.resource.memory-mb = 54000
Number of max containers you can run = (Yarn.nodemanager.resource.memory-mb = 54000 /
Yarn.scheduler.minimum-allocation-mb = 4096) = 13

Spark on Yarn Internals terminology

Resource Manager (Yarn) Tuning

Parallel read from JDBC: Challenges
and best practices.

Spark JDBC Read
What happens when you run this code?
What would be the impact at Database engine side?

Spark JDBC Read: Impact on Database engine e.g MSSQL Server

Impact on Database after Spark Parallel Read

An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin
JoinSelection execution planning strategy uses
spark.sql.autoBroadcastJoinThreshold property (default: 10M) to control the size
of a dataset before broadcasting it to all worker nodes when performing a join.
# check broadcast join threshold
>>> int(spark.conf.get("spark.sql.autoBroadcastJoinThreshold")) / 1024 / 1024
10
# logical plan with tree numbered
sampleDF.queryExecution.logical.numberedTreeString
# Query plan
sampleDF.explain

Repartition: Boost the Parallelism, by increasing the number of Partitions. Partition on Joins, to get
same key joins faster.
// Reduce number of partitions without shuffling, where repartition does equal data shuffling across the cluster.
employeeDF.coalesce(10).bulkCopyToSqlDB(bulkWriteConfig("EMPLOYEE_CLIENT"))
For example, In case of bulk JDBC write. Parameter "bulkCopyBatchSize" -> "2500", means Dataframe has 10 partitions
and each partition will write 2500 records Parallely.
Reduce: Impact on Network Communication, File I/O, Network I/O, Bandwidth I/O etc.

1. // disable autoBroadcastJoin
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
2. // Order doesn't matter
table1.leftjoin(table2) or table2.leftjoin(table1)
3. force broadcast, if one DataFrame is not small!
4. Minimize shuffling & Boost Parallism, Partitioning, Bucketing, coalesce, repartition,
HashPartitioner

PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow

Spark Submit Hyper-parameters and Dynamic Allocation
./bin/spark-submit
--conf spark.yarn.maxAppAttempts=1
--name PyConLT19
--master yarn
--deploy-mode cluster
--driver-memory 18g
--executor-memory 24g
--num-executors 4
--executor-cores 6
--conf spark.speculation=false
--conf spark.broadcast.compress=true
--conf spark.sql.broadcastTimeout=36000
--conf spark.network.timeout=2500s
--conf spark.dynamicAllocation.executorAllocationRatio=1
--conf spark.executor.heartbeatInterval=30s
--conf spark.dynamicAllocation.executorIdleTimeout=60s
--conf spark.dynamicAllocation.sustainedSchedulerBacklogTimeout=15s
--conf spark.network.timeout=1200s
--conf spark.dynamicAllocation.schedulerBacklogTimeout=15s
--conf spark.shuffle.service.enabled=true
--conf spark.dynamicAllocation.enabled=True
--conf spark.dynamicAllocation.minExecutors=2
--conf spark.dynamicAllocation.initialExecutors=2
--conf spark.dynamicAllocation.maxExecutors=6
examples/src/main/python/pi.py

Case Study: High Level Architecture
OLTP
Shadow
Data
Source
Apache
Spark
Spark
SQL
Sqoop
HDFS
Parquet
Yarn Cluster manager
Customer
Speciﬁc
Reporting
DB
Bulk
Load
Parallelism Orchestration: Airﬂow

Spark Streaming
Code!
Ref.
https://github.com/chetkhatri/getting-started-airﬂow-for-spark/blob/master/spark
_streaming_kafka.py

Key role of Apache Airﬂow for Scheduling Data
Pipelines
Codebase: https://github.com/chetkhatri/getting-started-airﬂow-for-spark

Trigger the Airflow DAG from API
curl -d ' {"conf":"{"retail_id":"29" , "env_type":"dev", "size_is":"medium"}", "run_id": "retailer_1111"}' -H "Content-Type:
application/json" -X POST http://localhost:8000/api/experimental/dags/nextgen_data_platforms/dag_runs
Ref. https://github.com/teamclairvoyant/airflow-rest-api-plugin
Spark Submit Operator inherited from BashOperator
https://github.com/apache/airflow/blob/master/airflow/contrib/operators/spark_s
ubmit_operator.py

Airﬂow - spark_hyperparameters.json

Airﬂow - nextgen_data_platform DAG

Airﬂow - nextgen_data_master_tables_subdag

References
[1] How to Setup Airflow Multi-Node Cluster with Celery & RabbitMQ.
[URL]
https://medium.com/@khatri_chetan/challenges-and-struggle-while-setting-up-multi-node-airflow-clu
ster-7f19e998ebb
[2] Setup and Configure Multi Node Airflow Cluster with HDP Ambari and Celery for Data Pipelines.
[URL]
https://medium.com/@khatri_chetan/setup-and-configure-multi-node-airflow-cluster-with-hdp-ambari-
and-celery-for-data-pipelines-dc1e96f3d773
[3] Challenges and Struggle while Setting up Multi-Node Airflow Cluster
[URL]
https://medium.com/@khatri_chetan/how-to-setup-airflow-multi-node-cluster-with-celery-rabbitmq-cf
de7756bb6a
[4] Leveraging Spark Speculation To Identify And Re-Schedule Slow Running Tasks.
https://blog.yuvalitzchakov.com/leveraging-spark-speculation-to-identify-and-re-schedule-slow-run
ning-tasks/

Thank you!
PyCon ZA Organizers and South Africa Python Community.

PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow

More Related Content

PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow