Apache Spark Python Slides
Apache Spark Python Slides
• To run our programs we will use the Python API for Spark:
PySpark
• Under the hood, Spark will automatically distribute the data contained in
RDDs across your cluster and parallelize the operations you perform on
them.
What can we do with RDDs?
Transformations
Actions
Transformations
• Apply some functions to the data in RDD to create a new RDD.
• Apply transformations.
• Launch actions.
Creating RDDs
How to create a RDD
• Take a e isti g collectio i our progra a d pass it to SparkCo te t’s
parallelize method.
• All the elements in the collection will then be copied to form a distributed
dataset that can be operated on in parallel.
• Very handy to create an RDD with little effort.
• NOT practical working with large datasets.
How to create a RDD
• Load RDDs from external storage by calling textFile method on
SparkContext.
• The return type of the map function is not necessary the same as its input type.
Solution to
Airports by latitude problem
flatMap transformation
flatMap
● flatMap is a transformation to create an RDD
from an existing RDD.
.flatMap()
flatMap VS map
map: 1 to 1 relationship
.map()
RDD RDD
.flatMap()
RDD RDD
To the code!
flatMap
Set operations
Set operations
Set operations which are performed on one RDD:
– sample
– distinct
sample
• The distinct transformation returns the distinct rows from the input
RDD.
• Union operation gives us back an RDD consisting of the data from both input RDDs.
• If there are any duplicates in the input RDDs, the resulting RDD of Spark’s union
operation will contain duplicates as well.
intersection operation
• Intersection operation returns the common elements which appear in both input RDDs.
• Intersection operation removes all duplicates including the duplicates from single RDD
before returning the results.
• Intersection operation is quite expensive since it requires shuffling all the data across
partitions to identify common elements.
subtract operation
• Subtract operation takes in another RDD as an argument and returns us an RDD that
only contains element present in the first RDD and not the second RDD.
• Subtract operation requires a shuffling of all the data which could be quite expensive
for large datasets.
cartesian operation
• Cartesian transformation returns all possible pairs of a and b where a is in the source
RDD and b is in the other RDD.
• reduce takes a function that operates on two elements of the type in the
input RDD and returns a new element of the same type. It reduces the
elements of this RDD using the specified binary function.
• This function produces the same result when repetitively applied on the
same set of RDD data, and reduces to a single value.
• With reduce operation, we can perform different types of aggregations.
Sample solution for
the Sum of Numbers problem
Important aspects about RDDs
RDDs are Distributed
• Each RDD is broken into multiple pieces called partitions,
and these partitions are divided across the clusters.
• In case of any node in the cluster goes down, Spark can recover the parts
of the RDDs from the input and pick up from where it left off.
• Spark does the heavy lifting for you to make sure that RDDs are fault
tolerant.
Summary of RDD Operations
Summary
• Transformations are operations on RDDs that return a new
RDD, such as map and filter.
• Even though new RDDs can be defined any time, they are
only computed by Spark in a lazy fashion, which is the first
time they are used in an ACTION.
Lazy Evaluation
Lazy Evaluation
• Transformations on RDDs are lazily evaluated, meaning that
Spark will not begin to execute until it sees an action.
• Rather than thinking of an RDD as containing specific data, it
might be better to think of each RDD as consisting of
instructions on how to compute the data that we build up
through transformations.
• Spark uses lazy evaluation to reduce the number of passes it
has to take over our data by grouping operations together.
Transformations return RDDs, whereas actions return some
other data type.
Transformations:
Actions:
Caching and Persistence
Persistence
• Sometimes we would like to call actions on the same RDD multiple times.
• If we do this naively, RDDs and all of its dependencies are recomputed
each time an action is called on the RDD.
• This can be very expensive, especially for some iterative algorithms,
which would call actions on the same dataset many times.
• If you want to reuse an RDD in multiple actions, you can also ask Spark to
persist by calling the persist() method on the RDD.
• When you persist an RDD, the first time it is computed in an action, it will
be kept in memory across the nodes.
Different Storage Level
RDD.persist(StorageLevel level)
RDD.cache() = RDD.persist (MEMORY_ONLY)
Which Storage Level we should choose?
• Spark’s storage levels are ea t to provide differe t trade-offs between
memory usage and CPU efficiency.
• If the RDDs can fit comfortably with the default storage level, MEMORY_ONLY is
the ideal option. This is the most CPU-efficient option, allowing operations on
the RDDs to run as fast as possible.
• If not, try using MEMORY_ONLY_SER to make the objects much more space-
efficient, but still reasonably fast to access.
• Don’t sa e to disk unless the functions that computed your datasets are
expensive, or they filter a significant amount of the data.
• What would happen If you attempt to cache too much data to fit in
memory?
– Spark will evict old partitions automatically using a Least Recently Used cache
policy.
– For the MEMORY_ONLY storage level, spark will re-compute these partitions
the next time they are needed.
– For the MEMORY_AND_DISK storage level, Spark will write these partitions
to disk.
– In either case, your spark job on’t break even if you ask Spark to cache too
much data.
– Caching unnecessary data can cause spark to evict useful data and lead to
longer re-computation time.
Spark Architecture
Spark - Master-Slave Architecture
Worker Node
Executor Cache
SparkContext
Worker Node
Executor Cache
Task Task
Driver Program Executors Driver Program
In: 2
..in New York new: 1
and york: 1
us: 1 In: 2
..in New York in US.. new: 4
and: 1
and in US … york: 3
…..History of history: 1
us: 1
..History of and: 1
New York of new: 2
history: 1
new jersey.. New York of york: 1
of: 2
new jersey.. of: 2
.. Metro of jersey: 1
jersey: 1
New York … …
Metro: 1
…
metro: 1
..Metro of of: 2
New York… new: 1
york: 1
…
Running Spark in the Local Mode
Worker Node
Executor Cache
SparkContext
Worker Node
Executor Cache
Task Task
Executor Cache
Master Machine
Task Task
Driver Program
Worker Node
SparkContext
Executor Cache
Task Task
Slave Machine
Spark Components
• Spark package designed for working with
structured data which is built on top of Spark
Core.
• Provides an SQL-like interface for working with
structured data.
• Running on top of Spark, Spark Streaming provides an API for
manipulating data streams that closely match the Spark
Core’s RDD API.
• Since this is a typical pattern, Spark provides the mapValues function. The
mapValues function will be applied to each key value pair and will
convert the values based on mapValues function, but it will not change
the keys.
reduceByKey aggregation
Aggregation
• When our dataset is described in the format of key-value pairs, it is quite
common that we would like to aggregate statistics across all elements with
the same key.
• We have looked at the reduce actions on regular RDDs, and there is a
similar operation for pair RDD, it is called reduceByKey.
• reduceByKey runs several parallels reduce operations, one for each key in
the dataset, where each operation combines values that have the same
key.
• Considering input datasets could have a huge number of keys, reduceByKey
operation is not implemented as an action that returns a value to the driver
program. Instead, it returns a new RDD consisting of each key and the
reduced value for that key.
Sample Solution for
the Average House problem
Task: compute the average price for houses with different number of bedrooms
Solution:
1. Flip the key value of the word count RDD to create a new Pair RDD with the
key being the count and the value being the word.
2. Do sortByKey on the intermediate RDD to sort the count.
3. Flip back the Pair RDD again, with the key back to the word and the value back
to the count.
A better solution:
1. Use the sortBy transformation!
sortBy
partitionBy groupByKey
A:1 A:1
A:3
B:2 A:2
B:1 B:1
B:3
C:3 B:2
A:2 C:3
C:6
C:3 C:3
Operations which would benefit from partitioning
• Join
• leftOuterJoin
• rightOuterJoin
• groupByKey
• reduceByKey
• combineByKey
• lookup
How reduceByKey benefits from partitioning
• Operations like map could cause the new RDD to forget the
pare t’s partitio i g i for atio , as such operatio s could,
in theory, change the key of each element in the RDD.
When there are multiple values for the same key in one of the inputs, the
resulting pair RDD will have an entry for every possible pair of values with that key
from the two input RDDs.
• However sometimes we want the keys in our
result as long as they appear in one of the RDD.
For instance, if we were joining customer
information with feedbacks, we might not want
to drop customers if there were not any
feedbacks yet.
Outer Joins
The resulting RDD has entries for each key in the source RDDs. The value associated
with each key in the resulting RDD is a tuple of the value from the source RDD and
an optional for the value from the other pair RDD.
Best Practices
• If both RDDs have duplicate keys, join operation can dramatically expand
the size of the data. It’s reco e ded to perfor a distinct or
combineByKey operation to reduce the key space if possible.
• Join operation may require large network transfers or even create data
sets beyond our capability to handle.
• Joins, in general, are expensive since they require that corresponding keys
from each RDD are located at the same partition so that they can be
combined locally. If the RDDs do not have known partitioners, they will
need to be shuffled so that both RDDs share a partitioner and data with
the same keys lives in the same partitions.
Shuffled Hash Join
• To join data, Spark needs the data that is to be joined
to live on the same partition.
• The default implementation of join in Spark is a
shuffled hash join.
• The shuffled hash join ensures that data on each
partition will contain the same keys by partitioning
the second dataset with the same default partitioner
as the first so that the keys with the same hash value
from both datasets are in the same partition.
• While this approach always works, it can be more
expensive than necessary because it requires a
shuffle.
Avoid Shuffle
• The shuffle can be avoided if both RDDs have a known
partitioner.
• DataFrames store data in a more efficient manner than native RDDs, taking
advantage of their schema.
• Unlike an RDD, data is organized into named columns, like a table in a relational
database.
Important Spark SQL Concept
• DataFrame
• Dataset
Dataset
• The Dataset API, released since Spark 1.6, it provides:
– the familiar object-oriented programming style
– compile-time type safety of the RDD API
– the benefits of leveraging schema to work with structured data
• A dataset is a set of structured data, not necessarily a row but it could be
of a particular type.
• Java and Spark will know the type of the data in a dataset at compile time.
• The Dataset API is not available in Python.
DataFrame and Dataset
• Starting in Spark 2.0, DataFrame APIs merge with Dataset APIs.
• Dataset takes on two distinct APIs characteristics: a strongly-typed API
and an untyped API.
• Consider DataFrame as untyped view of a Dataset, which is a Dataset of
Row where a Row is a generic untyped JVM object.
• Dataset, by contrast, is a collection of strongly-typed JVM objects.
• The Dataset API is only available on Java and Scala.
• For Python we stick with the DataFrame API.
Spark SQL in Action
Group by Salary Bucket
Salary Middle Point Range Number of Developers
0 – 20,000 10
20,000 – 40,000 29
40,000 – 60,000 12
Salary Middle Point Bucket Column Value
51,000 40,000
51,000 / 20,000 = 2. 55
Int (2.55) = 2
2 * 20,000 = 40,000
Catalyst Optimizer
• Spark SQL uses an optimizer called Catalyst to optimize all the queries
written both in Spark SQL and DataFrame DSL.
• This optimizer makes queries run much faster than their RDD
counterparts.
• The Catalyst is a modular library which is built as a rule-based system.
Each rule in the framework focuses on the specific optimization. For
example, rule like ConstantFolding focuses on removing constant
expression from the query.
Spark SQL practice:
House Price Problem
Spark SQL Joins
Spark SQL join Vs. core Spark join
• Spark SQL supports the same basic join types as core Spark.
• Spark SQL Catalyst optimizer can do more of the heavy lifting
for us to optimize the join performance.
• Using Spark SQL join, we have to give up some of our control.
For example, Spark SQL can sometimes push down or re-order
operations to make the joins more efficient. The downside is
that e do ’t ha e co trols o er the partitioner for
DataFrames, so e ca ’t a ually a oid shuffles as e did ith
core Spark joins.
Spark SQL Join Types
• The standard SQL join types are supported by Spark SQL and can be
specified as the how when performing a join.
Name Age
Henry 50
• The postcode in the maker space data source is the full postcode.
– W1T 3AC
• The postcode in the postcode data source is only the prefix of the postcode.
– W1T
• Join condition:
– If the postcode column in the maker space data source starts with the
postcode column in the postcode data source.
• Conner case:
– W14D T2Y might match both W14D and W14
• Solution:
– Append a space to the postcode prefix
– Then W14D T2Y only matches “W14D “, not “W14 “
DataFrame or RDD?
DataFrame
• DataFrames are the new hotness.
• MLlib is on a shift to DataFrame based API.
• Spark streaming is also moving towards, something called
structured streaming which is heavily based on DataFrame
API.
Are RDDs being treated as second class citizens?
Are they being deprecated?
NO
• The RDDs are still the core and fundamental
building block of Spark.
• It will ask Spark SQL to compile each query to Java byte code before executing it.
• This codegen option could make long queries or repeated queries substantially
faster, as Spark generates specific code to run them.
• For short queries or some non-repeated ad-hoc queries, this option could add
unnecessary overhead, as Spark has to run a compiler for each query.
• It’s reco e ded to use codegen option for workflows which involves large
queries, or with the same repeated query.
Configure Spark Properties
• spark.sql.inMemoryColumnarStorage.batchSize
Executor Cache
SparkContext
Worker Node
Executor Cache
Task Task
Running Spark in the Cluster Mode
Worker Node
Executor Cache
Worker Node
Executor Cache
Task Task
Cluster Manager
• The cluster manager is a pluggable component in Spark.
• Spark is packaged with a built-in cluster manager called the Standalone
Cluster Manager.
• There are other types of Spark manager master such as:
– Hadoop Yarn
• A resource management and scheduling tool for a Hadoop MapReduce cluster.
– Apache Mesos
• Centralized fault-tolerant cluster manager and global resource manager for your entire data center.
• The cluster manager abstracts away the underlying cluster environment
so that you can use the same unified high-level Spark API to write Spark
program which can run on different clusters.
• You can use spark-submit to submit an application to the cluster
spark-submit
Running Spark Applications on a Cluster
• The user submits an application using spark-submit.
• Spark-submit launches the driver program and invokes the main method
specified by the user.
• The driver program contacts the cluster manager to ask for resources to
start executors.
• The cluster manager launches executors on behalf of the driver program.
• The driver process runs through the user application. Based on the RDD
or dataframe operations in the program, the driver sends work to
executors in the form of tasks.
• Tasks are run on executor processes to compute and save results.
• If the driver’s ai ethod exits or it calls SparkContext.stop(), it will
terminate the executors.
spark-submit options
.spark-submit \
--executor-memory 20G \
--total-executor-cores 100 \
path/to/examples.py
Benefits of spark-submit
• We can run Spark applications from a command line or
execute the script periodically using a Cron job or other
scheduling service.