Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
100% found this document useful (1 vote)
549 views

PySpark Transformations Tutorial

This document provides examples of PySpark transformations. PySpark transformations produce a new DataFrame, DataSet or RDD from an existing one. Some key transformations discussed include: - map: Applies a function to each element of an RDD and returns a new RDD. - filter: Returns a new RDD containing only elements that satisfy a predicate function. - reduceByKey: Operates on key-value pairs and combines the values for each key using a function, such as to sum the counts from a baby names dataset grouped by name. - groupByKey: Groups the elements of each key in a key-value pair RDD into a single list. The examples demonstrate how these transformations work on sample datasets to

Uploaded by

ravikumar lanka
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
549 views

PySpark Transformations Tutorial

This document provides examples of PySpark transformations. PySpark transformations produce a new DataFrame, DataSet or RDD from an existing one. Some key transformations discussed include: - map: Applies a function to each element of an RDD and returns a new RDD. - filter: Returns a new RDD containing only elements that satisfy a predicate function. - reduceByKey: Operates on key-value pairs and combines the values for each key using a function, such as to sum the counts from a baby names dataset grouped by name. - groupByKey: Groups the elements of each key in a key-value pair RDD into a single list. The examples demonstrate how these transformations work on sample datasets to

Uploaded by

ravikumar lanka
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 58

PySpark Transformations Tutorial [14 Examples]

January 14, 2023 by Todd M


If you’ve read the previous PySpark tutorials on this site, you know that PySpark
transformations are functions produce a DataFrame, DataSet or Resilient Distributed Dataset
(RDD).  Resilient distributed datasets are Spark’s main programming abstraction and RDDs are
automatically parallelized across the cluster. 
PySpark transformation functions are lazily initialized.
As Spark matured, this abstraction changed from RDDs to DataFrame to DataSets, but the
underlying concept of a Spark transformation remains the same: transformations produce a
new, lazily initialized abstraction for data set whether the underlying implementation is an RDD,
DataFrame or DataSet. 
Note: as you would probably expect when using Python, RDDs can hold objects of multiple types
because Python is dynamically typed.
In some of the Spark Transformation examples in Python examples shown below, a CSV file is
loaded.  A snippet of this CSV file:
Year,First Name,County,Sex,Count
2012,DOMINIC,CAYUGA,M,62012,ADDISON,ONONDAGA,F,14
2012,ADDISON,ONONDAGA,F,14
2012,JULIA,ONONDAGA,F,15
There is a link to download this file in Resources section below.
PySpark Transformations Examples
 map
 flatMap
 filter
 mapPartitions
 mapPartitionsWithIndex
 sample
 union
 intersection
 distinct
 The Keys
 groupByKey
 reduceByKey
 aggregateByKey
 sortByKey
 join
 Frequently Asked Questions
 1. What is a transformation in PySpark?
 2. What is the difference between transformations and actions in PySpark?
 3. What are the different transformation types in PySpark?
 References

map
Map transformation returns a new RDD by applying a function to each element of this RDD
>>> baby_names = sc.textFile("baby_names.csv")
>>> rows = baby_names.map(lambda line: line.split(","))
So, in this pyspark transformation example, we’re creating a new RDD called “rows” by splitting
every row in the baby_names RDD.  We accomplish this by mapping over every element in
baby_names and passing in a lambda function to split by commas.
From here, we could use Python to access the array as shown next:
>>> for row in rows.take(rows.count()): print(row[1])

First Name
DAVID
JAYDEN
...
flatMap
flatMap is similar to map, because it applies a function to all elements in a RDD.  But, flatMap
flattens the results.
Compare flatMap to map in the following
>>> sc.parallelize([2, 3, 4]).flatMap(lambda x: [x,x,x]).collect()
[2, 2, 2, 3, 3, 3, 4, 4, 4]

>>> sc.parallelize([1,2,3]).map(lambda x: [x,x,x]).collect()


[[1, 1, 1], [2, 2, 2], [3, 3, 3]]
This is helpful with nested datasets such as found in JSON.
Adding collect to flatMap and map results was shown for clarity.  We can focus on Spark aspect
(re: the RDD return type) of the example if we don’t use collect:
>>> sc.parallelize([2, 3, 4]).flatMap(lambda x: [x,x,x])
PythonRDD[36] at RDD at PythonRDD.scala:43
filter
Create a new RDD bye returning only the elements that satisfy the search filter.  For SQL
minded, think where clause.
>>> rows.filter(lambda line: "MICHAEL" in line).collect()
Out[36]:
[[u'2013', u'MICHAEL', u'QUEENS', u'M', u'155'],
[u'2013', u'MICHAEL', u'KINGS', u'M', u'146'],
[u'2013', u'MICHAEL', u'SUFFOLK', u'M', u'142']...
For a more in depth tutorial on filter see PySpark Filter Tutorial.

See also  How to Deploy Python Programs to a Spark Cluster

mapPartitions
Consider mapPartitions a tool for performance optimization if you have the resources available.
It won’t do much when running examples on your laptop.  It’s the same as “map”, but works
with Spark RDD partitions which are distributed.  Remember the first D in RDD – Resilient
Distributed Datasets.
In examples below that when using parallelize,  elements of the collection are copied to form a
distributed dataset that can be operated on in parallel.
One important parameter for parallel collections is the number of partitions to cut the dataset
into. Spark will run one task for each partition of the cluster.
>>> one_through_9 = range(1,10)
>>> parallel = sc.parallelize(one_through_9, 3)
>>> def f(iterator): yield sum(iterator)
>>> parallel.mapPartitions(f).collect()
[6, 15, 24]

>>> parallel = sc.parallelize(one_through_9)


>>> parallel.mapPartitions(f).collect()
[1, 2, 3, 4, 5, 6, 7, 17]
See what’s happening?  Results [6,15,24] are created because mapPartitions loops through 3
partitions which is the second argument to the sc.parallelize call.
Partion 1: 1+2+3 = 6
Partition 2: 4+5+6 = 15
Partition 3: 7+8+9 = 24
The second example produces [1,2,3,4,5,6,7,17] which I’m guessing means the default number of
partitions on my laptop is 8.
Partion 1 = 1
Partition 2= 2
Partion 3 = 3
Partition 4 = 4
Partion 5 = 5
Partition 6 = 6
Partion 7 = 7
Partition 8: 8+9 = 17
Typically you want 2-4 partitions for each CPU core in your cluster. Normally, Spark tries to set
the number of partitions automatically based on your cluster or hardware based on standalone
environment.
To find the default number of partitions and confirm the guess of 8 above:
>>> print sc.defaultParallelism
8
mapPartitionsWithIndex
Similar to mapPartitions, but also provides a function with an int value to indicate the index
position of the partition.
>>> parallel = sc.parallelize(range(1,10),4)
>>> def show(index, iterator): yield 'index: '+str(index)+" values: "+ str(list(iterator))
>>> parallel.mapPartitionsWithIndex(show).collect()

['index: 0 values: 1',


'index: 1 values: 3',
'index: 2 values: 5',
'index: 3 values: 7']
When learning these APIs on an individual laptop or desktop, it might be helpful to show
differences in capabilities and outputs.  For example, if we change the above example to use a
parallelized list with 3 slices, our output changes significantly:
>>> parallel = sc.parallelize(range(1,10),3)
>>> def show(index, iterator): yield 'index: '+str(index)+" values: "+ str(list(iterator))
>>> parallel.mapPartitionsWithIndex(show).collect()

['index: 0 values: [1, 2, 3]',


'index: 1 values: [4, 5, 6]',
'index: 2 values: [7, 8, 9]']
sample
Return a random sample subset RDD of the input RDD
>>> parallel = sc.parallelize(range(1,10))
>>> parallel.sample(True,.2).count()
2

>>> parallel.sample(True,.2).count()
1

>>> parallel.sample(True,.2).count()
2
sample(withReplacement, fraction, seed=None)
 withReplacement – can elements be sampled multiple times (replaced when
sampled out)
 fraction – expected size of the sample as a fraction of this RDD’s size without
Parameters: replacement: probability that each element is chosen; fraction must be [0, 1]
with replacement: expected number of times each element is chosen; fraction
must be >= 0
 seed – seed for the random number generator
union
Simple.  Return the union of two RDDs
>>> one = sc.parallelize(range(1,10))
>>> two = sc.parallelize(range(10,21))
>>> one.union(two).collect()
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
Back to Top

See also  PySpark Quick Start [Introduction to Apache Spark for Python Developers]

intersection
Again, simple.  Similar to union but return the intersection of two RDDs
>>> one = sc.parallelize(range(1,10))
>>> two = sc.parallelize(range(5,15))
>>> one.intersection(two).collect()
[5, 6, 7, 8, 9]
distinct
Another simple one.  Return a new RDD with distinct elements within a source RDD
>>> parallel = sc.parallelize(range(1,9))
>>> par2 = sc.parallelize(range(5,15))

>>> parallel.union(par2).distinct().collect()
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
Formal API: distinct(): RDD[T]
Back to Top
The Keys
The group of transformation functions (groupByKey, reduceByKey, aggregateByKey, sortByKey,
join) all act on key, value pair RDDs.

For the following, we’re going to use the baby_names.csv file again which was introduced in a
previous post What is Apache Spark?
All the following examples presume the baby_names.csv file has been loaded and split such as:
>>> baby_names = sc.textFile("baby_names.csv")
>>> rows = baby_names.map(lambda line: line.split(","))
Back to Top
groupByKey
“When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs. “
The following groups all names to counties in which they appear over the years.
>>> rows = baby_names.map(lambda line: line.split(","))
>>> namesToCounties = rows.map(lambda n: (str(n[1]),str(n[2]) )).groupByKey()
>>> namesToCounties.map(lambda x : {x[0]: list(x[1])}).collect()

[{'GRIFFIN': ['ERIE',
'ONONDAGA',
'NEW YORK',
'ERIE',
'SUFFOLK',
'MONROE',
'NEW YORK',
...
The above example was created from baby_names.csv file which was introduced in previous
post What is Apache Spark? See reading CSV from PySpark tutorial for loading CSV in PySpark.
For a more in depth tutorial on groupBy see PySpark groupBy Tutorial.
reduceByKey
Operates on key, value pairs again, but the func must be of type (V,V) => V
Let’s sum the yearly name counts over the years in the CSV.  Notice we need to filter out the
header row.  Also notice we are going to use the “Count” column value (n[4])
>>> filtered_rows = baby_names.filter(lambda line: "Count" not in line).map(lambda line:
line.split(","))
>>> filtered_rows.map(lambda n: (str(n[1]), int(n[4]) ) ).reduceByKey(lambda v1,v2: v1 +
v2).collect()

[('GRIFFIN', 268),
('KALEB', 172),
('JOHNNY', 219),
('SAGE', 5),
('MIKE', 40),
('NAYELI', 44),
....
Formal API: reduceByKey(func: (V, V) ⇒ V): RDD[(K, V)]

See also  How to Deploy Python Programs to a Spark Cluster

The above example was created from baby_names.csv file which was introduced in previous
post What is Apache Spark?
aggregateByKey
Ok, I admit, this one drives me a bit nuts.  Why wouldn’t we just use reduceByKey?  I don’t
feel smart enough to know when to use aggregateByKey over reduceByKey.  For example, the
same results may be produced as reduceByKey:
>>> filtered_rows = baby_names.filter(lambda line: "Count" not in line).map(lambda line:
line.split(","))
>>> filtered_rows.map(lambda n: (str(n[1]), int(n[4]) ) ).aggregateByKey(0, lambda k,v: int(v)
+k, lambda v,k: k+v).collect()

[('GRIFFIN', 268),
('KALEB', 172),
('JOHNNY', 219),
('SAGE', 5),
...
And again,  the above example was created from baby_names.csv file which was introduced in
previous post What is Apache Spark?
There’s a gist of aggregateByKey as well.
sortByKey
This simply sorts the (K,V) pair by K.  Try it out. See examples above on where babyNames
originates.
>>> filtered_rows.map (lambda n: (str(n[1]), int(n[4]) ) ).sortByKey().collect()
[('AADEN', 18),
('AADEN', 11),
('AADEN', 10),
('AALIYAH', 50),
('AALIYAH', 44),
...

#opposite
>>> filtered_rows.map (lambda n: (str(n[1]), int(n[4]) ) ).sortByKey(False).collect()

[('ZOIE', 5),
('ZOEY', 37),
('ZOEY', 32),
('ZOEY', 30),
...
join
If you have relational database experience, this will be easy.  It’s joining of two datasets.  Other
joins are available as well such as leftOuterJoin and rightOuterJoin.
>>> names1 = sc.parallelize(("abe", "abby", "apple")).map(lambda a: (a, 1))
>>> names2 = sc.parallelize(("apple", "beatty", "beatrice")).map(lambda a: (a, 1))
>>> names1.join(names2).collect()

[('apple', (1, 1))]


leftOuterJoin, rightOuterJoin
>>> names1.leftOuterJoin(names2).collect()
[('abe', (1, None)), ('apple', (1, 1)), ('abby', (1, None))]

>>> names1.rightOuterJoin(names2).collect()
[('apple', (1, 1)), ('beatrice', (None, 1)), ('beatty', (None, 1))]
If you are interested in learning more, there are PySpark Joins with SQL and PySpark Joins with
DataFrame tutorials.
Frequently Asked Questions
1. What is a transformation in PySpark?
A PySpark transformation are operations which creates a new RDD (Resilient Distributed
Dataset) / DataFrame from an existing one.
Transformations are lazily evaluated, meaning they are not executed immediately when called,
but rather, create a plan for how to execute the operation when an action is called. This allows
PySpark to optimize the execution plan and reduce unnecessary computation.
Transformations are an essential part of PySpark programming.
2. What is the difference between transformations and actions in PySpark?
Transformations create a new DataFrame without immediately computing the result, while
PySpark actions trigger the computation of the DataFrame and return a value or perform a side
effect on the data.
See PySpark action examples for deeper dive on PySpark actions.
3. What are the different transformation types in PySpark?
As we saw above, there are many different transformation types in Pyspark.
PySpark RDD – Transformations
12 months ago
by Gottumukkala Sravan Kumar
In Python, PySpark is a Spark module used to provide a similar kind of processing like spark.
RDD stands for Resilient Distributed Datasets. We can call RDD a fundamental data structure in
Apache Spark.
We need to import RDD from the pyspark.rdd module.
AD
So In PySpark to create an RDD, we can use the parallelize() method.
Syntax:
spark_app.sparkContext.parallelize(data)
Where,

19
data can be a one dimensional (linear data) or two dimensional data (row-column data).
AD
RDD Transformations:
A Transformation RDD is an operation that is applied to an RDD to create new data from the
existing RDD. Using Transformations, we are able to filter the RDD by applying some
transformations.
Let’s see the transformations that are performed on the given RDD.
We will discuss them one by one.
1. map()
map() transformation is used to map a value to the elements present in the RDD. It takes an
anonymous function as a parameter, like lambda and transforms the elements in an RDD.
AD
Syntax:
RDD_data.map(anonymous_function)
Parameters:
anonymous_function looks like:
lambda element:operation
For example, the operation is to add/subtract all the elements with some new element.
Let’s see the examples to understand this transformation better.
AD
Example 1:
In this example, we create an RDD named student_marks with 20 elements and apply map()
transformation by adding each element with 20 and displaying them using collect() action.
#import the pyspark module

import pyspark

#import SparkSession for creating a session

from pyspark.sql import SparkSession

# import RDD from pyspark.rdd

from pyspark.rdd import RDD

#create an app named linuxhint

spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student marks data with 20 elements


student_marks
=spark_app.sparkContext.parallelize([89,76,78,89,90,100,34,56,54,22,45,43,23,56,78,21,34,34,56
,34])

#display data in RDD

print("Actual data in RDD: ",student_marks.map(lambda element: element).collect())

#Apply map() transformation by adding 20 to each element in RDD

print("After adding 20 to each element in RDD:",student_marks.map(lambda element:


element+ 20).collect())
Output:
AD
Actual data in RDD: [89, 76, 78, 89, 90, 100, 34, 56, 54, 22, 45, 43, 23, 56, 78, 21, 34, 34, 56, 34]

After adding 20 to each element in RDD:


[109, 96, 98, 109, 110, 120, 54, 76, 74, 42, 65, 63, 43, 76, 98, 41, 54, 54, 76, 54]
From the above output, we can see that element 20 is added to each and every element in RDD
through the lambda function using map() transformation.
Example 2:
In this example, we create an RDD named student_marks with 20 elements and apply map()
transformation by subtracting each element by 15 and displaying them using collect() action.
AD
#import the pyspark module

import pyspark

#import SparkSession for creating a session

from pyspark.sql import SparkSession

# import RDD from pyspark.rdd

from pyspark.rdd import RDD

#create an app named linuxhint

spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student marks data with 20 elements

student_marks
=spark_app.sparkContext.parallelize([89,76,78,89,90,100,34,56,54,22,45,43,23,56,78,21,34,34,56
,34])

#display data in RDD

print("Actual data in RDD: ",student_marks.map(lambda element: element).collect())


#Apply map() transformation by subtracting 15 from each element in RDD

print("After subtracting 15 from each element in RDD:",student_marks.map(lambda element:


element-15).collect())
Output:
AD
Actual data in RDD: [89, 76, 78, 89, 90, 100, 34, 56, 54, 22, 45, 43, 23, 56, 78, 21, 34, 34, 56, 34]

After subtracting 15 from each element in RDD: [74, 61, 63, 74, 75, 85, 19, 41, 39, 7, 30, 28, 8, 41,
63, 6, 19, 19, 41, 19]
From the above output, we can see that element 15 is subtracted to each and every element in
RDD through the lambda function using map() transformation.
2. filter()
filter() transformation is used to filter values from the RDD. It takes an anonymous function like
lambda and returns the elements by filtering elements from an RDD.
Syntax:
RDD_data.filter(anonymous_function)
Parameters:
anonymous_function looks like:
lambda element:condition/expression
For example, the condition is used to specify the expressive statements to filter the RDD.
AD
Let’s see examples to understand this transformation better.
Example 1:
In this example, we create an RDD named student_marks with 20 elements and apply filter()
transformation by filtering only multiples of 5 and displaying them using collect() action.
#import the pyspark module

import pyspark

#import SparkSession for creating a session

from pyspark.sql import SparkSession

# import RDD from pyspark.rdd

from pyspark.rdd import RDD

 =================
#create an app named linuxhint

spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student marks data with 20 elements

student_marks
=spark_app.sparkContext.parallelize([89,76,78,89,90,100,34,56,54,22,45,43,23,56,78,21,34,34,56,34])

 
#display data in RDD

print("Actual data in RDD: ",student_marks.map(lambda element: element).collect())

#Apply filter() transformation by returning inly multiples of 5.

print("Multiples of 5 from an RDD: ",student_marks.filter(lambda element: element%5==0).collect())

)
Output:
AD
Actual data in RDD: [89, 76, 78, 89, 90, 100, 34, 56, 54, 22, 45, 43, 23, 56, 78, 21, 34, 34, 56, 34]

Multiples of 5 from an RDD: [90, 100, 45]


From the above output, we can see that multiples of 5 elements are filtered from the RDD.
Example 2:
In this example, we create an RDD named student_marks with 20 elements and apply filter()
transformation by filtering elements that are greater than 45 and displaying them using collect()
action.
#import the pyspark module

import pyspark

#import SparkSession for creating a session

from pyspark.sql import SparkSession

# import RDD from pyspark.rdd

from pyspark.rdd import RDD

#create an app named linuxhint

spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student marks data with 20 elements

student_marks
=spark_app.sparkContext.parallelize([89,76,78,89,90,100,34,56,54,22,45,43,23,56,78,21,34,34,56,34])

#display data in RDD

print("Actual data in RDD: ",student_marks.map(lambda element: element).collect())


 

#Apply filter() transformation by filtering values greater than 45

print("Values greater than 45: ",student_marks.filter(lambda element: element>45).collect())


Output:
AD
Actual data in RDD: [89, 76, 78, 89, 90, 100, 34, 56, 54, 22, 45, 43, 23, 56, 78, 21, 34, 34, 56, 34]

Values greater than 45: [89, 76, 78, 89, 90, 100, 56, 54, 56, 78, 56]


From the above output, we can see those elements greater than 45 are filtered from the RDD.
3. union()
union() transformation is used to combine two RDDs. We can perform this transformation on
two RDDs..
Syntax:
RDD_data1.union(RDD_data2)
Let’s see examples to understand this transformation better.
AD
Example 1:
In this example, we will create a single RDD with student marks data and generate two RDD
from the single RDD by filtering some values using filter() transformation. After that, we can
perform union() transformation on the two filtered RDDs.
#import the pyspark module

import pyspark

#import SparkSession for creating a session

from pyspark.sql import SparkSession

# import RDD from pyspark.rdd

from pyspark.rdd import RDD

#create an app named linuxhint

spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student marks data with 20 elements

student_marks
=spark_app.sparkContext.parallelize([89,76,78,89,90,100,34,56,54,22,45,43,23,56,78,21,34,34,56
,34])
 

#display data in RDD

print("Actual data in RDD: ",student_marks.map(lambda element: element).collect())

first_filter = student_marks.filter(lambda element: element >90)

second_filter = student_marks.filter(lambda element: element <40)

#display first filtered transformation

print("Elements in RDD greater than 90 ",first_filter.collect())

#display second filtered transformation

print("Elements in RDD less than 40 ",second_filter.collect())

#Apply union() transformation by performing union on the above 2 filters

print("Union Transformation on two filtered data",first_filter.union(second_filter).collect())


Output:
AD
Actual data in RDD: [89, 76, 78, 89, 90, 100, 34, 56, 54, 22, 45, 43, 23, 56, 78, 21, 34, 34, 56, 34]

Elements in RDD greater than 90 [100]

Elements in RDD less than 40 [34, 22, 23, 21, 34, 34, 34]

Union Transformation on two filtered data [100, 34, 22, 23, 21, 34, 34, 34]


From the above output, you can see that we performed union on first_filter and second_filter.
first_filter is obtained by getting elements from studentsmarks RDD greater than 90 and
second_filter is obtained by getting elements from studentsmarks RDD less than 40 using filter()
transformation.
Example 2:
In this example, we will create two RDDs such that the first RDD has 20 elements and the second
RDD has 10 elements. Following that, we can apply a union() transformation to these two RDDs.
AD
#import the pyspark module

import pyspark

#import SparkSession for creating a session

from pyspark.sql import SparkSession

# import RDD from pyspark.rdd


from pyspark.rdd import RDD

#create an app named linuxhint

spark_app = SparkSession.builder.appName('linuxhint').getOrCreate()

# create student marks data with 20 elements

student_marks1
=spark_app.sparkContext.parallelize([89,76,78,89,90,100,34,56,54,22,45,43,23,56,78,21,34,34,56
,34])

# create student marks data with 10 elements

student_marks2 =spark_app.sparkContext.parallelize([45,43,23,56,78,21,34,34,56,34])

#display data in RDD

print("Actual data in student marks 1 RDD: ",student_marks1.map(lambda element:


element).collect())

#display data in RDD

print("Actual data in student marks 2 RDD: ",student_marks2.map(lambda element:


element).collect())

#Apply union() transformation by performing union on the above 2 RDD's

print("Union Transformation on two RDD ",student_marks1.union(student_marks2).collect())


Output:
AD
Actual data in student marks 1 RDD:
[89, 76, 78, 89, 90, 100, 34, 56, 54, 22, 45, 43, 23, 56, 78, 21, 34, 34, 56, 34]

Actual data in student marks 2 RDD: [45, 43, 23, 56, 78, 21, 34, 34, 56, 34]

Union Transformation on two RDD


[89, 76, 78, 89, 90, 100, 34, 56, 54, 22, 45, 43, 23, 56, 78, 21, 34, 34, 56, 34, 45, 43, 23, 56, 78, 21, 34, 
34, 56, 34]
We can see that two RDD’s are combined using union() transformation.
Conclusion
From this PySpark tutorial, we see three transformations applied to RDD. map() transformation
is used to map by transforming elements in an RDD, filter() is used to perform filter operations
and create a new filtered RDD from the existing RDD. Finally, we discussed union() RDD that is
used to combine two RDDs.

Overview
At a high level, every Spark application consists of a driver program that runs the
user’s main function and executes various parallel operations on a cluster. The main abstraction
Spark provides is a resilient distributed dataset (RDD), which is a collection of elements
partitioned across the nodes of the cluster that can be operated on in parallel. RDDs are created
by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or
an existing Scala collection in the driver program, and transforming it. Users may also ask Spark
to persist an RDD in memory, allowing it to be reused efficiently across parallel operations.
Finally, RDDs automatically recover from node failures.
A second abstraction in Spark is shared variables that can be used in parallel operations. By
default, when Spark runs a function in parallel as a set of tasks on different nodes, it ships a copy
of each variable used in the function to each task. Sometimes, a variable needs to be shared
across tasks, or between tasks and the driver program. Spark supports two types of shared
variables: broadcast variables, which can be used to cache a value in memory on all nodes,
and accumulators, which are variables that are only “added” to, such as counters and sums.
This guide shows each of these features in each of Spark’s supported languages. It is easiest to
follow along with if you launch Spark’s interactive shell – either bin/spark-shell for the Scala shell
or bin/pyspark for the Python one.
Linking with Spark

 Scala
 Java
 Python
Spark 3.4.1 is built and distributed to work with Scala 2.12 by default. (Spark can be built to work
with other versions of Scala, too.) To write applications in Scala, you will need to use a
compatible Scala version (e.g. 2.12.X).
To write a Spark application, you need to add a Maven dependency on Spark. Spark is available
through Maven Central at:
groupId = org.apache.spark
artifactId = spark-core_2.12
version = 3.4.1
In addition, if you wish to access an HDFS cluster, you need to add a dependency on hadoop-
client for your version of HDFS.
groupId = org.apache.hadoop
artifactId = hadoop-client
version = <your-hdfs-version>
Finally, you need to import some Spark classes into your program. Add the following lines:
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
(Before Spark 1.3.0, you need to explicitly import org.apache.spark.SparkContext._ to enable
essential implicit conversions.)
Initializing Spark

 Scala
 Java
 Python
The first thing a Spark program must do is to create a SparkContext object, which tells Spark
how to access a cluster. To create a SparkContext you first need to build a SparkConf object that
contains information about your application.
Only one SparkContext should be active per JVM. You must stop() the active SparkContext
before creating a new one.
val conf = new SparkConf().setAppName(appName).setMaster(master)
new SparkContext(conf)
The appName parameter is a name for your application to show on the cluster UI. master is
a Spark, Mesos or YARN cluster URL, or a special “local” string to run in local mode. In practice,
when running on a cluster, you will not want to hardcode master in the program, but
rather launch the application with spark-submit and receive it there. However, for local testing
and unit tests, you can pass “local” to run Spark in-process.
Using the Shell

 Scala
 Python
In the PySpark shell, a special interpreter-aware SparkContext is already created for you, in the
variable called sc. Making your own SparkContext will not work. You can set which master the
context connects to using the --master argument, and you can add Python .zip, .egg or .py files
to the runtime path by passing a comma-separated list to --py-files. For third-party Python
dependencies, see Python Package Management. You can also add dependencies (e.g. Spark
Packages) to your shell session by supplying a comma-separated list of Maven coordinates to
the --packages argument. Any additional repositories where dependencies might exist (e.g.
Sonatype) can be passed to the --repositories argument. For example, to run bin/pyspark on
exactly four cores, use:
$ ./bin/pyspark --master local[4]
Or, to also add code.py to the search path (in order to later be able to import code), use:
$ ./bin/pyspark --master local[4] --py-files code.py
For a complete list of options, run pyspark --help. Behind the scenes, pyspark invokes the more
general spark-submit script.
It is also possible to launch the PySpark shell in IPython, the enhanced Python interpreter.
PySpark works with IPython 1.0.0 and later. To use IPython, set
the PYSPARK_DRIVER_PYTHON variable to ipython when running bin/pyspark:
$ PYSPARK_DRIVER_PYTHON=ipython ./bin/pyspark
To use the Jupyter notebook (previously known as the IPython notebook),
$ PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS=notebook
./bin/pyspark
You can customize the ipython or jupyter commands by
setting PYSPARK_DRIVER_PYTHON_OPTS.
After the Jupyter Notebook server is launched, you can create a new notebook from the “Files”
tab. Inside the notebook, you can input the command %pylab inline as part of your notebook
before you start to try Spark from the Jupyter notebook.
Resilient Distributed Datasets (RDDs)
Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-
tolerant collection of elements that can be operated on in parallel. There are two ways to create
RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an
external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a
Hadoop InputFormat.
Parallelized Collections
 Scala
 Java
 Python
Parallelized collections are created by calling SparkContext’s parallelize method on an existing
iterable or collection in your driver program. The elements of the collection are copied to form a
distributed dataset that can be operated on in parallel. For example, here is how to create a
parallelized collection holding the numbers 1 to 5:
data = [1, 2, 3, 4, 5]
distData = sc.parallelize(data)
Once created, the distributed dataset (distData) can be operated on in parallel. For example, we
can call distData.reduce(lambda a, b: a + b) to add up the elements of the list. We describe
operations on distributed datasets later on.
One important parameter for parallel collections is the number of partitions to cut the dataset
into. Spark will run one task for each partition of the cluster. Typically you want 2-4 partitions for
each CPU in your cluster. Normally, Spark tries to set the number of partitions automatically
based on your cluster. However, you can also set it manually by passing it as a second parameter
to parallelize (e.g. sc.parallelize(data, 10)). Note: some places in the code use the term slices (a
synonym for partitions) to maintain backward compatibility.
External Datasets

 Scala
 Java
 Python
PySpark can create distributed datasets from any storage source supported by Hadoop,
including your local file system, HDFS, Cassandra, HBase, Amazon S3, etc. Spark supports text
files, SequenceFiles, and any other Hadoop InputFormat.
Text file RDDs can be created using SparkContext’s textFile method. This method takes a URI for
the file (either a local path on the machine, or a hdfs://, s3a://, etc URI) and reads it as a collection
of lines. Here is an example invocation:
>>> distFile = sc.textFile("data.txt")
Once created, distFile can be acted on by dataset operations. For example, we can add up the
sizes of all the lines using the map and reduce operations as follows: distFile.map(lambda s:
len(s)).reduce(lambda a, b: a + b).
Some notes on reading files with Spark:
 If using a path on the local filesystem, the file must also be accessible at the same path on
worker nodes. Either copy the file to all workers or use a network-mounted shared file
system.
 All of Spark’s file-based input methods, including textFile, support running on directories,
compressed files, and wildcards as well. For example, you can
use textFile("/my/directory"), textFile("/my/directory/*.txt"),
and textFile("/my/directory/*.gz").
 The textFile method also takes an optional second argument for controlling the number of
partitions of the file. By default, Spark creates one partition for each block of the file (blocks
being 128MB by default in HDFS), but you can also ask for a higher number of partitions by
passing a larger value. Note that you cannot have fewer partitions than blocks.
Apart from text files, Spark’s Python API also supports several other data formats:
 SparkContext.wholeTextFiles lets you read a directory containing multiple small text files,
and returns each of them as (filename, content) pairs. This is in contrast with textFile, which
would return one record per line in each file.
 RDD.saveAsPickleFile and SparkContext.pickleFile support saving an RDD in a simple format
consisting of pickled Python objects. Batching is used on pickle serialization, with default
batch size 10.
 SequenceFile and Hadoop Input/Output Formats
Note this feature is currently marked Experimental and is intended for advanced users. It may be
replaced in future with read/write support based on Spark SQL, in which case Spark SQL is the
preferred approach.
Writable Support
PySpark SequenceFile support loads an RDD of key-value pairs within Java, converts Writables
to base Java types, and pickles the resulting Java objects using pickle. When saving an RDD of
key-value pairs to SequenceFile, PySpark does the reverse. It unpickles Python objects into Java
objects and then converts them to Writables. The following Writables are automatically
converted:
Writable Type Python Type
Text str
IntWritable int
FloatWritable float
DoubleWritable float
BooleanWritable bool
BytesWritable bytearray
NullWritable None
MapWritable dict
Arrays are not handled out-of-the-box. Users need to specify custom ArrayWritable subtypes
when reading or writing. When writing, users also need to specify custom converters that
convert arrays to custom ArrayWritable subtypes. When reading, the default converter will
convert custom ArrayWritable subtypes to Java Object[], which then get pickled to Python
tuples. To get Python array.array for arrays of primitive types, users need to specify custom
converters.
Saving and Loading SequenceFiles
Similarly to text files, SequenceFiles can be saved and loaded by specifying the path. The key and
value classes can be specified, but for standard Writables this is not required.
>>> rdd = sc.parallelize(range(1, 4)).map(lambda x: (x, "a" * x))
>>> rdd.saveAsSequenceFile("path/to/file")
>>> sorted(sc.sequenceFile("path/to/file").collect())
[(1, u'a'), (2, u'aa'), (3, u'aaa')]
Saving and Loading Other Hadoop Input/Output Formats
PySpark can also read any Hadoop InputFormat or write any Hadoop OutputFormat, for both
‘new’ and ‘old’ Hadoop MapReduce APIs. If required, a Hadoop configuration can be passed in
as a Python dict. Here is an example using the Elasticsearch ESInputFormat:
$ ./bin/pyspark --jars /path/to/elasticsearch-hadoop.jar
>>> conf = {"es.resource" : "index/type"} # assume Elasticsearch is running on localhost defaults
>>> rdd = sc.newAPIHadoopRDD("org.elasticsearch.hadoop.mr.EsInputFormat",
"org.apache.hadoop.io.NullWritable",
"org.elasticsearch.hadoop.mr.LinkedMapWritable",
conf=conf)
>>> rdd.first() # the result is a MapWritable that is converted to a Python dict
(u'Elasticsearch ID',
{u'field1': True,
u'field2': u'Some Text',
u'field3': 12345})
Note that, if the InputFormat simply depends on a Hadoop configuration and/or input path, and
the key and value classes can easily be converted according to the above table, then this
approach should work well for such cases.
If you have custom serialized binary data (such as loading data from Cassandra / HBase), then
you will first need to transform that data on the Scala/Java side to something which can be
handled by pickle’s pickler. A Converter trait is provided for this. Simply extend this trait and
implement your transformation code in the convert method. Remember to ensure that this
class, along with any dependencies required to access your InputFormat, are packaged into your
Spark job jar and included on the PySpark classpath.
See the Python examples and the Converter examples for examples of using Cassandra /
HBase InputFormat and OutputFormat with custom converters.
RDD Operations
RDDs support two types of operations: transformations, which create a new dataset from an
existing one, and actions, which return a value to the driver program after running a
computation on the dataset. For example, map is a transformation that passes each dataset
element through a function and returns a new RDD representing the results. On the other
hand, reduce is an action that aggregates all the elements of the RDD using some function and
returns the final result to the driver program (although there is also a parallel reduceByKey that
returns a distributed dataset).
All transformations in Spark are lazy, in that they do not compute their results right away.
Instead, they just remember the transformations applied to some base dataset (e.g. a file). The
transformations are only computed when an action requires a result to be returned to the driver
program. This design enables Spark to run more efficiently. For example, we can realize that a
dataset created through map will be used in a reduce and return only the result of the reduce to
the driver, rather than the larger mapped dataset.
By default, each transformed RDD may be recomputed each time you run an action on it.
However, you may also persist an RDD in memory using the persist (or cache) method, in which
case Spark will keep the elements around on the cluster for much faster access the next time
you query it. There is also support for persisting RDDs on disk, or replicated across multiple
nodes.
Basics

 Scala
 Java
 Python
To illustrate RDD basics, consider the simple program below:
val lines = sc.textFile("data.txt")
val lineLengths = lines.map(s => s.length)
val totalLength = lineLengths.reduce((a, b) => a + b)
The first line defines a base RDD from an external file. This dataset is not loaded in memory or
otherwise acted on: lines is merely a pointer to the file. The second line defines lineLengths as
the result of a map transformation. Again, lineLengths is not immediately computed, due to
laziness. Finally, we run reduce, which is an action. At this point Spark breaks the computation
into tasks to run on separate machines, and each machine runs both its part of the map and a
local reduction, returning only its answer to the driver program.
If we also wanted to use lineLengths again later, we could add:
lineLengths.persist()
before the reduce, which would cause lineLengths to be saved in memory after the first time it is
computed.
Passing Functions to Spark
 Scala
 Java
 Python
Spark’s API relies heavily on passing functions in the driver program to run on the cluster. There
are two recommended ways to do this:

 Anonymous function syntax, which can be used for short pieces of code.
 Static methods in a global singleton object. For example, you can define object
MyFunctions and then pass MyFunctions.func1, as follows:
object MyFunctions {
def func1(s: String): String = { ... }
}

myRdd.map(MyFunctions.func1)
Note that while it is also possible to pass a reference to a method in a class instance (as opposed
to a singleton object), this requires sending the object that contains that class along with the
method. For example, consider:
class MyClass {
def func1(s: String): String = { ... }
def doStuff(rdd: RDD[String]): RDD[String] = { rdd.map(func1) }
}
Here, if we create a new MyClass instance and call doStuff on it, the map inside there references
the func1 method of that  MyClass  instance, so the whole object needs to be sent to the cluster.
It is similar to writing rdd.map(x => this.func1(x)).
In a similar way, accessing fields of the outer object will reference the whole object:
class MyClass {
val field = "Hello"
def doStuff(rdd: RDD[String]): RDD[String] = { rdd.map(x => field + x) }
}
is equivalent to writing rdd.map(x => this.field + x), which references all of this. To avoid this
issue, the simplest way is to copy field into a local variable instead of accessing it externally:
def doStuff(rdd: RDD[String]): RDD[String] = {
val field_ = this.field
rdd.map(x => field_ + x)
}
Understanding closures 
One of the harder things about Spark is understanding the scope and life cycle of variables and
methods when executing code across a cluster. RDD operations that modify variables outside of
their scope can be a frequent source of confusion. In the example below we’ll look at code that
uses foreach() to increment a counter, but similar issues can occur for other operations as well.
Example
Consider the naive RDD element sum below, which may behave differently depending on
whether execution is happening within the same JVM. A common example of this is when
running Spark in local mode (--master = local[n]) versus deploying a Spark application to a
cluster (e.g. via spark-submit to YARN):

 Scala
 Java
 Python
var counter = 0
var rdd = sc.parallelize(data)
// Wrong: Don't do this!!
rdd.foreach(x => counter += x)

println("Counter value: " + counter)


Local vs. cluster modes
The behavior of the above code is undefined, and may not work as intended. To execute jobs,
Spark breaks up the processing of RDD operations into tasks, each of which is executed by an
executor. Prior to execution, Spark computes the task’s closure. The closure is those variables
and methods which must be visible for the executor to perform its computations on the RDD (in
this case foreach()). This closure is serialized and sent to each executor.
The variables within the closure sent to each executor are now copies and thus, when counter is
referenced within the foreach function, it’s no longer the counter on the driver node. There is
still a counter in the memory of the driver node but this is no longer visible to the executors! The
executors only see the copy from the serialized closure. Thus, the final value of counter will still
be zero since all operations on counter were referencing the value within the serialized closure.
In local mode, in some circumstances, the foreach function will actually execute within the same
JVM as the driver and will reference the same original counter, and may actually update it.
To ensure well-defined behavior in these sorts of scenarios one should use an Accumulator.
Accumulators in Spark are used specifically to provide a mechanism for safely updating a
variable when execution is split up across worker nodes in a cluster. The Accumulators section
of this guide discusses these in more detail.
In general, closures - constructs like loops or locally defined methods, should not be used to
mutate some global state. Spark does not define or guarantee the behavior of mutations to
objects referenced from outside of closures. Some code that does this may work in local mode,
but that’s just by accident and such code will not behave as expected in distributed mode. Use
an Accumulator instead if some global aggregation is needed.
Printing elements of an RDD
Another common idiom is attempting to print out the elements of an RDD
using rdd.foreach(println) or rdd.map(println). On a single machine, this will generate the
expected output and print all the RDD’s elements. However, in cluster mode, the output
to stdout being called by the executors is now writing to the executor’s stdout instead, not the
one on the driver, so stdout on the driver won’t show these! To print all elements on the driver,
one can use the collect() method to first bring the RDD to the driver node
thus: rdd.collect().foreach(println). This can cause the driver to run out of memory, though,
because collect() fetches the entire RDD to a single machine; if you only need to print a few
elements of the RDD, a safer approach is to use the take(): rdd.take(100).foreach(println).
Working with Key-Value Pairs

 Scala
 Java
 Python
While most Spark operations work on RDDs containing any type of objects, a few special
operations are only available on RDDs of key-value pairs. The most common ones are distributed
“shuffle” operations, such as grouping or aggregating the elements by a key.
In Scala, these operations are automatically available on RDDs containing Tuple2 objects (the
built-in tuples in the language, created by simply writing (a, b)). The key-value pair operations
are available in the PairRDDFunctions class, which automatically wraps around an RDD of tuples.
For example, the following code uses the reduceByKey operation on key-value pairs to count
how many times each line of text occurs in a file:
val lines = sc.textFile("data.txt")
val pairs = lines.map(s => (s, 1))
val counts = pairs.reduceByKey((a, b) => a + b)
We could also use counts.sortByKey(), for example, to sort the pairs alphabetically, and
finally counts.collect() to bring them back to the driver program as an array of objects.
Note: when using custom objects as the key in key-value pair operations, you must be sure that
a custom equals() method is accompanied with a matching hashCode() method. For full details,
see the contract outlined in the Object.hashCode() documentation.
Transformations
The following table lists some of the common transformations supported by Spark. Refer to the
RDD API doc (Scala, Java, Python, R) and pair RDD functions doc (Scala, Java) for details.
Transformation Meaning
map(func) Return a new distributed dataset formed
the source through a function func.
filter(func) Return a new dataset formed by selecting
source on which func returns true.
flatMap(func) Similar to map, but each input item can be
output items (so func should return a Seq
mapPartitions(func) Similar to map, but runs separately on eac
RDD, so func must be of type Iterator<T>
on an RDD of type T.
mapPartitionsWithIndex(func) Similar to mapPartitions, but also provide
representing the index of the partition, so
Iterator<T>) => Iterator<U> when running
sample(withReplacement, fraction, seed) Sample a fraction fraction of the data, wit
using a given random number generator s
union(otherDataset) Return a new dataset that contains the un
source dataset and the argument.
intersection(otherDataset) Return a new RDD that contains the inters
source dataset and the argument.
distinct([numPartitions])) Return a new dataset that contains the di
dataset.
groupByKey([numPartitions]) When called on a dataset of (K, V) pairs, re
Iterable<V>) pairs.
Note: If you are grouping in order to perfo
sum or average) over each key,
using reduceByKey or aggregateByKey wi
performance.
Note: By default, the level of parallelism in
number of partitions of the parent RDD. Y
optional numPartitions argument to set a
reduceByKey(func, [numPartitions]) When called on a dataset of (K, V) pairs, re
pairs where the values for each key are ag
reduce function func, which must be of ty
in groupByKey, the number of reduce task
optional second argument.
aggregateByKey(zeroValue)(seqOp, combOp, [numPartitions])When called on a dataset of (K, V) pairs, re
pairs where the values for each key are ag
combine functions and a neutral "zero" va
value type that is different than the input
unnecessary allocations. Like in groupByK
tasks is configurable through an optional
sortByKey([ascending], [numPartitions]) When called on a dataset of (K, V) pairs w
returns a dataset of (K, V) pairs sorted by
descending order, as specified in the bool
join(otherDataset, [numPartitions]) When called on datasets of type (K, V) and
(K, (V, W)) pairs with all pairs of elements
supported through leftOuterJoin, rightOu
cogroup(otherDataset, [numPartitions]) When called on datasets of type (K, V) and
(K, (Iterable<V>, Iterable<W>)) tuples. Thi
called groupWith.
cartesian(otherDataset) When called on datasets of types T and U,
pairs (all pairs of elements).
pipe(command, [envVars]) Pipe each partition of the RDD through a s
bash script. RDD elements are written to t
output to its stdout are returned as an RD
coalesce(numPartitions) Decrease the number of partitions in the R
for running operations more efficiently aft
dataset.
repartition(numPartitions) Reshuffle the data in the RDD randomly to
partitions and balance it across them. This
the network.
repartitionAndSortWithinPartitions(partitioner) Repartition the RDD according to the give
each resulting partition, sort records by th
efficient than calling repartition and then
because it can push the sorting down into
Actions
The following table lists some of the common actions supported by Spark. Refer to the RDD API
doc (Scala, Java, Python, R)
and pair RDD functions doc (Scala, Java) for details.
Action Meaning
reduce(func) Aggregate the elements of the dataset using a function func (whic
returns one). The function should be commutative and associative
correctly in parallel.
collect() Return all the elements of the dataset as an array at the driver pro
after a filter or other operation that returns a sufficiently small sub
count() Return the number of elements in the dataset.
first() Return the first element of the dataset (similar to take(1)).
take(n) Return an array with the first n elements of the dataset.
takeSample(withReplacement, num, Return an array with a random sample of num elements of the dat
[seed]) replacement, optionally pre-specifying a random number generato
takeOrdered(n, [ordering]) Return the first n elements of the RDD using either their natural or
saveAsTextFile(path) Write the elements of the dataset as a text file (or set of text files)
local filesystem, HDFS or any other Hadoop-supported file system.
each element to convert it to a line of text in the file.
saveAsSequenceFile(path) Write the elements of the dataset as a Hadoop SequenceFile in a g
(Java and Scala) filesystem, HDFS or any other Hadoop-supported file system. This
value pairs that implement Hadoop's Writable interface. In Scala, i
that are implicitly convertible to Writable (Spark includes conversi
Double, String, etc).
saveAsObjectFile(path) Write the elements of the dataset in a simple format using Java se
(Java and Scala) loaded using SparkContext.objectFile().
countByKey() Only available on RDDs of type (K, V). Returns a hashmap of (K, Int
each key.
foreach(func) Run a function func on each element of the dataset. This is usually
updating an Accumulator or interacting with external storage syst
Note: modifying variables other than Accumulators outside of the
undefined behavior. See Understanding closures for more details.
The Spark RDD API also exposes asynchronous versions of some actions,
like foreachAsync for foreach, which immediately return a FutureAction to the caller instead of
blocking on completion of the action. This can be used to manage or wait for the asynchronous
execution of the action.
Shuffle operations
Certain operations within Spark trigger an event known as the shuffle. The shuffle is Spark’s
mechanism for re-distributing data so that it’s grouped differently across partitions. This
typically involves copying data across executors and machines, making the shuffle a complex
and costly operation.
Background
To understand what happens during the shuffle, we can consider the example of
the reduceByKey operation. The reduceByKey operation generates a new RDD where all values
for a single key are combined into a tuple - the key and the result of executing a reduce function
against all values associated with that key. The challenge is that not all values for a single key
necessarily reside on the same partition, or even the same machine, but they must be co-located
to compute the result.
In Spark, data is generally not distributed across partitions to be in the necessary place for a
specific operation. During computations, a single task will operate on a single partition - thus, to
organize all the data for a single reduceByKey reduce task to execute, Spark needs to perform
an all-to-all operation. It must read from all partitions to find all the values for all keys, and then
bring together values across partitions to compute the final result for each key - this is called
the shuffle.
Although the set of elements in each partition of newly shuffled data will be deterministic, and
so is the ordering of partitions themselves, the ordering of these elements is not. If one desires
predictably ordered data following shuffle then it’s possible to use:

 mapPartitions to sort each partition using, for example, .sorted


 repartitionAndSortWithinPartitions to efficiently sort partitions while simultaneously
repartitioning
 sortBy to make a globally ordered RDD
Operations which can cause a shuffle include repartition operations
like repartition and coalesce, ‘ByKey operations (except for counting)
like groupByKey and reduceByKey, and join operations like cogroup and join.
Performance Impact
The Shuffle is an expensive operation since it involves disk I/O, data serialization, and network
I/O. To organize data for the shuffle, Spark generates sets of tasks - map tasks to organize the
data, and a set of reduce tasks to aggregate it. This nomenclature comes from MapReduce and
does not directly relate to Spark’s map and reduce operations.
Internally, results from individual map tasks are kept in memory until they can’t fit. Then, these
are sorted based on the target partition and written to a single file. On the reduce side, tasks
read the relevant sorted blocks.
Certain shuffle operations can consume significant amounts of heap memory since they employ
in-memory data structures to organize records before or after transferring them.
Specifically, reduceByKey and aggregateByKey create these structures on the map side,
and 'ByKey operations generate these on the reduce side. When data does not fit in memory
Spark will spill these tables to disk, incurring the additional overhead of disk I/O and increased
garbage collection.
Shuffle also generates a large number of intermediate files on disk. As of Spark 1.3, these files
are preserved until the corresponding RDDs are no longer used and are garbage collected. This
is done so the shuffle files don’t need to be re-created if the lineage is re-computed. Garbage
collection may happen only after a long period of time, if the application retains references to
these RDDs or if GC does not kick in frequently. This means that long-running Spark jobs may
consume a large amount of disk space. The temporary storage directory is specified by
the spark.local.dir configuration parameter when configuring the Spark context.
Shuffle behavior can be tuned by adjusting a variety of configuration parameters. See the
‘Shuffle Behavior’ section within the Spark Configuration Guide.
RDD Persistence
One of the most important capabilities in Spark is persisting (or caching) a dataset in memory
across operations. When you persist an RDD, each node stores any partitions of it that it
computes in memory and reuses them in other actions on that dataset (or datasets derived from
it). This allows future actions to be much faster (often by more than 10x). Caching is a key tool
for iterative algorithms and fast interactive use.
You can mark an RDD to be persisted using the persist() or cache() methods on it. The first time
it is computed in an action, it will be kept in memory on the nodes. Spark’s cache is fault-tolerant
– if any partition of an RDD is lost, it will automatically be recomputed using the transformations
that originally created it.
In addition, each persisted RDD can be stored using a different storage level, allowing you, for
example, to persist the dataset on disk, persist it in memory but as serialized Java objects (to
save space), replicate it across nodes. These levels are set by passing a StorageLevel object
(Scala, Java, Python) to persist(). The cache() method is a shorthand for using the default
storage level, which is StorageLevel.MEMORY_ONLY (store deserialized objects in memory).
The full set of storage levels is:
Storage Level Meaning
MEMORY_ONLY Store RDD as deserialized Java objects in the JVM. If the
memory, some partitions will not be cached and will be
time they're needed. This is the default level.
MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the
memory, store the partitions that don't fit on disk, and
they're needed.
MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per
(Java and Scala) more space-efficient than deserialized objects, especial
serializer, but more CPU-intensive to read.
MEMORY_AND_DISK_SER Similar to MEMORY_ONLY_SER, but spill partitions that
(Java and Scala) instead of recomputing them on the fly each time they'
DISK_ONLY Store the RDD partitions only on disk.
MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.Same as the levels above, but replicate each partition o
OFF_HEAP (experimental) Similar to MEMORY_ONLY_SER, but store the data in o
requires off-heap memory to be enabled.
Note: In Python, stored objects will always be serialized with the  Pickle  library, so it does not
matter whether you choose a serialized level. The available storage levels in Python
include  MEMORY_ONLY,  MEMORY_ONLY_2,  MEMORY_AND_DISK,  MEMORY_AND_DISK_2,  DISK_O
NLY,  DISK_ONLY_2, and  DISK_ONLY_3.
Spark also automatically persists some intermediate data in shuffle operations
(e.g. reduceByKey), even without users calling persist. This is done to avoid recomputing the
entire input if a node fails during the shuffle. We still recommend users call persist on the
resulting RDD if they plan to reuse it.
Which Storage Level to Choose?
Spark’s storage levels are meant to provide different trade-offs between memory usage and
CPU efficiency. We recommend going through the following process to select one:
 If your RDDs fit comfortably with the default storage level (MEMORY_ONLY), leave them
that way. This is the most CPU-efficient option, allowing operations on the RDDs to run as
fast as possible.
 If not, try using MEMORY_ONLY_SER and selecting a fast serialization library to make the
objects much more space-efficient, but still reasonably fast to access. (Java and Scala)
 Don’t spill to disk unless the functions that computed your datasets are expensive, or they
filter a large amount of the data. Otherwise, recomputing a partition may be as fast as
reading it from disk.
 Use the replicated storage levels if you want fast fault recovery (e.g. if using Spark to serve
requests from a web application). All the storage levels provide full fault tolerance by
recomputing lost data, but the replicated ones let you continue running tasks on the RDD
without waiting to recompute a lost partition.
Removing Data
Spark automatically monitors cache usage on each node and drops out old data partitions in a
least-recently-used (LRU) fashion. If you would like to manually remove an RDD instead of
waiting for it to fall out of the cache, use the RDD.unpersist() method. Note that this method
does not block by default. To block until resources are freed, specify blocking=true when calling
this method.
Shared Variables
Normally, when a function passed to a Spark operation (such as map or reduce) is executed on a
remote cluster node, it works on separate copies of all the variables used in the function. These
variables are copied to each machine, and no updates to the variables on the remote machine
are propagated back to the driver program. Supporting general, read-write shared variables
across tasks would be inefficient. However, Spark does provide two limited types of shared
variables for two common usage patterns: broadcast variables and accumulators.
Broadcast Variables
Broadcast variables allow the programmer to keep a read-only variable cached on each machine
rather than shipping a copy of it with tasks. They can be used, for example, to give every node a
copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast
variables using efficient broadcast algorithms to reduce communication cost.
Spark actions are executed through a set of stages, separated by distributed “shuffle”
operations. Spark automatically broadcasts the common data needed by tasks within each
stage. The data broadcasted this way is cached in serialized form and deserialized before
running each task. This means that explicitly creating broadcast variables is only useful when
tasks across multiple stages need the same data or when caching the data in deserialized form is
important.
Broadcast variables are created from a variable v by calling SparkContext.broadcast(v). The
broadcast variable is a wrapper around v, and its value can be accessed by calling
the value method. The code below shows this:

 Scala
 Java
 Python
scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))
broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)

scala> broadcastVar.value
res0: Array[Int] = Array(1, 2, 3)
After the broadcast variable is created, it should be used instead of the value v in any functions
run on the cluster so that v is not shipped to the nodes more than once. In addition, the
object v should not be modified after it is broadcast in order to ensure that all nodes get the
same value of the broadcast variable (e.g. if the variable is shipped to a new node later).
To release the resources that the broadcast variable copied onto executors, call .unpersist(). If
the broadcast is used again afterwards, it will be re-broadcast. To permanently release all
resources used by the broadcast variable, call .destroy(). The broadcast variable can’t be used
after that. Note that these methods do not block by default. To block until resources are freed,
specify blocking=true when calling them.
Accumulators
Accumulators are variables that are only “added” to through an associative and commutative
operation and can therefore be efficiently supported in parallel. They can be used to implement
counters (as in MapReduce) or sums. Spark natively supports accumulators of numeric types,
and programmers can add support for new types.
As a user, you can create named or unnamed accumulators. As seen in the image below, a
named accumulator (in this instance counter) will display in the web UI for the stage that
modifies that accumulator. Spark displays the value for each accumulator modified by a task in
the “Tasks” table.

Tracking accumulators in the UI can be useful for understanding the progress of running stages
(NOTE: this is not yet supported in Python).

 Scala
 Java
 Python
A numeric accumulator can be created by
calling SparkContext.longAccumulator() or SparkContext.doubleAccumulator() to accumulate
values of type Long or Double, respectively. Tasks running on a cluster can then add to it using
the add method. However, they cannot read its value. Only the driver program can read the
accumulator’s value, using its value method.
The code below shows an accumulator being used to add up the elements of an array:
scala> val accum = sc.longAccumulator("My Accumulator")
accum: org.apache.spark.util.LongAccumulator = LongAccumulator(id: 0, name: Some(My
Accumulator), value: 0)

scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum.add(x))


...
10/09/29 18:41:08 INFO SparkContext: Tasks finished in 0.317106 s

scala> accum.value
res2: Long = 10
While this code used the built-in support for accumulators of type Long, programmers can also
create their own types by subclassing AccumulatorV2. The AccumulatorV2 abstract class has
several methods which one has to override: reset for resetting the accumulator to zero, add for
adding another value into the accumulator, merge for merging another same-type accumulator
into this one. Other methods that must be overridden are contained in the API documentation.
For example, supposing we had a MyVector class representing mathematical vectors, we could
write:
class VectorAccumulatorV2 extends AccumulatorV2[MyVector, MyVector] {

private val myVector: MyVector = MyVector.createZeroVector

def reset(): Unit = {


myVector.reset()
}

def add(v: MyVector): Unit = {


myVector.add(v)
}
...
}

// Then, create an Accumulator of this type:


val myVectorAcc = new VectorAccumulatorV2
// Then, register it into spark context:
sc.register(myVectorAcc, "MyVectorAcc1")
Note that, when programmers define their own type of AccumulatorV2, the resulting type can
be different than that of the elements added.
For accumulator updates performed inside actions only, Spark guarantees that each task’s
update to the accumulator will only be applied once, i.e. restarted tasks will not update the
value. In transformations, users should be aware of that each task’s update may be applied
more than once if tasks or job stages are re-executed.
Accumulators do not change the lazy evaluation model of Spark. If they are being updated
within an operation on an RDD, their value is only updated once that RDD is computed as part of
an action. Consequently, accumulator updates are not guaranteed to be executed when made
within a lazy transformation like map(). The below code fragment demonstrates this property:

 Scala
 Java
 Python
accum = sc.accumulator(0)
def g(x):
accum.add(x)
return f(x)
data.map(g)
# Here, accum is still 0 because no actions have caused the `map` to be computed.
Deploying to a Cluster
The application submission guide describes how to submit applications to a cluster. In short,
once you package your application into a JAR (for Java/Scala) or a set of .py or .zip files (for
Python), the bin/spark-submit script lets you submit it to any supported cluster manager.
Launching Spark jobs from Java / Scala
The org.apache.spark.launcher package provides classes for launching Spark jobs as child
processes using a simple Java API.
Unit Testing
Spark is friendly to unit testing with any popular unit test framework. Simply create
a SparkContext in your test with the master URL set to local, run your operations, and then
call SparkContext.stop() to tear it down. Make sure you stop the context within a finally block or
the test framework’s tearDown method, as Spark does not support two contexts running
concurrently in the same program.
Where to Go from Here
You can see some example Spark programs on the Spark website. In addition, Spark includes
several samples in the examples directory (Scala, Java, Python, R). You can run Java and Scala
examples by passing the class name to Spark’s bin/run-example script; for instance:
./bin/run-example SparkPi
For Python examples, use spark-submit instead:
./bin/spark-submit examples/src/main/python/pi.py
For R examples, use spark-submit instead:
./bin/spark-submit examples/src/main/r/dataframe.R
For help on optimizing your programs, the configuration and tuning guides provide information
on best practices. They are especially important for making sure that your data is stored in
memory in an efficient format. For help on deploying, the cluster mode overview describes the
components involved in distributed operation and supported cluster managers.
Finally, full API documentation is available in Scala, Java, Python and R.

Apache Spark™ examples
These examples give a quick overview of the Spark API. Spark is built on the concept
of distributed datasets, which contain arbitrary Java or Python objects. You create a dataset
from external data, then apply parallel operations to it. The building block of the Spark API is
its RDD API. In the RDD API, there are two types of operations: transformations, which define a
new dataset based on previous ones, and actions, which kick off a job to execute on a cluster.
On top of Spark’s RDD API, high level APIs are provided, e.g. DataFrame API and Machine
Learning API. These high level APIs provide a concise way to conduct certain data operations. In
this page, we will show examples using RDD API as well as examples using high level APIs.
RDD API examples

Word count
In this example, we use a few transformations to build a dataset of (String, Int) pairs
called counts and then save it to a file.

 Python

 Scala

 Java
text_file = sc.textFile("hdfs://...")
counts = text_file.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs://...")

Pi estimation
Spark can also be used for compute-intensive tasks. This code estimates π by "throwing darts"
at a circle. We pick random points in the unit square ((0, 0) to (1,1)) and see how many fall in the
unit circle. The fraction should be π / 4, so we use this to get our estimate.

 Python

 Scala

 Java
def inside(p):
x, y = random.random(), random.random()
return x*x + y*y < 1

count = sc.parallelize(range(0, NUM_SAMPLES)) \


.filter(inside).count()
print("Pi is roughly %f" % (4.0 * count / NUM_SAMPLES))

DataFrame API examples


In Spark, a DataFrame is a distributed collection of data organized into named columns. Users
can use DataFrame API to perform various relational operations on both external data sources
and Spark’s built-in distributed collections without providing specific procedures for processing
data. Also, programs based on DataFrame API will be automatically optimized by Spark’s built-in
optimizer, Catalyst.

Text search
In this example, we search through the error messages in a log file.

 Python

 Scala
 Java
textFile = sc.textFile("hdfs://...")

# Creates a DataFrame having a single column named "line"


df = textFile.map(lambda r: Row(r)).toDF(["line"])
errors = df.filter(col("line").like("%ERROR%"))
# Counts all the errors
errors.count()
# Counts errors mentioning MySQL
errors.filter(col("line").like("%MySQL%")).count()
# Fetches the MySQL errors as an array of strings
errors.filter(col("line").like("%MySQL%")).collect()

Simple data operations


In this example, we read a table stored in a database and calculate the number of people for
every age. Finally, we save the calculated result to S3 in the format of JSON. A simple MySQL
table "people" is used in the example and this table has two columns, "name" and "age".

 Python

 Scala

 Java
# Creates a DataFrame based on a table named "people"
# stored in a MySQL database.
url = \
"jdbc:mysql://yourIP:yourPort/test?user=yourUsername;password=yourPassword"
df = sqlContext \
.read \
.format("jdbc") \
.option("url", url) \
.option("dbtable", "people") \
.load()

# Looks the schema of this DataFrame.


df.printSchema()

# Counts people by age


countsByAge = df.groupBy("age").count()
countsByAge.show()

# Saves countsByAge to S3 in the JSON format.


countsByAge.write.format("json").save("s3a://...")

Machine learning example


MLlib, Spark’s Machine Learning (ML) library, provides many distributed ML algorithms. These
algorithms cover tasks such as feature extraction, classification, regression, clustering,
recommendation, and more. MLlib also provides tools such as ML Pipelines for building
workflows, CrossValidator for tuning parameters, and model persistence for saving and loading
models.

Prediction with logistic regression


In this example, we take a dataset of labels and feature vectors. We learn to predict the labels
from feature vectors using the Logistic Regression algorithm.

 Python

 Scala

 Java
# Every record of this DataFrame contains the label and
# features represented by a vector.
df = sqlContext.createDataFrame(data, ["label", "features"])

# Set parameters for the algorithm.


# Here, we limit the number of iterations to 10.
lr = LogisticRegression(maxIter=10)

# Fit the model to the data.


model = lr.fit(df)

# Given a dataset, predict each point's label, and show the results.
model.transform(df).show()

Additional examples
PySpark SelectExpr()
3 weeks ago
by Gottumukkala Sravan Kumar
Using the selectExpr() function in PySpark, we can directly evaluate an expression without
creating any TABLE or VIEW. This function is available in the pyspark.sql.DataFrame module
which is similar to the select() method. With selectExpr(), we can display the columns, apply the
functions on the columns, evaluate the expressions, perform the aggregations operations, etc.
It is also possible to evaluate/specify multiple columns at a time.
Pyspark.sql.DataFrame.selectExpr()
The selectexpr() function takes the columns/set of expressions and returns the DataFrame
based on the specified expressions/columns. Multiple expressions can be specified in this
function which is separated by comma. To display the DataFrame, we can use the
show()/collect() functions.
Syntax:
AD
pyspark_DataFrame_object.selectExpr(“Columns”/”Expressions”)
Here, the pyspark_DataFrame_object is the input PySpark DataFrame.
Scenario 1: Select the Columns
In this scenario, we will see how to select the particular columns from the PySpark DataFrame
using the selectExpr() function.
The expression that is used is “existing_column as new_name”. Here, the existing_column is the
column name that is present in the DataFrame and it is displayed as new_name (Aliasing).
Example:
Create a PySpark DataFrame named “agri_df” with 5 rows and columns. Get the “Soil_status”
and “Soil_Type” columns as “STATUS” and “TYPE”.
AD
import pyspark

from pyspark.sql import SparkSession

linuxhint_spark_app = SparkSession.builder.appName('Linux Hint').getOrCreate()

# farming data with 5 rows and 5 columns

agri =[{'Soil_Type':'Black','Irrigation_availability':'No','Acres':2500,'Soil_status':'Dry',
'Country':'USA'},

{'Soil_Type':'Black','Irrigation_availability':'Yes','Acres':3500,'Soil_status':'Wet',
'Country':'India'},

{'Soil_Type':None,'Irrigation_availability':'Yes','Acres':210,'Soil_status':'Dry',
'Country':'UK'},

{'Soil_Type':'Other','Irrigation_availability':'No','Acres':1000,'Soil_status':'Wet',
'Country':'USA'},

{'Soil_Type':'Sand','Irrigation_availability':'No','Acres':500,'Soil_status':'Dry',
'Country':'India'}]

# create the dataframe from the above data

agri_df = linuxhint_spark_app.createDataFrame(agri)

# Get the Soil_status and Soil_Type as "STATUS" and "TYPE".

agri_df.selectExpr("Soil_status as STATUS","Soil_Type as TYPE").show()


Output:
AD

Scenario 2: Specifying the Conditional Expressions


In this scenario, we will see how to evaluate the conditions within the selectExpr() function.
AD
The expression that is used is “existing_column operator value”. Here, the existing_column is
the column name that is present in the DataFrame and we compare each value in this column
with the string/value.
Example 1:
Check whether the country is “USA” or not. The equalto (=) operator is used here.
import pyspark

from pyspark.sql import SparkSession

linuxhint_spark_app = SparkSession.builder.appName('Linux Hint').getOrCreate()

# farming data with 5 rows and 5 columns

agri =[{'Soil_Type':'Black','Irrigation_availability':'No','Acres':2500,'Soil_status':'Dry',
'Country':'USA'},

{'Soil_Type':'Black','Irrigation_availability':'Yes','Acres':3500,'Soil_status':'Wet',
'Country':'India'},

{'Soil_Type':None,'Irrigation_availability':'Yes','Acres':210,'Soil_status':'Dry',
'Country':'UK'},

{'Soil_Type':'Other','Irrigation_availability':'No','Acres':1000,'Soil_status':'Wet',
'Country':'USA'},

{'Soil_Type':'Sand','Irrigation_availability':'No','Acres':500,'Soil_status':'Dry',
'Country':'India'}]

# create the dataframe from the above data

agri_df = linuxhint_spark_app.createDataFrame(agri)

# Check whether the country is 'USA' or not.

agri_df.selectExpr("Country = 'USA'").show()
Output:
AD

Example 2:
Check whether the Soil_Type is NULL or not. The NULL keyword checks whether the value is
NULL or not. If it is null, true is returned. Otherwise, false is returned. The final expression is
“Soil_Type IS NULL”
AD
import pyspark

from pyspark.sql import SparkSession

linuxhint_spark_app = SparkSession.builder.appName('Linux Hint').getOrCreate()

# farming data with 5 rows and 5 columns

agri =[{'Soil_Type':'Black','Irrigation_availability':'No','Acres':2500,'Soil_status':'Dry',
'Country':'USA'},

{'Soil_Type':'Black','Irrigation_availability':'Yes','Acres':3500,'Soil_status':'Wet',
'Country':'India'},

{'Soil_Type':None,'Irrigation_availability':'Yes','Acres':210,'Soil_status':'Dry',
'Country':'UK'},

{'Soil_Type':'Other','Irrigation_availability':'No','Acres':1000,'Soil_status':'Wet',
'Country':'USA'},

{'Soil_Type':'Sand','Irrigation_availability':'No','Acres':500,'Soil_status':'Dry',
'Country':'India'}]

# create the dataframe from the above data

agri_df = linuxhint_spark_app.createDataFrame(agri)

# Check whether the Soil_Type is NULL or not.

agri_df.selectExpr("Soil_Type IS NULL").show()
Output:
AD

Scenario 3: Evaluating the Expressions


In this scenario, we will see how to specify the mathematical expressions. The expression that is
used is “existing_column mathematical_expression”.
AD
Example:
1. Display the actual “Acres” column.
2. Add 100 to the “Acres” column.
3. Subtract 100 from the “Acres” column.
4. Multiply 100 with the “Acres” column.
5. Divide the “Acres” column by 100.
import pyspark

from pyspark.sql import SparkSession

linuxhint_spark_app = SparkSession.builder.appName('Linux Hint').getOrCreate()

# farming data with 5 rows and 5 columns

agri =[{'Soil_Type':'Black','Irrigation_availability':'No','Acres':2500,'Soil_status':'Dry',
'Country':'USA'},

{'Soil_Type':'Black','Irrigation_availability':'Yes','Acres':3500,'Soil_status':'Wet',
'Country':'India'},

{'Soil_Type':None,'Irrigation_availability':'Yes','Acres':210,'Soil_status':'Dry',
'Country':'UK'},

{'Soil_Type':'Other','Irrigation_availability':'No','Acres':1000,'Soil_status':'Wet',
'Country':'USA'},

{'Soil_Type':'Sand','Irrigation_availability':'No','Acres':500,'Soil_status':'Dry',
'Country':'India'}]

# create the dataframe from the above data

agri_df = linuxhint_spark_app.createDataFrame(agri)

# Write 4 Expressions to subtract, add, divide and multiply Acres column.

agri_df.selectExpr("Acres","Acres - 100","Acres * 100","Acres + 100","Acres / 100").show()


Output:
AD
Scenario 4: Applying the Aggregate Functions
SUM(column_name) – It evaluates the total value in the specified column.
MEAN(column_name) – It evaluates the average value in the specified column.
MIN(column_name) – It returns the minimum element among all elements in the specified
column.
MAX(column_name) – It returns the maximum element among all elements in the specified
column.
Example:
AD
1. Find the total, average, count, minimum, and maximum elements of “Acres”.
2. Find the minimum and maximum elements in the “Soil_status” column.
import pyspark

from pyspark.sql import SparkSession

linuxhint_spark_app = SparkSession.builder.appName('Linux Hint').getOrCreate()

# farming data with 5 rows and 5 columns

agri =[{'Soil_Type':'Black','Irrigation_availability':'No','Acres':2500,'Soil_status':'Dry',
'Country':'USA'},

{'Soil_Type':'Black','Irrigation_availability':'Yes','Acres':3500,'Soil_status':'Wet',
'Country':'India'},

{'Soil_Type':None,'Irrigation_availability':'Yes','Acres':210,'Soil_status':'Dry',
'Country':'UK'},

{'Soil_Type':'Other','Irrigation_availability':'No','Acres':1000,'Soil_status':'Wet',
'Country':'USA'},

{'Soil_Type':'Sand','Irrigation_availability':'No','Acres':500,'Soil_status':'Dry',
'Country':'India'}]

# create the dataframe from the above data

agri_df = linuxhint_spark_app.createDataFrame(agri)
# Aggregate operations

agri_df.selectExpr("SUM(Acres)","MEAN(Acres)","COUNT(Acres)", "AVG(Acres)","MIN(Acres)"
,
 "MAX(Acres)").show()

agri_df.selectExpr("MIN(Soil_status)", "MAX(Soil_status)").show()
Output:
AD

Conclusion
We discussed about the selectExpr() function which takes the columns/sets of expressions and
returns the DataFrame based on the specified expressions/columns. As part of this, we learned
the four major scenarios in which the selectExpr() is applicable. Multiple expressions can be
specified in this function which are separated by comma. There is no need to create a
TEMPORARY VIEW to use the selectExpr() function.
Overview
At a high level, every Spark application consists of a driver program that runs the
user’s main function and executes various parallel operations on a cluster. The main abstraction
Spark provides is a resilient distributed dataset (RDD), which is a collection of elements
partitioned across the nodes of the cluster that can be operated on in parallel. RDDs are created
by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or
an existing Scala collection in the driver program, and transforming it. Users may also ask Spark
to persist an RDD in memory, allowing it to be reused efficiently across parallel operations.
Finally, RDDs automatically recover from node failures.
A second abstraction in Spark is shared variables that can be used in parallel operations. By
default, when Spark runs a function in parallel as a set of tasks on different nodes, it ships a copy
of each variable used in the function to each task. Sometimes, a variable needs to be shared
across tasks, or between tasks and the driver program. Spark supports two types of shared
variables: broadcast variables, which can be used to cache a value in memory on all nodes,
and accumulators, which are variables that are only “added” to, such as counters and sums.
This guide shows each of these features in each of Spark’s supported languages. It is easiest to
follow along with if you launch Spark’s interactive shell – either bin/spark-shell for the Scala shell
or bin/pyspark for the Python one.
Linking with Spark

 Scala
 Java
 Python
Spark 3.4.1 is built and distributed to work with Scala 2.12 by default. (Spark can be built to work
with other versions of Scala, too.) To write applications in Scala, you will need to use a
compatible Scala version (e.g. 2.12.X).
To write a Spark application, you need to add a Maven dependency on Spark. Spark is available
through Maven Central at:
groupId = org.apache.spark
artifactId = spark-core_2.12
version = 3.4.1
In addition, if you wish to access an HDFS cluster, you need to add a dependency on hadoop-
client for your version of HDFS.
groupId = org.apache.hadoop
artifactId = hadoop-client
version = <your-hdfs-version>
Finally, you need to import some Spark classes into your program. Add the following lines:
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
(Before Spark 1.3.0, you need to explicitly import org.apache.spark.SparkContext._ to enable
essential implicit conversions.)
Initializing Spark

 Scala
 Java
 Python
The first thing a Spark program must do is to create a SparkContext object, which tells Spark
how to access a cluster. To create a SparkContext you first need to build a SparkConf object that
contains information about your application.
Only one SparkContext should be active per JVM. You must stop() the active SparkContext
before creating a new one.
val conf = new SparkConf().setAppName(appName).setMaster(master)
new SparkContext(conf)
The appName parameter is a name for your application to show on the cluster UI. master is
a Spark, Mesos or YARN cluster URL, or a special “local” string to run in local mode. In practice,
when running on a cluster, you will not want to hardcode master in the program, but
rather launch the application with spark-submit and receive it there. However, for local testing
and unit tests, you can pass “local” to run Spark in-process.
Using the Shell

 Scala
 Python
In the Spark shell, a special interpreter-aware SparkContext is already created for you, in the
variable called sc. Making your own SparkContext will not work. You can set which master the
context connects to using the --master argument, and you can add JARs to the classpath by
passing a comma-separated list to the --jars argument. You can also add dependencies (e.g.
Spark Packages) to your shell session by supplying a comma-separated list of Maven coordinates
to the --packages argument. Any additional repositories where dependencies might exist (e.g.
Sonatype) can be passed to the --repositories argument. For example, to run bin/spark-shell on
exactly four cores, use:
$ ./bin/spark-shell --master local[4]
Or, to also add code.jar to its classpath, use:
$ ./bin/spark-shell --master local[4] --jars code.jar
To include a dependency using Maven coordinates:
$ ./bin/spark-shell --master local[4] --packages "org.example:example:0.1"
For a complete list of options, run spark-shell --help. Behind the scenes, spark-shell invokes the
more general spark-submit script.
Resilient Distributed Datasets (RDDs)
Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-
tolerant collection of elements that can be operated on in parallel. There are two ways to create
RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an
external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a
Hadoop InputFormat.
Parallelized Collections

 Scala
 Java
 Python
Parallelized collections are created by calling SparkContext’s parallelize method on an existing
collection in your driver program (a Scala Seq). The elements of the collection are copied to
form a distributed dataset that can be operated on in parallel. For example, here is how to
create a parallelized collection holding the numbers 1 to 5:
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
Once created, the distributed dataset (distData) can be operated on in parallel. For example, we
might call distData.reduce((a, b) => a + b) to add up the elements of the array. We describe
operations on distributed datasets later on.
One important parameter for parallel collections is the number of partitions to cut the dataset
into. Spark will run one task for each partition of the cluster. Typically you want 2-4 partitions for
each CPU in your cluster. Normally, Spark tries to set the number of partitions automatically
based on your cluster. However, you can also set it manually by passing it as a second parameter
to parallelize (e.g. sc.parallelize(data, 10)). Note: some places in the code use the term slices (a
synonym for partitions) to maintain backward compatibility.
External Datasets

 Scala
 Java
 Python
Spark can create distributed datasets from any storage source supported by Hadoop, including
your local file system, HDFS, Cassandra, HBase, Amazon S3, etc. Spark supports text
files, SequenceFiles, and any other Hadoop InputFormat.
Text file RDDs can be created using SparkContext’s textFile method. This method takes a URI for
the file (either a local path on the machine, or a hdfs://, s3a://, etc URI) and reads it as a collection
of lines. Here is an example invocation:
scala> val distFile = sc.textFile("data.txt")
distFile: org.apache.spark.rdd.RDD[String] = data.txt MapPartitionsRDD[10] at textFile at
<console>:26
Once created, distFile can be acted on by dataset operations. For example, we can add up the
sizes of all the lines using the map and reduce operations as follows: distFile.map(s =>
s.length).reduce((a, b) => a + b).
Some notes on reading files with Spark:
 If using a path on the local filesystem, the file must also be accessible at the same path on
worker nodes. Either copy the file to all workers or use a network-mounted shared file
system.
 All of Spark’s file-based input methods, including textFile, support running on directories,
compressed files, and wildcards as well. For example, you can
use textFile("/my/directory"), textFile("/my/directory/*.txt"),
and textFile("/my/directory/*.gz"). When multiple files are read, the order of the partitions
depends on the order the files are returned from the filesystem. It may or may not, for
example, follow the lexicographic ordering of the files by path. Within a partition, elements
are ordered according to their order in the underlying file.
 The textFile method also takes an optional second argument for controlling the number of
partitions of the file. By default, Spark creates one partition for each block of the file (blocks
being 128MB by default in HDFS), but you can also ask for a higher number of partitions by
passing a larger value. Note that you cannot have fewer partitions than blocks.
Apart from text files, Spark’s Scala API also supports several other data formats:
 SparkContext.wholeTextFiles lets you read a directory containing multiple small text files,
and returns each of them as (filename, content) pairs. This is in contrast with textFile, which
would return one record per line in each file. Partitioning is determined by data locality
which, in some cases, may result in too few partitions. For those
cases, wholeTextFiles provides an optional second argument for controlling the minimal
number of partitions.
 For SequenceFiles, use SparkContext’s sequenceFile[K, V] method where K and V are the
types of key and values in the file. These should be subclasses of
Hadoop’s Writable interface, like IntWritable and Text. In addition, Spark allows you to
specify native types for a few common Writables; for example, sequenceFile[Int, String] will
automatically read IntWritables and Texts.
 For other Hadoop InputFormats, you can use the SparkContext.hadoopRDD method, which
takes an arbitrary JobConf and input format class, key class and value class. Set these the
same way you would for a Hadoop job with your input source. You can also
use SparkContext.newAPIHadoopRDD for InputFormats based on the “new” MapReduce
API (org.apache.hadoop.mapreduce).
 RDD.saveAsObjectFile and SparkContext.objectFile support saving an RDD in a simple
format consisting of serialized Java objects. While this is not as efficient as specialized
formats like Avro, it offers an easy way to save any RDD.
RDD Operations
RDDs support two types of operations: transformations, which create a new dataset from an
existing one, and actions, which return a value to the driver program after running a
computation on the dataset. For example, map is a transformation that passes each dataset
element through a function and returns a new RDD representing the results. On the other
hand, reduce is an action that aggregates all the elements of the RDD using some function and
returns the final result to the driver program (although there is also a parallel reduceByKey that
returns a distributed dataset).
All transformations in Spark are lazy, in that they do not compute their results right away.
Instead, they just remember the transformations applied to some base dataset (e.g. a file). The
transformations are only computed when an action requires a result to be returned to the driver
program. This design enables Spark to run more efficiently. For example, we can realize that a
dataset created through map will be used in a reduce and return only the result of the reduce to
the driver, rather than the larger mapped dataset.
By default, each transformed RDD may be recomputed each time you run an action on it.
However, you may also persist an RDD in memory using the persist (or cache) method, in which
case Spark will keep the elements around on the cluster for much faster access the next time
you query it. There is also support for persisting RDDs on disk, or replicated across multiple
nodes.
Basics

 Scala
 Java
 Python
To illustrate RDD basics, consider the simple program below:
lines = sc.textFile("data.txt")
lineLengths = lines.map(lambda s: len(s))
totalLength = lineLengths.reduce(lambda a, b: a + b)
The first line defines a base RDD from an external file. This dataset is not loaded in memory or
otherwise acted on: lines is merely a pointer to the file. The second line defines lineLengths as
the result of a map transformation. Again, lineLengths is not immediately computed, due to
laziness. Finally, we run reduce, which is an action. At this point Spark breaks the computation
into tasks to run on separate machines, and each machine runs both its part of the map and a
local reduction, returning only its answer to the driver program.
If we also wanted to use lineLengths again later, we could add:
lineLengths.persist()
before the reduce, which would cause lineLengths to be saved in memory after the first time it is
computed.
Passing Functions to Spark

 Scala
 Java
 Python
Spark’s API relies heavily on passing functions in the driver program to run on the cluster. There
are three recommended ways to do this:

 Lambda expressions, for simple functions that can be written as an expression. (Lambdas do
not support multi-statement functions or statements that do not return a value.)
 Local defs inside the function calling into Spark, for longer code.
 Top-level functions in a module.
For example, to pass a longer function than can be supported using a lambda, consider the code
below:
"""MyScript.py"""
if __name__ == "__main__":
def myFunc(s):
words = s.split(" ")
return len(words)

sc = SparkContext(...)
sc.textFile("file.txt").map(myFunc)
Note that while it is also possible to pass a reference to a method in a class instance (as opposed
to a singleton object), this requires sending the object that contains that class along with the
method. For example, consider:
class MyClass(object):
def func(self, s):
return s
def doStuff(self, rdd):
return rdd.map(self.func)
Here, if we create a new MyClass and call doStuff on it, the map inside there references
the func method of that  MyClass  instance, so the whole object needs to be sent to the cluster.
In a similar way, accessing fields of the outer object will reference the whole object:
class MyClass(object):
def __init__(self):
self.field = "Hello"
def doStuff(self, rdd):
return rdd.map(lambda s: self.field + s)
To avoid this issue, the simplest way is to copy field into a local variable instead of accessing it
externally:
def doStuff(self, rdd):
field = self.field
return rdd.map(lambda s: field + s)
Understanding closures 
One of the harder things about Spark is understanding the scope and life cycle of variables and
methods when executing code across a cluster. RDD operations that modify variables outside of
their scope can be a frequent source of confusion. In the example below we’ll look at code that
uses foreach() to increment a counter, but similar issues can occur for other operations as well.
Example
Consider the naive RDD element sum below, which may behave differently depending on
whether execution is happening within the same JVM. A common example of this is when
running Spark in local mode (--master = local[n]) versus deploying a Spark application to a
cluster (e.g. via spark-submit to YARN):

 Scala
 Java
 Python
counter = 0
rdd = sc.parallelize(data)

# Wrong: Don't do this!!


def increment_counter(x):
global counter
counter += x
rdd.foreach(increment_counter)

print("Counter value: ", counter)


Local vs. cluster modes
The behavior of the above code is undefined, and may not work as intended. To execute jobs,
Spark breaks up the processing of RDD operations into tasks, each of which is executed by an
executor. Prior to execution, Spark computes the task’s closure. The closure is those variables
and methods which must be visible for the executor to perform its computations on the RDD (in
this case foreach()). This closure is serialized and sent to each executor.
The variables within the closure sent to each executor are now copies and thus, when counter is
referenced within the foreach function, it’s no longer the counter on the driver node. There is
still a counter in the memory of the driver node but this is no longer visible to the executors! The
executors only see the copy from the serialized closure. Thus, the final value of counter will still
be zero since all operations on counter were referencing the value within the serialized closure.
In local mode, in some circumstances, the foreach function will actually execute within the same
JVM as the driver and will reference the same original counter, and may actually update it.
To ensure well-defined behavior in these sorts of scenarios one should use an Accumulator.
Accumulators in Spark are used specifically to provide a mechanism for safely updating a
variable when execution is split up across worker nodes in a cluster. The Accumulators section
of this guide discusses these in more detail.
In general, closures - constructs like loops or locally defined methods, should not be used to
mutate some global state. Spark does not define or guarantee the behavior of mutations to
objects referenced from outside of closures. Some code that does this may work in local mode,
but that’s just by accident and such code will not behave as expected in distributed mode. Use
an Accumulator instead if some global aggregation is needed.
Printing elements of an RDD
Another common idiom is attempting to print out the elements of an RDD
using rdd.foreach(println) or rdd.map(println). On a single machine, this will generate the
expected output and print all the RDD’s elements. However, in cluster mode, the output
to stdout being called by the executors is now writing to the executor’s stdout instead, not the
one on the driver, so stdout on the driver won’t show these! To print all elements on the driver,
one can use the collect() method to first bring the RDD to the driver node
thus: rdd.collect().foreach(println). This can cause the driver to run out of memory, though,
because collect() fetches the entire RDD to a single machine; if you only need to print a few
elements of the RDD, a safer approach is to use the take(): rdd.take(100).foreach(println).
Working with Key-Value Pairs

 Scala
 Java
 Python
While most Spark operations work on RDDs containing any type of objects, a few special
operations are only available on RDDs of key-value pairs. The most common ones are distributed
“shuffle” operations, such as grouping or aggregating the elements by a key.
In Python, these operations work on RDDs containing built-in Python tuples such as (1, 2). Simply
create such tuples and then call your desired operation.
For example, the following code uses the reduceByKey operation on key-value pairs to count
how many times each line of text occurs in a file:
lines = sc.textFile("data.txt")
pairs = lines.map(lambda s: (s, 1))
counts = pairs.reduceByKey(lambda a, b: a + b)
We could also use counts.sortByKey(), for example, to sort the pairs alphabetically, and
finally counts.collect() to bring them back to the driver program as a list of objects.
Transformations
The following table lists some of the common transformations supported by Spark. Refer to the
RDD API doc (Scala, Java, Python, R) and pair RDD functions doc (Scala, Java) for details.
Transformation Meaning
map(func) Return a new distributed dataset f
each element of the source throug
filter(func) Return a new dataset formed by se
elements of the source on which fu
flatMap(func) Similar to map, but each input item
or more output items (so func shou
than a single item).
mapPartitions(func) Similar to map, but runs separately
(block) of the RDD, so func must b
=> Iterator<U> when running on an
mapPartitionsWithIndex(func) Similar to mapPartitions, but also p
integer value representing the inde
so func must be of type (Int, Iterat
when running on an RDD of type T
sample(withReplacement, fraction, seed) Sample a fraction fraction of the da
replacement, using a given random
seed.
union(otherDataset) Return a new dataset that contain
elements in the source dataset and
intersection(otherDataset) Return a new RDD that contains th
elements in the source dataset and
distinct([numPartitions])) Return a new dataset that contain
of the source dataset.
groupByKey([numPartitions]) When called on a dataset of (K, V)
dataset of (K, Iterable<V>) pairs.
Note: If you are grouping in order t
aggregation (such as a sum or aver
using reduceByKey or aggregateBy
better performance.
Note: By default, the level of parall
depends on the number of partitio
You can pass an optional numParti
a different number of tasks.
reduceByKey(func, [numPartitions]) When called on a dataset of (K, V)
dataset of (K, V) pairs where the va
aggregated using the given reduce
must be of type (V,V) => V. Like in g
number of reduce tasks is configur
optional second argument.
aggregateByKey(zeroValue)(seqOp, combOp, When called on a dataset of (K, V)
[numPartitions]) dataset of (K, U) pairs where the v
aggregated using the given combin
neutral "zero" value. Allows an agg
that is different than the input valu
unnecessary allocations. Like in gro
number of reduce tasks is configur
optional second argument.
sortByKey([ascending], [numPartitions]) When called on a dataset of (K, V)
implements Ordered, returns a dat
sorted by keys in ascending or desc
specified in the boolean ascending
join(otherDataset, [numPartitions]) When called on datasets of type (K
returns a dataset of (K, (V, W)) pair
elements for each key. Outer joins
through leftOuterJoin, rightOuterJ
and fullOuterJoin.
cogroup(otherDataset, [numPartitions]) When called on datasets of type (K
returns a dataset of (K, (Iterable<V
tuples. This operation is also called
cartesian(otherDataset) When called on datasets of types T
dataset of (T, U) pairs (all pairs of e
pipe(command, [envVars]) Pipe each partition of the RDD thro
command, e.g. a Perl or bash scrip
written to the process's stdin and l
stdout are returned as an RDD of s
coalesce(numPartitions) Decrease the number of partitions
numPartitions. Useful for running o
efficiently after filtering down a lar
repartition(numPartitions) Reshuffle the data in the RDD rand
more or fewer partitions and balan
This always shuffles all data over th
repartitionAndSortWithinPartitions(partitioner) Repartition the RDD according to t
and, within each resulting partition
keys. This is more efficient than cal
then sorting within each partition b
the sorting down into the shuffle m
Actions
The following table lists some of the common actions supported by Spark. Refer to the RDD API
doc (Scala, Java, Python, R)
and pair RDD functions doc (Scala, Java) for details.
Action Meaning
reduce(func) Aggregate the elements of the dataset using a function fun
arguments and returns one). The function should be comm
associative so that it can be computed correctly in parallel.
collect() Return all the elements of the dataset as an array at the driv
usually useful after a filter or other operation that returns a
subset of the data.
count() Return the number of elements in the dataset.
first() Return the first element of the dataset (similar to take(1)).
take(n) Return an array with the first n elements of the dataset.
takeSample(withReplacement, num, Return an array with a random sample of num elements of
[seed]) without replacement, optionally pre-specifying a random nu
seed.
takeOrdered(n, [ordering]) Return the first n elements of the RDD using either their na
custom comparator.
saveAsTextFile(path) Write the elements of the dataset as a text file (or set of tex
directory in the local filesystem, HDFS or any other Hadoop
system. Spark will call toString on each element to convert
the file.
saveAsSequenceFile(path) Write the elements of the dataset as a Hadoop SequenceFil
(Java and Scala) the local filesystem, HDFS or any other Hadoop-supported fi
available on RDDs of key-value pairs that implement Hadoo
In Scala, it is also available on types that are implicitly conve
(Spark includes conversions for basic types like Int, Double,
saveAsObjectFile(path) Write the elements of the dataset in a simple format using J
(Java and Scala) which can then be loaded using SparkContext.objectFile().
countByKey() Only available on RDDs of type (K, V). Returns a hashmap o
count of each key.
foreach(func) Run a function func on each element of the dataset. This is
effects such as updating an Accumulator or interacting with
systems.
Note: modifying variables other than Accumulators outside
result in undefined behavior. See Understanding closures fo
The Spark RDD API also exposes asynchronous versions of some actions,
like foreachAsync for foreach, which immediately return a FutureAction to the caller instead of
blocking on completion of the action. This can be used to manage or wait for the asynchronous
execution of the action.
Shuffle operations
Certain operations within Spark trigger an event known as the shuffle. The shuffle is Spark’s
mechanism for re-distributing data so that it’s grouped differently across partitions. This
typically involves copying data across executors and machines, making the shuffle a complex
and costly operation.
Background
To understand what happens during the shuffle, we can consider the example of
the reduceByKey operation. The reduceByKey operation generates a new RDD where all values
for a single key are combined into a tuple - the key and the result of executing a reduce function
against all values associated with that key. The challenge is that not all values for a single key
necessarily reside on the same partition, or even the same machine, but they must be co-located
to compute the result.
In Spark, data is generally not distributed across partitions to be in the necessary place for a
specific operation. During computations, a single task will operate on a single partition - thus, to
organize all the data for a single reduceByKey reduce task to execute, Spark needs to perform
an all-to-all operation. It must read from all partitions to find all the values for all keys, and then
bring together values across partitions to compute the final result for each key - this is called
the shuffle.
Although the set of elements in each partition of newly shuffled data will be deterministic, and
so is the ordering of partitions themselves, the ordering of these elements is not. If one desires
predictably ordered data following shuffle then it’s possible to use:

 mapPartitions to sort each partition using, for example, .sorted


 repartitionAndSortWithinPartitions to efficiently sort partitions while simultaneously
repartitioning
 sortBy to make a globally ordered RDD
Operations which can cause a shuffle include repartition operations
like repartition and coalesce, ‘ByKey operations (except for counting)
like groupByKey and reduceByKey, and join operations like cogroup and join.
Performance Impact
The Shuffle is an expensive operation since it involves disk I/O, data serialization, and network
I/O. To organize data for the shuffle, Spark generates sets of tasks - map tasks to organize the
data, and a set of reduce tasks to aggregate it. This nomenclature comes from MapReduce and
does not directly relate to Spark’s map and reduce operations.
Internally, results from individual map tasks are kept in memory until they can’t fit. Then, these
are sorted based on the target partition and written to a single file. On the reduce side, tasks
read the relevant sorted blocks.
Certain shuffle operations can consume significant amounts of heap memory since they employ
in-memory data structures to organize records before or after transferring them.
Specifically, reduceByKey and aggregateByKey create these structures on the map side,
and 'ByKey operations generate these on the reduce side. When data does not fit in memory
Spark will spill these tables to disk, incurring the additional overhead of disk I/O and increased
garbage collection.
Shuffle also generates a large number of intermediate files on disk. As of Spark 1.3, these files
are preserved until the corresponding RDDs are no longer used and are garbage collected. This
is done so the shuffle files don’t need to be re-created if the lineage is re-computed. Garbage
collection may happen only after a long period of time, if the application retains references to
these RDDs or if GC does not kick in frequently. This means that long-running Spark jobs may
consume a large amount of disk space. The temporary storage directory is specified by
the spark.local.dir configuration parameter when configuring the Spark context.
Shuffle behavior can be tuned by adjusting a variety of configuration parameters. See the
‘Shuffle Behavior’ section within the Spark Configuration Guide.
RDD Persistence
One of the most important capabilities in Spark is persisting (or caching) a dataset in memory
across operations. When you persist an RDD, each node stores any partitions of it that it
computes in memory and reuses them in other actions on that dataset (or datasets derived from
it). This allows future actions to be much faster (often by more than 10x). Caching is a key tool
for iterative algorithms and fast interactive use.
You can mark an RDD to be persisted using the persist() or cache() methods on it. The first time
it is computed in an action, it will be kept in memory on the nodes. Spark’s cache is fault-tolerant
– if any partition of an RDD is lost, it will automatically be recomputed using the transformations
that originally created it.
In addition, each persisted RDD can be stored using a different storage level, allowing you, for
example, to persist the dataset on disk, persist it in memory but as serialized Java objects (to
save space), replicate it across nodes. These levels are set by passing a StorageLevel object
(Scala, Java, Python) to persist(). The cache() method is a shorthand for using the default
storage level, which is StorageLevel.MEMORY_ONLY (store deserialized objects in memory).
The full set of storage levels is:
Storage Level Meaning
MEMORY_ONLY Store RDD as deserialized Java objects in the JVM
fit in memory, some partitions will not be cached
recomputed on the fly each time they're needed
level.
MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM
fit in memory, store the partitions that don't fit o
them from there when they're needed.
MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte ar
(Java and Scala) This is generally more space-efficient than deser
especially when using a fast serializer, but more
read.
MEMORY_AND_DISK_SER Similar to MEMORY_ONLY_SER, but spill partitio
(Java and Scala) memory to disk instead of recomputing them on
they're needed.
DISK_ONLY Store the RDD partitions only on disk.
MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc. Same as the levels above, but replicate each par
nodes.
OFF_HEAP (experimental) Similar to MEMORY_ONLY_SER, but store the d
memory. This requires off-heap memory to be en
Note: In Python, stored objects will always be serialized with the  Pickle  library, so it does not
matter whether you choose a serialized level. The available storage levels in Python
include  MEMORY_ONLY,  MEMORY_ONLY_2,  MEMORY_AND_DISK,  MEMORY_AND_DISK_2,  DISK_O
NLY,  DISK_ONLY_2, and  DISK_ONLY_3.
Spark also automatically persists some intermediate data in shuffle operations
(e.g. reduceByKey), even without users calling persist. This is done to avoid recomputing the
entire input if a node fails during the shuffle. We still recommend users call persist on the
resulting RDD if they plan to reuse it.
Which Storage Level to Choose?
Spark’s storage levels are meant to provide different trade-offs between memory usage and
CPU efficiency. We recommend going through the following process to select one:
 If your RDDs fit comfortably with the default storage level (MEMORY_ONLY), leave them
that way. This is the most CPU-efficient option, allowing operations on the RDDs to run as
fast as possible.
 If not, try using MEMORY_ONLY_SER and selecting a fast serialization library to make the
objects much more space-efficient, but still reasonably fast to access. (Java and Scala)
 Don’t spill to disk unless the functions that computed your datasets are expensive, or they
filter a large amount of the data. Otherwise, recomputing a partition may be as fast as
reading it from disk.
 Use the replicated storage levels if you want fast fault recovery (e.g. if using Spark to serve
requests from a web application). All the storage levels provide full fault tolerance by
recomputing lost data, but the replicated ones let you continue running tasks on the RDD
without waiting to recompute a lost partition.
Removing Data
Spark automatically monitors cache usage on each node and drops out old data partitions in a
least-recently-used (LRU) fashion. If you would like to manually remove an RDD instead of
waiting for it to fall out of the cache, use the RDD.unpersist() method. Note that this method
does not block by default. To block until resources are freed, specify blocking=true when calling
this method.
Shared Variables
Normally, when a function passed to a Spark operation (such as map or reduce) is executed on a
remote cluster node, it works on separate copies of all the variables used in the function. These
variables are copied to each machine, and no updates to the variables on the remote machine
are propagated back to the driver program. Supporting general, read-write shared variables
across tasks would be inefficient. However, Spark does provide two limited types of shared
variables for two common usage patterns: broadcast variables and accumulators.
Broadcast Variables
Broadcast variables allow the programmer to keep a read-only variable cached on each machine
rather than shipping a copy of it with tasks. They can be used, for example, to give every node a
copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast
variables using efficient broadcast algorithms to reduce communication cost.
Spark actions are executed through a set of stages, separated by distributed “shuffle”
operations. Spark automatically broadcasts the common data needed by tasks within each
stage. The data broadcasted this way is cached in serialized form and deserialized before
running each task. This means that explicitly creating broadcast variables is only useful when
tasks across multiple stages need the same data or when caching the data in deserialized form is
important.
Broadcast variables are created from a variable v by calling SparkContext.broadcast(v). The
broadcast variable is a wrapper around v, and its value can be accessed by calling
the value method. The code below shows this:

 Scala
 Java
 Python
scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))
broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)

scala> broadcastVar.value
res0: Array[Int] = Array(1, 2, 3)
After the broadcast variable is created, it should be used instead of the value v in any functions
run on the cluster so that v is not shipped to the nodes more than once. In addition, the
object v should not be modified after it is broadcast in order to ensure that all nodes get the
same value of the broadcast variable (e.g. if the variable is shipped to a new node later).
To release the resources that the broadcast variable copied onto executors, call .unpersist(). If
the broadcast is used again afterwards, it will be re-broadcast. To permanently release all
resources used by the broadcast variable, call .destroy(). The broadcast variable can’t be used
after that. Note that these methods do not block by default. To block until resources are freed,
specify blocking=true when calling them.
Accumulators
Accumulators are variables that are only “added” to through an associative and commutative
operation and can therefore be efficiently supported in parallel. They can be used to implement
counters (as in MapReduce) or sums. Spark natively supports accumulators of numeric types,
and programmers can add support for new types.
As a user, you can create named or unnamed accumulators. As seen in the image below, a
named accumulator (in this instance counter) will display in the web UI for the stage that
modifies that accumulator. Spark displays the value for each accumulator modified by a task in
the “Tasks” table.

Tracking accumulators in the UI can be useful for understanding the progress of running stages
(NOTE: this is not yet supported in Python).

 Scala
 Java
 Python
A numeric accumulator can be created by
calling SparkContext.longAccumulator() or SparkContext.doubleAccumulator() to accumulate
values of type Long or Double, respectively. Tasks running on a cluster can then add to it using
the add method. However, they cannot read its value. Only the driver program can read the
accumulator’s value, using its value method.
The code below shows an accumulator being used to add up the elements of an array:
scala> val accum = sc.longAccumulator("My Accumulator")
accum: org.apache.spark.util.LongAccumulator = LongAccumulator(id: 0, name: Some(My
Accumulator), value: 0)

scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum.add(x))


...
10/09/29 18:41:08 INFO SparkContext: Tasks finished in 0.317106 s

scala> accum.value
res2: Long = 10
While this code used the built-in support for accumulators of type Long, programmers can also
create their own types by subclassing AccumulatorV2. The AccumulatorV2 abstract class has
several methods which one has to override: reset for resetting the accumulator to zero, add for
adding another value into the accumulator, merge for merging another same-type accumulator
into this one. Other methods that must be overridden are contained in the API documentation.
For example, supposing we had a MyVector class representing mathematical vectors, we could
write:
class VectorAccumulatorV2 extends AccumulatorV2[MyVector, MyVector] {

private val myVector: MyVector = MyVector.createZeroVector

def reset(): Unit = {


myVector.reset()
}

def add(v: MyVector): Unit = {


myVector.add(v)
}
...
}

// Then, create an Accumulator of this type:


val myVectorAcc = new VectorAccumulatorV2
// Then, register it into spark context:
sc.register(myVectorAcc, "MyVectorAcc1")
Note that, when programmers define their own type of AccumulatorV2, the resulting type can
be different than that of the elements added.
For accumulator updates performed inside actions only, Spark guarantees that each task’s
update to the accumulator will only be applied once, i.e. restarted tasks will not update the
value. In transformations, users should be aware of that each task’s update may be applied
more than once if tasks or job stages are re-executed.
Accumulators do not change the lazy evaluation model of Spark. If they are being updated
within an operation on an RDD, their value is only updated once that RDD is computed as part of
an action. Consequently, accumulator updates are not guaranteed to be executed when made
within a lazy transformation like map(). The below code fragment demonstrates this property:

 Scala
 Java
 Python
val accum = sc.longAccumulator
data.map { x => accum.add(x); x }
// Here, accum is still 0 because no actions have caused the map operation to be computed.
Deploying to a Cluster
The application submission guide describes how to submit applications to a cluster. In short,
once you package your application into a JAR (for Java/Scala) or a set of .py or .zip files (for
Python), the bin/spark-submit script lets you submit it to any supported cluster manager.
Launching Spark jobs from Java / Scala
The org.apache.spark.launcher package provides classes for launching Spark jobs as child
processes using a simple Java API.
Unit Testing
Spark is friendly to unit testing with any popular unit test framework. Simply create
a SparkContext in your test with the master URL set to local, run your operations, and then
call SparkContext.stop() to tear it down. Make sure you stop the context within a finally block or
the test framework’s tearDown method, as Spark does not support two contexts running
concurrently in the same program.
Where to Go from Here
You can see some example Spark programs on the Spark website. In addition, Spark includes
several samples in the examples directory (Scala, Java, Python, R). You can run Java and Scala
examples by passing the class name to Spark’s bin/run-example script; for instance:
./bin/run-example SparkPi
For Python examples, use spark-submit instead:
./bin/spark-submit examples/src/main/python/pi.py
For R examples, use spark-submit instead:
./bin/spark-submit examples/src/main/r/dataframe.R
For help on optimizing your programs, the configuration and tuning guides provide information
on best practices. They are especially important for making sure that your data is stored in
memory in an efficient format. For help on deploying, the cluster mode overview describes the
components involved in distributed operation and supported cluster managers.
Finally, full API documentation is available in Scala, Java, Python and R.
Getting Started

 Starting Point: SparkSession


 Creating DataFrames
 Untyped Dataset Operations (aka DataFrame Operations)
 Running SQL Queries Programmatically
 Global Temporary View
 Creating Datasets
 Interoperating with RDDs
o Inferring the Schema Using Reflection
o Programmatically Specifying the Schema
 Scalar Functions
 Aggregate Functions
Starting Point: SparkSession

 Scala
 Java
 Python
 R
The entry point into all functionality in Spark is the SparkSession class. To create a
basic SparkSession, just use SparkSession.builder:
from pyspark.sql import SparkSession

spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
Find full example code at "examples/src/main/python/sql/basic.py" in the Spark repo.
SparkSession in Spark 2.0 provides builtin support for Hive features including the ability to write
queries using HiveQL, access to Hive UDFs, and the ability to read data from Hive tables. To use
these features, you do not need to have an existing Hive setup.
Creating DataFrames

 Scala
 Java
 Python
 R
With a SparkSession, applications can create DataFrames from an existing RDD, from a Hive
table, or from Spark data sources.
As an example, the following creates a DataFrame based on the content of a JSON file:
val df = spark.read.json("examples/src/main/resources/people.json")

// Displays the content of the DataFrame to stdout


df.show()
// +----+-------+
// | age| name|
// +----+-------+
// |null|Michael|
// | 30| Andy|
// | 19| Justin|
// +----+-------+
Find full example code at
"examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala" in the Spark
repo.
Untyped Dataset Operations (aka DataFrame Operations)
DataFrames provide a domain-specific language for structured data manipulation
in Scala, Java, Python and R.
As mentioned above, in Spark 2.0, DataFrames are just Dataset of Rows in Scala and Java API.
These operations are also referred as “untyped transformations” in contrast to “typed
transformations” come with strongly typed Scala/Java Datasets.
Here we include some basic examples of structured data processing using Datasets:

 Scala
 Java
 Python
 R
In Python, it’s possible to access a DataFrame’s columns either by attribute (df.age) or by
indexing (df['age']). While the former is convenient for interactive data exploration, users are
highly encouraged to use the latter form, which is future proof and won’t break with column
names that are also attributes on the DataFrame class.
# spark, df are from the previous example
# Print the schema in a tree format
df.printSchema()
# root
# |-- age: long (nullable = true)
# |-- name: string (nullable = true)

# Select only the "name" column


df.select("name").show()
# +-------+
# | name|
# +-------+
# |Michael|
# | Andy|
# | Justin|
# +-------+

# Select everybody, but increment the age by 1


df.select(df['name'], df['age'] + 1).show()
# +-------+---------+
# | name|(age + 1)|
# +-------+---------+
# |Michael| null|
# | Andy| 31|
# | Justin| 20|
# +-------+---------+

# Select people older than 21


df.filter(df['age'] > 21).show()
# +---+----+
# |age|name|
# +---+----+
# | 30|Andy|
# +---+----+

# Count people by age


df.groupBy("age").count().show()
# +----+-----+
# | age|count|
# +----+-----+
# | 19| 1|
# |null| 1|
# | 30| 1|
# +----+-----+
Find full example code at "examples/src/main/python/sql/basic.py" in the Spark repo.
For a complete list of the types of operations that can be performed on a DataFrame refer to
the API Documentation.
In addition to simple column references and expressions, DataFrames also have a rich library of
functions including string manipulation, date arithmetic, common math operations and more.
The complete list is available in the DataFrame Function Reference.
Running SQL Queries Programmatically

 Scala
 Java
 Python
 R
The sql function on a SparkSession enables applications to run SQL queries programmatically
and returns the result as a DataFrame.
# Register the DataFrame as a SQL temporary view
df.createOrReplaceTempView("people")

sqlDF = spark.sql("SELECT * FROM people")


sqlDF.show()
# +----+-------+
# | age| name|
# +----+-------+
# |null|Michael|
# | 30| Andy|
# | 19| Justin|
# +----+-------+
Find full example code at "examples/src/main/python/sql/basic.py" in the Spark repo.
Global Temporary View
Temporary views in Spark SQL are session-scoped and will disappear if the session that creates it
terminates. If you want to have a temporary view that is shared among all sessions and keep
alive until the Spark application terminates, you can create a global temporary view. Global
temporary view is tied to a system preserved database global_temp, and we must use the
qualified name to refer it, e.g. SELECT * FROM global_temp.view1.

 Scala
 Java
 Python
 SQL
# Register the DataFrame as a global temporary view
df.createGlobalTempView("people")

# Global temporary view is tied to a system preserved database `global_temp`


spark.sql("SELECT * FROM global_temp.people").show()
# +----+-------+
# | age| name|
# +----+-------+
# |null|Michael|
# | 30| Andy|
# | 19| Justin|
# +----+-------+

# Global temporary view is cross-session


spark.newSession().sql("SELECT * FROM global_temp.people").show()
# +----+-------+
# | age| name|
# +----+-------+
# |null|Michael|
# | 30| Andy|
# | 19| Justin|
# +----+-------+
Find full example code at "examples/src/main/python/sql/basic.py" in the Spark repo.
Creating Datasets
Datasets are similar to RDDs, however, instead of using Java serialization or Kryo they use a
specialized Encoder to serialize the objects for processing or transmitting over the network.
While both encoders and standard serialization are responsible for turning an object into bytes,
encoders are code generated dynamically and use a format that allows Spark to perform many
operations like filtering, sorting and hashing without deserializing the bytes back into an object.

 Scala
 Java
case class Person(name: String, age: Long)

// Encoders are created for case classes


val caseClassDS = Seq(Person("Andy", 32)).toDS()
caseClassDS.show()
// +----+---+
// |name|age|
// +----+---+
// |Andy| 32|
// +----+---+

// Encoders for most common types are automatically provided by importing spark.implicits._
val primitiveDS = Seq(1, 2, 3).toDS()
primitiveDS.map(_ + 1).collect() // Returns: Array(2, 3, 4)

// DataFrames can be converted to a Dataset by providing a class. Mapping will be done by name
val path = "examples/src/main/resources/people.json"
val peopleDS = spark.read.json(path).as[Person]
peopleDS.show()
// +----+-------+
// | age| name|
// +----+-------+
// |null|Michael|
// | 30| Andy|
// | 19| Justin|
// +----+-------+
Find full example code at
"examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala" in the Spark
repo.
Interoperating with RDDs
Spark SQL supports two different methods for converting existing RDDs into Datasets. The first
method uses reflection to infer the schema of an RDD that contains specific types of objects.
This reflection-based approach leads to more concise code and works well when you already
know the schema while writing your Spark application.
The second method for creating Datasets is through a programmatic interface that allows you
to construct a schema and then apply it to an existing RDD. While this method is more verbose,
it allows you to construct Datasets when the columns and their types are not known until
runtime.
Inferring the Schema Using Reflection

 Scala
 Java
 Python
Spark SQL can convert an RDD of Row objects to a DataFrame, inferring the datatypes. Rows
are constructed by passing a list of key/value pairs as kwargs to the Row class. The keys of this
list define the column names of the table, and the types are inferred by sampling the whole
dataset, similar to the inference that is performed on JSON files.
from pyspark.sql import Row

sc = spark.sparkContext

# Load a text file and convert each line to a Row.


lines = sc.textFile("examples/src/main/resources/people.txt")
parts = lines.map(lambda l: l.split(","))
people = parts.map(lambda p: Row(name=p[0], age=int(p[1])))

# Infer the schema, and register the DataFrame as a table.


schemaPeople = spark.createDataFrame(people)
schemaPeople.createOrReplaceTempView("people")

# SQL can be run over DataFrames that have been registered as a table.
teenagers = spark.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")

# The results of SQL queries are Dataframe objects.


# rdd returns the content as an :class:`pyspark.RDD` of :class:`Row`.
teenNames = teenagers.rdd.map(lambda p: "Name: " + p.name).collect()
for name in teenNames:
print(name)
# Name: Justin
Find full example code at "examples/src/main/python/sql/basic.py" in the Spark repo.
Programmatically Specifying the Schema

 Scala
 Java
 Python
When a dictionary of kwargs cannot be defined ahead of time (for example, the structure of
records is encoded in a string, or a text dataset will be parsed and fields will be projected
differently for different users), a DataFrame can be created programmatically with three steps.

1. Create an RDD of tuples or lists from the original RDD;


2. Create the schema represented by a StructType matching the structure of tuples or lists in
the RDD created in the step 1.
3. Apply the schema to the RDD via createDataFrame method provided by SparkSession.
For example:
# Import data types
from pyspark.sql.types import StringType, StructType, StructField

sc = spark.sparkContext
# Load a text file and convert each line to a Row.
lines = sc.textFile("examples/src/main/resources/people.txt")
parts = lines.map(lambda l: l.split(","))
# Each line is converted to a tuple.
people = parts.map(lambda p: (p[0], p[1].strip()))

# The schema is encoded in a string.


schemaString = "name age"

fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]


schema = StructType(fields)

# Apply the schema to the RDD.


schemaPeople = spark.createDataFrame(people, schema)

# Creates a temporary view using the DataFrame


schemaPeople.createOrReplaceTempView("people")

# SQL can be run over DataFrames that have been registered as a table.
results = spark.sql("SELECT name FROM people")

results.show()
# +-------+
# | name|
# +-------+
# |Michael|
# | Andy|
# | Justin|
# +-------+
Find full example code at "examples/src/main/python/sql/basic.py" in the Spark repo.
Scalar Functions
Scalar functions are functions that return a single value per row, as opposed to aggregation
functions, which return a value for a group of rows. Spark SQL supports a variety of Built-in
Scalar Functions. It also supports User Defined Scalar Functions.
Aggregate Functions
Aggregate functions are functions that return a single value on a group of rows. The Built-in
Aggregation Functions provide common aggregations such
as count(), count_distinct(), avg(), max(), min(), etc. Users are not limited to the predefined
aggregate functions and can create their own. For more details about user defined aggregate
functions, please refer to the documentation of User Defined Aggregate Functio

You might also like