Spark Using Python
Spark Using Python
using Python
What is Spark
• Distributed data processing framework
• Distributed – runs in several machines
• We get more RAM, more processing power
• Data processing
• Read and process data
• Based on Resilient Distributed Datasets/DataFrames
• Used for ‘big data' processing
• 'big' is something that doesn't fit on normal
machines
• Changes as machines become more powerful
Resilient Distributed Datasets
• Imagine a big set of objects, and how we can
distribute/parallelize
• We can divide in slices and keep each slice in a different
node;
• Values are computed only when needed
• To guarantee fault-tolerance, we also keep info about how
we calculated each slice, so we can re-generate it if a node
fails
• We can hint to keep in cache, or even save on disk
• Immutable ! not designed for read/write
• instead, transform an exisQng one into a new one
• It is basically a huge list
• But distributed over many computers
Shared Spark Variables
• Broadcast variables
• copy is kept at each node
• Accumulators
• you can only add; main node can read
Func)onal programming in
python
• A lot of these concepts are already in python
• But python community tends to promote loops
• FuncQonal tools in python
• map
• filter
• reduce
• lambda
Map in Python
• Python supports the map operaQon, over any list
• We apply an operaQon to each element of a list,
return a new list with the results
• a=[1,2,3]
• def add1(x): return x+1
• map(add1,a) => [2,3,4]
• We usually do this with a for loop, this is a slightly
different way of thinking
Filter
• Select only certain elements from a list
• Example:
• a=[1,2,3,4]
• def isOdd(x): return x%2==1;
• filter(isOdd,a) => [1,3]
reduce in python
• Applies a funcQon to all pairs of elements of a list;
returns ONE value, not a list
• Example:
• a=[1,2,3,4]
• def add(x,y): return x+y
• reduce(add,a) => 10
• add(1,add(2,add(3,4,)))
• Beeer for funcQons that are commutaQve and
associaQve, so order doesn't maeer
lambdas
• When doing map/reduce/filter, we end up with
many Qny funcQons
• Lambdas allow us to define a funcQon as a value,
without giving it a name
• example: lambda x: x+1
• Can only have one expression
• do not write return
• I put parenthesis around it, usually not needed by
syntax
• (lambda x: x+1)(3) => 4
• map(lambda x: x+1, [1,2,3])=> [2,3,4]
Exercises
• (lambda x: 2*x)(3) => ?
• map(lambda x: 2*x, [1,2,3]) =>
• map(lambda t: t[0], [ (1,2), (3,4), (5,6) ] ) =>
• reduce(lambda x,y: x+y, [1,2,3]) =>
• reduce(lambda x,y: x+y, map(lambda t: t[0], [ (1,2),
(3,4), (5,6) ] ))=>
More exercises
• Given
• a=[ (1,2), (3,4), (5,6)]
• Write an expression to get only the second
elements of each tuple
• Write an expression to get the sum of the second
elements
• Write an expression to get the sum of the odd first
elements
Now let's do those with Spark
• Start the spark notebook
• import os
• os.environ['PYSPARK_PYTHON'] = '/opt/conda/envs/
python2/bin/python’
• #First to lines to work with Python 2.7
• import pyspark
• sc = pyspark.SparkContext()
Crea)ng RDDs in Spark
• All spark commands operate on RDDs (think big
distributed list)
• You can use sc.parallelize to go from list to RDD
• Later we will see how to read from files
• Many commands are lazy (they don't actually
compute the results unQl you need them)
• In pySpark, sc represents your SparkContext
Simple example
• list1=sc.parallelize( range(1,1000) )
• list2=list1.map(lambda x: x*10) # noQce lazy
• list2.reduce(lambda x,y: x+y)
• list2.filter(lambda x: x%100==0).collect()
Transforma)ons vs Ac)ons
• We divide RDD methods into two kinds:
• TransformaQons
• return another RDD
• are not really performed unQl an acQon is called (lazy)
• AcQons
• return a value other than an RDD
• are performed immediately
Some RDD methods
• TransformaQons
• .map( f ) – returns a new RDD applying f to each element
• .filter( f ) – returns a new RDD containing elements that
saQsfy f
• .flatmap(f) – returns a ‘flaeened’ list
• AcQons
• .reduce( f ) – returns a value reducing RDD elements with f
• .take( n ) – returns n items from the RDD
• .collect() – returns all elements as a list
• .sum() - sum of (numeric) elements of an RDD
• max,min,mean …
More examples
• rdd1=sc.parallelize( range(1,100) )
• rdd1.map(lambda x: x*x).sum()
• rdd1.filter(lambda x: x%2==0).take(5)
Exercises
1. Get an RDD with number 1 to 10
2. Get all the elements in that RDD which are
divisible by 3
3. Get the product of the elements in 2
Reading files
• sc.textFile(urlOrPath,minParQQons,useUnicode=Tru
e)
• Returns an rdd of strings (one per line)
• Can read from many files, using wildcards (*)
• Can read from hdfs, …
• We normally use map right aver and split/parse the
lines
• Example:
• people=sc.textFile(”../data/people.txt”)
• people=sc.textFile(”../data/people.txt”).map(lambda x:
x.split(’\t’)
Tuples and ReduceByKey
• Many Qmes we want to group elements first, and
then calculate values for each group
• In spark, we operate on tuples, <Key,Value> and we
normally use reduceByKey to perform a reduce on
the elements of each group
People example/Exercises
• We have a people.txt file with following schema:
• Name | Gender | Age | Favorite Language
• We can load with:
• people=sc.textFile("../data/people.txt").map(lambda x:
x.split('\t'))
• Find number of people by gender
• first get tuples like: ('M',1),('F',1) ... then reduce by key
• people.map(lambda t: (t[1],1)).reduceByKey(lambda
x,y:x+y).collect()
• Let’s find number of people by favorite programming
language
• Example: youngest person per gender
• people.map(lambda t: (t[3],int(t[2]) )).reduceByKey(lambda
x,y:min(x,y)).collect()
More people exercises
(homework)
• Get number of people with age 40+
• Using filter
• Using map and reduceByKey to produce two groups
<40, 40+
Person example with objects
• Using tuples for everything is … ok, but someQmes
we want nicer schema
• We can use regular python objects
• We sQll need to use tuples for joins, reduceByKey, since
they operate on tuples
• Can use x.name x.age etc which makes it slightly easier
Person class
class Person:
def parse(self,line):
fields=line.split('\t')
self.name=fields[0]
self.gender=fields[1]
self.age=int(fields[2])
self.favorite_language=fields[3]
return self
def __repr__(self):
return "Person( %s, gender=%s, %d years old, likes %s)"%
(self.name,self.gender,self.age,self.favorite_language)
• people=sc.textFile("../../data/
people.txt").map(Person().parse)
Sending programs within shell
• You can use extra parameters to include python (or
java) programs in your shell
• --py-files (and list of files, separated with spaces)
• Can use .py, .zip, .egg
• --jars to include java jars
• --packages, -- repositories to include maven packages
(java)
• --files to include arbitrary files in home folder of
executor
• Get out of pyspark
• Ctrl-D
• Run it again, including person.py in your --py-files
Person with Objects
• Number of people by gender
• people.map(lambda t: (t.gender,
1)).reduceByKey(lambda x,y:x+y).collect()
• Let’s do number of people by programming
language
• Youngest person by gender
• people.map(lambda t:
(t.gender,t.age )).reduceByKey(lambda x,y:min(x,y))
More people exercises
• Get number of people with age 40+
• Using filter
• Using map and reduceByKey to produce two groups
<40, 40+
• Get age of oldest person, by programming
language
End (for now)