Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
12 views

Spark Using Python

Spark is a distributed data processing framework that runs computations across computer clusters. It uses Resilient Distributed Datasets (RDDs), which allow large datasets to be distributed across nodes in a fault-tolerant way. RDDs can be transformed using operations like map, filter, and reduce to process the data in parallel. Spark programs are written using functional transformations and actions - transformations return new RDDs lazily while actions return concrete values by running the transformations. RDDs can be created from data in files or collections and processed using both basic RDD transformations and by grouping data using keys in reduceByKey.

Uploaded by

rasec.castro.m
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Spark Using Python

Spark is a distributed data processing framework that runs computations across computer clusters. It uses Resilient Distributed Datasets (RDDs), which allow large datasets to be distributed across nodes in a fault-tolerant way. RDDs can be transformed using operations like map, filter, and reduce to process the data in parallel. Spark programs are written using functional transformations and actions - transformations return new RDDs lazily while actions return concrete values by running the transformations. RDDs can be created from data in files or collections and processed using both basic RDD transformations and by grouping data using keys in reduceByKey.

Uploaded by

rasec.castro.m
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Introduc)on to Spark

using Python
What is Spark
• Distributed data processing framework
• Distributed – runs in several machines
• We get more RAM, more processing power
• Data processing
• Read and process data
• Based on Resilient Distributed Datasets/DataFrames
• Used for ‘big data' processing
• 'big' is something that doesn't fit on normal
machines
• Changes as machines become more powerful
Resilient Distributed Datasets
• Imagine a big set of objects, and how we can
distribute/parallelize
• We can divide in slices and keep each slice in a different
node;
• Values are computed only when needed
• To guarantee fault-tolerance, we also keep info about how
we calculated each slice, so we can re-generate it if a node
fails
• We can hint to keep in cache, or even save on disk
• Immutable ! not designed for read/write
• instead, transform an exisQng one into a new one
• It is basically a huge list
• But distributed over many computers
Shared Spark Variables
• Broadcast variables
• copy is kept at each node
• Accumulators
• you can only add; main node can read
Func)onal programming in
python
• A lot of these concepts are already in python
• But python community tends to promote loops
• FuncQonal tools in python
• map
• filter
• reduce
• lambda
Map in Python
• Python supports the map operaQon, over any list
• We apply an operaQon to each element of a list,
return a new list with the results
• a=[1,2,3]
• def add1(x): return x+1
• map(add1,a) => [2,3,4]
• We usually do this with a for loop, this is a slightly
different way of thinking
Filter
• Select only certain elements from a list
• Example:
• a=[1,2,3,4]
• def isOdd(x): return x%2==1;
• filter(isOdd,a) => [1,3]
reduce in python
• Applies a funcQon to all pairs of elements of a list;
returns ONE value, not a list
• Example:
• a=[1,2,3,4]
• def add(x,y): return x+y
• reduce(add,a) => 10
• add(1,add(2,add(3,4,)))
• Beeer for funcQons that are commutaQve and
associaQve, so order doesn't maeer
lambdas
• When doing map/reduce/filter, we end up with
many Qny funcQons
• Lambdas allow us to define a funcQon as a value,
without giving it a name
• example: lambda x: x+1
• Can only have one expression
• do not write return
• I put parenthesis around it, usually not needed by
syntax
• (lambda x: x+1)(3) => 4
• map(lambda x: x+1, [1,2,3])=> [2,3,4]
Exercises
• (lambda x: 2*x)(3) => ?
• map(lambda x: 2*x, [1,2,3]) =>
• map(lambda t: t[0], [ (1,2), (3,4), (5,6) ] ) =>
• reduce(lambda x,y: x+y, [1,2,3]) =>
• reduce(lambda x,y: x+y, map(lambda t: t[0], [ (1,2),
(3,4), (5,6) ] ))=>
More exercises
• Given
• a=[ (1,2), (3,4), (5,6)]
• Write an expression to get only the second
elements of each tuple
• Write an expression to get the sum of the second
elements
• Write an expression to get the sum of the odd first
elements
Now let's do those with Spark
• Start the spark notebook
• import os
• os.environ['PYSPARK_PYTHON'] = '/opt/conda/envs/
python2/bin/python’
• #First to lines to work with Python 2.7
• import pyspark
• sc = pyspark.SparkContext()
Crea)ng RDDs in Spark
• All spark commands operate on RDDs (think big
distributed list)
• You can use sc.parallelize to go from list to RDD
• Later we will see how to read from files
• Many commands are lazy (they don't actually
compute the results unQl you need them)
• In pySpark, sc represents your SparkContext
Simple example
• list1=sc.parallelize( range(1,1000) )
• list2=list1.map(lambda x: x*10) # noQce lazy
• list2.reduce(lambda x,y: x+y)
• list2.filter(lambda x: x%100==0).collect()
Transforma)ons vs Ac)ons
• We divide RDD methods into two kinds:
• TransformaQons
• return another RDD
• are not really performed unQl an acQon is called (lazy)
• AcQons
• return a value other than an RDD
• are performed immediately
Some RDD methods
• TransformaQons
• .map( f ) – returns a new RDD applying f to each element
• .filter( f ) – returns a new RDD containing elements that
saQsfy f
• .flatmap(f) – returns a ‘flaeened’ list
• AcQons
• .reduce( f ) – returns a value reducing RDD elements with f
• .take( n ) – returns n items from the RDD
• .collect() – returns all elements as a list
• .sum() - sum of (numeric) elements of an RDD
• max,min,mean …
More examples
• rdd1=sc.parallelize( range(1,100) )
• rdd1.map(lambda x: x*x).sum()
• rdd1.filter(lambda x: x%2==0).take(5)
Exercises
1. Get an RDD with number 1 to 10
2. Get all the elements in that RDD which are
divisible by 3
3. Get the product of the elements in 2
Reading files
• sc.textFile(urlOrPath,minParQQons,useUnicode=Tru
e)
• Returns an rdd of strings (one per line)
• Can read from many files, using wildcards (*)
• Can read from hdfs, …
• We normally use map right aver and split/parse the
lines
• Example:
• people=sc.textFile(”../data/people.txt”)
• people=sc.textFile(”../data/people.txt”).map(lambda x:
x.split(’\t’)
Tuples and ReduceByKey
• Many Qmes we want to group elements first, and
then calculate values for each group
• In spark, we operate on tuples, <Key,Value> and we
normally use reduceByKey to perform a reduce on
the elements of each group
People example/Exercises
• We have a people.txt file with following schema:
• Name | Gender | Age | Favorite Language
• We can load with:
• people=sc.textFile("../data/people.txt").map(lambda x:
x.split('\t'))
• Find number of people by gender
• first get tuples like: ('M',1),('F',1) ... then reduce by key
• people.map(lambda t: (t[1],1)).reduceByKey(lambda
x,y:x+y).collect()
• Let’s find number of people by favorite programming
language
• Example: youngest person per gender
• people.map(lambda t: (t[3],int(t[2]) )).reduceByKey(lambda
x,y:min(x,y)).collect()
More people exercises
(homework)
• Get number of people with age 40+
• Using filter
• Using map and reduceByKey to produce two groups
<40, 40+
Person example with objects
• Using tuples for everything is … ok, but someQmes
we want nicer schema
• We can use regular python objects
• We sQll need to use tuples for joins, reduceByKey, since
they operate on tuples
• Can use x.name x.age etc which makes it slightly easier
Person class
class Person:
def parse(self,line):
fields=line.split('\t')
self.name=fields[0]
self.gender=fields[1]
self.age=int(fields[2])
self.favorite_language=fields[3]
return self

def __repr__(self):
return "Person( %s, gender=%s, %d years old, likes %s)"%

(self.name,self.gender,self.age,self.favorite_language)

• people=sc.textFile("../../data/
people.txt").map(Person().parse)
Sending programs within shell
• You can use extra parameters to include python (or
java) programs in your shell
• --py-files (and list of files, separated with spaces)
• Can use .py, .zip, .egg
• --jars to include java jars
• --packages, -- repositories to include maven packages
(java)
• --files to include arbitrary files in home folder of
executor
• Get out of pyspark
• Ctrl-D
• Run it again, including person.py in your --py-files
Person with Objects
• Number of people by gender
• people.map(lambda t: (t.gender,
1)).reduceByKey(lambda x,y:x+y).collect()
• Let’s do number of people by programming
language
• Youngest person by gender
• people.map(lambda t:
(t.gender,t.age )).reduceByKey(lambda x,y:min(x,y))
More people exercises
• Get number of people with age 40+
• Using filter
• Using map and reduceByKey to produce two groups
<40, 40+
• Get age of oldest person, by programming
language
End (for now)

You might also like