Spark Using Python

Spark is a distributed data processing framework that runs computations across computer clusters. It uses Resilient Distributed Datasets (RDDs), which allow large datasets to be distributed across nodes in a fault-tolerant way. RDDs can be transformed using operations like map, filter, and reduce to process the data in parallel. Spark programs are written using functional transformations and actions - transformations return new RDDs lazily while actions return concrete values by running the transformations. RDDs can be created from data in files or collections and processed using both basic RDD transformations and by grouping data using keys in reduceByKey.

Uploaded by

rasec.castro.m

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

Spark Using Python

Uploaded by

rasec.castro.m

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Introduc)on to Spark

using Python
What is Spark
• Distributed data processing framework
• Distributed – runs in several machines
• We get more RAM, more processing power
• Data processing
• Read and process data
• Based on Resilient Distributed Datasets/DataFrames
• Used for ‘big data' processing
• 'big' is something that doesn't fit on normal
machines
• Changes as machines become more powerful
Resilient Distributed Datasets
• Imagine a big set of objects, and how we can
distribute/parallelize
• We can divide in slices and keep each slice in a different
node;
• Values are computed only when needed
• To guarantee fault-tolerance, we also keep info about how
we calculated each slice, so we can re-generate it if a node
fails
• We can hint to keep in cache, or even save on disk
• Immutable ! not designed for read/write
• instead, transform an exisQng one into a new one
• It is basically a huge list
• But distributed over many computers
Shared Spark Variables
• Broadcast variables
• copy is kept at each node
• Accumulators
• you can only add; main node can read
Func)onal programming in
python
• A lot of these concepts are already in python
• But python community tends to promote loops
• FuncQonal tools in python
• map
• filter
• reduce
• lambda
Map in Python
• Python supports the map operaQon, over any list
• We apply an operaQon to each element of a list,
return a new list with the results
• a=[1,2,3]
• def add1(x): return x+1
• map(add1,a) => [2,3,4]
• We usually do this with a for loop, this is a slightly
different way of thinking
Filter
• Select only certain elements from a list
• Example:
• a=[1,2,3,4]
• def isOdd(x): return x%2==1;
• filter(isOdd,a) => [1,3]
reduce in python
• Applies a funcQon to all pairs of elements of a list;
returns ONE value, not a list
• Example:
• a=[1,2,3,4]
• def add(x,y): return x+y
• reduce(add,a) => 10
• add(1,add(2,add(3,4,)))
• Beeer for funcQons that are commutaQve and
associaQve, so order doesn't maeer
lambdas
• When doing map/reduce/filter, we end up with
many Qny funcQons
• Lambdas allow us to define a funcQon as a value,
without giving it a name
• example: lambda x: x+1
• Can only have one expression
• do not write return
• I put parenthesis around it, usually not needed by
syntax
• (lambda x: x+1)(3) => 4
• map(lambda x: x+1, [1,2,3])=> [2,3,4]
Exercises
• (lambda x: 2*x)(3) => ?
• map(lambda x: 2*x, [1,2,3]) =>
• map(lambda t: t[0], [ (1,2), (3,4), (5,6) ] ) =>
• reduce(lambda x,y: x+y, [1,2,3]) =>
• reduce(lambda x,y: x+y, map(lambda t: t[0], [ (1,2),
(3,4), (5,6) ] ))=>
More exercises
• Given
• a=[ (1,2), (3,4), (5,6)]
• Write an expression to get only the second
elements of each tuple
• Write an expression to get the sum of the second
elements
• Write an expression to get the sum of the odd first
elements
Now let's do those with Spark
• Start the spark notebook
• import os
• os.environ['PYSPARK_PYTHON'] = '/opt/conda/envs/
python2/bin/python’
• #First to lines to work with Python 2.7
• import pyspark
• sc = pyspark.SparkContext()
Crea)ng RDDs in Spark
• All spark commands operate on RDDs (think big
distributed list)
• You can use sc.parallelize to go from list to RDD
• Later we will see how to read from files
• Many commands are lazy (they don't actually
compute the results unQl you need them)
• In pySpark, sc represents your SparkContext
Simple example
• list1=sc.parallelize( range(1,1000) )
• list2=list1.map(lambda x: x*10) # noQce lazy
• list2.reduce(lambda x,y: x+y)
• list2.filter(lambda x: x%100==0).collect()
Transforma)ons vs Ac)ons
• We divide RDD methods into two kinds:
• TransformaQons
• return another RDD
• are not really performed unQl an acQon is called (lazy)
• AcQons
• return a value other than an RDD
• are performed immediately
Some RDD methods
• TransformaQons
• .map( f ) – returns a new RDD applying f to each element
• .filter( f ) – returns a new RDD containing elements that
saQsfy f
• .flatmap(f) – returns a ‘flaeened’ list
• AcQons
• .reduce( f ) – returns a value reducing RDD elements with f
• .take( n ) – returns n items from the RDD
• .collect() – returns all elements as a list
• .sum() - sum of (numeric) elements of an RDD
• max,min,mean …
More examples
• rdd1=sc.parallelize( range(1,100) )
• rdd1.map(lambda x: x*x).sum()
• rdd1.filter(lambda x: x%2==0).take(5)
Exercises
1. Get an RDD with number 1 to 10
2. Get all the elements in that RDD which are
divisible by 3
3. Get the product of the elements in 2
Reading files
• sc.textFile(urlOrPath,minParQQons,useUnicode=Tru
e)
• Returns an rdd of strings (one per line)
• Can read from many files, using wildcards (*)
• Can read from hdfs, …
• We normally use map right aver and split/parse the
lines
• Example:
• people=sc.textFile(”../data/people.txt”)
• people=sc.textFile(”../data/people.txt”).map(lambda x:
x.split(’\t’)
Tuples and ReduceByKey
• Many Qmes we want to group elements first, and
then calculate values for each group
• In spark, we operate on tuples, <Key,Value> and we
normally use reduceByKey to perform a reduce on
the elements of each group
People example/Exercises
• We have a people.txt file with following schema:
• Name | Gender | Age | Favorite Language
• We can load with:
• people=sc.textFile("../data/people.txt").map(lambda x:
x.split('\t'))
• Find number of people by gender
• first get tuples like: ('M',1),('F',1) ... then reduce by key
• people.map(lambda t: (t[1],1)).reduceByKey(lambda
x,y:x+y).collect()
• Let’s find number of people by favorite programming
language
• Example: youngest person per gender
• people.map(lambda t: (t[3],int(t[2]) )).reduceByKey(lambda
x,y:min(x,y)).collect()
More people exercises
(homework)
• Get number of people with age 40+
• Using filter
• Using map and reduceByKey to produce two groups
<40, 40+
Person example with objects
• Using tuples for everything is … ok, but someQmes
we want nicer schema
• We can use regular python objects
• We sQll need to use tuples for joins, reduceByKey, since
they operate on tuples
• Can use x.name x.age etc which makes it slightly easier
Person class
class Person:
def parse(self,line):
fields=line.split('\t')
self.name=fields[0]
self.gender=fields[1]
self.age=int(fields[2])
self.favorite_language=fields[3]
return self

def __repr__(self):
return "Person( %s, gender=%s, %d years old, likes %s)"%

(self.name,self.gender,self.age,self.favorite_language)

• people=sc.textFile("../../data/
people.txt").map(Person().parse)
Sending programs within shell
• You can use extra parameters to include python (or
java) programs in your shell
• --py-files (and list of files, separated with spaces)
• Can use .py, .zip, .egg
• --jars to include java jars
• --packages, -- repositories to include maven packages
(java)
• --files to include arbitrary files in home folder of
executor
• Get out of pyspark
• Ctrl-D
• Run it again, including person.py in your --py-files
Person with Objects
• Number of people by gender
• people.map(lambda t: (t.gender,
1)).reduceByKey(lambda x,y:x+y).collect()
• Let’s do number of people by programming
language
• Youngest person by gender
• people.map(lambda t:
(t.gender,t.age )).reduceByKey(lambda x,y:min(x,y))
More people exercises
• Get number of people with age 40+
• Using filter
• Using map and reduceByKey to produce two groups
<40, 40+
• Get age of oldest person, by programming
language
End (for now)

Python Cheat Sheet 2.0
100% (1)
Python Cheat Sheet 2.0
10 pages
Hands - On Exercise: Using The Spark Shell..................................
100% (2)
Hands - On Exercise: Using The Spark Shell..................................
13 pages
BM Cimplicity Basic Control Engine and Scripting Reference Master
No ratings yet
BM Cimplicity Basic Control Engine and Scripting Reference Master
741 pages
Lisp Interpreter in Rust
From Everand
Lisp Interpreter in Rust
Vishal Patil
1/5 (1)
Advanced Python Tips
100% (1)
Advanced Python Tips
50 pages
Module 15: Mobile Forensics: Lab Scenario
0% (1)
Module 15: Mobile Forensics: Lab Scenario
28 pages
SPARK
No ratings yet
SPARK
36 pages
Apache Spark: CS240A Winter 2016. T Yang
No ratings yet
Apache Spark: CS240A Winter 2016. T Yang
36 pages
2.RDDs in Spark
No ratings yet
2.RDDs in Spark
38 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Ravi Pyspark RDD Tutorial 1665758938
No ratings yet
Ravi Pyspark RDD Tutorial 1665758938
20 pages
Final Print Py Spark
No ratings yet
Final Print Py Spark
133 pages
pyspark (1)
No ratings yet
pyspark (1)
44 pages
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
No ratings yet
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
11 pages
3- SPARK
No ratings yet
3- SPARK
51 pages
Introduction To Big Data With Apache Spark: Uc Berkeley
No ratings yet
Introduction To Big Data With Apache Spark: Uc Berkeley
43 pages
Python_for_DataScience
No ratings yet
Python_for_DataScience
47 pages
4220 6 (DataFormat)
No ratings yet
4220 6 (DataFormat)
15 pages
Big Data Computing Spark Basics and RDD: Ke Yi
No ratings yet
Big Data Computing Spark Basics and RDD: Ke Yi
43 pages
PySpark Cheat Sheet Python
No ratings yet
PySpark Cheat Sheet Python
1 page
PySpark Cheat Sheet Spark in Python PDF
No ratings yet
PySpark Cheat Sheet Spark in Python PDF
1 page
PySpark RDD Basics PDF
No ratings yet
PySpark RDD Basics PDF
1 page
Getting Started With Python Cheat Sheet
No ratings yet
Getting Started With Python Cheat Sheet
1 page
Python Cheat Sheet For Beginners
No ratings yet
Python Cheat Sheet For Beginners
1 page
Python BasicsGUIA PYTHON-01
No ratings yet
Python BasicsGUIA PYTHON-01
1 page
PySpark Transformations Tutorial
100% (1)
PySpark Transformations Tutorial
58 pages
Apache Spark - DataFrames and Spark SQL
100% (2)
Apache Spark - DataFrames and Spark SQL
146 pages
Pyspark
No ratings yet
Pyspark
31 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
01 Introduction to Python
No ratings yet
01 Introduction to Python
36 pages
Big Data - Spark
100% (1)
Big Data - Spark
72 pages
Parallel Programming With Spark: Matei Zaharia
No ratings yet
Parallel Programming With Spark: Matei Zaharia
40 pages
Cheat Sheet: Python For Data Science
No ratings yet
Cheat Sheet: Python For Data Science
4 pages
Cheat Sheet: Python For Data Science
No ratings yet
Cheat Sheet: Python For Data Science
4 pages
Notes For Fintech Assesment, Cheatsheet
No ratings yet
Notes For Fintech Assesment, Cheatsheet
19 pages
Py Spark 3 Quick Reference Guide
No ratings yet
Py Spark 3 Quick Reference Guide
2 pages
basics - Copy
No ratings yet
basics - Copy
17 pages
2335_m8_demo1_v1_0h2_cq188do
No ratings yet
2335_m8_demo1_v1_0h2_cq188do
9 pages
MyinterviewQs (1)
No ratings yet
MyinterviewQs (1)
9 pages
RDD Actions
No ratings yet
RDD Actions
18 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
Pyspark RDD Cheat Sheet Python For Data Science
No ratings yet
Pyspark RDD Cheat Sheet Python For Data Science
1 page
C:/Users/Rafe/Appdata/Local/Programs/Python/Python35-32/Scripts Object and Data Structures Basics
No ratings yet
C:/Users/Rafe/Appdata/Local/Programs/Python/Python35-32/Scripts Object and Data Structures Basics
16 pages
Resilient Distributed Datasets
No ratings yet
Resilient Distributed Datasets
40 pages
Spark Commands
No ratings yet
Spark Commands
3 pages
Python Cheat Sheet For Excel Users
No ratings yet
Python Cheat Sheet For Excel Users
5 pages
AML LAB MANUAL Yash
No ratings yet
AML LAB MANUAL Yash
60 pages
PySpark FP Course ID 58339
No ratings yet
PySpark FP Course ID 58339
44 pages
Python For Data Science Cheat Sheet 2.0
No ratings yet
Python For Data Science Cheat Sheet 2.0
11 pages
Apache Spark
No ratings yet
Apache Spark
6 pages
01 Introduction to Python
No ratings yet
01 Introduction to Python
36 pages
Dia 4
No ratings yet
Dia 4
24 pages
Python
No ratings yet
Python
5 pages
Data Analysis Python Read The Docs Io en Latest
No ratings yet
Data Analysis Python Read The Docs Io en Latest
79 pages
Advanced Python Tips
No ratings yet
Advanced Python Tips
50 pages
Esc Enter M Y A B D + D Z F Shift + Up/Down Space Shift + Space
No ratings yet
Esc Enter M Y A B D + D Z F Shift + Up/Down Space Shift + Space
12 pages
PySpark Core Print
No ratings yet
PySpark Core Print
8 pages
nRQgi8EgDUNFS451K4xQXA
No ratings yet
nRQgi8EgDUNFS451K4xQXA
61 pages
Spark and Dask
No ratings yet
Spark and Dask
55 pages
A Beginner's guide to Python
From Everand
A Beginner's guide to Python
Steven Mcananey
No ratings yet
Clean Desk Policy V1.0
No ratings yet
Clean Desk Policy V1.0
1 page
Validation and Verification Report
100% (1)
Validation and Verification Report
4 pages
Arduino IDE Programming - Simple Interrupt Application
No ratings yet
Arduino IDE Programming - Simple Interrupt Application
18 pages
End User Encryption Key Protection Policy
No ratings yet
End User Encryption Key Protection Policy
5 pages
Network Components-3
100% (1)
Network Components-3
2 pages
UTC317/B/TB Linear Integrated Circuit: Youwang Electronics Co - LTD
No ratings yet
UTC317/B/TB Linear Integrated Circuit: Youwang Electronics Co - LTD
9 pages
Configuration and Commissioning of The L-IP
No ratings yet
Configuration and Commissioning of The L-IP
13 pages
TB 96aiot 1126ce Hardware User Manual
No ratings yet
TB 96aiot 1126ce Hardware User Manual
10 pages
Se PBL
No ratings yet
Se PBL
23 pages
Web Based Network Monitoring
No ratings yet
Web Based Network Monitoring
6 pages
Stanford CV
No ratings yet
Stanford CV
1 page
Face Mask Detection Alert System Using Raspberry Pi: International Research Journal of Engineering and Technology (IRJET)
No ratings yet
Face Mask Detection Alert System Using Raspberry Pi: International Research Journal of Engineering and Technology (IRJET)
3 pages
Cons Retrofit Brochure 210x280 LOW
No ratings yet
Cons Retrofit Brochure 210x280 LOW
2 pages
Zvis Monitori 23-01-21
No ratings yet
Zvis Monitori 23-01-21
3 pages
ITU IT Computer Science
No ratings yet
ITU IT Computer Science
56 pages
Counters
No ratings yet
Counters
16 pages
PP - QM S4 Functionalities
No ratings yet
PP - QM S4 Functionalities
8 pages
Digital Logic Design: Assignment 1 Due Date: 10.12.2020 (Thursday)
No ratings yet
Digital Logic Design: Assignment 1 Due Date: 10.12.2020 (Thursday)
3 pages
Quarter 2 Module 2 Types of Network Cable and Network Architecture
No ratings yet
Quarter 2 Module 2 Types of Network Cable and Network Architecture
6 pages
Ga 880GM Usb3l R311
No ratings yet
Ga 880GM Usb3l R311
31 pages
Lab 12
No ratings yet
Lab 12
4 pages
Opentext Extended Ecm For Sap Successfactors Ce - Overview and Preparation
100% (1)
Opentext Extended Ecm For Sap Successfactors Ce - Overview and Preparation
59 pages
Ate Tatz
No ratings yet
Ate Tatz
16 pages
SE101_Lec4_InheritanceAbstractFinal
No ratings yet
SE101_Lec4_InheritanceAbstractFinal
36 pages
FPGA_BASED_SYSTEM_DESIGN [HONOURS]
No ratings yet
FPGA_BASED_SYSTEM_DESIGN [HONOURS]
6 pages
RFID, Password, GSM Based Doorlock
No ratings yet
RFID, Password, GSM Based Doorlock
29 pages
C Programming Part 1
No ratings yet
C Programming Part 1
10 pages
Create An NFT and Deploy To A Public Testnet, Using Remix - Smart Contracts - Guides and Tutorials - OpenZeppelin Community
No ratings yet
Create An NFT and Deploy To A Public Testnet, Using Remix - Smart Contracts - Guides and Tutorials - OpenZeppelin Community
11 pages