0% found this document useful (0 votes)

69 views

Pyspark Code

This document discusses Spark RDD operations like parallelizing data, transformations like map, filter, flatMap, reduceByKey and actions like count. It also discusses Spark DataFrame operations like reading data, selecting columns, adding/renaming columns, filtering, grouping, joining, SQL queries and user defined functions. Finally, it discusses basic Spark Streaming operations like creating DStream from file source and performing map and reduce transformations.

Uploaded by

Eren Levi

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

69 views

Pyspark Code

Uploaded by

Eren Levi

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

You are on page 1/ 3

RDD

import pyspark
from pyspark import SparkConf, SparkContext
conf = SparkConf().setAppName("Read File")
sc = SparkContext.getOrCreate(conf=conf)

rdd = sc.parallelize((1,4,7,10))
or
listv = [1,4,7,10]
rdd = sc.parallelize(listv)

rdd = sc.textFile("/FileStore/tables/numbers.txt")

flatrdd = rdd.flatMap(lambda x: x.split(" ")) or flatrdd = rdd.flatMap(lambda x:

x.split(","))
maprdd = rdd.map(lambda x: x.split(" ")) or maprdd = rdd.map(lambda x:
x.split(","))

(depending on whether words in input data are separated by space or comma )

rdd.count()
rdd.filter(function) or rdd.filter(lambda x: x=='apple')
rdd.distinct()

WORD COUNT
flat = rdd.flatMap(lambda x: x.split(" "))
maprdd = flat.map(lambda x: (x,1))
(To convert single column to key-value format)

maprdd.groupByKey().mapValues(list).collect()
maprdd.groupByKey().mapValues(len).collect()
maprdd.reduceByKey(lambda x,y: x+y).collect() (here x and y are values of the same
key)
flat.countByValue()

rdd.saveAsTextFile('/FileStore/22march/')

rdd.getNumPartitions() --> 3
rdd1 = rdd.repartition(5) (increase or decrease the number of partitions)
rdd1.getNumPartitions() --> 5

rdd2 = rdd.coalesce(1) (only decrease number of partitions)

rdd1.getNumPartitions() --> 1

Average(movie ratings)
rdd1 = rdd.map(lambda x: x.split(','))
rdd2 = rdd1.map(lambda x: (x[0],(int(x[1]),1)) ) (k,(v1,v2))
rdd3 = rdd2.reduceByKey(lambda x,y: (x[0]+y[0],x[1]+y[1])) #x[0] and y[0] are v1
and x[1],y[1] are v2
rdd4 = rdd3.map(lambda x: (x[0],x[1][0]/x[1][1])) x[1][0] - sum of ratings x[1][1]
- number of ratings

min and max

rdd.reduceByKey(lambda x,y: x if x > y else y).collect()
rdd.reduceByKey(lambda x,y: x if x < y else y).collect()
rdd.reduceByKey(lambda x,y: min(x,y)).collect()
To remove the header in csv file
header = rdd.first()
rdd1 = rdd.filter(lambda x: x!= header)

DataFrame

import pyspark.sql
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Spark DataFrames").getOrCreate()

df = spark.read.option("header",True).csv('/FileStore/tables/StudentData.csv')
df.show()
df.count()
df.printSchema()
df = spark.read.options(inferSchema='True',header='True').csv("/FileStore/tables/
StudentData.csv")

CUSTOM SCHEMA
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
custom_schema = StructType([
StructField("age", IntegerType(), True),
StructField("gender", StringType(), True),
StructField("name", StringType(), True),
StructField("course", StringType(), True),
StructField("roll", StringType(), True),
StructField("marks", IntegerType(), True),
StructField("email", StringType(), True)
])
df = spark.read.options(header='True').schema(custom_schema).csv('/FileStore/
tables/StudentData.csv')

RDD TO DATAFRAME CONVERSION

1.Using toDF (no custom schema for dataframe)

headers = rdd.first()
rdd1 = rdd.filter(lambda x: x != headers).map(lambda x: x.split(','))
columns = headers.split(',')
dfRdd = rdd1.toDF(columns)

2.Using createDataFrame (having custom schema)

headers = rdd.first()
rdd1 = rdd.filter(lambda x: x != headers).map(lambda x: x.split(','))
columns = headers.split(',')
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
cust_schema = StructType([
StructField("age", IntegerType(), True),
StructField("gender", StringType(), True),
StructField("name", StringType(), True),
StructField("course", StringType(), True),
StructField("roll", StringType(), True),
StructField("marks", IntegerType(), True),
StructField("email", StringType(), True)
])
rdd = rdd.map(lambda x: [int(x[0]), x[1], x[2], x[3], x[4], int(x[5]), x[6]])
dfRdd = spark.createDataFrame(rdd, schema=cust_schema)

select on dataframe

df.select("name","gender").show()
df.select(df.name, df.gender).show()
df.select(col("name"), col("gender")).show()
df.select('*').show()
df.select(df.columns[2:6]).show()

withColumn on dataframe

df1 = df.withColumn("roll", col("roll").cast("String"))

df2 = df.withColumn("marks", col('marks') + 10)
df3 = df.withColumn("aggregated marks", col('marks') - 10)
df4 = df.withColumn("name", lit("USA"))
df5 = df.withColumn("marks", df.marks - 10).withColumn("updated marks",
col("marks") + 20).withColumn("Country", lit("USA"))

df = df.withColumnRenamed("gender","sex").withColumnRenamed("roll", "roll number")

df.filter(col("course") == "DB").show()
df.filter(df.course.isin(courses_list)).show()
df.select("gender").distinct().show()
df.dropDuplicates(["gender","course"]).show()
df.sort(df.marks).show(1000)
df.sort(df.marks.desc(), df.age.asc()).show(1000)
df.orderBy(df.marks, df.age).show()
df.groupBy("gender").sum("marks").show()
df.groupBy("gender").count().show()
df.groupBy("gender").max("marks").show()
df.groupBy("gender").min("marks").show()
df.groupBy("gender").avg("marks").show()
df.groupBy("gender").mean("marks").show()
df.join(df1,df.id == df1.id,"inner").show()

To directly use sql query by creating view (instead of table name give view name)

df.createOrReplaceTempView("karthick")
spark.sql("select course, gender, count(*) from karthick group by course,
gender").show()

User defined function on dataframe

def get_total_salary(salary):
return salary + 100
totalSalaryUDF = udf(lambda x: get_total_salary(x), IntegerType())
df.withColumn("total_salary", totalSalaryUDF(df.salary)).show()

SparkStreaming

from pyspark.streaming import StreamingContext

from pyspark import SparkConf, SparkContext
conf = SparkConf().setAppName("Streaming")
sc = SparkContext.getOrCreate(conf=conf)
ssc = StreamingContext(sc, 1)

rdd = ssc.textFileStream("/FileStore/tables/")
rdd = rdd.map(lambda x: (x,1))
rdd = rdd.reduceByKey(lambda x,y : x+y)
rdd.pprint()
ssc.start()
ssc.awaitTerminationOrTimeout(1000000)

Tecnicas Cirurgicas em Animais de Grande Porte A Simon Turner
No ratings yet
Tecnicas Cirurgicas em Animais de Grande Porte A Simon Turner
331 pages
12 Information Practices Text Book Preeti Arora
No ratings yet
12 Information Practices Text Book Preeti Arora
45 pages
Pyspark Vs Pandas Cheatsheet
No ratings yet
Pyspark Vs Pandas Cheatsheet
3 pages
PySpark Data Frame Questions PDF
100% (1)
PySpark Data Frame Questions PDF
57 pages
CS 61A Scheme Midterm 2 Cheat Sheet
No ratings yet
CS 61A Scheme Midterm 2 Cheat Sheet
2 pages
Ravi Pyspark RDD Tutorial 1665758938
No ratings yet
Ravi Pyspark RDD Tutorial 1665758938
20 pages
Data and AI - Spark Python
No ratings yet
Data and AI - Spark Python
11 pages
Newserver : Server (: Step-01 Class Def
No ratings yet
Newserver : Server (: Step-01 Class Def
3 pages
Top 100 Pyspark Functions for Data Engineers 1738131847
No ratings yet
Top 100 Pyspark Functions for Data Engineers 1738131847
30 pages
Spark Cheat Sheet 1717838924
No ratings yet
Spark Cheat Sheet 1717838924
10 pages
Data_Engineer_Interview__1740985064
No ratings yet
Data_Engineer_Interview__1740985064
14 pages
Pair RDD Operations: Flat Map
No ratings yet
Pair RDD Operations: Flat Map
4 pages
Big Data Analytics in Apache Spark
No ratings yet
Big Data Analytics in Apache Spark
79 pages
DP600CodeUsed240514
No ratings yet
DP600CodeUsed240514
27 pages
Loading Pandas
No ratings yet
Loading Pandas
23 pages
Page 02
No ratings yet
Page 02
2 pages
Notes Dv
No ratings yet
Notes Dv
19 pages
MLSolutions
No ratings yet
MLSolutions
4 pages
Template Classification
No ratings yet
Template Classification
6 pages
Spark Graph Spip
No ratings yet
Spark Graph Spip
12 pages
MLRecord
No ratings yet
MLRecord
24 pages
FDS slips solution
No ratings yet
FDS slips solution
7 pages
Spark RDD
No ratings yet
Spark RDD
4 pages
Pandas Methods Overview
No ratings yet
Pandas Methods Overview
3 pages
2 - Intro to PySpark RDD
No ratings yet
2 - Intro to PySpark RDD
35 pages
Annual-Project
No ratings yet
Annual-Project
5 pages
script
No ratings yet
script
5 pages
Lab Distributed Big Data Analytics: Worksheet-3: Spark Graphx and Spark SQL Operations
No ratings yet
Lab Distributed Big Data Analytics: Worksheet-3: Spark Graphx and Spark SQL Operations
5 pages
python pandas
No ratings yet
python pandas
13 pages
Apache Spark: CS240A Winter 2016. T Yang
No ratings yet
Apache Spark: CS240A Winter 2016. T Yang
36 pages
Tarea de Ciencia de Datos
No ratings yet
Tarea de Ciencia de Datos
32 pages
Pyspark File Commands and Theory
No ratings yet
Pyspark File Commands and Theory
29 pages
External
No ratings yet
External
11 pages
Scanning and Sorting Directories
No ratings yet
Scanning and Sorting Directories
3 pages
R Programming Cont..
No ratings yet
R Programming Cont..
24 pages
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
10 pages
Spark RDD Commands - Spark Core
No ratings yet
Spark RDD Commands - Spark Core
7 pages
Codes_part 1
No ratings yet
Codes_part 1
7 pages
99c949c0-5910-425f-9ac5-155882800fa5
No ratings yet
99c949c0-5910-425f-9ac5-155882800fa5
36 pages
Mtgmeta
No ratings yet
Mtgmeta
2 pages
Programs of Python Pandas
No ratings yet
Programs of Python Pandas
15 pages
Pandas
No ratings yet
Pandas
21 pages
R/Rpad Reference Card: Slicing and Extracting Data
No ratings yet
R/Rpad Reference Card: Slicing and Extracting Data
5 pages
ml_labmanual (3)
No ratings yet
ml_labmanual (3)
33 pages
DOC-20250211-WA0009. (1)
No ratings yet
DOC-20250211-WA0009. (1)
26 pages
Lab Manual Page No 1
No ratings yet
Lab Manual Page No 1
32 pages
Statistics With R Programming For Bigdata (Autosaved)
No ratings yet
Statistics With R Programming For Bigdata (Autosaved)
41 pages
Experiment 1 solution
No ratings yet
Experiment 1 solution
5 pages
CheatSheet
No ratings yet
CheatSheet
15 pages
Pandas Cheat Sheet
100% (2)
Pandas Cheat Sheet
6 pages
Python Refcard
No ratings yet
Python Refcard
3 pages
Machine Learning - Unit IV Notes
No ratings yet
Machine Learning - Unit IV Notes
18 pages
Pandas
No ratings yet
Pandas
13 pages
SPARK
No ratings yet
SPARK
36 pages
Imp Details
No ratings yet
Imp Details
6 pages
2.RDDs in Spark
No ratings yet
2.RDDs in Spark
38 pages
Docse
No ratings yet
Docse
3 pages
CopulaGJM
No ratings yet
CopulaGJM
1 page
Pandas_Notes_Design
No ratings yet
Pandas_Notes_Design
5 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
jQuery 1.4 Reference Guide
From Everand
jQuery 1.4 Reference Guide
Jonathan Chaffer
3.5/5 (2)
How To Do Programming in VBScript
No ratings yet
How To Do Programming in VBScript
18 pages
Android Car Service (1) CarService - Android Carservice-CSDN Blog
No ratings yet
Android Car Service (1) CarService - Android Carservice-CSDN Blog
10 pages
Learn Visual Basic 6.0: Beginners All Purpose Symbolic Instruction Code
No ratings yet
Learn Visual Basic 6.0: Beginners All Purpose Symbolic Instruction Code
20 pages
C# Strings: String Vs String
No ratings yet
C# Strings: String Vs String
58 pages
Santosh Patil: Career Objective
No ratings yet
Santosh Patil: Career Objective
3 pages
DESCARGAR El Mejor SCRIPT para DOORS ? AUTOFARM SUPER OP Funcionando INDETECTABLE ROBLOX 2022?
No ratings yet
DESCARGAR El Mejor SCRIPT para DOORS ? AUTOFARM SUPER OP Funcionando INDETECTABLE ROBLOX 2022?
9 pages
Career Summary:: AWS Cloud Architect
No ratings yet
Career Summary:: AWS Cloud Architect
5 pages
UNIT 1 Notes
No ratings yet
UNIT 1 Notes
45 pages
Dav Institute Of, Meangement Sahil
No ratings yet
Dav Institute Of, Meangement Sahil
61 pages
Selenium Handbook GOOD
No ratings yet
Selenium Handbook GOOD
73 pages
C 500
No ratings yet
C 500
180 pages
Uat Protea
No ratings yet
Uat Protea
5 pages
Quo Vadis Pengembangan Rekayasa Perangkat Lunak: Abstrak
No ratings yet
Quo Vadis Pengembangan Rekayasa Perangkat Lunak: Abstrak
8 pages
Chapter 1 - Introduction To VB - Net Programming
No ratings yet
Chapter 1 - Introduction To VB - Net Programming
30 pages
Function Hooking and Windows DLL Injection PDF
No ratings yet
Function Hooking and Windows DLL Injection PDF
15 pages
A Step by Step Guide For Operations Orchestration-NA
No ratings yet
A Step by Step Guide For Operations Orchestration-NA
18 pages
TANGAZO LA NAFASI ZA KAZI eGA, TBA & KADCO PDF
No ratings yet
TANGAZO LA NAFASI ZA KAZI eGA, TBA & KADCO PDF
23 pages
Minor Project Report Format
No ratings yet
Minor Project Report Format
6 pages
HTML Note
No ratings yet
HTML Note
6 pages
Unit - 2 - Part-2
100% (1)
Unit - 2 - Part-2
43 pages
Static Variable in C
No ratings yet
Static Variable in C
30 pages
Presentasi Ulat Jerman
No ratings yet
Presentasi Ulat Jerman
10 pages
Webview Flutter Test - Dart
No ratings yet
Webview Flutter Test - Dart
20 pages
Lecture 3.3.1 Software Testability
No ratings yet
Lecture 3.3.1 Software Testability
17 pages
MCAD
No ratings yet
MCAD
24 pages
Web Development Programming Language
No ratings yet
Web Development Programming Language
19 pages
Eecs280f19 - Midterm - Solutions
No ratings yet
Eecs280f19 - Midterm - Solutions
11 pages
Republic of The Philippines Gordon College College of Computer Studies
No ratings yet
Republic of The Philippines Gordon College College of Computer Studies
14 pages
Analysis of Efficiency of Automated Software Testing Methods Direction of Research
No ratings yet
Analysis of Efficiency of Automated Software Testing Methods Direction of Research
5 pages

Pyspark Code

Uploaded by

Pyspark Code

Uploaded by

RDD

flatrdd = rdd.flatMap(lambda x: x.split(" ")) or flatrdd = rdd.flatMap(lambda x:

(depending on whether words in input data are separated by space or comma )

rdd2 = rdd.coalesce(1) (only decrease number of partitions)

min and max

RDD TO DATAFRAME CONVERSION

1.Using toDF (no custom schema for dataframe)

2.Using createDataFrame (having custom schema)

df1 = df.withColumn("roll", col("roll").cast("String"))

df = df.withColumnRenamed("gender","sex").withColumnRenamed("roll", "roll number")

User defined function on dataframe

from pyspark.streaming import StreamingContext

You might also like