pyspark-vs-pandas
pyspark-vs-pandas
Pandas
vs PySpark
Eren Han
Data Engineering Fundamentals
1
LOAD CSV
Pandas PySpark
df = spark.read \
.options(header=True,
df = pd.read_csv('sample.csv')
inferSchema=True) \
.csv('sample.csv')
Eren Han
Data Engineering Fundamentals
2
VIEW DATAFRAME
Pandas PySpark
df df.show()
df.head(10) df.show(10)
Eren Han
Data Engineering Fundamentals
3
CHECK COLUMNS AND DATA TYPES
Pandas PySpark
df.columns df.columns
df.dtypes df.dtypes
Eren Han
Data Engineering Fundamentals
4
RENAME COLUMNS
Pandas PySpark
Eren Han
Data Engineering Fundamentals
5
DROP COLUMN
Pandas PySpark
Eren Han
Data Engineering Fundamentals
6
FILTERING
Pandas PySpark
df[(df.column < 80) & (df.column2 == 50)] df[(df.column < 80) & (df.column2 == 50)]
Eren Han
Data Engineering Fundamentals
7
ADD COLUMN
Pandas PySpark
Eren Han
Data Engineering Fundamentals
8
FILL NULLS
Pandas PySpark
df.fillna(0) df.fillna(0)
Eren Han
Data Engineering Fundamentals
9
AGGREGATION
Pandas PySpark
Eren Han
Data Engineering Fundamentals
10
STANDARD TRANSFORMATIONS
Pandas PySpark
Eren Han
Data Engineering Fundamentals
11
CONDITIONAL STATEMENTS
Pandas PySpark
Eren Han
Data Engineering Fundamentals
12
MERGE / JOIN DATAFRAMES
Pandas PySpark
Eren Han
Data Engineering Fundamentals
13
SUMMARY STATISTICS
Pandas PySpark
df.describe() df.describe().show()
Note: Only
count,mean,stddev,min,max.
Eren Han
Data Engineering Fundamentals
14
CHANGE DATA TYPES
Pandas PySpark
from pyspark.sql.types
df['A'] = df['A'].astype(int)
import IntegerType
df = df.withColumn('A',
col('A').cast(IntegerType()))
Eren Han
Data Engineering Fundamentals
Eren Han