Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
16 views

pyspark-vs-pandas

Uploaded by

julianalb.berrio
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

pyspark-vs-pandas

Uploaded by

julianalb.berrio
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Data Engineering Fundamentals

Pandas
vs PySpark

Eren Han
Data Engineering Fundamentals
1
LOAD CSV

Pandas PySpark

df = spark.read \
.options(header=True,
df = pd.read_csv('sample.csv')
inferSchema=True) \
.csv('sample.csv')

Eren Han
Data Engineering Fundamentals
2
VIEW DATAFRAME

Pandas PySpark

df df.show()

df.head(10) df.show(10)

Eren Han
Data Engineering Fundamentals
3
CHECK COLUMNS AND DATA TYPES

Pandas PySpark

df.columns df.columns

df.dtypes df.dtypes

Eren Han
Data Engineering Fundamentals
4
RENAME COLUMNS

Pandas PySpark

df.columns = [x, y, z] df.toDF(x, y, z)

df.rename(columns= {"old":"new"}) df.withColumnRenamed("old","new")

Eren Han
Data Engineering Fundamentals
5
DROP COLUMN

Pandas PySpark

df.drop("column", axis=1) df.drop("column")

Eren Han
Data Engineering Fundamentals
6
FILTERING

Pandas PySpark

df[df.column < 80] df[df.column < 80]

df[(df.column < 80) & (df.column2 == 50)] df[(df.column < 80) & (df.column2 == 50)]

Eren Han
Data Engineering Fundamentals
7
ADD COLUMN

Pandas PySpark

df["new"] = 1 / df.column df.withColumn("new", 1 /


df.column)

Note: Division by zero is Note: Division by zero is NULL.


infinite.

Eren Han
Data Engineering Fundamentals
8
FILL NULLS

Pandas PySpark

df.fillna(0) df.fillna(0)

Eren Han
Data Engineering Fundamentals
9
AGGREGATION

Pandas PySpark

df.groupby([date, product]) \ df.groupby([date, product]) \


.agg({"sales":"mean", .agg({"sales":"mean",
"revenue":"max"}) "revenue":"max"})

Eren Han
Data Engineering Fundamentals
10
STANDARD TRANSFORMATIONS

Pandas PySpark

import numpy as np import pysapark.sql.functions as F


df["logcolumn"] = np.log(df.column) df.withColumn("logcolumn",
F.log(df.column)

Eren Han
Data Engineering Fundamentals
11
CONDITIONAL STATEMENTS

Pandas PySpark

df["cond"]= df.apply(lambda x: 1 if import pysapark.sql.functions as F


df.col1>20 else 2 if df.col2==6 else df.withColumn("cond", \
3, axis=1) F.when(df.col1>20,1) \
.when(df.col2==6,2)
.otherwise(3))

Eren Han
Data Engineering Fundamentals
12
MERGE / JOIN DATAFRAMES

Pandas PySpark

df.merge(df2, on="key") df.join(df2, on="key")

df.merge(df2, left_on="a",right_on="b") df.join(df2, df.a == df2.b)

Eren Han
Data Engineering Fundamentals
13
SUMMARY STATISTICS

Pandas PySpark

df.describe() df.describe().show()

Note: Only
count,mean,stddev,min,max.

Eren Han
Data Engineering Fundamentals
14
CHANGE DATA TYPES

Pandas PySpark

from pyspark.sql.types
df['A'] = df['A'].astype(int)
import IntegerType

df = df.withColumn('A',
col('A').cast(IntegerType()))

Eren Han
Data Engineering Fundamentals

Thank You for


reading. I hope
you enjoyed it.

Eren Han

You might also like