0% found this document useful (0 votes)

1K views

Pyspark Vs Pandas Cheatsheet

This document provides a cheatsheet comparing common data analysis tasks in Pandas and PySpark. It outlines how to import libraries, define datasets, read/write data, inspect data, handle missing/duplicate values, rename/select columns, join datasets, group and sort data using each framework. The cheatsheet acts as a quick reference guide to help users choose the appropriate tool for different data processing and manipulation operations.

Uploaded by

api-261489892

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1K views

Pyspark Vs Pandas Cheatsheet

Uploaded by

api-261489892

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

CHEATSHEET: PANDAS VS PYSPARK

Vanessa Afolabi

Import Libraries and Set System Options:

PANDAS PYSPARK

import pandas as pd from pyspark.sql.types import *

pd.options.display.max colwidth = 1000 from pyspark.sql.functions import *
from pyspark.sql import SQLContext*

Define and create a dataset:

PANDAS PYSPARK

data = {’col1’ : [ , , ], ’col2’ : [ , , ]} StructField(’Col1’, IntegerType())

df = pd.DataFrame(data, columns = [’col1’, ’col2’]) StructField(’Col2’, StringType())
schema = StructType([list of StructFields])
df = SQLContext(sc).createDataFrame(sc.emptyRDD(), schema)

Read and Write to CSV:

PANDAS PYSPARK

df.read csv() SQLContext(sc).read csv()

df.to csv() df.toPandas.to csv()

Indexing and Splitting:

PANDAS PYSPARK

df.loc[ ] df.randomSplit(weights=[ ], seed=n)

df.iloc[ ]

Inspect Data:
PANDAS PYSPARK

df.head() df.show()
df.head(n)
df.columns df.printSchema()
df.columns
df.shape df.count()
Handling Duplicate Data:
PANDAS PYSPARK

df.unique() df.distinct().count()
df.duplicated
df.drop duplicates() df.dropDuplicates()

Rename Columns:
PANDAS PYSPARK

df.rename(columns={”old col”:”new col”}) df.withColumnRenamed(”old col”,”new col”)

Handling Missing Data:

PANDAS PYSPARK

df.dropna() df.na.drop()
df.fillna() df.na.fill()
df.replace df.na.replace()
df[’col’].isna() df.col.isNull()
df[’col’].isnull()
df[’col’].notna() df.col.isNotNull()
df[’col’].notnull()

Common Column Functions:

PANDAS PYSPARK

df[”col”] = df[”col”].str.lower() df = df.withColumn(’col’,lower(df.col))

df[”col”] = df[”col”].str.replace() df = df.select(’*’,regexp replace().alias())
df = df.select(’*’,regexp extract().alias())
df[”col”] = df[”col”].str.split() df = df.withColumn(’col’,split(’col’))
df[”col”] = df[”col”].str.join() df = df.withColumn(’col’, UDF JOIN(df.col, lit(’ ’)))
df[”col”] = df[”col”].str.strip() df = df.withColumn(’col’, trim(df.col))

Apply User Defined Functions:

PANDAS PYSPARK

df[’col’] = df[’col’].map(UDF) df = df.withColumn(’col’, UDF(df.col))

df.apply(f) df = df.withColumn(’col’, when(cond, UDF(df.col)).otherwise())
df.applyMap(f)

Join two dataset columns:

PANDAS PYSPARK

df[’new col’] = df[’col1’] + df[’col2’] df = df.withColumn(’new col’,concat ws(’ ’,df.col1,df.col2))

df.select(’*’,concat(df.col1,df.col2).alias(’new col’))
Convert dataset column to a list:
PANDAS PYSPARK

list(df[’col’) df.select(”col”).rdd.flatMap(lambda x:x).collect()

Filter Dataset:
PANDAS PYSPARK

df = df[df[’col’] != ” ”] df = df[df[’col’] == val]

df = df.filter(df[’col’] == val)

Select Columns:
PANDAS PYSPARK

df = df[[’col1’,’col2’,’col3’]] df = df.select(’col1’,’col2’,’col3’)

Drop Columns:
PANDAS PYSPARK

df.drop([’B’,’C’], axis=1) df.drop(’col1’,’col2’)

df.drop(columns = [’B’,’C’])

Grouping Data:
PANDAS PYSPARK

df.groupby(by=[’col1’,’col2’]).count() df.groupBy(’col’).count().show()

Combining Data:
PANDAS PYSPARK

pd.concat([df1,df2]) df1.union(df2)
df1.append(df2)
df1.join(df2) df1.join(df2)

Cartesian Product:
PANDAS PYSPARK

df1[’key’] = 1 df1.crossJoin(df2)
df2[’key’] = 1
df1.merge(df2, how=’outer’, on=’key’)

Sorting Data:
PANDAS PYSPARK

df.sort values() df.sort()

df.sort index() df.orderBy()

PYSPARK Interview Questions
100% (2)
PYSPARK Interview Questions
126 pages
Pyspark Practice
No ratings yet
Pyspark Practice
42 pages
De Mod 5 Deploy Workloads With Databricks Workflows
No ratings yet
De Mod 5 Deploy Workloads With Databricks Workflows
19 pages
Pyspark Interview Questions: Click Here
0% (1)
Pyspark Interview Questions: Click Here
35 pages
SSIS Succinctly
No ratings yet
SSIS Succinctly
116 pages
Mastering Spark SQL PDF
100% (1)
Mastering Spark SQL PDF
1,776 pages
Redis Cheatsheet
100% (1)
Redis Cheatsheet
4 pages
PySpark SQL Cheat Sheet Python PDF
No ratings yet
PySpark SQL Cheat Sheet Python PDF
1 page
(Big Data Analytics With PySpark) (CheatSheet)
No ratings yet
(Big Data Analytics With PySpark) (CheatSheet)
7 pages
20 PySpark Problems
No ratings yet
20 PySpark Problems
22 pages
Final Print Py Spark
No ratings yet
Final Print Py Spark
133 pages
Pyspark PDF
100% (1)
Pyspark PDF
406 pages
4 - Action and RDD Transformations
No ratings yet
4 - Action and RDD Transformations
25 pages
Pyspark Material
No ratings yet
Pyspark Material
16 pages
Cleaning Data With PySpark Chapter2
100% (1)
Cleaning Data With PySpark Chapter2
25 pages
Data Cleaning With PySpark
No ratings yet
Data Cleaning With PySpark
21 pages
PySpark Transformations Tutorial
100% (1)
PySpark Transformations Tutorial
58 pages
Spark Architecture
100% (1)
Spark Architecture
12 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
Core Python
No ratings yet
Core Python
102 pages
Airflow Introduction
No ratings yet
Airflow Introduction
9 pages
PySpark RDD Basics PDF
No ratings yet
PySpark RDD Basics PDF
1 page
Cleaning Data With PySpark Chapter3
No ratings yet
Cleaning Data With PySpark Chapter3
25 pages
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
No ratings yet
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
3 pages
Spark SQL
No ratings yet
Spark SQL
24 pages
PySpark Reference Guide
No ratings yet
PySpark Reference Guide
2 pages
Apache Spark Interview Questions
No ratings yet
Apache Spark Interview Questions
12 pages
Ravi Pyspark RDD Tutorial 1665758938
No ratings yet
Ravi Pyspark RDD Tutorial 1665758938
20 pages
Cleaning Data With PySpark Chapter4
No ratings yet
Cleaning Data With PySpark Chapter4
23 pages
Spark Optimization PDF
100% (1)
Spark Optimization PDF
14 pages
Bigdata Notes
No ratings yet
Bigdata Notes
26 pages
Databricks Cloud How To Log Analysis Example
No ratings yet
Databricks Cloud How To Log Analysis Example
9 pages
Pyspark Learning Hub
No ratings yet
Pyspark Learning Hub
7 pages
Mongodb Cheat Sheet
No ratings yet
Mongodb Cheat Sheet
10 pages
Top Pyspark InterviewQuestions
No ratings yet
Top Pyspark InterviewQuestions
21 pages
Airflow DAG - Best Practices: DAG As Configuration File
100% (1)
Airflow DAG - Best Practices: DAG As Configuration File
6 pages
Azure Data Factory Vs Databricks - 4 Key Differences - Hevo
No ratings yet
Azure Data Factory Vs Databricks - 4 Key Differences - Hevo
14 pages
Pyspark Notes
No ratings yet
Pyspark Notes
93 pages
Databricks On AWS 01 Getting Started Apache Spark Slides
100% (1)
Databricks On AWS 01 Getting Started Apache Spark Slides
29 pages
Apache Spark Analytics Made Simple
No ratings yet
Apache Spark Analytics Made Simple
76 pages
PySpark VS SQL Interview Questions
No ratings yet
PySpark VS SQL Interview Questions
16 pages
Cloudera Spark
No ratings yet
Cloudera Spark
55 pages
Cleaning Data With PySpark Chapter1
0% (1)
Cleaning Data With PySpark Chapter1
20 pages
Window Function in Pyspark
100% (1)
Window Function in Pyspark
8 pages
Pyspark - SQL Module
No ratings yet
Pyspark - SQL Module
132 pages
Apache Spark Architecture
No ratings yet
Apache Spark Architecture
7 pages
Spark Syllabus 1
No ratings yet
Spark Syllabus 1
3 pages
VIP Cheatsheet: Convolutional Neural Networks: Afshine Amidi and Shervine Amidi November 26, 2018
No ratings yet
VIP Cheatsheet: Convolutional Neural Networks: Afshine Amidi and Shervine Amidi November 26, 2018
5 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
Pyspark With Docker
100% (1)
Pyspark With Docker
15 pages
Spark With Python Notes
No ratings yet
Spark With Python Notes
206 pages
Snow SQL
No ratings yet
Snow SQL
3 pages
Spark Interview Q&A
No ratings yet
Spark Interview Q&A
31 pages
Edureka - Scala Interview Questions
No ratings yet
Edureka - Scala Interview Questions
21 pages
Apache Spark 2.x Cookbook
From Everand
Apache Spark 2.x Cookbook
Rishi Yadav
No ratings yet
IBM InfoSphere Replication Server and Data Event Publisher
From Everand
IBM InfoSphere Replication Server and Data Event Publisher
Pav Kumar-Chatterjee
No ratings yet
PostgreSQL 9 High Availability Cookbook
From Everand
PostgreSQL 9 High Availability Cookbook
Shaun M. Thomas
5/5 (2)
Cheat Sheet: From Spark Data Sources SQL Queries
No ratings yet
Cheat Sheet: From Spark Data Sources SQL Queries
1 page
Cheat Sheet: From Spark Data Sources SQL Queries
No ratings yet
Cheat Sheet: From Spark Data Sources SQL Queries
1 page
Pyspark IQ FREE Guide
No ratings yet
Pyspark IQ FREE Guide
57 pages
PySpark SQL Cheat Sheet Python PDF
No ratings yet
PySpark SQL Cheat Sheet Python PDF
1 page
An Array Is A Collection of Data Elements of
No ratings yet
An Array Is A Collection of Data Elements of
51 pages
CS-E4780 project evaluation tables
No ratings yet
CS-E4780 project evaluation tables
2 pages
DVP - Communication Protocol
No ratings yet
DVP - Communication Protocol
15 pages
Concept of Tax Jurisdiction Code
100% (2)
Concept of Tax Jurisdiction Code
4 pages
04AA Robert Blake Dissertation Writing 2006
No ratings yet
04AA Robert Blake Dissertation Writing 2006
30 pages
MongoDB-Assignment 3
No ratings yet
MongoDB-Assignment 3
4 pages
History: Data Science Is A
100% (2)
History: Data Science Is A
4 pages
Client InterfacesV3.14andV5.9
No ratings yet
Client InterfacesV3.14andV5.9
312 pages
Cache Mapping Functions
No ratings yet
Cache Mapping Functions
39 pages
SAP Data Migration With LSMW - SCN
No ratings yet
SAP Data Migration With LSMW - SCN
2 pages
Chapter Three Research Methodology 3.0. Introduction: March 2020
No ratings yet
Chapter Three Research Methodology 3.0. Introduction: March 2020
6 pages
Sap Installation On Aix
No ratings yet
Sap Installation On Aix
5 pages
E Commerce
No ratings yet
E Commerce
15 pages
Methodological Triangulation in Nursing Research
No ratings yet
Methodological Triangulation in Nursing Research
25 pages
Title of The Study Statement of The Problem
No ratings yet
Title of The Study Statement of The Problem
2 pages
1z0-082
No ratings yet
1z0-082
72 pages
Interview Questions For Oracle, DBA, Developer Candidates
No ratings yet
Interview Questions For Oracle, DBA, Developer Candidates
29 pages
Project - Report - College - Management - System (1) Rashmi Pal2
No ratings yet
Project - Report - College - Management - System (1) Rashmi Pal2
34 pages
Lesson 5.1
No ratings yet
Lesson 5.1
43 pages
Snap Mirror Best Practices Tr-3446
No ratings yet
Snap Mirror Best Practices Tr-3446
72 pages
Unit-1: Overview and Concepts Data Warehousing and Business Intelligence
No ratings yet
Unit-1: Overview and Concepts Data Warehousing and Business Intelligence
27 pages
772s Data - Mining.concepts - And.techniques.2nd - Ed
No ratings yet
772s Data - Mining.concepts - And.techniques.2nd - Ed
239 pages
UNIX Tips and Tricks For A New User
No ratings yet
UNIX Tips and Tricks For A New User
16 pages
Unit 2 - Data Preprocessing
No ratings yet
Unit 2 - Data Preprocessing
42 pages
The Outbox Pattern: One of My Previous Articles
No ratings yet
The Outbox Pattern: One of My Previous Articles
11 pages
Iterative: Step 1. Defining The Business Needs
No ratings yet
Iterative: Step 1. Defining The Business Needs
3 pages
Adaptability-The Key To Mobile Computing
No ratings yet
Adaptability-The Key To Mobile Computing
19 pages
Literature Review Software
100% (2)
Literature Review Software
7 pages
Uber
No ratings yet
Uber
2 pages

Pyspark Vs Pandas Cheatsheet

Uploaded by

Pyspark Vs Pandas Cheatsheet

Uploaded by

CHEATSHEET: PANDAS VS PYSPARK

Import Libraries and Set System Options:

import pandas as pd from pyspark.sql.types import *

Define and create a dataset:

data = {’col1’ : [ , , ], ’col2’ : [ , , ]} StructField(’Col1’, IntegerType())

Read and Write to CSV:

df.read csv() SQLContext(sc).read csv()

Indexing and Splitting:

df.loc[ ] df.randomSplit(weights=[ ], seed=n)

df.rename(columns={”old col”:”new col”}) df.withColumnRenamed(”old col”,”new col”)

Handling Missing Data:

Common Column Functions:

df[”col”] = df[”col”].str.lower() df = df.withColumn(’col’,lower(df.col))

Apply User Defined Functions:

df[’col’] = df[’col’].map(UDF) df = df.withColumn(’col’, UDF(df.col))

Join two dataset columns:

df[’new col’] = df[’col1’] + df[’col2’] df = df.withColumn(’new col’,concat ws(’ ’,df.col1,df.col2))

list(df[’col’) df.select(”col”).rdd.flatMap(lambda x:x).collect()

df = df[df[’col’] != ” ”] df = df[df[’col’] == val]

df.drop([’B’,’C’], axis=1) df.drop(’col1’,’col2’)

df.sort values() df.sort()

You might also like