0% found this document useful (0 votes)

27 views

Learn Data Analysis With Python

The document discusses techniques for cleaning data in Python, including handling missing values, outliers, and duplicates. For missing values, it demonstrates dropping rows, replacing with 0, mean, median or mode, and selecting only rows without missing values. For outliers, it shows standardizing based on z-scores or using the interquartile range to identify outliers. For duplicates, it provides code to identify, display only, and remove duplicates while keeping the last observation. The goal is to clean raw data by addressing issues like missing values, outliers and duplicates to prepare for analysis.

Uploaded by

Anjali Agarwal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views

Learn Data Analysis With Python

Uploaded by

Anjali Agarwal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

Learn Data Analysis with Python: Find out the Practical

Code for Data Cleaning

Introduction

If we want to apply for any data analyst or data scientist role, it is necessary to know one of

the programming languages used for such roles. It could be R or Python or Scala etc. To fulfill

this, I have selected Python for data analysis.

The data preparation step is the most important part to win the battle of a data analysis project.

This document will have information about how data cleaning ( missing values, outliers,

duplicates )is possible with Python.

Raw data is full of impurity like outliers, missing values, duplicates, etc. To clean this data

means, it needs to be logical, significant and regulated.

Missing Data

This is one of the most common issues. To solve these, there are many methods. Let us find

out some practical codes.

Image by Author (graderecord.csv)

This is a CSV file with two missing values in the Grade column. Let us now find out practical

codes to process such types of missing information.

import pandas as pd

df = pd.read_csv(“graderecord.csv”)

df.head(10)

Code: Drop Rows with Missing values

df_no_missing = df.dropna()

df_no_missing

Code: Replace Empty rows with 0

df.fillna(0)

Code: Replace Empty rows with Mean or Median or Mode of the column
df[“Grade”].fillna(df[“Grade”].mean(), inplace=True)

df[“Grade”].fillna(df[“Grade”].median(), inplace=True)

df[“Grade”].fillna(df[“Grade”].mode(), inplace=True)

Code: Selecting Rows with No Missing values

df[df[‘Grade’].notnull()]

Code: If a column has all empty values. drop the completely empty column

# Add a Column with Empty Values

import numpy as np

df[‘newcol’] = np.nan

df.head()

# Drop Empty Column

df.dropna(axis=1, how=”all”)

Outlier Treatment

In this blog, I am considering two ways of outlier treatment.

Method 1: If data is normally distributed, then we follow the standardization method.

Considering, confidence interval of 95% that means z-score is 1.96 for 5% alpha value. In

conclusion, 95% of data is distributed within 1.96 standard deviations of the mean. So we can

drop the value below or above this range.

Image by Author (gradedata.csv)

import pandas as pd

df = pd.read_csv(“gradedata.csv”)

meangrade = df[‘grade’].mean()

stdgrade = df[‘grade’].std()

higherrange = meangrade + stdgrade * 1.96

lowerrange = meangrade — stdgrade * 1.96

df = df.drop(df[df[‘grade’] > higherrange].index)

df = df.drop(df[df[‘grade’] < lowerrange].index)

Method 2: In this method, we use the interquartile range (IQR). IQR is the difference between

25% (Q1 ) of the quantile and 75% (Q3) of the quantile. Any value lower than Q1–1.5*IQR or

greater than Q3–1.5*IQR is considered an outlier.

q1 = df[‘grade’].quantile(.25)

q3 = df[‘grade’].quantile(.75)

iqr = q3-q1

highrange = q3 + iqr * 1.5

lowrange = q1 — iqr * 1.5

df = df.drop(df[df[‘grade’] > highrange].index)

df = df.drop(df[df[‘grade’] < lowrange].index)

Finding Duplicates

Using Python libraries, we can identify duplicate rows as well as unique rows of a data set.

# Creating Dataset with Duplicates

import pandas as pd

Emp = [‘Jane’,’Johny’,’Boby’,’Jane’,’Mary’,’Jony’,’Melica’,’Melica’]

Salary = [9500,7800,7600,9500,7700,7800,9900,10000]

SalaryList = zip(Emp,Salary)

df = pd.DataFrame(data = SalaryList,columns=[‘Emp’, ‘Salary’])

# Displaying Only Duplicates in the Dataframe

df.duplicated()

# Displaying Dataset without Duplicates

df.drop_duplicates()

# Drop Rows with Duplicate Emp, Keeping the Last Observation

df.drop_duplicates([‘Emp’], keep=”last”)

chapter3 DS
No ratings yet
chapter3 DS
17 pages
Group A Assignment No2 Writeup
No ratings yet
Group A Assignment No2 Writeup
9 pages
Data Cleaning
No ratings yet
Data Cleaning
20 pages
Data Cleaning in Python
No ratings yet
Data Cleaning in Python
6 pages
Overview of Data Cleaning
No ratings yet
Overview of Data Cleaning
17 pages
DSBDL Asg 2 Write Up
No ratings yet
DSBDL Asg 2 Write Up
4 pages
Pandas-1
No ratings yet
Pandas-1
13 pages
Data Cleaning
No ratings yet
Data Cleaning
13 pages
haha1
No ratings yet
haha1
2 pages
Reading 5 - Data Preparation
No ratings yet
Reading 5 - Data Preparation
23 pages
Part A Assignment 6
No ratings yet
Part A Assignment 6
28 pages
Tutorial 4
No ratings yet
Tutorial 4
8 pages
Python (Unit - 2)
No ratings yet
Python (Unit - 2)
22 pages
DataCleaninginML
No ratings yet
DataCleaninginML
15 pages
Data Analytics lab manual
No ratings yet
Data Analytics lab manual
47 pages
Data Mining Lab 03
No ratings yet
Data Mining Lab 03
10 pages
Lab 3 DWM
No ratings yet
Lab 3 DWM
5 pages
Unit 5 Python
No ratings yet
Unit 5 Python
30 pages
ML_EX2
No ratings yet
ML_EX2
7 pages
EXP-12_IAIML
No ratings yet
EXP-12_IAIML
13 pages
Lecture 4 New Data Pre Processing
No ratings yet
Lecture 4 New Data Pre Processing
41 pages
Pandas
No ratings yet
Pandas
4 pages
Dsbda Ass2
No ratings yet
Dsbda Ass2
49 pages
Advanced Python Programming Data Science: The University of Sheffield
No ratings yet
Advanced Python Programming Data Science: The University of Sheffield
55 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Missing Data
No ratings yet
Missing Data
14 pages
Code explanation for date types
No ratings yet
Code explanation for date types
8 pages
DS Problem Statements and Codes
No ratings yet
DS Problem Statements and Codes
21 pages
DS Lec 6
No ratings yet
DS Lec 6
27 pages
2. DATA WRANGLING 2
No ratings yet
2. DATA WRANGLING 2
4 pages
Data Cleaning and Preparation
No ratings yet
Data Cleaning and Preparation
9 pages
Adsl Exp 3 2024
No ratings yet
Adsl Exp 3 2024
11 pages
Phython Example
No ratings yet
Phython Example
12 pages
Unit 1
No ratings yet
Unit 1
21 pages
Exploratory Data
No ratings yet
Exploratory Data
47 pages
Data Cleaning
No ratings yet
Data Cleaning
42 pages
lecture-week5
No ratings yet
lecture-week5
72 pages
hduud
No ratings yet
hduud
55 pages
TP2- ML -handling outliers
No ratings yet
TP2- ML -handling outliers
5 pages
Data Cleaning With Python and Pandas
No ratings yet
Data Cleaning With Python and Pandas
49 pages
DSI237_GROUP_2
No ratings yet
DSI237_GROUP_2
27 pages
Statistical Transform Data Cleaning
No ratings yet
Statistical Transform Data Cleaning
30 pages
Subtitle
No ratings yet
Subtitle
2 pages
Lecture 02
No ratings yet
Lecture 02
41 pages
Kenny-230722-Data Cleaning With Python and Pandas - Detecting Missing Values
No ratings yet
Kenny-230722-Data Cleaning With Python and Pandas - Detecting Missing Values
13 pages
5-Demonstrate Missing Value Analysis Using Sample Data.-06!01!2025
No ratings yet
5-Demonstrate Missing Value Analysis Using Sample Data.-06!01!2025
2 pages
AI351 Lecture 1 - Data Preprocessing
No ratings yet
AI351 Lecture 1 - Data Preprocessing
8 pages
Pandas
No ratings yet
Pandas
30 pages
ML Practical 03
No ratings yet
ML Practical 03
20 pages
1-Introduction to data cleaning
No ratings yet
1-Introduction to data cleaning
22 pages
Lect 2
No ratings yet
Lect 2
54 pages
1
No ratings yet
1
12 pages
Data Cleaning
No ratings yet
Data Cleaning
4 pages
FDS Chapter 3
No ratings yet
FDS Chapter 3
103 pages
Outliners
No ratings yet
Outliners
15 pages
Algorithms and Data Structures: An Easy Guide to Programming Skills
From Everand
Algorithms and Data Structures: An Easy Guide to Programming Skills
Rigdon Jonathan
No ratings yet
Mastering Data Structures and Algorithms in Python & Java
From Everand
Mastering Data Structures and Algorithms in Python & Java
Sachin Naha
No ratings yet
MCS-011: Problem Solving and Programming
From Everand
MCS-011: Problem Solving and Programming
Dr. DK Sukhani
No ratings yet
Mastering Data Structures and Algorithms in C and C++
From Everand
Mastering Data Structures and Algorithms in C and C++
Sachin Naha
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
User Manual ETK-20180803
No ratings yet
User Manual ETK-20180803
59 pages
Imf 202401 Cs 51
No ratings yet
Imf 202401 Cs 51
19 pages
LTE213 Module9 Jocelyn+Tejada
No ratings yet
LTE213 Module9 Jocelyn+Tejada
17 pages
Innovaya Visual Estimating
No ratings yet
Innovaya Visual Estimating
2 pages
Introduction To Computer (ITC) : Objective: To Familiar About The MS-Excel Application of MS - Office
No ratings yet
Introduction To Computer (ITC) : Objective: To Familiar About The MS-Excel Application of MS - Office
6 pages
Mastercam2023 Mill2d Advanced ProfessionalCourseware
No ratings yet
Mastercam2023 Mill2d Advanced ProfessionalCourseware
710 pages
blitz-logs_20231220224725
No ratings yet
blitz-logs_20231220224725
49 pages
C and C++: HCL
No ratings yet
C and C++: HCL
84 pages
Motor Speed Control Using GSM
No ratings yet
Motor Speed Control Using GSM
12 pages
Best Practices in The Field of Cyberjustice
No ratings yet
Best Practices in The Field of Cyberjustice
13 pages
Oracle Receivables R12
100% (2)
Oracle Receivables R12
208 pages
IEEE 829 Documentation
No ratings yet
IEEE 829 Documentation
5 pages
DBMS Question Bank
No ratings yet
DBMS Question Bank
12 pages
Farsi Ebook Download PDF
0% (1)
Farsi Ebook Download PDF
2 pages
Realflow Maya Connectivity
No ratings yet
Realflow Maya Connectivity
17 pages
CS holiday homework (3)
No ratings yet
CS holiday homework (3)
71 pages
results - 2024-09-11T191556.889
No ratings yet
results - 2024-09-11T191556.889
77 pages
Mini PROJECT Report Format 2011-2012
No ratings yet
Mini PROJECT Report Format 2011-2012
12 pages
مراجعة النخبة Ict ترم اول صف رابع من اكاديمية تراست (1)
No ratings yet
مراجعة النخبة Ict ترم اول صف رابع من اكاديمية تراست (1)
49 pages
Computer Studies F2 NOTES
No ratings yet
Computer Studies F2 NOTES
10 pages
OOAD Model Question PDF
No ratings yet
OOAD Model Question PDF
51 pages
Project Synopsis: Malnad College of Engineering
No ratings yet
Project Synopsis: Malnad College of Engineering
10 pages
CCNA 200-301: Number: 200-301 Passing Score: 825 Time Limit: 120 Min File Version: 1.0
No ratings yet
CCNA 200-301: Number: 200-301 Passing Score: 825 Time Limit: 120 Min File Version: 1.0
46 pages
Bank Management System
No ratings yet
Bank Management System
25 pages
Lab7 Instruction
No ratings yet
Lab7 Instruction
10 pages
How To - Adopt GREAT Developer Habits 26-09-23
No ratings yet
How To - Adopt GREAT Developer Habits 26-09-23
4 pages
Get Started With Box
No ratings yet
Get Started With Box
1 page
ERP Elias Woldemariam
No ratings yet
ERP Elias Woldemariam
111 pages
OS/2 Programming Guide: 4777 Magnetic Stripe Unit and 4778 PIN-Pad Magnetic Stripe Reader
No ratings yet
OS/2 Programming Guide: 4777 Magnetic Stripe Unit and 4778 PIN-Pad Magnetic Stripe Reader
132 pages
abrites-diagnostics-for-suzuki-user-manual
No ratings yet
abrites-diagnostics-for-suzuki-user-manual
9 pages

Learn Data Analysis With Python

Uploaded by

Learn Data Analysis With Python

Uploaded by

Learn Data Analysis with Python: Find out the Practical

Code for Data Cleaning

this, I have selected Python for data analysis.

duplicates )is possible with Python.

means, it needs to be logical, significant and regulated.

out some practical codes.

codes to process such types of missing information.

Code: Drop Rows with Missing values

Code: Replace Empty rows with 0

Code: Selecting Rows with No Missing values

# Add a Column with Empty Values

# Drop Empty Column

In this blog, I am considering two ways of outlier treatment.

Method 1: If data is normally distributed, then we follow the standardization method.

drop the value below or above this range.

higherrange = meangrade + stdgrade * 1.96

lowerrange = meangrade — stdgrade * 1.96

df = df.drop(df[df[‘grade’] > higherrange].index)

df = df.drop(df[df[‘grade’] < lowerrange].index)

greater than Q3–1.5*IQR is considered an outlier.

highrange = q3 + iqr * 1.5

lowrange = q1 — iqr * 1.5

df = df.drop(df[df[‘grade’] > highrange].index)

df = df.drop(df[df[‘grade’] < lowrange].index)

# Creating Dataset with Duplicates

df = pd.DataFrame(data = SalaryList,columns=[‘Emp’, ‘Salary’])

# Displaying Only Duplicates in the Dataframe

# Displaying Dataset without Duplicates

# Drop Rows with Duplicate Emp, Keeping the Last Observation

You might also like