0% found this document useful (0 votes)

107 views

Data Preprocessing in Python - Handling Missing Data

The document discusses techniques for handling missing data in Python. It describes seven techniques: data removal, statistical imputation using mean or median, manual filling based on observation, filling with most repeated value, random filling within data range, regression analysis filling, and finding relationships between variables. Examples are provided using Pandas to demonstrate statistical imputation and regression techniques.

Uploaded by

reyesward085

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

107 views

Data Preprocessing in Python - Handling Missing Data

Uploaded by

reyesward085

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Data Preprocessing in Python —

Handling Missing Data

The Click Reader · Follow
5 min read · Sep 21, 2021

Data pre-processing involves a series of data preparation steps used to

remove unwanted noise and filter out necessary data from a dataset. Learn
how to preprocess data in this article by reading about seven different ways
to handle missing data in Python.
There is a general convention that states that almost 80% of one’s time is
spent in pre-processing data whereas only 20% is used to build the actual ML
model itself. Hence, we can understand that data pre-processing is a vital
step in building intelligent robust ML models.

Techniques For Handling Missing Data

Data may not always be complete i.e. some of the values in the data may be
missing or null. Thus, there are a specific set of ways to handle the missing
data and make the data complete.

The following example shows that the ‘Years of Experience’ of ‘Employee’ is

missing. Also, the ‘Salary (in USD per year)’ of ‘Junior Manager’ is missing.
import pandas as pd

# Creating the dataframe as shown above

df = pd.DataFrame({'Job Position': ['CEO', 'Senior Manager', 'Junior

Manager', 'Employee', 'Assistant Staff'], 'Years of Experience':[5,
4, 3, None, 1], 'Salary':[100000,80000,None,40000, 20000]})

# Viewing the contents of the dataframe

df.head()

Search Medium Write

Some of the ways to handle missing data are listed below:

1. Data Removal

Remove the missing data rows (data points) from the dataset. However,
when using this technique will decrease the available dataset and in turn
result in less robustness of data point if the size of dataset is originally small.

# Dropping the 2nd and 3rd index

dropped_df = df.drop([2,3],axis=0)

# Viewing the dataframe

dropped_df

2. Fill missing value through statistical imputation

Fill the missing data by taking the mean or median of the available data
points. Generally, the median of the data points is used to fill the missing
values as it is not affected heavily by outliers like the mean. Here, we have
used the median to fill the missing data.
# Filling each column with their mean values

df['Years of Experience'] = df['Years of

Experience'].fillna(df['Years of Experience'].mean())

df['Salary'] = df['Salary'].fillna(df['Salary'].mean())

# Viewing the dataframe

3. Fill missing value using observation

Manually fill in the missing data from observation. This may be possible
sometimes for small datasets but for larger datasets it is very difficult to do
so.
4. Fill in the most repeated value

Fill in the missing value using the most repeated value in the dataset. This is
done when most of the data is repeated and there is good reasoning to do so.
Since there are no repeated values in the example, we can fill it with any one
of the numbers in the respective column.

5. Fill in with random value within the range of available data

Take the given range of data points and fill in the data by randomly selecting
a value from the available range.
6. Fill in by regression

Use regression analysis to find the most probable data point for filling in the
dataset.

from sklearn.linear_model import LinearRegression

# Excluding the rows with the null data

train_df = df.drop([2,3],axis=0)

# Creating linear regression model

regr = LinearRegression()

# Here the target is the Salary and the feature is Years of

Experience
regr.fit(train_df[['Years of Experience']], train_df[['Salary']])

# Predicting for 3 years of experience

regr.predict([[3]])

Therefore, the salary for 3 years of experience by regression is 60000. Now,

finding the years of experience based on salary.
from sklearn.linear_model import LinearRegression

# Excluding the rows with the null data

train_df = df.drop([2,3],axis=0)

# Creating linear regression model

regr = LinearRegression()

# Here the target is the Years of Experience and the feature is

Salary
regr.fit(train_df[['Salary']], train_df[['Years of Experience']])

# Predicting for 40000 salary

regr.predict([[40000.0]])

Therefore, the years of experience for 40000 salary is 2.

In Conclusion
Do you have any problems handling missing data in Python? Let us know in
the comment section below. Also, visit www.theclickreader.com to read
more articles like this.

Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
The WingMakers
100% (2)
The WingMakers
308 pages
OCl 1Z0-1072 - Exam PDF
100% (5)
OCl 1Z0-1072 - Exam PDF
31 pages
Anchor Bolt
67% (3)
Anchor Bolt
15 pages
Omicron Basic Protection Course Manual
100% (2)
Omicron Basic Protection Course Manual
219 pages
Packages in Python
No ratings yet
Packages in Python
54 pages
Memory Management in RTOS
No ratings yet
Memory Management in RTOS
20 pages
Data Normalization
No ratings yet
Data Normalization
7 pages
Data Preprocessing and Cleaning
No ratings yet
Data Preprocessing and Cleaning
6 pages
IF4071 - Deep Learning Laboratory
No ratings yet
IF4071 - Deep Learning Laboratory
1 page
Threads: Multicore Programming Multithreading Models Thread Libraries Threading Issues Operating System Examples
No ratings yet
Threads: Multicore Programming Multithreading Models Thread Libraries Threading Issues Operating System Examples
22 pages
Services and Components of OS
No ratings yet
Services and Components of OS
41 pages
AD3461 ML lab manual
No ratings yet
AD3461 ML lab manual
32 pages
Ad3311 Set4
No ratings yet
Ad3311 Set4
2 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
85 pages
UNIT 3(Chapter 2) Pandas
No ratings yet
UNIT 3(Chapter 2) Pandas
43 pages
Compiler-Design Notes
No ratings yet
Compiler-Design Notes
5 pages
C Tokens
No ratings yet
C Tokens
14 pages
Cs3353 Foundations of Data Science L T P C 3 0 0 3
No ratings yet
Cs3353 Foundations of Data Science L T P C 3 0 0 3
2 pages
Unit-3 Multithreading
No ratings yet
Unit-3 Multithreading
25 pages
Machine Learning Lab Dlihebca6sem
100% (1)
Machine Learning Lab Dlihebca6sem
25 pages
Introduction To Real-Time Operating Systems
No ratings yet
Introduction To Real-Time Operating Systems
36 pages
Operating System Service
No ratings yet
Operating System Service
6 pages
Circular Linked List Program in C
100% (1)
Circular Linked List Program in C
3 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
Classical Problems of Synchronization
No ratings yet
Classical Problems of Synchronization
10 pages
Operating System
No ratings yet
Operating System
18 pages
Array and String-Students
No ratings yet
Array and String-Students
33 pages
BE LP5 Manual 23-24
No ratings yet
BE LP5 Manual 23-24
67 pages
Ds Unit 1 Data Structures
No ratings yet
Ds Unit 1 Data Structures
27 pages
Dekker's Algorithm
No ratings yet
Dekker's Algorithm
9 pages
FDS Lab Manual
No ratings yet
FDS Lab Manual
48 pages
FDS Unit 5
No ratings yet
FDS Unit 5
22 pages
Ad3411 - Student
No ratings yet
Ad3411 - Student
27 pages
Allocation Free Space Management Memory Mapped Files
No ratings yet
Allocation Free Space Management Memory Mapped Files
25 pages
Q.1. Explain Process, PCB and Process State Diagram. Ans. Process
No ratings yet
Q.1. Explain Process, PCB and Process State Diagram. Ans. Process
16 pages
Chapter 5: Threads: Multithreading Models Thread Libraries Thread Pools
No ratings yet
Chapter 5: Threads: Multithreading Models Thread Libraries Thread Pools
20 pages
Assignment-2 Data Visualization and Data Preprocessing
No ratings yet
Assignment-2 Data Visualization and Data Preprocessing
1 page
File Allocation Methods
No ratings yet
File Allocation Methods
9 pages
Disk Scheduling Algorithms
No ratings yet
Disk Scheduling Algorithms
9 pages
Eda Unit 1
No ratings yet
Eda Unit 1
57 pages
OS LAB MANUAL
No ratings yet
OS LAB MANUAL
149 pages
Deep Learning Handson
No ratings yet
Deep Learning Handson
65 pages
Scheduled Classic
No ratings yet
Scheduled Classic
33 pages
Page Replacement Algorithms
No ratings yet
Page Replacement Algorithms
22 pages
ccs341-data-warehousing-lab-manual2021 (1)
No ratings yet
ccs341-data-warehousing-lab-manual2021 (1)
48 pages
OS 04 Threads
No ratings yet
OS 04 Threads
67 pages
The Binomial, Poisson, and Normal Distributions
No ratings yet
The Binomial, Poisson, and Normal Distributions
39 pages
AL3452-OS manual
No ratings yet
AL3452-OS manual
125 pages
Python Record Final With Viva Question
No ratings yet
Python Record Final With Viva Question
100 pages
Subject Name Parallel and Distributed Computing
100% (1)
Subject Name Parallel and Distributed Computing
3 pages
Distributed File Systems
No ratings yet
Distributed File Systems
75 pages
Os Lab Manual AI&DS
No ratings yet
Os Lab Manual AI&DS
64 pages
Mini Project HPC
No ratings yet
Mini Project HPC
17 pages
DATA STRUCTURES DESIGN LAB Manual
No ratings yet
DATA STRUCTURES DESIGN LAB Manual
50 pages
Dsf-Pyt-Lab Manual
No ratings yet
Dsf-Pyt-Lab Manual
50 pages
Lab Manual C AIDS - 2
No ratings yet
Lab Manual C AIDS - 2
50 pages
Cse-IV-unix and Shell Programming (10cs44) - Notes
No ratings yet
Cse-IV-unix and Shell Programming (10cs44) - Notes
161 pages
Bankers Algorithm
No ratings yet
Bankers Algorithm
4 pages
CS3353 Unit2
No ratings yet
CS3353 Unit2
51 pages
CS3461 OS Manual
No ratings yet
CS3461 OS Manual
119 pages
Cs3451 Ios Unit 5 Notes
No ratings yet
Cs3451 Ios Unit 5 Notes
21 pages
AI Lab MAnual Final
No ratings yet
AI Lab MAnual Final
44 pages
Java Reflection Complete Self-Assessment Guide
From Everand
Java Reflection Complete Self-Assessment Guide
Gerardus Blokdyk
No ratings yet
mmi3gp-updating-20231031
No ratings yet
mmi3gp-updating-20231031
65 pages
Instruction Manual Fisher 8560 Eccentric Disc Butterfly Control Valve en 137996
No ratings yet
Instruction Manual Fisher 8560 Eccentric Disc Butterfly Control Valve en 137996
36 pages
Graphic Design Institute: 123 Main Street Phone: 123-555-0123 Ocean View, MO 12345 Fax: 123-555-0124
No ratings yet
Graphic Design Institute: 123 Main Street Phone: 123-555-0123 Ocean View, MO 12345 Fax: 123-555-0124
2 pages
Cs101 105mcq's Solve by Maha-Shah
No ratings yet
Cs101 105mcq's Solve by Maha-Shah
18 pages
ICSS YCChuang
No ratings yet
ICSS YCChuang
3 pages
Ingilizce Testler
No ratings yet
Ingilizce Testler
69 pages
QFD (Quality Function Deployment) : Buku: QFD By: Lou Cohen Operation MGT By: Heizer, Jay, Render, Barry
No ratings yet
QFD (Quality Function Deployment) : Buku: QFD By: Lou Cohen Operation MGT By: Heizer, Jay, Render, Barry
16 pages
EqualLogic Release and Support Policy v25
No ratings yet
EqualLogic Release and Support Policy v25
7 pages
Sap Hana
100% (1)
Sap Hana
22 pages
Udom
No ratings yet
Udom
18 pages
Smartplant License Manager Installation & Setup Checklist
No ratings yet
Smartplant License Manager Installation & Setup Checklist
3 pages
Medicine Nephrology Rheumatology - Dr. Srinath Atf
No ratings yet
Medicine Nephrology Rheumatology - Dr. Srinath Atf
22 pages
ASME Handbook
100% (1)
ASME Handbook
38 pages
Chat Live Sexy
No ratings yet
Chat Live Sexy
2 pages
Tables in Sap
No ratings yet
Tables in Sap
20 pages
C20xx Xperia M Dual C20xx Xperia M Dual Service Manual
No ratings yet
C20xx Xperia M Dual C20xx Xperia M Dual Service Manual
118 pages
T5 Edge-User Manual
No ratings yet
T5 Edge-User Manual
42 pages
Block Chain List 2020
No ratings yet
Block Chain List 2020
86 pages
USA V Fox, Et Al: Notice Re Expert Testimony
No ratings yet
USA V Fox, Et Al: Notice Re Expert Testimony
26 pages
Summative Assessment 4: Instructions
No ratings yet
Summative Assessment 4: Instructions
20 pages
Benq MOnitor
No ratings yet
Benq MOnitor
1 page
Crash 2024 09 04 - 21.30.20 Server
No ratings yet
Crash 2024 09 04 - 21.30.20 Server
13 pages
COPYLOST02
No ratings yet
COPYLOST02
1 page
2010 Book VerificationAndValidationInSys
No ratings yet
2010 Book VerificationAndValidationInSys
261 pages
CSE Courses Spring
No ratings yet
CSE Courses Spring
4 pages
Towards Efficient Load Balancing Strategy For RPL Routing Protocol in IoT Networks
No ratings yet
Towards Efficient Load Balancing Strategy For RPL Routing Protocol in IoT Networks
65 pages