Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
73 views

Data Preprocessing in Python - Handling Missing Data

The document discusses techniques for handling missing data in Python. It describes seven techniques: data removal, statistical imputation using mean or median, manual filling based on observation, filling with most repeated value, random filling within data range, regression analysis filling, and finding relationships between variables. Examples are provided using Pandas to demonstrate statistical imputation and regression techniques.

Uploaded by

reyesward085
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views

Data Preprocessing in Python - Handling Missing Data

The document discusses techniques for handling missing data in Python. It describes seven techniques: data removal, statistical imputation using mean or median, manual filling based on observation, filling with most repeated value, random filling within data range, regression analysis filling, and finding relationships between variables. Examples are provided using Pandas to demonstrate statistical imputation and regression techniques.

Uploaded by

reyesward085
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Data Preprocessing in Python —

Handling Missing Data


The Click Reader · Follow
5 min read · Sep 21, 2021

Data pre-processing involves a series of data preparation steps used to


remove unwanted noise and filter out necessary data from a dataset. Learn
how to preprocess data in this article by reading about seven different ways
to handle missing data in Python.
There is a general convention that states that almost 80% of one’s time is
spent in pre-processing data whereas only 20% is used to build the actual ML
model itself. Hence, we can understand that data pre-processing is a vital
step in building intelligent robust ML models.

Techniques For Handling Missing Data


Data may not always be complete i.e. some of the values in the data may be
missing or null. Thus, there are a specific set of ways to handle the missing
data and make the data complete.

The following example shows that the ‘Years of Experience’ of ‘Employee’ is


missing. Also, the ‘Salary (in USD per year)’ of ‘Junior Manager’ is missing.
import pandas as pd

# Creating the dataframe as shown above

df = pd.DataFrame({'Job Position': ['CEO', 'Senior Manager', 'Junior


Manager', 'Employee', 'Assistant Staff'], 'Years of Experience':[5,
4, 3, None, 1], 'Salary':[100000,80000,None,40000, 20000]})

# Viewing the contents of the dataframe


df.head()

Sign up Sign In

Search Medium Write

Some of the ways to handle missing data are listed below:

1. Data Removal

Remove the missing data rows (data points) from the dataset. However,
when using this technique will decrease the available dataset and in turn
result in less robustness of data point if the size of dataset is originally small.

# Dropping the 2nd and 3rd index


dropped_df = df.drop([2,3],axis=0)

# Viewing the dataframe


dropped_df

2. Fill missing value through statistical imputation

Fill the missing data by taking the mean or median of the available data
points. Generally, the median of the data points is used to fill the missing
values as it is not affected heavily by outliers like the mean. Here, we have
used the median to fill the missing data.
# Filling each column with their mean values

df['Years of Experience'] = df['Years of


Experience'].fillna(df['Years of Experience'].mean())

df['Salary'] = df['Salary'].fillna(df['Salary'].mean())

# Viewing the dataframe


df

3. Fill missing value using observation

Manually fill in the missing data from observation. This may be possible
sometimes for small datasets but for larger datasets it is very difficult to do
so.
4. Fill in the most repeated value

Fill in the missing value using the most repeated value in the dataset. This is
done when most of the data is repeated and there is good reasoning to do so.
Since there are no repeated values in the example, we can fill it with any one
of the numbers in the respective column.

5. Fill in with random value within the range of available data

Take the given range of data points and fill in the data by randomly selecting
a value from the available range.
6. Fill in by regression

Use regression analysis to find the most probable data point for filling in the
dataset.

from sklearn.linear_model import LinearRegression

# Excluding the rows with the null data


train_df = df.drop([2,3],axis=0)

# Creating linear regression model


regr = LinearRegression()

# Here the target is the Salary and the feature is Years of


Experience
regr.fit(train_df[['Years of Experience']], train_df[['Salary']])

# Predicting for 3 years of experience


regr.predict([[3]])

Therefore, the salary for 3 years of experience by regression is 60000. Now,


finding the years of experience based on salary.
from sklearn.linear_model import LinearRegression

# Excluding the rows with the null data


train_df = df.drop([2,3],axis=0)

# Creating linear regression model


regr = LinearRegression()

# Here the target is the Years of Experience and the feature is


Salary
regr.fit(train_df[['Salary']], train_df[['Years of Experience']])

# Predicting for 40000 salary


regr.predict([[40000.0]])

Therefore, the years of experience for 40000 salary is 2.

In Conclusion
Do you have any problems handling missing data in Python? Let us know in
the comment section below. Also, visit www.theclickreader.com to read
more articles like this.

You might also like