Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
6 views

Handling Missing Values in Python

The document discusses handling missing values in Python, outlining reasons for missing data and various techniques for addressing it, including deletion and imputation methods. It categorizes missing data into three types: MCAR, MAR, and MNAR, and describes basic and advanced imputation techniques such as K-Nearest Neighbour and Multivariate Imputation by Chained Equations (MICE). Additionally, it notes that some algorithms, like XGBoost and LightGBM, can manage missing values without pre-processing.

Uploaded by

vm9545331377
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Handling Missing Values in Python

The document discusses handling missing values in Python, outlining reasons for missing data and various techniques for addressing it, including deletion and imputation methods. It categorizes missing data into three types: MCAR, MAR, and MNAR, and describes basic and advanced imputation techniques such as K-Nearest Neighbour and Multivariate Imputation by Chained Equations (MICE). Additionally, it notes that some algorithms, like XGBoost and LightGBM, can manage missing values without pre-processing.

Uploaded by

vm9545331377
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Handling Missing Values in Python

Real world data is messy and often contains a lot of missing values. There could be multiple reasons for the missing values but primarily the
reason for missing-ness can be attributed to

Either way we need to address this issue before we proceed with the modelling stuff. It is also important to note that some algorithms
like XGBoost and LightGBM can treat missing data without any pre-processing.

Syed Afroz Ali (Data Scientist)


https://www.kaggle.com/pythonafroz
https://www.linkedin.com/in/syed-afroz-70939914/
Reasons for Missing Values
Before we start treating the missing values, it is important to understand the various reasons for the missing-ness in data. Broadly speaking, there
can be three possible reasons:

1. Missing Completely at Random (MCAR)


The missing values on a given variable (Y) are not associated with other variables in a given data set or with the variable (Y) itself. In other
words, there is no particular reason for the missing values.

2. Missing at Random (MAR)


MAR occurs when the missing-ness is not random, but where missing-ness can be fully accounted for by variables where there is complete
information.

3. Missing Not at Random (MNAR)


Missing-ness depends on unobserved data or the value of the missing data itself.
Deletions

Deletion means to delete the missing values from a dataset. This is however not recommended as it might result in loss of information from the
dataset. We should only delete the missing values from a dataset if their proportion is very small. Deletions are further of three types:
Syed Afroz Ali (Data Scientist)
https://www.kaggle.com/pythonafroz
https://www.linkedin.com/in/syed-afroz-70939914/
Pairwise Deletion
Pairwise Deletion is used when values are missing completely at random i.e. MCAR. During Pairwise deletion, only the missing values are
deleted. All operations in pandas like mean, sum etc. intrinsically skips missing values.

List wise Deletion/ Dropping rows


During List wise deletion, complete rows (which contain the missing values) are deleted. As a result, it is also called Complete Case deletion.
Like Pairwise deletion, list wise deletions are also only used for MCAR values.

#Drop rows which contains any NaN or missing value for Age column
train_1.dropna(subset=['Age'],how='any',inplace=True)
train_1['Age'].isnull().sum()

The Age column doesn't have any missing values. A major disadvantage of List wise deletion is that a major chunk of data and hence a lot of
information is lost. Hence, it is advisable to use it only when the number of missing values is very small.

Dropping complete columns


If a column contains a lot of missing values, say more than 80%, and the feature is not significant, you might want to delete that feature.
However, again, it is not a good methodology to delete data.

Syed Afroz Ali (Data Scientist)


https://www.kaggle.com/pythonafroz
https://www.linkedin.com/in/syed-afroz-70939914/
Imputations Techniques for non-Time Series Problems

Imputation refers to replacing missing data with substituted values. There are a lot of ways in which the missing values can be imputed
depending upon the nature of the problem and data. Depending upon the nature of the problem, imputation techniques can be broadly they can
be classified as follows:

Syed Afroz Ali (Data Scientist)


https://www.kaggle.com/pythonafroz
https://www.linkedin.com/in/syed-afroz-70939914/
Basic Imputation Techniques
 Imputating with a constant value
 Imputation using the statistics (mean, median or most frequent) of each column in which the missing values are located

For this we shall use the The SimpleImputer class from sklearn.
# imputing with a constant
from sklearn.impute import SimpleImputer
train_constant = train.copy()
mean_imputer = SimpleImputer(strategy='constant')
train_constant.iloc[:,:] = mean_imputer.fit_transform(train_constant)
train_constant.isnull().sum()

from sklearn.impute import SimpleImputer


train_most_frequent = train.copy()
#setting strategy to 'mean' to impute by the mean
mean_imputer = SimpleImputer(strategy='most_frequent')#strategy can also be mean or median
train_most_frequent.iloc[:,:] = mean_imputer.fit_transform(train_most_frequent)
train_most_frequent.isnull().sum()

Imputations Techniques for Time Series Problems


Now let's look at ways to impute data in a typical time series problem. Tackling missing values in time Series problem is a bit different. The fillna() method is
used for imputing missing values in such problems.
 Basic Imputation Techniques

 'ffill' or 'pad' - Replace NaN s with last observed value


 'bfill' or 'backfill' - Replace NaN s with next observed value
 Linear interpolation method

Syed Afroz Ali (Data Scientist)


https://www.kaggle.com/pythonafroz
https://www.linkedin.com/in/syed-afroz-70939914/
Time Series dataset
The dataset is called Air Quality Data in India (2015 - 2020) Tand it contains air quality data and AQI (Air Quality Index) at hourly and daily
level of various stations across multiple cities in India. The dataset has a lot of missing values and is a classic Time series problem.

city_day['Xylene'][50:64]

city_day.fillna(method='ffill',inplace=True)
city_day['Xylene'][50:65]

Syed Afroz Ali (Data Scientist)


https://www.kaggle.com/pythonafroz
https://www.linkedin.com/in/syed-afroz-70939914/
Imputation using Linear Interpolation method
Time series data has a lot of variations against time. Hence, imputing using backfill and forward fill isn't the best possible solution to address the
missing value problem. A more apt alternative would be to use interpolation methods, where the values are filled with incrementing or
decrementing values.

Linear interpolation is an imputation technique that assumes a linear relationship between data points and utilises non-missing values from
adjacent data points to compute a value for a missing data point.

city_day1['Xylene'][50:65]

# Interpolate using the linear method


city_day1.interpolate(limit_direction="both",inplace=True)
city_day1['Xylene'][50:65]

Syed Afroz Ali (Data Scientist)


https://www.kaggle.com/pythonafroz
https://www.linkedin.com/in/syed-afroz-70939914/
Advanced Imputation Techniques
Advanced imputation techniques uses machine learning algorithms to impute the missing values in a dataset unlike the previous techniques
where we used other column values to predict the missing values.

K-Nearest Neighbour Imputation


The KNN-Imputer class provides imputation for filling in missing values using the k-Nearest Neighbours approach. Each missing feature is
imputed using values from n_neighbors nearest neighbour’s that have a value for the feature. The feature of the neighbours are averaged
uniformly or weighted by distance to each neighbour.
train_knn = train.copy(deep=True)
from sklearn.impute import KNNImputer
train_knn = train.copy(deep=True)
knn_imputer = KNNImputer(n_neighbors=2, weights="uniform")
train_knn['Age'] = knn_imputer.fit_transform(train_knn[['Age']])
train_knn['Age'].isnull().sum()

Multivariate feature imputation - Multivariate imputation by chained equations (MICE)


A strategy for imputing missing, values by modelling each feature with missing values, as a function of other features in a round-robin fashion. It
performs multiple regressions over random sample of the data, then takes the average of the multiple regression values and uses that value to
impute the missing value. In sklearn, it is implemented as follows
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
train_mice = train.copy(deep=True)

mice_imputer = IterativeImputer()
train_mice['Age'] = mice_imputer.fit_transform(train_mice[['Age']])
train_mice['Age'].isnull().sum()

Syed Afroz Ali (Data Scientist)


https://www.kaggle.com/pythonafroz
https://www.linkedin.com/in/syed-afroz-70939914/
Algorithms which handle missing values
Some algorithms like XGBoost and LightGBM can handle missing values without any pre-processing, by supplying relevant parameters.

https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn

Syed Afroz Ali (Data Scientist)


https://www.kaggle.com/pythonafroz
https://www.linkedin.com/in/syed-afroz-70939914/

You might also like