Handling Missing Values in Python
Handling Missing Values in Python
Real world data is messy and often contains a lot of missing values. There could be multiple reasons for the missing values but primarily the
reason for missing-ness can be attributed to
Either way we need to address this issue before we proceed with the modelling stuff. It is also important to note that some algorithms
like XGBoost and LightGBM can treat missing data without any pre-processing.
Deletion means to delete the missing values from a dataset. This is however not recommended as it might result in loss of information from the
dataset. We should only delete the missing values from a dataset if their proportion is very small. Deletions are further of three types:
Syed Afroz Ali (Data Scientist)
https://www.kaggle.com/pythonafroz
https://www.linkedin.com/in/syed-afroz-70939914/
Pairwise Deletion
Pairwise Deletion is used when values are missing completely at random i.e. MCAR. During Pairwise deletion, only the missing values are
deleted. All operations in pandas like mean, sum etc. intrinsically skips missing values.
#Drop rows which contains any NaN or missing value for Age column
train_1.dropna(subset=['Age'],how='any',inplace=True)
train_1['Age'].isnull().sum()
The Age column doesn't have any missing values. A major disadvantage of List wise deletion is that a major chunk of data and hence a lot of
information is lost. Hence, it is advisable to use it only when the number of missing values is very small.
Imputation refers to replacing missing data with substituted values. There are a lot of ways in which the missing values can be imputed
depending upon the nature of the problem and data. Depending upon the nature of the problem, imputation techniques can be broadly they can
be classified as follows:
For this we shall use the The SimpleImputer class from sklearn.
# imputing with a constant
from sklearn.impute import SimpleImputer
train_constant = train.copy()
mean_imputer = SimpleImputer(strategy='constant')
train_constant.iloc[:,:] = mean_imputer.fit_transform(train_constant)
train_constant.isnull().sum()
city_day['Xylene'][50:64]
city_day.fillna(method='ffill',inplace=True)
city_day['Xylene'][50:65]
Linear interpolation is an imputation technique that assumes a linear relationship between data points and utilises non-missing values from
adjacent data points to compute a value for a missing data point.
city_day1['Xylene'][50:65]
mice_imputer = IterativeImputer()
train_mice['Age'] = mice_imputer.fit_transform(train_mice[['Age']])
train_mice['Age'].isnull().sum()
https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn