Data Preprocessing in Python - Handling Missing Data
Data Preprocessing in Python - Handling Missing Data
Sign up Sign In
1. Data Removal
Remove the missing data rows (data points) from the dataset. However,
when using this technique will decrease the available dataset and in turn
result in less robustness of data point if the size of dataset is originally small.
Fill the missing data by taking the mean or median of the available data
points. Generally, the median of the data points is used to fill the missing
values as it is not affected heavily by outliers like the mean. Here, we have
used the median to fill the missing data.
# Filling each column with their mean values
df['Salary'] = df['Salary'].fillna(df['Salary'].mean())
Manually fill in the missing data from observation. This may be possible
sometimes for small datasets but for larger datasets it is very difficult to do
so.
4. Fill in the most repeated value
Fill in the missing value using the most repeated value in the dataset. This is
done when most of the data is repeated and there is good reasoning to do so.
Since there are no repeated values in the example, we can fill it with any one
of the numbers in the respective column.
Take the given range of data points and fill in the data by randomly selecting
a value from the available range.
6. Fill in by regression
Use regression analysis to find the most probable data point for filling in the
dataset.
In Conclusion
Do you have any problems handling missing data in Python? Let us know in
the comment section below. Also, visit www.theclickreader.com to read
more articles like this.