Data Pre-Processing Python For Beginner
Data Pre-Processing Python For Beginner
The process of dealing with unclean data and transform it into more
appropriate form for modeling is called data pre-processing. This step
can be considered as a mandatory in machine learning process due to
some reason, such as:
While data pre-processing can be different for every cases, there are
some common tasks that ca be used:
• data cleansing
• feature selection
Internal
• data scaling
• feature engineering
• dimensionality reduction
Internal
Even though most ML algorithm require complete dataset, not all of
them fail when there is missing data. There are algorithm that robust
to missing values, like KNN and Naive Bayes while other algorithm can
use missing values as a unique value, like Decision Trees. Nevertheless,
scikit-learn library implementations for those algorithms are not
robust to missing values.
Four features have missing values. We will work on feature ‘Age’, ‘BuildingArea’, and
‘YearBuilt’.
Internal
Feature Selection
• reduce complexity
• prevent overfitting
credit: machinelearningmastery.com
Internal
In using stats based feature selection, it is important to choose what
method to use based on the data types of input and output variable.
This is a decision tree to decide which stats based method is
suitable for our data:
credit: machinelearningmastery.com
We are going to use RFE method to select the most important features
from our dataset. Recursive Feature Elimination (RFE) is popular due
to its flexibility and ease of use. It reduces model complexity
by removing features one by one until the selected number of
features is left.
Internal
Six most relevant features based on RFE are indicated by “Selected=True”
Feature Scaling
Internal
All maximum values have been scaled to 1
Feature Engineering
Internal
• Decomposing a Date-Time: datetime -> hour_of_day;
hour -> morning, night
• etc.
2. Discretization
Next, for handling categorical features, there are several method called
encoding. These are three common encoding techniques with sample.
Label Encoding
Internal
• Useful for non-linear and tree-based algorithms.
One-Hot Encoding
Binary Encoding
Internal
• Variables -> numerical label (label encoding) -> binary
number -> split every digit into different columns.
Dimensionality Reduction
Internal
Dimensionality reduction techniques are often used for data
visualization. Nevertheless, these techniques can be used in applied
machine learning to simplify a classification or regression
dataset in order to better fit a predictive model.
Handling Outliers
Many data have outliers that can heavily affect model training result.
In Python, outliers can be easily detected using boxplot visualization.
Internal
We can adjust the outliers without any additional library using
winsorization method. Outlier values can be replaced by certain value
that called upper and lower bound.
Those are several common method for data preparation. Every project
is unique and may need different approach for data pre-processing and
cleansing.
References
• https://www.kaggle.com/alexisbcook/missing-values
• https://machinelearningmastery.com/data-preparation-for-
machine-learning-7-day-mini-course/
• https://machinelearningmastery.com/feature-selection-with-
real-and-categorical-data/
• https://towardsdatascience.com/categorical-encoding-using-
label-encoding-and-one-hot-encoder-911ef77fb5bd
Internal