Data Preparation For ML in Practice v213
Data Preparation For ML in Practice v213
Data Preparation For ML in Practice v213
• Dimension Reduction
• Feature Engineering • LDA
• Target encoding • PCA
• Dummy variables • Factor Analysis
• Interaction variables
• Feature Scaling
• Standard Scaler
• Normalize
Data Cleaning
• Data duplication
• Missing values
• Imbalanced classes
Handling duplications
Handling missing values
• Remedies:
• ps_car_03_cat and
ps_car_05_cat have a large
proportion of records with
missing values. Remove
these variables.
• ps_reg_03 (real/interval) has
missing values for 18% of all
records. Replace by the
mean.
• ps_car_11
(categorical/ordinal) has only
5 records with missing
values. Replace by the
mode, that is the most-
frequent value.
• ps_car_12 (real/interval) has
only 1 records with missing
value. Replace by the mean.
• ps_car_14 (real/interval) has
missing values for 7% of all
records. Replace by the
mean.
Handling missing values
Handling missing
values
• Here, we show how the NaN values are replaced by the mode approach.
Handling missing values
• Specialised techniques may be used to change the composition of samples in the training
dataset by undersampling the majority class or oversampling the minority class. Examples
include:
• Random Undersampling.
• SMOTE Oversampling.
Source: Applications of Machine Learning Methods in Complex Economics and Financial Networks:
Handling imbalance classes
• Target encoding
• Dummy variables
• Interaction variables
Handling high-cardinality categorical attributes using Target
Encoding
• In this method, a categorical variable is replaced with just one new numerical variable and replace each
category of the categorical variable with its corresponding probability of the target (if categorical) or average
of the target (if numerical).
• TE is particularly useful in handling with categorical attributes characterised by a large number of distinct
values known as high-cardinality. This represent a serious challenge for many classification and regression
algorithms that require numerical inputs.
Handling high-cardinality categorical attributes using Target
Encoding
• However, when dealing with very high cardinality input variables, many values may have a
relatively small sample size. Therefore, simply trusting the statistics of the target could mean
trouble (namely overfitting).
• For this reason, smoothing approaches [1][2] can be adopted whereby the estimate of the target
given a categorical attribute gets blended with the prior probability of the target (or the
baseline).
• The blending is controlled by a parameter that depends on the sample size.
• The larger the sample, the more the estimate is weighted toward the value of the target given a
categorical attribute
• The smaller the sample, the more the estimate is weighted toward the overall baseline for the target
• Adding some random gaussian noise to the column for each data point after target encoding the
feature is another popular way of handling the overfitting issue
Further reading:
[1] https://maxhalford.github.io/blog/target-encoding/
https://dl.acm.org/doi/10.1145/507533.507538
[2] https://dl.acm.org/doi/10.1145/507533.507538
[3] https://machinelearninginterview.com/topics/machine-learning/target-encoding-for-categorical-features/
[4]https://towardsdatascience.com/dealing-with-features-that-have-high-cardinality-1c9212d7ff1b
Target Encoding: apply
Target
Encoder
• If you are analysing your data using multiple linear regression (MLR) and any of your independent variables
are on a nominal or ordinal scale, you will need to create dummy variables that are numerical variables
limited to two specific values, 1 or 0.
• This is because nominal and ordinal independent variables, broadly known as categorical independent
variables, cannot be directly entered into a multiple regression analysis.
• The process of converting categorical variables into dummy variables is known as dummification.
• MLR is an extension of ordinary (aka simple) linear regression (OLR) that uses just one explanatory variable,
although there are non-linear regression methods for more complicated data and analysis. The general form
of each type of regression is:
Creating dummy variables: understanding
Creating dummy variables: understanding
Creating dummy variables: understanding
Creating dummy variables: understanding
By dropping the
reference column, this
is called Dummy
coding [1]
[1]
Creating dummy
variables: apply
• We can make use of the
dummification process to filter out
missing values (NaN)
Creating dummy
variables: apply
• Often, the input features for a predictive modelling task interact in unexpected and often
nonlinear ways. These features are called interaction and polynomial features can help to better
expose the important relationships between input variables and the target variable.
• Polynomial features are those features created by raising existing features to an exponent (a
power of 2).
• For example, if a dataset had one input feature X, then a polynomial feature would be the addition of a
new feature (column) where values were calculated by squaring the values in X, e.g. X^2. This process
can be repeated for each input variable in the dataset, creating a transformed version of each.
• As such, polynomial features are a type of feature engineering, e.g. the creation of new input
features based on the existing features.
Creating interaction variables: understanding
• The degree of the polynomial is used to control the number of features added, e.g. a degree of 3
will add two new variables for each input variable.
• Typically a small degree is used such as 2 or 3.
• Sometimes these features can result in improved modelling performance, although at the cost of
adding lots of additional input variables depending on the polynomial degree.
• Typically linear algorithms, such as linear regression and logistic regression (use in classification
problem), respond well to the use of polynomial input variables.
• Linear regression is linear in the model parameters and adding polynomial terms to the model
can be an effective way of allowing the model to identify nonlinear patterns.
Creating interaction variables: understanding
Creating interaction variables: apply
These suggest
that the input
(features)
variable are x2.
Creating interaction variables: apply
• If the variance is low or close to zero, then a feature is approximately constant. Evidence
suggests that such low variance features hold no predictive power. In this case, we can consider
the remove them.
• Sklearn has a handy method to do that: VarianceThreshold. By default it removes features with
zero variance. This will not be applicable for the Porto Seguor’s as we saw there are no zero-
variance variables in the EDA.
• We can set an arbitrarily variance threshold, and the accuracy of the predictions to determine
which features to be remove (or select).
Selecting features based on feature variance: low variance
analysis
• Setting threshold of 1%, we saw 28 low variance features were found. Looks like we would loose
many variables if we select features based on variance.
• Instead, Sklearn also comes with other feature selection methods:
• SelectKBest: evaluate all features using algorithm such as Chi-squared and select K highest scoring
features
• SelectFromModel:
Selecting features based on SelectKBest and Chi-squared stats
• Random Forest provides a good predictive performance, low overfitting, and an easy
interpretability (i.e. easy to determine feature importance).
• A random Forest model can be interpreted in 2 ways:
• Overall interpretation: determine which variables (or combinations of variables) have the most predictive
power, which ones have the least
• Local interpretation: for a given data point and associated prediction, determine which variables (or
combinations of variables) explain this specific prediction
• Here, we will base feature selection on the feature importances of a random forest model.
• We will first run model fitting of our pre-processed training set (X,y) using a Sklearn’s
RandomForestClassifier.
• Then, with the fitted model, we can use Sklean’s SelectFromModel model to select those
features which importance is great than the mean importance of all the features by default, but
can alter this threshold if we want. To see the selected features, we can get_support method on
the fitted model.
Selecting features using Random Forest (and SelectFromModel)
• Let's look at the hyperparameters of sklearns built-in random forest function, which are either
used to increase the predictive power of the model or to make the model faster.
• Dimensionality reduction (DR) is the transformation of data from a high-dimensional space into a
low-dimensional space so that the low-dimensional representation retains some meaningful
properties of the original data, ideally close to its intrinsic dimension [1].
• Working in high-dimensional spaces can be undesirable for many reasons:
• raw data are often sparse as a consequence of the curse of dimensionality [2]
• a large number of features available in the dataset may result overfitting in the learning model
• DR in machine learning and statistics reduces the number of random variables under
consideration by acquiring a collection of critical variables. It can be divided into feature
discovery and extraction of features.
• Here, we will explore the following 3 DR techniques:
• Linear Discriminant Analysis (LDA)
• Principal Component Analysis (PCA)
• Factor Analysis
[1] https://en.wikipedia.org/wiki/Dimensionality_reduction#cite_note-dr_review-1
[2] van der Maaten, Laurens; Postma, Eric; van den Herik, Jaap (October 26, 2009).
"Dimensionality Reduction: A Comparative Review" (PDF). J Mach Learn Res. 10: 66–71.
Introduction to PCA and LDA
Principle Component Analysis (PCA)
• PCA uses linear projection to transform data into the new feature space:
• (a) original data in feature space
• (b) centered data
• (c) projecting a data vector x onto another vector
• (d) direction of maximum variance of the projected coordinates
Principle Component Analysis (PCA)
• Separating
features (x) and
target (y) from
the train dataset
Applying PCA to Porto Seguor’s dataset
Applying PCA to Porto Seguor’s dataset
PCR vs LDA on Porto Seguor’s dataset
Introducing Factor Analysis (FA)
Factor Analysis (FA)
• Factor Analysis (FA) is another unsupervised ML used for dimensionality reduction. This
algorithm creates factors from the observed variables to represent the common variance i.e.
variance due to correlation among the observed variables.
• The goal of FA is to describe variability among correlated variables in fewer variables called factors,
based on the concept that multiple observed variable have similar patterns of responses because they
are all associated with a latent (i.e. not directly measured) variable.
• FA and PCA are very similar as they both identify patterns in the correlations between variables.
These patterns are used to infer the existence of underlying latent variables in the data. These
latent variables are often referred to as factors, components, and dimensions.
The difference between FA and PCA
• FA • PCA
• Reduce a large numbers of variables into • PCA seeks the linear combination of variables
fewer numbers of factors. in order to extract the maximum variance.
• Puts maximum common variance into a • Compute Eigenvector that are principal
common score. components of the dataset and collect them
• Associates multiple observed variables with a in projection matrix.
latent variable. • Each of the Eigenvector is associate with
• Has the same numbers of factors and Eigenvalue, which is magnitudes.
variables, where each factor contains a • Reduce the dataset into smaller dimensional
certain amount of overall variance . subspace by dropping the less informative
• Eigenvalue : A measure of the variance that a Eigenpairs.
factor explains for observed variables. A
factor with eigenvalue < 1 explains less
variance than a single observed value.
Applying FA to Porto Seguor’s dataset
Applying FA to Porto Seguor’s dataset
• Standard Scaler
• Normalize
Feature Scaling
• In this case, we choose to use the scikit-learn’s Standard Scaler in our data preparation.
• The Standard Scalar is the process of rescaling the attributes so that they have mean as 0 and
variance as 1.
• The ultimate goal to perform standardisation is to bring down all the features to a common scale
without distorting the differences in the range of the values.
Feature Scaling
Feature Scaling: fit_transform() vs transform()
• The fit_transform() method is calculating the mean and variance of each of the features present
in our data.
• The transform() method use the same mean and variance as calculated from the training data to
transform the test data.
Conclusion
Conclusion
• We have so far covered the entire workflow of data science, working all the way to a stage right
before the development of a ML prediction model.
• This include the exploratory data analysis (EDA) and data preparation for ML including sourcing
of raw data (in this case, the Porto Seguor's dataset), data cleaning, feature engineering, feature
selection (pre-fitting with ML algorithm), dimensionality reduction and finally the feature scaling.
• As a conclusion, I advise that the EDA and data preparation are studied along with the Jupyter-
notebook as supplied as they were developed in conjunction.
End of Lesson