Data Preparation For ML in Practice v213

CP70066E Machine Learning
Data Preparation for ML in Practice
Professor Jonathan Loo

Chair in Computing and Engineering
School of Engineering and Computing
University of West London
Lesson Outline
• Data Cleaning • Feature Selection

• Data duplication • Low variance features
• Missing values • SelectKBest and Chi-squared
• Imbalanced classes • Random Forest and SelectFromModel
• Dimension Reduction
• Feature Engineering • LDA
• Target encoding • PCA
• Dummy variables • Factor Analysis
• Interaction variables
• Feature Scaling
• Standard Scaler
• Normalize
Data Cleaning
• Data duplication
• Missing values
• Imbalanced classes
Handling duplications
Handling missing values
• Remedies:
• ps_car_03_cat and
ps_car_05_cat have a large
proportion of records with
missing values. Remove
these variables.
• ps_reg_03 (real/interval) has
missing values for 18% of all
records. Replace by the
mean.
• ps_car_11
(categorical/ordinal) has only
5 records with missing
values. Replace by the
mode, that is the most-
frequent value.
• ps_car_12 (real/interval) has
only 1 records with missing
value. Replace by the mean.
• ps_car_14 (real/interval) has
missing values for 7% of all
records. Replace by the
mean.
Handling missing
values
• There are still missing

values in the categorical
variables.
• Suggest how those NaN
can be imputed?
For the remaining categorical variables with

missing values, we can leave the missing value
(NaN) as such which can be eliminated at later
stage, replace them by the mode (most
frequent value) or to use algorithm-based
approach such as KNN [1][2].
[1] Imputing missing values

https://jamesrledoux.com/code/imputation
[2] Preprocessing: Encode and KNN Impute All Categorical
Features Fast
https://towardsdatascience.com/preprocessing-encode-and-knn-
impute-all-categorical-features-fast-b05f50b4dfaa
• Here, we show how the NaN values are replaced by the mode approach.
• Finally, the data is

cleaned.
• We now need to address
other data quality issues.
Handling imbalance classes
• As observed in the Target variables, the

proportion of records with target=1 is far
less than target=0. This can lead to a model
that has great accuracy but does have any
added value in practice.
• Two possible strategies to deal with this
problem are:
• oversampling records with target=1
• undersampling records with target=0
• MachineLearningMastery.com gives a nice

overview on tactics to combat imbalanced
classes in ML dataset [1]
[1] Overview on tactics to combat imbalanced classes in ML dataset:

https://machinelearningmastery.com/tactics-to-combat-imbalanced-classe
s-in-your-machine-learning-dataset/
• Specialised techniques may be used to change the composition of samples in the training
dataset by undersampling the majority class or oversampling the minority class. Examples
include:
• Random Undersampling.
• SMOTE Oversampling.
Source: Applications of Machine Learning Methods in Complex Economics and Financial Networks:
• Imbalanced-learn library https://imbalanced-learn.org/stable/index.html

Handling imbalance classes: under-sampling using library
Handling imbalance classes: SMOTE over-sampling using library
Combination of over- and under-sampling using SMOTEENN

G. Batista, B. Bazzan, M. Monard, “Balancing Training Data for
Automated Annotation of Keywords: a Case Study,” In WOB, 10-
18, 2003.
SMOTE: Synthetic Minority Over-sampling

Technique
Handling imbalance classes: using simple algorithm
Feature
Engineering
• Target encoding
• Dummy variables
• Interaction variables
Handling high-cardinality categorical attributes using Target
Encoding
• Target Encoding (TE) is numerisation of categorical variables via target.
• In this method, a categorical variable is replaced with just one new numerical variable and replace each
category of the categorical variable with its corresponding probability of the target (if categorical) or average
of the target (if numerical).
• TE is particularly useful in handling with categorical attributes characterised by a large number of distinct
values known as high-cardinality. This represent a serious challenge for many classification and regression
algorithms that require numerical inputs.
Handling high-cardinality categorical attributes using Target
Encoding
• However, when dealing with very high cardinality input variables, many values may have a
relatively small sample size. Therefore, simply trusting the statistics of the target could mean
trouble (namely overfitting).
• For this reason, smoothing approaches [1][2] can be adopted whereby the estimate of the target
given a categorical attribute gets blended with the prior probability of the target (or the
baseline).
• The blending is controlled by a parameter that depends on the sample size.
• The larger the sample, the more the estimate is weighted toward the value of the target given a
categorical attribute
• The smaller the sample, the more the estimate is weighted toward the overall baseline for the target
• Adding some random gaussian noise to the column for each data point after target encoding the
feature is another popular way of handling the overfitting issue
Further reading:
[1] https://maxhalford.github.io/blog/target-encoding/
https://dl.acm.org/doi/10.1145/507533.507538
[2] https://dl.acm.org/doi/10.1145/507533.507538
[3] https://machinelearninginterview.com/topics/machine-learning/target-encoding-for-categorical-features/
[4]https://towardsdatascience.com/dealing-with-features-that-have-high-cardinality-1c9212d7ff1b
Target Encoding: apply
High-cardinality categorical attributes turns into real values

after TE
From high- To real
cardinality (continuous)
categorical values
attributes
Target
Encoder
Smoothing approach is computed like in the following paper by

Daniele Micci-Barreca https://dl.acm.org/doi/10.1145/507533.507538
min_samples_leaf (int) : minimum samples to take category average
into account
smoothing (int) : smoothing effect to balance categorical average vs
prior
Creating dummy variables
• If you are analysing your data using multiple linear regression (MLR) and any of your independent variables
are on a nominal or ordinal scale, you will need to create dummy variables that are numerical variables
limited to two specific values, 1 or 0.
• This is because nominal and ordinal independent variables, broadly known as categorical independent
variables, cannot be directly entered into a multiple regression analysis.
• The process of converting categorical variables into dummy variables is known as dummification.
• MLR is an extension of ordinary (aka simple) linear regression (OLR) that uses just one explanatory variable,
although there are non-linear regression methods for more complicated data and analysis. The general form
of each type of regression is:
Creating dummy variables: understanding
By dropping the
reference column, this
is called Dummy
coding [1]
[1]
Creating dummy
variables: apply
• We can make use of the
dummification process to filter out
missing values (NaN)
Creating dummy
variables: apply
• Finally, we create dummy

variables for all the
categorical variables
Creating interaction variables: understanding
• Often, the input features for a predictive modelling task interact in unexpected and often
nonlinear ways. These features are called interaction and polynomial features can help to better
expose the important relationships between input variables and the target variable.
• Polynomial features are those features created by raising existing features to an exponent (a
power of 2).
• For example, if a dataset had one input feature X, then a polynomial feature would be the addition of a
new feature (column) where values were calculated by squaring the values in X, e.g. X^2. This process
can be repeated for each input variable in the dataset, creating a transformed version of each.
• As such, polynomial features are a type of feature engineering, e.g. the creation of new input
features based on the existing features.
• The degree of the polynomial is used to control the number of features added, e.g. a degree of 3
will add two new variables for each input variable.
• Typically a small degree is used such as 2 or 3.
• Sometimes these features can result in improved modelling performance, although at the cost of
adding lots of additional input variables depending on the polynomial degree.
• Typically linear algorithms, such as linear regression and logistic regression (use in classification
problem), respond well to the use of polynomial input variables.
• Linear regression is linear in the model parameters and adding polynomial terms to the model
can be an effective way of allowing the model to identify nonlinear patterns.
Creating interaction variables: apply
These suggest
that the input
(features)
variable are x2.
Creating interaction variables: apply
Any idea how the 157 variables were

obtained given 65 interaction
variables were added to the existing
102 variables?
Feature Selection
• Low variance features

• SelectKBest and Chi-squared
• Random Forest and
SelectFromModel
Selecting features based on feature variance: low variance
analysis
• If the variance is low or close to zero, then a feature is approximately constant. Evidence
suggests that such low variance features hold no predictive power. In this case, we can consider
the remove them.
• Sklearn has a handy method to do that: VarianceThreshold. By default it removes features with
zero variance. This will not be applicable for the Porto Seguor’s as we saw there are no zero-
variance variables in the EDA.
• We can set an arbitrarily variance threshold, and the accuracy of the predictions to determine
which features to be remove (or select).
Selecting features based on feature variance: low variance
analysis
• Setting threshold of 1%, we saw 28 low variance features were found. Looks like we would loose
many variables if we select features based on variance.
• Instead, Sklearn also comes with other feature selection methods:
• SelectKBest: evaluate all features using algorithm such as Chi-squared and select K highest scoring
features
• SelectFromModel:
Selecting features based on SelectKBest and Chi-squared stats
• Here, we run score function on

(X,y) using SelectKBest and Chi-
squared statistic.
• We use the score to rank the top 20
features with the highest values for
the test chi-squared statistic.
• The way to interpret the chi-
squared scores is that categorical
features with the highest values
indicate higher relevance and
importance in predicting the target
and may be included in a predictive
model development.
Selecting features based on SelectKBest and Chi-squared stats
• Here, we use the fit_transform() to select the 20 features.

Selecting features using Random Forest (and SelectFromModel)
• Random Forest is one of the most popular

machine learning algorithms.
• It is made up a collection of decision trees,
and each tree in the ensemble is comprised
of a data sample drawn from a training set
with replacement, called the bootstrap
sample.
• The determination of the prediction varies
depending on the type of problem:
• For a regression task, the individual decision
trees will be averaged
• For a classification task, a majority vote i.e.
the most frequent categorical variable will
yield the predicted class
Ensemble Methods allow us to take a sample
of Decision Trees into account, calculate
which features to use or questions to ask at
Further reading: each split, and make a final predictor based
https://www.ibm.com/cloud/learn/random-f on the aggregated results of the sampled
orest Decision Trees.
• Random Forest provides a good predictive performance, low overfitting, and an easy
interpretability (i.e. easy to determine feature importance).
• A random Forest model can be interpreted in 2 ways:
• Overall interpretation: determine which variables (or combinations of variables) have the most predictive
power, which ones have the least
• Local interpretation: for a given data point and associated prediction, determine which variables (or
combinations of variables) explain this specific prediction
• Here, we will base feature selection on the feature importances of a random forest model.
• We will first run model fitting of our pre-processed training set (X,y) using a Sklearn’s
RandomForestClassifier.
• Then, with the fitted model, we can use Sklean’s SelectFromModel model to select those
features which importance is great than the mean importance of all the features by default, but
can alter this threshold if we want. To see the selected features, we can get_support method on
the fitted model.
• Let's look at the hyperparameters of sklearns built-in random forest function, which are either
used to increase the predictive power of the model or to make the model faster.
• Increasing the predictive power: • Increasing the model's speed

• n_estimators is the number of trees the algorithm • n_jobs tells the engine how many processors it is
builds before taking the maximum voting or taking allowed to use. If it has a value of one, it can only
the averages of predictions. In general, a higher use one processor. A value of “-1” means that there
number of trees increases the performance and is no limit.
makes the predictions more stable, but it also slows
• random_state makes the model to produce the
down the computation.
same results when it has a definite value of
• max_features is the maximum number of features random_state.
random forest considers to split a node. Sklearn
provides several options, all described in the
documentation.
• min_sample_leaf determines the minimum number
of leafs required to split an internal node.
Dimensionality
Reduction
• Linear Discriminant Analysis

(LDA)
• Principle Component Analysis
(PCA)
• Factor Analysis
Dimensionality Reduction
• Dimensionality reduction (DR) is the transformation of data from a high-dimensional space into a
low-dimensional space so that the low-dimensional representation retains some meaningful
properties of the original data, ideally close to its intrinsic dimension [1].
• Working in high-dimensional spaces can be undesirable for many reasons:
• raw data are often sparse as a consequence of the curse of dimensionality [2]
• a large number of features available in the dataset may result overfitting in the learning model
• DR in machine learning and statistics reduces the number of random variables under
consideration by acquiring a collection of critical variables. It can be divided into feature
discovery and extraction of features.
• Here, we will explore the following 3 DR techniques:
• Linear Discriminant Analysis (LDA)
• Principal Component Analysis (PCA)
• Factor Analysis
[1] https://en.wikipedia.org/wiki/Dimensionality_reduction#cite_note-dr_review-1
[2] van der Maaten, Laurens; Postma, Eric; van den Herik, Jaap (October 26, 2009).
"Dimensionality Reduction: A Comparative Review" (PDF). J Mach Learn Res. 10: 66–71.
Introduction to PCA and LDA
Principle Component Analysis (PCA)
• PCA is a technique in unsupervised ML.

• It is a projection method where data with m-columns
(features) is projected into a subspace with m or fewer
columns, whilst retaining the essence of the original data.
• The key idea of the principle (vital) component analysis is
to minimise the dimensionality of a data set consisting of
several variables, either firmly or lightly, associated with
each other while preserving to the maximum degree the
variance present in the dataset.
• This is achieved by translating the variables into a new
collection of variables that are a mixture of original
dataset’s variables or attributes so that maximum
variance is preserved.
• This attribute combination is known as Principal
Components ( PCs), and the Dominant Principal
Component is called the component that has the most
variance captured. The order of variance retention
decreases as we step down in order, i.e. PC1, PC2, PC3 …
PCn.
• PCA uses linear projection to transform data into the new feature space:
• (a) original data in feature space
• (b) centered data
• (c) projecting a data vector x onto another vector
• (d) direction of maximum variance of the projected coordinates
• Let’s consider an dataset with 2 features (e.g.

heigh and weight) illustrated in a 2D space.
• Considering a PCA with 2 components aligning
to the 2 features (in this case, no reduction is
taking place), PCA projects every data point
into a new coordinate where every point has a
new (x,y) value.
• In this setup, we can understand the
“direction” of our data and its “magnitude”
(i.e. how “important” each direction is)
• We’ll then break this matrix down into two
separate components: direction and
magnitude. We can then understand the
“directions” of our data and its “magnitude”
(or how “important” each direction is).
• The snapshot, from the setosa.io applet,
displays the two main directions in this data:
the “red direction” and the “green direction.”
• The “red direction” is the more important one
as it is the principle component.
Linear Discriminant Analysis (LDA)
• LDA is a technique of supervised ML

(requires target variables) to distinguish two
classes/groups (because its linear).
• The critical principle of LDA is to optimize
the separability between the two classes to
identify them in the best way we can
determine.
• LDA is similar to PCA, which helps minimise
dimensionality. Still, by constructing a new
linear axis and projecting the data points on
that axis, it optimises the separability
between established categories.
• LDA does not function on finding the
primary variable; it merely looks at what
kind of point/features/subspace to
distinguish the data offers further
discrimination.
PCA vs LDA
• Both PCA and LDA are linear transformation techniques

that are commonly used for DR.
• PCA is “unsupervised” algorithm, since it “ignores” class
labels and its goal is to find the directions (the so-called
principal components) that maximise the variance in a
dataset.
• LDA is “supervised” and computes the directions (“linear
discriminants”) that will represent the axes that that
maximise the separation between multiple classes.
• For comparisons between classification accuracies for

image recognition after using PCA or LDA show that PCA
tends to outperform LDA if the number of samples per
class is relatively small [1]
• In practice, it is also not uncommon to use both LDA and
PCA in combination: E.g., PCA for dimensionality
reduction followed by an LDA.
[1] A.M. Martinez and A.C. Kak, “PCA or LDA”:

https://ieeexplore.ieee.org/document/908974?arnumber=908974
Experimenting PCA and LDA: data loading and pre-processing
This is the famous “Iris” dataset that has been

deposited on the UCI ML repository which contains
measurements for 150 iris flowers from 3 different
species.
The 3 classes in the Iris The 4 features of the Iris
dataset: dataset:
1) Iris-setosa (n=50) 1) sepal length in cm
2) Iris-versicolor (n=50) 2) sepal width in cm
3) Iris-virginica (n=50) 3) petal length in cm
4) petal width in cm
• Let’s inspect the data, our objective is to

see how we separate the features (x) and
target (y).

species.
dataset: dataset:

species.
dataset: dataset:

species.
dataset: dataset:
Experimenting PCA and LDA: comparison between PCA and LDA
Experimenting PCA and LDA: comparison between PCA and LDA
Experimenting PCA and LDA: further PCA evaluation
Experimenting PCA and LDA: evaluating PCA
Applying PCA to Porto Seguor’s dataset
• Separating
features (x) and
target (y) from
the train dataset
PCR vs LDA on Porto Seguor’s dataset
Introducing Factor Analysis (FA)
Factor Analysis (FA)
• Factor Analysis (FA) is another unsupervised ML used for dimensionality reduction. This
algorithm creates factors from the observed variables to represent the common variance i.e.
variance due to correlation among the observed variables.
• The goal of FA is to describe variability among correlated variables in fewer variables called factors,
based on the concept that multiple observed variable have similar patterns of responses because they
are all associated with a latent (i.e. not directly measured) variable.
• FA and PCA are very similar as they both identify patterns in the correlations between variables.
These patterns are used to infer the existence of underlying latent variables in the data. These
latent variables are often referred to as factors, components, and dimensions.
The difference between FA and PCA
• The mathematics of FA and PCA (PCA) are different.

• FA explicitly assumes the existence of latent factors underlying the observed data. PCA instead
seeks to identify variables that are composites of the observed variables.
The difference between FA and PCA
• FA • PCA
• Reduce a large numbers of variables into • PCA seeks the linear combination of variables
fewer numbers of factors. in order to extract the maximum variance.
• Puts maximum common variance into a • Compute Eigenvector that are principal
common score. components of the dataset and collect them
• Associates multiple observed variables with a in projection matrix.
latent variable. • Each of the Eigenvector is associate with
• Has the same numbers of factors and Eigenvalue, which is magnitudes.
variables, where each factor contains a • Reduce the dataset into smaller dimensional
certain amount of overall variance . subspace by dropping the less informative
• Eigenvalue : A measure of the variance that a Eigenpairs.
factor explains for observed variables. A
factor with eigenvalue < 1 explains less
variance than a single observed value.
Applying FA to Porto Seguor’s dataset
Applying FA to Porto Seguor’s dataset
• Here, we are using Factor Analyzer

package (
https://pypi.org/project/factor-analyzer/)
that can be install using the following
command line
FA vs PCA on Porto Seguor’s dataset
Feature Scaling
• Standard Scaler
• Normalize
Feature Scaling
• Feature Scaling is an important part of data preparation.

• Often in a dataset, we see some attribute contains information in numeric value where some
values are very high and some are very low. This will cause some issues in ML system. To solve
that problem, we set all values on the same scale.
• There are two methods to solve that problem:
• In this case, we choose to use the scikit-learn’s Standard Scaler in our data preparation.
• The Standard Scalar is the process of rescaling the attributes so that they have mean as 0 and
variance as 1.
• The ultimate goal to perform standardisation is to bring down all the features to a common scale
without distorting the differences in the range of the values.
Feature Scaling
Feature Scaling: fit_transform() vs transform()
We use fit_transform() on training data

but transform() on the test data….why?
• The fit_transform() method is calculating the mean and variance of each of the features present
in our data.
• The transform() method use the same mean and variance as calculated from the training data to
transform the test data.
Conclusion
Conclusion
• We have so far covered the entire workflow of data science, working all the way to a stage right
before the development of a ML prediction model.
• This include the exploratory data analysis (EDA) and data preparation for ML including sourcing
of raw data (in this case, the Porto Seguor's dataset), data cleaning, feature engineering, feature
selection (pre-fitting with ML algorithm), dimensionality reduction and finally the feature scaling.
• As a conclusion, I advise that the EDA and data preparation are studied along with the Jupyter-
notebook as supplied as they were developed in conjunction.
End of Lesson

Data Preparation For ML in Practice v213

Uploaded by

Copyright:

Available Formats

Data Preparation For ML in Practice v213

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Preparation For ML in Practice v213

Uploaded by

Copyright:

Available Formats

CP70066E Machine Learning

Data Preparation for ML in Practice

Professor Jonathan Loo

• Data Cleaning • Feature Selection

• There are still missing

For the remaining categorical variables with

[1] Imputing missing values

• Finally, the data is

• As observed in the Target variables, the

• MachineLearningMastery.com gives a nice

[1] Overview on tactics to combat imbalanced classes in ML dataset:

• Imbalanced-learn library https://imbalanced-learn.org/stable/index.html

Combination of over- and under-sampling using SMOTEENN

SMOTE: Synthetic Minority Over-sampling

• Target Encoding (TE) is numerisation of categorical variables via target.

High-cardinality categorical attributes turns into real values

Smoothing approach is computed like in the following paper by

• Finally, we create dummy

Any idea how the 157 variables were

• Low variance features

• Here, we run score function on

• Here, we use the fit_transform() to select the 20 features.

• Random Forest is one of the most popular

• Increasing the predictive power: • Increasing the model's speed

• Linear Discriminant Analysis

• PCA is a technique in unsupervised ML.

• Let’s consider an dataset with 2 features (e.g.

• LDA is a technique of supervised ML

• Both PCA and LDA are linear transformation techniques

• For comparisons between classification accuracies for

[1] A.M. Martinez and A.C. Kak, “PCA or LDA”:

This is the famous “Iris” dataset that has been

• Let’s inspect the data, our objective is to

This is the famous “Iris” dataset that has been

This is the famous “Iris” dataset that has been

This is the famous “Iris” dataset that has been

• The mathematics of FA and PCA (PCA) are different.

• Here, we are using Factor Analyzer

• Feature Scaling is an important part of data preparation.

We use fit_transform() on training data

You might also like