Unit 2 ML 2019
Unit 2 ML 2019
Unit 2 ML 2019
Prepared by
Prof. P.D.Bendale
1
Unit-2 Feature Engineering
• Concept of Feature, Preprocessing of data: Normalization
and Scaling, Standardization, Managing 1Hr
• missing values, Introduction to Dimensionality Reduction,
Principal Component Analysis (PCA), 1Hr
• Feature Extraction: Kernel PCA, Local Binary Pattern. 1Hr
• Introduction to various Feature Selection Techniques,
Sequential 1Hr
• Forward Selection, Sequential Backward Selection. 1Hr
• Statistical feature engineering: count-based, Length, Mean,
Median, Mode etc. based feature vectorcreation. 1Hr
• Multidimensional Scaling, Matrix Factorization Techniques.
1Hr
2
Concept of Feature Preprocessing of data:
3
Concept of Feature Preprocessing of data:
8
Concept of Feature Preprocessing of data:
9
Normalization
• Normalization is used when we want to
bound our values between two numbers,
typically, between [0,1] or
[-1,1].
• which helps us to change the values of
numeric columns in the dataset to use a
common scale.
• It is required only when features of
machine learning models have different
ranges. 10
Normalization
12
Normalization
19
Standardization scaling
20
Standardization
23
When to use Normalization or Standardization?
26
Managing Missing Values
Missing Completely At Random (MCAR)-
• Missing values are completely independent of other
data. There is no pattern.
• In the case of MCAR, the data could be missing due to human
error, some system/equipment failure, loss of sample, or some
unsatisfactory technicalities while recording the values.
• there is no relationship between the missing data and any
other values observed or unobserved
• For Example, suppose in a library there are some overdue
books. Some values of overdue books in the computer system
are missing. The reason might be a human error like the
librarian forgot to type in the values. So, the missing values of
overdue books are not related to any other variable/data in
the system.
27
Managing Missing Values
30
Introduction to Dimensionality Reduction
32
33
• There are two components of dimensionality
reduction:
• Feature selection: In this, we try to find a subset
of the original set of variables, or features, to
get a smaller subset which can be used to model
the problem. It usually involves three ways:
– Filter
– Wrapper
– Embedded
• Feature extraction: This reduces the data in a
high dimensional space to a lower dimension
space, i.e. a space with lesser no. of dimensions.
35
Methods of Dimensionality Reduction
38
Feature Selection
• Feature selection is the process of selecting the subset of
the relevant features and leaving out the irrelevant
features present in a dataset to build a model of high
accuracy.
• Three methods are used for the feature selection:
1. Filters Methods
• In this method, the dataset is filtered, and a subset that
contains only the relevant features is taken.
• Some common techniques of filters method are:
• Correlation
• Chi-Square Test
• ANOVA
• Information Gain, etc.
39
Feature Selection
2. Wrappers Methods
• In this method, some features are fed to the ML
model, and evaluate the performance.
• The performance decides whether to add those
features or remove to increase the accuracy of the
model. This method is more accurate than the
filtering method but complex to work.
• Some common techniques of wrapper methods are:
• Forward Selection
• Backward Selection
• Bi-directional Elimination
40
Feature Selection
• 3.Embedded Methods: Embedded
methods check the different training
iterations of the machine learning model
and evaluate the importance of each
feature.
• Some common techniques of Embedded
methods are:
• LASSO
• Elastic Net
• Ridge Regression, etc.
41
Feature Extraction:
44
Principal Component Analysis
46
Principal Component Analysis
• The PCA algorithm is based on some mathematical
concepts such as:
• Variance and Covariance
• Eigenvalues and Eigen factors
47
Principal Component Analysis
50
Principal Component Analysis
An Advantage of Variance
• One of the primary advantages of variance is
that it treats all deviations from the mean of the
data set in the same way, regardless of
direction.
• A Disadvantage of Variance
• it gives added weight to numbers that are far
from the mean, or outliers.
• Squaring these numbers can at times result in
skewed interpretations of the data set as a
whole.
51
Principal Component Analysis
52
Principal Component Analysis
53
Principal Component Analysis
54
Concept of Feature Preprocessing of data:
55
Principal Component Analysis
• Hence, the
• two eigenvalues
• of the given
• matrix are
• λ = 0 and λ = 4.
• Eigen matrix
• can be written
• as,
• |A- λI|
• and
• |A- λI| = 0
56
Principal Component Analysis
57
Principal Component Analysis
58
Principal Component Analysis
59
Principal Component Analysis
Advantages of Dimensionality Reduction
• It helps in data compression, and hence reduced storage
space.
• It reduces computation time.
• It also helps remove redundant features, if any.
Disadvantages of Dimensionality Reduction
• It may lead to some amount of data loss.
• PCA tends to find linear correlations between variables,
which is sometimes undesirable.
• PCA fails in cases where mean and covariance are not
enough to define datasets.
• We may not know how many principal components to
keep- in practice, some thumb rules are applied.
64
• Some common terms used in PCA algorithm:
• Dimensionality: It is the number of features or variables present in
the given dataset. More easily, it is the number of columns present
in the dataset.
• Correlation: It signifies that how strongly two variables are related
to each other. Such as if one changes, the other variable also gets
changed. The correlation value ranges from -1 to +1. Here, -1 occurs
if variables are inversely proportional to each other, and +1
indicates that variables are directly proportional to each other.
• Orthogonal: It defines that variables are not correlated to each
other, and hence the correlation between the pair of variables is
zero.
• Eigenvectors: If there is a square matrix M, and a non-zero vector v
is given. Then v will be eigenvector if Av is the scalar multiple of v.
• Covariance Matrix: A matrix containing the covariance between the
pair of variables is called the Covariance Matrix.
Kernel PCA
• PCA is a linear method. That is it can only be applied to
datasets which are linearly separable. It does an excellent
job for datasets, which are linearly separable.
• But, if we use it to non-linear datasets, we might get a result
which may not be the optimal dimensionality reduction.
• Kernel PCA uses a kernel function to project dataset into a
higher dimensional feature space, where it is linearly
separable. It is similar to the idea of Support Vector
Machines.
• There are various kernel methods like linear, polynomial,
and Gaussian.
• In the kernel space the two classes are linearly separable. Kernel
PCA uses a kernel function to project the dataset into a higher-
dimensional space, where it is linearly separable.
Kernel PCA
Local Binary Pattern
• is in its ability to differentiate tiny differences
in texture and topography, to identify key
features with which we can then differentiate
between images of the same type — no
painstaking labeling required.
• The goal of LBP is to encode geometric
features of an image by detecting edges,
corners, raised or flat areas and hard lines;
allowing us to generate a feature vector
representation of an image, or group of
images.
Local Binary Pattern
• we can determine the level of similarity
between our target representation and an
unseen image and can calculate the
probability that the image presented is of the
same variety or type as the target image.
• LBP can be split into 4 key steps:
• · Simplification
• · Binarisation
• · PDF (probability density function) calculation
• · Comparison (of the above functions)
• Simplification
Local Binary Pattern
• This is our data preprocessing step. In essence,
this is our first step in dimensionality reduction,
which allows our algorithm to focus purely on the
local differences in luminance, rather than
worrying about any other potential features.
• Therefore, we first convert our image into a
single channel (typically greyscale) representation
• Binarisation
• Next, we calculate the relative local luminance
changes. This allows us to create a local, low
dimensional, binary representation of each pixel
based on luminance.
Local Binary Pattern
• For each comparison, we output a binary value of 0
or 1, dependent on whether the central pixel’s
intensity (scalar value) is greater or less (respectively)
than the comparison pixel.
• This forms a k-bit binary value, which can then be
converted to a base 10 number; forming a new
intensity for that given pixel.
Kernel PCA
Statistical feature engineering:
82
Statistical feature engineering:
84
Feature Vector Creation.
85
Statistical feature engineering:
• The features may represent, as a whole, one mere pixel
or an entire image. The granularity depends on what
someone is trying to learn or represent about the
object. You could describe a 3-dimensional shape with a
feature vector indicating its height, width, depth, etc.
86
Statistical feature engineering: