Feature Selection & Feature Extraction
Feature Selection & Feature Extraction
EXTRACTION
13.1 Introduction
13.2 Dimensionality Reduction
13.2.1 Feature Selection
13.1 INTRODUCTION
Data sets are made up of numerous data columns, which are also referred to
as data attributes. These data columns can be interpreted as dimensions on an
n-dimensional feature space, and data rows can be interpreted as points inside
that space. One can gain a better understanding of a dataset by applying
geometry in this manner. In point of fact, these characteristics are
measurements of the same entity. It is possible for their existence in the
algorithm's logic to get muddled, which will result in a change to how well the
model functions.
Input variables are the columns of data that are fed into a model in order to
provide a forecast for a target variable. However, if your data is given in the
form of rows and columns, such as in a spreadsheet, then features is another
term that can be used interchangeably with input variables. It is possible that
the presence of a large number of dimensions in the feature space implies that
the volume of that space is enormous. As a result, the points (data rows) in
that space reflect a small and non-representative sample of the space's
contents. It is possible for the performance of machine learning algorithms to
degrade when there are an excessive number of input variables. The existence
of an excessive number of input variables has a significant impact on the
efficiency with which machine learning algorithms function. when it is used to
data that contains a large number of input attributes; this phenomenon is
referred to as the "curse of dimensionality." As a consequence of this, one of
the most common goals is to cut down on the number of input features. The
process of decreasing the number of dimensions that characterise a feature
space is referred to as "dimensionality reduction," which is a phrase that was
made up specifically to describe this phenomenon.
The usefulness of data mining can be hindered by an excessive amount of
information on occasion. There are occasions when only a handful of the
columns of data characteristics that have been compiled for the purpose of
constructing and testing a model do not offer any information that is 395
significant
Machine Learning - II to the model. However, there are some that actually reduce the reliability and
precision of the model.
For instance, let's say you want to build a model that can forecast the incomes
of people already employed in their respective fields. Therefore, data columns
like cellphone number, house number, and so on will not truly contribute any
value to the dataset, and they can therefore be omitted. This is because
irrelevant qualities introduce noise to the data and affect the accuracy of the
model. Additionally, because of the Noise, the size of the model as well as the
amount of time and system resources required for model construction and
scoring are both increased.
At this point in time, we are required to put the concept of Dimension
Reductionality into practise. This can be done in one of two ways: either by
selecting features to be extracted or by extracting features to be selected.
Both of these approaches are broken down in greater detail below. The step
of dimension reduction is one of the preprocessing phases that occurs during
the process of data mining. This step is one of the preprocessing steps that
may be beneficial in minimising the impacts of noise, correlation, and
excessive dimensionality.
Some more examples are presented below to let you understand What does
dimensionality reduction have to do with machine learning and predictive
modelling?
A simple issue concerning the classification of e-mails, in which we are
tasked with deciding whether or not a certain email constitutes spam. can
be brought up as a practical illustration of the concept of dimensionality
reduction. This can include elements like whether or not the email has a
standard subject line, the content of the email, whether or not it uses a
template, and so on. However, some of these features may overlap with
one another.
A classification problem that involves humidity and rainfall can
sometimes be simplified down to just one underlying feature as a result of
the strong relationship that exists between the two variables. As a direct
consequence of this, the number of characteristics could get cut down in
some circumstances.
A classification problem with three dimensions can be difficult to
understand, whereas a problem with two dimensions can be translated to a
fundamental space with two dimensions, and a problem with one
dimension can be mapped to a line with one dimension. This concept is
depicted in the diagram that follows, which shows how a three-
dimensional feature space can be broken down into two one-dimensional
feature spaces, with the number of features being reduced even further if it
is discovered that they are related.
In context of dimensionality reduction, various techniques like Principal
Component Analysis, Linear Discriminant Analysis, Singular Value
Decomposition are frequently used. In this unit we will discuss all the
mentioned concepts, related to Dimension reductionality
396
Dimensionality Reduction Feature Selection
Y and Extraction
X
Y
Y
X
Accuracy = 87%
We'll next use the Gender feature to train the model, and we acquire an
accuracy of 80%. –
400
At this point, we are going to select the variable that produced the most Feature Selection
favourable results. If you take a look at this table, you'll notice that the and Extraction
variable titled "Calories Burned" alone has an accuracy rating of 87 percent,
while the variable titled "Gender" has an accuracy rating of 80 percent, and
the variable titled "Plays Sport" has an accuracy rating of 85 percent. When
these two sets of data were compared, the winner was, unsurprisingly, the
number of calories burned. As a direct result of this, we will select this
variable.
Accuracy = 88%
We acquire a 91 percent accuracy when we combine Plays Sport with Calories
Burnt. The variable that yields the greatest improvement will be kept. That
makes natural sense. As you can see, when we combine Plays Sport with
Calories Burnt, we get a better result. As a result, we'll keep it and use it in our
model. We'll keep repeating the process till all the features are considered in
improving the model performance
Accuracy = 91%
f2
e1
e2
f1
When applying the PCA method, the following are the primary steps that
should be followed:
1. Obtain the dataset you need.
2. Calculate the mean of the vectors ().
3. Deduct the mean of the given data from the total.
4. Complete the computation for the covariance matrix.
5. Determine the eigenvectors and eigenvalues of the matrix that represents
the covariance matrix.
6. Creating a feature vector and deciding which components would be the
major ones i.e. the principal components.
7. Create a new data set by projecting the weight vector onto the dataset.As a
result, we have a smaller number of eigenvectors, and some data may
have been lost in the process. However, the remaining eigenvectors
should keep the most significant variances.
403
Machine Learning - II Merits of Dimensionality Reduction
It helps to compress data, which reduces the amount of space needed to
store it and the amount of time it takes to process it.
If there are any redundant features, it also helps to get rid of them.
Limitations of Dimensionality Reduction
You might lose some data.
You might lose some data.
PCA fails when the mean and covariance are not enough to describe a
dataset.
We don't know how many major parts we need to keep track of, but in
practice, we follow some rules.
Below is the practice question for Principal Component Analysis (PCA) :
Problem-01: 2, 3, 4, 5, 6, 7; 1, 5, 3, 6, 7, 8 are the given data. Using the PCA
Algorithm, calculate the primary component.
OR
Consider the two-dimensional patterns (2, 1), (3, 5), (4, 3), (5, 6), (6, 7), (8, 8)
and (9, 10). (7, 8).
Using the PCA Algorithm, calculate the primary component.
OR
Calculate the principal component of following data-
4.5
Thus, Mean vector (μ) =
5
404
Step-03: Feature Selection
and Extraction
On subtracting mean vector (µ) from the given feature vectors.
x1 – µ = (2 – 4.5, 1 – 5) = (-2.5, -4)
same for others
Feature vectors (xi) generated after subtraction are
2.5 1.5 0.5 0.5 1.5 2.5
4 0 2 1 2 3
Step-04:
Now to find covariance matrix : Covariance Matrix ( i )( Xi )t
= X n
m = (x
)(x )t = 2.5 2.5 4 = 6.25 10
1 1 1
4 10 16
m = (x )(x ) = 1.5 1.5 0 = 2.25 0
t
2 2 2
0 0 0
m = (x )(x ) = 0.5 0.5 2 = 0.25 1
t
3 3 3
2 0 4
m = (x )(x )t = 0.5 0.5 1 = 0.25 1
4 4 4
2 1 4
m = (x )(x ) = 1.5 2 =
t 1.5 2.25 3
5 5 5
2 3 4
m = (x )(x ) = 1.5 2.5 3 = 6.25 7.5
t
6 6 5
2 7.5 9
1 17.5 22
Covariance Marix =
6 22 34
2.92 3.67
Covariance Matrix =
3.67 5.67
Step-05:
Eigen values and Eigen vectors of the covariance matrix.
2.92 3.67 0 =0
3.67 5.67 0
2.92 3.67
=0
3.67 5.67
From here,
(2.92-λ)(5.67-λ)-(3.67x 3.67)=0 Solv
ing
16.56-2.92λ-5.67λ+λ2-13.47=0 this
λ2-8.56λ+3.09=0 quad
ratic
equation, we get λ=8.22,0.38
405
Machine Learning - II Thus, two eigen values are λ_1=8.22 and λ_2=0.38.
Clearly, the second eigen value is vary small compared to the first eigen value.
So, the second eigen vactor can be left out.
Eigen vector corresponding to the greatest eigen value is the principle
component for the given data set.
So, we find the eigen vector corresponding to eigen value
λ_1. We use the following equation to find the eigen vector-
MX=λX
Where-
M=Covariance Matrix ; X=Eigen vector ,and λ=Eigen value
Substituting the values in the above equation, we get-
On being substituting the values in the above equation, we get-
8
x1=0.69 x2
7
6
5
4
3
2
1
406 1 2 3 4 5 6 7 8
Problem -02 Feature Selection
and Extraction
Use PCA Algorithm to transform the pattern (2, 1) onto the eigen vector in the
previous question.
Solution-
2
The given feature vector is (2, 1) i.e. Given Feature Vector:
The feature vector gets transformed to :
= Transpose of Eigen vector x (Feature Vector – Mean Vector)
T
2.55 2 4.5 2.5
=
3.67 3.67 x 4 21.055
5
x = 2.55
1
10
X2
4
2
WLDA
0
0 2 4 6 8 10
X1
409
Machine Learning - II Solution :
To understand the working of LDA lets take an example and workout step by
step.
Step 1:- Compute the within – Class Scatter matrix (Sw), Sw measures how
well the data is scattered with in the class , which is done by finding the mean
of the classes i.e. μ1 & μ2 where, μ1& μ2 are the mean of class C1 & C2
respectively
Now, Sw is given by
Sw = S1 + S2
where, S1 is the covariance matrix for the class C1 and
S2 is the covariance matrix for the class C2
Let’s find the covariance matrices S1 & S2 of each class
S1 = x (x ) (x )T
1
c1
where, μ1 is the mean of class C1, which is computed by averaging the coordinates
of dataset X1
(Coordinates of X1)
4+2+2+3+4 1+4+3+6+4
1 , ;
5 5
1 3.00 ; 3.60
S =
1
(x ) (x )
1 1
T
= [3
1
3 60]
x w 1
(x ) = 1 1 1 0 1
1 1
0.4
2.6 0.4 0.6 2.4
Now, for each x1 we are going to calculate (x – μ ) (x – μ )T . So, we will have
1 1
“5” such matrices.
Now solving it one by one i.e. for each column of (x1 – μ1) we find (x1 – μ1).
(x1 – μ )T
1 1
1 [1 2.6] = (x 1 1) = 2.6
2.6 2.6 6.16
FIRST MATRIX
1
[ 1 0.6] = 1 0.6
0.6 0.6 0.36
410
0 1 0
Feature Selection
2.4 [0
2.4] = 0 5.76 and Extraction
1 [1 0.4] = 1 0
0.4 0.4 0.16
Similar to PCA we find this using Eigen vectors having largest Eigen value.
= ⋋V
Projection Vector
–1 V
S S
S – ⋋ I| = 0
w B
-1
i.e. |S
11.89 ⋋ 8.81
w B
5.08 3.76 = 0
Substituting in equation a ⋋
We V 0.91
get V 1 0.39
2 =
411
Machine Learning - II Or we directly solve
V1 T
= Sw 1 0.1921 0.032 5.4 0.91 0.39
( ) = 0.031
1 2
0.38 4
V
2
–1
Note: S is found using the formula.
w
1
a b 1 a b
=
c c ad bc c c
So, S =
2.64 0.44
w
0.44 5.28
1
S 1 = 5.28 0.44 0.384 0.032
w
13.74 0.44 2.68
0.032 0.192
Projection Vector
Corresponding to
highest Eigen