Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
8 views

Feature Selection & Feature Extraction

Uploaded by

T. ganesh
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Feature Selection & Feature Extraction

Uploaded by

T. ganesh
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 19

FEATURE SELECTION AND

EXTRACTION
13.1 Introduction
13.2 Dimensionality Reduction
13.2.1 Feature Selection

13.2.2 Feature extraction

13.3 Principal Component Analysis


13.4 Linear Discriminant Analysis
13.5 Singular Value Decomposition.
13.6 Summary
13.7 Solutions/Answers
13.8 Further Readings

13.1 INTRODUCTION
Data sets are made up of numerous data columns, which are also referred to
as data attributes. These data columns can be interpreted as dimensions on an
n-dimensional feature space, and data rows can be interpreted as points inside
that space. One can gain a better understanding of a dataset by applying
geometry in this manner. In point of fact, these characteristics are
measurements of the same entity. It is possible for their existence in the
algorithm's logic to get muddled, which will result in a change to how well the
model functions.
Input variables are the columns of data that are fed into a model in order to
provide a forecast for a target variable. However, if your data is given in the
form of rows and columns, such as in a spreadsheet, then features is another
term that can be used interchangeably with input variables. It is possible that
the presence of a large number of dimensions in the feature space implies that
the volume of that space is enormous. As a result, the points (data rows) in
that space reflect a small and non-representative sample of the space's
contents. It is possible for the performance of machine learning algorithms to
degrade when there are an excessive number of input variables. The existence
of an excessive number of input variables has a significant impact on the
efficiency with which machine learning algorithms function. when it is used to
data that contains a large number of input attributes; this phenomenon is
referred to as the "curse of dimensionality." As a consequence of this, one of
the most common goals is to cut down on the number of input features. The
process of decreasing the number of dimensions that characterise a feature
space is referred to as "dimensionality reduction," which is a phrase that was
made up specifically to describe this phenomenon.
The usefulness of data mining can be hindered by an excessive amount of
information on occasion. There are occasions when only a handful of the
columns of data characteristics that have been compiled for the purpose of
constructing and testing a model do not offer any information that is 395
significant
Machine Learning - II to the model. However, there are some that actually reduce the reliability and
precision of the model.
For instance, let's say you want to build a model that can forecast the incomes
of people already employed in their respective fields. Therefore, data columns
like cellphone number, house number, and so on will not truly contribute any
value to the dataset, and they can therefore be omitted. This is because
irrelevant qualities introduce noise to the data and affect the accuracy of the
model. Additionally, because of the Noise, the size of the model as well as the
amount of time and system resources required for model construction and
scoring are both increased.
At this point in time, we are required to put the concept of Dimension
Reductionality into practise. This can be done in one of two ways: either by
selecting features to be extracted or by extracting features to be selected.
Both of these approaches are broken down in greater detail below. The step
of dimension reduction is one of the preprocessing phases that occurs during
the process of data mining. This step is one of the preprocessing steps that
may be beneficial in minimising the impacts of noise, correlation, and
excessive dimensionality.
Some more examples are presented below to let you understand What does
dimensionality reduction have to do with machine learning and predictive
modelling?
 A simple issue concerning the classification of e-mails, in which we are
tasked with deciding whether or not a certain email constitutes spam. can
be brought up as a practical illustration of the concept of dimensionality
reduction. This can include elements like whether or not the email has a
standard subject line, the content of the email, whether or not it uses a
template, and so on. However, some of these features may overlap with
one another.
 A classification problem that involves humidity and rainfall can
sometimes be simplified down to just one underlying feature as a result of
the strong relationship that exists between the two variables. As a direct
consequence of this, the number of characteristics could get cut down in
some circumstances.
 A classification problem with three dimensions can be difficult to
understand, whereas a problem with two dimensions can be translated to a
fundamental space with two dimensions, and a problem with one
dimension can be mapped to a line with one dimension. This concept is
depicted in the diagram that follows, which shows how a three-
dimensional feature space can be broken down into two one-dimensional
feature spaces, with the number of features being reduced even further if it
is discovered that they are related.
In context of dimensionality reduction, various techniques like Principal
Component Analysis, Linear Discriminant Analysis, Singular Value
Decomposition are frequently used. In this unit we will discuss all the
mentioned concepts, related to Dimension reductionality
396
Dimensionality Reduction Feature Selection
Y and Extraction

X
Y
Y
X

13.2 DIMENSIONALITY REDUCTION


The Data mining and Machine Learning methodologies both have processing
challenges when working with big amounts of data (many attributes). In point
of fact, the dimensions of the feature space utilised by the approach, often
referred to as the model attributes, play the most important function.
Processing algorithms grow more difficult and time-consuming to implement
as the dimensionality of the processing space increases.
These elements, also known as the model attributes, are the fundamental
qualities, and they can either be variables or features. When there are more
features, it is more difficult to see them all, and as a result, the work on the
training set becomes more complex as well. This complexity was further
increased when a significant number of characteristics were linked; hence,
the classification became irrelevant as a result. In circumstances like these,
the strategies for decreasing the number of dimensions can prove to be highly
beneficial. In a nutshell, "the process of making a set of major variables from
a huge number of random variables is what is referred to as dimension
reduction." When conducting data mining, the step of dimension reduction can
be helpful as a preprocessing step to lessen the negative effects of noise,
correlation, and excessive dimensionality.
Dimension reduction can be accomplished in two ways :
 Feature selection: During this approach, a subset of the complete set of
variables is selected; as a result, the number of conditions that can be
utilised to illustrate the issue is narrowed down. It's normally done in one
of three ways.:
○ Filter method
○ Wrapper method
○ Embedded method
 Feature extraction: It takes data from a space with many dimensions and
transforms it into another environment with fewer dimensions.
397
Machine Learning - II 13.2.1 Feature Selection
It is the process of selecting some attributes from a given collection of
prospective features, and then discarding the rest of the attributes that were
considered. The use of feature selection can be done for one of two reasons:
either to get a limited number of characteristics in order to prevent overfitting
or to avoid having features that are redundant or irrelevant. For data scientists,
the ability to pick features is a vital asset. It is essential to the success of the
machine learning algorithm that you have a solid understanding of how to
choose the most relevant features to analyse. Features that are irrelevant,
redundant, or noisy can contaminate an algorithm, which can have a
detrimental impact on the learning performance, accuracy, and computing
cost. The importance of feature selection is only going to increase as the size
and complexity of the typical dataset continues to balloon at an exponential
rate.
Feature Selection Methods: Feature selection methods can be divided into
two categories: supervised, which are appropriate for use with labelled data,
and unsupervised, which are appropriate for use with unlabeled data. Filter
methods, wrapper methods, embedding methods, and hybrid methods are the
four categories that unsupervised approaches fall under.:
 Filter methods: Filter methods choose features based on statistics instead
of how well they perform in feature selection cross-validation. Using a
chosen metric, irrelevant attributes are found and recursive feature
selection is done. Filter methods can be either univariate, in which an
ordered ranking list of features is made to help choose the final subset of
features, or multivariate, in which the relevance of all the features as a
whole is evaluated to find features that are redundant or not important.
 Wrapper methods: Wrapper feature selection methods look at the
choice of a set of features as a search problem. Their quality is judged
by preparing, evaluating, and comparing a set of features to other sets of
features. This method makes it easier to find possible interactions between
variables. Wrapper methods focus on subsets of features that will help
improve the quality of the results from the clustering algorithm used for
the selection. Popular examples are Boruta feature selection and Forward
feature selection.
 Embedded methods: Embedded feature selection approaches incorporate
the feature selection machine learning algorithm as an integral component
of the learning process. This allows for simultaneous classification and
feature selection to take place within the method. Careful consideration is
given to the extraction of the characteristics that will make the greatest
contribution to each iteration of the process of training the model. A few
examples of common embedded approaches are the LASSO feature
selection algorithm, the random forest feature selection algorithm, and the
decision tree feature selection algorithm.
Among all approaches the most conventional feature selection is feed forward
feature selection.
Forward feature selection: The first step in the process of feature selection is
398 to evaluate each individual feature and choose the one that results in the most
effective algorithm model. This is referred to as "forward feature selection." Feature Selection
After that step, each possible combination of the feature that was selected and and Extraction
a subsequent feature is analysed, and then a second feature is selected, and so
on, until the required specified number of features is chosen. The operation of
the forward feature selection algorithm is depicted here in the figure.

The procedure to follow in order to carry out forward feature selection


1. Train the model with each feature being treated as a separate entity, and
then evaluate its overall performance.
2. Select the variable that results in the highest level of performance.
3. Carry on with the process while gradually introducing each variable.
4. The variable that produced the greatest amount of improvement is the one
that gets kept.
5. Perform the entire process once more until the performance of the model
does not show any meaningful signs of improvement.
Here, a fitness level prediction based on the three independent variables is
used
to show how forward feature selection works.

ID Calories_burnt Gender Plays_Sport? Fintess Level


1 121 M Yes Fit
2 230 M No Fit
3 342 F No Unfit
4 70 M Yes Fit
5 278 F Yes Unfit
6 146 M Yes Fit
7 168 F No Unfit
8 231 F Yes Fit
9 150 M No Fit
10 190 F No Fit
So, the first step in Forward Feature Selection is to train n models and judge
how well they work by looking at each feature on its own. So, if you have
three independent variables, we'll train three models, one for each of these
three traits. Let's say we trained the model using the Calories Burned feature
and the Fitness Level goal variable and got an accuracy of 87 percent.
399
Machine Learning - II
ID Calories_burnt Gender Plays_Sport? Fintess Level
1 121 M Yes Fit
2 230 M No Fit
3 342 F No Unfit
4 70 M Yes Fit
5 278 F Yes Unfit
6 146 M Yes Fit
7 168 F No Unfit
8 231 F Yes Fit
9 150 M No Fit
10 190 F No Fit

Accuracy = 87%
We'll next use the Gender feature to train the model, and we acquire an
accuracy of 80%. –

ID Calories_burnt Gender Plays_Sport? Fintess Level


1 121 M Yes Fit
2 230 M No Fit
3 342 F No Unfit
4 70 M Yes Fit
5 278 F Yes Unfit
6 146 M Yes Fit
7 168 F No Unfit
8 231 F Yes Fit
9 150 M No Fit
10 190 F No Fit
Accuracy = 80%
And similarly, the Plays_sport variable gives us an accuracy of 85%–

ID Calories_burnt Gender Plays_Sport? Fintess Level


1 121 M Yes Fit
2 230 M No Fit
3 342 F No Unfit
4 70 M Yes Fit
5 278 F Yes Unfit
6 146 M Yes Fit
7 168 F No Unfit
8 231 F Yes Fit
9 150 M No Fit
10 190 F No Fit
Accuracy = 85%

400
At this point, we are going to select the variable that produced the most Feature Selection
favourable results. If you take a look at this table, you'll notice that the and Extraction
variable titled "Calories Burned" alone has an accuracy rating of 87 percent,
while the variable titled "Gender" has an accuracy rating of 80 percent, and
the variable titled "Plays Sport" has an accuracy rating of 85 percent. When
these two sets of data were compared, the winner was, unsurprisingly, the
number of calories burned. As a direct result of this, we will select this
variable.

Variable used Accuracy


Calories_burnt 87.00%
Gender 80.00%
Plays_Sport? 85.00%
The next thing we'll do is repeat the previous steps, but this time we'll just add
a single variable at a time. Because of this, it makes perfect sense for us to
retain the Calories Burned variable as we proceed to add variables one at a
time. Consequently, if we use gender as an illustration, we have an accuracy
rate of 88 percent. –

ID Calories_burnt Gender Plays_Sport? Fintess Level


1 121 M Yes Fit
2 230 M No Fit
3 342 F No Unfit
4 70 M Yes Fit
5 278 F Yes Unfit
6 146 M Yes Fit
7 168 F No Unfit
8 231 F Yes Fit
9 150 M No Fit
10 190 F No Fit

Accuracy = 88%
We acquire a 91 percent accuracy when we combine Plays Sport with Calories
Burnt. The variable that yields the greatest improvement will be kept. That
makes natural sense. As you can see, when we combine Plays Sport with
Calories Burnt, we get a better result. As a result, we'll keep it and use it in our
model. We'll keep repeating the process till all the features are considered in
improving the model performance

ID Calories_burnt Gender Plays_Sport? Fintess Level


1 121 M Yes Fit
2 230 M No Fit
3 342 F No Unfit
4 70 M Yes Fit
5 278 F Yes Unfit
6 146 M Yes Fit
7 168 F No Unfit
401
Machine Learning - II
8 231 F Yes Fit
9 150 M No Fit
10 190 F No Fit

Accuracy = 91%

13.2.2 Feature extraction:


The process of reducing the amount of resources needed to describe a large
amount of data is called "feature extraction." One of the main problems with
doing complicated data analysis is that there are a lot of variables to keep
track of. A large number of variables requires a lot of memory and processing
power, and it can also cause a classification algorithm to overfit to training
examples and fail to generalise to new samples. Feature extraction is a broad
term for different ways to combine variables to get around these problems
while still giving a true picture of the data. Many people who work with
machine learning think that extracting features in the best way possible is the
key to making good models. The data's information must be shown by the
features in a way that fits the needs of the algorithm that will be used to solve
the problem. Some "inherent" features can be taken straight from the raw data,
but most of the time, we need to use these "inherent" features to find
"relevant" features that we can use to solve the problem.
In simple terms "feature extraction." can be described as a technique for
Defining a set of features, or visual qualities, that best show the information.
Feature Extraction Techniques such as: PCA, ICA, LDA, LLE, t-SNE and
AE. are some of the common examples in machine learning.
Feature extraction fills the following requirements:
It takes raw data, called features, and turns them into useful information by
reformatting, combining, and changing the primary features into new ones.
This process continues until a new set of data is created that the Machine
Learning models can use to reach their goals.
Methods of Dimensionality Reduction : The following are two well-known
and widely used dimension reduction techniques:
Dimension Reduction Techniques

Principal Component Analysis Fisher Linear


Discriminant
Analysis

 (PCA) Principal Component Analysis


 (LDA) Fisher Linear Discriminant Analysis
The reduction of dimensionality can be linear or non-linear, depending on the
method used. The most common linear method is called Principal Component
Analysis, or PCA.
402
Check Your Progress - 1 Feature Selection
and Extraction
Qn1. Define the term feature selection.
Qn2. What is the purpose of feature extraction in machine learning?
Qn3. Expand the following terms : PCA,LDA,GDA
Qn4. Name components of dimensionality reduction.

13.3 PRINCIPAL COMPONENT ANALYSIS


Karl Pearson was the first person to come up with this plan. It is based on the idea
that when data from a higher-dimensional space is put into a lower-
dimensional space, the lower-dimensional space should have the most
variation. In simple terms, principal component analysis (PCA) is a way to get
important variables (in the form of components) from a large set of variables
in a data set. It tends to find the direction in which the data is most spread out.
PCA is more useful when you have data with three or more dimensions.

f2

e1
e2

f1

When applying the PCA method, the following are the primary steps that
should be followed:
1. Obtain the dataset you need.
2. Calculate the mean of the vectors ().
3. Deduct the mean of the given data from the total.
4. Complete the computation for the covariance matrix.
5. Determine the eigenvectors and eigenvalues of the matrix that represents
the covariance matrix.
6. Creating a feature vector and deciding which components would be the
major ones i.e. the principal components.
7. Create a new data set by projecting the weight vector onto the dataset.As a
result, we have a smaller number of eigenvectors, and some data may
have been lost in the process. However, the remaining eigenvectors
should keep the most significant variances.

403
Machine Learning - II Merits of Dimensionality Reduction
 It helps to compress data, which reduces the amount of space needed to
store it and the amount of time it takes to process it.
 If there are any redundant features, it also helps to get rid of them.
Limitations of Dimensionality Reduction
 You might lose some data.
 You might lose some data.
 PCA fails when the mean and covariance are not enough to describe a
dataset.
 We don't know how many major parts we need to keep track of, but in
practice, we follow some rules.
Below is the practice question for Principal Component Analysis (PCA) :
Problem-01: 2, 3, 4, 5, 6, 7; 1, 5, 3, 6, 7, 8 are the given data. Using the PCA
Algorithm, calculate the primary component.
OR
Consider the two-dimensional patterns (2, 1), (3, 5), (4, 3), (5, 6), (6, 7), (8, 8)
and (9, 10). (7, 8).
Using the PCA Algorithm, calculate the primary component.
OR
Calculate the principal component of following data-

Class1 values Class 2 values


X 2,3,4 X 5,6,7
Y 1,5,3 Y 6,7,8
Answer :
Step-1: Get data.
The given feature vectors are- x1,x2,x3,x4,x5,x6 with the values given below:
2 3 4 5 6 7
           
 1  5  3   6  7  8
Step-2:
Find the mean vector (µ).
Mean vector (µ) = ((2 + 3 + 4 + 5 + 6 + 7) / 6, (1 + 5 + 3 + 6 + 7 + 8) / 6)= (4.5,
5)

4.5
Thus, Mean vector (μ) =  
5 
404
Step-03: Feature Selection
and Extraction
On subtracting mean vector (µ) from the given feature vectors.
 x1 – µ = (2 – 4.5, 1 – 5) = (-2.5, -4)
same for others
Feature vectors (xi) generated after subtraction are
2.5 1.5 0.5 0.5 1.5 2.5
             
4 0 2 1 2 3
           
Step-04:
Now to find covariance matrix : Covariance Matrix ( i  )( Xi  )t
= X n
m = (x
 )(x  )t = 2.5 2.5  4 = 6.25 10
   
1 1 1
4 10 16
   
m = (x  )(x  ) = 1.5 1.5 0 = 2.25 0
t

2 2 2
 0   0 0
   
m = (x  )(x  ) = 0.5 0.5  2 = 0.25 1 
t

3 3 3
2   0 4
   
m = (x  )(x  )t = 0.5 0.5 1 = 0.25 1 

4 4 4
2   1 4
   
m = (x  )(x  ) =   1.5 2 = 
t 1.5 2.25 3 

5 5 5
2   3 4
   
m = (x  )(x  ) = 1.5 2.5 3 = 6.25 7.5
t
   
6 6 5
2 7.5 9
   
1 17.5 22
Covariance Marix =
 
6 22 34
 
 2.92 3.67 
Covariance Matrix =  
3.67 5.67
 
Step-05:
Eigen values and Eigen vectors of the covariance matrix.
2.92 3.67  0 =0
3.67 5.67  0

2.92   3.67
=0
3.67 5.67  
From here,
(2.92-λ)(5.67-λ)-(3.67x 3.67)=0 Solv
ing
16.56-2.92λ-5.67λ+λ2-13.47=0 this
λ2-8.56λ+3.09=0 quad
ratic
equation, we get λ=8.22,0.38

405
Machine Learning - II Thus, two eigen values are λ_1=8.22 and λ_2=0.38.
Clearly, the second eigen value is vary small compared to the first eigen value.
So, the second eigen vactor can be left out.
Eigen vector corresponding to the greatest eigen value is the principle
component for the given data set.
So, we find the eigen vector corresponding to eigen value
λ_1. We use the following equation to find the eigen vector-
MX=λX
Where-
 M=Covariance Matrix ; X=Eigen vector ,and λ=Eigen value
Substituting the values in the above equation, we get-
On being substituting the values in the above equation, we get-

2.92 3.67 X1 X1


   =
 8.22 
 3.67 5.67    

Solving these, we get-


2.92X_1+3.67X_2=8.22X_1
3.67X_1+5.67X_2=8.22X_2
On simplification, we get-
5.3X1=3.67X2………(1)
3.67X1=2.55X2………(2)
From (1) and (2), X1 =0.69X2
From (2), the eigen vector is-

Eigen Vector: X1 2.55


=
 X2   
Thus, PCA for the given problem is

Principle Component: X1 2.55


  =   
X2
Lastly, we project the data points onto the new subspace as-
Projection

8
x1=0.69 x2
7
6
5
4
3
2
1

406 1 2 3 4 5 6 7 8
Problem -02 Feature Selection
and Extraction
Use PCA Algorithm to transform the pattern (2, 1) onto the eigen vector in the
previous question.
Solution-
2
The given feature vector is (2, 1) i.e. Given Feature Vector:
 
The feature vector gets transformed to :
= Transpose of Eigen vector x (Feature Vector – Mean Vector)
T
2.55  2 4.5  2.5
=
3.67 3.67 x 4    21.055
5  
x   = 2.55
1
       

Check Your Progress -3


Qn1. What are the advantages of dimensionality reduction?
Qn2. What are the disadvantages of dimensionality reduction?

13.4 LINEAR DISCRIMINANT ANALYSIS


In most cases, the application of logistic regression has been restricted to
problems involving two classes of subjects. On the other hand, the Linear
Discriminant Analysis is the linear classification method that is recommended
to use when there are more than two classes.
The algorithm for linear classification known as logistic regression is known
for being both straightforward and robust. On the other hand, there are a few
restrictions or faults in the system that highlight the requirement for more
complex linear classification algorithms. The following is a list of some of the
problems:
 Binary class Problems. Concerns regarding the binary class is that the
Logistic regression is utilised for issues that involve binary classification
or two classes. It is possible to enhance it such that it can manage
multiple- class categorization, but in practise, this is not very common.
 Unstable, but with well-defined classes. When the classes are extremely
distinct from one another, logistic regression may become unstable.
 It is prone to instability when there are only a few occurrences. When
there are not enough examples from which to draw conclusions about the
parameters, the logistic regression model may become unstable.
In view of the limitations of logistic regression that were discussed earlier, the
linear discriminant analysis is one of the prospective linear methods that can
be used for multi-class classification. This is because it addresses each of the
aforementioned concerns in their totality, which is the primary reason for its
success (i.e. flaws of logistic regression). When dealing with issues that
include binary categorization, two statistical methods that could be effective
are logical regression and linear discriminant analysis. Both of these
techniques are linear and regression-based.
407
Machine Learning - II Understanding LDA Models : In order to simplify the analysis of your data
and make it more accessible, LDA will make the following assumptions about
it:
1. The distribution of your data is Gaussian, and when plotted, each variable
appears to be a bell curve.
2. Each feature has the same variance, which indicates that the values of
each feature vary by the same amount on average in relation to the mean.
On the basis of these presumptions, the LDA model generates estimates for
both the mean and the variance of each class. In the case where there is only
one input variable, which is known as the univariate scenario, it is
straightforward to think about this.
When the sum of values is divided by the total number of values, we are able
to compute the mean value, or mu, of each input, or x, for each class(k), in the
following manner.
µk = 1/nk * sum(x)
Where,
µk represents the average value of x for class k and
nk represents the total number of occurrences that belong to class k.
When calculating the variance across all classes, the average squared difference
of each individual result's distance from the mean x - µ is employed.
σ2 = 1 / (n-K) * sum((x – µ)2)
Where σ2 represents the variance of all inputs (x), n represents the number of
instances, K represents the number of classes, and µ is the mean for input x.
Now we will discuss How to use LDA to Make Predictions ?
LDA generates predictions by calculating the likelihood that each class will
be given a fresh batch of data and then extrapolating from there. A forecast is
created by selecting the output class that contains the events that are the most
likely to occur. The Bayes Theorem is incorporated into the model in order to
calculate the probabilities involved. Utilizing the likelihood of each class as
well as the probability of data belonging to that class, Bayes's Theorem may
be utilised to estimate the probability of the output class (k) given the input
class (x). This is accomplished by using the following formula:
P(Y=x|X=x) = (Πk k * fk(x)) / sum(Π1 * fl(x))
The base probability of each class (k) that can be found in your training data
is denoted by the symbol Πk (e.g. 0.5 for a 50-50 split in a two class problem).
This concept is referred to as the prior probability within Bayes' Theorem.
Πk = nk/n
The value of f, which represents the estimated likelihood that x is a member
of the class, is presented here as f(x). We make use of a Gaussian distribution
function for the variable (x). By simplifying the previous equation and then
408
introducing the Gaussian, we are able to arrive at the equation that is Feature Selection
presented below. This type of function is referred to as a discriminate and Extraction
function, and the output classification (y) is determined by selecting the
category that contains the greatest value:
D (x) = x * (µ /σ2) – (µ 2/(2*σ2)) + ln(П )
Where, D (x) is the discriminating
k k functionkfor class k given input x, and µ , σ2,
k k
and Пk are all estimated from your data.
Now to perform the above task we need to prepare our data first, so the
question arises, How to prepare data suitable for LDA?
In order to prepare data suitable for LDA, one needs to understand following:-
1) Problems with Classification: LDA is used to solve classification
problems where the output variable is a categorical one. This may seem
obvious, as LDA works with both two and more than two classes.
2) Gaussian Distribution: The standard way to use the model assumes that
the input variables have a Gaussian distribution. Think about looking at
the univariate distributions of each attribute and using transformations to
make them look more like Gaussian distributions (e.g. log and root for
exponential distributions and Box-Cox for skewed distributions).
3) Remove Outliers: Think about removing outliers from your data. These
things can mess up the basic statistics like the mean and the standard
deviation that LDA uses to divide classes.
4) Same Variance: LDA assumes that the variance of each input variable is
the same. Before using LDA, you should almost always normalise your
data so that it has a mean of 0 and a standard deviation of 1.
Below is a practice problems based on Linear Discriminant Analysis (LDA) -
Problem-2 : Compute the Linear Discriminant projection for the following
two-dimensional datasetX1=(x1,x2)={(4,1),(2,4),(2,3),(3,6),(4,4)} & X2=(x1
,x2)={(9,10),(6,8),(9,5),(8,7),(10,8)}

10

X2
4

2
WLDA
0
0 2 4 6 8 10
X1

409
Machine Learning - II Solution :
To understand the working of LDA lets take an example and workout step by
step.

Step 1:- Compute the within – Class Scatter matrix (Sw), Sw measures how
well the data is scattered with in the class , which is done by finding the mean
of the classes i.e. μ1 & μ2 where, μ1& μ2 are the mean of class C1 & C2
respectively
Now, Sw is given by
Sw = S1 + S2
where, S1 is the covariance matrix for the class C1 and
S2 is the covariance matrix for the class C2
Let’s find the covariance matrices S1 & S2 of each class
S1 =  x (x   ) (x   )T
1
c1

where, μ1 is the mean of class C1, which is computed by averaging the coordinates
of dataset X1
(Coordinates of X1)
 4+2+2+3+4 1+4+3+6+4
1   , ;
5 5
 
1  3.00 ; 3.60

Similarly, μ2 = [8.4 ; 7.60]

S =
1
 (x   ) (x   )
1 1
T
 = [3
1
3  60]
x w 1

Mean reduced data of class C1 is given by (x1 – μ1)

(x   ) =  1 1 1 0 1 
1 1 
0.4
 2.6 0.4  0.6 2.4 
Now, for each x1 we are going to calculate (x – μ ) (x – μ )T . So, we will have
1 1
“5” such matrices.
Now solving it one by one i.e. for each column of (x1 – μ1) we find (x1 – μ1).
(x1 – μ )T
1  1
  1   [1  2.6] = (x 1   1) =   2.6
2.6 2.6 6.16  
   
FIRST MATRIX

Similarly for all columns  1  1


[ 1 0.4] =  0.4
     
0.4 0.4 0.16
   

1 
[ 1 0.6] =  1 0.6 
     
0.6 0.6 0.36
410    
0  1 0 
Feature Selection
 
2.4 [0    
2.4] = 0 5.76 and Extraction
   

 1  [1 0.4] = 1 0 
     
0.4 0.4 0.16
   

Adding ++++ and taking average we get covariance matrix S1


 0.8  0.4
S1   
0.4 2.6
 
Similarly for class 2, the covariance matrix is given by
 1.84  0.04
S  &  = 8.4 7.6
  2
2
0.4 2.64
 
Sw  S1 + S2
 2.64  0.44
Sw   
0.44 5.28
 
Step 2. Compute Between class scatter matrix (SB)
S  (   ) (   )T
B 1 2 1 2
 5.4   29.16 21.6 
  4  (5.4  4  21.6 16.00 
   

Step 3: Find the best LDA Projection Vector.


The LDA projection vector is a vector on to which all data samples are
projected and it carries all necessary features required for data, in fact this
projection vector is the Eigen vector, and we know that Eigen vector is the

expression used for it is S –1 S V = ⋋V where V is the projection vector.


vector that carries good balance between all the features of the dataset. The
w B

Similar to PCA we find this using Eigen vectors having largest Eigen value.

= ⋋V
Projection Vector
–1 V
S S
S – ⋋ I| = 0
w B
-1
i.e. |S
11.89  ⋋ 8.81
w B

5.08 3.76 = 0

Substituting in equation a ⋋

11.89 8.81 V1  V1 


5.08 3.76
V = 15.65 V
   2   2 

We V  0.91
get V 1 0.39
 2 = 
411
Machine Learning - II Or we directly solve
V1  T
= Sw 1 0.1921  0.032 5.4   0.91 0.39
  (   ) = 0.031
1 2
0.38    4 
V
 2    
–1
Note: S is found using the formula.
w

1
a b 1 a  b
=   
c c  ad  bc c c
   
So, S =
 2.64  0.44
 
w
0.44 5.28

1
S 1 = 5.28 0.44 0.384 0.032

w
13.74 0.44 2.68 
0.032 0.192

  

Step 4 : Dimension reduction


y = wT X input data samples i.e. y = 0.91 0.39
4 2 2 3 4
 2 3 6 4
1 
Projection vector 9 6 9 8 10
y = 0.91 0.39  
10 8 5 7 8 


Projection Vector
Corresponding to
highest Eigen

Note : Linear Discriminant Analysis (LDA) is a technique for dimensionality


reduction, and it is used as a pre-processing step for pattern classification and
machine learning applications. LDA is similar to PCA but the basic difference
is that LDA in addition find the axis that measures the separation between
multiple classes. Here the Goal is to project the N-dimensional feature space
on to a smaller k-dimensional subspace where k<=n, while maintaining the
class discriminator information.

You might also like