3b Features PDF

Foundations of Machine Learning
Module 3: Instance Based Learning and

Feature Reduction
Part B: Feature Selection
Sudeshna Sarkar
IIT Kharagpur
Feature Reduction in ML
- The information about the target class is inherent
in the variables.
- Naïve view:
More features
=> More information
=> More discrimination power.
- In practice:
many reasons why this is not the case!
Curse of Dimensionality
• number of training examples is fixed
=> the classifier’s performance usually will
degrade for a large number of features!
Feature Reduction in ML
- Irrelevant and
- redundant features
- can confuse learners.
- Limited training data.

- Limited computational resources.
- Curse of dimensionality.
Feature Selection
Problem of selecting some subset of features, while
ignoring the rest
Feature Extraction
Criteria for selection/extraction:

either improve or maintain the classification
accuracy, simplify classifier complexity.
Feature Selection - Definition
Subset selection
7
Feature Selection Steps
Feature selection is an
optimization problem.
o Step 1: Search the space
of possible feature
subsets.
o Step 2: Pick the subset
that is optimal or near-
optimal with respect to
some objective function.
Feature Selection Steps (cont’d)
Search strategies
– Optimum
– Heuristic
– Randomized
Evaluation strategies
- Filter methods
- Wrapper methods
Evaluating feature subset
• Supervised (wrapper method)
– Train using selected subset
– Estimate error on validation dataset
• Unsupervised (filter method)

– Look at input only
– Select the subset that has the most information
Evaluation Strategies
Filter Methods Wrapper Methods
Subset selection
• Select uncorrelated features
• Forward search
– Start from empty set of features
– Try each of remaining features
– Estimate classification/regression error for adding specific
feature
– Select feature that gives maximum improvement in
validation error
– Stop when no significant improvement
• Backward search
– Start with original set of size d
– Drop features with smallest impact on error
Feature selection
Univariate (looks at each feature independently of others)
– Pearson correlation coefficient
– F-score Univariate methods measure
– Chi-square some type of correlation
– Signal to noise ratio between two random variables
– mutual information • the label (yi) and a fixed feature
– Etc. (xij for fixed j)
• Rank features by importance
• Ranking cut-off is determined by user
Pearson correlation coefficient
Pearson correlation coefficient
From Wikipedia
Signal to noise ratio
• Difference in means divided by difference in
standard deviation between the two classes
S2N(X,Y) = (μX - μY)/(σX – σY)
• Large values indicate a strong correlation

Multivariate feature selection
• Multivariate (considers all features simultaneously)
• Consider the vector w for any linear classifier.
• Classification of a point x is given by wTx+w0.
• Small entries of w will have little effect on the dot
product and therefore those features are less relevant.
• For example if w = (10, .01, -9) then features 0 and 2 are
contributing more to the dot product than feature 1.
– A ranking of features given by this w is 0, 2, 1.
Multivariate feature selection
• The w can be obtained by any of linear classifiers
• A variant of this approach is called recursive feature
elimination:
– Compute w on all features
– Remove feature with smallest wi
– Recompute w on reduced data
– If stopping criterion not met then go to step 2
Feature Extraction
Foundations of Machine Learning
Module 3: Instance Based Learning and
Feature Reduction
Part C: Feature Extraction
Sudeshna Sarkar
IIT Kharagpur
Feature extraction - definition
Feature Extraction
PCA
Geometric picture of principal components (PCs)
Algebraic definition of PCs
Given a sample of p observations on a vector of N variables
x , x ,, x  
1 2 p
N
define the first principal component of the sample

by the linear transformation
N
z1  a x j   ai1 xij ,
T
1 j  1,2,  , p.
i 1
where the vector a1  (a11 , a21 ,  , a N 1 )

x j  ( x1 j , x2 j ,  , x Nj )
is chosen such that var[ z1 ] is maximum.
PCA
PCA
PCA for image compression
p=1 p=2 p=4 p=8
Original
p=16 p=32 p=64 p=100 Image
Is PCA a good criterion for classification?
• Data variation
determines the
projection direction
• What’s missing?
– Class information
What is a good projection?
Two classes
• Similarly, what is a
overlap
good criterion?
– Separating different
classes
Two classes are

separated
What class information may be useful?
• Between-class distance
– Distance between the centroids
of different classes
Between-class distance
What class information may be useful?
• Between-class distance
– Distance between the centroids of
different classes
• Within-class distance
• Accumulated distance of an instance
to the centroid of its class
• Linear discriminant analysis (LDA) finds

most discriminant projection by
• maximizing between-class distance
• and minimizing within-class distance Within-class distance
Linear Discriminant Analysis
Means and Scatter after projection
Good Projection
• Means are as far away as possible
• Scatter is small as possible
• Fisher Linear Discriminant
 m1  m2 
2
J w 
s s
2
1
2
2
Multiple Classes

3b Features PDF

Uploaded by

Copyright:

Available Formats

3b Features PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

3b Features PDF

Uploaded by

Copyright:

Available Formats

Foundations of Machine Learning

Module 3: Instance Based Learning and

Part B: Feature Selection

- Limited training data.

Criteria for selection/extraction:

• Unsupervised (filter method)

S2N(X,Y) = (μX - μY)/(σX – σY)

• Large values indicate a strong correlation

Part C: Feature Extraction

define the first principal component of the sample

where the vector a1  (a11 , a21 ,  , a N 1 )

p=1 p=2 p=4 p=8

Two classes are

• Linear discriminant analysis (LDA) finds

You might also like