Feature Selection in Machine Learning

• Feature Selection
• Dataset 1 - Iris Dataset
• Forward Selection
• Backward Selection
• Genetic Algorithm
• Dataset 2 - Abalone Dataset
• Dataset 3 – Custom Dataset

 The data set contains 3 classes of 50 instances
each, where each class refers to a type of iris
flower.

 Attribute Information (All in centimeters)
› Sepal length
› Sepal width
› Petal length
› Petal width
› Flower class
Ex: 5.3,3.7,1.5,0.2,Iris-setosa
5.0,3.3,1.4,0.2,Iris-setosa
7.0,3.2,4.7,1.4,Iris-versicolor
6.4,3.2,4.5,1.5,Iris-versicolor
6.3,3.3,6.0,2.5,Iris-virginica
5.8,2.7,5.1,1.9,Iris-virginica

1. One of the classes (Iris Setosa) is linearly
separable from the other two. However, the
other two classes are not linearly separable.
2. There is some overlap between the Versicolor
and Virginica classes, so that it is impossible to
achieve a perfect classification rate.
3. There is some redundancy in the four input
variables, so that it is possible to achieve a good
solution with only three of them, or even (with
difficulty) from two.

1. Iris dataset is just a simple dataset that values
are delimited by commas.
2. Dataset doesn’t include any variable naming or
case naming.
3. We can edit our dataset to give proper variable
naming.
› Class field has automatically turned into a nominal
field as it contain only three nominal values.

 Analysis  Feature Selection
 From the available variables, set the dependent
and independent (output and input) variables.
Dependent Variable :
Class
Independent Variable :
Sepal Width
Sepal Length
Petal Width
Petal Length
4

 Dependent or output variable states which
flower class the record belongs to; Either
Virginica, Versicolor or Setosa.
 Independent or input variables are used to
predict that decision.
 Typically we do not have a strong idea of the
relationship between the available variables and
the desired prediction.

 To an extent, some neural network architectures
(e.g., multilayer perceptrons) can actually learn to
ignore useless variables.
 However, other architectures (e.g., radial basis
functions) are adversely affected, and in all cases a
larger number of inputs implies that a larger number
of training cases are required.
 As a rule of thumb, the number of training cases
should be, a good, few times bigger than the
number of weights in the network, to prevent over-
learning.

 As a consequence, the performance of a network
can be improved by reducing the number of
inputs, even sometimes at the cost of losing
some input information.
› In many problem domains, a range of input variables
are available which may be used to train a neural
network, but it is not clear which of them are most
useful, or indeed are needed at all.

 In non-linear problems, there may be
interdependencies and redundancies between
variables;
› for example, a pair of variables may be of no value
individually, but extremely useful in conjunction, or
any one of a set of parameters may be useful.
› It is not possible, in general, to simply rank
parameters in order of importance.

 The "curse of dimensionality" means that it is
sometimes actually better to discard some
variables that do have a genuine information
content, simply to reduce the total number of
input variables, and therefore the complexity of
the problem, and the size of the network.
 Counter-intuitively, this can actually improve the
network's generalization capabilities.

 The method that is guaranteed to select the best
input set, is to train networks with all possible
input sets and all possible architectures, and to
select the best.
› In practice, this is impossible for any significant
number of candidate inputs.
 If you wish to examine the selection of variables
more closely yourself, Feature Selection is a
good technique.

 The Feature Selection Algorithms conduct a large
number of experiments with different
combinations of inputs, building probabilistic or
generalized regression networks for each
combination, evaluating the performance, and
using this to further guide the search.
 This is a "brute force" technique that may
sometimes find results much faster.

 It explicitly identify input variables that do not
contribute significantly to the performance of
networks, then by suggest to remove them.
 These algorithms are either stepwise algorithms
that progressively add or remove variables, or
genetic algorithms.

5.1 Randomized subset assignment to train, select
and test.
5.1

5.2 Fixed subset assignment to train, select and
test.
› Add a column containing nominal values “Train”,
“Select”, ”Test” and “Ignore”. For generate the
values, support of spreadsheet package may
needed. Name the column as NNSET.
5.2

Run the Feature Selection once and
these subsets will be assigned after
that.

 A major problem with neural networks is the
generalization issue (the tendency to overfit the
training data), accompanied by the difficulty in
quantifying likely performance on new data.
 It is important to have ways to estimate the
performance of the models on new data, and to
be able to select among them.
 Most work on assessing performance in neural
modeling concentrates on approaches to
resampling.

 Typically the neural network is trained in using a
training subset.
 The test subset is used to perform an unbiased
estimation of the network's likely performance.

 Often, a separate subset (the selection subset) is
used to halt training to mitigate over-learning, or
to select from a number of models trained with
different parameters. It keep an independent
check on the performance of the networks
during training with deterioration in the
selection error indicating over-learning.
 If over-learning occurs, stops training the
network, and restores it to the state with
minimum selection error.

6. In the results shown after analysis, each row will
represents a particular test of a combination of
inputs. So with this, it will show every
combination of inputs.
7. It is sometimes a good idea to reduce the
number of input variables to a network even at
the cost of a little performance, as this improves
generalization capability and decreases the
network size and execution size.

You can apply some extra pressure to eliminate
unwanted variables by assigning a Unit Penalty.
› This is multiplied by the number of units in the
network and added to the error level in assessing
how good a network is, and thus penalizes larger
networks.

If there are a large number of cases, the
evaluations performed by the feature selection
algorithms can be very time-consuming (the time
taken is proportional to the number of cases).
› For this reason, you can specify a sub-sampling
rate. (However, in this case as we have very few
cases, the sampling rate of 1.0 (the default) is fine).

 Begins by locating a single input variable, that on
its own, best predicts the output variable.
 It then checks for a second variable, that added
to the first.
 Repeat the process until either all variables have
been selected or no further improvement is
made.
 Good for larger number of variables.

 Generally Faster. Much faster if there are few
relevant variables, as it will locate them at the
beginning of its search.
 Can behave sensibly when data set has large
number of variables as it selects variables
initially.
 It may miss key variables if they are
interdependent. (that is where two or more
variables must be added at the same time in
order to improve the model.)

 The row label indicates the stage; (e.g. 2.3
indicates the third test in stage 2. ) The final row
replicates the best result found, for
convenience. The first column is the selection
error of the Probabilistic Neural Network (PNN)
or Generalized Regression Neural Network
(GRNN). Subsequent columns indicate which
inputs were selected for that particular
combination.

0 Penalty
0.003 Penalty
0.001 Penalty
0.005 Penalty
0.012 Penalty
0.002 Penalty
Conclusion : By considering the span of the error values of above results with
penalty value, Petal Width and Petal Length are good features to use if needed
reducing of input features.

 A Reverse process.
 Starts with a model including all the variables
and then removes them one at a time
 At each stage finding the variable that, when it is
removed least degrades the model.
 Good for smaller (20 or less) number of
variables.

 Doesn’t suffer from missing key variables.
 As it starts with the whole set of variables, the
initial evaluations are most time consuming.
 Suffer from large number of variables. Specially if
there are only a few weakly predictive ones in
the set.
 Not cut down the irrelevant variables until the
very end of its search.

0 Penalty
0.003 Penalty
0.001 Penalty
0.004 Penalty
0.012 Penalty
0.002 Penalty
penalty value, Petal Width, Petal Length and Sepal Length are good features
to use if needed reducing of input features.

 A optimization algorithm.
 Genetic algorithms are a particularly effective
search technique for combinatorial problems
(where a set of interrelated yes/no decisions
needs to be made).
 The method is time-consuming (it typically
requires building and testing many thousands of
networks)

 For reasonably-sized problem domains (perhaps
50-100 possible input variables, and cases
numbering in the low thousands), the algorithm
can be employed effectively overnight or at the
weekend on a fast PC.
 With sub-sampling, it can be applied in minutes
or hours, although at the cost of reduced
reliability for very large numbers of variables.

 Run with the default settings, it would perform
10,000 evaluations (100 population times 100
generations).
Since our problem has
only 4 candidate
inputs, the total
number of possible
combinations is only
16 (2 raised to the 4th
power).

0 Penalty 0.003 Penalty
0.004 Penalty
0.012 Penalty
penalty value, Petal Width, Petal Length and Sepal Length are good features
to use if needed reducing of input features.

 The age of abalone can determined by
counting the number of rings.
 The number of rings is the value to predict
from physical measurements.

Attribute Name Data Type Measureme
nt Unit
Description
Sex nominal M, F, and I (infant)
Length continuous mm Longest shell measurement
Diameter continuous mm perpendicular to length
Height continuous mm with meat in shell
Whole weight continuous grams whole abalone
Shucked weight continuous grams weight of meat
Viscera weight continuous grams gut weight (after bleeding)
Shell weight continuous grams after being dried
Rings integer +1.5 gives the age in years

Conclusion : Sex, Whole
Weight and Shell Weight
are good features to use if
needed reducing of input
features.
Height doesn’t give any
useful effect.
High Penalty : 0.001
Less Penalty : 0.0001

weight and Shell weight are
good features to use if needed
Height doesn’t give any useful
effect.

weight and Shell weight are
Height doesn’t give any useful
effect.
Sampling rate : 0.1

 4 Classes
 17 Attribute Features
•C1 •C2 •C3 •C4
•F1
•F2
•F3
•F4
•F5
•F6
•F7
•F8
•F9
•F10
•F11
•F12
•F13
•F14
•F15
•F17
•F18

Conclusion : Feature F14 is a
Feature F2, F5, F6, F7, F9 and
F11 doesn’t give any useful
effect.

 http://archive.ics.uci.edu/ml/datasets/Iris
[Online 2015-10-25]
 http://archive.ics.uci.edu/ml/datasets/Abalone
[Online 2015-10-25]
 Trajan Neural Network Simulator Help

Feature Selection in Machine Learning

More Related Content

Feature Selection in Machine Learning

Editor's Notes