Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo

1

U.V Vandebona

2

• Feature Selection
• Dataset 1 - Iris Dataset
• Forward Selection
• Backward Selection
• Genetic Algorithm
• Dataset 2 - Abalone Dataset
• Dataset 3 – Custom Dataset

3

 The data set contains 3 classes of 50 instances
each, where each class refers to a type of iris
flower.

4

 Attribute Information (All in centimeters)
› Sepal length
› Sepal width
› Petal length
› Petal width
› Flower class
Ex: 5.3,3.7,1.5,0.2,Iris-setosa
5.0,3.3,1.4,0.2,Iris-setosa
7.0,3.2,4.7,1.4,Iris-versicolor
6.4,3.2,4.5,1.5,Iris-versicolor
6.3,3.3,6.0,2.5,Iris-virginica
5.8,2.7,5.1,1.9,Iris-virginica

5

1. One of the classes (Iris Setosa) is linearly
separable from the other two. However, the
other two classes are not linearly separable.
2. There is some overlap between the Versicolor
and Virginica classes, so that it is impossible to
achieve a perfect classification rate.
3. There is some redundancy in the four input
variables, so that it is possible to achieve a good
solution with only three of them, or even (with
difficulty) from two.

6

1
3 2

7

1. Iris dataset is just a simple dataset that values
are delimited by commas.
2. Dataset doesn’t include any variable naming or
case naming.
3. We can edit our dataset to give proper variable
naming.
› Class field has automatically turned into a nominal
field as it contain only three nominal values.

8

 Analysis  Feature Selection
 From the available variables, set the dependent
and independent (output and input) variables.
Dependent Variable :
Class
Independent Variable :
Sepal Width
Sepal Length
Petal Width
Petal Length
4

9

 Dependent or output variable states which
flower class the record belongs to; Either
Virginica, Versicolor or Setosa.
 Independent or input variables are used to
predict that decision.
 Typically we do not have a strong idea of the
relationship between the available variables and
the desired prediction.

10

 To an extent, some neural network architectures
(e.g., multilayer perceptrons) can actually learn to
ignore useless variables.
 However, other architectures (e.g., radial basis
functions) are adversely affected, and in all cases a
larger number of inputs implies that a larger number
of training cases are required.
 As a rule of thumb, the number of training cases
should be, a good, few times bigger than the
number of weights in the network, to prevent over-
learning.

11

 As a consequence, the performance of a network
can be improved by reducing the number of
inputs, even sometimes at the cost of losing
some input information.
› In many problem domains, a range of input variables
are available which may be used to train a neural
network, but it is not clear which of them are most
useful, or indeed are needed at all.

12

 In non-linear problems, there may be
interdependencies and redundancies between
variables;
› for example, a pair of variables may be of no value
individually, but extremely useful in conjunction, or
any one of a set of parameters may be useful.
› It is not possible, in general, to simply rank
parameters in order of importance.

13

 The "curse of dimensionality" means that it is
sometimes actually better to discard some
variables that do have a genuine information
content, simply to reduce the total number of
input variables, and therefore the complexity of
the problem, and the size of the network.
 Counter-intuitively, this can actually improve the
network's generalization capabilities.

14

 The method that is guaranteed to select the best
input set, is to train networks with all possible
input sets and all possible architectures, and to
select the best.
› In practice, this is impossible for any significant
number of candidate inputs.
 If you wish to examine the selection of variables
more closely yourself, Feature Selection is a
good technique.

15

 The Feature Selection Algorithms conduct a large
number of experiments with different
combinations of inputs, building probabilistic or
generalized regression networks for each
combination, evaluating the performance, and
using this to further guide the search.
 This is a "brute force" technique that may
sometimes find results much faster.

16

 It explicitly identify input variables that do not
contribute significantly to the performance of
networks, then by suggest to remove them.
 These algorithms are either stepwise algorithms
that progressively add or remove variables, or
genetic algorithms.

17

5.1 Randomized subset assignment to train, select
and test.
5.1

18

5.2 Fixed subset assignment to train, select and
test.
› Add a column containing nominal values “Train”,
“Select”, ”Test” and “Ignore”. For generate the
values, support of spreadsheet package may
needed. Name the column as NNSET.
5.2

19

Run the Feature Selection once and
these subsets will be assigned after
that.

20

 A major problem with neural networks is the
generalization issue (the tendency to overfit the
training data), accompanied by the difficulty in
quantifying likely performance on new data.
 It is important to have ways to estimate the
performance of the models on new data, and to
be able to select among them.
 Most work on assessing performance in neural
modeling concentrates on approaches to
resampling.

21

 Typically the neural network is trained in using a
training subset.
 The test subset is used to perform an unbiased
estimation of the network's likely performance.

22

 Often, a separate subset (the selection subset) is
used to halt training to mitigate over-learning, or
to select from a number of models trained with
different parameters. It keep an independent
check on the performance of the networks
during training with deterioration in the
selection error indicating over-learning.
 If over-learning occurs, stops training the
network, and restores it to the state with
minimum selection error.

23

7
6

24

6. In the results shown after analysis, each row will
represents a particular test of a combination of
inputs. So with this, it will show every
combination of inputs.
7. It is sometimes a good idea to reduce the
number of input variables to a network even at
the cost of a little performance, as this improves
generalization capability and decreases the
network size and execution size.

25

You can apply some extra pressure to eliminate
unwanted variables by assigning a Unit Penalty.
› This is multiplied by the number of units in the
network and added to the error level in assessing
how good a network is, and thus penalizes larger
networks.

26

If there are a large number of cases, the
evaluations performed by the feature selection
algorithms can be very time-consuming (the time
taken is proportional to the number of cases).
› For this reason, you can specify a sub-sampling
rate. (However, in this case as we have very few
cases, the sampling rate of 1.0 (the default) is fine).

27

 Begins by locating a single input variable, that on
its own, best predicts the output variable.
 It then checks for a second variable, that added
to the first.
 Repeat the process until either all variables have
been selected or no further improvement is
made.
 Good for larger number of variables.

28

 Generally Faster. Much faster if there are few
relevant variables, as it will locate them at the
beginning of its search.
 Can behave sensibly when data set has large
number of variables as it selects variables
initially.
 It may miss key variables if they are
interdependent. (that is where two or more
variables must be added at the same time in
order to improve the model.)

29

 The row label indicates the stage; (e.g. 2.3
indicates the third test in stage 2. ) The final row
replicates the best result found, for
convenience. The first column is the selection
error of the Probabilistic Neural Network (PNN)
or Generalized Regression Neural Network
(GRNN). Subsequent columns indicate which
inputs were selected for that particular
combination.

30

0 Penalty
0.003 Penalty
0.001 Penalty
0.005 Penalty
0.012 Penalty
0.002 Penalty
Conclusion : By considering the span of the error values of above results with
penalty value, Petal Width and Petal Length are good features to use if needed
reducing of input features.

31

 A Reverse process.
 Starts with a model including all the variables
and then removes them one at a time
 At each stage finding the variable that, when it is
removed least degrades the model.
 Good for smaller (20 or less) number of
variables.

32

 Doesn’t suffer from missing key variables.
 As it starts with the whole set of variables, the
initial evaluations are most time consuming.
 Suffer from large number of variables. Specially if
there are only a few weakly predictive ones in
the set.
 Not cut down the irrelevant variables until the
very end of its search.

33

0 Penalty
0.003 Penalty
0.001 Penalty
0.004 Penalty
0.012 Penalty
0.002 Penalty
Conclusion : By considering the span of the error values of above results with
penalty value, Petal Width, Petal Length and Sepal Length are good features
to use if needed reducing of input features.

34

 A optimization algorithm.
 Genetic algorithms are a particularly effective
search technique for combinatorial problems
(where a set of interrelated yes/no decisions
needs to be made).
 The method is time-consuming (it typically
requires building and testing many thousands of
networks)

35

 For reasonably-sized problem domains (perhaps
50-100 possible input variables, and cases
numbering in the low thousands), the algorithm
can be employed effectively overnight or at the
weekend on a fast PC.
 With sub-sampling, it can be applied in minutes
or hours, although at the cost of reduced
reliability for very large numbers of variables.

36

 Run with the default settings, it would perform
10,000 evaluations (100 population times 100
generations).
Since our problem has
only 4 candidate
inputs, the total
number of possible
combinations is only
16 (2 raised to the 4th
power).

37

0 Penalty 0.003 Penalty
0.004 Penalty
0.012 Penalty
Conclusion : By considering the span of the error values of above results with
penalty value, Petal Width, Petal Length and Sepal Length are good features
to use if needed reducing of input features.

38

 The age of abalone can determined by
counting the number of rings.
 The number of rings is the value to predict
from physical measurements.

39

Attribute Name Data Type Measureme
nt Unit
Description
Sex nominal M, F, and I (infant)
Length continuous mm Longest shell measurement
Diameter continuous mm perpendicular to length
Height continuous mm with meat in shell
Whole weight continuous grams whole abalone
Shucked weight continuous grams weight of meat
Viscera weight continuous grams gut weight (after bleeding)
Shell weight continuous grams after being dried
Rings integer +1.5 gives the age in years

40

Conclusion : Sex, Whole
Weight and Shell Weight
are good features to use if
needed reducing of input
features.
Height doesn’t give any
useful effect.
High Penalty : 0.001
Less Penalty : 0.0001

41

Conclusion : Sex, Whole
weight and Shell weight are
good features to use if needed
reducing of input features.
Height doesn’t give any useful
effect.
High Penalty : 0.001
Less Penalty : 0.0001

42

Conclusion : Sex, Whole
weight and Shell weight are
good features to use if needed
reducing of input features.
Height doesn’t give any useful
effect.
Sampling rate : 0.1
High Penalty : 0.001
Less Penalty : 0.0001

43

 4 Classes
 17 Attribute Features
•C1 •C2 •C3 •C4
•F1
•F2
•F3
•F4
•F5
•F6
•F7
•F8
•F9
•F10
•F11
•F12
•F13
•F14
•F15
•F17
•F18

44

High Penalty : 0.001
Less Penalty : 0.0001
Conclusion : Feature F14 is a
good features to use if needed
reducing of input features.
Feature F2, F5, F6, F7, F9 and
F11 doesn’t give any useful
effect.

45

High Penalty : 0.001
Less Penalty : 0.0001
Conclusion : Feature F14 is a
good features to use if needed
reducing of input features.
Feature F2, F5, F6, F7, F9 and
F11 doesn’t give any useful
effect.

46

High Penalty : 0.001
Less Penalty : 0.0001
Conclusion : Feature F14 is a
good features to use if needed
reducing of input features.
Feature F2, F5, F6, F7, F9 and
F11 doesn’t give any useful
effect.

47

 http://archive.ics.uci.edu/ml/datasets/Iris
[Online 2015-10-25]
 http://archive.ics.uci.edu/ml/datasets/Abalone
[Online 2015-10-25]
 Trajan Neural Network Simulator Help

More Related Content

Feature Selection in Machine Learning

  • 2. • Feature Selection • Dataset 1 - Iris Dataset • Forward Selection • Backward Selection • Genetic Algorithm • Dataset 2 - Abalone Dataset • Dataset 3 – Custom Dataset
  • 3.  The data set contains 3 classes of 50 instances each, where each class refers to a type of iris flower.
  • 4.  Attribute Information (All in centimeters) › Sepal length › Sepal width › Petal length › Petal width › Flower class Ex: 5.3,3.7,1.5,0.2,Iris-setosa 5.0,3.3,1.4,0.2,Iris-setosa 7.0,3.2,4.7,1.4,Iris-versicolor 6.4,3.2,4.5,1.5,Iris-versicolor 6.3,3.3,6.0,2.5,Iris-virginica 5.8,2.7,5.1,1.9,Iris-virginica
  • 5. 1. One of the classes (Iris Setosa) is linearly separable from the other two. However, the other two classes are not linearly separable. 2. There is some overlap between the Versicolor and Virginica classes, so that it is impossible to achieve a perfect classification rate. 3. There is some redundancy in the four input variables, so that it is possible to achieve a good solution with only three of them, or even (with difficulty) from two.
  • 7. 1. Iris dataset is just a simple dataset that values are delimited by commas. 2. Dataset doesn’t include any variable naming or case naming. 3. We can edit our dataset to give proper variable naming. › Class field has automatically turned into a nominal field as it contain only three nominal values.
  • 8.  Analysis  Feature Selection  From the available variables, set the dependent and independent (output and input) variables. Dependent Variable : Class Independent Variable : Sepal Width Sepal Length Petal Width Petal Length 4
  • 9.  Dependent or output variable states which flower class the record belongs to; Either Virginica, Versicolor or Setosa.  Independent or input variables are used to predict that decision.  Typically we do not have a strong idea of the relationship between the available variables and the desired prediction.
  • 10.  To an extent, some neural network architectures (e.g., multilayer perceptrons) can actually learn to ignore useless variables.  However, other architectures (e.g., radial basis functions) are adversely affected, and in all cases a larger number of inputs implies that a larger number of training cases are required.  As a rule of thumb, the number of training cases should be, a good, few times bigger than the number of weights in the network, to prevent over- learning.
  • 11.  As a consequence, the performance of a network can be improved by reducing the number of inputs, even sometimes at the cost of losing some input information. › In many problem domains, a range of input variables are available which may be used to train a neural network, but it is not clear which of them are most useful, or indeed are needed at all.
  • 12.  In non-linear problems, there may be interdependencies and redundancies between variables; › for example, a pair of variables may be of no value individually, but extremely useful in conjunction, or any one of a set of parameters may be useful. › It is not possible, in general, to simply rank parameters in order of importance.
  • 13.  The "curse of dimensionality" means that it is sometimes actually better to discard some variables that do have a genuine information content, simply to reduce the total number of input variables, and therefore the complexity of the problem, and the size of the network.  Counter-intuitively, this can actually improve the network's generalization capabilities.
  • 14.  The method that is guaranteed to select the best input set, is to train networks with all possible input sets and all possible architectures, and to select the best. › In practice, this is impossible for any significant number of candidate inputs.  If you wish to examine the selection of variables more closely yourself, Feature Selection is a good technique.
  • 15.  The Feature Selection Algorithms conduct a large number of experiments with different combinations of inputs, building probabilistic or generalized regression networks for each combination, evaluating the performance, and using this to further guide the search.  This is a "brute force" technique that may sometimes find results much faster.
  • 16.  It explicitly identify input variables that do not contribute significantly to the performance of networks, then by suggest to remove them.  These algorithms are either stepwise algorithms that progressively add or remove variables, or genetic algorithms.
  • 17. 5.1 Randomized subset assignment to train, select and test. 5.1
  • 18. 5.2 Fixed subset assignment to train, select and test. › Add a column containing nominal values “Train”, “Select”, ”Test” and “Ignore”. For generate the values, support of spreadsheet package may needed. Name the column as NNSET. 5.2
  • 19. Run the Feature Selection once and these subsets will be assigned after that.
  • 20.  A major problem with neural networks is the generalization issue (the tendency to overfit the training data), accompanied by the difficulty in quantifying likely performance on new data.  It is important to have ways to estimate the performance of the models on new data, and to be able to select among them.  Most work on assessing performance in neural modeling concentrates on approaches to resampling.
  • 21.  Typically the neural network is trained in using a training subset.  The test subset is used to perform an unbiased estimation of the network's likely performance.
  • 22.  Often, a separate subset (the selection subset) is used to halt training to mitigate over-learning, or to select from a number of models trained with different parameters. It keep an independent check on the performance of the networks during training with deterioration in the selection error indicating over-learning.  If over-learning occurs, stops training the network, and restores it to the state with minimum selection error.
  • 23. 7 6
  • 24. 6. In the results shown after analysis, each row will represents a particular test of a combination of inputs. So with this, it will show every combination of inputs. 7. It is sometimes a good idea to reduce the number of input variables to a network even at the cost of a little performance, as this improves generalization capability and decreases the network size and execution size.
  • 25. You can apply some extra pressure to eliminate unwanted variables by assigning a Unit Penalty. › This is multiplied by the number of units in the network and added to the error level in assessing how good a network is, and thus penalizes larger networks.
  • 26. If there are a large number of cases, the evaluations performed by the feature selection algorithms can be very time-consuming (the time taken is proportional to the number of cases). › For this reason, you can specify a sub-sampling rate. (However, in this case as we have very few cases, the sampling rate of 1.0 (the default) is fine).
  • 27.  Begins by locating a single input variable, that on its own, best predicts the output variable.  It then checks for a second variable, that added to the first.  Repeat the process until either all variables have been selected or no further improvement is made.  Good for larger number of variables.
  • 28.  Generally Faster. Much faster if there are few relevant variables, as it will locate them at the beginning of its search.  Can behave sensibly when data set has large number of variables as it selects variables initially.  It may miss key variables if they are interdependent. (that is where two or more variables must be added at the same time in order to improve the model.)
  • 29.  The row label indicates the stage; (e.g. 2.3 indicates the third test in stage 2. ) The final row replicates the best result found, for convenience. The first column is the selection error of the Probabilistic Neural Network (PNN) or Generalized Regression Neural Network (GRNN). Subsequent columns indicate which inputs were selected for that particular combination.
  • 30. 0 Penalty 0.003 Penalty 0.001 Penalty 0.005 Penalty 0.012 Penalty 0.002 Penalty Conclusion : By considering the span of the error values of above results with penalty value, Petal Width and Petal Length are good features to use if needed reducing of input features.
  • 31.  A Reverse process.  Starts with a model including all the variables and then removes them one at a time  At each stage finding the variable that, when it is removed least degrades the model.  Good for smaller (20 or less) number of variables.
  • 32.  Doesn’t suffer from missing key variables.  As it starts with the whole set of variables, the initial evaluations are most time consuming.  Suffer from large number of variables. Specially if there are only a few weakly predictive ones in the set.  Not cut down the irrelevant variables until the very end of its search.
  • 33. 0 Penalty 0.003 Penalty 0.001 Penalty 0.004 Penalty 0.012 Penalty 0.002 Penalty Conclusion : By considering the span of the error values of above results with penalty value, Petal Width, Petal Length and Sepal Length are good features to use if needed reducing of input features.
  • 34.  A optimization algorithm.  Genetic algorithms are a particularly effective search technique for combinatorial problems (where a set of interrelated yes/no decisions needs to be made).  The method is time-consuming (it typically requires building and testing many thousands of networks)
  • 35.  For reasonably-sized problem domains (perhaps 50-100 possible input variables, and cases numbering in the low thousands), the algorithm can be employed effectively overnight or at the weekend on a fast PC.  With sub-sampling, it can be applied in minutes or hours, although at the cost of reduced reliability for very large numbers of variables.
  • 36.  Run with the default settings, it would perform 10,000 evaluations (100 population times 100 generations). Since our problem has only 4 candidate inputs, the total number of possible combinations is only 16 (2 raised to the 4th power).
  • 37. 0 Penalty 0.003 Penalty 0.004 Penalty 0.012 Penalty Conclusion : By considering the span of the error values of above results with penalty value, Petal Width, Petal Length and Sepal Length are good features to use if needed reducing of input features.
  • 38.  The age of abalone can determined by counting the number of rings.  The number of rings is the value to predict from physical measurements.
  • 39. Attribute Name Data Type Measureme nt Unit Description Sex nominal M, F, and I (infant) Length continuous mm Longest shell measurement Diameter continuous mm perpendicular to length Height continuous mm with meat in shell Whole weight continuous grams whole abalone Shucked weight continuous grams weight of meat Viscera weight continuous grams gut weight (after bleeding) Shell weight continuous grams after being dried Rings integer +1.5 gives the age in years
  • 40. Conclusion : Sex, Whole Weight and Shell Weight are good features to use if needed reducing of input features. Height doesn’t give any useful effect. High Penalty : 0.001 Less Penalty : 0.0001
  • 41. Conclusion : Sex, Whole weight and Shell weight are good features to use if needed reducing of input features. Height doesn’t give any useful effect. High Penalty : 0.001 Less Penalty : 0.0001
  • 42. Conclusion : Sex, Whole weight and Shell weight are good features to use if needed reducing of input features. Height doesn’t give any useful effect. Sampling rate : 0.1 High Penalty : 0.001 Less Penalty : 0.0001
  • 43.  4 Classes  17 Attribute Features •C1 •C2 •C3 •C4 •F1 •F2 •F3 •F4 •F5 •F6 •F7 •F8 •F9 •F10 •F11 •F12 •F13 •F14 •F15 •F17 •F18
  • 44. High Penalty : 0.001 Less Penalty : 0.0001 Conclusion : Feature F14 is a good features to use if needed reducing of input features. Feature F2, F5, F6, F7, F9 and F11 doesn’t give any useful effect.
  • 45. High Penalty : 0.001 Less Penalty : 0.0001 Conclusion : Feature F14 is a good features to use if needed reducing of input features. Feature F2, F5, F6, F7, F9 and F11 doesn’t give any useful effect.
  • 46. High Penalty : 0.001 Less Penalty : 0.0001 Conclusion : Feature F14 is a good features to use if needed reducing of input features. Feature F2, F5, F6, F7, F9 and F11 doesn’t give any useful effect.
  • 47.  http://archive.ics.uci.edu/ml/datasets/Iris [Online 2015-10-25]  http://archive.ics.uci.edu/ml/datasets/Abalone [Online 2015-10-25]  Trajan Neural Network Simulator Help

Editor's Notes

  1. But the precise choice of best variables is not obvious.
  2. You are likely therefore to accumulate a variety of data, with some which you suspect is important, and some which is of dubious
  3. You are likely therefore to accumulate a variety of data, with some which you suspect is important, and some which is of dubious
  4. (However, it is not necessary to choose a Unit Penalty - usually the algorithm is run without it, or with a very low value to eliminate only very marginal input variables.)
  5. The genetic algorithm is a good alternative when there are large numbers of variables (more than fifty, say), and also provides a valuable "second or third opinion" for smaller numbers of variables.
  6. Cutting the shell through the cone, staining it, and counting rings through a microscope. -- a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age.