Feature Selection in Machine Learning
Feature Selection in Machine Learning
Machine learning works on a simple rule – if you put garbage in, you will only get garbage to
come out. By garbage here, I mean noise in data.
This becomes even more important when the number of features are very large. You need
not use every feature at your disposal for creating an algorithm. You can assist your
algorithm by feeding in only those features that are really important.
Not only in the competitions but this can be very useful in industrial applications as well.
You not only reduce the training time and the evaluation time, you also have less things to
worry about!
Next, we’ll discuss various methodologies and techniques that you can use to subset your
feature space and help your models perform better and efficiently. So, let’s get started.
Filter Methods
Filter methods are generally used as a preprocessing step. The selection of features is
independent of any machine learning algorithms. Instead, features are selected on the basis
of their scores in various statistical tests for their correlation with the outcome variable. The
correlation is a subjective term here. For basic guidance, you can refer to the following table
for defining correlation co-efficients.
Pearson’s Correlation: It is used as a measure for quantifying linear dependence
between two continuous variables X and Y. Its value varies from -1 to +1. Pearson’s
correlation is given as:
ANOVA: ANOVA stands for Analysis of variance. It is similar to LDA except for the
fact that it is operated using one or more categorical independent features and one
continuous dependent feature. It provides a statistical test of whether the means of
several groups are equal or not.
One thing that should be kept in mind is that filter methods do not remove multicollinearity.
So, you must deal with multicollinearity of features as well before training models for your
data.
Wrapper Methods
In wrapper methods, we try to use a subset of features and train a model using them. Based
on the inferences that we draw from the previous model, we decide to add or remove
features from your subset. The problem is essentially reduced to a search problem. These
methods are usually computationally very expensive.
Some common examples of wrapper methods are forward feature selection, backward
feature elimination, recursive feature elimination, etc.
One of the best ways for implementing feature selection with wrapper methods is to use
Boruta package that finds the importance of a feature by creating shadow features.
1. Firstly, it adds randomness to the given data set by creating shuffled copies of all
features (which are called shadow features).
2. Then, it trains a random forest classifier on the extended data set and applies a
feature importance measure (the default is Mean Decrease Accuracy) to evaluate
the importance of each feature where higher means more important.
3. At every iteration, it checks whether a real feature has a higher importance than the
best of its shadow features (i.e. whether the feature has a higher Z-score than the
maximum Z-score of its shadow features) and constantly removes features which
are deemed highly unimportant.
4. Finally, the algorithm stops either when all features get confirmed or rejected or it
reaches a specified limit of random forest runs.
Embedded Methods
Embedded methods combine the qualities’ of filter and wrapper methods. It’s implemented
by algorithms that have their own built-in feature selection methods.
Some of the most popular examples of these methods are LASSO and RIDGE regression
which have inbuilt penalization functions to reduce overfitting.
Filter methods measure the relevance of features by their correlation with dependent
variable while wrapper methods measure the usefulness of a subset of feature by
actually training a model on it.
Filter methods are much faster compared to wrapper methods as they do not involve
training the models. On the other hand, wrapper methods are computationally very
expensive as well.
Filter methods use statistical methods for evaluation of a subset of features while
wrapper methods use cross validation.
Filter methods might fail to find the best subset of features in many occasions but
wrapper methods can always provide the best subset of features.
Using the subset of features from the wrapper methods make the model more prone
to overfitting as compared to using subset of features from the filter methods.