Ensembles of Classifiers: Evgueni Smirnov
Ensembles of Classifiers: Evgueni Smirnov
Ensembles of Classifiers: Evgueni Smirnov
Evgueni Smirnov
Outline
• Methods for Independently Constructing Ensembles
– Majority Vote
– Bagging and Random Forest
– Randomness Injection
– Feature-Selection Ensembles
– Error-Correcting Output Coding
• Methods for Coordinated Construction of Ensembles
– Boosting
– Stacking
• Reliable Classification: Meta-Classifier Approach
• Co-Training and Self-Training
Ensembles of Classifiers
• Basic idea is to learn a set of
classifiers (experts) and to allow them
to vote.
• Advantage: improvement in
predictive accuracy.
• Disadvantage: it is difficult to
understand an ensemble of classifiers.
Why do ensembles work?
Dietterich(2002) showed that ensembles overcome three problems:
• The Statistical Problem arises when the hypothesis space is too
large for the amount of available data. Hence, there are many
hypotheses with the same accuracy on the data and the learning
algorithm chooses only one of them! There is a risk that the
accuracy of the chosen hypothesis is low on unseen data!
• The Computational Problem arises when the learning algorithm
cannot guarantees finding the best hypothesis.
• The Representational Problem arises when the hypothesis space
does not contain any good approximation of the target class(es).
Original
D Training data
Step 1:
Build Multiple C1 C2 Ct -1 Ct
Classifiers
Step 2:
Combine C*
Classifiers
Why Majority Voting works?
• Suppose there are 25
base classifiers
– Each classifier has
error rate, = 0.35
– Assume errors made
by classifiers are
uncorrelated
– Probability that the
ensemble classifier makes
a wrong prediction:
25
25 i
P( X 13) (1 ) 25i 0.06
i 13 i
Bagging
• Employs simplest way of combining predictions that
belong to the same type.
• Combining can be realized with voting or averaging
• Each model receives equal weight
• “Idealized” version of bagging:
– Sample several training sets of size n (instead of just
having one training set of size n)
– Build a classifier for each training set
– Combine the classifier’s predictions
• This improves performance in almost all cases if
learning scheme is unstable (i.e. decision trees)
Bagging classifiers
Classifier generation
Let n be the size of the training set.
For each of t iterations:
Sample n instances with replacement from the
training set.
Apply the learning algorithm to the sample.
Store the resulting classifier.
classification
For each of the t classifiers:
Predict class of instance using classifier.
Return class that was predicted most often.
Why does bagging work?
• Bagging reduces variance by voting/
averaging, thus reducing the overall expected
error
– In the case of classification there are pathological
situations where the overall error might increase
– Usually, the more classifiers the better
Random Forest
Classifier generation
Let n be the size of the training set.
For each of t iterations:
(1) Sample n instances with replacement from
the training set.
(2) Learn a decision tree s.t. the variable
for any new node is the best variable among m
randomly selected variables.
(3) Store the resulting decision tree.
Classification
For each of the t decision trees:
Predict class of instance.
Return class that was predicted most often.
Bagging and Random Forest
• Bagging usually improves decision trees.
• Random forest usually outperforms
bagging due to the fact that errors of the
decision trees in the forest are less
correlated.
Randomization Injection
y2 +1 +1 +1 -1 -1 -1 -1
Classes
y3 +1 -1 -1 +1 +1 -1 -1
y4 -1 +1 -1 +1 -1 +1 -1
y2 +1 +1 +1 -1 -1 -1 -1
Classes
y3 +1 -1 -1 +1 +1 -1 -1
y4 -1 +1 -1 +1 -1 +1 -1
y3 -1 -1 +1 -1
y4 -1 -1 -1 +1
y3 0 -1 0 -1 0 +1
y4 0 0 -1 0 -1 -1
y3 -1 -1
y4 -1 +1
classification
Assign weight of zero to all classes.
For each of the t classifiers:
Add -log(e / (1 - e)) to weight of class predicted
by the classifier.
Return class with highest weight.
Remarks on Boosting
• Boosting can be applied without weights using re-
sampling with probability determined by weights;
• Boosting decreases exponentially the training error in
the number of iterations;
• Boosting works well if base classifiers are not too
complex and their error doesn’t become too large too
quickly!
• Boosting reduces the bias component of the error of
simple classifiers!
Stacking
• Uses meta learner instead of voting to
combine predictions of base learners
– Predictions of base learners (level-0 models) are
used as input for meta learner (level-1 model)
• Base learners usually different learning
schemes
• Hard to analyze theoretically: “black magic”
Stacking
BC1 0
BC2 1
instance1
BCn 1
instance1 0 1 1 1
Stacking
BC1 1
BC2 0
instance2
BCn 0
instance1 0 1 1 1
instance2 1 0 0 0
Stacking
Meta Classifier
instance1 0 1 1 1
instance2 1 0 0 0
Stacking
BC1 0
1
BC2 1
instance Meta Classifier
BCn 1
instance 0 1 1
More on stacking
• Predictions on training data can’t be used to generate
data for level-1 model! The reason is that the level-0
classifier that better fit training data will be chosen by
the level-1 model! Thus,
• k-fold cross-validation-like scheme is employed! An
example for k = 3!
train train test
train test train
instance BC
instance1 0 1 0
…………………………………………..
instancen 1 1 1
Meta Classifier Approach
instance BC
MC
instance1 0
…………………..
instancen 1
Meta Classifier Approach
Combined Classifier
instance BC
MC