Ensemble Classifiers
Ensemble Classifiers
Ensemble Classifiers
Ensemble Classifiers
• Introduction & Motivation
• Construction of Ensemble Classifiers
– Boosting (Ada Boost)
– Bagging
– Random Forests
• Empirical Comparison
Introduction & Motivation
• Suppose that you are a patient with a set of symptoms
• Instead of taking opinion of just one doctor (classifier), you
decide to take opinion of a few doctors!
• Is this a good idea? Indeed it is.
• Consult many doctors and then based on their diagnosis;
you can get a fairly accurate idea of the diagnosis.
• Majority voting - ‘bagging’
• More weightage to the opinion of some ‘good’ (accurate)
doctors - ‘boosting’
• In bagging, you give equal weightage to all classifiers,
whereas in boosting you give weightage according to the
accuracy of the classifier.
Ensemble Methods
• Construct a set of classifiers from the training
data
Step 1:
Create Multiple D1 D2 .... Dt-1 Dt
Data Sets
Step 2:
Build Multiple C1 C2 Ct -1 Ct
Classifiers
Step 3:
Combine C*
Classifiers
Ensemble Classifiers (EC)
• An ensemble classifier constructs a set of
‘base classifiers’ from the training data
• Methods for constructing an EC
• Manipulating training set
• Manipulating input features
• Manipulating class labels
• Manipulating learning algorithms
Ensemble Classifiers (EC)
• Manipulating training set
• Multiple training sets are created by resampling the
data according to some sampling distribution
• Sampling distribution determines how likely it is that
an example will be selected for training – may vary
from one trial to another
• Classifier is built from each training set using a
paritcular learning algorithm
• Examples: Bagging & Boosting
Ensemble Classifiers (EC)
• Manipulating input features
• Subset of input features chosen to form each training
set
• Subset can be chosen randomly or based on inputs
given by Domain Experts
• Good for data that has redundant features
• Random Forest is an example which uses DT as its
base classifierss
Ensemble Classifiers (EC)
• Manipulating class labels
• When no. of classes is sufficiently large
• Training data is transformed into a binary class problem
by randomly partitioning the class labels into 2 disjoint
subsets, A0 & A1
• Re-labelled examples are used to train a base classifier
• By repeating the class labeling and model building steps
several times, and ensemble of base classifiers is
obtained
• How a new tuple is classified?
• Example – error correcting output codings
Ensemble Classifiers (EC)
• Manipulating learning algorithm
• Learning algorithms can be manipulated in such a way
that applying the algorithm several times on the same
training data may result in different models
• Example – ANN can produce different models by
changing network topology or the initial weights of links
between neurons
• Example – ensemble of DTs can be constructed by
introducing randomness into the tree growing procedure
– instead of choosing the best split attribute at each
node, we randomly choose one of the top k attributes
Ensemble Classifiers (EC)
• First 3 approaches are generic – can be applied
to any classifier
• Fourth approach depends on the type of
classifier used
• Base classifiers can be generated sequentially or
in parallel
Ensemble Classifiers
• Ensemble methods work better with ‘unstable
classifiers’
• Classifiers that are sensitive to minor
perturbations in the training set
• Examples:
– Decision trees
– Rule-based
– Artificial neural networks
Why does it work?
• Suppose there are 25 base classifiers
– Each classifier has error rate, = 0.35
– Assume classifiers are independent
– Probability that the ensemble classifier makes a
wrong prediction:
i
25 25
i
i 13
(1 ) 25i
0.06
– CHK out yourself if it is correct!!
Examples of Ensemble Methods
• Accuracy of bagging:
k
Acc( M ) (0.632 * Acc( M i ) test _ set 0.368 * Acc( M i ) train _ set )
i 1
error ( M i )
• Weight of a classifier Mi’s weight is log
1 error ( M i )
Adaboost
• The lower a classifier error rate, the more accurate it is, and
therefore, the higher its weight for voting should be
• Weight of a classifier Mi’s vote is
error ( M i )
log
1 error ( M i )
• For each class c, sum the weights of each classifier that
assigned class c to X (unseen tuple)
• The class with the highest sum is the WINNER!
Example: AdaBoost
• Base classifiers: C1, C2, …, CT
• Error rate:
w C ( x ) y
N
1
i j i j j
N j 1
• Importance of a classifier:
1 1 i
i ln
2 i
Example: AdaBoost
• Weight update:
j
( j 1)
( j)
w exp if C j ( xi ) yi
wi i
Z j exp j if C j ( xi ) yi
where Z j is the normalization factor
C * ( x ) arg max j C j ( x ) y
T
y j 1
B1
0.0094 0.0094 0.4623
Boosting
Round 1 +++ - - - - - - - = 1.9459
Illustrating AdaBoost B1
0.0094 0.0094 0.4623
Boosting
Round 1 +++ - - - - - - - = 1.9459
B2
0.3037 0.0009 0.0422
Boosting
Round 2 - - - - - - - - ++ = 2.9323
B3
0.0276 0.1819 0.0038
Boosting
Round 3 +++ ++ ++ + ++ = 3.8744
Overall +++ - - - - - ++
Random Forests
• Ensemble method specifically designed for
decision tree classifiers
• Random Forests grows many classification
trees (that is why the name!)
• Ensemble of unpruned decision trees
• Each base classifier classifies a “new” vector
• Forest chooses the classification having the
most votes (over all the trees in the forest)
Random Forests
• Introduce two sources of randomness:
“Bagging” and “Random input vectors”
– Each tree is grown using a bootstrap sample of
training data
– At each node, best split is chosen from random
sample of mtry variables instead of all variables
Random Forests
Random Forest Algorithm
• M input variables, a number m<<M is specified such that at
each node, m variables are selected at random out of the M
and the best split on these m is used to split the node.
• m is held constant during the forest growing
• Each tree is grown to the largest extent possible
• There is no pruning
• Bagging using decision trees is a special case of random
forests when m=M
Random Forest Algorithm
• Out-of-bag (OOB) error
• Good accuracy without over-fitting
• Fast algorithm (can be faster than growing/pruning a single
tree); easily parallelized
• Handle high dimensional data without much problem
• Only one tuning parameter mtry = p , usually not sensitive to
it