Ensemble Classifiers

Ensemble Classifiers
• Introduction & Motivation
• Construction of Ensemble Classifiers
– Boosting (Ada Boost)
– Bagging
– Random Forests
• Empirical Comparison
Introduction & Motivation
• Suppose that you are a patient with a set of symptoms
• Instead of taking opinion of just one doctor (classifier), you
decide to take opinion of a few doctors!
• Is this a good idea? Indeed it is.
• Consult many doctors and then based on their diagnosis;
you can get a fairly accurate idea of the diagnosis.
• Majority voting - ‘bagging’
• More weightage to the opinion of some ‘good’ (accurate)
doctors - ‘boosting’
• In bagging, you give equal weightage to all classifiers,
whereas in boosting you give weightage according to the
accuracy of the classifier.
Ensemble Methods
• Construct a set of classifiers from the training
data
• Predict class label of previously unseen

records by aggregating predictions made by
multiple classifiers
General Idea
Original
D Training data
Step 1:
Create Multiple D1 D2 .... Dt-1 Dt
Data Sets
Step 2:
Build Multiple C1 C2 Ct -1 Ct
Classifiers
Step 3:
Combine C*
Classifiers
Ensemble Classifiers (EC)
• An ensemble classifier constructs a set of
‘base classifiers’ from the training data
• Methods for constructing an EC
• Manipulating training set
• Manipulating input features
• Manipulating class labels
• Manipulating learning algorithms
• Manipulating training set
• Multiple training sets are created by resampling the
data according to some sampling distribution
• Sampling distribution determines how likely it is that
an example will be selected for training – may vary
from one trial to another
• Classifier is built from each training set using a
paritcular learning algorithm
• Examples: Bagging & Boosting
• Manipulating input features
• Subset of input features chosen to form each training
set
• Subset can be chosen randomly or based on inputs
given by Domain Experts
• Good for data that has redundant features
• Random Forest is an example which uses DT as its
base classifierss
• Manipulating class labels
• When no. of classes is sufficiently large
• Training data is transformed into a binary class problem
by randomly partitioning the class labels into 2 disjoint
subsets, A0 & A1
• Re-labelled examples are used to train a base classifier
• By repeating the class labeling and model building steps
several times, and ensemble of base classifiers is
obtained
• How a new tuple is classified?
• Example – error correcting output codings
• Manipulating learning algorithm
• Learning algorithms can be manipulated in such a way
that applying the algorithm several times on the same
training data may result in different models
• Example – ANN can produce different models by
changing network topology or the initial weights of links
between neurons
• Example – ensemble of DTs can be constructed by
introducing randomness into the tree growing procedure
– instead of choosing the best split attribute at each
node, we randomly choose one of the top k attributes
• First 3 approaches are generic – can be applied
to any classifier
• Fourth approach depends on the type of
classifier used
• Base classifiers can be generated sequentially or
in parallel
• Ensemble methods work better with ‘unstable
classifiers’
• Classifiers that are sensitive to minor
perturbations in the training set
• Examples:
– Decision trees
– Rule-based
– Artificial neural networks
Why does it work?
• Suppose there are 25 base classifiers
– Each classifier has error rate,  = 0.35
– Assume classifiers are independent
– Probability that the ensemble classifier makes a
wrong prediction:
  i
25 25
  
 i 
i 13 
 (1   ) 25i
 0.06

– CHK out yourself if it is correct!!
Examples of Ensemble Methods
• How to generate an ensemble of classifiers?

– Bagging
– Boosting
– Random Forests
Bagging
• Also known as bootstrap aggregation
Original Data 1 2 3 4 5 6 7 8 9 10
Bagging (Round 1) 7 8 10 8 2 5 10 10 5 9
Bagging (Round 2) 1 4 9 1 2 3 2 7 3 2
Bagging (Round 3) 1 8 5 10 5 5 9 6 3 7
• Sampling uniformly with replacement

• Build classifier on each bootstrap sample
• 0.632 bootstrap
• Each bootstrap sample Di contains approx. 63.2%
of the original training data
• Remaining (36.8%) are used as test set
Bagging
• Accuracy of bagging:
k
Acc( M )   (0.632 * Acc( M i ) test _ set  0.368 * Acc( M i ) train _ set )
i 1
• Works well for small data sets

• Example:
X 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
y 1 1 1 -1 -1 -1 -1 1 1 1
Bagging
• Decision Stump
• Single level decision binary
tree
• Entropy – x<=0.35 or
x<=0.75
• Accuracy at most 70%
Bagging
Accuracy of ensemble classifier: 100% 

Bagging- Final Points
• Works well if the base classifiers are unstable
• Increased accuracy because it reduces the
variance of the individual classifier
• Does not focus on any particular instance of
the training data
• Therefore, less susceptible to model over-
fitting when applied to noisy data
• What if we want to focus on a particular
instances of training data?
Boosting
• An iterative procedure to adaptively change
distribution of training data by focusing more
on previously misclassified records
– Initially, all N records are assigned equal weights
– Unlike bagging, weights may change at the end of
a boosting round
Boosting
• Records that are wrongly classified will have
their weights increased
• Records that are classified correctly will have
their weights decreased
Original Data 1 2 3 4 5 6 7 8 9 10
Boosting (Round 1) 7 3 2 8 7 9 4 10 6 3
Boosting (Round 2) 5 4 9 4 2 5 1 7 4 2
Boosting (Round 3) 4 4 8 10 4 5 4 6 3 4
• Example 4 is hard to classify

• Its weight is increased, therefore it is more likely
to be chosen again in subsequent rounds
Boosting
• Equal weights are assigned to each training tuple
(1/d for round 1)
• After a classifier Mi is learned, the weights are
adjusted to allow the subsequent classifier M i+1 to
“pay more attention” to tuples that were
misclassified by Mi.
• Final boosted classifier M* combines the votes of
each individual classifier
• Weight of each classifier’s vote is a function of its
accuracy
• Adaboost – popular boosting algorithm
Adaboost
• Input:
– Training set D containing d tuples
– k rounds
– A classification learning scheme
• Output:
– A composite model
Adaboost
• Data set D containing d class-labeled tuples
(X1,y1), (X2,y2), (X3,y3),….(Xd,yd)
• Initially assign equal weight 1/d to each tuple
• To generate k base classifiers, we need k rounds
or iterations
• Round i, tuples from D are sampled with
replacement , to form Di (size d)
• Each tuple’s chance of being selected depends
on its weight
Adaboost
• Base classifier Mi, is derived from training
tuples of Di
• Error of Mi is tested using Di
• Weights of training tuples are adjusted
depending on how they were classified
– Correctly classified: Decrease weight
– Incorrectly classified: Increase weight
• Weight of a tuple indicates how hard it is to
classify it (directly proportional)
Adaboost
• Some classifiers may be better at classifying some
“hard” tuples than others
• We finally have a series of classifiers that
complement each other!
• Error rate of model Mi:
d
error ( M i )  w j * err ( X j )
j
where err(Xj) is the misclassification error for Xj(=1)
• If classifier error exceeds 0.5, we abandon it
• Try again with a new Di and a new Mi derived from it
Adaboost
• error (Mi) affects how the weights of training tuples are
updated
• If a tuple is correctly classified in round i, its weight is
multiplied by
error ( M i )
1  error ( M i )
• Adjust weights of all correctly classified tuples
• Now weights of all tuples (including the misclassified tuples)
are normalized
sum _ of _ old _ weights
• Normalization factor = sum _ of _ new _ weights
error ( M i )
• Weight of a classifier Mi’s weight is log
1  error ( M i )
Adaboost
• The lower a classifier error rate, the more accurate it is, and
therefore, the higher its weight for voting should be
• Weight of a classifier Mi’s vote is
error ( M i )
log
1  error ( M i )
• For each class c, sum the weights of each classifier that
assigned class c to X (unseen tuple)
• The class with the highest sum is the WINNER!
Example: AdaBoost
• Base classifiers: C1, C2, …, CT
• Error rate:
 w  C ( x )  y 
N
1
i  j i j j
N j 1
• Importance of a classifier:
1  1  i 
i  ln 
2  i 
Example: AdaBoost
• Weight update:
 j
( j 1)
( j)
w  exp if C j ( xi )  yi
wi i
 
Z j  exp j if C j ( xi )  yi
where Z j is the normalization factor
C * ( x )  arg max   j C j ( x )  y 
T
y j 1
• If any intermediate rounds produce error rate

higher than 50%, the weights are reverted back to
1/n and the re-sampling procedure is repeated
• Classification:
Illustrating AdaBoost
Initial weights for each data point Data points
for training
0.1 0.1 0.1

Original
Data +++ - - - - - ++
B1
0.0094 0.0094 0.4623
Boosting
Round 1 +++ - - - - - - -  = 1.9459
Illustrating AdaBoost B1
0.0094 0.0094 0.4623
Boosting
Round 1 +++ - - - - - - -  = 1.9459
B2
0.3037 0.0009 0.0422
Boosting
Round 2 - - - - - - - - ++  = 2.9323
B3
0.0276 0.1819 0.0038
Boosting
Round 3 +++ ++ ++ + ++  = 3.8744
Overall +++ - - - - - ++
Random Forests
• Ensemble method specifically designed for
decision tree classifiers
• Random Forests grows many classification
trees (that is why the name!)
• Ensemble of unpruned decision trees
• Each base classifier classifies a “new” vector
• Forest chooses the classification having the
most votes (over all the trees in the forest)
Random Forests
• Introduce two sources of randomness:
“Bagging” and “Random input vectors”
– Each tree is grown using a bootstrap sample of
training data
– At each node, best split is chosen from random
sample of mtry variables instead of all variables
Random Forests
Random Forest Algorithm
• M input variables, a number m<<M is specified such that at
each node, m variables are selected at random out of the M
and the best split on these m is used to split the node.
• m is held constant during the forest growing
• Each tree is grown to the largest extent possible
• There is no pruning
• Bagging using decision trees is a special case of random
forests when m=M
Random Forest Algorithm
• Out-of-bag (OOB) error
• Good accuracy without over-fitting
• Fast algorithm (can be faster than growing/pruning a single
tree); easily parallelized
• Handle high dimensional data without much problem
• Only one tuning parameter mtry = p , usually not sensitive to
it

Ensemble Classifiers

Uploaded by

Copyright:

Available Formats

Ensemble Classifiers

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ensemble Classifiers

Uploaded by

Copyright:

Available Formats

Ensemble Classifiers

• Predict class label of previously unseen

• How to generate an ensemble of classifiers?

• Sampling uniformly with replacement

• Works well for small data sets

Accuracy of ensemble classifier: 100% 

• Example 4 is hard to classify

• If any intermediate rounds produce error rate

0.1 0.1 0.1

You might also like