Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Ensemble Classifiers

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 37

Ensemble Classifiers

Ensemble Classifiers
• Introduction & Motivation
• Construction of Ensemble Classifiers
– Boosting (Ada Boost)
– Bagging
– Random Forests
• Empirical Comparison
Introduction & Motivation
• Suppose that you are a patient with a set of symptoms
• Instead of taking opinion of just one doctor (classifier), you
decide to take opinion of a few doctors!
• Is this a good idea? Indeed it is.
• Consult many doctors and then based on their diagnosis;
you can get a fairly accurate idea of the diagnosis.
• Majority voting - ‘bagging’
• More weightage to the opinion of some ‘good’ (accurate)
doctors - ‘boosting’
• In bagging, you give equal weightage to all classifiers,
whereas in boosting you give weightage according to the
accuracy of the classifier.
Ensemble Methods
• Construct a set of classifiers from the training
data

• Predict class label of previously unseen


records by aggregating predictions made by
multiple classifiers
General Idea
Original
D Training data

Step 1:
Create Multiple D1 D2 .... Dt-1 Dt
Data Sets

Step 2:
Build Multiple C1 C2 Ct -1 Ct
Classifiers

Step 3:
Combine C*
Classifiers
Ensemble Classifiers (EC)
• An ensemble classifier constructs a set of
‘base classifiers’ from the training data
• Methods for constructing an EC
• Manipulating training set
• Manipulating input features
• Manipulating class labels
• Manipulating learning algorithms
Ensemble Classifiers (EC)
• Manipulating training set
• Multiple training sets are created by resampling the
data according to some sampling distribution
• Sampling distribution determines how likely it is that
an example will be selected for training – may vary
from one trial to another
• Classifier is built from each training set using a
paritcular learning algorithm
• Examples: Bagging & Boosting
Ensemble Classifiers (EC)
• Manipulating input features
• Subset of input features chosen to form each training
set
• Subset can be chosen randomly or based on inputs
given by Domain Experts
• Good for data that has redundant features
• Random Forest is an example which uses DT as its
base classifierss
Ensemble Classifiers (EC)
• Manipulating class labels
• When no. of classes is sufficiently large
• Training data is transformed into a binary class problem
by randomly partitioning the class labels into 2 disjoint
subsets, A0 & A1
• Re-labelled examples are used to train a base classifier
• By repeating the class labeling and model building steps
several times, and ensemble of base classifiers is
obtained
• How a new tuple is classified?
• Example – error correcting output codings
Ensemble Classifiers (EC)
• Manipulating learning algorithm
• Learning algorithms can be manipulated in such a way
that applying the algorithm several times on the same
training data may result in different models
• Example – ANN can produce different models by
changing network topology or the initial weights of links
between neurons
• Example – ensemble of DTs can be constructed by
introducing randomness into the tree growing procedure
– instead of choosing the best split attribute at each
node, we randomly choose one of the top k attributes
Ensemble Classifiers (EC)
• First 3 approaches are generic – can be applied
to any classifier
• Fourth approach depends on the type of
classifier used
• Base classifiers can be generated sequentially or
in parallel
Ensemble Classifiers
• Ensemble methods work better with ‘unstable
classifiers’
• Classifiers that are sensitive to minor
perturbations in the training set
• Examples:
– Decision trees
– Rule-based
– Artificial neural networks
Why does it work?
• Suppose there are 25 base classifiers
– Each classifier has error rate,  = 0.35
– Assume classifiers are independent
– Probability that the ensemble classifier makes a
wrong prediction:
  i
25 25

  
 i 
i 13 
 (1   ) 25i
 0.06

– CHK out yourself if it is correct!!
Examples of Ensemble Methods

• How to generate an ensemble of classifiers?


– Bagging
– Boosting
– Random Forests
Bagging
• Also known as bootstrap aggregation
Original Data 1 2 3 4 5 6 7 8 9 10
Bagging (Round 1) 7 8 10 8 2 5 10 10 5 9
Bagging (Round 2) 1 4 9 1 2 3 2 7 3 2
Bagging (Round 3) 1 8 5 10 5 5 9 6 3 7

• Sampling uniformly with replacement


• Build classifier on each bootstrap sample
• 0.632 bootstrap
• Each bootstrap sample Di contains approx. 63.2%
of the original training data
• Remaining (36.8%) are used as test set
Bagging

• Accuracy of bagging:
k
Acc( M )   (0.632 * Acc( M i ) test _ set  0.368 * Acc( M i ) train _ set )
i 1

• Works well for small data sets


• Example:
X 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
y 1 1 1 -1 -1 -1 -1 1 1 1
Bagging
• Decision Stump
• Single level decision binary
tree
• Entropy – x<=0.35 or
x<=0.75
• Accuracy at most 70%
Bagging

Accuracy of ensemble classifier: 100% 


Bagging- Final Points
• Works well if the base classifiers are unstable
• Increased accuracy because it reduces the
variance of the individual classifier
• Does not focus on any particular instance of
the training data
• Therefore, less susceptible to model over-
fitting when applied to noisy data
• What if we want to focus on a particular
instances of training data?
Boosting
• An iterative procedure to adaptively change
distribution of training data by focusing more
on previously misclassified records
– Initially, all N records are assigned equal weights
– Unlike bagging, weights may change at the end of
a boosting round
Boosting
• Records that are wrongly classified will have
their weights increased
• Records that are classified correctly will have
their weights decreased
Original Data 1 2 3 4 5 6 7 8 9 10
Boosting (Round 1) 7 3 2 8 7 9 4 10 6 3
Boosting (Round 2) 5 4 9 4 2 5 1 7 4 2
Boosting (Round 3) 4 4 8 10 4 5 4 6 3 4

• Example 4 is hard to classify


• Its weight is increased, therefore it is more likely
to be chosen again in subsequent rounds
Boosting
• Equal weights are assigned to each training tuple
(1/d for round 1)
• After a classifier Mi is learned, the weights are
adjusted to allow the subsequent classifier M i+1 to
“pay more attention” to tuples that were
misclassified by Mi.
• Final boosted classifier M* combines the votes of
each individual classifier
• Weight of each classifier’s vote is a function of its
accuracy
• Adaboost – popular boosting algorithm
Adaboost
• Input:
– Training set D containing d tuples
– k rounds
– A classification learning scheme
• Output:
– A composite model
Adaboost
• Data set D containing d class-labeled tuples
(X1,y1), (X2,y2), (X3,y3),….(Xd,yd)
• Initially assign equal weight 1/d to each tuple
• To generate k base classifiers, we need k rounds
or iterations
• Round i, tuples from D are sampled with
replacement , to form Di (size d)
• Each tuple’s chance of being selected depends
on its weight
Adaboost
• Base classifier Mi, is derived from training
tuples of Di
• Error of Mi is tested using Di
• Weights of training tuples are adjusted
depending on how they were classified
– Correctly classified: Decrease weight
– Incorrectly classified: Increase weight
• Weight of a tuple indicates how hard it is to
classify it (directly proportional)
Adaboost
• Some classifiers may be better at classifying some
“hard” tuples than others
• We finally have a series of classifiers that
complement each other!
• Error rate of model Mi:
d
error ( M i )  w j * err ( X j )
j
where err(Xj) is the misclassification error for Xj(=1)
• If classifier error exceeds 0.5, we abandon it
• Try again with a new Di and a new Mi derived from it
Adaboost
• error (Mi) affects how the weights of training tuples are
updated
• If a tuple is correctly classified in round i, its weight is
multiplied by
error ( M i )
1  error ( M i )
• Adjust weights of all correctly classified tuples
• Now weights of all tuples (including the misclassified tuples)
are normalized
sum _ of _ old _ weights
• Normalization factor = sum _ of _ new _ weights

error ( M i )
• Weight of a classifier Mi’s weight is log
1  error ( M i )
Adaboost
• The lower a classifier error rate, the more accurate it is, and
therefore, the higher its weight for voting should be
• Weight of a classifier Mi’s vote is
error ( M i )
log
1  error ( M i )
• For each class c, sum the weights of each classifier that
assigned class c to X (unseen tuple)
• The class with the highest sum is the WINNER!
Example: AdaBoost
• Base classifiers: C1, C2, …, CT

• Error rate:

 w  C ( x )  y 
N
1
i  j i j j
N j 1

• Importance of a classifier:
1  1  i 
i  ln 
2  i 
Example: AdaBoost
• Weight update:
 j
( j 1)
( j)
w  exp if C j ( xi )  yi
wi i
 
Z j  exp j if C j ( xi )  yi
where Z j is the normalization factor
C * ( x )  arg max   j C j ( x )  y 
T

y j 1

• If any intermediate rounds produce error rate


higher than 50%, the weights are reverted back to
1/n and the re-sampling procedure is repeated
• Classification:
Illustrating AdaBoost
Initial weights for each data point Data points
for training

0.1 0.1 0.1


Original
Data +++ - - - - - ++

B1
0.0094 0.0094 0.4623
Boosting
Round 1 +++ - - - - - - -  = 1.9459
Illustrating AdaBoost B1
0.0094 0.0094 0.4623
Boosting
Round 1 +++ - - - - - - -  = 1.9459

B2
0.3037 0.0009 0.0422
Boosting
Round 2 - - - - - - - - ++  = 2.9323

B3
0.0276 0.1819 0.0038
Boosting
Round 3 +++ ++ ++ + ++  = 3.8744

Overall +++ - - - - - ++
Random Forests
• Ensemble method specifically designed for
decision tree classifiers
• Random Forests grows many classification
trees (that is why the name!)
• Ensemble of unpruned decision trees
• Each base classifier classifies a “new” vector
• Forest chooses the classification having the
most votes (over all the trees in the forest)
Random Forests
• Introduce two sources of randomness:
“Bagging” and “Random input vectors”
– Each tree is grown using a bootstrap sample of
training data
– At each node, best split is chosen from random
sample of mtry variables instead of all variables
Random Forests
Random Forest Algorithm
• M input variables, a number m<<M is specified such that at
each node, m variables are selected at random out of the M
and the best split on these m is used to split the node.
• m is held constant during the forest growing
• Each tree is grown to the largest extent possible
• There is no pruning
• Bagging using decision trees is a special case of random
forests when m=M
Random Forest Algorithm
• Out-of-bag (OOB) error
• Good accuracy without over-fitting
• Fast algorithm (can be faster than growing/pruning a single
tree); easily parallelized
• Handle high dimensional data without much problem
• Only one tuning parameter mtry = p , usually not sensitive to
it

You might also like