Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
100% found this document useful (1 vote)
113 views37 pages

Ensemble Classifiers

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1/ 37

Ensemble

Classifiers
Prof. Navneet Goyal

Ensemble Classifiers
Introduction & Motivation
Construction of Ensemble Classifiers
Boosting (Ada Boost)
Bagging
Random Forests

Empirical Comparison

Introduction &
Motivation

Suppose that you are a patient with a set of symptoms


Instead of taking opinion of just one doctor (classifier),
you decide to take opinion of a few doctors!
Is this a good idea? Indeed it is.
Consult many doctors and then based on their
diagnosis; you can get a fairly accurate idea of the
diagnosis.
Majority voting - bagging
More weightage to the opinion of some good
(accurate) doctors - boosting
In bagging, you give equal weightage to all classifiers,
whereas in boosting you give weightage according to
the accuracy of the classifier.

Ensemble Methods
Construct a set of classifiers from the
training data
Predict class label of previously
unseen records by aggregating
predictions made by multiple
classifiers

General Idea

Ensemble Classifiers (EC)


An ensemble classifier constructs a
set of base classifiers from the
training data
Methods for constructing an EC

Manipulating
Manipulating
Manipulating
Manipulating

training set
input features
class labels
learning algorithms

Ensemble Classifiers (EC)


Manipulating training set
Multiple training sets are created by
resampling the data according to some
sampling distribution
Sampling distribution determines how likely it
is that an example will be selected for training
may vary from one trial to another
Classifier is built from each training set using
a paritcular learning algorithm
Examples: Bagging & Boosting

Ensemble Classifiers (EC)


Manipulating input features
Subset of input features chosen to form
each training set
Subset can be chosen randomly or based
on inputs given by Domain Experts
Good for data that has redundant features
Random Forest is an example which uses
DT as its base classifierss

Ensemble Classifiers (EC)


Manipulating class labels
When no. of classes is sufficiently large
Training data is transformed into a binary class
problem by randomly partitioning the class labels
into 2 disjoint subsets, A0 & A1
Re-labelled examples are used to train a base
classifier
By repeating the class labeling and model building
steps several times, and ensemble of base
classifiers is obtained
How a new tuple is classified?
Example error correcting output codings (pp
307)

Ensemble Classifiers (EC)


Manipulating learning algorithm
Learning algorithms can be manipulated in such a
way that applying the algorithm several times on the
same training data may result in different models
Example ANN can produce different models by
changing network topology or the initial weights of
links between neurons
Example ensemble of DTs can be constructed by
introducing randomness into the tree growing
procedure instrad of choosing the best split
attribute at each node, we randomly choose one of
the top k attributes

Ensemble Classifiers (EC)


First 3 approaches are generic can be
applied to any classifier
Fourth approach depends on the type
of classifier used
Base classifiers can be generated
sequentially or in parallel

Ensemble Classifiers
Ensemble methods work better with
unstable classifiers
Classifiers that are sensitive to minor
perturbations in the training set
Examples:
Decision trees
Rule-based
Artificial neural networks

Why does it work?


Suppose there are 25 base classifiers
Each classifier has error rate, = 0.35
Assume classifiers are independent
Probability that the ensemble classifier
makes a wrong prediction:
25 25

i
25 i

(
1

)
0.06

i
i 13

CHK out yourself if it is correct!!

Examples of Ensemble
Methods
How to generate an ensemble of
classifiers?
Bagging
Boosting
Random Forests

Bagging
Also known as bootstrap aggregation
Original Data
Bagging (Round 1)
Bagging (Round 2)
Bagging (Round 3)

1
7
1
1

2
8
4
8

3
10
9
5

4
8
1
10

5
2
2
5

6
5
3
5

7
10
2
9

8
10
7
6

9
5
3
3

10
9
2
7

Sampling uniformly with replacement


Build classifier on each bootstrap sample
0.632 bootstrap
Each bootstrap sample Di contains approx.
63.2% of the original training data
Remaining (36.8%) are used as test set

Bagging
Accuracy of bagging:
k

Acc( M ) (0.632 * Acc( M i ) test _ set 0.368 * Acc( M i ) train _ set )


i 1

Works well for small data sets


Example:
X

0.
1

0.
2

0.
3

0.
4

0.
5

0.
6

0.
7

0.
8

0.
9

1.
0

-1

-1

-1

-1

Bagging
Decision Stump
Single level decision
binary tree
Entropy x<=0.35 or
x<=0.75
Accuracy at most 70%

Bagging

Accuracy of ensemble classifier:


100%

Bagging- Final Points


Works well if the base classifiers are
unstable
Increased accuracy because it reduces
the variance of the individual classifier
Does not focus on any particular instance
of the training data
Therefore, less susceptible to model overfitting when applied to noisy data
What if we want to focus on a particular
instances of training data?

Boosting
An iterative procedure to adaptively
change distribution of training data
by focusing more on previously
misclassified records
Initially, all N records are assigned equal
weights
Unlike bagging, weights may change at
the end of a boosting round

Boosting
Records that are wrongly classified
will have their weights increased
Records that are classified correctly
will have their weights decreased
Original Data
Boosting (Round 1)
Boosting (Round 2)
Boosting (Round 3)

1
7
5
4

2
3
4
4

3
2
9
8

4
8
4
10

5
7
2
4

6
9
5
5

7
4
1
4

8
10
7
6

9
6
4
3

Example 4 is hard to classify


Its weight is increased, therefore it is
more likely to be chosen again in
subsequent rounds

10
3
2
4

Boosting
Equal weights are assigned to each
training tuple (1/d for round 1)
After a classifier Mi is learned, the weights
are adjusted to allow the subsequent
classifier Mi+1 to pay more attention to
tuples that were misclassified by M i.
Final boosted classifier M* combines the
votes of each individual classifier
Weight of each classifiers vote is a
function of its accuracy
Adaboost popular boosting algorithm

Adaboost
Input:
Training set D containing d tuples
k rounds
A classification learning scheme

Output:
A composite model

Adaboost
Data set D containing d class-labeled tuples
(X1,y1), (X2,y2), (X3,y3),.(Xd,yd)
Initially assign equal weight 1/d to each tuple
To generate k base classifiers, we need k
rounds or iterations
Round i, tuples from D are sampled with
replacement , to form Di (size d)
Each tuples chance of being selected
depends on its weight

Adaboost
Base classifier Mi, is derived from
training tuples of Di
Error of Mi is tested using Di
Weights of training tuples are adjusted
depending on how they were classified
Correctly classified: Decrease weight
Incorrectly classified: Increase weight

Weight of a tuple indicates how hard it is


to classify it (directly proportional)

Adaboost
Some classifiers may be better at classifying some
hard tuples than others
We finally have a series of classifiers that
complement each other!
Error rate of model Mi:
d

( M i misclassification
) w j * err ( X j )
where err(Xj)error
is the
error for
j
Xj(=1)
If classifier error exceeds 0.5, we abandon it
Try again with a new Di and a new Mi derived from
it

Adaboost
error (Mi) affects how the weights of training tuples
are updated
If a tuple is correctly classified in round i, its weight
is multiplied by
error ( M i )
1 error ( M i )

Adjust weights of all correctly classified tuples


Now weights of all tuples (including the
misclassified tuples) are normalized
Normalization factorsum
= _ of _ old _ weights
sum _ of _ new _ weights

error ( M i )
Weight of a classifier Mis weight
is
log

1 error ( M i )

Adaboost
The lower a classifier error rate, the more
accurate it is, and therefore, the higher its weight
for voting should be
Weight of a classifier
errorMis
( M i ) vote is
log

1 error ( M i )

For each class c, sum the weights of each


classifier that assigned class c to X (unseen tuple)
The class with the highest sum is the WINNER!

Example: AdaBoost
Base classifiers: C1, C2, ,
CT
Error rate:
N

1
i
N

w C ( x ) y
j 1

Importance of a classifier:

1 1 i

i ln
2 i

Example: AdaBoost
Weight update:

exp j

if C j ( xi ) yi
w
wi

Z j exp j if C j ( xi ) yi
where Z j is the normalization factor
( j 1)

( j)
i

C * ( x ) arg max j C j ( x ) y
T

j 1

If any intermediate rounds produce error rate


higher than 50%, the weights are reverted back
to 1/n and the re-sampling procedure is
repeated
Classification:

Illustrating AdaBoost
Initial weights for each data
point

Data points
for training

Illustrating AdaBoost

Random Forests
Ensemble method specifically designed for
decision tree classifiers
Random Forests grows many classification
trees (that is why the name!)
Ensemble of unpruned decision trees
Each base classifier classifies a new
vector
Forest chooses the classification having the
most votes (over all the trees in the forest)

Random Forests
Introduce two sources of
randomness: Bagging and
Random input vectors
Each tree is grown using a bootstrap
sample of training data
At each node, best split is chosen from
random sample of mtry variables instead
of all variables

Random Forests

Random Forest
Algorithm
M input variables, a number m<<M is specified

such that at each node, m variables are selected


at random out of the M and the best split on
these m is used to split the node.
m is held constant during the forest growing
Each tree is grown to the largest extent possible
There is no pruning
m=Mdecision trees is a special case of
Bagging using
random forests when

Random Forest
Algorithm
Out-of-bag (OOB) error
Good accuracy without over-fitting
Fast algorithm (can be faster than
growing/pruning a single tree); easily parallelized
Handle high dimensional data without much
problem
p
Only one tuning parameter mtry =
, usually not
sensitive to it

You might also like