Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
40 views

A Simple Proof of AdaBoost Algorithm

This document provides a simple proof of the AdaBoost algorithm. It first summarizes the AdaBoost algorithm and its goal of exponentially reducing training error. It then proves two key expressions: (1) that the training error of the ensemble is bounded above, and (2) that if each base classifier performs slightly better than random guessing, the training error will decrease exponentially fast. The document gives a new proof of these expressions and explains the parameter selection in the AdaBoost algorithm.

Uploaded by

Xuqing Wu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views

A Simple Proof of AdaBoost Algorithm

This document provides a simple proof of the AdaBoost algorithm. It first summarizes the AdaBoost algorithm and its goal of exponentially reducing training error. It then proves two key expressions: (1) that the training error of the ensemble is bounded above, and (2) that if each base classifier performs slightly better than random guessing, the training error will decrease exponentially fast. The document gives a new proof of these expressions and explains the parameter selection in the AdaBoost algorithm.

Uploaded by

Xuqing Wu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

A Simple Proof of AdaBoost Algorithm

Yin Zhao
yz_math@hotmail.com

Updated on November 29, 2014

Adaboost is a powerful algorithm for predicting models. However, a major dis-


advantage is that Adaboost may lead to over-fit in the presence of noise. [1]
proved that the training error of the ensemble is bounded by the following ex-
pression: Y p
e ensembl e ≤ 2 · ǫt · (1 − ǫt ) (1)
t

where ǫt is the error rate of each base classifier t . If the error rate is less than 0.5,
we can write ǫt = 0.5 − γt , where γt measures how much better the classifier is
than random guessing (on binary problems). The bound on the training error of
the ensemble becomes
Yq P 2
e ensembl e ≤ 1 − 4γt 2 ≤ e −2 t γt (2)
t

Thus if each base classifier is slightly better than random so that γt > γ for some
γ > 0, then the training error drops exponentially fast. Nevertheless, because
of its tendency to focus on training examples that are misclassified, Adaboost
algorithm can be quite susceptible to over-fitting.[2]
We will give a new simple proof of 1 and 2; additionally, we try to explain why
the parameter
1 1 − ǫt
αt = · log
2 ǫt
in boosting algorithm.

AdaBoost Algorithm:
Recall the boosting algorithm is:
Given (x 1 , y 1 ), (x 2 , y 2 ), · · · , (x m , y m ), where x i ∈ X , y i ∈ Y = {−1, +1}.
Initialize
1
D 1 (i ) =
m
For t = 1, 2, . . . , T : Train weak learner using distribution D t .
Get weak hypothesis h t : X → {−1, +1} with error

ǫt = Pr [h t (x i ) 6= y i ]
i ∼D t

1
If ǫi > 0.5, then the weights D t (i ) are reverted back to their original uniform
1
values m .
Choose
1 1 − ǫt
αt = · log (3)
2 ǫt
Update:
e −αt
½
D t (i ) if h t (x i ) = y i
D t +1 (i ) = × (4)
Zt e αt if h t (x i ) 6= y i
where Z t is a normalization factor.
Output the final hypothesis:
à !
T
X
H (x) = sign αt · h t (x)
t =1

Proof:
Firstly, we will prove 1, note that D t +1 (i ) is the distribution and its summation
P
i D t +1 (i ) equals to 1, hence
½ −αt
X X e if h t (x i ) = y i
Z t = D t +1 (i ) · Z t = D t (i ) × αt
i i
e if h t (x i ) 6= y i

D t (i ) · e −αt +
X X
= D t (i ) · e αt
i : h t (x i )=y i i : h t (x i )6= y i

= e −αt ·
X X
D t (i ) + e αt · D t (i )
i : h t (x i )=y i i : h t (x i )6= y i

= e −αt · (1 − ǫt ) + e αt · ǫt (5)
In order to find αt we can minimize Z t by making its first order derivative equal
to 0.

[e −αt · (1 − ǫt ) + e αt · ǫt ] = −e −αt · (1 − ǫt ) + e αt · ǫt = 0
1 1 − ǫt
⇒ αt = · log
2 ǫt
which is 3 in the boosting algorithm. Then we put αt back to 5
1−ǫt 1−ǫt
− 1 log 1
log
Z t = e −αt · (1 − ǫt ) + e αt · ǫt = e 2 ǫt · (1 − ǫ ) + e 2
t
ǫt · ǫt
p
= 2 ǫt · (1 − ǫt ) (6)
On the other hand, derive from 4 we have

D t (i ) · e −αt ·y i ·h t (xi ) D t (i ) · e K t
D t +1 (i ) = =
Zt Zt
Since the product will either be 1 if h t (x i ) = y i or −1 if h t (x i ) 6= y i .
Thus we can write down all of the equations
1
D 1 (i ) =
m

2
D 1 (i ) · e K 1
D 2 (i ) =
Z1
D 2 (i ) · e K 2
D 3 (i ) =
Z2
.........
D t (i ) · e K t
D t +1 (i ) =
Zt
Multiply all equalities above and obtain

1 e −y i · f (xi )
D t +1 (i ) = · Q
m t Zt
P
where f (x i ) = t αt · h t (x i ).
Thus
1 X −y i · f (xi ) X Y Y
· e = D t +1 (i ) · Z t = Z t (7)
m i i t t

Note that if ǫi > 0.5 the data set will be re-sampled until ǫi ≤ 0.5. In other words,
the parameter αt ≥ 0 in each valid iteration process. The training error of the
ensemble can be expressed as
½ ½
1 X 1 if y i 6= h t (x i ) 1 X 1 if y i · f (x i ) ≤ 0
e ensembl e = · = ·
m i 0 if y i = h t (x i ) m i 0 if y i · f (x i ) > 0

1 X −y i · f (xi ) Y
≤ · e = Zt (8)
m i t

The last step derives from 7.


According to 6 and 8, we have proved 1
Y p
e ensembl e ≤ 2 · ǫt · (1 − ǫt ) (9)
t

In order to prove 2, we have to firstly prove the following inequality:

1 + x ≤ ex (10)

Or the equivalence e x − x − 1 ≥ 0.
Let f (x) = e x − x − 1, then

f (x) = e x − 1 = 0 ⇒ x = 0
′′
Since f (x) = e x > 0, so

f (x)mi n = f (0) = 0 ⇒ e x − x − 1 ≥ 0

which is desired. Now we go back to 9 and let


1
ǫt = − γt
2

3
where γt measures how much better the classifier is than random guessing (on
binary problems). Based on 10 we have
Y p
e ensembl e ≤ 2· ǫt · (1 − ǫt )
t

Yq
= 1 − 4γ2t
t
1
[1 + (−4γ2t )] 2
Y
=
t
2 1 2
(e −4γt ) 2 = e −2γt
Y Y

t t
P 2
= e −2· t γt

as desired.

References
[1] Freund, Y. & Schapire, R. E. (1997). A decision-theoretic generalization of on-line
learning and an application to boosting. Journal of computer and system sciences,
Elsevier, 55, 119-139

[2] Tan, P. N., Steinbach, M., & Kumar, V. (2006). Introduction to data mining. Boston:
Pearson Addison Wesley.

You might also like