A Simple Proof of AdaBoost Algorithm
A Simple Proof of AdaBoost Algorithm
Yin Zhao
yz_math@hotmail.com
where ǫt is the error rate of each base classifier t . If the error rate is less than 0.5,
we can write ǫt = 0.5 − γt , where γt measures how much better the classifier is
than random guessing (on binary problems). The bound on the training error of
the ensemble becomes
Yq P 2
e ensembl e ≤ 1 − 4γt 2 ≤ e −2 t γt (2)
t
Thus if each base classifier is slightly better than random so that γt > γ for some
γ > 0, then the training error drops exponentially fast. Nevertheless, because
of its tendency to focus on training examples that are misclassified, Adaboost
algorithm can be quite susceptible to over-fitting.[2]
We will give a new simple proof of 1 and 2; additionally, we try to explain why
the parameter
1 1 − ǫt
αt = · log
2 ǫt
in boosting algorithm.
AdaBoost Algorithm:
Recall the boosting algorithm is:
Given (x 1 , y 1 ), (x 2 , y 2 ), · · · , (x m , y m ), where x i ∈ X , y i ∈ Y = {−1, +1}.
Initialize
1
D 1 (i ) =
m
For t = 1, 2, . . . , T : Train weak learner using distribution D t .
Get weak hypothesis h t : X → {−1, +1} with error
ǫt = Pr [h t (x i ) 6= y i ]
i ∼D t
1
If ǫi > 0.5, then the weights D t (i ) are reverted back to their original uniform
1
values m .
Choose
1 1 − ǫt
αt = · log (3)
2 ǫt
Update:
e −αt
½
D t (i ) if h t (x i ) = y i
D t +1 (i ) = × (4)
Zt e αt if h t (x i ) 6= y i
where Z t is a normalization factor.
Output the final hypothesis:
à !
T
X
H (x) = sign αt · h t (x)
t =1
Proof:
Firstly, we will prove 1, note that D t +1 (i ) is the distribution and its summation
P
i D t +1 (i ) equals to 1, hence
½ −αt
X X e if h t (x i ) = y i
Z t = D t +1 (i ) · Z t = D t (i ) × αt
i i
e if h t (x i ) 6= y i
D t (i ) · e −αt +
X X
= D t (i ) · e αt
i : h t (x i )=y i i : h t (x i )6= y i
= e −αt ·
X X
D t (i ) + e αt · D t (i )
i : h t (x i )=y i i : h t (x i )6= y i
= e −αt · (1 − ǫt ) + e αt · ǫt (5)
In order to find αt we can minimize Z t by making its first order derivative equal
to 0.
′
[e −αt · (1 − ǫt ) + e αt · ǫt ] = −e −αt · (1 − ǫt ) + e αt · ǫt = 0
1 1 − ǫt
⇒ αt = · log
2 ǫt
which is 3 in the boosting algorithm. Then we put αt back to 5
1−ǫt 1−ǫt
− 1 log 1
log
Z t = e −αt · (1 − ǫt ) + e αt · ǫt = e 2 ǫt · (1 − ǫ ) + e 2
t
ǫt · ǫt
p
= 2 ǫt · (1 − ǫt ) (6)
On the other hand, derive from 4 we have
D t (i ) · e −αt ·y i ·h t (xi ) D t (i ) · e K t
D t +1 (i ) = =
Zt Zt
Since the product will either be 1 if h t (x i ) = y i or −1 if h t (x i ) 6= y i .
Thus we can write down all of the equations
1
D 1 (i ) =
m
2
D 1 (i ) · e K 1
D 2 (i ) =
Z1
D 2 (i ) · e K 2
D 3 (i ) =
Z2
.........
D t (i ) · e K t
D t +1 (i ) =
Zt
Multiply all equalities above and obtain
1 e −y i · f (xi )
D t +1 (i ) = · Q
m t Zt
P
where f (x i ) = t αt · h t (x i ).
Thus
1 X −y i · f (xi ) X Y Y
· e = D t +1 (i ) · Z t = Z t (7)
m i i t t
Note that if ǫi > 0.5 the data set will be re-sampled until ǫi ≤ 0.5. In other words,
the parameter αt ≥ 0 in each valid iteration process. The training error of the
ensemble can be expressed as
½ ½
1 X 1 if y i 6= h t (x i ) 1 X 1 if y i · f (x i ) ≤ 0
e ensembl e = · = ·
m i 0 if y i = h t (x i ) m i 0 if y i · f (x i ) > 0
1 X −y i · f (xi ) Y
≤ · e = Zt (8)
m i t
1 + x ≤ ex (10)
Or the equivalence e x − x − 1 ≥ 0.
Let f (x) = e x − x − 1, then
′
f (x) = e x − 1 = 0 ⇒ x = 0
′′
Since f (x) = e x > 0, so
f (x)mi n = f (0) = 0 ⇒ e x − x − 1 ≥ 0
3
where γt measures how much better the classifier is than random guessing (on
binary problems). Based on 10 we have
Y p
e ensembl e ≤ 2· ǫt · (1 − ǫt )
t
Yq
= 1 − 4γ2t
t
1
[1 + (−4γ2t )] 2
Y
=
t
2 1 2
(e −4γt ) 2 = e −2γt
Y Y
≤
t t
P 2
= e −2· t γt
as desired.
References
[1] Freund, Y. & Schapire, R. E. (1997). A decision-theoretic generalization of on-line
learning and an application to boosting. Journal of computer and system sciences,
Elsevier, 55, 119-139
[2] Tan, P. N., Steinbach, M., & Kumar, V. (2006). Introduction to data mining. Boston:
Pearson Addison Wesley.