Tuo Zhao Notes
Tuo Zhao Notes
In this lecture we formulate the basic (supervised) learning problem and introduce several key concepts
including loss function, risk and error decomposition.
1 Basic Concepts
We use X and Y to denote the input space and the output space, where typically we have X = Rp . A joint
probability distribution on X × Y is denoted as PX,Y . Let (X, Y ) be a pair of random variables distributed
according to PX,Y . We also use PX and PY |X to denote the marginal distribution of X and the conditional
distribution of Y given X.
Let Dn = {(x1 , y1 ), . . . , (xn , yn )} be an i.i.d. random sample from PX,Y . The goal of supervised learning
is to find a mapping h : X "→ Y based on Dn so that h(X) is a good approximation of Y . When Y = R
the learning problem is often called regression and when Y = {0, 1} or {−1, 1} it is often called (binary)
classification.
The dataset Dn is often called the training set (or training data), and it is important since the distribution
PX,Y is usually unknown. A learning algorithm is a procedure A which takes the training set Dn and
produces a predictor ĥ = A(Dn ) as the output. Typically the learning algorithm will search over a space of
functions H, which we call the hypothesis space.
R∗ = inf R(h)
h
where the infinum is often taken with respect to all measurable functions. The performance of a given
predictor/estimator can be evaluated by how close R(h) is to R∗ . Minimization of the risk is non-trivial
because the underlying distribution PX,Y is in general unknown, and the training data Dn only gives us an
incomplete knowledge of PX,Y in practice.
1
2.1 Binary Classification
For classification problem, a predictor h is also called a classifier, and the loss function for binary classification
is often taken to be the 0/1 loss. In this case, we have
P (h(X) ̸
= Y |X = x) = 1 − P (h(X) = Y |X = x)
= 1 − (P (h(X) = 1, Y = 1|X = x) + P (h(X) = 0, Y = 0|X = x))
= 1 − (E[I(h(X) = 1)I(Y = 1)|X = x] + E[I(h(X) = 0)I(Y = 0)|X = x])
= 1 − (I(h(x) = 1)E[I(Y = 1)|X = x] + I(h(x) = 0)E[I(Y = 0)|X = x])
= 1 − I(h(x) = 1)P (Y = 1|X = x) − I(h(x) = 0)P (Y = 0|X = x).
= Y |X = x) − P (h∗ (X) ̸
P (h(X) ̸ = Y |X = x)
= P (h∗ (X) = Y |X = x) − I(h(x) = 1)η(x) − I(h(x) = 0)(1 − η(x))
= η(x)[I(h∗ (x) = 1) − I(h(x) = 1)] + (1 − η(x))[I(h∗ (x) = 0) − I(h(x) = 0)]
= (2η(x) − 1)[I(h∗ (x) = 1) − I(h(x) = 1)]
≥ 0
where the last inequality is true by the definition of h∗ (x). The result follows by intergrating both sides with
respect to x.
!
2.2 Regression
In regression we typically have X = Rp and Y = R. And the risk is often measured by the squared error
loss, ℓ(p, y) = (p − y)2 . The following result shows that for squared error regression, the optimal predictor
is the conditional mean function E[Y |X = x].
Theorem 1-2. Suppose the loss function ℓ(., .) is the squared error loss. Let h∗ (x) = E[Y |X = x], then we
have R(h∗ ) = R∗ .
The proof will be left as an exercise. Thus regression with squared error can be thought as trying to estimate
the conditional mean function. How about regression with its risk defined by the absolute error loss function?
2
3 Approximation Error vs. Estimation Error
Suppose that the learning algorithm chooses the predictor from the hypothesis space H, and define
i.e. h∗ is the best predictor among H1 . Then the excess risk of the output ĥn of the learning algorithm is
defined and can be decomposed as follows2 :
" # " #
R(ĥn ) − R∗ = R(h∗ ) − R∗ + R(ĥn ) − R(h∗ )
$ %& ' $ %& '
approximation error estimation error
Such a decomposition reflects a trade-off similar to the bias-variance tradeoff (maybe slightly more general).
The approximation error is deterministic and is caused by the restriction of using H. The estimation error
is caused by the usage of a finite sample that cannot completely represent the underlying distribution.
The approximation error term behaves like a bias square term, and the estimation error behaves like the
variance term in standard statistical estimation problems. Similar to the bias-variance trade-off, there is also
a trade-off between the approximation error and the estimation error. Basically if H is large then we have
a small approximation error but a relatively large estimation error and vice versa.
1 Sometimes h∗ is defined as the approximately best predictor chosen by the estimation procedure if infinite amount of
samples are given. In that case, the approximation error can be caused by both the function class H and some intrinsic bias of
the estimation procedure.
2 Sometimes the decomposition is done for E ∗
Dn [R(ĥn )] − R .
3
ISYE/CSE
STAT 598Y 8803: Advanced
Statistical MachineTheory
Learning Learning Instructor:Jian
Instructor: Tuo Zhang
Zhao
In this lecture we introduce the concepts of empirical risk minimization, overfitting, model complexity and
regularization.
1!
n
ĥn = arg min ℓ(yi , h(xi ))
h∈H n i=1
which we call empirical risk minimization (ERM). Furthermore, we also define the empirical risk R̂n (h) as
1!
n
R̂n (h) := ℓ(yi , h(xi )).
n i=1
Because under some conditions R̂n (h) →p R(h) by the law of large numbers, the usage of ERM is at least
partially justified.
ERM covers many popular methods and is widely used in practice. For example, if we take H = {h(x) :
h(x) = θT x, ∀θ ∈ Rp } and ℓ(y, p) = (y − p)2 , then ERM becomes the well-known least squares estimation.
The celebrated maximum likelihood estimation (MLE) is also a special case of ERM where the loss function
is taken to be the negative log-likelihood function. Example: in binary classification (x1 , y1 ), . . . , (xn , yn )
where yi ∈ {−1, 1} and H = {h(x) : h(x) = θT x, ∀θ ∈ Rp }, the logistic regression is computed by minimizing
the logistic loss:
1!
n
θ̂ = arg min log(1 + exp(−yi θT xi ))
n i=1
which is equivalent to MLE.
2 Overfitting
ERM works by minimizing the empirical risk R̂n (h), while the goal of learning is to obtain a predictor with
a small risk R(h). Although under certain conditions the former will converge to the latter as n → ∞, in
practice we always have a finite sample and as a result, there might be a large discrepancy between those
two targets, especially when H is large and n is small. Overfitting refers to the situation where we have a
small empirical risk but still a relatively large true risk.
Consider the following example. Let ℓ(y, p) = (y − p)2 and we obtain the predictor ĥ by ERM:
1!
n
ĥ = arg min (yi − h(xi ))2 .
h∈H n
i=1
1
Polynomial with degree1 Polynomial with degree2
1.5 1.5
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−1.5 −1.5
0 2 4 6 0 2 4 6
Figure 1: Overfitting of polynomial regression. The true signal function (blue line) is h∗ (x) = sin(x), and
the function is fitted using 10 training examples (red dots). P1 and P2 show a lack of fitting (underfitting)
and P5 is overfitting.
Empirical Risk
True Risk
Best Model in Theory
Risk
Model Complexity
2
Figure 1 shows the case where H is taken to be P1 , P2 , ..., where Pk is the set of all polynomial functions
with order up to k.
We can see that when H = P3 the fitted predictor will have a small risk (close to the true signal sin(x)).
Taking Pk with larger k values as the hypothesis space can clearly improve its fitting with respect to the
10 observations (red dots), but this does not necessarily reduce the true risk as it overfits the training data.
Learning is more about generalization than memorization.
1. Take H1 , H2 , . . . ,"Hn , . . . to be a sequence of increasing sized spaces. For example, one typically has
Hk ⊂ Hk+1 and Hk = H. Given the training data Dn one finds ĥn by minimizing
This covers the method of Sieves and structural risk minimization (SRM).
2. Define a penalty function Ω : H '→ R+ and find ĥn by the following optimization procedure:
1!
n
ĥn = arg min ℓ(yi , h(xi )) + λn Ω(h)
h∈H n i=1
where λn > 0 balances the trade-off between goodness-of-fit and model complexity. This is also known
as the penalized empirical risk minimization.
In practice we often need to select Hn or λn based on the training data to achieve a good balance between
goodness-of-fit and model complexity.
Consider the following regression problem: let H = {h(x) : h(x) = θT x, ∀θ ∈ Rp } and we are trying to find
an estimator θ̂ which minimizes the risk EX,Y (Y − θT X)2 . For the first approach, we could define a sequence
of increasing constants 0 ≤ η1 ≤ η2 ≤ . . . ≤ ηk ≤ ... and define Hk = {h(x) : h(x) = θT x, θT θ ≤ ηk }. For the
second approach we define Ω(h) = θT θ. Then it is well-known from optimization that those two approaches
become mathematically equivalent (i.e. for any ηk there exists a λ such that those two optimization problems
have the same solution).
3
ISYE/CSE
STAT 598Y 8803: Advanced
Statistical Machine
Learning Learning
Theory Instructor:
Instructor: TuoZhang
Jian Zhao
The approximation error is deterministic and mainly caused by two possible reasons: (1) the restriction
of using the function class H; (2) if inf h∈H R(h) in the above equation were replaced by the minimum risk
achievable by the learning algorithm with infinite amount of data, then it can also be caused by the systematic
bias of the learning algorithm. Such an error is often not controllable as we do not know the underlying
distribution PX,Y . On the other hand, the estimation error depends on the sample size, the function class
H and the learning algorithm which we have control over. We would like to obtain probability bounds for
the estimation error.
Specifically, if we use ERM to obtain our predictor ĥn = arg minh∈H R̂n (h), and assume that inf h∈H R(h) =
R(h∗ ) for some h∗ ∈ H, then we have
Thus if we can obtain uniform bound of suph∈H |R(h)− R̂n (h)| then the approximation error can be bounded.
Thus again justifies the usage of the ERM method.
Intuitively, for any h ∈ H, R̂n (h) is a random variable which follows (in fact nR̂n (h)) a Binomial distribution
with mean R(h). Or we could think of it as the average of a series of random variables. Thus we should be
able to bound the difference between an average of a set of random variables and their mean. The uniform
bound, however, will depend crucially on how large/complex the hypothesis space H is.
The probably approximately correct (PAC) learning model typically states as follows: we say that ĥn is
ϵ-accurate with probability 1 − δ, if
( )
P R(ĥn ) − inf R(h∗ ) > ϵ < δ.
h∈H
In other words, we have R(ĥn ) − inf h∈H R(h) ≤ ϵ with probability at least (1 − δ).
1
2 Concentration Inequalities
Concentration inequalities will be used to measure how fast the empirical risk converges to the true risk. We
start with some loose but simple ones and then get to more useful results.
Theorem 3-1 (Markov Inequality ). For any nonnegative random variable X and ϵ > 0,
E[X]
P (X ≥ ϵ) ≤ .
ϵ
Proof. We have
E[X] ≥ E[I(X ≥ ϵ)X] ≥ ϵE[I(X ≥ ϵ)] = ϵP (X ≥ ϵ)
and thus P (X ≥ ϵ) ≤ E[X]/ϵ. !
Theorem 3-2 (Chernoff Inequality ) For any random variable X and ϵ > 0,
E[exp(tX)]
P (X ≥ ϵ) ≤
exp(tϵ)
and thus
E[exp(tX)]
P (X ≥ ϵ) ≤ inf .
t>0 exp(tϵ)
Proof. For any t > 0, since exp(tx) is a nonnegative monotone increasing function in x, we have
E[exp(tX)]
P (X ≥ ϵ) = P (exp(tX) ≥ exp(tϵ)) ≤ .
exp(tϵ)
Theorem 3-3. (Chebyshev Inequality ) For any random variable X and ϵ > 0,
V[X]
P (|X − E[X]| > ϵ) ≤ .
ϵ2
Both Markov and Chebyshev bounds are polynomial in 1/ϵ and often we need bounds which can converge to
zero exponentially fast. In fact, the Chebyshev inequality can be quite poor. Consider the following example:
Let binary iid random * variables X1 , . . . , Xn ∈ {0, 1} and p = P (Xi = 1). Then we have σ 2 := V[Xi ] =
p(1 − p). Define Sn = ni=1 Xi and we have E[Sn ] = np and V[Sn ] = np(1 − p) = nσ 2 . From Chebyshev
inequality by using ϵ̃ = nϵ, we have
(' ' )
' Sn E[Sn ] '' σ2
P '' − ≥ ϵ̃ ≤ 2.
n n ' nϵ̃
Thus the tail probability goes to zero at a rate of n−1 . But from the central limit theorem (CLT) we have
( )
√ 1 1
n Sn − E[Sn ] →d N (0, σ 2 ).
n n
2
In other words, we have
(+ ) ˆ ∞
n Sn 1 2 1 exp(−y 2 /2)
P ( − p) ≥ y → 1 − Φ(y) = √ exp(−x /2)dx ≤ √ .
σ2 n y 2π 2π y
So ( ) (+ + ) ( )
Sn E[Sn ] n Sn n nϵ̃2
P − ≥ ϵ̃ = P ( − p) ≥ ϵ̃ ≈ exp −
n n σ2 n σ2 2p(1 − p)
which decreases exponentially fast as a function of n. So the Chebyshev inequality does poorly in this case
and we need something better.
The Hoeffding’s inequality studies the concentration on the sum of independent random variables and*gives an
n
exponential tail bound. Given random variables X1 , . . . , Xn which are independent, and let Sn = i=1 Xi .
By Chernoff bound we have
P (Sn − E[Sn ] ≥ ϵ) = P (exp(t(Sn − E[Sn ])) ≥ exp(tϵ))
≤ exp(−tϵ)E[exp(t(Sn − E[Sn ]))]
, - n /0
.
= exp(−tϵ)E exp t (Xi − E[Xi ])
i=1
n
1
= exp(−tϵ) E[exp(t(Xi − E[Xi ]))].
i=1
The following lemma shows some property of a bounded random variable with mean zero.
Lemma 3-4. If random variable X has mean zero, i.e. E[X] = 0, and is bounded in [a, b], then for any
s > 0,
E[exp(sX)] ≤ exp(s2 (b − a)2 /8).
Proof.
By convexity of exponential function and Jensen’s inequality and the fact a ≤ X ≤ b,
X −a b−X
exp(sX) ≤ exp(sb) + exp(sa).
b−a b−a
Taking expectation on both sides, and utilizing the fact that E[X] = 0, we have
b exp(sa) − a exp(sb)
E[exp(sX)] ≤
b−a
= [1 − λ + λ exp(s(b − a))] exp(−λs(b − a))
a
where λ = − b−a . Now let u = s(b − a) and define
φ(u) := −λu + log(1 − λ + λ exp(u)),
then the above inequality becomes
E[exp(sX)] ≤ exp(φ(u)).
Now we need to find an upper bound on exp(φ(u)). Using Taylor’s expansion we have
u2 ′′
φ(u) = φ(0) + uφ′ (0) + φ (ξ)
2
for some ξ ∈ [0, u]. It is easy to verify that φ(0) = 0 and φ′ (0) = 0. And we have
λ exp(u) (λ exp(u))2
φ′′ (u) = −
1 − λ + λ exp(u) (1 − λ + λ exp(u))2
( )
λ exp(u) λ exp(u)
= 1−
1 − λ + λ exp(u) 1 − λ + λ exp(u)
1
≤ .
4
3
So we have φ(u) ≤ u2 /8, and therefore
! "
2ϵ2
2. P (Sn − E[Sn ] ≤ −ϵ) ≤ exp − Pn 2
i=1 (bi −ai )
! "
2ϵ2
3. P (|Sn − E[Sn ]| ≥ ϵ) ≤ 2 exp − Pn .
i=1 (bi −ai )
2
4ϵ
Now choose t = Pn 2 we have
i=1 (bi −ai )
( )
2ϵ2
P (Sn − E[Sn ] ≥ ϵ) ≤ exp − *n 2
.
i=1 (bi − ai )
If we appy the Hoeffding inequality to the average of a series of bernoulli random variables X1 , . . . , Xn , we
have
P (Sn /n − p ≥ ϵ) ≤ exp(−2nϵ2 )
since bi − ai = 1. This is exactly what the CLT indicates when p = 1/2. The following is a straightforward
application of the Hoeffding inequality:
Corollary 3-6 Assume that H = {h1 , . . . , hm }. Then for all ϵ > 0,
( ' ' )
P sup 'R̂n (h) − R(h)' ≥ ϵ ≤ 2m exp(−2nϵ2 ),
' '
h∈H
1
*n
for any distribution PX,Y , where R(h) = EX,Y [I(Y ̸
= h(X))] and R̂n (h) = n i=1 I(yi ̸
= h(xi )).
Finally we will introduce the McDiarmid inequality which generalizes the Hoeffding’s inequality to some
function of iid random variables. Some restrictions are needed in order to get exponential bounds.
Theorem 3-6 (McDiarmid Inequality/Bounded Differences) Suppose random variables X1 , . . . , Xn ∈
X are independent, f is a mapping from X n to R. If for any i and any x1 , . . . , xn , x′i ∈ X , f satisfies
4
Then for all ϵ > 0,
( )
2ϵ2
P (f (X1 , . . . , Xn ) − E[f (X1 , . . . , Xn )] ≥ ϵ) ≤ exp − n 2
*
i=1 ci
( )
2ϵ2
P (f (X1 , . . . , Xn ) − E[f (X1 , . . . , Xn )] ≤ −ϵ) ≤ exp − *n 2 .
i=1 ci
And we have E[Vi |X1 , . . . , Xi−1 ] = 0. Similar to the proof of Hoeffding’s inequality, we have
, n 0
1
P (f − E[f ] ≥ ϵ) ≤ inf exp(−tϵ)E exp(tVi ) .
t>0
i=1
And we have
, n
0 , , n−1
00
1 1
E exp(tVi ) = E E exp(tVn ) exp(tVi ) | X1 , . . . , Xn−1
i=1 i=1
,n−1 0
1
= E exp(tVi )E [exp(tVn ) | X1 , . . . , Xn−1 ]
i=1
,n−1 0
1
≤ E exp(tVi ) exp(t2 c2n /8)
i=1
..
. ( *n )
t2 2
i=1 ci
≤ exp .
8
Setting t = Pn4ϵ we obtain the claimed results.
i=1 c2i
!
Example. Consider ' '
' n '
' 1. '
f (X1 , . . . , Xn ) = sup 'E[g(X)] − g(Xi )' .
g∈G ' n i=1
'
If all g : X →
) [a, b] then we have ci = (b − a)/n. Thus we have
- ' n
' , ' n
'0/ ( )
'
' 1. '
'
'
' 1. '
' 2nϵ2
P sup 'E[g(X)] − g(Xi )' − E sup 'E[g(X)] − g(Xi )' ≤ exp − .
g∈G ' n '
i=1 g∈G ' n ' (b − a)2 i=1
As a final note, the bounds we obtained are the worst case senario since we did not utilize the variance
information. We could obtain bounds if the variance were known, such as Bernstein’s inequality.
5
ISYE/CSE
STAT 598Y 8803: Advanced
Statistical Machine
Learning Learning
Theory Instructor:
Instructor: TuoZhang
Jian Zhao
For a finite training sample Dn , the predictor ĥn can be thought as the output of the learning algorithm
given the training data and the hypothesis space, i.e. ĥn = A(Dn , H). Its risk R(ĥn ) = EX,Y [I(Y ̸= ĥn (X))]
is a random variable which depends on Dn , A and H.
Consistency of the learning algorithm has focused on the mean of this random variable, i.e. EDn [R(ĥn )]. In
PAC learning we are interested in its tail distribution, i.e. finding a bound which holds with large probability:
! "
P sup [R(h) − R̂n (h)] ≥ ϵ ≤ δ.
h∈H
The basic idea is to set the probability of being misled to δ and thus solve the ϵ.
Example 1 (single classifier). Consider the special case H = {h}, i.e. we only have a single function.
Furthermore, we assume that it can achieve 0 trainining error over Dn , i.e. R̂n (h) = 0. Then what is the
probability that its generalization error R(h) ≥ ϵ? We have
# $
P R̂n (h) = 0, R(h) ≥ ϵ = (1 − R(h))n
n
≤ (1 − ϵ)
≤ exp(−nϵ).
Setting the RHS to δ and solve for ϵ we have ϵ = n1 log 1δ . Thus with probability (1 − δ),
! "
1 1
P R̂n (h) = 0, R(h) < log .
n δ
Note that we can also utilzie the Hoeffding’s inequality to obtain P (|R̂n (h) − R(h)| ≥ ϵ) ≤ 2 exp(−2nϵ2 ),
which leads to % & '
1 2
P |R̂n (h) − R(h)| ≥ log ≤ δ.
2n δ
This is more general but not as tight as the previous one since it does not utilize the fact R̂n (h) = 0. !
Although the result in Example 1 is very simple, it has very limited practical meaning. The main reason is
that it only applies to a single fixed function h. Essentially, it says that for each fixed function h, there is
a set S of samples (whose measure P (S) ≥ 1 − δ) for which |R̂n (h) − R(h)| is bounded. However, such S
sets could be different for different functions. To handle this issue we need to obtain the uniform deviations
since:
R̂n (ĥn ) − Rn (ĥn ) ≤ sup (R̂n (h) − R(h)).
h∈H
The idea is to utilize the union bound as shown in the following example.
Example 2 (finite number of classifiers). Consider the case H = {h1 , . . . , hm }. Define
( )
Bk := (x1 , y1 ) . . . , (xn , yn ) : R(hk ) − R̂n (hk ) ≥ ϵ , k = 1, . . . , m.
Each Bk is the set of all bad samples for hk , i.e. the samples for which the bound fails for hk . In other
words, it contains all misleading samples. If we want to measure the proability of the samples which are
bad for any hk (k = 1, . . . , m), we could apply the Bonferroni inequality to simply obtain:
m
*
P (B1 ∪ . . . ∪ Bm ) ≤ P (Bk ).
k=1
1
Thus we have
m (
% '
# $ + )
P ∃h ∈ H : R(h) − R̂n (h) ≥ ϵ = P R(hk ) − R̂n (hk ) ≥ ϵ
k=1
m
union bound * # $
≤ P R(hk ) − R̂n (hk ) ≥ ϵ
k=1
≤ m exp(−2nϵ2 ).
• Hoeffding’s inequality does not utlize the variance information. So the results could be improved by
utilizing such information.
• The union bound could be quite loose. For instance, it is as bad as if all the functions in H were
independent.
• The supremum over H might be too conservative.
The bound in Example 2 becomes meaning less when m is infinite. The following example generalizes it to
the case of coutably many classifiers.
Example 3 (countable number of classifiers). Consider the case H = {h1 , h2 , . . . , hm , . . .}. Since we
need to bound the probability of the set of misleading samples (which
.∞could mislead any h ∈ H) by δ, we
need budget the proability of being misled by hm to wm δ such that k=1 wk ≤ 1. So in order to find ϵ > 0
which satisfies
# $
P ∃h ∈ H : R(h) − R̂n (h) ≥ ϵ ≤ δ,
2
since
% ∞ (
'
# $ + )
P ∃h ∈ H : R(h) − R̂n (h) ≥ ϵ = P R(hk ) − R̂n (hk ) ≥ ϵ
k=1
∞
union bound * # $
≤ P R(hk ) − R̂n (hk ) ≥ ϵ
k=1
*
= wk δ
k
≤ δ.
3
ISYE/CSE
STAT 598Y 8803: Advanced
Statistical Machine
Learning Learning
Theory Instructor:
Instructor: TuoZhang
Jian Zhao
We have considered the case when H is finite or countably infinite. In practice, however, the function class
H could be uncountable. Under this situation, the previous method does not work. The key idea is to group
functions based on the sample.
Given a sample Dn = {(x1 , y1 ), . . . , (xn , yn )}, and define S = {x1 , . . . , xn }. Consider the set
HS = Hx1 ,...,xn = {(h(x1 ), . . . , h(xn ) : h ∈ H} .
The size of this set is the total number of possible ways that S = {x1 , . . . , xn } can be classified. For binary
classification the cardinality of this set is always finite, no matter how large H is.
Definition (Growth Function). The growth function is the maximum number of ways into which n points
can be classified by the function class:
GH (n) = sup |HS | .
x1 ,...,xn
Growth function can be thought as a measure of the “size” for the class of functions H. Several facts about
the growth function:
GH (n) = 2n .
In other words, VC dimension of a function class H is the cardinality of the largest set that it can shatters.
Example. Consider all functions of the form H = {h(x) = I(x ≤ θ), θ ∈ R}. Then it can shatter 2 points
but for any three points it cannot shatter. !
Example. Consider all linear classifiers in a 2-d space, i.e. X = R2 . In this case, all linear classifiers can
shatter a set of 3 points. No set of four points can be shattered by linear classifiers. So the VC dimension
in this case is 2. !
Example. Consider all linear classifiers in a p-dimensional Euclidean space, i.e. X = Rp . Given x1 , . . . , xn ∈
Rp , we define the augmented data vector
zi = [1, xi ]T ∈ Rp+1 , i = 1, . . . , n.
Then the set of all linear classifiers can be written as
H = h : h(z) = sign θT z , θ ∈ Rp+1 .
! " # $
Define
Z = [z1 , z2 , . . . , zn ] ∈ R(p+1)×n
and we argue that x1 , . . . , xn is shattered by H if and only if the n columns of Z are linearly independent.
• If columns z1 , . . . , zn are linearly independent, we have n ≤ p + 1 and for any possible classification
assignment y ∈ {±1}n the linear system ZT θ = y must have a solution. Thus, there is a linear classifier
in H (by taking the solution of the linear equation) which can produce such arbitrary class assignment
y.
1
• Suppose columns z1 , . . . , zn are not linearly independent. For H to shatter the set there must exist a
θ ∈ Rp+1 with sign(z1T θ), ..., sign(znT θ) taking any possible vector in {±1}n . In other words, this means
that the vector ZT θ can be in any of the 2n orthants of Rn . However, this contradicts the fact that
z1 , . . . , zn are linearly dependent.
Since if n > p + 1 it is not possible to have Z’s columns linearly independent, but for n ≤ p + 1 we can
always find such x1 , . . . , xn to make it happen, we have dV C (H) = p + 1. !
A somewhat surprising result shows that the growth function GH either grows exponentially in n or only
increases polynomially in n, depends on whether n is greater than its VC dimension dV C (H) or not.
Theorem 5-1 (Sauer). If H is a class of functions with binary outputs and its VC dimension is d =
dV C (H). Then for all n ∈ N,
d & '
% n
GH (n) ≤ .
i
i=0
Furthermore, for all n ≥ d, we have
( en )d
GH (n) ≤ .
d
Proof. For any S = {x1 , . . . , xn }, consider a table containing values of functions in HS (i.e. we only
consider distinct ones projected onto the sample S), each row for one such unique tuple. For example, if
S = {x1 , x2 , . . . , x5 } we might have the following table T :
Each row is one possible tuple for some h ∈ H evaluated on the sample S. Obviously the number of rows
in T is the same as the cadinality of |HS |. Thus we can bound the growth function of H by the maximum
number of rows in table T . Next we transform the table T by processing each column sequentially. For
example, to process the first column, for each row, we replace a ” + ” into a ” − ” unless it produces a
duplicated row in the table. Table 2 shows the table after processing the first column (left table) and the
final table after processing all 5 columns (right table).
h∗ (x1 ) h(x2 ) h(x3 ) h(x4 ) h(x5 ) h∗ (x1 ) h∗ (x2 ) h∗ (x3 ) h∗ (x4 ) h∗ (x5 )
- + - + + - + - - -
- - - + + - - - + +
- + + - + - - - - +
- + + - - - - - - -
- - - + - - - - + -
Table 2: transformed tables (left: after processing the first column; right: after processing all 5 columns)
1. The size of the tables are not changed for such transformations, and rows in the final table T ∗ are
still unique. Thus we use the upper bound of the number of rows in T ∗ to bound the growth function
GH (n).
2
2. The final table T ∗ possess the property that if we replace any ”+” to ”−”, it will result in a duplication.
So the set of ” + ” elements in each row must be a subset of S that can be shattered by the table T ∗
(in fact, by the set of functions H∗ corresponding to the table T ∗ ).
3. If a subset A ⊂ S can be shattered by a latter table Tk+1 , then it must also be shattered by the
previous table Tk . To see this, notice that if A does not contain the transformed column xk , then the
result holds trivially as all columns in A remain the same in Tk and Tk+1 . If A contain the transformed
column xk , then for each +/- combination (2|A|−1 ) of elements in A\{xk }, we must have two rows in
Tk+1 such that they have ” + ” and “-” values in the column xk . Now in the previous table Tk , those
two rows must also exist. The “+” row is obviously there, and it must also contain the “-” row since
otherwise the “+” would not show up in the later table Tk+1 by the processing procedure.
Since dV C (T ∗ ) ≤ dV C (T ) = dV C (H) = d by observation 3, each row in T ∗ has at most d “+” elements. Thus
*d
an upper bound of the total number of rows in T ∗ is i=0 (ni), which is also an upper bound of the growth
function GH (n) by observation 1.
The second statement comes from the fact that for n ≥ d,
d & ' ( n )d %d & ' & 'i
% n n d
≤
i d i=0 i n
i=0
( n )d & 'n
d
= 1+
d n
( en )d
≤ .
d
3
ISYE/CSE
STAT 598Y 8803: Advanced
Statistical Machine
Learning Learning
Theory Instructor:
Instructor: TuoZhang
Jian Zhao
We introduce the generalization error bound which utilizes the growth function of H or VC dimension of H
instead of the naive cardinality |H|.
Theorem 6-1 (Vapnik-Chervonenkis). For any δ > 0, with probability at least 1 − δ,
!
2 log GH (2n) + 2 log 2δ
∀h ∈ H, R(h) ≤ R̂n (h) + 2 .
n
The proof of Theorem 6-1 utilizes a technique called symmetrization. For notational simplicty, we will use
n
1"
Pf = E[f (X, Y )], Pn f = f (xi , yi ).
n i=1
Here f (X, Y ) can be thought as ℓ(Y, h(X)). Also we define Zi = (Xi , Yi ). The key idea is to upper bound
the true risk by an estimate from an independent sample, which is often known as the “ghost” sample. We
use Z1′ , . . . , Zn′ to denote the ghost sample and
n
′ 1"
Pn f = f (x′i , yi′ ).
n i=1
Then we could project the functions in H onto this double sample and apply the union bound with the help
of the growth function GH (.) of H.
#
Lemma (Symmetrization). For any t > 0 such that t ≥ 2/n, we have
$ % & (
' ′ '
P sup |Pf − Pn f | ≥ t ≤ 2P sup 'Pn f − Pn f ' ≥ t/2 .
' '
f ∈F h∈H
Proof. Let fn be the function achieving the supremum. By triangle inequality we have
I (|Pfn − Pn fn | > t) PDn′ (|Pfn − P′n fn | < t/2) ≤ PDn′ (|P′n fn − Pn fn | > t/2) .
1
Proof of Theorem 6-1:
Let F = {f : f (x, y) = ℓ(y, h(x)), h ∈ H}. First note that GH (n) = GF (n).
& ( $ %
P sup R(h) − R̂n (h)) ≥ ϵ = P sup (Pf − Pn f ) ≥ ϵ
h∈H f ∈F
$ %
≤ 2P sup (P′n f − Pn f ) ≥ ϵ/2
f ∈F
$ %
= 2P sup (P′n f − Pn f ) ≥ ϵ/2
f ∈FDn ,D′
n
Note that in order for the result to be meaningful in Theorem 6-1, we require the dV C (H) to be finite. A
class of functions whose VC dimension is finite is called a VC class. We can also utilize this result to obtain
a bound on the expected risk E[R(ĥn )], where ĥn is the empirical minimizer. Since
' '
R(ĥn ) − inf R(h) ≤ 2 sup 'R(h) − R̂n (h)' ,
' '
h∈H h∈H
we have
& ( & ' ' (
P R(ĥn ) − inf R(h) ≥ ϵ ≤ P sup 'R(h) − R̂n (h)' ≥ ϵ/2
' '
h∈H h∈H
Define a nonnegative random variable Z = R(ĥn )−inf h∈H R(h), and we have P (Z ≥ ϵ) ≤ 4GH (2n) exp(−nϵ2 /32).
Thus
ˆ ∞
E[Z 2 ] = P (Z 2 ≥ t)dt
ˆ0 u ˆ ∞
= P (Z 2 ≥ t)dt + P (Z 2 ≥ t)dt
0 u
ˆ ∞
≤ u+ 4GH (2n) exp(−nt/32)dt
u
32GH (2n) ) nu *
= u+ exp − .
n 32
32 log(4GH (2n))
Minimizing the RHS with respect to u we have u = 32GH (2n)/n. Plugging in we have E[Z 2 ] ≤ n .
By the Cauchy-Schwarz inequality we have
& (
# log GH (2n)
E[R(ĥn )] − inf R(h) = E[Z] ≤ E[Z 2 ] ≤ O .
h∈H n
2
So if the growth function is only polynomially increasing as a function of n, then obviously we have E[R(ĥn )]−
inf h∈H R(h) → 0, i.e. the expected risk will converge to the minimum risk within the function class H.
3
ISYE/CSE
STAT 598Y 8803: Advanced
Statistical Machine
Learning Learning
Theory Instructor:
Instructor: TuoZhang
Jian Zhao
then in order to bound the quantity R(ĥn ) − inf h∈H R(h), it suffices to bound the quantity
sup |R(h) − R̂n (h)|.
h∈H
Thus the uniform bound plays an important role in statistical learning theory. The Glivenko-Cantelli class
is defined such that the above property holds as n → ∞.
Definition. H is a Glivenko-Cantelli class with respect to a probability measure P if for all ϵ > 0,
! "
P lim sup |Pf − Pn f | = 0 = 1,
n→∞ h∈H
i.e. suph∈H |Pf − Pn f | converges to zero almost surely (with proability 1). H is said to be a uniformly GC
Class if the convergence is uniformly over all probability measures P .
Note that Vapnik and Chervonenkis have shown that a function class is a uniformly GC class if and only if
it is a VC class.
Given a set of iid real-valued random variables Z1 , . . . , Zn and any z ∈ R, we know that the quantity
I(Zi ≤ z) is a Bernoulli random variable with mean P (Z ≤ z) = F (z), where F (.) is the CDF. Furthermore,
by strong law of large numbers, we know that
n
1#
I(Zi ≤ z) → F (z)
n i=1
almost surely. The following theorem is one of the most fundamental theorems in mathematical statistics,
which generalizes the strong law of large numbers: the empirical distribution function uniformly almost
surely converges to the true distribution function.
Theorem (Glivenko-Cantelli). Let Z1 , . . . , Zn be iid real-valued random variables with distribution func-
tion F (z) = P (Zi ≤ z). Denote the standard empirical distribution function by
n
1#
Fn (z) = I(Zi ≤ z).
n i=1
Then
nϵ2
! " ! "
P sup |F (z) − Fn (z)| > ϵ ≤ 8(n + 1) exp − ,
z∈R 32
and in particular, by the Borel-Cantelli lemma, we have
Proof.
$n
We use the notation ν(A) := P (Z ∈ A) and νn (A) = n1 i=1 I(Zi ∈ A) for any measurable set A ⊂ R. If we
let A denote the class of sets of the form (−∞, z] for all z ∈ R, then we have
1
We assume nϵ2 > 2 since otherwise the result holds trivially. The proof consists of several key steps.
(1) Symmetrization by a ghost sample: Introduce a ghost sample Z1′ , . . . , Zn′ which are iid together
with the original sample, and denote by νn′ the empirical measure with respect to the ghost sample. Then
for nϵ2 > 2 we have (by the symmetrization lemma)
! " ! "
P sup |νn (A) − ν(A)| > ϵ ≤ 2P sup |νn (A) − νn′ (A)| > ϵ/2 .
A∈A A∈A
where the sup is outside of the probability. The next step is to find an exponential bound for the RHS.
$n
(4) Hoeffding’s Inequality: With z1 , . . . , zn fixed, i=1 σi I(zi ∈ A) is a sum of n independent zero
mean random variables between [−1, 1]. Thus, by Hoeffding’s inequality we have
% %
n
& '
1 %%# % ϵ %%
P sup % σi I(Zi ∈ A)% > % Z1 , . . . , Zn
%
A∈A n % i=1 % 4
& % n % '
1 %%# % ϵ %%
≤ (n + 1) sup P σi I(Zi ∈ A)% > % Z1 , . . . , Zn
%
n % i=1 % 4
%
A∈A
nϵ2
! "
≤ 2(n + 1) exp − .
32
2
Taking expectation on both side we obtain the claimed result
nϵ2
! " ! "
P sup |νn (A) − ν(A)| > ϵ ≤ 8(n + 1) exp − .
A∈A 32
3
ISYE/CSE
STAT 598Y 8803: Advanced
Statistical Machine
Learning Learning
Theory Instructor:
Instructor: TuoZhang
Jian Zhao
Recall that in the proof of the Glivenko-Cantelli theorem we used the Rademacher random variables σ1 , . . . , σn
which are iid uniform {±1} random variables.
Definition. Let µ be a proability measure on X and assume that X1 , . . . , Xn are independent random
variables according to µ. Let F be a class of functions mapping from X to R. Define the random variable
! n
$
1" #
R̂n (F ) := E sup σi f (Xi ) #X1 , . . . , Xn ,
#
f ∈F n i=1
where σ1 , . . . , σn are independent uniform {±1}-valued random variables. R̂n (F ) is called the empirical
Rademacher averages of F . Note that it depends on the sample and can be actually computed. Essentially
it measures the correlation between a random noise (labeling) and functions in the function class F , in the
supremum sense. The Rademacher averages of F is
Rn (F ) = E[R̂n (F )].
For a function class F and a sample S = {x1 , . . . , xn }, we would like to bound the random quantity
% n
&
1"
φ(S) := sup (Pf − Pn f ) = sup E[f (X)] − f (xi ) .
f ∈F f ∈F n i=1
First, we bound the difference between the random variable and its mean by using McDiarmid inequality.
Consider another sample S ′ which only differs from S at one example. Then we have
# #
′
#
′
# c
|φ(S) − φ(S )| = # sup (Pf − Pn f ) − sup (Pf − Pn f )# ≤
# #
#f ∈F f ∈F # n
Next, we related the E[φ(S)] with the Rademacher averages. From now on, we define S ′ = {X1′ , . . . , Xn′ } to
1
be a ghost sample of S (not the same S ′ before). Note that
! $ ! % n
&$
1"
ES sup (Pf − Pn f ) = ES sup E[f (X)] − f (xi )
f ∈F f ∈F n i=1
! ! n n
$$
1" 1 " #
= ES sup E f (Xi′ ) − f (Xi )#X1 , . . . , Xn
#
f ∈F n i=1 n i=1
! n
$
Jensen 1" ′
≤ ES,S ′ sup (f (Xi ) − f (Xi ))
f ∈F n i=1
! n
$
1" ′
= ES,S ′ sup σi (f (Xi ) − f (Xi ))
f ∈F n i=1
! % n & % n
&$
1" ′ 1"
≤ ES,S ′ sup σi f (Xi ) + sup − σi f (Xi )
f ∈F n i=1 f ∈F n i=1
= 2Rn (F ).
So we have shown that ES [φ(S)] ≤ 2Rn (F ). Combine it with the first step (with c = 1), we have shown the
first part of the following theorem:
Theorem 8-1. Let F be a set of binary-valued {0, 1} functions. For all δ > 0, with proability at least 1 − δ,
)
log(1/δ)
∀f ∈ F, Pf ≤ Pn f + 2Rn (F ) + ,
2n
and also with probability at least 1 − δ,
)
log(2/δ)
∀f ∈ F, Pf ≤ Pn f + 2R̂n (F ) + C ,
n
√ √
where C = 2 + 1/ 2.
Proof.
First part has been proven above. For the second part, we apply the McDiarmid’s inequality again to the
empirical Rademacher averages
! n
$
1" #
R̂n (F ) = E sup σi f (Xi ) #X1 , . . . , Xn .
#
f ∈F n i=1
Note that R̂n (F ) is a function of X1 , . . . , Xn and satisfies the condition of the McDiarmid’s inequality with
bounded difference at most 1/n. So we have
* +
P 2Rn (F ) − 2R̂n (F ) > ϵ ≤ exp(−nϵ2 /2).
2
!
Assume that H is the hypothesis space and F = ℓ ◦ H = {f : f (x, y) = ℓ(y, h(x)), ∀h ∈ H} is the class
induced from H. Then the Rademacher averges of H and F are quite related. In fact, if we assume Y = {±1},
then we have
! n
$
1"
Rn (F ) = E sup σi I(Yi ̸= h(Xi ))
h∈H n i=1
! n
$
1" 1
= E sup σi (1 − Yi h(Xi ))
h∈H n i=1 2
! n
$
1 1"
= E sup σi Yi h(Xi )
2 h∈H n i=1
1
= Rn (H).
2
where R̂n (h, σ) is the empirical risk of classifier h with respect to random label σ = [σ1 , . . . , σn ]. When H
is so large that it can fit every random labeling perfectly, we have Rn (H) = 1/2 and the bound becomes
meaningless.
The Rademacher average is related to the growth funciton and VC dimension. One can bound the Rademacher
average by the growth function or VC dimension. We could estimate Rademacher averages for function classes
which are built from simpler classes. The following is a list of properties about Rademacher averages.
where σi ’s are iid Rademacher random variables. Since |c|σi has the same distribution as cσi , we have
! n
$
1"
Rn (cF ) = |c|E sup σi f (Xi ) = |c|Rn (F ).
f ∈F n i=1
3
this, we have
! n
$
1"
Rn (F + g) = E sup σi [f (Xi ) + g(Xi )]
f ∈F n i=1
! n
$ ! n $
1" 1"
= E sup σi f (Xi ) + E σi g(Xi )
f ∈F n i=1 n i=1
= Rn (F )
5. Ledoux-Talagrand contraction inequality: If φi is Lipshtiz, i.e. it satisfies |φi (a) − φi (b)| ≤ L|a − b|,
then ! $ ! $
n n
1" 1"
E sup σi φi (f (Xi )) ≤ LE sup σi f (Xi ) = LRn (F ).
f ∈F n i=1 f ∈F n i=1
4
ISYE/CSE
STAT 598Y8803: Advanced
Statistical Machine Theory
Learning Learning Instructor:Jian
Instructor: Tuo Zhang
Zhao
Recall that if S = {x1 , . . . , xn } ⊂ X n and H is a set of binary-valued functions, then HS denotes the
restriction of H to S, and the growth function (shatter coefficient) is defined by
Since H maps X into {0, 1}, HS is finite for any finite S (|HS | ≤ 2n ).
Such a definition works well for binary-valued functions, but if H is a set of real-valued functions then HS
will be a set with infinite cardinality even for finite n. Essentially we do not need to divide H into unique
elements based on HS , but only need to partition the function class H into small groups which are “local”
in nature.
Definition. Let (W, d) be a metric space and F ⊂ W. For every ϵ > 0, denote by N (ϵ, F , d) the minimal
number of open balls (with respect to metric d) needed to cover F . That is, N (ϵ, F , d) is the minimal
cardinality of the set {f1 , . . . , fm } ⊂ W with the property that for every f ∈ F there is some fi such that
d(f, fi ) < ϵ. The set {f1 , . . . , fm } is called an ϵ-cover of F . The logarithm of the covering number is called
the entropy of the set.
We will be interested in metrics induced by samples. For every sample {x1 , . . . , xn } let µn be the empirical
measure of the sample. For 1 ≤ p ≤ ∞ and a function f , put
! n
#1/p
1"
∥f ∥Lp(µn ) = |f (xi )|p
n i=1
and in particular, we have ∥f ∥L∞(µn ) = max1≤i≤n |f (xi )|. Let N (ϵ, F , Lp (µn )) be the covering number of F
at scale ϵ with respect to the norm Lp (µn ).
Theorem 9-1. For any class F of real-valued functions, any sample S = {x1 , . . . , xn } and ϵ > 0,
It is easy to see that the uniform covering number is a generalization of the growth function. Suppose that
F contains functions which map X into {0, 1}. Then for any S = {x1 , . . . , xn }, ϵ < 1 and p = ∞, we have
N (ϵ, F , L∞ (µn )) = |GH |, so N∞ (ϵ, F , n) = |GH |.
Based on the covering number we are able to obtain uniform convergence result for real-valued function class.
Theorem 9-2. Let F be a class of functions which map X into [−1, 1] and let µ be a probability measure on
X . Assume X1 , . . . , Xn are independent random variables distributed according to µ. For every ϵ > 0 and
n ≥ 8/ϵ2, ! #
nϵ2
$ $ % &
$1 " $
P sup $ $ f (Xi ) − E[f (X)]$ > ϵ ≤ 8E[N (ϵ, F , L1 (µn ))] exp −
$ ,
f ∈F n 128
where µn is the empirical measure on X1 , . . . , Xn .
1
'n
Proof. Consider the event A = {supf ∈F | i=1 σi f (Xi )| > nϵ/4}. We have
For any realization of X1 , . . . , Xn , its empirical measure is µn . Let G be an ϵ/8 cover of F with respect to
the L1 (µn ) norm, and we can assume that any function g ∈ G is bounded by 1. First, observe that
! $ n $ # ! $ n $ #
$" $ nϵ $" $ nϵ
P sup $ σi f (Xi )$ > ≤ P sup $ σi g(Xi )$ > .
$ $ $ $
f ∈F $i=1
$ 4 g∈G $ $
i=1
8
This is becuase if there is some element f ∗ ∈ F which makes the LHS event true, then we can find some
g ∗ ∈ G such that
n n
1" ∗ ∗ 1" ∗
|σi f (Xi ) − σi g (Xi )| = |f (Xi ) − g ∗ (Xi )| ≤ nϵ.
n i=1 n i=1
'n
So we have supg∈G | i=1 σi g(Xi )| > nϵ/8. Applying the union bound, Hoeffding’s inequality and utilizing
'N
the fact that ∀g ∈ G, i=1 g(xi )2 ≤ n, we have
! $ n $ # !$ n $ #
$" $ nϵ $" $ nϵ
P sup $ σi g(Xi )$ > ≤ |G| · sup P $ σi g(Xi )$ >
$ $ $ $
g∈G $ i=1 $ 8 g∈G $
i=1
$ 8
nϵ2
% &
≤ 2N (ϵ/8, F , L1(µn )) exp − .
128
The claim follows by combing this result with symmetrization (with ghost sample and Rademacher random
variables).
!
Lemma. For A ⊂ Rn with r = maxa∈A ∥a∥, and σ1 , . . . , σn being Rademacher random variables, we have
! n #
" *
E sup σi ai ≤ r 2 log |A|.
a∈A i=1
Proof.
For any s > 0 we have
n n
! ! ## ! ! ##
" "
exp sE sup σi ai ≤ E exp s sup σi ai
a∈A i=1 a∈A i=1
n
! ! ##
"
= E sup exp s σi ai
a∈A i=1
n
! #
" "
≤ E exp s σi ai
a∈A i=1
n
! #
" s2 " 2
≤ exp a
2 i=1 i
a∈A
≤ |A| exp s2 r2 /2 .
+ ,
So we have
n
! #
log |A| sr2
" % &
*
E sup σi ai ≤ inf + =r 2 log |A|.
a∈A i=1
s>0 s 2
2
!
By the lemma we have obvious that -
2 log |F|
R̂n (F ) ≤
n
if F is finite with output values within [−1, 1].
Theorem 9-3. For F ⊂ [−1, 1]X , we have
!- #
2 log N (ϵ, F , L2 (µn ))
R̂n (F ) ≤ inf +ϵ .
ϵ>0 n
Proof.
For an ϵ > 0, let G be an ϵ-cover of F . Then we have
n
( )
1"
R̂n (F ) = Eσ sup σi f (xi )
f ∈F n i=1
( ! n n
#)
1" 1"
= Eσ sup sup σi g(xi ) + σi (f (xi ) − g(xi ))
g∈G f ∈F ∩Bϵ (g) n i=1 n i=1
n
( )
1"
≤ Eσ sup σi g(xi ) + ϵ
g∈G n i=1
-
2 log N (ϵ, F , L2 (µn ))
≤ + ϵ,
n
.
where the first equality utilizes the fact that F = g∈G (F ∩ Bϵ (g)), and the first inequality comes from the
fact that ∥f − g∥L2 (µn ) ≤ ϵ and C-S inequality.
!
Definition. Let (W, d) be a metric space and F ⊂ W. For ϵ > 0, a subset A is said to be an ϵ-packing of
F , if for all distinct f1 , f2 ∈ A, we have d(f1 , f2 ) > ϵ. The ϵ-packing number P (ϵ, F , d) is defined as the
maximum cardinality of an ϵ-packing subset.
Both the covering number and packing number can be used to measure the size of the sets, and they are
obviously related. The following simple result shows that as long as one of them can be computed, we can
easily obtain a bound for the other one.
Theorem 9-4. Given a metric space (W, d). Then for all ϵ > 0 and for every F ⊂ W, the covering number
and packing number satisfy
P (2ϵ, F , d) ≤ N (ϵ, F , d) ≤ P (ϵ, F , d).
3
ISYE/CSE
STAT 598Y 8803: Advanced
Statistical Machine
Learning Learning
Theory Instructor:
Instructor: TuoZhang
Jian Zhao
We start with the perceptron algorithm which is probably one of the simplest linear classifiers. We then
introduce the margin maximization idea and derive the linear SVM classifier.
Assume that Dn = {(x1 , y1 ), . . . , (xn , yn )} where X = Rp and Y = {±1}. For simplicity we only consider the
linear classifiers without intercept here, i.e. H = {x !→ wT x : w ∈ Rp }. Furthermore, we assume that the
data are linearly separable, i.e. there exists some w∗ which can correctly classify all examples sign(w∗T xi ) = yi
for i = 1, . . . , n. The perceptron algorithm works as follows.
1. Start w0 = 0 and t = 0;
2. While wt has training error > 0:
wT x y
∗ i i
Theorem 10-1 (Novikov). Define r = maxi ∥xi ∥ and δ = mini ∥w ∗∥
, where w∗ is some classifier which
can linearly separate Dn . Then it terminates after T ≤ r2 /δ 2 steps.
Proof.
First, note that δ has the meaning of the “margin”: the minimum distance of an example to the decision
hyperplane. So the larger the margin, the smaller number of steps we need to converge. The basic idea of
the proof is to show that wt are getting closer and closer to w∗ . Since ∥wt − w∗ ∥2 = ∥wt ∥2 + ∥w∗ ∥2 − 2wtT w∗ ,
essentially we need to upperbound ∥wt ∥2 and lowerbound wtT w∗ and then combine the results.
First we have w0T w∗ = 0 and
T
wt+1 w∗ = wtT w∗ + yi xTi w∗
≥ wtT w∗ + δ∥w∗ ∥.
∥wt+1 ∥2 = ∥wt + yi xi ∥2
= ∥wt ∥2 + ∥xi ∥2 + 2yi xTi wt
≤ ∥wt ∥2 + r2 .
1
Maximum Margin Classifier: Support Vector Machines (SVM)
We consider the set of linear classifiers H = {h(x) = wT x + b, w ∈ Rp , b ∈ R}. Suppose the training
examples are linearly separable, i.e. there exist some linear classifier which has 0 training error. Consider
the following optimization problem:
min ∥w∥2
w,b
s.t. yi (wT xi + b) ≥ 1 ∀i = 1, . . . , n.
This is a constrained optimization where the objective is a quadratic function and the constraints are linear.
So it is a convex optimization problem (quadratic programming, to be more specific).
Given a hyperplance (classifier), define margin as the minimum distance between the plane to any of the
example. Now we show that the above optimization essentially tries to find a classifier which maximizes
the margin. First, assume that there are two examples x+ and x− , both are on the margin boundary (see
Figure 1). Then we know that the margin equals half of the distance between (x+ − x− )’s projection along
the direction that is perpendicular to the hyperplane. So we have
1 w
margin = (x+ − x− )T
2 ∥w∥
Using the fact that x+ and x− lie on margin, we have wT x+ +b = 1 and wT x− +b = −1. Thus wT (x+ −x− ) =
2. So we conclude that the margin is 1/∥w∥. Thus minimizing ∥w∥2 subject to the linear constraints is
equvalent to maximizing 1/∥w∥2 subject to the same constraints.
Since in practice examples may not be linearly separable, we introduce the concept of slack variables. For
each example, define ξi ≥ 0 to be the slack variable which measures how
!n much this example violates the
margin condition. Instead of minimizing ∥w∥2 , we will also add a term i=1 ξi which penalizes violation of
the margin condition. The relaxed optimization problem can be written as:
!n 2
min i=1 ξi + λ∥w∥
w,b
s.t. yi (wT xi + b) ≥ 1 ∀i = 1, . . . , n
ξi ≥ 0 ∀i = 1, . . . , n.
where λ > 0 is a tuning parameter which controls the balance between training error and the margin. Note
that equivalently, wecan write down the optimization problem as
"
n
min (1 − yi (wT xi + b))+ + λ∥w∥2
w,b
i=1
2
ISYE/CSE
STAT 598Y 8803: Advanced
Statistical Machine
Learning Learning
Theory Instructor:
Instructor: TuoZhang
Jian Zhao
In this lecture we first give some background about convex optimization including the KKT condition and
duality. We then derive the SVM dual optimization problem.
Consider the constrained minimization problem of the form
minimize f (x)
subject to: gi (x) = 0 i = 1, . . . , m ≤ n (1)
hj (x) ≤ 0 j = 1, . . . , p.
gi ’s are equality constraints and hj ’s are inequality constraints and usually they are assumed to be within
the class C 2 . A point that satisfies all constraints is said to be a feasible point. An inequality constraint is
said to be active at a feasible point x if hj (x) = 0 and inactive if hj (x) < 0. Equality constraints are always
active at any feasible point. To simplify notation we write h = [h1 , . . . , hp ] and g = [g1 , . . . , gm ], and the
constraints now become g(x) = 0 and h(x) ≤ 0.
Convince yourself why the above conditions hold geometrically. It is convenient to introduce the Lagrangian
associated with the problem as
where µ ∈ Rm , λ ∈ Rp and λ ≥ 0 are Lagrange multipliers. Note that equation (2), (3) and (4) together
give a total of n + m + p equations in the n + m + p variables x∗ , λ and µ.
From now on we assume that we only have inequality constrains for simplicity. The case with equality
constraints can be done in a similar way, except that µ does not have the nonnegative constraint as λ. So in
our case we have the following optimization problem:
1
and thus
sup inf L(x, λ) ≤ inf sup L(x, λ).
λ≥0 x x λ≥0
Thus we have
inf sup L(x, λ) = sup inf L(x, λ).
x λ≥0 λ≥0 x
∗ ∗
The point (x , λ ) is called the saddle point. One example is the function L(x, λ) = x2 − λ2 , with
saddle point (0, 0) as shown in Figure 1.
Weak duality always holds, and strong duality holds if f and hj ’s are convex and there exists at least one
feasible point which is an interior point. The Lagrange dual function D(λ) is defined as
⎧ ⎫
⎨ p
$ ⎬
D(λ) := inf L(x, λ) = inf f (x) + λj hj (x)
x x ⎩ ⎭
j=1
Note that (1) D(λ) is a concave function; (2) for any feasible λ and x we have D(λ) ≤ f (x). In fact if we
define p∗ to be the minimum solution of the primal optimization problem (primal solution), and d∗ to the
maximum of the dual problem d∗ = supλ≥0 D(λ) (dual solution). Then the weak duality says d∗ ≤ p∗ . The
quantity p∗ − d∗ is known as the duality gap, which can be a useful criteria for convergence.
Now we illustrate this duality relationship with a simple example where we only have one inequality con-
straint:
min f (x) s.t. h(x) ≤ 0.
Define ω(z) = inf{f (x) : h(x) ≤ z} for z ∈ R. Then it is easy to observe that ω(z) is monotone on each
coordinate of z. The duality can be illustrated by the fact that the primal solution p∗ is the intercept of
ω(z) with the vertical axis z = 0, and it is an upperbound of the maximum intercept with the vertical axis
of all hyperplanes that lie below ω(.). Such hyperplanes have the form lλ (z) = −λT z + inf x {f (x) + λh(x)}
with λ ≥ 0. An example is shown in Figure 2.
2
Saddle Point Duality
8
6 ω(z)
4
5
2
4
L(x, λ)
0
3
−2
p*
2 2
1
−4
2 0 1
1
0 −1
−1 0
−2 −2 x −2 −1.5 −1 −0.5 0 0.5 1 1.5 2
λ
Figure 1: Left: Saddle point (0, 0) of L(x, λ) = x2 − λ2 ; Right: Geometric interpretation of duality.
where the Lagrange multiplers α ≥ 0 and β ≥ 0. We want to remove the primal variables w, b, ξ by
maximization, i.e. set the following derivatives to zero:
n
∂L 1 $
=0 =⇒ w= αi yi xi
∂w 2λ i=1
n
∂L $
=0 =⇒ αi yi = 0
∂b i=1
∂L 1
=0 =⇒ αi + βi = .
∂ξ n
Plugging in and we obtain the dual:
n
$ 1 $
D(α, β) = αi − αi αj yi yj xTi xj .
i=1
4λ i,j
Since we have αi ≥ 0 and βi ≥ 0 and αi + βi = 1/n, thus we have 0 ≤ αi ≤ 1/n. So the dual optimization
problem becomes
(n 1 ( T
max i=1 αi − 4λ i,j αi αj yi yj xi xj
α
s.t. 0 ≤ αi ≤ 1/n.
which is a quadratic programming problem. Note that due to the constraints, the dual solution is in general
sparse, i.e. we have many α′i s equal to 0. We have the following observations:
1. If αi > 0: we have yi (wT xi + b) = 1 − ξi ≤ 1. So the example is either at or on the wrong side of the
margin. Such examples for αi > 0 are called support vectors.
2. If αi = 0: we have βi = 1/n and thus ξi = 0. So yi (wT xi + b) ≥ 1. Such examples are on the correct
side of the margin.
3
3. If yi (wT xi + b) < 1: we have ξ > 0 and thus βi = 0 and αi = 1/n. So if an example causes margin
error then its dual variable αi will take at the right boundary 1/n.
4. It is possible that for examples which are on the correct side of the margin, their αi ’s are nonzero.
5. In the objective xi ’s appear always in the form of inner product xTi xj . So if we first map xi into a
feature vector φ(xi ), then we could replace xTi xj by ⟨φ(xi ), φ(xj )⟩. This leads to the introduction of
reproducing kernel Hilbert space in SVM.
4
ISYE/CSE
STAT 598Y 8803: Advanced
Statistical Machine
Learning Learning
Theory Instructor:
Instructor: TuoZhang
Jian Zhao
We first define Hilbert space and then introduce the concept of Reproducing Kernel Hilbert Space (RKHS)
which plays an important role in machine learning.
Definition. A Hilbert space is an inner product space which is also complete and separable 1 with respect
to the norm/distance function induced by the inner product. For any f, g ∈ H and α ∈ R, ⟨., .⟩ is an inner
product if and only if it satisfies the following conditions:
1. ⟨f, g⟩ = ⟨g, f ⟩;
2. ⟨f + g, h⟩ = ⟨f, h⟩ + ⟨g, h⟩ and ⟨αf, g⟩ = α⟨f, g⟩;
3. ⟨f, f ⟩ ≥ 0 and ⟨f, h⟩ = 0 if and only if f = 0.
! !
The norm/distance induced by the inner product is defined as ∥f ∥ = ⟨f, f ⟩ and ∥f − g∥ = ⟨f − g, f − g⟩.
⟨., .⟩ is called a semi-inner product if the third condition only says ⟨f, f ⟩ ≥ 0. In this case, the induced norm
is actually a semi-norm.
Examples of Hilbert space includes:
1. Rn with ⟨a, b⟩ = aT b;
"∞
2. ℓ2 space of square summable sequence with inner product ⟨x, y⟩ = xi yi ;
i=1
´
3. The space of L2 square integrable functions with inner product ⟨f, g⟩ = f (x)g(x)dx.
A closed linear subspace G of a Hilbert space H is also a Hilbert space. The distance between an element
f ∈ H and G is defined as inf g∈G ∥f − g∥. Since G is closed, the infimum can be attained and we have fG ∈ G
such that ∥f − fG ∥ = inf g∈G ∥f − g∥. Such fG is called the projection of f onto G. It can be shown that such
fG is unique, and ⟨f − fG , g⟩ = 0 for all g ∈ G. The linear subspace G c = {f : ⟨f, g⟩ = 0, ∀g ∈ G} is called
the orthogonal complement of G. It can be shown that G c is also closed and f = fG + fG c for any f ∈ H,
where fG and fG c are projections of f onto G and G c . The decomposition f = fG + fG c is called a tensor
sum decomposition and is denoted by H = G ⊕ G c , G c = H ⊖ G or G = H ⊖ G c .
A simple example of decomposition would be H = R2 and G = {(x, 0) : x ∈ R} and G c = {(0, y) : y ∈ R}.
Any element (x, y) in H can be decomposed as (x, y) = (x, 0) + (0, y) and this decomposition is unique.
Theorem 12-1 (Riesz). For every contniuous linear functional L in a Hilbert space H, there exists a
unique gL ∈ H such that L(f ) = ⟨gL , f ⟩ for ∀f ∈ H.
Proof.
Define NL = {f : L(f ) = 0} to be the null space of L. Since L is continuous we have NL a closed linear
subspace. Assume NL ⊂ H then there exists a nonzero element g0 ∈ H ⊖ NL . We have
(L(f ))g0 − (L(g0 ))f ∈ NL ,
and thus
⟨(L(f ))g0 , (L(g0 ))f, g0 ⟩ = 0.
Thus we get # $
L(g0 )
L(f ) = g0 , f .
⟨g0 , g0 ⟩
Hence we take gL = (L(g0 ))g0 /⟨g0 , g0 ⟩. If NL = H we simply take gL = 0. If there are gL and g̃L two
representers for L then we have ⟨gL − g̃L , f ⟩ = 0 for any f ∈ H and thus ∥gL − g̃L ∥ = 0 and then gL = g̃L .
!.
1 A vector space H is complete if every Cauchy sequence in H converges to an element in H. A sequence satisfying
limm,n→∞ ∥fn − fm ∥ = 0 is called a Cauchy sequence.
1
Reproducing Kernel Hilbert Space
Definition. A kernel k : X × X ,→ R if (1) it is symmetric; (2) it is positive semi definite. I.e. any x1 , . . . , xn
the gram matrix K is positive semi definite.
!
Properties: (1) k(x, x) ≥ 0; (2) k(x, z) ≤ K(x, x)K(z, z).
There are a couple of ways to define RKHS which are equivalent.
Definition. k(., .) is a reproducing kernel of a Hilbert space H if for ∀f ∈ H, we have f (x) = ⟨k(x, .), f (.)⟩.
Definition. A RKHS is a Hilbert space H with a reproducing kernel whose span is dense in H.
An equivalent definition of RKHS woud be “a Hilbert space of functions with all evaluation functinos bounded
and linear” or “all evaluation functionals are continuous”.
Theorem 12-2 (Mercer’s). Let (X , µ) be a finite measure ´ space and k ∈ L∞ (X ×X , µ×µ) be a kernel such
that Tk : L2 (X , µ) ,→ L2 (X , µ) is positive definite, i.e. k(x, z)f (x)f (z)dµ(x)dµ(z) ≥ 0 for all f ∈ L2 (X , µ).
Let φi ∈ L2 (X , µ) be the normalized eigenfunctions of Tk associated with the eigenvalues λi ≥ 0. Then
∞
"∞ {λi }i=1 are absolutely summable;
(1) The eigenvalues
(2) k(x, z) = i=1 λi φi (x)φi (z) holds the series converges absolutely and uniformly.
We can construct a RKHS as the completed space of the span of eigenfunctions defined by the kernel:
% '
&
H= f : f (x) = αi φi (x) s.t. ∥f ∥H < ∞
i
" "
Given f = i αi φi and g = βi φi , the inner product and the norm induced by the inner product are
i
defined as ( )
& & & αi βi
⟨f, g⟩H = αi φi (x), βi φi (x) =
i i i
λi
H
and ( )
& & & α2
i
∥f ∥2H = αi φi (x), αi φi (x) = .
i i i
λi
H
It is easy to see that the representer property holds:
( )
& & & αi λi φi (x)
⟨f (.), k(., x)⟩H = αi φi (.), λi φi (x)φi (.) = = f (x).
i i i
λi
H
The RKHS concept can be utilized in SVM and other kernel machines which is known as the kernel trick.
Given the eigenvalues λi ’s and eigenfunctions φi ’s of a reproducing kernel k(., .), we can may the x ∈ Rp into
a higher dimensional feature space:
*! ! +
x ,→ Φ(x) = λ1 φ1 (x), . . . , λi φi (x), . . . .
The dimensionality of the feature vector Φ(x) is the same as the number of nonzero eigenvalues of k(., .),
which could be of infinite dimensional. By Mercer’s theorem, the standard ℓ2 inner product between any
two feature vectors Φ(x) and Φ(z) can now be computed by the reproducing kernel since
2
Representer Theorem
Theorem 12-3 (Representer). Given a reproducing kernel k and let H be the corresponding RKHS. Then
for a function L : Rn ,→ R and non-decreasing function Ω : R ,→ R, the solution of the optimization problem
min J(f ) = min L(f (x1 ), . . . , f (xn )) + Ω(∥f ∥2H )
, -
f ∈H f ∈H
can be expressed as
n
&
f∗ = αi k(xi , .).
i=1
Furthermore, if Ω(.) is strictly increasing, then all solutions have this form.
Proof.
Define the subspace G to be the span of
span{k(xi , .), 1 ≤ i ≤ n}.
Decompose f as f = fG + fG c . We have
∥f ∥H = ∥fG ∥G + ∥fG c ∥H
by orthogonality of G with G c . Since Ω is non-decreasing, we have
Ω(∥f ∥2H ) ≥ Ω(∥fG ∥2H ).
On the other hand, since the kernel k has the reproducing property, we have
f (xi ) = ⟨f, k(xi , .)⟩
= ⟨fG , k(xi , .)⟩ + ⟨fG c , k(xi , .)⟩
= ⟨fG , k(xi , .)⟩
= fG (xi ).
So this implies that L(f (x1 ), . . . , f (xn )) = L(fG (x1 ), . . . , fG (xn )), i.e. the first component of the optimization
′ 2
objective only depends on the projection of f onto G which is the "nspan of k(xi , .) s. Since Ω(∥f ∥H ) ≥
2 ∗
Ω(∥fG ∥H ), we have the minimizer can be expressed as f (.) = i=1 αi k(xi , .). If Ω(.) is strictly non-
decreasing, then fG c must be zero and all minimizers must take the above form.
!
Examples of Kernel
Some simple examples of kernel:
We could also construct kernels based on simple ones. For instance, we have kernels (it can be shown that
k(., .) satisfies the conditions of a kernel):
"
• k(x, z) = i αi ki (x, z) where αi ≥ 0 and ki (., .) are kernels;
• k(x, z) = k1 (x, z)k2 (x, z);
• k(x, z) = exp(k1 (x, z));
• k(x, z) = P (k1 (x, z)) where P (t) is a polynomial of t with nonnegative coefficients.
3
Rademacher Average
We next present a result which computes the upper bound of the Rademacher average of a function class
which is a ball in the RKHS. Consider the following learning problem:
n
1&
min ℓ(yi , f (xi )) + λ∥f ∥H ,
f ∈H n
i=1
for some properly chosen t > 0. So we would like to invest the function class F = {f : ∥f ∥H ≤ t}’s
Rademacher average.
Theorem. Let H be a RKHS with kernel k, and let K ∈ Rn×n so that Kij = k(xi , xj ). Define Ft = {f :
f ∈ H, ∥f ∥H ≤ t}. Then we have
. n
/
1& t!
R̂n (Ft ) := E sup ϵi f (xi )|X1 , . . . , Xn ≤ trace(K)
f ∈Ft n i=1 n
and 0
1∞
t 1&
Rn (Ft ) ≤ √ 2 λi
n i=1
´
where λi ’s are the eigenvalues of the operator Tk : f ,→ k(., x)f (x)dP (x).
Proof.
By the reproducing property we have
n n
1& 1&
sup ϵi f (xi ) = sup ϵi ⟨k(xi , .), f ⟩
f ∈Ft n i=1 f ∈Ft n i=1
( n )
1&
= sup ϵi k(xi , .), f
∥f ∥H ≤t n i=1
3 n 3
31 & 3
= t3 ϵi k(xi , .)3
3 3
3n 3
i=1
0
1 n
11 &
= t2 2 ϵi ϵj k(xi , xj ).
n i,j=1
Therefore we have
⎡ 0 ⎤
1& n
t 1
R̂n (Ft ) = E⎣ 2 ϵi ϵj k(xi , xj )|X1 , . . . , Xn ⎦
n i,j=1
0 ⎡ ⎤
1
n
t1 ⎣&
1
≤ 2E ϵi ϵj k(xi , xj )|X1 , . . . , Xn ⎦
n i,j=1
0
1 n
t1&
= 2 k(xi , xi )
n i=1
t!
= trace(K),
n
4
where we used the property that E[ϵi ] = 0 and V[ϵi ] = 1 and Jensen’s inequality. Since k(x, x) =
" ∞
i=1 λi φi (x)φi (x), where φi ’s are an orthonomral basis, we have
5
ISYE/CSE
STAT 598Y 8803: Advanced
Statistical MachineTheory
Learning Learning Instructor:Jian
Instructor: Tuo Zhang
Zhao
In ensemble learning, we obtain a final classifier by combining a set of classifiers. Often even if the set of
candidate classifiers are weak (means they only do slightly better than random), the final classifier could still
have very good performance. In most cases, the final classifier is a weighted combination of weak learners,
where the weights could be determined in different ways. We will focus on boosting algorithms (adaboost,
in particular) which has been shown to perform well in practice.
(b) Let αt = 1
2 log 1−ϵ
ϵt and update Ft = Ft−1 + αt ft .
t
3. Output Ft
The next theorem shows that AdaBoost is essentially minimizing the exponential loss of the training examples
by using a greedy search.
Theorem. The empirical training error of the final classifier FT (.) is upper bounded by Tt=1 2 ϵt (1 − ϵt ).
$ #
Furthermore, if ϵt ≤ 1/2 − γ for all t = 1, . . . , T , then the training error is further upper bouned by (1 −
4γ 2 )T /2 .
Proof.
1
The first part can be shown as follows. Since Y FT ≤ 0 implies exp(−Y FT (X)) ≥ 1. So we have the training
error upper bounded as follows:
n n
1! 1!
I(yi ̸
= FT (xi )) = I(yi FT (xi ) ≤ 0)
n i=1 n i=1
n
1!
≤ exp(−yi FT (xi ))
n i=1
n T
1! !
= exp(−yi αt ft (xi ))
n i=1 t=1
n T
1 !%
= exp(−yi αt ft (xi )).
n i=1 t=1
Since yi , f (xi ) ∈ {±}, their product is also ±1. By definition of Dt+1 (i) we have exp(−yi αt ft (xi )) · Dt (i) =
Dt+1 (i)Zt . Thus we have
n T n T
1 !% 1 ! % Dt+1 (i)
exp(−yi αt ft (xi )) = Zt
n i=1 t=1 n i=1 t=1 Dt (i)
n
&T '
1! % DT +1 (i)
= Zt (1)
n i=1 t=1 D1 (i)
T
%
= Zt ,
t=1
(n
where the last step comes from the facts that D1 (i) = 1/n and i=1 DT +1 (i) = 1 because it is normalized.
Now we choose αt to minimize Zt . By definition we have
! !
Zt = Dt (i) exp(−αt ) + Dt (i) exp(αt )
i:yi =ft (xi ) i:yi ̸
=ft (xi )
= (1 − ϵt ) exp(−αt ) + ϵt exp(αt )
and we obtain αt = 1
2 ln 1−ϵ
ϵt by setting ∂Zt /∂αt = 0. Plugging in we have
t
) )
ϵt 1 − ϵt #
Zt = (1 − ϵt ) + ϵt = 2 ϵt (1 − ϵt ).
1 − ϵt ϵt
1
The second part then follows since if we have ϵt ≥ 2 − γ for all t , then
T
% # % #
2 ϵt (1 − ϵt ) ≤ 2 1/4 − γ 2
t=1 t
= (1 − 4γ 2 )T /2 .
log 1/δ
Furthermore, if we let (1 − 4γ 2)T /2 ≤ δ then we have for T ≥ 2γ 2 the training error will be upper bounded
by δ. !
2
Assume that we already fixed α1 , . . . , αt−1 and f1 , . . . , ft−1 in the first t − 1 steps. In the t-th step, we have
n
1!
exp(−yi FT (xi ))
n i=1
n
1!
= exp (−yi (Ft−1 (xi ) + αt ft (xi )))
n i=1
n
1!
= [(exp(αt ) − exp(−αt ))I(yi ̸
= ft (xi )) + exp(−αt )] exp(−yi Ft−1 (xi ))
n i=1
n n
exp(−αt ) ! 1!
= exp(−yi Ft−1 (xi )) + (exp(αt ) + exp(−αt )) · I(yi ̸
= ft (xi )) exp(−yi FT −1 (xi )).
n i=1
n i=1
Since we have
T%−1
1
exp(−yi FT −1 (xi )) = DT (i) Zt
n t=1
by equation 1, plugging in we have
n
1!
exp(−yi FT (xi ))
n i=1
n T%−1 n
exp(−αt ) ! !
= exp(−yi Ft−1 (xi )) + (exp(αt ) + exp(−αt )) Zs · I(yi ̸
= ft (xi ))DT (i).
n i=1 s=1 i=1
As
(na result, for any αt we have the quantity is minimized with respect to ft if ft minimizes the quantity
i=1 DT (i)I(yi ̸ t (xi )), the weighted training error. Given ft , the αt which minimizes the empirical
= f(
exponential error n1 ni=1 exp(−yi FT (xi )) will be the same αt which minimizes Zt , i.e. it implies αt =
1 1−ϵt
2 ln ϵt .
We can see that AdaBoost is essentially a greedy algorithm which tries to minimize the empirical exponential
loss by using gradient descent: alternatively minimizes w.r.t. ft and αt in each iteration. Furthermore, we
can generalize AdaBoost to other loss functions by this viewpoint. Given any convex loss function φ(.) other
than the exponential loss (such as the logistic loss or the squared loss), we can use gradient descent method
to minimize
n
1!
R̂φ (Ft−1 + αt ft ) = φ(yi (Ft−1 (xi ) + αt ft (xi )))
n i=1
with respect to ft and αt . Gradient descent will choose a direction d = [ft (x1 ), . . . , ft (xn )] which is the
negative gradient of R̂φ (Ft−1 + z) at z = 0. Since the gradient of R̂φ (Ft−1 + z) at z = 0 is
* +
1 ′ 1 ′
φ (y1 Ft−1 (x1 ))y1 , . . . , φ (yn Ft−1 (xn ))yn ,
n n
essentially we want to find d which minimizes
n
!
T ′ ′
d [φ (y1 Ft−1 (x1 ))y1 , . . . , φ (yn Ft−1 (xn ))yn ] = φ′ (yi Ft−1 (xi ))yi ft (xi ).
i=1
Define ai = φ′ (yi Ft−1 (xi )) which is a constant. In terms of minimization, we have the following equivalences:
n n n n
! ! ! (−ai )(−yi ft (xi )) + ai !
min ai yi ft (xi ) ⇐⇒ min (−ai )(−yi ft (xi )) ⇐⇒ min ⇐⇒ min I(yi ̸
= ft (xi ))ai .
ft
i=1
ft
i=1
ft
i=1
2 ft
i=1
So we can obtain ft by minimizing
n
!
I(yi ̸
= ft (xi ))Dt (i)
i=1
−φ′ (yi Ft−1 (xi ))
wrt ft (.), where Dt (i) = Zt . This generalizes the AdaBoost from exponential loss φ(z) = exp(−z)
to other loss functions.
3
ISYE/CSE
STAT 598Y 8803: Advanced
Statistical MachineTheory
Learning Learning Instructor:Jian
Instructor: Tuo Zhang
Zhao
In this lecture we will consider the one of the most popular approaches in statistics: the maximum likelihood
estimation (MLE). In order to apply MLE, we need to make stronger assumptions about the distribution of
(X, Y ). Often such assumptions are reasonable in practical applications.
The MLE estimator seeks the model which maximizes the likelihood, or equivalently, minimizes the negative
log-likelihood. This is reasonable since the MLE estimator is the most probable explanation for the observed
data. Formally, let Θ be a parameter space, and assume that we have the model
yi ∼ pθ∗ (y), i = 1, . . . , n
for iid observations y1 , . . . , yn and θ∗ ∈ Θ is the true parameter. Here we do not have the covariates xi ’s for
simplicity, and it is straightforward to include them in the model. The MLE of θ∗ is
n
!
θ̂n = arg max pθ (yi )
θ∈Θ i=1
n
1 "
= arg min − log pθ (yi ).
θ∈Θ n i=1
So we can think of maximum likelihood as trying to minimize E[− log pθ (Y )]. On the other hand, consider
the quantity
# $
pθ∗ (Y )
E [log pθ∗ (Y ) − log pθ (Y )] = E log
pθ (Y )
pθ∗ (y)
ˆ
= log pθ∗ (y)dy
pθ (y)
= KL(pθ , pθ∗ )
≥ 0
1
where KL(q, p) is the KL-divergence between two distributions q and p. Although not a distance measure
(not symmetric), the KL-divergence measures the discrepancy between the two distributions. Also note that
the last inequality becomes equality if and only if pθ = pθ∗ . This is because
# $ # $ # $
p q q
KL(q, p) = Ep log = −Ep log ≥ − log Ep = 0.
q p p
By Jensen’s inequality, the equality happens if and only if p(x) = q(x) for all x. So we can see that if we
minimize E[− log pθ (Y )], the minimum it can achieve is E[− log pθ∗ (Y )], and it achieves this minimum when
θ = θ∗ , the true parameter value we want to find.
It is easy to see that MLE can be thought as a special case of empirical risk minimization, where the loss
function is simply the negative log-likelihood: ℓ(θ, yi ) = − log pθ (yi ). Another observation is that minimizing
the negative log-likelihood will result in the least squares estimator, if the error follows a normal distribution.
The empirical risk is
n
1"
R̂n (θ) = − log pθ (yi )
n i=1
and the risk is
R(θ) = E[ℓ(θ, Y )] = E[− log pθ (Y )].
The excess risk of θ is
R(θ) − R(θ∗ ) = E[− log pθ (Y ) + log pθ∗ (Y )] = KL(pθ , pθ∗ ),
the KL-divergence between pθ and pθ∗ .
2
!
This lemma says that convergence in KL-divergence will lead to convergence in hellinger distance. So if we
can establish the convergence in KL-divergence then the consistency of MLE can be proven.
The convergence of the KL-divergence can be seen as follows. Since θ̂n maximizes the likelihood over θ ∈ Θ,
we have
n n n
" pθ∗ (yi ) " "
log = log pθ∗ (yi ) − log pθ̂n (yi ) ≤ 0.
i=1
pθ̂n (yi ) i=1 i=1
Thus
n
1" pθ∗ (yi )
log − KL(pθ̂n , pθ∗ ) + KL(pθ̂n , pθ∗ ) ≤ 0.
n i=1 pθ̂n (yi )
So we have + +
+1 "n
pθ∗ (yi ) +
KL(pθ̂n , pθ∗ ) ≤ + log − KL(pθ̂n , pθ∗ )+ .
+ +
+n
i=1
p (y
θ̂n i ) +
which can be applied to θ̂n as well. As a result, the convergence of KL-divergence can be established.