0% found this document useful (0 votes)

172 views

Tuo Zhao Notes

1. The document introduces the concepts of supervised learning, including formulating the basic learning problem, loss functions, risk, and error decomposition. 2. It discusses empirical risk minimization, where the true risk is approximated by minimizing the empirical risk on the training data. This helps address the problem that the true distribution is unknown. 3. Overfitting, model complexity, and regularization are introduced as ways to balance the approximation error and estimation error in order to improve generalization performance.

Uploaded by

Marc Romaní

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

172 views

Tuo Zhao Notes

Uploaded by

Marc Romaní

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

ISYE/CSE

STAT 598Y8803: Advanced

Statistical Machine Theory
Learning Learning Instructor:Jian
Instructor: Tuo Zhang
Zhao

Lecture 1: Introduction to Supervised Learning

In this lecture we formulate the basic (supervised) learning problem and introduce several key concepts
including loss function, risk and error decomposition.

1 Basic Concepts
We use X and Y to denote the input space and the output space, where typically we have X = Rp . A joint
probability distribution on X × Y is denoted as PX,Y . Let (X, Y ) be a pair of random variables distributed
according to PX,Y . We also use PX and PY |X to denote the marginal distribution of X and the conditional
distribution of Y given X.
Let Dn = {(x1 , y1 ), . . . , (xn , yn )} be an i.i.d. random sample from PX,Y . The goal of supervised learning
is to find a mapping h : X "→ Y based on Dn so that h(X) is a good approximation of Y . When Y = R
the learning problem is often called regression and when Y = {0, 1} or {−1, 1} it is often called (binary)
classification.
The dataset Dn is often called the training set (or training data), and it is important since the distribution
PX,Y is usually unknown. A learning algorithm is a procedure A which takes the training set Dn and
produces a predictor ĥ = A(Dn ) as the output. Typically the learning algorithm will search over a space of
functions H, which we call the hypothesis space.

2 Loss Function and Risk

A loss function is a mapping ℓ : Y × Y "→ R+ (sometimes R × R "→ R+ ). For example, in binary classification
the 0/1 loss function ℓ(y, p) = I(y ̸ = p) is often used and in regression the squared error loss function
ℓ(y, p) = (y − p)2 is often used. Other loss functions include the following: absolute loss, Huber loss, ϵ-
insensitive loss, hinge loss, logistic loss, exponential loss, modified least squares loss, etc. They will be
discussed later in more details.
The performance of a predictor h : X "→ Y is measured by the expected loss, a.k.a. the risk or generalization
error :
R(h) := EX,Y [ℓ(Y, h(X))],
where the expectation is taken with respect to the distribution PX,Y . Since in practice we estimate ĥ
based on the training set Dn , we have R(ĥ) itself a random variable. Thus we may also use the quantity
R(A) = EDn [R(ĥ)] to characterize the generalization performance of the learning algorithm A, which is also
called the expected risk of the learning algorithm.
The risk is an important measure of the goodness of the predictor h(.) since it tells how it performs on
average in terms of the loss function ℓ(., .). The minimum risk is defined as

R∗ = inf R(h)
h

where the infinum is often taken with respect to all measurable functions. The performance of a given
predictor/estimator can be evaluated by how close R(h) is to R∗ . Minimization of the risk is non-trivial
because the underlying distribution PX,Y is in general unknown, and the training data Dn only gives us an
incomplete knowledge of PX,Y in practice.

1
2.1 Binary Classification

For classification problem, a predictor h is also called a classifier, and the loss function for binary classification
is often taken to be the 0/1 loss. In this case, we have

R(h) = EX,Y [ℓ(Y, f (X))] = EX,Y [I(Y ̸

= f (X))] = P (f (X) ̸
= Y ).

And the infimum risk R∗ is also known as the Bayes risk.

The following result shows that the Bayes classifier, which is defined as
!
∗ 1, P (Y = 1|X = x) ≥ 1/2
h (x) = ,
0, otherwise

can achieve the Bayes risk.

Theorem 1-1. For any classifier h we have R(h) ≥ R(h∗ ), i.e. R(h∗ ) = R∗ .
Proof.
= Y ) ≥ P (h∗ (X) ̸
We will show that P (h(X) ̸ = Y ). For any X = x, we have

Letting η(x) := P (Y = 1|X = x), we have for any X = x,

= Y |X = x) − P (h∗ (X) ̸
P (h(X) ̸ = Y |X = x)
= P (h∗ (X) = Y |X = x) − I(h(x) = 1)η(x) − I(h(x) = 0)(1 − η(x))
= η(x)[I(h∗ (x) = 1) − I(h(x) = 1)] + (1 − η(x))[I(h∗ (x) = 0) − I(h(x) = 0)]
= (2η(x) − 1)[I(h∗ (x) = 1) − I(h(x) = 1)]
≥ 0

where the last inequality is true by the definition of h∗ (x). The result follows by intergrating both sides with
respect to x.
!

2.2 Regression

In regression we typically have X = Rp and Y = R. And the risk is often measured by the squared error
loss, ℓ(p, y) = (p − y)2 . The following result shows that for squared error regression, the optimal predictor
is the conditional mean function E[Y |X = x].
Theorem 1-2. Suppose the loss function ℓ(., .) is the squared error loss. Let h∗ (x) = E[Y |X = x], then we
have R(h∗ ) = R∗ .
The proof will be left as an exercise. Thus regression with squared error can be thought as trying to estimate
the conditional mean function. How about regression with its risk defined by the absolute error loss function?

2
3 Approximation Error vs. Estimation Error
Suppose that the learning algorithm chooses the predictor from the hypothesis space H, and define

h∗ = arg inf R(h),

h∈H

i.e. h∗ is the best predictor among H1 . Then the excess risk of the output ĥn of the learning algorithm is
defined and can be decomposed as follows2 :
" # " #
R(ĥn ) − R∗ = R(h∗ ) − R∗ + R(ĥn ) − R(h∗ )
$ %& ' $ %& '
approximation error estimation error

Such a decomposition reflects a trade-off similar to the bias-variance tradeoff (maybe slightly more general).
The approximation error is deterministic and is caused by the restriction of using H. The estimation error
is caused by the usage of a finite sample that cannot completely represent the underlying distribution.
The approximation error term behaves like a bias square term, and the estimation error behaves like the
variance term in standard statistical estimation problems. Similar to the bias-variance trade-off, there is also
a trade-off between the approximation error and the estimation error. Basically if H is large then we have
a small approximation error but a relatively large estimation error and vice versa.

1 Sometimes h∗ is defined as the approximately best predictor chosen by the estimation procedure if infinite amount of
samples are given. In that case, the approximation error can be caused by both the function class H and some intrinsic bias of
the estimation procedure.
2 Sometimes the decomposition is done for E ∗
Dn [R(ĥn )] − R .

3
ISYE/CSE
STAT 598Y 8803: Advanced
Statistical MachineTheory
Learning Learning Instructor:Jian
Instructor: Tuo Zhang
Zhao

Lecture 2: Risk Minimization

In this lecture we introduce the concepts of empirical risk minimization, overfitting, model complexity and
regularization.

1 Empirical Risk Minimization

Given a loss function ℓ(., .), the risk R(h) is not computable as PX,Y is unknown. Thus we may not able
to directly minimize R(h) to obtain some predictor. Fortunately we are provided with the training data
Dn = {(x1 , y1 ), . . . , (xn , yn )} which represents the underlying distribution PX,Y .
Instead of minimizing R(h) = EX,Y [ℓ(Y, h(X))], one may replace PX,Y by its empirical distribution and thus
obtain the following minimization problem:

1!
n
ĥn = arg min ℓ(yi , h(xi ))
h∈H n i=1

which we call empirical risk minimization (ERM). Furthermore, we also define the empirical risk R̂n (h) as

1!
n
R̂n (h) := ℓ(yi , h(xi )).
n i=1

Because under some conditions R̂n (h) →p R(h) by the law of large numbers, the usage of ERM is at least
partially justified.
ERM covers many popular methods and is widely used in practice. For example, if we take H = {h(x) :
h(x) = θT x, ∀θ ∈ Rp } and ℓ(y, p) = (y − p)2 , then ERM becomes the well-known least squares estimation.
The celebrated maximum likelihood estimation (MLE) is also a special case of ERM where the loss function
is taken to be the negative log-likelihood function. Example: in binary classification (x1 , y1 ), . . . , (xn , yn )
where yi ∈ {−1, 1} and H = {h(x) : h(x) = θT x, ∀θ ∈ Rp }, the logistic regression is computed by minimizing
the logistic loss:
1!
n
θ̂ = arg min log(1 + exp(−yi θT xi ))
n i=1
which is equivalent to MLE.

2 Overfitting

ERM works by minimizing the empirical risk R̂n (h), while the goal of learning is to obtain a predictor with
a small risk R(h). Although under certain conditions the former will converge to the latter as n → ∞, in
practice we always have a finite sample and as a result, there might be a large discrepancy between those
two targets, especially when H is large and n is small. Overfitting refers to the situation where we have a
small empirical risk but still a relatively large true risk.
Consider the following example. Let ℓ(y, p) = (y − p)2 and we obtain the predictor ĥ by ERM:

1!
n
ĥ = arg min (yi − h(xi ))2 .
h∈H n
i=1

1
Polynomial with degree1 Polynomial with degree2
1.5 1.5
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−1.5 −1.5
0 2 4 6 0 2 4 6

Polynomial with degree3 Polynomial with degree5

1.5 1.5
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−1.5 −1.5
0 2 4 6 0 2 4 6

Figure 1: Overfitting of polynomial regression. The true signal function (blue line) is h∗ (x) = sin(x), and
the function is fitted using 10 training examples (red dots). P1 and P2 show a lack of fitting (underfitting)
and P5 is overfitting.

True/Empirical Risk VS. Model Complexity

Empirical Risk
True Risk
Best Model in Theory
Risk

Model Complexity

Figure 2: True/empirical risk vs. model complexity

2
Figure 1 shows the case where H is taken to be P1 , P2 , ..., where Pk is the set of all polynomial functions
with order up to k.
We can see that when H = P3 the fitted predictor will have a small risk (close to the true signal sin(x)).
Taking Pk with larger k values as the hypothesis space can clearly improve its fitting with respect to the
10 observations (red dots), but this does not necessarily reduce the true risk as it overfits the training data.
Learning is more about generalization than memorization.

3 Controlling Model Complexity

Overfitting is mainly caused by the fact that the hypothesis space H is too large for the sample size n.
Clearly the complexity of the hypothesis space H (i..e the size of H) we can aﬀord depends on the amount
of training data we have. For a given training dataset, the relationship between the true risk, the empirical
risk and model complexity can be best illustrated as in Figure 2.
One way to avoid overfitting is to choose H so that it is appropriate for the sample size. There are many
ways to control the model complexity, and they are in fact quite similar in spirit. Here we list two commonly
used approaches:

1. Take H1 , H2 , . . . ,"Hn , . . . to be a sequence of increasing sized spaces. For example, one typically has
Hk ⊂ Hk+1 and Hk = H. Given the training data Dn one finds ĥn by minimizing

ĥn = arg min R̂n (h).

h∈Hn

This covers the method of Sieves and structural risk minimization (SRM).

2. Define a penalty function Ω : H '→ R+ and find ĥn by the following optimization procedure:

1!
n
ĥn = arg min ℓ(yi , h(xi )) + λn Ω(h)
h∈H n i=1

where λn > 0 balances the trade-oﬀ between goodness-of-fit and model complexity. This is also known
as the penalized empirical risk minimization.

In practice we often need to select Hn or λn based on the training data to achieve a good balance between
goodness-of-fit and model complexity.
Consider the following regression problem: let H = {h(x) : h(x) = θT x, ∀θ ∈ Rp } and we are trying to find
an estimator θ̂ which minimizes the risk EX,Y (Y − θT X)2 . For the first approach, we could define a sequence
of increasing constants 0 ≤ η1 ≤ η2 ≤ . . . ≤ ηk ≤ ... and define Hk = {h(x) : h(x) = θT x, θT θ ≤ ηk }. For the
second approach we define Ω(h) = θT θ. Then it is well-known from optimization that those two approaches
become mathematically equivalent (i.e. for any ηk there exists a λ such that those two optimization problems
have the same solution).

3
ISYE/CSE
STAT 598Y 8803: Advanced
Statistical Machine
Learning Learning
Theory Instructor:
Instructor: TuoZhang
Jian Zhao

Lecture 3: Generalization Error Bounds and Concentration Inequalities

1 Generalization Error Bounds and PAC Learning

Although the concept of consistency of a learning algorithm is very important, it only measures how the
expectation of a random variable R(ĥn ) converges to the optimal Bayes risk asymptotically. However, it
does not say how fast this convergence is, and neither does it tell us how this random variable is distributed.
In particular, we are interested in probability bounds of the generalization error, such as the following: “with
probability at least 1 − δ, the risk R(ĥn ) is bounded by some quantity”.
Recall that the excess risk can be decomposed into approximation error and estimation error, i.e.
! " ! "
R(ĥn ) − R∗ = inf R(h) − R∗ + R(ĥn ) − inf R(h) .
h∈H h∈H
# $% & # $% &
approximation error estimation error

The approximation error is deterministic and mainly caused by two possible reasons: (1) the restriction
of using the function class H; (2) if inf h∈H R(h) in the above equation were replaced by the minimum risk
achievable by the learning algorithm with infinite amount of data, then it can also be caused by the systematic
bias of the learning algorithm. Such an error is often not controllable as we do not know the underlying
distribution PX,Y . On the other hand, the estimation error depends on the sample size, the function class
H and the learning algorithm which we have control over. We would like to obtain probability bounds for
the estimation error.
Specifically, if we use ERM to obtain our predictor ĥn = arg minh∈H R̂n (h), and assume that inf h∈H R(h) =
R(h∗ ) for some h∗ ∈ H, then we have

R(ĥn ) − inf R(h) = R(ĥn ) − R(h∗ )

h∈H

≤ R(ĥn ) − R(h∗ ) + R̂n (h∗ ) − R̂n (ĥn )

= (R(ĥn ) − R̂n (ĥn )) − (R(h∗ ) − R̂n (h∗ ))
' '
' '
≤ 2 sup 'R(h) − R̂n (h)' .
h∈H

Thus if we can obtain uniform bound of suph∈H |R(h)− R̂n (h)| then the approximation error can be bounded.
Thus again justifies the usage of the ERM method.
Intuitively, for any h ∈ H, R̂n (h) is a random variable which follows (in fact nR̂n (h)) a Binomial distribution
with mean R(h). Or we could think of it as the average of a series of random variables. Thus we should be
able to bound the diﬀerence between an average of a set of random variables and their mean. The uniform
bound, however, will depend crucially on how large/complex the hypothesis space H is.
The probably approximately correct (PAC) learning model typically states as follows: we say that ĥn is
ϵ-accurate with probability 1 − δ, if
( )
P R(ĥn ) − inf R(h∗ ) > ϵ < δ.
h∈H

In other words, we have R(ĥn ) − inf h∈H R(h) ≤ ϵ with probability at least (1 − δ).

1
2 Concentration Inequalities
Concentration inequalities will be used to measure how fast the empirical risk converges to the true risk. We
start with some loose but simple ones and then get to more useful results.
Theorem 3-1 (Markov Inequality ). For any nonnegative random variable X and ϵ > 0,

E[X]
P (X ≥ ϵ) ≤ .
ϵ

Proof. We have
E[X] ≥ E[I(X ≥ ϵ)X] ≥ ϵE[I(X ≥ ϵ)] = ϵP (X ≥ ϵ)
and thus P (X ≥ ϵ) ≤ E[X]/ϵ. !

Theorem 3-2 (Chernoﬀ Inequality ) For any random variable X and ϵ > 0,

E[exp(tX)]
P (X ≥ ϵ) ≤
exp(tϵ)

and thus
E[exp(tX)]
P (X ≥ ϵ) ≤ inf .
t>0 exp(tϵ)

Proof. For any t > 0, since exp(tx) is a nonnegative monotone increasing function in x, we have

E[exp(tX)]
P (X ≥ ϵ) = P (exp(tX) ≥ exp(tϵ)) ≤ .
exp(tϵ)

Theorem 3-3. (Chebyshev Inequality ) For any random variable X and ϵ > 0,

V[X]
P (|X − E[X]| > ϵ) ≤ .
ϵ2

Proof. Apply Markov Inequality with random variable Y = |X − E[X]|. !

Both Markov and Chebyshev bounds are polynomial in 1/ϵ and often we need bounds which can converge to
zero exponentially fast. In fact, the Chebyshev inequality can be quite poor. Consider the following example:
Let binary iid random * variables X1 , . . . , Xn ∈ {0, 1} and p = P (Xi = 1). Then we have σ 2 := V[Xi ] =
p(1 − p). Define Sn = ni=1 Xi and we have E[Sn ] = np and V[Sn ] = np(1 − p) = nσ 2 . From Chebyshev
inequality by using ϵ̃ = nϵ, we have
(' ' )
' Sn E[Sn ] '' σ2
P '' − ≥ ϵ̃ ≤ 2.
n n ' nϵ̃

Thus the tail probability goes to zero at a rate of n−1 . But from the central limit theorem (CLT) we have
( )
√ 1 1
n Sn − E[Sn ] →d N (0, σ 2 ).
n n

2
In other words, we have
(+ ) ˆ ∞
n Sn 1 2 1 exp(−y 2 /2)
P ( − p) ≥ y → 1 − Φ(y) = √ exp(−x /2)dx ≤ √ .
σ2 n y 2π 2π y
So ( ) (+ + ) ( )
Sn E[Sn ] n Sn n nϵ̃2
P − ≥ ϵ̃ = P ( − p) ≥ ϵ̃ ≈ exp −
n n σ2 n σ2 2p(1 − p)
which decreases exponentially fast as a function of n. So the Chebyshev inequality does poorly in this case
and we need something better.
The Hoeﬀding’s inequality studies the concentration on the sum of independent random variables and*gives an
n
exponential tail bound. Given random variables X1 , . . . , Xn which are independent, and let Sn = i=1 Xi .
By Chernoﬀ bound we have
P (Sn − E[Sn ] ≥ ϵ) = P (exp(t(Sn − E[Sn ])) ≥ exp(tϵ))
≤ exp(−tϵ)E[exp(t(Sn − E[Sn ]))]
, - n /0
.
= exp(−tϵ)E exp t (Xi − E[Xi ])
i=1
n
1
= exp(−tϵ) E[exp(t(Xi − E[Xi ]))].
i=1

The following lemma shows some property of a bounded random variable with mean zero.
Lemma 3-4. If random variable X has mean zero, i.e. E[X] = 0, and is bounded in [a, b], then for any
s > 0,
E[exp(sX)] ≤ exp(s2 (b − a)2 /8).

Proof.
By convexity of exponential function and Jensen’s inequality and the fact a ≤ X ≤ b,
X −a b−X
exp(sX) ≤ exp(sb) + exp(sa).
b−a b−a
Taking expectation on both sides, and utilizing the fact that E[X] = 0, we have
b exp(sa) − a exp(sb)
E[exp(sX)] ≤
b−a
= [1 − λ + λ exp(s(b − a))] exp(−λs(b − a))
a
where λ = − b−a . Now let u = s(b − a) and define
φ(u) := −λu + log(1 − λ + λ exp(u)),
then the above inequality becomes
E[exp(sX)] ≤ exp(φ(u)).
Now we need to find an upper bound on exp(φ(u)). Using Taylor’s expansion we have
u2 ′′
φ(u) = φ(0) + uφ′ (0) + φ (ξ)
2
for some ξ ∈ [0, u]. It is easy to verify that φ(0) = 0 and φ′ (0) = 0. And we have
λ exp(u) (λ exp(u))2
φ′′ (u) = −
1 − λ + λ exp(u) (1 − λ + λ exp(u))2
( )
λ exp(u) λ exp(u)
= 1−
1 − λ + λ exp(u) 1 − λ + λ exp(u)
1
≤ .
4

3
So we have φ(u) ≤ u2 /8, and therefore

E[exp(sX)] ≤ exp(s2 (b − a)2 /8).

Now are are ready to present the Hoeﬀding’s inequality.

Theorem 3-5 (Hoeﬀding Inequality ) * Let X1 , . . . , Xn be independent bounded random variables such that
n
Xi ∈ [ai , bi ] with probability 1. Let Sn = i=1 Xi . Then for any ϵ > 0, we have:
! "
2ϵ2
1. P (Sn − E[Sn ] ≥ ϵ) ≤ exp − Pn
i=1 (bi −ai )
2

! "
2ϵ2
2. P (Sn − E[Sn ] ≤ −ϵ) ≤ exp − Pn 2
i=1 (bi −ai )

! "
2ϵ2
3. P (|Sn − E[Sn ]| ≥ ϵ) ≤ 2 exp − Pn .
i=1 (bi −ai )
2

Proof. By the above derivation and Lemma 3-4 we have

n
1
P (Sn − E[Sn ] ≥ ϵ) ≤ exp(−tϵ) E [exp(t(Xi − E[Xi ]))]
i=1
- n
/
. t2 (bi − ai )2
≤ exp(−tϵ) exp
i=1
8

4ϵ
Now choose t = Pn 2 we have
i=1 (bi −ai )

( )
2ϵ2
P (Sn − E[Sn ] ≥ ϵ) ≤ exp − *n 2
.
i=1 (bi − ai )

Similarly we can prove the other two claims.

If we appy the Hoeﬀding inequality to the average of a series of bernoulli random variables X1 , . . . , Xn , we
have
P (Sn /n − p ≥ ϵ) ≤ exp(−2nϵ2 )
since bi − ai = 1. This is exactly what the CLT indicates when p = 1/2. The following is a straightforward
application of the Hoeﬀding inequality:
Corollary 3-6 Assume that H = {h1 , . . . , hm }. Then for all ϵ > 0,
( ' ' )
P sup 'R̂n (h) − R(h)' ≥ ϵ ≤ 2m exp(−2nϵ2 ),
' '
h∈H

1
*n
for any distribution PX,Y , where R(h) = EX,Y [I(Y ̸
= h(X))] and R̂n (h) = n i=1 I(yi ̸
= h(xi )).
Finally we will introduce the McDiarmid inequality which generalizes the Hoeﬀding’s inequality to some
function of iid random variables. Some restrictions are needed in order to get exponential bounds.
Theorem 3-6 (McDiarmid Inequality/Bounded Diﬀerences) Suppose random variables X1 , . . . , Xn ∈
X are independent, f is a mapping from X n to R. If for any i and any x1 , . . . , xn , x′i ∈ X , f satisfies

|f (x1 , . . . , xn ) − f (x1 , . . . , xi−1 , x′i , xi+1 , . . . , xn )| ≤ ci .

4
Then for all ϵ > 0,
( )
2ϵ2
P (f (X1 , . . . , Xn ) − E[f (X1 , . . . , Xn )] ≥ ϵ) ≤ exp − n 2
*
i=1 ci
( )
2ϵ2
P (f (X1 , . . . , Xn ) − E[f (X1 , . . . , Xn )] ≤ −ϵ) ≤ exp − *n 2 .
i=1 ci

Proof. The proof utilizes a martingale sequence. Define

Vi = E[f |X1 , . . . , Xi ] − E[f |X1 , . . . , Xi−1 ],
and note that E[Vi ] = 0 and
n
.
Vi = E[f |X1 , . . . , Xn ] − E[f ] = f (X1 , . . . , Xn ) − E[f ].
i=1

Define uppper and lower bounds as

Li = inf E[f |X1 , . . . , Xi−1 , x] − E[f |X1 , . . . , Xi−1 ]
x
Ui = sup E[f |X1 , . . . , Xi−1 , x] − E[f |X1 , . . . , Xi−1 ]
x
and note that Li ≤ Vi ≤ Ui . Furthermore, we have
Ui − Li = sup sup (E[f |X1 , . . . , Xi ] − E[X1 , . . . , Xi′ ]) ≤ ci .
Xi Xi′

And we have E[Vi |X1 , . . . , Xi−1 ] = 0. Similar to the proof of Hoeﬀding’s inequality, we have
, n 0
1
P (f − E[f ] ≥ ϵ) ≤ inf exp(−tϵ)E exp(tVi ) .
t>0
i=1

And we have
, n
0 , , n−1
00
1 1
E exp(tVi ) = E E exp(tVn ) exp(tVi ) | X1 , . . . , Xn−1
i=1 i=1
,n−1 0
1
= E exp(tVi )E [exp(tVn ) | X1 , . . . , Xn−1 ]
i=1
,n−1 0
1
≤ E exp(tVi ) exp(t2 c2n /8)
i=1
..
. ( *n )
t2 2
i=1 ci
≤ exp .
8
Setting t = Pn4ϵ we obtain the claimed results.
i=1 c2i

!
Example. Consider ' '
' n '
' 1. '
f (X1 , . . . , Xn ) = sup 'E[g(X)] − g(Xi )' .
g∈G ' n i=1
'
If all g : X →
) [a, b] then we have ci = (b − a)/n. Thus we have
- ' n
' , ' n
'0/ ( )
'
' 1. '
'
'
' 1. '
' 2nϵ2
P sup 'E[g(X)] − g(Xi )' − E sup 'E[g(X)] − g(Xi )' ≤ exp − .
g∈G ' n '
i=1 g∈G ' n ' (b − a)2 i=1

As a final note, the bounds we obtained are the worst case senario since we did not utilize the variance
information. We could obtain bounds if the variance were known, such as Bernstein’s inequality.

5
ISYE/CSE
STAT 598Y 8803: Advanced
Statistical Machine
Learning Learning
Theory Instructor:
Instructor: TuoZhang
Jian Zhao

Lecture 4: PAC Learning: Simple Examples

For a finite training sample Dn , the predictor ĥn can be thought as the output of the learning algorithm
given the training data and the hypothesis space, i.e. ĥn = A(Dn , H). Its risk R(ĥn ) = EX,Y [I(Y ̸= ĥn (X))]
is a random variable which depends on Dn , A and H.
Consistency of the learning algorithm has focused on the mean of this random variable, i.e. EDn [R(ĥn )]. In
PAC learning we are interested in its tail distribution, i.e. finding a bound which holds with large probability:
! "
P sup [R(h) − R̂n (h)] ≥ ϵ ≤ δ.
h∈H

The basic idea is to set the probability of being misled to δ and thus solve the ϵ.
Example 1 (single classifier). Consider the special case H = {h}, i.e. we only have a single function.
Furthermore, we assume that it can achieve 0 trainining error over Dn , i.e. R̂n (h) = 0. Then what is the
probability that its generalization error R(h) ≥ ϵ? We have
# $
P R̂n (h) = 0, R(h) ≥ ϵ = (1 − R(h))n
n
≤ (1 − ϵ)
≤ exp(−nϵ).

Setting the RHS to δ and solve for ϵ we have ϵ = n1 log 1δ . Thus with probability (1 − δ),
! "
1 1
P R̂n (h) = 0, R(h) < log .
n δ

Note that we can also utilzie the Hoeﬀding’s inequality to obtain P (|R̂n (h) − R(h)| ≥ ϵ) ≤ 2 exp(−2nϵ2 ),
which leads to % & '
1 2
P |R̂n (h) − R(h)| ≥ log ≤ δ.
2n δ

This is more general but not as tight as the previous one since it does not utilize the fact R̂n (h) = 0. !
Although the result in Example 1 is very simple, it has very limited practical meaning. The main reason is
that it only applies to a single fixed function h. Essentially, it says that for each fixed function h, there is
a set S of samples (whose measure P (S) ≥ 1 − δ) for which |R̂n (h) − R(h)| is bounded. However, such S
sets could be diﬀerent for diﬀerent functions. To handle this issue we need to obtain the uniform deviations
since:
R̂n (ĥn ) − Rn (ĥn ) ≤ sup (R̂n (h) − R(h)).
h∈H

The idea is to utilize the union bound as shown in the following example.
Example 2 (finite number of classifiers). Consider the case H = {h1 , . . . , hm }. Define
( )
Bk := (x1 , y1 ) . . . , (xn , yn ) : R(hk ) − R̂n (hk ) ≥ ϵ , k = 1, . . . , m.

Each Bk is the set of all bad samples for hk , i.e. the samples for which the bound fails for hk . In other
words, it contains all misleading samples. If we want to measure the proability of the samples which are
bad for any hk (k = 1, . . . , m), we could apply the Bonferroni inequality to simply obtain:
m
*
P (B1 ∪ . . . ∪ Bm ) ≤ P (Bk ).
k=1

1
Thus we have
m (
% '
# $ + )
P ∃h ∈ H : R(h) − R̂n (h) ≥ ϵ = P R(hk ) − R̂n (hk ) ≥ ϵ
k=1
m
union bound * # $
≤ P R(hk ) − R̂n (hk ) ≥ ϵ
k=1
≤ m exp(−2nϵ2 ).

Hence for H = {h1 , . . . , hm }, with probability at least (1 − δ),

,
log m + log δ1
∀h ∈ H, R(h) − R̂n (h) ≤ .
2n

Since this is a uniform upper bound, it can be applied to ĥn ∈ H. !

Note that we could also bound the expected´ value of E[suph∈H |R(h) − R̂n (h)|] by using the fact that for
∞
any nonnegative random variable Z, E[Z] = 0 P (Z > t)dt.
From the above PAC learning examples we can see that

• It requires assumptions on data generation, i.e. samples are iid.

• The error bounds are valid with respect to repeated samples of training data.
√
• For a fixed function we roughly have R(h) − R̂n (h) ≈ 1/ n.
-
• If |H| = m then suph∈H (R(h)− R̂n (h)) ≈ log m/n. The term log m can be thought as the complexity
of the hypothesis space H.

There are several things which can be improved:

• Hoeﬀding’s inequality does not utlize the variance information. So the results could be improved by
utilizing such information.
• The union bound could be quite loose. For instance, it is as bad as if all the functions in H were
independent.
• The supremum over H might be too conservative.

The bound in Example 2 becomes meaning less when m is infinite. The following example generalizes it to
the case of coutably many classifiers.
Example 3 (countable number of classifiers). Consider the case H = {h1 , h2 , . . . , hm , . . .}. Since we
need to bound the probability of the set of misleading samples (which
.∞could mislead any h ∈ H) by δ, we
need budget the proability of being misled by hm to wm δ such that k=1 wk ≤ 1. So in order to find ϵ > 0
which satisfies
# $
P ∃h ∈ H : R(h) − R̂n (h) ≥ ϵ ≤ δ,

we only need to make sure that for any k,

# $
P R(hk ) − R̂n (hk ) ≥ ϵ ≤ wk δ

2
since
% ∞ (
'
# $ + )
P ∃h ∈ H : R(h) − R̂n (h) ≥ ϵ = P R(hk ) − R̂n (hk ) ≥ ϵ
k=1
∞
union bound * # $
≤ P R(hk ) − R̂n (hk ) ≥ ϵ
k=1
*
= wk δ
k
≤ δ.

Again the first inequality comes from the

/ Bonferroni inequality. By a similar argument, we solve ϵ by setting
1
2
exp(−2nϵ ) = wk δ, which leads to ϵ = 2n log w1k δ . Thus we have with probability (1 − δ),
,
log w1k + log 1δ
∀h ∈ H, R(h) ≤ R̂n (h) + .
2n
!
Note that wk ’s have to be specified before seeing the training data, otherwise the result will not hold. One
way to interpret wk ’s is that they can be thought as the “prior” knowledge about the functions (such as in
Bayesian inference).

3
ISYE/CSE
STAT 598Y 8803: Advanced
Statistical Machine
Learning Learning
Theory Instructor:
Instructor: TuoZhang
Jian Zhao

Lecture 5: Growth Function and VC Dimension

We have considered the case when H is finite or countably infinite. In practice, however, the function class
H could be uncountable. Under this situation, the previous method does not work. The key idea is to group
functions based on the sample.
Given a sample Dn = {(x1 , y1 ), . . . , (xn , yn )}, and define S = {x1 , . . . , xn }. Consider the set
HS = Hx1 ,...,xn = {(h(x1 ), . . . , h(xn ) : h ∈ H} .
The size of this set is the total number of possible ways that S = {x1 , . . . , xn } can be classified. For binary
classification the cardinality of this set is always finite, no matter how large H is.
Definition (Growth Function). The growth function is the maximum number of ways into which n points
can be classified by the function class:
GH (n) = sup |HS | .
x1 ,...,xn

Growth function can be thought as a measure of the “size” for the class of functions H. Several facts about
the growth function:

• When H is finite, we always have GH (n) ≤ |H| = m.

• Since h(x) ∈ {0, 1}, we have GH (n) ≤ 2n . If GH (n) = 2n , then there is a set of n points such that the
class of functions H can generate any possible classification result on these points.

Definition (Shatterring). We say that H shatters S if |HS | = 2|S| .

Definition (VC Dimension). The VC dimension of a class H is the largest n = dV C (H) such that

GH (n) = 2n .
In other words, VC dimension of a function class H is the cardinality of the largest set that it can shatters.
Example. Consider all functions of the form H = {h(x) = I(x ≤ θ), θ ∈ R}. Then it can shatter 2 points
but for any three points it cannot shatter. !
Example. Consider all linear classifiers in a 2-d space, i.e. X = R2 . In this case, all linear classifiers can
shatter a set of 3 points. No set of four points can be shattered by linear classifiers. So the VC dimension
in this case is 2. !
Example. Consider all linear classifiers in a p-dimensional Euclidean space, i.e. X = Rp . Given x1 , . . . , xn ∈
Rp , we define the augmented data vector
zi = [1, xi ]T ∈ Rp+1 , i = 1, . . . , n.
Then the set of all linear classifiers can be written as
H = h : h(z) = sign θT z , θ ∈ Rp+1 .
! " # $

Define
Z = [z1 , z2 , . . . , zn ] ∈ R(p+1)×n
and we argue that x1 , . . . , xn is shattered by H if and only if the n columns of Z are linearly independent.

• If columns z1 , . . . , zn are linearly independent, we have n ≤ p + 1 and for any possible classification
assignment y ∈ {±1}n the linear system ZT θ = y must have a solution. Thus, there is a linear classifier
in H (by taking the solution of the linear equation) which can produce such arbitrary class assignment
y.

1
• Suppose columns z1 , . . . , zn are not linearly independent. For H to shatter the set there must exist a
θ ∈ Rp+1 with sign(z1T θ), ..., sign(znT θ) taking any possible vector in {±1}n . In other words, this means
that the vector ZT θ can be in any of the 2n orthants of Rn . However, this contradicts the fact that
z1 , . . . , zn are linearly dependent.

Since if n > p + 1 it is not possible to have Z’s columns linearly independent, but for n ≤ p + 1 we can
always find such x1 , . . . , xn to make it happen, we have dV C (H) = p + 1. !
A somewhat surprising result shows that the growth function GH either grows exponentially in n or only
increases polynomially in n, depends on whether n is greater than its VC dimension dV C (H) or not.
Theorem 5-1 (Sauer). If H is a class of functions with binary outputs and its VC dimension is d =
dV C (H). Then for all n ∈ N,
d & '
% n
GH (n) ≤ .
i
i=0
Furthermore, for all n ≥ d, we have
( en )d
GH (n) ≤ .
d

Proof. For any S = {x1 , . . . , xn }, consider a table containing values of functions in HS (i.e. we only
consider distinct ones projected onto the sample S), each row for one such unique tuple. For example, if
S = {x1 , x2 , . . . , x5 } we might have the following table T :

h(x1 ) h(x2 ) h(x3 ) h(x4 ) h(x5 )

- + - + +
+ - - + +
+ + + - +
- + + - -
- - - + -

Table 1: An example of H projected onto S = {x1 , . . . , x5 }

Each row is one possible tuple for some h ∈ H evaluated on the sample S. Obviously the number of rows
in T is the same as the cadinality of |HS |. Thus we can bound the growth function of H by the maximum
number of rows in table T . Next we transform the table T by processing each column sequentially. For
example, to process the first column, for each row, we replace a ” + ” into a ” − ” unless it produces a
duplicated row in the table. Table 2 shows the table after processing the first column (left table) and the
final table after processing all 5 columns (right table).

h∗ (x1 ) h(x2 ) h(x3 ) h(x4 ) h(x5 ) h∗ (x1 ) h∗ (x2 ) h∗ (x3 ) h∗ (x4 ) h∗ (x5 )
- + - + + - + - - -
- - - + + - - - + +
- + + - + - - - - +
- + + - - - - - - -
- - - + - - - - + -

Table 2: transformed tables (left: after processing the first column; right: after processing all 5 columns)

Now we have the following observations:

1. The size of the tables are not changed for such transformations, and rows in the final table T ∗ are
still unique. Thus we use the upper bound of the number of rows in T ∗ to bound the growth function
GH (n).

2
2. The final table T ∗ possess the property that if we replace any ”+” to ”−”, it will result in a duplication.
So the set of ” + ” elements in each row must be a subset of S that can be shattered by the table T ∗
(in fact, by the set of functions H∗ corresponding to the table T ∗ ).
3. If a subset A ⊂ S can be shattered by a latter table Tk+1 , then it must also be shattered by the
previous table Tk . To see this, notice that if A does not contain the transformed column xk , then the
result holds trivially as all columns in A remain the same in Tk and Tk+1 . If A contain the transformed
column xk , then for each +/- combination (2|A|−1 ) of elements in A\{xk }, we must have two rows in
Tk+1 such that they have ” + ” and “-” values in the column xk . Now in the previous table Tk , those
two rows must also exist. The “+” row is obviously there, and it must also contain the “-” row since
otherwise the “+” would not show up in the later table Tk+1 by the processing procedure.

Since dV C (T ∗ ) ≤ dV C (T ) = dV C (H) = d by observation 3, each row in T ∗ has at most d “+” elements. Thus
*d
an upper bound of the total number of rows in T ∗ is i=0 (ni), which is also an upper bound of the growth
function GH (n) by observation 1.
The second statement comes from the fact that for n ≥ d,
d & ' ( n )d %d & ' & 'i
% n n d
≤
i d i=0 i n
i=0
( n )d & 'n
d
= 1+
d n
( en )d
≤ .
d

3
ISYE/CSE
STAT 598Y 8803: Advanced
Statistical Machine
Learning Learning
Theory Instructor:
Instructor: TuoZhang
Jian Zhao

Lecture 6: Generalization Bound for Infinite Function Class

We introduce the generalization error bound which utilizes the growth function of H or VC dimension of H
instead of the naive cardinality |H|.
Theorem 6-1 (Vapnik-Chervonenkis). For any δ > 0, with probability at least 1 − δ,
!
2 log GH (2n) + 2 log 2δ
∀h ∈ H, R(h) ≤ R̂n (h) + 2 .
n

The proof of Theorem 6-1 utilizes a technique called symmetrization. For notational simplicty, we will use
n
1"
Pf = E[f (X, Y )], Pn f = f (xi , yi ).
n i=1

Here f (X, Y ) can be thought as ℓ(Y, h(X)). Also we define Zi = (Xi , Yi ). The key idea is to upper bound
the true risk by an estimate from an independent sample, which is often known as the “ghost” sample. We
use Z1′ , . . . , Zn′ to denote the ghost sample and
n
′ 1"
Pn f = f (x′i , yi′ ).
n i=1

Then we could project the functions in H onto this double sample and apply the union bound with the help
of the growth function GH (.) of H.
#
Lemma (Symmetrization). For any t > 0 such that t ≥ 2/n, we have
$ % & (
' ′ '
P sup |Pf − Pn f | ≥ t ≤ 2P sup 'Pn f − Pn f ' ≥ t/2 .
' '
f ∈F h∈H

Proof. Let fn be the function achieving the supremum. By triangle inequality we have

I (|Pfn − Pn fn | > t) I (|Pfn − P′n fn | < t/2) ≤ I (|P′n fn − Pn fn | > t/2) .

Taking expecation with respect to the ghost sample we have

By Chebyshev’s inequality, we have

4V[fn ] 1
PDn′ (|Pfn − P′n fn | < t/2) ≤ 2
≤ 2
nt nt
where V[fn ] ≤ 1/4 because fn is a random variable (function of the sample) whose value is between 0 and
1. Hence
& (
1
I (|Pfn − Pn fn | > t) 1 − 2 ≤ PDn′ (|P′n fn − Pn fn | > t/2) .
nt
#
Taking expecation with respect to the original sample and utilizing the fact that t ≥ 2/n we obtain the
result. Note the same result holds if we remove the absolute operator. !
The symmetrization lemma replaces the true risk by an average over the ghost sample. As a result, the RHS
only depends on the project of the function class F on the double sample FDn ,Dn′ .

1
Proof of Theorem 6-1:
Let F = {f : f (x, y) = ℓ(y, h(x)), h ∈ H}. First note that GH (n) = GF (n).
& ( $ %
P sup R(h) − R̂n (h)) ≥ ϵ = P sup (Pf − Pn f ) ≥ ϵ
h∈H f ∈F
$ %
≤ 2P sup (P′n f − Pn f ) ≥ ϵ/2
f ∈F
$ %
= 2P sup (P′n f − Pn f ) ≥ ϵ/2
f ∈FDn ,D′
n

≤ 2GF (2n)P ((P′n f − Pn f ) ≥ ϵ/2)

≤ 2GF (2n) exp(−nϵ2 /8).
The last inequality is by the Hoeﬀding’s inequality since P (Pn′ f − Pn f ≥ t) ≤ exp(−nt2 /2). Setting
δ = 2GF (2n) exp(−nϵ2 /8) we have the claimed result. !
Note that in the case of a finite function class, we have |H| = m and GH (2n) ≤ m. So except for the
constants, the bound is at least as good as the one before. Furthermore, if the VC dimension of H is
dV C (H) ≤ n, we can apply the Sauer’s lemma to obtain the following result:
Corollary 6-2. For any δ > 0, with probability at least 1 − δ,
!
2en
2dV C (H) log dV C (H) + 2 log 2δ
∀h ∈ H, R(h) ≤ R̂n (h) + 2 .
n

Note that in order for the result to be meaningful in Theorem 6-1, we require the dV C (H) to be finite. A
class of functions whose VC dimension is finite is called a VC class. We can also utilize this result to obtain
a bound on the expected risk E[R(ĥn )], where ĥn is the empirical minimizer. Since
' '
R(ĥn ) − inf R(h) ≤ 2 sup 'R(h) − R̂n (h)' ,
' '
h∈H h∈H

we have
& ( & ' ' (
P R(ĥn ) − inf R(h) ≥ ϵ ≤ P sup 'R(h) − R̂n (h)' ≥ ϵ/2
' '
h∈H h∈H

≤ 4GH (2n) exp(−nϵ2 /32).

Define a nonnegative random variable Z = R(ĥn )−inf h∈H R(h), and we have P (Z ≥ ϵ) ≤ 4GH (2n) exp(−nϵ2 /32).
Thus
ˆ ∞
E[Z 2 ] = P (Z 2 ≥ t)dt
ˆ0 u ˆ ∞
= P (Z 2 ≥ t)dt + P (Z 2 ≥ t)dt
0 u
ˆ ∞
≤ u+ 4GH (2n) exp(−nt/32)dt
u
32GH (2n) ) nu *
= u+ exp − .
n 32
32 log(4GH (2n))
Minimizing the RHS with respect to u we have u = 32GH (2n)/n. Plugging in we have E[Z 2 ] ≤ n .
By the Cauchy-Schwarz inequality we have
& (
# log GH (2n)
E[R(ĥn )] − inf R(h) = E[Z] ≤ E[Z 2 ] ≤ O .
h∈H n

2
So if the growth function is only polynomially increasing as a function of n, then obviously we have E[R(ĥn )]−
inf h∈H R(h) → 0, i.e. the expected risk will converge to the minimum risk within the function class H.

3
ISYE/CSE
STAT 598Y 8803: Advanced
Statistical Machine
Learning Learning
Theory Instructor:
Instructor: TuoZhang
Jian Zhao

Lecture 7: Glivenko-Cantelli Theorem

Recall that if we use empirical minimization to obtain our predictor

ĥn = arg min R̂n (h),

h∈H

then in order to bound the quantity R(ĥn ) − inf h∈H R(h), it suﬃces to bound the quantity
sup |R(h) − R̂n (h)|.
h∈H

Thus the uniform bound plays an important role in statistical learning theory. The Glivenko-Cantelli class
is defined such that the above property holds as n → ∞.
Definition. H is a Glivenko-Cantelli class with respect to a probability measure P if for all ϵ > 0,
! "
P lim sup |Pf − Pn f | = 0 = 1,
n→∞ h∈H

i.e. suph∈H |Pf − Pn f | converges to zero almost surely (with proability 1). H is said to be a uniformly GC
Class if the convergence is uniformly over all probability measures P .
Note that Vapnik and Chervonenkis have shown that a function class is a uniformly GC class if and only if
it is a VC class.
Given a set of iid real-valued random variables Z1 , . . . , Zn and any z ∈ R, we know that the quantity
I(Zi ≤ z) is a Bernoulli random variable with mean P (Z ≤ z) = F (z), where F (.) is the CDF. Furthermore,
by strong law of large numbers, we know that
n
1#
I(Zi ≤ z) → F (z)
n i=1

almost surely. The following theorem is one of the most fundamental theorems in mathematical statistics,
which generalizes the strong law of large numbers: the empirical distribution function uniformly almost
surely converges to the true distribution function.
Theorem (Glivenko-Cantelli). Let Z1 , . . . , Zn be iid real-valued random variables with distribution func-
tion F (z) = P (Zi ≤ z). Denote the standard empirical distribution function by
n
1#
Fn (z) = I(Zi ≤ z).
n i=1

Then
nϵ2
! " ! "
P sup |F (z) − Fn (z)| > ϵ ≤ 8(n + 1) exp − ,
z∈R 32
and in particular, by the Borel-Cantelli lemma, we have

lim sup |F (z) − Fn (z)| = 0 almost surely.

n→∞ z∈R

Proof.
$n
We use the notation ν(A) := P (Z ∈ A) and νn (A) = n1 i=1 I(Zi ∈ A) for any measurable set A ⊂ R. If we
let A denote the class of sets of the form (−∞, z] for all z ∈ R, then we have

sup |F (z) − Fn (z)| = sup |ν(A) − νn (A)|.

z∈R A∈A

1
We assume nϵ2 > 2 since otherwise the result holds trivially. The proof consists of several key steps.
(1) Symmetrization by a ghost sample: Introduce a ghost sample Z1′ , . . . , Zn′ which are iid together
with the original sample, and denote by νn′ the empirical measure with respect to the ghost sample. Then
for nϵ2 > 2 we have (by the symmetrization lemma)
! " ! "
P sup |νn (A) − ν(A)| > ϵ ≤ 2P sup |νn (A) − νn′ (A)| > ϵ/2 .
A∈A A∈A

(2) Symmetrization by Rademacher Variables: Let σ1 , . . . , σn be iid random variables, independent of

Z1 , . . . , Zn , Z1′ , . . . , Zn′ , with P (σi = 1) = P (σi = −1) = 1/2. Such random variables are called Rademacher
random variables. Observe that the distribution of
% n %
%# %
′
sup % (I(Zi ∈ A) − I(Zi ∈ A)%
% %
A∈A %
i=1
%
is the same as % n %
%# %
sup % σi (I(Zi ∈ A) − I(Zi′ ∈ A)%
% %
A∈A %
i=1
%
by the definition of Z1 , . . . , Zn ; Z1′ , . . . , Zn′ and σ1 , . . . , σn . Thus we have
! "
P sup |νn (A) − ν(A)| > ϵ
A∈A
! "
′
≤ 2P sup |νn (A) − νn (A)| > ϵ/2
A∈A
& % n % '
%1 # % ϵ
′
= 2P sup % σi (I(Zi ∈ A) − I(Zi ∈ A)% >
% %
A∈A % n i=1 % 2
% % % %
n n
& ' & '
1 %%# % ϵ 1 %%# ′
% ϵ
≤ 2P sup % σi I(Zi ∈ A)% > + 2P sup % σi I(Zi ∈ A)% >
% %
A∈A n % i=1 % 4 A∈A n % i=1 % 4
& % % '
n
1 %%# % ϵ
= 4P sup % σi I(Zi ∈ A)% > .
%
A∈A n %
i=1
% 4

(3) Conditioning: To bound the proability

& % % ' & % % '
n n
1 %%# % ϵ 1 %%# % ϵ
P sup % σi I(Zi ∈ A)% > = P sup % σi I(Zi ≤ z)% >
% %
A∈A n % i=1 % 4 z∈R n % i=1 % 4
we condition on Z1 , . . . , Zn . Fix z1 , . . . , zn ∈ R and note that the vector [I(z1 ≤ z), . . . , I(zn ≤ z)] can take
at most (n + 1) possible values for any z. Thus conditioned on Z1 , . . . , Zn , the supremum is just a maximum
over at most n + 1 random variables. Applying union bound we obtain
% % & % n %
n
& ' '
1 %%# % ϵ % 1 %%# % ϵ %%
P sup % σi I(Zi ∈ A)% > % Z1 , . . . , Zn ≤ (n + 1) sup P σi I(Zi ∈ A)% > % Z1 , . . . , Zn
% % %
A∈A n % i=1 % 4 n % i=1 % 4
%
A∈A

where the sup is outside of the probability. The next step is to find an exponential bound for the RHS.
$n
(4) Hoeffding’s Inequality: With z1 , . . . , zn fixed, i=1 σi I(zi ∈ A) is a sum of n independent zero
mean random variables between [−1, 1]. Thus, by Hoeﬀding’s inequality we have
% %
n
& '
1 %%# % ϵ %%
P sup % σi I(Zi ∈ A)% > % Z1 , . . . , Zn
%
A∈A n % i=1 % 4
& % n % '
1 %%# % ϵ %%
≤ (n + 1) sup P σi I(Zi ∈ A)% > % Z1 , . . . , Zn
%
n % i=1 % 4
%
A∈A

nϵ2
! "
≤ 2(n + 1) exp − .
32

2
Taking expectation on both side we obtain the claimed result

nϵ2
! " ! "
P sup |νn (A) − ν(A)| > ϵ ≤ 8(n + 1) exp − .
A∈A 32

3
ISYE/CSE
STAT 598Y 8803: Advanced
Statistical Machine
Learning Learning
Theory Instructor:
Instructor: TuoZhang
Jian Zhao

Lecture 8: Rademacher Complexity

Recall that in the proof of the Glivenko-Cantelli theorem we used the Rademacher random variables σ1 , . . . , σn
which are iid uniform {±1} random variables.
Definition. Let µ be a proability measure on X and assume that X1 , . . . , Xn are independent random
variables according to µ. Let F be a class of functions mapping from X to R. Define the random variable
! n
$
1" #
R̂n (F ) := E sup σi f (Xi ) #X1 , . . . , Xn ,
#
f ∈F n i=1

where σ1 , . . . , σn are independent uniform {±1}-valued random variables. R̂n (F ) is called the empirical
Rademacher averages of F . Note that it depends on the sample and can be actually computed. Essentially
it measures the correlation between a random noise (labeling) and functions in the function class F , in the
supremum sense. The Rademacher averages of F is

Rn (F ) = E[R̂n (F )].

For a function class F and a sample S = {x1 , . . . , xn }, we would like to bound the random quantity
% n
&
1"
φ(S) := sup (Pf − Pn f ) = sup E[f (X)] − f (xi ) .
f ∈F f ∈F n i=1

First, we bound the diﬀerence between the random variable and its mean by using McDiarmid inequality.
Consider another sample S ′ which only diﬀers from S at one example. Then we have
# #
′
#
′
# c
|φ(S) − φ(S )| = # sup (Pf − Pn f ) − sup (Pf − Pn f )# ≤
# #
#f ∈F f ∈F # n

where we assume that f ∈ [a, b] with c = b − a. Then by McDiarmind inequality we have

% ! $ &
2nϵ2
' (
P sup (Pf − Pn f ) − E sup (Pf − Pn f ) ≥ ϵ ≤ exp − 2 .
f ∈F f ∈F c

Thus, by setting δ = exp(−2nϵ2 /c2 ) we have ∀f ∈ F,

)
c2 log(1/δ)
φ(S) ≤ E[φ(S)] + .
2n

Next, we related the E[φ(S)] with the Rademacher averages. From now on, we define S ′ = {X1′ , . . . , Xn′ } to

1
be a ghost sample of S (not the same S ′ before). Note that
! $ ! % n
&$
1"
ES sup (Pf − Pn f ) = ES sup E[f (X)] − f (xi )
f ∈F f ∈F n i=1
! ! n n
$$
1" 1 " #
= ES sup E f (Xi′ ) − f (Xi )#X1 , . . . , Xn
#
f ∈F n i=1 n i=1
! n
$
Jensen 1" ′
≤ ES,S ′ sup (f (Xi ) − f (Xi ))
f ∈F n i=1
! n
$
1" ′
= ES,S ′ sup σi (f (Xi ) − f (Xi ))
f ∈F n i=1
! % n & % n
&$
1" ′ 1"
≤ ES,S ′ sup σi f (Xi ) + sup − σi f (Xi )
f ∈F n i=1 f ∈F n i=1
= 2Rn (F ).

So we have shown that ES [φ(S)] ≤ 2Rn (F ). Combine it with the first step (with c = 1), we have shown the
first part of the following theorem:
Theorem 8-1. Let F be a set of binary-valued {0, 1} functions. For all δ > 0, with proability at least 1 − δ,
)
log(1/δ)
∀f ∈ F, Pf ≤ Pn f + 2Rn (F ) + ,
2n
and also with probability at least 1 − δ,
)
log(2/δ)
∀f ∈ F, Pf ≤ Pn f + 2R̂n (F ) + C ,
n
√ √
where C = 2 + 1/ 2.
Proof.
First part has been proven above. For the second part, we apply the McDiarmid’s inequality again to the
empirical Rademacher averages
! n
$
1" #
R̂n (F ) = E sup σi f (Xi ) #X1 , . . . , Xn .
#
f ∈F n i=1

Note that R̂n (F ) is a function of X1 , . . . , Xn and satisfies the condition of the McDiarmid’s inequality with
bounded diﬀerence at most 1/n. So we have
* +
P 2Rn (F ) − 2R̂n (F ) > ϵ ≤ exp(−nϵ2 /2).

So with probability at least 1 − δ,

)
2 log(1/δ)
2Rn (F ) ≤ 2R̂n (F ) + .
n
So if we allow each step to be wrong with δ/2 probability, then we have with probability at least 1 − δ,
)
log(2/δ)
∀f ∈ F, Pf ≤ Pn f + 2R̂n (F ) + C
n
√ √
where C = 2 + 1/ 2.

2
!
Assume that H is the hypothesis space and F = ℓ ◦ H = {f : f (x, y) = ℓ(y, h(x)), ∀h ∈ H} is the class
induced from H. Then the Rademacher averges of H and F are quite related. In fact, if we assume Y = {±1},
then we have
! n
$
1"
Rn (F ) = E sup σi I(Yi ̸= h(Xi ))
h∈H n i=1
! n
$
1" 1
= E sup σi (1 − Yi h(Xi ))
h∈H n i=1 2
! n
$
1 1"
= E sup σi Yi h(Xi )
2 h∈H n i=1

1
= Rn (H).
2

The Rademacher average Rn (H) can actually be computed. Notice that

! n
$
1 1 1"
Rn (H) = E sup σi Yi h(Xi )
2 2 h∈H n i=1
! n
$
1 1 " 1 − σi h(Xi )
= + E sup −
2 h∈H n i=1 2
! n
$
1 1 " 1 − σi h(Xi )
= − E inf
2 h∈H n
i=1
2
, -
1
= − E inf R̂n (h, σ)
2 h∈H

where R̂n (h, σ) is the empirical risk of classifier h with respect to random label σ = [σ1 , . . . , σn ]. When H
is so large that it can fit every random labeling perfectly, we have Rn (H) = 1/2 and the bound becomes
meaningless.
The Rademacher average is related to the growth funciton and VC dimension. One can bound the Rademacher
average by the growth function or VC dimension. We could estimate Rademacher averages for function classes
which are built from simpler classes. The following is a list of properties about Rademacher averages.

1. If F ⊂ G then Rn (F ) ≤ Rn (G). It follows from the definition.

2. Rn (c · F) = |c|Rn (F ), where c · F = {x *→ cf (x) : f ∈ F}. Since we have
! n
$
1"
Rn (cF ) = E sup cσi f (Xi )
f ∈F n i=1

where σi ’s are iid Rademacher random variables. Since |c|σi has the same distribution as cσi , we have
! n
$
1"
Rn (cF ) = |c|E sup σi f (Xi ) = |c|Rn (F ).
f ∈F n i=1

3. Rn (F + g) = Rn (F ), where F + g is defined as {f + g : ∀f ∈ F} and g is a fixed function. To show

3
this, we have
! n
$
1"
Rn (F + g) = E sup σi [f (Xi ) + g(Xi )]
f ∈F n i=1
! n
$ ! n $
1" 1"
= E sup σi f (Xi ) + E σi g(Xi )
f ∈F n i=1 n i=1
= Rn (F )

since the second term is zero.

4. Let the convex hull of a set of functions F be defined as
. k k
/
" "
conv(F ) = αk fi : k ≥ 1, αi ≥ 0, αi = 1, f1 , . . . , fk ∈ F .
i=1 i=1

Then we have Rn (F ) = Rn (conv(F )) since

⎡ ⎤
n k
1 " "
Rn (conv(F )) = E⎣ sup σi αi fj (Xi )⎦
fj ∈F ,∥α∥1 =1 n i=1 j=1
⎡ % &⎤
k n
" 1 "
= E ⎣ sup sup αj σi fj (Xi ) ⎦
fj ∈F ∥α∥1 =1 j=1 n i=1
! n
$
1"
= E sup max σi fj (Xi )
fj ∈F j n i=1
= Rn (F ).

5. Ledoux-Talagrand contraction inequality: If φi is Lipshtiz, i.e. it satisfies |φi (a) − φi (b)| ≤ L|a − b|,
then ! $ ! $
n n
1" 1"
E sup σi φi (f (Xi )) ≤ LE sup σi f (Xi ) = LRn (F ).
f ∈F n i=1 f ∈F n i=1

4
ISYE/CSE
STAT 598Y8803: Advanced
Statistical Machine Theory
Learning Learning Instructor:Jian
Instructor: Tuo Zhang
Zhao

Lecture 9: Covering Number and Real-Valued Function Class

Recall that if S = {x1 , . . . , xn } ⊂ X n and H is a set of binary-valued functions, then HS denotes the
restriction of H to S, and the growth function (shatter coeﬃcient) is defined by

GH (n) = sup |HS |.

Since H maps X into {0, 1}, HS is finite for any finite S (|HS | ≤ 2n ).
Such a definition works well for binary-valued functions, but if H is a set of real-valued functions then HS
will be a set with infinite cardinality even for finite n. Essentially we do not need to divide H into unique
elements based on HS , but only need to partition the function class H into small groups which are “local”
in nature.
Definition. Let (W, d) be a metric space and F ⊂ W. For every ϵ > 0, denote by N (ϵ, F , d) the minimal
number of open balls (with respect to metric d) needed to cover F . That is, N (ϵ, F , d) is the minimal
cardinality of the set {f1 , . . . , fm } ⊂ W with the property that for every f ∈ F there is some fi such that
d(f, fi ) < ϵ. The set {f1 , . . . , fm } is called an ϵ-cover of F . The logarithm of the covering number is called
the entropy of the set.
We will be interested in metrics induced by samples. For every sample {x1 , . . . , xn } let µn be the empirical
measure of the sample. For 1 ≤ p ≤ ∞ and a function f , put
! n
#1/p
1"
∥f ∥Lp(µn ) = |f (xi )|p
n i=1

and in particular, we have ∥f ∥L∞(µn ) = max1≤i≤n |f (xi )|. Let N (ϵ, F , Lp (µn )) be the covering number of F
at scale ϵ with respect to the norm Lp (µn ).
Theorem 9-1. For any class F of real-valued functions, any sample S = {x1 , . . . , xn } and ϵ > 0,

N (ϵ, F , L1 (µn )) ≤ N (ϵ, F , L2 (µn )) ≤ N (ϵ, F , L∞ (µn )).

Definition. For ϵ > 0, define the uniform covering number

Np (ϵ, F , n) = sup N (ϵ, F , Lp (µn )).

µn

It is easy to see that the uniform covering number is a generalization of the growth function. Suppose that
F contains functions which map X into {0, 1}. Then for any S = {x1 , . . . , xn }, ϵ < 1 and p = ∞, we have
N (ϵ, F , L∞ (µn )) = |GH |, so N∞ (ϵ, F , n) = |GH |.
Based on the covering number we are able to obtain uniform convergence result for real-valued function class.
Theorem 9-2. Let F be a class of functions which map X into [−1, 1] and let µ be a probability measure on
X . Assume X1 , . . . , Xn are independent random variables distributed according to µ. For every ϵ > 0 and
n ≥ 8/ϵ2, ! #
nϵ2
$ $ % &
$1 " $
P sup $ $ f (Xi ) − E[f (X)]$ > ϵ ≤ 8E[N (ϵ, F , L1 (µn ))] exp −
$ ,
f ∈F n 128
where µn is the empirical measure on X1 , . . . , Xn .

1
'n
Proof. Consider the event A = {supf ∈F | i=1 σi f (Xi )| > nϵ/4}. We have

P (A) = Eµ [Eσ [I(A)|X1 , . . . , Xn ]]

n
( ( ! # ))
" nϵ
= Eµ Eσ I sup | σi f (Xi )| > |X1 , . . . , Xn .
f ∈F i=1 4

For any realization of X1 , . . . , Xn , its empirical measure is µn . Let G be an ϵ/8 cover of F with respect to
the L1 (µn ) norm, and we can assume that any function g ∈ G is bounded by 1. First, observe that
! $ n $ # ! $ n $ #
$" $ nϵ $" $ nϵ
P sup $ σi f (Xi )$ > ≤ P sup $ σi g(Xi )$ > .
$ $ $ $
f ∈F $i=1
$ 4 g∈G $ $
i=1
8

This is becuase if there is some element f ∗ ∈ F which makes the LHS event true, then we can find some
g ∗ ∈ G such that
n n
1" ∗ ∗ 1" ∗
|σi f (Xi ) − σi g (Xi )| = |f (Xi ) − g ∗ (Xi )| ≤ nϵ.
n i=1 n i=1
'n
So we have supg∈G | i=1 σi g(Xi )| > nϵ/8. Applying the union bound, Hoeﬀding’s inequality and utilizing
'N
the fact that ∀g ∈ G, i=1 g(xi )2 ≤ n, we have
! $ n $ # !$ n $ #
$" $ nϵ $" $ nϵ
P sup $ σi g(Xi )$ > ≤ |G| · sup P $ σi g(Xi )$ >
$ $ $ $
g∈G $ i=1 $ 8 g∈G $
i=1
$ 8
nϵ2
% &
≤ 2N (ϵ/8, F , L1(µn )) exp − .
128

The claim follows by combing this result with symmetrization (with ghost sample and Rademacher random
variables).
!
Lemma. For A ⊂ Rn with r = maxa∈A ∥a∥, and σ1 , . . . , σn being Rademacher random variables, we have
! n #
" *
E sup σi ai ≤ r 2 log |A|.
a∈A i=1

Proof.
For any s > 0 we have
n n
! ! ## ! ! ##
" "
exp sE sup σi ai ≤ E exp s sup σi ai
a∈A i=1 a∈A i=1
n
! ! ##
"
= E sup exp s σi ai
a∈A i=1
n
! #
" "
≤ E exp s σi ai
a∈A i=1
n
! #
" s2 " 2
≤ exp a
2 i=1 i
a∈A

≤ |A| exp s2 r2 /2 .
+ ,

So we have
n
! #
log |A| sr2
" % &
*
E sup σi ai ≤ inf + =r 2 log |A|.
a∈A i=1
s>0 s 2

2
!
By the lemma we have obvious that -
2 log |F|
R̂n (F ) ≤
n
if F is finite with output values within [−1, 1].
Theorem 9-3. For F ⊂ [−1, 1]X , we have
!- #
2 log N (ϵ, F , L2 (µn ))
R̂n (F ) ≤ inf +ϵ .
ϵ>0 n

Proof.
For an ϵ > 0, let G be an ϵ-cover of F . Then we have
n
( )
1"
R̂n (F ) = Eσ sup σi f (xi )
f ∈F n i=1
( ! n n
#)
1" 1"
= Eσ sup sup σi g(xi ) + σi (f (xi ) − g(xi ))
g∈G f ∈F ∩Bϵ (g) n i=1 n i=1
n
( )
1"
≤ Eσ sup σi g(xi ) + ϵ
g∈G n i=1
-
2 log N (ϵ, F , L2 (µn ))
≤ + ϵ,
n
.
where the first equality utilizes the fact that F = g∈G (F ∩ Bϵ (g)), and the first inequality comes from the
fact that ∥f − g∥L2 (µn ) ≤ ϵ and C-S inequality.
!
Definition. Let (W, d) be a metric space and F ⊂ W. For ϵ > 0, a subset A is said to be an ϵ-packing of
F , if for all distinct f1 , f2 ∈ A, we have d(f1 , f2 ) > ϵ. The ϵ-packing number P (ϵ, F , d) is defined as the
maximum cardinality of an ϵ-packing subset.
Both the covering number and packing number can be used to measure the size of the sets, and they are
obviously related. The following simple result shows that as long as one of them can be computed, we can
easily obtain a bound for the other one.
Theorem 9-4. Given a metric space (W, d). Then for all ϵ > 0 and for every F ⊂ W, the covering number
and packing number satisfy
P (2ϵ, F , d) ≤ N (ϵ, F , d) ≤ P (ϵ, F , d).

3
ISYE/CSE
STAT 598Y 8803: Advanced
Statistical Machine
Learning Learning
Theory Instructor:
Instructor: TuoZhang
Jian Zhao

Lecture 10: Perceptron and Linear SVM

We start with the perceptron algorithm which is probably one of the simplest linear classifiers. We then
introduce the margin maximization idea and derive the linear SVM classifier.
Assume that Dn = {(x1 , y1 ), . . . , (xn , yn )} where X = Rp and Y = {±1}. For simplicity we only consider the
linear classifiers without intercept here, i.e. H = {x !→ wT x : w ∈ Rp }. Furthermore, we assume that the
data are linearly separable, i.e. there exists some w∗ which can correctly classify all examples sign(w∗T xi ) = yi
for i = 1, . . . , n. The perceptron algorithm works as follows.

1. Start w0 = 0 and t = 0;
2. While wt has training error > 0:

(a) Pick one observation (xi , yi ) which is misclassified by wt ;

(b) Update wt+1 = wt + yi xi ;
(c) t = t + 1

3. Let T = t, w = wT and return ĥn (x) = sign(wT x).

wT x y
∗ i i
Theorem 10-1 (Novikov). Define r = maxi ∥xi ∥ and δ = mini ∥w ∗∥
, where w∗ is some classifier which
can linearly separate Dn . Then it terminates after T ≤ r2 /δ 2 steps.
Proof.
First, note that δ has the meaning of the “margin”: the minimum distance of an example to the decision
hyperplane. So the larger the margin, the smaller number of steps we need to converge. The basic idea of
the proof is to show that wt are getting closer and closer to w∗ . Since ∥wt − w∗ ∥2 = ∥wt ∥2 + ∥w∗ ∥2 − 2wtT w∗ ,
essentially we need to upperbound ∥wt ∥2 and lowerbound wtT w∗ and then combine the results.
First we have w0T w∗ = 0 and
T
wt+1 w∗ = wtT w∗ + yi xTi w∗
≥ wtT w∗ + δ∥w∗ ∥.

Clearly we have wtT w∗ ≥ tδ∥w∗ ∥ by induction. Second, since ∥w0 ∥ = 0 and

∥wt+1 ∥2 = ∥wt + yi xi ∥2
= ∥wt ∥2 + ∥xi ∥2 + 2yi xTi wt
≤ ∥wt ∥2 + r2 .

So we have ∥wt ∥2 ≤ tr2 . Thus we have

√
tδ∥w∗ ∥ ≤ wtT w∗ ≤ ∥wt ∥∥w∗ ∥ ≤ tr∥w∗ ∥.

Then it follows that t ≤ r2 /δ 2 for any t, and thus T ≤ r2 /δ 2 .

1
Maximum Margin Classifier: Support Vector Machines (SVM)

We consider the set of linear classifiers H = {h(x) = wT x + b, w ∈ Rp , b ∈ R}. Suppose the training
examples are linearly separable, i.e. there exist some linear classifier which has 0 training error. Consider
the following optimization problem:

min ∥w∥2
w,b

s.t. yi (wT xi + b) ≥ 1 ∀i = 1, . . . , n.

This is a constrained optimization where the objective is a quadratic function and the constraints are linear.
So it is a convex optimization problem (quadratic programming, to be more specific).
Given a hyperplance (classifier), define margin as the minimum distance between the plane to any of the
example. Now we show that the above optimization essentially tries to find a classifier which maximizes
the margin. First, assume that there are two examples x+ and x− , both are on the margin boundary (see
Figure 1). Then we know that the margin equals half of the distance between (x+ − x− )’s projection along
the direction that is perpendicular to the hyperplane. So we have
1 w
margin = (x+ − x− )T
2 ∥w∥

Using the fact that x+ and x− lie on margin, we have wT x+ +b = 1 and wT x− +b = −1. Thus wT (x+ −x− ) =
2. So we conclude that the margin is 1/∥w∥. Thus minimizing ∥w∥2 subject to the linear constraints is
equvalent to maximizing 1/∥w∥2 subject to the same constraints.
Since in practice examples may not be linearly separable, we introduce the concept of slack variables. For
each example, define ξi ≥ 0 to be the slack variable which measures how
!n much this example violates the
margin condition. Instead of minimizing ∥w∥2 , we will also add a term i=1 ξi which penalizes violation of
the margin condition. The relaxed optimization problem can be written as:
!n 2
min i=1 ξi + λ∥w∥
w,b

s.t. yi (wT xi + b) ≥ 1 ∀i = 1, . . . , n
ξi ≥ 0 ∀i = 1, . . . , n.

where λ > 0 is a tuning parameter which controls the balance between training error and the margin. Note
that equivalently, wecan write down the optimization problem as
"
n
min (1 − yi (wT xi + b))+ + λ∥w∥2
w,b
i=1

where (t)+ = max(t, 0).

2
ISYE/CSE
STAT 598Y 8803: Advanced
Statistical Machine
Learning Learning
Theory Instructor:
Instructor: TuoZhang
Jian Zhao

Lecture 11: Convex Optimization

In this lecture we first give some background about convex optimization including the KKT condition and
duality. We then derive the SVM dual optimization problem.
Consider the constrained minimization problem of the form

minimize f (x)
subject to: gi (x) = 0 i = 1, . . . , m ≤ n (1)
hj (x) ≤ 0 j = 1, . . . , p.

gi ’s are equality constraints and hj ’s are inequality constraints and usually they are assumed to be within
the class C 2 . A point that satisfies all constraints is said to be a feasible point. An inequality constraint is
said to be active at a feasible point x if hj (x) = 0 and inactive if hj (x) < 0. Equality constraints are always
active at any feasible point. To simplify notation we write h = [h1 , . . . , hp ] and g = [g1 , . . . , gm ], and the
constraints now become g(x) = 0 and h(x) ≤ 0.

Karush-Kuhn-Tucker (KKT) Conditions

KKT conditions (a.k.a. Kuhn-Tucker conditions) are necessary conditions for the local minimum solutions
of problem (1). Let x∗ be a local minimum point for Problem (1) and suppose x∗ is a regular point for the
constraints. Then there is a vector µ ∈ Rm and a vector λ ∈ Rp with λ ≥ 0 such that

∇f (x∗ ) + λT ∇h(x∗ ) + µT ∇g(x∗ ) = 0 (2)

g(x∗ ) = 0 (3)
λj hj (x∗ ) = 0 (j = 1, . . . , p) (4)

Convince yourself why the above conditions hold geometrically. It is convenient to introduce the Lagrangian
associated with the problem as

L(x, λ, µ) = f (x) + λT h(x) + µT g(x)

where µ ∈ Rm , λ ∈ Rp and λ ≥ 0 are Lagrange multipliers. Note that equation (2), (3) and (4) together
give a total of n + m + p equations in the n + m + p variables x∗ , λ and µ.
From now on we assume that we only have inequality constrains for simplicity. The case with equality
constraints can be done in a similar way, except that µ does not have the nonnegative constraint as λ. So in
our case we have the following optimization problem:

min f (x) s.t. h(x) ≤ 0.

Weak Duality and Strong Duality

Consider the Lagrangian L(x, λ) for the above optimization problem. Then we have the following two types
of dualities:

• weak duality: we have obviously for any λ ≥ 0 that

inf L(x, λ) ≤ inf sup L(x, λ)

x x λ≥0

1
and thus
sup inf L(x, λ) ≤ inf sup L(x, λ).
λ≥0 x x λ≥0

Such a relation always holds.

• strong duality: suppose in addition there exist x∗ and λ∗ ≥ 0 such that
L(x∗ , λ) ≤ L(x∗ , λ∗ ) ≤ L(x, λ∗ )
for all feasible x and λ ≥ 0. Then we have
inf sup L(x, λ) ≤ sup L(x∗ , λ)
x λ≥0 λ≥0
= L(x∗ , λ∗ )
= inf L(x, λ∗ )
x
≤ sup inf L(x, λ).
λ≥0 x

Thus we have
inf sup L(x, λ) = sup inf L(x, λ).
x λ≥0 λ≥0 x
∗ ∗
The point (x , λ ) is called the saddle point. One example is the function L(x, λ) = x2 − λ2 , with
saddle point (0, 0) as shown in Figure 1.

Weak duality always holds, and strong duality holds if f and hj ’s are convex and there exists at least one
feasible point which is an interior point. The Lagrange dual function D(λ) is defined as
⎧ ⎫
⎨ p
$ ⎬
D(λ) := inf L(x, λ) = inf f (x) + λj hj (x)
x x ⎩ ⎭
j=1

and we define the dual optimization problem as:

max D(λ) s.t. λ ≥ 0.

Note that (1) D(λ) is a concave function; (2) for any feasible λ and x we have D(λ) ≤ f (x). In fact if we
define p∗ to be the minimum solution of the primal optimization problem (primal solution), and d∗ to the
maximum of the dual problem d∗ = supλ≥0 D(λ) (dual solution). Then the weak duality says d∗ ≤ p∗ . The
quantity p∗ − d∗ is known as the duality gap, which can be a useful criteria for convergence.
Now we illustrate this duality relationship with a simple example where we only have one inequality con-
straint:
min f (x) s.t. h(x) ≤ 0.
Define ω(z) = inf{f (x) : h(x) ≤ z} for z ∈ R. Then it is easy to observe that ω(z) is monotone on each
coordinate of z. The duality can be illustrated by the fact that the primal solution p∗ is the intercept of
ω(z) with the vertical axis z = 0, and it is an upperbound of the maximum intercept with the vertical axis
of all hyperplanes that lie below ω(.). Such hyperplanes have the form lλ (z) = −λT z + inf x {f (x) + λh(x)}
with λ ≥ 0. An example is shown in Figure 2.

SVM Dual Problem

The SVM primal problem can be written as
1
(n
minw,b,ξ n i=1 ξi + λ∥w∥2
T
s.t. yi (w xi + b) ≥ 1 − ξi ; ξi ≥ 0 ∀i

2
Saddle Point Duality
8

6 ω(z)
4

5
2

4
L(x, λ)

0
3
−2
p*
2 2
1
−4
2 0 1
1
0 −1
−1 0
−2 −2 x −2 −1.5 −1 −0.5 0 0.5 1 1.5 2
λ

Figure 1: Left: Saddle point (0, 0) of L(x, λ) = x2 − λ2 ; Right: Geometric interpretation of duality.

Now the Lagrangian can be written as

n n n
1$ $ $
L(w, b, ξ, α, β) = ξi + λwT w + αi (1 − ξi − yi wT xi − yi b) − βi ξi
n i=1 i=1 i=1

where the Lagrange multiplers α ≥ 0 and β ≥ 0. We want to remove the primal variables w, b, ξ by
maximization, i.e. set the following derivatives to zero:
n
∂L 1 $
=0 =⇒ w= αi yi xi
∂w 2λ i=1
n
∂L $
=0 =⇒ αi yi = 0
∂b i=1
∂L 1
=0 =⇒ αi + βi = .
∂ξ n
Plugging in and we obtain the dual:
n
$ 1 $
D(α, β) = αi − αi αj yi yj xTi xj .
i=1
4λ i,j

Since we have αi ≥ 0 and βi ≥ 0 and αi + βi = 1/n, thus we have 0 ≤ αi ≤ 1/n. So the dual optimization
problem becomes
(n 1 ( T
max i=1 αi − 4λ i,j αi αj yi yj xi xj
α
s.t. 0 ≤ αi ≤ 1/n.

which is a quadratic programming problem. Note that due to the constraints, the dual solution is in general
sparse, i.e. we have many α′i s equal to 0. We have the following observations:

1. If αi > 0: we have yi (wT xi + b) = 1 − ξi ≤ 1. So the example is either at or on the wrong side of the
margin. Such examples for αi > 0 are called support vectors.
2. If αi = 0: we have βi = 1/n and thus ξi = 0. So yi (wT xi + b) ≥ 1. Such examples are on the correct
side of the margin.

3
3. If yi (wT xi + b) < 1: we have ξ > 0 and thus βi = 0 and αi = 1/n. So if an example causes margin
error then its dual variable αi will take at the right boundary 1/n.
4. It is possible that for examples which are on the correct side of the margin, their αi ’s are nonzero.
5. In the objective xi ’s appear always in the form of inner product xTi xj . So if we first map xi into a
feature vector φ(xi ), then we could replace xTi xj by ⟨φ(xi ), φ(xj )⟩. This leads to the introduction of
reproducing kernel Hilbert space in SVM.

4
ISYE/CSE
STAT 598Y 8803: Advanced
Statistical Machine
Learning Learning
Theory Instructor:
Instructor: TuoZhang
Jian Zhao

Lecture 12: Reproducing Kernel Hilbert Spaces and Kernel Methods

We first define Hilbert space and then introduce the concept of Reproducing Kernel Hilbert Space (RKHS)
which plays an important role in machine learning.
Definition. A Hilbert space is an inner product space which is also complete and separable 1 with respect
to the norm/distance function induced by the inner product. For any f, g ∈ H and α ∈ R, ⟨., .⟩ is an inner
product if and only if it satisfies the following conditions:

1. ⟨f, g⟩ = ⟨g, f ⟩;
2. ⟨f + g, h⟩ = ⟨f, h⟩ + ⟨g, h⟩ and ⟨αf, g⟩ = α⟨f, g⟩;
3. ⟨f, f ⟩ ≥ 0 and ⟨f, h⟩ = 0 if and only if f = 0.
! !
The norm/distance induced by the inner product is defined as ∥f ∥ = ⟨f, f ⟩ and ∥f − g∥ = ⟨f − g, f − g⟩.
⟨., .⟩ is called a semi-inner product if the third condition only says ⟨f, f ⟩ ≥ 0. In this case, the induced norm
is actually a semi-norm.
Examples of Hilbert space includes:

1. Rn with ⟨a, b⟩ = aT b;
"∞
2. ℓ2 space of square summable sequence with inner product ⟨x, y⟩ = xi yi ;
i=1
´
3. The space of L2 square integrable functions with inner product ⟨f, g⟩ = f (x)g(x)dx.

A closed linear subspace G of a Hilbert space H is also a Hilbert space. The distance between an element
f ∈ H and G is defined as inf g∈G ∥f − g∥. Since G is closed, the infimum can be attained and we have fG ∈ G
such that ∥f − fG ∥ = inf g∈G ∥f − g∥. Such fG is called the projection of f onto G. It can be shown that such
fG is unique, and ⟨f − fG , g⟩ = 0 for all g ∈ G. The linear subspace G c = {f : ⟨f, g⟩ = 0, ∀g ∈ G} is called
the orthogonal complement of G. It can be shown that G c is also closed and f = fG + fG c for any f ∈ H,
where fG and fG c are projections of f onto G and G c . The decomposition f = fG + fG c is called a tensor
sum decomposition and is denoted by H = G ⊕ G c , G c = H ⊖ G or G = H ⊖ G c .
A simple example of decomposition would be H = R2 and G = {(x, 0) : x ∈ R} and G c = {(0, y) : y ∈ R}.
Any element (x, y) in H can be decomposed as (x, y) = (x, 0) + (0, y) and this decomposition is unique.
Theorem 12-1 (Riesz). For every contniuous linear functional L in a Hilbert space H, there exists a
unique gL ∈ H such that L(f ) = ⟨gL , f ⟩ for ∀f ∈ H.
Proof.
Define NL = {f : L(f ) = 0} to be the null space of L. Since L is continuous we have NL a closed linear
subspace. Assume NL ⊂ H then there exists a nonzero element g0 ∈ H ⊖ NL . We have
(L(f ))g0 − (L(g0 ))f ∈ NL ,
and thus
⟨(L(f ))g0 , (L(g0 ))f, g0 ⟩ = 0.
Thus we get # $
L(g0 )
L(f ) = g0 , f .
⟨g0 , g0 ⟩
Hence we take gL = (L(g0 ))g0 /⟨g0 , g0 ⟩. If NL = H we simply take gL = 0. If there are gL and g̃L two
representers for L then we have ⟨gL − g̃L , f ⟩ = 0 for any f ∈ H and thus ∥gL − g̃L ∥ = 0 and then gL = g̃L .
!.
1 A vector space H is complete if every Cauchy sequence in H converges to an element in H. A sequence satisfying
limm,n→∞ ∥fn − fm ∥ = 0 is called a Cauchy sequence.

1
Reproducing Kernel Hilbert Space
Definition. A kernel k : X × X ,→ R if (1) it is symmetric; (2) it is positive semi definite. I.e. any x1 , . . . , xn
the gram matrix K is positive semi definite.
!
Properties: (1) k(x, x) ≥ 0; (2) k(x, z) ≤ K(x, x)K(z, z).
There are a couple of ways to define RKHS which are equivalent.
Definition. k(., .) is a reproducing kernel of a Hilbert space H if for ∀f ∈ H, we have f (x) = ⟨k(x, .), f (.)⟩.
Definition. A RKHS is a Hilbert space H with a reproducing kernel whose span is dense in H.
An equivalent definition of RKHS woud be “a Hilbert space of functions with all evaluation functinos bounded
and linear” or “all evaluation functionals are continuous”.
Theorem 12-2 (Mercer’s). Let (X , µ) be a finite measure ´ space and k ∈ L∞ (X ×X , µ×µ) be a kernel such
that Tk : L2 (X , µ) ,→ L2 (X , µ) is positive definite, i.e. k(x, z)f (x)f (z)dµ(x)dµ(z) ≥ 0 for all f ∈ L2 (X , µ).
Let φi ∈ L2 (X , µ) be the normalized eigenfunctions of Tk associated with the eigenvalues λi ≥ 0. Then
∞
"∞ {λi }i=1 are absolutely summable;
(1) The eigenvalues
(2) k(x, z) = i=1 λi φi (x)φi (z) holds the series converges absolutely and uniformly.
We can construct a RKHS as the completed space of the span of eigenfunctions defined by the kernel:
% '
&
H= f : f (x) = αi φi (x) s.t. ∥f ∥H < ∞
i
" "
Given f = i αi φi and g = βi φi , the inner product and the norm induced by the inner product are
i
defined as ( )
& & & αi βi
⟨f, g⟩H = αi φi (x), βi φi (x) =
i i i
λi
H
and ( )
& & & α2
i
∥f ∥2H = αi φi (x), αi φi (x) = .
i i i
λi
H
It is easy to see that the representer property holds:
( )
& & & αi λi φi (x)
⟨f (.), k(., x)⟩H = αi φi (.), λi φi (x)φi (.) = = f (x).
i i i
λi
H

The RKHS concept can be utilized in SVM and other kernel machines which is known as the kernel trick.
Given the eigenvalues λi ’s and eigenfunctions φi ’s of a reproducing kernel k(., .), we can may the x ∈ Rp into
a higher dimensional feature space:
*! ! +
x ,→ Φ(x) = λ1 φ1 (x), . . . , λi φi (x), . . . .

The dimensionality of the feature vector Φ(x) is the same as the number of nonzero eigenvalues of k(., .),
which could be of infinite dimensional. By Mercer’s theorem, the standard ℓ2 inner product between any
two feature vectors Φ(x) and Φ(z) can now be computed by the reproducing kernel since

k(x, z) = ⟨Φ(x), Φ(z)⟩ℓ2 .

2
Representer Theorem
Theorem 12-3 (Representer). Given a reproducing kernel k and let H be the corresponding RKHS. Then
for a function L : Rn ,→ R and non-decreasing function Ω : R ,→ R, the solution of the optimization problem
min J(f ) = min L(f (x1 ), . . . , f (xn )) + Ω(∥f ∥2H )
, -
f ∈H f ∈H

can be expressed as
n
&
f∗ = αi k(xi , .).
i=1
Furthermore, if Ω(.) is strictly increasing, then all solutions have this form.
Proof.
Define the subspace G to be the span of
span{k(xi , .), 1 ≤ i ≤ n}.
Decompose f as f = fG + fG c . We have
∥f ∥H = ∥fG ∥G + ∥fG c ∥H
by orthogonality of G with G c . Since Ω is non-decreasing, we have
Ω(∥f ∥2H ) ≥ Ω(∥fG ∥2H ).
On the other hand, since the kernel k has the reproducing property, we have
f (xi ) = ⟨f, k(xi , .)⟩
= ⟨fG , k(xi , .)⟩ + ⟨fG c , k(xi , .)⟩
= ⟨fG , k(xi , .)⟩
= fG (xi ).
So this implies that L(f (x1 ), . . . , f (xn )) = L(fG (x1 ), . . . , fG (xn )), i.e. the first component of the optimization
′ 2
objective only depends on the projection of f onto G which is the "nspan of k(xi , .) s. Since Ω(∥f ∥H ) ≥
2 ∗
Ω(∥fG ∥H ), we have the minimizer can be expressed as f (.) = i=1 αi k(xi , .). If Ω(.) is strictly non-
decreasing, then fG c must be zero and all minimizers must take the above form.
!

Examples of Kernel
Some simple examples of kernel:

• Linear kernel: k(x, z) = xT z or more generally, k(x, z) = xT Bz for B " 0.

• Polynomial kernel: k(x, z) = (xT z + c)d where c ≥ 0 and d ∈ N+ .
• RBF kernel: k(x, z) = exp(−γ∥x − z∥2 ).

We could also construct kernels based on simple ones. For instance, we have kernels (it can be shown that
k(., .) satisfies the conditions of a kernel):
"
• k(x, z) = i αi ki (x, z) where αi ≥ 0 and ki (., .) are kernels;
• k(x, z) = k1 (x, z)k2 (x, z);
• k(x, z) = exp(k1 (x, z));
• k(x, z) = P (k1 (x, z)) where P (t) is a polynomial of t with nonnegative coeﬃcients.

3
Rademacher Average
We next present a result which computes the upper bound of the Rademacher average of a function class
which is a ball in the RKHS. Consider the following learning problem:
n
1&
min ℓ(yi , f (xi )) + λ∥f ∥H ,
f ∈H n
i=1

where H is a RKHS with kernel k. The optimization problem is equivalent to

n
1&
min ℓ(yi , f (xi ))
f ∈H,∥f ∥H ≤t n
i=1

for some properly chosen t > 0. So we would like to invest the function class F = {f : ∥f ∥H ≤ t}’s
Rademacher average.
Theorem. Let H be a RKHS with kernel k, and let K ∈ Rn×n so that Kij = k(xi , xj ). Define Ft = {f :
f ∈ H, ∥f ∥H ≤ t}. Then we have
. n
/
1& t!
R̂n (Ft ) := E sup ϵi f (xi )|X1 , . . . , Xn ≤ trace(K)
f ∈Ft n i=1 n

and 0
1∞
t 1&
Rn (Ft ) ≤ √ 2 λi
n i=1
´
where λi ’s are the eigenvalues of the operator Tk : f ,→ k(., x)f (x)dP (x).
Proof.
By the reproducing property we have
n n
1& 1&
sup ϵi f (xi ) = sup ϵi ⟨k(xi , .), f ⟩
f ∈Ft n i=1 f ∈Ft n i=1
( n )
1&
= sup ϵi k(xi , .), f
∥f ∥H ≤t n i=1
3 n 3
31 & 3
= t3 ϵi k(xi , .)3
3 3
3n 3
i=1
0
1 n
11 &
= t2 2 ϵi ϵj k(xi , xj ).
n i,j=1

Therefore we have
⎡ 0 ⎤
1& n
t 1
R̂n (Ft ) = E⎣ 2 ϵi ϵj k(xi , xj )|X1 , . . . , Xn ⎦
n i,j=1
0 ⎡ ⎤
1
n
t1 ⎣&
1
≤ 2E ϵi ϵj k(xi , xj )|X1 , . . . , Xn ⎦
n i,j=1
0
1 n
t1&
= 2 k(xi , xi )
n i=1
t!
= trace(K),
n

4
where we used the property that E[ϵi ] = 0 and V[ϵi ] = 1 and Jensen’s inequality. Since k(x, x) =
" ∞
i=1 λi φi (x)φi (x), where φi ’s are an orthonomral basis, we have

Rn (Ft ) = E[R̂n (Ft )]

0
1 n
t 11 &
≤ √ E2 k(xi , xi )
n n i=1
t !
≤ √ E[k(X, X)]
n
0
1∞
t 1&
≤ √ 2 λi .
n i=1

5
ISYE/CSE
STAT 598Y 8803: Advanced
Statistical MachineTheory
Learning Learning Instructor:Jian
Instructor: Tuo Zhang
Zhao

Lecture 13: Ensemble Learning and Boosting

In ensemble learning, we obtain a final classifier by combining a set of classifiers. Often even if the set of
candidate classifiers are weak (means they only do slightly better than random), the final classifier could still
have very good performance. In most cases, the final classifier is a weighted combination of weak learners,
where the weights could be determined in diﬀerent ways. We will focus on boosting algorithms (adaboost,
in particular) which has been shown to perform well in practice.

1 Adaptive Boosting (AdaBoost)

Let H be a function class and (x1 , y1 ), . . . , (xn , yn ) be a set of training examples. H is often called the set
of weak learners, such as linear threshold functions, decision trees, decision stumps (a simple decision tree
with only one node, such as y = I(xy ≥ 3)), a set of simple decision functions, etc. The AdaBoost algorithm
works as follows.

1. Start with a uniform weight distribution

1
D1 (i) = , i = 1, . . . n
n
and let the initial classifier be F0 (x) = 0.
2. For t = 1..T :

(a) Choose ft ∈ H to (approximately) minimize the weighted classification error:

n
!
ϵt = Dt (i)I(yi ̸
= ft (xi )).
i=1

(b) Let αt = 1
2 log 1−ϵ
ϵt and update Ft = Ft−1 + αt ft .
t

(c) Update "

Dt (i) exp(αt ) yi ̸
= ft (xi )
Dt+1 (i) = ×
Zt exp(−αt ) otherwise,
#
where we will see later that Zt = 2 ϵt (1 − ϵt ).

3. Output Ft

The next theorem shows that AdaBoost is essentially minimizing the exponential loss of the training examples
by using a greedy search.
Theorem. The empirical training error of the final classifier FT (.) is upper bounded by Tt=1 2 ϵt (1 − ϵt ).
$ #

Furthermore, if ϵt ≤ 1/2 − γ for all t = 1, . . . , T , then the training error is further upper bouned by (1 −
4γ 2 )T /2 .
Proof.

1
The first part can be shown as follows. Since Y FT ≤ 0 implies exp(−Y FT (X)) ≥ 1. So we have the training
error upper bounded as follows:
n n
1! 1!
I(yi ̸
= FT (xi )) = I(yi FT (xi ) ≤ 0)
n i=1 n i=1
n
1!
≤ exp(−yi FT (xi ))
n i=1
n T
1! !
= exp(−yi αt ft (xi ))
n i=1 t=1
n T
1 !%
= exp(−yi αt ft (xi )).
n i=1 t=1

Since yi , f (xi ) ∈ {±}, their product is also ±1. By definition of Dt+1 (i) we have exp(−yi αt ft (xi )) · Dt (i) =
Dt+1 (i)Zt . Thus we have
n T n T
1 !% 1 ! % Dt+1 (i)
exp(−yi αt ft (xi )) = Zt
n i=1 t=1 n i=1 t=1 Dt (i)
n
&T '
1! % DT +1 (i)
= Zt (1)
n i=1 t=1 D1 (i)
T
%
= Zt ,
t=1
(n
where the last step comes from the facts that D1 (i) = 1/n and i=1 DT +1 (i) = 1 because it is normalized.
Now we choose αt to minimize Zt . By definition we have
! !
Zt = Dt (i) exp(−αt ) + Dt (i) exp(αt )
i:yi =ft (xi ) i:yi ̸
=ft (xi )

= (1 − ϵt ) exp(−αt ) + ϵt exp(αt )
and we obtain αt = 1
2 ln 1−ϵ
ϵt by setting ∂Zt /∂αt = 0. Plugging in we have
t

) )
ϵt 1 − ϵt #
Zt = (1 − ϵt ) + ϵt = 2 ϵt (1 − ϵt ).
1 − ϵt ϵt

1
The second part then follows since if we have ϵt ≥ 2 − γ for all t , then
T
% # % #
2 ϵt (1 − ϵt ) ≤ 2 1/4 − γ 2
t=1 t

= (1 − 4γ 2 )T /2 .
log 1/δ
Furthermore, if we let (1 − 4γ 2)T /2 ≤ δ then we have for T ≥ 2γ 2 the training error will be upper bounded
by δ. !

2 Alternative Interpretation of AdaBoost

Consider the upper bound of training error
n n T n T n
&T ' T
1! 1 !% 1 ! % Dt+1 (i) 1! % DT +1 (i) %
exp(−yi FT (xi )) = exp(−yi αt ft (xi )) = Zt = Zt = Zt .
n i=1 n i=1 t=1 n i=1 t=1 Dt (i) n i=1 t=1 D1 (i) t=1

2
Assume that we already fixed α1 , . . . , αt−1 and f1 , . . . , ft−1 in the first t − 1 steps. In the t-th step, we have
n
1!
exp(−yi FT (xi ))
n i=1
n
1!
= exp (−yi (Ft−1 (xi ) + αt ft (xi )))
n i=1
n
1!
= [(exp(αt ) − exp(−αt ))I(yi ̸
= ft (xi )) + exp(−αt )] exp(−yi Ft−1 (xi ))
n i=1
n n
exp(−αt ) ! 1!
= exp(−yi Ft−1 (xi )) + (exp(αt ) + exp(−αt )) · I(yi ̸
= ft (xi )) exp(−yi FT −1 (xi )).
n i=1
n i=1
Since we have
T%−1
1
exp(−yi FT −1 (xi )) = DT (i) Zt
n t=1
by equation 1, plugging in we have
n
1!
exp(−yi FT (xi ))
n i=1
n T%−1 n
exp(−αt ) ! !
= exp(−yi Ft−1 (xi )) + (exp(αt ) + exp(−αt )) Zs · I(yi ̸
= ft (xi ))DT (i).
n i=1 s=1 i=1
As
(na result, for any αt we have the quantity is minimized with respect to ft if ft minimizes the quantity
i=1 DT (i)I(yi ̸ t (xi )), the weighted training error. Given ft , the αt which minimizes the empirical
= f(
exponential error n1 ni=1 exp(−yi FT (xi )) will be the same αt which minimizes Zt , i.e. it implies αt =
1 1−ϵt
2 ln ϵt .
We can see that AdaBoost is essentially a greedy algorithm which tries to minimize the empirical exponential
loss by using gradient descent: alternatively minimizes w.r.t. ft and αt in each iteration. Furthermore, we
can generalize AdaBoost to other loss functions by this viewpoint. Given any convex loss function φ(.) other
than the exponential loss (such as the logistic loss or the squared loss), we can use gradient descent method
to minimize
n
1!
R̂φ (Ft−1 + αt ft ) = φ(yi (Ft−1 (xi ) + αt ft (xi )))
n i=1
with respect to ft and αt . Gradient descent will choose a direction d = [ft (x1 ), . . . , ft (xn )] which is the
negative gradient of R̂φ (Ft−1 + z) at z = 0. Since the gradient of R̂φ (Ft−1 + z) at z = 0 is
* +
1 ′ 1 ′
φ (y1 Ft−1 (x1 ))y1 , . . . , φ (yn Ft−1 (xn ))yn ,
n n
essentially we want to find d which minimizes
n
!
T ′ ′
d [φ (y1 Ft−1 (x1 ))y1 , . . . , φ (yn Ft−1 (xn ))yn ] = φ′ (yi Ft−1 (xi ))yi ft (xi ).
i=1

Define ai = φ′ (yi Ft−1 (xi )) which is a constant. In terms of minimization, we have the following equivalences:
n n n n
! ! ! (−ai )(−yi ft (xi )) + ai !
min ai yi ft (xi ) ⇐⇒ min (−ai )(−yi ft (xi )) ⇐⇒ min ⇐⇒ min I(yi ̸
= ft (xi ))ai .
ft
i=1
ft
i=1
ft
i=1
2 ft
i=1
So we can obtain ft by minimizing
n
!
I(yi ̸
= ft (xi ))Dt (i)
i=1
−φ′ (yi Ft−1 (xi ))
wrt ft (.), where Dt (i) = Zt . This generalizes the AdaBoost from exponential loss φ(z) = exp(−z)
to other loss functions.

3
ISYE/CSE
STAT 598Y 8803: Advanced
Statistical MachineTheory
Learning Learning Instructor:Jian
Instructor: Tuo Zhang
Zhao

Lecture 14: Maximum Likilihood Estimation

In this lecture we will consider the one of the most popular approaches in statistics: the maximum likelihood
estimation (MLE). In order to apply MLE, we need to make stronger assumptions about the distribution of
(X, Y ). Often such assumptions are reasonable in practical applications.

1 Maximum Likelihood Estimation

Consider the model
yi = h(xi ) + ϵi , i = 1, . . . , n
where we assume ϵi ’s are iid zero mean noises. Note that both classification and regression can be represented
in this way. Furthermore, we assume that the errors have a distribution F (µ) with mean µ = 0. Then we
have
yi ∼ F (h(xi )).
Assume that the probability density function for yi is ph (yi ), we can write down the joint likelihood as
n
!
ph (yi ).
i=1

The MLE estimator seeks the model which maximizes the likelihood, or equivalently, minimizes the negative
log-likelihood. This is reasonable since the MLE estimator is the most probable explanation for the observed
data. Formally, let Θ be a parameter space, and assume that we have the model

yi ∼ pθ∗ (y), i = 1, . . . , n

for iid observations y1 , . . . , yn and θ∗ ∈ Θ is the true parameter. Here we do not have the covariates xi ’s for
simplicity, and it is straightforward to include them in the model. The MLE of θ∗ is
n
!
θ̂n = arg max pθ (yi )
θ∈Θ i=1
n
1 "
= arg min − log pθ (yi ).
θ∈Θ n i=1

By strong law of large numbers we have

n
1"
− log pθ (yi ) →a.s. E[− log pθ (Y )].
n i=1

So we can think of maximum likelihood as trying to minimize E[− log pθ (Y )]. On the other hand, consider
the quantity
# $
pθ∗ (Y )
E [log pθ∗ (Y ) − log pθ (Y )] = E log
pθ (Y )
pθ∗ (y)
ˆ
= log pθ∗ (y)dy
pθ (y)
= KL(pθ , pθ∗ )
≥ 0

1
where KL(q, p) is the KL-divergence between two distributions q and p. Although not a distance measure
(not symmetric), the KL-divergence measures the discrepancy between the two distributions. Also note that
the last inequality becomes equality if and only if pθ = pθ∗ . This is because
# $ # $ # $
p q q
KL(q, p) = Ep log = −Ep log ≥ − log Ep = 0.
q p p
By Jensen’s inequality, the equality happens if and only if p(x) = q(x) for all x. So we can see that if we
minimize E[− log pθ (Y )], the minimum it can achieve is E[− log pθ∗ (Y )], and it achieves this minimum when
θ = θ∗ , the true parameter value we want to find.
It is easy to see that MLE can be thought as a special case of empirical risk minimization, where the loss
function is simply the negative log-likelihood: ℓ(θ, yi ) = − log pθ (yi ). Another observation is that minimizing
the negative log-likelihood will result in the least squares estimator, if the error follows a normal distribution.
The empirical risk is
n
1"
R̂n (θ) = − log pθ (yi )
n i=1
and the risk is
R(θ) = E[ℓ(θ, Y )] = E[− log pθ (Y )].
The excess risk of θ is
R(θ) − R(θ∗ ) = E[− log pθ (Y ) + log pθ∗ (Y )] = KL(pθ , pθ∗ ),
the KL-divergence between pθ and pθ∗ .

2 Hellinger Distance and Consistency of MLE

We define the Hellinger distance between two distributions p and q as
% ˆ
1 &' ' (2
h(p, q) = p(x) − q(x) dx.
2
It is easy to see that it is always nonnegative, symmetric, and satisfies the triangle inequality. And further-
more, h(p, q) = 0 if and only if p = q. The hellinger distance plays an important role in studying MLE
because it can be upper bounded by the KL-divergence, as shown in the following Lemma.
Lemma. We have h2 (q, p) ≤ 12 KL(q, p).
1
Proof. Using the fact that 2 log v ≤ v 1/2 − 1 for all v > 0, we have
'
1 q(x) q(x)
log ≤ ' − 1,
2 p(x) p(x)
and thus )' *
1 q(X)
KL(q, p) ≥ 1 − E ' .
2 p(X)
The result follows since
)' *
q(X)
ˆ ' '
1−E ' = 1− q(x) p(x)dx
p(X)
1 1
ˆ ˆ ˆ ' '
= p(x)dx + q(x)dx − q(x) p(x)dx
2 2
1 (2
ˆ &' '
= p(x) − q(x) dx
2
= h2 (p, q) = h2 (q, p).

2
!
This lemma says that convergence in KL-divergence will lead to convergence in hellinger distance. So if we
can establish the convergence in KL-divergence then the consistency of MLE can be proven.
The convergence of the KL-divergence can be seen as follows. Since θ̂n maximizes the likelihood over θ ∈ Θ,
we have
n n n
" pθ∗ (yi ) " "
log = log pθ∗ (yi ) − log pθ̂n (yi ) ≤ 0.
i=1
pθ̂n (yi ) i=1 i=1

Thus
n
1" pθ∗ (yi )
log − KL(pθ̂n , pθ∗ ) + KL(pθ̂n , pθ∗ ) ≤ 0.
n i=1 pθ̂n (yi )
So we have + +
+1 "n
pθ∗ (yi ) +
KL(pθ̂n , pθ∗ ) ≤ + log − KL(pθ̂n , pθ∗ )+ .
+ +
+n
i=1
p (y
θ̂n i ) +

If law of large numbers holds uniformly, we have

+ n +
+1 " pθ∗ (yi ) +
sup + log − KL(pθ , pθ∗ )+ → 0
+ +
θ∈Θ n i=1
+ pθ (yi ) +

which can be applied to θ̂n as well. As a result, the convergence of KL-divergence can be established.

Dynamical System - Meiss
No ratings yet
Dynamical System - Meiss
34 pages
Hbs
No ratings yet
Hbs
45 pages
Hassler Whitney - Tensor Products of Abelian Groups
No ratings yet
Hassler Whitney - Tensor Products of Abelian Groups
34 pages
Digital Lab Manual 2023
No ratings yet
Digital Lab Manual 2023
168 pages
Abstract Algebra Homework
No ratings yet
Abstract Algebra Homework
10 pages
Momentum Space Operator
100% (1)
Momentum Space Operator
12 pages
Hungerford Solution 6FIELDEXT
No ratings yet
Hungerford Solution 6FIELDEXT
10 pages
Apostol Chapter 02 Solutions
No ratings yet
Apostol Chapter 02 Solutions
23 pages
Solutions To The 82nd William Lowell Putnam Mathematical Competition Saturday, December 4, 2021
100% (2)
Solutions To The 82nd William Lowell Putnam Mathematical Competition Saturday, December 4, 2021
5 pages
Real Analysis - Homework Solutions: Chris Monico, May 2, 2013
No ratings yet
Real Analysis - Homework Solutions: Chris Monico, May 2, 2013
37 pages
PVTmathscope 1376530257
No ratings yet
PVTmathscope 1376530257
56 pages
116 hw1
No ratings yet
116 hw1
5 pages
Lecture 9: Contraction Mapping - June 20, 2012: Functional Analysis by R. Vittal Rao
No ratings yet
Lecture 9: Contraction Mapping - June 20, 2012: Functional Analysis by R. Vittal Rao
4 pages
Riemann Zeta (2k) Using Fourier Analysis
No ratings yet
Riemann Zeta (2k) Using Fourier Analysis
7 pages
Solution 4
No ratings yet
Solution 4
3 pages
Dummit and Foote Chapter 2 Solutions
No ratings yet
Dummit and Foote Chapter 2 Solutions
31 pages
263 Homework
No ratings yet
263 Homework
153 pages
Stein-Shakarchi Complex Analysis Solution Chapter 3 Meromorphic Functions and the Logarithm
No ratings yet
Stein-Shakarchi Complex Analysis Solution Chapter 3 Meromorphic Functions and the Logarithm
7 pages
Ejercicios Munkres Resueltos
No ratings yet
Ejercicios Munkres Resueltos
28 pages
Specht, Et Al: Euclidean Geometry Exercises and Answers: Contents
No ratings yet
Specht, Et Al: Euclidean Geometry Exercises and Answers: Contents
130 pages
BNW-vp (3)
No ratings yet
BNW-vp (3)
11 pages
2 12 PDF
No ratings yet
2 12 PDF
10 pages
Functional Analysis by R. Vittal Rao: Lecture 7: Convergence - June 13, 2012
No ratings yet
Functional Analysis by R. Vittal Rao: Lecture 7: Convergence - June 13, 2012
4 pages
Chapter 2 Metric Spaces 2018
No ratings yet
Chapter 2 Metric Spaces 2018
19 pages
Real Analysis by R. Vittal Rao: Function From A Set To Its Power Set
No ratings yet
Real Analysis by R. Vittal Rao: Function From A Set To Its Power Set
4 pages
Solution Tu's Manifolds Exer1 - 6
No ratings yet
Solution Tu's Manifolds Exer1 - 6
1 page
Abstract Algebra ch13 Dummit Foote2 PDF
No ratings yet
Abstract Algebra ch13 Dummit Foote2 PDF
19 pages
1 S5 PDF
No ratings yet
1 S5 PDF
85 pages
Shankar 7.4.5
100% (1)
Shankar 7.4.5
3 pages
Gaitsgory's Harvard Math 122 Notes PDF
No ratings yet
Gaitsgory's Harvard Math 122 Notes PDF
116 pages
IMOmath - Functional Equations - Problems With Solutions
No ratings yet
IMOmath - Functional Equations - Problems With Solutions
15 pages
Ee263 Ps1 Sol
No ratings yet
Ee263 Ps1 Sol
11 pages
111aass4 Solutions
No ratings yet
111aass4 Solutions
7 pages
Solutions PDF
No ratings yet
Solutions PDF
19 pages
Exercise 1 2022
No ratings yet
Exercise 1 2022
3 pages
Quantum Field Theory by Peskin - Chap15 Solution
No ratings yet
Quantum Field Theory by Peskin - Chap15 Solution
7 pages
Midterm Exam Solutions
100% (1)
Midterm Exam Solutions
26 pages
Real Analysis HW 2 Solutions
No ratings yet
Real Analysis HW 2 Solutions
4 pages
Rudin
No ratings yet
Rudin
26 pages
Modes of Convergence: N N N N
No ratings yet
Modes of Convergence: N N N N
4 pages
HW10 Sol
100% (1)
HW10 Sol
7 pages
Maths Methods Week 1: Vector Spaces
No ratings yet
Maths Methods Week 1: Vector Spaces
100 pages
Mathematical Techniques: Revision Notes: DR A. J. Bevan
No ratings yet
Mathematical Techniques: Revision Notes: DR A. J. Bevan
5 pages
Algebra (Dummit) HW 1
No ratings yet
Algebra (Dummit) HW 1
15 pages
QED ICTP Note0
No ratings yet
QED ICTP Note0
113 pages
Linear Dynamical Systems - Course Reader
No ratings yet
Linear Dynamical Systems - Course Reader
414 pages
On The Exponential Solution of Differential Equations For A Linear Operator PDF
No ratings yet
On The Exponential Solution of Differential Equations For A Linear Operator PDF
25 pages
Math 202 - Homework #6, 14.7 Solutions: N N N N
100% (1)
Math 202 - Homework #6, 14.7 Solutions: N N N N
3 pages
Mathematical Methods (Second Year) MT 2009: Problem Set 5: Partial Differential Equations
No ratings yet
Mathematical Methods (Second Year) MT 2009: Problem Set 5: Partial Differential Equations
4 pages
Elementary Symmetric Polynomial
No ratings yet
Elementary Symmetric Polynomial
4 pages
George Arfken - Solutions - 07
No ratings yet
George Arfken - Solutions - 07
5 pages
Limit Sequence
No ratings yet
Limit Sequence
2 pages
Mathematical Techniques: Revision Notes: DR A. J. Bevan
No ratings yet
Mathematical Techniques: Revision Notes: DR A. J. Bevan
8 pages
Integral Equations
75% (4)
Integral Equations
53 pages
Mathematical Methods (Second Year) MT 2009 Problem Set 2: Linear Algebra II
No ratings yet
Mathematical Methods (Second Year) MT 2009 Problem Set 2: Linear Algebra II
3 pages
Cha1 - 4 (Ashcroft and Mermin
25% (4)
Cha1 - 4 (Ashcroft and Mermin
2 pages
Statistical Learning Theory: 18.657: Mathematics of Machine Learning
No ratings yet
Statistical Learning Theory: 18.657: Mathematics of Machine Learning
9 pages
3.1 Binary Classification
No ratings yet
3.1 Binary Classification
4 pages
MIT18 657F15 LecNote PDF
No ratings yet
MIT18 657F15 LecNote PDF
194 pages
Lec11 Handout
No ratings yet
Lec11 Handout
86 pages
Classification
No ratings yet
Classification
19 pages
Lecture 1.2
No ratings yet
Lecture 1.2
7 pages
Linear Algebra 3ed Greub WH
No ratings yet
Linear Algebra 3ed Greub WH
448 pages
Barilari GeoDiff
No ratings yet
Barilari GeoDiff
145 pages
bAppM 2021 deRidderL
No ratings yet
bAppM 2021 deRidderL
62 pages
Angle Calculations For 3-And 4 - Circle X-Ray and Neutron Diffraetometers
No ratings yet
Angle Calculations For 3-And 4 - Circle X-Ray and Neutron Diffraetometers
8 pages
Few-Body Problems: The Motion of The Planets
No ratings yet
Few-Body Problems: The Motion of The Planets
36 pages
2.lecture2 Ate
No ratings yet
2.lecture2 Ate
61 pages
Lecture Notes On Ridge Regression
No ratings yet
Lecture Notes On Ridge Regression
113 pages
Gradient Based Optimization
No ratings yet
Gradient Based Optimization
24 pages
Single Tablet Press
No ratings yet
Single Tablet Press
11 pages
Chaos Theory and Strategy: Theory, Application, and Managerial Implications
No ratings yet
Chaos Theory and Strategy: Theory, Application, and Managerial Implications
12 pages
Using GW-BASIC For Drawing Mandelbrot Sets
No ratings yet
Using GW-BASIC For Drawing Mandelbrot Sets
3 pages
Chapter 6 - Behavioural Modelling
No ratings yet
Chapter 6 - Behavioural Modelling
42 pages
Mechanical Engineering - Yale University
No ratings yet
Mechanical Engineering - Yale University
7 pages
Vorticity and Circulation
No ratings yet
Vorticity and Circulation
5 pages
Intermediate Singa
No ratings yet
Intermediate Singa
20 pages
ENME 332, Spring 2013 Transfer Processes: Instructors: Reinhard Radermacher & Bao Yang
No ratings yet
ENME 332, Spring 2013 Transfer Processes: Instructors: Reinhard Radermacher & Bao Yang
18 pages
The System Building Blueprint
No ratings yet
The System Building Blueprint
8 pages
Texlon Manual
No ratings yet
Texlon Manual
20 pages
Advanced Thermodynamics: Exergy / Availability
No ratings yet
Advanced Thermodynamics: Exergy / Availability
64 pages
Bmos Mentoring Scheme (Senior Level) December 2012 (Sheet 3) Solutions
No ratings yet
Bmos Mentoring Scheme (Senior Level) December 2012 (Sheet 3) Solutions
6 pages
Applied Acoustics: Masoud Golzari, Ali Asghar Jafari
No ratings yet
Applied Acoustics: Masoud Golzari, Ali Asghar Jafari
22 pages
Handout Week 2 Loads On Structures CE415
No ratings yet
Handout Week 2 Loads On Structures CE415
79 pages
(Said S.E.H. Elnashaie, Parag Garhyan) Conservatio
No ratings yet
(Said S.E.H. Elnashaie, Parag Garhyan) Conservatio
661 pages
c03 Crypto DES AES Utc
No ratings yet
c03 Crypto DES AES Utc
72 pages
Fir Compiler Xilinx
No ratings yet
Fir Compiler Xilinx
85 pages
Acr Math Month Culmination
No ratings yet
Acr Math Month Culmination
4 pages
PPT1
No ratings yet
PPT1
93 pages
Vectors-Model Paper-01
No ratings yet
Vectors-Model Paper-01
4 pages
Cable Stay Bridge Modelling
No ratings yet
Cable Stay Bridge Modelling
7 pages
Upsc Cse Free Test Series 2024
No ratings yet
Upsc Cse Free Test Series 2024
6 pages
Phy 1
100% (1)
Phy 1
50 pages
Multiobjective Optimization
No ratings yet
Multiobjective Optimization
36 pages
Fefef
No ratings yet
Fefef
3 pages
Weekly Learning Activity Sheets Proving Triangle Congruence: Take Note
No ratings yet
Weekly Learning Activity Sheets Proving Triangle Congruence: Take Note
6 pages
Implementation of MAC Unit Using Booth Multiplier & Ripple Carry Adder
No ratings yet
Implementation of MAC Unit Using Booth Multiplier & Ripple Carry Adder
3 pages
6 Itf DPP Genetry
No ratings yet
6 Itf DPP Genetry
12 pages