Wcep2021 Notes
Wcep2021 Notes
Wcep2021 Notes
processes
Soumendu Sundar Mukherjee
Indian Statistical Institute, Kolkata
April 6, 2021
Warning: These course notes (for M.Stat. second year students) have not been sub-
ject to very careful scrutiny, and so there may be various typos/mistakes here and
there. Please let me know at soumendu041@gmail.com if you find one.
Contents
1 Review of metric topology 3
1.1 Metric spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Open and closed sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Limit points and accumulation points . . . . . . . . . . . . . . . . . . . . 5
1.5 Denseness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 Convergence, Cauchy sequences and completeness . . . . . . . . . . . . 6
1.7 Compactness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.8 Connectedness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1
Contents
2
1 Review of metric topology
Then ρ is called the discrete metric, and (M, ρ) is called a discrete metric space.
Example 1.4. Let M = C [0, 1], the space of all real-valued continuous functions on [0, 1].
M becomes a metric space with the metric
Z 1 1
ρ( f , g) = ( | f (t) − g(t)| p dt) p ,
0
C [0, 1] with the uniform metric will a central object in this course. We will be interested in
various topological properties of this metric space.
Example 1.5. Let M = (V, E) be a finite graph and define
If no such path exists, we define ρ(v1 , v2 ) = +∞ in accordance with the convention inf ∅ =
+∞.
More structure can be given beyond just a metric when M is also a linear space
(vector space). Such as norms, or inner products.
3
1 Review of metric topology
How do open and closed sets behave under union and intersections?
• Arbitrary union (resp. intersection) of open (resp. closed) sets is open (resp.
closed).
• Finite intersection (resp. union) of open (resp. closed) sets is open (resp. closed).
Let A ⊂ M. The closure of A, denoted by Ā, is the set ∩ F⊃ A,F closed F, i.e. Ā is
the “smallest” closed set containing A. The interior of A, denoted by A◦ , is the set
∪G⊂ A,F open G, i.e. A◦ is the “largest” open set contained in A. The boundary of A,
denoted by ∂A, is the set Ā \ A◦ .
Exercise 1.6. Prove or disprove (a) B( x; e) = B[ x; e], (b) B[ x; e]◦ = B( x; e), and (c) ∂B[ x; e] =
∂B( x; e) = S( x; e) = {y ∈ M | ρ(y, x ) = e}.
Exercise 1.7. Show that every open (resp. closed) set is Fσ (resp. Gδ ).
Exercise 1.8. What are all the open (closed) sets in a discrete metric space?
1.3 Continuity
Let M1 , M2 be metric spaces, and f : M1 → M2 . We say that f is continuous at x if,
for any neighbourhood V of f ( x ) in M2 , there exists a neighbourhood U of x in M1
such that U ⊂ f −1 (V ).
Exercise 1.10. Show that f : M1 → M2 is continuous if and only if, for all open V ⊂ M2 ,
f −1 (V ) is open in M1 .
Side remark. If if, for all open U ⊂ M1 , f (U ) is open in M1 , then we call f an open
map. An invertible map that is continuous as well as open is called a homeomorphism.
4
1 Review of metric topology
5
1 Review of metric topology
1.5 Denseness
X ⊂ M is dense in M if, for all x ∈ M, and for any open neighbourhood Ux of x,
Ux ∩ X 6= ∅. A metric space is called separable if it contains a countable dense subset.
Exercise 1.20. Show that C [0, 1] with the uniform metric is separable. (Hint: Stone-Weierstrass!)
Why do we care about separability? Well, measurability plays well with respect to
countable operations. So if one wants to do measure theory on metric spaces, sepa-
rability helps to restrict attention to countable operations (such as taking sup over a
countable dense set).
Exercise 1.21. Show that all convergent sequences are Cauchy. Give a counterexample to the
converse statement.
Exercise 1.22. What are all the Cauchy sequences in a discrete metric space?
Exercise 1.25. Any closed subset of a complete metric space is itself complete.
The fact that C [0, 1] is complete with the uniform metric will be very useful to us.
Complete, separable metric spaces are also called Polish spaces. Thus C [0, 1] is a Polish
space under the uniform metric.
6
1 Review of metric topology
1.7 Compactness
X ⊂ M is called compact if any open cover {Uα }α∈ I admits a finite subcover. X is
called bounded if ∃ an open ball B( x, e) ⊃ X.
Exercise 1.27. Let diam( A) := supx,y∈ A ρ( x, y) be the diameter of A. Show that A is
bounded if and only if A is either empty or diam( A) is finite.
Exercise 1.28. If X is compact, then it is closed and bounded. In Euclidean spaces the converse
also holds (the so-called Heine-Borel theorem).
Exercise 1.29. What are the compact subsets of a discrete metric space?
Exercise 1.30. Suppose K ⊂ M1 is compact and f : M1 → M2 is continuous, then f is
uniformly continuous.
Exercise 1.31. Suppose K ⊂ M1 is compact and f : M1 → M2 is continuous, then f (K ) is
compact (hence bounded).
X ⊂ M is called totally bounded, if, for any e > 0, ∃ points x1 , . . . , xk ∈ M, where
k = k (e) ≥ 1, such that X ⊂ ∪ik=1 B( xk ; e)2 .
Exercise 1.32. Prove that the closure of a totally bounded set it totally bounded.
For metric spaces there are equivalent characterisations of compactness.
Lemma 1.33. The following are equivalent:
(a) K is compact.
(d) Every sequence in K has a convergent subsequence (this is also called sequential com-
pactness).
A set X ⊂ M is called precompact or relatively compact if X̄ is compact.
Exercise 1.34. In complete metric spaces, the precompact sets are precisely the totally bounded
sets.
Exercise 1.35. Is the closed ball B[ x; e] compact?
As we will be working with C [0, 1] extensively, let us characterize its compact sub-
sets. For this we need to define the notion of equicontinuity of a family of functions.
Let F be a family of functions from M1 to M2 . F is called equicontinuous at x ∈ M1 ,
if, for any e > 0, ∃δ = δ( x, e) such that ρ1 (y, x ) < δ =⇒ ρ2 ( f (y), f ( x )) < e for all
f ∈ F . If F is equicontinuous at all x ∈ M1 , then we simply say that the family F is
equicontinuous.
Exercise 1.36. Define the notion of uniform equicontinuity and then show that if M1 is
compact, then F is uniformly equicontinuous.
2 Sucha finite collection of e-balls covering a space is called an e-net. We will encounter them time and
again when we start talking about empirical process theory.
7
2 Probability measures on metric Spaces
Theorem 1.40 (Arzelà-Ascoli for C [0, 1]). S ⊂ C ([0, 1]) has compact closure if and
only if the following two conditions hold:
1.8 Connectedness
Do not need now.
2.1 Preliminaries
Lemma 2.1 (Regularity of probability measures). Every probability measure µ on (M, B)
is regular, i.e. for any Borel set A, and for any e > 0, one can find G open and F closed such
that F ⊂ A ⊂ G, and µ( G \ F ) < e.
Proof. Use the “good set” principle. Show that statement is true if A is a closed set,
and then show that the set of all such sets forms a σ-field. Fill in the details.
R
Notation: Let µ be a measure and f ∈ Cb (M). Then µ f := f dµ. For probability
measures P, P f is thus same as E( f ( X )) where X ∼ P.
8
2 Probability measures on metric Spaces
Proof. For a general Borel set A, by regularity, one can construct two sequences of
closed sets Fin ⊂ A such that Pi ( Fin ) → Pi ( A), i = 1, 2. But then Pi ( F1n ∪ F2n ) →
Pi ( A), i = 1, 2, which implies that P1 ( A) = P2 ( A).
Lemma 2.3. Cb (M), the class of bounded continuous functions on M is probability deter-
mining. Same goes for Cbu (M), the class of bounded uniformly continuous functions on M.
Lemma 2.4. Any π-system A generating the Borel σ-field B of M is a determining class of
sets.
Exercise 2.5. Probability measures on σ-compact metric spaces (i.e. metric spaces that are
countable union of compact sets) are tight.
9
2 Probability measures on metric Spaces
Now let e ↓ 0.
“(c) ⇐⇒ (d)”: For the “ =⇒ ” direction, note that, by (b) lim sup Pn ( G c ) ≤ P( G c ),
i.e. lim sup(1 − Pn ( G )) ≤ 1 − P( G ), i.e. 1 − lim inf Pn ( G ) ≤ 1 − P( G ), which is same
as what we want. Similarly, “(d) =⇒ (c)”.
P( Ā) ≥ lim sup Pn ( Ā) ≥ lim sup Pn ( A) ≥ lim inf Pn ( A) ≥ lim inf Pn ( A◦ ) ≥ P( A◦ ).
“(e) =⇒ (a)”: Let f ∈ Cb (M). By scaling and translating suitably, we may assume
R1
that 0 < f < 1. Now P f = 0 P( f > t) dt. Note that P( f > t) = 1 − Ff (X ) (t), where
Ff (X ) is the CDF of the (real-valued) random variable f ( X ). As the set of jumps of a
CDF is at most countable, and ∂{ f > t} ⊂ { f = t} (by continuity of f ), we conclude
that Pn ( f > t) → P( f > t) for all but countably many points. Hence an application
R1 R1
of BCT gives that Pn f = 0 Pn ( f > t) dt → 0 P( f > t) dt = P f .
Recall the statement xn → x if and only if every subsequence of xn contains a further
subsequence that converges to x. We have an analogue of this for weak convergence.
w
Lemma 2.8. Pn → P if and only if every subsequence of Pn contains a further subsequence
that converges weakly to P.
Proof. Only the “if” part is non-trivial. Let f ∈ Cb (M). Consider the real-sequence
{Pn f }. Take a subsequence {Pnk f }. By hypothesis, for the subsequence {Pnk } of
{Pn }, there exists a further subsequence {Pnk } that converges weakly to P. Hence
`
{Pnk f } converges to P f . So any subsequence {Pnk f } of {Pn f } contains a further
`
subsequence {Pnk f } that converges to P f . It follows that Pn f → P f , and we are
`
done.
w
Exercise 2.9. Show that δxn → δx if and only if xn → x.
w
Exercise 2.10. Show that, if δxn → P, then P = δx for some x.
A class of sets A ⊂ B is called convergence determining if Pn ( A) → P( A) for all
w
P-continuity sets A ∈ A implies that Pn → P.
Exercise 2.11. Any convergence determining class is a probability determining class.
Lemma 2.12. Let A be a π-system such that every open set G can be written as a countable
w
union of sets from A. Then Pn ( A) → P( A) for all A ∈ A implies that Pn → P.
3 Such sets A are called P-continuity sets.
10
2 Probability measures on metric Spaces
Lemma 2.13. Suppose M is separable. Let A be a π-system such that for any x ∈ M and
e > 0, there exists A x,e ∈ A with x ∈ A◦x,e ⊂ A x,e ⊂ B( x; e). Then Pn ( A) → P( A) for all
w
A ∈ A implies that Pn → P.
Let A x,e be the class of sets A such that x ∈ A◦ ⊂ A ⊂ B( x; e). Let ∂A x,e = {∂A |
A ∈ A x,e }.
Lemma 2.14. Suppose M is separable. Let A be a π-system such that ∂A x,e either contains
∅ or has uncountably many disjoint elements. Then A is a convergence determining class.
Example 2.15. On Rd , rectangles of the form ( a, b] = ( a1 , b1 ] × · · · × ( ad , bd ] form a con-
vergence determining class.
Example 2.16. On R∞ , finite dimensional cylinders πi−1 ,...,i
1
(( a, b]) form a convergence deter-
d
mining class.
Example 2.17. On C [0, 1], finite dimensional cylinders πt−1 ,...,t 1
k
(( a, b]) do not form a conver-
gence determining class. However, they do form a determining class. To see this note that the
sequence of functions zn (t) = ntI(t ≤ n) + (2 − nt)I(n < t ≤ 2n) converge pointwise to
the 0 function. However, kzn k∞ = 1. So δzn does not converge weakly to δ0 . However, since
πt−1 ,...,t
1
k
zn = (zn (t1 ), . . . , zn (tk )) = (0, . . . , 0) = πt−1 ,...,t
1
k
δ0 , finite dimensional distributions
of δzn converge δ0 (prove this rigorously).
where the last inequality follows from the set inequality h−1 ( F ) ∩ Dhc ⊂ h−1 ( F ), which
is true beacuse, if x belongs to the LHS, then there is a sequence xn such that h( xn ) ∈ F
and xn → x. But, since x ∈ Dhc , this means that h( xn ) → h( x ). As F is closed, h( x ) ∈ F,
i.e. x ∈ h−1 ( F ).
w
Exercise 2.19. Another tool for your portmanteau: Pn → P if and only if for any bounded
measurable function h with P( Dh ) = 0 one has Pn h → Ph.
Exercise 2.20. Using the previous exercise, give an alternative proof of the mapping theorem.
11
2 Probability measures on metric Spaces
Lemma 2.22. Consider probability measures on C [0, 1]. If {Pn } is relatively compact, and all
finite dimensional distributions converge weakly, then there exists a probability measure P on
w
C [0, 1] such that Pn → P.
Proof. Consider a subsequence {Pn` } converging to some probability measure Q. Note
w w
then that Pn` ◦ πt−1 ,...,t
1
k
→ Q ◦ πt−1 ,...,t
1
k
. It follows that Pn ◦ πt−1 ,...,t
1
k
→ Q ◦ πt−1 ,...,t
1
k
. As finite
dimensional cylinders form a determining class, it also follows that all subsequential
weak limits of Pn are equal to Q. Since, by relative compactness, any subsequence has
w
a further subsequence that converges weakly to Q, we conclude that Pn → Q.
How does one show relative compactness of a family? A sufficient condition is
tightness of the family—a family Π is called tight if for any e > 0 there exists a compact
set Ke such that P(Ke ) > 1 − e for all P ∈ Π.
Proof. The proof is similar to the proof of Lemma 2.6, which establishes tightness of a
single measure. It suffices to show that if M can be written as an increasing union of
open sets Gi , then for any e > 0 there exists ne > 0 such that P( Gne ) > 1 − e, for all
P ∈ Π. Assume not, then, for some e0 > 0, we can find a sequence {Pn } in Π such
that Pn ( Gn ) ≤ 1 − e0 . By relative compactness, there is a further subsequence {Pnk }
converging weakly to some probability measure Q. Now for any m ≥ 1,
Q( Gm ) ≤ lim inf Pnk ( Gm ) ≤ lim inf Pnk ( Gnk ) ≤ 1 − e0 .
Let m → ∞ to get Q(M) ≤ 1 − e0 , a contradiction.
To use this, write M = ∪n≥1 B( xn ; 1/k ), where { x1 , x2 , . . .} is a countable dense set.
n
There is nk such that P(∪nk=1 B( xn ; 1/k )) > 1 − 2ek for all P ∈ Π. Hence, the totally
n
bounded set A = ∩k≥1 ∪nk=1 B( xn ; 1/k ) satisfies P( A) > 1 − e for all P ∈ Π. Ā is the
required compact set.
12
3 Weak convergence in C [0, 1]
Now we already know that tightness plus convergence of finite dimensional distri-
butions implies weak convergence. Combining this with the above lemma we can say
the following.
Lemma 3.2. Let Pn be a sequence of Borel probability measures on C [0, 1]. Suppose
(i) Pn ◦ πt−1 ,...,t
1
k
convereges weakly, for any t1 , . . . , tk ∈ [0, 1], k ≥ 1.
(ii) For all e > 0
lim lim sup Pn (wx (δ) ≥ e) = 0.
δ →0 n→∞
w
Then there is a Borel probability measure P on C [0, 1] such that Pn → P. The converse is also
true.
If not necessary, we will drop ω for notational simplicity. Note that we can compactly
write Xn (t) as
Sbntc ξ bntc+1
Xn (t) = √ + (nt − bntc) √ .
σ n σ n
Exercise 3.3. Show that Xn defined above is indeed a random function, i.e., Xn is A/B(C [0, 1])
measurable.
We will show that {Pn = P ◦ Xn−1 } is tight and Pn ◦ πt−1 ,...,t 1
k
convereges weakly to
a certain k-variate Gaussian, for any t1 , . . . , tk ∈ [0, 1], k ≥ 1. This will establish the
existence of a measure on C [0, 1] with Gaussian finite dimensional distributions.
13
3 Weak convergence in C [0, 1]
Theorem 3.4 (Existence of Wiener measure and Donsker’s theorem). There exists,
on C [0, 1], a Borel probability measure W such that, for any t1 , . . . , tk ∈ [0, 1], k ≥ 1, one
(t ,...,t ) w
has W ◦ πt−1 ,...,t
1
k
≡ N (0, Σ(t1 ,...,tk ) ), where Σij 1 k = ti ∧ t j . Moreover, Pn → W a .
aW is called the Wiener measure after mathematician Norbert Wiener, who was the first to con-
struct Brownian motion rigorously.
Let 0 = t0 < t1 < . . . tk−1 < tk = 1 be δ-sparse, i.e. min1<i<k |ti − ti−1 | ≥ δ. Then
It follows that
e
Pn ( w x ( δ ) ≥ e ) ≤ ∑ Pn ( sup | x (s) − x (ti−1 )| ≥ ).
3
1≤ i ≤ k t i −1 ≤ s ≤ t i
|S j − Snti−1 | d |S j |
max √ = max √ ,
nti−1 ≤ j≤nti σ n 1≤ j ≤ m σ n
14
3 Weak convergence in C [0, 1]
|S j − Snti−1 | d |S j |
max √ = max 0 √ ,
nti−1 ≤ j≤nti σ n 1≤ j ≤ m σ n
d|S |
By the CLT, √j → Z. Therefore
σ j
|S j |
e e
P p > √ → P |Z| > √ .
σ j 9 2δ 9 2δ
Therefore there exists j0 = j0 (e, δ) such that, for all j ≥ j0 , one has
√
|S j | 2 × (9 2)4 δ2 EZ4 C1 δ2
e e
P p > √ ≤ 2P | Z | > √ ≤ = .
σ j 9 2δ 9 2δ e4 e4
4 Thissays that, for independent random variables ξ 1 , . . . , ξ n , one has P(max1≤i≤n |Si | ≥ 3α) ≤
3 max1≤i≤n P(|Si | ≥ α).
15
3 Weak convergence in C [0, 1]
for some constant C2 > 0. Thus we get, for all large enough n, that
C1 δ2 C2 j0
6 C3 δ C4 j0
Pn (wx (δ) ≥ e) ≤ max , = max , ,
δ e4 e2 n e4 e2 nδ
for constants C3 , C4 > 0. It follows that
lim lim sup Pn (wx (δ) ≥ e) = 0.
δ →0 n→∞
This verifies condition (ii) of Lemma 3.2. This proves, simultaneously, the existence of
w
W and that Pn → W.
A random function B on C [0, 1], whose distribution is the Wiener measure, is called a
d
standard Brownian motion on [0, 1]. Thus we have proved that Xn → B.
d
Corollary 3.5 (Donsker). Let h : C [0, 1] → R be such that W ( Dh ) = 0. Then h( Xn ) →
h ( B ).
16
3 Weak convergence in C [0, 1]
2. It is possible to find the joint distribution of (m, M, B(1)) using Donsker’s theo-
rem, where m = mint∈[0,1] B(t), M = maxt∈[0,1] B(t). Indeed, if mn = mint∈[0,1] Xn (t),
Mn = maxt∈[0,1] Xn (t), then
1 d
√ (mn , Mn , Sn ) → (m, M, B(1)),
σ n
because the functional x 7→ (mint∈[0,1] ) x (t), maxt∈[0,1] x (t), x (1)) is continuous.
1
By using properties of SSRW, one can derive the limit distribution of σ√ n
( m n , Mn , S n )
to show that, for a < 0 < b, a < x < y < b,
P( a < m ≤ M < b, x < B(1) < y) (1)
= ∑ P(x + 2k(b − a) < Z < y + 2k(b − a))
k∈ Z
− ∑ P(2b − y + 2k(b − a) < Z < 2b − x + 2k(b − a)),
k∈ Z
≤ E wh (k B − Bbr k∞ ) | | B(1)| ≤ e
≤ w h ( e ).
Now let e → 0.
5Astochastic process Xt on [0, 1] is called a Gaussian process if all of its finite dimensional distributions
are (multivariate) Gaussians. Thus both B and Bbr are examples of Gaussian processes.
17
3 Weak convergence in C [0, 1]
∑ (−1)k e−2k b .
2 2
P( sup | Bbr | ≤ b) =
t∈[0,1] k ∈Z
Hence
Let e → 0 and use Lemma 3.9 and the mapping theorem to conclude that (why is the
interchange of sum and limit justified?)
∑ e−2(2k) b ∑ e−2(2k+1) b
2 2 2 2
= −
k∈ Z k∈ Z
= ∑ (−1) k −2k2 b2
e .
k ∈Z
18
3 Weak convergence in C [0, 1]
In this case, one says that the distribution of Yi undergoes a change in mean at epoch
τ, which is called a changepoint. Suppose that we wish to test H0 : θ = θ 0 , i.e. there is
no change. A natural test statistic for this is the CUSUM statistic:
δ
j j 1 1
j ∑ i n − j ∑ i ,
(α,β,δ)
Tn := max 1− Y − Y
αn≤ j≤ βn n n j ≤i j >i
Exercise 3.9. Show that for s, t ∈ [0, ∞), cov( B̃(s), B̃(t)) = s ∧ t.
Question: What would be the right topology on C [0, ∞) for this to make sense? That
is, if Ψ is the map taking x ∈ C [0, 1] to Ψ( x )(t) = (1 + t) x 1+t ∈ C [0, ∞), then what
t
topology on C [0, ∞) would make this map Borel measurable, so that one can define
Wiener measure on C [0, ∞) as W br ◦ Ψ−1 ?
It turns out that the right topology should be the topology of uniform convergence on
compact sets. This is metrizable, e.g., using the following metric
1
ρ( x, y) = ∑ 2k min{ sup | x (t) − y(t)|, 1}.
k ≥1 t∈[0,k]
Exercise 3.10. Show that C [0, ∞), equipped with ρ, is a Polish space.
Exercise 3.11. Ψ is continuous. In fact, ρ(Ψ( x ), Ψ(y)) ≤ C k x − yk∞ , for some universal
constant C > 0.
Thus we can define the Wiener measure on C [0, ∞) as the pushforward W br ◦ Ψ−1
of W br via the map Ψ.
An alternative route is to characterize tightness on (C [0, ∞), ρ) using a suitable
Arzelà-Ascoli theorem and then prove weak convergence of the following random
functions
Sbntc ξ bntc+1
X̃n (t) = √ + (nt − bntc) √ , t ≥ 0.
σ n σ n
The limit, naturally, would be Brownian motion on C [0, ∞). A treatment of this way
of constructing B̃ can be found in Section 2.4 of Karatzas and Shreve (2012).
19
4 Weak convergence in D [0, 1]
In this notation,
wx (δ) = sup wx ([t, t + δ]).
t∈[0,1−δ]
Since functions in D [0, 1] may have jumps, wx (δ) may be arbitarily large! The problem
is that we cannot allow every t in the supremum in the definition of wx (δ). We have
to avoid the jump points.
Exercise 4.1. Construct a sequence of functions xn ∈ D such that lim infn wxn (δ) = ∞, for
any δ > 0.
Recall that, a δ-sparse partition σ = {ti } of [0, 1] is a set of points 0 = t0 < t1 <
· · · < tk = 1 such that mini {ti − ti−1 } ≥ δ. Define
Lemma 4.3. Let e > 0. There exists a δ-sparse partition {ti } such that
20
4 Weak convergence in D [0, 1]
Now the above lemma also implies that any x ∈ D has at most countably many jump
discontinuities and that k x k∞ < ∞.
Given a partition σ = {ti }, consider the map Sσ : D [0, 1] → D [0, 1], taking x to the
simple function
Sσ x (t) = ∑ x (ti−1 )I[ti−1 ,ti ) (t) + x (1)I{1} (t).
i
It is clear that
k x − Sσ x k∞ ≤ max wx [ti−1 , ti ).
i
Therefore, one can find a sequence of partitions σn such that Sσn x → x uniformly. Thus
every càdlàg function is Borel measurable.
Let Λ be the group of continuous strictly increasing surjective functions λ on [0, 1].
Clearly, λ(0) = 0, λ(1) = 1. We will think of λ’s as deformations of the time scale.
Going back to the previous example, we want to devise a metric that not only com-
pares the values of I[0,a+ 1 ) (t) and I[0,a) (t), but also their supports. λ’s will help us
n
accomplish the latter. Define
d( x, y) = inf k x − y ◦ λk∞ ∨ kλ − I k∞ .
λ∈Λ
21
5 More on Empirical Processes
1 n
(Pn − P ) f = Pn f − P f = ∑ f ( ξ i ) − P f .
n i =1
If we take F = {1(−∞,t] (·) | t ∈ R}, then we recover the uniform empirical process
studied earlier.
A classical result in statistics is the Glivenko-Cantelli theorem:
a.s.
sup | Fn (t) − F (t)| → 0,
t ∈R
which can be thought of as a uniform law of large numbers (because SLLN only guar-
a.s.
antees that F̂n (t) − F (t) → 0, for each fixed t).
In this section, we will develop methods to prove such uniform laws for more gen-
eral empirical processes.
To this end, call a class F of measurable functions P-Glivenko-Cantelli if
a.s.
kPn − PkF := sup |Pn f − P f | → 0.
f ∈F
Remark 5.1 (Measurability issues). There is an obvious question whether kPn − PkF is
measurable. There are some standard techniques to deal with this issue such as (a) the outer-
expectation approach of Hoffman-Jorgensen, (b) working with separable stochastic processes
only, etc. We will henceforth not address measurability issues at all. If you are uncomfortable,
assume that the function class F is countable—no measurability issues would arise then.
Exercise 5.1. Suppose that P is a discrete probability measure. Then F = {1 A (·) |A is a Borel set}
is P-Glivenko-Cantelli.
Exercise 5.2. Show that the above statement is false for absolutely continuous probability
measures on R.
22
5 More on Empirical Processes
Exercise 5.5 (Hoeffding’s inequality). Suppose Xi are independent sub-Gaussians with pa-
rameters σi , i = 1, . . . , n. Then
n n δ2
−
P ( ∑ a i Xi ≥ E ∑ a i Xi + δ ) ≤ e 2 ∑ n a2 σ 2
i =1 i i .
i =1 i =1
Exercise 5.6. Suppose Xi are zero-mean independent sub-Gaussians with common parameter
σ, i = 1, . . . , n. Then
1 n − nδ2
2
P ( ∑ Xi ≥ δ ) ≤ e 2σ .
n i =1
Exercise 5.7. Suppose η takes values ±1 with probability 1/2 each (such random variables
are called Rademacher variables). Show that η is sub-Gaussian with parameter 1.
Exercise 5.8. Let X be a random variable such that a ≤ X ≤ b almost surely. Show that
X is sub-Gaussian with parameter σ = (b − a)/2. (Hint: Consider Taylor-expanding the
cumulant generating function ψ(λ) = log EeλX of X around 0.)
Using a soft symmetrization argument one can show that bounded random vari-
ables are sub-Gaussian with parameter (b − a). Here is how it goes. Let X 0 be an
independent copy of X and let η be a Rademacher variable independet of both X and
X 0 . We have
0 0 (Jensen) 0
EX eλ(X −EX ) = EX eλ(X −EX ) = EX eEX0 λ(X −X ) ≤ EX EX 0 eλ( X − X ) .
d
Now η ( X − X 0 ) = ( X − X 0 ). Hence
0 0 λ2 ( X − X 0 )2 λ2 ( b − a )2
EX EX 0 eλ(X −X ) = EX EX 0 Eη eλη (X −X ) ≤ EX EX 0 e 2 ≤e 2 .
Expected maxima of sub-Gaussians will appear many times in the later subsections.
23
5 More on Empirical Processes
(As pointed out by Apratim, this condition implies, via Exercise 1 of Homework 5,
that Dk is a sequence of martingale-differences.)
Exercise 5.10. Show that under the above setup we have
n δ2
−
P ( ∑ Dk ≥ δ ) ≤ e 2 ∑ n σ2
k =1 k .
k =1
Yk = E[ f ( X ) | Fk ].
Exercise 5.12 (Bounded-differences inequality). Suppose that the function f has the bounded-
differences property, i.e., for any 1 ≤ i ≤ n, there is some positive constant bi such that
24
5 More on Empirical Processes
Theorem 5.14. Let F be a class of measurable functions such that N[] (e, F , k · k L1 ) < ∞
for any e > 0. Then F is P-Glivenko-Cantelli.
Similarly,
P f − Pn f ≤ P f − Pn ` j
= −(Pn − P)` j + P( f − ` j )
≤ − min(Pn − P)` j + e.
j
Thus
|Pn f − P f | ≤ max{max(Pn − P)u j , − min(Pn − P)` j } +e.
j j
| {z }
=:∆n,e
25
5 More on Empirical Processes
The right hand side does not depend on f , and hence serves as a unifrom upper bound
Choosing a sequence ek → 0 and adjusting the null set accordingly, we conclude that
5.3 Symmetrization
Recall that a Rademacher random variable η takes values ±1 with equal probability.
Take an i.i.d. vector of such variables η = (ηi , . . . , ηd ). Clearly, η is a uniformly chosen
corner of the d-dimensional hypercube of side 2 centered at the origin. The size (or,
perhaps more accurately “spread”) of a set A ⊂ Rd can be measured by the maximum
inner product of an element of A with such a random Rademacher direction. We
define the Rademacher complexity of A as
R( A) := Eη suph a, η i.
a∈ A
Given a function class F and n i.i.d. random elements ξ i , the average size (w.r.t. the
distribution of the ξ j ’s) of F may be measured by the expected Rademacher complex-
ity of the Euclidean set F (ξ ) := { n1 ( f (ξ 1 ), . . . , f (ξ n )) | f ∈ F } (intuitively, if F is
large, one would expect the Euclidean set F (ξ ) to be large as well). We define this to
be the Rademacher complexity of F :
26
5 More on Empirical Processes
1 n
n i∑
= Eξ sup |Eξ 0 ( f (ξ i ) − f (ξ i0 ))|
f ∈F =1
1 n
≤ Eξ sup Eξ 0 | ∑
n i =1
( f (ξ i ) − f (ξ i0 ))|
f ∈F
1 n
n i∑
≤ Eξ Eξ 0 sup | ( f (ξ i ) − f (ξ i0 ))|.
f ∈F =1
Now let ηi be i.i.d. Rademacher, independent of the ξ i ’s and the ξ i0 ’s. Then
d
(ηi ( f (ξ i ) − f (ξ i0 )))1≤i≤n = ( f (ξ i ) − f (ξ i0 ))1≤i≤n .
1 n 1 n
sup | ∑ ( f (ξ i ) − f (ξ i ))| = sup | ∑ ηi ( f (ξ i ) − f (ξ i0 ))|.
0 d
f ∈F n i =1 f ∈F n i =1
Therefore
1 n 1 n
n i∑ ∑ ηi ( f (ξ i ) − f (ξ i0 ))|
0
Eξ Eξ 0 sup | ( f ( ξ i ) − f ( ξ i ))| = E ξ E ξ 0 E η sup |
f ∈F =1 f ∈F n i =1
1 n 1 n
≤ Eξ Eξ 0 Eη sup | ∑ ηi f (ξ i )| + sup | ∑ ηi f (ξ i )|
f ∈F n i =1 f ∈F n i =1
= 2R(F ).
Exercise 5.17. Let F be a class of functions that are uniformly bounded by B. Show that
2
− nδ 2
P(kPn − PkF ≥ EkPn − PkF + δ) ≤ e 2B .
2
− nδ 2
Conclude that, with probability at least 1 − e 2B , we have
In the next subsection, we will learn how to bound the Rademacher complexity
R(F ) of uniformly bounded function classes of finite VC-dimension.
27
5 More on Empirical Processes
for all t, t0 ∈ T.
Example 5.18. Let T = F , a class of measurable functions (which we will later assume to be
uniformly bounded and have finite VC dimension). Let
1 n
X f = √ ∑ ηi f ( ξ i ) ,
n i =1
where ηi are i.i.d. Rademacher and ξ i ’s are fixed. One can then check that ( X f ) f ∈F is sub-
Gaussian w.r.t. the pseudo-metric
s
1 n
n i∑
ρ ( f , g ) = k f − g k Pn = ( f (ξ i ) − g(ξ i ))2 .
=1
Example 5.19. Let T = Rn× p . Let X be a random n × p matrix with independent zero-mean
sub-Gaussian entries with sub-Gaussianity parameter 1. Let XΘ = h X, Θi F = trace( XΘ> ).
Then it is easy to check that ( XΘ )Θ∈T is sub-Gaussian w.r.t. the Frobenius metric
ρ(Θ, Θ0 ) = kΘ − Θ0 k F .
28
5 More on Empirical Processes
To prove the second inequality, let N = N (δ, T, ρ) and let S = {t1 , . . . , t N } be a δ-net
of T. Then, for any t, find t j ∈ S which is δ-close to t. We have
±( Xt − Xt1 ) = ±( Xt − Xt j ) ± ( Xt j − Xt1 )
≤ sup ( Xt − Xt0 ) + max | Xt j − Xt1 |.
t,t0 ∈T,
j
ρ(t,t0 )≤δ
Therefore
Xt − Xt 0 = Xt − Xt1 − ( Xt 0 − Xt1 )
≤ 2 sup ( Xt − Xt0 ) + 2 max | Xt j − Xt1 |.
t,t0 ∈T,
j
ρ(t,t0 )≤δ
The above bound, being uniform in t, t0 also works for supt,t0 ∈T ( Xt − Xt0 ). Taking
expectations, we get
E sup ( Xt − Xt0 ) ≤ 2E sup ( Xt − Xt0 ) + 2E max | Xt j − Xt1 |.
t,t0 ∈ T t,t0 ∈T,
j
ρ(t,t0 )≤δ
Now, because ( Xt )t∈T is sub-Gaussian w.r.t. ρ, the random variable Xt j − Xt1 is sub-
Gaussian w.r.t. parameter ρ2 (t j , t1 ) ≤ D2 . Hence, by Exercise 5.9, we get
q
E max | Xt j − Xt1 | ≤ 2 D2 log N (δ, T, ρ).
j
(C−S)
Now, X f − Xg = √1
n ∑in=1 ηi ( f (ξ i ) − g(ξ i )) ≤ kη k2 k f − gkPn , whence
q √
E sup ( X f − Xg ) ≤ δEkη k2 ≤ δ Ekη k22 = δ n.
f ,g∈F ,
k f − g kPn ≤ δ
29
5 More on Empirical Processes
0
E sup X f ≤ CB,ν
p
log n,
f ∈F
0
for some constant CB,ν that only depends on B and ν. Coming back to the original object of
consideration,
1 n 1
Eη sup | ∑
n i =1
ηi f (ξ i )| = √ E sup | X f |
n f ∈F
f ∈F
2
≤ √ E sup X f
n f ∈F
r
0 log n
≤ 2CB,ν ,
n
d
where we have used that fact that X f = − X f so that
E sup | X f | = E sup X f ∨ (− X f )
f ∈F f ∈F
≤ E sup X f ∨ sup X f (− X f )
f ∈F f ∈F
≤ E(sup X f + sup X f (− X f ))
f ∈F f ∈F
= 2E sup X f .
f ∈F
In the bound r
1 n
log n
Eη sup | ∑ ηi f (ξ i )| = OB,ν ,
f ∈F n i =1 n
the log n factor is an artifact of the crudeness of our discretization and can be removed via a
much more refined way of discretization called chaining.
Example 5.22. Consider upper bounding E √1n k X kop 6 where X is as in Example 5.8. We
have, using the variational representation of the operator norm,
= sup h X, xy> i F
k x k2 =kyk2 =1
≤ sup h X, Θi F ,
Θ ∈T
6 Itcan be shown that the largest eigenvalue of the scaled Wishart matrix n1 X > X converges almost
√
surely to (1 + γ)2 in the asymptotic regime p/n → γ > 0. Here we are using a cheap discretization
bound to prove a non-asymptotic version of this.
30
5 More on Empirical Processes
Therefore
E sup ( XΘ − XΘ0 ) ≤ 2δEk X kop .
Θ,Θ0 ∈T,
kΘ−Θ0 k F ≤δ
Ek X kop ≤ C n + p,
p
Using Gaussian comparison inequalities one can even prove the above result with C = 1,
which is the best possible in light of the asymptotic result.
k X kop = sup h X, Θi F .
rank(Θ)=1,kΘk F =1
31
5 More on Empirical Processes
Proof. The first few steps of the proof are the same as those in the proof the crude
discretization bound. We begin with the inequality
Example 5.26. Going back to Example 5.21, if we use chaining, the resulting bound is
s
√ Z D
1
E sup X f ≤ 2δ n + 32 log CB,ν + 2(ν − 1) log du.
f ∈F δ/4 u
RDq
The integral above can be upper-bounded by a constant multiple of 0 log( u1 ) du, which is
finite. Hence, choosing δ = √1 , we get
n
1 n
1
Eη sup | ∑ ηi f (ξ i )| = OB,ν √ .
f ∈F n i =1 n
RDq
Exercise 5.27. Show that 0 log( u1 ) du < ∞.
Remark 5.2. Although chaining is able to remove extra log n factors in many problems, there
are situations where it does not produce optimal bounds. Michel Talagrand’s generic chaining
method has the final say—it produces optimal upper bounds.
We end this section by collecting in form of a theorem the results we have proved
so far about a uniformly bounded function class of finite VC-dimension.
32
5 More on Empirical Processes
a.s.
This is much stronger than just saying that kPn − PkF → 0, which follows from the
2
above Theorem by the Borel-Cantelli lemma. Note that if we choose δ2 = 2kB
n , then,
with probability at least 1 − e−k , we have
r
k
kPn − PkF = OB,ν ,
n
33
References
References
Billingsley, P. (2013). Convergence of probability measures. John Wiley & Sons.
Karatzas, I. and Shreve, S. (2012). Brownian motion and stochastic calculus, volume 113.
Springer Science & Business Media.
34