Stochasticprocess
Stochasticprocess
or
Almost None of the Theory of Stochastic
Processes
Cosma Shalizi
Spring 2007
Contents
Table of Contents i
Preface 1
2 Building Processes 15
2.1 Finite-Dimensional Distributions . . . . . . . . . . . . . . . . . . 15
2.2 Consistency and Extension . . . . . . . . . . . . . . . . . . . . . 16
5 Stationary Processes 34
5.1 Kinds of Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . 34
i
CONTENTS ii
6 Random Times 38
6.1 Reminders about Filtrations and Stopping Times . . . . . . . . . 38
6.2 Waiting Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.3 Kac’s Recurrence Theorem . . . . . . . . . . . . . . . . . . . . . 41
6.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
7 Continuity 46
7.1 Kinds of Continuity for Processes . . . . . . . . . . . . . . . . . . 46
7.2 Why Continuity Is an Issue . . . . . . . . . . . . . . . . . . . . . 48
7.3 Separable Random Functions . . . . . . . . . . . . . . . . . . . . 50
7.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
8 More on Continuity 52
8.1 Separable Versions . . . . . . . . . . . . . . . . . . . . . . . . . . 52
8.2 Measurable Versions . . . . . . . . . . . . . . . . . . . . . . . . . 57
8.3 Cadlag Versions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
8.4 Continuous Modifications . . . . . . . . . . . . . . . . . . . . . . 59
10 Markov Characterizations 70
10.1 Markov Sequences as Transduced Noise . . . . . . . . . . . . . . 70
10.2 Time-Evolution (Markov) Operators . . . . . . . . . . . . . . . . 72
10.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
11 Markov Examples 78
11.1 Probability Densities in the Logistic Map . . . . . . . . . . . . . 78
11.2 Transition Kernels and Evolution Operators for the Wiener Process 80
11.3 Lévy Processes and Limit Laws . . . . . . . . . . . . . . . . . . . 82
11.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
12 Generators 88
12.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
CONTENTS iii
14 Feller Processes 99
14.1 Markov Families . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
14.2 Feller Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
14.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
27 Mixing 220
27.1 Definition and Measurement of Mixing . . . . . . . . . . . . . . . 221
27.2 Examples of Mixing Processes . . . . . . . . . . . . . . . . . . . . 223
27.3 Convergence of Distributions Under Mixing . . . . . . . . . . . . 223
27.4 A Central Limit Theorem for Mixing Sequences . . . . . . . . . . 225
Bibliography 284
Definitions, Lemmas,
Propositions, Theorems,
Corollaries, Examples and
Exercises
vii
CONTENTS viii
1
Part I
Stochastic Processes in
General
2
Chapter 1
3
CHAPTER 1. BASICS 4
Example 8 (Random set functions) Let T = B, the Borel field on the reals,
+
and Ξ = R , the non-negative extended reals. Then {Xt }t∈T is a random set
function on the reals.
i.e., the fraction of the samples up to time n which fall into that set. This is
the empirical measure. P̂n (B) is a one-sided random sequence of set functions
— in fact, of probability measures. We would like to be able to say something
about how it behaves. It would be very reassuring, for instance, to be able to
show that it converges to the common distribution of the Zi (Figure 1.8).
For any finite k, k−dimensional cylinder sets are defined similarly, and
clearly are the intersections of k different one-dimensional cylinder sets. To
see why they have this name, notice a cylinder, in Euclidean geometry, con-
sists of all the points where the x and y coordinates fall into a certain set
(the base),
Q leaving the z coordinate unconstrained. Similarly, a cylinder set
like At × s6=t Ξs consists of all the functions in ΞT where f (t) ∈ At , and are
otherwise unconstrained.
N.B., it has become common to apply the term “sample path” or even just
“path” even in situations where the geometric analogy it suggests may be some-
what misleading. For instance, for the empirical distributions of Example 10,
the “sample path” is the measure P̂n , not the curves shown in Figure 1.8.
Definition 14 (Functional of the Sample Path) Let E, E be a measure-
space. A functional of the sample path is a mapping f : ΞT 7→ E which is
X T /E-measurable.
Examples of useful and common functionals include maxima, minima, sam-
ple averages, etc. Notice that none of these are functions of any one random
variable, and in fact their value cannot be determined from any part of the
sample path smaller than the whole thing.
Definition 15 (Projection Operator, Coordinate Map) A projection op-
erator or coordinate map πt is a map from ΞT to Ξ such that πt X = X(t).
The projection operators are a convenient device for recovering the individ-
ual coordinates — the random variables in the collection — from the random
function. Obviously, as t ranges over T , πt X gives us a collection of random vari-
ables, i.e., a stochastic process in the sense of our first definition. The following
lemma lets us go back and forth between the collection-of-variables, coordinate
view, and the entire-function, sample-path view.
Theorem 16 (Product σ-field-measurability is equvialent to measur-
ability of all coordinates) X is F/ ⊗t∈T Xt -measurable iff πt X is F/Xt -
measurable for every t.
Proof: This follows from the fact that the one-dimensional cylinder sets
generate the product σ-field.
We have said before that we will want to constrain our stochastic processes
to have certain properties — to be probability measures, rather than just set
functions, or to be continuous, or twice differentiable, etc. Write the set of all
such functions in ΞT as U . Notice that U does not have to be an element of the
product σ-field, and in general is not. (We will consider some of the reasons for
this later.) As usual, by U ∩ X T we will mean the collection of all sets of the
form U ∩ C, where C ∈ X T . Notice that (U, U ∩ X T ) is a measure space. What
we want is to ensure that the sample path of our random function lies in U .
Definition 17 (A Stochastic Process Is a Random Function) A Ξ-valued
stochastic process on T with paths in U , U ⊆ ΞT , is a random function X : Ω 7→
U which is F/U ∩ X T -measurable.
CHAPTER 1. BASICS 7
1.3 Exercises
Exercise
S 1.1 (The product σ-field answers countable questions) Let
D = S X S , where the union ranges over all countable subsets S of the index
set T . For any event D ∈ D, whether or not a sample path x ∈ D depends on
the value of xt at only a countable number of indices t.
1. Show that D is a σ-field.
Figure 1.1: Examples of point processes. The top row shows the dates of ap-
pearances of 44 genres of English novels (data taken from Moretti (2005)). The
bottom two rows show independent realizations of a Poisson process with the
same mean time between arrivals as the actual history. The number of tick-
marks falling within any measurable set on the horizontal axis determines an
integer-valued set function, in fact a measure.
AATGAAATAAAAAAAAACGAAAATAAAAAA
AAGGCCATTAAAGTTAAAATAATGAAAGGA
CAATGATTAGGACAATAACATACAAGTTAT
GGGGTTAATTAATGGTTAGGATGGGTTTTT
CCTTCAAAGTTAATGAAAAGTTAAAATTTA
TAAGTATTTGAAGCACAGCAACAACTAGGT
111111111111110110011011000001
011110110111111011110110110110
011110001101101111111111110000
011011011011111101111000111100
011011111101101100001111110111
000111111111111001101100011011
●
● ●
●
●
2
● ● ●
●
● ●
● ●
● ● ● ●
●
● ● ●● ●●
● ●
●●
0
z
●
● ● ●
● ●
● ● ● ● ●
● ● ●
● ●
●
−2
● ●
● ●
●
−4
0 10 20 30 40 50
Figure 1.4: Examples of one-sided random sequences. These are linear Gaussian
random sequences, Xt+1 = 0.8Xt + Zt+1 , where the Zt are all i.i.d. N (0, 1), and
X1 = Z0 . Different shapes of dots represent different independent samples of
this autoregressive process. (The line segements are simply guides to the eye.)
This is a Markov sequence, but one with a continuous state-space.
CHAPTER 1. BASICS 11
1.0
● ● ●
● ● ● ● ● ●
●
●
● ● ●
●
0.8
●
●
0.6
● ●
● ●
●
●
x
●
0.4
● ●
●
● ● ●
●
0.2
●
● ●
● ● ●
● ● ●
● ●
0.0
● ● ●
●●●●
0 10 20 30 40 50
Figure 1.5: Nonlinear, non-Gaussian random sequences. Here X1 ∼ U (0, 1), i.e.,
uniformly distributed on the unit interval, and Xt+1 = 4Xt (1−Xt ). Notice that
while the two samples begin very close together, they rapidly separate; after a
few time-steps their locations are, in fact, effectively independent. We will study
both this approach to independence, known as mixing, and this example, known
as the logistic map, in some detail.
CHAPTER 1. BASICS 12
2
1
W
0
−1
−2
Figure 1.6: Continuous-time random processes. Shown are three samples from
the standard Wiener process, also known as Brownian motion, a Gaussian pro-
cess with independent increments and continuous trajectories. This is a central
part of the course, and actually what forced probability to be re-defined in terms
of measure theory.
CHAPTER 1. BASICS 13
3
2
1
X
0
−1
−2
1.0
1.0
● ●
●●
●
●●
●●
● ●●●
●
●●
●
●●
●
0.8
0.8
● ●●
●
●●
●●
●
●
●
●●
● ●●
●
●
●●
●
●
0.6
0.6
●
●
● ●
●
●
●
●
●
Fn(x)
Fn(x)
●
●
●
●
● ●●
●
●
●
●
●
●
0.4
0.4
●
●
● ●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
0.2
0.2
●
●
● ●
●
●●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
0.0
0.0
●
●
0 1 2 3 4 5 6 0 2 4 6 8 10 12
x x
1.0
●● ● ● ●● ●
● ● ●
●
●●
●
●●
● ●●
●●●
● ● ●
●●
● ●
●
●●●
●
●
●●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
0.8
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
0.6
●
●
●
●
●
●
●
●
Fn(x)
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.4
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.2
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.0
●
●
●
0 5 10 15 20 25
You will sometimes see “FDDs” and “fidis” as abbreviations for “finite-
dimensional distributions”. Please do not use “fidis”.
We can at least hope to specify the finite-dimensional distributions. But we
are going to want to ask a lot of questions about asymptotics, and global proper-
ties of sample paths, which go beyond any finite dimension, so you might worry
15
CHAPTER 2. BUILDING PROCESSES 16
that we’ll still need to deal directly with the infinite-dimensional distribution.
The next theorem says that this worry is unfounded; the finite-dimensional dis-
tributions specify the infinite-dimensional distribution (pretty much) uniquely.
Proof: “Only if”: Since X and Y have the same distribution, applying
the any given set of coordinate mappings will result in identically-distributed
random vectors, hence all the finite-dimensional distributions will agree.
“If”: We’ll use the π-λ theorem.
Let C be the finite cylinder sets, i.e., all sets of the form
This is going to get really awkward to write over and over, so let’s introduce
some simplifying notation. Fin(T ) will denote the class of all finite sub-sets
of our index set T , and likewise Denum(T ) all denumerable sub-sets. We’ll
indicate such sub-sets, for the moment, by capital letters like J, K, etc., and
extend the definition of coordinate maps (Definition 15) so that πJ maps from
ΞT to ΞJ in the obvious way, and πJK maps from ΞK to ΞJ , if J ⊂ K. If µ is
the measure for the whole process, then the finite-dimensional distributions are
{µJ |J ∈ Fin(T )}. Clearly, µJ = µ ◦ πJ −1 .
Proof: This is just the fact that we get marginal distributions by integrating
out some variables from the joint distribution. But, to proceed formally: Letting
J and K be finite sets of indices, J ⊂ K, we know that µK = µ ◦ πK −1 , that
µJ = µ ◦ πJ −1 and that πJ = πJK ◦ πK . Hence
−1
µJ = µ ◦ πJK ◦ πK (2.3)
−1
−1
= µ◦ πK ◦ πJK (2.4)
−1
= µK ◦ πJK (2.5)
as required.
I claimed that the reason to care about finite-dimensional distributions is
that if we specify them, we specify the distribution of the whole process. Lemma
25 says that a putative family of finite dimensional distributions must be consis-
tent, if they are to let us specify a stochastic process. Theorem 23 says that there
can’t be more than one process distribution with all the same finite-dimensional
marginals, but it doesn’t guarantee that a given collection of consistent finite-
dimensional distributions can be extended to a process distribution — it gives
uniqueness but not existence. Proving the existence of an extension requires
some extra assumptions. Either we need to impose topological conditions on Ξ,
CHAPTER 2. BUILDING PROCESSES 18
Proof: See 36-752 lecture notes (Theorem 50, Exercise 51), or Kallenberg,
Theorem 2.5, pp. 26–27. Note that “extension” here means extending from a
mere field to a σ-field, not from finite to infinite index sets.
Proof: This will be easier to follow if we first consider the case there T
is countable, which is basically Theorem 27 again, and then the general case,
where we need Proposition 28.
Countable T : We can, by definition, put the elements of T in 1−1 correspon-
dence with theN elements of N. This in turn establishes N a bijection between the
∞
product space t∈T Ξt = ΞT and the sequence space i=1 Ξt . This bijection
also induces a projective family of distributions on finite sequences. The Daniell
Extension Theorem (27) gives us a measure on the sequence space, which the
bijection takes back to a measure on ΞT . To see that this µ does not depend on
the order in which we arranged T , notice that any two arrangements must give
identical results for any finite set J, and then use Theorem 23.
Uncountable T : For each countable K ⊂ T , the argument of the preceding
paragraph gives us a measure µK on ΞK . And, clearly, these µK themselves form
a projective family. Now let’s define a set function µ on the countable cylinder
sets, i.e., on the class D of sets of the form A × ΞT \K , for some K ∈ Denum(T )
and some A ∈ XK . Specifically, µ : D 7→ [0, 1], and µ(A × ΞT \K ) = µK (A).
We would like to use Carathéodory’s theorem to extend this set function to
CHAPTER 2. BUILDING PROCESSES 20
a measure on the product σ-algebra XT . First, let’s check that the countable
cylinder sets form a field: (i) ΞT ∈ D, clearly. (ii) The complement, in ΞT , of
a countable cylinder A × ΞT \K is another countable cylinder, Ac × ΞT \K . (iii)
The union of two countable cylinders B1 = A1 × ΞT \K1 and B2 = A2 × ΞT \K2
is another countable cylinder, since we can always write it as A × ΞT \K for
some A ∈ XK , where K = K1 ∪ K2 . Clearly, µ(∅) = 0, so we just need to
check that µ is countably additive. So consider any sequence of disjoint cylinder
Si × ΞT \Ki , for
sets B1 , B2 , . . .. Because they’re cylinder sets, each i, Bi = A
some Ki ∈ Denum(T ), and some Ai ∈ XKi . Now set K = i Ki ; this is a
countable union of countable sets, and so S itself countable.
S Furthermore, say
Ci = Ai × ΞK\Ki , so we can say that i Bi = ( i Ci ) × ΞT \K . With this
notation in place,
[ [
µ Bi = µK Ci (2.6)
i i
X
= µK Ci (2.7)
i
X
= µKi Ai (2.8)
i
X
= µBi (2.9)
i
where in the second line we’ve used the fact that µK is a probability measure
on ΞK , and so countably additive on sets like the Ci . This proves that µ is
countably additive, so by Proposition 28 it extends to a measure on σ(D), the
σ-field generated by the countable cylinder sets. But we know from Definition
12 that this σ-field is the product σ-field. Since µ(ΞT ) = 1, Proposition 28
further tells us that the extension is unique.
Borel spaces are good enough for most of the situations we find ourselves
modeling, so the Daniell-Kolmogorov Extension Theorem (as it’s often known)
see a lot of work. Still, some people dislike having to make topological assump-
tions to solve probabilistic problems; it seems inelegant. The Ionescu-Tulcea
Extension Theorem provides a purely probabilistic solution, available if we can
write down the FDDs recursively, in terms of regular conditional probability
distributions, even if the spaces where the process has its coordinates are not
nice and Borel. Doing this properly will involve our revisiting and extending
some ideas about conditional probability, which you will have seen in 36-752, so
it will be deferred to the next lecture.
Chapter 3
21
CHAPTER 3. BUILDING PROCESSES BY CONDITIONING 22
Proof: As before, we’ll be working with the cylinder sets, but now we’ll
make our life simpler if we consider cylinders where Nnthe base set rests in the
first n spaces Ξ1 , ...Ξn .QMore specifically, set Bn = i=1 Xi (these are the
S base
∞
sets), and Cn = Bn × i=n+1 Ξi (these are the cylinder sets), and C = n Cn .
C clearly contains all the finite cylinders, so it generates the product σ-field on
infinite sequences. We will use it as the field in Proposition 32. (Checking that
C is a field is entirely parallel to checking that the D appearing in the proof of
Theorem 29 was a field.)
Q∞each base set A ∈ Bn , let [A] be the corresponding cylinder, [A] =
For
A × i=n+1 Ξi . Notice that for every set C ∈ C, there is at least one A, in some
Bn , such that C = [A]. Now we define a set function µ on C.
n
!
O
µ([A]) = κi A (3.1)
i=1
3.3 Exercises
Exercise 3.1 (Lomnick-Ulam Theorem on infinite product measures)
Let T be an uncountable index set, and (Ξt , Xt , µt ) a collection of probability
spaces. Show that there exist independent random variables Xt in Ξt with dis-
tributions µt . Hint: use the Ionescu Tulcea theorem on countable subsets of T ,
and then imitate the proof of the Kolmogorov extension theorem.
Exercise 3.2 (Measures of cylinder sets) In the proof of the Ionescu Tulcea
Theorem, we employed a set function on the finite cylinder sets, where the mea-
sure of an infinite-dimensional cylinder set [A] is taken to be the measure of its
finite-dimensional base set A. However, the same cylinder set can be specified
by different base sets, so it is necessary to show that Equation 3.1 has a unique
value on its right-hand side. In what follows, C is an arbitrary member of the
class C.
1. Show that, when A, B ∈ Bn , [A] = [B] iff A = B. That is, two cylinders
generated by bases of equal dimensionality are equal iff their bases are
equal.
2. Show that there is a smallest n such that C = [A] for an A ∈ Bn . Conclude
that the right-hand side of Equation 3.1 could be made well-defined if we
took n there to be this least possible n.
One-Parameter Processes in
General
26
Chapter 4
One-Parameter Processes,
Usually Functions of Time
We’ve been doing a lot of pretty abstract stuff, but the point of this is to
establish a common set of tools we can use across many different concrete situa-
tions, rather than having to build very similar, specialized tools for each distinct
case. Today we’re going to go over some examples of the kind of situation our
tools are supposed to let us handle, and begin to see how they let us do so. In
particular, the two classic areas of application for stochastic processes are dy-
namics (systems changing over time) and inference (conclusions changing as we
acquire more and more data). Both of these can be treated as “one-parameter”
processes, where the parameter is time in the first case and sample size in the
second.
27
CHAPTER 4. ONE-PARAMETER PROCESSES 28
Example 35 (Bernoulli process) You all know this one: a one-sided infinite
sequence of independent, identically-distributed binary variables, where, for all
t, Xt = 1 with probability p.
2. for any three times, t1 < t2 < t3 , W (t2 ) − W (t1 ) and W (t3 ) − W (t2 ) are
independent (the “independent increments” property),
3. W (t2 ) − W (t1 ) ∼ N (0, t2 − t1 ) and
4. W (t, ω) is a continuous function of t for almost all ω.
We will spend a lot of time with the Wiener process, because it turns out to
play a role in the theory of stochastic processes analogous to that played by
the Gaussian distribution in elementary probability — the easily-manipulated,
formally-nice distribution delivered by limit theorems.
When we examine the Wiener process in more detail, we will see that it
almost never has a derivative. Nonetheless, in a sense which will be made
clearer when we come to stochastic calculus, the Wiener process can be regarded
as the integral over time of something very like white noise, as described in the
preceding example.
Example 39 (Logistic Map) Let T = N, Ξ = [0, 1], X(0) ∼ U (0, 1), and
X(t + 1) = aX(t)(1 − X(t)), a ∈ [0, 4]. This is called the logistic map. Notice
that all the randomness is in the initial value X(0); given the initial condition,
all later values X(t) are fixed. Nonetheless, this is a Markov process, and we will
see that, at least for certain values of a, it satisfies versions of the laws of large
numbers and the central limit theorem. In fact, large classes of deterministic
dynamical systems have such stochastic properties.
Example 46 (Text) Text (at least in most writing systems!) can be repre-
sented by a sequence of discrete values at discrete, ordered locations. Since texts
can be arbitrarily long, but they all start somewhere, they are discrete-parameter,
one-sided processes. Or, more exactly, once we specify a distribution over se-
quences from the appropriate alphabet, we will have such a process.
CHAPTER 4. ONE-PARAMETER PROCESSES 31
Linguists believe that no Markovian model (with finitely many states) can
capture human language (Chomsky, 1957; Pullum, 1991). Whether this is true
of DNA sequences is not known. In both cases, hidden Markov models are used
extensively Baldi and Brunak (2001); Charniak (1993), even if they can only be
approximately true of language.
(A semi-group does not need to have an identity element, and one which
does is technically called a “monoid”. No one talks about the shift monoid or
time-evolution monoid, however.)
Before we had a Ξ-valued stochastic process X on T , i.e., our process was
a random function from T to Ξ. To extract individual random variables, we
used the projection operators πt , which took X to Xt . With the shift operators,
we simply have πt = π0 ◦ Σt . To represent the passage of time, then, we just
apply elements of this semi-group to the function space. Rather than having
complicated dynamics which gets us from one value to the next, by working with
shifts on function space, all of the complexity is shifted to the initial distribution.
This will prove to be extremely useful when we consider stationary processes in
the next lecture, and even more useful when, later on, we want to extend the
limit theorems from IID sequences to dependent processes.
4.3 Exercises
Exercise 4.1 (Existence of proto-Wiener processes) Use Theorem 29 and
the properties of Gaussian distributions to show that processes exist which satisfy
CHAPTER 4. ONE-PARAMETER PROCESSES 32
points (1)–(3) of Example 38 (but not necessarily continuity). You will want to
begin by finding a way to write down the FDDs recursively.
Exercise 4.2 (Time-Evolution Semi-Group) These are all very easy, but
worth the practice.
3. Verify that πτ = π0 ◦ Στ .
t
4. Verify that, for a discrete-parameter process, Σt = (Σ1 ) , and so Σ1 gen-
erates the semi-group. (For this reason it is often abbreviated to Σ.)
CHAPTER 4. ONE-PARAMETER PROCESSES 33
Empirical process for 102 samples from a log−normal Empirical process for 103 samples from a log−normal
0.8
0.8
0.6
0.6
0.4
0.4
0.2
E102(x)
E103(x)
0.2
−0.2 0.0
0.0
−0.4
−0.6
0 5 10 15 20 0 5 10 15 20
x x
Empirical process for 104 samples from a log−normal Empirical process for 105 samples from a log−normal
0.8
0.4
0.6
0.2
E104(x)
E105(x)
0.4
0.0
0.2
−0.2
0.0
−0.4
−0.2
0 5 10 15 20 0 5 10 15 20
x x
Empirical process for 106 samples from a log−normal Empirical process for 106 samples from a log−normal
1.0
0.350
0.5
E106(x)
E106(x)
0.340
0.0
0.330
−0.5
x x
Figure 4.1: Empirical process for successively larger samples from a log-normal
distribution, mean of the log = 0.66, standard deviation of the log = 0.66. The
samples were formed by taking the first n variates from an over-all sample of
1,000,000 variates. The bottom right figure magnifies a small part of E106 to
show how spiky the paths are.
Chapter 5
Stationary One-Parameter
Processes
Notice that when the parameter is discrete, we can get away with just check-
ing the distributions of blocks of consecutive indices.
34
CHAPTER 5. STATIONARY PROCESSES 35
At this point, you should check that a weakly stationary process has time-
invariant correlations. (We will say much more about this later.) You should
also check that strong stationarity implies weak stationarity. It will turn out
that weak and strong stationarity coincide for Gaussian processes, but not in
general.
Definition 51 (Conditional (Strong) Stationarity) A one-parameter pro-
cess is conditionally stationary if its conditional distributions are invariant un-
der time-translation: ∀n ∈ N, for every set of n + 1 indices t1 , . . . tn+1 ∈ T ,
ti < ti+1 , and every shift τ ,
L Xtn+1 |Xt1 , Xt2 . . . Xtn = L Xtn+1 +τ |Xt1 +τ , Xt2 +τ . . . Xtn +τ (5.4)
(a.s.).
Strict stationarity implies conditional stationarity, but the converse is not
true, in general. (Homogeneous Markov processes, for instance, are all con-
ditionally stationary, but most are not stationary.) Many methods which are
normally presented using strong stationarity can be adapted to processes which
are merely conditionally stationary.1
Strong stationarity will play an important role in what follows, because it
is the natural generaliation of the IID assumption to situations with dependent
variables — we allow for dependence, but the probabilistic set-up remains, in a
sense, unchanging. This will turn out to be enough to let us learn a great deal
about the process from observation, just as in the IID case.
“Only if”: The statement that µ = µ ◦ Σ−1 τ really means that, for any
set A ∈ X T , µ(A) = µ(Σ−1 τ A). Suppose A is a finite-dimensional cylinder
set. Then the equality holds, because all the finite-dimensional distributions
agree (by hypothesis). But this means that X and Στ X are two processes
with the same finite-dimensional distributions, and so their infinite-dimensional
distributions agree (Theorem 23), and the equality holds on all measurable sets
A.
This can be generalized somewhat.
To see that Eq. 5.10 implies Eq. 5.11, pick any measurable set B, and then
apply 5.10 to F (B) (which is ∈ X , because F is measurable). To go the other
way, from 5.11 to 5.10, it would have to be the case that, ∀A ∈ X , ∃B ∈ X such
that A = F (B), i.e., every measurable set would have to be the image, under
F , of another measurable set. This is not necessarily the case; it would require,
for starters, that F be onto (surjective).
Theorem 52 says that every stationary process can be represented by a
measure-preserving transformation, namely the shift. Since measure-preserving
transformations arise in many other ways, however, it is useful to know about
the processes they generate.
Proof: Consider shifting the sequence F n (X) by one: the nth term in the
shifted sequence is F n+1(X) = F n (F (X)). But since L (F (X)) = L (X), by
hypothesis, L F n+1 (X) = L (F n (X)), and the measure is shift-invariant. So,
by Theorem 52, the process F n (X) is stationary.
CHAPTER 5. STATIONARY PROCESSES 37
5.3 Exercises
Exercise 5.1 (Functions of Stationary Processes) Use Corollary 54 to
show that if g is any measurable function on Ξ, then the sequence g(F n (X)) is
also stationary.
38
CHAPTER 6. RANDOM TIMES 39
1 Actually, this is just one variety of option (an “American call”), out of a huge variety. I
Note 1: If I’m to be honest with you, I should admit that “return time”
and “recurrence time” are used more or less interchangeably in the literature to
refer to either the time coordinate of the first return (what I’m calling the return
time) or the time interval which elapses before that return (what I’m calling
the recurrence time). I will try to keep these straight here. Check definitions
carefully when reading papers!
Note 2: Observe that if we have a discrete-parameter process, and are in-
terested in recurrences of a finite-length sequence of observations w ∈ Ξk , we
can handle this situation by the device of working with the shift operator in
sequence space.
The question of whether any of these waiting times is optional (i.e., mea-
surable) must, sadly, be raised. The following result is generally enough for our
purposes.
en = wn−1 − wn , n ≥ 1 (6.5)
rn = en−1 − en , n ≥ 2 (6.6)
rn = wn−2 − 2wn−1 + wn , n ≥ 2 (6.7)
using first stationarity and then total probability. To see the second equality,
notice that, by total probability,
Proof: The event {θA = k, X1 ∈ A} is the same as the event {Y1 = 1, Y2 = 0, . . . Yk+1 = 1}.
Since P (X1 ∈ A) > 0, we can handle the conditional probabilities in an elemen-
tary fashion:
P (θA = k, X1 ∈ A)
P (θA = k|X1 ∈ A) = (6.12)
P (X1 ∈ A)
P (Y1 = 1, Y2 = 0, . . . Yk+1 = 1)
= (6.13)
P (Y1 = 1)
∞ P∞
k=1 P (Y1 = 1, Y2 = 0, . . . Yk+1 = 1)
X
P (θA = k|X1 ∈ A) = (6.14)
P (Y1 = 1)
k=1
P∞
k=2 rk
= (6.15)
e1
CHAPTER 6. RANDOM TIMES 43
where the last line uses Eq. 6.6. Since wn−1 ≥ wn , there exists a limn wn , which
is ≥ 0 since every individual wn is. Hence limn wn−1 − wn = 0.
∞ P∞
k=2 rk
X
P (θA = k|X1 ∈ A) = (6.21)
e1
k=1
e1 − (wn−1 − wn )
= lim (6.22)
n→∞ e1
e1
= (6.23)
e1
= 1 (6.24)
so we just need to show that the last series above sums to 1. Using Eq. 6.7
again,
n
X n
X
krk+1 = k(wk−1 − 2wk + wk+1 ) (6.27)
k=1 k=1
Xn n
X n
X
= kwk−1 + kwk+1 − 2 kwk (6.28)
k=1 k=1 k=1
n−1
X n+1
X n
X
= (k + 1)wk + (k − 1)wk − 2 kwk (6.29)
k=0 k=2 k=1
= w0 + nwn+1 − (n + 1)wn (6.30)
= 1 − wn − n(wn − wn+1 ) (6.31)
We therefore wish to show that limn wn = 0 implies limn wn + n(wn − wn+1 ) =
0. By hypothesis, it is enough to show that limn n(wn − wn+1 ) = 0. The partial
sums on the left-hand side of Eq. 6.27 are non-decreasing, so wn +n(wn −wn+1 )
is non-increasing. Since it is also ≥ 0, the limit limn wn + n(wn − wn+1 ) exists.
Since wn → 0, wn − wn+1 ≤ wn must also go to zero; the only question is
whether it goes to zero fast enough. So consider
n
X n
X
lim wk − wk+1 (6.32)
n
k=1 k=1
Telescoping the sums again, this is limn w1 − wn+1 . Since limn+1 wn+1 =
limn wn = 0, the limit exists. But we can equally re-write Eq. 6.32 as
n
X ∞
X
lim wk − wk+1 = wn − wn+1 (6.33)
n
k=1 n=1
Since the sum converges, the individual terms wn −wn+1 must be o(n−1 ). Hence
limn n(wn − wn+1 ) = 0, as was to be shown.
“Only if”: From Eq. 6.31 in the “if” part, we see that the hypothesis is
equivalent to
1 = lim (1 − wn − n(wn − wn+1 )) (6.34)
n
6.4 Exercises
Exercise 6.1 (Weakly Optional Times and Right-Continuous Filtra-
tions) Show that a random time τ is weakly {Ft }-optional iff it is {F}t+ -
optional.
Exercise 6.3 (Kac’s Theorem for the Logistic Map) First, do Exercise
5.3. Then, using the same code, suitably modified, numerically check Kac’s
Theorem for the logistic map with a = 4. Pick any interval I ⊂ [0, 1] you like,
but be sure not to make it too small.
1
1. Generate n initial points in I, according to the invariant measure √ .
π x(1−x)
For each point xi , find the first t such that F t (xi ) ∈ I, and take the mean
over the sample. What happens to this space average as n grows?
2. Generate a single point x0 in I, according to the invariant measure. Iterate
it T times. Record the successive times t1 , t2 , . . . at which F t (x0 ) ∈ I, and
find the mean of ti − ti−1 (taking t0 = 0). What happens to this time
average as T grows?
2 “How Sampling Reveals a Process” (Ornstein and Weiss, 1990); Algoet (1992); Kon-
Continuity of Stochastic
Processes
46
CHAPTER 7. CONTINUITY 47
Note that neither Lp -continuity nor stochastic continuity says that the indi-
vidual sample paths, themselves, are continuous.
As we will see, it will not be easy to show that our favorite random processes
have any of these desirable properties. What will be easy will be to show that
they are, in some sense, easily modified into ones which do have good regularity
properties, without loss of probabilistic content. This is made more precise
by the notion of versions of a stochastic process, related to that of versions of
conditional probabilities.
Notice that saying X and Y are indistinguishable means that their sample
paths are equal almost surely, while saying they are versions of one another
means that, at any time, they are almost surely equal. Indistinguishable pro-
cesses are versions of one another, but not necessarily the reverse. (Look at
where the quantifier and the probability statements go.) However, if T = Rd ,
then any two right-continuous versions of the same process are indistinguishable
(Exercise 7.2).
Proof: It’ll be enough to prove this for the supremum function M ; the proof
for m is entirely parallel. First, notice that M (ω) must be finite, because the
sample paths X(·, ω) are continuous functions, and continuous functions are
bounded on bounded intervals. Next, notice that M (ω) > a if and only if
X(t, ω) > a for some t ∈ I. But then, by continuity, there will be some rational
t0 ∈ I ∩ Q such that X(t0 , ω) > a; countably many, in fact.2 Hence
[
{ω : M (ω) > a} = {ω : X(t, ω) > a}
t∈I∩Q
1 Strictly speaking, we don’t really know that space-time is a continuum, but the discretiza-
Since, for each t, X(t, ω) is a random variable, the sets in the union on the
right-hand side are all measurable, and the union of a countable collection of
measurable sets is itself measurable. Since intervals of the form (a, ∞) generate
the Borel σ-field on the reals, we have shown that M (ω) is a measurable function
from Ω to the reals, i.e., a random variable.
Continuity raises some very tricky technical issues. The product σ-field is
the usual way of formalizing the notion that what we know about a stochas-
tic process are values observed at certain particular times. What we saw in
Exercise 1.1 is that “the product σ-field answers countable questions”: for any
measurable set A, whether x(·, ω) ∈ A depends only on the value of x(t, ω) at
countably many indices t. It follows that the class of all continuous sample
paths is not product-σ-field measurable, because x(·, ω) is continuous at t iff
x(tn , ω) → x(t, ω) along every sequence tn → t, and this is involves the value of
the function at uncountably many coordinates. It is further true that the class
of differentiable functions is not product σ-field measurable. For that matter,
neither is the class of piecewise linear functions! (See Exercise 7.1.)
You might think that, on the basis of Theorem 23, this should not really be
much of an issue: that even if the class of continuous sample paths (say) isn’t
strictly measurable, it could be well-approximated by measurable sets, and so
getting the finite-dimensional distributions right is all that matters. This would
make the theory of stochastic processes in continuous time much simpler, but
unfortunately it’s not quite the case. Here is an example to show just how bad
things can get, even when all the finite-dimensional distributions agree.3
Why care about this example? Two reasons. First, and technically, we’re
going to want to take suprema of the Wiener process a lot when we deal with
large deviations theory, and with the approximation of discrete processes by
continuous ones. Second, nothing in the example really depended on starting
3I stole this example from Pollard (2002, p. 214).
CHAPTER 7. CONTINUITY 50
from the Wiener process; any process with continuous sample paths would have
worked as well. So controlling the finite-dimensional distributions is not enough
to guarantee that a process has measurable extrema, which would lead to all
kinds of embarrassments. We could set up what would look like a reasonable
stochastic model, and then be unable to say things like “with probability p, the
{ temperature/ demand for electricity/ tracking error/ interest rate } will not
exceed r over the course of the year”.
Fundamentally, the issues with continuity are symptoms of a deeper problem.
The reason the supremum function is non-measurable in the example is that it
involves uncountably many indices. A countable collection of ill-behaved sets of
measure zero is a set of measure zero, and may be ignored, but an uncountable
collection of them can have probability 1. Fortunately, there are standard ways
of evading these measure-theoretic issues, by showing that one can always find
random functions which not only have prescribed finite-dimensional distribu-
tions (what we did in Lectures 2 and 3), but also are regular enough that we
can take suprema, or integrate over time, or force them to be continuous. This
hinges on the notion of separability for random functions.
Proof: (1) Take the separating set to be T itself. (2) Pick any countable
dense D. By density, for every t there will be a sequence ti ∈ D such that ti → t.
By continuity, along any sequence converging to t, x(ti ) → t. (3) Just like (2),
only be sure to pick the ti > t. (You can do this, again, for any countable dense
D.)
CHAPTER 7. CONTINUITY 51
7.4 Exercises
Exercise 7.1 (Piecewise Linear Paths Not in the Product σ-Field)
Consider real-valued functions on the unit interval (i.e., Ξ = R, T = [0, 1],
X = B). The product σ-field is thus B [0,1] . In many circumstances, it would be
useful to constrain sample paths to be piece-wise linear functions of the index.
Let PL([0, 1]) denote this class of functions. Use the argument of Exercise 1.1
to show that PL([0, 1]) 6∈ B [0,1] .
More on Continuity
Recall the story so far: last time we saw that the existence of processes with
given finite-dimensional distributions does not guarantee that they have desir-
able and natural properties, like continuity, and in fact that one can construct
discontinuous versions of processes which ought to be continuous. We therefore
need extra theorems to guarantee the existence of continuous versions of pro-
cesses with specified FDDs. To get there, we will first prove the existence of
separable versions. This will require various topological conditions on both the
index set T and the value space Ξ.
In the interest of space (or is it time?), Section 8.1 will provide complete and
detailed proofs. The other sections will simply state results, and refer proofs
to standard sources, mostly Gikhman and Skorokhod (1965/1969). (They in
turn follow Doob (1953), but are explicit about what he regarded as obvious
generalizations and extensions, and they cost about $20, whereas Doob costs
$120 in paperback.)
52
CHAPTER 8. MORE ON CONTINUITY 53
Example 86 (Compactifying the Reals) The real numbers R are not com-
pact: they have no finite covering by open intervals (or other open sets). The
extended reals, R ≡ R ∪ +∞ ∪ −∞, are compact, since intervals of the form
(a, ∞] and [−∞, a) are open. This is a two-point compactification of the reals.
There is also a one-point compactification, with a single point at ±∞, but this
has the undesirable property of making big negative and positive numbers close
to each other.
Recall that a random function is separable if its value at any arbitrary in-
dex can be determined almost surely by examining its values on some fixed,
countable collection of indices. The next lemma states an alternative charac-
terization of separability. The lemma after that gives conditions under which a
weaker property holds — the almost-sure determination of whether X(t, ω) ∈ B,
for a specific t and set B, by the behavior of X(tn , ω) at countably many tn .
The final lemma extends this to large collections of sets, and then the proof of
the theorem puts all the parts together.
Lemma 87 (Alternative Characterization of Separability) Let T be a
separable set, Ξ a compact metric space, and D a countable dense subset of T .
Define V as the class of all open balls in T centered at points in D and with
rational radii. For any G ⊂ T , let
!
[
R(G, ω) ≡ closure X(t, ω) (8.1)
t∈G∩D
\
R(t, ω) ≡ R(S, ω) (8.2)
S: S∈V, t∈S
Then X(t, ω) is D-separable if and only if there exists a set N ⊂ Ω such that
ω 6∈ N ⇒ ∀t, X(t, ω) ∈ R(t, ω) (8.3)
and P (N ) = 0.
CHAPTER 8. MORE ON CONTINUITY 54
Proof: Roughly speaking, R(t, ω) is what we’d think the range of the
function would be, in the vicinity of t, if it we went just by what it did at points
in the separating set D. The actual value of the function falling into this range
(almost surely) is necessary and sufficient for the function to be separable. But
let’s speak less roughly.
“Only if”: Since X(t, ω) is D-separable, for almost all ω, for any t there
is some sequence tn ∈ D such that tn → t and X(tn , ω) → X(t, ω). For any
ball S centered at t, there is some N such that tS n ∈ S if n ≥ N . Hence the
values of x(tn ) are eventually confined to the set t∈S∩D X(t, ω). Recall that
the closure of a set A consists of the points x such that, for some sequence
xn ∈ A, Sxn → x. As X(t n , ω) → X(t, ω), it must be the case that X(t, ω) ∈
closure t∈S∩D X(t, ω) . Since this applies to all S, X(t, ω) must be in the
intersection of all those closures, hence X(t, ω) ∈ R(t, ω) — unless we are on
one of the probability-zero bad sample paths, i.e., unless ω ∈ N .
“If”: Assume that, with probability 1, X(t, ω) ∈ R(t, ω). Thus, for any
S ∈ V , we know that there exists a sequence of points tn ∈ S ∩ D such that
X(tn , ω) → X(t, ω). However, this doesn’t say that tn → t, which is what we
need for separability. We will now build such a sequence. Consider a series
of spheres Sk ∈ V such that (i) every point in Sk is within a distance 2−k of
(k)
t and (ii) Sk+1 ⊂ Sk . For each Sk , there is a sequence tn ∈ Sk such that
(k) (k)
X(tn , ω) → X(t, ω). In fact, for any m > 0, |X(tn , ω) − X(t, ω)| < 2−m if
n ≥ N (k, m), for some N (k, m). Our final sequence of indices ti then consists of
(1) (2)
the following points: tn for n from N (1, 1) to N (1, 2); tn for n from N (2, 2)
(k)
to N (2, 3); and in general tn for n from N (k, k) to N (k, k + 1). Clearly, ti → t,
and X(ti , ω) → X(t, ω). Since every ti ∈ D, we have shown that X(t, ω) is
D-separable.
has probability 0.
Mn is the set where the random function, evaluated at the first n indices, gives a
value in our favorite set; it’s clearly measurable. Ln (t), also clearly measurable,
gives the collection of points in Ω where, if we chose t for the next point in
the collection, this will break down. pn is the worst-case probability of this
happening. For each t, Ln+1 (t) ⊆ Ln (t), so pn+1 ≤ pn . Suppose pn = 0;
then we’ve found the promised denumerable sequence, and we’re done. Suppose
instead that pn > 0. Pick any t such that P (Ln (t)) ≥ 12 pn , and call it tn+1 .
(There has to be such a point, or else pn wouldn’t be the supremum.) Now notice
that L1 (t2 ), L2 (t3 ), . . . Ln (tn+1 ) are all mutually exclusive, but not necessarily
jointly exhaustive. So
1 = P (Ω) (8.8)
!
[
≥ P Ln (tn+1 ) (8.9)
n
X
= P (Ln (tn+1 )) (8.10)
n
X1
≥ pn > 0 (8.11)
n
2
so pn → 0 as n → ∞.
We saw that Ln (t) is a monotone-decreasing sequence of sets, for each t,
so a limiting set exists, and in fact limn Ln (t) = N (t, B). So, by monotone
convergence,
P (N (t, B)) = P lim Ln (t) (8.12)
n
= lim P (Ln (t)) (8.13)
n
≤ lim pn (8.14)
n
= 0 (8.15)
as was to be shown.
S
to be N (t, B), and define N (t) ≡ B∈B0 N (t, B). Since N (t) is a countable
union of probability-zero events, it is itself a probability-zero event. Now, take
any B ∈ B, and any B0 ∈ B0 such that B ⊆ B0 . Then
∞
!
\
{X(t, ω) 6∈ B0 } ∩ {X(tn , ω) ∈ B} (8.17)
n=1
∞
!
\
⊆ {X(t, ω) 6∈ B0 } ∩ {X(tn , ω) ∈ B0 }
n=1
⊆ N (t) (8.18)
T (k) (k)
Since B = k B0 for some sequence of sets B0 ∈ B0 , it follows (via De
Morgan’s laws and the distributive law) that
∞ n o
(k)
[
{X(t, ω) 6∈ B} = X(t, ω) 6∈ B0 (8.19)
k=1
∞
!
\
{X(t, ω) 6∈ B} ∩ {X(tn , ω) ∈ B}
n=1
∞ n ∞
!
o
(k)
[ \
= X(t, ω) 6∈ B0 ∩ {X(tn , ω) ∈ B} (8.20)
k=1 n=1
[∞
⊆ N (t) (8.21)
n=1
= N (t) (8.22)
S S
things can go wrong. Set I = S∈V I(S) and N (t) = S∈V NS (t). Because
V is countable, I is still a countable set of indices, and N (t) is still of measure
zero. I is going to be our separating set, and we’re going to show that we have
uncountably many sets N (t) won’t be a problem.
Define X̃(t, ω) = X(t, ω) if t ∈ I or ω 6∈ N (t) — if we’re at a time in the
separating set, or we’re at some other time but have avoided the bad set, we’ll
just copy our original random function. What to do otherwise, when t 6∈ I and
ω ∈ N (t)? Construct R(t, ω), as in the proof of Lemma 87, and let X̃(t, ω) take
any value in this set. Since R(t, ω) depends only on the value of the function
at indices in the separating set, it doesn’t matter whether we build it from
X or from X̃. In fact, for all t and ω, X̃(t,n ω) ∈ R(t, ω), so, o by Lemma 87,
X̃(t, ω) is separable. Finally, for every t, X̃(t, ω) = X(t, ω) ⊆ N (t), so ∀t,
P X̃(t) = X(t) , and X̃ is a version of X (Definition 75).
and have them mean something. Irritatingly, this will require another modifi-
cation.
2 If you want to be really picky, define a 1-1 function h : Ξ 7→ Ξ̃ taking points to their
Proof: See Gikhman and Skorokhod (1965/1969, ch. IV, sec. 3, thm. 1,
p. 157).
Markov Processes
60
Chapter 9
Markov Processes
Lemma 103 (The Markov Property Extends to the Whole Future) Let
Xt+ stand for the collection of Xu , u > t. If X is Markov, then Xt+ Ft |Xt .
|=
61
CHAPTER 9. MARKOV PROCESSES 62
specifically wanted to show that independence was not a necessary condition for
the law of large numbers to hold, because his arch-enemy claimed that it was,
and used that as grounds for believing in free will and Christianity.1 It turns
out that all the key limit theorems of probability — the weak and strong laws of
large numbers, the central limit theorem, etc. — work perfectly well for Markov
processes, as well as for IID variables.
The other route to the Markov property begins with completely deterministic
systems in physics and dynamics. The state of a deterministic dynamical system
is some variable which fixes the value of all present and future observables.
As a consequence, the present state determines the state at all future times.
However, strictly deterministic systems are rather thin on the ground, so a
natural generalization is to say that the present state determines the distribution
of future states. This is precisely the Markov property.
Remarkably enough, it is possible to represent any one-parameter stochastic
process X as a noisy function of a Markov process Z. The shift operators give
a trivial way of doing this, where the Z process is not just homogeneous but
actually fully deterministic. An equally trivial, but slightly more probabilistic,
approach is to set Zt = Xt− , the complete past up to and including time t. (This
is not necessarily homogeneous.) It turns out that, subject to mild topological
conditions on the space X lives in, there is a unique non-trivial representation
where Zt = (Xt− ) for some function , Zt is a homogeneous Markov process,
|=
and Xu σ({Xt , t ≤ u})|Zt . (See Knight (1975, 1992); Shalizi and Crutchfield
(2001).) We may explore such predictive Markovian representations at the end
of the course, if time permits.
= (µ ⊗ ν)(x, Ξ × B) (9.2)
1 I am not making this up. See Basharin et al. (2004) for a nice discussion of the origin of
Markov chains and of Markov’s original, highly elegant, work on them. There is a translation
of Markov’s original paper in an appendix to Howard (1971), and I dare say other places as
well.
CHAPTER 9. MARKOV PROCESSES 63
Intuitively, all the product does is say that the probability of starting at
the point x and landing in the set B is equal the probability of first going to
y and then ending in B, integrated over all intermediate points y. (Strictly
speaking, there is an abuse of notation in Eq. 9.2, since the second kernel in a
composition ⊗ should be defined over a product space, here Ξ × Ξ. So suppose
we have such a kernel ν 0 , only ν 0 ((x, y), B) = ν(y, B).) Finally, observe that if
µ(x, ·) = δx , the delta function at x, then (µν)(x, B) = ν(x, B), and similarly
that (νµ)(x, B) = ν(x, B).
As with the shift semi-group, this is really a monoid (because µt,t acts as
the identity).
The major theorem is the existence of Markov processes with specified tran-
sition kernels.
νs = νt µt,s (9.3)
L (Xt ) = νt (9.4)
and ∀t1 ≤ t2 ≤ . . . ≤ tn ,
Proof: (From transition kernels to a Markov process.) For any finite set of
times J = {t1 , . . . tn } (in ascending order), define a distribution on ΞJ as
It is easily checked, using point (2) in the definition of a transition kernel semi-
group (Definition 105), that the νJ form a projective family of distributions.
Thus, by the Kolmogorov Extension Theorem (Theorem 29), there exists a
stochastic process whose finite-dimensional distributions are the νJ . Now pick
a J of size n, and two sets, B ∈ X n−1 and C ∈ X .
P (XJ ∈ B × C) = νJ (B × C) (9.8)
= E [1B×C (XJ )] (9.9)
= E 1B (XJ\tn )µtn−1 ,tn (Xtn−1 , C) (9.10)
= ν(B) (9.18)
The term “equilibrium” comes from statistical physics, where however its
meaning is a bit more strict, in that “detailed balance” must also be satisified:
for any two sets A, B ∈ X ,
Z Z
ν(dx)1A µt (x, B) = ν(dx)1B µt (x, A) (9.19)
i.e., the flow of probability from A to B must equal the flow in the opposite
direction. Much confusion has resulted from neglecting the distinction between
equilibrium in the strict sense of detailed balance and equilibrium in the weaker
sense of invariance.
Theorem 108 (Stationarity and Invariance for Homogeneous Markov
Processes) Suppose X is homogeneous, and L (Xt ) = ν, where ν is an invari-
ant distribution. Then the process Xt+ is stationary.
Proof: Exercise 9.4.
|=
The next-to-last line uses the fact that Xs Gt |Xt , because X is Markovian
with respect to {Gt }, and this in turn implies that conditioning Xs , or any
function thereof, on Gt is equivalent to conditioning on Xt alone. (Recall that
Xt is Gt -measurable.) The last line uses the facts that (i) E [f (Xs )|Xt ] is a
function Xt , (ii) X is adapted to {Ft }, so Xt is Ft -measurable, and (iii) if Y
is F-measurable, then E [Y |F] = Y . Since this holds for all f ∈ L1 , it holds
in particular for 1A , where A is any measurable set, and this established the
conditional independence which constitutes the Markov property. Since (Lemma
111) the natural filtration is the coarsest filtration to which X is adapted, the
remainder of the theorem follows.
The converse is false, as the following example shows.
The issue can be illustrated with graphical models (Spirtes et al., 2001; Pearl,
1988). A discrete-time Markov process looks like Figure 9.1a. Xn blocks all the
pasts from the past to the future (in the diagram, from left to right), so it
produces the desired conditional independence. Now let’s add another variable
which actually drives the Xn (Figure 9.1b). If we can’t measure the Sn variables,
just the Xn ones, then it can still be the case that we’ve got the conditional
independence among what we can see. But if we can see Xn as well as Sn —
which is what refining the filtration amounts to — then simply conditioning on
Xn does not block all the paths from the past of X to its future, and, generally
speaking, we will lose the Markov property. Note that knowing Sn does block all
paths from past to future — so this remains a hidden Markov model. Markovian
representation theory is about finding conditions under which we can get things
to look like Figure 9.1b, even if we can’t get them to look like Figure 9.1a.
CHAPTER 9. MARKOV PROCESSES 67
S1 S2 S3 ...
X1 X2 X3 ...
X1 X2 X3 ...
a b
Figure 9.1: (a) Graphical model for a Markov chain. (b) Refining the filtration,
say by conditioning on an additional random variable, can lead to a failure of
the Markov property.
9.4 Exercises
Exercise 9.1 (Extension of the Markov Property to the Whole Fu-
ture) Prove Lemma 103.
process. Suppose that, for all t, Xt−1 Xt+1 |Xt . Either prove that X is a
Markov process, or provide a counter-example. You may assume that Ξ is a
Borel space if you find that helpful.
for some finite integer k. For any Ξ-valued discrete-time process X, define the k-
(k)
block process Y (k) as the Ξk -valued process where Yt = (Xt , Xt+1 , . . . Xt+k−1 ).
CHAPTER 9. MARKOV PROCESSES 69
The second result shows that studying on the theory of first-order Markov pro-
cesses involves no essential loss of generality. For a test of the hypothesis that
X is Markovian of order k against the alternative that is Markovian of order
l > k, see Billingsley (1961). For recent work on estimating the order of a
Markov process, assuming it is Markovian to some finite order, see the elegant
paper by Peres and Shields (2005).
where ~e1 is the unit vector along the first coordinate axis.
1. Prove that AR(1) models are Markov for all choices of a0 and a1 , and all
distributions of X(0) and .
Alternative
Characterizations of
Markov Processes
Proof: Kallenberg, Proposition 8.6, p. 145. Notice that, in order to get the
“only if” direction to work, Kallenberg invokes what we have as Proposition 26,
which is where the assumptions that Ξ is a Borel space comes in. You should
verify that the “if” direction does not require this assumption.
Let us stick to the homogeneous case, and consider the function f in some-
what more detail.
70
CHAPTER 10. MARKOV CHARACTERIZATIONS 71
νs = νt µt,s (10.1)
Z
νs (B) = νt (dx)µt,s (x, B) (10.2)
M (a1 µ1 + a2 µ2 ) = a1 M µ1 + a2 M µ2 (10.3)
Note that every function in B(Ξ) is in L∞ (µ) for each µ, so the definition
of an L∞ transition operator is actually stricter than that of a plain transition
operator. This is unlike the case with Markov operators, where the L1 version
is weaker than the unrestricted version.
M ν = νκ (10.4)
Proof: Follows from the fact that V † is a vector space, and each gi is a
linear operator.
You are already familiar with an example of a conjugate space.
Here is the simplest example where the conjugate space is not equal to the
original space.
Example 127 (Row and Column Vectors) The space of row vectors is con-
jugate to the space of column vectors, since every continuous linear functional
of a column vector x takes the form of y T x for some other column vector y.
CHAPTER 10. MARKOV CHARACTERIZATIONS 75
Example 128 (Lp spaces) The function spaces Lp (µ) and Lq (µ) are conjugate
to each other, when 1/p + 1/q = 1, and the inner product is defined through
Z
hf, gi ≡ f gdµ (10.9)
for all f ∈ V, g ∈ V † .
Conversely, let X be any stochastic process, and let Kt,s be a semi-group of tran-
sition operators such that Equation 10.12 is valid (a.s.). Then X is a Markov
process.
CHAPTER 10. MARKOV CHARACTERIZATIONS 76
Lemma 134 (Markov Operators are Contractions) For any Markov op-
erator P and f ∈ L1 ,
kP f k ≤ kf k (10.13)
Proof (after Lasota and Mackey (1994, prop. 3.1.1, pp. 38–39)): First,
+
notice that (P f (x)) ≤ P f + (x), because
+ +
(P f (x)) = (P f + − P f − ) = max (0, P f + − P f − ) ≤ max (0, P f + ) = P f +
−
Similarly (P f (x)) ≤ P f − (x). Therefore |P f | ≤ P |f |, and then the statement
follows by integration.
10.3 Exercises
Exercise 10.1 (Kernels and Operators) Prove Lemma 121. Hints: 1. You
will want to use the fact that 1A ∈ B(Ξ) for all measurable sets A. 2. In going
back and forth between transition kernels and transition operators, you may find
Proposition 32 helpful.
Examples of Markov
Processes
78
CHAPTER 11. MARKOV EXAMPLES 79
P (Xn+1 ≤ x)
= P (Xn+1 ∈ [0, x]) (11.1)
= P Xn ∈ F −1 ([0, x])
(11.2)
√ √
1 1
= P Xn ∈ 0, 1− 1−x ∪ 1 + 1 − x ,1 (11.3)
2 2
Z 12 (1−√1−x) Z 1
= ρn (y) dy + √
ρn (y) dy (11.4)
2 (1+ 1−x)
1
0
√ √
1 d 1
= ρn 1− 1−x 1− 1−x (11.7)
2 dx 2
√ √
1 d 1
−ρn 1+ 1−x 1+ 1−x
2 dx 2
√ √
1 1 1
= √ ρn 1 − 1 − x + ρn 1+ 1−x (11.8)
4 1−x 2 2
Notice that this defines a linear operator taking densities to densities. (You
should verify the linearity.) In fact, this is a Markov operator, by the terms of
Definition 117. Markov operators of this sort, derived from deterministic maps,
are called Perron-Frobenius or Frobenius-Perron operators, and accordingly de-
noted by P . Thus an invariant density is a ρ∗ such that ρ∗ = P ρ∗ . All the
CHAPTER 11. MARKOV EXAMPLES 80
1
problem asks us to do is to verify that √ is such a solution.
π x(1−x)
√
1
ρ∗
1− 1−x (11.9)
2
−1/2
√ √
1 1 1
= 1− 1−x 1− 1− 1−x
π 2 2
√ √ −1/2
1 1 1
= 1− 1−x 1+ 1−x (11.10)
π 2 2
2
= √ (11.11)
π x
as desired.
By Lemma 135, for any distribution ρ, kP n ρ − P n ρ∗ k is a non-increasing
function of n. However, P n ρ∗ = ρ∗ , so the iterates of any distribution, under
the map, approach the invariant distribution monotonically. It would be very
handy if we could show that any initial distribution ρ eventually converged on
ρ∗ , i.e. that kP n ρ − ρ∗ k → 0. When we come to ergodic theory, we will see
conditions under which such distributional convergence holds, as it does for the
logistic map, and learn how such convergence in distribution is connected to
both pathwise convergence properties, and to the decay of correlations.
In Exercise 4.1, you showed that a process satisfying points (1)–(3) above
exists. To show that any such process has a version with continuous sample
paths, invoke Theorem 97 (Exercise 11.3). Here, however, we will show that
it is a homogeneous Markov process, and find both the transition kernels and
the evolution operators. Markovianity follows from the independent increments
property (2), while homogeneity and the form of the transition operators comes
from the Gaussian assumption (3).
First, here’s how the Gaussian increments property gives us the transition
probabilities:
P (W (t2 ) ∈ B|W (t1 ) = w1 ) = P (W (t2 ) − W (t1 ) ∈ B − w1 ) (11.15)
Z
1 − u 2
= du p e 2(t2 −t1 )(11.16)
B−w1 2π(t 2 − t 1 )
(w −w1 )2
Z
1 − 2
= dw2 p e 2(t2 −t1 ) (11.17)
B 2π(t2 − t1 )
≡ µt1 ,t2 (w1 , B) (11.18)
Since this evidently depends only on t2 − t1 , and not the individual times, the
process is homogeneous. Now, assuming homogeneity, we can find the time-
evolution operators for well-behaved observables:
Kt f (w) = E [f (Wt + s)|Ws = w] (11.19)
Z
= f (u)µt (w, du) (11.20)
Z
1 − (u−w)2
= f (u) √ e 2t du (11.21)
2πt
h √ i
= E f (w + tZ) (11.22)
Proof: The joint distribution for any finite collection of times factors into
a product of incremental distributions:
Kt f (x) = E [f (x + Zt )] (11.27)
another way to phrase our conclusion is that νt = νt/2 ? νt/2 . Clearly, this
argument could be generalized to relate νt to νt/n for any n:
?n
νt = νt/n (11.34)
where the IID Di ∼ νt/n . This is a very curious-looking property with a name.
that of X(1/n); inverting the latter then gives the desired distribution. Because
d
increments are independent, X(n+1/m) = X(n)+X(1/m), hence L (X(1)) fixes
the increment distribution for all rational time-intervals. By continuity, however,
this also fixes it for time intervals of any length. Since the distribution of
increments and the condition X(0) = 0 fix the finite-dimensional distributions, it
remains only to show that the process is cadlag, which can be done by observing
that it is continuous in probability, and then using Theorem 96.
We have established a correspondence between the limiting distributions of
IID sums, and processes with stationary independent increments. There are
many possible applications of this correspondence; one is to reduce problems
about limits of IID sums to problems about certain sorts of Markov process,
and vice versa. Another is to relate discrete and continuous time processes. For
0 ≤ t ≤ 1, set
bntc
(n) 1 X
Xt = √ Yi (11.36)
n i=0
where Y0 = 0 but otherwise the Yi are IID with mean 0 and variance 1. Then
each X (n) is a Markov process in discrete time, with stationary, independent
increments, and its time-evolution operator is
[nt]
1 X
Kn f (x) = E f x + √ Yi (11.37)
n i=1
√
As n grows, the normalized sum approaches
√ tZ, with Z a standard Gaussian
random variable. (Where does the t come from?) Since the time-evolution
operators determine the finite-dimensional distributions of a Markov process
(Theorem 133), this suggests that in some sense X (n) should be converging
in distribution on the Wiener process W . This is in fact true, but we will
have to go deeper into the structure of the operators concerned before we can
make it precise, as a first form of the functional central limit theorem. Just
as the infinitely divisible distributions are the limits of IID sums, processes
with stationary and independent increments are the limits of the corresponding
random walks.
There is a yet further sense which the Gaussian distribution is a special kind
of limiting distribution, which is reflected in another curious property of the
Wiener process. Gaussian random variables are not only infinitely divisible, but
they can be additively decomposed into more Gaussians. Distributions which
can be infinitely divided into others of the same kind will be called “stable”
(under convolution or averaging). “Of the same kind” isn’t very mathematical;
here is a more precise expression of the idea.
11.4 Exercises
Exercise 11.1 (Wiener Process with Constant Drift) Consider a process
X(0) which, like the Wiener process, has X(0) = 0 and independent increments,
but where X(t2 ) − X(t1 ) ∼ N (a(t2 − t1 ), σ 2 (t2 − t1 )). a is called the drift rate
and σ 2 the diffusion constant. Show that X(t) is a Markov process, following
the argument for the standard Wiener process (a = 0, σ 2 = 1) above. Do such
processes have continuous modifications for all (finite) choices of a and σ 2 ? If
so, prove it; if not, give at least one counter-example.
Exercise 11.3 (Continuity of the Wiener Process) Show that the Wiener
process has continuous sample paths, using its finite-dimensional distributions
and Theorem 97.
Generators of Markov
Processes
The strategy this suggests is to look for some other operator G such that
∞ i i
tG
X tG
Kt = e ≡
i=0
i!
Such an operator G is called the generator of the process, and the purpose of this
chapter is to work out the conditions under which this analogy can be carried
through.
In the exponential function
case, we notice that g can be extracted by taking
d tg
the derivative at zero: dt e t=0 = g. This suggests the following definition.
88
CHAPTER 12. GENERATORS 89
Proof: Since ν is invariant, νKt = ν for all t, hence νKt f = νf for all
t ≥ 0 and all f . Since taking expectations with respect to a measure is a linear
operator, ν(Kt f − f ) = 0, and obviously then (Eq. 12.1) νGf = 0.
Remark: Once again, νGf = 0 for all f is not enough, in itself, to show that
ν is an invariant measure.
You will usually see the definition of the generator written with f instead
of K0 f , but I chose this way of doing it to emphasize that G is, basically,
CHAPTER 12. GENERATORS 90
the derivative at zero, that G = dK/dt|t=0 . Recall, from calculus, that the
d t
exponential function can k t be defined by the fact that dt k ∝ k t (and e can
be defined as the k such that the constant of proportionality is 1). As part of
our program, we will want to extend this differential point of view. The next
lemma builds towards it, by showing that if f ∈ Dom(G), then Kt f is too.
u(t, x) − u(t0 , x)
u0 (t0 , x) = lim (12.5)
t→t0 t − t0
exists in the L sense, then we say that u0 (t0 ) is the time derivative or strong
derivative of u(t) at t0 .
when the density in the last expression exists. You may recall, from this context,
that the distributions of well-behaved random variables are completely specified
by their moment-generating functions; this is actually a special case of a more
general result about when functions are uniquely described by their Laplace
transforms, i.e., when f can be expressed uniquely in terms of f˜. This is im-
portant to us, because it turns out that the Laplace transform, so to speak, of a
semi-group of operators is better-behaved than the semi-group itself, and we’ll
want to say when we can use the Laplace transform to recover the semi-group.
The analogy with exponential functions, again, is a key. Notice that, for any
positive constant λ,
Z ∞
1
e−λt etg dt = (12.8)
t=0 λ − g
from which we could recover g, as that value of λ for which the Laplace transform
is singular. In our analogy, we will want to take the Laplace transform of the
semi-group. Just as f˜(λ) is another real number, the Laplace transform of a
semi-group of operators is going to be another operator. We will want that to
be the inverse operator to λ − G.
CHAPTER 12. GENERATORS 92
assuming all of the sums inside the limit converge, which is yet to be shown.
Remark 3: The point of condition (2) in the theorem is that, when it holds,
−1
(λI − G) is well-defined, i.e., there is an inverse to the operator λI − G. This
is, recall, what we would like the resolvent to be, if the analogy with exponential
functions is to hold good.
Remark 4: If we start from the semi-group Kt and obtain its generator G, the
theorem tells us that it satisfies properties (1)–(3). If we start from an operator
G and see that it satsifies (1)–(3), the theorem tells us that it generates some
semi-group. It might seem, however, that it doesn’t tell us how to construct that
semi-group, since the Yosida approximation involves the resolvent Rλ , which is
defined in terms of the semi-group, creating an air of circularity. In fact, when
−1
we start from G, we decree Rλ to be (λI − G) . The Yosida approximations
then is defined in terms of G and λ alone.
where the Xi are independent copies of the Markov process corresponding to the
semi-group Kt generated by G, with initial condition Xi (0) = x.
12.1 Exercises
Exercise 12.1 (Generators are Linear) Prove Lemma 152.
95
CHAPTER 13. STRONG MARKOV, MARTINGALES 96
When w is less than zero or above 2π, f (w) moves along the x axis of the plane;
in between, it moves along a circle of radius 1, centered at (0, 1), which it enters
and leaves at the origin. Notice that f is invertible everywhere except at the
origin, which is ambiguous between w = 0 and w = 2π.
Let X(t) = f (W (t) + π), where W (t) is a standard Wiener process. At
all t, P (W (t) + π = 0) = P (W (t) + π = 2π) = 0, so, with probability 1, X(t)
can be inverted to get W (t). Since W (t) is a Markov process, it follows that
P (X(t + h) ∈ B|X(t) = x) = P X(t + h) ∈ B|FtX almost surely, i.e., X is
Markov. Now consider τ = inf t X(t) = (0, 0), the hitting time of the origin.
This is clearly an F X -optional time, and equally clearly almost surely finite,
because, with probability 1, W (t) will leave the interval (−π, π) within a finite
time. But, equally clearly, the future behavior of X will be very different if it hits
the origin because W = π or because W = −π, which cannot be determined just
from X. Hence, there is at least one optional time at which X is not strongly
Markovian, so X is not a strong Markov process.
Since we often want to condition on the state of the process at random times,
we would like to find conditions under which a process is strongly Markovian
for all optional times.
for any t ≥ 0 and f ∈ Dom(G). The relationship between Kt f and the condi-
tional expectation of f suggests the following definition.
Proof: Take the definition of a martingale and re-arrange the terms in Eq.
13.5.
Martingale problems are important because of the two following theorems
(which can both be refined considerably).
of Markov processes converges on a limit which is, itself, a Markov process. One
approach is to show that the terms in the sequence solve martingale problems
(via Theorem 171), argue that then the limiting process does too, and finally
invoke Theorem 172 to argue that the limiting process must itself be strongly
Markovian. This is often much easier than showing directly that the limiting
process is Markovian, much less strongly Markovian. Theorem 172 itself is often
a convenient way of showing that the strong Markov property holds.
13.3 Exercises
Exercise 13.1 (Strongly Markov at Discrete Times) Let X be a homoge-
neous Markov process with respect to a filtration {Ft } and τ be an {Ft }-optional
time. Prove that if P (τ < ∞) = 1, and τ takes on only countably many values,
then X is strongly Markovian at τ . (Note: the requirement that X be homo-
geneous can be lifted, but requires some more technical machinery I want to
avoid.)
Feller Processes
Section 14.1 makes explicit the idea that the transition kernels
of a Markov process induce a kernel over sample paths, mostly to
fix notation for later use.
Section 14.2 defines Feller processes, which link the cadlag and
strong Markov properties.
If ν = δ(x − a), the delta distribution at a, then we write Xa and call it the
Markov process with initial state a.
The existence of processes with given initial distributions and initial states
is a trivial consequence of Theorem 106, our general existence result for Markov
processes.
Lemma 174 (Kernel from Initial States to Paths) For every initial state
x, there is a probability distribution Px on ΞT , X T . The function Px (A) : Ξ ×
X T → [0, 1] is a probability kernel.
99
CHAPTER 14. FELLER PROCESSES 100
That is, rather than thinking of a different stochastic process for each initial
state, we can simply think of different distributions over the path space ΞT .
This suggests the following definition.
Remark 1: The first property basically says that the dynamics are a smooth
function of the initial state. Recall1 that if we have an ordinary differential
equation, dx/dt = f (x), and the function f is reasonably well-behaved, the ex-
istence and uniqueness theorem tells us that there is a function x(t, x0 ) satisfying
the equation, and such that x(0, x0 ) = x0 . Moreover, x(t, x0 ) is a continuous
function of x0 for all t. The first Feller property is one counterpart of this for
stochastic processes. This is a very natural assumption to make in physical or
social modeling, that very similar initial conditions should lead to very similar
developments.
Remark 2: The second property imposes a certain amount of smoothness
on the trajectories themselves, and not just how they depend on the initial
state. It’s a pretty well-attested fact about the world that teleportation does
not take place, and that its state changes in a reasonably continuous manner
(“natura non facit saltum”, as they used to say). However, the second Feller
property does not require that the sample paths actually be continuous. We
will see below that they are, in general, merely cadlag. So a certain limited
amount of teleportation, or salti, is allowed after all. We do not think this
actually happens, but it is a convenience when using a discrete set of states to
approximate a continuous variable.
In developing the theory of Feller processes, we will work mostly with the
time-evolution operators, acting on observables, rather than the Markov oper-
ators, acting on distributions. This is traditional in this part of the theory, as
it seems to let us get away with less technical
R machinery, in effect because the
norm supx |f (x)| is stronger than the norm |f (x)|dx. Of course, because the
two kinds of operators are adjoint, you can work out everything for the Markov
operators, if you want to.
As usual, we warm up with some definitions. The first two apply to operators
on any normed linear space L, which norm we generically write as k · k. The
second two apply specifically when L is a space of real-valued functions on some
Ξ, such as Lp , p from 1 to ∞ inclusive.
Lemma 185 (The First Pair of Feller Properties) Eq. 14.11 holds if and
only if Eq. 14.3 does.
Lemma 186 (The Second Pair of Feller Properties) Eq. 14.12 holds if
and only if Eq. 14.4 does.
Proof: From Corollary 159, we have, as seen in Chapter 13, for all t ≥ 0,
Z t
Kt f − f = Ks Gf ds (14.13)
0
Theorem 193 (Feller Implies Cadlag) Every Feller process X has a cadlag
version.
Proof: The strong Markov property holds if and only if, for all bounded,
continuous functions f , t ≥ 0 and F X+ -optional times τ ,
We’ll show this holds for arbitrary, fixed choices of f , t and τ . First, we discretize
time, to exploit the fact that the Markov and strong Markov properties coincide
for discrete parameter processes. For every h > 0, set
E f (Xh (n + m))|FnX
= Kmh f (Xh (n)) (14.25)
Since the Markov and strong Markov properties coincide for Markov sequences,
we can now assert that
E f (X(τh + mh))|FτXh
= Kmh f (X(τh )) (14.26)
2 Roughly speaking, if f (x ) → f (x) for all continuous functions f , it should be obvious
n
that there is no way to avoid having xn → x. Picking a countable dense subset of functions
is still enough.
CHAPTER 14. FELLER PROCESSES 106
Since τh ≥ τ , FτX ⊆ FτXh . Now pick any set B ∈ FτX+ and use smoothing:
E [f (X(τh + t))1B ] = E [Kt f (X(τh ))1B ] (14.27)
E [f (X(τ + t))1B ] = E [Kt f (X(τ ))1B ] (14.28)
where we let h ↓ 0, and invoke the fact that X(t) is right-continuous (Theorem
193) and Kt f is continuous. Since this holds for arbitrary B ∈ FτX+ , and
Kt f (X(τ )) has to be FτX+ -measurable, we have that
E f (X(τ + t))|FτX+ = Kt f (X(τ ))
(14.29)
as required.
Here is a useful consequence of Feller property, related to the martingale-
problem properties we saw last time.
Theorem 195 (Dynkin’s Formula) Let X be a Feller process with generator
G. Let α and β be two almost-surely-finite F-optional times, α ≤ β. Then, for
every continuous f ∈ Dom(G),
"Z #
β
E [f (X(β)) − f (X(α))] = E Gf (X(t))dt (14.30)
α
14.3 Exercises
Exercise 14.1 (Yet Another Interpretation of the Resolvents) Consider
again a homogeneous Markov process with transition kernel µt . Let τ be an
exponentially-distributed random variable with rate λ, independent of X. Show
that E [Kτ f (x)] = λRλ f (x).
Exercise 14.2 (The First Pair of Feller Properties) Prove Lemma 185.
Hint: you may use the fact that, for measures, νt → ν if and only if νt f → νf ,
for every bounded, continuous f .
Exercise 14.3 (The Second Pair of Feller Properties) Prove Lemma 186.
Exercise 14.5 (Lévy and Feller Processes) Is every Lévy process a Feller
process? If yes, prove it. If not, provide a counter-example, and try to find a
sufficient condition for a Lévy process to be Feller.
Chapter 15
Convergence of Feller
Processes
107
CHAPTER 15. CONVERGENCE OF FELLER PROCESSES 108
Definition 198 (The Space D) By D(T, Ξ) we denote the space of all cadlag
functions from T to Ξ. By default, D will mean D(R+ , Ξ).
D admits of multiple topologies. For most purposes, the most convenient one
is the Skorokhod topology, a.k.a. the J1 topology or the Skorokhod J1 topology,
which makes D(Ξ) a complete separable metric space when Ξ is itself complete
and separable. (See Appendix A2 of Kallenberg.) For our purposes, we need
only the following notion and propositions.
where the infimum is over partitions of [0, t) into half-open intervals whose length
is at least h (except possibly for the last one). Because x is cadlag, for fixed x
and t, w(x, t, h) → 0 as h → 0.
The idea of a core is that we can get away with knowing how the operator
works on a linear subspace, which is often much easier to deal with, rather than
controlling how it acts on its whole domain.
Lemma 204 (Feller Generators Are Closed) The generator of every Feller
semigroup is closed.
Proof: We need to show that the graph of G contains all of its limit points,
that is, if fn ∈ Dom(G) converges (in L∞ ) on f , and Gfn → g, then f ∈ Dom(G)
and Gf = g. First we show that f ∈ Dom(G).
−1
But (I − G) = R1 . Since this is a bounded linear operator, we can exchange
applying the inverse and taking the limit, i.e.,
Since the range of the resolvents is contained in the domain of the generator,
f ∈ Dom(G). We can therefore say that f − g = (I − G)f , which implies that
Gf = g. Hence, the graph of G contains all its limit points, and G is closed.
bt/h c
Xn (t) ≡ Yn (bt/hn c), with corresponding semigroup Kn,t = Hn n , and gener-
ator An = (1/hn )(Hn − I). Even though Xn is not in general a homogeneous
Markov process, the conclusions of Theorem 205 remain valid.
Observe that pure-jump Markov processes always have cadlag sample paths.
Also observe that the average amount of time the process spends in state x, once
it jumps there, is 1/λ(x). So the time-average “velocity”, i.e., rate of change,
starting from x, Z
λ(x) (y − x)µ(x, dy)
Ξ
Let X(s, x0 ) be the solution to the initial-value problem where the differential is
given by F , i.e., for each 0 ≤ s ≤ t,
∂
X(s, x0 ) = F (X(s, x0 )) (15.15)
∂s
X(0, x0 ) = x0 (15.16)
The first conditions on the Xn basically make sure that they are Feller
processes. The subsequent ones make sure that the mean time-averaged rate of
change of the jump processes converges on the instantaneous rate of change of
the differential equation, and that, if we’re sufficiently close to the solution of
the differential equation in Rk , we’re not in some weird way outside the relevant
domains of definition. Even though Theorem 205 is about weak convergence,
converging in distribution on a non-random object is the same as converging in
probability, which is how we get uniform-in-time convergence in probability for
a conclusion.
There are, broadly speaking, two kinds of uses for this result. One kind is
practical, and has to do with justifying convenient approximations. If n is large,
we can get away with using an ODE instead of the noisy stochastic scheme, or
alternately we can use stochastic simulation to approximate the solutions of ugly
ODEs. The other kind is theoretical, about showing that the large-population
limit behaves deterministically, even when the individual behavior is stochastic
and strongly dependent over time.
CHAPTER 15. CONVERGENCE OF FELLER PROCESSES 113
15.4 Exercises
Exercise 15.1 (Poisson Counting Process) Show that the Poisson counting
process is a pure jump Markov process.
Convergence of Random
Walks
114
CHAPTER 16. CONVERGENCE OF RANDOM WALKS 115
Let’s check that {Kt } , t ≥ 0 is a Feller semi-group. The first Feller property
d
is easier to check in its probabilistic form, that, for all t, y → x implies Wy (t) →
Wx (t). The distribution of Wx (t) is just N (x, t), and it is indeed true that y → x
implies N (y, t) → N (x, t). The second Feller property can be checked in its
semi-group form: as t → 0, µt (w1 , B) approaches δ(w −w1 ), so limt→0 Kt f (x) =
f (x). Thus, the Wiener process is a Feller process. This implies that it has
cadlag sample paths (Theorem 193), but we already knew that, since we know
it’s continuous. What we did not know was that the Wiener process is not just
Markov but strong Markov, which follows from Theorem 194.
To find the generator of {Kt } , t ≥ 0, it will help to re-write it in an equivalent
form, as
h √ i
Kt f (w) = E f (w + Z t) (16.4)
where Z is an independent N (0, 1) random variable. (We saw that this was
equivalent in Section 11.2.) Now let’s pick an f ∈ C0 which is also twice
continuously differentiable, i.e., f ∈ C0 ∩ C 2 . Look at Kt f (w) − f (w), and
apply Taylor’s theorem, expanding around w:
h √ i
Kt f (w) − f (w) = E f (w + Z t) − f (w) (16.5)
h √ i
= E f (w + Z t) − f (w) (16.6)
√ √
1
= E Z tf 0 (w) + tZ 2 f 00 (w) + R(Z t) (16.7)
2
√ 0 f 00 (w) 2 h √ i
= tf (w)E [Z] + t E Z + E R(Z t) (16.8)
2
2
Recalling that E [Z] = 0, E Z = 1,
√
Kt f (w) − f (w) 1 00 E R(Z t)
lim = f (w) + lim (16.9)
t↓0 t 2 t↓0 t
√
So, we need to investigate the behavior of the remainder term R(Z t).
We know from Taylor’s theorem that
√ tZ 2 1
Z √
R(Z t) = du f 00 (w + uZ t) − f 00 (w) (16.10)
2 0
(16.11)
Proof: By explicit calculation. For any n, for any t and any h > 0,
bn(t+h)c
1 X
Yn (t + h) = Xi (16.21)
n1/2 i=0
bn(t+h)c
X
= Yn (t) + n−1/2 Xi (16.22)
i=bntc+1
bn(t+h)c−bntc
X
= Yn (t) + n−1/2 Xi0 (16.23)
i=0
using, in the last line, the fact that the Xi are IID. Eq. 16.20 follows from the
definition of S(m).
To see that the first property (Eq. 14.3) holds, we need the distribution of
Yn,y (t), that is, the distribution of the state of Yn at time t, when started from
state y (rather than 0). By Lemma 210,
d
Yn,y (t) = y + n−1/2 S 0 (bntc) (16.24)
d
Clearly, as y → x, y + n−1/2 S 0 (bntc) → x + n−1/2 S 0 (bntc), so the first Feller
property holds.
To see that the second Feller property holds, observe that, for each n, for all
t, and for all ω, Yn (t + h, ω) = Yn (t, ω) if 0 ≥ h < 1/n. This sure convergence
implies almost-sure convergence, which implies convergence in probability.
Lemma 212 (Evolution Operators of Random Walks) The “evolution”
(i.e., conditional expectation) operators of the random walk Yn , Kn,t , are given
by
Kn,t f (y) = E [f (y + Yn0 (t))] (16.25)
where Yn0 is an independent copy of Yn .
Proof: Substitute Lemma 210 into the definition of the evolution operator.
Kn,h f (y) ≡ E [f (Yn (t + h))|Yn (t) = y] (16.26)
= E [f (Yn (t + h) + Yn (t) − Yn (t)) |Yn (t) = y] (16.27)
h i
= E f n−1/2 S 0 (bntc) + Yn (t)|Yn (t) = y (16.28)
h i
= E f (y + n−1/2 S 0 (bntc)) (16.29)
In words, the transition operator simply takes the expectation over the incre-
ments of the random walk, just as with the Wiener process. Finally, substitution
of Yn0 (t) for n−1/2 S 0 (bntc) is licensed by Eq. 16.19.
d
Theorem 213 (Functional Central Limit Theorem (I)) Yn → W in D.
Proof: Apply Theorem 206. Clause (4) of the theorem says that if any of
d d
the other three clauses are satisfied, and Yn (0) → W (0) in R, then Yn → W
in D. Clause (2) is that Kn,t → Kt for all t > 0. That is, for any t > 0, and
f ∈ C0 , Kn,t f → Kt f as n → ∞. Pick any such t and f and consider Kn,t f .
By Lemma 212,
h i
Kn,t f (y) = E f (y + n−1/2 S 0 (bntc)) (16.30)
−1/2
As n → ∞, n−1/2 → t1/2 bntc . Since the Xi (and so the Xi0 ) have variance
−1/2 d
1, the central limit theorem applies, and bntc S 0 (bntc) → N (0, 1), say Z.
d √
Consequently n−1/2 S 0 (bntc → tZ. Since f ∈ C0 , it is continuous and bounded,
hence, by the definition of convergence in distribution,
h i h √ i
E f y + n−1/2 S 0 (bntc) → E f y + tZ (16.31)
CHAPTER 16. CONVERGENCE OF RANDOM WALKS 119
√
But E f (y + tZ) = Kt f (y), the time-evolution operator of the Wiener pro-
cess applied to f at y. Since the evolution operators of the random walks con-
verge on those of the Wiener process, and since their initial conditions match,
d
by the theorem Yn → W in D.
Finally, for any three times t1 < t2 < t3 , Yn (t3 ) − Yn (t2 ) and Yn (t2 ) − Yn (t1 )
are independent for sufficiently large n, because they become sums of disjoint
collections of independent random variables. The same applies to large groups
of times. Thus, the limiting distribution of Yn starts at the origin and has
independent Gaussian increments. Since these properties determine the finite-
fd
dimensional distributions of the Wiener process, Yn → W .
d
Theorem 215 (Functional Central Limit Theorem (II)) Yn → W .
fd
Proof: By Proposition 200, it is enough to show that Yn → W , and that
any of the properties in Proposition 201 hold. The lemma took care of the
finite-dimensional convergence, so we can turn to the second part. A sufficient
CHAPTER 16. CONVERGENCE OF RANDOM WALKS 120
P
condition is property (1) in the latter theorem, that |Yn (τn + hn ) − Yn (τn )| → 0
for all finite optional times τn and any sequence of positive constants hn → 0.
|Yn (τn + hn ) − Yn (τn )| = n−1/2 |S(bnτn + nhn c) − S(bnτn c)| (16.35)
d
= n−1/2 |S(bnhn c) − S(0)| (16.36)
= n−1/2 |S(bnhn c)| (16.37)
bnhn c
X
= n−1/2
Xi (16.38)
i=0
To see that this converges in probability to zero, we will appeal to Chebyshev’s
inequality: if Zi have common mean 0 and variance σ 2 , then, for every positive
,
m !
X mσ 2
Zi > ≤ (16.39)
P
i=1
2
√
Here we have Zi = Xi / n, so σ 2 = 1/n, and m = bnhn c. Thus
bnhn c
P n−1/2 |S(bnhn c)| > ≤ (16.40)
n2
As 0 ≤ bnhn c /n ≤ hn , and hn → 0, the bounding probability must go to zero
P
for every fixed . Hence n−1/2 |S(bnhn c)| → 0.
Proof: (Xi − µ)/σ has mean 0 and variance 1, so Theorem 215 applies.
This result is called “the invariance principle”, because it says that the
limiting distribution of the sequences of sums depends only on the mean and
variance of the individual terms, and is consequently invariant under changes
which leave those alone. Both this result and the previous one are known as the
“functional central limit theorem”, because convergence in distribution is the
same as convergence of all bounded continuous functionals of the sample path.
Another name is “Donsker’s Theorem”, which is sometimes associated however
with the following corollary of Theorem 215.
Corollary 217 (Donsker’s Theorem) Let Yn (t) and W (t) be as before, but
restrict the index set T to the unit interval [0, 1]. Let f be any function from
d
D([0, 1]) to R which is measurable and a.s. continuous at W . Then f (Yn ) →
f (W ).
CHAPTER 16. CONVERGENCE OF RANDOM WALKS 121
Proof: Exercise.
This version is especially important for statistical purposes, as we’ll see a
bit later.
16.3 Exercises
Exercise 16.1 (Example 167 Revisited) Go through all the details of Ex-
ample 167.
1. Show that FtX ⊆ FtW for all t, and that FtX ⊂ FtW .
2. Show that τ = inf t X(t) = (0, 0) is a FtX -optional time, and that it is
finite with probability 1.
3. Show that X is Markov with respect to both its natural filtration and the
natural filtration of the driving Wiener process.
4. Show that X is not strongly Markov at τ .
5. Which, if either, of the Feller properties does X have?
1 ∂ 2 f (x, t) ∂f (x, t)
=
2 ∂x2 ∂t
is called the diffusion equation. From our discussion of initial value problems
in Chapter 12 (Corollary 159 and related material), it is clear that the function
f (x, t) solves the diffusion equation with initial condition f (x, 0) if and only if
f (x, t) = Kt f (x, 0), where Kt is the evolution operator of the Wiener process.
−1/2 − x2
1. Take f (x, 0) = (2π10−4 ) e 2·10−4 . f (x, t) can be found analytically;
do so.
CHAPTER 16. CONVERGENCE OF RANDOM WALKS 122
2. Estimate f (x, 10) over the interval [−5, 5] stochastically. Use the fact that
Kt f (x) = E [f (W (t))|W (0) = x], and that random walks converge on the
Wiener process. (Be careful that you scale your random walks the right
way!) Give an indication of the error in this estimate.
3. Can you find an analytical form for f (x, t) if f (x, 0) = 1[−0.5,0.5] (x)?
4. Find f (x, 10), with the new initial conditions, by numerical integration on
the domain [−10, 10], and compare it to a stochastic estimate.
Suppose that, despite their interdependence, the Xi still obey the central limit
theorem,
n
1 X d
Xi → N (0, 1)
n1/2 i=1
Are these conditions enough to prove a functional central limit theorem, that
d
Yn → W ? If so, prove it. If not, explain what the problem is, and suggest an
additional sufficient condition on the Xi .
Part IV
123
Chapter 17
Section 17.1 introduces the ideas which will occupy us for the
next few lectures, the continuous Markov processes known as diffu-
sions, and their description in terms of stochastic calculus.
Section 17.2 collects some useful properties of the most important
diffusion, the Wiener process.
Section 17.3 shows, first heuristically and then more rigorously,
that almost all sample paths of the Wiener process don’t have deriva-
tives.
Remark: Having said that, I should confess that some authors don’t insist
that diffusions be homogeneous, and some even don’t insist that they be strong
Markov processes. But this is the general sense in which the term is used.
Diffusions matter to us for several reasons. First, they are very natural
models of many important systems — the motion of physical particles (the
124
CHAPTER 17. DIFFUSIONS AND THE WIENER PROCESS 125
tions, rather than stochastic differential equations, because things like 17.5 are really short-
hands for “find an X such that Eq. 17.4 holds”, and objects like dX don’t really make much
sense on their own. There’s a certain logic to this, but custom is overwhelmingly against
them.
CHAPTER 17. DIFFUSIONS AND THE WIENER PROCESS 126
The fully general theory of stochastic calculus considers integration with re-
spect to a very broad range of stochastic processes, but the original case, which
is still the most important, is integration with respect to the Wiener process,
which corresponds to driving a system with white noise. In addition to its
many applications in all the areas which use diffusions, the theory of integra-
tion against the Wiener process occupies a central place in modern probability
theory; I simply would not be doing my job if this course did not cover it.
We therefore begin our study of diffusions and stochastic calculus by reviewing
some of the properties of the Wiener process — which is also the most important
diffusion process.
E W (t + h)|FtX
= E [W (t + h)|W (t)] (17.6)
= E [W (t + h) − W (t) + W (t)|W (t)] (17.7)
= E [W (t + h) − W (t)|W (t)] + W (t) (17.8)
= 0 + W (t) = W (t) (17.9)
where the first line uses the Markov property of W , and the last line the Gaussian
increments property.
Proof: Pick any k times t1 < t2 < . . . < tk . Then the increments
W (t1 ) − W (0), W (t2 ) − W (t1 ), W (t3 ) − W (t2 ), . . . W (tk ) − W (tk−1 ) are in-
dependent Gaussian random variables. If X and Y are independent Gaus-
sians, then X, X + Y is a multivariate Gausssian, so (recursively) W (t1 ) −
W (0), W (t2 )−W (0), . . . W (tk )−W (0) has a multivariate Gaussian distribution.
Since W (0) = 0, the Gaussian distribution property follows. Since t1 , . . . tk were
arbitrary, as was k, all the finite-dimensional distributions are Gaussian.
Just as the distribution of a Gaussian random variable is determined by
its mean and covariance, the distribution of a Gaussian process is determined
by its mean over time, E [X(t)], and its covariance function, cov (X(s), X(t)).
(You might find it instructive to prove this without looking at Lemma 13.1 in
Kallenberg.) Clearly, E [W (t)] = 0, and, taking s ≤ t without loss of generality,
starting from prescriptions on the measures of finite-dimensional cylinders and building from
there, deriving the incremental properties we’ve started with as consequences.
CHAPTER 17. DIFFUSIONS AND THE WIENER PROCESS 128
the whole space, it treats all points uniformly, and it’s reasonably normalizable.
Wiener measure will turn out to play a similar role when our sample space is
C.
A mathematically important question, which will also turn out to matter
to us very greatly when we try to set up stochastic differential equations, is
whether, under this Wiener measure, most curves are differentiable. If, say,
almost all curves were differentiable, then it would be easy to define dW/dt.
Unfortunately, this is not the case; almost all curves are nowhere differentiable.
There is an easy heuristic argument to this conclusion. W (t) is a Gaussian,
whose variance is t. If we look at the ratio in a derivative
W (t + h) − W (t)
(t + h) − t
the numerator has variance h and the denominator is the constant h, so the
ratio has variance 1/h, which goes to infinity as h → 0. In other words, as we
look at the curve of W (t) on smaller and smaller scales, it becomes more and
more erratic, and the slope finally blows up into a completely unpredictable
quantity. This is basically the shape of the more rigorous argument as well.
4 Could we have used quadratic spline interpolation to get sample paths with continuous
d
first derivatives, and still have had → W ?
5 Doing Exercises 1.1 and 7.1, if you haven’t already, should help you resolve this paradox.
Chapter 18
130
CHAPTER 18. STOCHASTIC INTEGRALS PREVIEW 131
There are some very clean proofs of this theorem1 — but they require us to
use stochastic calculus! Doob (1953, pp. 384ff) gives a proof which does not,
however. The details of his proof are messy, but the basic idea is to get the
central limit theorem to apply, using the martingale property of M 2 (t) − t to
get the variance to grow linearly with time and to get independent increments,
and then seeing that between any two times t1 and t2 , we can fit arbitrarily
many little increments so we can use the CLT.
We will return to this result as an illustration of the stochastic calculus
(Theorem 247).
Then In (x) → I(x), if x is well-behaved and the length of the largest interval
R t=b
→ 0. If we want to evaluate t=a x(t)dw, where w is another function of t, the
natural thing to do is to get the derivative of w, w0 , replace the integrand by
x(t)w0 (t), and perform the integral with respect to t. The approximating sums
are then
Xn
x (ti−1 ) w0 (ti−1 ) (ti − ti−1 ) (18.1)
i=1
from the initial condition x(t0 ) = x0 , and uses linear interpolation to get a
continuous, almost-everywhere-differentiable curve. Remarkably enough, this
converges on the actual solution as ∆t shrinks (Arnol’d, 1973).)
Let’s try to
R carry all this over to random functions of time X(t) and W (t).
The integral X(t)dt is generally not a problem — we just find a version of X
1 See especially Ethier and Kurtz (1986, Theorem 5.2.12, p. 290).
CHAPTER 18. STOCHASTIC INTEGRALS PREVIEW 132
R
with measurable sample paths (Section 8.2). X(t)dW is also comprehensible
if dW/dt exists (almost surely). Unfortunately, we’ve seen that this is not the
case for the Wiener process, which (as you can tell from the W ) is what we’d
really like to use here. So we can’t approximate the integral with a sum like Eq.
18.1. But there’s nothing preventing us from using one like Eq. 18.2, since that
only demands increments of W . So what we would like to say is that
Z t=b n
X
X(t)dW ≡ lim X (ti−1 ) (W (ti ) − W (ti−1 )) (18.4)
t=a n→∞
i=1
133
CHAPTER 19. STOCHASTIC INTEGRALS AND SDES 134
Notice that if X is bounded on [a, b], in the sense that |X(t)| ≤ M with
probability 1 for all a ≤ t ≤ b, then X is square-integrable from a to b.
Definition 229 (S2 norm) The norm of a process X ∈ S2 [a, b] is its root-
mean-square time integral:
"Z #1/2
b
2
kXkS2 ≡ E X (t)dt (19.1)
a
Proof: Recall that a semi-norm is a function from a vector space to the real
numbers such that, for any vector X and any scalar a, kaXk = |a|kXk, and,
for any two vectors X and Y , kX + Y k ≤ kXk + kY k. The root-mean-square
time integral kXkS2 clearly has both properties. To be a norm and not just a
semi-norm, we need in addition that kXk = 0 if and only if X = 0. This is not
true for random processes, because the process which is zero at irrational times
t ∈ [a, b] but 1 at rational times in the interval also has semi-norm 0. However,
CHAPTER 19. STOCHASTIC INTEGRALS AND SDES 135
by identifying two processes X and Y if X(t) − Y (t) = 0 a.s. for almost all t,
we get a norm. This is exactly analogous to the way the L2 norm for random
variables is really only a norm on equivalence classes of variables which are equal
almost always. The proof that the space with this norm is complete is exactly
analogous to the proof that the L2 space of random variables is complete, see
e.g. Lemma 1.31 in Kallenberg (2002, p. 15).
Proof: Set
∞
X
Xn (t) ≡ X(i2−n )1[i/2n ,(i+1)/2n ) (t) (19.5)
i=0
where the last step uses the fact that X 2 must also be elementary.
has a limit in L2 . Moreover, this limit is the same for any such approximating
sequence Xn .
for every positive k. Since Xn and Xn+k are both elementary processes, their
difference is also elementary, and we can apply Lemma 236 to it. That is, for
every > 0, there is an n such that
!2
Z b
E (Xn+k (t) − Xn (t))dW < (19.20)
a
for all k. (Chose the in Eq. 19.19 to be the square root of the in Eq. 19.20.)
But this is to say that In (X) is a Cauchy sequence in L2 , therefore it has a
limit, which is also in L2 (because L2 is complete). If Yn is another sequence
CHAPTER 19. STOCHASTIC INTEGRALS AND SDES 138
taking the limit in L2 , with Xn as in Lemma 235. We will say that X is Itô-
integrable on [a, b].
Corollary 239 (The Itô isometry) Eq. 19.10 holds for all Itô-integrable X.
i.e., the Itô integral of the function which is always 1, 1R+ (t). If this is any
kind of integral at all, it should be W — more exactly, because this is a definite
Rb
integral, we want a dW = W (b) − W (a). Mercifully, this works. Pick any set
of time-points ti we like, and treat 1 as an elementary function with those times
as its break-points. Then, using our definition of the Itô integral for elementary
functions,
Z b X
dW = W (ti+1 ) − W (ti ) (19.23)
a ti
= W (b) − W (a) (19.24)
as required. (This would be a good time to convince yourself that adding extra
break-points to an elementary function doesn’t change its integral (Exercise
19.5.)
CHAPTER 19. STOCHASTIC INTEGRALS AND SDES 139
R
19.2.2 W dW
R
Tradition dictates that the next example be W dW . First, we should con-
vince ourselves that W (t) is Itô-integrable: it’s clearly measurable and non-
anticipative, but is it square-integrable? Yes; by Fubini’s theorem,
Z t Z t
W 2 (s)ds E W 2 (s) ds
E = (19.25)
0 0
Z t
= sds (19.26)
0
which is clearly finite on finite intervals [0, t]. So, this integral should exist.
Now, if the ordinary rules for change of variables held — equivalent, if the
chain-rule worked the usual way — we could say that W dW = 21 d(W 2 ), so
Rt
W dW = 12 dW 2 , and we’d expect 0 W dW = 12 W 2 (t). But, alas, this can’t
R R
be right. To see why, take the expectation: it’d be 12 t. But we know that it has
to be zero, and it has to be a martingale in t, and this is neither. A bone-head
would try to fix this byR subtracting off the non-martingale part, i.e., a bone-
t
head would guess that 0 W dW = 12 W 2 (t) − t/2. Annoyingly, in this case the
bone-head is correct. The demonstration is fundamentally straightforward, if
somewhat long-winded.
To begin, we need to approximate W by elementary functions. For each n,
P2n −1
let ti = i 2tn , 0 ≤ i ≤ 2n − 1. Set φn (t) = i=0 W (ti )1[ti ,ti+1 ) . Let’s check that
this converges to W (t) as n → ∞:
Z t "2n −1 Z #
2
X ti+1 2
E (φn (s) − W (s)) ds = E (W (ti ) − W (s)) ds (19.27)
0 i=0 ti
n
−1
2X Z ti+1
2
= E (W (ti ) − W (s)) ds (19.28)
i=0 ti
n
−1 Z
2X ti+1 h i
2
= E (W (ti ) − W (s)) ds(19.29)
i=0 ti
n
2X −1 Z ti+1
= (s − ti )ds (19.30)
i=0 ti
n
2X −1 Z 2−n
= sds (19.31)
i=0 0
n
2X −1 2 2−n
t
= (19.32)
i=0
2 0
n
2X −1
= 2−2n−1 (19.33)
i=0
−n−1
= 2 (19.34)
CHAPTER 19. STOCHASTIC INTEGRALS AND SDES 140
which → 0 as n → ∞. Hence
Z t Z t
W (s)dW = lim φn (s)dW (19.35)
0 n 0
n
2X −1
= lim W (ti )(W (ti+1 ) − W (ti )) (19.36)
n
i=0
n
2X −1
= lim W (ti )∆W (ti ) (19.37)
n
i=0
where ∆W (ti ) ≡ W (ti+1 ) − W (ti ), because I’m getting tired of writing both
subscripts. Define ∆W 2 (ti ) similarly. Since W (0) = 0 = W 2 (0), we have that
X
W (t) = ∆W (ti ) (19.38)
i
X
2
W (t) = ∆W 2 (ti ) (19.39)
i
Now let’s re-write ∆W 2 in such a way that we get a W ∆W term, which is what
we want to evaluate our integral.
∆W 2 (ti ) = W 2 (ti+1 ) − W 2 (ti ) (19.40)
2 2
= (∆W (ti ) + W (ti )) − W (ti ) (19.41)
2 2 2
= (∆W (ti )) + 2W (ti )∆W (ti ) + W (ti ) − W (ti ) (19.42)
2
= (∆W (ti )) + 2W (ti )∆W (ti ) (19.43)
This looks promising, because it’s got W ∆W in it. Plugging in to Eq. 19.39,
X 2
W 2 (t) = (∆W (ti )) + 2W (ti )∆W (ti ) (19.44)
i
X 1 2 1X 2
W (ti )∆W (ti ) = W (t) − (∆W (ti )) (19.45)
i
2 2 i
in L2 , so we have that
n
Z t 2X −1
W (s)dW = lim W (ti )∆W (ti ) (19.47)
0 n
i=0
n
2X−1
1 2 2
= W (t) − lim (∆W (ti )) (19.48)
2 n
i=0
1 2 t
= W (t) − (19.49)
2 2
CHAPTER 19. STOCHASTIC INTEGRALS AND SDES 141
as required.
Clearly, something weird is going on here, and it would be good to get to the
bottom of this. At the very least, we’d like to be able to use change of variables,
so that we can find functions of stochastic integrals.
Proof: Clearly, the non-anticipating processes are closed under linear oper-
ations, so it’s enough to show that the three components of any Itô process are
|=
non-anticipating. But a process which is always equal to X R 0 W (t) is clearly
non-anticipating. Similarly, since A(t) is non-anticipating, A(s)ds is too: its
natural filtration is smaller than that of A, so it cannot provide more infor-
mation about W (t), and A is, by assumption,
R non-anticipating. Finally, Itô
integrals are always non-anticipating, so B(s)dW is non-anticipating.
∂f ∂f 1 ∂2f
dF = (t, X(t))dt + (t, X(t))dX + B 2 (t) 2 (t, X(t))dt (19.50)
∂t ∂x 2 dx
That is,
Proof: I will suppose first of all that f , and its partial derivatives appear-
ing in Eq. 19.50, are all bounded. (You can show that the general case of C 2
CHAPTER 19. STOCHASTIC INTEGRALS AND SDES 142
[You can use the definition in the last line to build up a theory of stochastic
integrals with respect to arbitrary Itô processes, not just Wiener processes.]
n
2X −1 t
∂2f ∂2f
Z
2
(∆ti ) → 0 ds = 0 (19.58)
i=0
∂t2 0 ∂t2
2
The first term on the right-hand side, in (∆t) , goes to zero as n increases.
2 P ∂2f 2
Since A is square-integrable and ∂∂xf2 is bounded, ∂x2 A (ti )∆ti converges to
a finite value as ∆t → 0, so multiplying by another factor ∆t, as n → ∞, gives
zero. (This is the same argument as the one for Eq. 19.58.) Similarly for the
second term, in ∆t∆X:
n
2X −1 t
∂2f ∂2f
Z
t
lim 2
A(ti )B(ti )∆ti ∆W (ti ) = lim n A(s)B(s)dW (19.60)
n
i=0
∂x n 2 0 ∂x2
because A and B are elementary and the partial derivative is bounded. Now
apply the Itô isometry:
" Z
t 2 2 # "Z
t 2 #
t ∂ f t2 ∂2f 2 2
E A(s)B(s)dW = 2n E A (s)B (s)ds
2n 0 ∂x2 2 0 ∂x2
The time-integral on the right-hand side is finite, since A and B are square-
integrable and the partial derivative is bounded, and so, as n grows, both sides
go to zero. But this means that, in L2 ,
n
2X −1
∂2f
A(ti )B(ti )∆ti ∆W (ti ) → 0 (19.61)
i=0
∂x2
2
The third term, in (∆X) , does not vanish, but rather converges in L2 to a time
integral:
n
2X −1 t
∂2f 2 ∂2f 2
Z
2
B (ti )(∆W (ti )) → B (s)ds (19.62)
i=0
∂x2 0 ∂x2
CHAPTER 19. STOCHASTIC INTEGRALS AND SDES 144
n
2X −1
∂2f 2
A(ti )(∆ti ) → 0 (19.64)
i=0
∂t∂x
n
2X −1
∂2f
B(ti )∆ti ∆W (ti ) → 0 (19.65)
i=0
∂t∂x
where the argument for Eq. 19.65 is the same as that for Eq. 19.58, while that
for Eq. 19.65 follows the pattern of Eq. 19.61.
Let us, as promised, dispose of the remainder terms, given by Eq. 19.54,
re-stated here for convenience:
3 2 2 3
Ri = O (∆t) + ∆X(∆t) + (∆X) ∆t + (∆X) (19.66)
Taking ∆X = A∆t + B∆W , expanding the powers of ∆X, and using the fact
that everything is inside a big O to let us group together terms with the same
powers of ∆t and ∆W , we get
3 2 2 3
Ri = O (∆t) + ∆W (∆t) + (∆W ) ∆t + (∆W ) (19.67)
2
From our previous uses of Exercise 19.4, it’s clear that in the limit (∆W ) terms
will behave like ∆t terms, so
3 2 2
Ri = O (∆t) + ∆W (∆t) + (∆t) + ∆W ∆t (19.68)
2
Now, by our previous arguments, the sum of terms which are O((∆t) ) → 0, so
the first three terms all go to zero; similarly we have seen that a sum of terms
which are O(∆W ∆T ) → 0. We may conclude that the sum of the remainder
terms goes to 0, in L2 , as n increases.
Putting everything together, we have that
Z t Z t
1 ∂2f
∂f ∂f ∂f
F (t) − F (0) = + A + B 2 2 dt + BdW (19.69)
0 ∂t ∂x 2 ∂x 0 ∂x
exactly as required. This completes the proof, under the stated restrictions on
f , A and B; approximation arguments extend the result to the general case.
Remark 1. Our manipulations in the course of the proof are often summa-
rized in the following multiplication rules for differentials: dtdt = 0, dW dt = 0,
dtdW = 0, and, most important of all,
dW dW = dt
CHAPTER 19. STOCHASTIC INTEGRALS AND SDES 145
This last is of course related to the linear growth of the variance of the increments
of the Wiener process. If we used a different driving noise term, it would be
replaced by the corresponding rule for the growth of that noise’s variance.
Remark 2. Re-arranging Itô’s formula a little yields
∂f ∂f 1 ∂2f
dF = dt + dX + B 2 2 dt (19.70)
∂t ∂x 2 ∂x
The first two terms are what we expect from the ordinary rules of calculus; it’s
2
the third term which is new and strange. Notice that it disappears if ∂∂xf2 = 0.
When we come to stochastic differential equations, this will correspond to state-
independent noise.
Remark 3. One implication of Itô’s formula is that Itô processes are closed
under the application of C 2 mappings.
R
Example 243 (Section 19.2.2 summarized) The integral W dW is now
trivial. Let X(t) = W (t) (by setting A = 0, B = 1 in the definition of an Itô
process), and f (t, x) = x2 /2. Applying Itô’s formula,
∂f ∂f 1 ∂2f
dF = dt + dW + dt (19.71)
∂t ∂x 2 ∂x2
1 1
dW 2 = W dW + dt (19.72)
Z2 Z 2 Z
1 1
dW 2 = W dW + dt (19.73)
2 2
Z t
1 2 t
W (s)dW = W (t) − (19.74)
0 2 2
All of this extends naturally to higher dimensions.
summing over repeated indices, with the understanding that dWi dWj = δij dt,
dWi dt = dtdWi = dtdt = 0.
Proof: Entirely parallel to the one-dimensional case, only with even more
algebra.
It is also possible to define Wiener processes and stochastic integrals on
arbitrary curved manifolds, but this would take us way, way too far afield.
That is, we could evade the Itô formula if we evaluated our test function in
the middle of intervals, rather than at their beginnning. This leads to what are
called Stratonovich integrals. However, while Stratonovich integrals give simpler
change-of-variable formulas, they have many other inconveniences: they are not
martingales, for instance, and the nice connections between the form of an SDE
and its generator, which we will see and use in the next chapter, go away.
Fortunately, every Stratonovich SDE can be converted into an Itô SDE, and
vice versa, by adding or subtracting the appropriate noise term.
∂f ∂f 1 ∂2f
dY = dt + dM + B 2 2 dt (19.79)
∂t ∂x 2 ∂x
= −dt + 2M M 0 dW + (M 0 )2 dt (19.80)
0 = −1 + (M 0 (t))2 (19.81)
±1 = M 0 (t) (19.82)
d
Since −W is also a Wiener process, it follows that M = W (plus a possible
additive).
The a term is called the drift, and the b term the diffusion.
this reason — the Wiener process is constructed in such a way that the class of
Itô processes is impoverished. This leads to the idea of a weak solution to Eq.
19.83, which is a pair X, W such that W is a Wiener process, with respect to
the appropriate filtration, and X then is given by Eq. 19.84. I will avoid weak
solutions in what follows.
Remark 2: In a non-autonomous SDE, the coefficients would be explicit
functions of time, a(t, X)dt + b(t, X)dW . The usual trick for dealing with
non-autonomous n-dimensional ODEs is turn them into autonomous n + 1-
dimensional ODEs, making xn+1 = t by decreeing that xn+1 (t0 ) = t0 , x0n+1 = 1
(Arnol’d, 1973). This works for SDEs, too: add time as an extra variable with
constant drift 1 and constant diffusion 0. Without loss of generality, therefore,
I’ll only consider autonomous SDEs.
Let’s now prove the existence of unique solutions to SDEs. First, recall how
we do this for ordinary differential equations. There are several approaches,
most of which carry over to SDEs, but one of the most elegant is the “method
of successive approximations”, or “Picard’s method” (Arnol’d, 1973, SS30–31)).
To construct a solution to dx/dt = f (x), x(0) = x0 , this approach uses functions
Rt
xn (t), with xn+1 (t) = x0 + 0 f (xn (s)ds, starting with x0 (t) = x0 . That is, there
is an operator P such that xn+1 = P xn , and x solves the ODE iff it is a fixed
point of the operator. Step 1 is to show that the sequence xn is Cauchy on finite
intervals [0, T ]. Step 2 uses the fact that the space of continuous functions is
complete, with the topology of uniform convergence of compact sets — which,
for R+ , is the same as uniform convergence on finite intervals. So, xn has a
limit. Step 3 is to show that the limit point must be a fixed point of P , that
is, a solution. Uniqueness is proved by showing that there cannot be more than
one fixed point.
Before plunging in to the proof, we need some lemmas: an algebraic triviality,
a maximal inequality for martingales, a consequent maximal inequality for Itô
processes, and an inequality from real analysis about integral equations.
2
Lemma 249 (A Quadratic Inequality) For any real numbers a and b, (a + b) ≤
2a2 + 2b2 .
2
Proof: No matter what a and b are, a2 , b2 , and (a − b) are non-negative,
so
2
(a − b) ≥ 0 (19.85)
a + b2 − 2ab ≥ 0
2
(19.86)
a2 + b2 ≥ 2ab (19.87)
2
2a2 + 2b2 ≥ a2 + 2ab + b2 = (a + b) (19.88)
Definition 251 (The Space QM(T )) Let QM(T ), T > 0, be the space of all
non-anticipating processes, square-integrable on [0, T ], with norm kXkQM(T ) ≡
kX ∗ (T )k2 .
Technically, this is only a norm on equivalence classes of processes, where
the equivalence relation is “is a version of”. You may make that amendment
mentally as you read what follows.
Lemma 252 (Completeness of QM(T )) QM(T ) is a complete normed space
for each T .
Proof: Identical to the usual proof that Lp spaces are complete, see, e.g.,
Lemma 1.31 of Kallenberg (2002, p. 15).
Proposition 253 (Doob’s Martingale Inequalities) If M (t) is a continu-
ous martingale, then, for all p ≥ 1, t ≥ 0 and > 0,
p
E [|M (t)| ]
P (M ∗ (t) ≥ ) ≤ (19.89)
p
kM ∗ (t)kp ≤ qkM (t)kp (19.90)
Proof: See Propositions 7.15 and 7.16 in Kallenberg (pp. 128 and 129).
These can be thought of as versions of the Markov inequality, only for mar-
tingales. They accordingly get used all the time.
Lemma 254 (A Maximal Inequality for Itô Processes) Let X(t) be an
Rt Rt
Itô process, X(t) = X0 + 0 A(s)ds + 0 B(s)dW . Then there exists a constant
C, depending only on T , such that, for all t ∈ [0, T ],
Z t
2 2 2 2
kXkQM(t) ≤ C E X0 + E A (s) + B (s)ds (19.91)
0
Proof: Clearly,
Z t Z s
X ∗ (t) ≤
|X0 | + |A(s)|ds + sup B(s)dW (19.92)
0 0≤s≤t 0
Z t 2 Z s 2
2
(X ∗ (t)) ≤ 2X02 + 2 B(s0 )dW (19.93)
|A(s)|ds + 2 sup
0 0≤s≤t 0
Rt
Writing I(t) for 0
B(s)dW , and noticing
h that itiis a martingale, we have, from
∗ 2 2
Doob’s inequality (Proposition 253), E (I (t)) ≤ 4E I (t) . But, from Itô’s
hR i
t
isometry (Corollary 239), E I 2 (t) = E 0 B 2 (s)ds . Putting all the parts
together, then,
h i Z t Z t
2
E (X ∗ (t)) ≤ 2E X02 + 2E t A2 (s)ds + B 2 (s)ds (19.95)
0 0
PX − PY .
Proof: I’ll first prove existence, along with square-integrability, and then
uniqueness. That X is non-anticipating follows from the fact that it is an Itô
process (Lemma 241). For concision, abbreviate PX0 ,a,b by P .
As with ODEs, iteratively construct approximate solutions. Fix a T > 0,
and, for t ∈ [0, T ], set
X0 (t) = X0 (19.103)
Xn+1 (t) = P Xn (t) (19.104)
≤ CT E a2 (X0 ) + b2 (X0 )
(19.114)
Because a and b are Lipschitz, this will be finite if X0 has a finite second moment,
which, by assumption, it does. So Xn is a Cauchy sequence in QM(T ), which
is a complete space, so Xn has a limit in QM(T ), call it X:
lim kX − Xn kQM(T ) = 0 (19.115)
n→∞
The next step is to show that X is a fixed point of the operator P . This is
true because P X is also a limit of the sequence Xn .
2 2
kP X − Xn+1 kQM(T ) = kP X − P Xn kQM(T ) (19.116)
2
≤ DT kX − Xn kQM(T ) (19.117)
Proof: Entirely parallel to the one-dimensional case, only with more alge-
bra.
The conditions on the coefficients can be reduced to something like “locally
Lipschitz up to a stopping time”, but it does not seem profitable to pursue this
here. See Rogers and Williams (2000, Ch. V, Sec. 12).
d~xi p~i
= (19.120)
dt mi
d~pi F (x, p, t)
= (19.121)
dt mi
constitute the laws of motion. All the physical content comes from specifying
the force function F (x, p, t). We will consider only autonomous systems, so we
do not need to deal with forces which are explicit functions of time. Newton’s
2 See Selmeczi et al. (2006) for an account of the physicalm theory of Brownian motion,
including some of its history and some fascinating biophysical applications. Wax (1954)
collects classic papers on this and related subjects, still very much worth reading.
CHAPTER 19. STOCHASTIC INTEGRALS AND SDES 154
third law says that total momentum is conserved, when all bodies are taken into
account.
Consider a large particle of (without loss of generality) mass 1, such as a
pollen grain, sitting in a still fluid at thermal equilibrium. What forces act on
it? One is drag. At a molecular level, this is due to the particle colliding with
the molecules (mass m) of the fluid, whose average momentum is zero. This
typically results in momentum being transferred from the pollen to the fluid
molecules, and the amount of momentum lost by the pollen is proportional to
what it had, i.e., one term in d~ p/dt is −γ~p. In addition, however, there will be
fluctuations, which will be due to the fact that the fluid molecules are not all at
rest. In fact, because the fluid is at equilibrium, the momenta of the molecules
will follow a Maxwell-Boltzmann distribution,
p2
−3/2 − 12 molec
f (~
pmolec ) = (2πmkB T ) e mkB T
where which is a zero-mean Gaussian with variance mkB T . Tracing this through,
we expect that, over short time intervals in which the pollen grain nonetheless
collides with a large number of molecules, there will be a random impulse (i.e.,
random change in momentum) which is Gaussian, but uncorrelated over shorter
sub-intervals (by the functional CLT). That is, we would like to write
p = −γ~
d~ pdt + DIdW (19.122)
where D is the diffusion constant, I is the 3 × 3 identity matrix, and W of
course is the standard three-dimensional Wiener process. This is known as the
Langevin equation in the physics literature, as this model was introduced by
Langevin in 1907 as a correction to Einstein’s 1905 model of Brownian motion.
(Of course, Langevin didn’t use Wiener processes and Itô integrals, which came
much later, but the spirit was the same.) If you like time-series models, you
might recognize this as a continuous-time version of an mean-reverting AR(1)
model, which explains why it also shows up as an interest rate model in financial
theory.
We can consider each component of the Langevin equation separately, be-
cause they decouple, and solve them easily with Itô’s formula:
d(eγt p) = Deγt dW (19.123)
Z t
eγt p(t) = p0 + D eγs dW (19.124)
0
Z t
p(t) = p0 e−γt + D e−γ(t−s) dW (19.125)
0
We will see in the next chapter a general method of proving that solutions of
equations like 19.122 are Markov processes; for now, you can either take that
on faith, or try to prove it yourself.
Assuming p0 is itself Gaussian, with mean 0 and variance σ 2 , then (using
Exercise 19.6), p~ always has mean zero, and the covariance is
D2 −γ|s−t|
p(t), p~(s)) = σ 2 e−γ(s+t) +
cov (~ e − e−γ(s+t) (19.126)
2γ
CHAPTER 19. STOCHASTIC INTEGRALS AND SDES 155
19.6 Exercises
Exercise 19.1 (Basic Properties of the Itô Integral) Prove the following,
first for elementary Itô-integrable processes, and then in general.
1. Z c Z b Z c
X(t)dW = X(t)dW + X(t)dW (19.127)
a a b
almost surely.
2. If c is any real constant, then, almost surely,
Z b Z b Z b
(cX(t) + Y (t))dW = c XdW + Y (t)dW (19.128)
a a a
Exercise 19.3 (Continuity of the Itô Integral) Show that Ix (t) has (a
modification with) continuous sample paths.
Exercise 19.4 (“The square of dW ”) Use the notation of Section 19.2 here.
P 2
1. Show that i (∆W (ti )) converges on t (in L2 ) as n grows. Hint: Show
that the terms in the sum are IID, and that their variance shrinks suffi-
ciently fast as n grows. (You will need the fourth moment of a Gaussian
distribution.)
2. If X(t) is measurable and non-anticipating, show that
n
2X −1 Z t
2
lim X(ti )(∆W (ti )) = X(s)ds (19.130)
n 0
i=0
in L2 .
Exercise 19.6 (Itô integrals are Gaussian processes) For any fixed, non-
Rt
random cadlag function f on R+ , let If (t) = 0 f (s)dW .
1. Show that E [If (t)] = 0 for all t.
R t∧s
2. Show cov (If (t), If (s)) = 0 f 2 (u)du.
3. Show that If (t) is a Gaussian process.
More on Stochastic
Differential Equations
Section 20.1 shows that the solutions of SDEs are diffusions, and
how to find their generators. Our previous work on Feller processes
and martingale problems pays off here. Some other basic properties
of solutions are sketched, too.
Section 20.2 explains the “forward” and “backward” equations
associated with a diffusion (or other Feller process). We get our
first taste of finding invariant distributions by looking for stationary
solutions of the forward equation.
For the rest of this lecture, whenever I say “an SDE”, I mean “an SDE
satisfying the requirements of the existence and uniqueness theorem”, either
Theorem 259 (in one dimension) or Theorem 260 (in multiple dimensions). And
when I say “a solution”, I mean “a strong solution”. If you are really curious
about what has to be changed to accommodate weak solutions, see Rogers and
Williams (2000, ch. V, sec. 16–18).
158
CHAPTER 20. MORE ON SDES 159
Theorem 263 (SDEs and Feller Diffusions) The processes which solve
SDEs are all Feller diffusions.
Proof: Theorem 262 shows that solutions are homogeneous strong Markov
processes, the previous theorem shows that they are continuous (or can be made
so), and so by Definition 218, solutions are diffusions. For them to be Feller,
d P
we need (i) for every t > 0, Xy (t) → Xx (t) as y → x, and (ii) Xx (t) → x as
t → 0. But, since solutions are a.s. continuous, Xx (t) → x with probability 1,
automatically implying convergence in probability, so (ii) is automatic.
1 Here,and elsewhere, I am going to freely use the Einstein conventions for vector calculus:
repeated indices in a term indicate that you should sum over those indices, ∂i abbreviates
∂ 2 means ∂2 ∂
∂x
, ∂ij ∂x ∂x
, etc. Also, ∂t ≡ ∂t .
i i j
CHAPTER 20. MORE ON SDES 160
Proof: For every initial condition, the generator of the semi-group is the
same (Theorem 262, proof). Since the process is Feller for every initial condition
(Theorem 263), and a Feller semi-group is determined by its generator (Theorem
188), the process has the same evolution operator for every initial condition.
Hence, condition (ii) of Theorem 205 holds. This implies condition (iv) of that
theorem, which is the stated convergence.
Next, turn the expectations into integrals with respect to the transition proba-
bility kernels: Z Z
∂t µt (x, dy)f (y) = G µt (x, dy)f (y) (20.9)
Finally, assume that there is some reference measure λ µt (x, ·), for all t ∈ T
and x ∈ Ξ. Denote the corresponding transition densities by κt (x, y).
Z Z
∂t dλκt (x, y)f (y) = G dλκt (x, y)f (y) (20.10)
Z Z
dλf (y)∂t κt (x, y) = dλf (y)Gκt (x, y) (20.11)
Z
dλf (y) [∂t κt (x, y) − Gκt (x, y)] = 0 (20.12)
The operator G alters the way a function depends on x, the initial state. That is,
this equation is about how the transition density κ depends on the starting point,
“backwards” in time. Generally, we’re in a position to know κ0 (x, y) = δ(x − y);
what we want, rather, is κt (x, y) for some positive value of t. To get this, we
need the “forward” equation.
We obtain this from Lemma 155, which asserts that GKt = Kt G.
Z
∂t dλκt (x, y)f (y) = Kt Gf (x) (20.14)
Z
= dλκt (x, y)Gf (y) (20.15)
Notice that here, G is altering the dependence on the y coordinate, i.e. the state
at time t, not the initial state at time 0. Writing the adjoint2 operator as G† ,
Z Z
∂t dλκt (x, y)f (y) = dλG† κt (x, y)f (y) (20.16)
1
Gf (x) = ai (x)∂i f (x) + (bbT )ij (x)∂ij
2
f (x) (20.19)
2
You can show — it’s an exercise in vector calculus, integration by parts, etc. —
that the adjoint to G is the differential operator
1 2 T
G† f (x) = −∂i ai (x)f (x) + ∂ij (bb )ij (x)f (x) (20.20)
2
Notice that the space-dependence of the SDE’s coefficients now appears inside
the derivatives. Of course, if a and b are independent of x, then they simply
pull outside the derivatives, giving us, in that special case,
1
G† f (x) = −ai ∂i f (x) + (bbT )ij ∂ij
2
f (x) (20.21)
2
Let’s interpret this physically, imagining a large population of independent
tracer particles wandering around the state space Ξ, following independent
copies of the diffusion process. The second derivative term is easy: diffusion
tends to smooth out the probability density, taking probability mass away from
maxima (where f 00 < 0) and adding it to minima. (Remember that bbT is posi-
tive semi-definite.) If ai is positive, then particles tend to move in the positive
direction along the ith axis. If ∂i ρ is also positive, this means that, on average,
the point x sends more particles up along the axis than wander down, against
the gradient, so the density at x will tend to decline.
Example 265 (Wiener process, heat equation) Notice that (for diffusions
produced by SDEs) G† = G when a = 0 and b is constant over the state space.
This is the case with Wiener processes, where G = G† = 12 ∇2 . Thus, the heat
equation holds both for the evolution of observable functions of the Wiener pro-
cess, and for the evolution of the Wiener process’s density. You should convince
yourself that there is no non-negative integrable ρ such that Gρ(x) = 0.
CHAPTER 20. MORE ON SDES 163
20.3 Exercises
Exercise 20.1 (Langevin equation with a conservative force) A conser-
vative force is one derived from an external potential, i.e., there is a function
φ(x) giving energy, and F (x) = −dφ/dx. The equations of motion for a body
subject to a conservative force, drag, and noise read
p
dx = dt (20.22)
m
dp = −γpdt + F (x)dt + σdW (20.23)
In this chapter, we will use the results we have already obtained about
SDEs to give a rough estimate of a basic problem, frequently arising in practice1
namely taking a system governed by an ordinary differential equation and seeing
how much effect injecting a small amount of white noise has. More exactly,
we will put an upper bound on the probability that the perturbed trajectory
goes very far from the unperturbed trajectory, and see the rate at which this
1 For applications in statistical physics and chemistry, see Keizer (1987). For applications
in signal processing and systems theory, see Kushner (1984). For applications in nonparamet-
ric regression and estimation, and also radio engineering (!) see Ibragimov and Has’minskii
(1979/1981). The last book is especially recommended for those who care about the connec-
tions between stochastic process theory and statistical inference, but unfortunately expound-
ing the results, or even just the problems, would require a too-long detour through asymptotic
statistical theory.
164
CHAPTER 21. SMALL-NOISE SDES 165
probability goes to zero as the amplitude of the noise shrinks; this will be
2
O(e−C ). This will be our first illustration of a large deviations calculation. It
will be crude, but it will also introduce some themes to which we will return
(inshallah!) at greater length towards the end of the course. Then we will see
that the major improvement of the more refined tools is to give a lower bound
to match the upper bound we will calculate now, and see that we at least got
the logarithmic rate right.
I should say before going any further that this example is shamelessly ripped
off from Freidlin and Wentzell (1998, ch. 3, sec. 1, pp. 70–71), which is the
book on the subject of large deviations for continuous-time processes.
2 2
kG f − G0 f k∞ =
a
i i∂ f (x) + ∇ f (x) − a ∂
i i f (x)
(21.4)
2
∞
2
∇2 f (x)
= ∞
(21.5)
2
d
2 You can amuse yourself by showing this. Remember that X (t) → X (t) is equivalent to
y x
E [f (Xt )|X0 = y] → E [f (Xt )|X0 = x] for all bounded continuous f , and the solution of an
ODE depends continuously on its initial condition.
CHAPTER 21. SMALL-NOISE SDES 166
which goes to zero as → 0. But this is condition (i) of the convergence theorem,
which is equivalent to condition (iv), that convergence in distribution of the
initial condition implies convergence in distribution of the whole trajectory.
Since the initial condition is the same non-random point x0 for all , we have
d P
X → X0 as → 0. In fact, since X0 is non-random, we have that X → X0 .
That last assertion really needs some consideration of metrics on the space of
continuous random functions to make sense (see Appendix A2 of Kallenberg),
but once that’s done, the upshot is
Theorem 267 (Small-Noise SDEs Converge in Probability on No-Noise
ODEs) Let ∆ (t) = |X (t) − X0 (t)|. For every T > 0, δ > 0,
lim P sup ∆ (t) > δ = 0 (21.6)
→0 0≤t≤T
The only random component on the RHS is the supremum of the Wiener process,
so we’re in business, at least once we take on two standard results, one about
the Wiener process itself, the other just about multivariate Gaussians.
CHAPTER 21. SMALL-NOISE SDES 168
δe−Ka T
∗ ∗
P (∆ (T ) > δ) ≤ P |W | (T ) > (21.19)
δe−Ka T
= 2P |W (T )| > (21.20)
2 −2Ka T d/2−1
4 δ e δ 2 e−2Ka T
≤ d/2 2
e− 22 (21.21)
2 Γ(d/2)
5. Suppose we’re right about the rate (which, it will turn out, we are), and
it holds both from above and below. It would be nice to be able to say
something like
−2
P (∆∗ (T ) > δ) → C1 (δ, T )e−C2 (δ,T ) (21.24)
rather than
2 log P (∆∗ (T ) > δ) → −C2 (δ, T ) (21.25)
The difficulty with making an assertion like 21.24 is that the large devia-
tion probability actually converges on any function which goes to asymp-
totically to zero! So, to extract the actual rate of dependence, we need to
get a result like 21.25.
More generally, one consequence of Theorem 270 is that SDE trajectories
which are far from the trajectory of the ODE have exponentially small proba-
bilities. The vast majority of the probability will be concentrated around the
unperturbed trajectory. Reasonable sample-path functionals can therefore be
well-approximated by averaging their value over some small (δ) neighborhood of
the unperturbed trajectory. This should sound very similar to Laplace’s method
for the evaluate of asymptotic integrals in Euclidean space, and in fact one of
the key parts of large deviations theory is an extension of Laplace’s method to
infinite-dimensional function spaces.
In addition to this mathematical content, there is also a close connection
to the principle of least action in physics. In classical mechanics, the system
follows the trajectory of least action, the “action” of a trajectory being the
integral of the kinetic minus potential energy along that path. In quantum
mechanics, this is no longer an axiom but a consequence of the dynamics: the
action-minimizing trajectory is the most probable one, and large deviations from
it have exponentially small probability. Similarly, the theory of large deviations
can be used to establish quite general stochastic principles of least action for
Markovian systems.3
3 For a fuller discussion, see Eyink (1996),Freidlin and Wentzell (1998, ch. 3).
Chapter 22
Section 22.1 makes sense of the idea of white noise. This forms
the bridge from the ideas of Wiener integrals, in the previous lec-
tures, and spectral and ergodic theory, which we will pursue here.
Section 22.2 introduces the spectral representation of weakly sta-
tionary processes, and the central Wiener-Khinchin theorem con-
necting autocovariance to the power spectrum. Subsection 22.2.1
explains why white noise is “white”.
Section 22.3 gives our first classical ergodic result, the “mean
square” (L2 ) ergodic theorem for weakly stationary processes. Sub-
section 22.3.1 gives an easy proof of a sufficient condition, just using
the autocovariance. Subsection 22.3.2 gives a necessary and suffi-
cient condition, using the spectral representation.
This one example of an “analysis”, in the original sense of resolving into parts,
of a function into a collection of orthogonal basis functions. (You can find the
details in any book on Fourier analysis, as well as the varying conventions on
where the 2π goes, which side gets the e−iνt , the constraints on x̃ which arise
from the fact that x is real, etc.)
170
CHAPTER 22. SPECTRAL ANALYSIS AND L2 ERGODICITY 171
There are various reasons to prefer the trigonometric basis functions eiνt
over other possible choices. One is that they are invariant under translation
in time, which just changes phases1 . This suggests that the Fourier basis will
be particularly useful when dealing with time-invariant systems. For stochas-
tic processes, however, time-invariance is stationarity. This suggests that there
should be some useful way of doing Fourier analysis on stationary random func-
tions. In fact, it turns out that stationary and even weakly-stationary processes
can be productively Fourier-transformed. This is potentially a huge topic, es-
pecially when it’s expanded to include representing random functions in terms
of (countable) series of orthogonal functions. The spectral theory of random
functions connects Fourier analysis, disintegration of measures, Hilbert spaces
and ergodicity. This lecture will do no more than scratch the surface, covering,
in succession, white noise, the basics of the spectral representation of weakly-
stationary random functions and the fundamental Wiener-Khinchin theorem
linking covariance functions to power spectra, why white noise is called “white”,
and the mean-square ergodic theorem.
Good sources, if you want to go further, are the books of Bartlett (1955,
ch. 6) (from whom I’ve stolen shamelessly), the historically important and in-
spiring Wiener (1949, 1961), and of course Doob (1953). Loève (1955, ch. X) is
highly edifying, particular his discussion of Karhunen-Loève transforms, and the
associated construction of the Wiener process as a Fourier series with random
phases.
parts:
d dW du
(uW ) = u + W (22.1)
dt dt dt
= u(t)ξ(t) + u̇(t)W (t) (22.2)
Z t Z t
d
(uW )ds = u̇(s)W (s) + u(s)ξ(s)ds (22.3)
0 dt 0
Z t Z t
u(t)W (t) − u(0)W (0) = u̇(s)W (s)ds + u(s)ξ(s)ds (22.4)
0 0
Z t Z t
u(s)ξ(s)ds ≡ u(t)W (t) − u̇(s)W (s)ds (22.5)
0 0
We can take the last line to define ξ, and time-integrals within which it appears.
Notice that the terms on the RHS are well-defined without the Itô calculus: one
is just a product of two measurable random variables, and the other is the time-
integral of a continuous random function. With this definition, we can establish
some properties of ξ.
Proof:
Z t
(a1 u1 (s) + a2 u2 (s))ξ(s)ds (22.7)
0
Z t
= (a1 u1 (t) + a2 u2 (t))W (t) − (a1 u̇1 (s) + a2 u̇2 (s))W (s)ds
0
Z t Z t
= a1 u1 (s)ξ(s)ds + a2 u2 (s)ξ(s)ds (22.8)
0 0
Proposition 272 (White Noise Has Mean Zero) For all t, E [ξ(t)] = 0.
Proof:
Z t Z t
u(s)E [ξ(s)] ds = E u(s)ξ(s)ds (22.9)
0 0
Z t
= E u(t)W (t) − u̇(s)W (s)ds (22.10)
0
Z t
= E [u(t)W (t)] − u̇(s)E [W (t)] ds (22.11)
0
= 0−0=0 (22.12)
CHAPTER 22. SPECTRAL ANALYSIS AND L2 ERGODICITY 173
Rt
Proposition 273 (White Noise and Itô Integrals) For all u ∈ C 1 , 0
u(s)ξ(s)ds =
Rt
0
u(s)dW .
d(uW )
= W (t)u̇(t)dt + u(t)dW (22.13)
Z t Z t
u(t)W (t) = u̇(s)W (s)ds + u(t)dW (22.14)
0 0
Z t Z t
u(t)dW = u(t)W (t) − u̇(s)W (s)ds (22.15)
0 0
Z t
= u(s)ξ(s)ds (22.16)
0
This could be used to extend the definition of white-noise integrals to any
Itô-integrable process.
Proof: Since E [ξ(t)] = 0, we just need to show that E [ξ(t1 )ξ(t2 )] = δ(t1 −
t2 ). Remember (Eq. 17.14 on p. 127) that E [W (t1 )W (t2 )] = t1 ∧ t2 .
Z tZ t
u(t1 )u(t2 )E [ξ(t1 )ξ(t2 )] dt1 dt2 (22.17)
0 0
Z t Z t
= E u(t1 )ξ(t1 )dt1 u(t2 )ξ(t2 )dt2 (22.18)
0 0
"Z
t 2 #
= E u(t1 )ξ(t1 )dt1 (22.19)
0
Z t Z t
E u2 (t1 ) dt1 = u2 (t1 )dt1
= (22.20)
0 0
using the preceding proposition, the Itô isometry, and the fact that u is non-
random. But
Z tZ t Z t
u(t1 )u(t2 )δ(t1 − t2 )dt1 dt2 = u2 (t1 )dt1 (22.21)
0 0 0
Proof: To show that it is Gaussian, use Exercise 19.6. The mean is constant
for all times, and the covariance depends only on |t1 −t2 |, so it satisfies Definition
50 and is weakly stationary. But a weakly stationary Gaussian process is also
strongly stationary.
CHAPTER 22. SPECTRAL ANALYSIS AND L2 ERGODICITY 174
Remark 2. Notice that the only part of the right-hand side of Equation
22.22 which depends on t is the integrand, e−iνt , which just changes the phase
of each Fourier component deterministically. Roughly speaking, for a fixed ω the
amplitudes of the different Fourier components in X(t, ω) are fixed, and shifting
forward in time just involves changing their phases. (Making this simple is why
we have to allow X̃ to have complex values.)
The spectral representation is another stochastic integral, like the Itô integral
we saw in Section 19.1. There, the measure of the time interval [t1 , t2 ] was given
by the increment of the Wiener process, W (t2 )−W (t1 ). Here, for any frequency
interval [ν1 , ν2 ], the increment X̃(ν2 ) − X̃(ν1 ) defines a random set function
(admittedly, a complex-valued one). Since those intervals are a generating class
for the Borel σ-field, the usual arguments extend this set functionRuniquely to a
random complex measure on R, B. When we write something like G(ν)dX̃(ν),
we mean an integral with respect to this measure.
Rather than dealing directly with this measure, we can, as with the Itô inte-
gral, use approximation by elementary processes. That is, we should interpret
Z ∞
G(t, ν)dX̃(ν)
−∞
as sup νi+1 − νi goes to zero. This limit can be shown to exist in pretty much
exactly the same way we showed the corresponding limit for Itô integrals to
exist.2
Proof: See Loève (1955, §34.4). You can prove this yourself, however, using
the material on characteristic functions in 36-752.
Definition 281 (Jump of the Spectral Process) The jump of the spec-
tral process at ν is the difference between the right- and left- hand limits at ν,
∆X̃(ν) ≡ X̃(ν + 0) − X̃(ν − 0).
Remark 1: As usual, X̃(ν+0) ≡ limh↓0 X̃(ν + h), and X̃(ν−0) ≡ limh↓0 X̃(ν − h).
Remark 2: Some people call the set of points at which the jump is non-
zero the “spectrum”. This usage comes from functional analysis, but seems
needlessly confusing in the present context.
Remark. Some people prefer to talk about the spectral function as the
Fourier transform of the autocorrelation function, rather than of the autoco-
variance. This has the advantage that the spectral function turns out to be
a normalized cumulative distribution function (see Theorem 286 immediately
below), but is otherwise inconsequential.
Theorem 286 (Weakly Stationary Processes Have Spectral Functions)
The spectral function exists for every weakly stationary process, if Γ(τ ) is con-
tinuous. Moreover, S(ν) ≥ 0, S is non-decreasing, S(−∞) = 0, S(∞) = Γ(0),
and limh↓0 S(ν + h) and limh↓0 S(ν − h) exist for every ν.
Proof: Usually, by an otherwise-obscure result in Fourier analysis called
Bochner’s theorem. A more direct proof is due to Loève. Assume, without loss
of generality, that E [X(t)] = 0.
Start by defining
Z T /2
1
HT (ν) ≡ √ eiνt X(t)dt (22.30)
T −T /2
and define fT (ν) through H:
h i
2πfT (ν) ≡ E HT (ν)HT† (ν) (22.31)
" Z #
1 T /2 T /2 iνt1
Z
= E e X(t1 )e−iνt2 X † (t2 )dt1 dt2 (22.32)
T −T /2 −T /2
1 T /2 T /2 iν(t1 −t2 )
Z Z
= e E [X(t1 )X(t2 )] dt1 dt2 (22.33)
T −T /2 −T /2
1 T /2 T /2 iν(t1 −t2 )
Z Z
= e Γ(t1 − t2 )dt1 dt2 (22.34)
T −T /2 −T /2
Z T
|τ |
= 1− Γ(τ )eiντ dτ (22.35)
−T T
Recall that Γ(τ ) defines a non-negative quadratic form, meaning that
X
a†s at Γ(t − s) ≥ 0
s,t
for any sets of times and any complex numbers at . This will in particular work if
the complex numbers lie on the unit circle and can be written eiνt . This means
that integrals Z Z
eiν(t1 −t2 ) Γ(t1 − t2 )dt1 dt2 ≥ 0 (22.36)
so fT (ν) ≥ 0.
Define φT (τ ) as the integrand in Eq. 22.35, so that
Z ∞
1
fT (ν) = φT (τ )eiντ dτ (22.37)
2π −∞
CHAPTER 22. SPECTRAL ANALYSIS AND L2 ERGODICITY 178
which is recognizable as a proper Fourier transform. Now pick some N > 0 and
massage the equation so it starts to look like an inverse transform.
Z ∞
−iνt 1
fT (ν)e = φT (τ )eiντ e−iνt dτ (22.38)
2π −∞
Z ∞
|ν| 1 |ν|
1− fT (ν)e−iνt = φT (τ )eiντ e−iνt 1 − dτ (22.39)
N 2π −∞ N
and that 2
sin N (τ − t)/2
lim N =N
t→τ N (τ − t)/2
On the other hand, if τ 6= t,
2
sin N (τ − t)/2
lim N =0
N →∞ N (τ − t)/2
uniformly over any bounded interval in t. (You might find it instructive to try
plotting this function; you will need to be careful near the origin!) In other
words, this is a representation of the Dirac delta function, so that
Z ∞ 2
sin N (τ − t)/2
lim φT (τ ) N dτ = φT (τ )
N →∞ −∞ N (τ − t)/2
N
|ν|
Z
1− fT (ν)e−iνt dν
−N N
Definition 288 (Jump of the Spectral Function) The jump of the spectral
function at ν, ∆S(ν), is S(ν + 0) − S(ν − 0).
using the fact that integration and expectation commute to (formally)h bring the i
expectation inside the integral. Since X̃ has orthogonal increments, E dX̃ν†1 dX̃ν2 =
0 unless ν1 = ν2 . This turns the double integral into a single integral, and kills
the e−it(ν1 −ν2 ) factor, which had to go away because t was arbitrary.
Z ∞ h i
Γ(τ ) = e−iτ ν E d(X̃ν† X̃ν ) (22.52)
−∞
Z ∞
= e−iτ ν dVν (22.53)
−∞
R∞
using the definition of the power spectrum. Since Γ(τ ) = −∞ e−iτ ν dVν , it
follows that Sν and Vν differ by a constant, namely the value of V (−∞), which
can be chosen to be zero without affecting the spectral representation of X.
(Only considering time averages starting from zero involves no loss of gen-
erality for weakly stationary processes: why?)
Definition 292 (Integral Time Scale) The integral time scale of a weakly-
stationary random process is
R∞
dτ |Γ(τ )|
τint ≡ 0 (22.54)
Γ(0)
3 Proverbially: “time averages converge on space averages”, the space in question being
the state space Ξ; or “converge on phase averages”, since physicists call certain kinds of state
space “phase space”.
4 See von Plato (1994, ch. 3) for a fascinating history of the development of ergodic theory
through the 1930s, and its place in the history of mathematical probability.
CHAPTER 22. SPECTRAL ANALYSIS AND L2 ERGODICITY 182
2
Proof: Since X(t) is centered, E X(T ) = 0, and
X(T )
2 = Var X(T ) .
Everything else follows from re-arranging the bound in the proof of Theorem
293, Definition 292, and the fact that Γ(0) = Var [X(0)].
As a consequence of the corollary, if T τint , then the variance of the time
average is negigible compared to the variance at any one time.
L
As T → ∞, these integrals pick out ∆X̃(f ) and ∆X̃ † (f ). So, eif t X(T ) →2
∆X̃(f ).
Notice that the limit provided by the lemma is a random quantity. What’s
really desired, in most applications, is convergence to a deterministic limit,
which here would mean convergence (in L2 ) to zero.
CHAPTER 22. SPECTRAL ANALYSIS AND L2 ERGODICITY 184
L
Proof: Taking f = 0 in Lemma 297, X(T ) →2 ∆X̃(0), the jump in the
spectral function at zero. Let’s show that the (i) expectation of this jump is
zero, and that (ii) its variance is given by the integral expression on the LHS of
L
Eq. 22.66. For (i), because X(T ) h→2 Y , we i know that E X(T ) → E [Y ]. But
E X(T ) = E [X](T ) = 0. So E ∆X̃(0) = 0. For (ii), Lemma 295, plus the
h i
fact that E ∆X̃(0) = 0, shows that the variance is equal to the jump in the
spectrum at 0. But, by Lemma 296 with ν = 0, that jump is exactly the LHS
of Eq. 22.66.
Remark 1: Notice that if the integral time is finite, then the integral condi-
tion on the autocovariance is automatically satisfied, but not vice versa, so the
hypotheses here are strictly weaker than in Theorem 293.
Remark 2: One interpretation of the theorem is that the time-average is con-
verging on the zero-frequency component of the spectral process. Intuitively,
all the other components are being killed off by the time-averaging, because the
because time-integral of a sinusoidal function is bounded, but the denominator
in a time average is unbounded. The only part left is the zero-frequency com-
ponent, whose time integral can also grow linearly with time. If there is a jump
at 0, then this has finite variance; if not, not.
Remark 3: Lemma 297 establishes the L2 convergence of time-averages of
the form
1 T if t
Z
e X(t)dt
T 0
for any real f . Specifically, from Lemma 295, the mean-square of this variable
is converging on the jump in the spectrum at f . Multiplying X(t) by eif t makes
the old frequency f component the new frequency zero component, so it is the
surviving term. While the ergodic theorem itself only needs the f = 0 case, this
result is useful in connection with estimating spectra from time series (Doob,
1953, ch. X, §7).
22.4 Exercises
Exercise 22.1 (Mean-Square Ergodicity in Discrete Time) It is often
convenient to have a mean-square ergodic theorem for discrete-time sequences
rather than continuous-time processes. If the dt in the definition of X is re-
interpreted as counting measure on N, rather than Lebesgue measure on R+ ,
does the proof of Theorem 293 remain valid? (If yes, say why; if no, explain
where the argument fails.)
CHAPTER 22. SPECTRAL ANALYSIS AND L2 ERGODICITY 185
Ergodic Theory
186
Chapter 23
Now, this is how Gnedenko and Kolmogorov introduced their classic study of the
limit laws for independent random variables, but most of the random phenomena
we encounter around us are not independent. Ergodic theory is a study of
how large-scale dependent random phenomena nonetheless create nonrandom
regularity. The classical limit laws for IID variables X1 , X2 , . . . assert that,
1 Among mathematical scientists, anyway.
187
CHAPTER 23. ERGODIC PROPERTIES AND ERGODIC LIMITS 188
where the sense of convergence can be “almost sure” (strong law of large num-
bers), “Lp ” (pth mean), “in probability” (weak law), etc., depending on the
hypotheses we put on the Xi . One meaning of this convergence is that suffi-
ciently large random samples are representative of the entire population — that
the sample mean makes a good estimate of E [X].
The ergodic theorems, likewise, assert that for dependent sequences X1 , X2 , . . .,
time averages converge on expectations
t
1X
Xi → E [X∞ |I]
t i=1
where X∞ is some limiting random variable, or in the most useful cases a non-
random variable, and I is a σ-field representing some sort of asymptotic in-
formation. Once again, the mode of convergence will depend on the kind of
hypotheses we make about the random sequence X. Once again, the interpre-
tation is that a single sample path is representative of an entire distribution
over sample paths, if it goes on long enough. The IID laws of large numbers
are, in fact, special cases of the corresponding ergodic theorems.
Section 22.3 proved a mean-square (L2 ) ergodic theorem for weakly sta-
tionary continuous-parameter processes.2 The next few chapters, by contrast,
will develop ergodic theorems for non-stationary discrete-parameter processes.3
This is a little unusual, compared to most probability books, so let me say a
word or two about why. (1) Results we get will include stationary processes as
special cases, but stationarity fails for many applications where ergodicity (in a
suitable sense) holds. So this is more general and more broadly applicable. (2)
Our results will all have continuous-time analogs, but the algebra is a lot cleaner
in discrete time. (3) Some of the most important applications (for people like
you!) are to statistical inference and learning with dependent samples, and to
Markov chain Monte Carlo, and both of those are naturally discrete-parameter
processes. We will, however, stick to continuous state spaces.
Lemma 303 (Invariant Sets are a σ-Algebra) The class I of all measurable
invariant sets in Ξ forms a σ-algebra.
Proof: Clearly, Ξ is invariant. The other properties of a σ-algebra follow
because set-theoretic operations (union, complementation, etc.) commute with
taking inverse images.
CHAPTER 23. ERGODIC PROPERTIES AND ERGODIC LIMITS 190
Definition 305 (Infinitely Often, i.o.) For any set C ∈ X , the set C in-
finitely often, Ci.o. , consists of all those points in Ξ whose trajectories visit C
infinitely often, Ci.o. ≡ lim supt T −t C.
A measurable set is invariant µ-a.e., when its indicator function is almost in-
variant.
Remark 1: Some of the older literature annoyingly calls these objects totally
invariant.
Remark 2: Invariance implies invariance µ-almost everywhere, for any µ.
Lemma
Pm 309 (Invariance for simple functions) A simple function, f (x) =
k=1 am 1Ck (x), is invariant if and only if all the sets Ck ∈ I. Similarly, a
simple function is almost invariant iff all the defining sets are almost invariant.
Lemma 311 (Time averages are observables) For every t, the time-average
of an observable is an observable.
Proof: The class of measurable functions is closed under finite iterations
of arithmetic operations.
Definition 312 (Ergodic Property) An observable f has the ergodic prop-
erty when At f (x) converges as t → ∞ for µ-almost-all x. An observable has
the mean ergodic property when At f (x) converges in L1 (µ), and similarly for
the other Lp ergodic properties. If for some class of functions D, every f ∈ D
has an ergodic property, then the class D has that ergodic property.
Remark. Notice that what is required for f to have the ergodic property is
that almost every initial point has some limit for its time average,
n o
µ x ∈ Ξ ∃r ∈ R : lim At f (x) = r = 1 (23.3)
t→∞
This does not mean that there is some common limit for almost every initial
point, n o
∃r ∈ R : µ x ∈ Ξ lim At f (x) = r = 1 (23.4)
t→∞
Similarly, a class of functions has the ergodic property if all of their time averages
converge; they do not have to converge uniformly.
Definition 313 (Ergodic Limit) If an observable f has the ergodic property,
define Af (x) to be the limit of At f (x) where that exists, and 0 elsewhere. The
domain of A consists of all and only the functions with ergodic properties.
Observe that
t
1X n
Af (x) = lim K f (x) (23.5)
t→∞ t
n=0
That is, A is the limit of an arithmetic mean of conditional expectations. This
suggests that it should itself have many of the properties of conditional expec-
tations. In fact, under a reasonable condition, we will see that Af = E [f |I],
expectation conditional on the σ-algebra of invariant sets. We’ll check first that
A has the properties we’d want from a conditional expectation.
CHAPTER 23. ERGODIC PROPERTIES AND ERGODIC LIMITS 192
Proof: If c is any real number, then At cf (x) = cAt f (x), and so clearly, if
the limit exists, Acf (x) = cAf (x). Similarly, At (f + g)(x) = At f (x) + At g(x),
so if f and g both have ergodic properties, then so does f +g, and A(f +g)(x) =
Af (x) + Ag(x).
Proof: The event Af (x) < 0 is a sub-event of n {f (T n (x)) < 0}. Since
S
the union of a countable collection of measure zero events has measure zero,
Af (x) ≥ 0 almost everywhere.
Notice that our hypothesis is that f T n ≥ 0 a.e. for all n, not just that f ≥ 0.
The reason we need the stronger assumption is that the transformation T might
map every point to the bad set of f ; the lemma guards against that. Of course,
if f (x) ≥ 0 for all, and not just almost all, x, then the bad set is non-existent,
and Af ≥ 0 follows automatically.
Lemma 316 (Constant functions have the ergodic property) The con-
stant function 1 has the ergodic property. Consequently, so does every other
constant function.
Proof: Start with n = 1, and show that the discrepancy goes to zero.
t
1X
K i+1 f (x) − K i f (x)
AKf (x) − Af (x) = lim (23.6)
t t i=0
1
K t f (x) − f (x)
= lim (23.7)
t t
Pt−1
Since Af (x) exists a.e., we know that the series t−1 i=0 K i f (x) converges
−1 −1
a.e., implying that (t + 1) K t f (x) → 0 a.e.. But t−1 = t+1 t (t + 1) , and for
−1 −1
large t, t + 1/t < 2. Hence (t + 1) K t f (x) ≤ t−1 K t f (x) ≤ 2(t + 1) K t f (x),
implying that t−1 K t f (x) itself goes to zero (a.e.). Similarly, t−1 f (x) must go
to zero. Thus, overall, we have AKf (x) = Af (x) a.e., and Kf (x) ∈ Dom(A).
Proof: From Lemma 321, lim E [At f (X)] = E [f (X)]. Since the variables
At f (X) are uniformly integrable (as we saw in the proof of that lemma), it
follows (Proposition 4.12 in Kallenberg, p. 68) that they also converge in L1 (µ).
where L (X) = µ.
But (previous lemma) these functions converge in L1 (µ), so the limit of the
norm of their difference is zero.
Boundedness is not essential.
Proof: Examining the proofs shows that the boundedness of f was impor-
tant only to establish the uniform integrability of At f .
where U is the operator taking measures to measures. This leads us to the next
proposition.
Lemma 326 (Stationary Implies Asymptotically Mean Stationary) If
a dynamical system is stationary, i.e., T is preserves the measure µ, then it is
asymptotically mean stationary, with stationary mean µ.
Proof: If T preserves µ, then for every measurable set, µ(C) = µ(T −1 C).
Hence every term in the sum in Eq. 23.20 is µ(C), and consequently the limit
exists and is equal to µ(C).
Proposition 327 (Vitali-Hahn Theorem) If mt are a sequence of probabil-
ity measures on a common σ-algebra X , and m(C) is a set function such that
limt mt (C) = m(C) for all C ∈ X , then m is a probability measure on X .
Proof: This is a standard result from measure theory.
Theorem 328 (Stationary Means are Invariant Measures) If a dynam-
ical system is asymptotically mean stationary, then its stationary mean is an
invariant probability measure.
Pt−1
Proof: For every t, let mt (C) = 1t n=0 µ(T −n (C)). Then mt is a linear
combination of probability measures, hence a probability measure itself. Since,
for every C ∈ X , lim mt (C) = m(C), by Definition 325, Proposition 327 says
that m(C) is also a probability measure. It remains to check invariance.
m(C) − m(T −1 C) (23.21)
t−1 t−1
1X 1 X
= lim µ(T −n (C)) − lim µ(T −n (T −1 C))
t n=0 t n=0
t−1
1X
= lim µ(T −n−1 C) − µ(T −n C) (23.22)
t n=0
1
µ(T −t C) − µ(C)
= lim (23.23)
t
CHAPTER 23. ERGODIC PROPERTIES AND ERGODIC LIMITS 196
Since the probability measure of any set is at most 1, the difference between
two probabilities is at most 1, and so m(C) = m(T −1 C), for all C ∈ X . But
this means that m is invariant under T (Definition 53).
Remark: Returning to the symbolic manipulations, if µ is AMS with sta-
tionary mean m, then U m = m (because m is invariant), and so we can write
µ = m + (µ − m), knowing that µ − m goes to zero under averaging. Speaking
loosely (this can be made precise, at the cost of a fairly long excursion) m is
an eigenvector of U (with eigenvalue 1), and µ − m lies in an orthogonal di-
rection, along which U is contracting, so that, under averaging, it goes away,
leaving only m, which is like the projection of the original measure µ on to the
invariant manifold of U .
The relationship between an AMS measure µ and its stationary mean m
is particularly simple on invariant sets: they are equal there. A slightly more
general theorem is actually just as easy to prove, however, so we’ll do that.
Lemma 329 (Expectations of Almost-Invariant Functions) If µ is AMS
with limit m, and f is an observable which is invariant µ-a.e., then Eµ [f ] =
Em [f ].
Proof: Let C be any almost invariant set. Then, for any t, C and T −t C
differ by, at most, a set of µ-measure 0, so that µ(C) = µ(T −t C). The definition
of the stationary mean (Equation 23.20) then gives µ(C) = m(C), or Eµ [1C ] =
Em [1C ], i.e., the result holds for indicator functions. By Lemma 309, this then
extends to simple functions. The usual arguments then take us to all functions
which are measurable with respect to I 0 , the σ-field of almost-invariant sets,
but this (Lemma 308) is the class of all almost-invariant functions.
Lemma 330 (Limit of Cesàro Means of Expectations) If µ is AMS with
stationary mean m, and f is a bounded observable,
lim Eµ [At f ] = Em [f ] (23.24)
t→∞
23.5 Exercises
Exercise 23.1 (“Infinitely often” implies invariant) Prove Lemma 306.
198
CHAPTER 24. THE ALMOST-SURE ERGODIC THEOREM 199
Lemma 336 (Lower and Upper Limiting Time Averages are Invari-
ant) For each obsevable f , Af and Af are invariant functions.
Proof: Use our favorite trick, and write At f (T x) = t+1 t At+1 f (x) − f (x)/t.
Clearly, the lim sup and lim inf of this expression will equal the lim sup and
lim inf of At+1 f (x), which is the same as that of At f (x).
Proof: Since Af and Af are both invariant, they are both measurable with
respect to I (Lemma 304), so the set of x such that Af (x) = Af (x) is in I,
therefore it is invariant (Definition 303).
Af = Em [f |I] (24.4)
Proof: From Theorem 333 and its corollaries, it is enough to prove that all
L1 (m) observables have ergodic properties to get Eq. 24.4. From Lemma 338, it
is enough to show that the observables have ergodic properties in the stationary
system Ξ, X , m, T . (Accordingly, all expectations in the rest of this proof will
be with respect to m.) Since any observable can be decomposed into its positive
and negative parts, f = f + − f − , assume, without loss of generality,
that
f is
positive. Since Af ≥ Af everywhere, it suffices to show that E Af − Af ≤ 0.
This in turn will follow from E Af ≤ E [f ] ≤ E [Af ]. (Since f is bounded, the
integrals exist.)
We’ll prove that E Af ≤ E [f ], by showing that the time average comes
close to its lim sup, but from above (in the mean). Proving that E [Af ] ≥ E [f ]
will be entirely parallel.
Since f is bounded, we may assume that f ≤ M everywhere.
For every > 0, for every x there exists a finite t such that
This is because f is the limit of the least upper bounds. (You can see where
this is going already — the time-average has to be close to its lim sup, but close
from above.)
Define t(x, ) to be the smallest t such that f (x) ≤ + At f (x). Then, since
f is invariant, we can add from from time 0 to time t(x, ) − 1 and get:
t(x,)−1 t(x,)−1
X X
K n f (x) + t(x, ) ≥ K n f (x) (24.6)
n=0 n=0
Define BN ≡ {x|t(x, ) ≥ N }, the set of “bad” x, where the sample average fails
to reach a reasonable () distance of the lim sup before time N . Because t(x, )
is finite, m(BN ) goes to zero as N → ∞. Chose a N such that m(BN ) ≤ /M ,
and, for the corresponding bad set, drop the subscript. (We’ll see why this level
is important presently.)
We’ll find it convenient to not deal directly with f , but with a related func-
tion which is better-behaved on the bad set B. Set f˜(x) = M when x ∈ B,
and = f (x) elsewhere. Similarly, define t̃(x, ) to be 1 if x ∈ B, and t(x, )
elsewhere. Notice that t̃(x, ) ≤ N for all x. Something like Eq. 24.6 still holds
for the nice-ified function f˜, specifically,
t̃(x,)−1 t̃(x,)−1
X X
K n f (x) ≤ K n f˜(x) + t̃(x, ) (24.7)
n=0 n=0
contradicting the definition of t(x, ). Consequently, there can be no n < t(x, )
such that T n x ∈ B.
We are now ready to consider the time average AL f over a stretch of time
of some considerable length L. We’ll break the time indices over which we’re
averaging into blocks, each block ending when T t x hits B again. We need to
make sure that L is sufficiently large, and it will turn out that L ≥ N/(/M )
suffices, so that N M/L ≤ . The end-points of the blocks are defined recursively,
CHAPTER 24. THE ALMOST-SURE ERGODIC THEOREM 201
starting with b0 = 0, bk+1 = bk + t̃(T bk x, ). (Of course the bk are implicitly
dependent on x and and N , but suppress that for now, since these are constant
through the argument.) The number of completed blocks, C, is the largest k
such that L − 1 ≥ bk . Notice that L − bC ≤ N , because t̃(x, ) ≤ N , so if
L − bC > N , we could squeeze in another block after bC , contradicting its
definition.
Now let’s examine the sum of the lim sup over the trajectory of length L.
L−1
X C
X bk
X L−1
X
K n f (x) = K n f (x) + K n f (x) (24.11)
n=0 k=1 n=bk−1 n=bC
where the last step, going from bC to L, uses the fact that both and f˜ are
non-negative. Taking expectations of both sides,
"L−1 # " L−1
#
X X
n n˜
E K f (X) ≤ E L + M (N − 1) + K f (X) (24.19)
n=0 n=0
L−1
X L−1
X h i
E K n f (X) ≤ L + M (N − 1) + E K n f˜(X)
(24.20)
n=0 n=0
h i
LE f (x) ≤ L + M (N − 1) + LE f˜(X)
(24.21)
CHAPTER 24. THE ALMOST-SURE ERGODIC THEOREM 202
using the fact that f is invariant on the left-hand side, and that m is stationary
on the other. Now divide both sides by L.
M (N − 1) h i
+ E f˜(X)
E f (x) ≤ + (24.22)
hL i
≤ 2 + E f˜(X) (24.23)
h i
since M N/L ≤ . Now let’s bound E f˜(X) in terms of E [f ]:
h i Z
E f˜ = f˜(x)dm (24.24)
Z Z
= f˜(x)dm + f˜(x)dm (24.25)
Bc B
Z Z
= f (x)dm + M dm (24.26)
Bc B
Z
≤ E [f ] + M dm (24.27)
B
= E [f ] + M m(B) (24.28)
≤ E [f ] + M (24.29)
M
= E [f ] + (24.30)
using the definition of f˜ in Eq. 24.26, the non-negativity of f in Eq. 24.27, and
the bound on m(B) in Eq. 24.29. Substituting into Eq. 24.23,
E f ≤ E [f ] + 3 (24.31)
as was to be shown.
The proof of E f ≥ E [f ] proceeds in parallel, only the nice-ified function
f˜ is set equal
to 0 on the bad
set.
Since E f ≥ E [f ] ≥ E f , we have that E f − f ≥ 0. Since however it is
always true that f −f ≥ 0, we may conclude that f −f = 0 m-almost everywhere.
Thus m(Lf ) = 1, i.e., the time average converges m-almost everywhere. Since
this is an invariant event, it has the same measure under µ and its stationary
limit m, and so the time average converges µ-almost-everywhere as well. By
Corollary 334, Af = Em [f |I], as promised.
Proof: We need merely show that the ergodic properties hold, and then
the equation follows. To do so, define f M (x) ≡ f (x) ∧ M , an upper-limited
version of the lim sup. Reasoning
entirely
parallel to the proof of Theorem 339
leads to the conclusion that E f M ≤ E [f ]. Then let M → ∞, and apply the
monotone convergence theorem to conclude that E f ≤ E [f ]; the rest of the
proof goes through as before.
Chapter 25
204
CHAPTER 25. ERGODICITY AND METRIC TRANSITIVITY 205
“action-angle” variables, the speed of rotation of the angular variables being set
by the actions, which are conserved, energy-like quantities. See Mackey (1992);
Arnol’d and Avez (1968) for the ergodicity of rotations and its limitations, and
Arnol’d (1978) for action-angle variables. Astonishingly, the result for the one-
dimensional case was proved by Nicholas Oresme in the 14th century (von Plato,
1994).
25.4 Exercises
Exercise 25.1 (Ergodicity implies an approach to independence) Prove
Lemma 352.
Exercise 25.3 (Invariant events and tail events) Prove that every invari-
ant event is a tail event. Does the converse hold?
Decomposition of
Stationary Processes into
Ergodic Components
210
CHAPTER 26. ERGODIC DECOMPOSITION 211
being set by the way the state space breaks up into invariant sets of points with
the same long-run average behavior — the ergodic components. Put slightly
differently, the long-run behavior of an AMS system can be represented as a
mixture of stationary, ergodic distributions, and the ergodic components are, in
a sense, a minimal parametrically sufficient statistic for this distribution. (They
are not in generally predictively sufficient.)
The idea of an ergodic decomposition goes back to von Neumann, but was
considerably refined subsequently, especially by the Soviet school, who seem to
have introduced most of the talk of predictions, and all of the talk of ergodic
components as minimal sufficient statistics. Our treatment will follow Gray
(1988, ch. 7), and Dynkin (1978).
so ν is also invariant.
Proposition 356 Ergodic invariant measures are extremal points of the convex
set of invariant measures, i.e., they cannot be written as combinations of other
invariant measures.
Notice that whether or not λ(x) exists depends only on x (and T and X );
the initial distribution has nothing to do with it. Let’s look at some properties
of the long-run distributions. (The name “ergodic point” is justified by one of
them, Proposition 359.)
Proof: For every t, the set function given by At 1B (x) is clearly a probability
measure. Since λ(x) is defined by passage to the limit, the Vitali-Hahn Theorem
(Proposition 327) says λ(x) must be as well.
Proof: For every invariant set I, 1I (T n x) = 1I (x) for all n. Hence A1I (x)
exists and is either 0 or 1. This means λ(x) assigns every invariant set either
probability 0 or probability 1, so by Definition 341 it is ergodic.
Proof: By Lemma 317, A1B (x) = A1B (T x), when the appropriate limit exists.
Since, by assumption, it does in this case, for every measurable set λ(x, B) =
λ(T x, B), and the set functions are thus equal.
Proof: This is true, by the definition of λ(x), for the indicator functions of
all measurable sets. Thus, by linearity of At and of expectation, it is true for
all simple functions. Standard arguments then let us pass to all the functions
integrable with respect to the long-run distribution.
At this point, you should be tempted to argue as follows. If µ is an AMS
distribution with stationary mean m, then Af (x) = Em [f |I] for almost all x.
So, it’s reasonable to hope that m is a combination of the λ(x), and yet further
that
Af (x) = Eλ(x) [f ]
for µ-almost-all x. This is basically true, but will take some extra assumptions
to get it to work.
Obviously, the ergodic components partition the set of ergodic points. (The
partition is not necessarily countable, and in some important cases, such as
that of Hamiltonian dynamical systems in statistical mechanics, it must be
uncountable (Khinchin, 1949).) Intuitively, they form the coarsest partition
which is still fully informative about the long-run distribution. It’s also pretty
clear that the partition is left alone with the dynamics.
Proof: By Lemma 360, λ(x) = λ(T x), and the result follows.
Notice that I have been careful not to say that the ergodic components are
invariant sets, because we’ve been using that to mean sets which are both left
along by the dynamics and are measurable, i.e. members of the σ-field X , and
we have not established that any ergodic component is measurable, which in
turn is because we have not established that λ(x) is a measurable function.
CHAPTER 26. ERGODIC DECOMPOSITION 214
gives the ergodic component to which x belongs. The difficulty is that the
intersection is over all measurable sets B, and there are generally an uncountable
number of them (even if Ξ is countable!), so we have no guarantee that the
intersection of uncountably many measurable sets is measurable. Consequently,
we can’t say that any of the ergodic components is measurable.
The way out, as so often in mathematics, is to cheat; or, more politely,
to make an assumption which is strong enough to force open an exit, but not
so strong that we can’t support it or verify it1 What we will assume is that
there is a countable collection of sets G such that λ(x) = λ(y) if and only if
λ(x, G) = λ(y, G) for every G ∈ G. Then the intersection in Eq. 26.4 need only
run over the countable class G, rather than all of X , which will be enough to
reassure us that φ(x) is a measurable set.
are measurable, but you will find it instructive to try to work out the consequences of this
assumption, and to examine whether it holds for the Borel σ-field B — say on the unit interval,
to keep things easy.
CHAPTER 26. ERGODIC DECOMPOSITION 215
so that Z
m(B) = λ(x, B)dµ(x) (26.6)
Proof: For every set G ∈ G, At 1G (x) converges for µ- and m- almost all
x (Theorem 339). Since there are only countably many G, the set on which
they all converge also has probability 1; this set is E. Since (Proposition 362)
Af (x) = Eλ(x) [f ], and (Theorem 339 again) Af (x) = Em [f |I] a.s., we have
that Eλ(x) [f ] = Em [f |I] a.s.
Now let f = 1B . As we know (Lemma 331), Eµ [A1B (X)] = Em [1B (X)] =
m(B). But, for each x, A1B (x) = λ(x, B), so m(B) = Eµ [λ(X, B)].
In words, we have decomposed the stationary mean m into the long-run
distributions of the ergodic components, with weights given by the fraction of
the initial measure µ falling into each component. Because of Propositions 354
and 356, we may be sure that by mixing stationary ergodic measures, we obtain
an ergodic measure, and that our decomposition is unique.
will not be able to get to. Also, note that most textbooks on theoretical statistics state things
in terms of random variables and measurable functions thereof, rather than σ-fields, but this
is the more general case (Blackwell and Girshick, 1954).
CHAPTER 26. ERGODIC DECOMPOSITION 217
Proof: Nearly obvious. “Only if”: since the conditional probability exists,
there must be some such function (it’s a version of the conditional probabil-
ity), and since all the conditional probabilities are versions of one another, the
function cannot depend on θ. “If”: In this case, we have a single function
which is a version of all the conditional probabilities, so it must be true that
Pθ (A|S) = Pθ0 (A|S).
Proof: The set of distributions P is now the set of all long-run distributions
generated by the dynamics, and θ is an index which tracks them all unambigu-
ously. We need to show both sufficiency and necessity. Sufficiency: The σ-field
generated by φ is the one generated by the ergodic components, σ({Ci }). (Be-
cause the Ci are mutually exclusive, this is a particularly simple σ-field.) Clearly,
Pθ (A|σ({Ci })) = λ(φ(x), A) for all x and θ, so (Lemma 374), φ is a sufficient
statistic. Necessity: Follows from the fact that a given ergodic component con-
tains all the points with a given long-run distribution. Coarser σ-fields will not,
therefore, preserve conditional probabilities.
This theorem may not seem particularly exciting, because there isn’t, neces-
sarily, anything whose distribution matches the long-run distribution. However,
it has deeper meaning under two circumstances when λ(x) really is the asymp-
totic distribution of random variables.
1. If Ξ is really a sequence space, so that X = S1 , S2 , S3 , . . ., then λ(x)
really is the asymptotic marginal distribution of the St , conditional on
the starting point.
2. Even if Ξ is not a sequence space, if stronger conditions than ergodicity
known as “mixing”, “asymptotic stability”, etc., hold, there are reason-
able senses in which L (Xt ) does converge, and converges on the long-run
distribution.4
In both these cases, knowing the ergodic component thus turns out to be neces-
sary and sufficient for knowing the asymptotic distribution of the observables.
(Cf. Corollary 378 below.)
4 Lemma 352 already gave us a kind of distributional convergence, but it is of a very
weak sort, known as “convergence in Cesàro mean”, which was specially invented to handle
sequences which are not convergent in normal senses! We will see that there is a direct
correspondence between levels of distributional convergence and levels of decay of correlations.
CHAPTER 26. ERGODIC DECOMPOSITION 218
Theorem 376 Let Ξ, X be a measurable space, and let µ0 and µ1 be two infinite-
dimensional distributions of one-sided, discrete-parameter strictly-stationary Σ-
valued stochastic processes, i.e., µ0 and µ1 are distributions on ΞN , X N , and
they are invariant under the shift operator. If they are also ergodic under the
shift, then there exists a sequence of sets Rt ∈ X t such that µ0 (Rt ) → 0 while
µ1 (Rt ) → 1.
Corollary 377 Let H0 be “Xi are IID with distribution p0 ” and H1 be “Xi are
IID with distribution p1 ”. Then, as t → ∞, there exists a sequence of tests of
H0 against H1 whose size goes to 0 while their power goes to 1.
Proof: From Theorem 52, we can write X1t = π1:t U , for a sequence-valued
random variable U , using the projection operators of Chapter 2. For each
ergodic component, by Theorem 376, there exists a sequence of sets Rt,i such
that P (X1t ∈ Rt,i ) → 1 if U ∈ Ci , and goes to zero otherwise. Let φ(X1t ) be the
5 Introduced in Chapters 2 and 3. It’s possible to give an alternative construction using the
Hilbert space of all square-integrable random variables, and then projecting onto the subspace
of those which are X t measurable.
CHAPTER 26. ERGODIC DECOMPOSITION 219
set of all Ci for which X1t ∈ Rt,i . By Theorem 372, U is in some component with
probability 1, and, since there are only countably many ergodic components,
with probability 1 X1t will eventually leave all but one of the Rt,i . The remaining
one is the ergodic component.
26.4 Exercises
Exercise 26.1 Prove the converse to Proposition 356: every extermal point of
the convex set of invariant measures is an ergodic measure.
Chapter 27
Mixing
220
CHAPTER 27. MIXING 221
the purely statistical aspects of mixing processes, but the central limit theorem
at the end of this chapter will give some idea of the flavor of results in this area:
much like IID results, only with the true sample size replaced by an effective
sample size, with a smaller discount the faster the rate of decay of correlations.
Proof: Let A be any invariant set. By mixing, limt µ(T −t A ∩ A) = µ(T −t A)µ(A).
But T −t A = A for every t, so we have lim µ(A) = µ2 (A), or µ(A) = µ2 (A). This
can only be true if µ(A) = 0 or
mu(A) = 1, i.e., only if µ is T -ergodic.
Everything we have established about ergodic processes, then, applies to
mixing processes.
for all A, B ∈ X .
Proof: Directly from the fact that in this case m(B) = limt T −t B.
for all A, B ∈ G, then it holds for all pairs of sets in σ(G). If σ(G) = X , then
the process is mixing.
CHAPTER 27. MIXING 222
Proof(after Durrett, 1991, Lemma 6.4.3): Via the π-λ theorem, of course. Let
ΛA be the class of all B such that the equation holds, for a given A ∈ G. We
need to show that ΛA really is a λ-system.
Ξ ∈ ΛA is obvious. T −t Ξ = Ξ so µ(A ∩ Ξ) = µ(A) = µ(A)µ(Ξ).
Closure under complements. Let B1 and B2 be two sets in ΛA , and assume
B1 ⊂ B2 . Because set-theoretic operations commute with taking inverse images,
T −t (B2 \ B1 ) = T −t B2 \ T −t B1 . Thus
Taking limits of both sides, we get that lim |µ (A ∩ T −t (B2 \ B1 )) − µ(A)µ(T −t (B2 \ B1 ))| =
0, so that B2 \ B1 ∈ ΛA .
Closure under monotone limits: Let Bn be any monotone increasing sequence
in ΛA , with limit B. Thus, µ(Bn ) ↑ µ(B), and at the same time m(Bn ) ↑ m(B),
where m is the stationary limit of µ. Using Lemma 383, it is enough to show
that
lim µ(A ∩ T −t B) = µ(A)m(B) (27.7)
t
which again can be made less than any positive by taking n large. So, for
sufficiently large n, limt µ(A ∩ T −t B) is always within 2 of µ(A)m(B). Since
can be made arbitrarily small, we conclude that limt µ(A ∩ T −t B) = µ(A)m(B).
Hence, B ∈ ΛA .
We conclude, from the π − λ theorem, that Eq. 27.4 holds for all A ∈ G and
all B ∈ σ(G). The same argument can be turned around for A, to show that
Eq. 27.4 holds for all pairs A, B ∈ σ(G). If G generates the whole σ-field X ,
then clearly Definition 379 is satisfied and the process is mixing.
CHAPTER 27. MIXING 223
Example 387 (Irrational Rotations of the Circle are Not Mixing) Irrational
rotations of the circle, T x = x + φ mod 1, φ irrational, are ergodic (Example
348), and stationary under the Lebesgue measure. They are not, however, mix-
ing. Recall that T t x is dense in the unit interval, for arbitrary initial x. Because
it is dense, there is a sequence tn such that tn φ mod 1 goes to 1/2. Now let
A = [0, 1/4]. Because T maps intervals to intervals (of equal length), it follows
that T −tn A becomes an interval disjoint from A, i.e., µ(A ∩ T −tn A) = 0. But
mixing would imply that µ(A∩T −tn A) → 1/16 > 0, so the process is not mixing.
Lemma 389 In any Markov process, U n d converges weakly to 1, for all initial
probability densities d, if and only if U n f converges weakly to Eµ [f ], for all
initial L1 functions f , i.e. Eµ [U n f (X)g(X)] → Eµ [f (X)] Eµ [g(X)] for all
bounded, measurable g.
Proof: Exercise. The way to go is to use the previous lemma, of course. With
that tool, one can prove that the convergence holds for indicator functions, and
then for simple functions, and finally, through the usual arguments, for all L1
densities.
Proof: Exercise, from the fact that convergence in distribution implies con-
vergence of expectations of all bounded measurable functions.
It is natural to ask what happens if U t ν → µ not weakly but strongly. This
is known as asymptotic stability or (especially in the nonlinear dynamics liter-
ature) exactness. Remarkably enough, it is equivalent to the requirement that
µ(T t A) → 1 whenever µ(A) > 0. (Notice that for once the expression involves
images rather than pre-images.) There is a kind of hierarchy here, where differ-
ent levels of convergence of distribution (Cesáro, weak, strong) match different
sorts of ergodicity (metric transitivity, mixing, exactness). For more details, see
Lasota and Mackey (1994).
CHAPTER 27. MIXING 225
the system depends on its past. However, there are at least four other mixing
coefficients (β, φ, ψ and ρ) regularly used in the literature. Since any of these
others going to zero implies that α goes to zero, we will stick with α-mixing, as
in Rosenblatt (1956).
Also notice that if Xt is a Markov process (e.g., a dynamical system) then
the Markov property tells us that we only need to let the supremum run over
measurable sets in σ(Xt1 ) and σ(Xt2 ).
Lemma 393 If a dynamical system is α-mixing, then it is mixing.
Proof: α is the supremum of the quantity appearing in the definition of mixing.
Notation: For the remainder of this section,
n
X
Sn ≡ Xn (27.21)
k=1
σn2 ≡ Var [Sn ] (27.22)
S[nt]
Yn (t) ≡ (27.23)
σn
2 Doukhan (1995, p. 47) cites Jakubowski and Szewczak (1990) as the source, but I have
Definition 395 Xt obeys the functional central limit theorem or the invariance
principle when
d
Yn → W (27.25)
where W is a standard Wiener process on [0, 1], and the convergence is in the
Skorokhod topology of Sec. 15.1.
Then
∞
σn2 h
2
i X
lim = E |X1 | + 2 E [X1 Xk ] ≡ σ 2 (27.28)
n→∞ n
k=1
If σ 2 > 0, moreover, Xt obeys both the central limit theorem with variance σ 2 ,
and the functional central limit theorem.
Proof: Complicated, and based on a rather technical central limit theorem for
martingale difference arrays. See Doukhan (1995, sec. 1.5), or, for a simplified
presentation, Durrett (1991, sec. 7.7). √
For the rate of convergence of of L (Sn / n) to a Gaussian distribution, in
the total variation metric, see Doukhan (1995, sec. 1.5.2), summarizing sev-
eral works. Polynomially-mixing sequences converge polynomially in n, and
exponentially-mixing sequences converge exponentially.
There are a number of results on central limit theorems and functional cen-
tral limit theorems for deterministic dynamical systems. A particularly strong
one was recently proved by Tyran-Kamińska (2005), in a friendly paper which
should be accessible to anyone who’s followed along this far, but it’s too long
for us to do more than note its existence.
Part VI
Information Theory
227
Chapter 28
Section 28.1 introduces Shannon entropy and its most basic prop-
erties, including the way it measures how close a random variable is
to being uniformly distributed.
Section 28.2 describes relative entropy, or Kullback-Leibler di-
vergence, which measures the discrepancy between two probability
distributions, and from which Shannon entropy can be constructed.
Section 28.2.1 describes some statistical aspects of relative entropy,
especially its relationship to expected log-likelihood and to Fisher
information.
Section 28.3 introduces the idea of the mutual information shared
by two random variables, and shows how to use it as a measure of
serial dependence, like a nonlinear version of autocovariance (Section
28.3.1).
228
CHAPTER 28. ENTROPY AND DIVERGENCE 229
(1990).1
when the sum exists. Entropy has units of bits when the logarithm has base 2,
and nats when it has base e.
The joint entropy of two random variables, H[X, Y ], is the entropy of their
joint distribution.
The conditional entropy of X given Y , H[X|Y ] is
X X
H[X|Y ] ≡ P (Y = y) P (X = x|Y = y) log P (X = x|Y = y)(28.2)
y x
= −E [log P (X|Y )] (28.3)
= H[X, Y ] − H[Y ] (28.4)
Here are some important properties of the Shannon entropy, presented with-
out proofs (which are not hard).
1. H[X] ≥ 0
2. H[X] = 0 iff ∃x0 : X = x0 a.s.
3. If X can take on n < ∞ different values (with positive probability), then
H[X] ≤ log n. H[X] = log n iff X is uniformly distributed.
4. H[X]+H[Y ] ≥ H[X, Y ], with equality iff X and Y are independent. (This
comes from the logarithm in the definition.)
1 Remarkably, almost all of the post-1948 development has been either amplifying or refining
themes first sounded by Shannon. For example, one of the fundamental results, which we
will see in the next chapter, is the “Shannon-McMillan-Breiman theorem”, or “asymptotic
equipartition property”, which says roughly that the log-likelihood per unit time of a random
sequence converges to a constant, characteristic of the data-generating process. Shannon’s
original version was convergence in probability for ergodic Markov chains; the modern form
is almost sure convergence for any stationary and ergodic process. Pessimistically, this says
something about the decadence of modern mathematical science; optimistically, something
about the value of getting it right the first time.
CHAPTER 28. ENTROPY AND DIVERGENCE 230
5. H[X, Y ] ≥ H[X].
6. H[X|Y ] ≥ 0, with equality iff X is a.s. constant given Y , for almost all
Y.
7. H[X|Y ] ≤ H[X], with equality iff X is independent of Y . (“Conditioning
reduces entropy”.)
8. H[f (X)] ≤ H[X], for any measurable function f , with equality iff f is
invertible.
The first three properties can be summarized by saying that H[X] is max-
imized by a uniform distribution, and minimized, to zero, by a degenerate one
which is a.s. constant. We can then think of H[X] as the variability of X,
something like the log of the effective number of values it can take on. We can
also think of it as how uncertain we are about X’s value.2 H[X, Y ] is then how
much variability or uncertainty is associated with the pair variable X, Y , and
H[Y |X] is how much uncertainty remains about Y once X is known, averag-
ing over Y . Similarly interpretations follow for the other properties. The fact
that H[f (X)] = H[X] if f is invertible is nice, because then f just relabels the
possible values, meshing nicely with this interpretation.
A simple consequence of the above results is particularly important for later
use.
Proof: From the definitions, it is easily seen that H[X2 |X1 ] = H[X2 , X1 ] −
H[X1 ]. This establishes the chain rule for n = 2. A simple argument by
induction does the rest.
For non-discrete random variables, it is necessary to introduce a reference
measure, and many of the nice properties go away.
to find that X = x the less probable that event is, supposing that surprise should go as the
log of one over that probability, and defining entropy as expected surprise. The choice of
the logarithm, rather than any other increasing function, is of course retroactive, though one
might cobble together some kind of psychophysical justification, since the perceived intensity
of a sensation often grows logarithmically with the physical magnitude of the stimulus. More
dubious, to my mind, is the idea that there is any surprise at all when a fair coin coming up
heads.
CHAPTER 28. ENTROPY AND DIVERGENCE 231
when µ << ρ. Joint and conditional entropies are defined similarly. We will
also write Hρ [µ], with the same meaning. This is sometimes called differential
entropy when ρ is Lebesgue measure on Euclidean space, especially R, and then
is written h(X) or h[X].
Lemma 402 (Divergence and Total Variation) For any two distributions,
1 2
D(µkν) ≥ 2 ln 2 kµ − νk1 .
Proof: Algebra. See, e.g., Cover and Thomas (1991, Lemma 12.6.1, pp. 300–
301).
Proof: Algebra.
Shannon entropy can be constructed from the relative entropy.
Proof: Algebra.
A similar result holds for the entropy of a variable which takes values in a
finite subset, of volume V , of a Euclidean space, i.e., Hλ [X] = log V − D(µkυ),
where λ is Lebesgue measure and υ is the uniform probability measure on the
range of X.
Lemma 407 Suppose ν and µ are the distributions of two probability models,
and ν << µ. Then the cross-entropy is the expected negative log-likelihood of
the model corresponding to ν, when the actual distribution is µ. The actual
or empirical negative log-likelihood of the model corresponding to ν is Qρ (νkη),
where η is the empirical distribution.
3 See Kass and Vos (1997) or Amari and Nagaoka (1993/2000). For applications to sta-
tistical inference for stochastic processes, see Taniguchi and Kakizawa (2000). For an easier
general introduction, Kulhavý (1996) is hard to beat.
CHAPTER 28. ENTROPY AND DIVERGENCE 233
Corollary 411 The Fisher information matrix is equal to the Hessian (second
partial derivative) matrix of the relative entropy:
∂2
Iij (θ0 ) = D(νθ0 kνθ ) (28.15)
∂θi ∂θj
4 If we did have a triangle inequality, then we could say D(µkν) ≤ D(µkη) + D(ηkν), and
it would be enough to make sure that both the terms on the RHS went to zero, say by some
combination of maximizing the likelihood in-sample, so D(ηkν) is small, and ergodicity, so
that D(µkη) is small. While, as noted, there is no triangle inequality, under some conditions
this idea is roughly right; there are nice diagrams in Kulhavý (1996).
CHAPTER 28. ENTROPY AND DIVERGENCE 234
Proof: It is a classical
h 2result (see,
i e.g., Lehmann and Casella (1998, sec. 2.6.1))
∂
that Iij (θ) = −Eνθ ∂θi ∂θj log pθ . The present result follows from this, Lemma
407, Lemma 408, and the fact that Hρ [νθ0 ] is independent of θ.
Proof: Because then the total variation distance between the joint distribution,
L (Xt1 Xt2 ), and the product of the marginal distributions, L (Xt1 ) L (Xt2 ), is
being forced down towards zero, which implies mixing (Definition 379).
Chapter 29
236
CHAPTER 29. RATES AND EQUIPARTITION 237
Proof: For every > 0, there is an N () such that |an − a| < whenever
n > N (). Now take bn and break it up into two parts, one summing the terms
below N (), and the other the terms above.
n
−1 X
lim |bn − a| = lim n ai − a (29.3)
n n
i=1
n
X
≤ lim n−1 |ai − a| (29.4)
n
i=1
N ()
X
≤ lim n−1 |ai − a| + (n − N ()) (29.5)
n
i=1
N ()
X
≤ lim n−1 |ai − a| + n (29.6)
n
i=1
N ()
X
= + lim n−1 |ai − a| (29.7)
n
i=1
= (29.8)
Theorem 421 (Entropy Rate) For a stationary sequence, if the limiting con-
ditional entropy exists, then it is equal to the entropy rate, h(X) = h0 (X).
Proof: Start with the chain rule to break the joint entropy into a sum of
conditional entropies, use Lemma 419 to identify their limit as h]prime (X), and
CHAPTER 29. RATES AND EQUIPARTITION 238
as required.
Because h(X) = h0 (X) for stationary processes (when both limits exist), it is
not uncommon to find what I’ve called the limiting conditional entropy referred
to as the entropy rate.
Lemma 422 For a stationary sequence h(X) ≤ H[X1 ], with equality iff the
sequence is IID.
is what’s needed for h(X) = H[X1 ], all the values of the sequence must be
independent of each other; since the sequence is stationary, this would imply
that it’s IID.
correction.
CHAPTER 29. RATES AND EQUIPARTITION 240
Lemma 427 For each k, the entropy rate of the order k Markov approximation
is is equal to H[Xk+1 |X1k ].
Proof: Under the approximation (but not under the original distribution of X),
H[X1n ] = H[X1k ]+(n−k)H[Xk+1 |X1k ], by the Markov property and stationarity
(as in Examples 423 and 424). Dividing by n and taking the limit as n → ∞
gives the result.
t
Lemma 428 If X is a stationary two-sided sequence, then Yt = f (X−∞ ) de-
fines a stationary sequence, for any measurable f . If X is also ergodic, then Y
is ergodic too.
Proof: Because X is stationary, it can be represented as a measure-preserving
t d t
shift on sequence space. Because it is measure-preserving, θX−∞ = X−∞ , so
d
Y (t) = Y (t + 1), and similarly for all finite-length blocks of Y . Thus, all of the
finite-dimensional distributions of Y are shift-invariant, and these determine the
infinite-dimensional distribution, so Y itself must be stationary.
To see that Y must be ergodic if X is ergodic, recall that a random sequence
is ergodic iff its corresponding shift dynamical system is ergodic. A dynamical
system is ergodic iff all invariant functions are a.e. constant (Theorem 350).
Because the Y sequence is obtained by applying a measurable function to the
X sequence, a shift-invariant function of the Y sequence is a shift-invariant
function of the X sequence. Since the latter are all constant a.e., the former are
too, and Y is ergodic.
Lemma 429 If X is stationary and ergodic, then, for every k,
1 n
P lim − log µk (X1 (ω)) = hk = 1 (29.15)
n n
i.e., − n1 log µk (X1n (ω)) converges a.s. to hk .
Proof: Start by factoring the approximating Markov measure in the way sug-
gested by its definition:
n
1 n 1 k
1 X t−1
− log µk (X1 ) = − log P X1 − log P Xt |Xt−k (29.16)
n n n
t=k+1
t−1
As n grows, n1 log P X1k → 0, for every fixed k. On the other hand, − log P Xt |Xt−k
1
h∞ (X) ≤ lim inf − log P (X1n ) (29.34)
n n
a.s.
As hk ↓ h∞ , it follows that the liminf and the limsup of the normalized log
likelihood must be equal almost surely, and so equal to h∞ , which is to say to
h(X).
Why is this called the AEP? Because, to within an o(n) term, all sequences
of length n have the same log-likelihood (to within factors of o(n), if they have
positive probability at all. In this sense, the likelihood is “equally partitioned”
over those sequences.
Proof: See Algoet and Cover (1988, theorem 4), Gray (1990, corollary 8.4.1).
Remark. The usual AEP is in fact a consequence of this result, with the
appropriate reference measure. (Which?)
29.4 Exercises
Exercise 29.1 Markov approximations are maximum-entropy approximations.
(You may assume that the process X takes values in a finite set.)
a Prove that µk , as defined in Definition 426, gets the distribution of se-
quences of length k + 1 correct, i.e., for any set A ∈ X k+1 , ν(A) =
k+1
P X1 ∈ A .
b Prove that µk0 , for any any k 0 > k, also gets the distribution of length
k + 1 sequences right.
CHAPTER 29. RATES AND EQUIPARTITION 244
c In a slight abuse of notation, let H[ν(X1n )] stand for the entropy of a se-
quence of length n when distributed according to ν. Show that H[µk (X1n )] ≥
H[µk0 (X1n )] if k 0 > k. (Note that the n ≤ k case is easy!)
d Is it true that that if ν is any other measure which gets the distribution
of sequences of length k + 1 right, then H[µk (X1n )] ≥ H[ν(X1n )]? If yes,
prove it; if not, find a counter-example.
Large Deviations
245
Chapter 30
246
CHAPTER 30. LARGE DEVIATIONS: BASICS 247
Lemma 443 (“Fastest rate wins”) For any two sequences of positive num-
bers, (an + bn ) ' an ∨ bn .
− inf J(x) ≤ lim inf log P (X ∈ B) ≤ lim sup log P (X ∈ B) ≤ − inf J(x)
x∈intB →0 →0 x∈clB
(30.4)
for some non-negative function J : Ξ 7→ [0, ∞], its raw rate function. If J is
lower semi-continuous, it is just a rate function. If J is lower semi-continuous
and has compact level sets, it is a good rate function.1 By a slight abuse of
notation, we will write J(B) = inf x∈B J(x).
1 Sometimes what Kallenberg and I are calling a “good rate function” is just “a rate func-
tion”, and our “rate function” gets demoted to “weak rate function”.
CHAPTER 30. LARGE DEVIATIONS: BASICS 248
Lemma 448 If X satisfies the LDP with rate function J, then for every J-
continuous set B,
lim log P (X ∈ B) = −J(B) (30.9)
→0
Proof: By J-continuity, the right and left hand extremes of Eq. 30.4 are equal,
so the limsup and the liminf sandwiched between them are equal; consequently
the limit exists.
Remark: The obvious implication is that, for small , P (X ∈ B) ≈ ce−J(B)/ ,
which explains why we say that the LDP has rate 1/. (Actually, c need not be
constant, but it must be at least o(), i.e., it must go to zero faster than itself
does.)
There are several equivalent ways of defining the large deviation principle.
The following is especially important, because it is often simplifies proofs.
CHAPTER 30. LARGE DEVIATIONS: BASICS 249
Lemma 449 X obeys the LDP with rate 1/ and rate function J(x) if and
only if
for every closed Borel set C and every open Borel set O ⊂ Ξ.
Proof: “If”: The closure of any set is closed, and the interior of any set is
open, so Eqs. 30.10 and 30.11 imply
but P (X ∈ B) ≤ P (X ∈ clB) and P (X ∈ B) ≥ P (X ∈ intB), so the LDP
holds. “Only if”: every closed set is equal to its own closure, and every open set
is equal to its own interior, so the upper bound in Eq. 30.4 implies Eq. 30.10,
and the lower bound Eq. 30.11.
A deeply important consequence of the LDP is the following, which can be
thought of as a version of Laplace’s method for infinite-dimensional spaces.
Proof: We’ll find the limsup and the liminf, and show that they are both
sup f (x) − J(x).
First the limsup. Pick an arbitrary positive integer n. Because f is contin-
uous and bounded above, there exist finitelyS closed sets, call them B1 , . . . Bm ,
such that f ≤ −n on the complement of i Bi , and within each Bi , f varies by
at most 1/n. Now
h i
lim sup log E ef (X )/ (30.15)
h i
≤ (−n) ∨ max lim sup log E ef (X )/ 1Bi (X )
i≤m
≤ (−n) ∨ max sup f (x) − inf J(x) (30.16)
i≤m x∈Bi x∈Bi
Since δ was arbitrary, we can let it go to zero, so (by continuity of f ) inf y∈Bδ,x f (y) →
f (x), or h i
lim inf log E ef (X )/ ≥ f (x) − J(x) (30.22)
Since this holds for arbitrary x, we can replace the right-hand side by a supre-
mum over all x. Hence sup f (x) − J(x) is both the liminf and the limsup.
Remark: The implication of Varadhan’s lemma is that, for small , E ef (X )/ ≈
−1
c()e (supx∈Ξ f (x)−J(x)) , where c() = o(). So, we can replace the exponential
integral with its value at the extremal points, at least to within a multiplicative
factor and to first order in the exponent.
An important, if heuristic, consequence of the LDP is that “Highly im-
probable events tend to happen in the least improbable way”. Let us con-
sider two events B ⊂ A, and suppose that P (X ∈ A) > 0 for all . Then
P (X ∈ B|X ∈ A) = P (X ∈ B) /P (X ∈ A). Roughly speaking, then, this
conditional probability will vanish exponentially, with rate J(A) − J(B). That
is, even if we are looking at an exponentially-unlikely large deviation, the vast
majority of the probability is concentrated around the least unlikely part of the
event. More formal statements of this idea are sometimes known as “conditional
limit theorems” or “the Gibbs conditioning principle”.
Proof: Since f is continuous, f −1 takes open sets to open sets, and closed sets
to closed sets. Pick any closed C ⊂ Υ. Then
lim sup log P (f (X ) ∈ C) (30.23)
→0
lim sup log P X ∈ f −1 (C)
=
→0
≤ −J(f −1 (C)) (30.24)
= − inf−1
J(x) (30.25)
x∈f (C)
where the addition of indices for the delta function is to be done modulo n, i.e.,
P̂32 = 13 δ(X1 ,X2 ) + δ(X2 ,X3 ) + δ(X3 ,X1 ) . P̂nk takes values in P Ξk .
CHAPTER 30. LARGE DEVIATIONS: BASICS 252
Be careful not to confuse this “empirical process” with the quite distinct
“empirical process” of Example 43.
Proof: In each case, we obtain the lower-level statistic from the higher-level
one by applying a continuous function, hence the contraction principle applies.
For the distributions, the continuous functions are the projection operators of
Chapter 2.
Then Y ∼ µf, obeys an LDP with rate 1/ and rate function
Proof: See Kallenberg, Theorem 27.11 (ii). Notice, by the way, that the proof
of the upper bound on probabilities (i.e. that lim sup log P (X ∈ B) ≤ −J(B)
for closed B ⊆ Ξ) does not depend on exponential tightness, just the continuity
of f . Exponential tightness is only needed for the lower bound.
Theorem 460 (Bryc’s Theorem) If X are exponentially tight, and, for all
bounded continuous f , the limit
h i
Λf ≡ lim log E ef (X /) (30.35)
→0
Lemma 463 If X and Y are exponentially equivalent, one of them obeys the
LDP with a good rate function J iff the other does as well.
Proof: It is enough to prove that the LDP for X implies the LDP for Y , with
the same rate function. (Draw a truth-table if you don’t believe me!) As usual,
first we’ll get the upper bound, and then the lower.
Pick any closed set C, and let Cδ be its closed δ neighborhood, i.e., Cδ =
{x : ∃y ∈ C, d(x, y) ≤ δ}. Now
Using Eq. 30.38 from Definition 462, the LDP for X , and Lemma 443
As usual, to obtain the lower bound on open sets, pick any open set O and any
point x ∈ O. Because O is open, there is a δ > 0 such that, for some open
CHAPTER 30. LARGE DEVIATIONS: BASICS 255
(Notice that the initial arbitrary choice of δ has dropped out.) Taking the
supremum over all x gives −J(O) ≤ lim inf log P (Y ∈ O), as required.
Chapter 31
256
CHAPTER 31. IID LARGE DEVIATIONS 257
|=
ing functions, as E et(X1 +X2 ) = E etX1 etX2 = E etX1 E etX2 if X1 X2 .
If X1 , X2 , . . . are IID, then we can get a deviation bound for their sample mean
X n through the moment generating function:
n
!
X
P Xn ≥ a = P Xi ≥ na
i=1
n
≤ e−nta E etX1
P Xn ≥ a
1
≤ −ta + log E etX1
log P X n ≥ a
n
≤ inf −ta + log E etX1
t
≤ − sup ta − log E etX1
t
This suggests that the functions log E etX and sup ta − log E etX will be
useful to us. Accordingly, we encapsulate them in a pair of definitions.
then f ∗∗ is (in one dimension) something like the greatest convex lower bound
on f ; made precise, this statement even remains true in higher dimensions. I
make these remarks because of the following fact:
P X n ≥ a ≤ Λ∗ (a)
(31.8)
Lemma 469 (Donsker and Varadhan) The Legendre transform of the cu-
mulant generating functional is the relative entropy:
Proof: First of all, notice that the supremum in Eq. 31.10 can be taken over
all bounded measurable functions, not just functions in Cb , since Cb is dense.
This will let us use indicator functions and simple functions in the subsequent
argument.
If ν 6 µ, then D (νkµ) = ∞. But then there is also a set, call it B, with
µ(B) = 0, ν(B) > 0. Take fn = n1B . Then Eν [fn ] − Λ(fn ) = nν(B) − 0, which
can be made arbitrarily large by taking n arbitrarily large, hence the supremum
in Eq. 31.10 is ∞.
If ν µ, then show that D (νkµ) ≤ Λ∗ (ν) and D (νkµ) ≥ Λ∗ (ν), so they
must be equal. To get the first inequality, start with the observation then
dν dν
dµ exists, so set f = log dµ , which is measurable. Then D (νkµ) is Eν [f ] −
log Eµ e . If f is bounded, this shows that D (νkµ) ≤ Λ∗ (ν). If f is not
f
bounded, approximate
it by a sequence of bounded, measurable functions fn
with Eµ efn → 1 and Eν [fn ] → Eν [fn ], again concluding that D (νkµ) ≤
Λ∗ (ν).
To go the other way, first consider the special case where X is finite, and so
generated by a partition, with cells B1 , . . . Bn . Then all measurable functions
are simple functions, and Eν [f ] − Λ(f ) is
n
X n
X
g(f ) = fi ν(Bi ) − log efi µ(Bi ) (31.12)
i=1 i=1
∂g(f ) 1
= ν(Bi ) − Pn f µ(Bi )efi (31.13)
∂fi i=1 e i µ(B )
i
ν(Bi ) 1
= Pn fi
efi (31.14)
µ(Bi ) i=1 µ(B i )e
ν(Bi )
log = fi (31.15)
µ(Bi )
gives the maximum value of g(f ). (Remember that 0 log 0 = 0.) But then
g(f ) = D (νkµ). So Λ∗ (ν) ≤ D (νkµ) when the σ-algebra is finite. In the
general case, consider the
case where f is a simple function. Then σ(f ) is finite,
and Eν [f ] − log Eµ ef ≤ D (νkµ) follows by the finite case and smoothing.
Finally, if f is not simple, but
is bounded and measurable,
there is a simple h
such that Eν [f ] − log Eµ ef ≤ Eν [h] − log Eµ eh , so
Proof: The proof has three parts. First, the upper bound for closed sets;
second, the lower bound for open sets, under an additional assumption on Λ(t);
third and finally, lifting of the assumption on Λ by means of a perturbation
argument (related to Lemma 463).
To prove the upper bound for closed sets, we first prove the upper bound
for sufficiently small balls around arbitrary points. Then, we take our favorite
closed set, and divide it into a compact part close to the origin, which we can
cover by a finite number of closed balls, and a remainder which is far from the
origin and of low probability.
First the small balls of low probability. Because Λ∗ (x) = supu u · x − Λ(u),
for any > 0, we can find some u such that u · x − Λ(x) > min 1/, Λ∗ (x) − .
(Otherwise, Λ∗ (x) would not be the least upper bound.) Since u · x is contin-
uous in x, it follows that there exists some open ball B of positive radius,
centered on x, within which u · y − Λ(x) > min 1/, Λ∗ (x) − , or u · y >
Λ(x) + min 1/, Λ∗ (x) − . Now use the exponential Markov inequality to get
h i
≤ E eu·nX n −n inf y∈B u·y
P Xn ∈ B (31.17)
∗
≤ e−n(min ,Λ (x)−)
1
(31.18)
which is small. To get the the compact set near the origin of high probability,
use the exponential decay of the probability at large kxk. Since Λ(t) < ∞ for
all t, Λ∗ (x) → ∞ as kxk → ∞. So, using (once again) the exponential Markov
inequality, for every > 0, there must exist an r > 0 such that
1
1
log P
X n
> r ≤ − (31.19)
n
for all n.
Now pick your favorite closed measurable set C ∈ B d . Then C∩{x : kxk ≤ r}
is compact, and I can cover it by m balls B1 , . . . Bm , with centers x1 , . . . xm ,
of the sort built in the previous paragraph. So I can apply a union bound to
CHAPTER 31. IID LARGE DEVIATIONS 261
P X n ∈ C , as follows.
P Xn ∈ C (31.20)
= P X n ∈ C ∩ {x : kxk ≤ r} + P X n ∈ C ∩ {x : kxk > r}
m
!
[
≤ P Xn ∈ Bi + P
X n
> r (31.21)
i=1
m
!
X
≤ P X n ∈ Bi + P
X n
> r (31.22)
i=1
m
!
∗
−n(min 1
) 1
X
≤ e ,Λ (xi )− + e−n (31.23)
i=1
∗
≤ (m + 1)e−n(min ,Λ (C)−)
1
(31.24)
with Λ∗ (C) = inf x∈C Λ∗ (x), as usual. So if I take the log, normalize, and go to
the limit, I have
1 1
− min , Λ∗ (C) −
lim sup log P X n ∈ C ≤ (31.25)
n n
≤ −Λ∗ (C) (31.26)
since was arbitrary to start with, and I’ve got the upper bound for closed sets.
To get the lower bound for open sets, pick your favorite open set O ∈ B d ,
and your favorite x ∈ O. Suppose, for the moment, that Λ(t)/ktk → ∞ as
ktk → ∞. (This is the growth condition mentioned earlier, which we will left at
the end of the proof.) Then, because Λ(t) is smooth, there is some u such that
∇Λ(u) = x. (You will find it instructive to draw the geometry here.) Now let
Yi be a sequence of IID random variables, whose probability law is given by
E euX 1B (X)
= e−Λ(u) E euX 1B (X)
P (Yi ∈ B) = (31.27)
E [euX ]
1
lim inf log P X n ∈ O ≥ Λ(u) − u · x − kuk (31.31)
n
∗
≥ −Λ (x) − kuk (31.32)
≥ −Λ∗ (x) (31.33)
≥ − inf Λ∗ (x) = −Λ(O) (31.34)
x∈O
P
X n + σZ n − x
≤ ≥ −Λ∗X+σZ (x)
(31.35)
∗
≥ −ΛX (x) (31.36)
as required.
CHAPTER 31. IID LARGE DEVIATIONS 263
Lemma 473 Empirical means are expectations with respect to empirical mea-
sure. That is, let f be a real-valued measurable function and Yi = f (Xi ). Then
Y n = EP̂n [f (X)].
Proof: For each d, the sequence of vectors (f1 (Xi ), . . . fd (Xi )) are IID, so, by
Cramér’s Theorem (470), their empirical mean obeys the LDP with rate n and
good rate function
h d i
Jd (x) = sup t · x − log E et·f1 (X) (31.47)
t∈Rd
CHAPTER 31. IID LARGE DEVIATIONS 264
But, by Lemma 473, the empirical means are expectations over the empirical
distributions, so the latter must also obey the LDP, with the same rate and rate
function.
Notice, incidentally, that the fact that the fi ∈ F isn’t relevant for the proof
of the lemma; it will however be relevant for the proof of the theorem.
Proof: Combining Lemma 474 and Theorem 461, we see that EP̂n [f1∞ (X)]
obeys the LDP with rate n and good rate function
= D (νkµ) (31.54)
Notice however that this is an LDP in the space P (M ), not in P (Ξ). However,
the embedding taking P (Ξ) to P (M ) is continuous, and it is easily verified (see
Lemma 27.17 in Kallenberg) that P̂n is exponentially tight in P (Ξ), so another
application of the inverse contraction principle says that P̂n must obey the LDP
in the restricted space P (Ξ), and with the same rate.
Theorem 477 If Xi are IID in a Polish space, with a common measure µ, then
the empirical process distribution P̂n∞ obeys an LDP with rate n and good rate
function J∞ (ν) = d(νkµ∞ ), the relative entropy rate, if ν is a shift-invariant
probability measure, and = ∞ otherwise.
Proof: By Corollary 476 and the projective limit theorem 461, P̂n∞ obeys an
LDP with rate n and good rate function
But, applying the chain rule for relative entropy (Lemma 404),
But lim n−1 D (πn νkµn ) is the relative entropy rate, d(νkµ∞ ), and we’ve already
identified the right-hand side as the rate function.
The strength of Theorem 477 lies in the fact that, via the contraction prin-
ciple (Theorem 451), it implies that the LDP holds for any continuous function
of the empirical process distribution. This in particular includes the finite-
dimensional distributions, the empirical mean, functions of finite-length trajec-
tories, etc. Moreover, Theorem 451 also provides a means to calculate the rate
function for all these quantities.
Chapter 32
266
CHAPTER 32. LARGE DEVIATIONS FOR MARKOV SEQUENCES 267
discrete. Then
t−1
Y
X1n xt1
P = = ν(x1 ) µ(xi , xi+1 ) (32.1)
i=1
Pt−1
log µ(xi ,xi+1 )
= ν(x1 )e i=1 (32.2)
P t
= ν(x1 )e x,y∈Ξ2 Tx,y (x1 ) log µ(x,y) (32.3)
where Tx,y (xt1 ) counts the number of times the state y follows the state x in the
sequence xt1 , i.e., it gives the transition counts. What we have just established is
that the Markov chains on Ξ with a given initial distribution form an exponential
family, whose natural sufficient statistics are the transition counts, and whose
natural parameters are the logarithms of the transition probabilities.
(If Ξ is not continuous, but we make the density assumptions mentioned at
the beginning of this chapter, we can write
t−1
Y
p X1t (xt1 ) = n(x1 ) m(xi , xi+1 ) (32.4)
i=1
dT (xt1 ) log m(x,y)
R
= n(x1 )e Ξ2 (32.5)
1
where now T (xt1 ) puts probability mass n−1 at x, y for every i such that xi = x,
xi+1 = y.)
We can use this exponential family representation to establish the following
basic theorem.
Theorem 478 Let Xi be a Markov sequence obeying the assumptions set out
at the beginning of this chapter, and furthermore that µ(x, y)/ρ(y) is bounded
above (in the discrete-state case) or that m(x, y)/r(y) is bounded above (in
the continuous-state case). Then the two-dimensional empirical distribution
(“pair measure”) P̂t2 obeys an LDP with rate n and with rate function J2 (ψ) =
D (ψkπ1 ψ × µ) if ν is shift-invariant, J(ν) = ∞ otherwise.
Proof: I will just give the proof for the discrete case, since the modifications
for the continuous case are straightforward (given the assumptions made about
densities), largely a matter of substituting Roman letters for Greek ones.
First, modify the representation of the probabilities in Eq. 32.3 slightly, so
that it refers directly to P̂t2 (as laid down in Definition 454), rather than to the
transition counts.
ν(x1 ) t Px,y∈Ξ P̂t2 (x,y) log µ(x,y)
P X1t = xt1
= e (32.6)
µ(xt , x1 )
ν(x1 ) nEP̂ 2 [log µ(X,Y )]
= e t (32.7)
µ(xt , x1 )
Now construct a sequence of IID variables Yi , all distributed according to ρ, the
invariant measure of the Markov chain:
nE 2 [log ρ(Y )]
P Y1t = y1t
= e P̂t (32.8)
CHAPTER 32. LARGE DEVIATIONS FOR MARKOV SEQUENCES 268
Introduce a (final) proxy random sequence, also taking values in P (() Ξ2 ), call
R tF (ψ)
it Zt , with P (Zt ∈ B) = B e dQt,Y (ψ). We know (Corollary 476) that,
under Qt,Y , the empirical pair measure satisfies an LDP with rate t and good
rate function JY = D (ψkπ1 ψ ⊗ ρ), so by Corollary 457, Zt satisfies an LDP
with rate t and good rate function
and the infimum is clearly zero. Since this is the rate function Zt , in view of
Eqs. 32.13 and 32.14 it is also the rate function for P̂n2 , which we have agreed
to call J2 .
Remark 1: The key to making this work is the assumption that F is bounded
from above. This can fail if, for instance, the process is not ergodic, although
usually in that case one can rescue the general idea by some kind of ergodic
decomposition.
CHAPTER 32. LARGE DEVIATIONS FOR MARKOV SEQUENCES 269
Remark 2: The LDP for the pair measure of an IID sequence can now
be seen to be a special case of the LDP for the pair measure of a Markov
sequence. The same is true, generally speaking, of all the other LDPs for IID
and Markov sequences. Calculations are almost always easier for the IID case,
however, which permits us to give explicit formulae for the rate functions of
empirical means and empirical distributions unavailable (generally speaking) in
the Markovian case.
Corollary 479 The minima of the rate function J2 are the invariant distribu-
tions.
Proof: The rate function is D (ψkπ1 ψ ⊗ µ). Since relative entropy is ≥ 0, and
equal to zero iff the two distributions are equal (Lemma 401), we get a minimum
of zero in the rate function iff ψ = π1 ψ ⊗ µ, or ψ = ρ2 , for some ρ ∈ P (Ξ) such
that ρµ = ρ. Conversely, if ψ is of this form, then J2 (ψ) = 0.
Corollary 480 The empirical distribution P̂t obeys an LDP with rate t and
good rate function
D (νkπk−1 ν ⊗ µ) (32.19)
CHAPTER 32. LARGE DEVIATIONS FOR MARKOV SEQUENCES 270
Proof: An obvious extension of the argument for Theorem 478, using the
appropriate exponential-family representation of the higher-order process.
Whether all exponential-family stochastic processes (Küchler and Sørensen,
1997) obey LDPs is an interesting question; I’m not sure if anyone knows the
answer.
Theorem 484 The empirical process distribution obeys an LDP with rate t
and good rate function J∞ (ψ) = d(ψkρ), with ρ here standing for the stationary
process distribution of the Markov sequence.
Proof: Entirely parallel to the proof of Theorem 477, with Theorem 483 sub-
stituting for Corollary 476.
Consequently, any continuous function of the empirical process distribution
has an LDP.
Chapter 33
271
CHAPTER 33. THE GÄRTNER-ELLIS THEOREM 272
Basically, the upper large deviation bound holds under substantially weaker
conditions than the lower bound does, and it’s worthwhile having the partial
results available to use in estimates even if the full large deviations principle
does not apply.
∗
and its Legendre transform is written Λ (x).
The point of this is that the limsup always exists, whereas the limit doesn’t,
necessarily. But we can show that the limsup has some reasonable properties,
and in fact it’s enough to give us an upper bound.
∗
Lemma 486 Λ (t) is convex, and Λ (x) is a convex rate function.
Proof: The proof of the convexity of Λ (t) follows the proof in Lemma 466,
∗
and the convexity of Λ (t) by passing to the limit. To etablish Λ (x) as a
rate function, we need it to be non-negative and lower semi-continuous. Since
∗
Λ (0) = 0 for all , Λ (0) = 0. This in turn implies that Λ (x) ≥ 0. Since the
latter is the supremum of a class of continuous functions, namely t(x) − Λ (t),
it must be lower semi-continuous. Finally, its convexity is implied by its being
a Legendre transform.
Proof: Entirely parallel to the proof of the upper bound in Cramér’s Theorem
(470), up through the point where closed sets are divided into a compact part
and a remainder far from the origin, of exponentially-small probability. Because
K is compact, we can proceed as though the remainder is empty.
when the limit exists. Its domain of finiteness D ≡ {t ∈ Ξ∗ : Λ(t) < inf ty}. Its
limiting Legendre transform is Λ∗ , with domain of finiteness D∗ .
Lemma 490 If Λ(t) exists, then it is convex, Λ∗ (x) is a convex rate function,
and Eq. 33.2 applies to the latter. If in addition the process is exponentially
tight, then Eq. 33.3 holds for Λ∗ (x).
Proof: Exercise.
Unfortunately, this is not good enough to get exponential tightness in arbi-
trary vector spaces.
∗
Definition 492 (Exposed Point) A point x ∈ Ξ is exposed for Λ (·) when
∗ ∗
there is a t ∈ Ξ∗ such that Λ (y) − Λ (x) > t(y − x) for all y 6= x. t is the
exposing hyper-plane for x.
∗
In R1 , a point x is exposed if the curve Λ (y) lies strictly above the line of slope
∗ ∗
t through the point (x, Λ (x)). Similarly, in Rd , the Λ (y) surface must lie
∗
strictly above the hyper-plane passing through (x, Λ (x)) with surface normal
∗
t. Since Λ (y) is convex, we could generally arrange this by making this the
tangent hyper-plane, but we do not, yet, have any reason to think that the
∗
tangent is well-defined. — Obviously, if Λ(t) exists, we can replace Λ (·) by
∗
Λ (·) in the definition and the rest of this paragraph.
Note: Most books on large deviations do not give this property any particular
name.
Proof: If we pick any nice, exposed point x ∈ O, we can repeat the proof of
the lower bound from Cramér’s Theorem (470). In fact, the point of Definition
493 is to ensure this. Taking the supremum over all such x gives the lemma.
Proof: The large deviations upper bound is Lemma 488. The large deviations
lower bound is implied by Lemma 494 and the additional hypothesis of the
theorem.
Matters can be made a bit more concrete in Euclidean space (the original
home of the theorem), using, however, one or two more bits of convex analysis.
Notice that intA ⊆ rintA, since the latter, in some sense, doesn’t care about
points outside of A.
In view of Proposition 498, it’s really just enough to show that Λ∗ (O∩rintD∗ ) ≤
Λ∗ (O). This is trivial when the intersection O ∩ D∗ is empty, so assume it
isn’t, and pick any x in that intersection. Because O is open, and because
of the definition of the relative interior, we can pick any point y ∈ rintD∗ ,
and, for sufficiently small δ, δy + (1 − δ)x ∈ O ∩ rintD∗ . Since Λ∗ is convex,
Λ∗ (δy + (1 − δ)x) ≤ δΛ∗ (y) + (1 − δ)Λ∗ (x). Taking the limit as δ → 0,
33.2 Exercises
Exercise 33.1 Show that Cramér’s Theorem (470), is a special case of the
Euclidean Gärtner-Ellis Theorem (499).
Exercise
P∞ 33.4 Let Zi be real-valued mean-zero stationary
Pt Gaussian variables,
−1
with i=−∞ |cov (Z0 , Zi ) | < ∞. Let Xt = t i=1 Zi . Show that these
2
time
P∞ averages obey an LDP with rate t and rate function x /2Γ, where Γ =
i=−∞ cov (Z0 , Zi ). (Cf. the mean-square ergodic theorem 293 of chapter 22.)
If Zi are not Gaussian but are weakly stationary, find an additional hypothesis
such that the time averages still obey an LDP.
In Chapter 21, we looked at how the diffusions X which solve the SDE
dx
= a(x(t)), x(0) = x0 (34.2)
dt
in the “small noise” limit, → 0. Specifically, Theorem 270 gave a (fairly crude)
upper bound on the probability of deviations:
lim 2 log P sup ∆ (t) > δ ≤ −δ 2 e−2Ka T (34.3)
→0 0≤t≤T
where Ka depends on the Lipschitz coefficient of the drift function a. The the-
ory of large deviations for stochastic differential equations, known as Freidlin-
Wentzell theory for its original developers, shows that, using the metric implicit
276
CHAPTER 34. FREIDLIN-WENTZELL THEORY 277
in the left-hand side of Eq. 34.3, the family of processes X obey a large devia-
tions principle with rate −2 , and a good rate function.
(The full Freidlin-Wentzell theory actually goes somewhat further than just
SDEs, to consider small-noise perturbations of dynamical systems of many sorts,
perturbations by Markov processes (rather than just white noise), etc. Time
does not allow us to consider the full theory (Freidlin and Wentzell, 1998), or
its many applications to nonparametric estimation (Ibragimov and Has’minskii,
1979/1981), systems analysis and signal processing (Kushner, 1984), statistical
mechanics (Olivieri and Vares, 2005), etc.)
As in Chapter 31, the strategy is to first prove a large deviations principle
for a comparatively simple case, and then transfer it to more subtle processes
which can be represented as appropriate functionals of the basic case. Here,
the basic case is the Wiener process W (t), with t restricted to the unit interval
[0, 1].
1 1
Z Z 1
dν 1 2
(w) = exp − ẇ(t) · dW − 2 |ẇ(t)| dt (34.6)
dµ 0 2 0
Proof: This is a special case of Girsanov’s Theorem. See Corollary 18.25
on p. 365 of Kallenberg, or, more transparently perhaps, the relevant parts of
Liptser and Shiryaev (2001, vol. I).
Lemma 504 (Exponential Bound on the Probability of Tubes Around
Given Trajectories) For any δ, γ, K > 0, there exists an 0 > 0 such that, if
< 0 ,
J1 (x)+γ
P (kX − xk∞ ≤ δ) ≥ e− 2 (34.7)
provided x(0) = 0 and J1 (x) < K.
Proof: Using Proposition 503,
P (kX − xk∞ ≤ δ) = P (kY − 0k∞ ≤ δ) (34.8)
Z
dν
= (w)dµ (w) (34.9)
kwk∞ <δ dµ
Z
J1 (x) 1 1
R
= e− 2 e− 0 ẋ·dW dµ (w) (34.10)
kwk∞ <δ
From Lemma 268 in Chapter 21, we can see that P (kW k∞ < δ) → 1 as → 0.
So, if is sufficiently small, P (kW k∞ < δ) ≥ 3/4. Now, applying Chebyshev’s
inequality to the integrand,
√ !
1 1
Z
2 2p
P − ẋ · dW ≤ − J1 (x) (34.11)
0
Z 1 √ !
1 2 2p
≤ P ẋ · dW ≤ J1 (x) (34.12)
0
2
R1
2 E 0
ẋ · dW
≤ (34.13)
82 J1 (x)
R1 2
0
|ẋ| dt 1
= = (34.14)
8J1 (x) 4
CHAPTER 34. FREIDLIN-WENTZELL THEORY 279
Lemma 505 (Trajectories are Rarely Far from Action Minima) For
every j > 0, δ > 0, let U (j, δ) be the open δ neighborhood of L1 (j), i.e., all the
trajectories coming within δ of a trajectory whose action is less than or equal to
j. Then for any γ > 0, there is an 0 > 0 such that, if < 0 and
s−γ
P (X 6∈ U (j, δ)) ≤ e− 2 (34.18)
We will see that, for large enough n, this is exponentially close to X . First,
though, let’s bound the probability in Eq. 34.18.
P (X ∈
6 U (j, δ))
= P X 6∈ U (j, δ), kX − Yn, k∞ < δ
+P X 6∈ U (j, δ), kX − Yn, k∞ ≥ δ (34.20)
≤ P X 6∈ U (j, δ), kX − Yn, k∞ < δ + P kX − Yn, k∞ ≥ δ (34.21)
≤ P (J1 (Yn, ) > j) + P kX − Yn, k∞ ≥ δ (34.22)
where the ξi have the χ2 distribution with one degree of freedom. Using our
results on such distributions and their sums in Ch. 21, it is not hard to show
that, for sufficiently small ,
1 − j−γ
P (J1 (Yn, ) > j) ≤ e 2 (34.25)
2
To estimate the probability that the distance between X and Yn, reaches or
exceeds δ, start with the independent-increments property of X , and the fact
that the two processes coincide when t = i/n.
P kX − Yn, k∞ ≥ δ
Xn
≤ P max |X (t) − Yn, (t)| ≥ δ (34.26)
(i−1)/n≤t≤i/n
i=1
= nP max |X (t) − Yn, (t)| ≥ δ (34.27)
0≤t≤1/n
= nP max |W (t) − nW (1/n)| ≥ δ (34.28)
0≤t≤1/n
δ
≤ nP max |W (t)| ≥ (34.29)
0≤t≤1/n 2
δ
≤ 4dnP W1 (1/n) ≥ (34.30)
2d
2d − nδ2 22
≤ 4dn √ e 8d (34.31)
δ 2πn
again freely using our calculations from Ch. 21. If n > 4d2 j/δ 2 , then P kX − Yn, k∞ ≤
j−γ
1 − 2
2e , and we have overall
j−γ
P (X 6∈ U (j, δ)) ≤ e− 2 (34.32)
as required.
Proposition 506 (Compact Level Sets of the Wiener Action) The Cameron-
Martin norm has compact level sets.
Proof: It is easy to show that Lemma 504 implies the large deviation lower
bound for open sets. (Exercise 34.2.) The tricky part is the upper bound. Pick
any closed set C and any γ > 0. Let s = J1 (C) − γ. By Lemma 506, the set
CHAPTER 34. FREIDLIN-WENTZELL THEORY 281
(writing |y(t)| for the norm of Euclidean vectors y, and kxk for the supremum
norm of continuous curves). By Gronwall’s Inequality (Lemma 258), then,
kx1 − x2 k ≤ kw1 − w2 k eKa T (34.46)
on every interval [0, T ]. So we can make sure that kx1 − x2 k is less than any
desired amount by making sure that kw1 − w2 k is sufficiently small, and so F
is continuous.
CHAPTER 34. FREIDLIN-WENTZELL THEORY 283
where
1 −1
L(q, p) = (pi − ai (q)) Bij (q) (pj − aj (q)) (34.50)
2
and
B(q) = b(q)bT (q) (34.51)
with J(x) = ∞ if x ∈ C \ H∞ .
CHAPTER 34. FREIDLIN-WENTZELL THEORY 284
34.4 Exercises
Exercise 34.1 (Cameron-Martin Spaces are Hilbert Spaces) Prove Lemma
501.
Baldi, Pierre and Søren Brunak (2001). Bioinformatics: The Machine Learning
Approach. Cambridge, Massachusetts: MIT Press, 2nd edn.
285
BIBLIOGRAPHY 286
Banks, J., J. Brooks, G. Cairns, G. Davis and P. Stacy (1992). “On Devaney’s
Definition of Chaos.” American Mathematical Monthly, 99: 332–334.
Bartlett, M. S. (1955). An Introduction to Stochastic Processes, with Special
Reference to Methods and Applications. Cambridge, England: Cambridge
University Press.
Basharin, Gely P., Amy N. Langville and Valeriy A. Naumov (2004). “The
Life and Work of A. A. Markov.” Linear Algebra and its Applica-
tions, 386: 3–26. URL http://decision.csl.uiuc.edu/∼meyn/pages/
Markov-Work-and-life.pdf.
Biggers, Earl Derr (1928). Behind That Curtain. New York: Grosset and
Dunlap.
Billingsley, Patrick (1961). Statistical Inference for Markov Processes. Chicago:
University of Chicago Press.
Blackwell, David and M. A. Girshick (1954). Theory of Games and Statistical
Decisions. New York: Wiley.
Bosq, Denis (1998). Nonparametric Statistics for Stochastic Processes: Estima-
tion and Prediction. Berlin: Springer-Verlag, 2nd edn.
Caires, S. and J. A. Ferreira (2005). “On the Non-parametric Prediction of
Conditionally Stationary Sequences.” Statistical Inference for Stochastic Pro-
cesses, 8: 151–184. Correction, vol. 9 (2006), pp. 109–110.
Charniak, Eugene (1993). Statistical Language Learning. Cambridge, Mas-
sachusetts: MIT Press.
Chomsky, Noam (1957). Syntactic Structures. The Hauge: Mouton.
Courant, Richard and David Hilbert (1953). Methods of Mathematical Physics.
New York: Wiley.
Cover, Thomas M. and Joy A. Thomas (1991). Elements of Information Theory.
New York: Wiley.
Doukhan, Paul (1995). Mixing: Properties and Examples. New York: Springer-
Verlag.
Durbin, James and Siem Jam Koopman (2001). Time Series Analysis by State
Space Methods. Oxford: Oxford University Press.
Eckmann, Jean-Pierre and David Ruelle (1985). “Ergodic Theory of Chaos and
Strange Attractors.” Reviews of Modern Physics, 57: 617–656.
Eichinger, L., J. A. Pachebat, G. Glockner et al. (2005). “The genome of the
social amoeba Dictyostelium discoideum.” Nature, 435: 43–57.
Ellis, Richard S. (1985). Entropy, Large Deviations, and Statistical Mechanics.
Berlin: Springer-Verlag.
— (1988). “Large Deviations for the Empirical Measure of a Markov Chain
with an Application to the Multivariate Empirical Measure.” The Annals
of Probability, 16: 1496–1508. URL http://links.jstor.org/sici?sici=
0091-1798%28198810%2916%3A4%3C1496%3ALDFTEM%3E2.0.CO%3B2-S.
Embrechts, Paul and Makoto Maejima (2002). Selfsimilar processes. Princeton,
New Jersey: Princeton University Press.
Ethier, Stewart N. and Thomas G. Kurtz (1986). Markov Processes: Charac-
terization and Convergence. New York: Wiley.
Eyink, Gregory L. (1996). “Action principle in nonequilbrium statistical dy-
namics.” Physical Review E , 54: 3419–3435.
Fisher, Ronald Aylmer (1958). The Genetical Theory of Natural Selection. New
York: Dover, 2nd edn. First edition published Oxford: Clarendon Press, 1930.
Forster, Dieter (1975). Hydrodynamic Fluctuations, Broken Symmetry, and
Correlation Functions. Reading, Massachusetts: Benjamin Cummings.
Freidlin, M. I. and A. D. Wentzell (1998). Random Perturbations of Dynami-
cal Systems. Berlin: Springer-Verlag, 2nd edn. First edition first published
as Fluktuatsii v dinamicheskikh sistemakh pod deistviem malykh sluchainykh
vozmushchenii, Moscow: Nauka, 1979.
Frisch, Uriel (1995). Turbulence: The Legacy of A. N. Kolmogorov . Cambridge,
England: Cambridge University Press.
BIBLIOGRAPHY 288