Stochastic Control Notes
Stochastic Control Notes
2
Contents
3
Contents
4
Contents
5
Lecture 1
Review of probability theory
1.1. We denote by Z the set of integers, by N the set of positive integers, by Q the set of
R
rational numbers and by the set of real numbers.
1.2. A set containing a finite number of elements is said to be finite. A set containing
an infinite number of elements that can be put in one to one correspondence with the
positive integers is said to be countable. According to this definition, the sets Z and Q
are countable, as one can exhibit one-to-one correspondence between their elements and
the positive integers.
1.3. If A and B are subset of a set S , we write by A ∩ B and A ∪ B their union and
intersections respectively. We denote by Ac = S − A the complement of A in S .
1.4. The set of all subsets of a set S is called the power set and is usually denoted as 2S .
1.5. If the set S is finite, so is its power set. Thus in that case, S and its power set are very
similar objects. However, if S is countably infinite, its power set is not countable. Indeed,
R
the power set of N can be shown to be (by binary expansion of real numbers) and R
can be shown to be non-countable (by what is called a "diagonalization argument", for
your own edification, you can check the paper of Cantor [1] or Rudin’s book on Real
Analysis [3]; but this is beyond the scope of this course)
7
1 Review of probability theory
1.6. A.N. Kolmogorov introduced in 1933 [2] a precise mathematical model and axioms
for the subject of probability, which are now the foundation on which the subject is built.
We outline it here. The first step is to introduce the notion of probability space, which we
do now.
1.7. We start with an arbitrary set Ω whose elements are the possible states of the world
and subsets of Ω, called events, which correspond to a collection of states of the world.
For example, when throwing a die, the set Ω can be taken to be Ω = {1, 2, . . . , 6} and the
subset A = {1, 3, 5} is the event: "the outcome of the throw is odd".
A measure µ on Ω is a positive-real valued function from a subset S of the power set of
Ω
µ : S ⊆ 2Ω 7−→ [0, 1],
which describes the ’chance’ that an event occurs. Reasoning for the case of Ω finite, it
is reasonable to ask that µ(Ω) = 1, µ(∅) = 0 (where ∅) is the empty set, and that µ be
additive on disjoint sets:
When the cardinality of Ω is infinite, we have mentioned that its power set is uncount-
able. Requiring that µ be additive on uncountable unions is too restrictive however. We
thus require that µ be additive only on countable unions, that is if Ai , i ∈ N are disjoint
subsets of Ω, then
∑
∞
∞
µ(∪i =0Ai ) = µ(Ai ).
i =0
1.9. Example Take Ω to be the set of binary numbers on n digits, that is an element of
Ω is of the form (a1 . . . an ) where ai ∈ {0, 1}. The set Ω has cardinality 2n and its power
n
set cardinality 22 . Given α ∈ [0, 1], we can define a class of measures on Ω by letting
∑ ∑
µ((a 1, . . . , an )) = α ai
(1 − α) ai
.
8
1.1 Sets, measures and probability spaces
and hence the probability measure is normalized. We can extend the definition of µ from
∪ ∩
Ω to 2Ω via the relation µ(A B) = µ(A) + µ(B) if A B = ∅.
The Borel sets are subsets of Rgenerated by countable unions and complementation
R
of open intervals of . They form, by construction, a σ-algebra.
For the purpose of this course, it is often enough to consider the probability space
R
( , B, µ) where B is the Borel σ-field, and µ is the measure obtained by extending the
usual notion of measure obtained from the length of an interval.
From the Borel σ-algebra, one then constructs the larger σ-algebra of Lebesgue mea-
surable sets. We do not go into these details here, but one should remember that the
Lebesgue σ-field is larger that the Borel σ-field but that both are strict subset of the
R
power set of .
9
1 Review of probability theory
In this case, the measurement space is binary: M = {0, 1} and comes with the σ-field
2M = {{0, 1}, ∅, {0}, {1}}.
Measurability
1.12. Even though you will often hear "a random variable X is measurable," it is clear
from the definition that measurability is dependent on the underlying σ-fields.
1.13. From an engineering point of view, one can think of a σ-field as the resolution
of the measurement instrument at our disposal and the measurable functions as the ques-
tions one can answer using this instrument. Indeed, assume that the states of a sys-
tem belong to the set Ω = {1, 2, . . . , 10}. If your measurement device can only answer
whether the state is less than 5, then observe that you can also answer if it is larger than
5 (this is the complementation requirement for sets in a σ-field.) Hence the σ-field is
{∅, Ω, {1, 2, . . . , 5}, {6, 7, . . . , 10}}. The question "is the state 2 or 4" is clearly not answer-
able by this instrument and indeed the random variable is not measurable. If we buy
a second instrument that can tell us whether the state is even or odd, the new σ-field
at our disposal is the one generated by taking unions and complements and the sets
∅, Ω, {1, 3, 5, 7, 9}, {2, 4, 6, 8, 10}, {1, 2, . . . , 5} and {5, 6, . . . , 10}. Now observe that the set
{2, 4} is in the σ-field generated by the sets above, and we can thus answer the question
of whether the state is 2 or 4.
10
1.2 Random variables
Adapted σ− field
Ai = X −1 (Bi )
1.15. Given a probability space (Ω, S, µ), we say that two events A, B ∈ S are independent
if
µ(A ∩ B) = µ(A)µ(B).
1.16. To make sense of the above definition, consider again the case of throwing a die,
that is Ω = {1, 2, . . . , 6}. The events ’the outcome is odd’ (call it A) and ’the outcome
is less than or equal to 3’ (call it B) are not independent. Indeed, A ∩ B = {1, 3} and
thus µ(A ∩ B) = 1/3 , 1/4 = µ(A)µ(B). Intuitively, we know that these events are not
independent since not knowing whether the outcome is less than 3, the chance that it is
odd is 1/2. But if we know that the outcome is less than 3, then the chance that it is odd
is now 2/3. If instead of less than 3, we take B to be ’the outcome is less than 4’, then A
and B are independent.
1.18. It is an easy exercise to show that if S is a σ-field, then for any given A ∈ Ω,
∩
the collection of sets {B A | B ∈ S} is also a σ-field. It is the σ field on which the
conditional measure µ(·|A) is defined.
11
1 Review of probability theory
1.20. The notion of independence of events can be applied to random variables: we say
that two random variables X and Y are independent if their adapted σ-fields are made of
independent sets.
1.21. Given two random variables, X,Y , their joint cumulative distribution function is
given by
ϕ(x, y) = µ(X < x,Y < y),
where X < x refers to the set {s ∈ Ω|X (s ) < x }.
1.22. If the random variables X and Y are independent, their joint-cumulative distribu-
tion factors as
ϕ(x, y) = ϕ1 (x)ϕ2 (y)
with ϕ1 (x) = µ(X < x) and ϕ2 (y) = µ(Y < y).
Density function
12
1.2 Random variables
Independence
1.23. Recall the marginalization procedure: given a joint density, you can obtain the
density for one variable by integrating out the other
∫
fX (x) = fX,Y (x, y)dy .
1.24. Let us return to the die tossing example with X the random variable which is 1 if
the outcome is odd and 0 otherwise; Y the random variable which is 1 if the outcome is
strictly larger than 3 and 0 otherwise.
The joint density of X and Y is thus
1 2 2 1
fX,Y (0, 0) = ; fX,Y (1, 0) = ; fX,Y (0, 1) = ; fX,Y (1, 1) = .
6 6 6 6
Observe that ∑
fX,Y (x, y) = 1.
ij
∑ 1+2 1
fX (0) = fX,Y (0, j ) = =
j
6 2
∑ 1+2 1
fX (1) = fX,Y (1, j ) = =
j
6 2
as expected.
13
1 Review of probability theory
1.26. We can relate the densities fX and fY as follows. First, assume that ϕ is one-to-one.
R
Let ε > 0 be small and x 0 ∈ . The probability of the event A = {x | |x − x 0 | < ε} is
up to first order f (x 0 )ε. From basic considerations about probabilities of an event, it is
clear that we want that the image of this event under ϕ (that is, ϕ(A)) to have the same
probability. Observe that the length of the interval ϕ(A) is given, up to first order, by
d ϕ | ε. Putting the above together, we have that up to first order
dx x 0
( ) −1
dϕ
fY (y) = fX (ϕ (y))
−1
dx
1.27. If ϕ is not one-to-one, we simply take the sum over the inverse images ϕ−1 (y):
∑ ( ) −1
dϕ
fY (y) = fX (ϕ (y)) −1
x |y=ϕ(x)
dx
d ϕ −1
with the understanding that dx be evaluated at the appropriate inverse image.
1.28. If X and Y are Rn valued random variables related by Y = ϕ(X ), then the formula
becomes: ( ) −1
∑
∂ϕ
fY (y) = fX (ϕ (y)) det(
−1
∂x
x |y=ϕ(x)
∂ϕ
where ∂x is the Jacobian of ϕ.
∫ −x 2 dx.
1.29. Consider the integral I = Re We can evaluate it by first expressing its
square as ∫ ∫
e −x
2 −y 2
I =
2
dxdy
R R
and changing to polar coordinates to obtain
∫ ∞ ∫ 2π
r e −r dθdr = π.
2
I =2
0 0
14
1.2 Random variables
1
e −x /2σ .
2
f (x) = √
2πσ
where σ is positive real. Using the point above, we conclude that the above density is
normalized.
2 2
R
1.31. Because the integral of e −x 1 /2σ1 · · · e −xn /2σn over n is the the product of integrals
of the type of the one above, we obtain that
∫ ∫ √
e −x 1 /2σ1 · · · e −xn /2σn dx 1 · · · dx n = (2π)n σ1 · · · σn
2 2
···
R R
1.32. We introduce the vector x = [x 1, . . . , x n ]′ and the diagonal matrix D with positive
eigenvalues d1, . . . , dn . We have the following identity:
∫ √
′ −1
e −x (2D )x dx 1 · · · dx n = (2π)n det(D).
Rn
1.33. Let Q be a symmetric positive definite matrix. It is a fundamental fact form linear
algebra that there exists an orthogonal matrix Θ such that Θ′Q Θ is diagonal. If we set
z = Θ′x, then the Jacobian of z with respect to x is Θ′, whose determinant is one.
Thus the change of variables formula, along with with fact that the determinant of Q
and the determinant of Θ′Q Θ are the same, tells us that
∫ ∫ √
−z ′ Θ ′ (2D)−1 Θz −z ′ (2Q )−1 z
e dz 1 · · · dz n = e dz 1 · · · dz n = (2π)n det(Q ).
Rn Rn
1 ′ −1
fZ (z ) = √ e −z (2Q ) z
(2π)n det(Q )
which is well normalized as we have just shown.
1.35. Translating z by a constant value m does not change the value of the integral
above. We say that W is a multivariate Gaussian with mean m (and covariance Q ) if it
is distributed as
15
1 Review of probability theory
1 ′ −1 (w−m)
fW (w) = √ e −(w−m) (2Q ) .
(2π)n det(Q )
1.36. We record the identities, which can be derived easily following the steps outlined
in this section:
∫
1 ′ −1
m= w√ e −(w−m) (2Q )(w−m)dw
Rn (2π)n det(Q )
and ∫
1 ′ −1 (w−m)
Q = (w − m)(w − m)′ √ e −(w−m) (2Q ) dw .
Rn (2π)n det(Q )
mX (t ) ≜ E(e tX ).
Notice that this is rather similar to the Laplace transform of the density.
16
1.2 Random variables
mX (t ) = 1 + tm 1 + t 2 m2 /2 + t 3 m3 /3! + · · ·
c X (t ) ≜ E(e itX ).
This is, up to a constant factor, the Fourier transform of the density of X .
1.43. Whereas the moment generating function can fail to exist, the characteristic func-
tion always exists.
mX +Y (t ) = mX (t )mY (t )
17
Lecture 2
Basic notions from discrete-time stochastic processes
We cover aspects of the theory of stochastic processes in discrete time. stochastic process,
x(n) is a random variable
Sample paths
2.2. Stochastic processes are used to model a wide variety of natural phenomena that
are either deterministic, but too complex to model exactly, or genuinely random. Typical
examples are the number of customers in a queue at time t , number of packets arriving
at a router at time t , stock prices, the outcomes of a fair game, etc.
19
2 Basic notions from discrete-time stochastic processes
2.3. Consider flipping a coin n times and let X (t ) be the result of the t th flip, then X (t ) is
a random process with Ω = {H H H . . . , H T H H . . . , . . . ,T T T T . . .} and we take Σ = 2Ω .
An element of Ω is a sample path.
2.4. Consider tossing a die 3 times. In that case Ω = {111, 112, . . . , 666}. We can define
the stochastic process X (t ) = 1 if the outcome of the t th toss is odd and 0 if it is even.
The possible sample paths for X are thus 000, 001, . . . where in the first case, the three
tosses are even, etc.
2.5. The events in the σ-field Σ contain information about the whole process. In partic-
ular, they contain information about future outcomes. Indeed, in the case of the three
tosses of a die above, a random variable that is measurable with respect to 2Ω will allow
us to decide future outcomes. We would like to find a way to force a random variable
to use the information that is available only up to time t . Consistent to what was done
earlier, we encode the information a random variable has access to into a σ-field.
2.6. The notion of filtration is introduced to control the amount of information a random
variable has access to at time t.
Filtration
Fs ⊆ Ft if s ≤ t .
2.7. Example Let us consider again the case of tossing a coin n times. Define
F1 = {Ω, ∅, {H H . . . H, H . . . H T , . . . , H T T . . . T }, {T H . . . H,T . . . H T , . . . ,T T T . . . T } .
| {z }| {z }
A1 A2
In words, F1 contains two sets: the set of all events or samples that start with H and
the set of all events of samples that start with T . It is easy to see that these two sets are
disjoint, and that their union is Ω. Hence F1 is a σ-field.
Observe that F1 is the σ-field adapted to X (1), for X (1) being the outcome of the first
toss.
Define the four disjoint events:
B1 = {H H H H . . . , H H T H . . . , . . . , H H T T . . .}
B2 = {H T H H . . . , H T T T . . . , . . .}
B3 = {T H H H . . . ,T H T T . . . , . . .}
B4 = {T T H H . . . ,T T T T . . . , . . .}
20
2.1 Basic concepts
Hence, B 1 contains all the events such that the first two tosses yield H H , B 2 all the
events such that the first toss yields T and the second is H , etc. The events Bi are clearly
disjoint.
We define F2 as
F2 = {Ω, ∅, B 1, B 2, B 3, B 4, B 1 ∪ B 2, B 1 ∪ B 2 ∪ B 3, . . .}
where we take all the possible twofold and three fold unions of the Bi . Then F2 is a σ-field
(which has a finite number of elements). Observe that A 1 = B 1 ∪ B 2 and A2 = B 3 ∪ B 4 ,
hence
F1 ⊂ F2 .
The opposite is not true, e.g. B 1 < F1 .
Hence a random variable that is measurable with respect to F2 will have access to more
information than a random variable that is measurable with respect to F1 .
In this fashion, we can construct a filtration Fi . Observe that we have again that F2 is
the σ-field adapted to X (1), X (2).
2.8. We thus see that a filtration can be used to control the amount of information a
random variable has access to at time t .
2.9. A filtration thus formalizes the intuitive notion that a random variable Z (t ) defined
on process up to time t can only be a function of the observations X (1), X (2), . . . , X (t ). Z
is also sometimes called past-measurable. Ft is simply the σ-field adapted to X (1), . . . , X (t ),
and we rewrite Z = f (X (1), X (2), . . . , X (t )) for an arbitrary function f as "Z (t ) is adapted
to Ft ". Do not fail to grasp this point, as the language of filtration is the universal
language when talking about stochastic processes in most of the probability/statistics
literature.
2.10. Talking about filtrations allows us to make the definitions independent of the choice
of particular X .
2.11. Given a probability space (Ω, Σ, P ) and a filtration Ft , we say that a process X (t )
is Ft -adapted if X (1), . . . , X (l ) is Fl -measurable.
21
2 Basic notions from discrete-time stochastic processes
2.12. Consider flipping a biased coin with p(X = 1) = p and p(X = −1) = 1 − p. Let
S 0 ∈ Z (often, S 0 = 0.) Define
∑
n
S (n) = S 0 + Xi .
i =1
This process is called the simple random walk.
2.13. In general, we can define a random walk for Xk being i.i.d. but not necessarily
Bernoulli. The definition S n is the same as above.
Sample mean
A familiar stochastic process, though often not thought as such, is the sample mean
1∑
n
X¯ (n) = Xi ,
n i =1
Gambler’s ruin
Let us answer the following question about the simple random walk S (t ): given two
numbers b < 0 < a, what is the probability that S (t ) reaches a before it reaches b.
It is called the Gambler’s ruin problem because it can be used to model the following
scenario: You have an amount b of money and are playing a fair game against someone
with an amount a of money. Assume that at each round of the game, you loose or win
one dollar with equal probability. What is the chance that you get ruined before your
opponent?
2.14. To solve this problem we will, similarly as when we derived an equation for the
mean, seek to derive an equation for this probability. Define pk to be the probability that
S (t ) reaches a before it reaches b given that S (0) = k .
1 1
pk = pk +1 + pk −1 .
2 2
Indeed, starting from k , in order to reach a or b, we have to cross either k − 1 or k + 1 at
the next step. We go to k − 1 with probability 21 . From k − 1, the probability of reaching
a or b is independent of my previous position (Because the odds of winning a bet are not
dependent of how much money you have) and equal to, by definition, pk −1 . We have a
similar reasoning for the other term.
22
2.1 Basic concepts
2.16. Formally, if we define A to be the event "reaching a before reaching b", we have
1 1
pk = pk +1 + pk −1 .
2 2
2.17. We used in the derivation several tools that appears frequently in this type of
computations:
• First, marginalize with respect to a chosen variable. The choice of this variable is
application-dependent.
• Then, condition on the past.
These two steps will often allow us to find a recursive equation for a quantity of interest.
In words, only the previous known state of the process matters, and not its whole history.
This fact is intuitively obvious in the case of gambling: whether you lost or won does not
affect the odds that a future coin flip will come up heads or tails. This property appears
often in the study of stochastic processes: it is called the Markov property. We will study
processes having this property, called Markov chains, in more details later in the course.
1 1
pk = pk −1 + pk +1
2 2
is given by
pk = α + βk .
23
2 Basic notions from discrete-time stochastic processes
2.20. We clearly have p a = 1 and pb = 0. Plugging this in the general solution, we obtain:
k −b
pk = .
a −b
We observe that if a ≫ b, i.e. you are far richer than your opponent, the probability
that you are ruined before your opponent decreases, as was expected.
24
2.2 First-passage time of a simple random walk
τ(x) = min {t : X (t ) = x } .
First passage times arise naturally in many contexts in which an event is triggered when
a process reaches a certain value . Examples are in finance, when certain contracts or
options may be activated when an indicator reaches a given value, in biology, when a
reaction starts when the concentration of certain compounds is high enough, etc.
Computing τ = τ(1).
Define τ to be the first passage time at 1. Starting from zero, the process goes to 1 with
probability 1/2, and thus P (τ = 1) = 21 . With probability 1/2, the process goes to −1. If
the process is at −1, it will take τ′ steps to go to 0.
{
1 with prob 12
τ=
1 + τ′ + τ′′ with prob 12
where τ′ and τ′′ are independent random variables, distributed identically to τ.
25
2 Basic notions from discrete-time stochastic processes
since P (τ′ + τ = 0) = 0.
∑∞ n
Observe that n=1 z P (τ + τ′ = n) is the mgf of two independents random variables
τ + τ . Because τ, τ and τ′′ are all identically distributed, their mgf are the same. We
′ ′′ ′
hence have
1 1
M τ (z ) = z + z M τ (z )2 . (2.1)
2 2
√
1± 1−z 2
2.25. Solving Equation (2.1) for M τ (z ), we obtain M τ (z ) = z . Observe that for
z ∈ (0, 1), M τ (z ) must take values in (0, 1), hence
√
1 − 1 − z2
M τ (z ) = .
z
We plot this function in Figure 2-1
We next draw some conclusions from this analysis:
lim M τ (z ) = 1
z →1−
∑
∞
= lim− z n p(τ = n)
z →1
n=1
∑
∞
= P (τ = n)
n=1
= P (τ < ∞)
26
2.2 First-passage time of a simple random walk
0.8
M τ (z )
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
z
Figure 2-1: M τ (z ).
We deduce from this that every state is visited infinitely often. This is a consequence
of the two facts highlighted below. Indeed, because P (τ < ∞) = 1, the simple random
walk, starting from 0, will visit the state 1 in finite time almost surely. Because of the
Markov property, once we are at state 1, we can consider that it is the starting point
of the random walk: hence repeating the same reasoning, we will visit 2 in finite time
almost surely, then 3, etc. Because the walk is symmetric, the same applies to states −1,
−2, etc.
Expectation
Recall that
E(τ) = dzd |z =1Mτ (z ).
Indeed,
d d ∑∞
|z =1 M τ (z ) = |z =1 z n P (τ = n)
dz dz n=1
∑
∞
= * nz n−1P (τ = n)+
,n=1 -z =1
∑∞
= nP (τ = n)
n=1
= E(τ)
27
2 Basic notions from discrete-time stochastic processes
2.27. We thus observe that E(τ) = ∞: The expected time to visit 1 starting from 0 is
infinite.
τ(x)
We can derive τ(x) using the above remarks and the generating function for τ. Indeed,
we can write
τ(x) = τ1 + τ2 + . . . + τx
where τi are independent, identically distributed variables, with the same distribution
as τ. Indeed, in order to get from 0 to x, we must get from 0 to 1, which takes τ1 steps,
then from 1 to 2, which takes τ2 steps, where τi are identicaly distributed.
Hence
M τ(x) = (M τ (z ))x .
2.28. A skeptical mind may balk at the derivation of M τ (z ), finding the equation estab-
lished to get its mgf arbitrary. For example, we can as well say that
1 with prob 21
τ= 2 + τ′ + τ′′ + τ′′′ with prob 14
2 + τ′′′′ with prob. 14
where in the first case, the walk goes to the right at the first time step, in the second case,
the walk goes twice to the left, and in the third case, the walk goes once to the left, then
once to the right. We have that τ, τ′, τ′′, τ′′′, τ′′′′ are all iid,
A quick computation, in the spirit of the one of the previous point, yields:
z 1 2 3 1
M τ (x) = + z M τ (z ) + z 2 M τ (z ).
2 4 4
While finding the roots of this equation,
√
which is cubic in M τ may be hard, it is simpler
1− 1−z 2
to check that replacing M τ (z ) by z satisfies the equation. Hence we find the same
moment generating function, as was expected.
28
Lecture 3
Poisson counters and stochastic differential equations
29
3 Poisson counters and stochastic differential equations
N(t)
Figure 3-1: A sample path for a Poisson counter N (t ) of rate λ. The times between jumps
are distributed according to an exponential with parameter λ. The paths are
continuous from the right, and the limit from the left exists.
3.1. Let x(t ) be a non-decreasing process which takes on values in the positive integers N.
A sample path for x(t ) is depicted in Figure 3-1.
Denote by p n (t ) the quantity
Let λ > 0 be a positive real number. We say that x(t ) jumps at time t if for all for all
ε > 0, x(t + ε) − x(t − ε) = 1. Assume that the probability that x(t ) jumps during the time
interval dt is given by λdt .
3.2. We have the following informal derivation for the evolution of p n (t ). Consider the
set of all possible sample paths that fit the description above (integer valued, strictly
increasing). The variation at time t of p n (t ) is the variation in the number of paths x(·)
that are equal to n at t divided by the total number of paths. Up to first order, there is
30
3.1 Poisson counters
an increase in such paths stemming from paths that were at n − 1 before t and jump at t .
Up to first order because we discard path that jump ’twice in a row’, that is twice within
a ’dt ’.Similarly, there is a decrease in the number of paths stemming from paths that are
equal to n and jump to n + 1. That is, we have
Because the probability of a jump during dt is λdt , we have that the # paths that jump
to n from n − 1 is equal to # paths that are at n − 1 multiplied by λdt , that is
variation in #paths that are n at t = ( #paths that are n-1)λdt −(#paths that are n)λdt .
The constant λ is called the counting rate and the process x(t ) a Poisson counter. You can
take (3.1) as the definition of the evolution of the probabilities for a Poisson counter.
3.4. From the above formulation, it is clear that these equations can be solved one by
one starting with p 0 (t ) = e −λt to obtain
p 0 (t ) = e −λt
p 1 (t ) = λte −λt
p 2 (t ) = λ 2t 2 /2!e −λt
..
.
pk (t ) = λ k t k /k !e −λt
..
.
31
3 Poisson counters and stochastic differential equations
pk (t )
∑
∞ ∑∞
pi (t ) = e −λt * (λt )k /k !+ = 1
i =0 ,k =0 -
| {z }
e λt
3.6. We see that a Poisson counter x(t ) strictly increases with time. We can compute its
expectation as follows:
∑
∞
Ex(t ) = k pk (t )
k =0
∑
∞
= k λ k t k /k !e −λt
k =0
∑
∞
= k λ k t k /k !e −λt
k =1
∑
∞
= λt λ k −1t k −1 /(k − 1)!e −λt
k =1
∑ ∞
′ ′
= λt λ k t k /(k ′)!e −λt
k ′ =0
−λt λt
= λte e = λt
where we set k ′ = k − 1.
32
3.1 Poisson counters
3.8. What is the distribution of the times between jumps for a Poisson process? To
evaluate it, first notice that the ditribution between jumps does not depend on the current
value of the counter. Hence, we can focus on finding the distribution of the first jump of
a process that starts at 0. We have for t > 0
33
3 Poisson counters and stochastic differential equations
this latter referred to by saying that f is Lipschitz with constant C in its first argument,
then it is known that for each initial condition x(0) there exists a unique solution to (3.2).
3.10. We are interested in making sense of differential equations as above when there is
a stochastic term, that is we want to make sense of
∫ t ∫ t
x(t ) = x(0) + f (σ, x(σ))dσ + g (σ, x(σ))dN (σ) (3.3)
0 0
that is f (t1+ ) is a limit to t1 from the right and f (t1− ) the limit from the left.
34
3.2 Poisson driven differential equations
Functions x(t ) obeying the conditions above are said to be càdlàg (for continu à droite,
limite à gauche) or corlol (for continuous to the right, limit on the left).
3.13. In words, x(t ) is a solution in the sense of Itō if it behaves like a usual differential
equation (with ẋ = f (t, x)) when there is no jump in N (t ), and when there is a jump in
N (t ) at t = t1 , x(t ) jumps from its value right before t1 by an amount g (t1, x(t1− )). We
additionally require x(t ) to be continuous from the right with a well-defined limit from
the left.
3.14. Let us illustrate the above definition on a simple example. Consider the equation
dx = dN (t ), (3.4)
with initial condition x(0) = 0. When there is no jump in N , x(t ) remains constant
because f (t, x) = 0. When there is a jump in N at t1 , x(t ) jumps from its value before
the jump x(t1− ) to x(t1− ) + 1 since g = 1. It is clear that x(t ) = N (t ) in this case, which is
a comforting fact in view of (3.4).
with N (t ) a Poisson counter of rate λ. When there is no jump, x(t ) is a decaying expo-
nential, when N (t ) jumps, x(t ) jumps by 1, since g (x) is in this case 1. We illustrate a
sample path for this equation below
x(t )
t
A sample path for the equation dx = −xdt + dN .
35
3 Poisson counters and stochastic differential equations
When there is no jump, x(t ) obeys ẋ = x and when there is a jump, x(t ) jumps by its
value right before the jump, that is x(t ) doubles in size at each jump.
If t1, t2, . . . are the jump times for N (t ), the solution is
e t for 0 ≤ t ≤ t1
2e t for t1 < t ≤ t2
x(t ) =
4e t for t2 < t ≤ t3
..
.
3.17. From (3.5), one should resist the urge to integrate the equation treating N (t ) as a
usual function and conclude that x(t ) = C e N (t )+t for a constant C . The ’solution’ obtained
in this fashion is not the Itō solution.
dx = f (x)dt + g (x)dN
3.18. If the vector field f (x) is well-behaved (e.g. not stiff), direct integration such that
the Euler method can be used for f (x) and the interpretation of the probability of a
jump during a time interval ∆t as we presented earlier can be used to handle g (x)dN .
Explicitly, choose a time-step ∆ > 0 small (no larger than 10−3 ). The probability of a
jump between t and t + ∆ is λ∆. In view of this, the following integration scheme can be
used.
1. Sample w uniformly at random on [0, 1].
2. If w ≤ λ∆ (there is a jump):
3.19. The second approach consists for first drawing the next jump time, which we know
36
3.2 Poisson driven differential equations
37
3 Poisson counters and stochastic differential equations
where the vector fields f and g i are differentiable and the Ni are independent Poisson
counters of rates λ i .
R R ′
Let ψ : k 7−→ k be a differentiable function, we seek to find an equation for d ψ.
To fix notation, we write d ψ = h(t, x, Ni ); hence we seek to find an appropriate h. Given
a particular realization of the processes N (t ), we obtain a particular sample path x(t ).
An appropriate h is one such that for similar realizations of the Poisson processes, the
solution of (3.6) and solutions of d ψ = h(t, x, Ni ) are the same, where solutions are taken
in the Itō sense.
In between jumps, we have that x(t
∂ψ
⟨ d ψ) evolves⟩ according to dx = f (x)dt , and thus ψ(x)
evolves according to d ψ = ∂t dt + ∂x , f (t, x) dt . If one counter, say Ni , jumps at time
t , recall that x(t ) is càdlàg with x(t + ) = x(t − ) + g i (t, x − ). Hence we require ψ(t, x(t + )) =
ψ(x(t − ) + g i (t, x − )). From there, and because the probability of having two counters jump
at the same is zero, we conclude that
⟨ ⟩ ∑
∂ψ dψ
d ψ(t, x) = dt + , f (t, x) dt + (ψ(t, x + g i (x)) − ψ(t, x)) dNi
∂t ∂x i
38
3.4 Finite-state, continuous-time jump processes
∑
3.23. Conservation of probability (that is, i pi (t ) = 1) requires that the columns of A
sum to zero: ∑
ai j = 0.
i
Positivity of each entries of p(t ) requires that
ai j ≥ 0, i , j .
These two requirements together imply that aii ≤ 0. Matrices A satisfying the two
conditions above are called intensity matrices or inifnitesimal generators
3.24. Consider the three states continuous-time Markov chain with pi = P (x(t ) = i )
evolving according to
p 1 −3 1 0 p 1
d p = 2 −1 1.5 p
dt 2 2
p 3 1 0 −1.5 p 3
We can represent the Markov chain graphically as below:
1
1
2
1
2 3
1.5
3.25. The above representation of the Markov chain focuses on the evolution of the
probability that x(t ) be at a certain state. One might want to also have a representation
39
3 Poisson counters and stochastic differential equations
of the paths that x(t ) follows. We can do so using Poisson driven stochastic equations.
Starting slow, consider the following equation
3.26. Recall that, up to first order, the probability of a jump during the time interval ∆
is λ∆. Hence, using a reasoning similar to the one in 2, the probability that x(t ) = 1 is
modified by an amount −λ∆p 1 , corresponding to jumps from 1 to −1 and by an amount
λ∆p 2 , corresponding to jumps from −1 to 1. That is dp 1 = λp 1dt + λp −1dt . We can easily
derive a similar equation for p −1 and we obtain
[ ]
−λ λ
p˙ = p. (3.8)
λ −λ
3.27. We refer to (3.7) as a sample path description and to (3.8) as a probabilistic description
of a continuous-time, finite-state Markov process
3.29. Given a FSCT with n states s 1, . . . , sn evolving according to p˙ = Ap, where pi is the
probability of being in state si , we associate a vector-valued jump process, which jumps
R
between unit vectors of n . Precisely, we associate to si the unit vector e i . Observe that
if x(t ) = e i , then {
′ x(t ) if l , i
x(t ) + (e k − el )el x(t ) =
ek if l = i
3.30. Set G i j = (e i − e j )e ′j , form the above we conclude that the Poisson process
∑
dx = G i j xdNi j (3.9)
i,j
evolves in the set {e 1, e 2, · · · , e n }. If we set the rates of the Poisson counters Ni j appearing
in (3.9) to be
λ i j = ai j , i , j,
40
3.4 Finite-state, continuous-time jump processes
3.31. Instead of assigning to each state in the FSCT Markov chain a unit vector in
R n , we can assign to it a real number and use Lagrange interpolation to define a Poisson
driven equation that only evolves in this finite set of numbers. Precisely, to each si , assign
pairwise distinct real numbers zi . Define
{
0 if z , z j
ϕi j (z ) =
zi − z j if z = z j
3.33. Let us illustrate both representations on the FSCT Markov chain with probabilities
evolving according to
−2 0 3
p˙ = 0 −1 0 p.
2 1 −3
41
3 Poisson counters and stochastic differential equations
3.34. We have now at our disposal a versatile modelling tool—that is, Poisson driven
stochastic differential equations—to describe in continuous time systems in which dis-
crete events occur at random times. We have also shown how to use that tool to represent
sample paths corresponding to a finite state, continuous-time Markov chain.
3.35. As was the case with discrete-time stochastic processes, recall that this sample path
description can be, from a certain point of view, understood as a convenient way to put
a measure on the usually hard to handle space of paths (that is, functions of times). In
the case of Poisson driven equations, this path space would be all the functions x(t ) for
which there exists a sequence of jumps so that x(t ) is a possible solution in the sense of
Itō.
3.36. While we do not want to work explicitly with the measure induced by a stochastic
differential equation, we nevertheless want to evaluate quantities that depend on it. We
show in this section and the next how to do so.
3.37. To this end, let us focus on the Itō equation (3.6). Recall that if Ni (t ) is a Poisson
counter of rate λ,
E
Ni (t ) − λ i t = 0.
The above property is often referred to in the literature as saying that the compensated
Poisson process N (t ) − λ(t ) is a martingale with respect to its own filtration.
Let ∆ > 0, the probability that Ni (t ) jumps between t and t + ∆ is independent of x(t ).
Hence, we have that
∫ t +∆ ∑ ∫ t +∆
E (x(t + ∆) − x(t )) =E f (σ, x(σ))dσ + E
g i (σ, x(σ))dNi (σ).
t i t
∑
E
Letting ∆ → 0, we obtain that d x = E f (t, x(t ))dt + i Egi (t, x(t ))λi dt . Hence,
∑
d
dt
E
x(t ) = E f (t, x) + Egi (t, x)λi
i
3.5.2 Examples
We illustrate the expectation rule on a couple of examples. In particular, we show how
it can be used in conjunction with the Itō rule to evaluate higher moments of x(t ).
42
3.5 Computing expectations
E
d
dt
E
x(t ) = − x(t ) + λ,
E
from which we conclude that x(t ) = e −t + λ. What is the variance of x(t )? Recall that
E E E
var x(t ) = (x 2 (t )) − ( x(t ))2 . In order to evaluate x 2 (t ), we use the Itō rule and the
expectation rule. First, the Itō rule yields
d
dt
E E E
x 2 (t ) = −2 x 2 (t ) + (2 x(t ) + 1)λ
whose solution is ( )
Ex 2(t ) = Ex 2(0) − λ e −2t + λ/2.
with z (0) ∈ {−1, 1}. Observe that z (t ) ∈ {−1, 1} for all t . What is the variance of x(t )?
Using the Itō rule, we get
dx 2 = −2x 2dt + 2xzdt .
Hence, taking the expectation rule with the above equation, we need Exz . Hence, we
use the Itō rule again to get
Since z 2 (t ) = 1, this equation did not introduce any new terms, and we can thus now use
the expectation rule. We obtain the system of equations
{ d
E
dt x
2 = E
−2 x 2 + 2 (xz )E
E E
dt (xz ) = −(1 + 2λ) (xz ) + 1
d
43
3 Poisson counters and stochastic differential equations
is such that x(t ) ∈ {−1, 1} if x(0) ∈ {−1, 1}, but this does not hold if x(0) is initialized
outside of that set.
The equation
dx = xdt + dN ; x(0) = a > 0
has a density whose support is lower bounded by a.
3.40. Keeping in mind that for a given PSDE (Poisson-Driven Stochastic Equation),
determining what the support of the density of x(t ) is can itself be challenging, we show
how to derive an equation for the density assuming it exists.
We consider the general time-invariant PSDE:
∑
n
dx = f (x)dt + g i (x)dNi .
i =1
Let At be a σ-field and assume that there exists a differentiable function ρ(t, x) such that
for all A ∈ At ,
∫
P (x(t ) ∈ A) = ρ(t, x)dx
A
R R
Let ψ : k 7−→ be a smooth function; it is commonly referred to as a test function
in this context. From the Itō rule, we obtain
⟨ ⟩ ∑
∂ψ
dψ = , f (x) dt + (ψ(x + g i (x)) − ψ(x)) dNi
∂x i
44
3.6 The Fokker-Planck equation for jump processes
tation is thus ∫ ∫
∂
d
dt
E
ψ(x) =
d
dt
ψ(x)ρ(t, x)dx = ψ(x)
∂t
ρ(t, x)dx
3.41. Because ψ is arbitrary, we can replace the integral equation (3.10) by the differential
equation
( ) −1
∂ ∂ ∑ ∂g
ρ(t, x) = − f (x)ρ(t, x) + λi ρ(t, g˜ (x)) det I +
−1 i
− ρ(t, x)
∂t ∂x i ∂x x+gi (x)
i
This is the density equation for PSDE, provided that ρ(t, x) exists and is smooth enough,
and provided that g˜i is one-to-one. If it is not one-to-one, one can easily modify the
argument to take into account the set of inverse images of g˜i−1 (x). We do not write it
explicitly here.
45
3 Poisson counters and stochastic differential equations
While it might appear unnatural at first, the Fokker-Planck equation makes sense on an
intuitive level and can be obtained using simple, though hardly rigorous, means. We give
R
such an explanation here, first in the case of a process evolving in before extending it
to more general processes.
dx = f (x)dt + g (x)dN
R
and let x 0 ∈ be fixed. Let ε > 0 be small and consider the interval I = [x 0 −ε/2, x 0 +ε/2].
To save space, we write b = x 0 + ε/2 and a = x 0 − ε/2. We can interpret the density for x
around a given point as the number of sample paths that are around that point, that is:
where the quote signs around the equality sign indicate that we are not attempting to
make sense of these quantities rigorously.
The density equation can be obtained by taking the difference in the number of paths
that enter I and the number of paths that leave I . To this end, we analyze all the ways
in which paths can leave or enter I .
There are the sample paths which do not jump around time t . These evolve according
to f (x)dt . We thus need to evaluate ’how many’ paths enter I following f (x) minus how
many leave I . For example, if x 0 = 0 and f (x) = −x, it is clear that paths are entering I
from the left and the right of the interval and hence ρ(t, 0) increases under the effect of
this term. For a short time interval ∆, a sample path will move under the effect of f by an
amount f ∆. We thus have that the number of paths entering/leaving at b during a short
time interval ∆ is the number of path that are within a distance f (b)∆ from b. There are
ρ(t, b)f (b)∆ such paths. The number of paths leaving/entering at a is f (a)ρ(t, a)∆. Now
regarding the sign to give to each contribution, observe that if f (b) points to the right
(that is, is positive), it points away from x 0 and thus its effect is to decrease the number
of paths in I . Reciprocally, if f (a) points to the right, its contribution is to increase the
number of paths in I . Hence, the contributions at the boundaries have opposite signs.
Putting these together, we find that
∂ ∂
ρ(t, x) = − (ρ(t, x)f (x)).
∂t ∂x
Due to jumps of path that are in I right before t , the number of paths in I decreases
46
3.6 The Fokker-Planck equation for jump processes
by ρ(t, x)λ.
We now need to account for paths that are outside of I right before t , jump at t
and whose jumps are so that they land in t . To do so, consider a couple of examples:
assume that g (x) = 1. The jumps are all of magnitude 1. Hence, the paths that were
at (x 0 − 1) ± ε/2 right before t and jump are contributing to the total number of paths.
That terms is ρ(t, x − 1)λε. More generally, all the paths that are in the inverse image of
I under x + g (x) (the effect of the jump) contribute to increasing the number of paths
in I . For example, assume that g (x) = x, then the paths that are in x/2 ± ε/4 and
jump contribute to paths in I . In general, we thus have to look at the inverse image
under the function x + g (x) of the interval/region I . The volume of this interval/region
is related to the volume of I by the absolute value of the determinant of the Jacobian
of x + g (x)—a fact
( usually
) seen in multivariable calculus. This corresponds to the term
−1 ∂
ρ(t, g˜ (x)) det ∂x g˜ λ
3.43. Putting all these effects together, that is the effect of the drift f (x), the effect
of path that enter I through jumps and paths that leave I through jumps, we have the
Fokker-Planck equation.
3.44. The derivation of the effect of the drift term can be made more rigorous and
R
extended to the case of a density in n using Stokes’ theorem. It usually goes under the
name of Liouville equation or continuity equation. The idea is as follows. Consider the
differential equation
ẋ = f (x)
with f a smooth vector field and let ρ(t, x) to be a density (you might think of these as
a density of particules or a compressible fluid) that evolves following the vector field. By
this, we mean that if there is a distribution of initial conditions ρ(0, x) for the differential
equation ẋ = f (x), and if we solve the equation for each initial condition, the distribution
of solutions at time t is given by ρ(t, x).
3.45. We have the following balance equation: if S is a closed surface enclosing a volume
R
V in n and n(x) the unit normal vector of S at x that points outside, then
∫ ∫
d
ρ(t, x)dx = (ρ(t, x)f (x)) · n dS .
dt V S
The above equation says that the difference in what is inside V is what comes in minus
what comes out through the boundary S . Stokes’ theorem says that
∫ ∫
∂
(ρ(t, x)f (x)) · n dS = (ρ(t, x)f (x))dx .
S V ∂x
47
3 Poisson counters and stochastic differential equations
ρ(t, x)
2x 2x
x
a
2
x0 b a x0 b 2x 0
2 2
Figure 3-3: In the interval I = [a, b] centered at x 0 , there are roughly ρ(t, x 0 )(b −a) sample
paths at time t . We consider the sample path equation dx = −xdt +2xdN . We
depict in dashed-line examples of paths that enter I at t and in dotted-line
examples of path that exit I at t . There are two mechanisms through which
paths enter/exit I : through the drift term (these are the paths around a and
b) and through jumps. The density equation can be obtained by taking the
number of entering paths minus the number of exiting paths.
Because the above holds for any V , we obtain the same relation as above.
∂ ∂
ρ(t, x) = (ρx) + λ ρ(t, x − 1) − 2λ ρ(t, x) + λ ρ(t, x + 1)
∂t ∂x
This equation can be solved using the Fourier transform.
3.47. PSDE appear very frequently in the modelling of queues. In that context, knowl-
48
3.6 The Fokker-Planck equation for jump processes
edge of how the density for the length of the queue evolves is very useful to understand
queue dynamics and insure quality of service. We now investigate a simple example.
Models in which the size of the queue is a real number are often referred to as fluid queue
models.
Consider the following queue, in which tasks (or customers) are being processed at a
rate µ. The arrival of new tasks in the queue is modelled as a Poisson process N (t ) with
rate λ. Denote by 1R+ (x) the indicator function for the positive reals, that is
{
1 if x > 0
1R+ (x) =
0 otherwise
1
dx = − 1R+ (x)dt + dN .
µ
We illustrate a sample path for the evolution of the queue below:
x(t)
∂ 1 ∂
ρ(t, x) = (1R+ (x)ρ(t, x)) + λ (ρ(t, x − 1) − ρ(t, x))
∂t µ ∂x
First, observe that there might be a non-zero chance that the queue is empty. Taking
expectations of the sample path equation, we see that
d
dt
E
x =−
1
µ
E
1R+ (x) + λ.
E
Hence, in steady-state 1R+ (x) = λ µ. Observe that this says that the probability that
the queue is not empty is λ µ. (In general, recall that the expectation of the indicator
function of a set A is equal to the measure of that set A) Hence, in steady state, the
probability that the queue is empty is 1 − λ µ. Of course, the above ceases to make sense
if λ µ > 1, in which case the assumption that the steady-state exists is incorrect. This
49
3 Poisson counters and stochastic differential equations
agrees with our intuition that if the arrival rate λ is larger than the service rate µ, the
size of the queue will grow indefinitely.
Let us now focus on finding the steady-state density for the size of the queue. We know
that the steady-state has a δ measure at the origin of strength 1 − λ µ. We thus look for
a steady-state solution of the type
β = (1 − λ µ)λ µ.
We thus have a solution for the density over [0, 1). Now that x = 1, the density equation
is
1 dρ
− λ ρ = λ(1 − λ µ)δ(x − 1).
µ dx
Hence, at x = 1, the solution jumps by an amount −λ µ(1 − λ µ). Now, on the interval
(1, 2), we have
1 d
ρ − λ ρ = (1 − λ µ)λ µe −λ µe λ µx , x(1+ ) = λ µ(1 − λ µ)(e λ µ − 1)
µ dx
This equation can be solved using standard means. The complete solution is obtained
by continuing to integrate the equation piece by piece as we have done above.
with z (0) ∈ {−1, 1}. This system combines discrete and continuous variables. The density
50
3.6 The Fokker-Planck equation for jump processes
Furthermore, g˜ = [x, z ]′ + g (x, z) = [x, −z ]′. Hence, g˜−1 = [x, −z ]′ . Finally, observe that
the determinant det(I + ∂g /∂x) = | − 1| = 1. The density equation is thus
( [ ])
d ∂ −(x + z )
ρ(t, x, z ) = − ρ(t, x, z ) + λ (ρ(t, x, −z ) − ρ(t, x, z ))
dt ∂x 0
Because z takes on the values −1 or +1, we can write the above equation as a system
of equation with ρ+ (t, x) = ρ(t, x, 1) and ρ− = ρ(t, x, −1):
{∂ ∂
∂t ρ+ (t, x) = ∂x ((x − 1)ρ+ (t, x)) + λ(ρ− (t, x) − ρ+ (t, x))
∂ ∂
∂t ρ− (t, x) = ∂x ((x + 1)ρ− (t, x)) + λ(ρ+ (t, x) − ρ− (t, x))
It us a good exercise to derive the above system of equation using the intuitive approach
outlined above.
51
3 Poisson counters and stochastic differential equations
3.49. We are given a discrete-time, finite state Markov process whose probabilities evolve
according to
p(t + 1) = Ap(t )
where A is a given stochastic matrix. 1 For example, think of p as containing the probability
of sunny, overcast or rainy weather. We can think of the entry i j of A as being, e.g.
the probability that it is rainy(corresponding to i ) tomorrow given that it is overcast
(corresponding to state j ) today. The backwards equation is the process telling us: what
is the probability that it was rainy yesterday given that it is overcast today.
3.50. Let us assume that there exists a backwards process, that is a matrix A˜ such that
˜ ).
p(t − 1) = Ap(t
Assuming that we are at state ωi at time k + 1, we know that we could have ended up
at this state coming from any state ω j at time k , with probability
p(x(t + 1) = ωi |x(t ) = ω j )
∑ .
j p(x(t + 1) = ωi |x(t ) = ω j )
1
Recall that A is stochastic if the columns of A sum to one and it has positive entries. If p is a probability
vector (that is the entries of p sum to one and are positive), then Ap is also a probability vector if A is
stochastic.
52
3.7 The Backwards equation
53
3 Poisson counters and stochastic differential equations
3.52. In the case of a finite-state process, one might take a direct approach as we now
explain. First, assume that the probabilistic description of the process is given by
p˙ = Ap
But p(x(t + τ) = x j |x(t ) = x i ) is nothing else than the j i th entry of e Aτ . Set ϕ(t ) = e Aτ ,
we thus have that ∑
E(x(t )x(t + τ)) = pi (t )ϕ j i (τ)x i x j .
ij
54
3.8 Computing correlations
The idea is to write this equation at time τ, multiply by x(t ) and take expectations.
Because whether N (τ) jumps or not is independent of the value of N (t ) for t < τ, the
E E
expectation (x(t )dN (τ)) is simply (x(t ))λd τ if N has rate λ. Precisely,
∑
d
dτ
E
(x(t )x(τ)) = E (xt f (x(τ)) + E (x(t )gi (x(τ))) λi .
i
dx = −xdt + dN
with x(0) = 0 and N is a Poisson counter of rate λ. We evaluate limt →∞ x(t )x(t + τ). E
We evaluated the mean and variance of x for this process earlier and found
Ex(t ) = e −λt + λ
λ
Ex 2(t ) =
2
− λe −2t
d
dτ
E E
x(t )x(t + τ) = − (x(t )x(t + τ)) + x(t )λ. E
This is a linear differential equation that can easily be solved explicitly.
3.57. Consider the stochastic process z (t ) which takes values +1 and −1 and whose
probability vector p(t ) = (p(z (t ) = 1, p(z (t )) = −1)′ evolves according to
[ ]
−a b
p˙ = p.
a −b
55
3 Poisson counters and stochastic differential equations
lim
t →∞
E(x(t )x(t + τ)) = e −|τ| .
3.58. We start by exhibiting a sample path description for y(t ). From the previous
sections, we deduce that
dx = (α − x)dN α β + ( β − x)dN βα
evolves in {α, β} if initialized in that set. Indeed, if x = α and dN α β jumps, then x does
not change. If dN βα jumps, then x(t − ) changes by an amount ( β − α) and thus x(t ) = β.
The rate of change from α to β is the rate of change from 1 to −1 for z (t ). Hence we
find that the rates of N α β and N βα are respectively λ α β = b and λ βα = a.
Using the expectation rule, we have
d
dt
E
x(t ) = a β + αb − (a + b) x(t ). E
E
Hence limt →∞ x(t ) = a β+αb
a+b and thus the required condition is a β + αb = 0.
E
3.59. Let us now evaluate x 2 (t ), which is necessary for answering the second question
as we will see. To this end, we use the Itō rule to get
( ) ( )
dx 2 = (x + (α − x))2 − x 2 dN α β + (x + ( β − x))2 − x 2 dN βα
( ) ( )
d
dt
E E
(x 2 (t )) = α 2 − (x 2 ) b + β 2 − (x 2 ) a E
E
= a β 2 + b α 2 − (a + b) (x 2 )
whose solution is
E(x 2(t )) = C e −(a+b)t + a βa ++ bb α
2 2
.
3.60. To compute the correlation, we write the sample path equation with time variable
τ:
56
3.8 Computing correlations
E
a (x(t ))d τ and similarly for other combinations:
d
dτ
E E
(x(t )x(τ)) = −(a + b) (x(t )x(τ))
and thus
E(x(t )x(τ)) = E(x 2(t ))e −(a+b)(τ−t )
3.61. If we now assume that τ < t , we can use the same reasoning (starting from (x 2 (τ)) E
and integrating to t ) to get
3.62. We now go back to our original notation and change τ for t + τ, then E(x(t )x(τ))
E
becomes (x(t )x(t + τ)) and for
( )
a β 2 + b α 2 −(a+b)τ
E
(x(t )x(t + τ)) = C e −(a+b)t
+
a +b
e for τ > 0
( )
−(a+b)(t +τ) a β 2 + b α 2 (a+b)τ
= Ce + e for τ < 0
a +b
57
Lecture 4
Dynamic programming and optimal control
We now address optimal control of discrete and continuous-time Markov chains as well as
Itō differential equations. We start by establishing a few known result regarding optimal
control of deterministic systems.
59
4 Dynamic programming and optimal control
4.1. At the basis of dynamic programming is the so-called principle of optimality. As any
good principle, it is almost a tautology:
From any point on an optimal trajectory, the remaining trajectory is optimal for the problem
initiated at this point
4.2. Example: Shortest path. We illustrate the use of this principle on the problem of
finding the shortest path in a graph. Consider the following graph:
3 4
A1 2 B1 1 C1
3
5
1 2
3 3
3 7 2 3
S A2 6 B2 5 C2 T
2 1
3
7 4
6 2
A3 B3 C3
We want to find the shortest path from S to T. The principle of optimality tells us that
if S Ai B j C k T is a shortest path, then C k T is the shortest path from C k to T . Now there
is only one path from C γ to T for γ = 1, 2, 3. Let us record the shortest distance in a
function V . Hence
V (C 1 ) = 2; V (C 2 ) = 3 and V (C 3 ) = 1.
Applying the principle of optimality again, we know that the path B 1C γT is a shortest
path from B 1 to T only if C γT is a shortest path from C γ to T. Now because there is only
one path from C γ to T, it is the shortest. There are on the other hand three path from B 1
to T , passing through either C 1 , C 2 or C 3 . The shortest path from B 1 to T will be such
that the distance from B 1 to C γ plus the distance from C γ to T is minimized. This latter
distance is V (C γ ). Hence, the shortest distance from B 1 to T is min [4 + 2, 1 + 3, 3 + 1] = 4.
Hence V (B 1 ) = 4.. Now, a similar analysis yields V (B 2 ) and V (B 3 ). We summarize the
result:
V (B 1 ) = 4,V (B 2 ) = 5 and V (B 3 ) = 3.
Applying the principle of optimality again, the shortest path from A1 to T will contain
the shortest path from B β to T. We have recorded the lengths of these path in the function
V . Hence V (A1 ) is easily evaluated at the minimum of the distance from A1 to B β plus
V (B β ). We get
V (A 1 ) = 7,V (A2 ) = 7 and V (A3 ) = 7.
60
4.1 Dynamic programming in discrete-time
Hence the shortest distance from A α to T is seven for all α. Applying the principle of
optimality one last time, we get that the shortest path from S to T will contain the shortest
path from A α to T. Hence we found that a shortest path from S to T is S A 1B 2C 2T , with
length 8.
4.3. The main operation we performed in the derivation above was of the type
V (A α ) = min distance between (A α − B β ) + V (B β ) .
4.4. We derive an equation from the above principle in the case of an n stage optimization
problem. Let x t be the state of the process at time t , u t be our control at time t and
assume that the process obeys an evolution equation
x t +1 = a(x t , ut , t )
∑
n−1
J = c (x t , ut , t ) + J n (x n ).
t =0
∑
n−1
V (t, x) = inf c (x s , u s , s ) + J n (x n ) for x t = x
ut ,...,u n
s =t
In the example of the shortest path problem, V is nothing else than the shortest distance
from x to the target point.
4.5. Observe that J is the sum over the time variable of functions of (x t , u t , t ). We say
that the cost function is separable to emphasize the fact that it can be written as such a
sum. The function V is called the value function.
61
4 Dynamic programming and optimal control
∑
n−1
V (t, x) = inf c (x s , u s , s ) + J n (x n )
ut ,...,u n
s =t
∑
n−1
= inf inf c (x t , ut , t ) + c (x s , u s , s ) + J n (x n )
ut ut +1,...,u n
s =t +1
∑n−1
= inf c (x t , ut , t ) + inf c (x s , u s , s ) + J n (x n )
ut ut +1,...,u n
s =t +1
∑
Observe that when x t +1 = a(x t , u t , t ) is given, inf ut +1,...,un n−1 s =t +1 c (x s , u s , s ) + J n (x n ) is
V (t + 1, a(x, u, t )). Since x t +1 = a(x t , u t , t ), we obtain that the value function obeys the
recursion
V (t, x) = inf [c (x, u, t ) + V (a(x, u, t ), t + 1)]
u
for t < n and with final boundary condition V (n, x) = J n (x). This equation is called the
Bellman equation.
4.7.
Example 4.1. You have started your life with a lofty inheritance of M dollars placed in a trust
fund that gives you a yearly income of q M dollars, where q is the interest rate. You are averse to
working, but can add to the capital in order to increase your income. You cannot take money out
of the capital however. How much money you should save or use at time t in order to maximize
your spending over your lifetime, which you estimate to be n.
We first derive the evolution equation for the income. Our income in year 0 is x 0 =
q M . If we spend u 0 , we can add x 0 − u 0 to the capital, and our income in year 1 is
q (M + (x 0 − u 0 ) = x 0 + (x 0 − u 0 ). In year 2, if we have consume u 1 the previous year and
added x 1 − u 1 to the capital, our income is q (M + (x 0 − u 0 ) + (x 1 − u 1 )) = x 1 + (x 1 − u 1 ).
We can thus cast the problem as a dynamic programming problem where the evolution
equation is
x t +1 = x t + q (x t − ut )
and the objective is to maximize
J = s umtn−1
=0 u t .
We define the value function V (t, x) as being the maximal consumption we can afford
if our capital is x at time t . Bellman’s equation tells us that
V (t, x) = max u t + V (t + 1, x + q (x − ut )) .
ut
Since by time n, you can not consume anymore, V (n, x) = 0. Solving the problem
62
4.1 Dynamic programming in discrete-time
V (n − 1, x) = max(u n − 1).
u n−1
Because u n−1 is bounded above by x (you cannot spend more than we earn), we obtain
V (n − 1, x) = x.
Next, we have
Because the function we seek to maximize is linear in u, the optimal u will be either 0 or
x. Hence V (n − 1, x) = max(1 + θ, 2)x . The first entry in the max corresponding to u = 0
and the second to u = x.
We postulate that V (t, x) = kt x, that is V (t, x) is linear in x with a coefficient kt . We
will try to find a recurrence equation for kt . We have that
V (t, x) = max ut + kt +1 (x + q (x − ut )) = max(kt +1 (1 + q ), kt +1 + 1)x
ut
where we again used the fact that the maximum is obtained at u = 0 or u = x. Observe
that the right-hand-side is of the form kt x, and thus our hypothesis is verified and we
have the recurrence
63
4 Dynamic programming and optimal control
P (x t +1 |x t , u t ).
4.8. Observe that the expectation is taken with respect to a probability transition which
is itself dependent on the sequence of controls that is applied.
4.9. We so far assume perfect state observation, that is we observe the state x t at time
t and make a control choice ut based on that observation. Observe that, by convention,
the control u 0 takes us from time 0 to time 1, hence an n stage optimization problem
starting from an initial condition x 0 and ending at time n will require n decisions to be
made (or controls to be applied) u 0, . . . , u n−1 .
4.10. It is customary in this context to call the optimal control sequence u an optimal
policy, and use the letter π to denote it. That is, π = (u 0, . . . , u n−1 ). We denote by πt the
optimal policy from t on: πt = (u t , . . . , u n−1 ).
the optimal expected cost to go given that we start at x at time t . Because the process is
Markov, the expected cost-to-go only depends on the current state and control and not
on the past history of the process.
64
4.2 Markov decision processes
∑n
4.13. We now focus on the last term E s =t +1 c (x s , u s , s ) | x t = x, πt . Let αi range over
the possible values of the state of the Markov chain. We can write this expectation as
∑ ∑
αt +1 · · · αt +n P (x t +1 = αt +1, · · · , x n = αn |x t = x, πt ) J t +1, where the sums range over all
possible states for x t +i (the sums can be replaced by integrals in the continuous state-
space case.) We now use Bayes’ rule with respect to x t +1 to obtain
∑ ∑
··· P (x t +2, · · · , x n | x t +1, x t = x, πt )P (x t +1 | x t = x, πt ) Jt +1 .
αt +1 αt +n
We omitted the αt in the above equation to make it more readable. From now on, we
will explicitly write them only when they help the understanding. We now make two
observations:
• P (x t +1 |x t , πt ) only depends on u t and not on controls ut +1 etc. since they affect
transitions happening after t + 1. Hence
P (x t +1 |x t , πt ) = P (x t +1 |x t , ut )
P (x t +2 · · · x n | x t +1, x t = x, πt ) = P (x t +2 · · · x n | x t +1, πt +1 ),
65
4 Dynamic programming and optimal control
∑
= inf *.c (x, u, t ) +
u
E
inf ( Jt +1 |x t +1 = x αt +1, πt +1 )P (x t +1 = x αt +1 |x t , u)+/
πt +1
,( αt +1 -
∑ )
= inf c (x, u, t ) + V (t + 1x t +1 )P (x t +1 |x t , u)
u
E
= inf (c (x, u, t ) + V (t + 1, x t +1 )|x t , u))
u
We thus obtain
pt +1 = A(u)pt
where p(0) = p 0 is given and the transition matrix depends on a control variable u.
We label the states x 1, x 2, . . . , x n . We have the same cost function as above. Recall that
according to the convention used in this course (that is, the columns of A sum to one),
the j th column of A represents the probability distribution of the next state in the process
given that we are currently at state x j . That is, Ai j (u) = P (x t +1 = i | x t = j, u). Hence, if
we write the function V (t + 1, x) as a row vector
V (t + 1) = [V (t + 1, x 1 ),V (t + 1, x 2 ), . . . ,V (t + 1, x n )] ,
66
4.2 Markov decision processes
pt +1 = A(u)pt
where [ ]
.5 + u .7
A(u) =
.5 − u .3
′
where pt = p(x(t ) = a, p(x t ) = b
We seek to maximize the probability of being in state x 1 at time 3 and while exerting
the least effort in expectation. We thus introduce the function
∑
3
J = u t2 − ηp(x t = a)
t =0
where η is a positive real parameter and u ∈ [−.2, .2]. We start from state x(0) = a.
Define the value function V (t, x) as being the least achievable cost from state x at time
t . Then clearly, V (3, a) = −η and V (3, b) = 0. The value function obeys the recursion
[ ]
V (t, x) = inf u 2 +
u
E (V (t + 1, x1)|xt = x, u) .
Let us fix η = 1. Hence V (2, a) = inf u u 2 + (.5 + u)(−1) + (.5 − u)0. The minimum
is obatined at u = 0.2 and V (2, a) = −0.66. Similarly, V (2, b) = inf u u 2 + .3(−1) + .7 .
The minimal value is obtained at u = 0, for which V (2, b) = .4 Observe that if we are in
state b, our transitions are independent of u, and thus we can conclude that the optimal
control when we are in state b is zero, for all t : u(t, b) = 0.
The optimal controls for time 1 and 0 are obtained in a similar fashion.
67
4 Dynamic programming and optimal control
pt +1 = Apt ,
but now we have the option to stop the process at anytime. If we stop at time t and we
are at state x, the reward is r (t, x).
4.19. A typical reward would emphasize being at a certain state while penalizing long
E
waits. We seek to maximize (r (t, x)) over all stopping times 0 ≤ τ ≤ n.
4.20. Let V (t, x) be the optimal expected reward if we start from x at time t . This reward
E
is either r (t, x), if we decide to stop the process at t , or it is pt +1V (t + 1, x). The Bellman
equation is thus
E
V (t − 1, x) = max(r (t − 1, x), [V (t, x) | x t = x]
4.21. Starting with the boundary condition V (n, x) = r (n, x), we can evaluate V (t, x) for
t < n. The optimal stopping rule is thus the first time the expected reward is less or
equal than the current reward:
4.22. Example: The secretary problem. We now illustrate the above on the well-
known secretary problem. Consider the following game: your opponent has n tickets in
his hands with one number written on each ticket. You have no information about the
numbers. The deck is shuffled. Your opponent shows you the first ticket and you have to
decide on the spot if you select it or not. If you do not select it, he shows you the second
68
4.2 Markov decision processes
ticket and you again have to decide, etc. Once you discard a ticket, it cannot be selected
again. You win the game if you select the ticket with the highest number written on it.
The name of this example comes from the problem of interviewing candidates for a
secretary (or any other job) position. If you have n candidates, you want to pick the best
one by hiring her/him on the spot.
How to maximize your chance of winning/your probability of hiring the best candidate?
1. Observe that any stopping strategy that does not make use of the observation and
decides before hand to pick the k th ticket has a chance 1/n of winning.
2. It might not be obvious at first that one could do better, but consider the following
scheme: you discard the first n/2 tickets (assuming wlog that n is even), but record
the number on them, then pick the first ticket whose value is larger than any of the
first n/2. Notice that picking a ticket whose value is smaller than any of the first
n/2 uncovered tickets is surely a losing move. What is the probability of winning
of such a scheme?
3. If the ticket with highest number (let us call it the largest ticket) was in the first n/2
tickets, we lose. This happens with probability 1/2. If the second largest ticket is in
the first n/2 ticket, and the first one is not, we win. This happens with probability
1
4 . Hence with this simple scheme, we already have increased our probability of
winning to a constant factor of 14 ! We have the bounds
1 1
≥ P (winning) ≥ .
2 4
4. Let us find the optimal stopping time using dynamic programming. The first step
is to realize that there is an underlying Markov process in the problem, which
corresponds to how the tickets are shuffled. Define x t so that x t = 1 if the largest
ticket so far if the t th ticket, and 0 otherwise. Because we do not know anything
about the tickets a priori, this process is indeed Markov: the probability that the
current ticket is larger than the previous tickets is 1/t . We thus have a two-state
Markov process with probabilities P (x t ) being independent:
1 t −1
P (x t = 1) = , P (x t = 0) = . (4.1)
t t
5. The probability that we have the largest ticket in hand is zero if x t = 0. If x t = 1,
the probability that we have the largest ticket in hand, given our past observation
is t /n. To see this, we use Bayes’ rule with the following events: At is the event “I
have the largest ticket in hand at time t " and Bt is the event “I have the largest of
the first t tickets at time t . Hence, the probability of winning if we are at x t = 1
(which corresponds to event Bt ):
P (At ) 1/n
P (At | Bt ) = P (Bt | At ) =1 .
P (Bt ) 1/t
69
4 Dynamic programming and optimal control
6. The value function V (t, x) is the expected probability of winning given that I am
in state x at time t . Hence, if if have the largest ticket so far x = 1 at time t = n
(end -time), I have won and thus V (n, x = 1) = 1. Reciprocally, if I do not have the
largest ticket x = 0 at the end, I have lost with certainty: V (n, 0) = 0.
7. We can now set-up the Bellman recursion V (t − 1, x) = max(r (t − 1, x), V (t, x))E
where the expectation is taken with respect to the distribution given in (4.1). This
yields:
E
V (t − 1, 0) = max(0, V (t, x)) = t −1 t V (t, 0) + t V (t, 1)
1
E
V (t − 1, 1) = max(r (t − 1, 1), V (t, x)) = max( t −1 , t −1V (t, 0) + 1V (t, 1))
n t t
t −1
= max( n ,V (t − 1, 0))
(4.2)
and the optimal strategy is to take another ticket if x t = 0, or if x t = 1 and t/n <
V (t, 0). We stop and keep the ticket we have if x t = 1 and nt ≥ V (t, 0).
8. Because V (t, 1) ≥ V (t, 0) (from the second equation in (4.2)), we conclude from
the first equation in (4.2) that V (t, 0) > V (t + 1, 0).
Hence, the first argument in the max defining V (t, 1) is increasing to one whereas
the second is decreasing to zero. There exists thus a t ∗ at which they will cross,
that is there exists a stopping time t ∗ such that the optimal strategy is to discard
any ticket for the first t ∗ steps and then accept the first one that is larger than the
first t ∗ .
9. We can find this t ∗ by solving the Bellman recursion. For t > t ∗ , we have that
V (t, 1) = t/n, hence using the first equation in (4.2):
1t t −1 1 t −1
V (t − 1, 0) = + V (t, 0) = + V (t, 0).
tn t n t
This yields
V (t − 1, 0) 1 V (t, 0) 1 1 1
= + = ··· = + +···+ .
t −1 n(t − 1) t n(t − 1) nt n(n − 1)
Hence
t −1 ∑ 1
n−1
V (t − 1, 0) = , for t ≥ t ∗
n s =t −1 s
Let us now focus on large n, t . Recall that t ∗ is the smallest integer such that
t∗ ∗ ∗ ∑n−1 1
n ≥ V (t , 0). We deduce that t is the smallest integer such that s =t ∗ s ≤ 1. We
70
4.2 Markov decision processes
t ∗ ≃ n/e ≃ n/2.8
71
4 Dynamic programming and optimal control
4.23. We use the same set-up as before, that is we have a discrete-time, controlled Markov
process with given probability transitions P (x t +1 |x t , u).
4.24. Let γ be a real number and 0 < γ ≤ 1. Recall that πt = [u t , · · · , u n ] is the policy or
control law starting from time t . We write π for π0 . Consider the cost function
∑
n
J (n, x 0, π) = γt c (x t , ut ).
t =0
4.25. Define
V (n, x 0 ) = inf
π
E J (n, x0, π) | x0, π0 ,
the optimal expected cost for an horizon n and with initial condition x 0 .
72
4.3 Infinite-time horizon problems
Using a similar approach to the one used in a previous lecture, we can show that
inf
π1
E E
J (n − 1, x 1, π1 ) | x 0, π0 = [V (n − 1, x 1 ) | x 0, u 0 ] .
The idea is to write explicitly the expectation and condition on the state x n+1 . The
Bellman equation is thus
E
V (n, x) = inf [c (x, u) + γ (V (n − 1, x 1 ) | x 0, u)] .
u
4.29. There are three widely studied situations under which the limit will exist:
1. 0 < γ < 1 and |c (x, u)| < M for a given constant M and all admissible x, u.
2. 0 < γ ≤ 1 and c (x, u) ≥ 0.
3. 0 < γ ≤ 1 and c (x, u) ≤ 0.
The first case is called discounted programming, and is the one we will focus on. The other
cases are not much more difficult to handle (use the monotone convergence theorem to
exchange the order of integration), and go under the name of negative and positive
programming respectively. Note that the nomenclature is based on a definition of the
problem in which one tries to maximize a reward, not minimize a cost—the problems
are clearly equivalent, it suffices to take the reward to be minus the cost and change
minimization to maximization.
x t +1 = x t + u t + θt
4.31. The value function V (x) satisfies the equation (note that the costs are not neces-
sarily bounded, but are positive.)
73
4 Dynamic programming and optimal control
[ ]
E
V (x) = inf x 2 + u 2 + γ [V (x 1 )|x 0 = x, u 0 = u]
u
[ ∫ 1 ]
= inf x + u + γ
2 2
V (x + u + θ)dθ/2
u −1
4.32. The equation above does not lend itself to easy manipulations to find V (x). We
present in the next section a method to solve equations such as the one above, called
value iteration. Another common approach is to try a parametric form for the value
function and use the equation above to fit the parameters. The form of the cost function
is often a good starting point.
Differentiating the right-hand-side of the last equation with respect to u and setting the
result to zero, we get
2ax + b
u(x) = − .
2(1 + γa)
If we plug u(x) back in the previous equation, we see that the right-hand-side is a
quadratic polynomial in x. It thus suffices to equate the coefficients to the left and to the
right to obtain the value function.
74
4.4 Value iteration
E
V (x) = inf (c (x, u) + γ (V (x 1 )|x 0 = x, u 0 = u) .
u
u
E
Vn (x) = inf (c (x, u) + γ (Vn−1 (x 1 )|x 0 = x, u 0 = u) . (4.3)
This recursion is called value iteration; we can show that under some assumptions, Vn (x)
will converge to the value function V .
4.35. In case of a finite horizon, value iteration is nothing more than the usual dynamic
programming recursion if one takes V0 = 0.
4.36. We will show that it converges for the case of positive costs c with discounted
infinite horizon cost and give more general conditions for convergence, without proof,
later.
4.37. The idea is to show that the right-hand-side of (4.4) is a contraction and appeal to
fixed point theorems. To this end, define the operator
E
T (v (·))(x) : v (x) 7−→ inf (c (x, u) + γ (v (x 1 )|x 0 = x, u 0 = u) .
u
Observe that T takes a function as argument and outputs a function. We assume that
|c (x, u)| is bounded for all admissible x, u.
4.38. The value function V (x) is the solution of V = T (V ) and we can rewrite the value
iteration algorithm as
Vn = T (Vn−1 ). (4.4)
Hence, if we can show that T is a contraction, appealing to fixed point theorems, we
obtain that V = T V has a unique solution and the sequence Vn of (4.4) converges to this
solution.
R
4.39. In order to show that T is a contraction, first observe that if d ∈ , then T (V +d ) =
T (V ) + γd . Now let v 1 (x) and v 2 (x) be such that v 1 (x) ≤ v 2 (x) for all admissible x. Then
clearly T (v 1 )(x) ≤ T (v 2 )(x)—this property is called monotonicity.
75
4 Dynamic programming and optimal control
d = ∥v 1 (x) − v 2 (x)∥∞ .
It is easy to see that v 1 (x) − d ≤ v 2 (x) ≤ v 1 (x) + d , hence applying T to both sides and
using the monotonicity property:
T (v 1 (x)) − γd ≤ T (v 2 ) ≤ T (v 1 (x)) + γd .
We can rewrite this equation as
∥T (v 1 ) − T (v 2 (x))∥∞ ≤ γ∥v 2 − v 1 ∥∞
Hence, T is a contraction for the infinity norm (when γ < 1) and the value iteration
converges to a unique solution.
76
4.5 Dynamic programming in continuous-time
4.42. The principle of dynamic programming in continuous time take a form similar to
the one in discrete time. We show here how to derive it in the case that all the functions
involved are well-behaved, and the Taylor approximations made hold. We can obtain a
more formal derivation at little more cost by using the maximum principle of Pontryagin.
We will however not do this here.
V (t, x) = inf [c (x t , u t , t ) + V (t + 1, x t +1 )] .
u
under the condition that x(t ) = x. If we approximate the differential equation using Euler
integration with a step δ, that is
x t +1 ≃ x t + f (x t , u, t )δ.
4.45. For the discrete-time system obtained via this approximation, the Bellam principle
reads
V (t, x) = inf [c (x(t ), u(t ), t )δ + V (t + δ, x(t + δ))]
u
77
4 Dynamic programming and optimal control
is treated similarly. Observe that e −at ≃ 1−at up to first order. Hence in the discrete-time
approximation, it corresponds to a discount factor γ = 1−aδ. Recalling that the optimal-
ity condition in the case of a discounted cost was V (t, x) = inf u [c (x, u, t ) + γV (t + 1, x t +1 )],
we deduce that the equivalent relation for the continuous-time case is
[ ]
∂V ∂V
= − inf c (x(t ), u(t ), t ) − aV + (t, x(t ))f (x(t ), u(t ), t ) . (4.6)
∂t u ∂x
4.47. In case you find the above derivation of the Bellman equation unsatisfying, we can
easily show that if a control u has a value function V which satisfies (4.6) for all x and t ,
then u is optimal. To this end, consider the cost function
∫ T
J = c (x, u, t ) + e −aT JT (x(T ))
0
4.48. We will establish an inequality for the cost incurred by v when compared to the cost
incurred by y (that is, the value function), as follows: consider the trajectory obtained
from using v , we have
( )
d −at −at −at ∂V ∂V
e V (t, x) = −ae V + e + f (x, v, t ) .
dt ∂t ∂x
78
4.5 Dynamic programming in continuous-time
By definition of u, c (x, v, t ) − aV + ∂V ∂V
∂t + ∂x f (x, v, t ) > 0. To see this, call K (v ) =
c (x(t ), v (t ), t ) − aV + ∂V ∂
∂x (t, x(t ))f (x(t ), v (t ), t ). Hence, the Bellman equation reads ∂t V =
infw K (w). Since u = arg infw K (w) we have that K (u) ≤ K (v ) and thus ∂t∂ V + K (v ) ≥ 0.
d at
− e V (t, x) ≤ e −at c (x, v, t ).
dt
Integrating the above between 0 and T and rearranging terms, we obtain
∫ T
−aT
V (0, x(0)) ≤ e JT (x(T )) + e −at c (x, v, t )dt .
0
The quantity on the right is the cost incurred by v and the quantity on the left the cost
incurred by u. This hence proves our claim.
4.49. Example: LQ controller. We can apply the above relations to find the equation
of optimal least square control for linear dynamics. Consider the system
ẋ = Ax + Bu.
Let Q and R be symmetric positive definite matrices. We want to find the control u that
minimizes ∫ T
(x ′Rx + u ′Q u) dt
0
The dynamic programming equation is
[ ]
′ ′ ∂V ∂V ′
0 = inf x Rx + u Q u + + (Ax + Bu) . (4.7)
u ∂t ∂x
Differentiating the term in brackets in (4.7) with respect to u, we get that the minimizing
control obeys
∂V 1 ∂V
2Q u ∗ + B ′ ⇔ u ∗ = − Q −1B ′
∂x 2 ∂x
′
We try a value function of the form V (t, x) = x K (t )x where K is symmetric. We have
∂V ∂V
= x ′K̇ x and = 2K x .
∂t ∂x
79
4 Dynamic programming and optimal control
Observe that 2x ′K Ax = x ′K Ax +x ′A′K x. Because the above holds for all x, we conclude
that K obeys
K̇ = −R + K BQ −1B ′K − K A − A′K
with boundary condition K (T ). The equation above is called the Riccati equation.
80
4.6 Controlled jump processes
4.50. The type of problems we consider is the following: let S = {s1, . . . , sn } be the finite
state space and let x(t ) be a Markov process evolving in S with probabilities obeying
˙ ) = A(u)p(t ), p(0) = p 0
p(t
where u is a control term. We seek to minimize the expected value of a cost functional
∫ T
c (x, u)dt + C (x(T )).
0
4.51. Set [∫ ]
T
V (t, x) = inf
u
E c (x, u)dt + C (x(T ))|x(t ) = x, u .
t
Take δ small, we express inf u[t ..T ] as inf u[t ..t +δ] inf u[t +δ..T ] to obtain
[ ∫ T ]
V (t, x) ≃ inf
u[t,T ]
E c (x, u)dt + C (x(T )) | x(t ) = x, u
c (x(t ), u(t ))δ +
t +δ
[ (∫ T )]
≃ inf c (x(t ), u(t ))δ + inf
u[t,t +δ]
E
c (x, u)dt + C (x(T )) | x(t ) = x, u
u[t +δ,T ] t +δ
where we also used the fact that c (x, u, t ) is now determined by our current observation
x(t ) and control u(t ).
∫T
4.52. We introduce the function ϕ(t, x t )) = t c (x, u)dt +C (x(T )) where by a slight abuse
of notation, we denote by x t a random path x(.) starting at t . We will also write x(t ) = si
E
to denote random paths that start at x(t ) = si . We now evaluate (ϕ(t +δ, x t +δ |x(t ) = x, u).
Observe first that the expectation above is taken with respect to path that start at t
and end at T . We will split it up into paths from t to t + δ and then paths from t + δ to
T . We have
∑
E (ϕ(t + δ, x t +δ )|x(t ) = x, u) = ϕ(t + δ, x t +δ = si )p(x(t + δ) = si , x t +δ |x(t ) = x, u). (4.8)
si ∈S
Similarly to what we did in the discrete-time case, we use the Markov property to obtain
81
4 Dynamic programming and optimal control
4.53. We now focus on the second term of the last expression. Given p(t ), we have the
approximation
p(t + δ) ≃ p(t ) + A(u(t ))p(t )δ.
valid up to first order. Hence, if we know that x(t ) = si , p(t + δ) given that x(t ) = si is
A(u)e i δ + e i . We can thus rewrite (4.8) for the case x = s j as
∑
E(ϕ(t + δ, xt +δ )|x(t ) = s j , u) = ϕ(t + δ, x t +δ = si )(A(u)e j δ + e j )i p(x t +δ |x(t + δ) = si , ut +δ )
i
where we denote by (v )i the i th entry of the vector v . If we denote by ϕ′(t + δ, x) the row
vector [ϕ(t + δ, x t +δ = s1 ), . . . , ϕ(t + δ, x t +δ = sn )], we can rewrite the previous equation for
all j simultaneously as
4.54. Using the above relation, we can simplify the second term in our last expression
for the value function V (t, x) as follows:
[ ]
V (t, x) ≃ inf
u[t,t +δ]
c (x(t ), u(t ))δ + inf
u[t +δ,T ]
Et +δ..T [ϕ (t + δ, xt +δ )A(u)δ + ϕ (t + δ, xt +δ )] | xt +δ, ut +δ
′ ′
This yields
V (t, x) − V (t + δ, x t +δ )
≃ inf [c (x, u) + V ′(t + δ, x t +δ )A(u)]
δ u[t,t +δ]
Now recall that the sample paths x(t ) are right continuous, hence limδ→0 x(t + δ) = x(t ).
Moreover, the paths are piecewise constant. Hence, taking the limit as δ → 0, there is
no ∂V
∂x appearing on the left. We get
∂V ′(t, x)
= − inf [c (x, u) + V ′A(u)] .
∂t u
82
4.7 Infinite-horizon: uniformization
4.55. The Bellman equation is in the infinite horizon discounted cost case:
4.56. Recall that A is an infinitesimally stochastic matrix (or infinitesimal propagator for
the stochastic process), that is its columns sum to zero and its off-diagonal entries are all
positive. Let
m = sup |Aii (u)|.
i,u
Set
1
A˜ = (A + mI )
m
where I is the identity matrix of appropriate dimensions. Observe that the elements of A˜
are all positive and smaller than one by definition of m. Moreover, the sum of the entries
of each column is 1. Hence A˜ is a bona fide stochastic matrix.
4.57. We now add (m + α)V (x) to both sides of the Bellman equation
Dividing both sides by m + α, we get (we set c˜(x, u) = c (x, u)/(m + α))
where γ = m+α
m
. The above equation is the Bellman equation for a discrete-time infinite
˜
horizon Markov decision process with transition matrix A.
4.58. We have thus shown that continuous-time, discounted cost infinite-horizon Markov
decision processes could be solved by considering a discrete-time infinite-horizon prob-
83
4 Dynamic programming and optimal control
lem. Hence every tools that can be used in the discrete-time case (most notably, value
iteration) can be used in this continuous-time case.
4.59. We now explain informally what the uniformization procedure does and why it
works. Observe first that we are working exclusively in the infinite-horizon/steady-state
case. The value function V (x) does not depend on time in this case. It is thus reasonable
to expect that the exact timing of the jumps betweens states will not matter in that case,
but only the probability transitions of one state to another.
4.60. To elaborate on the previous point, consider the continuous time Markov process
−a1 − a 2 b1 c1
p˙ = a1 −b 1 − b 2 c2 p.
a2 b2 −c 1 − c 2
There are three states, s1, s2 and s3 . We know how to associate a sample path equation
to this process from Part 2. For example, let us associate −1 to s1 , 0 to s2 and 1 to s 3 . A
sample path equation is, e.g.
dx = (x − 1)dN 1 + (x + 1)dN 2 + · · ·
where we omitted the terms describing transitions starting at a state other than 0. We
have seen in Part 2 that if one sets the rate of N 1 at b 1 and the rate of N2 at b 2 , and
similarly for the terms not shown here, then the probability that x(t ) = −1 would be
p 1 (t ), p(x(t ) = 0) = p 2 (t ) and p(x(t ) = 1) = p 3 (t ). In that sense, the Itō equation is a
sample path realization of the Markov chain.
4.61. If for the sample path equation above, x(t ) = 0, what is the probability that x(t )
jumps t0 1? We know that x will jump to one if the Poisson counter N1 (t ) jumps before
the Poisson counter N 2 (t ). Hence this probability is P (T1 < T2 ) where T1 is the elapsed
time to the next jump of N 1 and similarly for T2 . Since N1 and N2 are independent and
exponentially distributed (with parameters b 1 and b 2 respectively), we have
∫ ∞ ∫ t2
P (T1 ≤ T2 ) = b 1b 2e −b 1t1 e −b 2t2 dt 1dt2
∫0 ∞ 0
= b 2e −b 2t2 (1 − e −b 1t2 )dt2
∫0 ∞
= b 2 (e −b 2t2 − e −(b 1 +b 2 )t2 )
0
b2
= 1−
b1 + b2
b1
=
b1 + b2
84
4.7 Infinite-horizon: uniformization
4.62. The above can be generalized to n possible jumps from state x = 0, where the
probability to jump from 0 to a particular state is proportional to the rate of the counter
associated to that state. Said otherwise, the probability of going from i to j is a j i /|aii |.
4.63. We can also evaluate the time that the chain will remain at state 0. If we let T be
a random variable representing the time before the next jump, we have that
4.64. We can summarize the points above by saying that once the chain reaches the
state i , it will remain at that state for a time T ≃ −aii e aii t and then jump to state j with
probability a j i /|aii |.
4.65. In light of the above, we can intuitively interpret uniformization as saying that in the
infinite horizon case, the exact timings of the jumps do not matter and we might as well
have them happen synchronously (that is, according to a given clock). The transitions
from state to state, however, do matter. It is easy to see that the probability that we
jump from i to j is the same for the continuous time chain described by A and for the
"uniformized" discrete-time chain A. ˜ Indeed, in the continuous time case, given that the
chain is at state i , when the chain jump, it will jump to state j with probability a j i /|aii |
as we have seen above. From the description of A˜ it is easy to see that given that the
chain jumps (to a state different from i ), the probability of landing at state j is similarly
a j i /|aii |.
85
Lecture 5
Wiener processes and stochastic differential equations
We now start the study of stochastic differential equations driven by Brownian motion
(or Wiener process). We will derive an Itō rule for this process, an expectation rule and
a Fokker-Planck (density) equation.
5.1 Diffusions
5.1. The study of what became to be known as Brownian motion started with the obser-
vation that some particles, that are large enough to be seen through a microscope, but
light enough to not sink when put in a body of water, would undergo what appears to be
a random motion. This phenomenon was observed many centuries ago. For example,
the following excerpt from a work by Lucretius
"Observe what happens when sunbeams are admitted into a building and shed light on its
shadowy places. You will see a multitude of tiny particles mingling in a multitude of ways...
their dancing is an actual indication of underlying movements of matter that are hidden from
our sight... It originates with the atoms which move of themselves [i.e., spontaneously]. Then
those small compound bodies that are least removed from the impetus of the atoms are set in
motion by the impact of their invisible blows and in turn cannon against slightly larger bodies.
So the movement mounts up from the atoms and gradually emerges to the level of our senses, so
that those bodies are in motion that we see in sunbeams, moved by blows that remain invisible."
87
5 Wiener processes and stochastic differential equations
5.2. In 1827, the botanist R. Brown observed particles of pollen in suspension in water,
and described this phenomena in more details, but was unable to identify the source of
these "invisible blows". The first to provide a mathematical analysis of the phenomenon
was the Danish scientist Thorvard Thiele. This work was followed by the works of
Bachelier (who analyzed the fluctuation of the stock market back in 1900), Einstein (who
used this diffusion process to evaluate other quantities of interests), Smoluchowski, etc..
5.3. Going back to the grains of pollen in suspension in water, what happens at the
microscopic level is that the grains of pollen, which are quite large compared to molecules
of water, are being bombarded at a very high rate by the molecules of water. In fact, the
rate of collision is estimated to be around 1021 collisions per second.
5.4. If studying the motion of the grains of pollen using classical mechanics is in theory
possible, the sheer size of the numbers involved make the success of such an approach
rather slim. On the flip side, the sheer size of these numbers hint to the fact that a
statistical analysis might be quite accurate.
5.5. Making the switch to a statistical thinking, consider dropping a large number of
grains of pollen in water and observing how the density of the grains evolves. The pollen
will undergo a diffusion in the water. Quite remarkably, a wide range of physical situations
involving diffusion are described by the same equation for the density/concentration. We
mention that diffusion does not only related to material quantities, but energy can also
be diffused.
5.6. Let ρ(t, x) be the density of grains of pollen at time t and position x. We can verify
experimentally that the density obeys the equation
∂ 1 ∂2
ρ(t, x) = ρ(t, x).
∂t 2 ∂x 2
This equation is called the heat equation or diffusion equation.
5.7. Observe that ψ(t, x) = √ 1 e −x /2t satisfies the heat equation for all t > 0. More
2
2πt
is true: if ρ(0, x) is a twice-differentiable initial density profile, then the density for any
t > 0 can be expressed at
∫
1 −(x−y)2 /2t
ρ(t, x) = ρ(0, y) √ e dy .
R 2πt
88
5.2 Brownian motion and Poisson counters
∂ ∑ ∂ ∂
ρ(t, x) = qi j ρ(t, x).
∂t ij
∂x i ∂x j
5.9. Because the Gaussian distribution is symmetric about the origin, its odd moments
vanish. The even moments can be evaluated using integration by parts:
∫
Ex p
= √
1
x p e −x /2σ dx
2
2πσ ∫R
1 1 p+1 x −x 2 /2σ
= √ x e dx
2πσ R p + 1 σ
=
1
(p + 1)σ
E
x p+2 .
5.10. We can evaluate all the moments starting from Ex 2 = σ using the relation Ex p =
E
(p − 1)σ x p−2 . We have
Ex 2 = σ
Ex 4 = 3σ 2
Ex 6 = 5 · 3σ 3
..
.
p! ( σ ) p/2
Ex p =
(p/2)! 2
89
5 Wiener processes and stochastic differential equations
5.11. We start from what we know: Poisson driven stochastic differential equations.
R
Consider a diffusion in . A particle gets hit from the left and from the right by molecules
of water. Because the number of hits is very high, we will ignore the motion of the particle
due to its inertia and solely consider its motion due to being hit. We assume that the
particle jumps by a small amount each time it gets hit. If we let x be the position of
the particle, an equation such as dx = dN 1 − dN 2 where N 1 and N 2 are independent
Poisson counters describes qualitatively the situation. We now need to let the rate of the
Poisson counter increase (to at least 1021 , which we will approximate by ∞.) We will
of course need to scale the size of the jumps appropriately, so that the motion becomes
independent of the rate λ.
λ
5.12. Let N1 (t ) and N 2 (t ) be independent Poisson counters of rate 2. We define the
process
1
dx λ (t ) = (dN 1 (t ) − dN 2 (t ))
s (λ)
with x λ (0) = 0. We want to find s (λ) so that the above relation yields at the macroscopic
level the statistics of a Gaussian distribution. Note that it is a priori not clear that this is
possible at all.
5.13. To fix s (λ), let us evaluate the second moment of x: using the Itō rule, we have
( ) ( )
1 2 1 2
dx = (x +
2
) − x dN 1 + (x −
2
) − x dN 2 .
2
s (λ) s (λ)
90
5.2 Brownian motion and Poisson counters
λ 2
d
dt
Ex2 =
2 s (λ)2
λ
=
s (λ)2
√
Hence, if we take s (λ) = λ, we see that as λ → ∞, the variance is independent of λ
E
(in fact, it is independent even at finite λ). More is true, x 2 = t . That is, the variance
of the motion increases linearly with time, as is required by the diffusion equation!
Using a symmetry argument, one can easily conclude that all the odd moments vanish.
Hence we get ( ) ( )
d xp
=
p Ex p−2
+
1 p
Ex p−4 + · · · E
dt 2 λ 4
where the omitted terms are in powers of 1/λ. We thus find by integrating the above
relation and taking the limit as λ goes to infinity
∫
1 t
λ→∞
p
E
lim x (t ) =
2 0
p(p − 1) lim x p−2dt
λ→∞
E
We can thus obtain the moments recursively starting from p = 2:
∫
1 t
lim
λ→∞
Ex (t ) =
2
2 0
2dt = t
∫
6 t
lim
λ→∞
E x (t ) =
4
3 0
tdt = 3t 2
lim E x 6 (t ) = 5 · 3 t 3
λ→∞
..
.
p! ( t ) 2
p
lim
λ→∞
Ex p
(t ) = (p − 1)(p − 3) · · · 3 · 1 t p/2
=
(p/2)! 2
91
5 Wiener processes and stochastic differential equations
Hence in the limit, the moments of x(t ) match the moments of the Gaussian distribu-
tion.
5.15. In fact, if we write the density equation for the process, we obtain
[ ( ) ( )]
∂ λ 1 1
ρ(t, x) = ρ t, x + √ − 2ρ(t, x) + ρ t, x − √
∂t 2 λ λ
In the limit as λ → ∞, the right-hand side tends to the second derivative 1 of ρ(t, x)
with respect to x and thus we recover the heat-equation:
∂ 1 ∂2
ρ(t, x) = ρ(t, x).
∂t 2 ∂x 2
5.16. Let τ > 0. Using the same approach as in Part 2, we have that
1
d τ x(t )x(t + τ) = x(t ) √ (dN 1 − dN 2 )
λ
and thus by taking expectations on both sides and recalling that Ni (t + τ) is independent
of x(t ), we get
d
E 1
E
(x(t )x(t + τ)) = x(t ) √ (dN 1 − dN 2 ) = 0 E
dt λ
Thus E(x(t )x(t + τ)) = Ex 2(t ). In general, one has the relation
Ex(t )x(s ) = Ex 2(min(t, s )).
1 1
x(t )−x(τ) = √ (dN 1 (t )−dN 2 (t ))−(dN 1 (τ)−dN 2 (τ)) = √ (dN 1 (t ) − dN 1 (τ)) − (dN 2 (t ) − dN 2 (τ))
λ λ
depends only on |t −τ|. Indeed, recall that the Poisson counters have the Markov property,
and hence each of the two terms in the last expression only depend on |t − τ|.
5.18. We can generalize the above by observing that if the intervals [t, τ] and [s, σ] do
1
Recall that d
dx 2
f (x) = limh→0 1
h2
(f (x − h) − 2f (x) + f (x + h)).
92
5.2 Brownian motion and Poisson counters
( )
lim
λ→ ∞
E (x(t ) − x(τ))2 = lim
λ→∞
E x 2 (t ) − 2x(t )x(τ)x 2 (τ)
= t − 2 min(t, τ) + τ
= |t − τ|
E
where we used the fact that lim x 2 (t ) = t and point 16 above.
93
5 Wiener processes and stochastic differential equations
1
dw(t ) = lim √ [dN 1 − dN 2 ]
λ→∞ λ
⟨ ⟩ ∑[ ]
∂ψ 1 √
dψ = , f (x) dt + ψ(x + √ g i (x)) − ψ(x) (dNi )/ λ
∂x i λ
∑[ 1
]
√
+ ψ(x − √ g i (x)) − ψ(x) dN −i / λ
i λ
94
5.3 Stochastic differential equations and the Itō rule
5.24. Before proceeding further, we need to understand the process dz = (dNi + dN −1 )/λ
in the limit as λ → ∞. We first evaluate its expectation. Using the expectation rule, we
get that
d
dt
z =1 E
E
and thus z (t ) = t . We now derive the variance of the process. First, using the Itō rule
for jump processes, we get that
( )2
1 2
dz = z +
2
− z (dNi + dN −i ) = (2z /λ + 1/λ 2 )(dNi + dN −i ).
λ
Hence,
d z2 E 1
E 1
= 2 z + = 2t + .
dt λ λ
Solving the above differential equation, we obtain
Ez 2 = t 2 + λt .
Hence, the variance of z (t ) is Ez 2 − (Ez )2 = t/λ.
5.25. The point above shows that, in the limit as λ → ∞, the process z (t ) evolves as t
with vanishingly small variance. Said otherwise, the process becomes deterministic. We
have thus established the very important relation
z (t ) = t or (dNi + dN −i )/λ = dt
95
5 Wiener processes and stochastic differential equations
in the limit as λ → ∞.
Using these two relations in (5.2), we obtain the Itō rule for stochastic differential equa-
tion with Wiener processes:
⟨ ⟩ ∑ [⟨ ∂ψ ⟩ ⟨ ⟩ ]
∂ψ 1 ∂2 ψ(x)
dψ = , f (x) dt + , g i (x) dw i + g i (x), g i (x) dt
∂t i
∂x 2 ∂x 2
5.27. Observe that from the above we can deduce the highly informal rule
dw 2 = dt .
The idea is to try to derive the right-hand-side of d ψ if we want that the solutions
obtained by first solving dx = f (x)dt + g (x)dw and then taking the function ψ is the
same as solving directly the equation for ψ. Starting from
dx = f (x)dt + g (x)dw,
one can derive the Itō rule by keeping the terms in dt of order no more than one for ψ.
A Taylor series of ψ yields
⟨ ⟩ ⟨ 2 ⟩
∂ψ 1 ∂ ψ
ψ(x + δ) = ψ(x) + ,δ + δ, 2 δ + h.o.t .
∂x 2 ∂x
Now for the case that interests us, δ = f (x)dt + g (x)dw and hence
⟨ ⟩ ⟨ ⟩
∂ψ 1 ∂2 ψ
ψ(x + δ) = ψ(x) + , f (x)dt + g (x)dw + f (x)dt + g (x)dw, 2 (f (x)dt + g (x)dw)
∂x 2 ∂x
⟨ ⟩ ⟨ ⟩ ⟨ ⟩
∂ψ 1 ∂ ψ
2 ∂2 ψ
= ψ(x) + , f (x)dt + g (x)dw + f (x), 2 f (x) dt + f (x), 2 g (x)dtdw
2
∂x 2 ∂x ∂x
⟨ ⟩
1 ∂ ψ
2
+ g (x), 2 g (x) dw 2 + h.o.t .
2 ∂x
Comparing the last relation to the Itō rule we have derived in the previous point, we
see that it implies that dw 2 = dt , and thus dtdw is of order 32 in dt and is ignored as
well as dt 2 .
96
5.3 Stochastic differential equations and the Itō rule
Examples
We now look at some examples
97
5 Wiener processes and stochastic differential equations
5.30. The expectation rule is easily obtained by recalling that dw = limλ √1 (dN 1 − dN 2 ).
λ
Hence
E
d x = E(f (x)dt + g (x)dw)
= E(f (x)dt + g (x) lim √
1
(dN 1 − dN 2 ))
λ λ
[ ]
= E f (x)dt + E(g (x))E 1
lim √ (dN 1 − dN 2 ))
λ λ
= E f (x)dt
where we used the fact that N1 and N2 are independent of x(t ). We can summarize this
by saying that
E
g (x)dw = 0.
dx = −xdt + αdw .
We obtain that
d
dt
E
(x) = − (x).E
E
Hence in steady-state, x = 0. Using the Itō rule, we can evaluate higher-order moments
of the process as follows:
1
dx 2 = 2x(−xdt + αdw) + α 2 2 = (α 2 − 2x 2 )dt + 2αxdw
2
Using the expectation rule, we get that
d
dt
E E
x 2 = α 2 − 2 (x 2 ).
Hence, the steady-state variance is α 2 /2. The third moment is obtained similarly:
α2
dx 3 = 3x 2 (−xdt + αdw) + 6xdt = (3α 2x − 3x 3 )dt + 3αx 2dw .
2
98
5.4 The expectation rule and examples
E
Using the expectation rule, we conclude that in steady-state x 3 = 0. Let p be a positive
integer, we have
( )
p(p − 1)α p−2 p(p − 1)α p−2
dx = px (−xdt + αdw) +
p p−1
x dt = x − px dt + αpx p−1dw .
p
2 2
5.32. Linear systems. Let x ∈ Rn, A ∈ Rn×n, B ∈ Rn and consider the Itō equation
dx = Axdt + Bdw .
This describes the dynamics of a linear system with a noise input. The expectation of x
is easily obtained using the expectation rule:
d
dt
E E E
x(t ) = A x(t ) ⇒ x(t ) = e At x(0).E
E
We now focus on the covariance of x. Define Σ(t ) = x(t )x ′(t ). We will find an equation
for the i, j th entry of Σ(t ) using the Itō and expectation rules. To this end, observe that
[ ] [ ∑ ] [ ]
xi ( l Ail xl ) b
d = ∑ dt + i dw .
xj A x
l jl l b j
In summary
Σ̇ = AΣ + ΣA′ + BB ′ .
99
5 Wiener processes and stochastic differential equations
100
5.5 Finite difference approximations
5.33. Consider the approximation scheme x(t + τ) = x(t ) + dx(t ), which applied to the
stochastic equation above yields
where τ > 0 is the time increment in the approximation and w(k τ) are independent
Gaussians.
This approximates x(τ) as
where w(0) is a Gaussian random variable with zero mean and variance τ.
5.34. We compare this approximation to the refined one obtained by going first through
τ/2:
τ τ
x(τ/2) = x(0) + f (x(0)) + g (x(0))w(0)
2 2
and
τ τ
x(τ) = x(τ/2) + f (x(τ/2)) + g (x(τ/2))w(τ/2).
2 2
If we expand f and g in their Taylor series we can express x(τ) in terms of x(0) up to
first order in τ/2 as
τ τ τ
x(τ) = x(0) + τ f (x(0)) + g (x(0))w(0) + g (x(0))w( ) + . . . (5.4)
2 2 2
5.35. It is understood that the quality of the solution increases as we take more interme-
diate steps between 0 and τ, but since we keep on adding random variables w(τ/k ), we
need to make sure that the variance at x(τ) remains constant, or, in other words, that
the statistical properties of x(τ) do not depend on the time-step chosen.
5.36. To this end, let us evaluate the variances of x(τ) obtained from the two rela-
tions (5.3) and (5.4).
We have
101
5 Wiener processes and stochastic differential equations
τ2 ( )
E E E
[(x(τ) − x(τ))(x(τ) − x(τ)) ] = g (x(0))g ′(x(0))
′
4
E w 2 (0) + w 2 (τ/2) . (5.6)
Hence, if we want the variances of (5.5) and (5.6) to be the same, we need the variances
of w(0) and w(τ/2) in (5.6) to be twice the variance of w(0) in (5.6). If we do not change
the variance of w(k τ) when τ decreases, the limit as τ → 0 that we will take below will
yield a deterministic process, as the noise variance goes to zero.
and in this way we can keep the variances of w(k τ) to be fixed as τ varies. If w(t ) is a
standard Brownian motion, that is with variance t , then we can take w(k τ) in (5.7) to be
independent Gaussians with zero mean and variance 1.
5.38. For example, the equation dx = dw implies that x(t ) ∼ N (0, t ). The approximation
scheme √
x(k τ) = x((k − 1)τ) + τw(τ)
with w(τ) ∼ N (0, 1) yields an exact solution for any τ.
102
5.6 First passage time
τa = min {t | w(t ) = a} ,
M (t ) = max(w(t ) for 0 ≤ s ≤ t ),
that is M (t ) is the largest value taken by w(t ) over the interval [0, t ]. We will derive
the distributions of M (t ) and τa . It should be clear that both M (t ) and τa are random
variables, and both are past-measurable. That is, we only need to know the process
w([0..t ]) in order to assign a value to these variables. Said otherwise, these random
variables are measurable with respect to the σ-field adapted to w(t ). We first emphasize
that using the material from the previous lecture, it should be clear that one can evaluate
the distributions by generating many random paths and recording, in case we care about
τa , the first time at which the process reaches a, or recording the maximum value of the
process over [0, t ] for the second case.
We claim that the following holds:
5.39. The first equality is rather easy to establish. Indeed, consider the event {M (t ) ≥ a}.
If the event happened, then because w(t ) is continuous with probability one (we will
derive this below, without using the material of this section since causality loops are
better avoided) we conclude that the event {τa ≤ t } happened. Reciprocally, if τa ≤ t ,
then we know w(t ) has reached a before t and hence M (t ) > a. This establishes the first
equality.
5.40. Let us focus on the second relation. First recall that w(t ) is symmetric, in the sense
that p(w(t ) > 0) = p(w(t ) < 0) = 1/2. The process is also Markov, or informally speaking,
future states only depend on the present state, and not on past states. This means that
if w(t ) = a, then the probability that w(t + τ) > a is the same as the probability that
w(t + τ) < a for τ > 0. We thus have, for s < t , that
1
P (w(t ) − w(τa ) > 0 | τa = s ) = P (w(t ) − w(τa ) < 0 | τa = s ) = .
2
If we integrate the above equation from s = 0..t with respect to the density for τa we
obtain
103
5 Wiener processes and stochastic differential equations
1
P (w(t ) − w(τa ) > 0 ∩ τa < t ) = P (w(t ) − w(τ) < 0 ∩ τa < t ) = P (τa < t ).
2
5.41. Finally, observe that the event A = {w(t ) − w(τa ) > 0} ∩ {τa < t } is the same as
the event B = {w(t ) > a}. To see this, observe that if A has taken place, then clearly
w(t ) > w(τa ) and w(τa ) = a. Reciprocally, if B as taken place, because w(t ) is continuous
w(t ) > a and τa < t . We conclude that
5.42. Let us assume that w(t ) is differentiable at t = 0. This means, since w(0) = 0,
that the quotients w(t )/t are bounded for all t close enough to zero. Hence if w(t ) is
differentiable at 0, there exists 0 < K < ∞ and ε > 0 such that
We want to show that, for K and ε fixed, the event {w(t ) < K t for all 0 ≤ t ≤ ε} has
probability zero. It is the same as showing that the event A = {∃t ∗ ∈ [0, ε] | w(t ∗ ) > K t ∗ }
has probability one.
Recall that M (t ) = max{w(s ) | 0 ≤ s ≤ t }. Observe that if B = {M (t ) > K t } happens,
this means that there exists 0 ≤ t ∗ ≤ t such that w(t ∗ ) ≥ K t ≥ K t ∗ . Said otherwise
B ⊆ A and thus P (B) ≤ P (A). We will show that as t → 0, P (B) tends to one and thus
P (A) = 1. From (5.8), we have that
P (M (t ) ≥ K t ) = 2P (w(t ) ≥ K t ) (5.9)
holds.
Recall the definition of the error function:
∫ t
2
e −x dx .
2
erf(t ) = √
π 0
It is related to the cumulative distribution function of a Gaussian with zero mean and
104
5.6 First passage time
∫x
√ 1 e −s /(2σ)ds ,
2
variance σ as follows: if Φ(x) = ∞ then
2πσ
1 1 √
Φ(x) = + erf(x/ 2σ)
2 2
Because w(t ) is Gaussian with variance t and zero mean, we can rewrite (5.9) in terms
of erf as
1 1 √ √ √
P (M (t ) ≥ K t ) = 2(1 − P (w(t ) < K t )) = 2 − 2( + erf(K t/ 2t ) = 1 − erf(K t/ 2)
2 2
and hence √ √
P (M (t ) ≥ K t ) = 1 − erf(K t/ 2).
The series development of erf(x) is
( )
2 x3 x5 x7
erf(x) = √ x − + − +···
π 3 · 1! 5 · 2! 7 · 3!
and hence as t → 0, P (M (t ) ≥ K t ) → 1.
5.43. We conclude from the above that for any fixed K < ∞ and ε > 0, we can make
P (w(t ) < K t for all 0 ≤ t ≤ ε) as close to zero as desired. Hence w(t ) is not differentiable
at 0 with probability 1.
105
5 Wiener processes and stochastic differential equations
The procedure is similar to the procedure used in the case of a Poisson process: we use
the Itō rule and the expectation rule to evaluate the expectation of a test function ψ.
∫
5.44. First, we have that Eψ(x) = ψ(x)ρ(t, x)dx . Hence,
Rn
∫
∂ ρ(t, x)
d
dt
Eψ(x) = ψ(x)
∂t
dx . (5.11)
Rn
106
5.7 The Fokker-Planck equation
5.48. Recall that f (x) and g (x) are vector valued. We denote by g i (x) the i th entry of
g (x). Integrating B by part twice, we get
∫ ⟨ ⟩
∂2 ψ
B = g, 2 g ρ(t, x)dx
Rn ∂x
∫ ∑∑ 2
∂ ψ
= g i (x)g j (x)ρ(t, x)dx
Rn i j
∂x i ∂x j
∑ ∑ [ [ ∂ψ ] ∫
∂ψ ∂
]
= g i g j ρ(t, x) − g i g j ρ(t, x)dx
i j
∂x i ∞ Rn ∂x i ∂x j
∑ ∑ [[ ∂
] ∫
∂ ∂
]
= ψ g i g j ρ(t, x) + ψ g i g j ρ(t, x)dx
i j
∂x j ∞ Rn ∂x i ∂x j
∫ ∑∑ ∂ ∂
= ψ g i g j ρ(t, x) dx
Rn i j
∂x i ∂x j
where we first integrated with respect to x i and used the fact that ρ(t, x) decays fast.
5.49. Putting Equation (5.11) and (5.12) together with the calculated values for A and
B, and appealing to the fact that ψ was arbitrary, we conclude that the integrands are
the same and thus
⟨ ⟩
∂ ∂ 1 ∑∑ ∂ ∂
ρ(t, x) = − , ρ(t, x)f + g i (x)g j (x)ρ(t, x) . (5.13)
∂t ∂x 2 i j ∂x i ∂x j
dx = dx .
Comparing to (5.10), we see that it corresponds to having f (x) = 0 and g (x) = 1. Hence
in this case, the density equation for x(t ) reads as
∂ 1 ∂2 ρ
ρ(t, x) = .
∂t 2 ∂x 2
dx = −xdt + αdw .
107
5 Wiener processes and stochastic differential equations
A direct application of (5.13) with f (x) = −x and g = α yields (here all the variables are
scalar)
∂ ∂x ρ(t, x) 1 2 ∂2 ρ(t, x)
ρ(t, x) = + α
∂t ∂x 2 ∂x 2
⟨ ⟩
∂ ρ(t, x 1, x 2 ) ∂ 1 ∂2 (g 12 ρ) ∂2 (g 1 g 2 ρ) ∂2 (g 22 ρ)
= − , ρf (x) + +2 +
∂t ∂x 2 ∂x 12 ∂x 1 ∂x 2 ∂x 22
∂(x 22 ρ) ∂(x 12 ρ) 1 ∂2 (x 22 ρ)
∂2 (x 1x 2 ρ(t, x)) ∂2 (x 12 ρ)
= − + − 2 +
∂x 1 ∂x 2 2 ∂x 12 ∂x 1 ∂x 2 ∂x 22
108
5.7 The Fokker-Planck equation
∑
m ∑
p
dx = f (x)dt + g i (x)dNi + hi (x)dw i
i =1 j =1
where w i (t ) are independent Brownian motions and Ni (t ) are independent Poisson coun-
ters of rates λ i , admits the following density equation:
⟨ ⟩ ∑ m ( ) −1 ∑p ∑ n
∂ ρ(t, x)
=−
∂
, f (x)ρ + λi ρ(t, g˜−1 (x)) det ∂ g˜i − ρ(t, x) + 1 ∂2 k l
h h ρ
∂t ∂x i ∂x 2 j =1 k,l =1 ∂x k xl j j
i =1
(5.14)
k
where h j is the k th entry of the vector h j and n is the dimension of x. We recall that
g˜i is defined as
g˜i (x) = x + g i (x)
and the above formula holds for g˜i one-to-one (and hence having a well-defined inverse)
—a straightforward modification can be applied if g˜i is not one-to-one (namely, consider
the inverse to be set-valued and integrate over this set, if the inverse image is a discrete
set, that integration is simply a sum.)
where z (0) = 1. This is a model for a switched linear system with Brownian noise. Indeed,
109
5 Wiener processes and stochastic differential equations
we know that z (t ) will take on two possible values z (t ) = ±1. For the case z = 1, the
dynamics is dx = xdt + bdw and for z = −1, it is dx = −3xdt + bdw. We could apply the
general formula (5.14) to obtain the density, but in this case, it is easier to proceed by
first noticing that since z takes on finite values, it is useful to use ρ+ (t, x) = ρ(t, x, z = 1)
and ρ− (t, x) = ρ(t, x, z = −1). Now one can deduce that the density at time t and x for
x = 1 will change according to the first and diffusion terms, and the jump terms (gains
from paths being at t and x but with z = −1 and just have jumped — which happens at
rate λ, and losses to z = −1 which happens at the same rate). The determinant involved
is easily seen to be equal to 1. Putting these together, we obtain the system
∂ ρ+ (t,x)
∂t = − ∂x∂xρ + b 2 ∂2 ρ − +
2 ∂x 2 + λ [ρ (t, x) − ρ (t, x)]
∂ ρ− (t,x)
= ∂3x ρ
+ b 2 ∂2 ρ + −
∂t ∂x 2 ∂x 2 + λ [ρ (t, x) − ρ (t, x)]
110
5.8 Stratonovich calculus
dx = f (x)dt + g (x)dw,
where the limit is understood as a mean-square limit. Keep in mind that w(t ) is a random
path, and thus the integral above is a random variable.
In the above definition of the Itō integral, the fact that we took the value of x(t ) at
the beginning of the discretization interval is important. While in the case of a usual
integration with respect to a ’nice’ function with bounded variations,the choice of where
x(·) is evaluated in a discretization interval ti −1, ti ] does not matter in the limit, because
w(ti ) is highly irregular, the choice does matter.
5.56. An alternate definition of the stochastic integral is due to Stratonovic and goes as
follows:
∫ ∑ x(ti ) + x(ti +1 )
¯ ) = lim
g (t )dw(t [w(ti +1 ) − w(ti )]
n→∞
i
2
Again, if w(t ) were a bounded variation function, this would of course yield the same
111
5 Wiener processes and stochastic differential equations
result as the discretization scheme used in the definition of the Itō integral.
5.57. We now look at how one can go from one integral to other. To this end, consider
the Itō equation
dx = f (x)dt + g (x)dw
where g (·) is a differentiable function and let fs (x), g s (x) be such that
¯
dx = fs (x)dt + g s (x)dw
has the same solution x(t ) as the Itō equation. We want to relate fs and g s to f and g .
We introduce the shorthand notation w i for w(ti ), x i for x(ti ), etc.
To establish the relation between the two, we start from the Stratonovic integral
∑[ (
x(ti ) + x(ti −1 )
) ]
x(t ) ≃ fs (ti )(ti − ti −1 ) + g s [w(ti ) − w(ti −1 )]
i
2
We can approximate dx i by
dx i = fs (x i −1 )(ti − ti −1 ) + g s (x i −1 ) [w i − w i −1 ] .
∑ ( ) ∑ ( )
dx i dx i
gs + x i −1 [w i − w i −1 ] ≃ g s x i −1 + [w i − w i −1 ]
i
2 i
2
∑[ 1 ∂g s
]
≃ g s (x i −1 ) + dx i [w i − w i −1 ]
i
2 ∂x
∑[ 1 ∂g s
]
≃ g (x i −1 ) + fs (x i −1 )(ti − ti −1 ) + g s (x i −1 ) [w i − w i −1 ]
i
2 ∂x
× [w i − w i −1 ]
112
5.8 Stratonovich calculus
∑ ( ) ∑[ ]
dx i + x i −1 1 ∂g s
gs [w i − w i −1 ] ≃ g s (x i −1 ) + g s (x i −1 ) [ti − ti −1 ]
i
2 i
2 ∂x
Putting the above together, we find that
∫ ( ) ∫
1 ∂g
fs (x) + g s (x) dt + g s (x)dw = fs (x) + g s (x)dw
Itō 2 ∂x Strat.
¯ = (f (x) + 1 ∂g
Stratonovic : dx = f (x)dt + g (x)dw ←→ Ito : dx g )dt + g (x)dw
2 ∂x
and
The Stratonovic differential, on the other hand, behaves like the usual differential from
calculus:
d¯ψ =
∂ψ ¯
f (x)dt + g (x)dw
∂x
We show that the above relation holds below:
1 dg
dx = f (x)dt + g (x)dt + g (x)dw (5.15)
2 dx
5.60. Let ψ be a one-to-one function with inverse ϕ and set y = ψ(x) (and thus x = ϕ(y)).
113
5 Wiener processes and stochastic differential equations
d ψ/dx = (d ϕ/dy)−1 .
Furthermore,
d 2ψ d −1 −1 d −1 ( ) −1 ( )
= (d ϕ/dy) = (d ϕ/dy) = d 2
ϕ/dy 2
d ψ/dx = d 2
ϕ/dy 2
dx 2 dx (d ϕ/dy)2 dx (d ϕ/dy)2 (d ϕ/dy)3
5.61. Let us write (5.15) in terms of y: first, from the Itō rule, we get
( )
dψ 1 dg 1 d 2ψ
d ψ(x) = f (x)dt + g (x)dt + g (x)dw + g 2 2 dt
dx 2 dx 2 dx
We let f¯(y) = f (ϕ(y)) and g¯(y) = g (ϕ(y)) and using the relations derived above, we
obtain
( ) −1 ( ) −1 ( ) −1 ( ) −3 2
( ) −1
d ϕ 1 d ϕ d ¯
g d ϕ 1 d ϕ d ϕ dt + d ϕ
dy = f¯ + g¯ − g¯2
g¯dw
dy 2 dy dy ∂y 2 dy dy
2 dy
5.62. Now we convert the last equation back to a Stratonovic equation. Observe that
( ) −1 ( ) −2 ( ) −1
d d ϕ
g¯ = −
dϕ
g¯ +
dϕ d g¯
dy dy dy dy dy
cancels out the second and third terms in the above expression for dy. We thus have the
Stratonovic equation
( ) −1
dϕ
dy = f¯dt + g¯dw .
dy
This shows that the Stratonovic differential obeys the same rules as the usual differen-
tial from calculus.
114
Lecture 6
System Concepts
We start in this chapter the study of control system with noise. After a brief overview
of controllability and observability for time-varying linear systems, we focus on linear
systems driven noise. It is well-known that studying linear dynamics in the Fourier and
Laplace transform domain is quite fruitful. We will see that a direct extension of this
theory to linear systems driven by Brownian motion is not possible, for the reason that
a typical Brownian motion sample paths does not have a finite L 2 norm. We will see
that one can nevertheless have a rather complete set of results if instead of focusing on
the frequency analysis of the energy in a signal (which is in the L 2 norm), we focus on
a frequency analysis of the power in the signal. This analysis is possible for stationary,
ergodic processes, two definitions that will be explained in this chapter. Building on
these, we will introduce the power spectrum of a signal, relate it to the Fourier transform
of the autocorrelation of the signal (that is the Wiener-Khinchin theorem) and close this
part by presenting stochastic realization theory. That is, we will characterize which power
spectra can be realized by a linear system.
115
6 System Concepts
6.2. A subclass of systems which enjoys a relatively complete theory is the class of linear
systems [though many open questions remain even for linear systems]. These are systems
of the type
ẋ = Ax
of course play a role in the analysis of (6.1). Because the system is linear, it is a good
idea to get a handle of the solutions for a basis of vectors of initial conditions. This is
exactly what the fundamental solution does for the canonical basis: if Φ(t, σ) satisfies
the equation
d
Φ(t, σ) = A(t )Φ(t, σ); Φ(σ, σ) = I
dt
then Φ(t, σ) is called a fundamental solution of ẋ = Ax.
The transition matrix has the following property
A2 A3
Φ(t, σ) = e A(t −σ) = I + A(t − σ) + (t − σ)2 + (t − σ)3 + . . .
2! 3!
116
6.1 Notions from deterministic systems
This does not hold if A is time-varying but there exists a similar iterated expansion called
the Peano-Baker series that handles that case.
6.5. From the fundamental solution of ẋ = Ax, we can obtain the solution of (6.1) for
any initial condition x(σ):
∫ t
x(t ) = Φ(t, σ)x(σ) + Φ(t, σ)B(σ)u(σ)dσ.
σ
The function
T (t, σ) = C (t )Φ(t, σ)B(σ)
is sometimes called the weighting pattern of (6.1).
6.7. A linear system is said to be controllable if for any x(σ) and t > σ, there exists a
control u(s ) defined for s ∈ [σ, t ] such that u(s ) drives the system (6.1) from x(σ) to zero.
It is important to realize that controllability is a question about the range space of an
R
operator that maps controls u to n , the state space. As it stands, the operator has an
infinite-dimensional domain and our first order of business is thus to find an equivalent
operator (in the sense that it has the same range space, since this is what we are concerned
about) with a finite-dimensional domain.
6.8. Consider the following mapping, which maps functions (controls) to vectors (think
of the state at a given time)
∫ t
L(u(t )) = B(t )u(t )dt .
σ
6.9. We claim that the range space of L and the range space of Q are the same:
1. Let y 1 be in the range space of Q . Hence there exists x 1 such that Q (σ, t )x 1 = y 1 .
117
6 System Concepts
2. Reciprocally, assume that y 1 is not in the range space of Q . Then, because the
complement of the range space is of codimension zero, there exists x 1 such that
Q x 1 = 0 and x 1′ y 1 , 0. Observe that
∫ t ∫ t
0= x 1′ Q x 1 = x 1′ B(s )B ′(s )x 1ds = ∥B ′(s )x 1 ∥ 2ds
σ σ
and hence B(s )x 1 = 0. Assume by contradiction that y 1 is in the range space of L.
Hence there exists u such that
∫ t
x 1′ B(s )u(s ) = x 1′ y 1 , 0.
σ
6.10. Using the above result, we can give conditions for a system to be controllable.
Consider the system of (6.1), and define the controllability gramian of this system to be
∫ t1
W (t0, t1 ) = Φ(t0, t )B(t )B ′(t )Φ′(t0, t )dt .
t0
We claim that we can drive the system from x 0 at t0 to x 1 at t1 if and only if x 0 −Φ(t0, t1 )x 1
is in the range space of W (t0, t1 ).
Hence the system is controllable if W (t0, t1 ) is of full rank.
1. The idea is to first get rid of the drift term Ax and then use the result above relating
range spaces of two operators. To this end, set
z (t ) = Φ(t0, t )x(t ).
d
0= (Φ(t0, t )Φ(t, t0 ))
dt
= Φ̇(t0, t )Φ(t, t0 ) + Φ(t0, t )Φ̇(t, t0 )
= Φ̇(t0, t )Φ(t, t0 ) + Φ(t0, t )A(t )Φ(t, t0 )
Hence
Φ̇(t0, t ) = −Φ(t0, t )A(t )
3. We now evaluate ż :
118
6.1 Notions from deterministic systems
6.11. It is not hard to see that the controllability gramian W (t0, t1 ) obeys the differential
equation
d
W (t0, t ) = A′(t )W (t0, t ) + W (t0, t )A(t ) + B(t )B ′(t )
dt
with W (t0, t0 ) = 0.
Indeed, we recall that the general solution of a matrix equation
Ṁ = A1 (t )M (t ) + M (t )A2 (t ) + B(t )
is given by
∫ t
′
M (t ) = Φ(t, t0 )M (t0 )Φ (t, t0 ) + Φ1 (t, σ)B(σ)Φ′2 (t, σ)dσ
t0
where Φ1 and Φ2 are the fundamental solutions of ẋ = A1 (t )x and ẋ = A′2 (t )x respectively.
Observability
6.12. We say that a system is observable at time t if we can determine x(t ) from the
knowledge of y(s ), s ∈ [t0, t1 ] with t0 < t1 < t .
119
6 System Concepts
ẋ = Ax + Bu; y = Cx (6.2)
where A, B and C are constant.
6.15. In the case of an LTI system, we have that Φ(t0, t ) = e A(t0 −t ) and thus
∫ t1
′
W (t0, t1 ) = e A(t0 −t )BB ′e A (t0 −t )dt .
t0
6.16. Using the series development of e At and the Cayley-Hamilton Theorem, it is not
hard to show that the range space and null space of W coincide with the range space
and null space of
Now WT is of full rank if and only if [B, AB, . . . , An−1B] is. Hence we conclude that a
LTI system is controllable if and only if
6.17. Using a similar approach, we conclude that a LTI system is observable if and only
if
[C ′, A′C ′, (A′)2C ′, . . . , (A′)n−1C ′] is of full rank.
Transfer functions
G (s ) = C (I s − A)−1B .
6.19. Recall that if the real parts of the eigenvalues of A are less than σ, then all x(t )e −σt
120
6.1 Notions from deterministic systems
6.21. Assume that A is stable (also called Hurwitz), that is that its eigenvalues have
negative real part. From the knowledge of the transfer function, we can easily evaluate
the response of the system to a sinusoidal input: precisely, if
u(t ) = u 0 cos(ωt ),
then asymptotically
x(t ) = x 1 cos(ωt ) + x 2 sin(ωt )
where x 1 = u 0 Re(G (iω)) and x 2 = u 0 Im(G (iω))
Stability
6.22. Stability of LTI systems can also be investigated through Lyapunov or energy
function. The idea is that if a system is stable, it should somehow dissipate its internal
energy, that is if E is the energy of the system, with E = 0 corresponding to the system
in its lowest energy level, d /dt (E) ≤ 0. One can show that for LTI systems, stability is
equivalent to the existence of a quadratic energy function. Precisely, if A is stable, there
exists Q a symmetric positive definite matrix such that
d ′
x Q x ≤ 0.
dt
6.23. We evaluate the total derivative of the above equation to obtain a condition on Q :
d ′
(x Q x) = (Ax)′Q x + x ′Q (Ax)
dt
= x ′(A′Q + Q A)x
A′Q + Q A < 0
121
6 System Concepts
Fourier transforms
RC R
6.24. We define the space of functions L 2 ( , ) or simply L 2 ( ) to be the space of
complex-valued, square integrable functions:
{ ∫ }
R R
L ( ) = f (x) : 7−→ such that
2
C |f (x)| dx < ∞ .
2
R
6.25. We define L 2 ([−π, π]) to be the space of periodic functions, with period 2π, which
are square integrable over a period:
{ ∫ π }
2
C
L ([−π, π]) = f (x) : [−π, π] 7−→ such that f (x) = f (x + 2π) and |f (x)| dx < ∞ .
2
−π
R
6.26. The Fourier Transform of f (t ) ∈ L 2 ( ) is defined as
∫
fˆ(ω) = F (f )(ω) = f (t )e −iωt dt
R
R
6.27. The inverse Fourier Transform of fˆ(ω) ∈ L 2 ( ) is given by
∫
−1 ˆ 1
f (t ) = F ( f )(t ) = fˆ(ω)e iωt dt
2π R
122
6.2 Power Spectrum
∫
6.30. In many situations, R f 2 (t )dt is proportional to the energy in a signal or system.
Think of f representing the amplitude of a sound wave or the voltage across the plates of
a capacitor. As a consequence of Parseval’s relation, given below, the Fourier transform
F (ω) of f (t ) can be thought of as describing how this energy is distributed amongst
frequencies present in the signal.
6.32. While fourier theory has proven very useful in the study of deterministic linear
systems, it cannot be applied to linear systems driven by noise without changes. Indeed,
consider the LTI stochastic system
We have seen in a previous lesson that the autocorrelation matrix Σ(t ) = E(x(t )x ′(t )
obeyed the differential equation
∫
1
Precisely, (f1 ⋆ f 2 )(x) = R f1 (y − x)f2 (y)dy
123
6 System Concepts
Recalling the definition of the controllability Gramian for the pair A, B, we conclude that
the covariance matrix of x(t ) is full rank if and only if the system is controllable.
E
6.34. Now let us focus our attention to the output y(t ). We have that (y(t )y ′(t )) = C ΣC ′
in steady-state. If y(t ) is scalar, then the variance of y(t ) is C ΣC ′ in steady-state. This
variance is nonzero if the system is controllable, and can be zero if the system is not. In
the first former case, we have signals with a constant variance in steady-state. Hence
∫
E y 2 (t )dt diverges
R
R
and the vast majority of signals y(t ) are not in L 2 ( ). In the latter case, the signal is zero
almost everywhere. In both cases, Fourier theory is of no help in analysis the signal.
6.35. Physically speaking, we see that we cannot apply Fourier theory since the signal y(t )
has infinite energy. However, observe that the following integral, where y(t ) is assumed
to be in steady-state,
∫ T
Ey 2 (t )dt = 2T C ΣC ′
−T
converges. Hence, we can make sense of the average power of y(t ) in the time interval
[−T ,T ]:
∫ T
1
2T
E y 2 (t )dt = C ΣC ′ .
−T
Generalized harmonic analysis deals with quantities as the one above. In order to
continue further, we need to introduce two definitions related to stochastic processes.
6.36. A stochastic process x(t ) is said to be stationary if for all positive integer k , and
real numbers t1, t2, . . . , tk , τ, the following holds:
124
6.2 Power Spectrum
6.38. We show below, by computing its autocorrelation explicitly, that the process y(t )
defined in (6.3) is weak-sense stationary in steady-state if A is a stable matrix.
6.40. Ergodicity is a property that involves the sample paths of the random process,
whereas stationarity does not. In some sense, ergodicity says that all sample paths are
essentially the same and contain all statistical properties of the process. This is perhaps
best understood by exhibiting a process that is stationary by not ergodic. Consider the
process
x(t ) = Y with Y ∼ N (0, 1).
The sample paths of this process are all constant and with a value Y , which is sampled
according to a N (0, 1). The process is obviously stationary since the distributions of x(t )
for all t are the same. The expectation of x(t ) is simply the expectation of Y and is thus
zero.
The sample average of any given path will, however not be zero unless Y = 0, which
happens with probability zero. Indeed,
∫ T
1
lim Y dt = Y.
T →∞ 2T −T
6.41. We now return to the spectral analysis of y(t ). We define the power spectrum of y(t )
as
125
6 System Concepts
1 ∫ T 2
−iωt
Φy (ω) = lim y(t )e dt (6.5)
T →∞ 2T −T
6.42. The following result relates the power spectrum of a real stationary, ergodic pro-
cess to its autocorrelation function. The result goes under the name of Wiener-Kinchin
theorem. Precisely, it says that
E
Φy (ω) = F ϕ(τ) = F (y(t )y(t + τ)). (6.6)
E
If the process is not real, the above equality with F (y(t )y ∗ (t + τ)) on the right-hand
side holds. To prove that this equality holds, we start from the definition of the power
spectrum, which we write as
∫ T ∫ T
1 −iωt
Φy (ω) = lim y(t )e dt y(s )e iωs ds .
T →∞ 2T −T −T
Combining these integrals, we get
∫ T ∫ T
1
Φy (ω) = lim y(t )y(s )e −iω(t −s )dtds
T →∞ 2T −T −T
We now set τ = t − s and express the above double integral in terms of s and τ. We get
∫ T ∫ T
1
Φy (ω) = lim y(s + τ)y(s )e −iωτ ds d τ
T →∞ 2T −T −T
Using (6.4), we can replace the integral with respect to s by the autocorrelation of y:
∫ T
Φy (ω) = lim
T →∞ −T
E[y(s )y(s + τ)]e −iωτ d τ
which proves the result.
126
6.3 Stochastic Realization
6.44. Using the expectation rule, we see that the expectation of x(t ) is
Ex(t ) = e At Ex(0).
Since A is stable, Ex(t ) is zero in steady-state.
6.45. We have shown in the previous lesson that Σ(t ) = Ex(t )x(t + τ) obeys the equation
Σ̇ = AΣ + ΣA′ + BB ′ .
6.46. We now evaluate the auto-correlation of x(t ). First, assume that τ > 0. We have
that
ds x(t + s ) = Ax(t + s )ds + Bdw t +s .
We multiply the above equation by x ′(t ) to the right and get
127
6 System Concepts
We know the initial condition for the above equation in steady-state at s = 0: this is
Ex(t )x ′(t ) = Σ. Hence
E ′
x(t )x ′(t + τ) = Σe A τ for τ ≥ 0.
6.47. Let us now look at the case τ < 0. The key point in the derivation above was that
dw t +s and x(t ) were independent. We start by writing
Now for s > 0, dw t +τ+s and x(t + τ) are independent. The expectation rule applied to
the above equation thus yields
d
ds
E E
x(t + τ + s )x ′(t + τ) = A x(t + τ + s )x ′(t + τ).
E
Again, we know that in steady-state, x(t + τ)x ′(t + τ) = Σ. Hence, integrating the
above equation from s = 0 to s = −τ, we obtain
6.48. We claim without proof that the processes x(t ) and y are ergodic.
6.49. We can evaluate the power spectrum of (6.7) from the autocorrelation function
using the Wiener-Khinchin theorem. To this end, first recall that
∫ ∞
e As e −s t dt = (I s − A)−1 . (6.8)
0
128
6.3 Stochastic Realization
6.50. Using the equation AΣ + ΣA′ + BB ′ = 0, we can reexpress Φx (ω) in a slightly more
useful form. To this end, add and subtract Σiω to the previous equation to obtain (we
omit the identity matrices when adding scalar and matrices below)
Observe that the left-hand-side of the above equation is the power spectrum (6.9) and
thus we have proved that
6.51. From the expression for the power spectrum of x, we can easily deduce the power
spectrum of y to be
Φy (ω) = C (A + iω)−1BB ′(A′ − iω)−1C ′ .
Because y is a scalar, then C (A + iω)−1B = B ′(A′ − iω)−1C ′, both being scalar quantities.
Set
ψ(ω) = B ′(A′ − iω)−1C ′ .
We have thus proved that the power spectrum of y has the following properties:
1. It is a rational function which can be expressed as
Φy (ω) = ψ(ω)ψ(−ω)
with ψ(ω) having roots in the left-hand side of the complex plane (recall that A is
stable).
2. It is even, that is Φy (ω) = Φy (−ω).
3. It is positive.
Knowing the above, we now seek to answer the following question:" what kind of power
spectra can we realize with a linear system?" The answer is given by spectral factorization
129
6 System Concepts
6.52. Spectral factorization Lemma: Let q (s ) be an even, real, proper rational function that
is non-negative on s = iω. Then there exists r (s ), real, proper rational, and having no
poles in the half plane ℜ(s ) > 0 and such that
q (s ) = r (s )r (−s ).
6.53. The spectral factorization lemma thus tells us that we can realize every power
spectrum q (s ) with the conditions above as a linear system.
6.54. Let r (s ) be a real, proper rational function having no poles in the left half-plane:
q 0 + q 1 s + . . . + q n−1 s n−1
r (s ) = .
p 0 + p 1 s + . . . + p n−1 s n−1 + s n
It is a well-known fact from linear systems theory that we can express this rational func-
tion as
r (s ) = C (I s − A)−1B
with
0 1 0 ··· 0 0 q 0 ′
0 q
0 0 1 · · · 0
. .. . .1
A = .. . .. ; B = .. ;
. C = ..
.. 0
0 0 . 0 1
1 q n−2
−p 0 −p 1 · · · −p n−2 −p n−1 q n−1
The above is called the canonical controllable realization.
6.55. Hence, given a q (s ) that satisfies the conditions of the spectral factorization Lemma,
we can find r (s ) and define a LTI system driven by noise that has q (iω) as power spectrum.
Given an autocorrelation function ϕ(τ), we can take its Fourier transform to obtain
Φ(ω) and we know that we can design a linear system with this autocorrelation only if
Φ(ω) satisfies the conditions of the spectral factorization Lemma.
1 1
Φ(ω) = + .
1+ω 2 9 + ω2
130
6.3 Stochastic Realization
We can rewrite it as
√ √
10 + 2ω2 √ ( 5 + iω) √ ( 5 − iω)
Φ(ω) = = 2 2
(1 + ω2 )(9 + ω2 ) (1 + iω)(3 + iω) (1 − iω)(3 − iω)
We deduce that the following system is such that the power spectrum of y(t ) is Φ(ω):
[ ] [ ][ ] [ ] [√ √ ] [x 1 ]
dx 1 0 1 x1 0
= dt + dw; dy = 10, 2 dt .
dx 2 −1 −3 x 2 1 x2
6.57. Because ϕ(τ) is the inverse Fourier transform of a positive function (Φ(ω) is positive
by definition, see (6.5)), there should be some restrictions as to which functions of τ can
be autocorrelations of linear stationary systems. We can characterize these functions as
follows: let u(t ) be any function in L 2 . We have
∫
Φ(ω)|û(ω)|2d ω ≥ 0 (6.10)
R
We use the following facts to find a characterization of the feasible autocorrelations:
1. From the Wiener-Khinchin, we have that Φ(ω) is the Fourier transform of ϕ(τ) =
E x(t )x(t + τ). ∫ ∫
2. The Parseval relation tells us that fˆ(ω) gˆ (ω) = 1/2π f (t )g (t )dt .
3. The (inverse) Fourier transform of a product of functions is the convolution of their
(inverse) Fourier transforms.
We now apply Parseval’s relation to (6.10) where the integrand is written as the product
(Φ(ω)û(ω))û(ω)∗ and recall that the product becomes a convolution:
∫ ∫
ϕ(t − τ)u(τ)u(t )d τdt ≥ 0.
R R
Functions ϕ(s ) that satisfy the above equality for all u are called positive definite functions
in the sense of Bôchner. We will not continue with their study here, but mention that there
is a large literature dealing with such functions.
131
Lecture 7
Linear and Nonlinear filtering
dx = f (x)dt + g (x)dw
dy = h(x)dt + d ν
where w and ν are independent Brownian motions. We have derived earlier in the course
an evolution equation for the density ρ(t, x) of x(t ): the Fokker-Planck equation. We take
a slightly unusual route and first deriving the evolution equation of a nonlinear filter
and deduce from it the linear filtering (Kalman) equations. We will start by establish-
ing an evolution equation for the conditional density of a discrete-time Markov process
given an observation process. We also elaborate on smoothing in this section. We then
pass to the limit to obtain a stochastic differential equation for the conditional density
of a continuous-time Markov process. Said otherwise, we derive a discrete-state space
version of the Duncan-Mortensen-Zakai (DMZ) equation describing the unnormalized
conditional density. We derive both the Itō and Stratonovic formulations of this equation.
We then extend the results to continuous state-spaces and obtain the DMZ stochastic par-
tial differential equation. Lastly, we focus on linear systems and obtain the Kalman-Bucy
filter from the DMZ equation.
133
7 Linear and Nonlinear filtering
7.1.1 Filtering
y(t ) ∼ c (y |x(t ), t )
where c (y |x, t ) is the distribution of the observation conditioned on begin in state x(t ).
A very common model is the additive Gaussian noise model
where the n(t ) are independent, zero mean Gaussians with variance σ. In that case,
c (y |x(t ), t ) = √ 1 e −(y−x(t )) /2σ .
2
2πσ
Using the observation model and the transition probability matrix, we derive an equa-
tion for the evolution of the conditional probability p(x t |y 1, y 1, . . . , y t ).
7.2. We start with an initial distribution p 0 for the state of x 0 and we assume that we make
a first observation y 1 at time t = 1. What is the probability of x 1 given this observation?
Using Bayes’ rule, we obtain
134
7.1 Conditional density for discrete Markov processes
7.3. We now write a vector equation for the above relation. To this end, we define
p(x 1 = ω1 |y 1 )
p(x 1 = ω2 |y 1 )
p(x 1 |y 1 ) = ..
.
p(x 1 = ωn |y 1 )
and
c (y 1 |ω1 ) 0 ... 0
0 c (y 1 |ω2 ) . . . 0
B(y 1 ) = .. .. .. .
. . .
0 ... 0 c (y 1 |ωn )
In words, B is a diagonal matrix with diagonal entries are the probability of observation
y 1 given the possible states ωi .
We thus have
p(x 1 |y 1 ) = B(y 1 )Ap 0
p(y 2 |x 2 = ωi , x 1 = ω j , y 1 ) = p(y 2 |x 2 = ωi )
p(y 1 |x 2 = ωi , x 1 = ω j ) = p(y 1 |x 1 = ω j ).
135
7 Linear and Nonlinear filtering
1 ∑
p(x 2 = ωi |y 1, y 2 ) = c (y 2 |ωi )c (y 1 |ω j )a j i p(x 1 = ωi )
p(y 1, y 2 ) j
7.5. We can generalize the above formula to obtain the filtering or conditional density
equation:
1
p(x k |y 1, . . . y k ) = B(y k )AB(y k −1 )A · · · B(y 1 )Ap 0 (7.1)
p(y 1, . . . , y k )
7.6. Observe that the normalizing constant in Equation (7.1) is nothing less that the
probability of observing the sequence of observables y 1, . . . , y k . In practice, one evaluates
the products B(y k )AB(y k −1 ) . . . ... without normalizing the vectors until the last step. The
normalizing constant (i.e. the sum of the entries of the vector B(y k )AB(y k −1 ) . . . is then
p(y 1, . . . , y k ).
7.7. The distinction between normalized density equation (that is Equation (7.1) above)
and its unnormalized counterpart (which is nothing else than (7.1) above without the
normalizing term p(y 1, . . . , y k )) is rather inconsequential now, but when dealing with con-
tinuous state-spaces, working with unnormalized densities greatly simplifies the evolution
equations. We get to this in the following sections.
7.1.2 Smoothing
Smoothing consists of estimating the value of the unknown signal x(t ) given future ob-
servations. Smoothing cannot, rather obviously, be performed in real time. Taking an
approach similar to the one taken for deriving the filtering equation (7.1), we derive an
equation for the update of p(x k |y k +1 ...y n ).
136
7.1 Conditional density for discrete Markov processes
p(y 1 |x 0 = ωi , x 1 = ω j ) = p(y 1 |x 1 = ω j ) = c (y 1 |x = ω j ),
1 ˜
p(x 0 |y 1 ) = AB(y 1 )p(x 1 )
p(y 1 )
where A˜ = A′D −1 with D the diagonal matrix with entries the sum of the columns of A′.
That is D is such that A˜ is a stochastic matrix.
7.9. Similarly to what was done in the previous section, we can show that in general
1 ˜ ˜ (y k −1 )AB(y
˜ k )p(x k ).
p(x 0 |y 1, . . . , y k ) = AB(y 1 ) . . . AB
N
where N is a normalizing constant. This is the smoothing or noncausal condition estima-
tion equation. Note that p(y 1, . . . , y N ) may be different if evaluated used the smoothing
or filtering equation. This is because the equation requires the use of an initial condition,
p(0) or p(n) that affects the result. If n grows very large, and the chain “mixes well”, the
effect of the initial conditions will disappear.
p(x k |y 1, y 2, . . . , y n ) (7.2)
7.10. We derive first an update equation for p(y k +1, y k +2, . . . , y n |x k ), or the probability of
observing the remainder of the observation sequence given that we know the state at k .
This quantity will prove useful below to estimate (7.2). Following a standard notation,
137
7 Linear and Nonlinear filtering
we define
β(x k = ωi ) = p(y k +1, y k +2, . . . , y n |x k = ωi ).
We let β(x k ) denote the row vector whose i th entry is β(x k = ωi ). Observe that β(x k ) is
not a probability vector and its entries thus do not necessarily sum to one. We have
∑
p(y 1, y 2, . . . y n |x 0 = ωi ) = p(y 1, , y 2, . . . y n, x 1 = ω j |x 0 = ωi )
j
∑
= p(y 1, y 2, . . . y n |x 0 = ωi , x 1 = ω j )p(x 1 = ω j |x 0 = ωi )
j
∑
= p(y 1, y 2, . . . y n |x 1 = ω j )p(x 1 = ω j |x 0 = ωi )
j
∑
= p(y 2, . . . y n |y 1, x 1 = ω j )p(y 1 |x 1 = ω j )p(x 1 = ω j |x 0 = ωi )
j
∑
= p(y 2, . . . y n |x 1 = ω j )p(y 1 |x 1 = ω j )p(x 1 = ω j |x 0 = ωi )
j
∑
= β(x 1 = ω j )c (y 1 |ω j )a j i
j
We thus see that the update for β is similar to the update for p(x 0 |y 1, . . .) except that
the unnormalized backwards equation is used (i.e. we use A′ instead of A.
˜ Observe that an
initial value of β(n + 1) is needed.
7.12. We have now the tools to address the problem of finding the most likely state given
the complete observation sequence:
138
7.1 Conditional density for discrete Markov processes
1
p(x k = ωi |y 1, . . . , y n ) = p(x k = ωi , y 1, . . . , y n )
p(y 1, . . . y n )
1
= p(y k +1, . . . , y n |x k = ωi , y 1, . . . , y k )p(x k = ωi , y 1, . . . , y k )
p(y 1, . . . y n )
1
= p(y k +1, . . . , y n |x k = ωi )α(x k = ωi ).
p(y 1, . . . y n )
We thus have
1
p(x k = ωi |y 1, . . . , y n ) = β(x k = ωi )α(x k = ωi )
p(y 1, . . . y N )
and we have seen how to evaluate all the quantities involved. Notice that if we only need
the most likely state, the knowledge of the normalizing constant is not necessary.
139
7 Linear and Nonlinear filtering
7.13. The idea of continuous observations requires some explanation. First, recall that
if w t is a standard Brownian motion, we derived that
w t ∼ N (0, t )
7.14. To elaborate on the previous point, we first recall that in the sense of distributions,
1
lim √ e −x /2τ = δ(x)
2
τ→0 πτ
1 w t +τ − w t
ẇ = lim √ √
τ→0 τ τ
1
= lim √ N (0, 1)
τ→0 τ
7.15. Observe that a white noise process is uncorrelated, in the sense that
140
7.2 Conditional density for continuous-time Markov chains
Hence if t1 , t2 , [ ]
lim
τ→0
E 1
τ 1
1
(w t +τ − w t1 ) (w t2 +τ − w t2 ) = 0.
τ
7.16. From the above, we conclude that the power spectrum of white noise, which by
E
the Wiener-Khinchin theorem is the Fourier transform of (ẇ(t )ẇ(t + τ)) is the Fourier
transform of δ(τ), which is a constant. Hence we have shown that white noise has a flat
power spectrum, that is
Φẇ (ω) = 1.
The name white noise actually comes from this characterization of the process.
7.17. A noisy observation for a process x(t ) is called AWGN or additive white Gaussian
noise if it is of the form
ẏ = x(t ) + ẇ .
This notation is the most frequently used in the engineering literature. We will also
use the notation
dy = xdt + dw .
We point out that this latter equation is the one that should be used to simulate a white
noise process.
7.18. The possibility of instantaneous observations with fixed variance leads to perfect
observations via a simple average. The observation equation above should be interpreted
as follows: you can make a measurement of the system in a time τ (which you can assume
to be very small), your observation is then
∫ (∫ τ ∫ τ )
1 τ 1 1 1 1
dy = y(τ) = dx + dw ⇒ ẏ = x + w τ .
τ 0 τ τ 0 0 τ τ
p˙ = Ap
ẏ = x(t ) + ẇ(t )
141
7 Linear and Nonlinear filtering
7.20. We now derive a stochastic equation for the evolution of the vector p(x t +τ |ẏ[0..t +
τ]). To this end, assume that we know the conditional density p(x t |y[0..t ]). We can
approximate p(x t +τ |y([0..t + τ]) by making one discrete time-step of length τ and use the
results of the previous section. The probability transition matrix is then e Aτ . We have
e −(ẏ−ωi )
2 τ/2
≃ 1 − (ẏ − ωi )2 τ/2 + h.o.t .
Observe that we used here Stratonovič calculus, as there are no correction terms in the
first order approximation.
7.25. If we approximate e Aτ as I + Aτ, we get using (7.5) (we write pt for p(x t |y([0..t ]))
τ τ
pt +τ ≃ (I − ẏ 2 I − H 2 + τH ẏ)(I + Aτ)pt
2 2
142
7.2 Conditional density for continuous-time Markov chains
7.26. In (7.6), the term − 2τ ẏ 2 is independent of p and thus simply rescales pt . To see
this, consider the differential equations
ẋ = f (t )I x + A(t )x
and
ż = A(t )z (t ).
It is easy to verify that their respective solutions x(t ) and z (t ) are related by
∫ t
x(t ) = exp( f (s )ds )z (t ).
0
∫t
Hence the term − 2τ ẏ 2 rescales the solution by exp( 0 − 2τ ẏ 2ds ). Since we are not keeping
track of the normalizing constants, we ignore this term in (7.6).
7.27. If we replace ẏ τ by d¯y (recall that we are using Stratonovic calculus), we obtain
1
d¯pt = (A − H 2 )pt dt + H pt d¯y (7.7)
2
The above equation is the unnormalized conditional density equation. if the observation
model is
ẏ = h(x)dt + ẇ,
it is easy to see that the only modification to the above equation is in H , and
h(ω1 ) 0 ··· 0
0 h(ω2 ) · · · 0
H = . .. .. .
.. . .
0 ··· h(ωn )
7.28. In order to obtain the Itō version of this equation, we can revisit our approximation
143
7 Linear and Nonlinear filtering
to e −(ẏ−h(ωi ))
2 τ/2
. We have
e −(ẏ−h(ωi ))
2 τ/2
= e −(ẏ
2 +h 2 (ω
i )−2ẏh(ωi ))τ/2
= e −ẏ
2 τ/2
e −h
2 (ω )τ/2
i
e ẏh(ωi )τ
= e −ẏ
2 τ/2
e −h
2 (ω )τ/2
i
e dyh(ωi )
1
e dyh(ωi ) ≃ 1 + h(ωi )dy + h 2 (ωi )(dy)2
2
1
≃ 1 + h(ωi )dy + h 2 (ωi )(h(x)dt + dw)2
2
1
≃ 1 + h(ωi )dy + h 2 (ωi )(h 2 (x)dt 2 + 2h(ωi )dt + dw 2 )
2
1 2
≃ 1 + h(ωi )dy + h (ωi )dt
2
where we used the fact that dw 2 = dt and ignored the terms in order higher than one
in dt . Thus the Itō correction term is 12 H 2dt and we conclude that the Itō formulation
of (7.7) is simply
dpt = Apt + H pt dy
7.29. We can also obtain the Itō version directly from the Stratonovic by first rewrit-
ing (7.7) as
1
d¯pt = (A − H 2 )pdt + H p(h(x)dt + dw).
2
Observe that term factor multiplying dw is H p. Using results from a previous lesson,
we know that the Itō formulation can be obtained from the Stratonovic formulation by
dH p
adding 12 dp H pdt = 12 H 2 pdt .
144
7.3 Nonlinear filtering: the Duncan-Mortensen-Zakai equation
7.30. Recall that if we have a continuous-time jump process with finite state space
∑
dx = g i (x)dNi ,
i
p˙ = Ap
7.31. The equivalent for a continuous state-space of the operator A, which describes the
density given sample path equation, is the Fokker-Planck operator:
∂f ρ 1 ∑ ∂2 g i g j ρ
dx = f (x)dt + g (x)dw −
7 →− +
∂x 2 i, j ∂x i ∂x j
∂f ρ 1 ∑ ∂2 g i g j ρ
Lρ = − +
∂x 2 i, j ∂x i ∂x j
∂ρ 1
= (L − h 2 (x))ρ + d¯yh(x)ρ
∂t 2
∂ρ
= L ρ + d¯yh(x)ρ
∂t
145
7 Linear and Nonlinear filtering
7.36. We will now derive an estimation procedure for x(t ). To this end, we write the
DMZ equation in Stratonovic form
∂ ρ 1 ∂2 ρ 1 2
= − x ρ + ẏx ρ
∂t 2 ∂x 2 2
7.37. Let us try with the parametric family of solutions ρ(t, x) = exp(a(t )x 2 +b(t )x +c (t )).
Plugging this value of ρ in the DMZ equation, we get
∂ ( )
ρ(t, x) = ȧx 2 + ḃx + c˙ e a(t )x +b(t )x+c (t )
2
∂t
for the time derivative.
Differentiating with respect to the state variable x, we obtain
∂ρ
= (2ax + b)ρ
∂x
∂2 ρ
= 2a ρ + (2ax + b)2 ρ
∂x 2
Putting the above relations together, we get
( )
1 ∂2 1 2 1( )
2 ax 2 +bx+c
− x = 2a + (2ax + b) 2
− x e
2 ∂x 2 2 2
7.38. We can now equate the coefficients of x 2, x 1 and x 0 in the previous equation. We
find
1
ȧ = 2a 2 −
2
ḃ = 2ab + ẏ
1
c˙ = a + b 2
2
146
7.3 Nonlinear filtering: the Duncan-Mortensen-Zakai equation
7.39. The equations above give us the conditional density of x(t ) given the observation
process ẏ. We can rewrite the equations in a more enlightening form by rewriting the
Gaussian as
2 +bx+c 1
e −(x−x̂) /2σ
2
e ax =√
2πσ
If we take the above as a definition of σ and x̂, we find the relations
2σ = −1/a
x̂/σ = b
1
−x̂ 2 /2σ − log(2πσ) = c
2
Using these relations, we can easily find the evolution equations for σ and x̂. For σ,
we have
1 ȧ
σ̇ =
2 a2
1 2a 2 − 1/2
=
2 a2
1
= (2 − 1/2a 2 )
2
= 1 − σ2
147
7 Linear and Nonlinear filtering
148
7.4 Linear filtering: the Kalman-Bucy filter
7.41. We start by establishing a simple relation that often goes under thename of mean-
variance decomposition. Let z be a vector-valued random variable, we have
= Ey e (h(y))
∫
where e (h(y)) is the conditional error e (h(y)) = ∥ĥ(y) − h(x)∥ 2 p(x |y)dx . Hence, given
E
measurements y m , we have e (h(y m )) = (∥x̂(y) − x ∥ 2 |y = y m ). Hence, for each y m , we have
to assign a value ĥ(y m ).
7.43. Applying the mean-variance decomposition to the above (with z = ĥ −h) conditional
expectation we have
ĥ(y) = E(h(x)|y).
where A is the triangle in R2 with vertices at (0, 0), (1, 0) and (1, 1) as depicted below.
149
7 Linear and Nonlinear filtering
y
(1,1)
x
(0,0) (1,0)
7.45. Orthogonal projection Let V be an inner product space, with inner product ⟨·, ·⟩
and W be a closed subspace of V . Let x ∈ V . The orthogonal projection of x onto W is
the vector πW x of W satisfying:
150
7.4 Linear filtering: the Kalman-Bucy filter
⟨x − w 1, x − w 1 ⟩ ≤ ⟨x − πW x, x − πW x⟩.
We thus have
0 ≥ ⟨x − πW x, x − πW x⟩ − ⟨x − w 1, x − w 1 ⟩
≥ ⟨w 1 − πW x, 2x − w1 − πW x⟩
≥ ⟨w 1 − πW x, x − πW x⟩ + ⟨w 1 − πW x, x − w1⟩
0 ≤ ⟨w 1 − πW x, x − w1⟩
≤ ⟨w 1 − πW x, x − πW x + πW x − w1⟩
≤ −∥w 1 − πW x ∥ 2
where we again used the fact that x − πW x is orthogonal to all vectors in W . The last
equation tells us that w 1 = πW x.
7.47. Least squares and orthogonal projection Let W be a linear subspace of an inner
product space V . Let x ∈ V . We seek the closest point to x in W :
w ∗ = arg min ∥x − w ∥ 2 .
w ∈W
7.48. In case V and W are finite dimensional, we can find an explicit formula for πW x.
Let w 1, . . . , w m be a basis for W . Then it is enough to check that ⟨x − πW x, w⟩ = 0 on a
basis of W . This yields
⟨x, w i ⟩ = ⟨πW x, w⟩.
Now, because πW x ∈ W , we can write it as a linear combination of the w i ’s: πW x =
∑
ai w i .
7.49. Theorem The conditional expectation X → E(X |G) is the orthogonal projection
from L 2 (Σ) onto L 2 (G).
7.50. MMSE Estimator From the above Theorem, we can immediately deduce that the
MMSE estimator, given observations y, is the conditional expectation of x given y:
151
7 Linear and Nonlinear filtering
x M M S E (y m ) = E(X |Y = y m ).
Assume that you can only measure X 2 , and you are asked to estimate X 1 given your
observation of the realization of X 2 . The mmse estimator of X 1 given X 2 is the conditional
expectation of X 1 given X 2 :
∫
with p(x 1 |x 2 ) = p(x 1, x 2 ) p(x1 2 ) and p(x 2 ) = p(x 1, x 2 )dx 1 .
A straightforward, but lengthy, computation shows that E(X1 |X2 = x2) is disitributed
as a Gaussian with mean µ̄ and covariance Σ̄ obeying
−1
µ̄ = µ1 + Σ12 Σ22 (x 2 − µ2 )
−1
Σ̄ = Σ11 − Σ12 Σ22 Σ21
We see that if X1 and X 2 are uncorrelated, i.e. Σ12 = 0, then the conditional expectation
of X 1 given X 2 is simply X 2 . This should ot come as a surprise, since if X 1 and X 2 are
uncorrelated, then observing X 2 should not affect the mmse estimation of X 1 .
We also observe that Σ̄ is smaller than Σ̄11 , and that the amount by which it is smaller
increases if Σ12 increases (i.e. if X 2 and X 1 are strongly correlated) and decreases if Σ22
is larger (i.e. if X 2 has a large variance, observing it does not help the estimation of X 1 ).
These relations are at the basis of the discrete-time Kalman filter.
152
7.4 Linear filtering: the Kalman-Bucy filter
7.53. We seek an unbiased estimator of x(t ) given past observations y. We look at the
class of estimators that obey the following equation:
ż = F (t )z (t ) + H (t )ẏ .
E E
7.54. If we want x(t ) = x(t ) to hold, we need to impose some conditions on F and
E E
H . First, we need z (0) = x(0). Second, taking expectations, we get equation ẋ and ż
F (t ) = A(t ) + G (t )C (t )
e˙ = ẋ − ż = (A + GC )e + G v̇ + B ẇ .
The covariance of the error Σee = E(ee ′) hence follows (see Lesson 4)
Σ̇ee = (A + GC )Σee + Σee (A + GC )′ + GG ′ + BB ′ .
7.56. Our objective is to find a value for G (t ) above so that the error covariance is small.
We show here that we can do that in a rather strong sense: we choose G so as to minimize
Σe e (t ) according to the Löwner partial order on positive definite matrices 1 .
To this end, consider the auxiliary Riccati equation
S˙ = AS + S A′ − SC ′C S + BB ′ (7.10)
S˙ = AS + S A′ − SC ′C S + BB ′ + GC S + S (GC )′ − GC S − S (GC )′
= (A + GC )S + S (A + GC )′ − SC ′C S + BB ′ − GC S − S (GC )′
1
This partial order is such that A ≤ B ⇔ x ′Ax ≤ x ′Bx, ∀x ∈ Rn .
153
7 Linear and Nonlinear filtering
In the above equation, S (t ) is a known driving term (it obeys equation (7.10)). We can
thus write the value of Σee (t ) − S (t ) explicitly:
∫ t
Σee (t ) − S (t ) = ΦA+GC (t )(G + SC ′)(G + SC ′)′Φ′A+GC (t )dt
0
where ΦA+GC (t ) is the fundamental solution of ẋ = (A + GC )x.
We conclude from the above expression that Σee (t ) − S (t ) is positive definite for all
possible choice of G (observe that the integrand is a positive definite matrix). The least
value, in the Löwner partial order (and any other sensible order!) that we can achieve
is zero however, obtained by taking G = −SC ′.
In that case, Σee (t ) = S (t ).
7.58. We thus conclude that the causal, unbiased linear estimator of x that minimizes
the variance of the error is
154
Lecture 8
Ergodicity and Markov Processes
We prove the ergodic theorem and show how it applies to Markov processes.
155
8 Ergodicity and Markov Processes
8.1. Let (X, B, µ) be a measure space. We call a map T : X 7−→ X measurable if for all
A ∈ B, T −1 (A) ⊂ B. We call a measurable map measure preserving if for all A ∈ B,
R
µ(T −1 (A)) = µ(A). We call a function f : X 7−→ integrable if
∫
|f (x)|d µ < ∞.
X
8.2. We call a set A invariant for T , or simply invariant, if T −1 (A) = A. We claim that
invariant sets form a σ-field.
Lemma 8.1. Let X be a topological space and B a σ-field of sets of X . Let T : X 7−→ X be a
measurable map. The invariant sets of T form a σ-field.
Proof. We need to show that the countable union of invariant sets is invariant and that
the complement of an invariant set is invariant. Let Ai , i ≥ 0 be a collection of invariant
sets. Then T −1 (∪i Ai ) = {x ∈ X | T (x) ∈ ∪i Ai } = ∪i {x ∈ X | T (x) ∈ Ai = ∪i T −1 (Ai ).
This proves the first part.
For the second part, denote by Ac the complement of A in X . Then T −1 (X ) = T −1 (A ∪
Ac ) = T −1 (A) ∪ T −1 (Ac ) = X , where the last equality comes from T −1 (X ) = X . Because
T −1 (A) = A by assumption, we have A ∪ T −1 (Ac ) = X and thus T −1 (Ac ) = Ac .
x t +1 = T (x t )
8.3. We call a map T ergodic if its σ-field I of invariant sets is {∅, X }. We now state
Birkhoff’s Ergodic Theorem.
156
8.1 Birkhoff’s ergodic theorem
Lemma 8.2 (Maximal inequality). Let (X, B, µ) be a measure space and T : X 7−→ X be a
R
measure preserving map. Let f : X 7→ be an integrable function. Set f 0 = 0 and
∑
n−1
fn (x) = f + f ◦ T + · · · + f ◦ T n−1 = f (T i (x)). (8.1)
i =0
Furthermore, set
Fn (x) = max f j (x) (8.2)
0≤ j ≤n
Proof. Because f is integrable, so are the fn and Fn . Hence An = Fn−1 (0, ∞) is measurable.
For the second part, first note that Fn ◦ T ≥ f j ◦ T for 0 ≤ j ≤ n by definition of Fn .
Hence
Fn ◦ T + f ≥ f j ◦ T + f = f j +1
where the last equality follows from the definition of f j . Because this holds for all 0 ≤
j ≤ n, we have
Fn ◦ T + f ≥ max f j ≥ max f j .
1≤ j ≤n+1 1≤ j ≤n
Note that we cannot conclude in general that the previous inequality holds for j = 0, but
if we restrict ourselves to the set An defined above, we have
Fn ◦ T + f ≥ Fn , on the set An .
157
8 Ergodicity and Markov Processes
8.5.
Lemma 8.3. Let g be an integrable, real-valued function on (X, B, µ). Let α ∈ R and set
1∑
n−1
B α = x ∈ X | sup g (T j x) > α
.
n≥1 n j =0
∞
B α = ∪n=1 {x ∈ X | g n (x) > nα}.
Define f = g − α and fn and Fn as in Eqs. (8.1) and (8.2) respectively. Then we have
∞
B α = ∪n=1 {x ∈ X | fn (x) > 0}.
To see this, note that if x is such that fn (x) > 0 for some n > 0, then Fn (x) > fn (x) > nα.
Reciprocally, if Fn (x) > 0, then there exists a j such that f j (x) > 0. These two statements
show that the last two definitions of B α are equivalent.
As the reader can observe, the last expression for B α begs the use of the maximal
inequality. To this end, let
Bn := {x ∈ X | Fn (x) > 0}
and denote by 1Bn its indicator function. Observe that
Bn ⊆ Bn+1 .
f 1Bn ≤ |f |
where we used the dominated converge theorem (see Th. ?? below) for the last equality.
158
8.1 Birkhoff’s ergodic theorem
Finally, the maximal inequality of Lemma 8.2 says that for all n ≥ 1,
∫
f dµ ≥ 0
Bn
∫
and thus Bα f d ν ≥ 0 or, equivalently,
∫
g d µ ≥ a µ(B α ).
Bα
In case X , A, note that since T (A) = A, we can just apply the result just derived to the
space (X, BA, µ) where BA is the σ-field obtained by intersecting all sets in B with A.
We thus ∫have to show that f + = f − almost everywhere, and moreover that they are both
equal to X f d µ.
R
Let α, β ∈ and define
By definition, we have that f + (x) ≥ f− (x). We thus want to show that the set {x ∈ X |
f + (x) > f− (x)} is of measure zero. We can write this set as the (countable) union, over
β < α and α, β rational numbers, of D α, β . It thus suffices to show that µ(D α, β ) = 0 if
β < α.
From Eq. (8.4), we have that
T −1D α, β = D α, β .
159
8 Ergodicity and Markov Processes
then we have that D α, β ∩ B α = D α, β . To see that this last relation holds, note that
supn≥1 f˜n (x) ≥ lim sup f˜n (x). We can apply Lemma 8.3 with A = D α, β to obtain
∫ ∫
f dµ = f d µ ≥ α µ(D α, β ∩ B α ) = α µ(D α, β ). (8.5)
D α, β D α, β ∩B α
Now for an integrable function g and real numbers α1, β1 define the set
E α1, β1 = {x ∈ X | −f + (x) < −α, −f − (x) > − β} = {x ∈ X | f + (x) > α, f − (x) < β} = D α, β .
The previous equation together with Eq. (8.5) yields that α ≤ β. This is only possible
if µ(D α, β ) = 0. We have thus shown that f + = f− almost everywhere, and thus that
˜
∫limn→∞ fn (x) is well-defined for almost all x ∈ X . It remains to show that it is equal to
X f d µ. ∫ ∫
We first show that f + is integrable, and then show that X f +d µ = X f d µ.
To show that f + is integrable, we use Fatou’s lemma (see Th. ?? below). To this end,
note that
∫ ∑
n−1 ∫ ∫
˜ 1
| fn |d µ ≤ |f ◦ T |d µ = |f |d µ.
j
j =1
n
160
8.1 Birkhoff’s ergodic theorem
+
∫
Finally, it remains to show that f = X f d µ. To do so, we shall use Lemma 8.3 again.
First, introduce the set
q + q +1
C n,q = {x ∈ X | ≤f ≤ }.
n n
Because we have shown that f + = f + ◦ T , then T −1C n,q = C n,q . Furthermore, for all
δ > 0,
C n,q ∩ Bq /n−δ = C n,q
by definition of B α . From Lemma 8.3 (with A = C n,q ,
∫
q
f d µ ≥ ( − δ)µ(C n,q )
C n,q n
∫ q
Since this holds for any δ > 0, we have that C n,q f d µ ≥ n µ(C n,q ). From the definition of
∫ q +1
C n,q , we also have that C n,q f +d µ ≤ n µ(C n,q ) and thus
∫ ∫
+ 1 q 1
f d µ ≤ µ(C n,q ) + µ(C n,q ) ≤ µ(C n,q ) + f d µ.
C n,q n n n C n,q
X = ∪q ∈ZC n,q ,
we have that ∑
µ(C n,q ) = µ(X )
q ∈Z
Because the previous inequality holds for all n > 0, we obtain the inequality
∫ ∫
+
f dµ ≤ f d µ. (8.7)
X X
161
8 Ergodicity and Markov Processes
and thus f ∗ = E(f | I). Finally, if I = {∅, X }, then f ∗ is constant almost everywhere
and ∫
1∑
n
lim f (T x) =
j
f d µ.
n→∞ n X
i =0
8.7. Given a measure preserving map T : X 7−→ X , we define by M its set of invariant
measures: hence a measure µ ∈ M if µ(T −1 (A) = µ(A). It is easy to see that invariant
measures form a convex subset of the set of all measures. Indeed, for α1, α2 ≥ 0 and such
that α1 + α2 = 1, we have
Recall that an extreme point of a convex set is a point µ that cannot be expressed as a
∑
convex combination i αi µi of µi without having for some i that αi = 1.
We say that a measure µ is ergodic for the transformation T for any A ∈ I, the σ-field
of invariant sets of T , µ(A) = 0 or µ(A) = 1. We have the following result:
Proof. We first assume that µ ∈ M is not extremal and show that µ is not ergodic. Because
µ is assumed non-extremal, there exists α ∈ (0, 1) and µ1, µ2 ∈ M, µ1 , µ2 such that
µ = α µ1 + (1 − α)µ2 . If µ were ergodic, then for all A ∈ I, µ(A) = 0 or µ(A) = 1. Because
α is positive, we conclude that either µ1 (A) = µ2 (A) = 0 or µ1 (A) = µ2 (A) = 1. Hence µ1
and µ2 agree on I. Now let f be an integrable function. From the ergodic theorem, we
know that
∑ n
+
f (x) := lim
n→∞
1
n
E
f (T i (x)) = µi (f | I),
i =1
162
8.1 Birkhoff’s ergodic theorem
where the expectation is over any measure µi ∈ M. From the last relation, we conclude
that f + (x) is measurable with respect to I. Integrating f + over X , we have
∫ ∫
+
f (x)d µi = f d µi .
X X
Since f is arbitrary, equating the first and last term of the above equation yields that
µ1 = µ2 , which is a contradiction. Hence µ is not ergodic.
We now assume that µ is not ergodic and show that it can be expressed as the non-
trivial convex combination of two invariant measures. Since µ is not ergodic, there exists
a set F ∈ I with 0 < µ(F ) < 1. We define the measures µ1 and µ2 as follows: for A ∈ B,
µ(A ∩ F ) µ(A ∩ F c )
µ1 (A) = and µ2 (A) =
µ(F ) µ(F c )
where F c = Ω − F is the complement of F in Ω. We note that µi are invariant; we show
it for µ1 , a similar proof holds for µ2 :
163
8 Ergodicity and Markov Processes
x t : Ω 7→ Rn .
R
In fact, we can think of Ω as ( n )T and B as the σ-field on Ω generated by products of Fi
R
in the Borel σ-field of n . Hence, x t (ω) can be thought as the "value of the random path
ω at t . Note that the set-up allows for more general situations, but keeping this simple
scenario in mind can be helpful. In general, if F is the σ-field of Borel sets in n , then R
B is generated by the sets
8.10. Having described the state space Ω and a σ-field which made our random variables
x t measurable, we now set-up to put a measure on Ω. Describing explicitly a measure
on a set of paths – an infinite-dimensional space – is rarely done without first passing by
finite-dimensional probability distributions and using Kolmogorov’s consistency theorem,
which we state below. The idea is that while it might be difficult to describe a generic
set in Ω and its measure, we can easily look at "time slices" and characterize the paths
x t (ω) such that x t1 (ω) ∈ F1 for F1 ∈ F . Generalizing this idea, we might want to try to
define P by giving finite dimensional probability measures on nk and declare: R
µt1,...,tk (F1 × F2 × · · · × Fk ) = P (x t1 ∈ F1, · · · , x tk ∈ Fk ).
One can easily construct such finite dimensional measures: the simplest example would
∏
be to make all time-slices independents: µt1,...,tk (F1 × F2 × · · · × Fk ) = µti (Fi ). This
does not yield many interesting processes. Another way to obtain µ is by specifying a
distribution of increments of the path x t +δ − x t . This is what was done for the definition
of Brownian motion earlier in the notes.
The measures µ··· need to satisfy some consistency conditions. The first is that for any
σ ∈ S k , where S j is the permutation group on k elements, we need to have
164
8.1 Birkhoff’s ergodic theorem
The Kolmogorov consistency Theorem says that if µ satisfies the above two require-
ments, then µ can be used to define a measure on all of Ω.
Theorem 8.4 (Kolmogorov Consistency Theorem). For k ∈ N and t1, t2, · · · , tk ∈ T , let
R
µt1,··· ,tk be probability measures on n satisfying the conditions of Eqs. (8.9) and (8.10). Then
there exists a probability space (Ω, B, P ) and stochastic process x t on Ω such that
8.11. We now turn our attention to Markov processes. Informally speaking, these are
processes for which the future states only depend on the present state, and not on past
states. Hence a Markov process should satisfy the relation
8.12. Let t1 ≤ t2 ≤ t3 and let f , g be integrable functions. The Markov property for x t
implies that
E E
(g (x t3 ) | Ft1t1 × Ft2t2 ) = (g (x t3 ) | Ft2t2 ).
The so-called time-reversed Markov property states that
E(f (xt ) | Ft t
1 2
2
× Ft3t3 ) = E(f (xt ) | Ft t ).
1 2
2
We will show that the Markov property and time-reversed Markov property are equiva-
165
8 Ergodicity and Markov Processes
Lemma 8.4. Let x t be a Markov process with distribution P on (Ω, B), let f and g be measur-
able functions and t1 < t2 < t3 . Then the following relations hold:
E E
1. (g (x t3 ) | Ft1t1 × Ft2t2 ) = (g (x t3 ) | Ft2t2 ).
E E
2. (f (x t1 ) | Ft2t2 × Ft3t3 ) = (f (x t1 ) | Ft2t2 ).
E E E
3. (f (x t1 )g (x t3 ) | Ft2t2 ) = (f (x t1 ) | Ft2t2 ) (g (x t3 ) | Ft2t2 ).
Proof. The first relation is equivalent to the Markov property of x t . We now show that
relation 3 holds:
( )
E
(g (x t3 )f (x t1 ) | Ft2t2 ) = EE(g (x t3 )f (x t1 )) | Ft2t2 × Ft1t1 ) | Ft2t2
( )
= E E
f (x t1 ) (g (x t3 ) | Ft2t2 × Ft1t1 ) | Ft2t2
( )
= E E
f (x t1 ) (g (x t3 ) | Ft2t2 ) | Ft2t2
( ) ( )
= E E
f (x t1 ) | Ft2t2 g (x t3 ) | Ft2t2
where we used relation 3 to go from the second to third line. Since f and h are arbitrary,
E E
we have that (g (x t3 ) | Ft2t2 ) = (g (x t3 ) | Ft1t1 × Ft2t2 )) almost everywhere. We can similarly
show that relation 2 and 3 are equivalent.
8.13. We now show how to construct a Markov process via a transition probability function.
In words, it is a function that gives us a measure for the state of the process t seconds in
the future, given that we are in a given state x now. Precisely, it is a measure Pt (x, ·) for
R
all t ∈ T , x ∈ n :
(t, x) 7−→ Pt (x, ·) is a measure on n . R
For a Borel set F of Rn , Pt (x, F ) is the probability that xs +t ∈ F given that x s = x. Said
166
8.1 Birkhoff’s ergodic theorem
P (x t1 ∈ F ) = Pt (x, F ).
where we use the subscript x to indicate the initial state of the process. If the initial state
of the process is not x, but is instead distributed according to a distribution ρ(dx), we
have
∫
P µ (x t1 ∈ F1, x t2 ∈ F2, · · · , x tl ∈ Fl ) := µ(dx)P x (x t1 ∈ F1, x t2 ∈ F2, · · · , x tl ∈ Fl ). (8.11)
Rn
167
8 Ergodicity and Markov Processes
where x is the initial state of the process at 0, We have from the Chapman-Kolmogorov
equation that
∫
Et f (x) = Pt (x, dx 1 )f (x 1 )
R
∫ ∫
n
= Ps (x, dx s )Pt −s (x s , dx 1 )f (x 1 )
∫R R .
n n
= Ps (x, dx s )Et −s f (x s )
Rn
= Es (Et −s (f ))(x)
Hence
Et +s = Et ◦ Es
and the semi-group property indeed holds.
Lemma 8.5. Let x t be a Markov process and Et the associated semi-group defined in Eq. (8.12).
Then the following properties hold:
1. If f is a positive function, so is Et f .
2. Et is a contraction for the L ∞ norm.
Proof. The proof of the first item is obvious from the definition of Et . For the second
item, we have
∫ ∫
∥Et f ∥∞ = sup | Pt (x, dx t )f (x t )| ≤ sup | Pt (x, dx t )| sup |f (x)| = ∥ f ∥∞,
x∈ Rn Rn x∈ Rn Rn x∈ Rn
where we used the fact that Pt (x, dx) integrates to one.
8.15. We can introduce an operator similar to Et but that acts on measures instead of
on functions. Assume that a Markov process is initialized randomly at time 0, according
to a distribution π. What is the distribution of the states at time t ? Denote by µt the
density of the states at time t . We have already seen that for F a Borel set, we have
∫
µt (F ) = P (x t ∈ F ) = µ(dx)Pt (x, F )
Rn
We define ∫
(Mt µ)(F ) = µ(dx)Pt (x, F ).
Rn
From the Chapman-Kolmogorov equation, we conclude that the operators Mt form a
168
8.1 Birkhoff’s ergodic theorem
semi-group:
Mt +s = Mt ◦ Ms .
The Brownian motion process introduced earlier in these notes is the Markov process
with transition density
1 (x − x 1 )2
pt (x, x 1 ) = exp(− ).
(2πt )n/2 2t
Furthermore, one can make sense of the infinitesimal generators of Tt and S t . We define
1
L µ = lim (Mt µ − µ)
t →0 t
Note that L is nothing more than the Fokker-Planck operator we have introduced in
an earlier lecture. Its dual, L ∗ is
1
L ∗ f = lim (Et f − f )
t →0 t
P (x t1 +h ∈ F1, . . . , x tk +h ∈ Fk ) = P (x t1 ∈ F1, . . . , x tk ∈ Fk )
holds. ∫
Recall that P (x t ∈ F ) = Mt µ(F ) = Rn µ(dx)Pt (x, F ). Thus if the process is invariant,
we conclude that
Mt µ = Mt +s µ = µ
and hence
L µ = 0.
Hence the invariant measures for the process are in the kernel of the Fokker-Planck operator.
Note that if we initialize the process with an invariant measure µ at t = 0, we obtain a
stationary process with distribution P µ whose marginal at each time t ∈ is µ, i.e. R
169
8 Ergodicity and Markov Processes
M̄ := { µ | Mt µ = µ for all t ∈ R} .
τs (x t (ω)) = x t +s (ω).
where t0 is arbitrary because the process is stationary, and I is the σ-field of invariant
sets for the shift-operator. Observe that since the marginal of P µ for t0 is µ, the expec-
tation of the right-hand-side of the above equation is taken with respect to µ. We call
a distribution P µ for a Markov process Markov process ergodic if for all A ∈ I,
P µ (A) = 0 or P µ (A) = 1. We have seen in Prop. 8.1 that P µ is ergodic if and only
if it is an extreme point of the set of invariant measures of the shift-operator. We can
characterize ergodic measure as follows:
Theorem 8.5. A stationary distribution µ for a Markov process is an extremal point of M̄ (the
set of stationary distributions) if and only if P µ is ergodic (an extremal point of the set M of
invariant measures of the shift-operator.)
Proof. Note that the map that sends µ to P µ , described in Eq. (8.11) is linear. Hence, if
P µ is an extreme point of M, then µ is an extreme point of M̄ (it can be shown via a
simple contradiction argument).
We now show that if µ is an extremal point of M̄, then so is P µ . To see this, let H ∈ I
be a non-trivial invariant set for the shift operator and recall that we set Ft := Ft ∞ and
F t = F−∞t . Because H is invariant for the shift operator, it is both in the future F and
t
the past F −t for t >> 0. From Lem. 8.4, we thus have that
Hence, P µ (H | F00 ) = 0 or P µ (H | F00 ) = 1. Hence there exists a Borel set F such that
R
H = {ω : x t (ω) ∈ F for all t ∈ }. Since H is non-trivial, 0 < µ(F ) < 1. Furthermore,
note that if the process starts in F (resp. F c ) at time 0, it never leaves F (resp. F c ), in
the sense that x t (ω) ∈ F for all t > 0. Hence Pt (x, F ) = 1 (resp. Pt (x, F c ) = 1) for all
t > 0. Hence µ is not extremal.
170
8.1 Birkhoff’s ergodic theorem
171
Bibliography
173