Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
67 views

STAT333 Lecture Notes Book Version

This document contains course notes for STAT 333. It covers the following topics related to discrete-time Markov chains: 1. Basic definitions including probability spaces, random variables, and stochastic processes. 2. The main topics covered include Markov chains, transition matrices, Chapman-Kolmogorov equations, stationary distributions, recurrence and transience, communicating classes, and exit distributions. 3. Various properties and theorems are discussed relating to classification of states, limiting behavior, and existence of stationary distributions. Special cases like detailed balance conditions and time reversibility are also covered.

Uploaded by

Pengwei Xu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views

STAT333 Lecture Notes Book Version

This document contains course notes for STAT 333. It covers the following topics related to discrete-time Markov chains: 1. Basic definitions including probability spaces, random variables, and stochastic processes. 2. The main topics covered include Markov chains, transition matrices, Chapman-Kolmogorov equations, stationary distributions, recurrence and transience, communicating classes, and exit distributions. 3. Various properties and theorems are discussed relating to classification of states, limiting behavior, and existence of stationary distributions. Special cases like detailed balance conditions and time reversibility are also covered.

Uploaded by

Pengwei Xu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

STAT 333 Course Notes

University of Waterloo

The One And Only


Waterloo 76er
Bill Zhuo

Free Material & Not For Commercial Use


Contents

I Basic Settings
0.1 General Outline 7
0.2 Basic Settings 7

II STAT 333 Main Part


1 (Discrete-Time) Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.1 Review on Conditional Probability 13
1.2 Markov Chains (DTMC) 13
1.3 Transitional Matrix 14
1.4 C-K Equation 15
1.4.1 Visualization of Markov Chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.4.2 Vector-Matrix Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.5 Expected Value of A Function of Xn 19
1.5.1 Review of Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.5.2 Expectation of f (Xn ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.6 Stationary Distribution 24
1.6.1 Why such π is called stationary? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.6.2 Existence & Uniqueness of π? Convergence to π? . . . . . . . . . . . . . . . . . . . . . . 24
1.7 Recurrence & Transience 25
1.8.1 Communicating Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.8.2 Strong Markov Property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1.8.3 Alternative Characterizations for Recurrence/Transience . . . . . . . . . . . . . . . . . 31
1.10.1 Short Review on Indicator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4

1.12 Existence of a Stationary Distribution 34


1.13 The Period of A State 35
1.15 Main Theorems 37
1.19 Roles of Different Conditions 45
1.19.1 Irreducibility (I) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
1.19.2 Aperiodicity (A) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
1.19.3 Recurrent (R) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
1.20 Special Examples 45
1.20.1 Detailed Balance Condition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
1.20.2 Time Reversibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
1.20.3 The Metropolis-Hastings Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
1.21 Exit Distribution 48
1.21.1 Basic Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
1.21.2 First-Step Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
1.22 Exit Time 52
1.22.1 Basic Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
1.22.2 General Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
1.23 Infinite State Space 54
1.23.1 Basic Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
1.23.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
1.24 Branching Process (Galton-Watson Process) 60
1.24.1 Basic Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
1.24.2 Extinction Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
1.25 Review on DTMC 65
1.25.1 Basic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
1.25.2 Classification and Class Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
1.25.3 Stationary Distribution and Limiting Behaviour . . . . . . . . . . . . . . . . . . . . . . . . . 68
1.25.4 Exit Distribution and Exit Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
1.25.5 Two Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
I
Basic Settings

0.1 General Outline


0.2 Basic Settings
Review & Introduction

0.1 General Outline


1. Bi-weekly homework (10%), Monday before tutorial starts
2. Assessments:
(a) 6 Homework (drop lowest one) (10%)
(b) 2 Midterms (50% = 2 × 25%)
(c) Final (40%)

0.2 Basic Settings


Definition 0.2.1 — Probability Space. A probability space consists of a triplet (Ω, E , P) where
1. Ω is the sample space, the collection of all the possible outcomes of a random experiment
2. E is the σ -algebra, the collection of all the “events". An event E is a subset of Ω, for
which we can talk about probability.
3. P is a probability measure, that is a set function (a mapping from event to a real number)
P : E → R with P(E) is the result. This mapping needs to satisfy the probability axioms:
(a) ∀E ∈ E , 0 ≤ P(E) ≤ 1
(b) P(Ω) = 1
(c) For countably many disjoint events E1 , E2 , . . . , we have

P(∪∞
i=1 Ei ) = ∑ P(Ei )
i=1

Definition 0.2.2 — Random Variable. A random variable (r.v.) X is a mapping from Ω to R.

X :Ω → R
ω 7→ X(ω)
8
Definition 0.2.3 — Stochastic Processes. 1. Process is idea of change overtime
2. Stochastic is just the idea of randomness. The Etymology is from the ancient greek,
stands for “aim at", “guess". This particular vocabulary had not been used until the
beginning of the 20-th century by Khinchine. There is a political story behind this, the
note will not go into details of this.
This was later translated by Feller and Doob into English. And the rest is history.
(a) A sequence/family of random variables (simpler, so we take this as the definition)
(b) A random function (hard to formulate)
3. Formal Definition: A stochastic process {Xt }t∈T is a collection of random variables
defined on the common probability space, where T is an index set. In most cases, T
corresponds to “time", which can be either discrete or continuous.
(a) In discrete case, we rather write {Xn }n=0,1,...

Definition 0.2.4 — States. The possible values of Xt ,t ∈ T are called the states of the process.
There collection is called the state space, often denoted by S. The state space can either be
discrete or continuous. The famous example of continuous state space is the Brownian motion.
But in this course, we will be more focusing on the discrete state space so that we can relabel
the states in S into the so-called standardized state space

S∗ = {0, 1, 2, . . . } ←− Countable State Space

or
S∗ = {0, 1, 2, . . . , n} ←− Finite State Space

 Example 0.1 — White Noise. Let X0 , X1 , . . . be i.i.d r.v.s following certain distribution. Then,
{Xn }n=0,1,2,... is a stochastic process. This is also known as (“White Noise"). 

 Example 0.2 — Simple Random Walk. Let X1 , X2 , . . . be i.i.d r.v.s. For each one of them
(
P(Xi = 1) = p
P(Xi = −1) = 1 − p
define S0 = 0, for the other Si , we have
n
Sn = ∑ Xi
i=1

Then, {Sn }n=0,1,2,... is a stochastic process. With the state space,


S=Z
This {Sn }n=0,1,2,... is called a “simple random walk". Note that
(
Sn−1 + 1 with probability p
Sn =
Sn+1 − 1 with probability 1 − p


1. Q: Now the question is why do we need the concept of stochastic process? Why don’t we
just look at the joint distribution of (X0 , X1 , . . . , Xn )?
2. A: The joint distribution of a large number of r.v.s is very complicated, because it does not
take advantage of the special structure of T , which is time. For example, for simple random
walk, the full distribution of (S0 , S1 , . . . , Sn ) is complicated for n large. However, the structure
is actually simple if we focus on adjacent terms.
Sn+1 = Sn + Xn+1
0.2 Basic Settings 9

Randomness
Number Random Variable

Change of Time

Function of Time Stochastic Processes

Figure 0.2.1: Formulation of Stochastic Process Definition

where Sn is the last value and Xn+1 is a dirichlet type of output. Also, note that Sn and Xn+1
are actually independent by definition. By introducing time into the framework, things can
be often greatly simplified. More precisely, from this simple random walk example, we
found that for {Sn }n=0,1,2,... , if we know Sn , then the distribution of Sn+1 will be depent on
the history Si , ∀0 ≤ i ≤ n − 1. This is a very useful property and it motivates the notion of
Markov Chain.
II
STAT 333 Main Part

1 (Discrete-Time) Markov Chains . . . . . . . 13


1.1 Review on Conditional Probability
1.2 Markov Chains (DTMC)
1.3 Transitional Matrix
1.4 C-K Equation
1.5 Expected Value of A Function of Xn
1.6 Stationary Distribution
1.7 Recurrence & Transience
1.12 Existence of a Stationary Distribution
1.13 The Period of A State
1.15 Main Theorems
1.19 Roles of Different Conditions
1.20 Special Examples
1.21 Exit Distribution
1.22 Exit Time
1.23 Infinite State Space
1.24 Branching Process (Galton-Watson Process)
1.25 Review on DTMC
1. (Discrete-Time) Markov Chains

1.1 Review on Conditional Probability


Definition 1.1.1 — Conditional Probability. The conditional probability of an event given an
event A with P(A) > 0 is given by

P(B ∩ A)
P(B|A) =
P(A)

Theorem 1.1.1 Let A1 , A2 , . . . be disjoint events such that


[
Ai = Ω
i

then we have
1. Law of Total Probability:

P(B) = ∑ P(B|Ai ) · P(Ai )
i=1

2. Bayes’ Rule:
P(B|Ai ) · P(Ai )
P(Ai |B) = ∞
∑ j=1 P(B|A j ) · P(A j )

1.2 Markov Chains (DTMC)


Definition 1.2.1 — Discrete-Time Markov Chain (DTMC). A discrete-time stochastic process
{Xn }n=0,1,2,... is called a discrete-time Markov Chain (DTMC) with transition matrix

P = [Pi j ]i, j∈S


14 Chapter 1. (Discrete-Time) Markov Chains

if for any j, i, in−1 , . . . , i0 ∈ S,

P(Xn+1 = j|Xn = i, Xn−1 = in−1 , . . . , X0 = i0 ) = Pi j

R In a more general setting, the Markov property can be defined as

P(Xn+1 = j|Xn = i, Xn−1 = in−1 , . . . , X0 = i0 ) = P(Xn+1 = j|Xn = i)

The previous definition only depends on i, j, but this one depends on i, j and time n.

What does the Markov property tells us?


Note that Xn is the present states and Xn−1 , . . . , X0 are the historical states, and Xn+1 is about
the future. The Markov property means the history has no impact on the future, all we care is
the present. In other words, the past will only influence the future through current state.

[Yi Shen] “Given the present state, the history and the future are independent."

In our definition, we focus on the time-homogeneous case, i.e

P(Xn+1 = j|Xn = i, Xn−1 = in−1 , . . . , X0 = i0 ) = P(Xn+1 = j|Xn = i) = Pi j

this one does not depend on time n (probability structure is time-invariant).

1.3 Transitional Matrix


Proposition
 1.3.1 — Transition Matrix Properties P. We have a corresponding transition matrix
P = Pi j i, j∈S
1.
Pi j ≥ 0, ∀i, j ∈ S

2. Row sums of P are always 1:


∑ Pi j = 1, ∀i ∈ S
j∈S

The reason is

∑ Pi j = ∑ P(Xn+1 = j|Xn = i) = P(Xn+1 ∈ S|Xn = i) = 1


j∈S j∈S

3. Matrix form:  
 
 

 Pi j 

 

the row index i gives the current state and the column index j gives the next state.
 Example 1.1 — Simple Random Walk Revisit. We have not yet shown simple random walk
is DTMC. Left as an exercise.
Exercise 1.1 Show that the simple random walk is a DTMC. 
1.4 C-K Equation 15

Pi,i+1 = P(S1 = i + 1|S0 = i)


= P(X1 = 1|S0 = i) equivalent to first move is 1
= P(X1 = 1)

|=
X1 S0
=p

Pi,i−1 = P(S1 = i − 1|S0 = i)


= P(X1 = −1|S0 = i) equivalent to first move is 1
= P(X1 = −1)

|=
X1 S0
= 1− p

Thus, Pi j = 0 for all j 6= i ± 1. The transition matrix P is infinite-dimensional


 
. . .. 
 . . 
 
 0 p 
P= 

 q 0 p 


 q 0 p 

.. .. ..
. . .


 Example 1.2 — Ehrenfest’s Urn. Two urns A, B, total M balls. Each time, pick one ball at
random uniformly and leave it to the opposite urn.
1. Let Xn be the number of balls in A after step n.
2. The state space S = {0, . . . , M}.
3. X0 = i means i balls in A and M − i balls
 in B
i
M
 j = i − 1 (The ball is from A)
Pi j = P(X1 = j|X0 = i) = M−i
M j = i + 1 (The ball is from B)

0 j 6= i ± 1

then, the transition matrix is


 
0 1
1 0 M−1 
M M 
2

 M 0 

P= .. 

 0 . 

 .. .. 1

 . . M
1 0


1.4 C-K Equation


Theorem 1.4.1 — Chapman-Kolmogorov Equation (C-K Equation). 
The question is that we are given the 1-step transition matrix P = Pi j i, j∈S . How can we decide
the n-step transition probabilities? (with time-homogeneity)
(n)
Pi j := P(Xn = j|X0 = i) = P(Xm+n = j|Xm = i), m = 0, 1, 2, . . .
16 Chapter 1. (Discrete-Time) Markov Chains

Let us start with 2 step transition matrix


n o
(2)
P(2) = Pi j
i, j∈S

Condition on what happens at time 1.


(2)
Pi j = P(X2 = j|X0 = i) = ∑ P(X2 = j|X0 = i, X1 = k)P(X1 = k|X0 = i) (∗)
k∈S

this is nothing else but the conditional version of the law of total probability. Details are as
follow

P(X2 = j|X0 = i) = ∑ P(X2 = j, X1 = k|X0 = i)


k∈S
P(X2 = j, X1 = k, X0 = i)
= ∑
k∈S P(X0 = i)
P(X2 = j, X1 = k, X0 = i) P(X1 = k, X0 = i)
= ∑ ·
k∈S P(X1 = k, X0 = i) P(X0 = i)
= ∑ P(X2 = j|X0 = i, X1 = k)P(X1 = k|X0 = i)
k∈S

Now, we have (∗) proved. We have not yet used Markov property! By Markov property, X0

|=
X2
(2)
(∗) → Pi j = ∑ P(X2 = j|X1 = k)P(X1 = k|X0 = i) = ∑ Pk j Pik = ∑ Pik Pk j = (P2 )i j
k∈S k∈S k∈S

Or we can interpret
(2)
Pi j = [P]Ti [P] j

Thus, P(2) = P2 . Using the same idea, for n, m = 0, 1, 2, . . . , apply X0


|=

Xm
(n+m)
Pi j = P(Xn+m = j|X0 = i)
= ∑ P(Xn+m = j|Xm = k, X0 = i) · P(Xm = k|X0 = i)
k∈S
= ∑ P(Xn+m = j|Xm = k) · P(Xm = k|X0 = i)
k∈S
(m) (n)
= ∑ Pik Pk j
k∈S
(m)
= (P · P(n) )i j → P(n+m) = P(n) P(m) ⇐= C-K Equation

(m+n) (m) (n)


We can also write the entry form Pi j = ∑k∈S Pik Pk j .
By definition,

P( 1) = P
P(2) = P(1) P(1) = P2
P(3) = P(1) P(2) = P3
..
.
P(n) = Pn
1.4 C-K Equation 17

The left hand side is the n-step transition matrix


n o
P(n) = Pikj → Pinj = P(Xn = j|X0 = i)
i, j∈S

while the right hand side is the n-th power of the one-step transition matrix

Pn = P · P . . . P

R
0 m m+n

i 1 j

..
(m) . (n)
Pik Pk j

..
.

Figure 1.4.1: Intuition of C-K Equation

[Yi Shen]“Condition at timee m (on Xm ) and sum up all the possibilities"

1.4.1 Visualization of Markov Chain


Markov chains can be represented by weighted directed graphs.
1. We define states to be nodes
2. Possible (one-step) transitions to be directed edges
3. Draw an edge from i to j if and only if Pi j > 0
4. The transition probabilities to be the weights of the edges
 Example 1.3 — Markov Chain Visualization. With S = {0, 1, 2}
1 1

2 2
P= 1
2 3
5 5

18 Chapter 1. (Discrete-Time) Markov Chains

1
2

1 2
2 3 5
5

1 2
1

Figure 1.4.2: Visualization

 Example 1.4 — Simple Random Walk Revisit Again. Another example for Markov Chain
visualization with a graph representation

p p p p p p

··· −2 −1 0 1 2 ···

1− p 1− p 1− p 1− p 1− p 1− p

Figure 1.4.3: Random Walk Visualization

1.4.2 Vector-Matrix Form


What is the distribution of Xn ?
So far, we have seen transition probabilities
(n)
Pi j = P(Xn = j|X0 = i)
 
(n)
if this DTMC starts from state i for sure (i.e. P(X0 = i) = 1), then Pi j is the also the distribution
j
of Xn at time n. That is the row i of P(n) is the distribution of Xn if the chain starts from state i.

!
P(n) =
i − − −

what if the chain has a random starting state, i.e, X0 is random? 

Definition 1.4.1 — Initial Distribution. Let µ = (µ(0), µ(1), . . . ), where µ(i) = P(X0 = i). The
row vector µ is called the initial distribution of this Markov Chain.

R This is the initial distribution of the initial state X0 .


1.5 Expected Value of A Function of Xn 19

Definition 1.4.2 — µn . Similarly, we can define µn = (µn (0), µn (1), . . . ) to be the distribution
of Xn , where µn (i) = P(Xn = i). In this notation, we think of this as a row vector representing
a distribution. Sometimes, written as µn (Xn = i). While in this notation, we think of it as a
probability.

R We note that µ = µ0 .

Proposition 1.4.2 — Property of µn . The row vector µn represents a distribution, hence we have
1. µn (i) ≥ 0, ∀i
2.
0
∑ µn (i) = 1 = µn · I
i∈S

Q: µn =? given µ and P

Theorem 1.4.3 — A Piece of Useful Fact.

µn = µPn

Proof. For any j ∈ S,

µn ( j) = P(Xn = j)
= ∑ P(Xn = j|X0 = i)P(X0 = i) Law of Total Probability
i∈S
(n)
= ∑ µ(i)Pi j A Bunch of Definitions
i∈S
h i
= µP(n)
j
h i
(n)
= µP C-K
j

Thus,
µn = µPn


Note that µn is the distribution of Xn , a row vector.


µ is the initial distribution, a row vector.
Pn is the transition matrix.

R The distribution of a DTMC is completely determined by two things


1. Initial distribution µ
2. Transition matrix P

1.5 Expected Value of A Function of Xn


1.5.1 Review of Conditional Expectation
Definition 1.5.1 — Conditional Distribution. Let X,Y be discrete random variables. Then, the
20 Chapter 1. (Discrete-Time) Markov Chains

conditional distribution of X given Y = y is given by

P(X = x,Y = y)
P(X = x|Y = y) = , ∀x ∈ A§ and P(Y = y) 6= 0
P(Y = y)

where A§ is the support of X and denote

fX|Y =y (x) = P(X = x|Y = y) = fX|Y (x|y)

the conditional probability mass function.


Proposition 1.5.1 — Conditional pmf is a legitimate pmf. Given any y such that P(Y = y) 6= 0,
1.
fX|Y =y (x) ≥ 0, ∀x ∈ A§
2.
∑ fX|Y =y (x) = 1
x∈A§

Continuous case is similar.

R Since the conditional pmf is a legitimate pmf, the conditional distribution is also a legitimate
probability distribution. (Potentially different from the unconditional distribution) As a result,
we can define expectation on the conditional distribution.

Definition 1.5.2 — Conditional Expectation. Let X,Y be discrete random variables and g is
a continuous function on X. Then, the conditional expectation of g(X) given Y = y is defined as

E (g(X)|Y = y) = ∑ g(x)P(X = x|Y = y)


x∈A§

R Fixed y, the conditional expectation is nothing else but the expectation under the conditional
distribution.

Different Ways to Understand Conditional Expectations (Important!)


1. Value: Fix a value y, E (g(X)|Y = y) is a number
2. Function: As y changes,
h(y) = E (g(X)|Y = y)
is a function of y.
3. Random Variable: since Y is a random variable, we can define

E (g(X)|Y ) = h(Y )

thus, E (g(X)|Y ) is a function of Y . Hence, also a random variable. Then,

E (g(X)|Y ) (ω) = E (g(X)|Y = Y (ω)) , ω ∈ Ω

this random variable takes value E (g(X)|Y = y) when Y = y.


Proposition 1.5.2 — Properties of Conditional Expectation.
1.5 Expected Value of A Function of Xn 21

1. Linearity: inherited from expectation


E (aX + b|Y = y) = aE (X|Y = y) + b
and
E (X + Z|Y = y) = E (X|Y = y) + E (Z|Y = y)
2. Plug-in Property
E (g(X,Y )|Y = y) = E (g(X, y)|Y = y) 6= E (g(X,Y ))
Proof. Discrete case

E (g(X,Y )|Y = y) = ∑ ∑ g(xi , y j )P(X = xi ,Y = y j |Y = y)


xi y j
(
0 yi = y
→ ∑ ∑ g(xi , y j )P(X = xi ,Y = y j |Y = y j ) = P(X=xi ,Y =y j )
xi yi P(Y =y) = P(X = xi |Y = y) y j = y
→ E (g(X,Y )|Y = y) = ∑ g(xi , y)P(X = xi |Y = y)
xi
= E (g(X, y)|Y = y) g(X, y) is a function of X

In particular, for y ∈ A†
E (g(X)h(Y )|Y = y) = h(y)E (g(X)|Y = y)
in random variable form,
E (g(X)h(Y )|Y ) = h(Y )E (g(X)|Y )
3. Independence: If X Y , then E (g(X)|Y ) = E (g(X))
|=

Proof. Due to independence, the conditional distribution is the same as the unconditional
distribution 
4. Law of Iterated Expectation
E (E (X|Y )) = E (X)
Alert: E (X|Y ) is a random variable, as a function of Y
Proof. Discrete case:
when Y = y j , then random variable
E (X|Y ) = E (X|Y = y j ) = ∑ xi P(X = xi |Y = y j )
xi

This happens with probability PY = y j . Thus, we have the following


E (E (X|Y )) = ∑ E (X|Y = y j ) P(Y = y j )
yj
!
=∑ ∑ xi P(X = xi |Y = y j ) P(Y = y j )
yj xi

= ∑ xi ∑ P(X = xi |Y = y j )P(Y = y j )
xi yj

= ∑ xi P(X = xi ) Law of Total Probability


xi
= E (X)
22 Chapter 1. (Discrete-Time) Markov Chains

 Example 1.5 Let Y be the number of claims received by an insurance company, X is some
random parameter, such that

Y |X ∼ Poi(X), X ∼ Exp(λ )

what is E (Y )?
Solution: E (Y ) = E (E (Y |X)), since Y |X ∼ Poi(X), then E (Y |X = x) = x → E (Y |X) = X.
Thus, E (Y ) = E (E (Y |X)) = E (X) = λ1 

1.5.2 Expectation of f (Xn )

Let f : S → R given by i 7→ f (i). We want to know the expectation of f (Xn ).


Proposition 1.5.3 — Two Approaches.

1. Method 1:

E ( f (Xn )) = ∑ f (i)P(Xn = i)
i∈S
= ∑ f (i)µn (i)
i∈S
0
= µn f

where
 
f (0)
 f (1)
 
0  .. 
f = . 


 f (i) 
 
..
.

is the column vector giving all the values of f on different states. Then,

0
E ( f (Xn )) = µn f
0
= µPn f

where
(a) µ is the row vector, initial distribution
(b) Pn is the transition matrix
(c) f 0 is the column vector, function of states

0
R [what happens if we calculate Pn f first?] This corresponds to finding E ( f (Xn )|X0 = i) , i ∈
S first
1.5 Expected Value of A Function of Xn 23

2. Method 2:
E ( f (Xn )) = E (E ( f (Xn )|X0 ))
= ∑ E ( f (Xn )|X0 = i) P(X0 = i)
i∈S
= ∑ E ( f (Xn )|X0 = i) µ(i)
i∈S
= ∑ f (n) (i)µ(i)
i∈S
f (n) (0)
 
 f (n) (1)
 
0  . 
= µ f (n) =µ . 
 .

 f (n) (i) 
 
..
.
0
= µPn f

Note that E ( f (Xn )|X0 = i) is a function X0 denoted by f (n) (X0 ).


0

R [How to find f (n) ?]


0
f (n) (i) = E ( f (Xn )|X0 = i)
= ∑ f ( j)P(Xn = j|Xo = i)
j

= ∑ Pinj f ( j)
j
 0

= Pn f
i
0 0
Thus, f (n) = Pn f

This agrees with what we get in Method 1.


0
R Hence, if we calculate Pn f first, what we get is
 
E ( f (Xn )|X0 = 0)
E ( f (Xn )|X0 = 1)
 
(n)
0  .. 
f =
 . 

 E ( f (Xn )|X0 = i) 
 
..
.

which is a column vector represeneting f (n) (i) = E ( f (Xn )|X0 = i)

R Let’s think about the interpretation


0
E ( f (Xn )) = µPn f
0
1. Method 1: E ( f (Xn )) = µn f go directly to the states of time n, get a row vector first.
0
2. Method 2: E ( f (Xn )) = µE ( f (Xn )), standing still at time 0, but use f (n) to project the
future expected states, get the column vector first.
In any case,
 row vectors
0
 are distributions (µ, µ1 , . . . , µn , . . . ) while column vectors are
0
fucntions f , f (n) , . . .
24 Chapter 1. (Discrete-Time) Markov Chains

1.6 Stationary Distribution


Definition 1.6.1 — Stationary Distribution. A probability distribution π = (π0 , π1 , . . . ) is
called a stationary distribution (invariant distribution) of the DTMC {Xn }n=0,1,... with tran-
sition matrix P if
1. π = πP as a system of equations
2. ∑i∈S πi = 1 by the definition of probability distribution

1.6.1 Why such π is called stationary?


Assume thee DTMC starts with initial distribution µ = π. Then, the distribution of X1 is

µ1 = µP = πP = π = µ

the distribution of X2 :
µ2 = µP2 = πPP = π = µ
Thus, µn = µ. If the markov chain starts from a stationary distribution, its distribution will never
change (stationary/invariant).
Example 1.6 An electron has two states: ground state |0i and excited state|1i. Let Xn = {0, 1}
be the state at time n. At each step, the electron changes state with probability α if it is in state |0i,
with probability β if it is in state |1i.
Then, {Xn } is a DTMC. Its transitional matrix is

 0 1 
0 1−α α
1 β 1−β

solve for stationary distribution (s)

Solution: π = πP  
 1−α α 
π0 π1 = π0 π1
β 1−β
(
π0 (1 − α) + π1 β = π0 (1)

π0 α + π1 (1 − β ) = π1 (2)
Even though we have 2 equations with 2 unknowns. However, note that they are not linearly
independent. Subtracting (1) from indentity π0 + π1 = π1 + π0 , gives (2). Hence, (2) is redundant.
From (1), we have απ0 = β π1 → ππ01 = αβ . Now, we need π0 + π1 = 1. Thus,

β α
π0 = π1 =
α +β α +β
 
β α
Thus, there exists a unique stationary distribution π = α+β α+β 

R The stationary distribution example from before is typical.


1. Use π = πP to get proportions between different components of π
0
2. Use π · 1 = 1 to normalize and get exact values

1.6.2 Existence & Uniqueness of π? Convergence to π?


We can note that π = πP implies that π is the transpose of an eigenvector of P with eigenvalue of 1.
1.7 Recurrence & Transience 25

1.7 Recurrence & Transience


Definition 1.7.1 — First Revisit Time. Let y ∈ S be a state. Define

Ty = min {n ≥ 1 : Xn = y}

as the time of the first (re)visit to y. And we define

ρyy = Py (Ty < ∞) = P(Ty < ∞|X0 = y)

Definition 1.7.2 — Recurrent & Transient State. A state y ∈ S is called recurrent, if ρyy = 1,
which means always return to y.
It is called transient if ρyy < 1, which means there is a pass probability that the chain never visits
y again.

R Note that
1 − ρyy = Py (Py = ∞) > 0

 Example 1.7 Consider the following transition matrix

 01 1
1
2
0 2 2
P= 1 1 1
2 2
2 1

its related graph representation is given below Given X0 = 0, we have

1 1
2 2 1

0 1 2

1 1
2 2

Figure 1.7.1: Markov Chain Graph Representation

1
P(X1 = 0|X0 = 0) = P(X1 = 1|X0 = 0) =
2
Note that X1 = 0 means T0 = 1 and X1 = 1 means T0 = ∞ since 1 and 2 will never go to 0. This
analysis implies that
1
ρ00 = P0 (T0 < ∞) = < 1
2
Thus, by definition, state 0 is transient. By similar deduction, we have state 1 is also transient.
Given X0 = 2, we have

P(X1 = 2|X0 = 2) = 1, X1 = 2 ≡ T2 = 1 → ρ22 = P2 (T2 < ∞) = 1

thus, state 2 is recurrent. 


26 Chapter 1. (Discrete-Time) Markov Chains

R This is an example for which recurrence/transience can be directly checked by definitions.


However, this is very rare, as the distribution of Ti is very hard to derive in general. Thus, we
need better criteria for recurrence/transience.

Definition 1.7.3 — x communicates to y. Let x, y ∈ S (x, y can be the same state). x is said to
communicate to y, denoted by x → y, means it starting from x, the probability that the chain
eventually (re)visits state y is positive. i.e,

ρxy = Px (Ty < ∞) > 0

Note that this is equivalent to say

n
∃n ≥ 1, 3 Pxy >0
or say “x can go to y".

Lemma 1.8 — Transitivity of Communication. If x → y, y → z, then x → z.

Proof. Note that


M
x → y =⇒ ∃M ≥ 1 3 Pxy >0 y → z =⇒ ∃N ≥ 1 3 PyzN > 0

then, by C-K Equation, we have

PxzM+N = ∑ PxkM PkzN ≥ PxyM PyzN > 0 =⇒ x → z


k∈S

M PN just specifies one M + N-long path from x to z via y. While the quantity
this is true since Pxy yz
M+N
Pxz captures all possible paths. 

Theorem 1.8.1 If ρxy > 0 but ρyx < 1, then x is transient.


 k > 0 being the smallest length of a path from x to y. Since Pk > 0,

Proof. Define κ = min k : Pxy xy
there are states
y1 , . . . , yκ−1 −→ Pxy1 · Py1 y2 · · · · · Pyκ−1 y > 0
Note that none of y1 , . . . , yκ−1 is x, since otherwise, this is not the shortest path. Now, we have

Px (Tx = ∞) ≥ Pxy1 · Py1 y2 · · · · · Pyκ−1 y (1 − Pyx )

Where Pxy1 · Py1 y2 · · · · · Pyκ−1 y is the path going from x to y without returning to x and (1 − Pyx )
corresponds to the idea of once in y, never goes back to x. This quantity is larger than 0 since
ρyx < 1. These two together describes one way not to visit x ever again via y. Thus,

ρxx = Px (Tx < ∞) < 1

This means x is transient. 

Corollary 1.8.2 If x is recurrent, and ρxy > 0, then ρyx = 1. (Result of contrapositive)

1.8.1 Communicating Class


1.7 Recurrence & Transience 27
Definition 1.8.1 — Communicating Class. A set of states C ⊆ S is called a communicating
class, if it satisfies the following properties
1. ∀i, j ∈ C, i → j, j → i
2. ∀i ∈ C, j 6∈ C, i 6→ j or j 6→ i.
The idea is

“States in the same class communicate with each other, states in different classes
do not communicates in both ways"

R Communication and communicating classes are graphs. i → j means can go from i to j by


following the arrows (directed edges).

How to Find classes: “find the loops"

 Example 1.8

0 1 2 3
0 1

2
1
2

1  21 1
2 
1 1 1
2 4 4 0 2
3 1

Note that

0 1

2 3

Figure 1.8.1: Markov Chain Graph Representation

ρ01 > 0, ρ10 > 0 =⇒ 0, 1 ∈ same communication class

State 2 is not in any class since ρi2 = 0, ∀i ∈ S.


State 3 is a class on its own as 3 → 3 but ρ3i = 0, ∀i ∈ S.
Then, there are two recurrent classes, {0, 1}, {3}, with one transient state 2. And 2 does not below
to any class. 
28 Chapter 1. (Discrete-Time) Markov Chains

 Example 1.9 We start the graphical representation:

0 1

2 3

Figure 1.8.2: Markov Chain Graph Representation

Note that ρ01 , ρ12 , ρ20 > 0 =⇒ 0,1,2 are in the same class.
ρ23 , ρ32 > 0 =⇒ 2,3 are in the same class. Then, by transivity, we have 0, 1, 2, 3 are all in the same
class. 

Definition 1.8.2 — Irreducile Markov Chain. A Markov Chain is called irreducible if all the
states are in the same class. In other words, i ↔ j, for all i, j ∈ S.
A set B is called irreducible if i ↔ j, ∀i, j ∈ B.

 Example 1.10 From a previous example:

0 1

2 3

This is not a irreducible Markov Chain. 

 Example 1.11 For the second example we have just seen,

0 1

2 3

This is indeed an irreducible Markov Chain. 


1.7 Recurrence & Transience 29

Theorem 1.8.3 Let i, j ∈ C be in the same communicating class. Then j is recurrent/transient if


and only if i is recurrent/transient.

“Recurrence/Transience are class properties."

Proof. The proof will be included later. 

R As a result of the theorem. We can call a class recurrent/transient, if all its states are
recurrent/transient.
(This is equivalent to one state in the class if recurrent/transient.)
In order to decide if a class is recurrent/transient, we just need to check an element is
recurrent/transient is enough.

Definition 1.8.3 — Closed Set. A set A is called closed if i ∈ A, j 6∈ A implies Pi j = 0 (not


from i to j in one step). Equivalently, this is to say, if i ∈ A and j 6∈ A, then i 6→ j (not from i to
j in no-matter-how-many steps).

“Cannot get out once the chain goes into A."

Figure 1.8.3: Once get into the outer class, cannot get out

Theorem 1.8.4 — Decomposition of the State Space. The state space S can be written as a
disjoint union
S = T ∪ R1 ∪ R2 ∪ . . .
where T is the set of all transient states (T is not necessarily one class), and Ri , i = 1, 2, . . . are
closed recurrent classes. Equivalently, in other textbook, we can say irreducible sets of recurrent
states.
(This proof is trivial if we can recognize communicating classes are equivalence classes with the
equivalence relation ↔, two-way communication)

Proof. First wee collect all the transient states and put them into T . For each recurrent states, note
that it must belong to at least one class, since it communicates with itself. We collect one class
for each state, remove the identical classes, and get {Rk }k=1,2,... . Also, since recurrence is a class
property, Rk , k = 1, 2, . . . are all recurrent classes. By construction, we have

S = T ∪ R1 ∪ R2 ∪ . . .
30 Chapter 1. (Discrete-Time) Markov Chains

/ ∀i 6= j ∈ {1, 2, . . . }.
1. Disjoint: We still need to show Ri ∩ R j = 0,
For the sake of contradiction, suppose ∃i ∈ S and i ∈ Rk , i ∈ Rk0 . But then, this means
∀ j ∈ Rk , j0 ∈ Rk0 , we have
i ↔ j, i ↔ j0 =⇒ j0 ↔ j0
Thus, j, j0 are in the same class. Since j and j0 are arbitrary states in Rk and Rk0 . We must
have Rk = Rk0 contradicting the construction that Rk 6= Rk0 for k 6= k0 .
2. Closeness: for the sake of contradiction, say Rk is not closed, then there exists i ∈ Rk and
j 6∈ Rk such that Pi j > 0 → i → j. Or equivalently, we can say ρi j > 0. As i 6∈ Rk , i 6↔ j, which
implies that j 6→ i, then ρ ji = 0 < 1. Thus, i is a transient state, which yields a contradiction.


0 1

2 3

 Example 1.12 We have T = {2}, R1 = {0, 1}, R2 = {3}, so S = T ∪ R1 ∪ R2 

0 1

2 3

 Example 1.13 Only one recurrence class S = R = {0, 1, 2, 3} 

1.8.2 Strong Markov Property


Recall that Ty = min {n ≥ 1 : Xn = y}.

Theorem
 1.8.5 — Strong Markov Property for (time-homogeneous) MC. The process
XTy +k k=0,1,2,...
behaves like the MC with initial state y. (forget about the history and restart at
state y)

Proof. It suffices to show that for T = Ty

P(XT +1 = z|XT = y, T = n, Xn−1 = xn−1 , . . . , X0 = x0 ) = Pyz

for all n, xn−1 , . . . , x1 6= y.


1.7 Recurrence & Transience 31

This is obvious since

P(XT +1 = z|XT = y, T = n, Xn−1 = xn−1 , . . . , X0 = x0 )


= P(Xn+1 = z|Xn = y, Xn−1 = xn−1 , . . . , X0 = x0 ) Conditioning on T = n and T = Ty
= P(Xn+1 = z|Xn = y) = Pyz Markov Property

1.8.3 Alternative Characterizations for Recurrence/Transience


Definition 1.8.4 Define n o
Tyk = min n > Tyk−1 , Xn = y

time of the k−th visit to state y.

By Strong Markov Property,


Py (Tyk < ∞) = ρyy
k

the LHS means revisiting y for at least k times, while the ρyy means revisit y for the first time.
Two Possibilities:
k → 0 as k → ∞, where
1. ρyy < 1 and y is transient, then ρyy

Py (visits y for infinite number of times) = 0

⇐⇒ Py (there is a last visit tp y) = 1


k = 1 for all k, then
2. ρyy = 1 and y is recurrent, then ρyy

Py (visits y for infinite number of times) = 1

Indeed, we know more. Let N(y) be the total number of visits to state y. Then,

Py (N(y) ≥ k) = Py (Tyk < ∞) = ρyy


k

k+1
=⇒ Py (N(y) ≥ k + 1) = ρyy
k+1
=⇒ Py (N(y) ≤ k) = 1 − ρyy

This is the cdf of Geo(1 − ρyy ) and

N(y)|X0 =y ∼ Geo(1 − ρyy )

1. Expectation can be used to characterize transience/recurrence:


1 ρyy
Ey N(y) = −1 =
1 − ρyy 1 − ρyy

then

ρyy = 1 =⇒ Ey N(y) = ∞
ρyy < 1 =⇒ Ey N(y) < ∞

More generally, we have the following lemma


Lemma 1.9
ρxy
Ex N(y) =
1 − ρyy
32 Chapter 1. (Discrete-Time) Markov Chains

Proof.

Ex N(y) = Px (the chain ever visits y)Ex (N(y)|Ty < ∞)


+ Px (the chain never visits y) · 0
= Px (Ty < ∞)Ex (N(y)|Ty < ∞)
= ρxy Ey (N(y)) Strong Markov Property
ρxy
=
1 − ρyy

Lemma 1.10

Ex N(y) = ∑ Pxyn
n=1

Proof.
∞ ∞
N(y) = ∑ 1{X =y} =⇒ Ex N(y) = ∑ Ex (1{X =y} )
n n
n=1 n=1
∞ ∞
= ∑ Px (Xn = y) = ∑ Pxyn
n=1 n=1

R Combine this lemma with the previous result that

ρyy = 1 =⇒ Ey N(y) = ∞
ρyy < 1 =⇒ Ey N(y) < ∞

we have the following theorem

Theorem 1.10.1 y is recurrent if and only if Ey (N(y)) = ∑∞ n


n=1 Pyy = ∞ diverges to infinity

1.10.1 Short Review on Indicator


Let A be an event, then (
1 ω ∈A
1A (ω) =
0 ω∈6 A
and we have
E (1A ) = P(A)

Theorem 1.10.2 Let i, j ∈ C be in the same communicating class. Then j is recurrent/transient


if and only if i is recurrent/transient.

“Recurrence/Transience are class properties."


1.7 Recurrence & Transience 33

Proof. Now, we can do this proof. Suppose x, y are in the same class, which means x ↔ y and x is
recurrent. Since x → y and y → x, there exists M, NN such that
(M) (N)
Pxy > 0 Pyx > 0

Note that
M+N+k (N) (k) (M)
P(XM+N+k = y|X0 = y) = Pyy ≥ Pyx Pxx Pxy
= P(XM+N+k = y, XN+k = x, Xn = x|X0 = y)
∞ ∞ ∞
(l) (l)
=⇒ ∑ Pyy ≥ ∑ Pyy = ∑ PyyM+N+k Let k = l − M − N
l=1 l=M+N+1 k=1
∞ ∞
(N) (k) (M) (N) (M) k
≥ ∑ Pyx Pxx Pxy = Pyx Pxy Pxx ∑ =∞ x is recurrent
k=1 k=1

Thus, y is recurrent. Therefore, x recurrent implies y recurrent and recurrence is a class property.
This also implies that transience is a class property as well. 

Lemma 1.11 In a finite closed set, there has to be at least one recurrent state.

Proof. For the sake of contradiction, suppose all states in C are transient. Then, for any two x, y ∈ C
ρxy
Ex (N(y)) = <∞
1 − ρyy
=⇒ ∑ Ex (N(y)) < ∞
y∈C

However,
!
∑ Ex (N(y)) = Ex ∑ N(y)
y∈C y∈C
!

= Ex ∑ ∑ 1{X =y} n
y∈C n=1
!

= Ex ∑ ∑ 1{X =y} n
n=1 y∈C

For each n, exactly one indicator takes value 1, the rest ones are 0, this implies

∑ 1{X =y} ) = 1
n
y∈C

then, we have ! !
∞ ∞
Ex ∑ ∑ 1{X =y} n
= Ex ∑1 =∞
n=1 y∈C n=1

This yields a contradiction. Hence, there must exist at least one recurrent state 

Combine this result with the fact that recurrence/transience are class properties, we have
34 Chapter 1. (Discrete-Time) Markov Chains

Theorem 1.11.1 A finite closed class must be recurrent. In particular, an irreducible Markov
Chain with finite state space must be recurrent.

1.12 Existence of a Stationary Distribution


In this part, we show that an irreducible and recurrent DTMC “almost" has a stationary distribution.
If the state space is finite, then it has a stationary distribution.
Definition 1.12.1 — Stationary Measure (Invariant Measure). Let a row vector

µ ∗ = (µ ∗ (0), µ ∗ (1), . . . , µ ∗ (i), . . . )

is called a stationary measure (invariant measure), if µ ∗ (i)∗ ≥ 0, ∀i ∈ S and µ ∗ P = µ ∗ .

R A stationary measure is a stationary distribution without normalization. If ∑i µ ∗ (i) < ∞, then


it can be normalized to get a stationary distribution.

Theorem 1.12.1 Let {Xn }n=0,1,2,... be an irreducible and recurrent DTMC with transition matrix
P. Let x ∈ S and T x := min {n ≥ 1 : Xn = x}, then

µx (y) = ∑ Px (Xn = y, Tx > n), y ∈ S
n=0

defines a stationary measure with 0 < µx (y) < ∞, ∀y ∈ S.

n = P (X = y, T > n), then


Proof. Define P̄xy x n x

µx (y) = ∑ P̄xyn
n=0

We have the following two cases:


1. For z 6= x:
(µx P)(z) = ∑ µx (y)Pyz
y
!

∑ ∑ P̄xyn Pyz
y n=0

∑ ∑ P̄xyn Pyz
n=0 y

then,

∑ P̄xyn Pyz = ∑ Px (Xn = y, Tx > n, Xn+1 = z)


y y

= Px (Tx > n + 1, Xn+1 = z) = P̄xzn+1


Thus,
∞ ∞
∑ ∑ P̄xyn Pyz = ∑ P̄xzn+1
n=0 y n=0

= ∑ P̄xzn = µx (z)
n=0
1.13 The Period of A State 35

Since P̄xz0 = Px (X0 = z, Tx > 0) = 0.


2. For z = x: similarly, we have

∑ P̄xyn Pyx = ∑ Px (Xn = y, Tx > n, Xn+1 = x)


y y

= Px (Tx = n + 1)

Thus,
∞ ∞
(µx P)(x) = ∑ ∑ P̄xyn Pyx = ∑ Px (Tx = n + 1) = 1 By total probability and recurrence
n=0 y n=0

= µx (x) = ∑ Px (Xn = x, Tx > n)
n=0

The last line is true since the probability will be 1 only when n = 0, 0 for all other n.
Combine part 1 and 2, we have

(µx P)(z) = µx (z), ∀z ∈ S =⇒ µx P = µx

Next, we show 0 < µx (y) < ∞ for all y ∈ S. First,

1 = µx (x) = (µx Pn )x
= ∑ µx (z)Pzxn
z
n
≥ µx (y)Pyx
n > 0. This
Since we have irreducibility which implies y → x. Then, there exists n such that Pyx
implies that µx (y) < ∞.
Second, recall that the chain can visit y before returning to x is

Px (number of visits to y before returning to x ≥ 1) > 0


=⇒ Ex (number of visits to y before returning to x ≥ 1) = µx (y) > 0

R Note that

µx (y) = ∑ Px (Xn = y, T x > n)
n=0
∞  Tx −1
= ∑ Ex 1 1
{Xn =y} {Tx >n} = Ex ∑ 1{Xn =y}
n=0 n=0
= Ex (number of visits to y before returning to x)

1.13 The Period of A State


Definition 1.13.1 — Period of a State. The period of a state x is defined as

n
d(x) := gcd {n ≥ 1 : Pxx > 0}

R Note that we are taking the gcd of the steps when the probability of state x going back to x is
not 0. There is no guarantee for going-back.
36 Chapter 1. (Discrete-Time) Markov Chains
Definition 1.13.2 — Aperiodic. If x has period 1, then we say x is aperiodic. If all
states in a MC is aperiodic, then we call this MC is aperiodic.
If Pxx > 0, then x is obviously aperiodic. But the converse is not true, x is aperiodic does
not imply Pxx > 0.
If x is aperiodic, we do not necessarily have Pxx > 0.


3 > 0 and P2 > 0 means d(0) = 1. However, P = 0.
Example 1.14 Note that P00 00 00

1 2

Figure 1.13.1: Aperiodic Graph Example

 Example 1.15 — Simple Random Walk Revisit. Note that


(
n =0
P00 n=1 mod 2
n = n 1 n
 
P00 n/2 2

n
n
is the number of ways to get n2 steps to the right, and 12 is the probability of a

where n/2
particular ordering with n2 steps to the right. By this analysis, we have that

d(0) = 2

p p p p p p

··· −2 −1 0 1 2 ···

1− p 1− p 1− p 1− p 1− p 1− p
1
Figure 1.13.2: Random Walk Visualization with p = 2

Lemma 1.14 — "Period is a class property".

x → y, y → x =⇒ d(x) = d(y)

Proof. Since x → y and y → x, there exists m, n such that


m n
Pxy >0 Pyx >0

then
m+n m n
Pxx ≥ Pxy Pyx > 0
1.15 Main Theorems 37
l > 0, we have
Moreover, for any l such that Pyy

m+n+l m l n
Pxx ≥ Pxy Pyy Pyx > 0

As a result,
d(x)|m + n d(x)|m + n + l

thus,
d(x)|l

l > 0, thus d(x) is the common divisor for all elements in


this holds for all l such that Pyy

n o
l
l ≥ 1 : Pyy >0

Thus, d(x)|d(y). Symmetrically, d(y)|d(x). Thus,

d(x) = d(y)

as desired. 

1.15 Main Theorems


Overall Conditions:
1. I: The MC is irreducible (one and only one class, everything communicates with everything)
2. A: The MC is aperiodic (i.e, all the states have period 1)
3. R: All the states are recurrent
4. S: There exists a stationary distribution π.

Theorem 1.15.1 — Convergence Theorem. Suppose I,A,S. Then,

n
Pxy −→n→∞ π(y), ∀x, y ∈ S

no-matter where you starts, only depends on the target states.

“The limiting transition probability, hence also the limiting distribution, does not
depend on where we start. (Under the conditions of I,A,S)”

Or we can write
lim Pn = π(y), ∀x, y ∈ S =⇒ lim P(Xn = y) = π(y)
n→∞ xy n→∞

Proof. To show this convergence theorem is true, we need to prove a lemma first.
Lemma 1.16 If there exists a stationary distribution π such that π(y) > 0, then y is recurrent.
38 Chapter 1. (Discrete-Time) Markov Chains

Proof. Assume the DTMC {Xn }n=0,1,... starts from the stationary distribution π. This means that

P(Xn = y) = π(y), ∀n = 0, 1, . . .

∞= ∑ P(Xn = y)
n=1

= ∑ E(1{X =y} ) n
n=1
!

=E ∑ 1{X =y} n
n=1
= E(N(y))
= ∑ Ex N(y)π(x)
x∈S
ρxy
= ∑ π(x) 1 − ρyy
x∈S
1
≤ ∑ π(x) 1 − ρyy
x∈S
1
= =
1 − ρyy
Thus,
ρyy = 1
This means the state y is recurrent. 

By taking the contrapositive the lemma above, we have the following corollaries

Corollary 1.16.1 If y is transient, then π(y) = 0 for any stationary distribution π.

Corollary 1.16.2

I and S =⇒ R
Alright, back to the Convergence Theorem Proof. The proof is freaking long, but the main idea is
“Coupling"
Consider two independent DTMCs {Xn }n=0,1,... and {Yn }n=0,1,... , both having the same transition
matrix P with arbitrary initial distributions.
It is easy to show that the pairs Zn = (Xn ,Yn ) can also result in a DTMC with transition matrix
P̄(x1 ,y1 ),(x2 ,y2 ) = Px1 ,x2 Py1 ,y2
Next, we show that under I and A, {Zn }n=1,... is also inreducible. Consider the following lemma
n > 0 for all n ≥ n .
Lemma 1.17 If y is aperiodic, then there exists n0 such that Pyy 0

Proof. We will use a fact in number theory, as a corollary of Bezout’s Lemma. It states that
If we have a set of coprime numbers I, then there are integers i1 , . . . , im from this I and
a n0 , such that for any n ≥ n0 , n can be written as
n = a1 i1 + a2 i2 + · · · + an in
where ai are positive integers.
1.15 Main Theorems 39
 n > 0 , by aperiodicity of y, we know that all elements in I are coprime. By

Here, I = n ≥ 1 : Pyy
the Bezout’s Corollary, there exists n0 ∈ N such that n ≥ n0 ,
n i1 i1 i1 i2 i2 i2 im im im
Pyy ≥ Pyy Pyy . . . Pyy · Pyy Pyy . . . Pyy . . . Pyy Pyy . . . Pyy >0
| {z } | {z } | {z }
a1 terms a2 terms am terms

Since we know that {Xn } is irreducible, we have for any x1 , x2 there exists K such that

PxK1 ,x2 > 0

similarly, for any y1 , y2 , there exists L such that

PyL1 ,y2 > 0

Since the DTMCs are aperiodic, we can apply the above lemma, and take M large enough such that
Pxm2 ,x2 > 0 and Pym2 ,y2 > 0 for any m ≤ M. In particular, for any m ≥ M + max K, L, then

Pxm1 ,x2 ≥ PxK1 ,x2 Pxm−K


2 ,x2
>0
=⇒ P̄(xm1 ,x2 ),(y1 ,y2 ) = Pxm1 ,x2 Pym1 ,y2 > 0
Pym1 ,y2 ≥ PyL1 ,y2 Pym−L
2 ,y2
>0

Since this holds for any (x1 , x2 ), (y1 , y2 ). We have {Zn }n=0,1,... is irreducible. Moreover, we want to

show {Zn }n=0,1,... is recurrent. To see this, note that π̄ given by π̄(x, y) = π(x)π(y) is a stationary
distribution. Take x such that π(x) > 0, then π̄(x, x) > 0. By the first lemma within this proof, we
have (x, x) as a recurrent state. And since {Zn } is irreducible, then it must be recurrent as well.
Now, define T = min {n ≥ 0 : Xn = Yn }, this is the first time that the two chains meet and

V (x, x) = min {n ≥ 0 : Xn = Yn = x} = min {n ≥ 0 : Zn = (x, x)}

then, from what have deduced so far,

T ≤ V (x, x) < ∞

with certainty.
Recall that if i is recurrent and ρi j > 0, then ρ ji = 1.
Finally, we have proved that
“The two independent Markov Chains will eventually meet”
For any state y ∈ S, discuss the values of T and XT , we have
n
P(Xn = y, T ≤ n) = ∑ ∑ P(T = m, Xm = x, Xn = y)
m=0 x∈S
n
= ∑ ∑ P(T = m, Xm = x)P(Xn = y|Xm = x) Markov Property
m=0 x
n
= ∑ ∑ P(T = m,Ym = x)P(Yn = y|Ym = x)
m=0 y
= P(Yn = y, T ≤ n)
40 Chapter 1. (Discrete-Time) Markov Chains

“After meeting, they have the same distribution.”


Then,
|P(Xn = y) − P(Yn = y)|
= |(P(Xn = y, T ≤ n) + P(Xn = y, T > n)) − (P(Yn = y, T ≤ n) + P(Yn = y, T > n))|
≤ P(Xn = y, T > n) + P(Yn = y, T > n)
n→∞
≤ 2P(T > n) −−−→ 0

Recall that we have the freedom to choose the initial distributions of the {Xn } and {Yn }. Take
X0 = x and Y0 ∼ π. Then,
n
Pxy − π(y) = |P(Xn = y) − π(y)| = |P(Xn = y) − P(Yn = y)| −n→∞
−−→ 0

Thus, we have
n n→∞
Pxy −−−→ π(y)


Theorem 1.17.1 — Existence of Stationary Measure. Suppose I, R, then there exists a sta-
tionary measure µ ∗ with 0 < µ ∗ < ∞, ∀x ∈ S.

Proof. Proved by Theorem 1.12.1 

R Remember that stationary measure is a weaker definition of stationary distribution.

Theorem 1.17.2 — Asymptotic Frequency. Suppose I, R, if Nn (y) is the number of visits to


y up to time n, then
Nn (y) 1
−→
n n→∞ Ey (Ty )
where
Ty = min {n ≥ 1 : Xn = y}
Nn (y)
we consider n as the fraction of time spent in y (up to time n).
1
“Long run fraction of time spent in y is Ey (Ty ) ”

where Ey (Ty ) is the expected revisit time to y given that we start with y, which is also the
“expected cycle length”.

(1) (2)
Proof. We can chop the time line into difference cycles. Let Ty , Ty , . . . bee the time that the
(2) (1) (3) (2)
chain (re)visits y after time 0. By Strong Markov Property, Ty − Ty , Ty − Ty , . . . are i.i.d

···
(0) (2) (4)
Ty Ty Ty

Figure 1.17.1: Chop the Time Line

random variables. By the Strong Law of Large Number,


1.15 Main Theorems 41

Theorem 1.17.3 — Strong Law of Large Number. For X1 , X2 , . . . i.i.d r.v.s, then

∑ni=1 Xi

− E(Xi )
n a.s

we have
(i+1) (i)
∑k−1 Ty − Ty
=⇒ i=1
 k − 1 
(i+1) (i)
=⇒ E Ty − Ty = Ey (Ty )
With negligible changes (THIS IS VERY ANNOYING)
(k)
Ty n→∞
=⇒ −−−→ Ey (Ty )
k
(Nn (y)) (Nn (y)+1)
Observe that Ty ≤ n ≤ Ty , then
(N (y))
Ty n
(N (y)) (N (y))+1 =⇒ Ey (Ty )
Ty n n Ty n Nn (y) + 1 n→∞ Nn (y)
≤ ≤ −−−→ (N (y))+1
Nn (y) Nn (y) Nn (y) + 1 Nn (y) Ty n Nn (y) + 1
−→ Ey (Ty )
Nn (y) + 1 Nn (y)
By Squeeze Theorem, we have
n Nn (y) n→∞ 1
lim = Ey (Ty ) =⇒ −−−→
n→∞ Nn (y) n Ey (Ty )


Theorem 1.17.4 — How to find a stationary distribution?. Suppose I and S, then,

1
π(y) =
Ey (Ty )

Proof. As I, S implies R, we can apply the result above, with the initial distribution being π. Taking
expectation on both sides,  
Nn (y) 1
E −→
n Ey (Ty )
This is a result from Dominant Convergence Theorem (DCT) from real analysis. Then, note that
!
n
E(Nn (y)) = E ∑ 1{X =y} m
m=1
n 
= ∑E 1{Xm =y} By DCT
m=1
n
= ∑ P(Xm = y)
m=1

Since the Chain is stationary, we have

P(Xm = y) = P(X0 = y) = π(y)


=⇒ E(Nn (y)) = nπ(y)
42 Chapter 1. (Discrete-Time) Markov Chains

Thus,
1
π(y) =
E(Ty )


Corollary 1.17.5 — Nicest Case. Suppose I, A, S, (R), then

n Nn (y) 1
π(y) = lim Pxy = lim =
n→∞ n→∞ n Ey (Ty )

Stationary Distribution = Limiting transition probability


= Long-run fraction of time
1
=
Expected revisit time
“Everything exists and things are all equal!”

This is very philosophical LMAO!

R We don’t need R actually, but we have not yet shown that relationship.

Theorem 1.17.6 — Long-run Average. Suppose I, S, and ∑x | f (x)|π(x) < ∞. Then,

1 n
lim
n→∞ n
∑ f (Xm ) = ∑ f (x)π(x) = π f 0
m=1 x

Proof. Recall that I, R, from Theorem 1.12.1, we have



µx (y) = ∑ Px (Xn = y, Tx > n), y ∈ S
n=0
= E(Number of visits to y before returning to x)
= Ex NTx (y)
1
forms a stationary measure. Then, note that ∑y Ex NTx (y) = Ex (Tx ) = π(x) . We need a lemma to
proceed.
Lemma 1.18 If I,S then π(x) > 0 for all x ∈ S.

“This is pretty much telling you that the stationary distribution exists and it is unique.”

Proof. Say π is a distribution, then at π(y) > 0 for some y ∈ S. Since the Markov Chain is
n > 0, then
irreducible, we know that y → x for all x ∈ S. It means that there exists nN such that Pyx

π = πPn n

hence,
π(x) = ∑ π(z)Pzxn
z∈S
n
≥ π(y)Pyx >0
1.15 Main Theorems 43

Then, we have Ex (Tx ) < ∞. Thus,  


Ex (NTx (y))
Ex (Tx ) y∈S
gives a stationary distribution. 

R This is an amazing result to perform stationary measure normalization to get a stationary


distribution.

Thus,
Ex (NTx (y))
= π(y)
Ex (Tx )
1
while Ex (Tx ) = π(x) , we have
π(y)
Ex (NTx (y)) =
π(x)
Proposition 1.18.1 If I, then the stationary distribution is unique if it exists.

Proof. The proof is postponed to “harmonic function” part. 


Oh boy, we still need to get back to our reward function proof... The reward collected in k-th
cycle (defined by returns to x) is
Tk
Yk := ∑ f (Xm )
m=Tk−1 +1
we take the expectation of Yk , then
∑y∈S π(y) f (y)
E(Ty ) = ∑ Ex (NTx (y)) f (y) =
y∈S π(x)
The average reward over time is
∑y Yk + negligible terms
∑k (Tk − Tk−1 ) + negligible terms
(k)
where Tk = Tx We have each cycle to be i.i.d shown in the graph above. The negligible terms are
···

T0 T2 T4

Figure 1.18.1: Chop the Time Line

from the first and last cycles. Then,


∑y Yk + negligible terms
∑k (Tk − Tk−1 ) + negligible terms
1
∑k Yk + · · ·
= 1 k
k ∑k (Tk − Tk−1 ) + · · ·
∑y∈S π(y) f (y)
E(Yk ) π(x)
−→ = 1
Ex (Tx ) π(x)
0
= ∑ π(y) f (y) = π f
y


44 Chapter 1. (Discrete-Time) Markov Chains

R
1. Reminder: f 0 is the transpose of f and f 0 is a column vector. Thus, π f 0 is the dot
product.
2. Interpretation: where ∑nm=1 f (Xm ) is the total of rewards/costs up to time n based on
f . Then, we can interpret 1n ∑nm=1 f (Xm ) as the average rewards/costs up to time n per
step. Then, the whole left-hand side limn→∞ n1 ∑nm=1 f (Xm ) can be interpreted as the
long-run average rewards/costs per step

 Example 1.16
0 1 2 3
0 0.1 0.2 0.4 0.3
P= 1 0.1 0.2 0.4 0.3
2 0.3 0.4 0.3 0
 
3 0.1 0.2 0.4 0.3

1. This is irreducible since P03 , P32 , P21 , P10 > 0


2. This is aperiodic since P00 > 0 and aperiodicity is a class property. (check for the diagonal
entries)
3. This is recurrent since there is only one irreducible class with finite state space and, by
theorem, we have the whole class as a recurrent class.
4. This has a stationary distribution (measure), by solve using brute force.
(  
πP = π 19 30 40 21
=⇒ π = , , ,
∑x πx = 1 110 110 110 110
By using previous results, we have
1. Limiting Transition Probability
lim Pn = π(y)
n→∞ xy

for example, limn→∞ P12n = π(2) = 4 .


11
Again, this limits does not depend on x.
2. Long-run fraction/frequency of visiting state y:
Nn (y)
lim = π(y)
n→∞ n
for example, the long-run fraction of time that the chain visits/ spends in state 0 is given by
19
π(0) = 110 .
3. Expected cycle length given by visits to y or the expected time that the chain visits state
y again given the chain starts with state y:
1
Ey (Ty ) =
π(y)
for example, given the chain starts from state 3, the expected time that it returns to state 3 (by
default, first time) is
1 110
=
π(3) 21
4. Long-run average: for example, according to the state, have a holding cost of 2x, for
x ∈ {0, 1, 2, 3}, then the average/mean holding cost per term in the long-run is
 
  0
19 30 40 21  2 = 19 × 0 + 30 × 2 + 40 × 4 + 21 × 6 = 173
π f0 =

, , ,
110 110 110 110  4 110 110 110 110 55
6
1.19 Roles of Different Conditions 45

R Typically, we want to solve for the stationary distribution first for questions like this.

1.19 Roles of Different Conditions


1.19.1 Irreducibility (I)
I (irreducibility) is related to the uniqueness of the stationary distribution
Irreducible =⇒ stationary distribution is unique (if exists)
we have seen an example in Assignment 2. Or simpler
 Example 1.17  
1 0
P=
0 1
both (1, 0) and (0, 1) are stationary distributions. Thus, any convex combination of them:

α(1, 0) + (1 − α)(0, 1) = (α, 1 − α), ∀0 ≤ α ≤ 1

is a stationary distribution. Therefore, π is not unique. As a result, the limiting transition


n and the limiting distribution lim
probability limn→∞ Pxy n→∞ P(Xn = y) will depend on the initial
state/distribution. 

1.19.2 Aperiodicity (A)


n (limiting transitional probability). In
A (Aperiodicity) is related to the existence of limn→∞ Pxy
particular,
Aperiodic =⇒ limit exists
 Example 1.18  
0 1
P=
1 0
we have d(0) = d(1) = 2. Note that P2 = I and P2n = I and
 
0 1
P2n+1 = P =
1 0

Thus, by the uniqueness of limits, we have

lim Pn
n→∞ xy

does not exist for all x, y ∈ {0, 1}. 

1.19.3 Recurrent (R)


R (recurrence) is related to the existence of a stationary measure.
The MC is recurrent =⇒ a stationary measure exists

1.20 Special Examples


1.20.1 Detailed Balance Condition
46 Chapter 1. (Discrete-Time) Markov Chains

Definition 1.20.1 — Detailed Balance Condition. A distribution π = {π(x)}x∈S is said to


satisfy the detailed balance condition if

π(x)Pxy = π(y)Pyx , ∀x, y ∈ S

Proposition 1.20.1 — Detailed Balance Condition =⇒ Stationary Distribution. If a distribu-


tion π satisfies the detailed balance condition, then π is a stationary distribution.

Proof. See Assignment 2 Q2 (a) 

Stationary Distribution Detailed Balance


z y z y

x x
Flow x → y should be the flow y → x

Total probability “flow” entering x should be π(x)Pxy = π(y)Pyx


equal to the total probability flow leaving x

∑ π(z)Pzx = (πP)x = π(x)


z

R Overall balance is not pair-wise balance.

1.20.2 Time Reversibility


Start with a DTMC {Xm }m=0,1,... . Fix n, then {Ym }m=0,1,...,n given by Ym = Xn−m is called the
reversed process of {Xm }.

Theorem 1.20.2 If {Xm } starts from a stationary distribution π satisfying π(i) > 0 for any i ∈ S,
then its reversed process {Ym } is a DTMC with transition matrix given by

P̂i j = P(Ym+1 = j|Ym = i)


π( j)Pji
=
π(i)

Proof. Check the Markov Property:

P(Ym+1 = im+1 |Ym = im , . . . ,Y0 = i0 )


P(Ym+1 = im+1 ,Ym = im , . . . ,Y0 = i0 )
=
P(Ym = im , . . . ,Y0 = i0 )
P(Xn−(m+1) = im+1 , Xn−m = im , . . . , Xn = i0 )
=
P(Xn−m = im , . . . , Xn = i0 )
P(Xn−(m+1) = im+1 )Pim+1 ,im Pim ,im−1 . . . Pi1 ,i0
=
P(Xn−m = im )Pim ,im−1 . . . Pi1 ,i0
P(Xn−(m+1) = im+1 )Pim+1 ,im
=
P(Xn−m = im )Pim ,im−1
1.20 Special Examples 47

Since {Xm } starts from a stationary distribution P(Xn−(m+1) = im+1 ) = π(im+1 ) and P(Xn−m =
im ) = π(im ). Now, we have
P(Ym+1 = im+1 |Ym = im , . . . ,Y0 = i0 )
π(im+1 )Pim+1 ,im
=
π(im )
This shows:
1. This transitional probability does not depend on the history im−1 , . . . , i0 . Hence, {Ym }nm=0 is a
DTMC
2. The transition probability is given by
π( j)Pji
Pˆi j = P(Ym+1 = j|Ym = i) =
π(i)


R
1. We note this theorem means that the reversability requires the stationary distribution to
exist 
2. We can check that P̂ = P̂i j i, j∈S is actually a valid transition matrix
π( j)Pji
Pˆi j = ≥0
π(i)
and
∑ j∈S π( j)Pji (πP)i π(i)
∑ Pˆi j = π(i)
=
π(i)
=
π(i)
=1
j∈S

Definition 1.20.2 — Time-Reversable DTMC. A DTMC {Xm }m=0,1,... is called time-reversable,


of its reversed chain {Ym }nm=0 has the same distribution as {Xm }nm=0 for all n.

R This is much stronger than reversability, not all reversable DTMC is time-reversable. But the
other way around is clearly true. This is somehow related to the detailed balance condition in
an intuitive way. Since the detail balance condition provides a two-way transition for each
state, which provides the “time-reversability”.

Proposition 1.20.3 A DTMC {Xm }m=0,1,... is time-reversable if and only if it satisfies the detailed
balance condition.
Proof. 1. Assume Detailed Balance Condition: we have stationarity for free. Say {Xm } starts
from π and π(i)Pi j = π( j)Pji . Then, Y0 = Xn starts from π and the transition matrix
π( j)Pji
Pˆi j = = Pi j , ∀i, j ∈ S
π(i)
Note that the starting stationary distribution and their transition matrices are identical, then
we must have them as identical DTMC with same distributions.
2. Assume Time-Reversability: by definition, we have X0 adn Xn = Y0 have the same distribu-
tion for all n. Then, X0 follows a stationary distribution.
π( j)Pji
Pi j = Pˆi j = =⇒ π(i)Pi j = π( j)Pji , ∀i, j ∈ S
π(i)
Thus, we have detailed balance condition.

48 Chapter 1. (Discrete-Time) Markov Chains

1.20.3 The Metropolis-Hastings Algorithm


Goal:
the sample from a distribution π = {π(x)}x∈S when a direct sampling is hard to implement. This is
an “MCMC” algorithm, which stand for Monte-Carlo Markov Chain
Algorithm 1.1 — Metropolis-Hastings.

• Start an irreducible DTMC with transition matrix Q = {Qxy }x,y∈S and certain initial
distribution (typically, an initial state)
• In each time,
1. Propose a move from the current state x to state y ∈ S according to probability Qxy
2. Accept this move with probability
 
π(y)Qyx
rxy = min ,1
π(x)Qxy

if the move is rejected, stay in x


Wait for a long time, then sample from this MC.

R The transition matrix of this MC is given by

Pxy = Qxy rxy , x 6= y


Pxx = 1 − ∑ Pxy
y6=x

we show π is the stationary distribution of this MC.


Indeed, π satisfies the detailed balance condition:
For any two states x, y ∈ S, by symmetry, assume π(y)Qyx ≥ π(x)Qxy and rxy = 1 and ryx =
π(x)Qxy
π(y)Qyx , then
π(x)Qxy
Pxy = Qxy Pyx =
π(y)

=⇒ π(x)Pxy = π(x)Qxy = π(y)Pyx


Since typical, the rejection rate is positive, the Markov Chain is automatically aperiodic. Thus,
the convergence of the transition matrix is guaranteed by the Convergence Theorem

lim P(Xn = x) = π(x)


n→∞

1.21 Exit Distribution


1.21.1 Basic Setting
The subsets A, B of the state space. C = S − (A ∪ B) is finite. The question is

“Starting in a state in C, what is the probability that the chain exits C by entering A or
B.”

Mathematical Formulation
We define VA = min {n ≥ 0 : Xn ∈ A} and VB = min {n ≥ 0, Xn ∈ B}. Then, what is Px (VA < VB )?
1.21 Exit Distribution 49

 Example 1.19
1 2 3 4
1 0.25 0.6 0 0.15
P= 2 0 0.2 0.7 0.1 
3 1
 
4 1

then, let’s say


C = {1, 2} A = {3} B = {4}

We think the entry P33 = 1 and P44 = 1 are not that important since we only care about the chain
before going to 3 or 4. 

Let h(1) = P1 (V3 < V4 ), h(2) = P2 (V3 < V4 ). Discuss the first-step

h(1) = P1 (V3 < V4 )


4
= ∑ P(V3 < V4 |X1 = x, X0 = 1)P(X1 = x|X0 = 1)
x=1



P1 (V3 < V4 ) = h(1) x=1

P (V < V ) = h(2)
2 3 4 x=2
P(V3 < V4 |X1 = x, X0 = 1) =


1 x=3

0 x=4
=⇒ h(1) = 0.25h(1) + 0.6h(2)

similarly,
h(2) = 0.2h(2) + 0.7

Solve this system of equations, we have


(
h(1) = 0.7
h(2) = 87

R To solve this question, we need to introduce the idea of First-Step Analysis.

1.21.2 First-Step Analysis


Theorem 1.21.1 Let S = A∪B∪C, where A, B,C are disjoint sets, and C is finite. If Px (VA ∧VB <
∞) > 0, for all x ∈ C. Then,
h(x) := Px (VA < VB )
is the unique solution of the system of equations

h(x) = ∑ Pxy h(y), x ∈ C


y

with boundary conditions

h(a) = 1, a ∈ A h(b) = 0, b ∈ B
50 Chapter 1. (Discrete-Time) Markov Chains

Proof. By first-step analysis,


h(x) = P(VA < VB |X0 = X)
= ∑ P(VA < VB |X0 = x, X1 = y) · P(X0 = x|X1 = y)
y∈S

= ∑ Pxy h(y)
y∈S

boundary conditions hold trivially.


Hence, we only need to look at the uniqueness. Note that the system of equations can be written as
h0 = Qh0 + R0A
where h = (h(x1 ), h(x2 ), . . . ) for x1 , x2 , · · · ∈ C. Note that
C A B
C Q | R 
 
∑y∈A Px1 ,y
−− −− −−  R0A = ∑y∈A Px2 ,y 
 
A ..
 
.
B

The reason is that


h(x) = ∑ Pxy h(y)
y∈S

= ∑ Pxy h(y) + ∑ Pxy


y∈C y∈A
| {z } | {z }
=(Qh)(x) =(R0A )(x)

then,
I · h0 = Qh0 + R0A
0 =⇒ h0 = (I − Q)−1 R0A
(I − Q)h = R0A
is unique as long as I − Q is invertible. Note that
C A B
C  Q | R 
−− −− −− 
A I
 
0
B I

Since for PX (VA < VB ), we only need to observe the chain before it hits A or B, the change of the
rows in P corresponding to A and B will not change the result of this problem.
By doing this change, A and B are now absorbing, and all the states in C becomes transient! (since
Px (VA ∧VB < ∞) > 0).

R Hence, the “exit distributions/exit times" are also expressed as “absorption probability/absorption
time".

To show I − Q is invertible, note that since the states in C are transient (in P0 )
0 = lim Px (Xn0 ∈ C)
n→∞
n
= lim ∑ P0 xy
n→∞
y∈C

= lim
n→∞
∑ (Qn )xy
y∈C
1.21 Exit Distribution 51

The last equality holds because of the block structure of P0 , recall that
 
0Q R
P =
0 I

This corresponds to the fact that in order to have Xn0 ∈ C, we must have X00 , . . . , Xn−1
0 ∈ C which
implies that
lim Qn = 0
n→∞

as the zero matrix. Then, all the eigenvalues of Q have norm smaller than 1. Thus, there does not
exist a non-zero f 0 such that

I · f 0 = f 0 = Q f 0 ⇐⇒ (I − Q) f 0 = 0

Thus, I − Q is invertible. We are done! 

R
1. We see that the function h in the above theorem satisifies

h(x) = ∑ Pxy h(y) = Ex (h(X1 )) =: h(1) (x), ∀x ∈ C


y

Definition 1.21.1 — Harmonic Function. In general, a function h is called


harmonic at state x, if

h(x) = ∑ Pxy h(y) = Ex (h(X1 )) = h(1) (x)


y

h is called harmonic in A ⊆ S, if

h(x) = ∑ Pxy h(y) = Ex (h(X1 )) = h(1) (x), ∀x ∈ A


y

h is called harmonic if
0
h0  = Ph0 = h(1)

h(0)
h(x) = ∑ Pxy h(y) = Ex (h(X1 )) = h(1) (x), ∀x ∈ S ⇐⇒ 
h(1)

y
 
 . 
..

2. Matrix Formula: In the proof we have seen h0 = (I − Q)−1 R0A , where

C A B
C Q | R 
   
∑y∈A Px1 ,y P(X ∈ A|X0 = x1 )
−− −− −− 
R0A = ∑y∈A Px2 ,y  = P(X ∈ A|X0 = x2 )
   
A | .. ..
 
B | . .

This is the matrix formula to calculate

Px (VA < VB ) = h(x)


52 Chapter 1. (Discrete-Time) Markov Chains

1.22 Exit Time


1.22.1 Basic Setting
Similar to the exit distribution part, but now we are interested in the expected time that the chain
exits a part of the state space.
More precisely, let S = A ∪C, A,C disjoint and C is finite. Define
VA := min {n ≥ 0 : Xn ∈ A}
which is the first time the chain exits C, which is also the first time the chain hits/visits A.
We want to know
Ex (VA ) = E(VA |X0 = x), x ∈ C
 Example 1.20 — Sample Example as for the Exit Distribution.

1 2 3 4
1 0.25 0.6 0 0.15
P= 2 0 0.2 0.7 0.1 
3 1
 
4 1

then, let’s say


C = {1, 2} A = {3, 4}
Now, we want to know
g(1) = E(VA |X0 = 1)
g(2) = E(VA |X0 = 2)
Note that g(3) = g(4) = 0.

First-step Analysis
1.
g(1) = E(VA |X0 = 1)
4
= ∑ E(VA |X1 = x, X0 = 1)P(X1 = x|X0 = 1)
x=1



g(1) + 1 x=1

g(2) + 1 x=2
E(VA |X1 = x, X0 = 1) =


1 x=3

1 x=4
Note that this “1" comes from the time passed.
g(1) = 0.25(g(1) + 1) + 0.6(g(2) + 1) + 0.15 × 1
=⇒
= 1 + 0.25g(1) + 0.6g(2) [1]
2. Similarly, we have
g(2) = 1 + 0.2g(2) [2]
We can solve for g(1), g(2) with [1], [2]. Then,
 
7 5
(g(1), g(2)) = ,
3 4
Thus, starting from the state 1, the expected time until the chain reaches 3 or 3 = 4 is 73 . If starting
from state 2, the expected time until the chain reaches 3 or 3 = 4 is 54 . 
1.22 Exit Time 53

1.22.2 General Results


Theorem 1.22.1 — Unique Solution of g(x). Let S = A ∪ C where A,C are disjoint and C is
finite. If Px (VA < ∞) > 0 for any x ∈ C, then g(x) := Ex (VA ), x ∈ C is the unique solution of the
system of equations
g(x) = 1 + ∑ Pxy g(y), x ∈ C
y∈S

with the boundary conditions g(a) = 0, ∀a ∈ A.

Proof. 1. Existence: by first-step analysis, we have

g(x) = ∑ Pxy (g(y) + 1) + ∑ Pxy × 1


y∈C y∈A

= 1 + ∑ Pxy g(y)
y∈C

= 1 + ∑ Pxy g(y) S = A ∪C given the boundary condition


y∈S

2. Uniqueness: observe that

g(x) = 1 + ∑ Pxy g(y), x ∈ S


y∈S

means
g0 = 10 + Qg0
     
g(x1 ) 1 g(x1 )
g(x2 ) 1
 =   + Q g(x2 )
 

.. .. ..
. . .
where
C A
C Q | R !
−− | −−
A |

Then,
Ig0 = 10 + Qg0
(I − Q)g0 = 10
g0 = (I − Q)−1 10
We are looking at exactly the same matrix I − Q as in the exit distribution part. By Theorem
1.21.1, we have I − Q is invertible, g0 is the unique solution.


R
1. g0 = |C|× column vector
2. Q = |C| × |c| matrix
3. 10 = |C| × 1 column vector
54 Chapter 1. (Discrete-Time) Markov Chains

1.23 Infinite State Space


1.23.1 Basic Setting
All the results covered in the previous parts hold for both finite and infinite state spaces (unless
otherwise specified).
There is one distribution (one pair of two notions) which only makes sense in finite state space.

Definition 1.23.1 — Positive Recurrent and Null Recurrent. A state x is called positive
recurrent if Ex (Tx ) < ∞ (recall Tx = min {n ≥ 1 : Xn = X}).
A recurrent state x is called null recurrent, if Ex (Tx ) = ∞.

R Recall that recurrence means P(Tx < ∞) = 1 and transient means P(Tx = ∞) > 0. The
classification should be as follow:

Category Subcategory
Recurrent Positive Recurrent
Recurrent Null Recurrent
Transient Transient

 Example 1.21 — St. Petersburg paradox. A random variable which is finite with probability
1, but has infinite mean. Let X = 2n with probability 2−n for n = 1, 2, . . . . Then,


∑ 2−n = 1 =⇒ P(X = ∞) = 1
n=1

But
1 1 1
E(X) = 2 × + 4 × + 8 × + · · · = 1 + 1 + · · · + ∞
2 4 8


1.23.2 Main Results


Theorem 1.23.1 For irreducible MC, the followings are equivalent
1. Some state is positive recurrent
2. There exists a stationary distribution π
3. All the states are positive recurrent

Proof. 1. 3 → 1 : TRIVIAL!!!!!!!!
2. 1 → 2 : Let x be a positive recurrent state. By Theorem 1.12.1, we know that a recurrent state
can give us a stationary measure


µx (y) = ∑ Px (Xn = y|Tx > n), y ∈ S
n=0

starting with x, the expected number of visits to y before returning to x.


It can be normalized to become a stationary distribution if and only if

µy µx (y) < ∞
1.23 Infinite State Space 55

Note that

∑ µ(y) = ∑ ∑ Px (Xn = y|Tx > n)
y∈S y∈S n=0

=∑ ∑ Ex (1X =y · · · 1T >n )
n x
y∈S n=0
!

= Ex ∑ 1T >n ∑ 1X =y
x n
n=0 y∈S
!

= Ex ∑ 1T >n
x
n=0
= Ex (Tx ) < ∞ 1Tx >n = 1, n = 0, 1, . . . , n − 1
µ(y)
=⇒ π(y) =
Ex (Tx )

since x is positive recurrent.


3. 2 → 3 : Recall that we have
I,S =⇒ π(x) > 0, ∀x ∈ S
1
I,R =⇒ π(x) = Ex (Tx )
Hence, Ex (Tx ) < ∞ for any x ∈ S. Thus, all x ∈ S are positive recurrent.


Corollary 1.23.2 Positive recurrence and null recurrence are class properties.

x ↔ y =⇒ (x is positive recurrent ⇐⇒ y is positive recurrent)

and
x ↔ y =⇒ (x is null recurrent ⇐⇒ y is null recurrent)

Proof. This is just a sketch of the proof. Let x be a positive recurrent state and C be the commu-
nicating class containing x. Since C is recurrent, it is closed. Hence, {Xn } C is a DTMC and its

transition matrix is given by Pi j i, j∈C .

 C Cc 
C P C 0
Cc


The top right block is 0 since C is closed. Thus, P C is a transition matrix. This restricted Markov
Chain is irreducible. By the Theorem above, x is positive recurrent, will imply all the states in C
are positive recurrent. (Note that for every y ∈ C,

Ey (Ty ) P = Ey (Ty ) P|
C

Since starting from y, the chain will only move in C 

since both positive recurrence and recurrence are closed property, so is null recurrence.
56 Chapter 1. (Discrete-Time) Markov Chains

Corollary 1.23.3 A state x is positive recurrent if and only if there exists a stationary distribution
π such that π(x) > 0.

Proof. Note that for both directions, we have x is recurrent. Hence, it suffices to prove the result
for the case when the chain is irreducible (otherwise, we can consider the restricted chain as the
cloased class containing x)
1. =⇒: From a previous result, we know that

µ(y) = ∑ Px (Xn = y, Tx > n), y ∈ S
n=0

gives a stationary measure. Then,


n
µ(x) = ∑ Px (Xn = x, Tx > n) = 1
n=0

∑ µ(y) = Ex (Tx ) < ∞


y∈S
µ(x)
π(x) = >0
Ex (Tx )

2. ⇐=: Given by Theorem above




Corollary 1.23.4 A DTMC with finite state space must have at least one positive recurrent
state.

Proof. Again, we can assume the Markov Chain is irreducible. Fix x ∈ S,



µ(y) = ∑ Px (Xn = y, Tx > n)
n=0

gives a stationary distribution. Moreover,

∑ µ(y) < ∞ ⇐= |S| < ∞


y∈S
µ(y)
=⇒ π(y) =
∑y∈S µ(y)

is a stationary distribution. Thus, by the theorem above, one state is positive recurrent. 

Corollary 1.23.5 A DTMC with finite state space does not have null recurrent state.

R Why is this? consider a nulll recurrent class. Since it is recurrent, it is closed. This implies
that the class forms an irreducible Markov Chain. Assume the state space (hence the class
only have finite states, then by the previous corollary, we have a positive recurrent state, which
contradicts the class being null recurrent. The corollay holds automatically.
Moreover, this tells us that

“A null recurrent class must have infinite states."


1.23 Infinite State Space 57

The intution is
1 Nn (x)
= lim
Ex (Tx ) n→∞ n
if we have Ex (Tx ) = ∞, this is saying that the long run fraction to each state is 0. This will
not be possible with only finitely many states.

 Example 1.22 — Simple Random Walk (Revisit for n + 1 times).

S=Z
Pi,i+1 = p p ∈ (0, 1)
Pi,i−1 = 1 − p = q

p p p p p p

··· −2 −1 0 1 2 ···

1− p 1− p 1− p 1− p 1− p 1− p

Figure 1.23.1: Random Walk Visualization

1. Irreducible
2. The period is 2
3. Fact: The simple random walk transient for p 6= 12 , null recurrent for p = 12 .

Proof. 1. Case 1: p 6= 21 . By symmetry, assume p > 21 , then

Xn = Y1 + · · · +Yn

where {Yn } are iid with distribution


(
1 p
Yn =
−1 1− p

Then,
E(Yn ) = 1 · p + (−1)(1 − p) = 2p − 1 > 0

By Strong Law of Large Number,

Xn 1 n a.s.
= ∑ Ym −−→ E(Y1 ) = 2p − 1
n n m=1

as n → ∞. Thus,
n→∞
Xn −−−→ ∞

Thus, for any state i ≥ 0 (in particular, state 0), there is a last visit time to i. This implies that
0 is transient and {Xn } is transient.
58 Chapter 1. (Discrete-Time) Markov Chains

4 y

x
1 2 3 4 5 6 7

−2

−4

Figure 1.23.2: Simple Random Walk

(n)
2. Case 2: p = 12 , recall that a state i is recurrent, if and only if ∑∞
n=0 Pii = ∞. For state 0, we
have

2n
P00 = P(n stes to the left, n steps to the right)
   n  n
2n 1 1
=
2n+1
P00 =0 period is 2 n 2 2
n to the left n to the right
  n
2n 1
=
n 4

This is hard to compute! But we have a good way to approximate!

Theorem 1.23.6 — Stirling’s Formula.


√ 1
n! ∼ 2πe−n nn+ 2 , as n → ∞

This is based on the fact that


a(n) n→∞
−−−→ 1 =⇒ a(n) ∼ b(n)
b(n)
Now, we have
√ 1
2πe−2n (2n)2n+ 2
 
2n (2n)!
=⇒ = ∼ √
n n!n! 1 2

2πe−n nn+ 2
1 1 1
= √ 22n+ 2 √
2π n
  n
2n 1 1
=⇒ ∼√
n 4 πn

Thus,
When p = 12 , there does not exist a stationary distribution.

Try to solve πP = π
1.23 Infinite State Space 59

Component corresponding to state i.

1 1
π(i) = π(i − 1) + π(i + 1)
2 2
π(i + 1) − π(i) = π(i) − π(i − 1)

Since this holds for all i ∈ Z, {π(i)} is an arithmetic series. The general form is

π(i) = π(0) + ia, ∀i a = π(1) − π(0)

also, π(i) ∈ [0, 1], ∀i. This will force a = 0. However, this implies π(i) = π(0), ∀i.

π(i)
2

i
−3 −2 −1 1 2 3

−2

1
Figure 1.23.3: Simple Random Walk π(i) with p = 2

(a) If π(0) = 0, π(i) = 0, ∀i =⇒ ∑i π(i) = 0 6= 1


(b) If π(0) > 0, ∑i π(i) = ∞ 6= 1
Thus, the normalization condition ∑i π(i) = 1 cannot hold, a stationary distribution does not
exist. In particular, 0 is not a positive recurrent state and the whole chain is not positive
recurrent. In other words, the whole Chain is null recurrent.


 Example 1.23 — Simple Random Walk with A Reflecting Barrier. A random walk with a
reflecting barrier.
S = Z+ = {0, 1, 2, . . . }

and Pi,i+1 = p, Pi,i−1 = 1 − p, i ≥ 1, P01 = 1. When p < 12 , such a chain is positive recurrent.

1 p p p p

0 1 2 3 4 ···

1− p 1− p 1− p 1− p 1− p

Figure 1.23.4: Random Walk Visualization


60 Chapter 1. (Discrete-Time) Markov Chains

To see this, we solve for the stationary distribution. Only Pi,i+1 and Pi,i−1 are non-zero, we can use
the detailed balanced condition (tridiagonal transition matrix). Then,

1
π(0) · 1 = π(1)(1 − P) =⇒ π(1) = π(0)
1− p
p
π(i) · p = π(i + 1)(1 − p), i = 1, 2, . . . =⇒ π(i + 1) = π(i)
1− p
 2
p p
π(i) = π(i − 1) = π(i − 2) = · · ·
1− p 1− p
 i−1
p
= π(i)
1− p
 i−1
p 1
= π(0)
1− p 1− p

A stationary distribution exists if and only if ∑∞


i=0 π(i) < ∞.
p
Here π(1), π(2), . . . forms a geometric series with ratio a 1−p , which is smaller than 1 if and only
1
if p < 2 . 

R The reflected simple random walk is


1
1. Positive recurrent ⇐⇒ p < 2
2. Null recurrent ⇐⇒ p = 21
3. Transient ⇐⇒ p > 21

1.24 Branching Process (Galton-Watson Process)


1.24.1 Basic Setup
Consider a population. Each organism, at the end of its life, produces a random number Y of
offsprings. The distribution of Y is denoted as

P(Y = k) = Pk , Pk ≥ 0, k = 0, 1, . . .

and ∑∞ k=0 Pk = 1.
Start from one common ancestor, X0 = 1. The number of offsprings of different individuals are
independent.
Let

Xn := the number fo individual (in the population) in the n-th generation.

Then,
(n) (n)
Xn+1 = Y1n +Y2 + · · · +YXn
(n) (n) (n)
where Y1n ,Y2 , . . . ,YXn are independent copies of Y and Yi is the ubmer of offsprings of the i-th
individual in the n-th generation.
One thing we care about is the expectation of the number of offsprings of the n-th generation:

E(Xn ) =?

Assuming E(Y ) = µ.
1.24 Branching Process (Galton-Watson Process) 61

Generation 1

Generation 2

Figure 1.24.1: Branching Process

Solution:  
(n) (n)
E(Xn+1 ) = E Y1n +Y2 + · · · +YXn
  
n (n) (n)
= E E Y1 +Y2 + · · · +YXn |Xn
= E(µXn ) = µE(Xn )

R This result is known as Wald’s Identity in statistics

E(Xn+1 ) = µE(Xn )

We can continue this inductively to see that

E(Xn ) = µ n E(X0 ) = µ n , n = 0, 1, . . .

1.24.2 Extinction Probability


As long as P0 > 0, state 0 is absorbing and all the other states are transient.
But it does not mean that the population will extinct for sure.
If the population on average keeps growing and tends to infinity with positive probability, then we
probability of extension is smaller than 1.
To find the extinction probability, we introduce the mathematical tool, Generating Functions.

Definition 1.24.1 — Generating Function. Let P = {P0 , P1 , . . . } be a distribution on on


{0, 1, . . . }. Let η be a random variable following distribution P. That is

P(η = i) = Pi

The generating function of η or of P is defined by

ϕ(s) = E(sη )

= ∑ Pk sk 0≤s≤1
k=0

Proposition 1.24.1 — Properties of Generating Functions. Let ϕ(s) be a generating function


1. ϕ(0) = P0 , ϕ(1) = ∑∞
k=0 Pk = 1
2. Generating function determines the distribution.

1 d k ϕ(s)

Pk =
k! dsk s=0
62 Chapter 1. (Discrete-Time) Markov Chains

Reason (This is immediate by Taylor Expansion, a short illustration is included here):

ϕ(s) = P0 + P1 s1 + · · · + Pk−1 sk−1 + Pk sk + Pk+1 sk+1 + . . .

then,
d k ϕ(s)
= k!Pk + (· · · )s + (· · · )s2 + . . .
dsk
Then,
d k ϕ(s) 1 d k ϕ(s)

= k!Pk =⇒ Pk =
dsk s=0 k! dsk s=0
In particular, given P1 , P2 , · · · ≥ 0, this implies ϕ(s) is increasing and convex (all of its
derivatives will be positive).
3. Let η1 , . . . , ηn be independent random variables with generating functions ϕ1 , . . . , ϕn , then

X = η1 + · · · + ηn

have generating function


ϕX (s) = ϕ1 (s) . . . ϕn (s)

Proof.
ϕx (s) = E sX


= E(sη1 . . . sηn )
= E(sη1 ) . . . E(sηn ) independence
= ϕ1 (s) . . . ϕn (s)


4. Pseudo Moments: very useful!

d k ϕ(s) d k E(sη )
 k η  
d s 
η−k
= =E = E η(η − 1) . . . (η − k + 1)s
dsk s=1 dsk s=1 dsk s=1
s=1

= E(η(η − 1) . . . (η − k + 1))
In particular,
E(η) = ϕ 0 (1)
and
00 2
Var(η) = ϕ (1) + ϕ 0 (1) − ϕ 0 (1)

The Graph of a Generating Function


Back to extinction probability. Define

N = min {n ≥ 0 : Xn = 0}

to be the extinction time and


un = P(N ≤ n) = P(Xn = 0)
where P(N ≤ n) is the extinction happens before or at time n. Note that {un } is an increasing
sequence and bounded above, by Monotone Convergence Theorem, it is well-defined to define

u := lim un = P(N < ∞) = P(the population eventually die out) = extinction probability
n→∞
1.24 Branching Process (Galton-Watson Process) 63

0.8

ϕη (s)
0.6

0.4

P0
0.2

0
0 0.2 0.4 0.6 0.8 1

Figure 1.24.2: The Graph of a Generating Function

Our goal is to find u


(meiyouzhuya)
Note that we have the following relation between un and un−1 . Then,

un = ∑ Pk (un−1 )k = ϕ(un−1 )
k=0

where ϕ is the generating fucntion of Y .


Reason:
Note that each sub-population has the same distribution as the whole population. The whole
population dies out in n steps if and only if each sub-population initiated by an individual in
generating 1 dies out in n − 1 steps.
Then,
un = P(N ≤ n)
= ∑ P(N ≤ n|X1 = k)P(X1 = k)
k
= ∑ P(N1 ≤ n − 1, . . . , Nk ≤ n − 1|X1 = k)Pk
k
= ∑ Pk ukn−1 = ϕ(un−1 )
k

where Nm is the number of steps for the sub-population to die out. And we can also write
k Y
un−1 = E un−1 .
Thus, the problem becomes:

With an initial value u0 = 0 since (X0 = 1) and the relation

un = ϕ(un−1 )

what is limn→∞ un = u?

Recall that
1. ϕ(0) = P0 > 0
2. ϕ(1) = 1
3. ϕ(s) is increasing
64 Chapter 1. (Discrete-Time) Markov Chains

Figure 1.24.3: We can divide the whole population into sub-populations

4. ϕ(s) is convex
Draw ϕ(s) and the function f (s) = between 0 and 1. We have two possibilities.
1 1

0.8 0.8
ϕ(s)
0.6 0.6
ϕ(s) f (s)
f (s)
0.4 0.4

0.2 0.2

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

Case 1: u < 1 Case 2: u = 1 extinction happens for sure

Theorem 1.24.2 The extinction probability u will be the smallest intersection of ϕ(s) and f (s).
Equivalently, it is the smallest solution of ϕ(s) = s between 0 and 1.

Reason:
see the dynamics of the graph
This dynamic process verifies the results for case 1 and case 2.

Q: How to tell whether we are in Case 1 or Case 2?

A: We can check the derivative at s = 1!

Note that ϕ 0 (1) = E(Y ) and


ϕ 0 (1) > 1 =⇒ Case 1
ϕ 0 (1) ≤ 1 =⇒ Case 2
1.25 Review on DTMC 65

f (s)
ϕ(s)
ϕ(u2 )
ϕ(u1 )

ϕ(u0 )

u0 = 0 u1 = ϕ(u0 ) Fixed Point u

Figure 1.24.4: Fixed Point Iteration

Thus, we conclude that

E(Y ) > 1 =⇒ Extinction with certain porbability less than 1 and u is a unique solution

also, we can think of E(Y ) > 1 as on average, there are more than 1 offspring, so the population
will probably explode, which diminish the chance to wipe out the whole population. (Thanos has
left the chat.)
E(Y ) ≤ 1 =⇒ Extinction happes for sure (with prob. 1)

we can think of E(Y ) ≤ 1 as on average, there is less than or equal to 1 offspring, so there is
always a risk to have the population to die out.

1.25 Review on DTMC


Preliminaries
Probability space, stochastic process, index set, state space...

1.25.1 Basic Properties


Definition 1.25.1 — Discrete-Time Markov Chain (DTMC) and its transition matrix. A
discrete-time stochastic process {Xn }n=0,1,2,... is called a discrete-time Markov Chain (DTMC)
with transition matrix
P = [Pi j ]i, j∈S
if for any j, i, in−1 , . . . , i0 ∈ S,

P(Xn+1 = j|Xn = i, Xn−1 = in−1 , . . . , X0 = i0 ) = Pi j = P(Xn+1 = j|Xn = i)

Proposition 1.25.1 — Properties of Transition Matrix.

Pi j ≥ 0 and ∑ Pi j = 1 ⇐⇒ there exists a legitimate DTMC


j∈S
66 Chapter 1. (Discrete-Time) Markov Chains

Proposition 1.25.2 — Multi-Step Transition Probabilities.

P(n) = Pn

and C-K Equation: P(n+m) = P(n) P(m) , more frequently,


(m) (n)
Pin+m
j = ∑ Pik Pk j
k∈S

R We also have graphical representation (this does not require justification)

Conditional Probability
1. E(g(X)|Y = y) is a number
2. E(g(X)|Y ) is a random variable of Y

R Make sure you check the properties of conditional expectations. Especially, the iterated
expectation
E(E(X|Y )) = E(X)

Distribution of Xn and E( f (Xn ))


Given an initial distribution µ = (µ(0), µ(1), . . . ) (a distribution of X0 ), then

µn = µPn

as the distribution at time n (Xn ). For f = ( f (0), f (1), . . . ), then


0
E( f (Xn )) = µPn f
0
= µn f
0
= µ f (n)

where  
E( f (Xn )|X0 = 0)
0
f (n) = E( f (Xn )|X0 = 1)
 
..
.
and µ and P completely characterize a DTMC.

R We have row vectors as distribution and function vector (on state) is a column vector when
used.

1.25.2 Classification and Class Properties


Ty = min {n ≥ 1 : Xn = y} first visit (revisit) time

ρxy = P(Ty < ∞) = P(Ty < ∞|X0 = x)


we say x → y if ρxy > 0. We know that x ↔ y is an equivalence relation. This implies that there is a
disjoint partition of states into communicating classes. When there is only one a communicating
class, we call the Chain irreducible. To identity the classes, we can find the loops.
1.25 Review on DTMC 67

Ey (Ty ) < ∞ Positive Recurrent

ρyy = 1 Recurrent

State y Ey (Ty ) = ∞
Null Recurrent

ρyy < 1
Transient

Figure 1.25.1: Summary

Positive Recurrence/Null Recurrence/Transience


Positive recurrent, null recurrent, transience are class properties!

Criteria for Recurrence vs. Transience


Recurrent Transient
ρyy = Py (Ty = ∞) = 1 ρyy < 1
P(N(y) = ∞) = 1 P(N(y) < ∞) = 1
Ey (N(y)) = ∞ Ey (N(y)) < ∞
n n
∑∞n=1 Pyy = ∞ ∑∞
n=1 Pyy < ∞

Proposition 1.25.3 1. ρxy > 0, ρyx < 1 =⇒ x is transient


2. x is recurrent and ρxy > 0 =⇒ ρyx = 1

Definition 1.25.2 — Closed Set. If i ∈ A, j 6∈ A =⇒ Pi j = 0, equivalently, ρi j = 0.

Proposition 1.25.4 A finite and closed class is positive recurrent.

Corollary 1.25.5 A DTMC with finite DTMC does not have null recurrent state/class.

Proposition 1.25.6 If a class is recurrent, then it is closed.

Criteria for Positive Recurrent State y


1. Ey (Ty ) < ∞ (hard to do most of the time)
2. There exists a stationary distribution π such that π(y) > 0.

Decomposition of the State Space S


S = T ∪ R1 ∪ R2 ∪ . . .

where T is the transient states (con contain classes and stand-alone states) and Ri are the recurrent
(closed) classes.
68 Chapter 1. (Discrete-Time) Markov Chains

Proposition 1.25.7 — Starting from x, how many visits to y on average?.


ρxy
Ex (N(y)) =
1 − ρyy
Strong Markov Property

XTy +k k=0,1,2...
behaves like the MC starting from X0 = y.

Period
n
d(x) = gcd {n ≥ 1 : Pxx > 0}
If d(x) = 1 =⇒ x is aperiodic.
If all states are aperiodic, then the MC is aperiodic. Easiest way to check is

Pxx >=⇒ x aperiodic

the other direction is false, check the example in previous part.


Proposition 1.25.8 Period is a class property.

1.25.3 Stationary Distribution and Limiting Behaviour


Definition 1.25.3 — Stationary Distribution. A probability distribution π = (π0 , π1 , . . . ) is
called a stationary distribution (invariant distribution) of the DTMC {Xn }n=0,1,... with transition
matrix P if
1. π = πP as a system of equations
2. ∑i∈S πi = 1 by the definition of probability distribution

Proposition 1.25.9 If the DTMC is irreducible, recurrent, then



µx (y) = ∑ Px (Xn = y, Tx > n)
n=0
= Ex (number of visits to y before returing to x)

give a stationary measure where µx (x) = 1. It gives stationary distribution if and only if

∑ µx (y) = Ex (Tx ) < ∞


y∈S

which means x is a positive recurrent state.

Main Theorems
Theorem 1.25.10 — Convergence Theorem. Suppose I,A,S. Then,

n
Pxy −→n→∞ π(y), ∀x, y ∈ S

no-matter where you starts, only depends on the target states.

“The limiting transition probability, hence also the limiting distribution, does not
depend on where we start. (Under the conditions of I,A,S)”

Or we can write
lim Pn = π(y), ∀x, y ∈ S =⇒ lim P(Xn = y) = π(y)
n→∞ xy n→∞
1.25 Review on DTMC 69

Theorem 1.25.11 — Asymptotic Frequency. Suppose I, R, if Nn (y) is the number of visits


to y up to time n, then
Nn (y) 1
−→
n n→∞ Ey (Ty )
where
Ty = min {n ≥ 1 : Xn = y}
Nn (y)
we consider n as the fraction of time spent in y (up to time n).
1
“Long run fraction of time spent in y is Ey (Ty ) ”

where Ey (Ty ) is the expected revisit time to y given that we start with y, which is also the
“expected cycle length”.

Theorem 1.25.12 — How to find a stationary distribution?. Suppose I and S, then,

1
π(y) =
Ey (Ty )

Corollary 1.25.13 — Nicest Case. Suppose I, A, S, (R), then

n Nn (y) 1
π(y) = lim Pxy = lim =
n→∞ n→∞ n Ey (Ty )

Stationary Distribution = Limiting transition probability


= Long-run fraction of time
1
=
Expected revisit time

Theorem 1.25.14 — Long-run Average. Suppose I, S, and ∑x | f (x)|π(x) < ∞. Then,

1 n
lim
n→∞ n
∑ f (Xm ) = ∑ f (x)π(x) = π f 0
m=1 x

Proposition 1.25.15 If MC is I, then

S ⇐⇒ positive recurrent

In this case, π(x) > 0 for all x ∈ S

Detailed Balance Condition


π(x)Pxy = π(y)Pyx , ∀x, y ∈ S

we know that

Detailed balance condition =⇒ stationary distribution

tridiagonal transition matrix


Detailed balance condition ⇐= stationary distribution
70 Chapter 1. (Discrete-Time) Markov Chains

Time-reversed Chain
For fixed n,
Ym = Xn−m , m = 0, 1, . . . , n
Proposition 1.25.16 {Ym } is a DTMC if X starts from a stationary distribution.

d
Definition 1.25.4 — Time-reversable. X is called time-reversable if {Xn } = {Ym }.

Theorem 1.25.17 Time-reversible ⇐⇒ detailed balance condition holds

Metropolis-Hastings Algorithm (Omitted)


1.25.4 Exit Distribution and Exit Time
Exit Distribution
We have
S =C∪A∪B
h(x) = Px (VA < VB )
is the unique solution of the system of equations

h(x) = ∑y Pxy h(y) x ∈ C

h(a) = 1 x∈A

h(b) = 0 x∈B

Matrix Formula: In the proof we have seen h0 = (I − Q)−1 R0A , where

C A B
C Q | R 
   
∑y∈A Px1 ,y P(X ∈ A|X0 = x1 )
−− −− −−  R0A = ∑y∈A Px2 ,y  = P(X ∈ A|X0 = x2 )
   
A | .. ..
 
. .
B |

Exit Time
S = A∪S
we have g(x) = Ex (VA ) is the unique solution of
(
g(x) = 1 + ∑y Pxy g(y) x ∈ C
g(a) = 0 a∈A

and its matrix form:


g0 = (I − Q)−1 10

1.25.5 Two Examples


Simple Random Walk Without Barrier
1. p 6= 21 =⇒ transient
2. p = 12 =⇒ null recurrent
1
If there is a reflecting barrier, p < 2 =⇒ positive recurrent.
Branching Process
1.25 Review on DTMC 71
Definition 1.25.5 — Generating Function.

ϕ(s) = E(sη ) = ∑ Pk sk , s ∈ [0, 1]
k=0

Need to know the properties of g.f.s


Galton-Watson Process
Given Y ∼ {Pn } as the distribution of number of offsprings for 1 individual.

E(Y ) = µ =⇒ E(Xn ) = µ n

Given assumption X0 = 1, we have


un = ϕ(un−1 )
extinction before

You might also like