Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
312 views

STAT0007 Course Notes

This document contains lecture notes for a course on applied probability and stochastic processes. It outlines the course details, including the instructor's contact information, lecture and tutorial times. It also provides an overview of the key topics that will be covered, such as random variables, generating functions, conditioning, discrete-time and continuous-time Markov processes. The assessment details are given as an in-class assessment worth 10% and a final exam worth 90% of the total grade.

Uploaded by

Musa Asad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
312 views

STAT0007 Course Notes

This document contains lecture notes for a course on applied probability and stochastic processes. It outlines the course details, including the instructor's contact information, lecture and tutorial times. It also provides an overview of the key topics that will be covered, such as random variables, generating functions, conditioning, discrete-time and continuous-time Markov processes. The assessment details are given as an in-class assessment worth 10% and a final exam worth 90% of the total grade.

Uploaded by

Musa Asad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 93

STAT2003 Introduction to Applied Probability

STAT3102 Stochastic Processes

Lecture notes

Course details
• Course organiser: Dr. Elinor Jones (elinor.jones@ucl.ac.uk).

• Lectures: 9-11am TUESDAYS and 3-4pm FRIDAYS.


• Office hours: Please check Moodle.
• Useful books:
– ‘Introduction to Probability Models’ by S. Ross
– ‘Probability and Random Processes’ by Grimmett & Stirzaker (more mathematical).

Tutorials

• Exercises will be uploaded to Moodle each Friday afternoon.


• You must hand in your work by 12 noon the following THURSDAY. Please post your work in your
tutor’s box in the Undergraduate Common Room. Make sure that your name and tutor group is clearly
printed on your work. There is no need to attach a copy of the homework sheet.
• Late submissions will not be marked, and scanned or emailed copies will not be considered.

Assessment
• ICA : Friday 3rd of March 3-4pm (provisional)- open book, 45 minutes. This will be 10% of your final
mark.
• Final exam in Term 3 - closed book, 2 or 2.5 hours (for STAT3102 and STAT2003 respectively). This
will be 90% of your final mark.
• There is no choice of questions in the ICA or final written exam.

• College approved calculators only (but you will need one!).

1
Contents
1 Introduction (Important!) 4
1.1 A course overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Mathematical/ Statistical Writing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.1 Definitions (optional) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.2 Discrete random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.3 Continuous random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.4 Important distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.5 Generating functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.6 Probability generating functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.7 Moment generating functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.8 What else can we do with generating functions? . . . . . . . . . . . . . . . . . . . . . 12
1.4 Conditioning on events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4.1 Conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4.2 Conditional expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4.3 Useful conditioning formulae . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.4.4 The idea of ‘first step decomposition’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2 What is a stochastic process? 23


2.1 Definitions and basic properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 The Markov property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3 Why is the Markov property useful? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4 How can I tell if a process is Markov? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3 Discrete-time Markov processes 29


3.1 Introduction to discrete time Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1.1 Time homogeneity and transition probabilities . . . . . . . . . . . . . . . . . . . . . . 29
3.1.2 Transition matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1.3 The n-step transition probabilities and the Chapman-Kolmogorov equations . . . . . . 30
3.1.4 Marginal probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1.5 Duration of stay in a state . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Classification of Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.1 Irreducible classes of intercommunicating states . . . . . . . . . . . . . . . . . . . . . . 34
3.2.2 Recurrence and transience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.3 Positive recurrence and null recurrence . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2.4 Periodicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.5 Class properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.6 Further properties of discrete time Markov chains . . . . . . . . . . . . . . . . . . . . . 41
3.2.7 Closed classes and absorption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2.8 Decomposition of the states of a Markov chain . . . . . . . . . . . . . . . . . . . . . . 45
3.3 Limiting behaviour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3.1 Invariant distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3.2 Equilibrium distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3.3 Invariant and equilibrium distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3.4 Ergodic Markov chains and the equilibrium distribution . . . . . . . . . . . . . . . . . 50
3.3.5 Extending the Main Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.3.6 When does an equilibrium distribution exist? . . . . . . . . . . . . . . . . . . . . . . . 53

2
4 Continuous-time Markov processes 56
4.1 Continuous-time Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2 The importance of the exponential distribution and ‘little-o’ notation . . . . . . . . . . . . . . 57
4.2.1 Order notation: o(h) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2.2 The lack-of-memory property of the exponential distribution . . . . . . . . . . . . . . 58
4.2.3 Other useful properties of the exponential distribution . . . . . . . . . . . . . . . . . . 59
4.3 Breaking down the definition of a continuous time Markov process . . . . . . . . . . . . . . . 60
4.3.1 Holding times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3.2 The jump chain: rates of change . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3.3 The jump chain: probability of going to a particular state . . . . . . . . . . . . . . . . 62
4.4 Analysis of transition probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.4.1 Transition probabilities over a small interval . . . . . . . . . . . . . . . . . . . . . . . . 64
4.4.2 Transitions over longer periods: the Chapman-Kolmogorov equations . . . . . . . . . . 65
4.4.3 Kolmogorov’s forward equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.4.4 Kolmogorov’s backward equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.4.5 Solving the KFDEs and KBDEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.4.6 The generator matrix, Q . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.5 Limiting behaviour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.5.1 Invariant distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.5.2 Equilibrium distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.5.3 A limit theorem for continuous time Markov chains . . . . . . . . . . . . . . . . . . . . 73

5 Important types of continuous-time processes 75


5.1 The Poisson process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.1.1 The generator matrix for the Poisson process . . . . . . . . . . . . . . . . . . . . . . . 76
5.1.2 The distribution of the number of events by time t . . . . . . . . . . . . . . . . . . . . 77
5.1.3 Other properties of the Poisson process . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.1.4 Superposition and thinning of a Poisson process . . . . . . . . . . . . . . . . . . . . . 81
5.2 Birth and death processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.2.1 Transition rates and the jump chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2.2 The generator matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.2.3 Important types of birth-death processes . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.2.4 Transition probabilities (over a small interval) . . . . . . . . . . . . . . . . . . . . . . . 88
5.2.5 The equilibrium distribution of the (general) birth and death process . . . . . . . . . . 89
5.2.6 The probability of extinction for linear birth-death process . . . . . . . . . . . . . . . 91
5.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

3
1 Introduction (Important!)
1.1 A course overview
A stochastic process is a process which evolves randomly over time (or space, or both):

• ‘stochastic’: equivalent to ‘random’.


• ‘process’: occurring over time (or space, or both).
We will consider some simple stochastic processes which are used to model real-life situations. Our main
focus will be on developing mathematical tools to study their properties and learn about long-term behaviour
of such processes. Though this is a mathematical course, the applications of stochastic processes are wide-
ranging, e.g. in physics, engineering, the environment, social science, epidemics, genetics and finance. We’ll
see examples from some of these areas during the course.

1.2 Mathematical/ Statistical Writing


Mathematics is a language for which you must learn its grammar. Whether you are handing in a tutorial
sheet or writing an exam, you need to present your work in an appropriate manner. You could start by
thinking of the following:
• Think carefully about precisely what you are trying to calculate/ solve and what information you have
available to you in order to do so.
• Have you clearly defined all your notation? Be precise: are your definitions watertight?
• Is your notation appropriate?
• Does your solution flow logically, that is, does each step follow on logically from the previous step?
Again, be precise. The examiner is not a mind reader. YOU need to guide the reader through your
thought process.
• Have you explained your reasoning, not just written down some formulae?
• It is important to understand what you are doing; don’t just repeat a list of steps. Always ask yourself:
‘do I know what I’m doing?’ and ‘how would I explain this to a fellow student?’.

Being able to write clearly will help you to think clearly. It forces you to understand the material: if you
can’t write it well, you don’t understand it well enough. Learning mathematics is not about repetition, as
it was at A-Level, and so past papers are NOT A GOOD REVISION TOOL. In particular, I write ICA and
exam questions that probe your understanding, and not your ability to regurgitate methods. This reflects
the fact that proper mathematics is not about learning to apply a set of methods, but using logical reasoning
to find your way through a problem.

Some tips for writing in STAT2003/STAT3102 (based on previous cohorts of students):


• Random variables form the basis for most, if not every, question you will encounter. Make sure you
define each and every random variable you use. All random variables should be assigned to CAPITAL
letters.
• Probabilities should be written out in full - if you want to calculate P (X = 1) for some random variable
X, then write it as such. NEVER use notation such as P (1) to denote this. And, as the previous
bullet point states, writing P (x = 1) is plain wrong.

• The ‘equals sign’ has a very specific meaning. Please don’t abuse it.

4
• Do a reality check on your answer. For example, a probability should be between 0 and 1. An
expectation of how many time units it takes to reach a particular point should be at least 1 (assuming
moving between locations takes 1 unit of time, and that you’re not already there!).
• If your answer doesn’t contain written explanation, then it is not a complete solution. Think about
guiding the marker through your thought process, rather than expecting the marker to second guess
what you meant to say.
• Use tutorial sheets to practice good mathematical writing.

1.3 Random variables


Informally speaking, a random variable is a variable [i.e. can take different values], whose numerical values
are randomly chosen from a set of possible values, governed by a probability distribution. For example,
the time that a passenger must wait until the next southbound tube train at Goodge street station is a
variable. Now suppose you arrive at Goodge street station - the amount of time you must wait until the
next southbound tube arrives is a realisation of the random variable ‘waiting time’.

It is important that you can spot a random variable, and can also define appropriate random variables for
a given scenario.

1.3.1 Definitions (optional)


Ideally, we should think about random variables as functions. In order to do so, we must first build up some
terminology. We’ll use a coin tossing example to illustrate: suppose you toss a coin twice and record whether
it lands heads or tails at each toss. Note that the following is a simplification of ideas from Probability Theory.

• An experiment is any situation where there is a set of possible outcomes.


– Our coin-tossing example is an experiment.
– The possible outcomes are HH, HT, TH or TT.
• The set of possible outcomes is called the sample space, Ω.

– In our case, Ω = {HH, HT, T H, T T }


• An outcome is denoted ω and is an element of the sample space.
– Suppose we conduct the experiment, and the outcome is TH. This is our ω.
• An event, A, is a subset of the sample space.

– e.g. define an event as ‘at least one head’.


– This is satisfied by HH, HT and TH.
– This is a subset of Ω.

5
Probability model Real world

Sample space, Ω ‘Set of all possible outcomes of the experiment.’

Sample point, ω ‘Outcome of the experiment.’

Event A ⊆ Ω ‘Property that the outcome of the experiment may possess.’

With a finite sample space Ω, we can assign probabilities to individual sample points ω via a ‘weight function’,
p : Ω → [0, 1]. This allocates a value p(ω) for each possible outcome ω, which we interpret as the probability
that ω occurs. We need X
p(ω) = 1,
ω∈Ω

then we may calculate the probability of an event A ⊆ Ω via


X
P (A) = p(ω) = 1.
ω∈A

Caution: this doesn’t always work with infinite sample spaces, but we won’t consider this further.

Notice that:
1. P(A) ≥ 0 for all events A,
P∞
2. if A1 , A2 , A3 , . . . are events with Ai ∩ Aj = ∅ for all i 6= j, then P(∪∞
i=1 Ai ) = i=1 P(Ai ) i.e. P is
countably additive,
3. P(Ω) = 1.
As the outcome of an experiment corresponds to a sample point ω ∈ Ω, we can make some numerical mea-
surements whose values depend on the outcome ω. This gives rise to a random variable, let’s call it X,
which is a function X : Ω → R. Its value at a sample point, X(ω), represents the numerical value of the
measurement when the outcome of the experiment is ω.

In our set-up, there are different random variables you could define. Let’s define X to be the number of
heads in the two tosses of a coin. In this case X maps Ω to 0 (if we see TT), 1 (if we see TH or HT) or 2 (if
we see HH).

X(T T ) = 0
X(T H) = X(HT ) = 1
X(HH) = 2

6
All outcomes are equally likely here, and so p(ω) = 1/4 for all outcomes ω. If we define our event A =“two
heads”, then since the only sample point in A is HH,
X
p(ω) = 1/4,
ω∈A

and we say that P (A) = 1/4.

Similarly, we can also define another event B =“at least one tail”, and this is given by P (X ≤ 1), which we
can calculate as:
P(B) = P(X ≤ 1) = P({ω ∈ Ω : X(ω) ≤ 1}) (1)
Which ω satisfy (1) in this case? Now you should be able to calculate the probability required.

Note: we have defined the distribution function of X above. In general, the distribution function of
X of a random variable is given, for x ∈ R, by:
FX (x) = P(X ≤ x) = P({ω ∈ Ω : X(ω) ≤ x}).

1.3.2 Discrete random variables


X is a discrete random variable if it takes only finitely many or countably infinitely many values.

The probability mass function of a discrete random variable is


pX (x) = P(X = x) = P({ω : X(ω) = x}).
The expectation of X is X
E(X) = x P(X = x).
x
For a function g we have X
E(g(X)) = g(x) P(X = x).
x
The variance of X is h i
var(X) = E (X − E(X))2 = E(X 2 ) − [E(X)]2 .

1.3.3 Continuous random variables


X is a continuous random variable
R∞ if it has a probability density function, i.e., if there exists a
non-negative function fX with −∞ fX (x)dx = 1, such that for all x,
Z x
FX (x) = fX (u)du .
−∞

Then
dFX (x)
fX (x) = .
dx
The expectation of X is Z ∞
E(X) = xfX (x)dx
−∞
and the expectation of g(X), for some function g, is
Z ∞
E(g(X)) = g(x)fX (x)dx ,
−∞

provided that these integrals exist. As before, the variance of X is


h i
var(X) = E (X − E(X))2 = E(X 2 ) − [E(X)]2 .

7
1.3.4 Important distributions
There are several probability distributions which we will use repeatedly during the course. Make sure you
are very familiar with the following:

Discrete distributions

Distribution Notation Support PMF Expectation Variance


n!
Binomial X ∼ Bin(n, p) 0, 1, ..., n P(X = k) = k!(n−k)! pk q n−k np npq

k = 0, 1, . . . , n; q = 1 − p q =1−p
k
Poisson X ∼ Po(λ) 0, 1, 2, 3... P(X = k) = exp(−λ) λk! λ λ

where λ ≥ 0 k = 0, 1, 2, . . .

Geometric X ∼ Geom(p) 1, 2, 3, 4... P(X = k) = (1 − p)k−1 p 1/p q/p2

where p ∈ [0, 1] k = 1, 2, 3, . . . where q = 1 − p

Continuous distributions

Distribution Notation Support PDF Expectation Variance

Exponential X ∼ Exp(λ) R+ fX (x) = λe−λx 1/λ 1/λ2

where λ > 0 (positive reals) for x > 0


(
1 a<x<b 1 1
Uniform X ∼ Unif(a, b) [a, b] fX (x) = 2 (b − a) 12 (b − a)2
0 otherwise

β α exp(−βx)xα−1
Gamma X ∼ Gamma(α, β) R+ fX (x) = Γ(α) α/β α/β 2

(positive reals) x>0

1.3.5 Generating functions


Generating functions are USEFUL. Don’t think of them as baffling mathematical constructs; instead think
of them as marvelous time-saving devices.

Imagine you have a sequence of real numbers, z1 , z2 , z3 , .... It can be hard to keep track of such numbers and
all the information they contain. A generating function provides a way of wrapping up the entire sequence
into one expression. When we want to extract information about the sequence, we simply interrogate this
one expression in a particular way in order to get what we want.

There are many different types of generating function. These are different in the sense that we wrap up the
sequence in different ways, and so extracting particular pieces of information from them requires different
techniques. The two types of generating function that you should be familiar with are:

8
1. The probability generating function (PGF);
2. The moment generating function (MGF)

1.3.6 Probability generating functions


The probability generating function (p.g.f ) of X is defined to be

X
GX (s) = E(sX ) = sk P(X = k) for |s| ≤ 1.
k=0

Basic requirement(s) The random variable X takes values in N0 = {0, 1, 2, . . .}

Sequence to ‘wrap up’ P (X = 0), P (X = 1), P (X = 2), P (X = 3), ...


P∞ k
Method k=0 s P (X = k)

(for any s ∈ [−1, 1])

Information stored · The probabilities P (X = 0), P (X = 1), P (X = 2), ...

· Some moments (expectations).

The information stored in probability generating functions can be accessed as follows.


1. Calculating moments.

dGX (s)
E(X) = = G0X (1),
ds s=1

2
d GX (s)
E[X(X − 1)] = = G00X (1), and so on.
ds2 s=1

2. Calculating probabilities. Suppose that X takes values in N0 = {0, 1, 2, . . .}.


(a)

X
GX (0) = 0k P(X = k) = P(X = 0).
k=0

(b) If we expand GX (s) in powers of s the coefficient of sk is equal to P(X = k), so we can also find
P(X = k) for all k.

9
Example 1.1
You can also work ‘backwards’: given a PGF for a random variable X, what is the distribution of
X?

For example, suppose you are told that GX (s) = ps/[1 − (1 − p)s]. We can convert this expression
into the summation format of a PGF and spot its distribution:
ps
GX (s) = ,
1 − (1 − p)s

X
= ps (1 − p)k sk
k=0

X
= sj (1 − p)j−1 p, where j = k + 1.
j=1

Therefore, P(X = j) = (1 − p)j−1 p, so X ∼ Geometric(p).

In fact, different distributions have different PGF formats but random variables with the same distribution
have a PGF of the same format. You can therefore spot the distribution of a random variable instantly from
the format of its PGF.

Exercise 1.2
Find the format of the following PGFs, either by calculating the PGF directly or looking in a text
book!

Distribution of X Probability generating function, GX (s)

X ∼ Geometric(p)

X ∼ Poisson(λ)

X ∼ Binomial(n, p)

X ∼ Bernoulli(p)

1.3.7 Moment generating functions


The moment generating function (m.g.f ) of X is defined to be

10
Z
MX (s) = E [exp(tX)] = exp(tx)fX (x)dx for any t ∈ R such that the integral converges.
R

Basic requirement(s) A real valued random variable

Quantity to ‘wrap up’ fX (x)


R
Method R
exp(tx)fX (x)dx

(for any t ∈ R such that the integral converges)

Information stored Moments

We can access all moments of X from MX (t):

dn MX

n (n)
E[X ] = MX (0) =
dtn t=0

The same as for PGFs, you can spot the distribution of a random variable instantly from the format of its
MGF.

Exercise 1.3
Find the format of the following PGFs, either by calculating the PGF directly or looking in a text
book!

Distribution of X Probability generating function, GX (s)

X ∼ Normal(µ, σ 2 )

X ∼ Exponential(λ)

X ∼ Binomial(n, p)

X ∼ Bernoulli(p)

X ∼ Geometric(p)

Notice that a discrete or continuous random variable can have a MGF, whereas a PGF is for discrete random
variables only (i.e. those with countable state space).

11
1.3.8 What else can we do with generating functions?
We’ll concentrate here on (useful) things you can do with probability generating functions, though similar
conclusions can be reached using moment generating functions too.
1. Calculating the distribution of sums of random variables.
Suppose that X1 , . . . , Xn are i.i.d. random variables with common PGF GX (s). The PGF GY (s) of
Y = X1 + · · · + Xn is given by

GY (s) = E sX1 +···+Xn ,




Yn
E sXi ,

=
i=1
= [GX (s)]n ,

and by looking at [GX (s)]n , we can spot the distribution of Y , which may otherwise be difficult. Also,
if Z1 Z2 (that is, Z1 and Z2 are independent but do not necessarily have the same distribution) then
|=

GZ1 +Z2 (s) = GZ1 (s)GZ2 (s).

By looking at the PGF of Z1 + Z2 , GZ1 +Z2 (s), we can then deduce the distribution Z1 + Z2 . This can
be an easier strategy than trying to deduce its distribution directly.

Example 1.4
X and Y are both Poisson random variables, with parameters λ1 and λ2 , respectively. Assume that
X and Y are also independent of each other. What is the distribution of X + Y ?

Solution: Use probability generating functions!

The PGF of a Poisson random variable with parameter λ is

G(s) = eλ(s−1) .

Therefore, the PGF of X + Y is

GX+Y (s) = eλ1 (s−1) eλ2 (s−1) = e(λ1 +λ2 )(s−1) .

We recognise the format of this PGF as the PGF of a Poisson distribution with parameter (λ1 + λ2 ).
Therefore,
X + Y ∼ Poisson(λ1 + λ2 )

2. Calculating the PGF of a random sum (that is, the sum of a random number of random variables).
Suppose that X1 , . . . , XN are i.i.d. random variables with common PGF GX (s) and that N has PGF
GN (s).

The PGF GY (s) of Y = X1 + · · · + XN is given by


h i
GY (s) = E E sY | N .

12
We’ll see this technique of expanding an expectation again later (you have already seen it if you studied
STAT2001 or STAT3101), and it will prove to be a very useful technique in solving problems. Using
this, we can deduce that
h i h i h i
E E sY | N = E [GX (s)]N = GN GX (s) .

Similarly, it can be shown (challenge!) that the m.g.f. MY (t) of Y is given by


MY (t) = E exptY = GN [MX (t)].


Example 1.5
The number of people who enter a particular bookstore on Gower Street, per day, follows a Poisson
distribution with parameter 300. Of those who visit, 60% will make a purchase. What is the
distribution of the number of people who make a purchase at the bookstore in one day?

Solution: Use probability generating functions!

Notice that whether a visitor makes a purchase is a Bernoulli trial, with probability of success 0.6.
Let (
1 if ith customer makes a purchase
Xi =
0 otherwise
so that we are looking to find the distribution of Y = X1 + ... + XN where N ∼ Poisson(300). Notice
that the PGF of the Bernoulli random variable X is GX (s) = (0.4 + 0.6s), and the PGF of N is
e300(s−1) .

By the result above, we know that the PGF of Y is GY (s) = GN (GX (s)). Think of this as using
GX (s) as the ‘in place of’ s in the PGF of N :

GY (s) = GN (GX (s))


= e300(0.4+0.6s−1) = e300(0.6s−0.6) = e180(s−1)

We recognise this as the PGF of a Poisson random variable with mean 180. Therefore, the distribution
of those who make a purchase is Poisson(180).

1.4 Conditioning on events


Calculating probabilities that are conditional on a particular event or occurrence is a natural requirement.
Make sure you understand the rationale behind conditional probability, don’t just learn the formulae!

Easy exercise
You have a deck of playing cards (52 cards in total, split equally into four suits: hearts, diamonds,
spades and clubs, with the first two considered ‘red’ cards, and the others considered ‘black’ cards).
• I pick a card at random. What is the probability that the chosen card is from the diamond
suit?

• I pick a card at random and tell you that it is a red card (i.e. either hearts or diamonds). What
is the probability that the chosen card is from the diamond suit?

13
The second question requires a conditional probability (it is conditional on knowing that the card is red),
but you most likely calculated the probability intuitively without using any formulae. How did you do that?
You applied Bayes’s theorem intuitively, without even realising it.

In fact, Bayes’s theorem for calculating conditional probabilities is perfectly intuitive. Suppose we are
calculating P (A|C) for some events A and C:
• This is asking for the probability of A occurring, given that C occurs.
• Assuming C occurs, then the contents of the statespace Ω which relate to C becomes the ‘new’ states-
pace, Ω0 (which is just the event C for our purposes).

• What we are asking here, in effect, is to calculate the probability of A in the ‘new’ sample space Ω0
(i.e. P (A ∩ C)), scaled to Ω0 (i.e. divided by P (C)).
• The final equation is therefore P (A|C) = P (A ∩ C)/P (C)

Why scale by P (C)?


Simply because the probability of all disjoint events in any sample space must sum to 1. Our sample
space is now Ω0 = C, so the new (probability) weight function we apply to all events in Ω0 must sum to
1. We scale all probabilities by P (C) to ensure that this is the case.

The following graphic may help you to visualise why Bayes’s theorem ‘works’.

14
1.4.1 Conditional probability
Let A and C be events with P(C) > 0. The conditional probability of A given C is

P(A ∩ C)
P(A | C) = .
P(C)

It is easy to verify that

1. P(A | C) ≥ 0 ,
2. if A1 , A2 , . . . are mutually exclusive events, then
X
P(A1 ∪ A2 ∪ . . . | C) = P(Ai | C),
i

(try drawing diagrams like the above to visualise this)

3. P(C | C) = 1.
Conditional probability for random variables instead of events works is the same way.

Discrete RVs X and Y Continuous RVs X and Y

Joint probability mass function: Joint density function:

pX,Y (x, y) = P(X = x, Y = y). fX,Y (x, y)

X and Y independent if and only if for all x, y: X and Y independent if and only if for all x, y:

pX,Y (x, y) = pX (x)pY (y) = P(X = x)P(Y = y). fX,Y (x, y) = fX (x)fY (y)

Marginal probability mass function (of X): Marginal density (of Y ):


P R∞
pX (x) = P(X = x) = y P(X = x, Y = y). fY (y) = −∞ fX,Y (x, y) dx .

Conditional PMF of X given Y = y Conditional density of X given Y = y (fY (y) > 0)

pX|Y =y (x) = P(X = x | Y = y) = P(X=x,Y =y) fX,Y (x,y)


P(Y =y) fX|Y =y (x) = fY (y)

Conditional distribution function of X given Y = y Conditional distribution function of X given Y = y

FX|Y =y (x) = P(X ≤ x | Y = y) FX|Y =y (x) = P(X ≤ x | Y = y)

1.4.2 Conditional expectation


Just like calculating an expectation of, say, X, requires knowing the distribution of X, the conditional ex-
pectation of X, given that we know Y = y requires knowing the distribution of (X|Y = y).

The conditional expectation of X given Y = y is


(P
xP(X = x | Y = y) for discrete RVs
E(X|Y = y) = R ∞x
−∞
xfX|Y =y (x) dx for continuous RVs

15
The conditional expectation of g(X) given Y = y, for some function g, is
(P
g(x)P(X = x | Y = y) for discrete RVs
E(g(X)|Y = y) = R ∞x
−∞
g(x)fX|Y =y (x) dx for continuous RVs

Example 1.6
Suppose that Ω = {ω1 , ω2 , ω3 } and P(ωi ) = 1/3 for i = 1, 2, 3. Suppose also that X and Y are
random variables with X(ω1 ) = 2, X(ω2 ) = 3, X(ω3 ) = 1, and Y (ω1 ) = 2, Y (ω2 ) = 2, Y (ω3 ) = 1.
Find the conditional pmf pX|Y =2 (x) and the conditional expectation E(X | Y = 2).

Possible values of X are 1, 2, 3.

pX|Y =2 (1) = P(X = 1 | Y = 2) = 0.

P(X = 2, Y = 2) P(ω1 ) 1
pX|Y=2 (2) = P(X = 2 | Y = 2) = = = .
P(Y = 2) P(ω1 )+P(ω2 ) 2
P(X = 3, Y = 2) P(ω2 ) 1
pX|Y=2 (3) = P(X = 3 | Y = 2) = = = .
P(Y = 2) P(ω1 )+P(ω2 ) 2
Thus the conditional pmf of X given Y = 2 equals 21 for x = 2 and for x = 3 and is 0 otherwise. The
conditional expectation is
X
E(X | Y = 2) = x P(X = x | Y = 2) = (1 × 0) + (2 × 1/2) + (3 × 1/2) = 5/2 .
x

An important concept
So far, we have only calculated conditional expectations when conditioning on specific values, e.g.
E[X|Y = 2]. We can extend this idea to conditioning on random variables without equating them to
specific values.

The notation E(X | Y ) is used to denote a random variable that takes the value E(X | Y = y) with
probability P(Y = y).

Note that E(X | Y ) is a function of the random variable Y , and is itself a random variable.

Example 1.7
In Example 1.6 we found that E(X | Y = 2) = 25 . It can similarly be shown that E(X | Y = 1) = 1.
You can also check that P(Y = 1) = 1/3 and P(Y = 2) = 2/3. Thus, the random variable E(X | Y )
has two possible values 1 and 25 , with probabilities 31 and 23 , respectively. That is,
(
1 with probability 1/3
E[X|Y ] =
5/2 with probability 2/3

16
1.4.3 Useful conditioning formulae
Three important formulae:
(i) Law of total probability Let A be an event and Y be ANY random variable. Then
(P
P(A|Y = y)P(Y = y) if Y discrete
P(A) = R ∞y
−∞
P(A | Y = y)fY (y) dy if Y continuous

(ii) Modification of (i): if Z is a third discrete random variable,


X
P(X = x | Z = z) = P(X = x, Y = y | Z = z)
y
X P(X = x, Y = y, Z = z)
=
y
P(Z = z)
X P(X = x, Y = y, Z = z) P(Y = y, Z = z)
=
y
P(Y = y, Z = z) P(Z = z)
X
= P(X = x | Y = y, Z = z) P(Y = y | Z = z),
y

assuming the conditional probabilities are all defined. Note the similarity with (i).

(iii) Law of conditional (iterated) expectations


(P
E(X | Y = y) P(Y = y) if Y discrete
E[X] = E[E(X | Y )] = R ∞y
−∞
E(X | Y = y)fY (y) dy if Y continuous

and, more generally,  


E[g(X)] = E E[g(X) | Y ]

Note in particular:

(i) The law of total probability applies to ANY event A and ANY random variable Y .
(ii) The law of the conditional (iterated) expectation, E[X] = E[E[X|Y ]] applies to ANY random
variables X and Y , and we will be using this identity heavily during the course.

17
Example 1.8
In Example 1.7 we found the distribution of the random variable E(X | Y ) was
(
1 w.p. 1/3
E[X|Y ] =
5/2 w.p. 2/3

The expectation of this random variable is


1 5 2
E[E(X|Y )] = 1 × + × = 2.
3 2 3
Also, the expectation of X is given by
1 1 1
E(X) = 1 × + 2 × + 3 × = 2,
3 3 3
so that we see that the law of the conditional (iterated) expectation, i.e. that E[X] = E[E[X|Y ]],
holds.

Example 1.9
Let X and Y have joint density
1 −x/y −y
fX,Y (x, y) = e e , 0 < x, y < ∞ .
y

Find fY (y), E(X | Y ) and hence E(X).

Solution
For any y > 0, Z ∞
1 −x/y −y
fY (y) = e e dx = e−y [−e−x/y ]∞
0 =e
−y
,
0 y
so Y has an exponential distribution with parameter 1.

Also, for any fixed y > 0 and any x > 0,

fX,Y (x, y) e−x/y e−y 1


fX|Y (x | y) = = = e−x/y ,
fY (y) ye−y y

so X | Y = y ∼ exp(1/y).

It follows that E(X | Y = y) = y and so E(X | Y ) = Y . Using the Law of Conditional Expectations
we then have
E[X] = E[E(X | Y )] = E(Y ) = 1 .

1.4.4 The idea of ‘first step decomposition’


The ‘first step decomposition’ technique is a very useful method for computing certain types of expectations
and probabilities. We will be using this method extensively throughout the course.

18
Example 1.10
Roads connecting three locations in a city are shown in the following diagram, the time taken to cycle
along any of the roads being 1 minute. Geraint is a bicycle courier whose office is located at location
A. He has a parcel that he needs to deliver to location C. However, he is a law abiding cyclist who
will only follow the direction of travel permitted on each road, as indicated by the arrows. Whenever
he comes to a junction, he selects any of the routes available to him with probabilities as given in the
diagram below.

C
1/3 1/2

2/3
A B

1/2

(i) How long, on average, will it take Geraint to deliver the parcel?
(ii) How many times, on average, will Geraint visit location B before delivering the parcel?

Now suppose that another location, D, is added to the mix, but that all couriers arriving at D are
stuck there indefinitely.

C
1/3 1/3

2/3 1/3
A B D

1/3

(iii) With what probability does the parcel never get delivered?

Some initial comments on these type of questions:


(i) How long, on average, will it take Geraint to deliver the parcel?
– If there is a positive probability of never getting to location C, then the expectation will be infinite.
– In the first diagram, the parcel will be delivered eventually as in this case P (Parcel not delivered by time n) →
0 as n → ∞.
– For this question, the minimum possible time taken is 1 minute (if he goes directly from A to C).
Therefore the expectation should be greater than 1 (this is a handy reality check at the end of
such questions).

(ii) How many times, on average, will Geraint visit location B before delivering the parcel?
– Such expectations must be positive as Geraint will visit location B with positive probability.
– This expectation cannot be negative! Again, this is a good reality check.

19
– This expectation can be anything in the interval (0, ∞). Compare this with the previous question.
(iii) For the updated map, with what probability does the parcel never get delivered?
– This must be between 0 and 1. This may be obvious, but I have known (good) students to forget
that they are calculating a probability and give either a negative answer or something greater
than 1.
There are many ways of solving these questions, but a convenient method is to use ‘first step decomposition’:
think about ‘where do we go first’ ?

First step decomposition: what is it, and why is it useful?

This technique can be adapted to solve many different problems, and it will be used frequently in this
course. In fact, all three questions posed about Geraint’s cycling trip can be solved using the general
idea of first step decomposition.

The idea is to decompose the process on the basis of where the process goes first (or its first step, hence
the name). This, combined with the iterated law of conditional expectation (when we want to compute
an expectation), or the law of total probability (for computing probabilities), proves to be a rather
powerful method.

Typical questions which can be solved via first step decomposition include:
1. How long does it take, on average, to reach a particular point for the first time?

2. How often, on average, do we visit a certain point before a particular event occurs?
3. What’s the probability that we ever reach/ never reach a particular point?

20
Example 1.10 continued
How long, on average, will it take Geraint to deliver the parcel?

We’ll apply first step decomposition to solve this question. Let Xn denote Geraint’s location at time
n, and let T denote the time it takes for Geraint to deliver the parcel to location C.

We need to calculate E[T ], and we will do this by conditioning on Geraint’s next move. In the first
instance, he starts in location A, so X0 = A. Using the iterated law of conditional expectation, we
get:
E[T ] = E[E[T |X1 ]] = E[T |X1 = B]P (X1 = B) + E[T |X1 = C]P (X1 = C).
Since X1 =B with probability 2/3 and X1 =C with probability 1/3, we get:
2 1
E[T ] = E[T |X1 = B] + E[T |X1 = C].
3 3
Now notice that E[T |X1 = C] = 1 because we are conditioning on Geraint reaching location C for
the first time at time 1, so this simplifies to
2 1
E[T ] = E[T |X1 = B] + . (2)
3 3
Now compute E[T |X1 = B] by applying first step decomposition again: from B, where can Geraint
go next?
1 1
E[T |X1 = B] = E[E[T |X2 , X1 = B]] = E[T |X2 = C, X1 = B] + E[T |X2 = A, X1 = B].
2 2
Now notice that E[T |X2 = C, X1 = B] = 2 and E[T |X2 = A, X1 = B] = 2 + E[T ] (the latter because
Geraint used 2 minutes going to B and then back to A, and once back in A it is as if the process
starts over from scratch).

1
Therefore E[T |X1 = B] = 1 + 2 (2 + E[T ]). Substituting this back in to (2), we have:

2 1 1 5
E[T ] = E[T |X1 = B] + = E[T ] + .
3 3 3 3
Solving gives E[T ] = 2.5 minutes.

You may want to consider simplifying the notation of the solution to Example 1.8 above, by defining
new quantities. For example, let Li denote the expected time to reach C, given that we start in state i.
Then, the solution above simplifies to
2 1
LA = 1 + LB + LC . (3)
3 3
We must add 1 since we are using one minute to move from state A to another state. Now we know
that LC = 0 (if we start in C, it takes no time to get there!), and LB = 1 + LA /2 + LC /2 = 1 + LA /2.
Therefore, substituting into (3), we need to solve
2 1 5
LA = 1 + (1 + LA /2) = LA + .
3 3 3
Solving gives LA = 2.5 minutes, as before.

21
Exercise 1.11
Use a similar method to solve parts (ii) and (iii) of Example 1.10.

1.5 Exercises
The following problems can be solved by a variety of methods. Try to solve them using conditioning
arguments.

Exercise 1.12
A coin, with probability of heads equal to p, is tossed repeatedly until the first head is obtained.
What is the expected number of throws required?

Exercise 1.13
Two UCL students are trapped in a corridor in Torrington Place after 7 p.m. The corridor has three
exits. The first exit leads out of the building in 2 minutes, the second leads back to the original
corridor after 5 minutes and the third is a dead end that leads back after 1 minute. Assume that
the students are trying to get out of the building, but are so engrossed in a statistical discussion that
they fail to learn from experience, and they choose an exit at random each time they return to the
corridor.
(a) Show that the students are certain to get out of the building eventually.

(b) Find the expected length of time until they get out.
(c) Find the variance of the time until they get out.
(d) Find the probability generating function of the time until they get out.

Exercise 1.14
The number of claims arriving at an insurance company in a week has a Poisson distribution with
mean λ. For any claim form, there is a probability p that it is completed incorrectly, independently
of other claim forms and of the number of claims.

(a) Find the distribution of the number of forms completed incorrectly in a week.
(b) Deduce the distribution of the number of forms completed incorrectly in a year.
(c) The sizes of the claims are independent random variables X1 , X2 , X3 , . . . that have the same
distribution, and that are independent of the number of claims. Let T be the total amount
claimed in one week (including claims on incorrect forms). Obtain formulae for the expectation
and variance of T .

22
2 What is a stochastic process?
2.1 Definitions and basic properties
Definition: Stochastic process

A stochastic process is a collection of random variables {Xt , t ∈ T } taking values in the state space
S. The parameter, or index, set T often represents time.

Notation
• Discrete-time processes: {Xt , t ∈ T }, where T = {0, 1, 2, . . .}. This is often written in terms of the
natural numbers (including zero) rather than T , i.e. {Xn , n ∈ N}, where N = {0, 1, 2, . . .}
• Continuous-time processes: {X(t), t ∈ T }, where T = R+ (i.e. t ≥ 0).
• Notice that the state space can be discrete or continuous. In this course we will only consider discrete
state spaces.

Examples of stochastic processes.


Below are examples of stochastic processes, categorised according to whether they are continuous
or discrete time processes, and whether the state space is discrete or continuous. We only consider
discrete state space processes in this course, in both discrete and continuous time.

STATE SPACE

Discrete Continuous

How many more heads than tails in total? A gambler’s winnings and losses after each bet
Discrete

‘Time’ here is coin tosses ‘Time’ here is bet number


TIME

Arrival times Brownian motion


Continuous

Process increases by 1 at each arrival Used to model stock prices

23
NOTE: Continuous-time processes can change their value/state (‘jump’) at any instant of time; discrete-
time processes can only do this at a discrete set of time points. For discrete-time processes, when we use the
word ‘time’ we mean ‘number of transitions/steps’.

2.2 The Markov property


Definition: Markov process

A stochastic process {Xt , t ∈ T } is a Markov process if, for any sequence of times
0 < t1 < · · · < tn < tn+1 and any n ≥ 0,

P(Xtn+1 = j | Xtn = in , · · · , Xt0 = i0 ) = P(Xtn+1 = j | Xtn = in ),

for any j, i0 , . . . , in in S.

The Markov property - an intuitive explanation.


In other words, the Markov property states that the value (or state) of the process at time tn+1 depends
only on its value (state) at time tn , and not on its behaviour previous to time tn . That is, the probability
of any future event depends only on the most recent information we have about the process.

Let
• Xn be the present state;
• A be a past event (involving X0 , . . . , Xn−1 );
• B be a future event (involving Xn+1 , Xn+2 , . . ..)
The Markov property (MP) states that given the present state (Xn ), the future B is conditionally independent
of the past A, denoted
B A|Xn .
|=

Therefore,

P(A ∩ B | Xn = i) = P(A | Xn = i) P(B | Xn = i).


Similarly, on applying the Markov property,
P(B | Xn = i, A) = P(B | Xn = i),
for any n ≥ 0, any i and any A and B. Notice here that we can use the Markov property because the most
recent information about the chain, its location at time n, is given exactly as Xn = i.

An important note about the Markov property.


Think carefully about the definition of the Markov property. It states that
P(Xtn+1 = j | Xtn = in , · · · , Xt0 = i0 ) = P(Xtn+1 = j | Xtn = in ).
Notice the term I highlight above. We can only use the Markov property if the most recent information
(highlighted in bold) gives us specific information about the location of the process. We could not apply
the Markov property if, for example, we wanted to simplify
P(Xtn+1 = j | Xtn = in or jn , · · · , Xt0 = i0 ),
since the most recent information (the location of the process at time tn ) does not tell us where exactly
the Markov chain is at time tn .

24
2.3 Why is the Markov property useful?
We’ll be using the Markov property repeatedly throughout the course, so make sure you are familiar with it!
The Markov property is a strong independence assumption. It is useful because it simplifies probability
calculations. It enables joint probabilities to be expressed as a product of simple conditional probabilities.

Example 2.1
Suppose that X is a (discrete time) Markov process taking values 0, 1 and 2. Write

P(X0 = 1, X1 = 2, X2 = 0)

as a product of conditional probabilities, simplifying your answer as much as possible, justifying each
step of your argument.

P(X0 = 1, X1 = 2, X2 = 0) = P(X2 = 0 | X1 = 2, X0 = 1) P(X1 = 2, X0 = 1) Bayes’s theorem


= P(X2 = 0 | X1 = 2) P(X1 = 2, X0 = 1) Markov property
= P(X2 = 0 | X1 = 2) P(X1 = 2 | X0 = 1) P(X0 = 1), Bayes’s theorem

Therefore, P(X0 = 1, X1 = 2, X2 = 0) = P(X2 = 0 | X1 = 2) P(X1 = 2 | X0 = 1) P(X0 = 1).

Exercise 2.2
Make sure you understand how to apply the Markov property (repeatedly!) by answering the
following.

Suppose that X is a (discrete time) Markov process taking values in some statespace S. Show that
n
Y
P(X0 = i0 , . . . , Xn = in ) = P(Xk = ik | Xk−1 = ik−1 ) P(X0 = i0 ).
k=1

This is a very useful identity which we’ll be making use of throughout the course.

2.4 How can I tell if a process is Markov?


You will come across many ICA and exam questions asking you to explain whether or not a particular process
is Markov. There are two ways that you could establish whether a process is Markov.
1. Without using any mathematics, deduce that the process is/ isn’t Markov logically.
2. Checking (mathematically) that the Markov property holds for all states in the statespace at all times.

The second strategy is particularly hard, so we mainly use this technique when we suspect that the Markov
property does not hold by finding a counter example which does not satisfy the Markov property. To show
that the Markov property does not hold, find one set of states a, b, c and perhaps d such that one of the
following holds (you do not need to show both hold - one will do!):

• P (Yn+1 = a|Yn = b, Yn−1 = c) 6= P (Yn+1 = a|Yn = b) for some a, b, c;

25
• P (Yn+1 = a|Yn = b, Yn−1 = c) 6= P (Yn+1 = a|Yn = b, Yn−1 = d) for some a, b, c, d with c 6= d. This
shows that the state of the process at time (n − 1) affects the state of the process at time (n + 1),
violating the Markov property.

Exploiting features of the process to decide whether it is Markov

Many questions that you will encounter will provide you with a process {Xn , n = 0, 1, 2, ...}
which is known to be Markov, and will define a further process {Yn , n = 0, 1, 2, ...} as a function of the
original Markov process. The question is usually whether the ‘new’ process is also a Markov process.
That is, if {Xn , n = 0, 1, 2, ...} is a Markov process, and {Yn , n = 0, 1, 2, ...} is a function of this Markov
process, then does the Markov property hold for {Yn , n = 0, 1, 2, ...} too?

In this kind of situation, it is often expected (or even necessary) to take a mathematical approach to
proving or disproving that the Markov property holds in the ‘new’ process. Two strategies to keep in
mind:
• Is the function in question bijective (one-to-one)? If it is, then the new process will also be Markov.
(Think carefully as to why this has to be the case!).

• If the function in question is not bijective, then the new process may, or may not, be Markov.
In this case, if you are trying to prove that the new process is not Markov, a good strategy is to
exploit the fact that you do not have uniqueness.
We’ll see these strategies being used in the next example.

26
Example 2.3.

Let X0 , X1 , X2 , . . . be a sequence of independent random variables such that for n = 0, 1, 2, . . . , we


have P(Xn = +1) = P(Xn = −1) = 12 . For n = 1, 2, 3, . . . define:

Xn−1 + Xn Xn−1 + 2Xn


Yn = ; Zn = .
2 3
(i) Establish that X0 , X1 , ... is a Markov process.
(ii) Is {Yn , n = 0, 1, 2, ...} a Markov process?

(ii) Is {Zn , n = 0, 1, 2, ...} a Markov process?

Solution:
(i) It is easy to show that ANY sequence of i.i.d. random variables satisfies the Markov property
(try it), therefore part (i) is complete.
(ii) The state space for {Yn , n = 0, 1, 2, ...} is S = {−1, 0, 1}. Notice that
– Yn = −1 only if Xn−1 = −1 and Xn = −1;
– Yn = 1 only if Xn−1 = 1 and Xn = 1;
– Yn = 0 if Xn−1 = −1 and Xn = 1 OR Xn−1 = 1 and Xn = −1;
The transformation function from X to Y is not bijective because there is ambiguity in the value
of the pair (Xn , Xn−1 ) when Yn = 0. Let’s try to show that the Markov property does not hold
by finding a counter example. After a trial-and-error approach I find a counter example, which
proves that the Markov property does not hold (note: this is not the only counter example in
this case):

P (Yn+1 = 1|Yn = 0, Yn−1 = −1) = 0.5


P (Yn+1 = 1|Yn = 0, Yn−1 = 1) = 0

Therefore, {Yn , n = 0, 1, 2, ...} is not a Markov process.


Notice that my strategy here was to let Yn = 0: this prompts ambiguity in the value of the
original process at time n, and hence lets me construct a counter example.

(iii) The state space for {Zn , n = 0, 1, 2, ...} is S = {−1, −1/3, 1/3, 1}. Notice that
– Zn = −1 only if Xn−1 = −1 and Xn = −1;
– Zn = −1/3 only if Xn−1 = 1 and Xn = −1;
– Zn = 1/3 only if Xn−1 = −1 and Xn = 1;
– Zn = 1 only if Xn−1 = 1 and Xn = 1.
Therefore, whatever the value of Z, we can deduce exactly the values of the relevant random
variables X: we have a one-to-one relationship between pairs of X variables and values for Z.
We must therefore conclude that {Zn , n = 0, 1, 2, ...} is a Markov process.

27
Exercise 2.4.
Try the following exercises to make sure that you are comfortable in establishing whether the Markov
property holds for the following processes.

???

Let X0 , X1 , X2 , . . . be a sequence of independent random variables such that for n = 0, 1, 2, . . . , we


have P(Xn = 0) = 0.75 and P(Xn = 1) = 0.25. For n = 1, 2, 3, . . . define

Yn = (Xn + Xn−1 )2

and Y0 = 0. State whether or not the stochastic process {Yn , n = 0, 1, 2, ...} is a Markov chain. If it is
a Markov chain, state why it must be so. If the process is not a Markov chain, then give an example
where the Markov property breaks down.

???

DNA consists of four nucleotide bases - A, G, C, and T.

Suppose that at each loci (location) along a DNA strand:

• Given the base at loci n, the probability that it will not change at loci n + 1 is 0.5.
• If the base at loci n + 1 is different from loci n, the base at loci n + 1 is chosen randomly from
the remaining bases.

Let Zn be the base at loci n. Is {Zn } a Markov chain? [HINT: No maths required! Deduce that the
process must satisfy the Markov property logically.]

28
3 Discrete-time Markov processes
3.1 Introduction to discrete time Markov chains
Definition: Discrete-time Markov chain
A discrete-time Markov chain, often abbreviated to Markov chain, is a sequence of random
variables X0 , X1 , X2 , . . . taking values in a finite or countable state space S such that, for all
n, i, j, i0 , i1 , . . . , in−1

P(Xn+1 = j | Xn = i, Xn−1 = in−1 , . . . , X0 = i0 ) = P(Xn+1 = j | Xn = i),

that is, a discrete-time stochastic process satisfying the Markov property.

3.1.1 Time homogeneity and transition probabilities

Definition: Time homogeneous Markov chain

A Markov chain is time-homogeneous if

P(Xn+1 = j | Xn = i)

does not depend on n for all i, j ∈ S.

For a time homogeneous Markov chain, we have

P(Xn+1 = j | Xn = i) = P(Xm+1 = j | Xm = i)

even if m 6= n. In effect, this allows us to forget about when the chain moved from i to j, all that matters is
that this occurred in one time step. When a chain is time homogeneous, we also have that for any integer
r ≥ 1,
P(Xn+r = j | Xn = i) = P(Xm+r = j | Xm = i).
That is, the probability of moving from state i to state j in r time steps is the same regardless of when
this happened. In this course, we will assume that all discrete time Markov processes are time homogeneous
unless otherwise stated.

The probabilities

pij = P(Xn+1 = j | Xn = i), for i, j, ∈ S,


are called the (1-step) transition probabilities of the Markov chain. These are very important and will help us
to solve many interesting questions about the behaviour of Markov chains. Notice that pij does not depend
on n as we are assuming that X is time homogeneous.

3.1.2 Transition matrices


The pij s are often collected into a matrix P called the transition matrix. The transition probability pij is
the element of P in the ith row and the j th column, that is, the (i, j)th element of P .

29
Important properties of a transition matrix include:

(i) All entries are non-negative (and, because they are probabilities, are ≤ 1).
P
(ii) Each of the rows sum to 1, that is, j pij = 1, for all j ∈ S.
Why must each row sum to 1? Why needn’t each column sum to 1?

Exercise 3.1
A rat is put in room 1 in the maze illustrated.

1 2

3 4

Every minute the rat changes rooms, choosing its exit at random from the exits available. Let Zn
be the number of the room occupied just after the nth transition.

Justify that {Zn } is a Markov chain, and find its transition matrix.

3.1.3 The n-step transition probabilities and the Chapman-Kolmogorov equations

Definition: n-step transition probabilities

The n-step transition probabilities are


(n)
pij = P(Xn = j | X0 = i)
= P(Xm+n = j | Xm = i) for all m if {Xn } is time-homogeneous.

(n)
So pij is the probability that a process in state i will be in state j after n ‘steps’. Note that due to time
homogeneity, this does not depend on m.

Definition: n-step transition matrix


(n)
The n-step transition matrix P (n) has pij as its (i, j)th element.

(n+m)
The Chapman-Kolmogorov (C-K) equations show us how to compute pij or P (n+m) for m ≥ 0, n ≥ 0.
These are useful equations which will enable us to calculate these quantities quickly and easily.

30
Consider time points 0, m and m + n, where m ≥ 0, n ≥ 0 and suppose that we want to compute the
probability that we go from state i to state j in (n + m) steps,
(n+m)
pij = P (Xn+m = j|X0 = i).

The C-K equations are derived using the simple observation that at time m, we must be in some state (let’s
call it k). By summing over all possibilities for k, we derive the C-K equations.

i k j state
-
0 m m+n time

(m+n)
By conditioning on the state k at time m we derive pij :
(n+m)
pij = P(Xm+n = j | X0 = i)
X
= P(Xm+n = j, Xm = k | X0 = i)
k∈S
↓ see (ii) in ‘Useful conditioning formulae’
X
= P(Xm+n = j | Xm = k, X0 = i) P(Xm = k | X0 = i)
k∈S
↓ Markov property and time homogeneity
(m) (n)
X
= pik pkj , for all n, m ≥ 0, all i, j ∈ S.
k∈S

Definition: Chapman-Kolmogorov equations

The Chapman-Kolmogorov equations are


(n+m)
X (m) (n)
pij = pik pkj , for all n, m ≥ 0, all i, j ∈ S.
k∈S

Note the plural - there is one equation for each (i, j) pair.

Since
(m)
{pik , k ∈ S} is the ith row of P (m)
and
(n)
{pkj , k ∈ S} is the jth column of P (n) ,
the Chapman-Kolmogorov equations can be written in matrix form as

P (m+n) = P (m) · P (n) ,


where · denotes matrix multiplication. Note that this single equation encodes all the individual C-K
equations above.

This may not seem particularly helpful, but note the following:
1. P (1) is the transition matrix P .
2. P (0) is the identity matrix I.

31
3. Note

P (2) = P (1+1) = P · P = P 2 .
P (3) = P (1+2) = P · P (2) = P 3 , etc.

By using the C-K equations repeatedly, we find that:

P (n) = P n .

That is, the n-step transition matrix P (n) is equal to the nth matrix power of P .

This is enormously helpful - given a Markov chain with (one-step) transition matrix P , we can now compute
its n-step transition matrix by simply multiplying P with itself n times.

Exercise 3.2
The weather on a particular day is either ‘rainy’ (R) or ‘fine’ (F ). We assume that a Markov chain
model is appropriate. Let Xn be the state of weather on day n. The state space is {R, F }.

The Markov chain assumption means that, given information on whether today is rainy or not, the
probability that tomorrow is a rainy day does not depend on the weather on days prior to today.
That is,

Xn+1 {Xn−1 , . . . , X0 } | Xn .
|=

If it rains today, then it will rain tomorrow with probability 0.6.


If it is fine today then it will be fine tomorrow with probability 0.5.

(a) Find the 2-step transition matrix. [HINT: Find the (1-step) transition matrix first, then apply
Chapman-Kolmogorov.]
(b) Find P(X4 = R, X3 = R | X0 = R).

3.1.4 Marginal probabilities


So far, we have only considered conditional probabilities. However, we may also be interested in marginal
probabilities such as P(Xn = j).
(n)
Let pj = P(Xn = j). If, for example, S = N0 = {0, 1, 2, . . .}, then
(n) (n)
p(n) = (p0 , p1 , . . .) = (P(Xn = 0), P(Xn = 1), . . .)

is a probability row vector (i.e. a row vector with non-negative entries summing to 1) specifying the distri-
bution of Xn .

(n) (n)
Do not confuse pij and pj . The former is a conditional probability, and the latter a marginal
probability.

32
Definition: Initial distribution
(0) (0) (0)
When n = 0, and the state space is S = {0, 1, 2, ...}, then p(0) = (p0 , p1 , p2 , . . .) = (P(X0 =
0), P(X0 = 1), P(X0 = 2), . . .) is the distribution of X0 . This is the initial distribution: the distribu-
tion of states in which we start the Markov chain. This tells us the probability of the chain starting
in any particular state.

(n)
If we look at pj , the probability that the chain is in state j at time n, then

(n) (0) (n)


X X
pj = P(Xn = j) = P(Xn = j | X0 = i) P(X0 = i) = pi pij .
i i

Equivalently, in matrix notation,


p(n) = p(0) · P (n) = p(0) · P n .

Thus, the initial distribution, p(0) , and the transition matrix, P , together contain all the
probabilistic information about the Markov chain. That is, this is all we need in order to
compute or derive any quantity of interest related to the Markov chain.

For example,
(0)
P(X0 = i0 , X1 = i1 , . . . , Xn = in ) = pi0 pi0 i1 pi1 i2 · · · pin−1 in
is a useful identity to compute a joint probability using only the (one-step) transition probabilities and
initial distribution.

Exercise 3.3
(0) (0)
In Exercise 3.2, suppose pR = 0.7 and pF = 0.3.

1. Find p(0) , p(1) and p(2) .


2. Compute P (X0 = R, X1 = F ).
3. Compute P (X0 = R, X1 = F, X2 = F, X3 = R, X4 = F ).

3.1.5 Duration of stay in a state


Remember that a Markov chain is memoryless in the sense that its future is independent of its past, con-
ditional on its current state (or the most recent information that we have about its location). Therefore, if
the chain is currently in state i, the further amount of time that it will remain there does not depend on
how long it has already been in state i.

For any Markov chain the duration of stay in state i satisfies

P(stay in state i for n steps | chain is currently in state i)


=P(stay in i for a further n − 1 steps, then leave i)
=pn−1
ii (1 − pii ).

Therefore, the duration of stay in state i has a geometric distribution with parameter (1 − pii ), from which
1
we can deduce that the expected amount of time that the process remains in state i is 1−p ii
.

33
Where will the Markov chain go when it leaves i?

For j 6= i,

P(MC moves to j | MC leaves state i) = P(Xn+1 = j | Xn = i, Xn+1 6= i)


P(Xn+1 = j, Xn+1 6= i | Xn = i)
=
P(Xn+1 6= i | Xn = i)
P(Xn+1 = j | Xn = i)
=
P(Xn+1 6= i | Xn = i)
pij
= .
1 − pii
This answer makes sense: the probability is proportional to pij and is scaled so that
X pij
= 1.
1 − pii
j6=i

Summary:
The duration of stay of the Markov chain in state i has a geometric distribution with expected value
1
1−pii steps.

pij
When the chain moves from state i, it goes to state j (6= i) with probability 1−pii .

3.2 Classification of Markov chains


Suppose we wish to study a particular Markov chain, summarised by its transition matrix P and an initial
distribution p(0) . We may be interested in the long-run probabilistic behaviour of the chain or in the
probability that some future event of interest happens.

Even if nobody has studied this particular Markov chain before, we are able to use some general results
on the behaviour of Markov chains to help us. In order to use this theory we need to classify each state
of the chain as being one of a number types, and hence classify the type of Markov chain that we have.

The first step in classifying Markov chains is to split the state space into non-overlapping groups (classes)
of states. We will see later that states in the same class are of the same type. This simplifies the problem of
classifying all the states in the chain because we only need to classify one state in each class.

3.2.1 Irreducible classes of intercommunicating states

Definition: Communicating and intercommunicating classes

State i communicates with state j (we write i → j) if


(n)
pij > 0 for some n ≥ 0,

that is, starting from state i, it is possible that the chain will eventually enter state j.

States i and j intercommunicate (we write i ↔ j) if i → j and j → i.

34
Exercise 3.5
A Markov chain has the following state-space and transition matrix.
 
 1/2 1/2 0 0 
 
 1 0 0 0
 

S = {1, 2, 3, 4}, P = 
.

 0 1/2 1/3 1/6 
 
 
0 0 0 1

Draw a diagram to summarise the possible moves which can be made in one step. Which states
intercommunicate?

Properties of ↔
(i) i ↔ i for all states i.

(ii) if i ↔ j then j ↔ i.
(iii) if i ↔ j and j ↔ k then i ↔ k.
(i)+(ii)+(iii) mean that ↔ is an equivalence relation on S.

Definition: Irreducible classes and irreducible chains


We can partition the state space S into non-overlapping classes of intercommunicating states.
Within a particular class, every pair of states intercommunicate; states in different classes do not
intercommunicate.

Each class is called an irreducible class of states.

If there is only one irreducible class, then the Markov chain is said to be irreducible.

Example 3.6
For the Markov chain with state space S = {0, 1, 2} and transition matrix
 
1/4 1/4 1/2
 
 
P =
 0 1 0 ,

 
1/2 0 1/2

the state space can be partitioned in to the irreducible classes {0, 2} and {1}. In particular, the
Markov chain is not irreducible because there are two irreducible classes.

35
Exercise 3.6
Find the irreducible classes for each of the Markov chains given below.
 
1 0 0
 
 
(a) S = {0, 1, 2}, P =
 1/2 .
0 1/2 
 
0 1 0

 
 0 0 01 
 
 0 0 0 1 
 
(b) S = {0, 1, 2, 3}, P =

.

 1/2 1/2 0 0 
 
 
0 0 1 0

3.2.2 Recurrence and transience


Let {Xn } be a Markov chain with state space S and let i ∈ S. Let
fi = P(ever return to i | X0 = i) = P(Xn = i for some n ≥ 1 | X0 = i).

Definition: Recurrence and transience

If fi = 1 then i is recurrent (eventual return to state i is certain).


If fi < 1 then i is transient (eventual return to state i is uncertain).

Exercise 3.7
1. For the Markov chain defined in Exercise 3.6(a), deduce whether fi = 1 or fi < 1 for i = 0, 1, 2,
and hence classify each state as either recurrent or transient.

2. For the Markov chain defined in Exercise 3.6(b), deduce whether fi = 1 or fi < 1 for i = 0, 1, 2, 3,
and hence classify each state as either recurrent or transient.

Studying recurrence and transience enables us to ask further interesting questions of our Markov chain, for
example, how many times will a Markov chain hit state i, given it starts in state i?

(i) If i is transient
Let N be the number of hits on i (including the hit from the fact that X0 = i). Then
P(N = n | X0 = i) = P(return n − 1 times to i, then never return | X0 = i)
= fin−1 (1 − fi ),

So, given X0 = i, N has a geometric distribution with parameter 1 − fi . So N is finite with probability
1 and

1
E(N | X0 = i) = (< ∞).
1 − fi

36
(ii) If i is recurrent
With probability 1 the chain will eventually return to state i.
By time homogeneity and the Markov property, the chain ‘starts afresh’ on return to the initial state,
so that state i will eventually be visited again (with probability 1).
Repeating the argument shows that the chain will return to i infinitely often (with probability 1).
We also have
E(N | X0 = i) is infinite.

Another idea connected with recurrence and transience is that of first passage times. This describes the time
it takes for a Markov chain to return to state i, given that it started there.

Definition: First passage time

Let
Tii = min{n ≥ 1 : Xn = i | X0 = i}.
Tii is the first passage time from state i to itself, that is, the number of steps until the chain first
returns to state i given that it starts in state i.

If the chain never returns to state i we set Tii = ∞.

Note that:
fi = P(ever return to i | X0 = i) = P(Tii < ∞)
and
1 − fi = P(never return to i | X0 = i) = P(Tii = ∞).
Connecting this with the idea of transience and recurrence yields:
• if i is recurrent then P(Tii < ∞) = 1, that is, Tii is finite with probability 1.
• if i is transient then P(Tii < ∞) < 1, or equivalently P(Tii = ∞) = 1 − P(Tii < ∞) > 0, that is, Tii is
infinite with positive probability. (Therefore, µi = E[Tii ] is infinite.)

i recurrent i transient

P (ever return to i | X0 = i) fi = 1 fi < 1

P (never return to i | X0 = i) 1 − fi = 0 1 − fi > 0

(number of hits on i | X0 = i) ∞ (w.p. 1) < ∞ (w.p. 1)


h i
1
E number of hits on i | X0 = i ∞ 1−fi <∞

P(Tii < ∞) fi =1 fi < 1

P(Tii = ∞) 1 − fi = 0 1 − fi > 0

37
3.2.3 Positive recurrence and null recurrence
In fact, there are two types of recurrent state. Recall that the first passage time (also called the recurrence
time) Tii of a state i is the number of steps until the chain first returns to state i given that it starts in state i.

Let µi = E(Tii ) be the mean recurrence time of state i. We have already seen that if i is transient then
µi = ∞. We have not computed µi when i is recurrent. In fact, the value of µi when i is recurrent enables
us to distinguish between two types of recurrence: positive recurrence and null recurrence.

Definition: Positive and null recurrence


A recurrent state i is

positive recurrent if µi < ∞ (that is, µi is finite),

null recurrent if µi = ∞ (that is, µi is infinite).

At first glance it may seem strange that a state i can be null recurrent: return to state i is certain, but the
expected time (number of steps) to return to i is infinite. In other words, the random variable Tii is finite
with probability 1, but its mean, E(Tii ), is infinite. Such states do exist and we will see examples later in
the course.

3.2.4 Periodicity

Definition: Periodicity
(n)
The period of a state i is the greatest common divisor of the set of integers n ≥ 1 such that pii > 0,
that is, n o
(n)
di = gcd n ≥ 1 : pii > 0 .

If di = 1 then state i is aperiodic.


(n)
If pii = 0 for all n ≥ 1 (i.e. return to state i is impossible) then, by convention, state i is also said
to be aperiodic.

3.2.5 Class properties

Definition: Class property

A class property is a property which is shared by all the states in an irreducible class. If it
can be shown that one of the states in a class has a certain class property then all the states in the
irreducible class must also have that same property.

I give sketch proofs for some of the following results relating to class properties. The other proofs can be
found in any good textbook on Markov processes.

38
Result: Recurrence/transience is a class property

A pair of intercommunicating states are either both recurrent or both transient. That is, recurrence/
transience is a class property. It is not possible to have a mixture of recurrent and transient states
in an irreducible class.

Proof: Look in ‘Introduction to Probability Models’ (Sheldon Ross).

Result
Null recurrence is a class property.

Proof:
Recall that µi = E(Tii ) is the expected first passage time or the mean recurrence time of state i, and that
(
finite if i is positive recurrent
µi is
infinite if i is null recurrent or transient.

It can be shown (proof omitted) that

(n) 1
i aperiodic ⇒ pii → as n → ∞,
µi
(nd) d (m)
i periodic, period d ⇒ pii → as n → ∞(and pii = 0 if m is not a multiple of d).
µi

So, as n → ∞
(n) 1
• if i is aperiodic, positive recurrent then pii → µi (positive, finite),
(nd) d
• if i is periodic, positive recurrent then pii → µi (positive, finite),
(n)
• if i is null recurrent then pii → 0,
P∞ (n) (n)
• if i is transient then (proof omitted) n=0 pii < ∞, and pii → 0.
(n) (m)
Now let i ↔ j, with pij > 0, pji > 0. Suppose that i is null recurrent. Then

(m+k+n) (n) (k) (m)


pii ≥ pij pjj pji .

As k → ∞
(m+k+n)
pii → 0
(because i is null recurrent), and so
(k)
pjj → 0
(n) (m)
(because pij > 0 and pji > 0). So j is either transient or null recurrent. But i is recurrent and i ↔ j,
so j must be recurrent as well (because recurrent/transience is a class property). Hence j must be null
recurrent.

39
Result
Positive recurrence is a class property.

Proof:
Suppose that i ↔ j and that i is positive recurrent. Then j must be recurrent (because recurrence is a
class property). If j is null recurrent then i is also null recurrent. This is a contradiction. Hence j must be
positive recurrent.

Result
Intercommunicating states have the same period. That is, periodicity is a class property.

Proof:
Let i and j be states with i ↔ j. Let di be the period of i and dj be the period of j. Let n and m be such
(n) (m)
that pij > 0 and pji > 0. Then
(n+m) (m) (n)
pjj ≥ pji pij > 0,
because there is at least one way of returning to j in n + m steps: j to i in m steps and then i to j in a
further n steps. Therefore,

dj | n + m, (dj is a divisor of n + m).


(k)
Let k be such that pii > 0. Then
(n+k+m) (m) (k) (n)
pjj ≥ pji pii pij > 0,

so
dj | n + k + m, (dj is a divisor of n + k + m).
together imply that
dj | k, (dj is a divisor of k).
(k)
That is, dj is a factor of any {k ≥ 1 : pii > 0}.

However, by definition di is the greatest common divisor of such k. Therefore, dj ≤ di . Reversing the roles
of i and j gives di ≤ dj . Hence di = dj .

Summary

The following are class properties:


• Null recurrence;
• Positive recurrence;
• Transience;

• Periodicity.

40
Exercise 3.8
Find and classify the irreducible classes of these Markov chains.
(a)
 
 1/2 1/2 0 0 
 
 1 0 0 0 
 
S = {1, 2, 3, 4}, P =



 0 1/2 1/3 1/6 
 
 
0 0 0 1

(b)
 
1 0 0
 
 
S = {0, 1, 2}, P =
 1/2 0 1/2 

 
0 1 0

3.2.6 Further properties of discrete time Markov chains

Result
A finite irreducible Markov chain cannot be null recurrent.
(n)
Proof: Recall that pj = P(Xn = j). Suppose that the chain is null recurrent. It can be shown that
(n)
pj → 0 as n → ∞ for all j ∈ S.
We know that
(n)
X
pj = 1 for all n,
j∈S
so
(n)
X
lim pj = 1.
n→∞
j∈S

But S is finite so if the chain is null recurrent


X (n) X (n)
lim pj = lim pj = 0.
n→∞ n→∞
j∈S j∈S

This is a contradiction, so the Markov chain cannot be null recurrent.

Result
A finite Markov chain cannot contain any null recurrent states.

Proof:
Suppose that the chain contains a state i which is null recurrent. Then, since null recurrence is a class
property, state i is in some irreducible finite closed class of null recurrent states. This is not possible (see
above).

41
Result: Finite Markov chains
It is not possible for all states in a finite state space Markov chain to be transient.

Note that this implies that all states in a finite irreducible Markov chain are recurrent (because
recurrence is a class property). That is, finite and irreducible ⇒ recurrent Markov chain.

Proof:
Suppose all states are transient and S = {0, 1, . . . , M }. A transient state will be visited only a finite number
of times. Therefore for each state i there exists a time Ti after which i will never be visited again. Therefore,
after time T = max{T0 , . . . , TM } no state will be visited again. BUT the Markov chain must be in some
state. Therefore we have a contradiction. Therefore, at least one state must be recurrent.

3.2.7 Closed classes and absorption

Definition: Closed classes and absorbing states

A class C of intercommunicating states is closed if pij = 0 for all i ∈ C and j ∈


/ C, that is, once the
chain gets into C, it never leaves C.

If a single state forms a closed class then it is called an absorbing state.

Exercise 3.9
Find all closed classes and absorbing states for the following Markov chains.
(a)
 
 1/2 1/2 0 0 
 
 1 0 0 0 
 
S = {1, 2, 3, 4}, P =



 0 1/2 1/3 1/6 
 
 
0 0 0 1

(b)
 
1 0 0
 
 
S = {0, 1, 2}, P =
 1/2 0 1/2 

 
0 1 0

Note the following important observations:


(i) A finite closed class must be recurrent: once entered, the class behaves like a finite irreducible Markov
chain.

(ii) An absorbing state is positive recurrent: its mean recurrence time is 1.

42
Result
An irreducible class C of recurrent states is closed.

Proof:
Suppose that C is not closed. Then there exist i ∈ C and j ∈ / C such that i → j. But j 6→ i, otherwise
i and j would intercommunicate and j would be in C. Thus, there is positive probability that the chain
leaves i and never returns to i. This means that i is transient. This is a contradiction, as i ∈ C and C is
recurrent.

The next three exercises ask you to find quantities relating to closed classes and absorption. Try to use first
step decomposition to solve these questions (see section 1.4.4).

Exercise 3.10
Suppose that X is a discrete time Markov chain with state-space S = {0, 1, 2, 3, 4} and transition
matrix P given by  
 1 0 0 0 0
 
 
 1/2 1/4 1/4 0 0 
 
 
P = 0 1/2 0 1/2 0  .
 
 
 0 0 0 1/2 1/2 
 
 
0 0 0 1/2 1/2

(a) Suppose that X0 = 1. Find the probability that the chain is eventually absorbed in C3 = {3, 4}.

(b) If p(0) = (0, 1/3, 2/3, 0, 0), calculate the probability that the chain is absorbed in C3 .
(c) If the chain is absorbed in C3 = {3, 4}, what is the probability that X0 = 2 ?

Exercise 3.11
Now suppose that X is a discrete time Markov chain with state-space S = {0, 1, 2, 3} and transition
matrix P given by  
 1/3 2/3 0 0 
 
 1 0 0 0 
 
P =  .

 1/4 1/4 1/4 1/4 
 
 
0 0 1 0

(a) Suppose that X0 = 2. Find the expected time until absorption into {0, 1}.
(b) Find the expected time to absorption in one or other of the two closed classes, starting from
state 1.

43
Exercise 3.12
Now suppose that X is a discrete time Markov chain with state-space S = {0, 1, 2, 3} and transition
matrix P given by  
 0 1/2 1/2 0 
 
 1/3 0 0 2/3 
 
P =  .

 1/2 1/2 0 0 
 
 
0 0 1 0

Find the expected first passage time from state 0 to state 3.

3.2.8 Decomposition of the states of a Markov chain


Putting the previous results together shows that the state space S of a Markov chain can be decomposed as

S = T ∪ Cm1 ∪ Cm2 ∪ · · · ,

where
• T is a set of transient states;

• Cm1 , Cm2 , . . . are irreducible closed classes of recurrent states.

Each Cmi is either null recurrent or positive recurrent, and all states in a particular Cmi have the same
period (different Cmi ’s can have different periods). The following observations apply:

(a) If X0 ∈ Cmi then the chain stays in Cmi forever.


(b) If X0 ∈ T then either:
(i) the chain always stays in T (only possible if the set T of states is infinite); or
(ii) the chain is eventually absorbed in one of the Cmi ’s (where it stays forever).

(c) If S is finite then it is impossible for the chain to remain forever in the (finite) set T of transient states.
(d) If S is finite then at least one state must be visited infinitely often and so there must be at least one
recurrent state.

We also know that if S is finite then there are no null recurrent states. The consequences this are:
(a) If a Markov chain has a finite state space then there must be at least one positive recurrent state.
(b) If a Markov chain has a finite state space and is irreducible, then it must be positive recurrent.
(c) A finite closed irreducible class must be positive recurrent.

44
Exercise 3.13
Find and classify the irreducible classes of the following Markov chains.
(a)  
 1/2 1/2 0 0 0 
 
 
 1/2 1/2 0 0 0 
 
 
S = {0, 1, 2, 3, 4}, P =
 0 0 0 1 0 .

 
 
 0 0 1 0 0 
 
 
1/4 0 1/4 0 1/2

(b)
 
 1/2 1/2 0 ··· 
0 0
 
 1/3 0 2/3 0 0 · · · 
 
S = {0, 1, 2, . . .}, P =

,

 1/4 0
 0 3/4 0 · · · 

 . .. .. .. .. 
.. . . . .

that is,
1 i+1
pi0 = and pi,i+1 = , for i = 0, 1, 2, . . . .
i+2 i+2
(c)
 
 0 1/2 1/2 0 
 
 1/2 0 0 1/2 
 
S = {1, 2, 3, 4}, P =

.

 1 0 0 0 
 
 
0 1 0 0

(d) For p + q = 1, 0 < p < 1,


 
 q p 0 0 0 ··· 
 
0 q p 0 0 ··· 
 

S = {0, 1, 2, . . .}, P =

,


 0 0 q p 0 ··· 

 .. .. .. .. .. 
. . . . .

that is,
pii = q and pi,i+1 = p for i = 0, 1, 2, . . . .

45
3.3 Limiting behaviour
This is the crux of the first half of the course: understanding a Markov chain’s behaviour in the long run
(limiting behaviour).

Our goal here is to find whether the distribution of states, p(n) , settles down (i.e. converges) as n → ∞.
If it does, what is this limit?

We now have enough tools at our disposal. It will turn out that to fully determine its long term behaviour
we must classify the Markov chain in terms of:
(a) Irreducible classes;
(b) Transience and positive/null recurrence;
(c) Periodicity;

3.3.1 Invariant distributions


We will start with a somewhat simpler question than that of limiting behaviour: are there particular initial
distributions for a Markov chain, such that when the Markov chain starts in any of these distributions, we
can determine the limit of p(n) as n → ∞?

The answer to this question is ‘yes’, and these distributions are known as ‘invariant’.

Definition: Invariant distribution

The probability row vector π is an invariant distribution of a discrete-time process {Xt , t ∈ T } if

π = π P.

If π is an invariant distribution, and the initial distribution of the Markov chain is π, why does this imply
that we can study its long term behaviour?

The idea is that if the chain starts in π (or equivalently, we find at some later time that the distribution
of states is π), then this tells us that the distribution of states is π forever: we’ve uncovered the long term
behaviour of the chain in this specific case.

This works since, suppose that an invariant distribution π exists and that p(0) = π, then

p(1) = p(0) P = π P = π;
p(2) = p(1) P = π P = π;
p(3) = p(2) P = π P = π; etc.

In fact p(n) = π, for n = 1, 2, . . . and we have found the limit of p(n) as n → ∞, as required!

If our initial distribution is π, with π = πP , then the distribution of the states at any given future time
n is also π. This is why it is called an invariant distribution.

An invariant distribution always exists if the state space S is finite, but it need not be unique. If
S is infinite then an invariant distribution need not exist.

46
Assuming that the transition matrix P is a k × k matrix, and the state space is S = {1, 2, ..., k}, solving
π = πP is a matter of solving a set of k simultaneous equations, each of the form

πi = π1 p1i + π2 p2i + ... + πk pki .

If you try this you will find that one equation is redundant (try it!): we are therefore trying to solve for k
unknowns (π1 , ..., πk ) using (k − 1) equations. This is clearly impossible, but we have the advantage that
since π is itself a probability distribution, we must have π1 + ... + πk = 1. This added requirement means
that we can now solve for π1 , ..., πk .

Exercise 3.14
Find an invariant distribution for the Markov chain in Exercise 3.2.

3.3.2 Equilibrium distributions


Simply looking at invariant distributions in order to determine a Markov chain’s long term behaviour is
very restrictive. We want a far more general idea of long term behaviour. Can we determine whether p(n)
converges as n → ∞, regardless of the initial distribution? If this limit exists, what is it?

Example 3.15
Recall the weather example (Exercise 3.2), with state space S = { R, F } and transition matrix P ,
 
0.6 0.4
P = .
 
0.5 0.5

We examine how P (n) = P n , the n-step transition matrix changes as n increases.

     
 0.56 0.44   0.556 0.444   0.555556 0.444444 
P2 =  , P3 =  , P 6 = P 3 ·P 3 =  .
0.55 0.45 0.555 0.445 0.555555 0.444445

Looking at different initial distributions for this Markov chain and what happens in the ‘long run’
from these starting distributions:

Initial distribution Distribution of X1 Distribution of X2 ... Distribution of X6

(0,1) (0.5,0.5) (0.55, 0.45) ... (0.55556, 0.44444)

(1,0) (0.6,0.4) (0.56, 0.44) ... (0.55556, 0.44444)

(0.7,0.3) (0.57,0.43) (0.557, 0.443) ... (0.55556, 0.44444)

(0.5,0.5) (0.55,0.45) (0.555, 0.445) ... (0.55556, 0.44444)

47
Notice that as n → ∞, P (n) is tending to a limiting matrix whose rows are identical to each other, and
p(n) seems to be tending to a limiting probability distribution which is the same regardless of the initial
distribution. This is a very important phenomena which we describe in the next definition.

Definition: Equilibrium distribution

A probability row vector π = {πj , j ∈ S} is an equilibrium distribution for the discrete-time Markov
chain {Xn } if

p(n) −→ π as n −→ ∞,
independently of the initial distribution p(0) (i.e. for any initial distribution, the long-run distribution
of Xn as n → ∞ is the same).

When an equilibrium distribution exists,


(n)
πj = lim pij ,
n→∞

for all i, j ∈ S.

Notice what this says: π is an equilibrium distribution if p(n) → π as n → ∞ regardless of the start
distribution, p(0) .

It is important to note that there cannot be two or more of these limits: either such a limit does not exist,
or exactly one exists. Think about this carefully!

3.3.3 Invariant and equilibrium distributions


We know that p(n) = p(n−1) P . If this chain has an equilibrium distribution then p(n) −→ π as n → ∞, and
π is the equilibrium distribution. Therefore, letting n → ∞ we have π = πP, so that π is also an invariant
distribution.

Though the concepts of an invariant distribution and an equilibrium distribution are therefore closely related,
they are NOT the same thing: an equilibrium distribution must also be an invariant distribution, BUT an
invariant distribution is not necessarily an equilibrium distribution.

=⇒

EQU ILIBRIU M IN V ARIAN T

⇐=
6

Invariant distribution π:
- If the chain starts in (or equivalently, gets to) π, the distribution of states at all further times is also
π.

48
Equilibrium distribution π:
- If the chain is run for long enough (i.e. forever) its probabilistic behaviour settles down to that of π.
- This is regardless of the state in which the chain starts.

Notice:

• No invariant distribution ⇒ no equilibrium distribution.

• Exactly 1 invariant distribution ⇒ there may or may not be an equilibrium distribution.


• Two or more invariant distributions ⇒ no equilibrium distribution.

Some other important things to note:

• Not all Markov chains have equilibrium distributions.


• If a Markov chain has an equilibrium distribution, then it is unique (cannot have more than one
equilibrium distribution).
• If a Markov chain has an equilibrium distribution, then it also has an invariant distribution and these
distributions are identical.
• If a Markov chain does not have an equilibrium distribution, then it may or may not have an invariant
distribution.
• A Markov chain can have any number of invariant distributions (including no invariant distribution).

• It is possible for a Markov chain to have more than one invariant distribution but not have an equilib-
rium distribution.
• It is possible for a Markov chain to have neither an invariant distribution nor an equilibrium distribu-
tion.

3.3.4 Ergodic Markov chains and the equilibrium distribution

Definition: Ergodic state

An ergodic state is one that is both positive recurrent and aperiodic.

49
Result: Main limit theorem

Let {Xn } be an irreducible ergodic Markov chain. Without loss of generality we assume that S =
{0, 1, 2, . . .}. Then

(a) there is a unique probability row vector π = (π0 , π1 , π2 , . . .) satisfying

π = πP ,

(n)
(b) pij → πj as n → ∞, for all i, j ∈ S ,

(c) π satisfies
1
πj = , for all j ∈ S ,
µj
where µj is the mean recurrence time of state j.

Note that this theorem tells us (among other things) that


an irreducible, ergodic Markov chain has an equilibrium distribution.

Some remarks on the Main Limit Theorem


1. Proof omitted.
2. Part (b) of the result shows that the n-step transition matrix P (n) tends to
 
 π0 π1 π2 ··· 
 
π0 π1 π2 ··· 
 

 ,
 

 π0 π1 π2 ··· 

 .. .. .. 
. . .

as n → ∞.
3. Part (b) of the result implies that π is the equilibrium distribution of the chain (since p(n) = p(0) P (n)
tends to π as n → ∞).
4. The following interpretations of π are important.
(i) Observation 3 above implies that πj is the limiting probability that the chain is in state j at time
n.
(ii) It can be shown that πj is also equal to the long-run proportion of time that the chain spends in
state j.
5. Note that part (a) of the result can be strengthened to:

An irreducible aperiodic Markov chain has an invariant distribution if and only if the chain is
positive recurrent.

50
This implies that π is unique and satisfies πj = 1/µj , j ∈ S.
This provides an alternative way to show that a Markov chain is positive recurrent: if the chain is
irreducible, aperiodic and has an invariant distribution (that is, there exits a π such that π = πP ) then
it must be positive recurrent.

Exercise 3.16
Determine whether the Markov chains with the following features have an equilibrium distribution
and/ or invariant distribution. [HINT: are the following irreducible, ergodic Markov chains? How
many invariant distributions does each one have?]
 
0.6 0.4
(a) S = {R, F } , P =  .
 
0.5 0.5
 
 1 0 
(b) S = {0, 1} , P = .
0 1

3.3.5 Extending the Main Limit Theorem


Though the Main Limit Theorem (MLT) tells us that an equilibrium distribution exists for an irreducible
and ergodic Markov chain, we want to be able to deduce whether any discrete time Markov chain has an
equilibrium distribution. We will do this by relaxing each assumption of the Main Limit Theorem and assess
whether an equilibrium distribution is still guaranteed to exist.

Periodicity
The MLT assumed that the chain was ergodic, and therefore aperiodic. What happens if we relax this
assumption, so that we have an irreducible positive recurrent Markov chain {Xn , n = 0, 1, 2, . . .} with period
d > 1?

Let state j have mean recurrence time µj . We construct a new process {Yn , n = 0, 1, 2, . . .} with Yn = Xnd .
Then {Yn , n = 0, 1, 2, . . .} is an ergodic Markov chain and in this new chain, state j has mean recurrence
time µj /d. From the Main Limit Theorem it follows that

d
P(Yn = j | Y0 = j) → as n → ∞,
µj

that is,
(nd) d
P(Xnd = j | X0 = j) = pjj → as n → ∞.
µj
However,
(n)
pjj = 0 for n ∈
/ {0, d, 2d, 3d, . . .}.
Hence {Xn } has no equilibrium distribution.

Positive recurrence
The MLT assumed that the chain was ergodic, and therefore positive recurrent. What happens if we relax

51
this assumption so that we have an irreducible aperiodic Markov chain {Xn , n = 0, 1, 2, . . .} with transient
or null-recurrent states only?

Suppose that a Markov chain consists only of transient and/ or null-recurrent states. Then all states have
µj = ∞. It follows from the proof that null recurrence is a class property that for all i, j

(n) 1
pij → = 0 as n → ∞,
µj
that is, all limits exist but are zero. This limit is not a proper probability distribution so no equilibrium
distribution exists.

Irreducibility
The MLT assumed that the chain was irreducible. What happens if we relax this assumption so that the
Markov chain has more than one class? We can split this into two possibilities:

(a) Suppose that a Markov chain contains 2 or more closed classes, Cm1 , Cm2 , . . ..
If, for example, class Cm1 is ergodic then for all j ∈ Cm1

 p(n) → 1 , as n → ∞, if i ∈ Cm

ij µj 1

 p(n) = 0, for all n,



if i ∈ Cm2 .
ij

Hence there is no equilibrium distribution (the limit depends on the initial state).
(b) Suppose that a Markov chain consists of a finite number of transient states and a single closed class
C. Then eventually the chain will be absorbed into C.

- If C is ergodic then for all states j ∈ C there is an equilibrium distribution since:



 0

if j is transient
πj =
 1

otherwise.
µj

- For a transient state j, µj = ∞ so

1
πj = for all j.
µj
This does not give a proper probability distribution, and so if C is not ergodic then no equilibrium
distribution exists.

3.3.6 When does an equilibrium distribution exist?


To summarise these results:
(a) More than 1 closed class ⇒ no equilibrium distribution exists (the long-term behaviour of the chain
depends on p(0) ).

(b) No closed classes (i.e. all states are transient) ⇒ no equilibrium distribution exists.
(c) If there exists exactly one closed class C, is it certain that the Markov chain will eventually be absorbed
into C?

52
– If “no” then no equilibrium distribution exists (the long-term behaviour of the chain depends on
p(0) ).
– If “yes” then an equilibrium distribution exists if and only if C is ergodic.

An equilibrium distribution exists if and only if there is an ergodic class C and the
Markov chain is certain to be absorbed into C eventually, wherever it starts.

Exercise 3.17
Three cards, labelled A,B and C, are placed in a row and their order changed as follows. A random
choice is made of either the left hand card or the right hand card, either card having probability 1/2
of being chosen, and this card is then placed between the other two. This process is repeated, the
successive random choices being independent.

(a) Find the transition matrix for the six-state Markov chain of successive orders X1 , X2 , . . . of the
cards.
(b) Is the chain irreducible, aperiodic?
(c) Find the two-step transition matrix.

(d) If the initial order is ABC, find approximately the probability that after 2n changes, where n
is large, the order is ABC.

53
Exercise 3.18

For each of these Markov chains state, with a reason, whether an equilibrium distribution exists. If
it does exist then find it.
(a)  
 1/2 1/2 0 0 0 
 
 
 1/2 1/2 0 0 0 
 
 
S = {0, 1, 2, 3, 4}, P =
 0 0 0 1 0 .

 
 
 0 0 1 0 0 
 
 
1/4 0 1/4 0 1/2

(b)
 
 1/2 1/2 0 ···  0 0
 
 1/3 0 2/3 0 0 · · · 
 
S = {0, 1, 2, . . .}, P =

.

 1/4 0
 0 3/4 0 · · · 

 . .. .. .. .. 
.. . . . .

(c)
 
 0 1/2 1/2 0 
 
 1/2 0 0 1/2 
 
S = {1, 2, 3, 4}, P =



 1 0 0 0 
 
 
0 1 0 0

(d)
 
 1/3 2/3 0 0

 
 1 0 0 0 
 
S = {0, 1, 2, 3}, P =

.

 1/4 1/4 1/4 1/4 
 
 
0 0 1 0

(e)
 
 q p 0 0 0 ··· 
 
0 q p 0 0 ··· 
 

S = {0, 1, 2, . . .}, P =

.


 0 0 q p 0 ··· 

 .. .. .. .. .. 
. . . . .

54
4 Continuous-time Markov processes

Notation

Discrete-time processes: {Xt , t ∈ T }, where T = {0, 1, 2, . . .}.

Continuous-time processes: {X(t), t ∈ T }, where T = R+ (i.e. t ≥ 0).

Continuous-time processes can change their value/state (‘jump’) at ANY instant of time.

Remember that in this course we only consider stochastic processes {X(t), t ≥ 0} with finite or count-
able state space S.

4.1 Continuous-time Markov chains


Definition: Continuous time Markov process

{X(t), t ≥ 0} is a continuous-time Markov process if, for all t ≥ 0, 0 ≤ t0 < t1 < · · · < tn < s and for
all states i, j, i0 , i1 , . . . , in ∈ S

P(X(s + t) = j | X(s) = i, X(tn ) = in , . . . , X(t0 ) = i0 ) = P(X(s + t) = j | X(s) = i),


that is, the continuous-time process {X(t), t ≥ 0} satisfies the Markov property.

As in the discrete time case, we only consider time homogeneous processes, which implies that:

P(X(s + t) = j | X(s) = i) = P(X(t) = j | X(0) = i)

for all s, t and all i, j ∈ S.

The difficulty with continuous time processes is that there is no ‘smallest’ unit of time. In the discrete time
case, we talked about 1-step transition probabilities and extended these to more general n−step transition
probabilities. Here, we have no such ‘minimum time unit’ and so our notation for the transition probabilities
must clearly show the amount of time taken to reach one state from another. We will use the notation:

pij (t) = P(X(t) = j | X(0) = i)

for transition probabilities, and store these in a transition matrix P(t) (note again here that we are being
careful with our unit of time). If S = {0, 1, 2, . . .} this will be of the form
 
 00p (t) p 01 (t) p 02 (t) · · · 
 
 p10 (t) p11 (t) p12 (t) · · · 
 
P (t) = 
 

 p20 (t) p21 (t) p22 (t) · · · 
 
 .. .. .. 
. . .

The same ‘rules’ apply here as for the discrete time case: the elements of P (t) must satisfy, for all t ≥ 0,
X
pij (t) ≥ 0 and pij (t) = 1.
j

55
In particular, when t = 0,

 1

if j = i,
pij (0) = P (X(0) = j | X(0) = i) =
 0 if j 6= i,

so P(0) = I (the identity matrix). This makes sense - the process stays in the same state if we allow no time
to pass!

4.2 The importance of the exponential distribution and ‘little-o’ notation


The exponential distribution will prove to be fundamental for continuous time Markov processes, as it is
the only probability distribution to display the lack of memory property. Before going any further, we need
more notation (this will make calculations simpler).

4.2.1 Order notation: o(h)

Definition: An o(h) function

Let f (x) be a function. We define f to be o(h) (we say “f is order h”) if

f (h)
lim = 0,
h→0 h
that is,
as h → 0, f (h) → 0 faster than h does.

Example 4.1
• f (x) = x2 .

f (h) h2
= = h → 0 as h → 0.
h h
So x2 is o(h).

• f (x) = x.

f (h) h 1
= = √ → ∞ as h → 0.
h h h

So x is not o(h).

56
Example 4.2 Suppose that X ∼ exponential(λ). Then for h > 0,

P(X ∈ (t, t + h], X > t)


P(X ∈ (t, t + h] | X > t) =
P(X > t)
P(X ∈ (t, t + h])
=
P(X > t)
P(X > t) − P(X > t + h)
=
P(X > t)
exp(−λt) − exp (−λ(t + h))
=
exp (−λt)
=1 − exp (−λh) (4)
(λh)2 (λh)3
 
=1 − 1 − λh + − + ···
2! 3!
(λh)2 (λh)3
=λh − + − ···
2! 3!
=λh + o(h).

4.2.2 The lack-of-memory property of the exponential distribution

Definition: Memoryless distribution

A random variable X is memoryless if for all s, t ≥ 0,

P (X > s + t|X > t) = P (X > s)

Exercise 4.3
Show that if a random variable X is memoryless, then

P (X > s + t) = P (X > s)P (X > t)

It turns out that the exponential distribution is memoryless (in fact, it is the only memoryless distribution).
Example 4.2 demonstrates this clearly: we showed that if X ∼ exponential(λ), then, according to (4) above,

P(X ∈ (t, t + h] | X > t) = 1 − exp (−λh), for h > 0,

or equivalently,
P(X − t ∈ (0, h] | X > t) = 1 − exp(−λh), for h > 0. (5)
The right hand side of (5) is the distribution function of an exponential(λ) distribution. This implies that

{(X − t) | X > t} ∼ exponential(λ).

That is, if X ∼ exponential(λ) then {(X − t) | X > t} ∼ exponential(λ) too. Check back to the definition
of a memoryless distribution to see that the exponential distribution is therefore memoryless.

57
The identity in exercise 4.3 shows the impact of the memoryless property. Suppose that we are waiting for
an event to occur (e.g. a train arriving at Goodge Street station), the time we must wait being exponentially
distributed with parameter λ. If we wait s minutes without a train arriving, the remaining time until the
event occurs is still exponential(λ). That is, the fact that we have already waited s minutes without an event
is ‘forgotten’ and we essentially start our wait again.

Example 4.4

Question: Service times at a bank are exponentially and independently distributed with parameter
µ. One queue feeds two clerks. When you arrive both clerks are busy, but the queue is empty. You
will be next to be served; what is the probability that, of the three customers present, you will be
last to leave?

Answer: When the first customer leaves you are served. Your service time T1 has an exponential(µ)
distribution and by the lack-of-memory property of the exponential distribution so does the further
service time T2 of the other customer. T1 and T2 are independent. Therefore

P(you are last to leave) = P(T1 > T2 )


Z ∞
= P (T1 > T2 | T2 = t)fT2 (t)dt
Z0 ∞
= P(T1 > t) fT2 (t)dt
Z0 ∞
µ 1
= exp(−µt) µ exp(−µt)dt = = .
0 2µ 2

4.2.3 Other useful properties of the exponential distribution


It is helpful to note the following result before we proceed.

Result: Distribution of the minimum of exponential random variables

Let Y1 , ..., Yn be independent random variables such that Yi ∼ exponential(λi ). Then:


• The distribution of the minimum of these random variables is also exponential, with parameter
(λ1 + ... + λn ):
min{Y1 , ..., Yn } ∼ exponential(λ1 + ... + λn ) (6)

• The probability that Yk is the minimum of these random variables can be shown to be
λk
(7)
λ1 + ... + λn

Challenge
Prove these assertions about the minimum of exponential random variables!

58
4.3 Breaking down the definition of a continuous time Markov process
A continuous-time Markov chain can be described by the distribution of holding times for each state (i.e.
the distribution of the length of time the chain stays in a particular state), together with the jump chain
(i.e. when the continuous-time Markov chain does leave a state, to which other state does it go?).

4.3.1 Holding times


Remember that for a continuous time Markov process with discrete state-space, the chain can ‘jump’ to
another state at any point in time. We’ll first look at how long a chain stays in a state before moving.
It can be shown that if the chain starts in state i then it stays there for a random length of time, the holding
time, that is exponentially distributed with parameter, say, qi (i.e. with mean 1/qi ).

Why is the distribution of the holding time exponential?

Suppose that the process starts in state i, and let T (i) be the holding time in state i. We will show that
P (T (i) > u + s|T (i) > u) = P (T (i) > s) for u, s ≥ 0, and from this we can deduce that the distribution
of T (i) is memoryless (and hence exponential).

Notice that the event {T (i) > u} is identical to the event {X(t) = i for all 0 ≤ t ≤ u}, for any u ≥ 0.
Now:

P (T (i) > s + t|T (i) > s) =P (X(u) = i for all 0 ≤ u ≤ t + s | X(u) = i for all 0 ≤ u ≤ s)
↓ Using information from the conditioning
=P (X(u) = i for all s ≤ u ≤ t + s | X(u) = i for all 0 ≤ u ≤ s)
↓ Markov property
=P (X(u) = i for all s ≤ u ≤ t + s | X(s) = i)
↓ Time homogeneity
=P (X(u) = i for all 0 ≤ u ≤ t | X(0) = i)
↓ Equivalent statements
=P (T (i) > t)

So T (i) is memoryless, and therefore exponentially distributed. Supposing that we let qi denote the
parameter of this exponential distribution, then qi can also be thought of as the rate at which the chain
leaves state i.

4.3.2 The jump chain: rates of change


The sequence of states visited by a continuous-time Markov chain forms a discrete-time Markov chain called
the embedded jump chain.

In the previous section we defined the rate at which the chain leaves state i to be qi . Notice that this makes
no assumptions about the state which the chain goes to, simply that the chain moves from state i. To which
state does the process move when it leaves state i?

A simple way of thinking about this question is with the ‘alarm clock’ analogy. Suppose that you are cur-
rently in state i, and from here, you could potentially go to one of states j, k, l, .... Each of these states carries

59
with it an alarm clock, which will ring at an exponentially distributed time with parameter qij , qik , qil ... re-
spectively, independently of each other. For example, the time at which the alarm clock in state j will ring,
Tij , is exponential with parameter qij .

Whichever alarm clock rings first, the process will next move to that state and will do so immediately. Once
the chain moves to this state, we reset all alarm clocks again, one for each of the states, noting that there is
no alarm clock for the state that the chain is currently in (why?).

Also note that if I am currently in state i, the time at which the alarm clock in state j rings (i 6= j), is
Tij ∼ exponential(qij ). However, if I am in another state, say k (with k 6= j and k 6= i), then the time
at which the alarm clock in state j rings, is Tkj ∼ exponential(qkj ). Note the change in parameter- the
distribution for the time at which the alarm clock in state j rings changes depending on where the process
is currently.

Why is the distribution of the time at which an alarm clock rings exponential?

Suppose the time at which alarm clocks ring is not exponentially distributed. It can be shown that this
implies that the holding time in a state is not memoryless, and therefore not exponential. This is a
contradiction.

Therefore, the time at which the alarm clocks ring must be exponentially distributed.

Example 4.5
A continuous time Markov chain with three states, S = {1, 2, 3}, is currently in state 1 (the green
dot denotes the location of the process in the diagram below). An exponential alarm clock is placed
in each of states 2 and 3, and whichever of these rings first, the process will immediately go to that
state. Notice that the time at which either can ring, T12 and T13 are independent of one another.

It is clear that the holding time in state 1, T (1), can be expressed as:

T (1) = min{T12 , T13 }.

Now suppose that the process is in state 3. From state 3, the process can only go to state 2 next and
so only one alarm clock is set (i.e. there is no alarm clock in state 1), and the time that this alarm
clock rings is the time that the process moves to state 2.

Generalising from this example, since


T (i) = min Tij
j6=i

60
where such direct transitions are possible (see example 4.5 above for a case where not all transitions are
possible), and

T (i) ∼ exponential(qi )
Tij ∼ exponential(qij ),

it makes sense that the parameter qi and the parameters {qij ; j 6= i} are also related. But how?

Look back at the result in (6). This tells us exactly how the parameter qi and the parameters
{qi1 , ..., , qi,i−1 , qi,i+1 , ..., qik } are related: X
qi = qij . (8)
j6=i

This also makes intuitive sense: the rate at which we leave state i is the sum of the rates at which we
enter other states.

Before moving on, note one more property of these exponential times. Suppose that the process is in state
i currently and can from there move to any of the states {1, ..., i − 1, i + 1, ..., k}. What if we conditioned
on the process going to state k next? Does this affect the holding time in state i? Is the holding time
in state i, under this assumption, exponential with parameter qi as per usual, or does the conditioning im-
ply that it is now exponential with parameter qik (as this is the time at which the alarm clock in state k rings)?

The answer is that the holding time in state i is still exponential with parameter qi , even if we condition on
going to a particular state k next. Let the events A and B be defined as follows:
• A is the event min{Ti1 , ..., Ti,i−1 , Ti,i+1 , ..., Tik } > t;
• B is the event min{Ti1 , ..., Ti,i−1 , Ti,i+1 , ..., Tik } = Tik ;
• Note that we want to show that P (A|B) = P (A);

• It is easy to see that P (B|A) = P (B), and so

P (A) P (A)
P (A|B) = P (B|A) = P (B) = P (A),
P (B) P (B)

as required.

4.3.3 The jump chain: probability of going to a particular state


These ideas give rise to some interesting problems. For example, if the chain is in state i, what’s the prob-
ability that the alarm clock for state j will be the first to ring? This is equivalent to asking ‘what’s the
probability that when we move from state i, we go to state j?’. The second assertion above help to answer
this question.

Using these results, can we find the probability that the alarm clock for state j will be the first to ring,
give that the process is currently in state i? This is equivalent to asking ‘what’s the probability that
when we move from state i, we go to state j?’. Since the time until the alarm clock for state j rings is

61
exponential with parameter qij , then

pij :=P (from state i, move to state j)


=P ( alarm for state j rings first | currently in state i)
=P (T (j) < T (k) for k 6= j| currently in state i)
=P (T (j) = min{T (k)} | currently in state i)
qij qij
=P =:
k6=i qik qi

Therefore, the probability that from state i, the process goes to state j is given by pij = qij /qi .

Note: Suppose that the continuous-time Markov chain has an absorbing state, state i. If the chain enters
state i then it will stay there forever. In this case we would set pii = 1, rather than using the equation given
above.

Notice that the jump chain is itself a discrete-time Markov chain, and therefore we can construct its transition
matrix.

Example 4.6
Suppose that a continuous time Markov chain has state-space S = {1, 2, 3, 4}, and that qij denotes
the rate at which the chain moves from state i to state j (for j 6= i). Let
X
qi = qij for i = 1, 2, 3, 4.
j6=i

The transition matrix of its jump chain is


 
 0 q12 /q1 q13 /q1 q14 /q1 
 
 q21 /q2 0 q23 /q2 q24 /q2 
 
P =



 q31 /q3 q32 /q3 0 q34 /q3 
 
 
q41 /q4 q42 /q4 q43 /q4 0

Notice that the diagonal of the transition matrix for a jump chain is ALWAYS ZERO, since we are
assuming that the chain moves to another state. Check that all rows sum to 1!

The jump chain can be used to answer questions which involve the states that the chain enters but not the
times at which they are entered:
• What is the probability that {X(t), t ≥ 0} ever enters a particular state?

• If a chain states are thought of as the number of individuals in a population (e.g. if the process is in
state 5, then there are 5 individuals in the population), what is the probability that a population ever
becomes extinct? That is, what is the probability that the chain ever enters state 0?

62
4.4 Analysis of transition probabilities
So far, we have only considered the rate at which a continuous time Markov chain moves from one state
to another. For discrete time processes, our interest was initially in transition probabilities. We will now
explore transition probabilities for continuous time Markov chains.

The obvious ‘problem’ with continuous time Markov chains is that, unlike in the discrete time setting, there
is no “smallest time” until the next transition. Instead, we can jump to another state at any time. The
transition probability pij (t), for states i and j (j 6= i), is a function of t which in principle can be studied.
This is far more complex than in the discrete time case, and our aim is to dispose of the need to work with
transition probabilities to study the long term behaviour of the chain.

It turns out that this is possible, and we will later encounter the generator matrix of a continuous time
Markov chain, which will be much easier to work with than transition probabilities pij (t). To understand
where the generator matrix comes from, we will first investigate the nature of transition probabilities for
continuous time Markov processes.

We will make use of T (i), the holding time in state i, throughout this section. First we will consider transition
probabilities pij (h) over a very small period of time of length h. This will help us in studying transition
probabilities over longer periods of time, and we will be able to develop further tools, namely:
1. The continuous time version of the Chapman-Kolmogorov equations;
2. Kolmogorov’s forward equations;
3. Kolmogorov’s backward equations
Don’t forget that this is all in the hope of understanding the long term behaviour of continuous time Markov
chains!

4.4.1 Transition probabilities over a small interval


We’ll first look at transition probabilities pij (h) for very small h > 0. To get from state i to state j in a time
interval of length h, there are two broad options:
1. Go directly from state i to state j, with the transition occurring sometime in the interval [0, h].
2. Go from state i to state j via another state k, with k 6= i, j, (or via more than one other state) with
all transitions occurring within the interval [0, h].
In fact, we will show that only the first option is possible, that is, the probability of making more than one
transition in a very short time interval is negligible.

Suppose that we start in state i. The probability of one transition in [0, h] is

P (T (i) < h) = 1 − exp(−qi h) = qi h + o(h),

while the probability of one transition in [0, h] and that the transition is to state j 6= i is

pij (h) =P (T (i) < h| move from i to j in the interval [0, h])P (move from i to j in the interval [0, h])
=P (T (i) < h)P (move from i to j in the interval [0, h])
qij
= (1 − exp(−qi h))
qi
qij
= (qi h + o(h)) = qij h + o(h)
qi

63
Probability of two transitions in [0, h] is
X
P (T (i) + T (k) < h| move from i to k first)P (move from i to k first)
k6=i
X qik
≤ P (T (i) < h, T (k) < h| move from i to k first)
qi
k6=i
X qik
≤ P (T (i) < h| move from i to k first)P (T (k) < h| move from i to k first)
qi
k6=i
X qik
= (qi h + o(h)) (qk h + o(h)) = o(h).
qi
k6=i

On the other hand, the probability of two transitions in [0, h] and that the chain at time h is in state j is
P (two transitions in [0, h] and that the chain at time h is in state j) ≤ P (two transitions in [0, h]) = o(h)
Similarly, we can show that for more than two transitions, P (more than two transition in [0, h]) = o(h).

Putting this together,


pii (h) = P (no transitions in [0, h]|X(0) = i)
= 1 − P (one transition in [0, h]|X(0) = i) − P (two transitions in [0, h]|X(0) = i) − ...
= 1 − qi h + o(h)

In summary:
• In a very small time interval of length h, the probability of making more than one transition is
o(h). This is negligible, and so we consider that AT MOST one transition will occur in such a
small time interval.

• The probability of moving from state i to state j in the small time interval [0, h] is

pij (h) = qij h + o(h)

and that this transition happens directly from i to j.


• The probability being in state i at time h, given that the chain starts in state i, is

pii (h) = 1 − qi h + o(h)

and that no transitions occur during this interval (otherwise, to get back to state i, we would
require at least two transitions which has negligible probability).

Putting this together we have that


(
1 − qi h + o(h) for i = j
pij (h) =
qij h + o(h) 6 j.
for i =

4.4.2 Transitions over longer periods: the Chapman-Kolmogorov equations


It turns out that the Chapman-Kolmogorov equations are just as important in continuous time as they were
in discrete time. They will become our main tool in proving two crucial results on transition probabilities
later on: the Kolmogorov forward and backward equations.

64
Result: Chapman-Kolmogorov equations in continuous time

The Chapman-Kolmogorov equations in continuous time are

P (s + t) = P (s) P (t).

The proof is as in discrete time: condition on the state at time s.

i k j state
-
0 s s+t time

Individual conditional probabilities:


X
pij (s + t) = pik (s)pkj (t)
k∈S

Matrix format:
P (s + t) = P (s)P (t)

We also have
p(t) = p(0) P (t),
(t) (t)
where p(0) is the distribution of X0 (the initial distribution) and p(t) = (p0 , p1 , . . .) (the distribution of the
possible location of process at time t).

4.4.3 Kolmogorov’s forward equations


The next results, Kolmogorov’s equations, show us how to compute the transition matrix P (t) for any t > 0.
Their more useful role, however, is in proving later results on the long-term behaviour of continuous time
processes. There are two versions of Kolmogorov’s equations: the forward and the backward equations.

65
Result: Kolmogorov forward differential equations (KFDEs)

The Kolmogorov forward differential equations (KFDEs) are


X
p0ij (t) = −pij (t) qj + pik (t) qkj
k∈S
k6=j

for all pairs (i, j), or, in matrix notation,

P 0 (t) = P (t) Q,

where, for a continuous time Markov chain with state space S = {1, 2, 3, ...},
 
−q1 q12 q13 ... 
 
 q21 −q2 q23 ... 
 
Q=

,

 q31
 q32 −q3 ... 
 . .. .. ..

.. . . .
P
where qij is the transition rate from state i to state j and qi = j6=i qij (the rate at which the process
leaves state i, as before).

The KFDEs are rather easy to prove, the main tool being the Chapman-Kolmogorov equations for continuous
time processes. We take an initial time point (0) followed later by two time points (t and t + h) with h small.
Notice that the times t and t + h are therefore close together.

i k j value of X(t)
-
0 t t+h time

By the Chapman-Kolmogorov equations, we know that


X
pij (t + h) = pik (t)pkj (h)
k∈S

and recalling that for small h > 0,


(
1 − qi h + o(h) for i = j
pij (h) =
qij h + o(h) 6 j.
for i =

Therefore, substituting this in and separating the case k = j from k 6= j,


X
pij (t + h) = (1 − qj h + o(h)) pij (t) + pik (t) (qkj h + o(h))
k∈S
k6=j
X
= pij (t) − hpij (t)qj + h pik (t) qkj + o(h).
k∈S
k6=j

66
Now subtract pij (t) from each side, and divide each side by h:

pij (t + h) − pij (t) X o(h)


= −pij (t)qj + pik (t) qkj + .
h k∈S
h
k6=j

Letting h ↓ 0 gives X
p0ij (t) = −pij (t)qj + pik (t) qkj , for all i, j ∈ S,
k∈S
k6=j

which shows that the KFDEs hold.

In matrix notation
P 0 (t) = P (t) Q.

Exercise 4.7
Convince yourself that the matrix version of the KFDEs contains all the individual differential equa-
tions X
p0ij (t) = −pij (t)qj + pik (t) qkj ,
k∈S
k6=j

for all i, j in S.

The importance of the KFDEs

If we wish to find pij (t) = P (X(t) = j | X(0) = i), for j = 0, 1, 2 . . . then we have only one set of
equations to solve.

If we are able to do this we will get a general solution. Then we use the initial distribution p(0) (often of
the form X(0) = i with probability 1) to find the particular solution which satisfies this initial condition.

67
4.4.4 Kolmogorov’s backward equations

Result: Kolmogorov backward differential equations (KBDEs)

The Kolmogorov backward differential equations (KBDEs) are


X
p0ij (t) = −pij (t) qi + qik pkj (t)
k∈S
k6=i

or, in matrix notation,


P 0 (t) = Q P (t),
where, for a continuous time Markov chain with state space S = {1, 2, 3, ...},
 
−q1 q12 q13 ... 
 
 q21 −q2 q23 ... 
 
Q=

,

 q31
 q32 −q3 ... 
 . .. .. ..

.. . . .
P
where qij is the transition rate from state i to state j and qi = j6=i qij (the rate at which the process
leaves state i, as before).

Exercise 4.8
Compare Kolmogorov’s forward and backward equations.

The KBDEs can be shown to hold by conditioning on the state of the process at time h (compare this with
the forward equations). We take two initial time points (0 and h) close together followed later by a single
time point (t + h).

i k j value of X(t)
-
0 h t+h time

By the Chapman-Kolmogorov equations, we know that


X
pij (t + h) = pik (h)pkj (t)
k∈S

and recalling that for small h > 0,


(
1 − qi h + o(h) for i = k
pik (h) =
qik h + o(h) 6 k.
for i =

68
Therefore, substituting this is in and separating the case i = k from i 6= k,
X
pij (t + h) = (1 − qi h + o(h)) pij (t) + (qik h + o(h))pkj (t)
k∈S
k6=i
X
= pij (t) − hpij (t)qi + h qik pkj (t) + o(h).
k∈S
k6=i

Now subtract pij (t) from each side, and divide each side by h:

pij (t + h) − pij (t) X o(h)


= −pij (t)qi + pik (t) qkj + .
h k∈S
h
k6=i

Letting h ↓ 0 gives X
p0ij (t) = −pij (t)qi + pik (t) qkj , for all i, j ∈ S,
k∈S
k6=i

which shows that the KBDEs hold. The KBDEs in matrix notation are written as

P 0 (t) = Q P (t).

4.4.5 Solving the KFDEs and KBDEs


The KFDEs are P 0 (t) = P (t)Q and the KBDEs are P 0 (t) = QP (t). As these are differential equations, we
can solve them to find an expression for P (t).

An initial condition is required to obtain a specific as opposed to general solution. It makes sense to use
P (0) = I as the initial condition (why?). In doing so, both the KFDEs and KBDEs give the same solution
for P (t) (this is a good thing, otherwise something very strange is happening!).

Result: Solution to the KFDEs and KBDEs

Subject to the initial condition P (0) = I, both KFDEs and KBDEs have the same solution for P (t),
namely
∞ n n
X t Q
P (t) = = exp(tQ).
n=0
n!

Exercise 4.9
Using the solution to the differential equations, i.e.
∞ n n
X t Q
P (t) = ,
n=0
n!

show that for a small time h > 0, we have

P (h) = I + hQ + o(h),

where I denotes the identity matrix.

69
4.4.6 The generator matrix, Q
We’ve seen that the transition rates that we introduced in section 4.3 play a significant role for continuous
time Markov processes, appearing in the KFDEs and KBDEs. The solution to these differential equations
show us that the transition matrix P (t) is entirely dependent on these rates through the matrix Q.

In fact, the matrix Q is so important that it has a name - the generator matrix. We’ll now have a look at
some properties of this matrix Q in preparation for the next section, where we will use it extensively.

For a continuous time Markov process with state-space S = {1, 2, 3, ...}, the generator matrix Q is defined
as  
−q1 q12 q13 ... 
 
 q21 −q2 q23 ... 
 
Q= 
.

 q31 q32
 −q3 ...  
 . .. .. ..

.. . . .

Since the qik are rates, they are strictly non-negative, and therefore qi must be so too. Also recall that in
section 4.2, we saw X
qi = qik .
k6=i

Consider the rows of Q, and you will notice that they sum to zero. Therefore, the diagonal of the matrix
Q is negative (can be zero too), while all the off-diagonal entries are positive (can be zero).

In fact, the matrix Q is so important that together with the initial distribution of the continuous time Markov
chain, p(0) , we can specify the process exactly and can compute all that we’re interested in, including the
long term behaviour of the chain. Compare this to the discrete time chain, where we required the initial
distribution together with the transition matrix:

Discrete time process specified by: Continuous time process specified by:

Initial distribution, p(0) Initial distribution, p(0)

Transition matrix, P Generator matrix, Q

In effect, for continuous time Markov processes, the generator matrix Q takes the place of the transition
matrix P in the discrete time case.

Note: throughout this chapter, we assume that the qij s satisfy certain technical conditions which are
sufficient to prevent a Markov chain ‘behaving badly’, for example, the elements in the matrix Q (the rates)
exist and are finite. These conditions will be met in all our examples.

So far, we have thought about the generator matrix, Q, as a matrix of rates. In fact, we could also show
that
dP (t)
Q := = P 0 (0).
dt t=0
That is, Q contains the rates of change of the transition probabilities P (X(t) = j | X(0) = i) at
t = 0: just as we already know. In order for this differentiation to make sense, we assume that P (t) is

70
continuous at 0, that is, 
 1

if j = i
pij (t) →
 0 if j 6= i as t ↓ 0.

4.5 Limiting behaviour


As we did for the discrete time processes, we will study both invariant and equilibrium distributions. We
interpret invariant and equilibrium distributions in the same way as we did for discrete time processes.

4.5.1 Invariant distributions


Definition: Invariant distribution for continuous time Markov chains

The probability row vector π is an invariant distribution of a continuous-time Markov chain {X(t), t ≥
0} with transition matrix P (t) if

π = π P (t), for all t ≥ 0.

Using P (t) = exp(tQ) gives


π exp tQ = π
t Q2 2
π(1 + tQ + + ···) = π
2!
t2 πQ2
tπQ + + · · · = 0,
2!
which holds for all t ≥ 0 if and only if πQ = 0.

π = π P (t) is equivalent to π Q = 0.

The consequence of this is as follows. If p(0) = π then p(t) = π for all t ≥ 0, that is, if we start the chain
using the distribution π, the distribution of the chain at any future time t is also π.

More generally, if the distribution of states of our continuous Markov chain is π eventually, then the
distribution of states of our continuous Markov chain will be π for all future times.

Finding the invariant distribution of a continuous time Markov chain requires finding a vector π which
satisfies πQ = 0. This is much easier than solving π = π P (t) as P (t) itself is difficult to compute!

4.5.2 Equilibrium distributions

Definition: Equilibrium distribution for continuous time Markov chains

An equilibrium distribution π = {πj , j ∈ S} exists if pij (t) → πj as t → ∞ for each j ∈ S, where π


is a probability distribution that does not depend on the initial state i.

71
Suppose that such a distribution exists. Then since pij (t) → πj (a constant) as t → ∞ we expect

p0ij (t) → 0 as t → ∞.

The KFDEs also give X


p0ij (t) = −pij (t) qj + pik (t) qkj .
k∈S
k6=j

Letting t → ∞ in the KFDE above gives


X X
lim p0ij (t) = lim (−pij (t) qj ) + lim pik (t) qkj = −πj qj + πk qkj .
t→∞ t→∞ t→∞
k∈S k∈S
k6=j k6=j

Therefore, since p0ij (t) → 0 as t → ∞, this is equivalent to


X
0 = −πj qj + πk qkj or πQ = 0.
k∈S
k6=j

What does this tell us?


The fact that an equilibrium distribution π satisfies πQ = 0 implies that an equilibrium distribution
is also an invariant distribution (just as was the case in discrete time). Therefore an equilibrium
distribution is an invariant distribution of {X(t), t ≥ 0}.

Note that it is not always true the other way around: an invariant distribution of {X(t), t ≥ 0} is not
necessarily an equilibrium distribution.

We used the KFDEs to establish this result. Since the forward and backward equations give the same solu-
tion for P (t), could we have used the KBDEs to do the same?

In this case, the answer is ‘no’. Though the KFDEs and KBDEs look very similar, it is sometimes the case
that only one of these is useful in solving a particular problem. Here, using the KFDEs allowed us to see the
connection between the invariant and equilibrium distributions. However, if we had tried to use the KBDEs
instead, we would get the following:
 
X X X
0 = −πj qi + qik πj = −πj qi + πj qik = πj  qik − qi  = 0.
 
k∈S k∈S k∈S
k6=j k6=j k6=j

Obviously, the equation is perfectly valid and correct, but 0 = 0 is of no use whatsoever!

In other cases, which we will not consider in this course, the backward equations prove to be useful while
the forward equations are not.

4.5.3 A limit theorem for continuous time Markov chains


Think back to the discrete time case: we were able to establish a theorem that told us exactly when an
equilibrium distribution existed for a particular discrete time Markov chain. We would like to do the same
here, too, for continuous time Markov chains. In the continuous time case, things are more difficult, and we
end up with a result that is a little more restrictive than we did in the discrete time case.

The theorem will depend on the Markov chain being irreducible, which we define next.

72
Definition: Irreducible continuous time Markov chains

A continuous-time Markov chain is irreducible if, for every i and j in S, pij (t) > 0 for some t. That
is, all states in the chain intercommunicate.

Result: Existence of an equilibrium distribution for irreducible chains

Suppose that {X(t), t ≥ 0} is an irreducible continuous-time Markov chain.

(i) If there exists an invariant distribution π then it is unique and pij (t) → πj as t → ∞ for all
i, j ∈ S. (i.e. π is the equilibrium distribution of the chain.)
(ii) If there is no invariant distribution then pij (t) → 0 as t → ∞ for all i, j ∈ S. (i.e. no equilibrium
distribution exists.)

Note:

• For continuous-time Markov chains we do not need to worry about periodicity.


• If the chain is positive recurrent then such an invariant distribution does exist. Therefore, if an
irreducible Markov chain has no invariant distribution it must be either null recurrent or transient.
It does not follow that only irreducible processes can have an equilibrium distribution. In the next section
we will show that a process which is not irreducible does have an equilibrium distribution under certain
conditions. However, to prove this, we cannot rely on the result above: we do the maths ‘brute force’.

73
5 Important types of continuous-time processes
5.1 The Poisson process
This is a very simple but important continuous time Markov process which is the building block for
many complex processes. A Poisson process, {N (t), t ≥ 0}, counts the number of events in (0, t]. These
events occur one at a time and independently of each other.
• The state space of a Poisson process is infinite and consists of the natural numbers together with {0}:
S = {0, 1, 2, 3, ...}.

• The Poisson process always starts in state 0: N (0) = 0.


• As it is a counting process, when it leaves state 0 (no events) it moves to state 1 (one event has
occurred).
• From state 1 (one event has occurred), it proceeds to state 2 (two events have occurred).

• And so on...
• These ‘events’ occur uniformly, i.e. at a constant rate, λ, per unit time.
You may find it helpful to visualise a Poisson process using the following state space diagram,

0 1 2 3 4 ···

Notice that there are no probabilities on the arrows here, unlike the discrete time case. Remember that
moving to the next state can occur at any time. Equivalently, we can visualise a Poisson process using a
graph which plots the realisation of the process; each ‘step’ upward denotes the occurrence of an event.

Formally, a Poisson process is defined as follows.

74
Definition: Poisson process

Let N (t) denote the number of occurrences of some event in (0, t] for which there exists λ > 0 such
that for h > 0:
(i) P (1 event in (t, t + h]) = λh + o(h);
(ii) P (no events in (t, t + h]) = 1 − λh + o(h);
(iii) The number of events in (t, t + h] is independent of the process in (0, t].

Then {N (t), t ≥ 0} is a Poisson process of rate λ.

Notes
1. Statement (i) can be written as P(N (t + h) − N (t) = 1) = λh + o(h).
Statement (ii) can be written as P(N (t + h) − N (t) = 0) = 1 − λh + o(h).
2. Statement (i) says that P (1 event in (t, t+h]) is approximately λh (that is, approximately proportional
to the length h of the time interval (t, t + h]) for small h.
Statement (ii) says that P (0 events in (t, t + h]) ≈ 1 − λh, for small h. Therefore, P (more than 1 event
in (t, t + h]) is o(h).
3. Statement (iii) says that the process has independent increments.
4. Note that the Poisson process satisfies the Markov property: consider times t1 < t2 < · · · < tn < tn+1 .
Then

P(N (tn+1 ) = in+1 | N (tn ) = in , . . . , N (t1 ) = i1 )


=P(in+1 − in events in (tn , tn+1 ] | N (tn ) = in , . . . , N (t1 ) = i1 )
=P(in+1 − in events in (tn , tn+1 )) by (iii)
=P(N (tn+1 ) = in+1 | N (tn ) = in ).

5.1.1 The generator matrix for the Poisson process


Remember that a Poisson process currently in state i must next move to state i + 1: there is no choice of
states. How long does the Poisson process stay in state i? This is the holding time, and we already know this
is exponentially distributed with some parameter qi . What is qi in this case? Recall that for small h > 0,

pii (h) = 1 − qi h + o(h).

Compare this to the definition of the Poisson process, which states that

pii (h) = P (no events in (t, t + h]) = 1 − λh + o(h),

and that this holds for all i = 0, 1, 2, .... Together, these imply that qi = λ: the rate of leaving state i is the
rate of the Poisson process.

Notice also that:


• The rate of leaving state i is equivalent to the rate at which we go from state i to state i + 1, therefore
qi,i+1 = qi = λ.

75
• For k < i, qik = 0 since the Poisson process cannot go down in value - it is an increasing process!
• For k > (i + 1), qik = 0 since the Poisson process cannot jump directly from state i to state k > i + 1:
it must jump to state i + 1 first.

Putting this together, the generator matrix for a Poisson process is


 
−λ λ 0 0 0 ... 
 
 0 −λ λ
 
0 0 ... 
 
 
Q= 0
 0 −λ λ 0 ...  
 
 
 0
 0 0 −λ λ ...  
 . .. .. .. .. . . 
.. . . . . .

Notice that since we now have the generator matrix, Q, for a Poisson process, and we know that its initial
distribution is always p(0) = (1, 0, 0, ...), we can use the results from Chapter 4 on continuous time processes.
In particular, test your understanding of these results by answering the following.

Exercise 5.1
Does a Poisson process have an invariant distribution? An equilibrium distribution?

Exercise 5.2
Write down the forward and backward equations for the Poisson process (you need not solve them).

Exercise 5.3
Find the transition matrix of the embedded jump chain of the Poisson process. Describe the embedded
jump chain in words.

5.1.2 The distribution of the number of events by time t


Now we’ll look at further properties of the Poisson process.

What is the probability that, by time t, k events have occurred? In other words, can we calculate p0k (t) =:
P (N (t) = k)? Note that this is the same as asking for the probability that the Poisson process has moved
from state 0 to state k in a time interval of length t.

We can use the forward equations for a Poisson process to establish the answer to this question. The forward

76
equations are
X
p0i,i+1 (t) = − pi,i+1 (t) qi+1 + pik (t) qk,i+1
k∈S
k6=j

= − pi,i+1 (t) qi+1 + pii (t) qi,i+1


= − pi,i+1 (t)λ + pii (t)λ
=λ(pii (t) − pi,i+1 (t)),

for all i = 0, 1, 2, .... We can solve these differential equations, as before, and we use

 1

if k = 0
p0k (0) = P(N (0) = k) =
 0

for k = 1, 2, 3, . . . ,

as the initial condition, i.e. N (0) = 0. The solution is

(λt)k exp(−λt)
p0k (t) = P(N (t) = k) = , k = 0, 1, 2, . . . ,
k!
i.e.
N (t) ∼ Poisson(λt).

Exercise 5.4 k
exp −λt
Show that P(N (t) = k) = (λt) k! is indeed the solution to the forward equations, by substituting
this into both sides of the equation.

Exercise 5.5
Tourists arrive at the departure point of a sightseeing bus according to a a Poisson process of rate
λ per minute. Denote by N (t) the number of tourists that arrive at the departure point of the bus
during a time interval of length t minutes.
(i) For t > 0, name the distribution of N (t) and state its mean and variance.

(ii) The bus driver gets bored and drives off after 10 minutes. If λ = 3, what is the probability that
there will be 32 tourists on the bus? What is the expected number of passengers that will be
on the bus?
(iii) Given that the first tourist arrived during the first 5 minutes, find the probability that he or
she arrived during the first 2 minutes.

5.1.3 Other properties of the Poisson process

Result: Number of events

The number of events in (s, s + t] has a Poisson(λt) distribution (for all s).

77
Statements (i), (ii) and (iii) in the definition of the Poisson process are constant over time, so

(λt)k exp(−λt)
P(k events in (s, s + t]) = P(k events in (0, t]) = .
k!
[The Poisson process has stationary increments: the distribution of the number of events in a time interval
does not depend on when the interval starts. This is equivalent to time homogeneity. ]

Result: Number of events in a small interval

In a small interval of length h, the probability of one event is λh + o(h). The probability of two or
more of these events in a small interval of length h is o(h).

P(0 events in (t, t + h]) = P(T1 > h) = exp(−λh) = 1 − λh + o(h).


P(≥ 1 event in (t, t + h]) = P(T1 < h) = 1 − exp(−λh) = λh + o(h).
Z h
P(≥ 2 events in (t, t + h]) = P(T1 + T2 < h) = λ2 x exp(−λx) dx = o(h).
0

Result: Disjoint intervals

If (a, b] and (c, d] are non-overlapping intervals, then the number of events in (a, b] is independent of
the numbers of events in (c, d]).

This follows from condition (iii).

Result: Time to first event


The time T1 to the first event is exponentially distributed with parameter λ.

P(T1 > t) = P(N (t) = 0) = exp(−λt), t > 0.


Therefore,
FT1 (t) = P(T1 ≤ t) = 1 − exp(−λt) and fT1 (t) = λ exp(−λt), t > 0.
The random variable T1 has an exponential distribution with rate λ.

Result: Time between successive events

The times between successive events are i.i.d. exponential(λ) random variables.

Let T2 be the time between the first and second events. Firstly, we find the marginal distribution of T2 :

78
Z ∞
P(T2 > t) = P(T2 > t | T1 = s)fT1 (s) ds
Z0 ∞
= P(no events in (s, s + t])λ exp(−λs) ds
Z0 ∞
= exp(−λt) λ exp(−λs) ds
0
= exp(−λt).
Therefore, T2 ∼ exp(λ). Now we show that T1 and T2 are independent.
Z ∞
P(T1 > v, T2 > t) = P(T2 > t | T1 = u)fT1 (u) du
Zv ∞
= P(no events in (u, u + t]) λ exp(−λu) du
Zv ∞
= exp(−λt) λ exp(−λu) du
v
= exp(−λt) exp(−λv) = P(T2 > t)P(T1 > v).
Therefore, T1 and T2 are independent. Repeating the argument for (T2 , T3 ), (T3 , T4 ), . . . gives the result.

Result: Time to rth event

The time to the rth event has a Gamma(r, λ) distribution.

Let Sr be the time to the rth event. Then Sr = T1 + · · · + Tr . This is the sum of r i.i.d. exponential(λ)
random variables, which has a Gamma(r, λ) distribution.

Result: Time to next event from arbitrary time point

The time from an arbitrary time point t to the next event is an exponential(λ) random variable.

This follows directly from the lack-of-memory property of the exponential distribution and is consistent with
the definition (i), (ii), (iii) of a Poisson process.

79
These properties are summarised in the following diagram.

A further very useful property is the distribution of the times at which events occur when we know exactly
the number of events that have happened. Notice that this is a result which is conditional on knowing the
number of events that have occurred.

Result: Distribution of times of events when k events have occurred

Given that exactly k events occur in (0, t], the members of the set (U1 , . . . , Uk ) of k (unordered)
arrival times are i.i.d. random variables, each of which has distribution U (0, t). That is,
i.i.d.
U1 , . . . , UN (t) | N (t) = k ∼ U (0, t).

This makes sense: the k events are randomly scattered in the interval (0, t]. You do not need to be able to
prove this result, but you will need to use it.

5.1.4 Superposition and thinning of a Poisson process

Result: Superposition of Poisson processes

Suppose that {N1 (t), t ≥ 0} and {N2 (t), t ≥ 0} are independent Poisson processes with rates λ1 and λ2
respectively, and N (t) = N1 (t)+N2 (t) for all t ≥ 0 (so that {N (t)} counts the total number of events).

Then {N (t), t ≥ 0} is a Poisson process of rate λ1 + λ2 .

80
We’ll show that the three conditions, (i), (ii) and (iii), required of a Poisson process, hold.
(i)

P(1 event in (t, t + h])


=P(1 event from process 1)P(0 events from process 2)
+ P(0 events from process 1)P(1 event from process 2)
=(λ1 h + o(h))(1 − λ2 h + o(h)) + (1 − λ1 h + o(h)}{λ2 h + o(h))
=λ1 h − λ1 λ2 h2 + o(h) + λ2 h − λ1 λ2 h2 + o(h)
=(λ1 + λ2 )h + o(h).
(ii)

P(0 events in (t, t + h))


=P(0 events in process 1)P(0 events in process 2)
=(1 − λ1 h + o(h))(1 − λ2 h + o(h))
=1 − (λ1 + λ2 )h + o(h).

(iii) Independent increments. This holds because it holds for each process individually and the two process
are independent.

Result: Thinning of a Poisson process

{N (t), t ≥ 0} is a Poisson process of rate λ. Each event of the process is deleted with probability
(1 − p) and kept with probability p, independently of all other events in the process.

Let {M (t), t ≥ 0} be the process containing only the events which are kept, then {M (t), t ≥ 0} is a
Poisson process of rate pλ.

We’ll show that the three conditions, (i), (ii) and (iii), required of a Poisson process, hold.

(i)

P(1 event in (t, t + h])


=P(1 event in {N (t)} in (t, t + h], event not deleted) + o(h)
=(λh + o(h)) × p + o(h) = pλh + o(h).

(ii)

P(0 events in (t, t + h])


=P(0 events in {N (t)} in (t, t + h])
+ P(1 event in {N (t)} in (t, t + h], event deleted) + o(h)
=1 − λh + o(h) + (λh + o(h)) × (1 − p) + o(h)
=1 − pλh + o(h).

(iii) Deleting events doesn’t affect the independent increments property because events are deleted inde-
pendently of each other.

81
Example 5.6
Outside Heals on Tottenham Court Road a charity worker attempts to stop people in order to
have a conversation. People walking along the street arrive at Heals in a Poisson process of rate
40 per minute. On average only 1 in every 20 people stop to talk. Each person’s decision is taken
independently of all other people. Each conversation takes a time which is exponentially distributed
with mean 1 minute, and is independent of other conversations. If the charity worker is busy, people
passing Heals do not stop.

Let X(t) denote whether the charity worker is busy at time t, with S = {0, 1}. State Q and the
forward equation for p01 (t). If the charity worker is free at time t = 0 minutes, show that the
probability that they are busy at time t > 0 is
2 2
− exp(−3t).
3 3
Hence, or otherwise, find the long-run proportion of the time that the charity worker is busy.

Solution
Firstly, pn (t) = P (X(t) = n | X(0) = 0) with X(0) = 0 so that p0 (0) = 1. People who stop to talk
1
arrive in a Poisson process of rate 40 × 20 = 2 per minute. [Thinning of a Poisson process.]

The generator matrix is therefore  


 −2 2 
Q= ,
1 −1

and forward equation for p01 (t) is

p001 (t) = 2p00 (t) − p01 (t).

Notice that p00 (t) = 1 − p01 (t), and substituting this into the forward equation for p01 (t) yields

p001 (t) = 2 (1 − p01 (t)) − p01 (t) = 2 − 3p01 (t).

We now need to show that


2 2
p01 (t) = − exp(−3t).
3 3
Using this identity, the left hand side of the forward equation above is
6
p001 (t) = exp(−3t) = 2 exp(−3t),
3
2 2
and the right hand side of the forward equation above is also 2 exp(−3t). Therefore, 3 − 3 exp(−3t)
satisfies the forward equation for p01 (t), and so
2 2
p01 (t) = − exp(−3t),
3 3
as required.

We get the long-run proportion of time that the charity worker is busy by letting t → ∞, giving the
answer 2/3. Don’t forget that we could have obtained this answer more easily by solving πQ = 0,
yielding π1 = 2/3, as before.

82
Exercise 5.7
A bicycle factory has three independent production lines, each of which produces bicycles according
to a Poisson Process. The first, for racing bikes, has a time in minutes between completed machines
that is exponential (1/5). The interarrival times for the second, for BMX bikes, are exponential (1/4)
and for the third, for mountain bikes, are exponential (1/6).

(a) What is the expected number of bikes produced in an 8 hour shift?


(b) What is the probability that exactly fourteen BMX bikes are made in the first hour of produc-
tion?
(c) What is the variance of the time in minutes until the third mountain bike is made from the
beginning of the shift?
(d) Sir Bradley Wiggins arrives during a shift. How long would he expect to wait until he sees the
first BMX bicycle roll off the production line?
(e) Given that 20 bikes are produced in an hour, what is the mean number that will be produced
in the period [21, 36) minutes?

Exercise 5.8
On a single telephone line, calls arrive according to a Poisson process of rate λ per minute.

(a) Name the distribution of the number of phone calls that arrive during a time interval of length
t minutes (t > 0), and state its mean.
(b) Name the distribution of the time (in minutes) between the arrivals of two successive phone
calls, and state its mean.
If a caller finds the line free then his/her call is ‘effective’; effective calls have durations that are
independent, exponentially distributed random variables with a mean of 1/µ minutes, independently
of the arrival process. If a caller finds the line busy then this call is lost.
(c) Let Y be the time (in minutes) from the beginning of an effective phone call to the arrival of
the next effective phone call. Write down the mean and the variance of Y .

(d) State, with a reason, whether or not the arrivals of the effective phone calls follow a Poisson
process.

5.2 Birth and death processes


This is another important class of continuous-time Markov processes in which jumps are always one step
up (‘birth’) or one step down (‘death’). Births and deaths occur singly and independently of each other.
The state space is S = {0, 1, 2, . . .}. A graph showing the path of a birth-death process is shown below. A
step upward corresponds to a birth, while a step downward corresponds to a death. Compare this to the
Poisson process!

83
We think about the ‘states’ of a birth death process as the size of the population: if the process is ‘in
state n’ then ‘the size of the population is n’. The state identifies the number of individuals in the
population (though note that a birth-death process doesn’t have to describe the size of a population -
it can describe any process which goes up or down by one unit at each transition).

Important questions to ask of birth-death processes include:


• What is the probability of a birth (death) in a small time interval?
• What is the probability that the next event is a birth? A death?
• How long must we wait until the next event?

• How do these processes behave in the long run?


• Will the population ever die out (become extinct)?

5.2.1 Transition rates and the jump chain


Let N (t) denote the size of a population at time t ≥ 0 for a birth-death process. If at time t the size of the
population is n, then next transition is:

• to state (n + 1) (a birth) OR
• to state (n − 1) (a death).

Notation: if the size of the population is n, then, using the same notation as in Chapter 4,
Birth rate: qn,n+1 = λn
Death rate: qn,n−1 = µn

84
• Obviously we must have µ0 = 0. Why?

• The birth and death rates can depend on the population size. That is, they can change as the
population size changes which is why the general birth and death rates are indicated with a
subscript n, the population size.

How long does the process stay in state n? We know from Chapter 4 that the distribution of this time is
exponential, and since the only transitions allowable are to (n + 1) (with rate qn,n+1 = λn ) or to (n − 1) (with
rate qn,n−1 = µn ) then we have that the rate at which the process leaves state n is qn = qn,n+1 + qn,n−1 =
(λn + µn ). Therefore, the time T until the next ‘event’ will be exponential with rate (λn + µn ),

T ∼ exponential(λn + µn ).

From this we can construct the jump chain of a birth-death process. Using the birth and death rates λn and
µn , the next event will be a:
• birth, with probability λn /(λn + µn );
• death, with probability µn /(λn + µn ).
The transition matrix of the embedded jump chain is therefore
 
 ? ? 0 0 0 0 
 
 µ1 λ1 
 λ +µ 0 λ1 +µ1 0 0 0 
 1 1 
 
P = µ2 λ2
 0 λ2 +µ2 0 λ2 +µ2 0 0 
 
 µ3 λ3


 0 0 λ3 +µ3 0 λ3 +µ3 ··· 

 .. .. .. .. .. .. 
. . . . . .

Why is there ambiguity in the first line of the transition matrix of the embedded jump chain?

5.2.2 The generator matrix


We can immediately construct the generator matrix, Q, for the birth-death process using the transition rates
λn and µn , and the following observations:
• For k < (i − 1), qik = 0 since the birth-death process cannot jump down by more than 1.

• For k > (i + 1), qik = 0 since the birth-death process cannot jump directly from state i to state
k > i + 1: it must jump to state i + 1 first.

85
Definition: Generator matrix for the birth-death process

The generator matrix for a birth-death process is


 
 −λ0 λ0 0 0 0 0 
 
 µ1 −(λ1 + µ1 )
 
λ1 0 0 0 
 
 
Q=  0 µ2 −(λ2 + µ2 ) λ2 0 0 
 
 
 0
 0 µ3 −(λ3 + µ3 ) λ3 ··· 

 . . . .. .. .. 
.. .. .. . . .

We could also have constructed the generator matrix by computing the transition probabilities pij (h) for
small h, and reading off the associated transition rates. Let pij (h) = P(N (t + h) = j | N (t) = i). Then
pn,n+1 (h) = {λn h + o(h)}{1 − λ h + o(h)} + o(h)
= λn h + o(h)
pn,n−1 (h) = {1 − µ h + o(h)}{µ h + o(h)} + o(h)
= µ h + o(h)
pnn (h) = {1 − µ h + o(h)}{1 − λ h + o(h)} + o(h)
= 1 − (µ + λ)h + o(h).
pnm (h) = o(h), if m ∈
/ {n − 1, n, n + 1}.
Now compare this to the general result for any continuous time Markov process with discrete state space:
pij (h) = qij h + o(h)
pii (h) = 1 − qi h + o(h)
Match up the rates with the qi and qij , to construct the generator matrix. Of course, this is a far more
cumbersome way of extracting the generator matrix!

5.2.3 Important types of birth-death processes


(i) IMMIGRATION at rate α: λn = α, n = 0, 1, 2 . . . .
(ii) LINEAR BIRTH at rate λ per individual (that is, every individual alive at time t has probability
λh + o(h) of giving birth to a new individual in (t, t + h]):
λn = nλ, n = 0, 1, 2, . . . .

(iii) EMIGRATION at rate β: λn = β, n = 1, 2, . . . .


(iv) LINEAR DEATH at rate µ per individual (that is, every individual alive at time t has probability
µh + o(h) of dying in (t, t + h]):
µn = nµ, n = 1, 2, . . . .

Easy exercise
Which types of birth-death processes (listed above) have birth or death rates which depend on the
current population size?

86
We could define a process that has (i), (ii), (iii) and (iv), that is,

λn = α + nλ, n = 0, 1, 2, . . . , and µn = β + nµ, n = 1, 2, . . . ,

although it would be difficult to study this process algebraically.

Some comments on particular types of birth-death process.

• A Poisson process of rate λ is a pure immigration process, i.e.

λn = λ, n = 0, 1, . . . and µn = 0, n = 1, 2, . . . .

Therefore, an immigration process of rate α is equivalent to a Poisson process of rate α.

• A pure death process has λ0 = 0, for n = 0, 1, 2, . . . ..


Recall that the time until the next jump is exponential(qi ), so that under pure linear death the
time until the next jump is exponential(nµ).
This makes sense because the time until a given individual dies is exponential(µ) and all individuals
are independent of each other.
i.i.d.
[Recall from section 4.3.2 that if Ti ∼ exponential(µ), i = 1, . . . , n then min(T1 , . . . , Tn ) ∼
exponential(nµ)]
• A pure birth process has µn = 0, for n = 1, 2, . . .
(. . . and we can make similar statements to pure linear death with ‘death’ replaced by ‘birth’).

5.2.4 Transition probabilities (over a small interval)


We already know from Chapter 4 that for any continuous time, discrete state space, Markov process we have:

pij (h) = qij h + o(h)

for small h. For the birth-death process,

qn,n+1 = λn , qn,n−1 = µn and qij = 0 for j 6= (i − 1), (i + 1),

therefore, supposing that the process is currently in state n (n individuals in the population):
• the probability of a birth in (t, t + h], is pn,n+1 (h) = λn h + o(h).
• the probability of a death in (t, t + h], is pn,n−1 (h) = µn h + o(h).

87
Exercise 5.9
Fill in the table below, assuming that the population size at time t is n.

Immigration-Emigration process Immigration-Emigration w. linear birth

Birth rate

Death rate

P (Birth ∈ (t, t + h])

(small h)

P (Death ∈ (t, t + h])

(small h)

5.2.5 The equilibrium distribution of the (general) birth and death process
Do all birth-death processes have an equilibrium distribution? Our result in Chapter 4 only specified whether
an irreducible continuous time Markov chain had an equilibrium distribution:

Result from Chapter 4


Suppose that {X(t), t ≥ 0} is an irreducible continuous-time Markov chain.

(i) If there exists an invariant distribution π then it is unique and pij (t) → πj as t → ∞ for all i, j ∈ S.
– That is, π is the equilibrium distribution of the chain.
(ii) If there is no invariant distribution then pij (t) → 0 as t → ∞ for all i, j ∈ S.

– No equilibrium distribution exists.

Are all birth-death processes irreducible?

No (consider, for example, a Poisson process). When a process is not irreducible, we cannot guarantee that
an invariant distribution is an equilibrium distribution. Which birth-death processes are irreducible (and so
we can apply the result above)?

88
Exercise 5.10

Fill in the table to show which birth-death processes are irreducible, and which are not.

Linear birth Immigration Linear birth

with Immigration

Linear death

Emigration

Linear death

with emigration

Note that for a process to have an equilibrium distribution, it must have exactly one invariant distribution.
Recall that the existence of exactly one invariant distribution does not necessarily imply that the process
has an equilibrium distribution, though! Furthermore, on applying the result in Chapter 4,

• If the process is irreducible - existence of invariant distribution implies existence of equilibrium distri-
bution.
• If the process is NOT irreducible - cannot yet tell if existence of an invariant distribution implies the
existence of an equilibrium distribution.

First of all, we’ll search for conditions under which an invariant distribution exists for a birth-death process.
Once this has been established, we will show that for one particular non-irreducible birth-death process
(for which exactly one invariant distribution exists) the invariant distribution is in fact an equilibrium
distribution.

So when does a birth-death process have an invariant distribution?

Solving πQ = 0 gives

−λ0 π0 + µ1 π1 = 0
λj−1 πj−1 − (λj + µj ) πj + µj+1 πj+1 = 0, j = 1, 2, . . . .

The first equation gives


λ0
π1 = π0 .
µ1
Solving the second set of equations iteratively gives
λj−1 · · · λ0
πj = π0 . j = 2, 3, . . . . (9)
µj · · · µ1

89
Therefore, π defined by:
λ0
π1 = π0 ;
µ1
λj−1 · · · λ0
πj = π0 . for j = 2, 3, . . ..
µj · · · µ1

satisfies πQ = 0.

P∞
For π to be a probability distribution it must also satisfy j=0 πj = 1.
 

X λj−1 · · · λ0 
π0 1 + = 1,
j=1
µj · · · µ1
 −1

1 +
X λj−1 · · · λ0  = π0 . (10)
j=1
µj · · · µ1

These equations define a probability distribution if and only if the sum in (10) is finite. If the state space S
is finite, then the sum in (10) must be finite.

In summary, if (10) is finite, then an invariant distribution exists. This is guaranteed for birth-death
processes with finite state-space.

If an invariant distribution exists, is it an equilibrium distribution?

Notice that if (10) is divergent an invariant distribution (and therefore an equilibrium distribution) cannot
exist.

If the process is irreducible, the equilibrium and invariant distributions coincide (if the latter exists).

What if the process is not irreducible? Can an equilibrium distribution exist? The answer is yes. We will
now look for conditions under which an equilibrium distribution exists for a linear birth-death process
only (we will not look at other types of non-irreducible birth-death processes). This equilibrium distribution
will be linked to the probability of extinction for the process, which we look at next.

5.2.6 The probability of extinction for linear birth-death process


The linear birth-death process is not irreducible because state 0 does not communicate with any other state.
Note that for a linear birth-death process, 0 is an absorbing state: once the process reaches 0, there are no
individuals in the population to give birth and the population is therefore ‘extinct’. It is interesting in its
own right to study this probability, but we’ll be interested in it because of its link with the invariant and
equilibrium distribution for this process.

How are ‘extinction probability’ and invariant distributions linked in this case?
For linear birth-death processes, once extinction (reaching state 0) occurs, we stay in this state forever.
Therefore, π = (1, 0, 0, . . .) is an invariant distribution for the process.

90
Under what circumstances is π = (1, 0, 0, . . .) also an equilibrium distribution? This is equivalent to asking
‘under which conditions does the population die out (become extinct) with probability 1?’. Alternatively,
‘under what circumstances is it certain that the chain will reach the state 0?’.

An elegant way of solving this is via generating functions. Let N (t) denote the state of the linear birth-
death process at time t (i.e. the number of individuals in the population at time t). Then
h i X∞
G(s, t) = E sN (t) = sn P(N (t) = n).
n=0

It can be shown that  N (0)


−(λ−µ)t
 µ(1−s)−(µ−λs)e−(λ−µ)t
 if λ 6= µ
 λ(1−s)−(µ−λs)e

G(s, t) =

  N (0)
 λt(1−s)+s

if λ = µ
λt(1−s)+1

Now, the probability that extinction occurs at or before time t is given by P(N (t) = 0), which can be
extracted from the generating function via

X
G(0, t) = 0n P(N (t) = n) = P(N (t) = 0),
n=0

so that   N (0)
µ−µ exp(−(λ−µ)t)
 if λ 6= µ
 λ−µ exp(−(λ−µ)t)



G(0, t) =


  N (0)
λt


1+λt if λ = µ

If we let t → ∞ we get



 1 if λ ≤ µ


P(eventual extinction) =



 (µ/λ)N (0)

if λ > µ.

What does all this mean?

If the birth rate is no larger than the death rate, the population is certain to become extinct eventually.

This implies that π = (1, 0, 0, . . .) is an equilibrium distribution if λ ≤ µ.

If, on the other hand, λ > µ, then expected population size increases to ∞ as t → ∞. You can see this
by using G(s, t) to show that
E[N (t)] = N (0) exp(−(λ − µ)t).
However, even under these conditions, extinction is still possible and so the population size will either
increase without limit, or become extinct. As there is uncertainty as to how the process behaves in the
future, there is no equilibrium distribution.

91
5.3 Exercises
The next exercises are exam-type questions. They are hard, but will test your understanding of birth-death
processes.

Exercise 5.11
Suppose that arrivals to a utopian island occur as a Poisson process with rate λ per year. Those who
arrive on the island can never leave, but each person on the island at time t has probability µh + o(h)
of dying in the interval (t, t + h] years. Any individual on the island at time t has probability
θh + o(h) of giving birth in the interval (t, t + h] years.

Arrivals, births and deaths are independent and the probability of two or more of these occurring in
(t, t + h] is o(h). The parameters λ, µ, θ are all positive. Let Xt denote the population size of the
island by time t.

(a) Give the generator matrix for the process X.


(b) Given that Xt = 0, state the distribution of the time until the next immigration, stating its
mean.
(c) Given that Xt = i, consider Ti , the time that it takes until the population size is (i + 1) for the
first time since time t. Justify that
1
E[Ti | next event is a birth or immigration] = ;
λ + iθ + iµ
1
E[Ti | next event is a death ] = + E[Ti−1 ] + E[Ti ].
λ + iθ + iµ

(d) Using the expressions given in part (c), calculate the expected time until the population reaches
size (i + 1) for the first time, for i = 0, 1, 2.
(e) Given that Xt = 0, give an expression for how long, on average, it takes for the population to
reach size 2 for the first time.

92
Exercise 5.12
The population of an island evolves according to an immigration-death process. Specifically, the
population evolves according to the following rules.

People immigrate onto the island in a Poisson process of rate α per year. Each person on the
island at time t has probability µh + o(h) of dying in the time interval (t, t + h] years. There is no
emigration from the island and there are no births.

Immigrations and deaths are independent and the probability of two or more events (immigrations
or deaths) in (t, t + h] is o(h). The parameters α and µ are positive.

Let N (t) denote the number of people on the island at time t.


(a) Write down the state space S and the generator matrix Q of transition rates of the continuous-
time Markov chain {N (t), t ≥ 0}.
(b) Let {Yn , n = 0, 1, 2, . . .} be the embedded discrete-time jump chain of {N (t); t ≥ 0}.

(i) Write down the transition matrix P of {Yn ; n = 0, 1, 2, . . .}.


(ii) Let Ti be the holding time in state i, that is, the time spent in state i. State the distribution
of Ti , for i = 0, 1, 2, . . ..
(c) Suppose that N (0) = 0. Calculate the expected time until there are two people on the island.

(d) You are given that, if N (0) = 0, the probability generating function G(s, t) = E sN (t) of N (t)
is given by  
α −µt

G(s, t) = exp 1−e (s − 1) .
µ
(i) Use G(s, t) to infer the distribution of N (t) | N (0) = 0.
(ii) Hence, or otherwise, infer the equilibrium distribution of {N (t); t ≥ 0}.

(e) Suppose now that the island has an active volcano. When the volcano erupts the whole
population of the island is killed, reducing the population size to zero. Eruptions occur in a
Poisson process of rate φ per year. Otherwise the population of the island evolves according to
the immigration-death process described above.

Using your answer to (d)(i), or otherwise, calculate the expected number of people killed by the
next eruption to occur after time 0, given that N (0) = 0.

93

You might also like