Class19 Approxinf

Approximate Inference
9.520 Class 19
Ruslan Salakhutdinov
BCS and CSAIL, MIT
1
Plan
1. Introduction/Notation.
2. Examples of successful Bayesian models.
3. Laplace and Variational Inference.
4. Basic Sampling Algorithms.
5. Markov chain Monte Carlo algorithms.
2
References/Acknowledgements
• Chris Bishop’s book: Pattern Recognition and Machine
Learning, chapter 11 (many figures are borrowed from this book).
• David MacKay’s book: Information Theory, Inference, and

Learning Algorithms, chapters 29-32.
• Radford Neals’s technical report on Probabilistic Inference Using

Markov Chain Monte Carlo Methods.
• Zoubin Ghahramani’s ICML tutorial on Bayesian Machine Learning:

http://www.gatsby.ucl.ac.uk/∼zoubin/ICML04-tutorial.html
• Ian Murray’s tutorial on Sampling Methods:

http://www.cs.toronto.edu/∼murray/teaching/
3
Basic Notation
P (x) probability of x
P (x|θ) conditional probability of x given θ
P (x, θ) joint probability of x and θ
Bayes Rule:
P (x|θ)P (θ)
P (θ|x) =
P (x)
where
Z
P (x) = P (x, θ)dθ Marginalization
I will use probability distribution and probability density interchangeably. It should be obvious from the context.
4
Inference Problem
Given a dataset D = {x1, ..., xn}:
Bayes Rule:
P (D|θ) Likelihood function of θ
P (D|θ)P (θ) P (θ) Prior probability of θ
P (θ|D) =
P (D)
P (θ|D) Posterior distribution over θ
Computing posterior distribution is known as the inference problem.

But:
Z
P (D) = P (D, θ)dθ
This integral can be very high-dimensional and difficult to compute.
5
Prediction
P (D|θ) Likelihood function of θ
P (D|θ)P (θ)
P (θ|D) = P (θ) Prior probability of θ
P (D)
P (θ|D) Posterior distribution over θ
Prediction: Given D, computing conditional probability of x∗ requires

computing the following integral:
Z
P (x∗|D) = P (x∗|θ, D)P (θ|D)dθ
= EP (θ|D)[P (x∗|θ, D)]
which is sometimes called predictive distribution.

Computing predictive distribution requires posterior P (θ|D).
6
Model Selection
Compare model classes, e.g. M1 and M2. Need to compute posterior
probabilities given D:
P (D|M)P (M)
P (M|D) =
P (D)
where
Z
P (D|M) = P (D|θ, M)P (θ|M)dθ
is known as the marginal likelihood or evidence.
7
Computational Challenges
• Computing marginal likelihoods often requires computing very high-
dimensional integrals.
• Computing posterior distributions (and hence predictive

distributions) is often analytically intractable.
• In this class, we will concentrate on Markov Chain Monte Carlo

(MCMC) methods for performing approximate inference.
• First, let us look at some specific examples:

– Bayesian Probabilistic Matrix Factorization
– Bayesian Neural Networks
– Dirichlet Process Mixtures (last class)
8
Bayesian PMF
User
Features
1 2 3 4 5 6 7 ...
1 5 3 ? 1 ...
2
V
3 ? 4 ? 3 2 ...
3
4 Movie
5
6
R ~ U Features
7
...
We have N users, M movies, and integer rating values from 1 to K.

Let rij be the rating of user i for movie j, and U ∈ RD×N , V ∈ RD×M
be latent user and movie feature matrices:
R ≈ U >V
Goal: Predict missing ratings.

9
Bayesian PMF
αV αU
Probabilistic linear model with Gaussian
observation noise. Likelihood:
ΘV ΘU
p(rij |ui, vj , σ 2) = N (rij |u>
i v j , σ 2
)
Gaussian Priors over parameters:

Vj Ui
N
Y
p(U |µU , ΛU ) = N (ui|µu, Σu),
Rij i=1
i=1,...,N M
j=1,...,M Y
p(V |µV , ΛV ) = N (vi|µv , Σv ).
σ i=1
Conjugate Gaussian-inverse-Wishart priors on the user and movie

hyperparameters ΘU = {µu, Σu} and ΘV = {µv , Σv }.
Hierarchical Prior.
10
Bayesian PMF
∗
Predictive distribution: Consider predicting a rating rij for user i
and query movie j:
ZZ
∗ ∗
p(rij |R) = p(rij |ui, vj )p(U,
| V, Θ{zU , ΘV |R)}d{U, V }d{ΘU , ΘV }
Posterior over parameters and hyperparameters
Exact evaluation of this predictive distribution is analytically

intractable.
Posterior distribution p(U, V, ΘU , ΘV |R) is complicated and does not

have a closed form expression.
Need to approximate.
11
Bayesian Neural Nets
Regression problem: Given a set of i.i.d observations X = {xn}N
n=1
with corresponding targets D = {tn}N
n=1.
Likelihood:
N
Y
p(D|X, w) = N (tn|y(xn, w), β 2)
n=1
The mean is given by the output

of the neural network:
M
X D
X
2 1

yk (x, w) = wkj σ wji xi
j=0 i=0
where σ(x) is the sigmoid function.
Gaussian prior over the network parameters: p(w) = N (0, α2I).
12
Bayesian Neural Nets
Likelihood:
N
Y
p(D|X, w) = N (tn|y(xn, w), β 2)
n=1
Gaussian prior over parameters:

p(w) = N (0, α2I)
Posterior is analytically intractable:

p(D|w, X)p(w)
p(w|D, X) = R
p(D|w, X)p(w)dw
Remark: Under certain conditions, Radford Neal (1994) showed, as the number of
hidden units go to infinity, a Gaussian prior over parameters results in a Gaussian
process prior for functions.
13
Undirected Models
x is a binary random vector with xi ∈ {+1, −1}:
1 X X
p(x) = exp θij xixj + θ i xi .
Z
(i,j)∈E i∈V
where Z is known as partition function:

X X X
Z= exp θij xixj + θ i xi .
x (i,j)∈E i∈V
If x is 100-dimensional, need to sum over 2100 terms.

The sum might decompose (e.g. junction tree). Otherwise we need
to approximate.
Remark: Compare to marginal likelihood.
14
Inference
For most situations we will be interested
in evaluating the expectation:
Z
E[f ] = f (z)p(z)dz
p̃(z)
We will use the following notation: p(z) = Z .
We can evaluate p̃(z) pointwise, but cannot evaluate Z.

1
• Posterior distribution: P (θ|D) = P (D) P (D|θ)P (θ)
1
• Markov random fields: P (z) = Z exp(−E(z))
15
Laplace Approximation
Consider:
0.8
p̃(z)
0.6 p(z) =
Z
0.4
Goal: Find a Gaussian approximation
0.2
q(z) which is centered on a mode
0 of the distribution p(z).
−2 −1 0 1 2 3 4
At a stationary point z0 the gradient 5p̃(z) vanishes. Consider a

Taylor expansion of ln p̃(z):
1
ln p̃(z) ≈ ln p̃(z0) − (z − z0)T A(z − z0)
2
where A is a Hessian matrix:
A = − 5 5 ln p̃(z)|z=z0
16
Consider:
0.8
p̃(z)
0.6 p(z) =
Z
0.4
Goal: Find a Gaussian approximation
0.2
q(z) which is centered on a mode
0 of the distribution p(z).
−2 −1 0 1 2 3 4
Exponentiating both sides:

1
p̃(z) ≈ p̃(z0) exp − (z − z0)T A(z − z0)
2
We get a multivariate Gaussian approximation:
1/2

|A| 1 T
q(z) = D/2
exp − (z − z 0 A(z − z0)
)
(2π) 2
17
p̃(z)
Remember p(z) = Z ,
where we approximate:
(2π)D/2
Z Z
1 T
Z = p̃(z)dz ≈ p̃(z0) exp − (z − z0) A(z − z0) = p̃(z0)
2 |A|1/2
1
Bayesian Inference: P (θ|D) = P (D) P (D|θ)P (θ).
Identify: p̃(θ) = P (D|θ)P (θ) and Z = P (D):
• The posterior is approximately Gaussian around the MAP estimate θM AP
1/2

|A| 1
p(θ|D) ≈ exp − (θ − θM AP )T A(θ − θM AP )
(2π)D/2 2
18
Remember p(z) = p̃(Zz) , where we approximate:
(2π)D/2
Z Z
1 T
Z = p̃(z)dz ≈ p̃(z0) exp − (z − z0) A(z − z0) = p̃(z0)
2 |A|1/2
1
Bayesian Inference: P (θ|D) = P (D) P (D|θ)P (θ).
Identify: p̃(θ) = P (D|θ)P (θ) and Z = P (D):
• Can approximate Model Evidence:Z

P (D) = P (D|θ)P (θ)dθ
• Using Laplace approximation
D 1
ln P (D) ≈ ln P (D|θM AP ) + ln P (θM AP ) + ln 2π − ln |A|
| 2
{z 2 }
Occam factor: penalize model complexity
19
Bayesian Information Criterion
BIC can be obtained from the Laplace approximation:
D 1
ln P (D) ≈ ln P (D|θM AP ) + ln P (θM AP ) + ln 2π − ln |A|
2 2
by taking the large sample limit (N → ∞) where N is the number of
data points:
1
ln P (D) ≈ P (D|θM AP ) − D ln N
2
• Quick, easy, does not depend on the prior.
• Can use maximum likelihood estimate of θ instead of the MAP estimate
• D denotes the number of “well-determined parameters”
• Danger: Counting parameters can be tricky (e.g. infinite models)
20
Variational Inference
Key Idea: Approximate intractable distribution p(θ|D) with simpler, tractable
distribution q(θ).
We can lower bound the marginal likelihood using Jensen’s inequality:
Z Z
P (D, θ)
ln p(D) = ln p(D, θ)dθ = ln q(θ) dθ
q(θ)
Z Z Z
p(D, θ) 1
≥ q(θ) ln dθ = q(θ) ln p(D, θ)dθ + q(θ) ln dθ
q(θ) q(θ)
| {z }
Entropy functional
| {z }
Variational Lower-Bound
= ln p(D) − KL(q(θ)||p(θ|D)) = L(q)
where KL(q||p) is a Kullback–Leibler divergence. It is a non-symmetric measure of
the difference between two probability distributions q and p.
The goal of variational inference is to maximize the variational lower-bound
w.r.t. approximate q distribution, or minimize KL(q||p).
21
Variational Inference
Key Idea: Approximate intractable distribution p(θ|D) with simpler, tractable
distribution q(θ) by minimizing KL(q(θ)||p(θ|D)).
QD
We can choose a fully factorized distribution: q(θ) = i=1 qi (θi ), also known
as a mean-field approximation.
The variational lower-bound takes form:
Z Z
1
L(q) = q(θ) ln p(D, θ)dθ + q(θ) ln dθ
q(θ)
Z XZ
Y 1
= qj (θj ) ln p(D, θ) qi(θi)dθi dθj + qi(θi) ln dθi
i
q(θ i )
i6=j
| {z }
Ei6=j [ln p(D, θ)]
Suppose we keep {qi6=j } fixed and maximize L(q) w.r.t. all possible forms for the
distribution qj (θj ).
22
Variational Approximation
1
0.8
The plot shows the original distribution (yellow),
0.6
along with the Laplace (red) and
0.4 variational (green) approximations.
0.2
0
−2 −1 0 1 2 3 4
By maximizing L(q) w.r.t. all possible forms for the distribution qj (θj ) we obtain a
general expression:
exp(Ei6=j [ln p(D, θ)])
qj∗(θj ) = R
exp(Ei6=j [ln p(D, θ)])dθj
Iterative Procedure: Initialize all qj and then iterate through the factors replacing
each in turn with a revised estimate.
Convergence is guaranteed as the bound is convex w.r.t. each of the factors qj (see
Bishop, chapter 10).
23
Inference: Recap
For most situations we will be interested
in evaluating the expectation:
Z
E[f ] = f (z)p(z)dz
p̃(z)
We will use the following notation: p(z) = Z .
We can evaluate p̃(z) pointwise, but cannot evaluate Z.

1
• Posterior distribution: P (θ|D) = P (D) P (D|θ)P (θ)
1
• Markov random fields: P (z) = Z exp(−E(z))
24
Simple Monte Carlo
General Idea: Draw independent samples {z 1, ..., z n} from
distribution p(z) to approximate expectation:
Z N
1 X
E[f ] = f (z)p(z)dz ≈ f (z n) = fˆ
N n=1
Note that E[f ] = E[fˆ], so the estimator fˆ has correct mean (unbiased).
The variance:
1
var[fˆ] = E (f − E[f ]) 2

N
Remark: The accuracy of the estimator does not depend on
dimensionality of z.
25
Simple Monte Carlo
In general:
Z N
1 X
f (z)p(z)dz ≈ f (z n), z n ∼ p(z)
N n=1
Predictive distribution: Z
P (x∗|D) = P (x∗|θ, D)P (θ|D)dθ
N
1 X
≈ P (x∗|θn, D), θn ∼ p(θ|D)
N n=1
Problem: It is hard to draw exact samples from p(z).
26
Basic Sampling Algorithm
How to generate samples from simple non-uniform distributions
assuming we can generate samples from uniform distribution.
Ry
Define: h(y) = −∞
p(ŷ)dŷ
Sample: z ∼ U [0, 1].

Then: y = h−1(z) is a sample from p(y).
Problem: Computing cumulative h(y) is just as hard!
27
Rejection Sampling
Sampling from target distribution p(z) = p̃(z)/Zp is difficult.
Suppose we have an easy-to-sample proposal distribution q(z), such
that kq(z) ≥ p̃(z), ∀z.
Sample z0 from q(z).

Sample u0 from Uniform[0, kq(z0)]
The pair (z0, u0) has uniform distribution

under the curve of kq(z).
If u0 > p̃(z0), the sample is rejected.
28
Rejection Sampling
Probability that a sample is accepted is:
Z
p̃(z)
p(accept) = q(z)dz
kq(z)
Z
1
= p̃(z)dz
k
The fraction of accepted samples depends on the ratio of the area

under p̃(z) and kq(z).
Hard to find appropriate q(z) with optimal k.
Useful technique in one or two dimensions. Typically applied as a
subroutine in more advanced algorithms.
29
Importance Sampling
Suppose we have an easy-to-sample proposal distribution q(z), such
that q(z) > 0 if p(z) > 0.
Z
E[f ] = f (z)p(z)dz
Z
p(z)
= f (z) q(z)dz
q(z)
1 X p(z n) n n
≈ n
f (z ), z ∼ q(z)
N n q(z )
n
The quantities wn = p(z )/q(zn) are known as importance weights.
Unlike rejection sampling, all samples are retained.
But wait: we cannot compute p(z), only p̃(z).
30
Importance Sampling
Let our proposal be of the form q(z) = q̃(z)/Zq :
Zq
Z Z Z
p(z) p̃(z)
E[f ] = f (z)p(z)dz = f (z) q(z)dz = f (z) q(z)dz
q(z) Zp q̃(z)
Zq 1 X p̃(z n) n Zq 1 X n n n
≈ n
f (z ) = w f (z ), z ∼ q(z)
Zp N n q̃(z ) Zp N n
Zp
But we can use the same importance weights to approximate Zq :
Zp 1 X p̃(z n)
Z Z
1 p̃(z) 1 X n
= p̃(z)dz = q(z)dz ≈ n
= w
Zq Zq q̃(z) N n q̃(z ) N n
Hence:
1 X wn
E[f ] ≈ P n f (z n) Consistent but biased.
N n nw
31
Problems
If our proposal distribution q(z) poorly matches our target distribution
p(z) then:
• Rejection Sampling: almost always rejects

• Importance Sampling: has large, possibly infinite, variance
(unreliable estimator).
For high-dimensional problems, finding good proposal distributions is

very hard. What can we do?
Markov Chain Monte Carlo.
32
Markov Chains
A first-order Markov chain: a series of random variables {z 1, ..., z N }
such that the following conditional independence property holds for
n ∈ {z 1, ..., z N −1}:
p(z n+1|z 1, ..., z n) = p(z n+1|z n)
We can specify Markov chain:

• probability distribution for initial state p(z 1).
• conditional probability for subsequent states in the form of transition
probabilities T (z n+1 ← z n) ≡ p(z n+1|z n).
Remark: T (z n+1 ← z n) is sometimes called a transition kernel.
33
Markov Chains
A marginal probability of a particular state can be computed as:
X
n+1
p(z )= T (z n+1 ← z n)p(z n)
zn
A distribution π(z) is said to be invariant or stationary with respect

to a Markov chain if each step in the chain leaves π(z) invariant:
X
π(z) = T (z ← z 0)π(z 0)
z0
A given Markov chain may have many stationary distributions. For

example: T (z ← z 0) = I{z = z 0} is the identity transformation. Then
any distribution is invariant.
34
Detailed Balance
A sufficient (but not necessary) condition for ensuring that π(z) is
invariant is to choose a transition kernel that satisfies a detailed
balance property:
π(z 0)T (z ← z 0) = π(z)T (z 0 ← z)
A transition kernel that satisfies detailed balance will leave that

distribution invariant:
X X
π(z 0)T (z ← z 0) = π(z)T (z 0 ← z)
z0 z0
X
= π(z) T (z 0 ← z) = π(z)
z0
A Markov chain that satisfies detailed balance is said to be reversible.

35
Recap
We want to sample from target distribution π(z) = π̃(z)/Z
(e.g. posterior distribution).
Obtaining independent samples is difficult.
• Set up a Markov chain with transition kernel T (z 0 ← z) that leaves
our target distribution π(z) invariant.
• If the chain is ergodic, i.e. it is possible to go from every state to
any other state (not necessarily in one move), then the chain will
converge to this unique invariant distribution π(z).
• We obtain dependent samples drawn approximately from π(z) by
simulating a Markov chain for some time.
Ergodicity: There exists K, for any starting z, T K (z 0 ← z) > 0 for all π(z 0) > 0.
36
Metropolis-Hasting Algorithm
A Markov chain transition operator from current state z to a new
state z 0 is defined as follows:
• A new ’candidate’ state z ∗ is proposed according to some proposal
distribution q(z ∗|z), e.g. N (z, σ 2).
• A candidate state x∗ is accepted with probability:
∗ ∗

π̃(z ) q(z|z )
min 1,
π̃(z) q(z ∗|z)
• If accepted, set z 0 = z ∗. Otherwise z 0 = z, or the next state is the

copy of the current state.
Note: no need to know normalizing constant Z.
37
We can show that M-H transition kernel leaves π(z) invariant by
showing that it satisfies detailed balance:
0 0

0 0 π(z ) q(z|z )
π(z)T (z ← z) = π(z)q(z |z) min 1,
π(z) q(z 0|z)
= min (π(z)q(z 0|z), π(z 0)q(z|z 0))
0

π(z) )q(z |z)
= π(z 0) min 0 0
,1
π(z ) q(z|z )
= π(z 0)T (z ← z 0)
Note that whether the chain is ergodic will depend on the particulars
of π and proposal distribution q.
38
3
2.5 Using Metropolis algorithm to sample

2 from Gaussian distribution with
proposal q(z 0|z) = N (z, 0.04).
1.5
1 accepted (green), rejected (red).
0.5
0
0 0.5 1 1.5 2 2.5 3
39
Choice of Proposal
Proposal distribution:
q(z 0|z) = N (z, ρ2).
ρ large - many rejections

ρ small - chain moves too slowly
The specific choice of proposal can greatly affect the performance of

the algorithm.
40
Gibbs Sampler
Consider sampling from p(z1, ..., zN ).
Initialize zi, i = 1, ..., N

For t=1,...,T
Sample z1t+1 ∼ p(z1|z2t , ..., zN
t
)
Sample z2t+1 ∼ p(z2|z1t+1, xt3, ..., zN
t
)
···
t+1
Sample zN ∼ p(zN |z1t+1, ..., zN
t+1
−1)
Gibbs sampler is a particular instance of M-H algorithm with proposals

p(zn|zi6=n) → accept with probability 1. Apply a series (component-
wise) of these operators.
41
Gibbs Sampler
Applicability of the Gibbs sampler depends on how easy it is to sample
from conditional probabilities p(zn|zi6=n).
• For discrete random variables with a few discrete settings:
p(zn, zi6=n)
p(zn|zi6=n) = P
zn p(zn, zi6=n)
The sum can be computed analytically.
• For continuous random variables:

p(zn, zi6=n)
p(zn|zi6=n) = R
p(zn, zi6=n)dzn
The integral is univariate and is often analytically tractable or
amenable to standard sampling methods.
42
Bayesian PMF
Remember predictive distribution?: Consider predicting a rating
∗
rij for user i and query movie j:
ZZ
∗ ∗
p(rij |R) = p(rij |ui, vj )p(U,
| V, Θ{zU , ΘV |R)}d{U, V }d{ΘU , ΘV }
Posterior over parameters and hyperparameters
Use Monte Carlo approximation:
N
∗ 1 X
∗ (n) (n)
p(rij |R) ≈ p(rij |ui , vj ).
N n=1
The samples (uni, vjn) are generated by running a Gibbs sampler, whose
stationary distribution is the posterior distribution of interest.
43
Bayesian PMF
Monte Carlo approximation:
N
∗ 1 X
∗ (n) (n)
p(rij |R) ≈ p(rij |ui , vj ).
N n=1
The conditional distributions over the user and movie feature vectors
are Gaussians → easy to sample from:
∗ ∗

p(ui|R, V, ΘU , α) = N ui|µi , Σi
∗ ∗

p(vj |R, U, ΘU , α) = N vj |µj , Σj
The conditional distributions over hyperparameters also have closed

form distributions → easy to sample from.
Netflix dataset – Bayesian PMF can handle over 100 million ratings.
44
MCMC: Main Problems
Main problems of MCMC:
• Hard to diagnose convergence (burning in).
• Sampling from isolated modes.
More advanced MCMC methods for sampling in distributions with

isolated modes:
• Parallel tempering
• Simulated tempering
• Tempered transitions
Hamiltonian Monte Carlo methods (make use of gradient information).
Nested Sampling, Coupling from the Past, many others.

45

Class19 Approxinf

Uploaded by

Copyright:

Available Formats

Class19 Approxinf

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Class19 Approxinf

Uploaded by

Copyright:

Available Formats

Approximate Inference

2. Examples of successful Bayesian models.

3. Laplace and Variational Inference.

4. Basic Sampling Algorithms.

5. Markov chain Monte Carlo algorithms.

• David MacKay’s book: Information Theory, Inference, and

• Radford Neals’s technical report on Probabilistic Inference Using

• Zoubin Ghahramani’s ICML tutorial on Bayesian Machine Learning:

• Ian Murray’s tutorial on Sampling Methods:

Computing posterior distribution is known as the inference problem.

This integral can be very high-dimensional and difficult to compute.

Prediction: Given D, computing conditional probability of x∗ requires

= EP (θ|D)[P (x∗|θ, D)]

which is sometimes called predictive distribution.

is known as the marginal likelihood or evidence.

• Computing posterior distributions (and hence predictive

• In this class, we will concentrate on Markov Chain Monte Carlo

• First, let us look at some specific examples:

We have N users, M movies, and integer rating values from 1 to K.

Goal: Predict missing ratings.

Gaussian Priors over parameters:

Conjugate Gaussian-inverse-Wishart priors on the user and movie

Exact evaluation of this predictive distribution is analytically

Posterior distribution p(U, V, ΘU , ΘV |R) is complicated and does not

The mean is given by the output

where σ(x) is the sigmoid function.

Gaussian prior over the network parameters: p(w) = N (0, α2I).

Gaussian prior over parameters:

Posterior is analytically intractable:

where Z is known as partition function:

If x is 100-dimensional, need to sum over 2100 terms.

We can evaluate p̃(z) pointwise, but cannot evaluate Z.

At a stationary point z0 the gradient 5p̃(z) vanishes. Consider a

Exponentiating both sides:

Identify: p̃(θ) = P (D|θ)P (θ) and Z = P (D):

• The posterior is approximately Gaussian around the MAP estimate θM AP

Identify: p̃(θ) = P (D|θ)P (θ) and Z = P (D):

• Can approximate Model Evidence:Z

We can evaluate p̃(z) pointwise, but cannot evaluate Z.

Problem: It is hard to draw exact samples from p(z).

Sample: z ∼ U [0, 1].

Problem: Computing cumulative h(y) is just as hard!

Sample z0 from q(z).

The pair (z0, u0) has uniform distribution

If u0 > p̃(z0), the sample is rejected.

The fraction of accepted samples depends on the ratio of the area

• Rejection Sampling: almost always rejects

For high-dimensional problems, finding good proposal distributions is

Markov Chain Monte Carlo.

p(z n+1|z 1, ..., z n) = p(z n+1|z n)

We can specify Markov chain:

Remark: T (z n+1 ← z n) is sometimes called a transition kernel.

A distribution π(z) is said to be invariant or stationary with respect

A given Markov chain may have many stationary distributions. For

A transition kernel that satisfies detailed balance will leave that

A Markov chain that satisfies detailed balance is said to be reversible.

• If accepted, set z 0 = z ∗. Otherwise z 0 = z, or the next state is the

2.5 Using Metropolis algorithm to sample