MCMC Bayes PDF
MCMC Bayes PDF
MCMC Bayes PDF
1 Bayesian Modeling
Not surprisingly, Bayes’s Theorem is the key result that drives Bayesian
S modeling and statistics.
T Let S be a
sample space and let B1 , . . . , BK be a partition of S so that (i) k Bk = S and (ii) Bi Bj = ∅ for all i 6= j.
Theorem 1 (Bayes’s Theorem) Let A be any event. Then for any 1 ≤ k ≤ K we have
P (A | Bk )P (Bk ) P (A | Bk )P (Bk )
P (Bk | A) = = PK .
P (A) j=1 P (A | Bj )P (Bj )
Of course there is also a continuous version of Bayes’s Theorem with sums replaced by integrals. Bayes’s
Theorem provides us with a simple rule for updating probabilities when new information appears. In Bayesian
modeling and statistics this new information is the observed data and it allows us to update our prior beliefs
about parameters of interest which are themselves assumed to be random variables.
Figure 20.1 (Taken from from Ruppert’s Statistics and Data Analysis for FE): Prior and posterior densities
for α = β = 2 and n = x = 5. The dashed vertical lines are at the lower and upper 0.05-quantiles of the
posterior, so they mark off a 90% equal-tailed posterior interval. The dotted vertical line shows the location
of the posterior mode at θ = 6/7 = 0.857.
which is obtained using the posterior distribution of θ given the observed data X, π(θ | x). Much of Bayesian
analysis is concerned with “understanding” the posterior π(θ | x). Note that
π(θ | x) ∝ π(θ)p(x | θ)
which is what we often work with in practice. Sometimes we can recognize the form of the posterior by simply
inspecting π(θ)p(x | θ). But typically we cannot recognize the posterior and cannot compute the denominator
in (1) either. In such cases approximate inference techniques such as MCMC are required. We begin with a
simple example.
θα−1 (1 − θ)β−1
π(θ) = , 0 < θ < 1.
B(α, β)
We also assume that X | θ ∼ Bin(n, θ) so that p(x | θ) = nx θx (1 − θ)n−x , x = 0, . . . , n. The posterior then
satisfies
p(θ | x) ∝ π(θ)p(x | θ)
θα−1 (1 − θ)β−1 n x
= θ (1 − θ)n−x
B(α, β) x
∝ θα+x−1 (1 − θ)n−x+β−1
which we recognize as the Beta(α + x, β + n − x) distribution! See Figure 20.1 from Statistics and Data
Analysis for Financial Engineering by David Ruppert for a numerical example and a visualization of how the data
and prior interact to produce the posterior distribution.
We say the prior π(θ; α) is a conjugate prior for the likelihood p(x | θ) if the posterior satisfies
so that the observations influence the posterior only via a parameter change α0 → α(x). In particular, the form
or type of the distribution is unchanged. In Example 1, for example, we saw the beta distribution is conjugate
for the binomial likelihood. Here are two further examples.
i=1
(θ − µ1 )2
∝ exp −
2γ12
where
n
X
γ1−2 := γ0−2 + N σ −2 µ1 := γ12 µ0 γ0−2 + xi σ −2 .
and
i=1
which we recognize as the N-Inv-χ2 (µ0 , σ02 /κ0 , ν0 , σ02 ) PDF. Note that µ and σ 2 are not independent under this
joint prior.
Exercise 2 Show that multiplying this prior by the normal likelihood yields a N-Inv-χ2 distribution.
where θ ∈ Rm is a parameter vector and u(x) = (u1 (x), . . . , um (x)) is the vector of sufficient statistics. The
exponential family includes Normal, Gamma, Beta, Poisson, Dirichlet, Wishart and Multinomial distributions as
MCMC and Bayesian Modeling 4
special cases. The exponential family is also essentially the only distribution with a non-trivial conjugate prior.
This conjugate prior takes the form
>
π(θ; α, γ) ∝ eθ α−γψ(θ) . (3)
Combining (2) and (3) we see the posterior takes the form
>
u(x)−ψ(θ) θ > α−γψ(θ) >
p(θ | x, α, γ) ∝ eθ e = eθ (α+u(x))−(γ+1)ψ(θ)
= π(θ | α + u(x), γ + 1)
without knowing the constant of proportionality in (4). This leads to the general sampling problem:
where p̃(z) ≥ 0 is easy to compute but Zp is (too) hard to compute. This very important situation arises in
several contexts:
R
1. In Bayesian models where p̃(θ) := p(x | θ)π(θ) is easy to compute but Zp := p(x) = θ π(θ)p(x | θ)dθ
can be very difficult or impossible to compute.
2. In models from statistical physics, e.g. the Ising model, we only know p̃(z) = e−E(z) , where E(z) is an
“energy” function. (The Ising model is an example of a Markov network or an undirected graphical model.)
MCMC and Bayesian Modeling 5
3. Dealing with evidence in directed graphical models such as belief networks aka directed acyclic graphs.
The sampling problem is the problem of simulating from p(z) in (5) without knowing the constant Zp . While
the acceptance-rejection algorithm can be used, it is very inefficient in high dimensions and an alternative
approach is required. That alternative approach is Markov Chain Monte-Carlo (MCMC).
Note it’s easy to check that [p(Xt+1 = xt | Xt−1 = xt−1 )](xt ,xt−1 )∈Ω = P2 .
Definition 2 A Markov chain is called ergodic if there exists r such that Pr > 0
We note that the ergodicity of a Markov chain is equivalent to the Markov chain being:
1. Irreducible: For all x, y ∈ Ω, there exists r(x, y) s.t. Pr(x,y) (x, y) > 0
2. Aperiodic: For all x ∈ Ω, GCD r : P r (x, x) > 0 = 1.
Definition 4 The total variation distance, dT V (µ, ν), between two probability measures µ, ν on Ω is defined as
1X
kµ − νkT V := max{µ(S) − ν(S)} = |µ(z) − ν(z)|
S⊂Ω 2
z∈Ω
The mixing time function, τmix (), is defined as the time until the total variation distance to π is below . It can
be shown to satisfy1
It’s easy to check that if π satisfies (7) then it is the stationary distribution of the Markov chain since then we
have X X
P (y | x)π(x) = P (x | y)π(y) = π(y)
x x
which is (6). Note that (7) implies the chain moves from x to y at the same rate as it moves from y to x
(when in equilibrium). For this reason (7) is often called the detailed balance equation. Satisfying the
detailed balance equation is a sufficient (but not necessary) condition for π to be a stationary distribution We
will also want to have ergodicity to guarantee that π is the stationary distribution.
Exercise 3 What is the stationary distribution for a reversible symmetric Markov chain?
There are analogous definitions and results for Markov chains on continuous state spaces that we will not state
here.
Proof of Claim: We simply check that p(x) satisfies the detailed balance equations. We have
p(y) Q(x | y)
α(y | x)Q(y | x) p(x) = min · , 1 Q(y | x)p(x)
| {z } p(x) Q(y | x)
P (y|x)
as desired. There are still some important questions that need to be addressed:
1. How do we determine when stationarity is achieved?
- In general it is difficult to provide a theoretical answer to this question. Instead, we check for
convergence to stationarity on a case-by-case basis using convergence diagnostics. We will
discuss this further in Section 4.2.
2. There are many possible choices of proposal distribution, Q(· | ·). Which one should we use?
- This is an important question since Q(· | ·) influences how much time is required to reach stationarity.
There appears to be relatively few results on this question although rules of thumb and experience /
experimentation do provide (partial) answers. See also the related discussion in Example 5 below.
Figure 11.9 (Taken from Bishop’s Pattern Recognition and Machine Learning): A simple illustration using
Metropolis algorithm to sample from a Gaussian distribution whose one standard-deviation contour is shown
by the ellipse. The proposal distribution is an isotropic Gaussian distribution whose standard deviation is
0.2. Steps that are accepted are shown as green lines, and rejected steps are shown in red. A total of 150
candidate samples are generated, of which 43 are rejected.
Figure 11.9 from Bishop’s Pattern Recognition and Machine Learning displays samples from a Gaussian
distribution that were generated using the Metropolis algorithm with an isotropic Gaussian distribution as the
proposal distribution, Q(· | ·). Specifically Q(· | xt ) ∼ N2 (xt , 0.2 × I2 ) where In denotes the n-dimensional
identity matrix.
MCMC and Bayesian Modeling 8
Exercise 5 Can you explain the pattern of accepted and rejected samples in Figure 11.9? This is a general
phenomenon and is important to understand. See also Example 5 below.
Figure 27.8 (Taken from Barber’s Bayesian Reasoning and Machine Learning): Metropolis-Hastings sam-
ples from a bi-variate distribution p(x1 , x2 ) using a proposal q̃(x0 |x) = N(x0 |x, I). We also plot the iso-
probability contours of p. Although p(x) is multi-modal, the dimensionality is low enough and the modes
sufficiently close such that a simple Gaussian proposal distribution is able to bridge the two modes. In higher
dimensions, such multi-modality is more problematic.
Figure 27.8 from Barber’s Bayesian Reasoning and Machine Learning displays samples from a bi-modal density
that were generated using a bivariate normal proposal. In general, simulating from multi-modal distributions
using MCMC can be challenging, particularly in high-dimensional problems.
Exercise 6 Consider carefully the following questions all of which refer to Figure 27.8. (Understanding them is
key to understanding the issues that arise when simulating from multi-modal distributions.)
1. Suppose instead of using a N(x0 |x, I) proposal we instead used a N(x0 |x, σI) with σ << 1 a constant that
is very small. How do you think the algorithm would perform then? Specifically, do you think convergence
to stationarity would happen “quickly”?
3 Gibbs Sampling
Gibbs sampling2 is an MCMC sampler introduced by Geman and Geman in 1984. Let x(t) ∈ Rm denote the
current sample. Then Gibbs sampling proceeds as follows:
In Gibbs only one component of x is updated at a time. It is common to simply order the m components and
(t+1)
update them sequentially. We can then let xk be the value of the chain after all m updates rather than each
individual update. Gibbs sampling is a very popular method for applications where the conditional distributions,
(t)
p(xj | x−k ), are easy to simulate from. This is the case for for conditionally conjugate models and others.
It is easy to see that Gibbs sampling is a special case of Metropolis-Hastings sampling with
p(yk | x−k ) y−k = x−k
Qk (y | x) =
0 otherwise.
and that each component update will be accepted with probability 1. One must be careful, however, that the
component-wise Markov Chain is ergodic as discussed earlier. See Barber’s Figure 27.5 in Section 3.1 for an
example where the chain is not ergodic in which the Gibbs sampler would fail to converge to the desired
stationary distribution.
It is hard to simulate directly from p(x, y) but the conditional distributions are easy to work with. We see that
• p(x | y) ∼ Bin(n, y)
• p(y | x) ∼ Beta(x + α, n − x + β)
and since it’s easy to simulate from each conditional, it’s easy to run a Gibbs sampler to simulate from the joint
distribution.
Exercise 7 Can you identify a situation where the distribution of (8) might arise? Hint: Refer to one of our
earlier examples. (Note that the marginal distribution of x has a beta-binomial distribution.)
Table 11-2 (Taken from Bayesian Data Analysis, 2nd edition by Gelman et al.): Coagulation time in
seconds for blood drawn from 24 animals randomly allocated to four different diets. Different treatments
have different numbers of observations because the randomization was unrestricted. From Box, Hunter, and
Hunter (1978), who adjusted the data so that the averages are integers, a complication we ignore in our
analysis.
Gibbs sampling is particulary suited for hierarchical model, an important class of models throughout statistics
and machine learning. We consider here an example from Bayesian Data Analysis by Gelman et al. and the data
is presented in Table 11-2 above. The data-points yij , for i = 1, . . . , nj and j = 1, . . . , J are assumed to be
independently normally distributed within each of J groups with means θj and common variance σ 2 . That is,
yij | θj ∼ N (θj , σ 2 ).
MCMC and Bayesian Modeling 10
PJ
The total number of observations is n = j=1 nj . Group means are assumed to follow a normal distribution
with unknown mean µ and variance τ 2
θj ∼ N (µ, τ 2 ).
A uniform prior is assumed3 for (µ, log σ, τ ) which is equivalent to assuming (why?) that p(µ, log σ, log τ ) ∝ τ
The posterior is then given by
J J Yn
j
Y Y
2
N yij | θj , σ 2 .
p(θ, µ, log σ, log τ | y) ∝ τ N θj | µ, τ (9)
j=1 j=1 i=1
We will see from (9) that all conditional distributions required for Gibbs sampler have simple conjugate forms:
where nj
1
τ 2 µ + σ 2 ȳ.j 1
θbj := 1 nj and Vθj := 1 nj .
τ 2 + σ2 τ2 + σ2
These conditional distributions are independent so generating the θj ’s one at a time is equivalent to
drawing θ all at once.
τ2
µ | (θ, σ, τ, y) ∼ N µb, (11)
J
1
PJ
where µ
b := J j=1 θj .
σ 2 | (θ, µ, τ, y) ∼ Inv-χ2 n, σ
b2
(12)
1
PJ Pnj 2
b2 :=
where σ n j=1 i=1 (yij − θj ) .
1
PJ 2
where τb2 := J−1 j=1 (θj − µ) .
To start the Gibbs sampler we only (why?) need starting points for θ and µ and then we use (10) to (13) to
repeatedly generate samples from the conditional distributions.
3 If a uniform prior was assigned to log τ then the posterior would be improper as discussed in Gelman et al. This emphasizes
the importance of understanding the issues associated with choosing priors. We have not discussed these issues in these lecture
notes but they are important.
MCMC and Bayesian Modeling 11
Exercise 8 Does the collapsed Gibbs sampler remind you of any variance reduction technique? If so, which
one and why?
Figure 27.5 (Taken from Barber’s Bayesian Reasoning and Machine Learning): A two dimensional distribu-
tion for which Gibbs sampling fails. The distribution has mass only in the shaded quadrants. Gibbs sampling
proceeds from the lth sample state (xl1 , xl2 ) and then sampling from p(x2 |xl1 ), which we write (xl+1 l+1
1 , x2 )
l+1 l l+1
where x1 = x1 . One then continues with a sample from p(x1 |x2 = x2 ), etc. If we start in the lower left
quadrant and proceed this way, the upper right region is never explored.
A second problem that often arises with Gibbs sampling is that the samples might be strongly correlated
(negatively or positively). In that event it may take too long to reach the stationary distribution. This
phenomenon is discussed in the captions for Figure 27.7 from Barber’s BRML and Figure 11.11 from Bishop’s
PRML, both of which are displayed below.
When the variables are very correlated a common strategy (to overcome this weakness) is to perform a simple
transformation of variables so that the transformed variables are approximately independent.
Exercise 9 Suppose the random variables x1 , . . . , xd are independent. How long do you think it will take the
Gibbs sampler to reach stationarity in that case?
MCMC and Bayesian Modeling 12
Figure 11.11 (Taken from Bishop’s Pattern Recognition and Machine Learning): Illustration of Gibbs
sampling by alternate updates of two variables whose distribution is a correlated Gaussian. The step size is
governed by the standard deviation of the conditional distribution (green curve), and is O(l), leading to slow
progress in the direction of elongation of the joint distribution (red ellipse). The number of steps needed to
obtain an independent sample from the distribution is O((L/l)2).
Figure 27.7 (Taken from Barber’s Bayesian Reasoning and Machine Learning): Two hundred Gibbs samples
for a two dimensional Gaussian. At each stage only a single component is updated. (a): For a Gaussian
with low correlation, Gibbs sampling can move through the likely regions effectively. (b): For a strongly
correlated Gaussian, Gibbs sampling is less effective and does not rapidly explore the likely regions.
The following example provides a cautionary example highlighting the dangers of blindly running a Gibbs
sampler for a given set of conditional distributions.
so that both conditionals are exponential distributions (and therefore well-defined). If we apply a Gibbs sampler
here, however, we will not obtain a sample from any marginal or joint distribution. This is because (14) and
(15) do not correspond to any joint distribution on (x, y).
MCMC and Bayesian Modeling 13
Figure 11.2 (Taken from Gelman et al.’s BDA, 2nd ed.): Five independent sequences of a Markov chain
simulation for the bivariate unit normal distribution, with over-dispersed starting points indicated by solid
squares. (a) After 50 iterations, the sequences are still far from convergence. (b) After 1000 iterations, the
sequences are nearer to convergence. Figure (c) shows the iterates from the second halves of the sequences.
The points in Figure (c) have been jittered so that steps in which the random walk stood still are not hidden.
scalar estimands so they are approximately normal. We can achieve this by, for example, taking logs of strictly
positive quantities and taking logits of quantities that must lie in (0, 1).
Let ψij for i = 1, . . . n and j = 1, . . . , m be the MCMC samples computed after the burn-in period and after
splitting the non-burn-in component of each chain in two. The between- and within-sequence variances, B
and W , are computed as4
m
n X 2
B := ψ̄.j − ψ̄..
m − 1 j=1
m n
1 X 2 1 X 2
W := s where s2j := ψij − ψ̄.j
m j=1 j n − 1 i=1
1
Pn 1
Pm
and where ψ̄.j := n i=1 ψij and ψ̄.. := m j=1 ψ̄.j . We can estimate Var (ψ | X) as a weighted average of
W and B with
d (ψ | X) = n − 1 W + 1 B.
+
Var (16)
n n
+
d (ψ | X) overestimates the marginal posterior variance (of ψ) since the starting distribution is
Note that Var
over-dispersed. But it will be unbiased when sampling from the desired stationary distribution.
We also note that for any finite n, it should be the case that W is an underestimate of Var (ψ | X). This follows
since each individual sequence may not yet have had time to explore all of the target, i.e. stationary, distribution.
But W should approach Var (ψ | X) in the limit as n → ∞. We therefore monitor convergence through
s
+
d (ψ | X)
Var
R :=
b
W
Note that by the above argument, we should have R b > 1 for any finite n but we also have Rb → 1 as n → ∞.
This leads to the following rule of thumb for diagnosing convergence:
Rule of Thumb: Values of R b < 1.1 are acceptable but the closer Rb is to 1 the better. We then monitor R
b for
all quantities ψ of interest.
4 B contains a factor of n because it is based on the variance of the within-sequence means, ψ̄ , each of which is an average
.j
of n values.
MCMC and Bayesian Modeling 15
Example 9 (Data Augmentation for Binary Response Regression with a Probit Link6 )
We have binary response variables y := (y1 , . . . , ym ) and corresponding to the ith response we have a covariate
vector xi := (xi1 , . . . , xik ). The probit regression model is, like logistic regression, a generalized linear model
(GLM) except the probability that yi = 1 satisfies
where Φ is the CDF of the standard normal distribution. The goal is to estimate β := (β1 , . . . , βk ) and this can
be done using standard GLM software using the ‘probit’ link function. We will use a Bayesian approach here,
however. If we assume a prior π(β) on β then the posterior density is given by
n
Y
g(β | y) ∝ π(β) pyi i (1 − pi )1−yi
i=1
n
Y yi
Φ x> (1 − Φ x>
1−yi
= π(β) i β i β ) . (18)
i=1
It is not clear how to generate samples of β from the posterior in (18) in a Gibbs sampling framework. A clever
way to resolve this problem is to define latent, i.e. unobserved, variables
zi := xi1 β1 + · · · + xik βk + i
5 See Appendix A.3 for a description of HMC.
6 This example is taken from Bayesian Analysis of Binary and Polychotomous Response Data by Albert and Chib (1993).
MCMC and Bayesian Modeling 16
where the i ’s are IID N(0, 1) for i = 1, . . . , n. Note that (why?) pi = P (zi > 0) = Φ(x> i β). We can now
regard the problem as a missing data problem where instead of observing the zi ’s we only observe the indicators
yi := 1{zi >0} and our posterior distribution is now over β and z := (z1 , . . . , zn ). This posterior is given by
g(β, z | y) ∝ g(β, z, y)
Yn
1{zi >0} 1{yi =1} + 1{zi ≤0} 1{yi =0} φ(zi ; x>
= π(β) i β, 1) (19)
i=1
where φ(· ; µ, σ 2 ) denotes the PDF for a normal random variable with mean µ and variance σ 2 . The posterior in
(19) is in a particularly convenient form for Gibbs sampling if we assume π(β) ≡ 1, i.e. a uniform prior on β. In
that case we can use a block Gibbs sampler where we simulate successively from g(β | z, y) and g(z | β, y).
When π(β) ≡ 1 it is relatively(!) easy to see that
where MVNk (µ, Σ) denotes a k-dimensional multivariate normal distribution with mean vector µ and
covariance matrix Σ, and X is the design matrix for the problem.
Exercise 10 Justify (20) and then explain how we can also simulate from g(z | β, y).
As a specific example, we consider the data-set on the Donner party, a group of wagon trail emigrants who
struggled to cross the Sierra Nevada mountains in California in 1846-47 with the result being that a large
number of them starved to death. We are interested in estimating the model
where yi = 1 denotes the death of the ith person in the party and yi = 0 denotes their survival. We have two
covariates, Male (1 for males, 0 for females) and Age (in years). Figure 1 displays estimated percentile survival
100
90 5 th Percentile
Median
80
95th Percentile
70
Survival Rate (%)
60
50
40
30
20
10
0
10 20 30 40 50 60 70
Age
Figure 1: Median, 5th and 95% percentile survival rates as a function of age for men
rates for men of various ages based in the Donner party. These quantities were computed by running the block
Gibbs sampler as described above and using the β samples (after convergence had been diagnosed) together
with (21).
In addition to demonstrating ing the power of data augmentation, it is also worth noting that the survival curves
of Figure 1 would be extremely difficult to construct in a non-Bayesian framework, especially when there are
relatively few data-points so that large n asymptotic results do not apply.
MCMC and Bayesian Modeling 17
Remark 2 In non-Bayesian problems with latent / hidden variables it is very common to estimate parameters
via the EM algorithm. In Bayesian versions of these problems it is typically the case that a 2-stage Gibbs
sampler can easily be implemented. The first stage simulates the unknown parameters given the data (observed
and hidden) while the second stage simulates the unobserved data given the parameters and observed data.
f (Xt+1 | V = v) ∝ f (Xt+1 , v)
= f (v | Xt+1 ) f (Xt+1 ) (22)
where f (v | Xt+1 ) is easily computed given the (user-specified) distribution of and f (Xt+1 ) is the objective
distribution of the risk-factor returns discussed above. We can use MCMC to simulate many samples from (22)
which can then be used to construct an optimal portfolio.
Note that we obtain the famous Black-Litterman model when Xt+1 is the vector of security returns, g(·) is
linear, and all distributions are multivariate normal. In this case the posterior can be calculated analytically.
Figure taken from “The Markov Chain Monte Carlo Revolution”, by Persi Diaconis in the Bulletin of the
American Mathematical Society (2008).
7 This example is based on the paper “The Markov Chain Monte Carlo Revolution”, by Persi Diaconis in the Bulletin of the
The goal then was to crack this cipher and find the function
where si runs over all the symbols that appear in the coded message. The idea here is that functions with
high values of Pl(f ) are good candidates for the decryption code in (23).
3. We therefore search for maximal f (·)0 s by running the following MCMC algorithm:
• Start with an initial guess f .
• Compute Pl(f ).
• Change to f∗ by making a random transposition of the values f assigns to two symbols.
• Compute Pl(f∗ ); if this is larger than Pl(f ) accept f∗ .
• If not, flip a coin where the probability of heads is Pl(f∗ )/Pl(f ). If the coin toss comes up heads
accept f∗ . Otherwise stay at f .
Exercise 11 What type of MCMC algorithm is described in Step 3? Explain what each step is doing.
By running the algorithm for sufficiently many iterations and possibly from randomly chosen starting points we
hope that the algorithm will identify regions of high probability, i.e plausibility.
1. A topic mixture θ d for each document is drawn independently from a DirK (α1) distribution, where
DirK (φ) is a Dirichlet distribution over the K-dimensional simplex with parameters φ = (φ1 , . . . , φK ).
2. Each of the K topics {β k }K
k=1 are drawn independently from a DirM (γ1) distribution.
3. Then for each of the i = 1 . . . , Nd words in document d, an assignment variable zid is drawn from
Mult(θ d ).
4. Conditional on the assignment variable zid , word i in document d, denoted as wid , is drawn independently
from Mult(β zid )
This is a hierarchical model and it’s straightforward to write out the joint distribution of all the data. Only the
wid ’s are observed, however, and so we need to use the corresponding conditional distribution to learn the topic
mixtures for each document, the K topic distributions and the latent variables zid . This is typically done via
Gibbs sampling or variational Bayes. Figure 2 displays some of the main topics found in a sample from the
conditional distribution.
MCMC and Bayesian Modeling 19
Figure 2: Taken from Introduction to Probabilistic Topic Models by D.M. Blei (2011).
A graphical model contains nodes and (directed or undirected) edges. Each node in the graph corresponds to a
random variable with the edge structure of the graph (and edge direction in case of directed graphs) determining
the various conditional independence / dependence relationships between the random variables. These
relationships often enable inference, e.g. computation of conditional distributions, to be performed very
efficiently. We only consider directed graphical models here.
Note the ordering of nodes in the DAG of Figure 8.2 which was taken
from Bishop’s PRML. This ordering can be used to write
Note that it is by definition that (24) must hold for any DAG representing p(x). Specifically, the DAG structure
models the fact for all k we have p(xk | x1 , . . . , xk−1 ) = p(xk | pa(xk )). It’s easy (why?) to simulate from a
DAG using (24). Indeed simulating using the representation in (24) is called ancestral sampling. It is not so
easy, however, to simulate from the joint conditional distribution when some nodes are observed but we will see
that Gibbs sampling is easy to implement in that case.
where x3 , x5 and x6 are “clamped” at their observed values in (25). Computing the normalizing factor, i.e. the
denominator, in (25) can be computationally demanding — especially for very large DAGs. Note also that the
ordering of the original DAG (with no observed variables) is now lost. e.g. x1 and x3 are no longer independent
once x5 has been observed.
Exercise 12 Can we use still ancestral sampling to simulate from p(x1 , x2 , x4 , x7 | x3 , x5 , x6 )? If so, is it
efficient?
In fact we can simulate efficiently from p(x1 , x2 , x4 , x7 | x3 , x5 , x6 ) using Gibbs sampling. To see this note that
at each step of the Gibbs sampler we need to simulate from p(xi | x−i ) where any observed values in x−i are
clamped at these values throughout the simulation. But it’s easy to see (why?) that
1 Y
p(xi | x−i ) = p(xi | pa(xi )) p(xj | pa(xj ))
Z
j∈ch(i)
where pa(xi ) and ch(i) are the parent and children nodes, respectively, of xi , and Z is the (usually easy to
compute) normalization constant
X Y
Z= p(xi | pa(xi )) p(xj | pa(xj )).
xi j∈ch(i)
MCMC and Bayesian Modeling 21
Figure 3: Taken from “Strategies for Petroleum Exploration Based on Bayesian Networks: a Case Study”,
by Martinelli et al. (2012).
Note that xi ∈ pa(xj ) for each j ∈ ch(i) and so the product term in the above expression for Z is required. The
parents of xi , the children of xi and the parents of the children of xi are known collectively as the Markov
blanket of xi .
Exercise 13 When using a Gibbs sampler to simulate from a DAG given some nodes have been observed, is
the sampler guaranteed to succeed? If not, what can go wrong?
A Appendix
We briefly discuss a few other important topics in Bayesian modeling and MCMC here.
2. Simulating samples from the posterior predictive distribution and checking them for “reasonableness”. We
can do this by first simulating θ from the posterior distribution (we already have these samples from the
MCMC!) and then simulating Xrep | θ.
3. Posterior predictive checking: in this case we design test statistics of interest and compare their
posterior predictive distributions (using simulated samples) to observed values of these test statistics. This
can be viewed as a form of internal model validation.
Bayesian Data Analysis (BDA) by Gelman et al. should be consulted for a far more detailed introduction to
model checking as well as many more examples.
(i) The deviance information criterion (DIC). This is only suitable for certain types of Bayesian models.
(ii) The Watanabe-Akaike information criterion (WAIC). This is a recently developed criterion and is
more generally applicable than DIC. It is not suitable, however, for models where the data is
dependent (given θ) like time-series and spatial models.
Note that pD is a random variable that depends on the data and it’s estimated differently for DIC and
WAIC. When comparing models, a smaller DIC or WAIC is “better”. Both DIC and WAIC are easily
estimated from the output of an MCMC which is a useful feature given the computational demands of
Bayesian modeling.
2. Bayesian cross-validation where the data is divided into K folds. The error on each fold is computed
by fitting the model on the remaining K − 1 folds. The error can be computed using either of:
(i) The mean-squared prediction error which requires the predicted values of the hold-out data. We can
use the posterior predictive mean which can often be estimated from MCMC.
(ii) The log posterior predictive distribution evaluated at the hold-out data.
Cross-validation can clearly be computationally very demanding.
3. Bayes factors can also be useful when choosing among competing models. Specifically, given two
models H1 and H2 , the Bayes factor, B(H2 ; H1 ), is
R
p(X | H2 ) p(X | θ 2 , H2 )p(θ 2 | H2 ) dθ 2
B(H2 ; H1 ) := = Rθ2 (26)
p(X | H1 ) θ1
p(X | θ 1 , H1 )p(θ 1 | H1 ) dθ 1
Note that the Bayes factor is not defined if the priors p(θ i | Hi ) are not proper. In general we need to
estimate the two integrals in (26) in order to estimate B(H2 ; H1 ).
Bayesian Model Averaging (BMA) is a related technique that performs inference using a weighted
average of several “good” models with the weights computed via Bayes factors.
It is perhaps worth emphasizing that Bayesian methods and classical frequentist methods differ8 significantly
from each other on the topic of model comparison and selection. In contrast, Bayesian and frequentist
approaches often lead to similar results when evaluating a fixed and given model.
8 See, for example, Chapters 28 and 37 of David MacKay’s excellent text Information Theory, Inference, and Learning
where as usual Zx is unknown. We now introduce a new auxiliary variable / vector y with
1 Hy (y)
p(y) = e .
Zy
We typically choose y to be Gaussian so that Hy (y) = − 12 y> y. We also assume x and y are independent so
that
1 1
p(x, y) = p(x)p(y) = eHx (x)+Hy (y) = eH(x,y)
Zx Zy Z
where Z := Zx Zy and H(x, y) := Hx (x) + Hy (y). The goal is to define an MCMC algorithm for generating
samples of (x, y) with the stationary distribution p(x, y). Then once stationarity is reached we can simply
discard the y samples. The “trick” is to define the proposal distribution so that we can easily jump from one
mode (of p(x)) to another.
We can achieve this as follows: given a current sample (x, y) we:
H(x0 , y0 ) ≈ H(x, y)
so that it will be accepted with high probability in the M-H algorithm. We can achieve this by moving
(approximately) along a contour of H from (x, y) to (x0 , y0 ) where (x0 , y0 ) = (x + ∆x, y + ∆y). A first-order
Taylor approximation implies
To move (approximately) along a contour of H would like to set sum of last two terms in (27) to 0. This is a
1-dimensional constraint so many solutions are possible. To identify a particular solution it is customary to use
so-called Hamiltonian dynamics whereby
so that H(x0 , y0 ) ≈ H(x, y) as desired. We take L such Hamiltonian steps all with the same value of which is
drawn randomly according to
+0 , with prob. 0.5
=
−0 , with prob. 0.5
so that the proposal distribution, Q(· | ·), is symmetric.
The variable x has the interpretation of position and the auxiliary variable y has the interpretation of
momentum. Typically, y has the same dimension as x so there is one momentum variable for each space
variable. The Hamiltonian dynamics, i.e movement along a contour of H, can be implemented in a more
sophisticated way than (27) via so-called leapfrog discretization. See, for example, Bishop’s PRML for
details. In order to implement the algorithm we need to specify the parameters L and 0 . The success10 of the
algorithm is quite sensitive to these choices. A high-level version of the HMC algorithm is given in Algorithm
27.4 below which is taken from Barber’s BRML.
Figure 27.9 (also taken from from Barber’s BRML) displays HMC in action in a one-dimensional example where
the distribution is bimodal. The distribution becomes bivariate with the addition of the auxiliary variable y and
we see in part (c) how the Hamiltonian dynamics enables the sampler to easily cross between the two islands of
high probability, i.e. the two modes.
the new and popular STAN software which was developed mainly by a team at Columbia University!
11 Our brief development of empirical Bayes follows Section 6.1 of the recent text Computer Age Statistical Inference: Al-
gorithms, Evidence, and Data Science by Efron and Hastie. The rest of that chapter as well as Chapter 21 and other more
advanced applications elsewhere in the text demonstrate the now widespread applicability of the empirical Bayes approach.
The text is free to download from Cambridge University Press if you’re on the Columbia network.
MCMC and Bayesian Modeling 26
Figure 27.9 (Taken from Barber’s BRML): Hybrid Monte Carlo. (a): Multi-modal distribution p(x) for
which we desire samples. (b): HMC forms the joint distribution p(x)p(y) where p(y) is Gaussian. (c): This
is a plot of (b) from above. Starting from the point x, we first draw a y from the Gaussian p(y), giving
a point (x, y), given by the green line. Then we use Hamiltonian dynamics (white line) to traverse the
distribution at roughly constant energy for a fixed number of steps, giving x0 , y 0 . We accept this point if
H(x0 , y 0 ) > H(x, y 0 ) and make the new sample x0 (red line). Otherwise this candidate is accepted with
probability exp(H(x0 , y 0 ) − H(x, y 0 )). If rejected the new sample x0 is taken as a copy of x.
Claims x 0 1 2 3 4 5 6 7
Counts yx 7840 1317 239 42 14 4 4 1
Formula (31) .168 .363 .527 1.33 1.43 6.00 1.25 -
Gamma MLE .164 .398 .633 .87 1.10 1.34 1.57 -
Table 1: Counts yx of number of claims x made in a single year by 9461 automobile insurance policy holders.
Robbins’ formula (31) estimates the number of claims expected in a succeeding year, for instance 0.168 for
a customer in the x = 0 category. Parametric maximum likelihood analysis based on a gamma prior gives
less noisy estimates.
e−θk θkx
P (Xk = x) = pθk (x) := , x = 0, 1, 2, . . . . (28)
x!
We also assume that the θk ’s are random with prior g(θ). Consider now an individual customer with number of
claims x last year. Then we have (why?)
R∞
θpθ (x)g(θ) dθ
E[θ | x] = R0 ∞ . (29)
0
pθ (x)g(θ) dθ
Note that (29) would also yield the expected number of claims made by the customer next year since (why?)
E[θ | x] = E[X | x]. So formula (29) is what the insurance company needs to answer its question if it already
knows the prior g(·). For example, if the company assumes g is Gamma(ν, σ) with ν and σ known, then there is
no problem calculating (29). But how would we choose “good” values of ν and σ? A typical Bayesian approach
would in fact assume they are unknown and would therefore place a hyper-prior (with known parameters) on
(ν, σ). In that case considerably more work would be required to compute g and calculate (29).
Alternatively we can be a little clever! Using (28) and (29) we have
R ∞ −θ x+1
0R
e θ /x! g(θ) dθ
E[θ | x] = ∞ −θ x
0
[e θ /x!] g(θ) dθ
R∞
(x + 1) 0 e−θ θx+1 /(x + 1)! g(θ) dθ
= R∞
0
[e−θ θx /x!] g(θ) dθ
f (x + 1)
= (x + 1) (30)
f (x)
MCMC and Bayesian Modeling 27
R∞
where f (x) = 0 pθ (x)g(θ) dθ is the marginal density of X. From (30) it is clear that to answer the insurance
company’s question we only need f (·) and not g(·). But we have a lot of data and can easily estimate f (·)
directly to obtain Robbins’ approximation
b | x] fb(x + 1)
E[θ = (x + 1)
fb(x)
yx+1
= (x + 1) (31)
yx
with yx denoting the number of observations with x claims. That is, we estimate f (x) with fb(x) = yx /N where
b | x] in the third row of Table 1.
N = 9461. We see the values of E[θ
Note that the values at the end of the third row in Table 1 seem to go awry. This is because formula (31) has
become unstable at that point due to the small count numbers in the data for policies that had 5 or more
claims. We can help resolve this issue by using a parametric empirical Bayesian approach in contrast to the
non-parametric approach outlined above.
θν−1 e−θ/σ
g(θ) = , θ≥0
σ ν Γ(ν)
with (ν, σ) unknown. Instead of placing a (hyper-) prior on (ν, σ) we can estimate them from the data by
explicitly computing (how?) the marginal density f (x) which now has parameters ν and σ. We then simply
compute12 the maximum likelihood estimators νb and σ b to obtain
b | x] fνb,bσ (x + 1)
E[θ = (x + 1) (32)
fνb,bσ (x)
as our estimator. The fourth row of Table 1 was obtained using (32).
Exercise 14 Explain how you would compute an explicit expression for fν,σ (x) In Example 16.
b k | xk ],
According to Efron and Hastie, Robbins’ formula came as a surprise to the statistical world since E[θ
unavailable without the prior g, suddenly became available by leveraging the information in data from (a large
number of) similar cases. It’s interesting to note that many eminent statisticians including Robbins, Fisher, Von
Mises and others developed empirical Bayesian estimators but the approach, which was often criticized for being
neither Bayesian nor frequentist, is now quite standard and has grown in popularity in the “big-data” era where
massive parallel data-sets are not quite common.
Section 6.2 of Efron and Hastie describes the first known application of empirical Bayes. It was developed by
Ronald Fisher and he solved a missing-species problem concerned with estimating the number of butterfly
species in Malaysia during World War II. They then go on to describe how the same methods can be (and have
been) used to estimate the total number of words in Shakespeare’s vocabulary.