Cs 13 Batch 2
Cs 13 Batch 2
We use the generic p(·) notation for densities, if there is no danger of confusion.
where
n
X
s = t(y) = yi
i=1
is the observed number of successes. Here the likelihood depends on the data y
only through the value of t(y), which is said to be a sufficient statistic. Since
the posterior depends on the data only through the value of t(y).
In a more general situation, a statistic t(Y ) is called sufficient, if the likeli-
hood can be factored as
and therefore the posterior depends on the data only through the value t(y) of
the sufficient statistic.
62
February 8, 2013
In other words, we might as well throw away the original data as soon as we
have calculated the value of the sufficient statistic. (Do not try this at home.
You might later want to consider other likelihoods for your data!) Sufficient
statistics are very convenient, but not all likelihoods admit a sufficient statistic
of a fixed dimension, when the sample size is allowed to vary. Such sufficient
statistics exist only in what are known as exponential families, see, e.g., the text
of Schervish [6, Ch. 2] for a discussion.
In the Bernoulli trial example, the random variable S corresponding to the
sufficient statistic
X n
S = t(Y ) = Yi
i=1
has the binomial distribution Bin(n, θ) with sample size n and success proba-
bility θ. I.e., if we observe only the number of success s (but not the order in
which the successes and failures happened), then the likelihood is given by
n s
p(s | θ) = θ (1 − θ)n−s , 0 < θ < 1. (5.2)
s
The two functions (5.1) and (5.2) describe the same experiment, and are
proportional to each other (as functions of θ). The difference stems from the
fact that there are exactly ns equally probable sequences y1 , . . . , yn , which sum
to a given value of s, where 0 ≤ s ≤ n. Since the two functions are proportional
to each other, we will get the same posterior with either of them if we use the
same prior. Therefore it does not matter which of the expressions (5.1) and (5.2)
we use as the likelihood for a binomial experiment.
Observations.
• When calculating the posterior, you can always leave out from the likeli-
hood such factors, which depend only on the data but not on the param-
eter. Doing that does not affect the posterior.
63
February 8, 2013
where nj is the number of yi s which take on the value j. This is the multino-
mial likelihood.
P Clearly the frequencies n1 , . . . , nk form a sufficient statistic.
Notice that j nj = n.
In this case it is possible to work out the distribution of the sufficient statistic,
i.e., the random frequency vector N = (N1 , . . . , Nk ), where
Nj = #{i = 1, . . . , n : Yi = j}, j = 1, . . . , k.
P (N1 = n1 , N2 = n2 , . . . , Nk = nk | θ1 , θ2 , . . . , θk )
(5.4)
n
= θn1 θn2 · · · θknk ,
n1 , n2 , · · · , nk 1 2
P
when the integers 0 ≤ n1 , . . . , nk ≤ n and j nj = n. Here
n n!
= (5.5)
n1 , n2 , · · · , nk n1 !n2 ! · · · nk !
{θ 7→ f (θ | φ) : φ ∈ S}, (5.6)
θ 7→ p(θ | y) = f (θ | φ1 ),
where φ1 ∈ S. In order to find the posterior, we only need to find the value of
the updated hyperparameter vector φ1 = φ1 (y).
If the densities f (θ | φ) of the conjugate family have an easily understood
form, then Bayesian inference is simple, provided we can approximate our prior
64
February 8, 2013
65
February 8, 2013
where
Z 1
c(y) = θa+s−1 (1 − θ)b+n−s−1 dθ = B(a + s, b + n − s),
0
where the last step is immediate, since the integral is the normalizing
constant of the beta density Be(θ | a1 , b1 ), where a1 = a + s and b1 =
b + n − s. Therefore
which is shorthand notation for the statement that the RVs Yi , i = 1, . . . , n are
independently Poisson distributed with parameter θ. Then
1 yi −θ
p(yi | θ) = θ e , yi = 0, 1, 2, . . .
yi !
The likelihood has the functional form of a gamma density. If the prior for θ is
the gamma distribution Gam(a, b) with known hyperparameters a, b > 0, i.e., if
ba a−1 −bθ
p(θ) = θ e , θ > 0,
Γ(a)
then
and from this we recognize that the posterior is the gamma distribution
n
X
Gam(a + yi , b + n).
1
66
February 8, 2013
Suppose that the prior is N (µ0 , σ02 ) with known constants µ0 and σ02 . Then the
posterior is
where
1 1
q(θ) = (y − θ)2 + 2 (θ − µ0 )2
τ2 σ0
is a second degree polynomial in θ, and the coefficient of θ2 in q(θ) is positive.
Therefore the posterior is a certain normal distribution. However, we need to
calculate its mean µ1 and variance σ12 . This we achieve by completing the square
in the quadratic polynomial q(θ). However, we need only to keep track of the
first and second degree terms.
Developing the density N (θ | µ1 , σ12 ) as a function of θ, we obtain
1 (θ − µ1 )2
2 1
N (θ | µ1 , σ1 ) = √ exp −
σ1 2π 2 σ12
1 1 2 µ1
∝ exp − θ −2 2 θ
2 σ12 σ1
67
February 8, 2013
Next, we equate the coefficients of θ2 and θ, firstly, in q(θ) and, secondly, in the
previous formula to find out that we have
p(θ | y) = N (θ | µ1 , σ12 ),
where
1 1 1 µ1 y µ0
= 2 + 2, = 2 + 2, (5.7)
σ12 τ σ0 σ12 τ σ0
from which we can solve first σ12 and then µ1 .
In Bayesian inference it is often convenient to parametrize the normal distri-
bution by its mean and precision, where precision is defined as the reciprocal of
the variance. We have just shown that the posterior precision equals the prior
precision plus the datum precision.
If we have n independent observations Yi ∼ N (θ, τ 2 ) with a known variance,
then it is a simple matter to show that
n
1X
ȳ = yi
n i=1
i.i.d. 1
Yi | θ ∼ N (µ, ), i = 1, . . . , n
θ
where the mean µ is known but the variance 1/θ is unknown. Notice that
we parametrize the sampling distribution using the precision θ instead of the
variance 1/θ. Then
√
θ 1
p(yi | θ) = √ exp − θ(yi − µ)2 ,
2π 2
68
February 8, 2013
The previous result can be expressed also in terms of the variance φ = 1/θ.
The variance has what is known as the inverse gamma distribution with density
1 1
Gam( | a1 , b1 ) 2 , φ > 0,
φ φ
where a1 and b1 are the just obtained updated parameters, as can be established
by the change of variable φ = 1/θ in the posterior density. The inverse gamma
distribution is also called the scaled inverse chi-square distribution, using a
certain other convention for the parametrization.
In the present case, this integral can be solved analytically, and the marginal
posterior of φ can be shown to be a t-distribution.
69
February 8, 2013
That is
(x − µ)T Q(x − µ) = xT Qx − 2xT Qµ + µT Qµ
(for any symmetric matrix Q), which should be compared with the familiar
formula (a − b)2 = a2 − 2 ab + b2 valid for scalars a and b.
Therefore, as a function of x,
−1 1 T T
Nd (x | µ, Q ) ∝ exp − (x Qx − 2x Qµ) . (5.8)
2
where the scalar c does not depend on θ. Comparing this result with (5.8), we
see that the posterior is the multivariate normal Nd (µ1 , Q−1
1 ), where
Q1 = Q0 + R, Q1 µ1 = Q0 µ0 + Ry. (5.9)
Again, posterior precision equals the prior precision plus the datum precision.
In this manner one can identify the parameters of a multivariate normal distri-
bution, by completing the square.
As in the univariate case, this result can be extended to several (condi-
tionally) independent observations, and also to the case where both the mean
vector and the precision matrix are (partially) unknown, when we employ an
appropriate conjugate prior.
70
February 8, 2013
Yi = xTi β + i , i = 1, . . . , n,
p(β) = Np (β | µ0 , Q−1
0 )
p(β | y) = Np (β | µ1 , Q−1
1 ),
with
Q1 = Q0 + τ X T X, Q1 µ1 = Q0 µ0 + τ X T y. (5.11)
This follows similarly as in the previous section, namely from
Comparing this result with (5.8), we get the previously announced formu-
las (5.11).
71
February 8, 2013
det(B)α
Wishd (X | α, B) = det(X)α−(d+1)/2 exp(− tr(B X)), X>0
Γd (α)
Here Γd (·) is the generalized gamma function, see (A.6) and tr(M ) denotes the
trace of the square matrix M , i.e.,
X
tr(M ) = mii .
i
Further, the qualification X > 0 means that the above expression applies when
X is not only symmetric but also positive definite, otherwise the Wishart pdf
is zero. The Wishart density is the joint pdf of the d(d + 1)/2 distinct entries
of the symmetric matrix X, e.g., the elements xij , i ≥ j on or below the main
diagonal. When d = 1, then Wishd (x | α, β) reduces to Gam(x | α, β).
E.g., when d = 2 then the symmetric matrix X can be written using only
the elements x11 , x21 and x22 as follows,
x11 x21
X=
x21 x22
A symmetric matrix is positive definite if and only if all of its eigenvalues are
positive and if and only if all of its leading principal minors are positive. In this
2 × 2 case, X = [xij ] is positive definite if and only if
Writing out the determinant and the trace in terms of the matrix elements, we
obtain the following expression for the joint density of the Wishart distribution,
det(B)α
Wish2 (x11 , x21 , x22 | α, B) = √
π Γ(α)Γ(α − 21 )
3
(x11 x22 − x221 )α− 2 exp (−(β11 x11 + 2β21 x21 + β22 x22 )) ,
if x11 > 0 and x11 x22 − x221 > 0, and the pdf is zero otherwise.
Now we assume that
i.i.d.
Yi | Q ∼ Nd (µ, Q−1 ), i = 1, . . . , n.
the likelihood is
n
!
n/2 1X T
p(y | Q) ∝ det(Q) exp − (yi − µ) Q(yi − µ) .
2 i=1
72
February 8, 2013
tr(C D) = tr(D C)
Let us denote
n
X
S= (yi − µ)(yi − µ)T .
i=1
When we use the above trace identity in the likelihood and combine likelihood
with the prior, we get
1
p(Q | y) ∝ det(Q)α− 2 (d+1) exp(− tr(B X))
1
det(Q)n/2 exp(− tr( S X))
2
α+n/2− 21 (d+1) 1
= det(Q) exp − tr (B + S)Q ,
2
which shows that the posterior is
n
n 1X
Wishd (α + , B+ (yi − µ)(yi − µ)T ).
2 2 i=1
p(φ | ψ),
p(φ | ψ, y).
73
February 8, 2013
where the mean and the precision matrix of the conditional prior of β are arbi-
trary functions of τ , then the full conditional of β can be obtained from equa-
tions (5.11), since τ is considered known in p(β | τ, y).
On the other hand, if the prior distribution is of the form
then an easy calculation shows that the full conditional of τ is also a gamma
distribution. 4
5.7 Reparametrization
Suppose that we have formulated a Bayesian statistical model in terms of a
parameter vector θ with a continuous distribution, but then want to reformu-
late it in terms of a new parameter vector φ, where there is a diffeomorphic
correspondence between θ and φ. I.e., the correspondence
φ = g(θ) ⇔ θ = h(φ)
74
February 8, 2013
p(θ) ∝ h(θ),
Then there does not exist a constant of proportionality that will allow p(θ) to
be a proper density, i.e., to integrate to one. In that case we have an improper
prior. Notice that this is different from expressing the prior by the means of
an unnormalized density h, which can be normalized to be a proper density.
Sometimes we get a proper posterior, if we multiply an improper prior with the
likelihood and then normalize.
For example, consider one normally distributed observation Y ∼ N (θ, τ 2 )
with a known variance τ 2 , and take
p(θ) ∝ 1, θ ∈R.
This prior is intended to represent complete prior ignorance about the unknown
mean: all possible values are deemed equally likely. Calculating formally,
1
p(θ | y) ∝ p(y | θ) p(θ) ∝ exp − 2 (y − θ)2
2τ
∝ N (θ | y, τ 2 )
We obtain the same result in the limit, if we take N (µ0 , σ02 ) as the prior and
then let the prior variance σ02 go to infinity.
One often uses improper priors in a location-scale model, with a location
parameter µ and a scale parameter σ. Then it is conventional to take the prior
of the location parameter to be uniform and to let the logarithm of the scale
parameter σ have a uniform distribution and to take them to be independent in
their improper prior. This translates to an improper prior of the form
1
p(µ, σ) ∝ , µ ∈ R, σ > 0 (5.13)
σ
by using (formally) the change of variables formula,
dτ 1
p(σ) = p(τ ) ∝ ,
dσ σ
75
February 8, 2013
76
February 8, 2013
Here 0 < α < 1 is some fixed number, such that the required coverage proba-
bility is 1 − α; usual choices are α = 0.05 or α = 0.1 corresponding to 95 % and
90 % probability intervals, respectively. Some authors call such intervals prob-
ability intervals, credible intervals (or credibility intervals) or Bayesian
confidence intervals.
The posterior intervals have the direct probabilistic interpretation (5.14).
In contrast, the confidence intervals of frequentist statistics have probability
interpretations only with reference to (hypothetical) sampling of the data under
identical conditions.
Within the frequentist framework, the parameter is an unknown determinis-
tic quantity. A frequentist confidence interval either covers or does not cover the
true parameter value. A frequentist statistician constructs a frequentist confi-
dence interval at significance level α100% in such a way that if it were possible
to sample repeatedly the data under identical conditions (i.e., using the same
value for the parameter), then the relative frequency of coverage in a long run
of repetitions would be about 1 − α. But for the data at hand, the calculated
frequentist confidence interval still either covers or does not cover the true pa-
rameter value, and we do not have guarantees for anything more. Many naive
users of statistics (and even some textbook authors) mistakenly believe that
their frequentist confidence intervals have the simple probability interpretation
belonging to posterior intervals.
The coverage requirement (5.14) needs to be supplemented by other crite-
ria. To demonstrate why this is so, let q be the quantile function of posterior
(which may be explicitely available in simple conjugate situations and may be
approximated with the empirical quantile function in other situations). Then
the interval
[q(t), q(1 − (α − t))]
has the required coverage probability 1 − α for any value 0 < t < α; cf. (2.5)
and (2.6).
In practice it is easiest to use the equal tail (area) interval (or central
interval), whose end points are selected so that α/2 of the posterior probability
lies to the left and α/2 to the right of the intervals. By the definition of the
quantile function, the equal tail posterior interval is given by
[q(α/2), q(1 − α/2)]. (5.15)
If the quantile function is not available, but one has available a sample θ1 , . . . , θN
from the posterior, then one can use the empirical quantiles calculated from the
sample.
Many authors recommend the highest posterior density (HPD) region,
which is defined as the set
Ct = {θ : fΘ|Y (θ | y) ≥ t},
77
February 8, 2013
P (Θ ∈ Ct ) = 1 − α.
Often (but not always) the HPD region turns out to be an interval. Then it
can proven to be the shortest interval with the desired coverage 100(1 − α)%.
However, calculating a HPD interval is more difficult than calculating an equal
tail interval.
In a multiparameter situation one usually examines one parameter at a time.
Let φ be the scalar parameter of interest in θ = (φ, ψ), and suppose that we
have available a sample
5.11 Literature
See, e.g., Bernardo and Smith [1] for further results on conjugate analysis. The
books by Gelman et al. [4], Carlin and Louis [2] and O’Hagan and Forster [5]
are rich sources of ideas on Bayesian modeling and analysis. Sufficiency is a
central concept in parametric statistics. See, e.g., Schervish [6] for a discussion.
Bibliography
[1] José M. Bernardo and Adrian F. M. Smith. Bayesian Theory. John Wiley
& Sons, 2000. First published in 1994.
[2] Bradley P. Carlin and Thomas A. Louis. Bayesian Methods for Data Anal-
ysis. Chapman & Hall/CRC, 3rd edition, 2009.
[3] Ronald Christensen, Wesley Johnson, Adam Branscum, and Timothy E.
Hanson. Bayesian Ideas and Data Analysis: An Introduction for Scientists
and Statisticians. Texts in Statistical Science. CRC Press, 2011.
[4] Andrew Gelman, John B. Carlin, Hal S. Stern, and Donald B. Rubin.
Bayesian Data Analysis. Chapman & Hall/CRC Press, 2nd edition, 2004.
[5] Anthony O’Hagan and Jonathan Forster. Bayesian Inference, volume 2B of
Kendall’s Advanced Theory of Statistics. Arnold, second edition, 2004.
[6] Mark J. Schervish. Theory of Statistics. Springer series in statistics.
Springer-Verlag, 1995.
78
Chapter 6
Approximations
1
fΘ|Y (θ | y) = q(θ | y),
c(y)
where we know how to evaluate the unnormalized density q(θ | y), but do not
necessarily know the value of the normalizing constant c(y).
Instead of the original parameter space, we consider a finite interval [a, b],
which should cover most of the mass of the posterior distribution. We divide
[a, b] evenly into N subintervals
Bi = [a + (i − 1)h, a + ih], i = 1, . . . , N.
1
θi = a + (i − )h, i = 1, . . . , N.
2
We use the midpoint rule for numerical integration. This means that we ap-
proximate the integral over the i’th subinterval of any function g by the rule
Z
g(θ) dθ ≈ hg(θi ). (6.1)
Bi
Using the midpoint rule on each of the subintervals, we get the following
79
February 8, 2013
Using this approximation, we can approximate the value of the posterior density
at the point θi ,
1 1 q(θi | y)
fΘ|Y (θi | y) = q(θi | y) ≈ PN . (6.3)
c(y) h j=1 q(θj | y)
By following the same reasoning which lead to (6.2), we may form the ap-
proximation Z X
k(θ) q(θ | y) dθ ≈ h k(θi ) q(θi | y)
i=1
basically for any function k such that k(θ) q(θ | y) differs appreciably from
zero only on the interval (a, b). This can be used to approximate the posterior
expection of an arbitrary function k(θ) of the parameter, by
R
k(θ) q(θ | y) dθ
Z
E(k(Θ) | Y = y) = k(θ) fΘ|Y (θ | y) dθ = R
q(θ | y) dθ
PN (6.5)
k(θi ) q(θi | y)
≈ i=1 PN
j=1 q(θj | y)
80
February 8, 2013
and
∂2 ∂2
A=− log fΘ|Y (θ | y) =− log q(θ | y) .
∂θ2 θ=θ̂ ∂θ2 θ=θ̂
• The first and higher order (partial) derivatives with respect to θ of log q(θ | y)
and log fΘ|Y (θ | y) agree, since these function differ only by an additive
constant (which depends on y but not on θ).
81
February 8, 2013
• The first order term of the Taylor expansion disappears, since θ̂ is also the
mode of the log-posterior log fΘ|Y (θ | y).
at least in the vicinity of the mode θ̂. Luckily, we recognize that πapprox (θ)
is an unnormalized form of the density of the normal distribution with mean
θ̂ and variance 1/A. The end result is that the posterior distribution can be
approximated with the normal distribution
!
1
N θ̂, , (6.7)
− L00 (θ̂)
and L00 (θ̂) is the second derivative of L(θ) evaluated at the mode θ̂.
The multivariate analog of the result starts with the second degree expansion
of the log-posterior centered on its mode θ̂,
1
log fΘ|Y (θ | y) ≈ log fΘ|Y (θ̂ | y) + 0 − (θ − θ̂)T A(θ − θ̂),
2
where A is the negative Hessian matrix of L(θ) = log q(θ | y) evaluated at the
mode,
∂2 ∂2 ∂2
Aij = − log fΘ|Y (θ | y) =− L(θ) =− L(θ)
∂θi ∂θj θ=θ̂ ∂θi ∂θj θ=θ̂ ∂θ ∂θT θ=θ̂ ij
The first degree term of the expansion vanishes, since θ̂ is the mode of the log-
posterior. Here A is at least positively semidefinite, since θ̂ is a maximum. If A
is positively definite, we can proceed with the normal approximation.
Exponentiating, we find out that approximately (at least near the mode)
1 T
fΘ|Y (θ | y) ∝ exp − (θ − θ̂) A(θ − θ̂) .
2
Therefore we can approximate the posterior with the corresponding multivariate
normal distribution with mean θ̂ and covariance matrix given by A−1 , i.e., the
approximating normal distribution is
−1
00
N θ̂, −L (θ̂) , (6.8)
where L00 (θ̂) is the Hessian matrix of the logarithm of the unnormalized poste-
rior, L(θ) = log q(θ | y), evaluated at its mode θ̂. The precision matrix of the
82
February 8, 2013
where y = (y1 , y2 , y3 , y4 ) = (13, 1, 2, 3). The mode and the second derivative of
L(θ) = log q(θ | y) evaluated at the mode are given by
θ eφ
φ = logit(θ) = ln ⇔ θ= .
1−θ 1 + eφ
The given unnormalized posterior for θ transforms to the following unnormalized
posterior for φ,
dθ
q̃(φ | y) = q(θ | y)
dφ
y4 y2 +y3 y1
eφ 2 + 3eφ eφ
1
= .
1 + eφ 1 + eφ 1 + eφ (1 + eφ )2
The mode and the second derivative of L̃(φ) = log q̃(φ | y) evaluated at the
mode are given by
φ̂ ≈ 0.582, L̃00 (φ̂) ≈ −2.259.
(Also φ̂ can be found by solving a quadratic.) This results in the normal ap-
proximation N (0.582, 1/2.259) for the logit of θ.
When we translate that approximation back to the original parameter space,
we get the approximation
dφ
fΘ|Y (θ | y) ≈ N (φ | 0.582, 1/2.259) ,
dθ
83
February 8, 2013
4
true posterior
norm. appr. on original scale
norm. appr. on logit scale
3
2
1
0
Figure 6.1: The exact posterior density (solid line) together with its normal
approximation (dashed line) and the approximation based on the normal ap-
proximation for the logit of θ. The last approximation is markedly non-normal
on the original scale, and it is able to capture the skewness of the true posterior
density.
i.e.,
1
fΘ|Y (θ | y) ≈ N (logit(θ) | 0.582, 1/2.259) .
θ(1 − θ)
Both of these approximations are plotted in Figure 6.1 together with the
true posterior density (whose normalizing constant can be found exactly). 4
84
February 8, 2013
Here the negative Hessian of the log-likelihood is called the observed (Fisher)
information (matrix), and we denote it by J(θ),
∂2
J(θ) = −`00 (θ) = − log p(y | θ). (6.9)
∂θ ∂θT
The negative Hessian of the log-posterior equals the sum of the observed infor-
mation and the negative Hessian of the log-prior.
If the sample size is large, then the likelihood dominates the prior in the
sense that the likelihood is highly peaked while the prior is relatively flat in the
region where the posterior density is appreciable. In large samples the mode of
the log-posterior θ̂ and the mode of the log-likelihood (the maximum likelihood
estimator, MLE) θ̂MLE are approximately equal, and also the Hessian matrix of
the log-posterior is approximately the same as the Hessian of the log-likelihood.
Combining these two approximations, we get
θ̂ ≈ θ̂MLE , −L00 (θ̂) ≈ J(θ̂MLE ).
When we plug these approximations in the normal approximation (6.8), we see
that in large samples the posterior is approximately normal with mean equal to
the MLE and covariance matrix given by the inverse of the observed information,
p(θ | y) ≈ N θ | θ̂MLE , [J(θ̂MLE )]−1 . (6.10)
Here Y is a random vector from the sampling distribution of the data, and so
θ̂MLE (Y ) is the maximum likelihood estimator considered as a random variable
(or random vector). In contrast, θ̂MLE (y) is the maximum likelihood estimate
calculated from the observed data y.
Comparing equations (6.10) and (6.11) we see that for large samples the
posterior distribution can be approximated using the same formulas that (fre-
quentist) statisticians use for the maximum likelihood estimator. In large sam-
ples the influence of the prior vanishes, and then one does need to spend much
effort on formulating the prior distribution so that it would reflect all available
prior information. However, in small samples careful formulation of the prior is
important.
85
February 8, 2013
Heuristically, the integrand is negligible when we go far away from θ̂, and so we
should be able to approximate the integral I by a simpler integral, where we
take into account only the local behavior of L(θ) around its mode. To this end,
we first approximate L(θ) by its second degree Taylor polynomial centered at
the mode θ̂,
1
L(θ) ≈ L(θ̂) + 0 · (θ − θ̂) + L00 (θ̂)(θ − θ̂)2 .
2
Since g(θ) is slowly varying, we may approximate the integrand as follows
L(θ) 1 2
g(θ) e ≈ g(θ̂) exp L(θ̂) − Q(θ − θ̂) ) ,
2
where
Q = −L00 (θ̂).
For the following, we must assume that L00 (θ̂) < 0. Integrating the approxima-
tion, we obtain
Z
1
I ≈ g(θ̂) eL(θ̂) exp(− Q(θ − θ̂)2 ) dθ
2
√ (6.14)
2π L(θ̂)
= p g(θ̂) e
Q
This is the univariate case of Laplace’s approximation. (Actually, it is just the
leading term in a Laplace expansion, which is an asymptotic expansion for the
integral.)
To handle the multivariate result, we use the normalizing constant of the
Nd (µ, Q−1 ) distribution to evaluate the integral
(2π)d/2
Z
1
exp − (x − µ)T Q(x − µ) dx = √ . (6.15)
2 det Q
This result is valid for any symmetric and positive definite d × d matrix Q.
Integrating the multivariate second degree approximation of g(θ) exp(L(θ)), we
obtain
(2π)d/2
Z
I = g(θ) eL(θ) dθ ≈ p g(θ̂) eL(θ̂) , (6.16)
det(Q)
86
February 8, 2013
(2π)d/2
p k(θ̂) eL(θ̂)
det(Q)
E[k(Θ) | Y = y] ≈ = k(θ̂), (6.17)
(2π)d/2 L(θ̂)
p e
det(Q)
where
θ̂ = arg max L(θ), Q = −L00 (θ̂).
Here we need a single maximization, and do not need to evaluate the Hessian
at all.
A less obvious approach is to choose
where
θ̂∗ = arg max[k(θ) q(θ | y)], θ̂ = arg max q(θ | y).
and Q∗ and Q are the negative Hessians
00
Q∗ = −L∗ (θ̂∗ ), Q = −L00 (θ̂),
where
L∗ (θ) = log(k(θ) q(θ | y)), L(θ) = log q(θ | y).
We need two separate maximizations and need to evaluate two Hessians for this
approximation.
Tierney and Kadane analyzed the errors committed in these approximations
in the situation, where we have n (conditionally) i.i.d. observations, and the
87
February 8, 2013
sample size n grows. The first approximation (6.17) has relative error of or-
der O(n−1 ), while the second approximation (6.18) has relative error of order
O(n−2 ). That is,
and
1/2
k(θ̂∗ ) q(θ̂∗ | y)
det(Q)
1 + O(n−2 ) .
E[k(Θ) | Y = y] =
det(Q∗ ) q(θ̂ | y)
Hence the second approximation is much more accurate (at least asymptoti-
cally).
where p(φ, ψ | y) is the normalized posterior. The main difference with approx-
imating a posterior expectation is the fact, that now we are integrating only
over the component(s) ψ of θ = (φ, ψ).
Fix the value of φ for the moment. Let ψ ∗ (φ) be the maximizer of the
function
ψ 7→ log p(φ, ψ | y),
and let Q(φ) be the negative Hessian matrix of this function evaluated at
ψ = ψ ∗ (φ). Notice that we can equally well calculate ψ ∗ (φ) and Q(φ) as the
maximizer and the negative of the d2 × d2 Hessian matrix of ψ 7→ log q(φ, ψ | y),
respectively,
ψ ∗ (φ) = arg max (log q(φ, ψ | y)) = arg max q(φ, ψ | y) (6.19)
ψ ψ
∂2
Q(φ) = − log q(φ, ψ | y) . (6.20)
∂ψ ∂ψ T |ψ=ψ ∗ (φ)
88
February 8, 2013
evaluated at the MAP, the maximum point of the same function. However, it
is often enough to approximate the functional form of the marginal posterior.
When considered as a function of φ, we have, approximately,
p(φ, ψ | y) q(φ, ψ | y)
p(φ | y) = ∝ .
p(ψ | φ, y) p(ψ | φ, y)
This result is valid for any choice of ψ. Let us now form a normal approximation
for the denominator for a fixed value of φ, i.e.,
However, this approximation is accurate only in the vicinity of the mode ψ ∗ (φ),
so let us use it only at the mode. The end result is the following approximation,
q(φ, ψ | y)
p(φ | y) ∝
N (ψ | ψ ∗ (φ), Q(φ)−1 ) |ψ=ψ∗ (φ)
= (2π)d2 /2 det(Q(φ))−1/2 q(φ, ψ ∗ (φ) | y)
∝ q(φ, ψ ∗ (φ) | y) (det Q(φ))−1/2 ,
89
February 8, 2013
want to simulate from the approximate marginal posterior, then we can use the
unnormalized approximation (6.23) directly, together with accept–reject, SIR
or the grid-based simulation method of Sec. 6.1. See the articles by H. Rue and
coworkers [2, 3] for imaginative applications of these ideas.
Another possibility for approximating the marginal posterior would be to
build a normal approximation to the joint posterior, and then marginalize.
However, a normal approximation to the marginal posterior would only give the
correct result with absolute error of order O(n−1/2 ), so the accuracies of both
of the Laplace approximations are much better. Since the Laplace approxima-
tions yield good relative instead of absolute error, the Laplace approximations
maintain good accuracy also in the tails of the densities. In contrast, the normal
approximation is accurate only in the vicinity of the mode.
Example 6.2. Consider normal observations
i.i.d. 1
[Yi | µ, τ ] ∼ N (µ, ), i = 1, . . . , n,
τ
together with the non-conjugated prior
1
p(µ, τ ) = p(µ) p(τ ) = N (µ | µ0 , ) Gam(τ | a0 , b0 ).
ψ0
The full conditional of µ is readily available,
1
p(µ | τ, y) = N (µ | µ1 , )
ψ1
where
n
X
ψ1 = ψ0 + nτ ψ1 µ1 = ψ0 µ0 + τ yi
i=1
µ∗ (τ ) is also the mode of p(µ, τ | y) for any τ . We also need the second derivative
∂2 ∂2
2
(log p(µ, τ | y)) = (log p(µ | τ, y)) = −ψ1 ,
∂µ ∂µ2
for µ = µ∗ (τ ), but the derivative does not in this case depend on the value of
µ at all. An unnormalized form of the Laplace approximation to the marginal
posterior of τ is therefore
q(µ∗ (τ ), τ | y)
p(τ | y) ∝ √ , where q(µ, τ | y) = p(y | µ, τ ) p(µ) p(τ ).
ψ1
90
February 8, 2013
In this toy example, the Laplace approximation (6.23) for the functional
form of the marginal posterior p(τ | µ) is exact, since by the multiplication rule,
p(µ, τ | y)
p(τ | y) =
p(µ | τ, y)
Bibliography
[1] Tom Leonard. A simple predictive density function: Comment. Journal of
the American Statistical Association, 77:657–658, 1982.
91
February 8, 2013
Figure 6.2: (a) Marginal posterior density of τ and (b) a sample drawn from the
approximate joint posterior together with contours of the true joint posterior
density.
4
1.0
0.8
3
posterior density
0.6
2
τ
0.4
1
0.2
0.0
0 1 2 3 4 −4 −2 0 2 4
τ µ
(a) (b)
92
February 8, 2013
170