Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
5 views

Cs 13 Batch 2

Uploaded by

Jingkui Wang
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Cs 13 Batch 2

Uploaded by

Jingkui Wang
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Chapter 5

More Bayesian Inference

We use the generic p(·) notation for densities, if there is no danger of confusion.

5.1 Likelihoods and sufficient statistics


Let us consider n (conditionally) independent Bernoulli trials Y1 , . . . , Yn with
success probability θ. That is, the RVs Yi are independent and Yi takes on
the value 1 with probability θ (success in the i’th Bernoulli experiment) and
otherwise is zero (failure in the i’th Bernoulli experiment). Having observed the
values y1 , . . . , yn , the likelihood corresponding to y = (y1 , . . . , yn ) is given by
n
Y n
Y
p(y | θ) = p(yi | θ) = θyi (1 − θ)1−yi
i=1 i=1 (5.1)
s n−s
= θ (1 − θ) , 0 < θ < 1,

where
n
X
s = t(y) = yi
i=1

is the observed number of successes. Here the likelihood depends on the data y
only through the value of t(y), which is said to be a sufficient statistic. Since

p(θ | y) ∝ p(y | θ) p(θ) = θt(y) (1 − θ)n−t(y) p(θ),

the posterior depends on the data only through the value of t(y).
In a more general situation, a statistic t(Y ) is called sufficient, if the likeli-
hood can be factored as

p(y | θ) = g(t(y), θ) h(y)

for some functions g and h. Then (as a function of θ)

p(θ | y) ∝ p(y | θ) p(θ) ∝ g(t(y), θ) p(θ)

and therefore the posterior depends on the data only through the value t(y) of
the sufficient statistic.

62
February 8, 2013

In other words, we might as well throw away the original data as soon as we
have calculated the value of the sufficient statistic. (Do not try this at home.
You might later want to consider other likelihoods for your data!) Sufficient
statistics are very convenient, but not all likelihoods admit a sufficient statistic
of a fixed dimension, when the sample size is allowed to vary. Such sufficient
statistics exist only in what are known as exponential families, see, e.g., the text
of Schervish [6, Ch. 2] for a discussion.
In the Bernoulli trial example, the random variable S corresponding to the
sufficient statistic
X n
S = t(Y ) = Yi
i=1

has the binomial distribution Bin(n, θ) with sample size n and success proba-
bility θ. I.e., if we observe only the number of success s (but not the order in
which the successes and failures happened), then the likelihood is given by
 
n s
p(s | θ) = θ (1 − θ)n−s , 0 < θ < 1. (5.2)
s

The two functions (5.1) and (5.2) describe the same experiment, and are
proportional to each other (as functions of θ). The difference stems from the
fact that there are exactly ns equally probable sequences y1 , . . . , yn , which sum
to a given value of s, where 0 ≤ s ≤ n. Since the two functions are proportional
to each other, we will get the same posterior with either of them if we use the
same prior. Therefore it does not matter which of the expressions (5.1) and (5.2)
we use as the likelihood for a binomial experiment.
Observations.

• When calculating the posterior, you can always leave out from the likeli-
hood such factors, which depend only on the data but not on the param-
eter. Doing that does not affect the posterior.

• If your model admits a convenient sufficient statistic, you do not need


to work out the distribution of the sufficient statistic in order to write
down the likelihood. You can always use the likelihood of the underlying
repeated experiment, even if the original data has been lost and only the
sufficient statistic has been recorded.

• However, if you do know the density of the sufficient statistic (conditionally


on the parameter), you can use that as the likelihood. (This is tricky;
consult, e.g., Schervish [6, Ch. 2] for a proof.)

We can generalize the Bernoulli experiment (or binomial experiment) to the


case, where there are k ≥ 2 possible outcomes instead of two possible outcomes.
Consider an i.i.d. sample Y1 , . . . , Yn from the discrete distribution with k dif-
ferent
Pvalues 1, . . . , k with respective probabilities θ1 , . . . , θk , where 0 < θj < 1
and θj = 1. (Because of the sum constraint, there are actually only k − 1 free
parameters.) The likelihood corresponding to the data y = (y1 , . . . , yn ) is given
by
n n Y k
1(y =j)
Y Y
p(y | θ) = p(yi | θ) = θj i = θ1n1 θ2n2 · · · θknk , (5.3)
i=1 i=1 j=1

63
February 8, 2013

where nj is the number of yi s which take on the value j. This is the multino-
mial likelihood.
P Clearly the frequencies n1 , . . . , nk form a sufficient statistic.
Notice that j nj = n.
In this case it is possible to work out the distribution of the sufficient statistic,
i.e., the random frequency vector N = (N1 , . . . , Nk ), where

Nj = #{i = 1, . . . , n : Yi = j}, j = 1, . . . , k.

Using combinatorial arguments it can be easily proven that

P (N1 = n1 , N2 = n2 , . . . , Nk = nk | θ1 , θ2 , . . . , θk )
(5.4)
 
n
= θn1 θn2 · · · θknk ,
n1 , n2 , · · · , nk 1 2
P
when the integers 0 ≤ n1 , . . . , nk ≤ n and j nj = n. Here
 
n n!
= (5.5)
n1 , n2 , · · · , nk n1 !n2 ! · · · nk !

is called a multinomial coefficient. The multivariate discrete distribution


with pmf (5.4) is called the multinomial distribution with sample size pa-
rameter n and probability vector parameter (θ1 , . . . , θk ). The binomial distri-
bution is a special case of the multinomial distribution: if S ∼ Bin(n, p), then
the vector (S, n − S) has the multinomial distribution with parameters n and
(p, 1 − p).
Notice that we can use the simple expression (5.3) for the likelihood of a
multinomial observation even when we know very well that the pmf of the
random vector (N1 , . . . , Nk ) involves the multinomial coefficient.

5.2 Conjugate analysis


Some likelihoods have the property that if the prior is selected from a certain
family of distributions, then the posterior also belongs to the same family. Such
a family is called closed under sampling or a conjugate family (for the likelihood
under consideration). A trivial and useless example of a conjugate family is
provided by the set of all distributions. The useful conjugate families can be
described by a finite number of hyperparameters, i.e., they are of the form

{θ 7→ f (θ | φ) : φ ∈ S}, (5.6)

where S a set in an Euclidean space, and θ 7→ f (θ | φ) is a density for each


value of the hyperparameter vector φ ∈ S. If the likelihood p(y | θ) admits this
conjugate family, and if the prior p(θ) is f (θ | φ0 ) with a known value φ0 , then
the posterior is of the form

θ 7→ p(θ | y) = f (θ | φ1 ),

where φ1 ∈ S. In order to find the posterior, we only need to find the value of
the updated hyperparameter vector φ1 = φ1 (y).
If the densities f (θ | φ) of the conjugate family have an easily understood
form, then Bayesian inference is simple, provided we can approximate our prior

64
February 8, 2013

knowledge by some member f (θ | φ0 ) of the conjugate family and provided


we know how to calculate the updated hyperparameters φ1 (y). However, nice
conjugate families of the form (5.6) are possible only when the likelihood belongs
to the exponential family, see, e.g., Schervish [6, Ch. 2].
The prior knowledge of the subject matter expert on θ is, unfortunately,
usually rather vague. Transforming the subject matter expert’s prior knowledge
into a prior distribution is called prior elicitation. See, e.g., [3] for examples of
how this could be achieved for certain important statistical models. Supposing
we are dealing with a scalar parameter, the expert might only have a feeling
for the order of magnitude of the parameter, or might be able to say, which
values would be surprisingly small or surprisingly large for the parameter. One
approach for constructing the prior would then be to select from the family (5.6)
some prior, which satisfies those kind of prior summaries.
As an example of conjugate analysis, consider the binomial likelihood (5.1)
corresponding to sample size n and success probability θ. Recall that the beta
density with (hyper)parameters a, b > 0 is given by
1
Be(θ | a, b) = θa−1 (1 − θ)b−1 , 0 < θ < 1.
B(a, b)
Suppose that the parameter θ has the beta prior Be(a, b) with known hyperpa-
rameters a and b. Then
p(θ | y) ∝ p(y | θ) p(θ)
∝ θs (1 − θ)n−s θa−1 (1 − θ)b−1
∝ Be(θ | a + s, b + n − s), 0 < θ < 1.

Therefore we claim that the posterior is Be(a + s, b + n − s), where s is the


number of successes (and n − s is the number of failures). Notice the following
points.
• We developed the posterior density, as a function of the parameter θ,
dropping any constants (i.e., factors not involving θ).
• It is important to keep in mind, which is the variable we are interested in
and what are the other variables, whose functions we treat as constants.
The variable of interest is the one whose posterior distribution we want to
calculate.
• We finished the calculation by recognizing that the posterior has a familiar
functional form. In the present example we obtained a beta density except
that it did not have the right normalizing constant. However, the only
probability density on 0 < θ < 1 having the derived functional form is the
beta density Be(θ | a+s, b+n−s), and therefore the posterior distribution
is this beta distribution.
• In more detail: from our calculations, we know that the posterior has the
unnormalized density θa+s−1 (1 − θ)b+n−s−1 on 0 < θ < 1. Since we know
that the posterior density is a density on (0, 1), we can find the normalizing
constant by integration:
1
p(θ | y) = θa+s−1 (1 − θ)b+n−s−1 , 0 < θ < 1,
c(y)

65
February 8, 2013

where
Z 1
c(y) = θa+s−1 (1 − θ)b+n−s−1 dθ = B(a + s, b + n − s),
0

where the last step is immediate, since the integral is the normalizing
constant of the beta density Be(θ | a1 , b1 ), where a1 = a + s and b1 =
b + n − s. Therefore

p(θ | y) = Be(θ | a + s, b + n − s).

• As soon as we have recognized the functional form of the posterior, we


have recognized the posterior distribution.

5.3 More examples of conjugate analysis


5.3.1 Poisson sampling model and gamma prior
Suppose that
i.i.d.
Yi | θ ∼ Poi(θ), i = 1, . . . , n,

which is shorthand notation for the statement that the RVs Yi , i = 1, . . . , n are
independently Poisson distributed with parameter θ. Then

1 yi −θ
p(yi | θ) = θ e , yi = 0, 1, 2, . . .
yi !

and the likelihood is given by


n
Y Pn
p(y | θ) = p(yi | θ) ∝ θ 1 yi
e−nθ .
i=1

The likelihood has the functional form of a gamma density. If the prior for θ is
the gamma distribution Gam(a, b) with known hyperparameters a, b > 0, i.e., if

ba a−1 −bθ
p(θ) = θ e , θ > 0,
Γ(a)

then

p(θ | y) ∝ p(y | θ) p(θ)


Pn
yi −nθ a−1 −bθ
∝ θ 1 e θ e
Pn
a+ 1 yi −1 −θ(b+n)
∝ θ e , θ>0

and from this we recognize that the posterior is the gamma distribution
n
X
Gam(a + yi , b + n).
1

66
February 8, 2013

5.3.2 Exponential sampling model and gamma prior


Suppose that
i.i.d.
Yi | θ ∼ Exp(θ), i = 1, . . . , n
Θ ∼ Gam(a, b),

where a, b > 0 are known constants. Then

p(yi | θ) = θ e−θyi , yi > 0,

and the likelihood is


n
Y n
X
p(y | θ) = p(yi | θ) = θn exp(−θ yi ).
i=1 i=1
P
We obtain Gam(a + n, b + yi ) as the posterior.

5.4 Conjugate analysis for univariate normal ob-


servations
5.4.1 Known variance but unknown mean
Suppose that we have one normally distributed observation Y ∼ N (θ, τ 2 ), where
the mean θ is unknown but the variance τ 2 is a known value. Then
1 (y − θ)2
 
1
p(y | θ) = √ exp − .
τ 2π 2 τ2

Suppose that the prior is N (µ0 , σ02 ) with known constants µ0 and σ02 . Then the
posterior is

p(θ | y) ∝ p(y | θ) p(θ)


   
1 1 1
∝ exp − 2 (y − θ)2 − 2 (θ − µ0 )2 = exp − q(θ) ,
2τ 2σ0 2

where
1 1
q(θ) = (y − θ)2 + 2 (θ − µ0 )2
τ2 σ0
is a second degree polynomial in θ, and the coefficient of θ2 in q(θ) is positive.
Therefore the posterior is a certain normal distribution. However, we need to
calculate its mean µ1 and variance σ12 . This we achieve by completing the square
in the quadratic polynomial q(θ). However, we need only to keep track of the
first and second degree terms.
Developing the density N (θ | µ1 , σ12 ) as a function of θ, we obtain

1 (θ − µ1 )2
 
2 1
N (θ | µ1 , σ1 ) = √ exp −
σ1 2π 2 σ12
  
1 1 2 µ1
∝ exp − θ −2 2 θ
2 σ12 σ1

67
February 8, 2013

Next, we equate the coefficients of θ2 and θ, firstly, in q(θ) and, secondly, in the
previous formula to find out that we have

p(θ | y) = N (θ | µ1 , σ12 ),

where
1 1 1 µ1 y µ0
= 2 + 2, = 2 + 2, (5.7)
σ12 τ σ0 σ12 τ σ0
from which we can solve first σ12 and then µ1 .
In Bayesian inference it is often convenient to parametrize the normal distri-
bution by its mean and precision, where precision is defined as the reciprocal of
the variance. We have just shown that the posterior precision equals the prior
precision plus the datum precision.
If we have n independent observations Yi ∼ N (θ, τ 2 ) with a known variance,
then it is a simple matter to show that
n
1X
ȳ = yi
n i=1

is a sufficient statistic. In this case we know the distribution of the corresponding


RV Ȳ conditionally on θ,
τ2
[Ȳ | θ] ∼ N (θ, ),
n
From these two facts we get immediately the posterior distribution from (5.7),
when the prior is again N (µ0 , σ02 ). (Alternatively, we may simply multiply the
likelihood with the prior density, and examine the resulting expression.)

5.4.2 Known mean but unknown precision


Suppose that the RVs Yi are independently normally distributed,

i.i.d. 1
Yi | θ ∼ N (µ, ), i = 1, . . . , n
θ
where the mean µ is known but the variance 1/θ is unknown. Notice that
we parametrize the sampling distribution using the precision θ instead of the
variance 1/θ. Then
√  
θ 1
p(yi | θ) = √ exp − θ(yi − µ)2 ,
2π 2

and the likelihood is


n
!
n/2 1X 2
p(y | θ) ∝ θ exp − (yi − µ) θ .
2 i=1

If the prior is Gam(a, b), then the posterior is evidently


n
n 1X
Gam(a + ,b + (yi − µ)2 ).
2 2 i=1

68
February 8, 2013

The previous result can be expressed also in terms of the variance φ = 1/θ.
The variance has what is known as the inverse gamma distribution with density
1 1
Gam( | a1 , b1 ) 2 , φ > 0,
φ φ
where a1 and b1 are the just obtained updated parameters, as can be established
by the change of variable φ = 1/θ in the posterior density. The inverse gamma
distribution is also called the scaled inverse chi-square distribution, using a
certain other convention for the parametrization.

5.4.3 Both the mean and the precision are unknown


Suppose that the RVs Yi are independently normally distributed with unknown
mean φ and unknown precision τ ,
i.i.d. 1
[Yi | φ, τ ] ∼ N (φ, ), i = 1, . . . , n.
τ
In this case the likelihood for θ = (φ, τ ) is conjugate for the prior of the form
1
p(φ, τ | a0 , b0 , µ0 , n0 ) = Gam(τ | a0 , b0 ) N (φ | µ0 , ).
n0 τ
Notice that the precision and the mean are dependent in this prior. This kind
of a dependent prior may be natural in some problems but less natural in some
other problems.
Often the interest centers on the mean φ while the precision τ is regarded as a
nuisance parameter. The marginal posterior of φ (i.e., the marginal distribution
of φ in the joint posterior) is obtained from the joint posterior by integrating
out the nuisance parameter,
Z
p(φ | y) = p(φ, τ | y) dτ.

In the present case, this integral can be solved analytically, and the marginal
posterior of φ can be shown to be a t-distribution.

5.5 Conjugate analysis for the multivariate nor-


mal sampling model
When dealing with the multivariate instead of the univariate normal distribu-
tion, it is even more convenient to parametrize the normal distribution using the
precision matrix, which is defined as the inverse of the covariance matrix, which
we assume to be non-singular. Like the covariance matrix, also the precision
matrix is a symmetric and positive definite matrix.
The density of the multivariate normal Nd (µ, Q−1 ) with mean µ and preci-
sion matrix Q (i.e., of Nd (µ, Σ), where the covariance matrix Σ = Q−1 ) is given
by
 
−1 −d/2 1/2 1 T
Nd (x | µ, Q ) = (2π) (det Q) exp − (x − µ) Q(x − µ)
2
where d is the dimensionality of x.

69
February 8, 2013

5.5.1 Unknown mean vector but known precision matrix


Expanding the quadratic form inside the exponential function in Nd (x | µ, Q−1 ),
we get
(x − µ)T Q(x − µ) = xT Qx − xT Qµ − µT Qx + µT Qµ
Now, the precision matrix Q is symmetric, and a scalar equal its transpose, so

µT Qx = (µT Qx)T = xT QT µ = xT Qµ.

That is
(x − µ)T Q(x − µ) = xT Qx − 2xT Qµ + µT Qµ
(for any symmetric matrix Q), which should be compared with the familiar
formula (a − b)2 = a2 − 2 ab + b2 valid for scalars a and b.
Therefore, as a function of x,
 
−1 1 T T
Nd (x | µ, Q ) ∝ exp − (x Qx − 2x Qµ) . (5.8)
2

Suppose that we have a single multivariate observation Y ∼ N (θ, R−1 ),


where the prior precision matrix R is known and suppose that the prior for the
parameter vector θ is the normal distribution N (µ0 , Q−1
0 ) with known hyper-
parameters µ0 and Q0 . Then
 
1 T
p(y | θ) ∝ exp − (y − θ) R(y − θ) .
2
The prior is  
1 T
p(θ) ∝ exp − (θ − µ0 ) Q0 (θ − µ0 ) .
2
The posterior is proportional to their product,
   
1 T 1 T 1
p(θ | y) ∝ exp − (θ − y) R(θ − y) − (θ − µ0 ) Q0 (θ − µ0 ) = exp − q(θ)
2 2 2
Here
q(θ) = (θ − y)T R(θ − y) + (θ − µ0 )T Q0 (θ − µ0 )
= θT Rθ − 2θT Ry + y T Ry + θT Q0 θ − 2θT Q0 µ0 + µT0 Rµ0
= θT (R + Q0 )θ − 2θT (Ry + Q0 µ0 ) + c,

where the scalar c does not depend on θ. Comparing this result with (5.8), we
see that the posterior is the multivariate normal Nd (µ1 , Q−1
1 ), where

Q1 = Q0 + R, Q1 µ1 = Q0 µ0 + Ry. (5.9)

Again, posterior precision equals the prior precision plus the datum precision.
In this manner one can identify the parameters of a multivariate normal distri-
bution, by completing the square.
As in the univariate case, this result can be extended to several (condi-
tionally) independent observations, and also to the case where both the mean
vector and the precision matrix are (partially) unknown, when we employ an
appropriate conjugate prior.

70
February 8, 2013

5.5.2 Linear regression when the error variance is known


The sampling model in linear regression can be described by stating that

Yi = xTi β + i , i = 1, . . . , n,

where i ∼ N (0, σ 2 ) independently of each other and independently of β, where


we assume that the error variance σ 2 is known. Here xi is the known covari-
ate vector corresponding to the i’th response Yi , and β ∈ R p is the vector of
regression coefficients.
It is more convenient to rewrite the linear regression model in matrix form,
as follows,
1
[Y | β] ∼ Nn (Xβ, I) (5.10)
τ
Here X is the known n × p model matrix, which has xTi as its i’th row vector,
and τ = 1/σ 2 is the precision parameter of the error distribution.
If the prior is multivariate normal,

p(β) = Np (β | µ0 , Q−1
0 )

then the posterior is also multivariate normal,

p(β | y) = Np (β | µ1 , Q−1
1 ),

with
Q1 = Q0 + τ X T X, Q1 µ1 = Q0 µ0 + τ X T y. (5.11)
This follows similarly as in the previous section, namely from

p(β | y) ∝ p(y | β) p(β)


1
= Nn (y | Xβ, I) Np (β | µ0 , Q−1
0 )
 τ 
1 T 1 T
∝ exp − τ (y − Xβ) (y − Xβ) − (β − µ0 ) Q0 (β − µ0 )
2 2
 
1
= exp − q(β) ,
2

where the quadratic form q(β) is

q(β) = τ (y − Xβ)T (y − Xβ) + (β − µ0 )T Q0 (β − µ0 )


= β T τ X T X + Q0 β − 2 β T (τ X T y + Q0 µ0 ) + c1 .


Comparing this result with (5.8), we get the previously announced formu-
las (5.11).

5.5.3 Known mean but unknown precision matrix


At this point we need to introduce the Wishart distribution which is a joint
distribution for the distinct entries of a symmetric random d × d matrix X.
When we use the parametrization of Bernanrdo and Smith [1], then the Wishart

71
February 8, 2013

distribution with parameters α > (d−1)/2 and B (a symmetric positive definite


matrix) has the density

det(B)α
Wishd (X | α, B) = det(X)α−(d+1)/2 exp(− tr(B X)), X>0
Γd (α)

Here Γd (·) is the generalized gamma function, see (A.6) and tr(M ) denotes the
trace of the square matrix M , i.e.,
X
tr(M ) = mii .
i

Further, the qualification X > 0 means that the above expression applies when
X is not only symmetric but also positive definite, otherwise the Wishart pdf
is zero. The Wishart density is the joint pdf of the d(d + 1)/2 distinct entries
of the symmetric matrix X, e.g., the elements xij , i ≥ j on or below the main
diagonal. When d = 1, then Wishd (x | α, β) reduces to Gam(x | α, β).
E.g., when d = 2 then the symmetric matrix X can be written using only
the elements x11 , x21 and x22 as follows,
 
x11 x21
X=
x21 x22

A symmetric matrix is positive definite if and only if all of its eigenvalues are
positive and if and only if all of its leading principal minors are positive. In this
2 × 2 case, X = [xij ] is positive definite if and only if

x11 > 0, x11 x22 − x221 > 0

Writing out the determinant and the trace in terms of the matrix elements, we
obtain the following expression for the joint density of the Wishart distribution,

det(B)α
Wish2 (x11 , x21 , x22 | α, B) = √
π Γ(α)Γ(α − 21 )
3
(x11 x22 − x221 )α− 2 exp (−(β11 x11 + 2β21 x21 + β22 x22 )) ,

if x11 > 0 and x11 x22 − x221 > 0, and the pdf is zero otherwise.
Now we assume that
i.i.d.
Yi | Q ∼ Nd (µ, Q−1 ), i = 1, . . . , n.

Here µ ∈ R d is a known vector, and Q ∈ R d×d is an unknown symmetric positive


definite matrix. The prior density is Wishd (Q | α, B).
Since
 
−d/2 1/2 1 T
p(yi | Q) = (2π) det(Q) exp − (yi − µ) Q(yi − µ) ,
2

the likelihood is
n
!
n/2 1X T
p(y | Q) ∝ det(Q) exp − (yi − µ) Q(yi − µ) .
2 i=1

72
February 8, 2013

To proceed, we need to use certain properties of the trace of a matrix.


The matrix trace is obviously a linear function, the trace of a scalar γ is
tr(γ) = γ, and what is more, trace also satisifies the identity

tr(C D) = tr(D C)

whenever C D is a square matrix (the factors need not be square matrices).


Therefore we can write as follows
n
X n
X
(yi − µ)T Q(yi − µ) = tr (yi − µ)T Q(yi − µ)

i=1 i=1
n
X
tr (yi − µ)(yi − µ)T Q

=
i=1
" n
# !
X
= tr (yi − µ)(yi − µ)T Q .
i=1

Let us denote
n
X
S= (yi − µ)(yi − µ)T .
i=1
When we use the above trace identity in the likelihood and combine likelihood
with the prior, we get
1
p(Q | y) ∝ det(Q)α− 2 (d+1) exp(− tr(B X))
1
det(Q)n/2 exp(− tr( S X))
2   
α+n/2− 21 (d+1) 1
= det(Q) exp − tr (B + S)Q ,
2
which shows that the posterior is
n
n 1X
Wishd (α + , B+ (yi − µ)(yi − µ)T ).
2 2 i=1

If need be, this can be translated to the traditional parametrization of the


Wishart distribution with the aid of Appendix A.7.

5.6 Conditional conjugacy


In multiparameter problems it may be difficult or impossible to use conjugate
priors. However, some benefits of conjugate families can be retained, if one has
conditional conjugacy in the Bayesian statistical model.
Suppose we have parameter vector θ, which we partition as θ = (φ, ψ), where
the components φ and ψ are not necessarily scalars. The the full conditional
(density) of φ in the prior distribution is defined as

p(φ | ψ),

and the full conditional (density) of of φ in the posterior is defined as

p(φ | ψ, y).

73
February 8, 2013

Then φ exhibits conditional conjugacy, if the full conditional of φ in the prior


and and in the posterior belong to the same family of distributions.
In practice, one notices the conditional conjugacy of φ as follows. The prior
full conditional of φ is
p(φ | ψ) ∝ p(φ, ψ),
when we regard the joint prior as a function of φ. Similarly, the posterior full
conditional of φ is

p(φ | ψ, y) ∝ p(φ, ψ, y) = p(φ, ψ) p(y | φ, ψ),

when we regard the joint distribution p(φ, ψ, y) as a function of φ. If we rec-


ognize the functional forms of the prior full conditional and the posterior full
conditional, then we have conditional conjugacy.
If we partition the parameter vector into k components, θ = (θ1 , . . . , θk )
(which are not necessarily scalars), then sometimes all the components are con-
ditionally conjugate. In other cases, only some of the components turn out to
be conditionally conjugate.
Example 5.1. [Conditional conjugacy in linear regression] Consider the linear
regression model
1
[Y | β, τ ] ∼ Nn (Xβ, I), (5.12)
τ
where now both β and the precision of the error distribution τ are unknown. If
the prior distribution is of the form

p(β, τ ) = p(τ ) Np (β | µ0 (τ ), (Q0 (τ ))−1 ),

where the mean and the precision matrix of the conditional prior of β are arbi-
trary functions of τ , then the full conditional of β can be obtained from equa-
tions (5.11), since τ is considered known in p(β | τ, y).
On the other hand, if the prior distribution is of the form

p(β, τ ) = p(β) Gam(τ | a0 (β), b0 (β)),

then an easy calculation shows that the full conditional of τ is also a gamma
distribution. 4

5.7 Reparametrization
Suppose that we have formulated a Bayesian statistical model in terms of a
parameter vector θ with a continuous distribution, but then want to reformu-
late it in terms of a new parameter vector φ, where there is a diffeomorphic
correspondence between θ and φ. I.e., the correspondence

φ = g(θ) ⇔ θ = h(φ)

is one-to-one and continuously differentiable in both directions. What happens


to the prior, likelihood and the posterior under such a reparametrization?
We get the prior of φ using the change of variables formula for densities:
∂θ
fΦ (φ) = fΘ (θ) = fΘ (h(φ)) |Jh (φ)|.
∂φ

74
February 8, 2013

If we know φ then we also know θ = h(φ). Therefore the likelihood stays


the same in that
fY |Φ (y | φ) = fY |Θ (y | h(φ)).
Finally, the posterior density changes in the same way as the prior density
(by the change of variables formula), i.e.,
∂θ
fΦ|Y (φ | y) = fΘ|Y (θ | y) = fΘ|Y (h(φ) | y) |Jh (φ)|.
∂φ

5.8 Improper priors


Sometimes one specifies a prior by stating that

p(θ) ∝ h(θ),

where h(θ) is a non-negative function, whose integral is infinite


Z
h(θ) dθ = ∞.

Then there does not exist a constant of proportionality that will allow p(θ) to
be a proper density, i.e., to integrate to one. In that case we have an improper
prior. Notice that this is different from expressing the prior by the means of
an unnormalized density h, which can be normalized to be a proper density.
Sometimes we get a proper posterior, if we multiply an improper prior with the
likelihood and then normalize.
For example, consider one normally distributed observation Y ∼ N (θ, τ 2 )
with a known variance τ 2 , and take

p(θ) ∝ 1, θ ∈R.

This prior is intended to represent complete prior ignorance about the unknown
mean: all possible values are deemed equally likely. Calculating formally,
 
1
p(θ | y) ∝ p(y | θ) p(θ) ∝ exp − 2 (y − θ)2

∝ N (θ | y, τ 2 )

We obtain the same result in the limit, if we take N (µ0 , σ02 ) as the prior and
then let the prior variance σ02 go to infinity.
One often uses improper priors in a location-scale model, with a location
parameter µ and a scale parameter σ. Then it is conventional to take the prior
of the location parameter to be uniform and to let the logarithm of the scale
parameter σ have a uniform distribution and to take them to be independent in
their improper prior. This translates to an improper prior of the form
1
p(µ, σ) ∝ , µ ∈ R, σ > 0 (5.13)
σ
by using (formally) the change of variables formula,
dτ 1
p(σ) = p(τ ) ∝ ,
dσ σ

75
February 8, 2013

when τ = log σ and p(τ ) ∝ 1.


Some people use the so called Jeffreys’ prior, which is designed to have a
form which is invariant with respect to one-to-one reparametrizations. Also
this leads typically to an improper prior. There are also other processes which
attempt produce non-informative priors, which often turn out to be improper.
(A prior is called non-informative, vague, diffuse or flat, if it plays a minimal
role in the posterior distribution.)
Whereas the posterior derived from a proper prior is automatically proper,
the posterior derived from an improper prior can be either proper or improper.
Notice, however, that an improper posterior does not make any sense.
If you do use an improper prior, it is your duty to check that the posterior is
proper.

5.9 Summarizing the posterior


The posterior distribution gives a complete description of the uncertainty con-
cerning the parameter after the data has been observed. If we use conjugate
analysis inside a well-understood conjugate family, then we need only report
the hyperparameters of the posterior. E.g., if the posterior is a multivariate
normal (and the dimensionality is low) then the best summary is to give the
mean and the covariance matrix of the posterior. However, in more complicated
situations the functional form of the posterior may be opaque, and then we need
to summarize the posterior.
If we have a univariate parameter, then the best description of the posterior
is the plot of its density function. Additionally, we might want to calculate such
summaries as the posterior mean the posterior variance, the posterior mode, the
posterior median, and other selected posterior quantiles. If we cannot plot the
density, but are able to simulate from the posterior, we can plot the histogram
and calculate summaries (mean, variance, quantiles) from the simulated sample.
If we have a two-dimensional parameter, then we can still make contour
plots or perspective plots of the density, but in higher dimensions such plots
are not possible. One practical approach in a multiparameter situation is to
summarize the one-dimensional marginal posteriors of the scalar components of
the parameter.
Suppose that (after a rearrangement of the components) θ = (φ, ψ), where
φ is the scalar component of interest. Then the marginal posterior of φ is
Z
p(φ | y) = p(φ, ψ | y) dψ

The indicated integration may be very difficult to perform analytically. However,


if one has available a sample

(φ1 , ψ1 ), (φ2 , ψ2 ), ..., (φN , ψN )

from the posterior of θ = (φ, ψ), then φ1 , φ2 , . . . , φN is a sample from the


marginal posterior of φ. Hence we can summarize the marginal posterior of
φ based on the sample φ1 , . . . , φN .

76
February 8, 2013

5.10 Posterior intervals


We consider a univariate parameter θ which has a continuous distribution. One
conventional summary of the posterior is a 100(1 − α)% posterior interval of
the parameter θ, which is any interval C in the parameter space such that
Z
P (Θ ∈ C | Y = y) = p(θ | y) dθ = 1 − α. (5.14)
C

Here 0 < α < 1 is some fixed number, such that the required coverage proba-
bility is 1 − α; usual choices are α = 0.05 or α = 0.1 corresponding to 95 % and
90 % probability intervals, respectively. Some authors call such intervals prob-
ability intervals, credible intervals (or credibility intervals) or Bayesian
confidence intervals.
The posterior intervals have the direct probabilistic interpretation (5.14).
In contrast, the confidence intervals of frequentist statistics have probability
interpretations only with reference to (hypothetical) sampling of the data under
identical conditions.
Within the frequentist framework, the parameter is an unknown determinis-
tic quantity. A frequentist confidence interval either covers or does not cover the
true parameter value. A frequentist statistician constructs a frequentist confi-
dence interval at significance level α100% in such a way that if it were possible
to sample repeatedly the data under identical conditions (i.e., using the same
value for the parameter), then the relative frequency of coverage in a long run
of repetitions would be about 1 − α. But for the data at hand, the calculated
frequentist confidence interval still either covers or does not cover the true pa-
rameter value, and we do not have guarantees for anything more. Many naive
users of statistics (and even some textbook authors) mistakenly believe that
their frequentist confidence intervals have the simple probability interpretation
belonging to posterior intervals.
The coverage requirement (5.14) needs to be supplemented by other crite-
ria. To demonstrate why this is so, let q be the quantile function of posterior
(which may be explicitely available in simple conjugate situations and may be
approximated with the empirical quantile function in other situations). Then
the interval
[q(t), q(1 − (α − t))]
has the required coverage probability 1 − α for any value 0 < t < α; cf. (2.5)
and (2.6).
In practice it is easiest to use the equal tail (area) interval (or central
interval), whose end points are selected so that α/2 of the posterior probability
lies to the left and α/2 to the right of the intervals. By the definition of the
quantile function, the equal tail posterior interval is given by
[q(α/2), q(1 − α/2)]. (5.15)
If the quantile function is not available, but one has available a sample θ1 , . . . , θN
from the posterior, then one can use the empirical quantiles calculated from the
sample.
Many authors recommend the highest posterior density (HPD) region,
which is defined as the set
Ct = {θ : fΘ|Y (θ | y) ≥ t},

77
February 8, 2013

where the threshold t has to be selected so that

P (Θ ∈ Ct ) = 1 − α.

Often (but not always) the HPD region turns out to be an interval. Then it
can proven to be the shortest interval with the desired coverage 100(1 − α)%.
However, calculating a HPD interval is more difficult than calculating an equal
tail interval.
In a multiparameter situation one usually examines one parameter at a time.
Let φ be the scalar parameter of interest in θ = (φ, ψ), and suppose that we
have available a sample

(φ1 , ψ1 ), (φ2 , ψ2 ), ..., (φN , ψN )

from the posterior. Then φ1 , φ2 , . . . , φN is a sample from the marginal posterior


of φ. Hence the central marginal posterior interval of φ can be calculated as
in (5.15), when q is the empirical quantile function based on φ1 , . . . , φN .

5.11 Literature
See, e.g., Bernardo and Smith [1] for further results on conjugate analysis. The
books by Gelman et al. [4], Carlin and Louis [2] and O’Hagan and Forster [5]
are rich sources of ideas on Bayesian modeling and analysis. Sufficiency is a
central concept in parametric statistics. See, e.g., Schervish [6] for a discussion.

Bibliography
[1] José M. Bernardo and Adrian F. M. Smith. Bayesian Theory. John Wiley
& Sons, 2000. First published in 1994.
[2] Bradley P. Carlin and Thomas A. Louis. Bayesian Methods for Data Anal-
ysis. Chapman & Hall/CRC, 3rd edition, 2009.
[3] Ronald Christensen, Wesley Johnson, Adam Branscum, and Timothy E.
Hanson. Bayesian Ideas and Data Analysis: An Introduction for Scientists
and Statisticians. Texts in Statistical Science. CRC Press, 2011.
[4] Andrew Gelman, John B. Carlin, Hal S. Stern, and Donald B. Rubin.
Bayesian Data Analysis. Chapman & Hall/CRC Press, 2nd edition, 2004.
[5] Anthony O’Hagan and Jonathan Forster. Bayesian Inference, volume 2B of
Kendall’s Advanced Theory of Statistics. Arnold, second edition, 2004.
[6] Mark J. Schervish. Theory of Statistics. Springer series in statistics.
Springer-Verlag, 1995.

78
Chapter 6

Approximations

6.1 The grid method


When one is confronted with a low-dimensional problem with a continuous pa-
rameter, then it is usually easy to approximate the posterior density on a dense
grid of points which covers the relevant part of the parameter space. We discuss
the method for a one-dimensional parameter θ.
We suppose that the posterior is available in the unnormalized form

1
fΘ|Y (θ | y) = q(θ | y),
c(y)

where we know how to evaluate the unnormalized density q(θ | y), but do not
necessarily know the value of the normalizing constant c(y).
Instead of the original parameter space, we consider a finite interval [a, b],
which should cover most of the mass of the posterior distribution. We divide
[a, b] evenly into N subintervals

Bi = [a + (i − 1)h, a + ih], i = 1, . . . , N.

The width h of one subinterval is


b−a
h= .
N
Let θi be the midpoint of the i’th subinterval,

1
θi = a + (i − )h, i = 1, . . . , N.
2
We use the midpoint rule for numerical integration. This means that we ap-
proximate the integral over the i’th subinterval of any function g by the rule
Z
g(θ) dθ ≈ hg(θi ). (6.1)
Bi

Using the midpoint rule on each of the subintervals, we get the following

79
February 8, 2013

approximation for the normalizing constant


Z Z b N Z
X
c(y) = q(θ | y) dθ ≈ q(θ | y) dθ = q(θ | y) dθ
a i=1 Bi
(6.2)
N
X
≈h q(θi | y)
i=1

Using this approximation, we can approximate the value of the posterior density
at the point θi ,
1 1 q(θi | y)
fΘ|Y (θi | y) = q(θi | y) ≈ PN . (6.3)
c(y) h j=1 q(θj | y)

We also obtain approximations for the posterior probabilities of the subintervals,


Z
P (Θ ∈ Bi | Y = y) = fΘ|Y (θ | y) dθ ≈ hfΘ|Y (θi | y)
Bi
(6.4)
q(θi | y)
≈ PN .
j=1 q(θj | y)

By following the same reasoning which lead to (6.2), we may form the ap-
proximation Z X
k(θ) q(θ | y) dθ ≈ h k(θi ) q(θi | y)
i=1

basically for any function k such that k(θ) q(θ | y) differs appreciably from
zero only on the interval (a, b). This can be used to approximate the posterior
expection of an arbitrary function k(θ) of the parameter, by
R
k(θ) q(θ | y) dθ
Z
E(k(Θ) | Y = y) = k(θ) fΘ|Y (θ | y) dθ = R
q(θ | y) dθ
PN (6.5)
k(θi ) q(θi | y)
≈ i=1 PN
j=1 q(θj | y)

These approximations can be surprisingly accurate even for moderate values


of N provided we are able to identify an interval [a, b], which covers the essential
part of posterior distribution.
To summarize, the grid method for approximating the posterior density or
for or simulating from it is the following.
• First evaluate the unnormalized posterior density q(θ | y) at a regular
grid of points θ1 , . . . , θN with spacing h. The grid should cover the main
support of the posterior density.
• If you want to plot the posterior density, normalize these values by dividing
by their sum and additionally by the bin width h as in eq. (6.3). This gives
an approximation to the posterior ordinates p(θi | y) at the grid points θi .
• If you want a sample from the posterior, sample with replacement from
the grid points θi with probabilities proportional to the numbers q(θi | y),
cf. (6.4).

80
February 8, 2013

• If you want to approximate the posterior expectation E[k(θ) | y], calculate


the weighted average of the values k(θi ) using the values q(θi | y) as
weights, cf. eq. (6.5).

The midpoint rule is considered a rather crude method of numerical inte-


gration. In the numerical analysis literature, there are available much more
sophisticated methods of numerical integration (or numerical quadrature) and
they can be used in a similar manner. Besides dimension one, these kinds of
approaches can be used in dimensions two or three. However, as the dimen-
sionality of the parameter space grows, computing at every point in a dense
multidimensional grid becomes more and more expensive.

6.2 Normal approximation to the posterior


We now try to approximate a posterior density by a normal density based on
the behavior of the posterior density at its mode. This approximation can
be quite accurate, when the sample sizes is large, provided the posterior is
unimodal. We will call the resulting approximation a normal approximation to
the posterior, but the result is sometimes also called a Laplace approximation
or a modal approximation. A normal approximation can be used directly as
an approximate description of the posterior. However, such an approximation
can be utilized also indirectly, e.g., to form a good proposal distribution for the
Metropolis–Hastings method.
We first discuss normal approximation in the univariate situation. The sta-
tistical model has a single parameter θ, which has a continuous distribution. We
do know an unnormalized version q(θ | y) of the posterior density, but the nor-
malizing constant is usually unknown. We consider the case, where θ 7→ q(θ | y)
is unimodal: i.e., it has only one local maximum. We suppose that we have
located the mode θ̂ of the unnormalized posterior q(θ | y). Notice that θ̂ is
also the posterior mode, which is also called the MAP (maximum a posteriori)
estimate. Actually, θ̂ depends on the data y, but we suppress this dependence
in our notation. Usually we would have to run some numerical optimization
algorithm in order to find the mode.
The basic idea of the method is to use the second degree Taylor polynomial of
the log-posterior (the logarithm of the posterior density) centered on the mode
θ̂,
1
log fΘ|Y (θ | y) ≈ log fΘ|Y (θ̂ | y) + b(θ − θ̂) − A(θ − θ̂)2 , (6.6)
2
where
∂ ∂
b= log fΘ|Y (θ | y) = log q(θ | y) = 0,
∂θ θ=θ̂ ∂θ θ=θ̂

and
∂2 ∂2
A=− log fΘ|Y (θ | y) =− log q(θ | y) .
∂θ2 θ=θ̂ ∂θ2 θ=θ̂

Notice the following points.

• The first and higher order (partial) derivatives with respect to θ of log q(θ | y)
and log fΘ|Y (θ | y) agree, since these function differ only by an additive
constant (which depends on y but not on θ).

81
February 8, 2013

• The first order term of the Taylor expansion disappears, since θ̂ is also the
mode of the log-posterior log fΘ|Y (θ | y).

• A ≥ 0, since θ̂ is a maximum of q(θ | y). For the following, we need to


assume that A > 0.
Taking the exponential of the second degree Taylor approximation (6.6), we
see that we may approximate the posterior by the function
 
A 2
πapprox (θ) ∝ exp − (θ − θ̂) ,
2

at least in the vicinity of the mode θ̂. Luckily, we recognize that πapprox (θ)
is an unnormalized form of the density of the normal distribution with mean
θ̂ and variance 1/A. The end result is that the posterior distribution can be
approximated with the normal distribution
!
1
N θ̂, , (6.7)
− L00 (θ̂)

where L(θ) is the logarithm of the unnormalized posterior,

L(θ) = log q(θ | y)

and L00 (θ̂) is the second derivative of L(θ) evaluated at the mode θ̂.
The multivariate analog of the result starts with the second degree expansion
of the log-posterior centered on its mode θ̂,
1
log fΘ|Y (θ | y) ≈ log fΘ|Y (θ̂ | y) + 0 − (θ − θ̂)T A(θ − θ̂),
2
where A is the negative Hessian matrix of L(θ) = log q(θ | y) evaluated at the
mode,
∂2 ∂2 ∂2
 
Aij = − log fΘ|Y (θ | y) =− L(θ) =− L(θ)
∂θi ∂θj θ=θ̂ ∂θi ∂θj θ=θ̂ ∂θ ∂θT θ=θ̂ ij

The first degree term of the expansion vanishes, since θ̂ is the mode of the log-
posterior. Here A is at least positively semidefinite, since θ̂ is a maximum. If A
is positively definite, we can proceed with the normal approximation.
Exponentiating, we find out that approximately (at least near the mode)
 
1 T
fΘ|Y (θ | y) ∝ exp − (θ − θ̂) A(θ − θ̂) .
2
Therefore we can approximate the posterior with the corresponding multivariate
normal distribution with mean θ̂ and covariance matrix given by A−1 , i.e., the
approximating normal distribution is
  −1 
00
N θ̂, −L (θ̂) , (6.8)

where L00 (θ̂) is the Hessian matrix of the logarithm of the unnormalized poste-
rior, L(θ) = log q(θ | y), evaluated at its mode θ̂. The precision matrix of the

82
February 8, 2013

approximating normal distribution is the negative Hessian of the log-posterior


evaluated at the posterior mode. Another characterization for the precision
matrix is that it is the Hessian of the negative log-posterior evaluated at the
posterior mode. The covariance matrix of the normal approximation is the
inverse of its precision matrix.
Typically the mode of the log-posterior (or the maximum point of the neg-
ative log-posterior) would be calculated using some numerical optimization al-
gorithm. The Hessian would then be calculated using numerical differentiation,
see Sec. B.7 for an example.
Before using the normal approximation, it is often advisable to reparameter-
ize the model so that the transformed parameters are defined on the whole real
line and have roughly symmetric distributions. E.g., one can use logarithms of
positive parameters and apply the logit function to parameters which take val-
ues on the interval (0, 1). The normal approximation is then constructed for the
transformed parameters, and the approximation can then be translated back to
the original parameter space. One must, however, remember to multiply by the
appropriate Jacobians.
Example 6.1. We consider the unnormalized posterior

q(θ | y) = θy4 (1 − θ)y2 +y3 (2 + θ)y1 , 0 < θ < 1,

where y = (y1 , y2 , y3 , y4 ) = (13, 1, 2, 3). The mode and the second derivative of
L(θ) = log q(θ | y) evaluated at the mode are given by

θ̂ ≈ 0.677, L00 (θ̂) ≈ −37.113.

(The mode θ̂ can be found by solving a quadratic equation.) The resulting


normal approximation in the original parameter space is N (0.677, 1/37.113).
We next reparametrize by defining φ as the logit of θ,

θ eφ
φ = logit(θ) = ln ⇔ θ= .
1−θ 1 + eφ
The given unnormalized posterior for θ transforms to the following unnormalized
posterior for φ,

q̃(φ | y) = q(θ | y)

y4  y2 +y3  y1
eφ 2 + 3eφ eφ

1
= .
1 + eφ 1 + eφ 1 + eφ (1 + eφ )2

The mode and the second derivative of L̃(φ) = log q̃(φ | y) evaluated at the
mode are given by
φ̂ ≈ 0.582, L̃00 (φ̂) ≈ −2.259.
(Also φ̂ can be found by solving a quadratic.) This results in the normal ap-
proximation N (0.582, 1/2.259) for the logit of θ.
When we translate that approximation back to the original parameter space,
we get the approximation

fΘ|Y (θ | y) ≈ N (φ | 0.582, 1/2.259) ,

83
February 8, 2013

4
true posterior
norm. appr. on original scale
norm. appr. on logit scale

3
2
1
0

0.0 0.2 0.4 0.6 0.8 1.0

Figure 6.1: The exact posterior density (solid line) together with its normal
approximation (dashed line) and the approximation based on the normal ap-
proximation for the logit of θ. The last approximation is markedly non-normal
on the original scale, and it is able to capture the skewness of the true posterior
density.

i.e.,
1
fΘ|Y (θ | y) ≈ N (logit(θ) | 0.582, 1/2.259) .
θ(1 − θ)
Both of these approximations are plotted in Figure 6.1 together with the
true posterior density (whose normalizing constant can be found exactly). 4

6.3 Connection to the traditional frequentist asymp-


totics
Here we discus the relationship of the normal approximation (6.8) to the fre-
quentist asymptotics of the maximum likelihood estimator. The unnormalized
version of the posterior density is of the form
q(θ | y) = k(y) fY |Θ (y | θ) fΘ (θ) = k(y) p(y | θ) p(θ),
where p(θ) is the prior, p(y | θ) is the likelihood, and k(y) is any convenient con-
stant which may depend on the data but not on the parameter vector. Therefore
the logarithm of the unnormalized posterior is
L(θ) = log q(θ | y) = log k(y) + `(θ) + log p(θ),
where `(θ) = p(y | θ) is the log-likelihood. Therefore the negative Hessian of
L(θ) is
∂2
−L00 (θ) = −`00 (θ) − log p(θ)
∂θ ∂θT

84
February 8, 2013

Here the negative Hessian of the log-likelihood is called the observed (Fisher)
information (matrix), and we denote it by J(θ),
∂2
J(θ) = −`00 (θ) = − log p(y | θ). (6.9)
∂θ ∂θT
The negative Hessian of the log-posterior equals the sum of the observed infor-
mation and the negative Hessian of the log-prior.
If the sample size is large, then the likelihood dominates the prior in the
sense that the likelihood is highly peaked while the prior is relatively flat in the
region where the posterior density is appreciable. In large samples the mode of
the log-posterior θ̂ and the mode of the log-likelihood (the maximum likelihood
estimator, MLE) θ̂MLE are approximately equal, and also the Hessian matrix of
the log-posterior is approximately the same as the Hessian of the log-likelihood.
Combining these two approximations, we get
θ̂ ≈ θ̂MLE , −L00 (θ̂) ≈ J(θ̂MLE ).
When we plug these approximations in the normal approximation (6.8), we see
that in large samples the posterior is approximately normal with mean equal to
the MLE and covariance matrix given by the inverse of the observed information,
 
p(θ | y) ≈ N θ | θ̂MLE , [J(θ̂MLE )]−1 . (6.10)

This approximation should be compared with the well-known frequentist


asymptotic distribution results for the maximum likelihood estimator. Loosely,
these results can be summarized so that the sampling distribution of the maxi-
mum likelihood estimator is asymptotically normal with mean equal to the MLE
and covariance matrix equal to the inverse of the observed information. In order
to write this approximation as a formula, we need to indicate the dependence
of the maximum likelihood estimator on the data as follows,
d
 
θ̂MLE (Y ) ≈ N θ̂MLE (y), [J(θ̂MLE (y))]−1 . (6.11)

Here Y is a random vector from the sampling distribution of the data, and so
θ̂MLE (Y ) is the maximum likelihood estimator considered as a random variable
(or random vector). In contrast, θ̂MLE (y) is the maximum likelihood estimate
calculated from the observed data y.
Comparing equations (6.10) and (6.11) we see that for large samples the
posterior distribution can be approximated using the same formulas that (fre-
quentist) statisticians use for the maximum likelihood estimator. In large sam-
ples the influence of the prior vanishes, and then one does need to spend much
effort on formulating the prior distribution so that it would reflect all available
prior information. However, in small samples careful formulation of the prior is
important.

6.4 Posterior expectations using Laplace approx-


imation
Laplace showed in the 1770’s how one can form approximations to integrals of
highly peaked positive functions by integrating analytically a suitable normal

85
February 8, 2013

approximation. We will now apply this idea to build approximations to posterior


expectations. We assume that the posterior density is highly peaked while the
function k, whose posterior expectation we seek is relatively flat. The posterior
density is typically known only in the unnormalized form q(θ | y), and then
R
k(θ) q(θ | y) dθ
E[k(Θ) | Y = y] = R . (6.12)
q(θ | y) dθ
Tierney and Kadane [4] approximated separately the numerator and the denom-
inator of eq. (6.12) using Laplace’s method, and analyzed the resulting error.
To introduce the idea of Laplace’s approximation (or Laplace’s method),
consider a highly peaked function L(θ) of a scalar variable θ such that L(θ) has
a unique mode (i.e., a maximum) at θ̂. Suppose that g(θ) is a function, which
varies slowly. We seek an approximation to the integral
Z
I = g(θ) eL(θ) dθ. (6.13)

Heuristically, the integrand is negligible when we go far away from θ̂, and so we
should be able to approximate the integral I by a simpler integral, where we
take into account only the local behavior of L(θ) around its mode. To this end,
we first approximate L(θ) by its second degree Taylor polynomial centered at
the mode θ̂,
1
L(θ) ≈ L(θ̂) + 0 · (θ − θ̂) + L00 (θ̂)(θ − θ̂)2 .
2
Since g(θ) is slowly varying, we may approximate the integrand as follows
 
L(θ) 1 2
g(θ) e ≈ g(θ̂) exp L(θ̂) − Q(θ − θ̂) ) ,
2
where
Q = −L00 (θ̂).
For the following, we must assume that L00 (θ̂) < 0. Integrating the approxima-
tion, we obtain
Z
1
I ≈ g(θ̂) eL(θ̂) exp(− Q(θ − θ̂)2 ) dθ
2
√ (6.14)
2π L(θ̂)
= p g(θ̂) e
Q
This is the univariate case of Laplace’s approximation. (Actually, it is just the
leading term in a Laplace expansion, which is an asymptotic expansion for the
integral.)
To handle the multivariate result, we use the normalizing constant of the
Nd (µ, Q−1 ) distribution to evaluate the integral
(2π)d/2
Z  
1
exp − (x − µ)T Q(x − µ) dx = √ . (6.15)
2 det Q
This result is valid for any symmetric and positive definite d × d matrix Q.
Integrating the multivariate second degree approximation of g(θ) exp(L(θ)), we
obtain
(2π)d/2
Z
I = g(θ) eL(θ) dθ ≈ p g(θ̂) eL(θ̂) , (6.16)
det(Q)

86
February 8, 2013

where d is the dimensionality of θ, and Q is the negative Hessian of L evaluated


at the mode,
Q = −L00 (θ̂),
and we must assume that the d × d matrix Q is positively definite.
Using these tools, we can approximate the posterior expectation of k(θ)
(see (6.12)) in several different ways. One idea is to approximate the numerator
by choosing
g(θ) = k(θ), eL(θ) = q(θ | y)
in eq. (6.16), and then to approximate the denominator by choosing

g(θ) ≡ 1, eL(θ) = q(θ | y).

These choices yield the approximation

(2π)d/2
p k(θ̂) eL(θ̂)
det(Q)
E[k(Θ) | Y = y] ≈ = k(θ̂), (6.17)
(2π)d/2 L(θ̂)
p e
det(Q)

where
θ̂ = arg max L(θ), Q = −L00 (θ̂).
Here we need a single maximization, and do not need to evaluate the Hessian
at all.
A less obvious approach is to choose

g(θ) ≡ 1, eL(θ) = k(θ) q(θ | y)

to approximate the numerator, and

g(θ) ≡ 1, eL(θ) = q(θ | y)

to approximate the denominator. Here we need to assume that k is a positive


function, i.e., k > 0. The resulting approximation is
1/2
k(θ̂∗ ) q(θ̂∗ | y)

det(Q)
E[k(Θ) | Y = y] ≈ , (6.18)
det(Q∗ ) q(θ̂ | y)

where
θ̂∗ = arg max[k(θ) q(θ | y)], θ̂ = arg max q(θ | y).
and Q∗ and Q are the negative Hessians
00
Q∗ = −L∗ (θ̂∗ ), Q = −L00 (θ̂),

where
L∗ (θ) = log(k(θ) q(θ | y)), L(θ) = log q(θ | y).
We need two separate maximizations and need to evaluate two Hessians for this
approximation.
Tierney and Kadane analyzed the errors committed in these approximations
in the situation, where we have n (conditionally) i.i.d. observations, and the

87
February 8, 2013

sample size n grows. The first approximation (6.17) has relative error of or-
der O(n−1 ), while the second approximation (6.18) has relative error of order
O(n−2 ). That is,

E[k(Θ) | Y = y] = k(θ̂) 1 + O(n−1 )




and
1/2
k(θ̂∗ ) q(θ̂∗ | y)

det(Q)
1 + O(n−2 ) .

E[k(Θ) | Y = y] =
det(Q∗ ) q(θ̂ | y)

Hence the second approximation is much more accurate (at least asymptoti-
cally).

6.5 Posterior marginals using Laplace approxi-


mation
Tierney and Kadane discuss also an approximation to the marginal posterior,
when the parameter vector θ is composed of two vector components θ = (φ, ψ).
The form of the approximation is easy to derive, and was earlier discussed
by Leonard [1]. However, Tierney and Kadane [4, Sec. 4] were the first to
analyze the error in this Laplace approximation. We first derive the form of the
approximation, and then make some comments on the error terms based on the
discussion of Tierney and Kadane.
Let q(φ, ψ | y) be an unnormalized form of the posterior density, based on
which we try to approximate the normalized marginal posterior p(φ | y). Let
the dimensions of φ and ψ be d1 and d2 , respectively. We have
Z Z
p(φ | y) = p(φ, ψ | y) dψ = exp(log p(φ, ψ | y)) dψ,

where p(φ, ψ | y) is the normalized posterior. The main difference with approx-
imating a posterior expectation is the fact, that now we are integrating only
over the component(s) ψ of θ = (φ, ψ).
Fix the value of φ for the moment. Let ψ ∗ (φ) be the maximizer of the
function
ψ 7→ log p(φ, ψ | y),
and let Q(φ) be the negative Hessian matrix of this function evaluated at
ψ = ψ ∗ (φ). Notice that we can equally well calculate ψ ∗ (φ) and Q(φ) as the
maximizer and the negative of the d2 × d2 Hessian matrix of ψ 7→ log q(φ, ψ | y),
respectively,

ψ ∗ (φ) = arg max (log q(φ, ψ | y)) = arg max q(φ, ψ | y) (6.19)
ψ ψ

∂2
 
Q(φ) = − log q(φ, ψ | y) . (6.20)
∂ψ ∂ψ T |ψ=ψ ∗ (φ)

For fixed φ, we have the second degree Taylor approximation in ψ,


1
log p(φ, ψ | y) ≈ log p(φ, ψ ∗ (φ) | y) − (ψ − ψ ∗ (φ))T Q(φ)(ψ − ψ ∗ (φ)), (6.21)
2

88
February 8, 2013

and we assume that matrix Q(φ) is positive definite.


Next we integrate the exponential function of the approximation (6.21) with
respect to ψ, with the result

p(φ | y) ≈ p(φ, ψ ∗ (φ) | y) (2π)d2 /2 (det Q(φ))−1/2 .

To evaluate this approximation, we need the normalizing constant of the unnor-


malized posterior q(φ, ψ | y), which we obtain by another Laplace approxima-
tion, and the end result is
s
−d1 /2 ∗ det Q
p(φ | y) ≈ (2π) q(φ, ψ (φ) | y) , (6.22)
det Q(φ)

where Q is negative of the (d1 + d2 ) × (d1 + d2 ) Hessian of the function

(φ, ψ) 7→ log q(φ, ψ | y)

evaluated at the MAP, the maximum point of the same function. However, it
is often enough to approximate the functional form of the marginal posterior.
When considered as a function of φ, we have, approximately,

p(φ | y) ∝ q(φ, ψ ∗ (φ) | y) (det Q(φ))−1/2 . (6.23)

The unnormalized Laplace approximation (6.23) can be given another inter-


pretation (see, e.g., [2, 3]). By the multiplication rule,

p(φ, ψ | y) q(φ, ψ | y)
p(φ | y) = ∝ .
p(ψ | φ, y) p(ψ | φ, y)

This result is valid for any choice of ψ. Let us now form a normal approximation
for the denominator for a fixed value of φ, i.e.,

p(ψ | φ, y) ≈ N (ψ | ψ ∗ (φ), Q(φ)−1 ).

However, this approximation is accurate only in the vicinity of the mode ψ ∗ (φ),
so let us use it only at the mode. The end result is the following approximation,
 
q(φ, ψ | y)
p(φ | y) ∝
N (ψ | ψ ∗ (φ), Q(φ)−1 ) |ψ=ψ∗ (φ)
= (2π)d2 /2 det(Q(φ))−1/2 q(φ, ψ ∗ (φ) | y)
∝ q(φ, ψ ∗ (φ) | y) (det Q(φ))−1/2 ,

which is the same as the unnormalized Laplace approximation (6.23) to the


marginal posterior of φ.
Tierney and Kadane show that the relative error in the approximation (6.22)
is of the order O(n−1 ), when we have n (conditionally) i.i.d. observations, and
that most of the error comes from approximating the normalizing constant.
They argue that the approximation (6.23) captures the correct functional form
of the marginal posterior with relative error O(n−3/2 ) and recommend that
one should therefore use the unnormalized approximation (6.23), which can
then be normalized by numerical integration, if need be. For instance, if we

89
February 8, 2013

want to simulate from the approximate marginal posterior, then we can use the
unnormalized approximation (6.23) directly, together with accept–reject, SIR
or the grid-based simulation method of Sec. 6.1. See the articles by H. Rue and
coworkers [2, 3] for imaginative applications of these ideas.
Another possibility for approximating the marginal posterior would be to
build a normal approximation to the joint posterior, and then marginalize.
However, a normal approximation to the marginal posterior would only give the
correct result with absolute error of order O(n−1/2 ), so the accuracies of both
of the Laplace approximations are much better. Since the Laplace approxima-
tions yield good relative instead of absolute error, the Laplace approximations
maintain good accuracy also in the tails of the densities. In contrast, the normal
approximation is accurate only in the vicinity of the mode.
Example 6.2. Consider normal observations

i.i.d. 1
[Yi | µ, τ ] ∼ N (µ, ), i = 1, . . . , n,
τ
together with the non-conjugated prior
1
p(µ, τ ) = p(µ) p(τ ) = N (µ | µ0 , ) Gam(τ | a0 , b0 ).
ψ0
The full conditional of µ is readily available,
1
p(µ | τ, y) = N (µ | µ1 , )
ψ1
where
n
X
ψ1 = ψ0 + nτ ψ1 µ1 = ψ0 µ0 + τ yi
i=1

The mode of the full conditional p(µ | τ, y) is


Pn
ψ0 µ0 + τ i=1 yi
µ∗ (τ ) = µ1 = .
ψ0 + nτ
We now use this knowledge to build a Laplace approximation to the marginal
posterior of τ .
Since, as a function of µ,

p(µ, τ | y) ∝ p(µ | τ, y),

µ∗ (τ ) is also the mode of p(µ, τ | y) for any τ . We also need the second derivative

∂2 ∂2
2
(log p(µ, τ | y)) = (log p(µ | τ, y)) = −ψ1 ,
∂µ ∂µ2

for µ = µ∗ (τ ), but the derivative does not in this case depend on the value of
µ at all. An unnormalized form of the Laplace approximation to the marginal
posterior of τ is therefore

q(µ∗ (τ ), τ | y)
p(τ | y) ∝ √ , where q(µ, τ | y) = p(y | µ, τ ) p(µ) p(τ ).
ψ1

90
February 8, 2013

In this toy example, the Laplace approximation (6.23) for the functional
form of the marginal posterior p(τ | µ) is exact, since by the multiplication rule,

p(µ, τ | y)
p(τ | y) =
p(µ | τ, y)

for any choice of µ, in particular for µ = µ∗ (τ ). Here the numerator is known


only in an unnormalized form.
Figure 6.2 (a) illustrates the result using data y = (−1.4, −1.6, −2.4, 0.7, 0.6)
and hyperparameters µ0 = 0, ψ0 = 0.5, a0 = 1, b0 = 0.1. The unnormalized
(approximate) marginal posterior has been drawn using the grid method of
Sec. 6.1. Figure 6.2 (b) shows an i.i.d. sample drawn from the approximate
posterior
p̃(τ | y) p(µ | τ, y),
where p̃(τ | y) is a histogram approximation to the true marginal posterior
p(τ | y), which has been sampled using the grid method.
4

Bibliography
[1] Tom Leonard. A simple predictive density function: Comment. Journal of
the American Statistical Association, 77:657–658, 1982.

[2] H. Rue and S. Martino. Approximate Bayesian inference for hierarchical


Gaussian Markov random fields models. Journal of Statistical Planning and
Inference, 137(10):3177–3192, 2007.
[3] H. Rue, S. Martino, and N. Chopin. Approximate Bayesian inference for la-
tent Gaussian models using integrated nested Lapalce approximations. Jour-
nal of the Royal Statistical Society: Series B, 2009. to appear.
[4] Luke Tierney and Joseph B. Kadane. Accurate approximations for poste-
rior moments and marginal densities. Journal of the American Statistical
Association, 81:82–86, 1986.

91
February 8, 2013

Figure 6.2: (a) Marginal posterior density of τ and (b) a sample drawn from the
approximate joint posterior together with contours of the true joint posterior
density.

4
1.0
0.8

3
posterior density

0.6

2
τ
0.4

1
0.2
0.0

0 1 2 3 4 −4 −2 0 2 4

τ µ

(a) (b)

92
February 8, 2013

170

You might also like