Bayesian Data Analysis
Bayesian Data Analysis
net/publication/46714374
CITATION READS
1 2,263
1 author:
Herbert Hoijtink
Utrecht University
148 PUBLICATIONS 2,910 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Herbert Hoijtink on 16 February 2016.
Abstract
This chapter will provide an introduction to Bayesian data anal-
ysis. Using an analysis of covariance model as the point of depar-
ture, Bayesian parameter estimation (based on the Gibbs sampler),
Bayesian hypothesis testing (using posterior predictive inference), and
Bayesian model selection (via the Bayes factor) will be introduced.
The chapter will be concluded with a short discussion of Bayesian
hierarchical modelling and references for further reading.
1
1 Introduction
It is impossible to give a comprehensive introduction to Bayesian data anal-
ysis in just one chapter. In the sequel I will present what I consider to be
the most important components of Bayesian data analysis: parameter esti-
mation based on the Gibbs sampler; the Bayesian counterpart of hypothesis
testing (posterior predictive inference); and model selection using the Bayes
factor. The chapter will be concluded with a short discussion of Bayesian
hierarchical modelling and references to topics that will not be discussed in
this chapter. For accessible introductions to Bayesian data analysis the in-
terested reader is referred to Gill (2002) and Lee (1997). Throughout the
chapter references for further reading will be given both to these two books
and to more advanced material.
It would be easy to fill a whole chapter with a description and discussion
of the differences between Bayesian data analysis and the classical frequentist
data analysis that most readers will be acquainted with. Since this chapter is
rather applied in nature (how to do Bayesian estimation, hypothesis testing
and model selection) I will here and in the sequel highlight two differences
that are important for these applications. Consider, for example, a simple
estimation problem: estimate the mean weight (the parameter of interest)
of 18 year old Dutch females. A frequentist would obtain a sample from
the population of 18 year old Dutch females, compute the sample average
and use this as an estimate of the mean weight in the population. Besides
this sample, a Bayesian would also use his prior expectations (that is his
expectations with respect to the mean weight before the data are sampled)
to estimate the mean weight. These expectation are quantified in a so called
prior distribution. For the example at hand this prior distribution could be
a normal distribution with a mean of 60 kilogram and a standard deviation
of 5 kilogram. Bayesians combine the information in the sample and the
prior distribution to estimate average weight. Suppose, for example, that
the average weight in the sample would be 58 kilogram with a standard error
of 2 kilogram, in that case the Bayesian estimate would be an average weight
of 58.27 kilogram (a weighted average of 60 and 58 with weights of 52 and
22 , respectively). Stated otherwise, Bayesians use two sources of information
when making inferences: the data and prior distributions. Throughout this
chapter this difference between Bayesian and classical frequentist inference
will be highlighted.
The second difference between frequentist and Bayesian data analysis are
2
the computational means that are used to obtain estimates, p-values and
other quantities that are useful when making statistical inferences. Where
maximum likelihood is the main tool in classical inference, Bayesians pre-
fer sampling methods. Sampling methods will be elaborated in each of the
sections dealing with estimation, model checking and model selection in this
chapter.
All the concepts and procedures to be introduced in this chapter will be
discussed in the context of and illustrated with a data set previously discussed
by Tabachnick and Fidell (1996, p. 426-428, 436-437). They use analysis of
covariance (Tabachnick and Fidell, 1996, Chapter 8) to determine whether
or not the self-esteem of women depends on the degree of feminity (which
is coded low/high) and masculinity (also coded low/high) of the women.
Note that, the observed scores for self-esteem are in the range 8-29 where 8
denotes a high and 29 a low self-esteem. Social economic status (observed
scores in the range 0-81 where 0 denotes a low social economic status) will
be used as a covariate. Observed means, standard deviations and sample
sizes are presented in Table 1. The main research questions for these data
are: (a) whether high (h) feminine women have a higher self-esteem than low
(l) feminine women; (b) whether high masculine women have a higher self-
esteem than low masculine women; and, (c) whether there is a joint effect
of scoring high or low on both variables. Note that self-esteem is scored
inversely, that is, higher values denote a smaller self-esteem. Let µ denote the
mean of self-esteem adjusted for the covariate social economic status. The
hypotheses corresponding to (a), (b) and (c) are then: H1a : {µhl , µhh } <
{µll , µlh }, where the first index denotes the degree of femininity and the
second index the degree of masculinity; H1b : {µlh , µhh } < {µll , µhl } ; and,
H1c : µhh < {µhl , µlh } < µll , respectively. The traditional null-hypotheses
H0 : µhh = µhl = µlh = µhh represents the possibility that neither the degree
of femininity nor masculinity have an effect on self-esteem.
Note that the set of hypotheses specified differs from the traditional null-
hypothesis H0 , that is, ”nothing is going on” and alternative hypothesis
H2 : not H0 , that is, ”something is going on but I don’t know what”. Loosely
formulated, if H2 is preferred over H0 it is still not clear what is going on,
however, if either one of H1a , H1b , H1c is preferred over H0 it is clear which
of the underlying theories is the best. This is an example of the use of prior
knowledge (what is the relative order of the four adjusted means) in statistical
inference. Instead of having a rather general and non-specific alternative like
H2 , prior knowledge with respect to possible state of affairs in the population
3
Table 1: Sample Means, Standard Deviations and Sample Sizes
4
binomial distribution:
à !
N
f (x | N, π) = π x (1 − π)N −x . (2)
x
5
5
Posterior
4
Likelihood
3 Prior
0
0 0.2 0.4 0.6 0.8 1
π
Figure 1: Likelihood, Prior and Posterior Densities for the Binomial Example
Carlin, Stern and Rubin, 2004, pp. 576-577). The mode is obtained at
8−1 8
π = 8−1+14−1 = .35, the expectation is obtained at π = 22 = .363. The
mode is an equally weighted average (both the sample size of the data and
the prior distribution are equal to 10) of the value of π in the sample (.2)
and prior (.5). This illustrates how the posterior combines the information
available in the distribution of the data and the prior distribution.
6
to the unknown model parameters. For (1) the distribution is
N
Y G
X
f (y | D, x, µ, β, σ 2 ) ∼ N (yi | µg dig + βxi , σ 2 ) (6)
i=1 g=1
7
As can be seen, the same prior is used for each µg , that is, a normal dis-
tribution with mean µ0 and variance τ02 . A vague prior for µg is obtained
using e.g. τ02 = 100000. A normal distribution with such a large variance is
almost flat, implying that a priori each possible value of µg is equally likely.
The prior for β is also a normal distribution with mean β0 and variance γ02 .
Again a vague prior is obtained using e.g. γ02 = 100000. The prior for σ 2
is a so called scaled inverse chi-square distribution. The interested reader is
referred to Gelman, Carlin, Stern and Rubin (2004, pp. 50, 547, 580) for a
further specification of the scaled inverse chi-square distribution with scale
parameter λ20 and degrees of freedom ν0 . A vague prior is obtained using
ν0 = 1, see, for example, the figures in Lee (1997, pp. 51-53).
Prior distributions for inequality constrained and null models can easily
be derived from the prior distribution of the unconstrained model. Let θm
denote {µ, β, σ 2 ∈ Hm }, that is, the set of parameter values allowed given
the restriction imposed by model Hm , then
h(µ, β, σ 2 | H2 )Iθm ∈Hm
h(µ, β, σ 2 | Hm ) = R 2
, (8)
θm h(µ, β, σ | H2 )Iθm ∈Hm dθm
8
is to obtain a sample from the posterior, and to use this sample to compute
parameter estimates and credibility intervals (the Bayesian counterpart of
a confidence interval). For the simple binomial example this sample could
consist of, for example, 1000 values of π sampled from the posterior distri-
bution g(π | N, x). The expected value of π (called the expected a posteriori
(EAP) estimate) is then simple the average of these 1000 values. A 95% cen-
tral credibility interval is obtained using the 2.5-th and 97.5-th percentile of
the 1000 values ordered from smallest to largest. The error in estimate and
credibility interval caused by using a sample from the posterior is called the
Monte Carlo error (Gelman, Carlin, Stern and Rubin, 2004, pp. 277-278).
Increasing the sample will reduce the error.
Obtaining a sample from the posterior is not always so easy as in the sim-
ple binomial example. The latter can be obtained from many software pack-
ages, for example, in SPSS using COMPUTE with RV.BETA(.). A popular
method to obtain a sample from a multidimensional posterior distribution
is the Gibbs sampler (Gelman, Carlin, Stern and Rubin, 2004, pp. 287-289;
Gill, 2002, pp. 311-313; Lee, 1997, pp. 259-268; Hoijtink, 2000). Gibbs sam-
plers can be programmed using, for example, Fortran or C++, or, using pack-
ages especially developed for the construction of Gibbs samplers like Win-
bugs (Spiegelhalter, Thomas, Best and Lunn, 2004) or MCMCpack (Martin
and Quinn, 2005) combined with the R-package (http://www.r-project.org/)
and OpenBugs (Thomas, 2004) in combination with the R-package (BRugs,
http://cran.r-project.org/src/contrib/Descriptions/BRugs.html). The Gibbs
sampler is an iterative procedure. Each iteration consists of a number of steps
in which each parameter is sampled from its distribution conditional on the
current values of the other parameters. This will exemplified using (9). For
notational convenience, let g = 1, . . . , 4 = ll, hl, lh, hh.
Subsequently the Gibbs sampler iterates across the following three steps for
t = 1, . . . , T iterations:
9
which can be shown (Klugkist, Laudy and Hoijtink, 2005) to be a
N (µg | ag , bg , L, U ) distribution where ag and bg denote the mean and
variance of this normal distribution, respectively, and
L = max{µhl , µhh } if g ∈ {ll, lh} and − ∞ otherwise
denotes the lowerbound on µg implied by the restriction H1a and
U = min{µll , µlh } if g ∈ {hl, hh} and ∞ otherwise
denotes the upperbound on µg . The mean and variance are:
1 PN P
1
µ
τ02 0
+ σ2
( i=1 dig yi − β Ni=1 dig xi )
ag = 1 1 PN ,
τ02
+ σ2 i=1 dig
0
and
1
bg = 1 1 PN .
τ02
+ σ2 i=1 dig
Using inverse probability sampling it is easy to sample a deviate from
this truncated distribution: (a) sample a random number u from a
uniform distribution on the interval [0,1]; (b) compute the proportions
v and w that are not admissible due to L and U :
Z L
v= N (µg | ag , bg )dµg , (11)
−∞
and, Z ∞
w= N (µg | ag , bg )dµg ; (12)
U
(c) compute µg such that it is the deviate associated with the u-th
percentile of the admissible part of the posterior of µg :
Z µg
v + u(1 − v − w) = N (µg | ag , bg )dµg . (13)
0
10
Table 2: Gibbs Sample and EAP estimates for the Parameter of model H1a
with Social Economic Status as a Covariate
and
1
d= 1 1 PN .
γ02
+ σ2 i=1 x2i
g(σ 2 | µ, β, y, D, x, Hm ), (15)
In Table 2 a part of the sample obtained for the ’self-esteem’ data using
social economic status as a covariate is displayed for H1a . The number of
iterations T = 6000 of which 1000 were used as the burn-in period (see the
next section).
11
As can be seen in Table 2, the 95% central credibility interval for β
contains the value zero. This implies that the adjusted means will not change
a lot if the covariate social economic status is removed from the model. As
can be seen from the observed means in Table 1, the restriction µhl < µlh
does not appear to be in accordance with the data. This is nicely reflected in
Table 2 where µhl is forced to be smaller than µlh , but is never much smaller
than µlh . This is also reflected by the EAP estimates (simply the average
of the corresponding column), and the largely overlapping central credibility
intervals (simply the 2.5-th and 97.5 percentile of the corresponding column)
for µhl and µlh . Note that the EAP estimates were computed after deletion
of 1000 iterations burn-in, and, after a check of convergence of the Gibbs
sampler. Both burn-in and convergence will be elaborated in the next section.
12
20
15
µll 10
5
0
0 100 200 300 400 500 600 700 800 900 1000
t
20
15
µll 10
5
0
1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000
t
Figure 2: The First 2000 Iterations of the Gibbs Sampler for µll
in Cowles and Carlin (1996) and Gill (2002, Chapter 11). Especially in more
complicated models, there is always the possibility that the Gibbs sampler did
not visit the whole domain of the posterior distribution. The consequence is
that some regions may be under-represented in the Gibbs sample. The prob-
ability that this happens can be reduced running k = 1, . . . , K parallel Gibbs
samplers, each starting from different initial values. For each parameter this
would result (after discarding a burn-in phase) in, for example, k = 1, . . . , 5
vectors of sampled values, that can be summarized in a matrix with elements
ξkt for t = 1, . . . , 1000. If the posterior distribution of a model is a uni-modal
distribution (as is the case for all the models discussed in this chapter) there
is no need for multiple parallel chains of the Gibbs sampler. If the chain
is long enough (usually a few thousand iterations of the Gibbs sampler is
sufficient) the Gibbs sampler will almost certainly converge to the desired
posterior distribution. However, in order to check convergence, that is, to
check whether the number of iteration is large enough, it is still convenient to
collect the values sampled in a matrix with elements ξtk . In this case k = 1
refers to iterations 1001,...,2000, k = 2 to 2001,...,3000 etc., that is, for each
of the sequences T = 1000.
Iterations 1001,...,2000 are displayed in the bottom panel of Figure 2.
13
For k = 2, . . . , 5 almost identical displays are obtained. That is, according
to an eye ball test the Gibbs sampler has converged. Gelman, Carlin, Stern
and Rubin (2004, pp. 294-299) present a diagnostic that has become quite
popular as a more formal way to check convergence. First of all, for each
parameter the so-called between and within sequence variance is computed:
K
T X
B= (ξ − ξ .. )2 , (16)
K − 1 k=1 .k
1 PT 1 PK
where, ξ .k = T t=1 ξtk and ξ .. = K k=1 ξ .k , and
K T
1 X 1 X
W = (ξtk − ξ .k )2 . (17)
K k=1 T − 1 t=1
14
Table 3: R̂ for H1a Using Social Economic Status as a Covariate
Parameter R̂
µll 1.01
µhl 1.01
µlh 1.02
µhh 1.01
β 1.01
σ2 1.03
(Chib and Greenberg, 1995; Tierney, 1998; Gelman, Carlin, Stern and Rubin,
2004, pp. 290-292; Gill, 2002, pp. 317-325) it is easy to sample from non-
standard distributions. Here we will focus on the Metropolis Hastings within
Gibbs algorithm. In this algorithm within one or more steps of the Gibbs
sampler the Metropolis Hastings algorithm is used to sample the conditional
distribution at hand (Gelman, Carlin, Stern and Rubin, 2004, p. 292).
Suppose, for example, that the conditional distribution in Step 2 of our
Gibbs sampler can not be traced. What often can be done in such a situ-
ation is evaluation of g(β | µ, σ 2 , y, D, x, Hm ) = g(β | .) for each value of
β (just evaluate (14) for a specific value of β with al the other parameters
fixed at their current values). What subsequently is needed, is an approxima-
tion of the target distribution g(β | .) by means of a standard distribution.
Especially for models that contain many parameters the choice of the ap-
proximating distribution is important: the closer the resemblance between
approximation and target the faster the Metropolis-Hastings within Gibbs
sampler will converge (Gelman, Carlin, Stern and Rubin, 2004, pp.305-307).
A basic idea is to use an approximating distribution depending on the val-
ues sampled in the previous iteration q(β t | β t−1 ). The interested reader
is reffered to Robert and Casella (2004, Chapter 7.3) for an elaboration of
this idea. A so called independent Metropolis-Hastings algorithm (Robert
and Casella, 2004, Chapter 7.4) is obtained if the approximating distribution
does not depend on the values sample in the previous iteration. A rather
unintelligent idea that would nevertheless work quite well in the situation at
hand is to use an approximation q(β t | β t−1 ) = q(β t ) ∼ N (0, 1). After spec-
ification of the approximating distribution three steps are needed to sample
a value from the target distribution:
15
1. In iteration t, sample a value β t from q(β t | β t−1 ).
g(β t )/q(β t |β t−1 )
2. Compute the ratio r = g(β t−1 )/q(β t−1 |β t )
.
16
H0 : P1 P2
P1 P2 3,V 2 2 P1 P2 4,V 2 7
tN1N2 1 tN1N2 1
where N1 and N2 denote the sample sizes in group 1 and 2, respectively, and
y 1 , y 2 , s21 and s22 the corresponding sample averages and variances. Note that
in the null-population µ1 = µ2 , that is, both means have the same value. In
the sequel this value will be denoted by µ. As can be seen in Figure 3, first
17
of all data matrices have to be sampled from the null-population. This is
problematic because under H0 the values µ and σ 2 have to be known in order
to be able to sample data. Here µ and σ 2 are nuisance parameters, stated
otherwise, there are many values for µ and σ 2 that are in accordance with
H0 which leaves the problem from which of the many null-populations the
data matrices should be sampled.
In many standard situations (analysis of variance, multiple regression)
nuisance parameters can easily be handled because the test statistic is a
pivot, that is, the distribution of the test statistic does not depend on the
actual values of the nuisance parameters. This is illustrated in Figure 3:
whatever the actual values of µ and σ 2 the t-test always has a t-distribution
with N1 + N2 − 1 degrees of freedom. Stated otherwise, the two-sided p-value
does not depend on the actual null-population from which data matrices are
replicated, because the distribution of T (·) is always tN1 +N2 −1 .
Pivots are among the most elegant achievements of classical statistics.
For many situations pivotal test statistics do not exist. Classical solutions
for this situation are so called plug-in p-values (Bayarri and Berger, 2000)
or asymptotic p-values (Robins, van der Vaart and Ventura, 2000), that is,
p-values computed assuming that the sample size is very large. However,
since this chapter is on Bayesian data analysis we will limit ourselves to the
Bayesian way to deal with nuisance parameters in the absence of pivotal test
statistics: posterior predictive p-values.
that is, in accordance with the Bayesian tradition computations are per-
formed conditional on the data that are observed. This opens the possibility
18
H0
g (T 0 | Z , H 0 )
T 0 ,1 ... T 0 ,T
T (T 0 ,1 , Z 1rep ) T (T 0 ,1 , Z Trep )
... p-value
t (T 0 ,1 , Z ) t (T 0 ,1 , Z )
19
Table 4: The Computation of Posterior Predictive P-values
t µ1 µ2 µ3 µ4 β σ2 T (·) t(·)
1 18.20 16.62 12.46 13.18 12.62 .00 1.86 1.60
...
6 18.44 16.51 14.97 13.02 12.25 .01 1.76 1.64
...
residual variances are equal. This will be investigated using the following
discrepancy measure:
t(·) = s2largest /s2smallest , (25)
where
Ng
1 X
s2g = (yi − µg + βxi )2 (26)
Ng i=1
denotes the within group residual variance of which the smallest and largest
observed in the the four groups are used in the test statistic. Note that Ng
denotes the sample size in group g. Note furthermore, that (26) and thus
(25) depend both on the data and the unknown model parameters µ and β.
This measure is chosen to show that the posterior predictive approach
enables a researcher to construct any test statistic without having to derive
its distribution under the null-hypothesis. As will be elaborated in the next
section, t(.) can be evaluated using a posterior predictive p-value, or using
its distance to the distribution of T (.). The latter approach is called model
checking (Gelman, Meng and Stern, 1996): even if the p-value is rather small,
a researcher may conclude that the distance between t(.) and the distribution
of T (.) is so small, that it is not necessary to adjust the model used, e.g.,
that it is not necessary to use a model with group dependent within group
residual variances. It is interesting to note a rule of thumb existing in the
context of analysis of variance (Tabachnick and Fidell, 1996, p. 80): if the
sample sizes per group are within a ratio of 4:1, t(.) may be as large as 10,
before heterogeneity of within group variances becomes a problem.
First of all the Gibbs sampler was used to obtain a sample from the pos-
terior distribution of the null-model. A part of the results is displayed in
Table 4. Subsequently, for t = 1, . . . , T a data matrix is replicated from the
null-population. Finally, t(·) and T (·) are computed using the observed and
replicated data matrices, respectively, and µ and β. The posterior predictive
20
p-value is then simply the proportion of T (·) larger than t(·) resulting in the
value .88. This implies that the discrepancies computed for the observed data
are in accordance with the posterior predictive distribution of the discrep-
ancies under the hypothesis of equal within group residual variances. The
range of the observed discrepancies was [1.51,1.72], the range of the repli-
cated discrepancies [1.02,3.58]. As can be seen the observed discrepancies
are well within the range of the replicated discrepancies. Furthermore, the
values of the observed discrepancies are much smaller than the rule of thumb
that t(.) may be as large as 10 (for analysis of variance, here we look at an
analysis of covariance) if the group sizes differ less than a factor 4:1. The
conclusion is that a model with equal within group residual variances is more
than reasonable.
21
measures were (almost) uniform, that is, that (p < α | H0) ≈ α.
Also in other situations researchers can execute such a simulation study
to determine if their posterior predictive p-values have acceptable fre-
quency properties or not.
• Bayarri and Berger (2000) present two new types of p-values that ex-
plicitly account for the fact that the data are used twice: the conditional
predictive p-value and the partial posterior predictive p-value. In their
examples the frequency properties of these p-values are excellent. How-
ever, their examples are rather simple, and it may be difficult or even
impossible to compute these p-values for more elaborate examples like
the example given in the previous section.
• Bayarri and Berger (2000) note and exemplify that so-called ’plug-in’
p-values appear to have better frequency properties than posterior pre-
dictive p-values. These p-values can be obtained using the parametric
bootstrap, that is, replace θ 0,1 , . . . , θ 0,T in Step 1 of the computation
of the posterior predictive p-value by the maximum likelihood estimate
of the model parameters θ̂ 0 for t = 1, . . . , T . Note, that although the
frequency properties of plug-in p-values appear to be better than those
of posterior predictive p-values, it has to be determined for each new
situation how good they actually are.
• Bayarri and Berger (2000) also note that p-values can be calibrated. In
its simplest form this entails the simulation of a sequence Z 1 , . . . , Z J
from a null population, and subsequent computation of the sequence
p1 , . . . , pJ . If the latter is not uniformly distributed, it does not hold
that P (p < α | H0 ) = α. However, using the sequence p1 , . . . , pJ for
each α a value α∗ can be found such that P (p < α∗ | H0 ) = α. If,
subsequently, it is desired to test the null-hypothesis with α = .05 for
empirical data, the null hypothesis should be rejected if the p-value is
smaller than the α∗ corresponding to α = .05.
• Last but not least, Gelman, Meng and Stern (1996) are not in the least
worried about the frequency properties of posterior predictive p-values.
They suggest to use discrepancies simply to assess the discrepancy be-
tween a model and the data. A quote from Tiao and Xu (1993) clarifies
what they mean: ”... development of diagnostic tools with a greater
emphasis on assessing the usefulness of an assumed model for specific
22
purposes at hand, rather than on whether the model is true”. They
also suggest not to worry about the power that can be achieved using a
specific discrepancy, but, to choose the discrepancy such that it reflects
”how the model fits in aspects that are important for our problems at
hand”. Stated otherwise, although posterior predictive inference is not
a straightforward alternative for the classical approach with respect to
hypothesis testing (is H0 true or not?), it can be used for model check-
ing. It allows researchers to define discrepancies between model and
data such that they are relevant for the problem at hand (as was done
in the previous section to investigate equality of within group residual
variances). Subsequently the observed size of these discrepancies can
be compared with the sizes that are expected if the model is true via
the posterior predictive distribution of these discrepancies. Finally, the
researcher at hand has to decide whether the differences between the
observed and replicated discrepancies are so large that it is worthwhile
to adjust the model.
23
2
P2
P1 P2
P1
-2
-2 2
models. Classical information criteria like AIC (Akaike, 1987) and CAIC
(Bozdogan, 1987) consist of two parts:
• The first part is −2 log f (Z | θ̂ m ), that is, the distribution of the data
or likelihood evaluated using the maximum likelihood estimate θ̂ m of
θ m . The smaller the value of the first part, the better the fit of the
model.
• The second part is a penalty for model size which is a function of the
number of parameters P in a model. For AIC this penalty is 2P , for
CAIC the penalty is (log N + 1)P . The smaller the penalty, the more
parsimonious the model.
An information criterion results from the addition of fit and penalty, the
smaller the resulting number, the better the model at hand.
As will now be illustrated, fit and penalty are (although implicitly) also
important parts of the marginal likelihood (28). It is therefore a fully auto-
matic Ockham’s razor (Smith and Spiegelhalter, 1980; Jefferys and Berger,
1992; Kass and Raftery, 1995) in the sense that model fit and model size are
automatically accounted for. Consider, for example, the situation displayed
in Figure 5. There are two models under investigation:
2
X
H1 : yi = µg dig + ei , with ei ∼ N (0, 1), (29)
g=1
24
and,
2
X
H2 : yi = µg dig + ei , with ei ∼ N (0, 1) and µ1 > µ2 . (30)
g=1
As can be seen in Figure 5, the fit of both models is the same because, loosely
spoken, both H1 and H2 support the maximum of f (y | D, µ1 , µ2 ). However,
when (31) is evaluated it turns out that it is larger for H2 than for H1 , that
is, H2 is preferred to H1 . This can be seen as follows: denote the integrated
density of f (·) over the upper triangle by a and over the lower triangle by
b. Since a is smaller than b, it follows that m(y | H1 ) = 1/16a + 1/16b is
smaller than m(y | H2 ) = 2/16b. Stated otherwise, the marginal likelihood
prefers H2 over H1 because the fit of both models is about the same, but the
parameter space of H2 is smaller than the parameter space of H1 .
The ratio of two marginal likelihoods is called the Bayes factor (Kass
and Raftery, 1995; Gill, 2002, Chapter 7; Lee, 1997, Chapter 4; Lavine and
Schervish, 1999), that is,
m(Z | Hm ) P (Hm | Z) P (Hm )
BFm,m0 = 0 )
= / . (32)
m(Z | Hm P (Hm0 | Z) P (Hm0 )
As can be seen, the Bayes factor is equal to the ratio of posterior to prior
model-odds. This means that the Bayes factor represents the change in
believe from prior to posterior model odds. Stated otherwise, if BFm,m0 = 4
model m has become four times as likely as model m0 after observing the data.
A more straightforward interpretation of the marginal likelihood is obtained
using posterior model probabilities computed under the assumption that the
prior model probabilities P (Hm ) = M1 for m = 1, . . . , M :
m(Z | Hm )
P (Hm | Z) = PM . (33)
m=1 m(Z | Hm )
25
If BFm,m0 = 4 then with equal prior probabilities the posterior probabilities
of model m and m0 are .80 and .20, respectively.
26
.5
.4
.3 d=2
.2 d=3
.1
0
-3 -2 -1 0 1 2 3
µ
data (Berger and Perricchi, 1996, 2004; Perez and Berger, 2002) is to use as
small a part of the data as possible to construct a prior distribution for each
model under consideration. This will render a prior distribution that is in
agreement with the population from which the data are sampled, and, that
is informative enough to avoid the Bartlett-Lindley paradox. The as small
as possible part of the data is called a minimal training sample and will be
denoted by Z(l). A minimal training sample is the smallest sample for which
the posterior prior is proper:
f (Z(l) | θ m )h(θ m | Hm )
h(θ m | Z(l), Hm ) = . (36)
m(Z(l) | Hm )
A standard (but not the only possible) choice for h(θ m | Hm ) is a reference
prior (Kass and Wasserman, 1996). For the example in (34) and (35) the size
of the minimal training sample is one, because one observation is sufficient
to obtain a proper posterior prior for µ: f (z(l) | µ) = N (z(l) | µ, 1), h(µ |
H2 ) ∝ constant, resulting in h(µ | z(l), H2 ) = N (µ | z(l), 1). Note that for
H1 the (posterior) prior is a point mass of one at µ = 0.
The posterior prior distribution depends on the training sample chosen.
One way to avoid this arbitrariness is to randomly select many training sam-
ples from the observed data. The two most important ways in which these
training samples can be processed to render one Bayes factor are averaged
intrinsic Bayes factors (Berger and Perricchi, 1996, 2004) and expected pos-
terior priors (Berger and Perricchi, 2004; Perez and Berger, 2002). For each
training sample the intrinsic Bayes factor of model m to m0 can be computed:
R
f (Z(−l) | θ m )h(θ m | Z(l), Hm )dθ m
IBFm,m0 = R θ m , (37)
θ m f (Z(−l) | θ m0 )h(θ m0 | Z(l), Hm0 )dθ m0
27
where Z(−l) denotes the data matrix excluding the observations that are
part of the training sample. The average of the IBF’s resulting for each of
the training samples is the averaged intrinsic Bayes factor. Bayes factors can
also be computed using (28) for each model m with h(θ m | Hm ) replaced by
the expected posterior prior:
L
1X
h(θ m | Z(l), Hm ), (38)
L l=1
where L denotes the number of training samples.
Both intrinsic Bayes factors and the approach using expected posterior
priors are general methods that can be applied in many situations. The
encompassing prior approach (Klugkist, Laudy and Hoijtink, 2005; Klugk-
ist, Kato and Hoijtink, 2005; Kato and Hoijtink, 2006; Laudy and Hoijtink,
2006) was developed specifically to deal with the selection of the best of a
set of inequality constrained hypotheses (see Section 1 for an elaboration
of inequality constrained hypotheses for the self-esteem data). Since (8) is
used to derive the prior distribution for constrained models, only the encom-
passing prior (7), that is, the prior for the unconstrained model, has to be
specified. This is in agreement with the principle of compatibility (Dawid
and Lauritzen, 2000) which is best illustrated using a quote from Leucari
and Consonni (2003) and Roverate and Consonni (2004): ”If nothing was
elicited to indicate that the two priors should be different, then it is sensible
to specify [the prior of constrained models] to be, . . ., as close as possible
to [the prior of the unconstrained model]. In this way the resulting Bayes
factor should be least influenced by dissimilarities between the two priors due
to differences in the construction processes, and could thus more faithfully
represent the strength of the support that the data lend to each model”.
As can be seen in (7), each mean has the same prior distribution, this
ensures that the encompassing model does not favor any of the models being
compared. Furthermore, the encompassing prior should assign a substantial
probability to values of µ, β and σ 2 that are in agreement with the data at
hand, and very small probabilities to values that are not. Since it is a priori
unknown which values are in agreement with the data, these values will be
derived using the data. This is reasonable because the compatibility of the
priors ensures that this information is used in a similar manner for each of
the models under investigation. The following procedure is used:
• The prior distribution for σ 2 is an Inv-χ2 (σ 2 | ν0 , λ20 ). We will use
28
ν0 = 1 and λ20 = 12.1 (the least squares estimate of σ 2 ).
• The prior distribution for β is a N (β | β0 , γ02 ). The lower (l) and
upper bound (u) of the 99.7% confidence interval for the least squares
estimate of β is used to determine the prior distribution: β0 = u+l 2
,
2 u−l 2
and γ0 = ( 2 ) . Stated otherwise, prior mean and variance are chosen
such that mean minus one standard deviation equal l, and mean plus
one standard deviation equals u. The resulting numbers for β0 and γ02
are 0 and .0004, respectively.
• For g = 1, . . . , G the prior distribution for µg is N (µg | µ0 , τ02 ). Like for
β, for each mean the lower and upper bound of the 99.7% confidence
interval for the least squares estimate is determined. The smallest lower
bound becomes l and the largest upperbound u. Subsequently, µ0 and
τ02 are determined in the same way as β0 and γ02 . The resulting numbers
for µ0 and τ02 are 15.7 and 13.7, respectively.
To summarize this section, if researchers want to use Bayes factors to
select the best of a number of competing models, one should not choose
reference, vague or uninformative priors. This was exemplified using the
Bartlett-Lindley paradox. Instead researchers should either use subjective
priors, or, priors constructed using the data like the posterior prior or the
encompassing prior.
29
rather inefficient, that is, often a huge sample from h(θ m | Hm ) is needed to
avoid that m̂(·) depends strongly on the sample at hand. An improvement
of (39) is the harmonic mean estimator (Kass and Raftery (1995):
T
1X
m̂(Z | Hm ) = ( f (Z | θ m,t )−1 )−1 , (40)
T t=1
30
Table 5: Posterior Probabilities for Four Models for the Self-Esteem Data
5.4 Example
In the introduction of this chapter the self-esteem data were introduced. The
four hypotheses that were specified for these data are listed in Table 5. As
can be seen, the hypothesis that the four means are equal is replaced by the
hypothesis that the four means are about equal. The main reason for this
substitution is that (42) is not defined for models in which two or more of
the parameters are exactly equal. Another reason is that the traditional null-
hypothesis does not always describe a state of affairs in the population that
is relevant for the reseach project at hand. See, for example, Cohen (1994)
for an elaboration of this point of view. In these situations the traditional
null-hypothesis can be replaced by a hypothesis that states that the four
means are about equal, where about equal is operationalized as:
Further motivation for this choice can be found in Berger and Delampady
(1987). For the computation of the posterior probabilities presented in Table
5, ε = .1. This number is about 1/4-th of the posterior standard error of
the means if (1) is used to analyze the self-esteem data without constraints
on the parameters. Results in Berger and Delampady (1987) suggest that
use of such small values of ε in (44) renders results that are similar to using
ε = 0, that is, using exact equality constraints. As can be seen, the data
provide support for H1b and H1c , and not for H1a and H0 . Given posterior
probabilities of .40 and .60 for H1b and H1c , respectively, it is hard decide
which hypothesis is the best. Choosing H1c implies that the probability
of incorrectly rejecting H1b is .40, which is a rather large conditional error
probability. It is much more realistic to acknowledge that both models have
their merits, or, to use a technique called model averaging (Hoeting, Madigan,
Raftery and Volinsky, 1999) which can, loosely spoken, be used to combine
31
both models. Whatever method is used, looking at Table 5 it can be seen
that both models agree that µhh < µll . However, there is disagreement about
the position of µlh and µhl . It is interesting to see (note that the EAP of
β was about zero for all models under investigation) that the restrictions
of both H1b and H1c are in agreement with the observed means in Table 1.
Probably H1c has a higher posterior probability than H1b because it contains
one more inequality constraint, that is, it is a smaller model and thus the
implicit penalty for model size in the marginal likelihood is smaller.
6 Further Reading
This chapter provided an introduction, discussion and illustration of Bayesian
estimation using the Gibbs sampler, model checking using posterior predic-
tive inference, and model selection using posterior probabilities. As noted
before, I consider these to be the most important components of Bayesian
data analysis. Below I will shortly discuss other important components of
Bayesian data analysis that did not receive attention in this chapter.
Hierarchical modelling (Gill, 2002, Chapter 10; Gelman, Carlin, Stern
and Rubin, 2004, Chapter 5; Lee, 1997, Chapter 8) is an important Bayesian
tool for model construction. Consider, for example, a sample of g = 1, . . . , G
schools and with each school the IQ (denoted by yi|g ) for i = 1, . . . , N children
is measured. For g = 1, . . . , G it can be assumed that yi|g ∼ N (yi|g | µg , 15).
A hierarchical model is obtained if it is assumed that the µg have a common
distribution: µg ∼ N (µg | µ, σ 2 ). For µ and σ 2 a so called hyper prior has
to be specified, e.g., h(µ, σ 2 ) ∼ N (µ | µ0 , τ02 )Inv-χ2 (σ 2 | ν0 , λ20 ). This setup
renders the joint posterior distribution of µ1 , . . . , µG , µ and σ 2 as:
g(µ1 , . . . , µG , µ, σ 2 | y 1 , . . . , y G )
∝ (45)
G Y
Y N
N (yi|g | µg , 15)N (µg | µ, σ 2 )N (µ | µ0 , τ02 )Inv-χ2 (σ 2 | ν0 , λ20 ).
g=1 i=1
Using a data augmented Gibbs sampler this posterior is easily sampled iter-
ating across the following two steps:
• Data augmentation: for g = 1, . . . , G sample µg from
N
Y
g(µg | µ, σ 2 , y1|g , . . . , yN |g ) ∝ N (yi|g | µg , 15)N (µg | µ, σ 2 ). (46)
i=1
32
• Sample µ and σ 2 from
g(µ, σ 2 | µ1 , . . . , µG )
∝ (47)
G
Y
N (µg | µ, σ 2 )N (µ | µ0 , τ02 )Inv-χ2 (σ 2 | ν0 , λ20 ).
g=1
As illustrated in this chapter, this sample can be used for estimation, model
checking and model selection.
In Section 4 posterior predictive inference was presented. The interested
reader is referred to Box (1980), who discusses prior predictive inference.
Prior predictive inference is obtained if in Figure 4 the posterior distribution
g(θ 0 | Z, H0 ) is replaced by the prior distributions h(θ 0 | H0 ). See Gelman,
Meng and Stern (1996) for comparisons of both methods.
Besides posterior probabilities there are other Bayesian methods that
can be used for model selection. The Bayesian information criterion (BIC,
Kass and Raftery, 1995; Gill, 2002, pp. 223-224) is an approximation of
−2 log m(Z | Hm ) that is similar to the CAIC: −2 log f (Z | θ̂m ) + P log N .
The deviance information criterion (DIC, Spiegelhalter, Best, Carlin and van
der Linde, 2002) is an information criterion that can be computed using a
sample of parameter vectors from g(θ m | Z, Hm ). Like the marginal likeli-
hood, the penalty for model fit does not have to be specified in terms of the
number of parameters, but is determined using ”the mean of the deviances
minus the deviance of the mean” as a measure of the size of the parameter
space. The posterior predictive L-criterion (Laud and Ibrahim, 1995; Gelfand
and Gosh, 1998) is a measure of the distance between the observed data and
the posterior predictive distribution of the data for each model under inves-
tigation. It can be used to select the model that best predicts the observed
data in terms of the specific L-criterion chosen.
References
Akaike, H. (1987). Factor analysis and AIC. Psychometrika, 52,317-332.
Bayarri, M.J. and Berger, J.O. (2000). P-values for composite null models.
Journal of the American Statistical Association, 95,1127-1142.
33
Berger, J.O. and Delampady, M. (1987). Testing Precise Hypotheses. Sta-
tistical Science, 3,317-352.
Berger, J.O. and Perricchi, L. (1996). The intrinsic Bayes factor for model
selection and prediction. Journal of the American Statistical Associa-
tion, 91, 109-122.
Cowles, M.K. and Carlin, B.P. (1996). Markov chain Monte Carlo methods:
a comparative review. Journal of the American Statistical Association,
91, 883-904.
Gelfand, A.E. and Gosh, S.K. (1998). Model choice, a minimum posterior
predictive loss approach. Biometrika,85,1-11.
Gelman, A., Carlin, J.B., Stern, H.S. and Rubin, D.B. (2004). Bayesian
Data Analysis, London: Chapman and Hall.
34
Gelman, A. Meng, X.L. and Stern, H. (1996). Posterior predictive assess-
ment of model fitness via realized discrepancies. Statistica Sinica, 6,
733-807.
Hoeting, J.A., Madigan, D., Raftery, A.E. and Volinsky, C.T. (1999). Bayesian
model averaging, a tutorial. Statistical Science, 14, 382-417.
Kass, R.E. and Raftery, A.E. (1995). Bayes factors. Journal of the Ameri-
can Statistical Association, 90, 773-795.
35
Laud, P. and Ibrahim, J. (1995). Predictive model selection. Journal of the
Royal Statistical Society, Series B, 57,247-262.
Laudy, O. and Hoijtink, H. (2006). Bayesian methods for the analysis of in-
equality constrained contingency tables. Statistical Methods in Medical
Research, 15,1-16.
Lavine, M. and Schervish, M.J. (1999). Bayes factors: what they are and
what they are not. The American Statistician, 53, 119-122.
Martin, A.D. and Quinn, K.M. (2005). MCMCpack: Markov chain Monte
Carlo (MCMC) Package. URL http://mcmcpack.wustl.edu. R package
version 0.6-3.
Perez, J.M. and Berger, J.O. (2002). Expected posterior prior distributions
for model selection. Biometrika,89,491-511.
Robert, C.P. and Casella, G. (2004). Monte Carlo Statistical Methods. New
York: Springer.
Robins, J.M., van der Vaart, A. and Venture, V. (2000). Asymptotic distri-
bution of p-values in composite null models. Journal of the American
Statistical Association, 95, 1143-1156.
36
View publication stats
Smith, A.F.M. and Spiegelhalter, D.J. (1980). Bayes factors and choice cri-
teria for linear models. Journal of the Royal Statistical Society, Series
B, 42, 213-220.
Spiegelhalter, D., Thomas, A., Best, N., and Lunn, D. (2004). WinBUGS,.
URL http://www.mrc-bsu.cam.ac.uk/bugs/winbugs/. Version 1.4.1.
Spiegelhalter, D.J., Best, N.G., Carlin, B.P. and van der Linde, A. (2002).
Bayesian measures of model complexity and fit. Journal of the Royal
Statistical Society, Series B, 64,583-639.
Tanner, M.A. and Wong, W.H. (1987). The calculation of posterior dis-
tributions by data augmentation. Journal of the American Statistical
Association,82,528-550.
Zeger, S.L. and Karim, M.R. (1991). Generalized linear models with random
effects: a Gibbs sampling approach. Journal of the American Statistical
Association,86,79-86.
37