Interplay of Bayes

Statistical Science
2004, Vol. 19, No. 1, 5880

DOI 10.1214/088342304000000116
Institute of Mathematical Statistics, 2004
The Interplay of Bayesian and
Frequentist Analysis
M. J. Bayarri and J. O. Berger
Abstract. Statistics has struggled for nearly a century over the issue of
whether the Bayesian or frequentist paradigm is superior. This debate is
far from over and, indeed, should continue, since there are fundamental
philosophical and pedagogical issues at stake. At the methodological level,
however, the debate has become considerably muted, with the recognition
that each approach has a great deal to contribute to statistical practice and
each is actually essential for full development of the other approach. In
this article, we embark upon a rather idiosyncratic walk through some of
these issues.
Key words and phrases: Admissibility, Bayesian model checking, condi-
tional frequentist, condence intervals, consistency, coverage, design, hierar-
chical models, nonparametric Bayes, objective Bayesian methods, p-values,
reference priors, testing.
CONTENTS
1. Introduction
2. Inherently joint Bayesianfrequentist situations
2.1. Design or preposterior analysis
2.2. The meaning of frequentism
2.3. Empirical Bayes, gamma minimax, restricted risk
Bayes
3. Estimation and condence intervals
3.1. Computation with hierarchical, multilevel or mixed
model analysis
3.2. Assessment of accuracy of estimation
3.3. Foundations, minimaxity and exchangeability
3.4. Use of frequentist methodology in prior develop-
ment
3.5. Frequentist simplications and asymptotic approxi-
mations
4. Testing, model selection and model checking
4.1. Conditional frequentist testing
4.2. Model selection
4.3. p-Values for model checking
M. J. Bayarri is Professor, Department of Statistics,
University of Valencia, Av. Dr. Moliner 50, 46100
Burjassot, Valencia, Spain (e-mail: susie.bayarri@
uv.es). J. O. Berger is Director, Statistical and Applied
Mathematical Sciences Institute, and Professor, Duke
University, P.O. Box 90251, Durham, North Carolina
27708-0251, USA (e-mail: berger@stat.duke.edu).
5. Areas of current disagreement
6. Conclusions
Acknowledgments
References
1. INTRODUCTION
Statisticians should readily use both Bayesian and
frequentist ideas. In Section 2 we discuss situations
in which simultaneous frequentist and Bayesian think-
ing is essentially required. For the most part, how-
ever, the situations we discuss are situations in which
it is simply extremely useful for Bayesians to use fre-
quentist methodology or frequentists to use Bayesian
methodology.
The most common scenarios of useful connections
between frequentists and Bayesians are when no exter-
nal information (other than the data and model itself)
is to be introduced into the analysison the Bayesian
side, when objective prior distributions are used.
Frequentists are usually not interested in subjective,
informative priors, and Bayesians are less likely to be
interested in frequentist evaluations when using sub-
jective, highly informative priors.
We will, for the most part, avoid the question
of whether the Bayesian or frequentist approach to
statistics is philosophically correct. While this is a
valid question, and research in this direction can be of
58
BAYESIAN AND FREQUENTIST ANALYSIS 59
fundamental importance, the focus here is simply on
methodology. In a related vein, we avoid the question
of what is pedagogically correct. If pressed, we
would probably argue that Bayesian statistics (with
emphasis on objective Bayesian methodology) should
be the type of statistics that is taught to the masses,
with frequentist statistics being taught primarily to
advanced statisticians, but that is not an issue for
this paper.
Several caveats are in order. First, we primarily focus
on the Bayesian and frequentist approaches here; these
are the most generally applicable and accepted statisti-
cal philosophies, and both have features that are com-
pelling to most statisticians. Other statistical schools,
such as the likelihood school (see, e.g., Reid, 2000),
have many attractive features and vocal proponents, but
have not been as extensively developed or utilized as
the frequentist and Bayesian approaches.
A second caveat is that the selection of topics
here is rather idiosyncratic, being primarily based on
situations and examples in which we are currently
interested. Other Bayesianfrequentist synthesis works
(e.g., Pratt, 1965; Barnett, 1982; Rubin, 1984; and
even Berger, 1985a) focus on a quite different set of
situations. Furthermore, we almost completely ignore
many of the most time-honored Bayesianfrequentist
synthesis topics, such as empirical Bayes analysis.
Hence, rather than being viewed as a comprehensive
review, this paper should be thought of more as a
personal view of current interesting issues in the
Bayesianfrequentist synthesis.
2. INHERENTLY JOINT BAYESIAN
FREQUENTIST SITUATIONS
There are certain statistical scenarios in which a
joint frequentistBayesian approach is arguably re-
quired. As illustrations of this, we rst discuss the
issue of designin which the notion should not be
controversialand then discuss the basic meaning of
frequentism, which arguably should be (but is not
typically perceived as) a joint frequentistBayesian
endeavor.
2.1 Design or Preposterior Analysis
Frequentist design focuses on planning of experi-
mentsfor instance, the issue of choosing an appro-
priate sample size. In Bayesian analysis this is often
called preposterior analysis, because it is done before
the data is collected (and, hence, before the posterior
distribution is available).
EXAMPLE 2.1. Suppose X
1
, . . . , X
n
are i.i.d.
Poisson random variables with mean , and that it is
desired to estimate under the weighted squared er-
ror loss (
)
2
/
and using the classical estima-

tor

=

X. This estimator has frequentist expected loss
E
[(

X )
2
/
] =
/n.
A typical design problem would be to choose the
sample size n so that the expected loss is less than
some prespecied limit C. (An alternative formulation
might be to minimize C +nc, where c is the cost of an
observation, but this would not signicantly alter the
discussion here.) This is clearly not possible, for all ;
hence we must bring prior knowledge about into play.
A primitive recommendation that one often sees, in
such situations, is to make a best guess
0
for ,
and then choose n so that

0
/n C; that is, choose
n
0
/C. This is needlessly dogmatic, in that one
rarely believes particularly strongly in a particular
value
0
.
A common primitive recommendation in the oppo-
site direction is to choose an upper bound
U
for ,
and then choose n so that

U
/n C; that is, choose
n
U
/C. This is needlessly conservative, in that
the resulting n will typically be much larger than
needed.
The Bayesian approach to the design question is
to elicit a subjective prior distribution () for ,
and then to choose n so that
_

n
() d C; that
is, choose n
_
() d/C. This is a reasonable
compromise between the above two extremes and will
typically result in the most reasonable values of n.
Classical design texts often focus on the very spe-
cial situations in which the design criterion is constant
in the unknown model parameter , and hence fail to
clarify the philosophical centrality of Bayesian issues
in design. The basic fact is that, before experimenta-
tion, one knows neither the data nor , and so expec-
tations over both (i.e., both frequentist and Bayesian
expectations) are needed for design. See Chaloner and
Verdinelli (1995) and Dawid and Sebastiani (1999).
A very common situation in which design evaluation
is not constant is classical testing, in which the sample
size is often chosen to achieve a given power at a
specied value
of the parameter under the alternative

hypothesis. Again, specifying a specic
is very
crude when viewed from a Bayesian perspective. Far
more reasonable for a classical tester would be to
specify a prior distribution for under the alternative,
and consider the average power with respect to this
60 M. J. BAYARRI AND J. O. BERGER
distribution. (More controversial would be to consider
an average Type I error.)
2.2 The Meaning of Frequentism
There is a sense in which essentially everyone should
ascribe to frequentism:
FREQUENTIST PRINCIPLE. In repeated practical
use of a statistical procedure, the long-run average
actual accuracy should not be less than (and ideally
should equal) the long-run average reported accuracy.
This version of the frequentist principle is actually
a joint frequentistBayesian principle. Suppose, for in-
stance, that we decide it is relevant to statistical prac-
tice to repeatedly use a particular statistical model and
procedurefor instance, a 95% classical condence
interval for a normal mean. This procedure will, in
practice, be used on a series of different problems
involving a series of different normal means with a cor-
responding series of data. Hence, in evaluating the pro-
cedure, we should simultaneously be averaging over
the differing means and data.
This is in contrast to textbook statements of the
frequentist principle which tend to focus on xing
the value of, say, the normal mean, and imagining
repeatedly drawing data from the given model and
utilizing the condence procedure repeatedly on this
data. The word imagining is emphasized, because this
is solely a thought experiment. What is done in practice
is to use the condence procedure on a series of
different problemsnot use the condence procedure
for a series of repetitions of the same problem with
different data (which would typically make no sense
in practice).
Neyman himself repeatedly pointed out (see, e.g.,
Neyman, 1977) that the motivation for the frequentist
principle is in its use on differing real problems, and
not imaginary repetitions for one problem with a xed
true parameter. Of course, the reason textbooks typi-
cally give the latter (philosophically misleading) ver-
sion is because of the convenient mathematical fact
that if, say, a condence procedure has 95% frequen-
tist coverage for each xed parameter value, then it will
necessarily also have 95% coverage when used repeat-
edly on a series of differing problems. Thus (as with
design), whenever the frequentist evaluation is constant
over the parameter space, one does not need to also
do a Bayesian average over the parameter space; but,
conceptually, it is the combined frequentistBayesian
average that is practically relevant.
The impact of this real frequentist principle thus
arises when the frequentist evaluation of a procedure
is not constant over the parameter space. Here is
an example.
EXAMPLE 2.2. Binomial condence interval.
Brown, Cai and DasGupta (2001, 2002) considered the
problem of observing X Binomial(n, ) and deter-
mining a 95% condence interval for the unknown
success probability . We consider here the special
case of n = 50, and two condence procedures. The
rst is C
J
(x), dened as the Jeffreys equal-tailed
95% condence interval, given by
C
J
(x) =
_
q
0.025
(x), q
0.975
(x)
_
, (2.1)
where q
(x) is the th-quantile of the Beta(x + 0.5,

50.5 x) distribution. The second condence proce-
dure we consider is the modied Jeffreys equal-tailed
95% condence interval, given by
C
J
(x) =
_
_
_
q
0.025
(x), q
0.975
(x)
_
, if x =0
and x =n,
_
0, q
0.975
(x)
_
, if x =0,
_
q
0.025
(x), 1
_
, if x =n.
(2.2)
For the moment, simply consider these as formulae for
condence intervals; we later discuss their motivation.
Brown, Cai and Dasgupta (2001) provide the graph
of the coverage probability of C
J
given in Figure 1.
Note that, while roughly close to the target 95%, the
coverage probability varies considerably as a function
of , going from a high of 1 at = 0 and =1
to a low of 0.884 at = 0.049 and = 0.951.
A textbook frequentist might then assert that this
is only an 88.4% condence procedure, since the
coverage cannot be guaranteed to be higher than
this limit. But would the practical frequentist agree
with this?
The practical frequentist evaluates how C
J
would
work for a sequence {
1
,
2
, . . . ,
m
} of parameters
(and corresponding data) encountered in a series of real
problems. If m is large, the law of large numbers guar-
antees that the coverage that is actually experienced
will be the average of the coverages obtained over the
sequence of problems. Thus we should be considering
averages of the coverage in Figure 1 over sequences
of
j
.
One could, of course, choose the sequence of
j
to
all be 0.049 and/or 0.951, but this is not very realistic.
One might consider global averages with respect to
sequences generated from prior distributions (), but
a practical frequentist presumably does not want to
FIG. 1. Frequentist coverage of the C
J
intervals, as a function
of when n =50.
spend much time thinking about prior distributions.
One plausible solution is to look at local average
coverage, dened via a local smoothing of the binomial
coverage function. A convenient computational kernel
for this problem, when smoothing at a point is
desired and when a smoothing kernel having standard
deviation is desired (so that 2 can roughly be
thought of as the range over which the smoothing is
performed), is the Beta(a(), a(1 )) distribution
k
,
() where
a() =
_
_
1 2, if ,
[(1 )
2
1],
if < <1 ,
1
3 +2, if 1 .
(2.3)
Writing the standard frequentist coverage as 1 (),
this leads to the -local average coverage
1
() =
_
1
0
[1 ()]k
,
() d
=
n
x=0
_
n
x
_
_
_
a() +a(1 )
_

_
a() +x
_

_
a(1 ) +n x
__
_
(a())
_
a(1 )
_

_
a() +a(1 ) +n
__
1
_
C
J
Beta
_
|a(), a(1 )
_
d,
the last equation following from the standard expres-
sion for the betabinomial predictive distribution [and
FIG. 2. Local average coverage of the C
J
intervals, as a
function of when n =50 and =0.05.
with Beta(|a(), a(1 )) denoting the beta density
with given parameters].
For the binomial example we are considering, this
is graphed in Figure 2, for = 0.05. (Such a value
of could be interpreted as implying that one is
sure that the sequence of practical problems, for
which the binomial condence interval will be used,
has
j
varying by at least 0.05.) Note that this local
average coverage is always close to 0.95, so that a
practical frequentist would be quite pleased with the
condence interval.
One could imagine a textbook frequentist arguing
that, sometimes, a particular value, such as =0.049,
could be of special interest in repeated investigations,
the value perhaps corresponding to some important
physical theory concerning that science will repeat-
edly investigate. In such a situation, however, it is ar-
guably not appropriate to utilize condence intervals;
that there is a special value of of interest should
be acknowledged via some type of testing procedure.
Even if there were a distinguished value of and it
was erroneously handled by nding a condence inter-
val, the practical frequentist has one more arrow in his
or her quiver: it is not likely that a series of experiments
investigating this particular physical theory would all
choose the same sample size, so one should consider
practical averaging over sample size. For instance,
suppose sample sizes would vary between 40 and 60
for the binomial problem we have been considering.
Then one could reasonably consider average coverage
over these sample sizes, the result of which is given in
Figure 3. While not always as close to 0.95 as was the
local average coverage, it would still strike most peo-
ple as reasonable to call C
J
a 95% condence interval
when averaged over reasonable sample sizes.
FIG. 3. Average coverage over n between 40 and 60 of the C
J
intervals, as a function of .
A similar idea concerning local averages of
frequentist properties was employed by Woodroofe
(1986), who called the concept very weak expan-
sions. Brown, Cai and DasGupta (2002), for the
binomial problem, considered the average coverage
dened as the smooth part of their asymptotic expan-
sion of coverage, yielding a result similar to that in
Figure 2. Rousseau (2000) took a different approach,
considering slight adjustment of the Bayesian intervals
through randomization to achieve the correct frequen-
tist coverage.
So far the discussion has been in terms of the
practical frequentist acknowledging the importance of
considering averages over . We would also claim,
however, that Bayesians should ascribe to the above
version of the frequentist principle. If (say) a Bayesian
were to repeatedly construct purported 90% credible
intervals in his or her practical work, yet they only con-
tained the unknowns about 70% of the time, something
would be seriously wrong. A Bayesian might feel that
the practical frequentist principle will automatically be
satised if he or she does a good Bayesian job of sep-
arately analyzing each individual problem, and hence
that it is not necessary to specically worry about the
principle, but that does not mean that the principle
is invalid.
EXAMPLE 2.3. In this regard, let us return to
the binomial example to discuss the origin of the
condence intervals C
J
(x) and C
J
(x). The inter-
vals C
J
(x) arise as the Bayesian equal-tailed credi-
ble sets obtained from use of the Jeffreys prior (see
Jeffreys, 1961) ()
1/2
(1 )
1/2
for . (In
particular, the intervals are formed by the upper and
lower /2-quantiles of the resulting posterior distribu-
tion for .) This is the prior that is customary for an
objective Bayesian to use for the binomial problem.
(See Section 3.4.3 for further discussion of the Jeffreys
prior.) Note that, because of the derivation of the cred-
ible set from the objective Bayesian perspective, there
is strong reason to believe that conditionally on the
given situation and data, the accuracy assignment of
95% is reasonable. See Section 3.2.2 for discussion of
conditional performance.
The frequentist coverage of the intervals C
J
(x) is
given in Figure 4. Apure frequentist might well be con-
cerned with the raw coverage of this credible interval
because it goes to zero at =0 and =1. A moments
reection reveals why this is the case: the equal-tailed
Bayesian credible intervals purposely exclude values
in the left and right tails of the posterior distribution
and, hence, will always exclude = 0 and =1. The
modication of this interval employed in Brown, Cai
and DasGupta (2001) is C
J
(x) in (2.2): for the ob-
servations x = 0 or x = n, one simply extends the
Jeffreys equal-tailed credible intervals to include 0 or 1.
Of course, from a conditional Bayesian perspective,
these intervals then have posterior probability 0.975,
so a Bayesian would no longer call them 95% credible
intervals.
While the raw frequentist coverage of the Jeffreys
equal-tailed credible intervals might seem unappeal-
ing, their 0.05-local average coverage is excellent,
virtually the same as that in Figure 2 for the modied
interval; indeed, the difference is not visually apparent,
so that we do not separately include a graph of this lo-
cal average coverage. Hence the practical frequentist
would be quite happy with use of C
J
(x), even if it has
low coverage right at the endpoints.
FIG. 4. Coverage of the C
J
intervals, as a function of
when n =50.
The issue of Bayesians achieving good pure fre-
quentist coverage near a nite boundary of a para-
meter space is an interesting issue; our guess is that
this is often not possible. In the above example, for
instance, whether a Bayesian includes, or excludes,
= 0 or =1 in a credible interval is rather arbi-
trary and will depend on, for example, a choice such as
that between an equal-tailed or highest posterior den-
sity (HPD) interval. (The HPD intervals for x =0 and
x = n would include =0 and =1, respectively.)
Furthermore, this choice will typically lead to either
0 frequentist coverage or coverage of 1 at the end-
points, unless something unnatural to a Bayesian, such
as randomization, were incorporated. Hence the recog-
nition of the centrality to frequentist practice of some
type of average coverage, rather than pointwise cover-
age, can be important in such problems to achieve si-
multaneously acceptable Bayesian and frequentist per-
formance.
2.3 Empirical Bayes, Gamma Minimax,
Restricted Risk Bayes
Several approaches to statistical analysis have been
proposed which are inherently a mixture of Bayesian
and frequentist analysis. These approaches have leng-
thy histories and extensive literature and we so we
can do little more here than simply give pointers to
the areas.
Robbins (1955) introduced the empirical Bayes
approach, in which one species a class of prior
distributions , but assumes that the prior is other-
wise unknown. The data is then used to help de-
termine the prior and/or to directly nd the optimal
Bayesian answer. Frequentist reasoning was intimately
involved in Robbins original formulation of empiri-
cal Bayes, and in signicant implementations of the
paradigm, such as Morris (1983) for hierarchical mod-
els. More recently, the name empirical Bayes is often
used in association with approximate Bayesian analy-
ses which do not specically involve frequentist mea-
sures. (Simply using a maximumlikelihood estimate of
a hyperparameter does not make a technique frequen-
tist.) For modern reviews of empirical Bayes analysis
and previous references, see Carlin and Louis (2000)
and Robert (2001).
In the gamma minimax approach, one again has a
class of possible prior distributions and considers
the frequentist Bayes risk (the expected loss over
both the data and unknown parameters) of the Bayes
procedure for priors in the class. One then chooses
that prior which minimizes this frequentist Bayes risk.
For examples and references, see Berger (1985a) and
Vidakovic (2000).
In the restricted risk Bayes approach, one has a sin-
gle prior distribution, but can only consider statisti-
cal procedures whose frequentist risk (expected loss)
is constrained in some fashion. The idea is that one
can utilize the prior information, but in a way that
will be guaranteed to be acceptable to the frequentist
who wants to limit frequentist risk. (See Berger, 1985a,
for discussion and earlier references.) This approach is
actually not inherently Bayesianfrequentist, but is
more what could be termed a hybrid approach, in the
sense that it seeks some type of formal compromise be-
tween Bayesian and frequentist positions. There have
been many other attempts at such compromises, but
none has seemed to signicantly affect statistical prac-
tice.
There are many other important areas in which
joint frequentistBayesian evaluation is used. Some
were even developed primarily from the Bayesian
perspective, such as the prequential approach of Dawid
(cf. Dawid and Vovk, 1999).
3. ESTIMATION AND CONFIDENCE INTERVALS
In statistical estimation (including development of
condence intervals), objective Bayesian and frequen-
tist methods often give similar (or even identical)
answers in standard parametric problems with contin-
uous parameters. The standard normal linear model
is the prototypical example: frequentist estimates and
condence intervals coincide exactly with the stan-
dard objective Bayesian estimates and credible inter-
vals. Indeed, this occurs more generally in situations
that exhibit an invariance structure, provided objec-
tive Bayesians use the right-Haar prior density; see
Berger (1985a), Eaton (1989) and Robert (2001) for
discussion and earlier references.
This dual frequentistBayesian interpretation of ma-
ny textbook estimation procedures has a number of
important implications, not the least of which is that
much of standard textbook statistical methodology
(and standard software) can alternatively be presented
and described from the objective Bayesian perspective.
In particular, one can teach much of elementary statis-
tics from this alternative perspective, without changing
the procedures that are taught.
In more complicated situations, it is still usually
possible to achieve near-agreement between frequen-
tist and Bayesian estimation procedures, although this
may require careful utilization of the tools of both.
A number of situations requiring such cross-utilization
of tools are discussed in this section.
3.1 Computation with Hierarchical, Multilevel or
Mixed Model Analysis
With the advent of Gibbs sampling and other Markov
chain Monte Carlo (MCMC) methods of analysis (cf.
Robert and Casella, 1999), it has become relatively
standard to deal with models that go under any of the
names listed in the above title as Bayesian methods.
This popularity of the Bayesian methods is not nec-
essarily because of their intrinsic virtues, but rather
because the Bayesian computation is now much easier
than computation via more classical routes. See Hobert
(2000) for an overview and other references.
On the other hand, any MCMC method relies fun-
damentally on frequentist reasoning to do the com-
putation. An MCMC method generates a sequence of
simulated values
1
,
2
, . . . ,
m
of an unknown quan-
tity , and then relies upon a law of large num-
bers or ergodic theorem (both frequentist) to assert
that

m
=
1
m
m
i=1

i
. Furthermore, diagnostics
for MCMC convergence are almost universally based
on frequentist tools. There is a purely Bayesian way
of looking at such computation problems, which goes
under the heading Bayesian numerical analysis (cf.
Diaconis, 1988a; OHagan, 1992), but in practice
it is typically much simpler to utilize the frequen-
tist reasoning.
In conclusion for much of modern statistical analysis
in hierarchical models, we already see an inseparable
joining of Bayesian and frequentist methodology.
3.2 Assessment of Accuracy of Estimation
Frequentist methodology for point estimation of un-
known model parameters is relatively straightforward
and successful. However, assessing the accuracy of the
estimates is considerably more challenging and is a
problem for which frequentists should draw heavily on
Bayesian methodology.
3.2.1 Finding good condence intervals in the pres-
ence of nuisance parameters. Condence intervals for
a model parameter are a common way of indicating
the accuracy of an estimate of the parameter. Find-
ing good condence intervals when there are nuisance
parameters is very challenging within the frequen-
tist paradigm, unless one utilizes objective Bayesian
methodology, in which case the frequentist problem
becomes relatively straightforward. Indeed, here is a
rather general prescription for nding condence in-
tervals using objective Bayesian methods:
Begin with a reasonable objective prior distrib-
ution. (See Section 3.4 for discussion of objective
priors, and note that a reasonable objective prior
may well depend on which parameter is the parame-
ter of interest.)
By simulation, obtain a (large) sample from the
posterior distribution of the parameter of interest:
Option 1. If a predetermined condence interval
C(X) is of interest, simply approximate the pos-
terior probability of the interval by the fraction
of the samples from the posterior distribution that
fall in the interval.
Option 2. If the condence interval is not predeter-
mined, nd the /2 upper and lower fractiles of
the posterior sample; the interval between these
fractiles approximates the 100(1 )% equal-
tailed posterior credible interval for the parameter
of interest. (Alternative forms for the condence
set can be considered, but the equal-tailed interval
is ne for most applications.)
Assert that the obtained interval is the frequen-
tist condence interval, having frequentist coverage
given by the posterior probability of the interval.
There is a large body of theory, discussed in Sec-
tion 3.4, as well as considerable practical experience,
supporting the validity of constructing frequentist con-
dence intervals in this way. Here is one example from
the practical experience side.
EXAMPLE 3.1. Medical diagnosis (Mossman and
Berger, 2001). Within a population for which p
0
=
Pr( Disease D), a diagnostic test results in either a
Positive (+) or Negative () reading. Let p
1
=Pr(+|
patient hasD) and p
2
=Pr(+|patient does not haveD).
By Bayes theorem,
=Pr(D|+) =
p
0
p
1
p
0
p
1
+(1 p
0
)p
2
.
In practice, the p
i
are typically unknown, but for
i = 0, 1, 2 there are available (independent) data x
i
having Binomial(n
i
, p
i
) densities. It is desired to nd
a 100(1 )% condence set for that has good
conditional and frequentist properties.
A simple objective Bayesian approach to this prob-
lem is to utilize the Jeffreys priors (p
i
) p
1/2
i

(1 p
i
)
1/2
for each of the p
i
, and compute the
100(1 )% equal-tailed posterior credible interval
for . A suitable implementation of the algorithm pre-
sented above is as follows:
Draw random p
i
from the Beta(x
i
+
1
2
, n
i
x
i
+
1
2
)
posterior distributions, i =0, 1, 2.
Compute the associated
=
p
0
p
1
p
0
p
1
+(1 p
0
)p
2
for each random triplet.
Repeat this process 10,000 times.
The /2 and 1 /2 fractiles of these 10,000
generated form the desired condence interval.
[In other words, simply order the 10,000 values
of , and let the condence interval be the interval
between the (10,000

2
)th and (10,000
1
2
)th
values.]
The proposed objective Bayesian procedure is clear-
ly simple to use, but is the resulting condence in-
terval a satisfactory frequentist interval? To provide
perspective on this question, note that the above prob-
lem has also been studied in the frequentist literature,
using standard log-odds and delta-method procedures
to develop condence intervals, as well as more sophis-
ticated approaches such as the GartNam (Gart and
Nam, 1988) procedure. For a description of these clas-
sical methods, as applied to this problem of medical
diagnosis, see Mossman and Berger (2001).
Table 1 gives an indication of the frequentist perfor-
mance of the condence intervals developed by these
four methods. It is based on a simulation that repeat-
edly generates data from binomial distributions with
sample sizes n
i
= 20 and the indicated values of the
parameters (p
0
, p
1
, p
2
). For each generated triplet of
data in the simulation, the 95% condence interval is
computed using the objective Bayesian algorithm or
one of the three classical methods. It is then noted
whether the computed interval contains the true , or
misses to the left or right. The entries in the table are
the long run proportion of misses to the left or right.
Ideally, these proportions should be 0.025 and, at the
least, their sum should be 0.05.
Clearly the objective Bayes interval has quite good
frequentist performance, better than any of the classi-
cally derived condence intervals. Furthermore, it can
be seen that the objective Bayes intervals are, on av-
erage, smaller than the classically derived intervals.
(See Mossman and Berger, 2001, for these and more
extensive computations.) Finally, the objective Bayes
condence intervals were the simplest to derive and
will automatically be conditionally appropriate (see
Section 3.2.2), because of their Bayesian derivation.
The nding in the above example, that objective
Bayesian analysis very easily provides small con-
dence sets with excellent frequentist coverage, has
been repeatedly shown to happen. See Section 3.4 for
additional discussion.
3.2.2 Obtaining good conditional measures of accu-
racy. Developing frequentist condence intervals us-
ing the Bayesian approach automatically provides an
additional signicant benet: the condence statement
will be conditionally appropriate. Here is a simple arti-
cial example.
EXAMPLE 3.2. Two observations, X
1
and X
2
, are
to be taken, where
X
i
=
_
+1, with probability 1/2,
1, with probability 1/2.
Consider the condence set for the unknown ,
C(X
1
, X
2
)
=
_
the point
_
1
2
(X
1
+X
2
)
_
, if X
1
=X
2
,
the point {X
1
1}, if X
1
=X
2
.
The frequentist coverage of this condence set can eas-
ily be shown to be P
(C(X
1
, X
2
) contains ) =0.75.
This is not at all a sensible report, once the data is at
hand. To see this, observe that, if x
1
= x
2
, then we
know for sure that their average is equal to , so that
the condence set is then actually 100% accurate. On
the other hand, if x
1
= x
2
, we do not know if is the
datas common value plus 1 or their common value mi-
nus 1, and each of these possibilities is equally likely
to have occurred.
To obtain sensible frequentist answers here, one
must dene the conditioning statistic S = |X
1
X
2
|,
which can be thought of as measuring the strength of
evidence in the data (S =2 reecting data with maxi-
mal evidential content and S =0 being data of minimal
TABLE 1
The probability that the nominal 95% interval misses the true on the left and on the right, for the
indicated parameter values and when n
0
=n
1
=n
2
=20
(p
0
, p
1
, p
2
) O-Bayes Log odds GartNam Delta
_
1
4
,
3
4
,
1
4
_
0.0286, 0.0271 0.0153, 0.0155 0.0277, 0.0257 0.0268, 0.0245
_
1
10
,
9
10
,
1
10
_
0.0223, 0.0247 0.0017, 0.0003 0.0158, 0.0214 0.0083, 0.0041
_
1
2
,
9
10
,
1
10
_
0.0281, 0.0240 0.0004, 0.0440 0.0240, 0.0212 0.0125, 0.0191
evidential content). Then one denes frequentist cover-
age conditional on the strength of evidence S. For the
example, an easy computation shows that this condi-
tional condence equals
P
_
C(X
1
, X
2
) contains |S =2
_
=1,
P
_
C(X
1
, X
2
) contains |S =0
_
=
1
2
,
for the two distinct cases, which are the intuitively
correct answers.
It is important to realize that conditional frequentist
measures are fully frequentist and (to most people)
clearly better than unconditional frequentist measures.
They have the same unconditional property (e.g., in the
above example one will report 100% condence half
the time, and 50% condence half the time, resulting
in an average of 75% condence, as must be the
case for a frequentist measure), yet give much better
indications of the accuracy for the type of data that one
has actually encountered.
In the above example, nding the appropriate con-
ditioning statistic was easy but, in more involved sit-
uations, it can be a challenging undertaking. Luckily,
intervals developed via the Bayesian approach will au-
tomatically condition appropriately. For instance, in
the above example, the objective Bayesian approach
assigns the standard objective prior (for a location
parameter) () = 1, from which is easy to com-
pute that the posterior probability assigned to the set
C(X
1
, X
2
) is 1 or 0.5 as the observations differ or are
the same. (This is essentially Option 1 of the algo-
rithm described at the beginning of the Section 3.2.1,
although here the posterior probabilities can be com-
puted analytically.)
General theory about conditional condence can be
found in Kiefer (1977); see also Robinson (1979),
Berger (1985b), Berger and Wolpert (1988), Casella
(1988) and Lehmann and Casella (1998). In Sec-
tion 4.1, we will return to this dual theme that (i) it is
crucial for frequentists to condition appropriately;
(ii) this is technically most easily accomplished by us-
ing Bayesian tools.
3.2.3 Accuracy assessment in hierarchical models.
As mentioned earlier, the utilization of hierarchical
or random effects or mixed or multilevel models has
increasingly taken a Bayesian avor in practice, in
part driven by the computational advantages of Gibbs
sampling and MCMC analysis. Another reason for this
greatly increasing utilization of the Bayesian approach
to such problems is that practitioners are nding
the inferences that arise from the Bayesian approach
to be considerably more realistic than those from
competitors, such as various versions of maximum
likelihood estimation (or empirical Bayes estimation)
or (often worse) unbiased estimation.
One of the potentially severe problems with the
maximum likelihood or empirical Bayes approach is
that maximum likelihood estimates of variances in
hierarchical models (or variance component models)
can easily be zero, especially when there are numer-
ous variances in the model that are being estimated.
(Unbiased estimation will be even worse in such sit-
uations; if the mle is zero, the unbiased estimate will
be negative.)
EXAMPLE 3.3. Suppose, for i = 1, . . . , p, that
X
i
Normal(
i
, 1) and
i
Normal(0,
2
), all ran-
dom variables being independent. Then, marginally,
X
i
Normal(0, 1 +
2
), so that the likelihood func-
tion of
2
can be written
L(
2
)
1
(1 +
2
)
p/2
exp
_
S
2
2(1 +
2
)
_
, (3.1)
where S
2
=
X
2
i
. The mle for
2
is easily calculated
to be
2
= max{0,
S
2
p
1}. Thus, if S
2
< p, the
mle would be
2
= 0 (and the unbiased estimate
would be negative). While a value of S
2
< p is
somewhat unusual here [if, e.g., p =4 and
2
=1, then
Pr(S
2
<p) =0.264], it is quite common in problems
with numerous variance components to have at least
one mle variance estimate equal to 0.
For p = 4 and S
2
= 4, the likelihood function
in (3.1) is graphed in Figure 5. While L(
2
) is de-
creasing away from 0, it does not decrease particularly
FIG. 5. Likelihood function of
2
when p = 4 and S
2
= 4
is observed.
quickly, clearly indicating that there is considerable
uncertainty as to the true value of
2
even though the
mle is 0.
Utilizing an mle of 0 as a variance estimate can be
quite dangerous, because it will typically affect the
ensuing analysis in an incorrectly aggressive fashion.
In the above example, for instance, setting
2
to 0 is
equivalent to stating that all the
i
are exactly equal to
each other. This is clearly unreasonable in light of the
fact that there is actually great uncertainty about
2
,
as reected in Figure 5. Since the likelihood maximum
is occurring at the boundary of the parameter space, it
is also very difcult to utilize likelihood or frequentist
methods to attempt to incorporate uncertainty about
2
into the analysis.
None of these difculties arises in the Bayesian
approach, and the vague nature of the information in
the data about such variances will be clearly reected
in the posterior distribution. For instance, if one were to
use the constant prior density (
2
) = 1 in the above
example, the posterior density would be proportional
to the likelihood in Figure 5, and the signicant
uncertainty about
2
would permeate the analysis.
3.3 Foundations, Minimaxity and Exchangeability
There are numerous ties between frequentist and
Bayesian analysis at the foundational level. The foun-
dation of frequentist statistics typically focuses on the
class of optimal procedures in a given situation,
called a complete class of procedures. Through the
work of Wald (1950) and others, it has long been
known that a complete class of procedures is identi-
cal to the class of Bayes procedures or certain limits
thereof. Furthermore, in proving frequentist optimal-
ity of a procedure, it is typically necessary to employ
Bayesian tools. (See Berger, 1985a; Robert, 2001, for
many examples and references.) Hence, at a fundamen-
tal level, the frequentist paradigm is intertwined with
the Bayesian paradigm.
Interestingly, this fundamental duality has not had a
pronounced effect on the Bayesian versus frequentist
debate. In part, this is because many frequentists
nd the search for optimal frequentist procedures to
be of limited practical utility (since such searches
usually take place in rather limited settings, from the
perspective of practice), and hence do not themselves
pursue optimality and thereby come into contact with
the Bayesian equivalence. Even among frequentists
who are signicantly concerned with optimality, it is
typically perceived that the relationship with Bayesian
analysis is a nice mathematical coincidence that can be
used to eliminate inferior frequentist procedures, but
that Bayesian ideas should not formthe basis for choice
among acceptable frequentist procedures. Still, the
complete class theorems provide a powerful underlying
link between frequentist and Bayesian statistics.
One of the most prominent frequentist principles for
choosing a statistical procedure is that of minimaxity;
see Brown (1994, 2000) and Strawderman (2000) for
reviews on the important impact of this concept on
statistics. Bayesian analysis again provides the most
useful tool for deriving minimax procedures: one
nds the least favorable prior distribution, and the
minimax procedure is the resulting Bayes rule.
To many Bayesians, the most compelling foundation
of statistics is that based on exchangeability, as de-
veloped in de Finetti (1970). From the assumption of
exchangeability of an innite sequence, X
1
, X
2
, . . . ,
of observations (essentially the assumption that the
distribution of the sequence remains the same under
permutation of the coordinates), one can sometimes de-
duce the existence of a particular statistical model, with
unknown parameters, and a prior distribution on the
parameters. By considering an innite series of obser-
vations, frequentist reasoningor at least frequentist
mathematicsis clearly involved. Reviews of more re-
cent developments and other references can be found in
Diaconis (1988b) and Lad (1996).
There are many other foundational arguments that
begin with axioms of rational behavior and lead to the
conclusion that some type of Bayesian behavior is im-
plied. (See Bernardo and Smith, 1994, for review and
references.) Many of these effectively involve simul-
taneous frequentistBayesian evaluations of outcomes,
such as Rubin (1987), which is perhaps the weakest set
of axioms that implies Bayesian behavior.
3.4 Use of Frequentist Methodology in
Prior Development
In principle, a subjective Bayesian need not worry
about frequentist ideasif a prior distribution is elici-
ted and accurately reects prior beliefs, then Bayes
theorem guarantees that any resulting inference will
be optimal. The hitch is that it is not very common to
have a prior distribution that accurately reects all prior
beliefs. Suppose, for instance, that the only unknown
model parameter is a normal mean . Complete assess-
ment of the prior distribution for involves an innite
number of judgments [e.g., specication of the prob-
ability of the interval (, r) for any rational num-
ber r]. In practice, of course, only a few assessments
are ever made, with the others being made conven-
tionally (e.g., one might specify the rst quartile and
the median, but then choose a Cauchy density for the
prior). Clearly one should worry about the effect of fea-
tures of the prior that were not elicited.
Even more common in practice is to utilize a default
or objective prior distribution, and Bayess theorem
does not then provide any guarantee as to performance.
It has proved to be very useful to evaluate partially
elicited and objective priors by utilizing frequentist
techniques to evaluate their properties in repeated use.
3.4.1 Information-based developments. A number
of developments of prior distributions utilize informa-
tion-based arguments that rely on frequentist measures.
Consider the reference prior theory, for instance, ini-
tiated in Bernardo (1979) and rened in Berger and
Bernardo (1992). The reference prior is dened to
be that distribution which minimizes the asymptotic
KullbackLeibler divergence between the posterior
distribution and the prior distribution, thus hopefully
obtaining a prior that minimizes information in an
appropriate sense. This divergence is calculated with
respect to a joint frequentistBayesian computation
since, as in design, it is being computed before any data
has been obtained.
The reference prior approach has arguably been
the most generally successful method of obtaining
Bayes rules that have excellent frequentist performance
(see Berger, Philippe and Robert, 1998, as but one
example). There are, furthermore, many other features
of reference priors that are inuenced by frequentist
matters. One such feature is that the reference prior
typically depends not only on the model, but also on
which parameter is the inferential focus. Without such
dependence on the parameter of interest, optimal
frequentist performance is typically not attainable by
Bayesian methods.
A number of other information-based priors have
also been derived. See Soo (2000) for an overview
and references.
3.4.2 Consistency. Perhaps the simplest frequentist
estimation tool that a Bayesian can usefully employ is
consistency: as the sample size grows to , does the
estimate being studied converge to the true value (in
a suitable sense of convergence). Bayes estimates are
virtually always consistent if the parameter space is
nite-dimensional (see Schervish, 1995, for a typical
result and earlier references), but this need not be
true if the parameter space is not nite-dimensional
or in irregular cases (see Ghosh, Ghosal and Samanta,
1994). Here is an example of the former.
EXAMPLE 3.4. In numerous models in use today,
the number of parameters increases with the amount
of data. The classic example of this is the Neyman
Scott problem (Neyman and Scott, 1948), in which
one observes
X
ij
N(
i
,
2
), i =1, . . . , n, j =1, 2,
and is interested in estimating
2
. Dening x
i
=
(x
i1
+x
i2
)/2, x =( x
1
, . . . , x
n
), S
2
=
n
i=1
(x
i1
x
i2
)
2
and = (
1
, . . . ,
n
), the likelihood function can
be written
L(, )
1
2n
exp
_
2
_
| x |
2
+
S
2
4
__
.
Until relatively recently, the most commonly used
objective prior was the Jeffreys-rule prior (Jeffreys,
1961), here given by
J
(, ) = 1/
n+1
. The re-
sulting posterior distribution for is proportional to
the likelihood times the prior, which, after integrating
out , is
(|x)
1
2n+1
exp
_
S
2
4
2
_
.
One common Bayesian estimate of
2
is the poste-
rior mean, which here is S
2
/[4(n 1)]. This estimate
is inconsistent, as can be seen by applying simple fre-
quentist reasoning to the situation. Indeed, note that
(X
i1
X
i2
)
2
/(2
2
) is a chi-squared random variable
with one degree of freedom, and hence that S
2
/(2
2
)
is chi-squared with n degrees of freedom. It follows by
the law of large numbers that S
2
/(2n)
2
, so that
the Bayes estimate converges to
2
/2, the wrong value.
(Any other natural Bayesian estimate, such as the pos-
terior median or posterior mode, can also be seen to
be inconsistent.)
The problem in the above example is that the
Jeffreys-rule prior is often inappropriate in multidi-
mensional settings, yet it can be difcult or impossible
to assess this problem within the Bayesian paradigm
itself. Indeed, the inadequacy of the multidimensional
Jeffreys-rule prior has led to a search for improved ob-
jective priors in multivariable settings. The reference
prior approach, mentioned earlier, has been one suc-
cessful solution. [For the NeymanScott problem, the
reference prior is
R
(, ) = 1/, which results in
a consistent posterior mean and, indeed, yields infer-
ences that are numerically equal to the classical in-
ferences for
2
.] Another approach to developing im-
proved priors is discussed in Section 3.4.3.
3.4.3 Frequentist performance: coverage and ad-
missibility. Consistency is a rather crude frequentist
criterion, and more sophisticated frequentist evalua-
tions of performance of Bayesian procedures are of-
ten considered. For instance, one of the most common
approaches to evaluation of an objective prior distribu-
tion is to see if it yields posterior credible sets that have
good frequentist coverage properties. We have already
seen examples of this method of evaluation in Exam-
ples 2.2 and 3.1.
Evaluation by frequentist coverage has actually been
given a formal theoretical denition and is called the
frequentist-matching approach to developing objective
priors. The idea is to look at one-sided Bayesian cred-
ible sets for the unknown quantity of interest, and
then seek that prior distribution for which the credible
sets have optimal frequentist coverage asymptotically.
Welch and Peers (1963) developed the rst extensive
results in this direction, essentially showing that, for
one-dimensional continuous parameters, the Jeffreys
prior is frequentist-matching. There is an extensive lit-
erature devoted to nding frequentist-matching priors
in multivariate contexts; see Efron (1993), Rousseau
(2000), Ghosh and Kim (2001), Datta, Mukerjee,
Ghosh and Sweeting (2000) and Fraser, Reid, Wong
and Yi (2003) for some recent results and earlier
references.
Other frequentist properties have also been used to
help in the choice of an objective prior. For instance, if
estimation is the goal, it has long been common to uti-
lize the frequentist concept of admissibility to help in
the selection of the prior. The idea behind admissibility
is to dene a loss function in estimation (e.g., squared
error loss), and then see if a proposed estimator can be
beaten in terms of frequentist expected loss (e.g., mean
squared error). If so, the estimator is said to be inad-
missible; if it cannot be beaten, it is admissible. For
instance, in situations having what is known as a group
invariance structure, it has long been known that the
prior distribution dened by the right-Haar measure
will typically yield Bayes estimates that are admis-
sible from a frequentist perspective, while the seem-
ingly more natural (to a Bayesian) left-Haar measure
will typically fail to yield admissible estimators. Thus
use of the right-Haar priors has become standard. See
Berger (1985a) and Robert (2001) for general discus-
sion and many examples of the use of admissibility.
Another situation in which admissibility has played
an important role in prior development is in choice of
Bayesian priors in hierarchical modeling. In a sense,
this topic was initiated in Stein (1956), which effec-
tively showed that the usual constant prior for a multi-
variate normal mean would result in an inadmissible
estimator under quadratic loss (in three or more di-
mensions). One of the rst Bayesian works to ad-
dress this issue was Hill (1974). To access the huge
resulting literature on the role of admissibility in
choice of hierarchical priors, see Brown (1971), Berger
and Robert (1990), Berger and Strawderman (1996),
Robert (2001) and Tang (2001).
Here is an example where initial admissibility con-
siderations led to signicant Bayesian developments.
EXAMPLE 3.5. Consider estimation of a covari-
ance matrix , based on i.i.d. multivariate normal
data (x
1
, . . . , x
n
), where each column vector x
i
arises
from the N
k
(0, ) density. The sufcient statistic for
is S =

n
i=1
x
i
x
i
. Since Stein (1975), it has been
understood that the commonly used estimates of ,
which are various multiples of S (depending on the loss
function considered) are seriously inadmissible. Hence
there has been a great effort in the frequentist literature
(see Yang and Berger, 1994, for references) to develop
better estimators of .
The interest in this from the Bayesian perspective is
that by far the most commonly used subjective prior
for a covariance matrix is the inverse Wishart prior (for
subjectively specied a and b)
() ||
a/2
exp
_
1
2
tr[b
1
]
_
. (3.2)
A frequently used objective version of this prior is the
Jeffreys-rule prior given by choosing a = k + 1 and
b =0. When one notes that the Bayesian estimates
arising from these priors are linear functions of S,
which were deemed to be seriously inadequate by the
frequentists, there is clear cause for concern in the
routine use of these priors.
In this case, it is possible to also indicate the problem
with these priors utilizing Bayesian reasoning. Indeed,
write = H
t
DH, where H is an orthogonal matrix
and D is a diagonal matrix with diagonal entries being
the eigenvalues of the matrix, d
1
> d
2
> > d
k
.
A change of variables yields
() d =|D|
a/2
exp
_
1
2
tr[bD
1
]
_

i<j
(d
i
d
j
) I
[d
1
>>d
k
]
dDdH,
where I
[d
1
>>d
k
]
denotes the indicator function on
the given set. Since

i<j
(d
i
d
j
) is near zero when
any two eigenvalues are close, it follows that the
conjugate priors (and the Jeffreys-rule prior) tend to
force apart the eigenvalues of the covariance matrix;
the priors give near-zero density to close eigenvalues.
This is contrary to typical prior beliefs. Indeed, often
in modelling, one is debating between assuming an
exchangeable covariance structure (and hence equal
eigenvalues) or allowing a more general structure.
When one is contemplating whether or not to assume
equal eigenvalues, it is clearly inappropriate to use a
prior distribution that gives essentially no weight to
equal eigenvalues, and instead forces them apart.
As an alternative objective prior here, the reference
prior was derived in Yang and Berger (1994) and is
given by
(D, H) = |D|
1
dDdH; this clearly elim-
inates the forcing apart of eigenvalues. Furthermore, it
is shown in Yang and Berger (1994) that use of the ref-
erence prior often results in improvements in estimat-
ing on the order of 50%over use of the Jeffreys prior.
Motivated, in part, by the signicant inferiority of
the standard inverse Wishart and Jeffreys-rule priors
for , a large Bayesian literature has developed in
recent years that provides alternative prior distribu-
tions for a covariance matrix. See Tang (2001) and
Daniels and Pourahmadi (2002) for examples and ear-
lier references.
Note that we are not only focusing on objective pri-
ors here. Even proper priors that are commonly used by
subjectivists can have hidden and highly undesirable
featuressuch as the forcing apart of the eigenvalues
for the inverse Wishart priors in the above example
and frequentist (and objective Bayesian) tools can ex-
pose these features and allow for development of better
subjective priors.
3.4.4 Robust Bayesian analysis. Robust Bayesian
analysis formally recognizes the impossibility of com-
plete subjective specication of the model and prior
distribution; as mentioned earlier, complete specica-
tion would involve an innite number of assessments,
even in the simplest situations. It follows that one
should, ideally, work with a class of prior distribu-
tions with the class reecting the uncertainty remain-
ing after the (nite) elicitation efforts. ( could also
reect the differing judgments of various individuals
involved in the decision process.)
While much of robust Bayesian analysis takes place
in a purely Bayesian framework (e.g., determining the
range of the posterior mean as the prior ranges over ),
it also has strong connections with the empirical Bayes,
gamma minimax and restricted risk Bayes approaches,
discussed in Section 2.3. See Berger (1985a, 1994),
Delampady et al. (2001) and Ros, Insua and Ruggeri
(2000) for discussion and references.
3.4.5 Nonparametric Bayesian analysis. In nonpa-
rametric statistical analysis, the unknown quantity in a
statistical model is a function or a probability distrib-
ution. A Bayesian approach to such problems requires
placing a prior distribution on this space of functions or
space of probability distributions. Perhaps surprisingly,
Bayesian analysis of such problems is computationally
quite feasible and is seeing signicant practical imple-
mentation; cf. Dey, Mller and Sinha (1998).
Function spaces and spaces of probability measures
are enormous spaces, and subjective elicitation of a
prior on these spaces is not really feasible. Thus, in
practice, it is typical to use a convenient form for
a nonparametric prior (typically chosen for computa-
tional reasons), with perhaps a small number of fea-
tures of the prior being subjectively specied. Thus,
much as in the case of the NeymanScott example,
one worries that the unspecied features of the prior
may overwhelm the data and result in inconsistency
or poor frequentist performance. Furthermore, there is
evidence (e.g., Freedman, 1999) that Bayesian credi-
ble sets and frequentist condence sets need not agree
in nonparametric problems, making it more difcult to
judge performance.
There is a long-time literature on such issues, the
earlier period going from Freedman (1963) through
Diaconis and Freedman (1986). To access the more re-
cent literature, see Barron (1999), Barron, Schervish
and Wasserman (1999), Ghosal, Ghosh and van der
Vaart (2000), Zhao (2000), Kim and Lee (2001),
Belitser and Ghosal (2003) and Ghosh and
Ramamoorthi (2003).
3.4.6 Impropriety and identiability. One of the
most crucial problems that Bayesians face in dealing
with complex modeling situations is that of ensuring
that the posterior distribution is proper; use of improper
objective priors can result in improper posterior distrib-
utions. (Use of vague proper priors in such situations
will formally result in proper posterior distributions,
but these posteriors will essentially be meaningless if
the limiting improper objective prior would have re-
sulted in an improper posterior distribution.)
One of the major situations in which impropriety
can arise is when there is a problem of parameter
identiability, as in the following example.
EXAMPLE 3.6. Suppose, for i = 1, . . . , p, that
X
i
Normal(
i
,
2
) and
i
Normal(0,
2
), all
randomvariables being independent. Then, marginally,
X
i
Normal(0,
2
+
2
), and it is clear that we
cannot separately estimate
2
and
2
(although we can
estimate their sum); in classical language,
2
and
2
are not identiable. Were a Bayesian to attempt to
utilize an improper objective prior here, such as (
2
,
2
) =1, the posterior distribution would be improper.
The point here is that frequentist insight and litera-
ture about identiability can be useful to a Bayesian in
determining whether there is a problem with posterior
propriety. Thus, in the above example, upon recogniz-
ing the identiability problem, the Bayesian will know
not to use the improper objective prior and will attempt
to elicit a true subjective proper prior for at least one of
2
or
2
. (Of course, more data, such as replications at
the rst stage of the model, could also be sought.)
3.5 Frequentist Simplications and
Asymptotic Approximations
Situations can occur in which straightforward use of
frequentist intuition directly yields sensible answers. In
the NeymanScott problem, for instance, consideration
of the paired differences, x
i1
x
i2
, directly yielded
a sensible answer. In contrast, a fairly sophisticated
objective Bayesian analysis (use of the reference prior)
was required for a satisfactory answer.
This is not to say that classical methodology is
universally better in such situations. Indeed, Neyman
and Scott created this example primarily to show that
use of maximum likelihood methodology can be very
inadequate; it essentially leads to the same bad
answer in the example as the Bayesian analysis based
on the Jeffreys-rule prior. This points out the dilemma
facing Bayesians in use of frequentist simplications:
a frequentist answer might be simple, but a Bayesian
might well feel uneasy in its utilization unless it were
felt to approximate a Bayesian answer. (For instance,
is the answer conditionally sound, as discussed in
Section 3.2.2.) Of course, if only the frequentist answer
is available, the issue is moot.
It would be highly useful to catalogue situations in
which direct frequentist reasoning is arguably simpler
than Bayesian methodology, but we do not attempt
to do so. Discussion of this and examples can be
found in Robins and Ritov (1997) and Robins and
Wasserman (2000).
Outside of standard models (such as the normal lin-
ear model), it is unfortunately rather rare to be able to
obtain exact frequentist answers for small or moderate
sample sizes. Hence much of frequentist methodology
relies on asymptotic approximations, based on assum-
ing that the sample size is large.
Asymptotics can also be used to provide an approx-
imation to Bayesian answers for large sample sizes;
indeed, Bayesian and frequentist asymptotic answers
are often (but not always) the same; see Schervish
(1995) for an introduction to Bayesian asymptotics and
Le Cam (1986) for a high-level discussion. One might
conclude that this is thus another signicant poten-
tial use of frequentist methodology by Bayesians. It is
rather rare for Bayesians to directly use asymptotic an-
swers, however, since Bayesians can typically directly
compute exact small sample size answers, often with
less effort than derivation of the asymptotic approxi-
mation would require.
Still, asymptotic techniques are useful to Bayesians,
in a variety of approximations and theoretical develop-
ments. For instance, the popular Laplace approxima-
tion (cf. Schervish, 1995) and BIC (cf. Schwarz, 1978)
are based on an asymptotic arguments. Important
Bayesian methodological developments, such as the
denition of reference priors, also make considerable
use of asymptotic theory, as was mentioned earlier.
4. TESTING, MODEL SELECTION AND
MODEL CHECKING
Unlike estimation, frequentist reports and conclu-
sions in testing (and model selection) are often in con-
ict with their Bayesian counterparts. For a long time
it was believed that this was unavoidablethat the two
paradigms are essentially irreconcilable for testing.
Berger, Brown and Wolpert (1994) showed, however,
that this is not necessarily the case; that the main dif-
culty with frequentist testing was an inappropriate lack
of conditioning which could, in a variety of situations,
be xed. This is the focus of the next section, after
which we turn to more general issues involving the
interaction of frequentist and Bayesian methodology in
testing and model selection.
4.1 Conditional Frequentist Testing
Unconditional NeymanPearson testing, in which
one reports the same error probability regardless of the
size of the test statistic (as long as it is in the rejection
region), has long been viewed as problematical by most
statisticians. To Fisher, this was the main inadequacy
of NeymanPearson testing, and one of the chief mo-
tivations for his championing p-values in testing and
model checking. Unfortunately (as Neyman would ob-
serve), p-values do not have a frequentist justication
in the sense, say, of the frequentist principle in Sec-
tion 2.2. For more extensive discussion of the perceived
inadequacies of these two approaches to testing, see
Berger (2003).
The solution proposed in Berger, Brown and
Wolpert (1994) for testing, following earlier devel-
opments in Kiefer (1977), was to use the Neyman
Pearson approach of formally dening frequentist er-
ror probabilities of Type I and Type II, but to do so
conditional on the observed value of a statistic measur-
ing the strength of evidence in the data, as was done
in Example 3.2. (Other proposed solutions to this prob-
lem have been considered in, e.g., Hwang et al., 1992.)
For illustration, suppose that we wish to test that
the data X arises from the simple (i.e., completely
specied) hypotheses H
0
: f = f
0
or H
1
: f = f
1
. The
idea is to select a statistic S = S(X) which measures
the strength of the evidence in X, for or against
the hypotheses. Then, conditional error probabilities
(CEPs) are computed as
(s) =P(Type I error |S =s)
P
0
_
reject H
0
|S(X) =s
_
,
(4.1)
(s) =P(Type II error|S =s)
P
1
_
accept H
0
|S(X) =s
_
,
where P
0
and P
1
refer to probability under H
0
and H
1
,
respectively.
The proposed conditioning statistic S and associated
test utilize p-values to measure the strength of the
evidence in the data. Specically (see Wolpert, 1996;
Sellke, Bayarri and Berger, 2001), we consider
S =max{p
0
, p
1
},
where p
0
is the p-value when testing H
0
versus H
1
,
and p
1
is the p-value when testing H
1
versus H
0
.
[Note that the use of p-values in determining eviden-
tiary equivalence is much weaker than their use as
an absolute measure of signicance; in particular, use
of (p
i
), where is any strictly increasing function,
would determine the same conditioning.] The corre-
sponding conditional frequentist test is then as follows:
if p
0
p
1
reject H
0
and
report Type I CEP (s);
(4.2)
if p
0
>p
1
accept H
0
and
report Type II CEP (s);
where the CEPs are given in (4.1).
To this point, there has been no connection with
Bayesianism. Conditioning, as above, is completely
allowed (and encouraged) within the frequentist para-
digm. The Bayesian connection arises because Berger,
Brown and Wolpert (1994) show that
(s) =
B(x)
1 +B(x)
and (s) =
1
1 +B(x)
, (4.3)
where B(x) is the likelihood ratio (or Bayes factor),
and these expressions are precisely the Bayesian poste-
rior probabilities of H
0
and H
1
, respectively, assuming
the hypotheses have equal prior probabilities of 1/2.
Therefore, a conditional frequentist can simply com-
pute the objective Bayesian posterior probabilities of
the hypotheses, and declare that they are the condi-
tional frequentist error probabilities; there is no need
to formally derive the conditioning statistic or perform
the conditional frequentist computations. (There are
some technical details concerning the denition of the
rejection region, but these have almost no practical im-
pact; see Berger, 2003, for further discussion.)
The value of having a merging of the frequentist and
objective Bayesian answers in testing goes well beyond
the technical convenience of computation; statistics as
a whole is the big winner because of the unication that
results. But we are trying to avoid philosophical issues
here and so will simply focus on the methodological
advantages that will accrue to frequentism.
Dass and Berger (2003) and Paulo (2002) extend this
result to many classical testing scenarios; here is an
example from the former.
EXAMPLE 4.1. McDonald, Vance and Gibbons
(1995) studied car emission data X = (X
1
, . . . , X
n
),
testing whether the i.i.d. X
i
follow the Weibull or
Lognormal distribution, given, respectively, by
H
0
: f
W
(x; , ) =

_
x
_
1
exp
_
_
x
_
,
H
1
: f
L
(x; ,
2
) =
1
x
2
2
exp
_
(lnx )
2
2
2
_
.
There are several difculties with classical analysis of
this situation. First, there are no low-dimensional suf-
cient statistics, and no obvious test statistics; indeed,
McDonald, Vance and Gibbons (1995) simply consider
a variety of generic tests, such as the likelihood ra-
tio test (MLR), which they eventually recommended as
being the most powerful. Second, it is not clear which
hypothesis to make the null hypothesis, and the clas-
sical conclusion can depend on this choice (although
not signicantly, in the sense of the choice allowing
differing conclusions with low error probabilities). Fi-
nally, computation of unconditional error probabilities
requires a rather expensive simulation and, once com-
puted, one is stuck with error probabilities that do not
vary with the data.
For comparison, the conditional frequentist test when
n =16 (one of the cases considered by McDonald,
Vance and Gibbons, 1995) results in the following test:
T
C
=
_
_
if B(x) 0.94,
reject H
0
and report Type I CEP
(x) =B(x)/(1 +B(x)),
if B(x) >0.94,
accept H
0
and report Type II CEP
(x) =1/(1 +B(x)),
where
B(x) =
2(n)n
n/2
((n 1)/2)
(n1)/2
(4.4)
_

0
_
y
n
n
i=1
exp
_
z
i
z
ys
z
_
_
n
dy,
with z
i
=lnx
i
, z =
1
n
n
i=1
z
i
and s
2
z
=
1
n
n
i=1
(z
i
z)
2
.
In comparison with the situation for the unconditional
frequentist test:
There is a well-dened test statistic B(x).
If one switches the null hypothesis, the new Bayes
factor is simply B(x)
1
, which will clearly lead to
the same CEPs (i.e., the CEPs do not depend on
which hypothesis is called the null hypothesis).
Computation of the CEPs is almost trivial, requiring
only a one-dimensional integration.
Above all, the CEPs vary continuously with the data.
In elaboration of the last point, consider one of the
testing situations considered by McDonald, Vance and
Gibbons (1995), namely, testing for the distribution of
carbon monoxide emission data, based on a sample of
size n = 16. Data was collected at the four different
mileage levels indicated in Table 2, with (b) and (a)
indicating before or after scheduled vehicle main-
tenance. Note that the decisions for both the MLR and
the conditional test would be to accept the lognormal
model for the data. McDonald, Vance and Gibbons
(1995) did not give the Type II error probability associ-
ated with acceptance (perhaps because it would depend
on the unknown parameters for many of the test statis-
tics they considered) but, even if Type II error had been
provided, note that it would be constant. In contrast, the
conditional test has CEPs (here, the conditional Type II
errors) that vary fully with the data, usefully indicating
the differing certainties in the acceptance decision for
TABLE 2
For CO data, the MLR test at level =0.05, and the conditional
test of H
0
: Lognormal versus H
1
: Weibull
Mileage 0 4,000 24,000 (b) 24,000 (a)
MLR decision A A A A
B(x) 2.436 9.009 6.211 2.439
T
C
decision A A A A
CEP 0.288 0.099 0.139 0.291
the considered mileages. Further analyses and compar-
isons can be found in Dass and Berger (2003).
Derivation of the conditional frequentist test. The
conditional frequentist analysis here depends on recog-
nizing an important fact: both the Weibull and the
lognormal distributions are locationscale distributions
(the Weibull after suitable transformation). In this case,
the objective Bayesian (and, hence, conditional fre-
quentist) solution to the problem is to utilize the right-
Haar prior density for the distributions [(, ) =
1
for the lognormal problem] and compute the result-
ing Bayes factor. (Unconditional frequentists could, of
course, have recognized the invariance of the situation
and used this as a test statistic, but they would have
faced a much more difcult computational challenge.)
By invariance, the distribution of B(X) under ei-
ther hypothesis does not depend on model parameters,
so that the original testing problem can be reduced
to testing two simple hypotheses, namely H
0
: B(X)
has distribution F
W
versus H
1
: B(X) has distrib-
ution F
L
, where F
W
and F
L
are the distribution
functions of B(X) under the Weibull and Lognormal
distributions, respectively, with an arbitrary choice of
the parameters (e.g., = = 1 for the Weibull, and
=0, =1 for the Lognormal). Recall that the CEPs
happen to equal the objective Bayesian posterior prob-
abilities of the hypotheses.
4.2 Model Selection
Clyde and George (2004) give an excellent review of
Bayesian model selection. Frequentists have not typ-
ically used Bayesian arguments in model selection,
although that may be changing, in part due to the pro-
nounced practical success that Bayesian model aver-
aging has achieved. Bayesians often use frequentist
arguments to develop approximate model selection
methods (such as BIC), to evaluate performance of
model selection methods and to develop default pri-
ors for model selection. There is a huge list of arti-
cles of this type, including many listed in Clyde and
George (2004). Robert (2001), Berger and Pericchi
(2001, 2004) and Berger, Ghosh and Mukhopadhyay
(2003) also have general discussions and numerous
other recent references. The story here is far from set-
tled, in that there is no agreement on the immediate
horizon as to even a reasonable method of model se-
lection. It seems highly likely, however, that any such
agreement will be based on a mixture of frequentist and
Bayesian arguments.
4.3 p-Values for Model Checking
Both classical statisticians and Bayesians routinely
use p-values for model checking. We rst consider
their use by classical statisticians and showthe value of
Bayesian methodology in the computation of proper
p-values; then we turn to Bayesian p-values and the
importance of frequentist ideas in their evaluation.
4.3.1 Use of Bayesian methodology in computing
classical p-values. Suppose that a statistical model
H
0
: X f (x|) is being entertained, data x
obs
is
observed and it is desired to check whether the model
is adequate, in light of the data. Classical statisticians
have long used p-values for this purpose. A strict
frequentist would not do so (since p-values do not
satisfy the frequentist principle), but most frequentists
relax their morals a bit in this situation (i.e., when
there is no alternative hypothesis which would allow
construction of a NeymanPearson test).
The common approach to model checking is to
choose a statistic T = t (X), where (without loss of
generality) large values of T indicate less compatibility
with the model. The p-value is then dened as
p =Pr
_
t (X) t (x
obs
)|
_
. (4.5)
When is known, this probability computation is with
respect to f (x|). The crucial question is what to do
when is unknown.
For future reference, we note a key property of
the p-value when is known: considered as a ran-
dom function of X in the continuous case, p(X) has
a Uniform(0, 1) distribution under H
0
. The implica-
tion of this property is that p-values then have a com-
mon interpretation across statistical problems, making
their general use feasible. Indeed, this property, or its
asymptotic version for composite H
0
, has been used
to characterize proper, well-behaved p-values (see
Robins, van der Vaart and Ventura, 2000).
When is unknown, computation of a p-value
requires some way of eliminating from (4.5). There
are many non-Bayesian ways of doing so, some of
which are reviewed in Bayarri and Berger (2000) and
Robins, van der Vaart and Ventura (2000). Here we
consider only the most common method, which is to
replace in (4.5) by its mle,

. The resulting p-value
will be called the plug-in p-value (p
plug
). Henceforth
using a superscript to denote the density with respect
to which the p-value in (4.5) is computed, the plug-in
p-value is thus dened as
p
plug
=Pr
f (|
)
_
t (X) t (x
obs
)
_
. (4.6)
Although very simple to use, there is a worrisome
double use of the data in p
plug
, rst to estimate and
then to compute the tail area corresponding to t (x
obs
)
in that distribution.
Bayarri and Berger (2000) proposed the following
alternative way of eliminating , based on Bayesian
methodology. Begin with an objective prior densi-
ty (); we recommend, as before, that this be a
reference prior (when available), but even a constant
prior density will usually work ne. Next, dene the
partial posterior density (Bayesian motivation will be
given in the next section)
(|x
obs
\ t
obs
) f (x
obs
|t
obs
, )()
(4.7)
f (x
obs
|)()
f (t
obs
|)
,
resulting in the partial posterior predictive density
of T ,
m(t |x
obs
\t
obs
) =
_
f (t|)(|x
obs
\t
obs
) d. (4.8)
Since this density is free of , it can be used in (4.5) to
compute the partial posterior predictive p-value,
p
ppp
=Pr
m(|x
obs
\t
obs
)
(T t
obs
). (4.9)
Note that p
ppp
uses only the information in x
obs
that
is not in t
obs
= t (x
obs
) to train the prior and to then
eliminate . Intuitively this avoids double use of the
data because the contribution of t
obs
to the posterior is
removed before is eliminated by integration. (The
notation x
obs
\ t
obs
was chosen to indicate this.)
When T is not ancillary (or approximately so), the
double use of the data can cause the plug-in p-value to
fail dramatically, in the sense of being moderately large
even when the null model is clearly wrong. Examples
and earlier references are given in Bayarri and Berger
(2000). Here is another interesting illustration, taken
from Bayarri and Castellanos (2004).
EXAMPLE 4.2. Consider the hierarchical (or ran-
dom effects) model
X
ij
|
i
N(
i
,
2
i
) for i =1, . . . , I,
j =1, . . . , n
i
,
i
|, N(,
2
) for i =1, . . . , I,
(4.10)
where all variables are independent and, for simplicity,
we assume that the variances
2
i
at the rst level
are known. Suppose that we are primarily interested
in investigating whether the normality assumption
for the means
i
is compatible with the data. We
choose the test statistic T = max{

X
1
, . . . ,

X
I
}, where
X
i
denotes the usual group sample means, which here
are sufcient statistics for the
i
. When no specic
alternative hypothesis is postulated, optimal test
statistics do not exist, and casual choices (such as this)
are not uncommon. The issue under study is whether
such (easy) choices of the test statistic can usefully be
used for computing p-values.
Since the
i
are random effects, it is well known
that tests should be based on the marginal densities
of the sufcient statistics

X
i
, with the
i
integrated
out. (Replacing the
i
by their mles would result in a
vacuous inference here.) The resulting null distribution
can be represented
X
i
|, N(,
2
i
+
2
) for i =1, . . . , I. (4.11)
Thus p
plug
is computed with respect to this distri-
bution, with the mles ,
2
[numerically computed
from (4.11)] inserted back into (4.11) and (4.10).
To compute the partial posterior predictive p-value,
we begin with a common objective prior for (,
2
),
namely (,
2
) = 1/ (not 1/
2
as is sometimes
done, which would result in an improper posterior).
The computation of p
plug
is based on an MCMC,
discussed in Bayarri and Castellanos (2004).
Both p-values were computed for a simulated data
set, in which one of the groups comes from a distribu-
tion with a much larger mean than the other groups. In
particular, the data was generated from
X
ij
|
i
N(
i
, 4) for i =1, . . . , 5,
j =1, . . . , 8,
i
N(1, 1) for i =1, . . . , 4,
5
N(5, 1).
(4.12)
The resulting sample means were 1.560, 0.641, 1.982,
0.014 and 6.964. Note that the sample mean of the fth
group is 6.65 standard deviations away from the mean
of the other four groups. With this data, p
plug
=0.130,
dramatically failing to clearly indicate that the as-
sumption of i.i.d. normality of the
i
is wrong, while
p
ppp
=0.010. Many other similar examples can be
found in Castellanos (2002).
A strong indication that a proposed p-value is
inappropriate is when it fails to be asymptotically
Uniform(0, 1) under the null hypothesis for all values
of , as a proper p-value should. Robins, van der Vaart
and Ventura (2000) prove that the plug-in p-value often
fails to be asymptotically proper in this sense, while the
partial posterior predictive p-value is asymptotically
proper. Furthermore, they show that the latter p-value
is uniformly most powerful with respect to Pitman al-
ternatives, lending additional powerful frequentist sup-
port to the methodology. Note that this is completely
a frequentist evaluation; no Bayesian averaging is in-
volved. Numerical comparisons in the above example
of the behavior of p
plug
and p
ppp
under the assumed
(null) model is deferred to Section 4.3.2.
4.3.2 Evaluating Bayesianp-values. Most Bayesian
p-values are dened analogously to (4.8) and (4.9),
but with alternatives to (|x
obs
\ t
obs
). Subjective
Bayesians simply use the prior distribution () di-
rectly to integrate out . The resulting p-value, called
the predictive p-value and popularized by Box (1980),
has the property, when considered as a randomvariable
of X, of being uniformly distributed under the null pre-
dictive distribution. (See Meng, 1994, for discussion of
the importance of this property.) This p-value is thus
Uniform[0, 1] in an average sense over , which is pre-
sumably satisfactory (for consistency of interpretation
across problems) to Bayesians who believe in their sub-
jective prior.
Much of model checking, however, takes place in
scenarios in which the model is quite tentative and,
hence, for which serious subjective prior elicitation
(which is invariably highly costly) is not feasible.
Hence model checking is more typically done by
Bayesians using objective prior distributions; guaran-
tees concerning average uniformity of p-values then
no longer apply and, indeed, the resulting p-values are
not even dened if the objective prior distribution is
improper (since the predictive distribution of T will
then be improper). The solution that has become
quite popular is to utilize objective priors, but to use the
posterior distribution (|x
obs
), instead of the prior, in
dening the distribution used to compute a p-value.
Formally, this leads to the posterior predictive p-value,
dened in Guttman (1967) and popularized in Rubin
(1984) and Gelman, Carlin, Stern and Rubin (1995),
given by
p
post
=Pr
m(|x
obs
)
(T t
obs
),
(4.13)
m(t |x
obs
) =
_
f (t |)(|x
obs
) d.
Note that there is also double use of the data
in p
post
, rst to convert the (possibly improper) prior
into a proper distribution for determining the predic-
tive distribution, and then for computing the tail area
corresponding to t (x
obs
) in that distribution. The detri-
mental effect of this double use of the data was dis-
cussed in Bayarri and Berger (2000) and arises again in
Example 4.2. Indeed, computation yields that the pos-
terior predictive p-value for the given data is 0.409,
which does not at all suggest a problem with the ran-
dom effects normality assumption; yet, recall that one
of the means was more than six standard deviations
away from the others.
The point of this section is to observe that the fre-
quentist property of uniformity of a p-value under
the null hypothesis provides a useful discriminatory
tool for judging the adequacy of Bayesian (and other)
p-values. Note that if a p-value is uniform under the
null hypothesis in the frequentist sense for any ,
then it has the strong Bayesian property of being mar-
ginally Uniform[0, 1] under any proper prior distribu-
tion. More important, if a proposed p-value is always
either conservative or anticonservative in a frequentist
sense (see Robins, van der Vaart and Ventura, 2000,
for denitions), then it is likewise guaranteed to be
conservative or anticonservative in a Bayesian sense,
no matter what the prior. If the conservatism (or anti-
conservatism) is severe, then the p-value cannot cor-
respond to any approximate true Bayesian p-value.
In this regard, Robins, van der Vaart and Ventura
(2000) show that p
post
is often severely conservative
(and, surprisingly, is worse in this regard than is p
plug
),
while p
ppp
is asymptotically Uniform[0, 1].
It is also of interest to numerically study the nonas-
ymptotic null distribution of the three p-values con-
sidered in this section. We thus return to the random
effects example.
EXAMPLE 4.3. Consider again the situation de-
scribed in Example 4.2. We consider p
plug
(X), p
ppp
(X)
and p
post
(X) as random variables and simulate their
distribution under an instance of the null model.
Specically, we chose the random effects mean and
variance to be 0 and 1, respectively, thus simulat-
ing X
ij
as in (4.12), but nowgenerating all ve
i
from
the N(0, 1) distribution (so that the null normal hierar-
chical model is correct). Figure 6 shows the resulting
sampling distributions of the three p-values.
Note that the distribution of p
ppp
(X) is quite close
to uniform, even though only ve means were in-
volved. In contrast, the distributions of p
plug
(X) and
p
post
(X) are quite far from uniform, with the latter
being the worst. Even for larger numbers of means
(e.g., 25), p
plug
(X) and p
post
(X) remained signi-
cantly nonuniform, indicating a serious and inappro-
priate conservatism.
Of course, Bayesians frequently criticize the direct
use of p-values in testing a precise hypothesis, in that
p-values then fail to correspond to natural Bayesian
measures. There are, however, various calibrations
of p-values that have been suggested (see Good,
1983; Sellke, Bayarri and Berger, 2001). But these are
higher level considerations in that, if a purported
Bayesian p-value fails to adequately criticize a null
hypothesis, the higher level concerns are irrelevant.
FIG. 6. Null distribution of p
plug
(X) (left), p
post
(X) (center) and p
ppp
(X) (right) when the null normal hierarchical model is correct.
5. AREAS OF CURRENT DISAGREEMENT
It is worth mentioning some aspects of inference in
which it seems very difcult to reconcile the frequentist
and Bayesian approaches. Of course, as discussed in
Section 4, it was similarly believed to be difcult
to reconcile frequentist and Bayesian testing until
recently, so it may simply be a matter of time until
reconciliation also occurs in these other areas.
Multiple comparisons. When doing multiple tests,
such as occurs in the variable selection problem in
regression, classical statistics performs some type of
adjustment (e.g., the Bonferonni adjustment to the
signicance level) to account for the multiplicity of
tests. In contrast, Bayesian analysis does not explicitly
adjust for multiplicity of tests, the argument being
that a correct adjustment is automatic within the
Bayesian paradigm.
Unfortunately, the duality between frequentist and
Bayesian methodology in testing, discussed in Sec-
tion 4, has not been extended to the multiple hypothesis
testing framework, so we are left with competing and
quite distinct methodologies. Worse, there are multi-
ple competing frequentist methodologies and multiple
competing Bayesian methodologies, a situation that is
professionally disconcerting, yet common when there
is no frequentistBayesian consensus. (To access some
of these methodologies, see http://www.ba.ttu.edu/isqs/
westfall/mcp2002.htm.)
Sequential analysis. The stopping rule principle says
that once the data have been obtained, the reasons
for stopping experimentation should have no bearing
on the evidence reported about unknown model pa-
rameters. This principle is automatically satised by
Bayesian analysis, but is viewed as crazy by many fre-
quentists. Indeed frequentist practice in clinical trials
is to spend for looks at the data; that is, if there
are to be interim analyses during the clinical trial, with
the option of stopping the trial early should the data
look convincing, frequentists feel that it is then manda-
tory to adjust the allowed error probability (down) to
account for the multiple analyses.
This issue is extensively discussed in Berger and
Berry (1988), which has many earlier references. That
it is a controversial and difcult issue is admirably
expressed by Savage (1962): I learned the stopping
rule principle from Professor Barnard, in conversation
in the summer of 1952. Frankly, I then thought it a
scandal that anyone in the profession could advance an
idea so patently wrong, even as today I can scarcely
believe that people resist an idea so patently right.
An interesting recent development is the nding,
in Berger, Boukai and Wang (1999), that optimal
conditional frequentist testing also essentially obeys
the stopping rule principle. This suggests the admit-
tedly controversial speculation that optimal (condi-
tional) frequentist procedures will eventually be found
to essentially satisfy the stopping rule principle; and
that we currently have a controversy only because sub-
optimal (unconditional) frequentist procedures are be-
ing used.
It is, however, also worth noting that in prior de-
velopment Bayesians sometimes utilize the stopping
rule to help dene an objective prior, for example,
the Jeffreys or reference prior (see Ye, 1993; Sun and
Berger, 2003, for examples and motivation). Since the
statistical inference will then depend on the stopping
rule, objective Bayesian analysis can involve a (proba-
bly slight) violation of the stopping rule principle. (See
Sweeting, 2001, for related discussion concerning the
extent of violation of the likelihood principle by objec-
tive Bayesian methods.)
Finite population sampling. In this central area of
statistics, classical inference is primarily based on fre-
quentist averaging with respect to the sampling proba-
bilities by which units of the population are selected for
inclusion in the sample. In contrast, Bayesian analy-
sis asserts that these sampling probabilities are irrele-
vant for inference, once the data is at hand. (See Ghosh,
1988, which contains excellent essays by Basu on this
subject.) Hence we, again, have a fundamental philo-
sophical and practical conict.
There have been several arguments (e.g., Rubin,
1984; Robins and Ritov, 1997) to the effect that there
are situations in which Bayesians do need to take into
account the sampling probabilities, to save themselves
from a too-difcult (and potentially nonrobust) prior
development. Conversely, there has been a very signif-
icant growth in use of Bayesian methodology in nite
population contexts, such as in the use of small area
estimation methods (see, e.g., Rao, 2003). Small area
estimation methods actually occur in the broader con-
text of the model-based approach to nite population
sampling (which mostly ignores the sampling probabil-
ities), and this model-based approach also has frequen-
tist and Bayesian versions (the differences, however,
being much smaller than the differences arising from
use, or not, of the sampling probabilities).
6. CONCLUSIONS
It seems quite clear that both Bayesian and frequen-
tist methodology are here to stay, and that we should
not expect either to disappear in the future. This is not
to say that all Bayesian or all frequentist methodol-
ogy is ne and will survive. To the contrary, there are
many areas of frequentist methodology that should be
replaced by (existing) Bayesian methodology that pro-
vides superior answers, and the verdict is still out on
those Bayesian methodologies that have been exposed
as having potentially serious frequentist problems.
Philosophical unication of the Bayesian and fre-
quentist positions is not likely, nor desirable, since
each illuminates a different aspect of statistical infer-
ence. We can hope, however, that we will eventually
have a general methodological unication, with both
Bayesians and frequentists agreeing on a body of stan-
dard statistical procedures for general use.
ACKNOWLEDGMENTS
Research supported by Spanish Ministry of Science
and Technology Grant SAF2001-2931 and by NSF
Grants DMS-01-03265 and DMS-01-12069. Part of
the work was done while the rst author was visiting
the Statistical and Applied Mathematical Sciences
Institute and ISDS, Duke University.
REFERENCES
BARNETT, V. (1982). Comparative Statistical Inference, 2nd ed.
Wiley, New York.
BARRON, A. (1999). Information-theoretic characterization of
Bayes performance and the choice of priors in paramet-
ric and nonparametric problems. In Bayesian Statistics 6
(J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. M. Smith,
eds.) 2752. Oxford Univ. Press.
BARRON, A., SCHERVISH, M. J. and WASSERMAN, L. (1999).
The consistency of posterior distributions in nonparametric
problems. Ann. Statist. 27 536561.
BAYARRI, M. J. and BERGER, J. (2000). P-values for composite
null models (with discussion). J. Amer. Statist. Assoc. 95
11271170.
BAYARRI, M. J. and CASTELLANOS, M. E. (2004). Bayesian
checking of hierarchical models. Technical report, Univ. Va-
lencia.
BELITSER, E. and GHOSAL, S. (2003). Adaptive Bayesian infer-
ence on the mean of an innite-dimensional normal distribu-
tion. Ann. Statist. 31 536559.
BERGER, J. (1985a), Statistical Decision Theory and Bayesian
Analysis, 2nd ed. Springer, New York.
BERGER, J. (1985b). The frequentist viewpoint and conditioning.
In Proceedings of the Berkeley Conference in Honor of Jerzy
Neyman and Jack Kiefer (L. Le Cam and R. Olshen, eds.)
1544. Wadsworth, Monterey, CA.
BERGER, J. (1994). An overview of robust Bayesian analysis (with
discussion). Test 3 5124.
BERGER, J. (2003). Could Fisher, Jeffreys and Neyman have
agreed on testing (with discussion)? Statist. Sci. 18 132.
BERGER, J. and BERNARDO, J. (1992). On the development of
reference priors. In Bayesian Statistics 4 (J. M. Bernardo,
J. O. Berger, A. P. Dawid and A. F. M. Smith, eds.) 3560.
Oxford Univ. Press.
BERGER, J. and BERRY, D. (1988). The relevance of stopping
rules in statistical inference (with discussion). In Statisti-
cal Decision Theory and Related Topics IV (S. Gupta and
J. Berger, eds.) 1 2972. Springer, New York.
BERGER, J., BOUKAI, B. and WANG, Y. (1999). Simultaneous
Bayesianfrequentist sequential testing of nested hypotheses.
Biometrika 86 7992.
BERGER, J., BROWN, L. D. and WOLPERT, R. (1994). A unied
conditional frequentist and Bayesian test for xed and sequen-
tial simple hypothesis testing. Ann. Statist. 22 17871807.
BERGER, J., GHOSH, J. K. and MUKHOPADHYAY, N. (2003).
Approximations and consistency of Bayes factors as model
dimension grows. J. Statist. Plann. Inference 112 241258.
BERGER, J. and PERICCHI, L. (2001). Objective Bayesian meth-
ods for model selection: Introduction and comparison (with
discussion). In Model Selection (P. Lahiri, ed.) 135207. IMS,
Beachwood, OH.
BERGER, J. and PERICCHI, L. (2004). Training samples in
objective Bayesian model selection. Ann. Statist. 32 841869.
BERGER, J., PHILIPPE, A. and ROBERT, C. (1998). Estimation of
quadratic functions: Noninformative priors for non-centrality
parameters. Statist. Sinica 8 359376.
BERGER, J. and ROBERT, C. (1990). Subjective hierarchical Bayes
estimation of a multivariate normal mean: On the frequentist
interface. Ann. Statist. 18 617651.
BERGER, J. and STRAWDERMAN, W. (1996). Choice of hierarchi-
cal priors: Admissibility in estimation of normal means. Ann.
Statist. 24 931951.
BERGER, J. and WOLPERT, R. L. (1988). The Likelihood Prin-
ciple: A Review, Generalizations, and Statistical Implications,
2nd ed. IMS, Hayward, CA. (With discussion.)
BERNARDO, J. M. (1979). Reference posterior distributions for
Bayesian inference (with discussion). J. Roy. Statist. Soc.
Ser. B 41 113147.
BERNARDO, J. M. and SMITH, A. F. M. (1994). Bayesian Theory.
Wiley, New York.
BOX, G. E. P. (1980). Sampling and Bayes inference in scientic
modeling and robustness (with discussion). J. Roy. Statist. Soc.
Ser. A 143 383430.
BROWN, L. D. (1971). Admissible estimators, recurrent diffusions,
and insoluble boundary value problems. Ann. Math. Statist. 42
855903.
BROWN, L. D. (1994). Minimaxity, more or less. In Statistical
Decision Theory and Related Topics V (S. Gupta and J. Berger,
eds.) 118. Springer, New York.
BROWN, L. D. (2000). An essay on statistical decision theory.
J. Amer. Statist. Assoc. 95 12771281.
BROWN, L. D., CAI, T. T. and DASGUPTA, A. (2001). Interval
estimation for a binomial proportion (with discussion). Sta-
tist. Sci. 16 101133.
BROWN, L. D., CAI, T. T. and DASGUPTA, A. (2002). Con-
dence intervals for a binomial proportion and asymptotic ex-
pansions. Ann. Statist. 30 160201.
CARLIN, B. P. and LOUIS, T. A. (2000). Bayes and Empirical
Bayes Methods for Data Analysis, 2nd ed. Chapman and
Hall, London.
CASELLA, G. (1988). Conditionally acceptable frequentist solu-
tions (with discussion). In Statistical Decision Theory and
Related Topics IV (S. Gupta and J. Berger, eds.) 1 73117.
Springer, New York.
CASTELLANOS, M. E. (2002). Diagnstico Bayesiano de mode-
los. Ph.D. dissertation, Univ. Miguel Hernndez, Spain.
CHALONER, K. and VERDINELLI, I. (1995). Bayesian experimen-
tal design: A review. Statist. Sci. 10 273304.
CLYDE, M. and GEORGE, E. (2004). Model uncertainty. Sta-
tist. Sci. 19 8194.
DANIELS, M. and POURAHMADI, M. (2002). Bayesian analysis of
covariance matrices and dynamic models for longitudinal data.
Biometrika 89 553566.
DASS, S. and BERGER, J. (2003). Unied conditional frequentist
and Bayesian testing of composite hypotheses. Scand. J. Sta-
tist. 30 193210.
DATTA, G. S., MUKERJEE, R., GHOSH, M. and SWEETING, T. J.
(2000). Bayesian prediction with approximate frequentist va-
lidity. Ann. Statist. 28 14141426.
DAWID, A. P. and SEBASTIANI, P. (1999). Coherent dispersion
criteria for optimal experimental design. Ann. Statist. 27
6581.
DAWID, A. P. and VOVK, V. G. (1999). Prequential probability:
Principles and properties. Bernoulli 5 125162.
DE FINETTI, B. (1970). Teoria delle Probabilit 1, 2. Einaudi,
Torino. [English translations published (1974, 1975) as Theory
of Probability 1, 2. Wiley, New York.]
DELAMPADY, M., DASGUPTA, A., CASELLA, G., RUBIN, H. and
STRAWDERMAN, W. E. (2001). A new approach to default
priors and robust Bayes methodology. Canad. J. Statist. 29
437450.
DEY, D., MLLER, P. and SINHA, D., eds. (1998). Practical Non-
parametric and Semiparametric Bayesian Statistics. Lecture
Notes in Statist. 133. Springer, New York.
DIACONIS, P. (1988a). Bayesian numerical analysis. In Statisti-
cal Decision Theory and Related Topics IV (S. Gupta and
J. Berger, eds.) 1 163175. Springer, New York.
DIACONIS, P. (1988b). Recent progress on de Finettis notion
of exchangeability. In Bayesian Statistics 3 (J. M. Bernardo,
M. H. DeGroot, D. V. Lindley and A. F. M. Smith, eds.)
111125. Oxford Univ. Press.
DIACONIS, P. and FREEDMAN, D. (1986). On the consistency of
Bayes estimates (with discussion). Ann. Statist. 14 167.
EATON, M. L. (1989). Group Invariance Applications in Statistics.
IMS, Hayward, CA.
EFRON, B. (1993). Bayes and likelihood calculations from con-
dence intervals. Biometrika 80 326.
FRASER, D. A. S., REID, N., WONG, A. and YI, G. Y. (2003).
Direct Bayes for interest parameters. In Bayesian Statistics 7
(J. M. Bernardo, M. J. Bayarri, J. O. Berger, A. P. Dawid,
D. Heckerman, A. F. M. Smith and M. West, eds.) 529534.
Oxford Univ. Press.
FREEDMAN, D. A. (1963). On the asymptotic behavior of
Bayes estimates in the discrete case. Ann. Math. Statist. 34
13861403.
FREEDMAN, D. A. (1999). On the Bernsteinvon Mises the-
orem with innite-dimensional parameters. Ann. Statist. 27
11191140.
GART, J. J. and NAM, J. (1988). Approximate interval estimation
of the ratio of binomial parameters: A review and corrections
for skewness. Biometrics 44 323338.
GELMAN, A., CARLIN, J. B., STERN, H. and RUBIN, D. B.
(1995). Bayesian Data Analysis. Chapman and Hall, London.
GHOSAL, S., GHOSH, J. K. and VAN DER VAART, A. W. (2000).
Convergence rates of posterior distributions. Ann. Statist. 28
500531.
GHOSH, J. K., ed. (1988). Statistical Information and Likelihood.
A Collection of Critical Essays. Lecture Notes in Statist. 45.
Springer, New York.
GHOSH, J. K., GHOSAL, S. and SAMANTA, T. (1994). Stability
and convergence of the posterior in non-regular problems. In
Statistical Decision Theory and Related Topics V (S. S. Gupta
and J. Berger, eds.) 183199. Springer, New York.
GHOSH, J. K. and RAMAMOORTHI, R. V. (2003). Bayesian
Nonparametrics. Springer, New York.
GHOSH, M. and KIM, Y.-H. (2001). The BehrensFisher problem
revisited: A Bayesfrequentist synthesis. Canad. J. Statist. 29
517.
GOOD, I. J. (1983). Good Thinking: The Foundations of Probabil-
ity and Its Applications. Univ. Minnesota Press, Minneapolis.
GUTTMAN, I. (1967). The use of the concept of a future observa-
tion in goodness-of-t problems. J. Roy. Statist. Soc. Ser. B 29
83100.
HILL, B. (1974). On coherence, inadmissibility and inference
about many parameters in the theory of least squares. In
Studies in Bayesian Econometrics and Statistics (S. Fienberg
and A. Zellner, eds.) 555584. North-Holland, Amsterdam.
HOBERT, J. (2000). Hierarchical models: A current computational
perspective. J. Amer. Statist. Assoc. 95 13121316.
HWANG, J. T., CASELLA, G., ROBERT, C., WELLS, M. T.
and FARRELL, R. (1992). Estimation of accuracy in testing.
Ann. Statist. 20 490509.
JEFFREYS, H. (1961). Theory of Probability, 3rd ed. Oxford
Univ. Press.
KIEFER, J. (1977). Conditional condence statements and con-
dence estimators (with discussion). J. Amer. Statist. Assoc. 72
789827.
KIM, Y. and LEE, J. (2001). On posterior consistency of survival
models. Ann. Statist. 29 666686.
LAD, F. (1996). Operational Subjective Statistical Methods:
A Mathematical, Philosophical and Historical Introduction.
Wiley, New York.
LE CAM, L. (1986). Asymptotic Methods in Statistical Decision
Theory. Springer, New York.
LEHMANN, E. L. and CASELLA, G. (1998). Theory of Point
Estimation, 2nd ed. Springer, New York.
MCDONALD, G. C., VANCE, L. C. and GIBBONS, D. I. (1995).
Some tests for discriminating between lognormal and Weibull
distributionsan application to emissions data. In Recent Ad-
vances in Life-Testing and ReliabilityA Volume in Honor of
Alonzo Clifford Cohen, Jr. (N. Balakrishnan, ed.) Chapter 25.
CRC Press, Boca Raton, FL.
MENG, X.-L. (1994). Posterior predictive p-values. Ann. Statist.
22 11421160.
MORRIS, C. (1983). Parametric empirical Bayes inference: Theory
and applications (with discussion). J. Amer. Statist. Assoc. 78
4765.
MOSSMAN, D. and BERGER, J. (2001). Intervals for post-test
probabilities: A comparison of ve methods. Medical Decision
Making 21 498507.
NEYMAN, J. (1977). Frequentist probability and frequentist statis-
tics. Synthse 36 97131.
NEYMAN, J. and SCOTT, E. L. (1948). Consistent estimates based
on partially consistent observations. Econometrica 16 132.
OHAGAN, A. (1992). Some Bayesian numerical analysis. In
Bayesian Statistics 4 (J. M. Bernardo, J. O. Berger, A. P. Dawid
and A. F. M. Smith, eds.) 345363. Oxford Univ. Press.
PAULO, R. (2002). Problems on the Bayesian/frequentist interface.
Ph.D. dissertation, Duke Univ.
PRATT, J. W. (1965). Bayesian interpretation of standard inference
statements (with discussion). J. Roy. Statist. Soc. Ser. B 27
169203.
RAO, J. N. K. (2003). Small Area Estimation. Wiley, New York.
REID, N. (2000). Likelihood. J. Amer. Statist. Assoc. 95
13351340.
ROS INSUA, D. and RUGGERI, F., eds. (2000). Robust Bayesian
Analysis. Lecture Notes in Statist. 152. Springer, New York.
ROBBINS, H. (1955). An empirical Bayes approach to statistics.
Proc. Third Berkeley Symp. Math. Statist. Probab. 1 157164.
Univ. California Press, Berkeley.
ROBERT, C. P. (2001). The Bayesian Choice, 2nd ed. Springer,
New York.
ROBERT, C. P. and CASELLA, G. (1999). Monte Carlo Statistical
Methods. Springer, New York.
ROBINS, J. M. and RITOV, Y. (1997). Toward a curse of di-
mensionality appropriate (CODA) asymptotic theory for semi-
parametric models. Statistics in Medicine 16 285319.
ROBINS, J. M., VAN DER VAART, A. and VENTURA, V. (2000).
Asymptotic distribution of p-values in composite null models.
ROBINS, J. and WASSERMAN, L. (2000). Conditioning, likelihood
and coherence: A review of some foundational concepts.
ROBINSON, G. K. (1979). Conditional properties of statistical
procedures. Ann. Statist. 7 742755.
ROUSSEAU, J. (2000). Coverage properties of one-sided intervals
in the discrete case and applications to matching priors.
Ann. Inst. Statist. Math. 52 2842.
RUBIN, D. B. (1984). Bayesianly justiable and relevant fre-
quency calculations for the applied statistician. Ann. Statist.
12 11511172.
RUBIN, H. (1987). A weak system of axioms for rational
behavior and the non-separability of utility from prior. Statist.
Decisions 5 4758.
SAVAGE, L. J. (1962). The Foundations of Statistical Inference.
Methuen, London.
SCHERVISH, M. (1995). Theory of Statistics. Springer, New York.
SCHWARZ, G. (1978). Estimating the dimension of a model.
Ann. Statist. 6 461464.
SELLKE, T., BAYARRI, M. J. and BERGER, J. (2001). Calibration
of p-values for testing precise null hypotheses. Amer. Statist.
55 6271.
SOOFI, E. (2000). Principal information theoretic approaches.
STEIN, C. (1956). Inadmissibility of the usual estimator for
the mean of a multivariate normal distribution. Proc. Third
Berkeley Symp. Math. Statist. Probab 1 197206. Univ. Cal-
ifornia Press, Berkeley.
STEIN, C. (1975). Estimation of a covariance matrix. Reitz Lec-
ture, IMSASA Annual Meeting. (Also unpublished lec-
ture notes.)
STRAWDERMAN, W. (2000). Minimaxity. J. Amer. Statist. Assoc.
95 13641368.
SUN, D. and BERGER, J. (2003). Objective priors under sequential
experimentation. Technical report, Univ. Missouri.
SWEETING, T. J. (2001). Coverage probability bias, objective
Bayes and the likelihood principle. Biometrika 88 657675.
TANG, D. (2001). Choice of priors for hierarchical models: Admis-
sibility and computation. Ph.D. dissertation, Purdue Univ.
VIDAKOVIC, B. (2000). Gamma-minimax: A paradigm for conser-
vative robust Bayesians. In Robust Bayesian Analysis. Lecture
Notes in Statist. 152 241259. Springer, New York.
WALD, A. (1950). Statistical Decision Functions. Wiley, New
York.
WELCH, B. and PEERS, H. (1963). On formulae for condence
points based on integrals of weighted likelihoods. J. Roy.
Statist. Soc. Ser. B 25 318329.
WOLPERT, R. L. (1996). Testing simple hypotheses. In Data
Analysis and Information Systems: Statistical and Conceptual
Approaches (H.-H. Bock and W. Polasek, eds.) 289297.
Springer, Berlin.
WOODROOFE, M. (1986). Very weak expansions for sequential
condence levels. Ann. Statist. 14 10491067.
YANG, R. and BERGER, J. (1994). Estimation of a covariance
matrix using the reference prior. Ann. Statist. 22 11951211.
YE, K. (1993). Reference priors when the stopping rule depends on
the parameter of interest. J. Amer. Statist. Assoc. 88 360363.
ZHAO, L. (2000). Bayesian aspects of some nonparametric prob-
lems. Ann. Statist. 28 532552.

Interplay of Bayes

Uploaded by

Copyright:

Available Formats

Interplay of Bayes

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Interplay of Bayes

Uploaded by

Copyright:

Available Formats

Statistical Science

2004, Vol. 19, No. 1, 5880

and using the classical estima-

of the parameter under the alternative

(x) is the th-quantile of the Beta(x + 0.5,

You might also like