1xraftery Et All 1997

Bayesian Model Averaging for Linear Regression Models
Author(s): Adrian E. Raftery, David Madigan and Jennifer A. Hoeting

Source: Journal of the American Statistical Association, Vol. 92, No. 437 (Mar., 1997), pp.
179-191
Published by: Taylor & Francis, Ltd. on behalf of the American Statistical Association
Stable URL: https://www.jstor.org/stable/2291462
Accessed: 18-06-2019 15:47 UTC
REFERENCES
Linked references are available on JSTOR for this article:
https://www.jstor.org/stable/2291462?seq=1&cid=pdf-reference#references_tab_contents
You may need to log in to JSTOR to access the linked references.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide
range of content in a trusted digital archive. We use information technology and tools to increase productivity and
facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org.
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
https://about.jstor.org/terms
American Statistical Association, Taylor & Francis, Ltd. are collaborating with JSTOR to
digitize, preserve and extend access to Journal of the American Statistical Association
This content downloaded from 88.241.79.247 on Tue, 18 Jun 2019 15:47:17 UTC
All use subject to https://about.jstor.org/terms
Bayesian Model Averaging for Linear
Regression Models
Adrian E. RAFTERY, David MADIGAN, and Jennifer A. HOETING
We consider the problem of accounting for model uncertainty in linear regression models. Conditioning on a single selected model
ignores model uncertainty, and thus leads to the underestimation of uncertainty when making inferences about quantities of interest.
A Bayesian solution to this problem involves averaging over all possible models (i.e., combinations of predictors) when making
inferences about quantities of interest. This approach is often not practical. In this article we offer two alternative approaches.
First, we describe an ad hoc procedure, "Occam's window," which indicates a small set of models over which a model average
can be computed. Second, we describe a Markov chain Monte Carlo approach that directly approximates the exact solution. In the
presence of model uncertainty, both of these model averaging procedures provide better predictive performance than any single
model that might reasonably have been selected. In the extreme case where there are many candidate predictors but no relationship
between any of them and the response, standard variable selection procedures often choose some subset of variables that yields a
high R2 and a highly significant overall F value. In this situation, Occam's window usually indicates the null model (or a small
number of models including the null model) as the only one (or ones) to be considered thus largely resolving the problem of
selecting significant models when there is no signal in the data. Software to implement our methods is available from StatLib.
KEY WORDS: Bayes factor; Markov chain Monte Carlo model composition; Model uncertainty; Occam's window; Posterior
model probability.
1. INTRODUCTION applications this averaging will not be a practical proposi-

tion. Here we present two alternative approaches.
Selecting subsets of predictor variables is a basic part of
First, we extend the Bayesian graphical model selection
building a linear regression model. The objective of vari-
able selection is typically stated as follows: Given a de- algorithm of Madigan and Raftery (1994) to linear regres-
pendent variable Y and a set of a candidate predictors sion models. We refer to this algorithm as "Occam's win-
dow." This approach involves averaging over a reduced set
X1, X2, ... , Xk, find the "best" model of the form
of models. Second, we directly approximate the complete
p
solution by applying the Markov chain Monte Carlo model
Y=/3o +E3jXj +?, composition (MC3) approach of Madigan and York (1995)
j=1
to linear regression models. In this approach the posterior
where X1, X2,. .., Xp is a subset of Xl, X2, distribution
...,Xk. Here of a quantity of interest is approximated by a
"best" may have any of several meanings; for example, Markov chain Monte Carlo method that generates a process
the model providing the most accurate predictions for new that moves through model space. We show in an example
cases exchangeable with those used to fit the model. that both of these model averaging approaches provide bet-
A typical approach to data analysis is to carry out a ter predictive performance than any single model that might
model selection exercise leading to a single "best" model reasonably have been selected.
and then to make inferences as if the selected model were Freedman (1983) pointed out that when there are many
the true model. However, this ignores a major component predictors and there is no relationship between the predic-
of uncertainty-namely, uncertainty about the model itself tors and the response, variable selection techniques can lead
(Draper 1995; Hodges 1987; Leamer 1978; Moulton 1991; to a model with a high R2 and a highly significant over-
Raftery 1988, 1996). As a consequence, uncertainty about all F value. By contrast, when a dataset is generated with
quantities of interest can be underestimated. (For striking no relationship between the predictors and the response,
examples of this see Draper 1995, Kass and Raftery 1995, Occam's window typically indicates the null model as the
Madigan and York 1995, Miller 1984, Raftery 1996, and "best" model or as one of a small set of "best" models,
Regal and Hook 1991.) A complete Bayesian solution to thus largely resolving the problem of selecting a significant
this problem involves averaging over all possible combina- model for a null relationship.
tions of predictors when making inferences about quantities The background literature for our approach includes sev-
of interest. Indeed, this approach provides optimal predic- eral areas of research: the selection of subsets of predictor
tive ability (Madigan and Raftery 1994). However, in many variables in linear regression models (Breiman 1992, 1995;
Breiman and Spector 1992; Draper and Smith 1981; Hock-
ing 1976; Linhart and Zucchini 1986; Miller 1990; Shibata
Adrian E. Raftery is Professor of Statistics and Sociology, and David
1981), Bayesian approaches to the selection of subsets of
Madigan is Assistant Professor of Statistics, Department of Statistics, Uni-
versity of Washington, Seattle, WA 98195. Jennifer Hoeting is Assistant predictor variables in linear regression models (George and
Professor of Statistics, Department of Statistics, Colorado State Univer- McCulloch 1993; Laud and Ibrahim 1995; Mitchell and
sity, Fort Collins, CO 80523. The research of Raftery and Hoeting was par- Beauchamp 1988; Schwarz 1978), and model uncertainty
tially supported by Office of Naval Research contract N-00014-91-J-1074.
Madigan's research was partially supported by National Science Foun-
dation grant DMS 92111627. The authors are grateful to Danika Lew for
research assistance and the editor, the associate editor, two anonymous ref- ( 1997 American Statistical Association
erees, and David Draper for very helpful comments that greatly improved Journal of the American Statistical Association
the article. March 1997, Vol. 92, No. 437, Theory and Methods
179
180 Journal of the American Statistical Association, March 1997
(Freedman, Navidi, and Peters 1986; Leamer 1978; Madi- where A is the observable to be predicted and the expecta-
gan and Raftery 1994; Stewart 1987; Stewart and Davis tion is with respect to EK 1 Pr(A IMk, D) Pr(MkID). This
1986). follows from the nonnegativity of the Kullback-Leibler in-
In the next section we outline the philosophy underlyingformation divergence.
our approach. In Section 3 we describe how we selected Implementation of Bayesian model averaging is difficult
prior distributions and outline the two model averaging ap- for two reasons. First, the integrals in (3) can be hard to
proaches in Section 4. In Section 5 we provide an example compute. Second, the number of terms in (1) can be enor-
and describe our assessment of predictive performance. In mous. In this article we present solutions to both of these
Section 6 we compare the performance of Occam's window problems.
to that of standard variable selection methods when there
is no relationship between the predictors and the response. 3. BAYESIAN FRAMEWORK
Finally, in Section 7 we discuss related work and suggest 3.1 Modeling Framework
future directions.
Each model that we consider is of the form
2. ACCOUNTING FOR MODEL UNCERTAINTY p
USING BMA
Y =/30 + 1Z 3jXj + ? =XI3 + El (4)
As described previously, basing inferences on a single j=1
"best" model as if the single selected model were true ig-

where the observed data on p predictors are c
nores model uncertainty, which can result in underestimat-
n x (p + 1) matrix X. The observed data on
ing uncertainty about quantities of interest. There is a stan-
variable are contained in the n vector Y. We assign to e
dard Bayesian solution to this problem, proposed by Leamer
a normal distribution with mean zero and variance a2 and
(1978). If M = {M1, .I. ., MK} denotes the set of all modelsassume that the E's in distinct cases are independent. We
being considered and if A is the quantity of interest such
consider the (p + 1) individual parameters 3 and a2 to be
as a future observation or the utility of a course of action,
unknown.
then the posterior distribution of A given the data D is
Where possible, informative prior distributions for 3 and
K
a2 should be elicited and incorporated into the analysis (see
Garthwaite and Dickey 1992 and Kadane, Dickey, Winkler,
Pr(A ID) Z EPr(AIMk,D)Pr(MkID). (1)
Smith, and Peters 1980). In the absence of expert opinion,
k=l
we seek to choose prior distributions that reflect uncertainty
This is an average of the posterior distributions under each about the parameters and also embody reasonable a pri-
model weighted by the corresponding posterior model prob- ori constraints. We use prior distributions that are proper
abilities. We call this Bayesian model averaging (BMA). In but reasonably flat over the range of parameter values that
Equation (1) the posterior probability of model Mk is given could plausibly arise. These represent the common situation
by where there is some prior information, but rather little of
it, and put us in the "stable estimation" case where results
are relatively insensitive to changes in the prior distribution
Pr (Mk ID) - Pr (D IMk) Pr (Mk) 2 (Edwards, Lindman, and Savage 1963). We use the standard
K[l Pr(DIMl) Pr(M) (2)
normal gamma conjugate class of priors,
where
,B - N(4, a 2V)
Pr(DlMk)= JPr(DI0k,Mk)Pr(0kIMk)d9kand
(3)
vA X2
is the marginal likelihood of model Mk, ok is the vector a2 u
of parameters of model Mk, Pr(Ok Mk) is the prior density Here vi, A, the (p + 1) x (p + 1)
Of ok under model Mk, Pr(DI 0k, Mk) is the likelihood, andvector p are hyperparameters t
Pr(Mk) is the prior probability that Mk is the true model. The marginal likelihood for Y under a model Mi based
All probabilities are implicitly conditional on M, the set on the proper priors described earlier is given by
of all models being considered. In this article we consider
M to be equal to the set of all possible combinations of P(Y|pil Vi, Xi, Mi)
predictors. _ r ( M+n ) (7vA) v/2
Averaging over all of the models in this fashion provides
irn/2r (2) II + X,V,Xl1l/2
better predictive ability, as measured by a logarithmic scor-
ing rule, than using any single model Mj: x [Av + (Y -Xipi)'
X (I + X_VL X>(Y- Xi,u)l(v?T)/2 (5)
-F [1og { EPr(/ Mk, D)Pr(MkID) }1 where Xi is the design matrix and Vi is the covari
trix for /3 corresponding to model Mi (Rtaiffa an
1961). The Bayes factor for Mo versus M1, the ratio of
Raftery, Madigan, and Hoeting: Averaging for Linear Regression 181
Equation (5) for i = 0 and i = 1, is then given by ple variance of Y, s' denotes the sample variance of Xi
for i = 1,... p, and X is a hyperparameter to be chosen.
The prior variance of i0 is chosen conservatively and rep-
resents an upper bound on the reasonable variance for this
parameter. The variances of the remaining ,B parameters are
where ai = Av + (Y - Xi,ui)t (I + XiViX) 1(Y - Xi,i)
i = -0 1. chosen to reflect increasing precision about each /i as the
variance of the corresponding Xi increases and to be in-
3.2 Selection of Prior Distributions variant to scale changes in both the predictor variables and
the response variable.
The Bayesian framework described earlier gives the
BMA user the flexibility to modify the prior setup as de- For a categorical predictor variable Xi with (c + 1) pos-
sible outcomes (c > 2), the Bayes factor should be invari-
sired. In this section we describe the prior distribution setup
that we adopt in our examples below. ant to the selection of the corresponding dummy variables
For noncategorical predictor variables, we assume the (Xi, I Xi,). To this end, we set the prior variance of
(Oi1,
individual O's to be independent a priori. We center theI. ,i ) equal to a 2q2[(1/n)xiTxi]_1, where X" is
distribution of 3 on zero (apart from 3o) and choose the n x c design matrix for the dummy variables, where
tt = (i, O, o ... , 0), where & is the ordinaryeach
least
dummy variable has been centered by subtracting its
squares estimate of i0. The covariance matrix V is sample mean. This is related to the g prior of Zellner (1986).
equal to a2 multiplied by a diagonal matrix with entries The complete prior covariance matrix for 3 is now given
(Si2, q2S-2, 2S-2, .2. ., q$2S-2), where s2 denotes the sam-
by
2
SY
\ 2s 22s2
To choose the remaining hyperparameters v, A, and q, For a = .05, this yields v = 2.58, A = .28, and b = 2.85.
we define a number of reasonable desiderata and attempt For this set of hyperparameters, Pr(u 2 < 1) .81. We use
to satisfy them. In what follows we assume that all the these settings of the hyperparameters in the examples that
variables have been standardized to have mean zero and follow.
sample variance 1. We would like the following desiderata To compare our prior for pi, i = 1, . . . ,p for a noncate-
to hold gorical predictor with the actual distribution of coefficients
from real data, we collected 13 datasets from several regres-
1. The prior density Pr(31, . . ., IOp) is reasonably flat over
sion textbooks (see App. A). Figure 1 shows a histogram
the unit hypercube [-1, 1]P.
of the 100 coefficients from the standardized data plotted
2. Pr(u2) is reasonably flat over (a, 1) for some small a.
with the prior distribution resulting from the hyperparam-
3. Pr(u 2 < 1) is large.
eters that we use. As desired, the prior density is relatively
The order of importance of these desiderata is roughly the flat over the range of observed values.
order in which they are listed. More formally, we maximize
Pr(u 2 < 1) subject to the following: 4. TWO APPROACHES TO BAYESIAN
MODEL AVERAGING
a. Pr(Ol = O,.., op = 0)/ Pr(Ol = ,. ,pp=1 <
K1. (Following Jeffreys (1961), we choose K1 4.1 Occam's Window
10)
Our first method for accounting for model uncertainty
b. {maxa<2<1 / Pr(u2)1} Pr(U2 = a) < K2-
starting from Equation (1) involves applying the Occam's
c. {maxa<,2<1 / Pr(u2)} Pr(U2 = 1) < K2.
window algorithm of Madigan and Raftery (1994) to linear
Because desideratum 2 is less important than desideratum regression models. Two basic principles underly this ad hoc
1, we have chosen K2 = 10. approach.
N to identify the models in A. Two further principles under

the search strategy. The first principle-Occam's window-
0 concerns interpretating the ratio of posterior model prob-
abilities Pr(M1iD)/Pr(MoID). Here Mo is a model with
one less predictor than M1. The essential idea is shown in
0
Figure 2. If there is evidence for Mo then M1 is rejected,
C6
but to reject Mo we require strong evidence for the larger
model, M1. If the evidence is inconclusive (falling in Oc-
c
cam's window), then neither model is rejected. The second

principle is that if Mo is rejected, then so are all of the
9 models nested within it.
These principles fully define the strategy. Typically, in
our experience, the number of terms in (1) is reduced to
fewer than 25, often to as few as 1 or 2. Madigan and
Raftery (1994) provided a detailed description of the algo-
-4 -3 -2 -1 o 2 3 4
rithm and showed how averaging over the selected models
o P. ai provides better predictive performance than basing infer-
Figure 1. Histogram of 1 00 Coefficients from Standardized Data,
ence on a single model in each of the examples that they
from 13 Textbook Datasets. The solid line is the prior density for Oi,considered.
4.2 Markov Chain Monte Carlo Model Composition

First, if a model predicts the data far less well than the Our second approach is to approximate (1) using a
model that provides the best predictions, then it has effec- Markov chain Monte Carlo (MCMC) approach (see, e.g.,
tively been discredited and should no longer be considered. Smith and Roberts 1993). For our application, we adopt the
Thus models not belonging to
MCMC model composition (MC3) methodology of Madi-
1Mmax,{Pr (Ml ID)} gan and York (1995), which generates a stochastic pro-
A' = {k* maxl{Pr(Mk ID) <C (7) cess that moves through model space. We can construct
a Markov chain {M(t),t = 1,2,.. .} with state space M
should be excluded from Equation (1), where C is chosen and equilibrium distribution Pr(Mi D). If we simulate this
by the data analyst and maxi {Pr(Ml ID) } denotes the modelMarkov chain for t = 1, ... , N, then under certain regular-
with the highest posterior model probability. In the exam- ity conditions, for any function g(Mi) defined on M, the
ples that follow we use C = 20. The number of models in average
Occam's window increases as the value of C decreases.
Second, appealing to Occam's razor, we exclude models
N
that receive less support from the data than any of their
simpler submodels. More formally, we also exclude from G=NE' g(M(t) 1l)
t=1
(1) models belonging to
B={Mk: 3M CEM,M CMk,P>j(Ml D) >j (8) converges almost surely to E(g(M)) as N -* oc (Smith
Roberts 1993). To compute (1) in this fashion, set g(M)
Equation (1) is then replaced by Pr(A IM, D).
To construct the Markov chain, we define a neighborhood
nbd(M) for each M c M that consists of the model M
Pr(ZA D) = LMkEA Pr(E A Mk, D) Pr(D Mk) Pr(Mk)
LMk EA Pr(DIMk) Pr(Mk) 9 itself and the set of models with either one variable more
or one variable fewer than M. Define a transition matrix
where q by setting q(M -* M') = 0 for all M' f nbd(M) and
q(M -* M') constant for all M' c nbd(M). If the chain is
A = >A'\ B c M. (10)
currently in state M, then we proceed by drawing M' from
This greatly reduces the number of models q(M -* in
M'). the
It is then
sum accepted
in with probability
Equation (1), and now all that is required is a search strategy
f Pr(M'ID)
Inconclusive Evidence
I I pr(Mf ID)
OL OR pr(MolD)
Otherwise, the state stays in state M. Madigan and York
| Evidence for MiXo | |Strong Evidence for M1|
(1995) described MG3 for discrete graphical models. Soft-
ware for implementing the MG3 algorithm is described in
Figure 2. Occam's Window: Interpreting the Posterior Odds for
Nested Models. the Appendix.
5. MODEL UNCERTAINTY AND PREDICTION Ehrlich's analysis concentrated on the relationship be-
tween crime rate and predictors 14 and 15 (probability of
5.1 Example: Crime and Punishment imprisonment and average time served in state prisons). In
5.1.1 Crime and Punishment: Overview. Up to the his original analysis, Ehrlich (1973) focused on two regres-
1960s, criminal behavior was traditionally viewed as de- sion models, consisting of the predictors (9, 12, 13, 14, 15)
viant and linked to the offender's presumed exceptional psy- and (1, 6, 9, 10, 12, 13, 14, 15), which were chosen in ad-
chological, social, or family circumstances (Taft and Eng- vance based on theoretical grounds.
land 1964). Becker (1968) and Stigler (1970) argued that on To compare Ehrlich's results with models that might be
the contrary, the decision to engage in criminal activity is a selected using standard techniques, we chose three popular
rational choice determined by its costs and benefits relative variable selection techniques: Efroymson's stepwise method
to other (legitimate) opportunities. (Miller 1990), minimum Mallow's Cp, and maximum ad-
In an influential article, Ehrlich (1973) developed this ar- justed R2 (Weisberg 1985). Efroymson's stepwise method
gument theoretically, specified it mathematically, and tested is like forward selection except that when a new variable
it empirically using aggregate data from 47 U.S. states in is added to the subset, partial correlations are considered
1960. Errors in Ehrlich's empirical analysis were corrected to see whether any of the variables currently in the sub-
by Vandaele (1978), who gave the corrected data, which we set should be dropped. Similar hybrid methods are found in
use here (see also Cox and Snell 1982). (Ehrlich's study has most standard statistical computer packages. Problems with
been much criticized (see, e.g., Brier and Fienberg 1980), stepwise regression, Mallow's Cp, and adjusted R2 are well
and we cite it here for purely illustrative purposes. For econ-known (see, e.g., Weisberg 1985).
omy of expression, we use causal language and speak of Table 1 displays the results from the full model with all
"effects," even though the validity of this language for these 15 predictors, three models selected using standard variable
data is dubious. Because people, not states, commit crimes, selection techniques, and the two models chosen by Ehrlich
these data may reflect aggregation bias.) on theoretical grounds. The three models chosen using vari-
Ehrlich's theory goes as follows. The costs of crime are able selection techniques (models 2, 3, 4) share many of
related to the probability of imprisonment and the aver- the same variables and have high values of R2. Ehrlich's
age time served in prison, which in turn are influenced by theoretically chosen models fit the data less well. There are
police expenditures, which may themselves have an inde- striking differences-indeed, conflicts-between the results
pendent deterrent effect. The benefits of crime are related from the different models. Even the models chosen using
to both the aggregate wealth and income inequality in the statistical techniques lead to conflicting conclusions about
surrounding community. The expected net payoff from al- the main questions of interest, despite the models' superfi-
ternative legitimate activities is related to educational level cial similarity.
and the availability of employment, the latter being mea- Consider first the predictor for probability of imprison-
sured by the unemployment and labor force participation ment, X14. This is a significant predictor in all six models,
rates. The payoff from legitimate activities was expected to so interest focuses on estimating the size of its effect. To
be lower (in 1960) for nonwhites and for young males than aid interpretation, recall that all variables have been trans-
for others, so that states with high proportions of these were formed logarithmically, so that when all other predictors are
expected also to have higher crime rates. Vandaele (1978) held fixed, /14= -.30 means roughly that a 10% increase
also included an indicator variable for southern states, the in the probability of imprisonment produces a 3% reduc-
sex ratio, and the state population as control variables, but tion in the crime rate. The estimates of /14 fluctuate wildly
the theoretical rationale for inclusion of these predictors is between models. The stepwise regression model gives an es-
unclear. timate about one-third lower in absolute value than the full
We thus have 15 candidate predictors of crime rate (Ta- model, enough to be of policy importance; this difference
ble 4), and so potentially 215 = 32, 768 different models. is equal to about 1.7 standard errors. The Ehrlich models
As in the original analyses, all data were transformed loga- give estimates that are about one-half higher than the full
rithmically. Standard diagnostic checking (see, e.g., Draper model, and more than twice as big as those from stepwise
and Smith 1981) did not reveal any gross violations of the regression (in absolute value). There is clearly considerable
assumptions underlying normal linear regression. model uncertainty about this parameter.
Table 1. Models Selected for Crime Data
R2 Number of
Method Variables (%) variables /14 /15 P15
1 Full model All 87 15 -.30 -.27 .133

2 Stepwise regression 1 3 4 9 11 13 14 83 7 -.19
3 Mallows' Cp 1 3 4 9 11 12 13 14 15 85 9 -.30 -.30 .050
4 Adjusted R2 1 3 4 7 8 9 11 12 13 14 15 86 11 -.30 -.25 .129
5 Ehrlich model 1 9 12 13 14 15 66 5 -.45 -.55 .009
6 Ehrlich model 2 1 6 9 10 12 13 14 15 70 8 -.43 -.53 .011
NOTE: P15 is the p value from a two-sided t test for testing 15 = 0. For the
to the 5% level
Table 2. Crime Data: Occam's Window Posterior Model Probabilities effect on the relationship between the models as measured
by the Bayes factor.
Posterior model
Model probability (%/o) Table 4 shows the posterior probability that the coeffi-
cient for each predictor does not equal 0-that is, Pr(/3i 7
1 3 4 9 11 13 14 12.6
1 3 4 11 13 14 9.0 OID)-obtained by summing the posterior model probabili-
1 3 4 9 13 14 8.4 ties across models for each predictor. The results from Oc-
1 3 5 9 11 13 14 8.0
cam's window and MC3 are fairly close for most of the
3 4 8 9 13 14 7.6
1 3 4 13 14 6.3 predictors. Predictors with high Pr(/3 7& OID) include pro-
1 3 4 11 13 5.8 portion of young males, mean years of schooling, police
1 3 5 11 13 14 5.7 expenditure, income inequality, and probability of impris-
1 3 4 13 4.9
onment.
1 3 5 9 13 14 4.8
3 5 8 9 13 14 4.4 Comparing the two models analyzed by Ehrlich (1973),
3 4 9 13 14 4.1 consisting of the predictors (9, 12, 13, 14, 15) and (1, 6,
3 5 9 13 14 3.6
9, 10, 12, 13, 14, 15), with the results in Table 4, we see
1 3 5 13 14 3.5
2 3 4 13 14 2.0 that several predictors included in Ehrlich's analysis receive
1 3 5 11 13 1.9 little support from the data. The estimated Pr(/3 78 OID) is
3 4 13 14 1.6
quite small for predictors 6, 10, 12, and 15. Two predictors
3 5 13 14 1.6
3 4 13 1.4 (3 and 4) have empirical support but were not included by
1 3 5 13 1.4 Ehrlich. Indeed, Ehrlich's two selected models have very
3 5 13 .7
low posterior probabilities.
1 4 12 13 .7
Ehrlich's work attracted attention primarily because of
his conclusion that both the probability of imprisonment
Now (predictor 14) and the average prison
consider 15, term (predictor 15)th
influenced the crime *rate. The posterior distributions for the
state prisons. Whether this is significant at all is not clear,
and t tests based on different models lead to conflicting coefficients of these predictors, based on the model averag-
conclusions. In the full model, /315 has a nonsignificanting
p results of MC3, are shown in Figures 3 and 4. The MC3
value of .133, while stepwise regression leads to a model posterior distribution for 314 is indeed centered away from
that does not include this variable. On the other hand, Mal- 0, with a small spike at 0 corresponding to P(f314 = OlD).
The posterior distribution for 314 based on Occam's win-
lows' Cp leads to a model in which the p value for 315 iS
significant at the .05 level, whereas with adjusted Rf2 it is dow is quite similar. The spike at 0 is an artifact of our
again not significant. In contrast, in Ehrlich's models it is approach, in which it is possible to consider models with a
highly significant. predictor fully removed from the model. This is in contrast
to the practice of setting the predictor close to 0 with high
Together these results paint a confused picture about /314
probability
and 015. Later we argue that the confusion can be resolved (as in George and McCulloch 1993). In contrast
by taking explicit account of model uncertainty. to Figure 3, the MC3 posterior distribution for the coeffi-
cient corresponding to average prison term is centered close
5.1.2 Crime and Punishment: Model Averaging. For to 0 and has a large spike at 0 (Fig. 4). Occam's window in-
the model averaging strategies, we assumed that all possi- dicates a spike at 0 only, or no support for inclusion of this
ble combinations of predictors were equally likely a pri- predictor. By averaging over all models, our results indicate
ori. To implement Occam's window, we started from the support for a relationship between crime rate and predic-
null model and used the "up" algorithm only (see Madigan tor 14, but not predictor 15. Our model averaging results
and Raftery 1994). The selected models and their poste- are consistent with those of Ehrlich for the probability of
rior model probabilities are shown in Table 2. The models imprisonment, but not for the average prison term.
with posterior model probabilities of 1.2% or larger as in-
dicated by MC3 are shown in Table 3. In total, 1,772 differ-
Table 3. Crime Data: MC3, Models With Posterior Model
ent models were visited during 30,000 iterations of MC3. Probabilities of 1.2% or Larger
Occam's window chose 22 models in this example, clearly
Posterior model
indicating model uncertainty. Choosing any one model and
Model probability (%6)
making inferences as if it were the "true" model ignores
1 3 4 9 11 13 14 2.6
model uncertainty. In the next section we further explore
1 3 4 11 13 14 1.8
the consequences of basing inferences on a single model. 1 3 4 9 13 14 1.7
The top models indicated by the two methods (Tables 1 3 4 5 9 13 14 1.6
1 3 4 9 11 13 14 15 1.6
2 and 3) are quite similar. The posterior probabilities are
1 3 4 9 13 14 15 1.6
normalized over all selected models for Occam's window 3 4 8 9 13 14 1.5
and over all possible combinations of the 15 predictors for 1 3 4 13 14 1.3
1 3 4 11 13 1.2
MC3. So the posterior probabilities for the same models
1 3 5 11 13 14 1.2
differ across the model averaging method, but this has little
Table 4. Crime Data: Pr(/3j 7 O 0D), Expressed as a Percentage
Predictor Occam's Ehrlich's

number Predictor window MC3 models
1 Percentage of males age 14-24 73 79

2 Indicator variable for southern state 2 17
3 Mean years of schooling 99 98
4 Police expenditure in 1960 64 72
5 Police expenditure in 1959 36 50
6 Labor force participation rate 0 6
7 Number of males per 1,000 females 0 7
8 State population 12 23
9 Number of nonwhites per 1,000 people 53 62
10 Unemployment rate of urban males age 14-24 0 11
11 Unemployment rate of urban males, age 35-39 43 45
12 Wealth 1 30
13 Income inequality 100 100
14 Probability of imprisonment 83 83
15 Average time served in state prisons 0 22
NOTE: The last column indicates the predictors included in the two models considered b
* Corresponds to Ehrlich model 1 and * corresponds to Ehrlich model 2.
Among the variables that measure the expected benefits The model averaging results for the predictors for po-
from crime, Ehrlich concluded that both wealth and income lice expenditures lead to an interesting interpretation. Po-
inequality had an effect; we found this to be true for income lice expenditure was measured in two successive years, and
inequality but not for wealth. For the predictors that repre- the measures are highly correlated (r = .993). The data
sent the payoff from legitimate activities, Ehrlich found the show clearly that the 1960 crime rate is associated with
effects of variables 1, 6, 10, and 1 1 to be unclear; he did not police expenditures, and that only one of the two mea-
include mean schooling in his model. We found strong evi- sures (X4 and X5) is needed, but they do not say for sure
dence for the effect of some of these variables, notably the which measure should be used. Each model in Occam's
percent of young males and mean schooling, but the effects window contains one predictor or the other, but not both.
of unemployment and labor force participation are either
For both Occam's window and MC3 Pr[(f34 7& 0) U (/5 7
unproven or unlikely. Finally, the "control" variables that 0) D] = 1, so the data provide very strong evidence for an
have no theoretical basis (2, 7, 8) turned out, satisfyingly, association with police expenditures.
to have no empirical support either.
In summary, we found strong support for some of
Ehrlich's conclusions but not for others. In particular, by
averaging over all models, our results indicate support for
a relationship between crime rate and probability of impris-
onment, but not for average time served in state prisons.
5.1.3 Crime and Punishment: Assessment of Predictive

Performance. We use the predictive ability of the selected
models for future observations to measure the effectiveness
9
of a model selection strategy. Our specific objective is to
o ,.
compare the quality of the predictions based on model av-

eraging with the quality of predictions based on any single
model that an analyst might reasonably have selected.
To measure performance, we randomly split the com-
plete dataset into two subsets. Other percentage splits can
be adopted. A 50-50 split was chosen here, so that each
portion would contain enough data to be a representative
sample. We ran Occam's window and MC3 using half of
the data. This set is called the training set, DT. We evalu-
-1.0 -0.5 0.0 0.5 1.0 ated performance using the prediction set, made up of the
remaining half of the data, DP - D\DT. Within this frame-
P14
work, we assessed predictive performance using numerical
and graphical measures of performance.
Figure 3. Posterior Distribution for 014, the Coefficient for the Pre-
dictor "Probability of Imprisonment," Based on the MC3 Model Average. Predictive coverage was measured using the proportion
The spike corresponds to P (O14 = 0 1D). The vertical axis on the leftof observations in the performance set that fall in the cor-
corresponds to the posterior distribution for 0 14, and the vertical axis on
responding 90% prediction interval. For both Occam's win-
the right corresponds to the posterior distribution for 014 equal to zero.
The density is scaled so that the maximum of the density is equal to dow and MC3, 80% of the observations in the performance
P (O14 7d ?I D) on the right axis. set fell in the 90% prediction intervals over the averaged
Occam's window and MC3 are not highly sensitive to the

C-l )
N l 1.00
choice of prior. The results for Occam's window and MC3
using three different sets of priors were quite similar.
In an attempt to provide a graphical measure of predic-
-0.75
tive performance, we used a "calibration plot" to determine
o _
LfO whether the predictions were well calibrated. A model is
well calibrated if, for example, 70% of the observations in
L -0.50 the test dataset are less than or equal to the 70th percentile
of the posterior predictive distribution. The calibration plot
shows the degree of calibration for different models, with
-0.25 the posterior predictive probability on the x-axis and the
percentage of observed data less than or equal to the pos-
terior predictive probability on the y-axis. In a calibration
plot, perfect calibration is the 45-degree line; the closer a
0~~~~~~~~~~~~~~~~~ 0.0 model's calibration line to the 45-degree line, the better
-1.0 -0.5 0.0 0.5 1.0
calibrated the model. The calibration plot is similar to re-
liability diagrams used to assess probability forecasts (see,
f3?5 e.g., Murphy and Winkler 1977). The calibration plot for
Figure 4. Posterior Distribution for (15, the Coefficient for the Predic- the model chosen by stepwise selection and for model av-
tor 'Average Time Served in State Prisons," Based on the Model Average eraging using Occam's window is shown in Figure 5. The
Over a Large Set of Models From MC3. See Figure 3.
shaded area in Figure 5 shows where the model averaging
strategy produces predictions that are better calibrated than
models (Table 5). David Draper (personal communication) predictions from the model chosen by the stepwise model
suggested that BMA falls somewhat short of nominal cov- selection procedure. The calibration plot for MC3 is similar.
erage here because aspects of model uncertainty other than These performance measures support our claim that con-
model selection have not been assessed. In Hoeting, Raftery, ditioning on a single selected model ignores model uncer-
and Madigan (1995, 1996), we extended BMA to account tainty, which in turn leads to the underestimation of uncer-
for uncertainty in the selection of transformations and in tainty when making inferences about quantities of interest.
the identification of outliers. Model averaging leads to better-calibrated predictive distri-
For comparison with other standard variable selection butions.
techniques, we used the three popular variable selection
procedures discussed earlier to select two or three "best"
models. The models that we chose using these methods 5.2 Simulated Examples: Predictive Performance
are given in Table 5. All of the individual models chosen In the foregoing example, the true answer is unknown. T
using standard techniques performed considerably worse further demonstrate the usefulness of BMA, we use sever
than the model averaging approaches, with prediction cov- simulated examples. In our examples, we follow the format
erage ranging from 58% to 67%. Thus the model averag- of George and McCulloch (1993).
ing strategies improved predictive coverage substantially as Example 5.2.1. In this example we investigate the im-
compared to any single model that might reasonably have pact of model averaging on predictive performance when
been chosen. there is little model uncertainty. For the training set, we
A sensitivity analysis for priors chosen within the frame-simulated p = 15 predictors and n- 50 observations as
work described in Section 3.2 indicates that the results for independent standard normal vectors. We generated the re-
Table 5. Crime Data: Performance Comparison
Predictive
Method Model coverage (%)
MC3 Model averaging 80

Occam's window Model averaging 80
Stepwise (5%) 3 4 9 13 67
Adjusted R2 (2) 1 2 3 4 5 8 11 12 13 15 67
Adjusted R2 (3) 1 2 3 4 5 6 8 11 12 13 15 67
Stepwise (15%) 3 4 8 9 13 15 63
Cp(2) 1 2 3 4 11 13 63
Adjusted R2 (1) 1 2 3 4 5 11 12 13 15 58
Cp (1) 1 2 3 4 11 13 15 58
Cp (3) 1 2 3 4 11 12 13 15 58
NOTE: Predictive coverage is the percentage of obse

numbers correspond to the ith model chosen usin
using the Cp method. The percentage values shown
and F-to-delete values. For example, F = 3.84 corresponds approximately to the 5% level.
where E N50(0, o2) with u- = 2.5. Least squares es-

timates for these data are given in Table 8. The corre-
lation structure resulted in moderate pairwise correlation
between predictors 1-5 and 11-15 (corr(X1,Xll) = .39,
corr(X2, X12) = .41, corr(X3, X13) = .56, corr(X4, X14) =
aO
.71, corr(X5,X15) = .69) and small pairwise correlations
elsewhere (median correlation equal to -.02). We generated
0 50 additional observations in the same manner to create the
D co prediction set.
V0/ Table 9 shows that in this example, model averaging has
2tCi
better predictive performance than any single model that
might have been selected. In this example, the poor per-
.0
0-
formance of the true model and the other single models
selected using standard techniques demonstrate that model
0.0 0.2 0.4 0.6 0.8 1.0 uncertainty can strongly influence predictive performance.
posterior predictive probability
Figure 5. Crime Data: Calibration Plot. The solid line denotes model
averaging (Occam's window); the dashed line, predictors 3, 4, 8, 9, 13, 6. SUCCESSFUL IDENTIFICATION
15 (stepwise). OF THE NULL MODEL
Linear regression models are frequently used even when

sponse using the model
little is known about the relationship between the predictors
and the response. When there is a weak relationship be-
y = X4 + X5 + El (12)
tween the predictors and the response, the overall F statis-
tic will be small and thus the null hypothesis that the null
where E N50(0, u2) with u- = 2.5. Least squares esti-
model is true fails to be rejected. However, many data an-
mates for these data are given in Table 6. There is little
alysts perform model selection regardless of the F statis-
model uncertainty in this example; only the p values for
tic value for the overall model. Problems can then occur,
04 and 35 were smaller than .1. We generated 50 addi-
as subsequent model selection techniques often choose a
tional observations in the same manner to create the predic-
tion set.
model that includes a subset of the predictors. Freedman
(1983) has shown that in the extreme case where there is no
In this example the true model, the model averaging tech-
relationship between the predictors and the response vari-
niques, and models selected using standard techniques all
able, omitting the predictors with the smallest t values (e.g.,
have poor predictive coverage (Table 7). It is slightly en-
p > .25) can result in a model with a highly significant F
couraging that BMA performs better than the true model,
statistic and high R2. In contrast, if the response and predic-
but the improvement is too small to be significant. This
tors are independent, Occam's window typically indicates
and other similar examples that we simulated show that
the null model only, or the null model as one of a small
when there is very little model uncertainty, predictive per-
number of "best" models.
formance is not significantly improved by model averaging.
Following Freedman (1983), we generated 5,100 inde-
Example 5.2.2. This example demonstrates the perfor-
pendent observations from a standard normal distribution
mance of BMA when a subset of the predictors is corre-
to create a matrix with 100 rows and 51 columns. The first
lated. For the training set, we simulated p = 15 predic-
column was taken to be the dependent variable in a regres-
tors and n = 50 observations. We obtained predictors 1-10
sion equation, and the other 50 columns were taken to be
as independent standard normal vectors, Xi,...I Xo iid
the predictors. Thus the predictors are independent of the
N(O, 1), and generated predictors 11-15 using the frame-
work
response by construction. For the entire dataset, the multi-
ple regression results were as follows:
[Xll, . * X15]
* R2 = .55 and p = .29.
= [XI , ... , X5] ( [3, .5 ,.7, .9 , 1. 1]T[11 1 11] ) + El, * 18 coefficients out of 50 were significant at the .25
level.
where E N(O, 1). We generated the response using the
* 4 coefficients out of 50 were significant at the .05 level.
model
We used three different variable selection procedures on
y = XI + X2 + X3 + X4 + X5 + ? (13) the simulated data. The first of these was the method used
Table 6. Least Squares Estimates for Example 5.2.1 (& = 2.9)
/o i /02 /3 /4 /5 /6 /7 /8 /9 3io Oil /12 /13 /14 /15

/ 0 0 0 0 1.00 1.00 0 0 0 0 0 0 0 0 0 0
/3 .42 .21 .40 .07 .95 1.72 .20 .34 -.32 .24 -.15 .6 -.45 -.08 .20 .18
&O .46 .55 .56 .36 .52 .47 .39 .58 .49 .45 .44 .55 .48 .52 .45 .47
Table 7. Performance Comparison for Example 5.2.1: Predictive Coverage

for a 90% Prediction Interval
Predictive
Method Model coverage (9%o)
BMA (estimated coverage) Model averaging 72

Adjusted R 2 (3) 2 4 5 8 11 70
Cp (3) 4 5 11 70
True model and stepwise (5%) 4 5 68
Stepwise (15%) and Cp (2) 2 4 5 68
Cp (1) 4 5 68
Adjusted R2 (2) 2 4 5 10 11 68
Adjusted R2 (1) 2 4 5 11 66
NOTE: Predictive coverage for BMA (all models) is estimated using the 371 models with posterior model probabilities greater than .0001;
see Table 5.
by Freedman (1983), in which all predictors with p values man (1983) and the stepwise method chose models with
of .25 or lower were included in a second pass over the many predictors and highly significant R2 values.
data. The results from this method were as follows: At best, Occam's window correctly indicates that the null
model is the only model that should be chosen when there
* R2 = .40 and p = .0003.
is no signal in the data. At worst, Occam's window chooses
* 17 coefficients out of 18 were significant at the .25
the null model along with several other models. The pres-
level.
ence of the null model among those chosen by Occam's
* 10 coefficients out of 18 were significant at the .05 window should indicate to a researcher the possibility of
level. evidence for a lack of signal in the data that he or she is
analyzing.
These results are highly misleading, as they indicate a def-
To examine the possibility that our Bayesian approach
inite relationship between the response and the predictors,
favors parsimony to the extent that Occam's window finds
whereas in fact the data are all noise.
no signal even when one exists, we did an additional simu-
The second model selection method used on the full data-
lation study. We generated 3,000 observations from a stan-
set was Efroymson's stepwise method. This indicated a
dard normal distribution to create a dataset with 100 ob-
model with 15 predictors with the following results:
servations and 30 candidate predictors. We allowed the re-
sponse Y to depend only on X1, where Y = .5X1 + E with
* R2 = .40, and p = .0001.
E N(O, .75). Thus Y still has unit variance, and the "true"
* All 15 predictors were significant at the .25 level.
R2 for the model equals .20.
* 10 coefficients out of 15 were significant at the .05
For this simulated data, Occam's window contained one
level.
model only-the correct model with Xl. In contrast, the
Again a model is chosen that misleadingly appears to have screening method used by Freedman produced a model with
a great deal of explanatory power. six predictors, including X1, with four of these significant
The third variable selection method that we used was at the .1 level. Stepwise regression indicated a model with
Occam's window. The only model chosen by this method two predictors, including X1, both of them significant at the
was the null model. .025 level. So the two standard variable selection methods
We repeated the foregoing procedure 10 times with simi- indicated evidence for variables that in fact were not at all
lar results. In five simulations, Occam's window chose only associated with the dependent variable, whereas Occam's
the null model. For the remaining simulations, three mod- window chose the correct model.
els or fewer were chosen along with the null model. All the These examples provide evidence that Occam's window
nonnull models chosen had Rf2 values less than .15. For all overcomes the problem of selection of the null model when
of the simulations, the selection procedure used by Freed- there is no signal in the data.
Table 8. Least Squares Estimates for Example 5.2.2 (& = 2.21)
fo fi f2 /3 (4 (5 (6 (7 (8 (9 3io Oil (12 (13 (14 (15
( 0 1.00 1.00 1.00 1.00 1.00 0 0 0 0 0 0 0 0 0 0

(3 .12 .80 1.07 1.03 -.18 .55 -.67 .28 -.1 1 .31 .29 .11 -.09 -.39 .73 -.96
&a .38 .49 .41 .45 .53 .58 .37 .41 .49 .33 .34 .40 .32 .35 .37 .40
Rattery, Madigan, and Hoeting: Averaging for Linear Regression 189
Table 9. Performance Comparison for Example 5.2.2: Predictive Coverage

for a 90% Prediction Interval
Predictive
Method Model coverage (%Yo)
MC3 Model averaging 92

Stepwise (5% and 15%) 1 2 3 4 80
True model 1 2 3 4 5 78
Cp (2) 1 2 3 6 13 14 15 72
Adjusted R2 (1)
Cp (3) 1 2 3 6 10 14 15 72
Adjusted R2 (3) 1 2 3 6 7 13 14 15 72
Cp (1) 1 2 3 5 14 15 70
Adjusted R2 (2) 1 2 3 6 10 13 14 15 70
NOTE: Predictive coverage for BMA (all models) is estimated using

.00005; see Table 5.
7. DISCUSSION to date suggest that we achieved this objective. The pri-

ors for f3 lead to a reasonable prior variance and result in
7.1 Related Work conclusions that are not highly sensitive to the choice of hy-
perparameters. Thus the data dependence does not appear
Draper (1995) has also addressed the problem of assess-
to be a drawback.
ing model uncertainty. Draper's approach is based on the
In a strict sense, our data-dependent priors do not corre-
idea of model expansion; that is, starting with a single rea-
spond to a Bayesian subjective prior. Our priors might be
sonable model chosen by a data-analytic search, expanding
considered to be an approximation to a true Bayesian sub-
model space to include those models suggested by context
jective prior and might be appropriate when little prior in-
or other considerations, and then averaging over this model
formation is available. We have followed other authors, in-
class. Draper did not directly address the problem of model
cluding George and McCullough (1993), Laud and Ibrahim
uncertainty in variable selection. However, one could con-
(1995), and Zellner (1986), in referring to our approach as
sider Occam's window to be a practical implementation of
Bayesian.
model expansion.
The choice of which procedure, to use-Occam's win-
George and McCulloch (1993) have developed the
dow or MC3-will depend on the particular application.
stochastic search variable selection (SSVS) method, which
Occam's window will be most useful when one is inter-
is similar in spirit to MC3. They defined a Markov chain
ested in making inferences about the relationships between
that moves through model space and parameter space at the
the variables. Occam's window also tends to be much faster
same time. Their method never actually removes a predic-
computationally. MC3 is the better procedure to choose if
tor from the full model, but only sets it close to zero with
the goal is good predictions or if the posterior distribution
high probability. Our approach avoids this by integrating
of some quantity is of more interest than the nature of the
analytically over parameter space.
"true" model and if computer time is not a critical consid-
We have focused here on Bayesian solutions to the model
eration. However, each approach is flexible enough to be
uncertainty problem. Very little has been written about fre-
used successfully for both inference and prediction.
quentist solutions to the problem. Perhaps the most obvious
We have described two procedures that can be used to ac-
frequentist solution is to bootstrap the entire data analysis,
count for model uncertainty in variable selection for linear
including model selection. However, Freedman et al. (1986)
regression models. In addition to variable selection, uncer-
have shown that this does not necessarily give a satisfactory
tainty is also involved in the identification of outliers and
solution to the problem.
in the choice of transformations in regression. To broaden
the flexibility of our current procedures, and to improve our
7.2 Conclusions ability to account for model uncertainty, we have extended
The prior distribution of the covariance matrix BMA for ,tode-
include transformation selection and outlier iden-
scribed in Section 3.2 depends on the actual data, tification including in work reported elsewhere (Hoeting et al. 1995,
both the dependent and the independent variables. A simi- 1996).
lar data-dependent approach to the assessment of the priors
was used by Raftery (1996). Although at first this may ap-
APPENDIX A: DATA FOR FIGURE 1
pear to be contrary to the idea of a prior, our objective was
to develop priors that lead to posteriors similar to those of The following data from selected textbooks were used to make
a person with little prior information. Examples analyzed Figure 1:
Page Number of Number of

Dataset Source number observations predictors
Attitude survey Chatterjee and Price (1991) 70 30 6

Equal education oppurtunity Chatterjee and Price (1991) 176 70 3
Gasoline mileage Chatterjee and Price (1991) 261 30 10
Nuclear power Cox and Snell (1982) 81 32 10
Crime Cox and Snell (1982) 170 47 13
Hald Draper and Smith (1981) 630 13 4
Grades Hamilton (1993) 83 118 3
Swiss fertility Mosteller and Tukey (1977) 550 47 5
Surgical unit Neter, Wasserman and Kutner (1990) 439, 468 108 4
Berkeley study Weisberg (1985)
Girls 56 32 10
Boys 57 26 10
Housing Weisberg (1985) 241 27 9
Highway Weisberg (1985) 206 39 13
APPENDIX B: SOFTWARE FOR Springer-Verlag, pp. 1-16.

IMPLEMENTING MC3 Garthwaite, P. H., and Dickey, J. M. (1992), "Elicitation of Prior Distri-
butions for Variable Selection Problems in Regression," The Annals of
BMA is a set of S-PLUS functions that can be obtained free of Statistics, 20, 1697-1719.
charge via the World Wide Web address http://lib.stat.cmu. Geisser, S. (1980), Discussion of "Sampling and Bayes' Inference in Sci-
edu/S/bma or by sending an e-mail message containing the text entific Modelling and Robustness" by G. E. P. Box, Journal of the Royal
"send BMA from S" to the Internet address statlib@stat.cmu.edu. Statistical Society, Ser. A, 143, 416-417.
The program MC3.REG performs MCMC model composition George, E. I., and McCulloch, R. E. (1993), "Variable Selection via Gibbs
for linear regression. The set of programs fully implements the Sampling," Journal of the American Statistical Association, 88, 881-890.
Good, I. J. (1952), "Rational Decisions," Journal of the Royal Statistical
MC3 algorithm described in Section 4.2.
Society, Ser. B, 14, 107-114.
Hamilton, L. C. (1993), Statistics With Stata 3, Belmont, CA: Duxbury
[Received November 1993. Revised June 1996.] Press.
Hocking, R. R. (1976), "The Analysis and Selection of Variables in Linear
REFERENCES Regression," Biometrics, 32, 1-51.
Hodges, J. S. (1987), "Uncertainty, Policy Analysis, and Statistics," Statis-
Becker, G. S. (1968), "Crime and Punishment: An Economic Approach,"
tical Science, 2, 259-291.
Journal of Political Economy, 76, 169-217.
Hoeting, J. A., Raftery, A. E., and Madigan, D. (1995), "Simultaneous
Brier, S. S., and Fienberg, S. E. (1980), "Recent Econometric Modeling
Variable and Transformation Selection in Linear Regression," Technical
of Crime and Punishment: Support for the Deterrence Hypothesis?,"
Report 9506, Colorado State University, Dept. of Statistics.
Evaluation Review, 4, 147-191.
(1996), "A Method for Simultaneous Variable Selection and Outlier
Breiman, L. (1968), Probability, Reading, MA: Addison-Wesley.
Identification in Linear Regression," Journal of Computational Statistics
(1992), "The Little Bootstrap and Other Methods for Dimension- and Data Analysis, 22, 251-270.
ality Selection in Regression: X-Fixed Prediction Error," Journal of the
Jeffreys, H. (1961), Theory of Probability (3rd ed.), London: Oxford Uni-
American Statistical Association, 87, 738-754.
versity Press.
(1995), "Better Subset Regression Using the Nonnegative Garrote,"
Kadane, J. B., Dickey, J. M., Winkler, R. L., Smith, W. S., and Peters, S. C.
Technometrics, 37, 373-384.
(1980), "Interactive Elicitation of Opinion for a Normal Linear Model,"
Breiman, L., and Spector, P. (1992), "Submodel Selection and Evaluation
Journal of the American Statistical Association, 75, 845-854.
in Regression," International Statistical Review, 60, 291-319.
Kass, R. E., and Raftery, A. E. (1995), "Bayes Factors," Journal of the
Chatterjee, S., and Price, B. (1991), Regression Analysis by Example (2nd
American Statistical Association, 90, 773-795.
ed.), New York: Wiley.
Laud, P. W., and Ibrahim, J. G. (1995), "Predictive Model Selection," Jour-
Cox, D. R., and Snell, E. J. (1982), Applied Statistics: Principles and Ex-
nal of the Royal Statistical Society, Ser. B, 57, 247-262.
amples, New York: Chapman and Hall.
Leamer, E. E. (1978), Specification Searches, New York: Wiley.
Chung, K. L. (1967), Markov Chains with Stationary Transition Probabil-
ities (2nd ed.), Berlin: Springer-Verlag. Linhart, H., and Zucchini, W. (1986), Model Selection, New York: Wiley.
Draper, D. (1995), "Assessment and Propagation of Model Uncertainty" Madigan, D., and Raftery, A. E. (1994), "Model Selection and Accounting
(with discussion), Journal of the Royal Statistical Society, Ser. B, 57, for Model Uncertainty in Graphical Models Using Occam's Window,"
45-97. Journal of the American Statistical Association, 89, 1535-1546.
Draper, N. R., and Smith, H. (1981), Applied Regression Analysis (2nd Madigan, D., and York, J. (1995), "Bayesian Graphical Models for Discrete
ed.), New York: Wiley. Data," International Statistical Review, 63, 215-232.
Edwards, W., Lindman, H., and Savage, L. J. (1963), "Bayesian Statistical Miller, A. J. (1984), "Selection of Subsets of Regression Variables" (with
Inference for Psychological Research," Psychological Review, 70, 193- discussion), Journal of the Royal Statistical Society, Ser. A, 147, 389-
242. 425.
Ehrlich, I. (1973), "Participation in Illegitimate Activities: A Theoretical (1990), Subset Selection in Regression, New York: Chapman and
and Empirical Investigation." Journal of Political Economy, 81, 521- Hall.
565. Mitchell, T. J., and Beauchamp, J. J. (1988), "Bayesian Variable Selec-
Freedman, D. A. (1983), "A Note on Screening Regression Equations," tion in Linear Regression" (with discussion), Journal of the American
The American Statistician, 37, 152-155. Statistical Association, 83, 1023-1036.
Freedman, D. A., Navidi, W. C., and Peters, S. C. (1986), "On the Impact Mosteller, F., and Tukey, J. W. (1977), Data Analysis and Regression,
of Variable Selection in Fitting Regression Equations," in On Model Reading, MA: Addison-Wesley.
Uncertainty and Its Statistical Implications, ed. T. K. Dijkstra, Berlin: Moulton, B. R. (1991), "A Bayesian Approach to Regression Selection
and Estimation With Application to a Price Index for Radio Services," Gibbs Sampler and Related Markov Chain Monte Carlo Methods," Jour-
Journal of Econometrics, 49, 169-193. nal of the Royal Statistical Society, Ser. B, 55, 3-24.
Murphy, A. H., and Winkler, R. L. (1977), "Reliability of Subjective Prob- Stewart, L. (1987), "Hierarchical Bayesian Analysis Using Monte Carlo
ability Forecasts of Precipitation and Temperature," Applied Statistics, Integration: Computing Posterior Distributions When There are Many
26, 41-47. Possible Models," The Statistician, 36, 211-219.
Neter, J., Wasserman, W., and Kutner, M. (1990), Applied Linear Statistical Stewart, L., and Davis, W. W. (1986), "Bayesian Posterior Distributions
Models, Homewood, IL: Irwin. Over Sets of Possible Models With Inferences Computed by Monte
Raftery, A. E. (1988), "Approximate Bayes Factors for Generalized Lin- Carlo Integration," The Statistician, 35, 175-182.
ear Models," Technical Report 121, University of Washington, Dept. of Stigler, G. J. (1970), "The Optimum Enforcement of Laws," Journal of
Statistics. Political Economy, 78, 526-536.
(1996), "Approximate Bayes Factors and Accounting for Model Taft, D. R., and England, R. W. (1964), Criminology (4th ed.), New York:
Uncertainty in Generalized Linear Models," Biometrika, 83, 251-266. Macmillan.
Raiffa, H., and Schlaifer, R. (1961), Applied Statistical Decision Theory, Vandaele, W. (1978), "Participation in Illegitimate Activities; Ehrlich Re-
Cambridge, MA: MIT Press. visited," in Deterrence and Incapacitation (eds. A. Blumstein, J. Cohen,
Regal, R., and Hook, E. B. (1991), "The Effects of Model Selection on and D. Nagin), Washington, D.C.: National Academy of Sciences Press,
Confidence Intervals for the Size of a Closed Population," Statistics in pp. 270-335.
Medicine, 10, 717-721. Weisberg, S. (1985), Applied Linear Regression (2nd ed.), New York: Wi-
Schwarz, G. (1978), "Estimating the Dimension of a Model," The Annals ley.
of Statistics, 6, 461-464. Zellner, A. (1986), "On Assessing Prior Distributions and Bayesian Re-
Shibata, R. (1981), "An Optimal Selection of Regression Variables," gression Analysis With g Prior Distributions," in Bayesian Inference
Biometrika, 68, 45-54. and Decision Techniques-Essays in Honor of Bruno de Finetti, eds.
Smith, A. F. M., and Roberts, G. 0. (1993), "Bayesian Computation via P. K. Goel and A. Zellner, Amsterdam: North-Holland, pp. 233-243.

1xraftery Et All 1997

Uploaded by

Copyright:

Available Formats

1xraftery Et All 1997

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1xraftery Et All 1997

Uploaded by

Copyright:

Available Formats

Bayesian Model Averaging for Linear Regression Models

Author(s): Adrian E. Raftery, David Madigan and Jennifer A. Hoeting

1. INTRODUCTION applications this averaging will not be a practical proposi-

"best" model as if the single selected model were true ig-

N to identify the models in A. Two further principles under

cam's window), then neither model is rejected. The second

4.2 Markov Chain Monte Carlo Model Composition

Table 1. Models Selected for Crime Data

1 Full model All 87 15 -.30 -.27 .133

Table 4. Crime Data: Pr(/3j 7 O 0D), Expressed as a Percentage

Predictor Occam's Ehrlich's

1 Percentage of males age 14-24 73 79

5.1.3 Crime and Punishment: Assessment of Predictive

compare the quality of the predictions based on model av-

Occam's window and MC3 are not highly sensitive to the

Table 5. Crime Data: Performance Comparison

MC3 Model averaging 80

NOTE: Predictive coverage is the percentage of obse

where E N50(0, o2) with u- = 2.5. Least squares es-

Linear regression models are frequently used even when

Table 6. Least Squares Estimates for Example 5.2.1 (& = 2.9)

/o i /02 /3 /4 /5 /6 /7 /8 /9 3io Oil /12 /13 /14 /15

Table 7. Performance Comparison for Example 5.2.1: Predictive Coverage

BMA (estimated coverage) Model averaging 72

Table 8. Least Squares Estimates for Example 5.2.2 (& = 2.21)

fo fi f2 /3 (4 (5 (6 (7 (8 (9 3io Oil (12 (13 (14 (15

( 0 1.00 1.00 1.00 1.00 1.00 0 0 0 0 0 0 0 0 0 0

Table 9. Performance Comparison for Example 5.2.2: Predictive Coverage

MC3 Model averaging 92

NOTE: Predictive coverage for BMA (all models) is estimated using

7. DISCUSSION to date suggest that we achieved this objective. The pri-

Page Number of Number of

Attitude survey Chatterjee and Price (1991) 70 30 6

APPENDIX B: SOFTWARE FOR Springer-Verlag, pp. 1-16.

You might also like