1xraftery Et All 1997
1xraftery Et All 1997
1xraftery Et All 1997
REFERENCES
Linked references are available on JSTOR for this article:
https://www.jstor.org/stable/2291462?seq=1&cid=pdf-reference#references_tab_contents
You may need to log in to JSTOR to access the linked references.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide
range of content in a trusted digital archive. We use information technology and tools to increase productivity and
facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org.
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
https://about.jstor.org/terms
American Statistical Association, Taylor & Francis, Ltd. are collaborating with JSTOR to
digitize, preserve and extend access to Journal of the American Statistical Association
This content downloaded from 88.241.79.247 on Tue, 18 Jun 2019 15:47:17 UTC
All use subject to https://about.jstor.org/terms
Bayesian Model Averaging for Linear
Regression Models
Adrian E. RAFTERY, David MADIGAN, and Jennifer A. HOETING
We consider the problem of accounting for model uncertainty in linear regression models. Conditioning on a single selected model
ignores model uncertainty, and thus leads to the underestimation of uncertainty when making inferences about quantities of interest.
A Bayesian solution to this problem involves averaging over all possible models (i.e., combinations of predictors) when making
inferences about quantities of interest. This approach is often not practical. In this article we offer two alternative approaches.
First, we describe an ad hoc procedure, "Occam's window," which indicates a small set of models over which a model average
can be computed. Second, we describe a Markov chain Monte Carlo approach that directly approximates the exact solution. In the
presence of model uncertainty, both of these model averaging procedures provide better predictive performance than any single
model that might reasonably have been selected. In the extreme case where there are many candidate predictors but no relationship
between any of them and the response, standard variable selection procedures often choose some subset of variables that yields a
high R2 and a highly significant overall F value. In this situation, Occam's window usually indicates the null model (or a small
number of models including the null model) as the only one (or ones) to be considered thus largely resolving the problem of
selecting significant models when there is no signal in the data. Software to implement our methods is available from StatLib.
KEY WORDS: Bayes factor; Markov chain Monte Carlo model composition; Model uncertainty; Occam's window; Posterior
model probability.
pendent variable Y and a set of a candidate predictors sion models. We refer to this algorithm as "Occam's win-
dow." This approach involves averaging over a reduced set
X1, X2, ... , Xk, find the "best" model of the form
of models. Second, we directly approximate the complete
p
solution by applying the Markov chain Monte Carlo model
Y=/3o +E3jXj +?, composition (MC3) approach of Madigan and York (1995)
j=1
to linear regression models. In this approach the posterior
where X1, X2,. .., Xp is a subset of Xl, X2, distribution
...,Xk. Here of a quantity of interest is approximated by a
"best" may have any of several meanings; for example, Markov chain Monte Carlo method that generates a process
the model providing the most accurate predictions for new that moves through model space. We show in an example
cases exchangeable with those used to fit the model. that both of these model averaging approaches provide bet-
A typical approach to data analysis is to carry out a ter predictive performance than any single model that might
model selection exercise leading to a single "best" model reasonably have been selected.
and then to make inferences as if the selected model were Freedman (1983) pointed out that when there are many
the true model. However, this ignores a major component predictors and there is no relationship between the predic-
of uncertainty-namely, uncertainty about the model itself tors and the response, variable selection techniques can lead
(Draper 1995; Hodges 1987; Leamer 1978; Moulton 1991; to a model with a high R2 and a highly significant over-
Raftery 1988, 1996). As a consequence, uncertainty about all F value. By contrast, when a dataset is generated with
quantities of interest can be underestimated. (For striking no relationship between the predictors and the response,
examples of this see Draper 1995, Kass and Raftery 1995, Occam's window typically indicates the null model as the
Madigan and York 1995, Miller 1984, Raftery 1996, and "best" model or as one of a small set of "best" models,
Regal and Hook 1991.) A complete Bayesian solution to thus largely resolving the problem of selecting a significant
this problem involves averaging over all possible combina- model for a null relationship.
tions of predictors when making inferences about quantities The background literature for our approach includes sev-
of interest. Indeed, this approach provides optimal predic- eral areas of research: the selection of subsets of predictor
tive ability (Madigan and Raftery 1994). However, in many variables in linear regression models (Breiman 1992, 1995;
Breiman and Spector 1992; Draper and Smith 1981; Hock-
ing 1976; Linhart and Zucchini 1986; Miller 1990; Shibata
Adrian E. Raftery is Professor of Statistics and Sociology, and David
1981), Bayesian approaches to the selection of subsets of
Madigan is Assistant Professor of Statistics, Department of Statistics, Uni-
versity of Washington, Seattle, WA 98195. Jennifer Hoeting is Assistant predictor variables in linear regression models (George and
Professor of Statistics, Department of Statistics, Colorado State Univer- McCulloch 1993; Laud and Ibrahim 1995; Mitchell and
sity, Fort Collins, CO 80523. The research of Raftery and Hoeting was par- Beauchamp 1988; Schwarz 1978), and model uncertainty
tially supported by Office of Naval Research contract N-00014-91-J-1074.
Madigan's research was partially supported by National Science Foun-
dation grant DMS 92111627. The authors are grateful to Danika Lew for
research assistance and the editor, the associate editor, two anonymous ref- ( 1997 American Statistical Association
erees, and David Draper for very helpful comments that greatly improved Journal of the American Statistical Association
the article. March 1997, Vol. 92, No. 437, Theory and Methods
179
This content downloaded from 88.241.79.247 on Tue, 18 Jun 2019 15:47:17 UTC
All use subject to https://about.jstor.org/terms
180 Journal of the American Statistical Association, March 1997
(Freedman, Navidi, and Peters 1986; Leamer 1978; Madi- where A is the observable to be predicted and the expecta-
gan and Raftery 1994; Stewart 1987; Stewart and Davis tion is with respect to EK 1 Pr(A IMk, D) Pr(MkID). This
1986). follows from the nonnegativity of the Kullback-Leibler in-
In the next section we outline the philosophy underlyingformation divergence.
our approach. In Section 3 we describe how we selected Implementation of Bayesian model averaging is difficult
prior distributions and outline the two model averaging ap- for two reasons. First, the integrals in (3) can be hard to
proaches in Section 4. In Section 5 we provide an example compute. Second, the number of terms in (1) can be enor-
and describe our assessment of predictive performance. In mous. In this article we present solutions to both of these
Section 6 we compare the performance of Occam's window problems.
to that of standard variable selection methods when there
is no relationship between the predictors and the response. 3. BAYESIAN FRAMEWORK
Finally, in Section 7 we discuss related work and suggest 3.1 Modeling Framework
future directions.
Each model that we consider is of the form
2. ACCOUNTING FOR MODEL UNCERTAINTY p
USING BMA
Y =/30 + 1Z 3jXj + ? =XI3 + El (4)
As described previously, basing inferences on a single j=1
Pr(DlMk)= JPr(DI0k,Mk)Pr(0kIMk)d9kand
(3)
vA X2
is the marginal likelihood of model Mk, ok is the vector a2 u
of parameters of model Mk, Pr(Ok Mk) is the prior density Here vi, A, the (p + 1) x (p + 1)
Of ok under model Mk, Pr(DI 0k, Mk) is the likelihood, andvector p are hyperparameters t
Pr(Mk) is the prior probability that Mk is the true model. The marginal likelihood for Y under a model Mi based
All probabilities are implicitly conditional on M, the set on the proper priors described earlier is given by
of all models being considered. In this article we consider
M to be equal to the set of all possible combinations of P(Y|pil Vi, Xi, Mi)
predictors. _ r ( M+n ) (7vA) v/2
Averaging over all of the models in this fashion provides
irn/2r (2) II + X,V,Xl1l/2
better predictive ability, as measured by a logarithmic scor-
ing rule, than using any single model Mj: x [Av + (Y -Xipi)'
X (I + X_VL X>(Y- Xi,u)l(v?T)/2 (5)
-F [1og { EPr(/ Mk, D)Pr(MkID) }1 where Xi is the design matrix and Vi is the covari
trix for /3 corresponding to model Mi (Rtaiffa an
1961). The Bayes factor for Mo versus M1, the ratio of
This content downloaded from 88.241.79.247 on Tue, 18 Jun 2019 15:47:17 UTC
All use subject to https://about.jstor.org/terms
Raftery, Madigan, and Hoeting: Averaging for Linear Regression 181
Equation (5) for i = 0 and i = 1, is then given by ple variance of Y, s' denotes the sample variance of Xi
for i = 1,... p, and X is a hyperparameter to be chosen.
The prior variance of i0 is chosen conservatively and rep-
resents an upper bound on the reasonable variance for this
parameter. The variances of the remaining ,B parameters are
where ai = Av + (Y - Xi,ui)t (I + XiViX) 1(Y - Xi,i)
i = -0 1. chosen to reflect increasing precision about each /i as the
variance of the corresponding Xi increases and to be in-
3.2 Selection of Prior Distributions variant to scale changes in both the predictor variables and
the response variable.
The Bayesian framework described earlier gives the
BMA user the flexibility to modify the prior setup as de- For a categorical predictor variable Xi with (c + 1) pos-
sible outcomes (c > 2), the Bayes factor should be invari-
sired. In this section we describe the prior distribution setup
that we adopt in our examples below. ant to the selection of the corresponding dummy variables
For noncategorical predictor variables, we assume the (Xi, I Xi,). To this end, we set the prior variance of
(Oi1,
individual O's to be independent a priori. We center theI. ,i ) equal to a 2q2[(1/n)xiTxi]_1, where X" is
distribution of 3 on zero (apart from 3o) and choose the n x c design matrix for the dummy variables, where
tt = (i, O, o ... , 0), where & is the ordinaryeach
least
dummy variable has been centered by subtracting its
squares estimate of i0. The covariance matrix V is sample mean. This is related to the g prior of Zellner (1986).
equal to a2 multiplied by a diagonal matrix with entries The complete prior covariance matrix for 3 is now given
(Si2, q2S-2, 2S-2, .2. ., q$2S-2), where s2 denotes the sam-
by
2
SY
\ 2s 22s2
To choose the remaining hyperparameters v, A, and q, For a = .05, this yields v = 2.58, A = .28, and b = 2.85.
we define a number of reasonable desiderata and attempt For this set of hyperparameters, Pr(u 2 < 1) .81. We use
to satisfy them. In what follows we assume that all the these settings of the hyperparameters in the examples that
variables have been standardized to have mean zero and follow.
sample variance 1. We would like the following desiderata To compare our prior for pi, i = 1, . . . ,p for a noncate-
to hold gorical predictor with the actual distribution of coefficients
from real data, we collected 13 datasets from several regres-
1. The prior density Pr(31, . . ., IOp) is reasonably flat over
sion textbooks (see App. A). Figure 1 shows a histogram
the unit hypercube [-1, 1]P.
of the 100 coefficients from the standardized data plotted
2. Pr(u2) is reasonably flat over (a, 1) for some small a.
with the prior distribution resulting from the hyperparam-
3. Pr(u 2 < 1) is large.
eters that we use. As desired, the prior density is relatively
The order of importance of these desiderata is roughly the flat over the range of observed values.
order in which they are listed. More formally, we maximize
Pr(u 2 < 1) subject to the following: 4. TWO APPROACHES TO BAYESIAN
MODEL AVERAGING
a. Pr(Ol = O,.., op = 0)/ Pr(Ol = ,. ,pp=1 <
K1. (Following Jeffreys (1961), we choose K1 4.1 Occam's Window
10)
Our first method for accounting for model uncertainty
b. {maxa<2<1 / Pr(u2)1} Pr(U2 = a) < K2-
starting from Equation (1) involves applying the Occam's
c. {maxa<,2<1 / Pr(u2)} Pr(U2 = 1) < K2.
window algorithm of Madigan and Raftery (1994) to linear
Because desideratum 2 is less important than desideratum regression models. Two basic principles underly this ad hoc
1, we have chosen K2 = 10. approach.
This content downloaded from 88.241.79.247 on Tue, 18 Jun 2019 15:47:17 UTC
All use subject to https://about.jstor.org/terms
182 Journal of the American Statistical Association, March 1997
B={Mk: 3M CEM,M CMk,P>j(Ml D) >j (8) converges almost surely to E(g(M)) as N -* oc (Smith
Roberts 1993). To compute (1) in this fashion, set g(M)
Equation (1) is then replaced by Pr(A IM, D).
To construct the Markov chain, we define a neighborhood
nbd(M) for each M c M that consists of the model M
Pr(ZA D) = LMkEA Pr(E A Mk, D) Pr(D Mk) Pr(Mk)
LMk EA Pr(DIMk) Pr(Mk) 9 itself and the set of models with either one variable more
or one variable fewer than M. Define a transition matrix
where q by setting q(M -* M') = 0 for all M' f nbd(M) and
q(M -* M') constant for all M' c nbd(M). If the chain is
A = >A'\ B c M. (10)
currently in state M, then we proceed by drawing M' from
This greatly reduces the number of models q(M -* in
M'). the
It is then
sum accepted
in with probability
Equation (1), and now all that is required is a search strategy
f Pr(M'ID)
Inconclusive Evidence
I I pr(Mf ID)
OL OR pr(MolD)
Otherwise, the state stays in state M. Madigan and York
| Evidence for MiXo | |Strong Evidence for M1|
(1995) described MG3 for discrete graphical models. Soft-
ware for implementing the MG3 algorithm is described in
Figure 2. Occam's Window: Interpreting the Posterior Odds for
Nested Models. the Appendix.
This content downloaded from 88.241.79.247 on Tue, 18 Jun 2019 15:47:17 UTC
All use subject to https://about.jstor.org/terms
Raftery, Madigan, and Hoeting: Averaging for Linear Regression 183
5. MODEL UNCERTAINTY AND PREDICTION Ehrlich's analysis concentrated on the relationship be-
tween crime rate and predictors 14 and 15 (probability of
5.1 Example: Crime and Punishment imprisonment and average time served in state prisons). In
5.1.1 Crime and Punishment: Overview. Up to the his original analysis, Ehrlich (1973) focused on two regres-
1960s, criminal behavior was traditionally viewed as de- sion models, consisting of the predictors (9, 12, 13, 14, 15)
viant and linked to the offender's presumed exceptional psy- and (1, 6, 9, 10, 12, 13, 14, 15), which were chosen in ad-
chological, social, or family circumstances (Taft and Eng- vance based on theoretical grounds.
land 1964). Becker (1968) and Stigler (1970) argued that on To compare Ehrlich's results with models that might be
the contrary, the decision to engage in criminal activity is a selected using standard techniques, we chose three popular
rational choice determined by its costs and benefits relative variable selection techniques: Efroymson's stepwise method
to other (legitimate) opportunities. (Miller 1990), minimum Mallow's Cp, and maximum ad-
In an influential article, Ehrlich (1973) developed this ar- justed R2 (Weisberg 1985). Efroymson's stepwise method
gument theoretically, specified it mathematically, and tested is like forward selection except that when a new variable
it empirically using aggregate data from 47 U.S. states in is added to the subset, partial correlations are considered
1960. Errors in Ehrlich's empirical analysis were corrected to see whether any of the variables currently in the sub-
by Vandaele (1978), who gave the corrected data, which we set should be dropped. Similar hybrid methods are found in
use here (see also Cox and Snell 1982). (Ehrlich's study has most standard statistical computer packages. Problems with
been much criticized (see, e.g., Brier and Fienberg 1980), stepwise regression, Mallow's Cp, and adjusted R2 are well
and we cite it here for purely illustrative purposes. For econ-known (see, e.g., Weisberg 1985).
omy of expression, we use causal language and speak of Table 1 displays the results from the full model with all
"effects," even though the validity of this language for these 15 predictors, three models selected using standard variable
data is dubious. Because people, not states, commit crimes, selection techniques, and the two models chosen by Ehrlich
these data may reflect aggregation bias.) on theoretical grounds. The three models chosen using vari-
Ehrlich's theory goes as follows. The costs of crime are able selection techniques (models 2, 3, 4) share many of
related to the probability of imprisonment and the aver- the same variables and have high values of R2. Ehrlich's
age time served in prison, which in turn are influenced by theoretically chosen models fit the data less well. There are
police expenditures, which may themselves have an inde- striking differences-indeed, conflicts-between the results
pendent deterrent effect. The benefits of crime are related from the different models. Even the models chosen using
to both the aggregate wealth and income inequality in the statistical techniques lead to conflicting conclusions about
surrounding community. The expected net payoff from al- the main questions of interest, despite the models' superfi-
ternative legitimate activities is related to educational level cial similarity.
and the availability of employment, the latter being mea- Consider first the predictor for probability of imprison-
sured by the unemployment and labor force participation ment, X14. This is a significant predictor in all six models,
rates. The payoff from legitimate activities was expected to so interest focuses on estimating the size of its effect. To
be lower (in 1960) for nonwhites and for young males than aid interpretation, recall that all variables have been trans-
for others, so that states with high proportions of these were formed logarithmically, so that when all other predictors are
expected also to have higher crime rates. Vandaele (1978) held fixed, /14= -.30 means roughly that a 10% increase
also included an indicator variable for southern states, the in the probability of imprisonment produces a 3% reduc-
sex ratio, and the state population as control variables, but tion in the crime rate. The estimates of /14 fluctuate wildly
the theoretical rationale for inclusion of these predictors is between models. The stepwise regression model gives an es-
unclear. timate about one-third lower in absolute value than the full
We thus have 15 candidate predictors of crime rate (Ta- model, enough to be of policy importance; this difference
ble 4), and so potentially 215 = 32, 768 different models. is equal to about 1.7 standard errors. The Ehrlich models
As in the original analyses, all data were transformed loga- give estimates that are about one-half higher than the full
rithmically. Standard diagnostic checking (see, e.g., Draper model, and more than twice as big as those from stepwise
and Smith 1981) did not reveal any gross violations of the regression (in absolute value). There is clearly considerable
assumptions underlying normal linear regression. model uncertainty about this parameter.
R2 Number of
Method Variables (%) variables /14 /15 P15
NOTE: P15 is the p value from a two-sided t test for testing 15 = 0. For the
to the 5% level
This content downloaded from 88.241.79.247 on Tue, 18 Jun 2019 15:47:17 UTC
All use subject to https://about.jstor.org/terms
184 Journal of the American Statistical Association, March 1997
Table 2. Crime Data: Occam's Window Posterior Model Probabilities effect on the relationship between the models as measured
by the Bayes factor.
Posterior model
Model probability (%/o) Table 4 shows the posterior probability that the coeffi-
cient for each predictor does not equal 0-that is, Pr(/3i 7
1 3 4 9 11 13 14 12.6
1 3 4 11 13 14 9.0 OID)-obtained by summing the posterior model probabili-
1 3 4 9 13 14 8.4 ties across models for each predictor. The results from Oc-
1 3 5 9 11 13 14 8.0
cam's window and MC3 are fairly close for most of the
3 4 8 9 13 14 7.6
1 3 4 13 14 6.3 predictors. Predictors with high Pr(/3 7& OID) include pro-
1 3 4 11 13 5.8 portion of young males, mean years of schooling, police
1 3 5 11 13 14 5.7 expenditure, income inequality, and probability of impris-
1 3 4 13 4.9
onment.
1 3 5 9 13 14 4.8
3 5 8 9 13 14 4.4 Comparing the two models analyzed by Ehrlich (1973),
3 4 9 13 14 4.1 consisting of the predictors (9, 12, 13, 14, 15) and (1, 6,
3 5 9 13 14 3.6
9, 10, 12, 13, 14, 15), with the results in Table 4, we see
1 3 5 13 14 3.5
2 3 4 13 14 2.0 that several predictors included in Ehrlich's analysis receive
1 3 5 11 13 1.9 little support from the data. The estimated Pr(/3 78 OID) is
3 4 13 14 1.6
quite small for predictors 6, 10, 12, and 15. Two predictors
3 5 13 14 1.6
3 4 13 1.4 (3 and 4) have empirical support but were not included by
1 3 5 13 1.4 Ehrlich. Indeed, Ehrlich's two selected models have very
3 5 13 .7
low posterior probabilities.
1 4 12 13 .7
Ehrlich's work attracted attention primarily because of
his conclusion that both the probability of imprisonment
Now (predictor 14) and the average prison
consider 15, term (predictor 15)th
influenced the crime *rate. The posterior distributions for the
state prisons. Whether this is significant at all is not clear,
and t tests based on different models lead to conflicting coefficients of these predictors, based on the model averag-
conclusions. In the full model, /315 has a nonsignificanting
p results of MC3, are shown in Figures 3 and 4. The MC3
value of .133, while stepwise regression leads to a model posterior distribution for 314 is indeed centered away from
that does not include this variable. On the other hand, Mal- 0, with a small spike at 0 corresponding to P(f314 = OlD).
The posterior distribution for 314 based on Occam's win-
lows' Cp leads to a model in which the p value for 315 iS
significant at the .05 level, whereas with adjusted Rf2 it is dow is quite similar. The spike at 0 is an artifact of our
again not significant. In contrast, in Ehrlich's models it is approach, in which it is possible to consider models with a
highly significant. predictor fully removed from the model. This is in contrast
to the practice of setting the predictor close to 0 with high
Together these results paint a confused picture about /314
probability
and 015. Later we argue that the confusion can be resolved (as in George and McCulloch 1993). In contrast
by taking explicit account of model uncertainty. to Figure 3, the MC3 posterior distribution for the coeffi-
cient corresponding to average prison term is centered close
5.1.2 Crime and Punishment: Model Averaging. For to 0 and has a large spike at 0 (Fig. 4). Occam's window in-
the model averaging strategies, we assumed that all possi- dicates a spike at 0 only, or no support for inclusion of this
ble combinations of predictors were equally likely a pri- predictor. By averaging over all models, our results indicate
ori. To implement Occam's window, we started from the support for a relationship between crime rate and predic-
null model and used the "up" algorithm only (see Madigan tor 14, but not predictor 15. Our model averaging results
and Raftery 1994). The selected models and their poste- are consistent with those of Ehrlich for the probability of
rior model probabilities are shown in Table 2. The models imprisonment, but not for the average prison term.
with posterior model probabilities of 1.2% or larger as in-
dicated by MC3 are shown in Table 3. In total, 1,772 differ-
Table 3. Crime Data: MC3, Models With Posterior Model
ent models were visited during 30,000 iterations of MC3. Probabilities of 1.2% or Larger
Occam's window chose 22 models in this example, clearly
Posterior model
indicating model uncertainty. Choosing any one model and
Model probability (%6)
making inferences as if it were the "true" model ignores
1 3 4 9 11 13 14 2.6
model uncertainty. In the next section we further explore
1 3 4 11 13 14 1.8
the consequences of basing inferences on a single model. 1 3 4 9 13 14 1.7
The top models indicated by the two methods (Tables 1 3 4 5 9 13 14 1.6
1 3 4 9 11 13 14 15 1.6
2 and 3) are quite similar. The posterior probabilities are
1 3 4 9 13 14 15 1.6
normalized over all selected models for Occam's window 3 4 8 9 13 14 1.5
and over all possible combinations of the 15 predictors for 1 3 4 13 14 1.3
1 3 4 11 13 1.2
MC3. So the posterior probabilities for the same models
1 3 5 11 13 14 1.2
differ across the model averaging method, but this has little
This content downloaded from 88.241.79.247 on Tue, 18 Jun 2019 15:47:17 UTC
All use subject to https://about.jstor.org/terms
Raftery, Madigan, and Hoeting: Averaging for Linear Regression 185
NOTE: The last column indicates the predictors included in the two models considered b
* Corresponds to Ehrlich model 1 and * corresponds to Ehrlich model 2.
Among the variables that measure the expected benefits The model averaging results for the predictors for po-
from crime, Ehrlich concluded that both wealth and income lice expenditures lead to an interesting interpretation. Po-
inequality had an effect; we found this to be true for income lice expenditure was measured in two successive years, and
inequality but not for wealth. For the predictors that repre- the measures are highly correlated (r = .993). The data
sent the payoff from legitimate activities, Ehrlich found the show clearly that the 1960 crime rate is associated with
effects of variables 1, 6, 10, and 1 1 to be unclear; he did not police expenditures, and that only one of the two mea-
include mean schooling in his model. We found strong evi- sures (X4 and X5) is needed, but they do not say for sure
dence for the effect of some of these variables, notably the which measure should be used. Each model in Occam's
percent of young males and mean schooling, but the effects window contains one predictor or the other, but not both.
of unemployment and labor force participation are either
For both Occam's window and MC3 Pr[(f34 7& 0) U (/5 7
unproven or unlikely. Finally, the "control" variables that 0) D] = 1, so the data provide very strong evidence for an
have no theoretical basis (2, 7, 8) turned out, satisfyingly, association with police expenditures.
to have no empirical support either.
In summary, we found strong support for some of
Ehrlich's conclusions but not for others. In particular, by
averaging over all models, our results indicate support for
a relationship between crime rate and probability of impris-
onment, but not for average time served in state prisons.
This content downloaded from 88.241.79.247 on Tue, 18 Jun 2019 15:47:17 UTC
All use subject to https://about.jstor.org/terms
186 Journal of the American Statistical Association, March 1997
are given in Table 5. All of the individual models chosen In the foregoing example, the true answer is unknown. T
using standard techniques performed considerably worse further demonstrate the usefulness of BMA, we use sever
than the model averaging approaches, with prediction cov- simulated examples. In our examples, we follow the format
erage ranging from 58% to 67%. Thus the model averag- of George and McCulloch (1993).
ing strategies improved predictive coverage substantially as Example 5.2.1. In this example we investigate the im-
compared to any single model that might reasonably have pact of model averaging on predictive performance when
been chosen. there is little model uncertainty. For the training set, we
A sensitivity analysis for priors chosen within the frame-simulated p = 15 predictors and n- 50 observations as
work described in Section 3.2 indicates that the results for independent standard normal vectors. We generated the re-
Predictive
Method Model coverage (%)
This content downloaded from 88.241.79.247 on Tue, 18 Jun 2019 15:47:17 UTC
All use subject to https://about.jstor.org/terms
Raftery, Madigan, and Hoeting: Averaging for Linear Regression 187
Figure 5. Crime Data: Calibration Plot. The solid line denotes model
averaging (Occam's window); the dashed line, predictors 3, 4, 8, 9, 13, 6. SUCCESSFUL IDENTIFICATION
15 (stepwise). OF THE NULL MODEL
This content downloaded from 88.241.79.247 on Tue, 18 Jun 2019 15:47:17 UTC
All use subject to https://about.jstor.org/terms
188 Journal of the American Statistical Association, March 1997
Predictive
Method Model coverage (9%o)
NOTE: Predictive coverage for BMA (all models) is estimated using the 371 models with posterior model probabilities greater than .0001;
see Table 5.
by Freedman (1983), in which all predictors with p values man (1983) and the stepwise method chose models with
of .25 or lower were included in a second pass over the many predictors and highly significant R2 values.
data. The results from this method were as follows: At best, Occam's window correctly indicates that the null
model is the only model that should be chosen when there
* R2 = .40 and p = .0003.
is no signal in the data. At worst, Occam's window chooses
* 17 coefficients out of 18 were significant at the .25
the null model along with several other models. The pres-
level.
ence of the null model among those chosen by Occam's
* 10 coefficients out of 18 were significant at the .05 window should indicate to a researcher the possibility of
level. evidence for a lack of signal in the data that he or she is
analyzing.
These results are highly misleading, as they indicate a def-
To examine the possibility that our Bayesian approach
inite relationship between the response and the predictors,
favors parsimony to the extent that Occam's window finds
whereas in fact the data are all noise.
no signal even when one exists, we did an additional simu-
The second model selection method used on the full data-
lation study. We generated 3,000 observations from a stan-
set was Efroymson's stepwise method. This indicated a
dard normal distribution to create a dataset with 100 ob-
model with 15 predictors with the following results:
servations and 30 candidate predictors. We allowed the re-
sponse Y to depend only on X1, where Y = .5X1 + E with
* R2 = .40, and p = .0001.
E N(O, .75). Thus Y still has unit variance, and the "true"
* All 15 predictors were significant at the .25 level.
R2 for the model equals .20.
* 10 coefficients out of 15 were significant at the .05
For this simulated data, Occam's window contained one
level.
model only-the correct model with Xl. In contrast, the
Again a model is chosen that misleadingly appears to have screening method used by Freedman produced a model with
a great deal of explanatory power. six predictors, including X1, with four of these significant
The third variable selection method that we used was at the .1 level. Stepwise regression indicated a model with
Occam's window. The only model chosen by this method two predictors, including X1, both of them significant at the
was the null model. .025 level. So the two standard variable selection methods
We repeated the foregoing procedure 10 times with simi- indicated evidence for variables that in fact were not at all
lar results. In five simulations, Occam's window chose only associated with the dependent variable, whereas Occam's
the null model. For the remaining simulations, three mod- window chose the correct model.
els or fewer were chosen along with the null model. All the These examples provide evidence that Occam's window
nonnull models chosen had Rf2 values less than .15. For all overcomes the problem of selection of the null model when
of the simulations, the selection procedure used by Freed- there is no signal in the data.
This content downloaded from 88.241.79.247 on Tue, 18 Jun 2019 15:47:17 UTC
All use subject to https://about.jstor.org/terms
Rattery, Madigan, and Hoeting: Averaging for Linear Regression 189
Predictive
Method Model coverage (%Yo)
This content downloaded from 88.241.79.247 on Tue, 18 Jun 2019 15:47:17 UTC
All use subject to https://about.jstor.org/terms
190 Journal of the American Statistical Association, March 1997
The program MC3.REG performs MCMC model composition George, E. I., and McCulloch, R. E. (1993), "Variable Selection via Gibbs
for linear regression. The set of programs fully implements the Sampling," Journal of the American Statistical Association, 88, 881-890.
Good, I. J. (1952), "Rational Decisions," Journal of the Royal Statistical
MC3 algorithm described in Section 4.2.
Society, Ser. B, 14, 107-114.
Hamilton, L. C. (1993), Statistics With Stata 3, Belmont, CA: Duxbury
[Received November 1993. Revised June 1996.] Press.
Hocking, R. R. (1976), "The Analysis and Selection of Variables in Linear
REFERENCES Regression," Biometrics, 32, 1-51.
Hodges, J. S. (1987), "Uncertainty, Policy Analysis, and Statistics," Statis-
Becker, G. S. (1968), "Crime and Punishment: An Economic Approach,"
tical Science, 2, 259-291.
Journal of Political Economy, 76, 169-217.
Hoeting, J. A., Raftery, A. E., and Madigan, D. (1995), "Simultaneous
Brier, S. S., and Fienberg, S. E. (1980), "Recent Econometric Modeling
Variable and Transformation Selection in Linear Regression," Technical
of Crime and Punishment: Support for the Deterrence Hypothesis?,"
Report 9506, Colorado State University, Dept. of Statistics.
Evaluation Review, 4, 147-191.
(1996), "A Method for Simultaneous Variable Selection and Outlier
Breiman, L. (1968), Probability, Reading, MA: Addison-Wesley.
Identification in Linear Regression," Journal of Computational Statistics
(1992), "The Little Bootstrap and Other Methods for Dimension- and Data Analysis, 22, 251-270.
ality Selection in Regression: X-Fixed Prediction Error," Journal of the
Jeffreys, H. (1961), Theory of Probability (3rd ed.), London: Oxford Uni-
American Statistical Association, 87, 738-754.
versity Press.
(1995), "Better Subset Regression Using the Nonnegative Garrote,"
Kadane, J. B., Dickey, J. M., Winkler, R. L., Smith, W. S., and Peters, S. C.
Technometrics, 37, 373-384.
(1980), "Interactive Elicitation of Opinion for a Normal Linear Model,"
Breiman, L., and Spector, P. (1992), "Submodel Selection and Evaluation
Journal of the American Statistical Association, 75, 845-854.
in Regression," International Statistical Review, 60, 291-319.
Kass, R. E., and Raftery, A. E. (1995), "Bayes Factors," Journal of the
Chatterjee, S., and Price, B. (1991), Regression Analysis by Example (2nd
American Statistical Association, 90, 773-795.
ed.), New York: Wiley.
Laud, P. W., and Ibrahim, J. G. (1995), "Predictive Model Selection," Jour-
Cox, D. R., and Snell, E. J. (1982), Applied Statistics: Principles and Ex-
nal of the Royal Statistical Society, Ser. B, 57, 247-262.
amples, New York: Chapman and Hall.
Leamer, E. E. (1978), Specification Searches, New York: Wiley.
Chung, K. L. (1967), Markov Chains with Stationary Transition Probabil-
ities (2nd ed.), Berlin: Springer-Verlag. Linhart, H., and Zucchini, W. (1986), Model Selection, New York: Wiley.
Draper, D. (1995), "Assessment and Propagation of Model Uncertainty" Madigan, D., and Raftery, A. E. (1994), "Model Selection and Accounting
(with discussion), Journal of the Royal Statistical Society, Ser. B, 57, for Model Uncertainty in Graphical Models Using Occam's Window,"
45-97. Journal of the American Statistical Association, 89, 1535-1546.
Draper, N. R., and Smith, H. (1981), Applied Regression Analysis (2nd Madigan, D., and York, J. (1995), "Bayesian Graphical Models for Discrete
ed.), New York: Wiley. Data," International Statistical Review, 63, 215-232.
Edwards, W., Lindman, H., and Savage, L. J. (1963), "Bayesian Statistical Miller, A. J. (1984), "Selection of Subsets of Regression Variables" (with
Inference for Psychological Research," Psychological Review, 70, 193- discussion), Journal of the Royal Statistical Society, Ser. A, 147, 389-
242. 425.
Ehrlich, I. (1973), "Participation in Illegitimate Activities: A Theoretical (1990), Subset Selection in Regression, New York: Chapman and
and Empirical Investigation." Journal of Political Economy, 81, 521- Hall.
565. Mitchell, T. J., and Beauchamp, J. J. (1988), "Bayesian Variable Selec-
Freedman, D. A. (1983), "A Note on Screening Regression Equations," tion in Linear Regression" (with discussion), Journal of the American
The American Statistician, 37, 152-155. Statistical Association, 83, 1023-1036.
Freedman, D. A., Navidi, W. C., and Peters, S. C. (1986), "On the Impact Mosteller, F., and Tukey, J. W. (1977), Data Analysis and Regression,
of Variable Selection in Fitting Regression Equations," in On Model Reading, MA: Addison-Wesley.
Uncertainty and Its Statistical Implications, ed. T. K. Dijkstra, Berlin: Moulton, B. R. (1991), "A Bayesian Approach to Regression Selection
This content downloaded from 88.241.79.247 on Tue, 18 Jun 2019 15:47:17 UTC
All use subject to https://about.jstor.org/terms
Raftery, Madigan, and Hoeting: Averaging for Linear Regression 191
and Estimation With Application to a Price Index for Radio Services," Gibbs Sampler and Related Markov Chain Monte Carlo Methods," Jour-
Journal of Econometrics, 49, 169-193. nal of the Royal Statistical Society, Ser. B, 55, 3-24.
Murphy, A. H., and Winkler, R. L. (1977), "Reliability of Subjective Prob- Stewart, L. (1987), "Hierarchical Bayesian Analysis Using Monte Carlo
ability Forecasts of Precipitation and Temperature," Applied Statistics, Integration: Computing Posterior Distributions When There are Many
26, 41-47. Possible Models," The Statistician, 36, 211-219.
Neter, J., Wasserman, W., and Kutner, M. (1990), Applied Linear Statistical Stewart, L., and Davis, W. W. (1986), "Bayesian Posterior Distributions
Models, Homewood, IL: Irwin. Over Sets of Possible Models With Inferences Computed by Monte
Raftery, A. E. (1988), "Approximate Bayes Factors for Generalized Lin- Carlo Integration," The Statistician, 35, 175-182.
ear Models," Technical Report 121, University of Washington, Dept. of Stigler, G. J. (1970), "The Optimum Enforcement of Laws," Journal of
Statistics. Political Economy, 78, 526-536.
(1996), "Approximate Bayes Factors and Accounting for Model Taft, D. R., and England, R. W. (1964), Criminology (4th ed.), New York:
Uncertainty in Generalized Linear Models," Biometrika, 83, 251-266. Macmillan.
Raiffa, H., and Schlaifer, R. (1961), Applied Statistical Decision Theory, Vandaele, W. (1978), "Participation in Illegitimate Activities; Ehrlich Re-
Cambridge, MA: MIT Press. visited," in Deterrence and Incapacitation (eds. A. Blumstein, J. Cohen,
Regal, R., and Hook, E. B. (1991), "The Effects of Model Selection on and D. Nagin), Washington, D.C.: National Academy of Sciences Press,
Confidence Intervals for the Size of a Closed Population," Statistics in pp. 270-335.
Medicine, 10, 717-721. Weisberg, S. (1985), Applied Linear Regression (2nd ed.), New York: Wi-
Schwarz, G. (1978), "Estimating the Dimension of a Model," The Annals ley.
of Statistics, 6, 461-464. Zellner, A. (1986), "On Assessing Prior Distributions and Bayesian Re-
Shibata, R. (1981), "An Optimal Selection of Regression Variables," gression Analysis With g Prior Distributions," in Bayesian Inference
Biometrika, 68, 45-54. and Decision Techniques-Essays in Honor of Bruno de Finetti, eds.
Smith, A. F. M., and Roberts, G. 0. (1993), "Bayesian Computation via P. K. Goel and A. Zellner, Amsterdam: North-Holland, pp. 233-243.
This content downloaded from 88.241.79.247 on Tue, 18 Jun 2019 15:47:17 UTC
All use subject to https://about.jstor.org/terms