A Review of Bayesian Variable Selection
A Review of Bayesian Variable Selection
A Review of Bayesian Variable Selection
85–118
1 Introduction
An important problem in statistical analysis is the choice of an optimal model from a
set of a priori plausible models. In many analyses, this reduces to the choice of which
subset of variables should be included into the model. This problem has exercised the
minds of many statisticians, leading to a variety of algorithms for searching the model
space and selection criteria for choosing between competing models (e.g. Miller 2002;
Broman and Speed 2002; Sillanpää and Corander 2002). In the Bayesian framework,
the model selection problem is transformed to the form of parameter estimation: rather
than searching for the single optimal model, a Bayesian will attempt to estimate the
posterior probability of all models within the considered class of models (or in practice,
of all models with non-negligible probability). In many cases, this question is asked in
variable-specific form: i.e. the task is to estimate the marginal posterior probability
that a variable should be in the model.
At present, the computational method most commonly used for fitting Bayesian
models is Markov chain Monte Carlo (MCMC) technique (Robert and Casella 2004).
Variable selection methods are therefore needed that can be implemented easily in the
MCMC framework. In particular, having these models implemented in the BUGS lan-
guage (either in WinBUGS or OpenBUGS) means that the methods can easily be slotted
into different models. The purpose of this paper is to review the different methods that
∗ Departmentof Mathematics and Statistics, University of Helsinki, Finland, mailto:bob.ohara@
helsinki.fi
∗ Department of Mathematics and Statistics, University of Helsink i, Finland, http://www.rni.
helsinki.fi/~mjs/
have been suggested for variable selection, and to present BUGS code for their imple-
mentation, which may also help to clarify the ideas presented. Some of these methods
have been reviewed by Dellaportas et al. (2000). We do not consider some methods, such
as Bayesian approximative computational approaches (e.g. Ball 2001; Sen and Churchill
2001) or methods based on calculating model choice criteria, such as DIC (Spiegelhal-
ter et al. 2002), because these are only feasible to use with a maximum of dozens of
candidate models. We also omit the machine learning literature focusing on finding
maximum a posteriori estimates for parameters (e.g. Tipping 2004).
Although methods for variable selection are reviewed here, this should not be taken
to imply that they should be used uncritically. In many studies, the variables in the
regression have been chosen because there is a clear expectation that they will influence
the response, and the problem is one of inferring the strength of influence. For these
studies, the best strategy may therefore be to fit the full model, and then interpret the
sizes of the posterior estimates of the parameters in terms of their importance. In other
studies the purpose is more exploratory: seeing what the analysis throws out from a
large number of candidates. This is not always an unreasonable approach. One clear
example where this is a sensible way to proceed is in gene mapping, where it is assumed
that there are only a small number of genes that have a large effect on a trait, while most
of the genes have little or no effect. The underlying biology is therefore sparse: only a
few factors (i.e. genes) are expected to influence the trait. The distinction between this
case, and many other regression problems is in the prior distribution of effect sizes: for
gene mapping, the distribution is extremely leptokurtic, with a few large effects, but
most regression coefficients being effectively zero. The prior is also exchangeable over
the loci (i.e. we are ignorant a priori of where any influential gene might be located).
Conversely, in many studies the expectation may be of a slow tapering of effects, with
no clear tail to the distribution, and often more substantial information about the likely
size of the effects. This implies that a different set of priors should be considered in these
two situations: this paper examines different options for the exploratory case where the
prior may be leptokurtic.
This review is structured so that we first set out the general regression model. We
then describe the different variable selection methods, and some of their properties.
Then we describe three examples, using simulated and real data sets, to illustrate the
performance of the different methods. Finally, we wrap up by discussing the relative
merits of these methods, and indicate when different methods might be preferred. BUGS
code for the methods is given in Supplementary Material online, as are some supple-
mentary figures.
R.B. O’Hara and M. J. Sillanpää 87
One aspect of the problem is the well known trade-off between bias and variance. In
general, a large set of variables is desirable for prediction and a small set of variables
(that have a meaningful interpretation) for inference. Another aspect that influences
the number of variables in the model is the number of observations in the data set. As
a rule of thumb for shrinkage methods, one can safely consider only problems where
there are maximally 10-15 times more candidates than observations (Zhang and Xu
2005; Hoti and Sillanpää 2006). However, where the true and safe upper limit exists,
is very problem specific and depends on degree of correlation (co-linearity) among the
candidate variables.
p
X
yi = α + θj xi,j + ei . (1)
j=1
Here, α is the intercept and ei ∼ N (0, σ 2 ) are the errors. We can assume more
generally that yi is a member of the exponential family of distributions, giving us a
generalized linear model. In such a case, we take the usual link function g(·) so that
E(g(yi )) = ηi and ηi equals the right hand side of the linear model (1) without the
error term. The variable selection procedure can be seen as one of deciding which of
the regression parameters, the θj s, are equal to zero. Each θj should therefore have a
“slab and spike” prior (Miller 2002), with a spike (the probability mass) either exactly
at or around zero, and a flat slab elsewhere. For this, we may use an auxiliary indi-
cator variable Ij (where Ij = 1 indicates presence, and Ij = 0 absence of covariate j
88 Bayesian Variable Selection Methods
in the model) to denote whether the variable is in the slab or spike part of the prior.
A second auxiliary variable, the effect size βj , is also needed for most of the methods,
where βj = θj when Ij = 1 (e.g. by defining θj = Ij βj ). When Ij = 0, the variable βj
can be defined in several ways, and this leads to the differences between the methods
outlined below. For the moment we will assume θj = Ij βj , to simplify the explanation,
and change it later on when needed (e.g. for SSVS below).
Once the model has been set up, it is usually fitted using MCMC, and many of the
methods outlined below use a single-site Gibbs sampler to do this. The variable selection
part of the model entails estimating Ij and θj . From this, the posterior probability that
a variable is “in” the model, i.e. the posterior inclusion probability, can simply be
calculated as the mean value of the indicator Ij . The methods outlined below vary in
how they treat Ij , βj and θj .
Sparseness
The degree of sparseness required, i.e. how complex the model should be is an impor-
tant property. In some cases, the sparseness may be set according to some optimality
criterion (e.g. the best predictive abilities, (Burnham and Anderson 2002)). Taking a
decision analytic perspective, we can view the prior as providing a loss function, so un-
less an objective optimality criterion can be found, it is not clear that one loss function
is appropriate for all circumstances. Therefore, some flexibility in the amount of model
complexity allowable is needed. An obvious approach to this is to use P (Ij = 1), the
prior probability of variable inclusion, to set the sparseness: a smaller value of P (Ij = 1)
leading to sparser models. Typically, this will be independent across the js, so that the
prior distribution for the number of covariates is binomial, with mean P (Ij = 1). A
value of 0.5 for this has been suggested for P (Ij = 1) (e.g. George and McCulloch
1993), which makes all models equiprobable. Whilst this may improve mixing proper-
ties of the MCMC sampler and may appear attractive as a null prior, it is informative
in that it favours models where about half of the variables are selected. In many cases,
only a small proportion of variables are likely to be required in the model, so this prior
may often bias the model towards being too complex. The choice of value for P (Ij = 1)
is then up to the investigator, in some cases a decision analytic approach may be a good
way of eliciting the prior.
R.B. O’Hara and M. J. Sillanpää 89
Tuning
A practical problem in implementing variable selection is tuning the model (by adjust-
ing different components, such as the prior distribution) to ensure good mixing of the
MCMC chains, i.e. letting the sampler jump efficiently between the slab and spike. If
single-site updaters are being used, this means updating Ij given a value of βj . This
relies on confounding of Ij and βj , so that θj ≈ 0 for both Ij = 1 and Ij = 0, and hence
the updater for Ij can jump between states easily (and θj be updated subsequently).
In general, this will depend on the prior for θj , so the mixing properties of the sampler
depend on the prior distributions. This has lead to the suggestion that data-based pri-
ors should be designed with the purpose of improving mixing (see references below) or
giving good centering and scaling properties (e.g. a fractional prior, Smith and Kohn
2002). Although this is attractive from the computational point of view, it contravenes
the Bayesian philosophy, as the prior should be a summary of the beliefs of the analyst
(before seing the data), not a reflection of the ability of the fitting algorithms to do
their job. One goal of this review is to find out under what circumstances the different
methods work efficiently, so that philosophically correct (subjective) priors can be de-
signed properly. This may require a trade-off, with a sub-optimal model being used, in
order to accommodate better (philosophically plausible) priors.
Several of the methods below may exhibit problems in the marginal identifiability
(i.e. confounding) of variables Ij and βj . This can occur because almost identical likeli-
hoods can be obtained for Ij = 1 and Ij = 0 when βj is near zero, as illustrated above.
Deliberate confounding of variables can thus be used as a strategy to improve mixing.
Because of this, though, it may be more informative to monitor the posterior of the
product θj = Ij × βj instead of the individual variables (Sillanpää and Bhattacharjee
2005), i.e. monitor the parameter that appears in equation 1.
Global adaptation
A natural Bayesian strategy for building a model would be to place a normal prior on
βj | (Ij = 1). If the variance of this prior is fixed at a constant, the model for the data
would be equivalent to a classical fixed effects model, a terminology we will adopt here.
But variable selection approaches can be developed where the variance is estimated as
well. In some circumstances, this can be done by extending the model (1) above as
a hierarchical model; considering the regression coefficients to be exchangeable and be
drawn from a common distribution, e.g. βj | (Ij = 1) is drawn from N (0, τ 2 ), where τ 2
is an unknown parameter to be estimated. We will follow the terminology in classical
statistics and refer to this as a random effect model. This approach has the advantage
of helping tuning, for example if we define θj = βj Ij , then βj | (Ij = 0) will also depend
on τ 2 . The distribution of θj is thus shrunk towards the correct region of the param-
eter space by the other θj s. This can help circumvent tuning problems, at the cost of
increasing the confounding of Ij and βj , as is discussed more below.
90 Bayesian Variable Selection Methods
Local adaptation
Instead of fitting one common parameter for all the regression coefficients, as in global
adaptation, one can use different variance parameters for different covariates (or for co-
variate groups). This helps tuning and allows heterogeneity, i.e., if there are some local
characteristics among covariates. An example of this is where βj | (Ij = 1) is drawn
from N (0, τj2 ), where τj2 is an covariate-specific variance parameter to be estimated (cf.
Example 2 in (Iswaran and Rao 2005)).
Analytical integration
To speed up the convergence of the MCMC (and mixing with respect to selected co-
variates) in some of the model selection methods described above, it is possible to
analytically integrate over the effects θ and σ 2 in model (1), and then use Gibbs sam-
pling for the indicator variables (George and McCulloch 1997). Fast mixing is possible
because updating an indicator does not depend on values of the effect coefficients. If
preferred, the posterior for effect coefficients can still be obtained. Additionally in high
dimensional problems, the coefficients βj only need to be updated for covariates with
Ij = 1 (e.g. Yi 2004).
One characteristic of the Bayesian approach is the ability to marginalize over nuisance
parameters. This carries over into model selection, where posterior distributions of pa-
rameters (including indicators for models) can be calculated by averaging over all of the
other variables, i.e. over the different models. For example, as we will do below, the
probability that a single variable should enter a model can be averaged over all of the
models. Of course, this is convenient in the MCMC framework as it just means that
the calculations can be done on the MCMC output for each indicator (i.e. Ij ) separately.
It is also well known that, with many covariates, it is the ones that have a large effect
that are selected, even if support for estimated effect size is large by chance (e.g. Miller
2002; Lande and Thompson 1990; Göring et al. 2001). Hence, if the same data is used
for estimating both the the model (i.e. variable selection) and individual contributions
(effect sizes), overestimation of effect sizes will almost certainly occur. Fortunately,
robust estimation of effect sizes can be done in a Bayesian setting by averaging the
effect size over several different models (e.g. Ball 2001; Sillanpää and Corander 2002).
Given posterior model probabilities, model-specific effect estimates are weighted by the
probability of the corresponding model. This is an especially useful technique if there
are several competing models which all show high posterior probabilities.
R.B. O’Hara and M. J. Sillanpää 91
The most direct approach to variable selection is to set the slab, θj | (Ij = 1) equal to
βj , and the spike, θj | (Ij = 0) equal to 0. This approach has spawned two methods,
differing in the way they treat βj | (Ij = 0):
Kuo & Mallick. The first method simply sets θj = Ij βj (Kuo and Mallick
(1998)). This assumes that the indicators and effects are independent a priori, so
P (Ij , βj ) = P (Ij )P (βj ), and independent priors are simply placed on each Ij and βj .
The MCMC algorithm to fit the model does not require any tuning, but when Ij = 0,
the updated value of βj is sampled from the full conditional distribution, which is its
prior distribution. Mixing will be poor if this is too vague, as the sampled values of βj
will only rarely be in the region where θj has high posterior support, so the sampler
will only rarely flip from Ij = 0 to Ij = 1. This method has been used for applications
in genetics by Uimari and Hoeschele (1997) and with local adaptation by Sillanpää and
Bhattacharjee (2005, 2006). Smith and Kohn (2002) use this approach to model sparse
covariance matrices for longitudinal data (where they use a Cholesky decomposition of
the covariance matrix, which reduces the problem to one of variable selection).
GVS. An alternative model formulation called Gibbs variable selection (GVS) was
suggested by Dellaportas et al. (1997), extending a general idea of Carlin and Chib
(1995). It attempts to circumvent the problem of sampling βj from too vague a prior
by sampling βj | (Ij = 0) from a “pseudo-prior”, i.e. a prior distribution which
has no effect on the posterior. This is done by setting θj = Ij βj as before, but
now the prior distributions of indicator and effect are assumed to depend on each
other, i.e. P (Ij , βj ) = P (βj | Ij )P (Ij ). In effect, a mixture prior is assumed for βj :
P (βj | Ij ) = (1 − Ij )N (µ̃, S) + Ij N (0, τ 2 ) (here and elsewhere we will loosely use N (·, ·)
to denote both a normal distribution and its density function), where constants (µ̃, S)
are user-defined tuning parameters, and τ 2 is a fixed prior variance of βj . The intuitive
idea is to use a prior for βj | (Ij = 0) which is concentrated around the posterior density
of θ, so that when Ij = 0, P (βj | Ij = 1) is reasonable large, and hence there is a good
probability that the chain will move to Ij = 1. The algorithm does require tuning, i.e.
(µ̃, S) need to be chosen so that good values of βj are proposed when Ij = 0. The data
will determine which values are good but without directly influencing the posterior, and
hence tuning can be done to improve mixing without changing the model’s priors.
92 Bayesian Variable Selection Methods
In this approach, the spike is a narrow distribution concentrated around zero. Here
θj = βj and the indicators affect the prior distribution of βj , i.e., P (Ij , βj ) = P (βj |
Ij )P (Ij ). A mixture prior for β is used: P (βj | Ij ) = (1 − Ij )N (0, τ 2 ) + Ij N (0, gτ 2 ),
where the first density (the spike) is centred around zero and has a small variance. This
model gives identifiability for variables Ij and βj , but in order to obtain convergence the
algorithm requires tuning - specification of fixed prior parameters (τ 2 and gτ 2 ) which
are data (or at least context) dependent. Note that unlike in GVS, values of the prior
parameters when Ij = 0 influence the posterior. Tuning is not easy, as P (βj | Ij = 0)
needs to be very small but at the same time not too restricted around zero (otherwise
Gibbs sampler moves between states Ij = 0 and Ij = 1 are not possible in practice).
The technique was introduced by George and McCulloch (1993) and extended for mul-
tivariate case by Brown et al. (1998). It has seen extensive use, for example see Yi et al.
(2003); Meuwissen and Goddard (2004) for applications to gene mapping.
Meuwissen and Goddard (2004) introduced (in multivariate context) a random ef-
fects variant of SSVS where τ 2 was taken as a parameter to be estimated in the model
with own prior, and g was fixed at 100. A natural alternative would be to fix τ 2 , and
(in effect) estimate g, in practice by placing a prior on the product gτ .
Adaptive shrinkage
A different approach to inducing sparseness is not to use indicators in the model, but
instead to specify a prior directly on θj that approximates the “slab and spike” shape.
Hence, θj = βj , with prior βj | τj2 ∼ N (0, τj2 ), and a suitable prior is placed on τj2 to
give the appropriate shape to P (βj ). The prior should work by shrinking values of βj
towards zero if there is no evidence in the data for non-zero values (i.e. the likelihood
is concentrated around zero). Conversely, there should be practically no shrinkage for
data-supported values of covariates that are non-zero. The method is adaptive in the
sense that a degree of sparseness is defined by the data, through the way it shrinks the
covariates effects towards zero. The degree of sparseness of the model can be adjusted
by changing the prior distribution of τj2 (either by changing the form of the distribution
or the parameters). Tuning in this way may also affect the mixing of the MCMC chains.
A problem is that there is no indicator variable to show when a variable is ’in’ the model,
however one can be constructed by setting a standardised threshold, c, such that Ij = 1
if | βj |> c (cf. Hoti and Sillanpää 2006).
Jeffreys’ prior. A scale invariant Jeffreys’ prior, P (τj2 ) ∝ τ12 , provides one method
j
for adaptive shrinkage. Theoretically, the resulting posterior is not proper
(e.g. Hopert and Casella 1996; ter Braak et al. 2005), although a proper approximation
can be made by giving finite limits to P (τj2 ) (see below). There is no tuning parameter in
the model, which is either good or bad: the slab part of the prior is then uninformative
but cannot be adjusted. See Xu (2003) for introduction and application of this method
R.B. O’Hara and M. J. Sillanpää 93
to gene mapping. See Zhang and Xu (2005) for penalized ML equivalent of the method.
Laplacian shrinkage. An alternative to using the Jeffreys’ prior on the variance
is to use an exponential prior for τj2 with a parameter µ. After analytical integration
over the variance components, we obtain a Laplacian double exponential distribution
for P (βj | µ); for details, see Figueiredo (2003). The degree of sparseness is controlled
by µ which has a data dependent scale and requires tuning. The random effect variant
of the method, where µ is a parameter and has its own prior, is better known as the
Bayesian Lasso (Park and Casella 2008; Yi and Xu 2008). The Lasso (Tibshirani 1996)
is the frequentist equivalent of this approach.
The models above are developed through placing priors on the individual covariates
θj s. An alternative approach is to view the model as a whole, and place priors on Nv ,
the number of covariates selected in the model and their coefficients (θ1 , . . . , θNv ), and
then allow the choice of which covariate it is that is in the model to be a secondary
problem. This approach can reduce to the models above, if the number of covariates
in the model is chosen to be binomially distributed with Nmax equal to the number
of candidate covariates, p. However, it is often computationally more convenient to
use a lower Nmax , i.e. to restrict the maximum number of covariates possible. The
advantage of this approach is that the likelihood is smaller, as one only
PNv needs to sum
over the selected variables, replacing the summation in model (1) by k=1 θk xi,lk . The
number of selected variables, Nv , is then itself a random variable, and sparseness can
be controlled through the prior distribution of Nv .
Reversible jump MCMC. Reversible jump MCMC is a flexible technique for
model selection introduced by Green (1995), which lets the Markov chain explore spaces
of different dimension. For variable selection, the positions (indices) of the selected vari-
ables are defined as l1 , .., lNv , and the model is updated by randomly selecting variable j
and then proposing either addition to (Nv := Nv +1) or deletion from (Nv := Nv −1) the
model of the corresponding effect. The length of the vector of θj s is therefore not fixed
but varies during the estimation. The updating is done using a Metropolis-Hastings
algorithm, but with the acceptance ratio adjusted for the change in dimension. The
degree of sparseness can be controlled by setting the prior for Nv : using a binomial
distribution is then approximately the same as setting a constant, independent, prior
for each Ij . Reversible jump MCMC has been applied in many areas, and its use is
wider than just selecting variables in a regression. See Sillanpää and Arjas (1998) for
application to gene mapping and the paper by Lunn et al. (2006) for a WinBUGS ap-
plication. See also the brief note of Sillanpää et al. (2004).
Composite model space (CMS). A problem with reversible jump MCMC is that
the change in dimension increases the complexity of the algorithm. This can be circum-
vented by fixing Nmax to something less than p, but to use indicator variables to allow
94 Bayesian Variable Selection Methods
Table 1: The qualitative classification of the variable selection methods with respect to
speed, mixing and separation in our tested examples (AVE: average, EXC: Excellent).
P
covariates to enter or leave the model (with the constraint j Ij ≤ Nmax ). As in indi-
cator model selection above, θj = Ij βj and both a priori independence or dependence
between indicators and effects can be assumed. Because the maximum dimension is
fixed,
P the indicators, Ij , are mutually dependent, with maximum and minimum values
of j Ij being set. Variable selection is then performed by randomly selecting com-
ponent j and then proposing a change of the indicator value Ij (this corresponds to
adding or deleting the component). The prior for the number of components can be set
in the same way as in reversible jump MCMC. The method was introduced by Godsill
(2001), and has been used by Yi (2004) in a gene mapping application. See Kilpikari
and Sillanpää (2003) for a closely related approach from the reversible jump MCMC
stand point.
The different variable selection methods work by specifying the priors for βj , and
possibly other auxiliary variables. Interest lies in both the estimates of the parame-
ters (in particular, whether they are consistent across models) and the behaviour of
the MCMC, i.e. how long the runs take, how well the chains mix, etc. For all three
examples, short runs were used to estimate running time and quality of mixing, and
then a longer run (chosen to give reasonable level of mixing), was used to obtain pos-
terior distributions of the parameters, which could then be examined to see how well
the methods classified variables as being included in the models, and also whether the
R.B. O’Hara and M. J. Sillanpää 95
For all three sets of data, the equation (1) formed the basis of the model. However,
for the simulated data where a generalized linear model was used, equation 1 gives the
expected value on the linear scale for each datum.
p
X
ηi = α + θj xi,j + ei , (2)
j=1
where log(λi ) = ηi with the observed counts yi ∼ P oisson(λi ) and the (over-
dispersion) errors ei ∼ N (0, σe2 ).
For the simulations, known values of α = ln(10) and σe2 = 0.752 were used. The
covariate values, the xi,j ’s, were simulated independently from a standard normal dis-
tribution, N(0,1). The regression parameters, θj , were generated according to a tapered
model, i.e. θj = a + b(j/10.5 − 1), with a = 0.3 and b = 0, 0.05, 0.1, ..., 0.3 for each data
set: this gave them a mean of 0.3, and a range between 0 and 0.6.
The same model (2), was used to analyse the simulated data sets with prior distri-
butions specified as below and in Table 2.
127-way ANOVA, with a normally distributed response and 127 two-level factors. Be-
cause the model is almost saturated (127 covariates, 150 data points), this is the type
of problem where an efficient variable selection scheme is necessary.
Some discrete marker genotypes are missing (in total 5% of the covariates values,
with all individuals having at least 79% of their covariate information observed). A
model for the missing covariate data (i.e. xi,j s) is therefore needed. Because of the de-
sign of the crosses, for each covariate, the two alleles (i.e. genotype classes) are equally
likely, so we assume that the missing data are missing completely at random (MCAR),
and assume xi,j ∼ Bern(0.5). For simplicity we assume that the covariates are in-
dependent, although in reality dependence will be present as the genetic markers are
sometimes close to each other on the chromosome (Fig. 4). A more complex model (e.g.
Knapp et al. 1990; Sillanpää and Arjas 1998) would be preferable for a “real” analysis.
We did not try local adaptation in any of the methods as it is likely to behave very
similarly to adaptive shrinkage. However, we tried two versions of the methods, with
and without global adaptation, i.e., varying the way we treated σβ2 above. For the fixed
R.B. O’Hara and M. J. Sillanpää 97
effects model σβ2 is given a constant value, but in the random effects model it has a
distribution, so the standard deviation is given a uniform prior:
The form of the prior distribution for βj and the indicator Ij depends on which vari-
able selection method is used. Because the adaptive shrinkage method with the Jeffreys’
prior has no parameters that can be adjusted to change the degree of sparseness, this
model was used as a benchmark for the analysis of each set of data. For this model, the
posterior of several of the parameters is bimodal (this corresponds to covariates where
P (Ij = 1 | data) is not near 0 or 1), and a suitable cut-off, c could be chosen by visual
examination, so that |βj | < c would be equivalent to Ij = 0 (cf. Hoti and Sillanpää
2006). From this, the number of non-zero components (i.e. number of estimated values
of |βj | above c) was estimated and rounded to give a prior for Ij . P (Ij = 1)(= p) and
c are also given in Table 2. This approach to prior specification was taken to help give
consistency in the comparisons: clearly it should not be used for actual analyses.
No Selection
The model with no model selection was used as a baseline for comparison. The vague
priors were essentially those for Ij = 1, ∀j, i.e. equation 4 for β for the fixed effect, and
equations 4 and 5 for the random effect model (viz., similar to ridge regression).
The method of Kuo and Mallick (1998) was implemented using Ij as a number (0 or 1),
and setting θj = Ij βj . A mathematically equivalent implementation would use Ij as an
98 Bayesian Variable Selection Methods
Table 2: Parameters of prior distributions for different variable selection methods for
three different sets of data.
indicator:
½
0 if Ij = 0
θj = (6)
βj if Ij = 1
This was also investigated, but the performance was the same in either case, so only
results from the first implementation are reported. The other priors (e.g. for βj ) are as
above, for both the fixed and random effect models.
GVS
For GVS a pseudo-prior is needed for Ij = 0, otherwise the model is the same as
the Kuo & Mallick model. For this, for both the fixed and random effect models,
2
βj | (Ij = 0) ∼ N (0, σGV S ) was used.
SSVS
The priors for βj | (Ij = 1) are as above for both the fixed and random effect models.
Both random effect models suggested above were tried. For the fixed effect model and
the first random effect model, for Ij = 0 the prior for βj was constructed so that
P (|βj | < c) < 0.01, by setting it to be 3 standard deviations away from the mean,
i.e. βj | (Ij = 0) ∼ N (0, (3 × c)2 ). For the second random effect model (i.e. due to
Meuwissen and Goddard 2004) we used g = 10−3 . This second model is referred to as
M & G.
Only a single version of this adaptive shrinkage method is possible. The prior for τj2
was log(τj2 ) ∼ U (−50, 50), for all sets of data, which is an finite approximation to the
fully correct method and should cover the realistic range of τj2 (approximately 10−22 to
1022 ).
R.B. O’Hara and M. J. Sillanpää 99
Laplacian shrinkage
A prior on the scale (µ) is needed for this model. For the fixed effect, this was designed
so that a priori P (βj > c) ≈ p. This lead to the prior |βj | ∼ Exp(−log(1 − p)/c).
For the random effect version (i.e. the Bayesian LASSO), a uniform prior U(0,20) was
placed on µ.
The priors specified above were used for both fixed and random effects. The prior is
given on the number of variables, Nv , in the model, so this was a binomial distribution
with P (Nv ) ∼ Bin(m, p).
The priors defined above were used, but a maximum of 40 variables was set. The results
of the short run for the Barley data showed that composite model space was too slow to
be useful, being about three times slower than any of the other methods, and with poor
mixing (the fixed effect model had not even converged after 1500 MCMC iterations).
Full runs were therefore not attempted.
The speed of the Composite Model Space in BUGS is due to the way that BUGS im-
plements the model, rather than an intrinsic problem with the model. BUGS compiles
all logical nodes fully, so that for each variable in the model, the node has to include
every covariate in its calculation. Hence, each of the possible combinations of covariates
is included, and so the likelihood quickly becomes excessively large. Implementations
coded from scratch will therefore be much quicker.
From the short runs, the full runs were designed to have a burn-in and thinning suf-
100 Bayesian Variable Selection Methods
ficient to give good mixing (a minimum burn-in of 500 MCMC iterations was used, and
thinning to keep between every fifth and every fiftieth MCMC iteration). The choice
of thinning depended on the effective number of MCMC samples, the number of runs,
and a visual inspection of the MCMC chain histories, to check the mixing. On the basis
of the results, the full runs were thinned to every 20 iterations, with the exception of
the No Selection (fixed effect version) and adaptive shrinkage with Jeffreys’ prior (both
thinned to every 10 iterations), and the random effect version of Kuo & Mallick, fixed
effect version of SSVS and both reversible jump MCMC methods (all thinned to 50).
The same thinning was used for both sets of priors.
The full data were examined to see how well the methods worked. In particular,
how efficiently they separated the variables into those to be included in the model,
and those to be excluded. Ideally P (Ij | data) should be close to 0 or 1, with very
few intermediate values. The posterior probabilities can also be represented as Bayes
Factors (e.g. Kass and Raftery 1995), and the categories of Kass & Raftery can be used
to judge the strength of evidence that a variable should be included in a model. The
estimates of the regression parameters should also be consistent across methods: we
would expect the different methods to give the same estimates.
Influence of Tuning
As indicated above, some of the methods require tuning to obtain good mixing. These
were examined, in particular looking at the following properties by varying the variables:
The mixing of the MCMC chain in the GVS and SSVS models was measured by the
number of runs in all of the indicators, Ij whilst the effect of the scale of the Laplacian
prior was investigated by examining the posterior distribution of the βj s.
4 Results
An overall qualitative assesment of different aspects (computational speed, efficiency
of mixing and separation) of the performance of the methods in the three data sets is
summarized in Table 1.
R.B. O’Hara and M. J. Sillanpää 101
10
150
d
b
Effective Number of
8
Samples for α
100
6
log(Total Number of Runs for I )
4
50
2
0
0
250
Pollution Data
Barley Data
200
6
150
4
100
2
50
a
c
0
0
Reversible Jump, Random
Reversible Jump, Fixed
Laplacian, Random
Laplacian, Fixed
Jeffreys
SSVS, M & G
SSVS, Random
SSVS, Fixed
GVS, Random
GVS, Fixed
Kuo & Mallick, Random
Kuo & Mallick, Fixed
No Selection, Random
No Selection, Fixed
Figure 1: Statistics for short runs for three data sets (simulated data as boxplots, Barley
and Pollution data as lines) and 11 variable selection methods. (a) Standardized times to
run 1000 MCMC iterations (standardised to have a mean of 1), (b) Estimated effective
number of MCMC samples for α (for adaptive shrinkage using Jeffreys’ prior applied to
Simulated data and all Pollution results > 150), (c) Total numbers of runs for indicator
variables, Priors set 1, (d) Total numbers of runs for indicator variables, Priors set 2.
The fixed effect variant of Kuo & Mallick method was not run for the simulated data.
102 Bayesian Variable Selection Methods
The number of changes in state of the indicators was generally similar. The fixed
effect variants tended to perform poorly, so using a random effects prior (global adapta-
tion) improved the mixing. The effect of the hierarchical variance is to pull the posteriors
for the βj ’s towards the right part of the parameter space, so that when Ij = 0, βj is
being sampled from close to the correct part of the parameter space. It is interesting
that the fixed effect GVS method does not exhibit good mixing.
The posterior distribution of the variable with the largest true effect size is shown
in Fig. 3. The conditional distributions (i.e. P (βj | Ij = 1, data)) are similar: the
principal difference being the larger uncertainty in the Laplacian estimates. The figures
are similar for the second set of priors (see Fig. 2 in Supplementary Material), except
that in the case of the random effects SSVS P (βj | Ij = 0, data) and P (βj | Ij = 1, data)
are very similar.
Fixed
0.6 0.6 0.6
0.4 0.4 0.4
0.2 0.2 0.2
0.0 0.0 0.0
0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6
1.0 1.0 1.0 1.0
Random
0.8 0.8 0.8 0.8
0.6 0.6 0.6 0.6
0.4 0.4 0.4 0.4
True Effect Size
Fixed
0.6 0.6 0.6
0.4 0.4 0.4
0.2 0.2 0.2
0.0 0.0 0.0
0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6
1.0 1.0 1.0
Random
0.8 0.8 0.8
0.6 0.6 0.6
0.4 0.4 0.4
0.2 0.2 0.2
0.0 0.0 0.0
0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6
Pr(Ij=1 | data)
Figure 2: Posterior inclusion probabilities for simulated data plotted against true values
of the coefficients. Lines show the fitted curves (quasi-binomial generalized linear model
with a logit link). Black and dots: priors set 1, cyan and crosses: priors set 2.
104 Bayesian Variable Selection Methods
0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 −0.10 0.00 0.05 0.10 0.15
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.0 0.2 0.4 0.6 0.0 0.1 0.2 0.3 0.4 0.5 0.6
0.0 0.2 0.4 0.6 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0 0.1 0.2 0.3 0.4 0.5
0.0 0.2 0.4 0.6 0.8 0.0 0.1 0.2 0.3 0.4 0.5 0.6
βj | Ij, data
Figure 3: Posterior distributions for regression coefficients, βj , for the variable with the
largest true effect size in the Simulated data, with vague priors. Posterior mode and
70% highest posterior density interval. Black: fixed effect models, cyan: random effect
models. Strong colours (i.e. black and red) denote P (βj | Ij = 1, data), Lighter colours
(i.e. grey and light cyan) denote P (βj | Ij = 0, data). The dotted line shows the true
effect size of the simulated coefficients. Numbers in plots are b in the simulations of the
data: a higher number equates to a larger variation in the true coefficients (see text for
details).
R.B. O’Hara and M. J. Sillanpää 105
Priors 1 Priors 2
1.00
No Selection
0.75
0.50
0.25
0.00
1.00
0.50
0.25
0.00
1.00
0.75
GVS
0.50
0.25
0.00
1.00
Pr(Ij=1 | data)
0.75
SSVS
0.50
0.25
0.00
1.00
Jeffreys
0.75
0.50
0.25
0.00
1.00
Laplacian
0.75
0.50
0.25
0.00
Reversible Jump
1.00
0.75
0.50
0.25
0.00
1 2 3 4 5 6 7 1 2 3 4 5 6 7
Chromosome
Figure 4: Posterior inclusion probabilities for Barley data. Dotted line: prior proba-
bility, light grey region: 3 <Bayes Factor < 20, mid-grey region: 20 <Bayes Factor
< 150, dark grey region: 150 <Bayes Factor. Solid line: fixed effect models, dashed
line: random effect models.
106 Bayesian Variable Selection Methods
fixed effect methods and the adaptive shrinkage with Jeffreys’ prior tended to give low
estimates for P (Ij = 1 | data), with only a few variables likely to be in the model, and
even these have intermediate probabilities. The random effects methods, wih the excep-
tion of the Kuo & Mallick and reversible jump MCMC methods, gave poor separation of
the variables, the marginal posterior inclusion probability for variables apparently hav-
ing no effect was similar to the prior probability: this behavior is clear with the more
informative priors too. For the fixed effect variants, the use of informative priors lead
to similar posterior distributions, in particular the Kuo & Mallick and reversible jump
MCMC methods achieved better identification of variables that should be included in
the model.
The posterior distributions of βj for the 10 loci showing the highest posterior QTL
occupancy P (Ij = 1 | data) (averaged over the models) are similar (Fig. 5), although the
random effects variants shrink the estimates towards zero, as might be expected (these
variables are chosen to be extreme, so the common variance shrinks them towards the
other variables). The estimates from the analyses with the second set of priors show a
similar pattern (results not shown).
−0.8 −0.4 0.0 −1.0 −0.5 0.0 −1.0 0.0 0.5 1.0 1.5 −1.0 −0.6 −0.2 0.2 0.0 0.4 0.8 1.2
0.0 0.2 0.4 −0.8 −0.4 0.0 −2.0 −1.0 0.0 0.0 0.5 1.0 1.5 −0.6 −0.4 −0.2 0.0
βj | Ij, data
density interval. Black: fixed effect models, cyan: random effect models. Strong colours
in the Barley data set, with vague priors. Posterior mode and 70% highest posterior
panel) showing the highest marginal posterior inclusion probability, P (Ij = 1 | data),
Figure 5: Posterior distributions for regression coefficients, βj , for the 10 loci (one per
107
(i.e. black and cyan) denote P (βj | Ij = 1, data), Lighter colours (i.e. grey and light
108 Bayesian Variable Selection Methods
Priors 1 Priors 2
RelHumid
SO2
NOx
No Selection
HydCarb
% Poor
W Collar
% NWhite
Pop. Density
% Gd House
Yrs Educ
Ave. HH Size
% Over 65
Jul Temp
Jan Temp
Rainfall
RelHumid
SO2
RelHumid
SO2
NOx
HydCarb
% Poor
GVS
W Collar
% NWhite
Pop. Density
% Gd House
Yrs Educ
Ave. HH Size
% Over 65
Jul Temp
Jan Temp
Rainfall
RelHumid
SO2
NOx
HydCarb
% Poor
SSVS
W Collar
% NWhite
Pop. Density
% Gd House
Yrs Educ
Ave. HH Size
% Over 65
Jul Temp
Jan Temp
Rainfall
RelHumid
SO2
NOx
HydCarb
Jeffreys
% Poor
W Collar
% NWhite
Pop. Density
% Gd House
Yrs Educ
Ave. HH Size
% Over 65
Jul Temp
Jan Temp
Rainfall
RelHumid
SO2
NOx
HydCarb
Laplacian
% Poor
W Collar
% NWhite
Pop. Density
% Gd House
Yrs Educ
Ave. HH Size
% Over 65
Jul Temp
Jan Temp
Rainfall
RelHumid
Reversible Jump
SO2
NOx
HydCarb
% Poor
W Collar
% NWhite
Pop. Density
% Gd House
Yrs Educ
Ave. HH Size
% Over 65
Jul Temp
Jan Temp
Rainfall
Pr(Ij=1 | data)
Figure 6: Posterior inclusion probabilities for Pollution data. Dashed line: prior prob-
ability, light grey region: 3 <Bayes Factor < 20, mid-grey region: 20 <Bayes Factor
< 150, dark grey region: 150 <Bayes Factor. Solid line: fixed effect models, dashed
line: random effect models.
R.B. O’Hara and M. J. Sillanpää 109
Increasing the variance of the GVS pseudo-prior had varying effects: increasing the
number of runs for the Pollution data, and the Barley data up to a plateau, had little
effect on the Simulations, except for an outlier (Fig. 7, first column). Increasing the
width of the spike in the fixed effect SSVS increased the number of runs to a peak,
and then decreased it (Fig. 7, second column), with the Pollution data having a second
increase. The decrease is related to the change in the posterior probability: as the indi-
cators are more often in state 0, they will flip less often, and hence there will be fewer
runs. Because most of the posterior probabilities for the simulated data are close to 1,
increasing the width decreases them all towards 0.5, so increases the flipping between
states, and hence the number of runs.
Increasing the ratio in the M & G model caused a decline in the number of runs
(Fig. 7, third column). If we assume that the width of the ”slab” is determined by the
data (i.e. it barely changes with g), then the spike much be becoming thinner. This is
similar to the fixed effect SSVS: there is less overlap with the likelihood, so it is harder
for the indicator to flip between states.
Reducing the prior variance of the regression coefficients in the Laplacian shrinkage
method had little effect on the posterior modes (Fig. 8), other than the small prior vari-
ances tending to reduce the coefficients towards zero. This effect is clearer when much
smaller prior variances are used (data not shown). Unfortunately, it shrinks all of the
coefficients, including those that the other methods suggest should be non-zero. Hence,
the method gives bad separation of the variables, which is not the desired behaviour.
5 Discussion
The variety of methods available for variable selection is a tribute to the ingenuity of
those who have been working on these problems. Each method has its own properties,
and it is unlikely that any one will be optimal for all situations. Our conclusions about
the methods are summarized in Table 1. Some recommendations can be made on the
basis of these conclusions, especially when BUGS is being used. Some caveats are nec-
essary, and these will be explored below.
Firstly, it should be observed that methods based on placing a prior point mass at
zero (i.e. the Kuo & Mallick and GVS methods) tended to behave poorly in these tests,
being slow and performing no better than the other methods. With non-informative pri-
ors, the adaptive shrinkage approach using Jeffreys’ prior appeared to work best, being
roughly as fast as the other methods, and also mixing well and providing a good separa-
tion between variables ‘in’ and ‘out’ of the model. If informative priors and a fixed effect
model are to be used, then SSVS becomes more attractive: mixing was improved and
110 Bayesian Variable Selection Methods
6500
1200
2000
Pollution Data
6000
1000
TotRunsMandGPollute
TotRunsSSVSPollute
TotRunsGVSPollute
1500
5500
800
1000
600
5000
400
500
4500
5 10 15 20 0.2 0.4 0.6 0.8 1.0 200 400 600 800 1000
35000
300
of Runs
Barley Data
TotRunsMandGBarley
TotRunsSSVSBarley
1200
TotRunsGVSBarley
250
200
30000
Total Number
1000
150
100
800
25000
50
600
0.2 0.4 0.6 0.8 1.0 0.05 0.10 0.15 0.20 0.25 200 400 600 800 1000
20000
Simulation Data
3000
15000
TotRunsMandGSims1
TotRunsSSVSSims1
TotRunsGVSSims1
300
2500
10000
2000
200
1500
5000
100
1000
0.2 0.4 0.6 0.8 1.0 0.05 0.10 0.15 0.20 0.25 200 400 600 800 1000
Pseudo−prior
(1:10)/10
Spike
(1:10)/40 g
Ratio
Standard Deviation Standard Deviation
Figure 7: Effect of tuning parameters on the number of runs for three data sets: effect
of (1) the width of the pseudo-prior in GVS, (2) the spike width in the fixed effect SSVS,
and (3) the ratio of slab to spike variances in the M & G SSVS.
R.B. O’Hara and M. J. Sillanpää 111
0.02 0.08 0.02 0.08 0.02 0.08 0.02 0.08 0.02 0.08
Figure 8: Effect of prior mean (expressed through the “cut-off”, c) for Laplacian
shrinkage on the posterior modes of βj .
112 Bayesian Variable Selection Methods
good separation achieved (as, indeed, it was with the Kuo & Mallick and GVS methods).
The Laplacian method performed poorly, with the results looking similar to the no
selection results. This is perhaps not surprising as the prior distributions are similar,
without a marked spike. Yi and Xu (2008) investigated the same model, but placed a
slightly different prior on µ. Using their prior in our analyses gave similar results to
those presented here. The Bayesian LASSO appears not to shrink the variables with
small effects, as the “spike” is relatively flat. Yi and Xu (2008) also found this behaviour
with a t-distributed prior, and estimating the shape parameter (i.e. the “degrees of free-
dom”). This suggests that sparsity has to be forced onto a model; the data themselves
may not demand it.
For random effect models (utilising global adaptation), all methods give similar re-
sults when they work, so the faster speed of SSVS makes it more attractive. The
Meuwissen & Goddard implementation can behave poorly (e.g. Fig. 2), and may be
sensitive to the prior parameters. One problem in the random effect models is that
separation is poorer. This is a result of the posterior variance, τ 2 , being pulled down,
so that there is less difference between P (βj | Ij = 0, data) and P (βj | Ij = 1, data)
in the region where the MCMC chain is sampling. Hence, if in reality Ij = 0, the
posterior distribution of Ij is largely determined by the prior, a problem which may be
particularly bad when there is little data. One strategy that might help here is to use
block-local adaptation: splitting the covariates up into groups which may be expected
to react in a similar way, and using a different random effect within each (local) group.
Reversible jump MCMC can run very quickly per iteration but this is offset by the
poorer mixing, so that overall it performs similarly to the other methods. Curiously,
for the Barley data it estimated a smaller number of variables than the other methods,
and seems to perform best (i.e. give a clear separation of P (Ij = 1 | data) into values
close to 0 or 1 when prior P (Ij = 1) = 0.5. Implementing a reversible jump MCMC
scheme that analytically integrated out the βj ’s would improve the mixing performance
of the sampler.
To what extent are the conclusions outlined above generalisable? Firstly, it must be
noted that some of the results are specific to BUGS: in particular, the Composite Model
Space will be faster than most of the other methods if it is programmed from scratch.
Similarly, analytic integration over the βj ’s should provide better mixing properties and
performance of the methods. Optimisation of code, use of block-updates (e.g. blocking
each Ij and βj together; Meuwissen et al. 2001; Geweke 1996), and local adaptation
may also improve performance of the methods. It should also be acknowledged that we
considered only three data sets, and that other tests of the methods may lead to dif-
ferent conclusions. This is probably mainly relevant for the comparisons of the mixing
performances of the methods, where the data and priors can have a large influence.
R.B. O’Hara and M. J. Sillanpää 113
The importance of these results to other models also depends on what else are in
these other models. The models examined here only had a regression with variable se-
lection, and no other sub-model. For the Barley data there are over 100 variables, so the
differences in timings are particularly severe. Hence, in cases where variable selection
is carried out for fewer variables, and where the variable selection is part of a larger
model, any differences between models may not be of practical significance. This may
mean that differences in timings and mixing may be small, so GVS or Kuo & Mallick
(for example) will perform just as well as SSVS. One advantage of GVS and Kuo &
Mallick is that when Ii = 0, θj = 0, so no tuning of the width of the spike is required.
The results here suggest that the random effect approach (i.e. global adaptation)
shows promise as a method that tunes the model to make variable selection easier, but
it perhaps needs further study. As already noted, the posterior probability is approxi-
mately equal to the prior probability when there is no evidence that the variable should
be included, so Bayes factors should probably be used to present the results. The ran-
dom effect methods were less effective when there are fewer variables in the model. In
this case, the variance of the random effect is being more poorly estimated, and hence
the tuning is less accurate, as can be seen in the Pollution example where GVS, SSVS,
and reversible jump MCMC gave very high posterior probabilities to all variables, with
the distributions of βj | Ij being very similar for Ij = 0 and Ij = 1. It is possible that,
in some cases, the slab part of the model will become shrunk to close to zero, in which
case the model may become stuck suggesting that many variables are in the model, but
all with very small effects. Here we have assumed that p, the prior inclusion probability,
was fixed. But another form of global adaptation would be to place a prior on this, and
hence estimate a global inclusion probability (i.e. common over all js) (Iswaran and
Rao 2005) or analytically integrate it out of the analysis (Smith and Kohn 2002). Our
experiments with this idea suggest that it will not perform better than the methods
investigated here (results not shown).
Using informative priors and local adaptation will also help the mixing, so it is ad-
visable to consider if they can be elicited. In principle there is considerable flexibility
in the priors that could be used, for example Sillanpää and Bhattacharjee (2005, 2006)
used the Kuo & Mallick approach, with local adaptation using a Cauchy distribution
as a prior for βj . Because this distribution also has a large amount of probability mass
around zero, there was considerable confounding between the parameters (in the manner
discussed above). The authors therefore focused on estimating θj rather than Ij , and
the method is perhaps better seen as a variant of adaptive shrinkage. Even when infor-
mative priors cannot be justified from prior knowledge, their use may still be considered
as long as the computational advantages sufficiently out-weigh the disadvantages in the
bias they induce.
Whilst the use of variable selection can be criticised as being hypothesis testing in a
fake beard and glasses, there are still occasions when it can be useful, in particular when
the purpose of the analysis is exploratory. The example of QTL analysis (i.e. the Barley
114 Bayesian Variable Selection Methods
data) is typical here, where there is often little a priori reason to expect any particular
marker to behave differently. It is then perhaps not surprising that much of the recent
development has been in this area, although the wider use of these methods will depend
on either software being written specifically for an application, or the porting of these
methods into general purpose software such as OpenBUGS. It is pleasing, then, that
the simpler methods do seem to work well, suggesting that even if they are not optimal,
they are still a useful part of the Bayesian’s armoury.
References
Ball, R. D. (2001). “Bayesian methods for quantitative trait loci based on model se-
lection: approximate analysis using Bayesian information criteria.” Genetics, 159:
1351–1364.
Broman, K. W. and Speed, T. P. (2002). “A model selection approach for identification
of quantitative trait loci in experimental crosses.” J. Roy. Stat. Soc. B., 64: 641–656.
Brown, P. J., Vannucci, M., and Fearn, T. (1998). “Multivariate Bayesian variable
selection and prediction.” J. Roy. Stat. Soc. B., 60: 627–641.
Burnham, K. and Anderson, D. (2002). Model Selection and Multimodel Inference: A
Practical Information-Theoretic Approach. New York, U.S.A.: Springer, 2nd edition.
Carlin, B. R. and Chib, S. (1995). “Bayesian model choice via Markov Chain Monte
Carlo methods.” J. Roy. Stat. Soc. B., 57: 473–484.
Dellaportas, P., Forster, J. J., and Ntzoufras, I. (2000). “Bayesian variable selection
using the Gibbs sampling.” In Dey, D. K., Ghosh, S. K., and Mallick, B. K. (eds.),
Generalized linear models: a Bayesian perspective, 273–286. Marcel Dekker, Inc.,New
York.
Dellaportas, P., J.J.Forster, and Ntzoufras, I. (1997). “On Bayesian model and variable
selection using MCMC. Technical report.” Technical report, Department of Statistics,
Athens University of Economics and Business, Athens Greece.
Figueiredo, M. A. T. (2003). “Adaptive sparseness for supervised learning.” IEEE
Transactions on Pattern Analysis and Machine Intelligence, 25: 1150–1159.
Gelman, A. (2006). “Prior distributions for variance parameters in hierarchical models
(comment on article by Browne and Draper).” Bayesian Analysis, 1: 515–534.
George, E. I. and McCulloch, R. E. (1993). “Variable selection via Gibbs sampling.” J.
Am. Stat. Assoc., 85: 398–409.
— (1997). “Approaches for Bayesian variable selection.” Statistica Sinica, 7: 339–373.
Geweke, J. (1996). “Variable selection and model comparison in regression.” In
Bernando, J. M., Berger, J. O., Dawid, A. P., and Smith, A. F. M. (eds.), Bayesian
Statistics 5, 609–620. Oxford University Press, Oxford, UK.
R.B. O’Hara and M. J. Sillanpää 115
Geyer, C. J. (1992). “Practical Markov chain Monte Carlo.” Statist. Sci., 7: 473–483.
Godsill, S. J. (2001). “On the relationship between MCMC model uncertainty methods.”
J. Comput. Graph. Stat., 10: 230–248.
Göring, H. H. H., Terwilliger, J. D., and Blangero, J. (2001). “Large upward bias in
estimation of locus-specific effects from genomewide scans.” Am. J. Hum. Genet., 69:
1357–1369.
Green, P. J. (1995). “Reversible jump Markov chain Monte Carlo computation and
Bayesian model determination.” Biometrika, 82: 711–732.
Hopert, J. P. and Casella, G. (1996). “The effect of improper priors on Gibbs sampling
in hierarchical mixed models.” J. Am. Stat. Assoc., 91: 1461–1473.
Hoti, F. and Sillanpää, M. J. (2006). “Bayesian mapping of genotype x expression
interactions in quantitative and qualitative traits.” Heredity, 97: 4–18.
Iswaran, H. and Rao, J. S. (2005). “Spike and slab variable selection: frequentist and
Bayesian strategies.” The Annals of Statistics, 33: 730–773.
Kass, R. E. and Raftery, A. E. (1995). “Bayes factors.” J. Am. Stat. Assoc., 90:
773–795.
Kilpikari, R. and Sillanpää, M. J. (2003). “Bayesian analysis of multilocus association
in quantitative and qualitative traits.” Genet. Epidemiol., 25: 122–135.
Knapp, S., Bridges, W., and Birkes, D. (1990). “Mapping quantitative trait loci using
molecular marker linkage maps.” Theor. Appl. Genet., 79: 583–592.
Kuo, L. and Mallick, B. (1998). “Variable selection for regression models.” Sankhya
Ser. B, 60: 65–81.
Lande, R. and Thompson, R. (1990). “Efficiency of marker-assisted selection in the
improvement of quantitative traits.” Genetics, 124: 743–756.
Lunn, D. J., Whittaker, J. C., and Best, N. (2006). “A Bayesian toolkit for genetic
association studies.” Genet. Epidemiol., 30: 231–247.
McDonald, G. C. and Schwing, R. C. (1973). “Instabilities of regression estimates
relating air pollution to mortality.” Technometrics, 15: 463–482.
Meuwissen, T. H. E. and Goddard, M. E. (2004). “Mapping multiple QTL using linkage
disequilibrium and linkage analysis information and multitrait data.” Genet. Sel.
Evol., 36: 261–279.
Meuwissen, T. H. E., Hayes, B. J., and Goddard, M. E. (2001). “Prediction of total
genetic value using genome-wide dense marker map.” Genetics, 157: 1819–1829.
Miller, A. (2002). Subset Selection in Regression. Boca Raton, Florida, U.S.A.: Chap-
man & Hall/CRC.
116 Bayesian Variable Selection Methods
Park, T. and Casella, G. (2008). “The Bayesian Lasso.” J. Am. Stat. Assoc., 103:
681–686.
Plummer, M., Best, N., Cowles, K., and Vines, K. (2008). coda: Output analysis and
diagnostics for MCMC. R package version 0.13-3.
Robert, C. and Casella, G. (2004). Monte Carlo Statistical Methods. New York:
Springer-Verlag, 2nd edition.
Sen, S. and Churchill, G. A. (2001). “A statistical framework for quantitative trait
mapping.” Genetics, 159: 371–387.
Sillanpää, M., Gasbarra, D., and Arjas, E. (2004). “Comment on “On the Metropolis-
Hastings acceptance probability to add or drop a quantitative trait locus in Markov
Chain Monte Carlo-based Bayesian analyses”.” Genetics, 167: 1037.
Sillanpää, M. J. and Arjas, E. (1998). “Bayesian mapping of multiple quantitative trait
loci from incomplete inbred line cross data.” Genetics, 148: 1373–1388.
Sillanpää, M. J. and Bhattacharjee, M. (2005). “Bayesian association-based fine map-
ping in small chromosomal segments.” Genetics, 169: 427–439.
— (2006). “Association mapping of complex trait loci with context-dependent effects
and unknown context-variable.” Genetics, 174: 1597–1611.
Sillanpää, M. J. and Corander, J. (2002). “Model choice in gene mapping: what and
why.” Trends Genet., 18: 301–307.
Smith, M. and Kohn, R. (2002). “Parsimonious covariance matrix estimation for longi-
tudinal data.” J. Am. Stat. Assoc., 97: 1141–1153.
Spiegelhalter, D. J., Best, N. G., Carlin, B. P., and der Linde, A. V. (2002). “Bayesian
measures of model complexity and fit (with discussion).” J. R. Stat. Soc. B., 64:
583–616.
Sturtz, S., Ligges, U., and Gelman, A. (2005). “R2WinBUGS: A Package for Running
WinBUGS from R.” Journal of Statistical Software, 12: 1–16.
ter Braak, C. J. F., Boer, M. P., and Bink, M. C. A. M. (2005). “Extending Xu’s
Bayesian model for estimating polygenic effects using markers of the entire genome.”
Genetics, 170: 1435–1438.
Thomas, A., O’Hara, B., Ligges, U., and Sturtz, S. (2006). “Making BUGS Open.” R
News, 9: 12–17.
Tibshirani, R. (1996). “Regression shrinkage and selection via the lasso.” J. R. Stat.
Soc. B., 58: 267–288.
Tinker, N. A., Mather, D. E., Rossnagel, B. G., Kasha, K. J., and Keinhofs, A. (1996).
“Regions of the genome that affect agronomic performance in two-row barley.” Crop.
Sci., 36: 1053–1062.
R.B. O’Hara and M. J. Sillanpää 117
Xu, S. (2003). “Estimating polygenic effects using markers of the entire genome.”
Genetics, 163: 789–801.
Yi, N. (2004). “A unified Markov chain Monte Carlo framework for mapping multiple
quantitative trait loci.” Genetics, 167: 967–975.
Yi, N., George, V., and Allison, D. B. (2003). “Stochastic search variable selection for
identifying multiple quantitative trait loci.” Genetics, 164: 1129–1138.
Yi, N. and Xu, S. (2008). “Bayesian LASSO for quantitative trait loci mapping.”
Genetics, 179: 1045–1055.
Zhang, Y.-M. and Xu, S. (2005). “A penalized maximum likelihood method for esti-
mating epistatic effects of QTL.” Heredity, 95: 96–104.
Acknowledgments
We would like to thank Nengjun Yi and Roderick D. Ball for their comments on the manuscript.
This work was supported by research grants (202324 and 205371) from the Academy of Fin-
land.
118 Bayesian Variable Selection Methods