Volume 4
Number 4
2009
Bayesian Analysis
Hierarchical Bayesian Modeling of Hitting Performance in Baseball . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . S. T. Jensen, B. B. McShane and A. J. Wyner
Comment on Article by Jensen et al. . . . . . . . . . . . . . . J. Albert and P. Birnbaum
Comment on Article by Jensen et al. . . . . . . . . . . . . . . . . . . . . . . . . . M. E. Glickman
Comment on Article by Jensen et al. . . . . . . . F. A. Quintana and P. M. Müller
Rejoinder . . . . . . . . . . . . . . . . . . . . . S. T. Jensen, B. B. McShane and A. J. Wyner
Bayesian Inference for Directional Conditionally Autoregressive Models . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M. Kyung and S. K. Ghosh
Spiked Dirichlet Process Prior for Bayesian Multiple Hypothesis Testing in Random Effects Models . . . . . . . . . . . . . . . . . . . . . S. Kim, D.B. Dahl and M. Vannucci
Modeling space-time data using stochastic differential equations . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J. A. Duan, A. E. Gelfand and C. F. Sirmans
Inconsistent Bayesian Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . R. Christensen
Sample Size Calculation for Finding Unseen Species . . . . . H. Zhang and H. Stern
Markov Switching Dirichlet Process Mixture Regression . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M. A. Taddy and A. T. Kottas
A Case for Robust Bayesian Priors with Applications to Clinical Trials . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J. A. Fúquene, J. D. Cook and L. R. Pericchi
Editor-in-Chief’s Note . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B. P. Carlin
631
653
661
665
669
675
707
733
759
763
793
817
847
Bayesian Analysis (2009)
4, Number 4, pp. 631–652
Hierarchical Bayesian Modeling of Hitting
Performance in Baseball
Shane T. Jensen∗ , Blakeley B. McShane† and Abraham J. Wyner‡
Abstract. We have developed a sophisticated statistical model for predicting the
hitting performance of Major League baseball players. The Bayesian paradigm
provides a principled method for balancing past performance with crucial covariates, such as player age and position. We share information across time and across
players by using mixture distributions to control shrinkage for improved accuracy.
We compare the performance of our model to current sabermetric methods on a
held-out season (2006), and discuss both successes and limitations.
Keywords: baseball, hidden Markov model, hierarchical Bayes
1
Introduction and Motivation
There is substantial public and private interest in the projection of future hitting performance in baseball. Major league baseball teams award large monetary contracts to
top free agent hitters under the assumption that they can reasonably expect that past
success will continue into the future. Of course, there is an expectation that future performance will vary, but for the most part it appears that teams are often quite foolishly
seduced by a fine performance over a single season. There are many questions: How
should past consistency be balanced with advancing age when projecting future hitting
performance? In young players, how many seasons of above-average performance need
to be observed before we consider a player to be a truly exceptional hitter? What is
the effect of a single sub-par year in an otherwise consistent career? We will attempt to
answer these questions through the use of fully parametric statistical models for hitting
performance.
Modeling and prediction of hitting performance is an area of very active research
within the quantitatively-oriented baseball community. Popular current methods include PECOTA (Silver 2003) and MARCEL (Tango 2004). PECOTA is considered
a ”gold-standard” tool in the sabermetrics community and its predictions are billed
by Baseball Prospectus as being ”deadly accurate”. It is a sophisticated commercial
product managed by a team of statisticians which incorporates proprietary data, minor
league histories, and detailed injury reports. Since PECOTA is proprietary, we cannot
∗ Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA,
mailto:stjensen@wharton.upenn.edu
† Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA,
mailto:mcshaneb@wharton.upenn.edu
‡ Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA,
mailto:ajw@wharton.upenn.edu
c 2009 International Society for Bayesian Analysis
°
DOI:10.1214/09-BA424
632
Bayesian Modeling of Hitting
say exactly what methods they use though we know the general method is based on
matching a player’s past career performance to the careers of a set of comparable major league ballplayers. For each player, their set of comparable players is found by a
nearest neighbor analysis of past players (both minor and major league) with similar
performance at the same age. Once a comparison set is found, the future performance
prediction for the player is based on the historical performance of those past comparable
players. Factors such as park effects, league effects and physical attributes of the player
are also taken into account. PECOTA also makes use of substantial manual curation
both to the matching process and to introduce new information as it becomes available.
We have observed that the pre-season PECOTA predictions are adjusted on a daily
basis as news (e.g., injury information, pre-season performance, etc.) is released.
In contrast, our focus is on a model-based approach to prediction of hitting performance which is fully-automated and based on publicly available data. Thus, a more
appropriate benchmark for our analysis is MARCEL, a publicly available prediction engine based on the same freely available dataset (Lahman 2006) as our model. MARCEL
is a simple two-stage system for prediction. First, MARCEL takes a weighted average
of the performance of the player over the previous three years, giving more weight to the
most recent seasons. Then, it shrinks this weighted average to the overall league mean
based on the number of plate appearances. Thus, the more data for a given player, the
less shrinkage. Over several seasons, MARCEL has performed well against more elaborate competitors (Tango 2004), but should be outperformed by our principled approach.
Although it is less of a fair benchmark, we will also compare with PECOTA in order
to assess how well our model does against the best available proprietary commercial
product.
In Section 2, we present a Bayesian hierarchical model for the evolution of hitting
performance throughout the careers of individual players. Bayesian or Empirical Bayes
approaches have recently been used to model individual hitting events based on various
within-game covariates (Quintana et al. 2008) and for prediction of within-season performance (Brown 2008). We are addressing a different question: how can we predict the
course of a particular hitter’s career based on the seasons of information we have observed thus far? Our model includes several covariates that are crucial for the accurate
prediction of hitting for a particular player in a given year. A player’s age and home
ballpark certainly has an influence on their hitting; we will include this information
among the covariates in our model. We will also include player position in our model,
since we believe that position is an important proxy for hitting performance (e.g., second
basemen have a generally lower propensity for home runs than first basemen). Finally,
our model will factor past performance of each player into future predictions. In Section 3, we test our predictions against a hold out data set, and compare our performance
with several competing methods. A major advantage of our model-based approach is
the ability to move beyond the point predictions offered by other engines to the incorporation of variability via calibrated predictive intervals. We examine our results not
only in terms of accuracy of our point predictions, but also the quality the prediction
intervals produced by our model. We also investigate several other interesting aspects
of our model in Section 3 and then conclude with a brief discussion in Section 4.
S. T. Jensen, B. B. McShane and A. J. Wyner
2
633
Model and Implementation
Our data comes from the publicly-available Lahman Baseball Database (Lahman 2006),
which contains hitting totals for each major league baseball player from 1871 to the
present day, though we will fit our model using only seasons from 1990 to 2005. In
total, we have 10280 player-years of of information from major league baseball between
1990 and 2005 that will be used for model estimation. Within each season j, we will
use the following data for each player i:
1. Home Run Total : Yij
2. At Bat Total : Mij
3. Age : Aij
4. Home Ballpark : Bij
5. Position : Rij
As an example, Barry Bonds in 2001 had Yij = 73 home runs out of Mij = 476 at bats.
We excluded pitchers from our model, leaving us with nine positions: first basemen
(1B), second basemen (2B), third basemen (3B), shortstop (SS), left fielder (LF), center
fielder (CF), right fielder (RF), catcher (C), and the designated hitter (DH). There
were 46 different home ballparks used in major league baseball between 1990 and 2005.
Player ages ranged between 20 and 49, though the vast majority of player ages were
between 23 and 44.
2.1
Hierarchical Model for Hitting
Our outcome of interest for a given player i in a given year (season) j is their home run
total Yij , which we model as a Binomial variable:
Yij ∼ Binomial(Mij , θij )
(1)
where θij is a player- and year-specific home run rate, and Mij are the number of
opportunities (at bats) for player i in year j. Note that by using at-bats as our number
of opportunities, we are excluding outcomes such as walks, sacrifice flies and hit-bypitches. We will assume that the number of opportunities Mij are fixed and known
so we focus our efforts on modeling each home run rate θij . The i.i.d. assumption
underlying the binomial model has already been justified for hitting totals within a
single season (Brown 2008), and so seems reasonable for hitting totals across an entire
season.
We next model each unobserved player-year rate θij as a function of home ballpark
b = Bij , position k = Rij and age Aij of player i in year j.
µ
¶
θij
log
= αk + βb + fk (Aij )
(2)
1 − θij
634
Bayesian Modeling of Hitting
The parameter vector α = (α1 , . . . , α9 ) are the position-specific intercepts for each of
the nine player positions. The function fk (Aij ) is a smooth trajectory of Aij , that
is different for each position k. We allow a flexible model for fk (·) by using a cubic
B-spline (de Boor 1978) with different spline coefficients γ estimated for each position.
The age trajectory component of this model involves the estimation of 36 parameters:
four B-spline coefficients per position × nine different positions.
We call the parameter vector β the “team effects” since these parameters are shared
by all players with the same team and home ballpark. However, these coefficients β
can not be interpreted as a true “ballpark effect” since they are confounded with the
effect of the team playing in that ballpark. If a particular team contains many home
run hitters, then that can influence the effect of their home ballpark. Separating the
effect of team versus the effect of ballpark would require examining hitting data at the
game level instead of the seasonal level we are using for our current model.
There are two additional aspects of hitting performance that are not captured by the
model outlined in (1)-(2). Firstly, conditional on the covariates age, position, and ballpark, our model treats the home run rate θij as independent and identically-distributed
across players i and years j. However, we suspect that not all hitters are created equal:
we posit that there exists a sub-group of elite home run hitters within each position that
share a higher mean home run rate. We can represent this belief by placing a mixture
model on the intercept term αk dictated by a latent variable Eij in each player-year. In
other words,
½
αk0
if Eij = 0
αk =
αk1
if Eij = 1
where we force αk0 < αk1 for each position k. We call the latent variable Eij the elite
status for player i in year j. Players with elite status are modeled as having the same
shape to their age trajectory, but with an extra additive term (on the log-odds scale)
that increases their home run rate. However, we have a different elite indicator Eij
for each player-year, which means that a particular player i can move in and out of
elite status during the course of his career. Thus, the elite sub-group is maintained in
the player population throughout time even though this sub-group will not contain the
exact same players from year to year.
The second aspect of hitting performance that needs to be addressed is that the
past performance of a particular player should contain information about his future
performance. One option would be to use player-specific intercepts in the model to allow
each player to have a different trajectory. However, this model choice would involve a
large number of parameters, even if these player-specific intercepts were assumed to
share a common prior distribution. In addition, many of these intercepts would be
subject to over-fitting due to small number of observed years of data for many players.
We instead favor an approach that involves fewer parameters (to prevent over-fitting)
while still allowing different histories for individual players. We accomplish this goal by
building the past performance of each player into our model through a hidden Markov
model on the elite status indicators Eij for each player i. Specifically, our probability
model of the elite status indicator for player i in year j + 1 is allowed to depend on the
S. T. Jensen, B. B. McShane and A. J. Wyner
635
Figure 1: Hidden Markov Model for Elite Status
elite status indicator for player i in year j:
p(Ei,j+1 = b|Eij = a, Rij = k) = νabk
a, b ∈ {0, 1}
(3)
where Eij is the elite status indicator and Rij is the position of player i in year j. This
relationship is also graphically represented in Figure 1. The Markovian assumption
induces a dependence structure on the home run rates θi,j over time for each player
i. Players that show elite performance up until year j are more likely to be predicted
as elite at year j + 1. The transition parameters ν k = (ν00k , ν01k , ν10k , ν11k ) for each
position k = 1, . . . , 9 are shared across players at their position, but can differ between
positions, which allows for a different proportion of elite players in each position. We
initialize each player’s Markov chain by setting Ei0 = 0 for all i, meaning that each player
starts their career in non-elite status. This initialization has the desired consequence
that young players must show consistently elite performance in multiple years in order
to have a high probability of moving to the elite group.
In order to take a fully Bayesian approach to this problem, we must specify prior
distributions for all of our unknown parameters. The forty-eight different ballpark
coefficients β in our model all share a common Normal distribution,
βl
∼
Normal(0, τ 2 )
∀ l = 1, . . . , 48
(4)
The spline coefficients γ needed for the modeling of our age trajectories also share a
common Normal distribution,
γkl
∼ Normal(0, τ 2 )
∀ k = 1, . . . , 9, l = 1, . . . , L
(5)
where L is the number of spline coefficients needed in the modeling of age trajectories
for f (Aij , Rij ) for each position. In our latent mixture model, we also have two intercept coefficients for each position, αk = (αk0 , αk1 ), which share a truncated Normal
distribution,
αk
∼ MVNormal(00, τ 2I 2 ) · Ind(αk0 < αk1 )
∀ k = 1, . . . , 9
(6)
636
Bayesian Modeling of Hitting
where 0 is the 2 × 1 vector of zeros and I 2 is the 2 × 2 identity matrix. This bivariate
distribution is truncated by the indicator function Ind(·) to ensure that αk0 < αk1
for each position k. We make each of the prior distributions (4)-(6) non-informative
by setting the variance hyperparameter τ 2 to a very large value (10000 in this study).
Finally, for the position-specific transition parameters of our elite status ν , we use flat
Dirichlet prior distributions,
(ν00k , ν01k ) ∼
(ν10k , ν11k ) ∼
Dirichlet(ω, ω)
Dirichlet(ω, ω)
∀ k = 1, . . . , 9,
∀ k = 1, . . . , 9.
(7)
These prior distributions are made non-informative by setting ω to a small value (ω = 1
in this study). We also examined other values for ω and found that using different
values had no influence on our posterior inference, which is to be expected considering
the dominance of the data in equation (9). Combining these prior distributions together
with equations (1)-(3) give us the full posterior distribution of our unknown parameters,
Y
α, β, γ , ν , E |X
X) ∝
p(α
p(Yij |Mij , θij ) · p(θij |Rij , Aij , Bij , Eij , α , β, γ )
i,j
α, β, γ , ν ).
·p(Eij |Ei,j−1 , ν ) · p(α
(8)
A, B , M , R ).
where we use X to denote our entire set of observed data Y and covariates (A
2.2
MCMC Implementation
We estimate our posterior distribution (8) by a Gibbs sampling strategy (Geman and
Geman 1984). We iteratively sample from the following conditional distributions of each
set of parameters given the current values of the other parameters:
α|β, γ , ν , E , X ) = p(α
α|β, γ , E , X )
1. p(α
α, γ , ν , E , X ) = p(β|α
α, γ , E , X )
2. p(β|α
3. p(γγ |β, α , ν , E , X ) = p(γγ |β, α , E , X )
E)
4. p(νν |β, γ , α , E , X ) = p(νν |E
E |β, γ , ν , E , X )
5. p(E
A, B , M , R ).
where again X denotes our entire set of observed data Y and covariates (A
Combined together, steps 1-3 of the Gibbs sampler represent the usual estimation of
α, β, γ ) in a Bayesian logistic regression model. The conditional
regression coefficients (α
posterior distributions for these coefficients are complicated and we employ the common
strategy of using the Metropolis-Hastings algorithm to sample each coefficient (see, e.g.
Gelman et al. (2003)). The proposal distribution for a particular coefficient is a Normal distribution centered at the maximum likelihood estimate of that coefficient. The
S. T. Jensen, B. B. McShane and A. J. Wyner
637
variance of this Normal proposal distribution is a tuning parameter that was adaptively
adjusted to provide a reasonable rejection/acceptance ratio (Gelman et al. 1996). Step
4 of the Gibbs sampler involves standard distributions for our transition parameters
ν k = (ν00k , ν01k , ν10k , ν11k ) for each position k = 1, . . . , 9. The conditional posterior
distributions for our transition parameters implied by (8) are
E
(ν00k , ν01k )|E
E
(ν11k , ν10k )|E
where Nabk =
ni
PP
∼
∼
Dirichlet (N00k + ω, N01k + ω)
Dirichlet (N11k + ω, N10k + ω)
(9)
I(Ei,t = a, Ei,t+1 = b) over all players i in position k and where
i t=1
ni represents the number of years of observed data for player i’s career. Finally, step
5 of our Gibbs sampler involves sampling the elite status Eij for each year j of player
i, which can be done using the “Forward-summing Backward-sampling” algorithm for
hidden Markov models (Chib 1996). For a particular player i, this algorithm “forwardsums” by recursively calculating
X i,t , Θ) ∝
p(Eit |X
∝
X i,t−1 , Θ)
p(Xi,t |Eit , Θ) · p(Eit |X
p(Xi,t |Eit , Θ )
1
X
X i,t−1 , Θ )
p(Eit |Ei,t−1 = e, Θ ) p(Ei,t−1 = e|X
(10)
e=0
for all t = 1, . . . , ni where X i,k represents the observed data for player i up until year
k, Xi,k represents only the observed data for player i in year k, and Θ represents all
other parameters. The algorithm then ”backward-samples” by sampling the terminal
X i,ni , Θ) and then sampling Ei,t−1 |Ei,t for
elite state Ei,ni from the distribution p(Ei,ni |X
t = ni back to t = 1. Repeating this algorithm for each player i gives us a complete
sample of our elite statuses E . We ran multiple chains from different starting values
to evaluate convergence of our Gibbs sampler. Our results are based on several chains
where the first 1000 iterations were discarded as burn-in. Our chains were also thinned,
taking only every eighth iteration, in order to eliminate autocorrelation.
2.3
Model Extension: Player-Specific Transition Parameters
In Section 2.1, we introduced a hidden Markov model that allows the past performance
of each player to influence predictions for future performance. If we infer player i to
have been elite in year t (Ei,t = 1), then this inference influences the elite status of
that player in his next year, Ei,t+1 through the transition parameters ν k . However, one
potential limitation of these transition parameters ν k is that they are shared globally
across all players at that position: each player at position k has the same probability
of transitioning from elite to non-elite and vice versa. This model assumption allows
us to pool information across players for the estimation of our transition parameters
in (9), but may lead to loss of information if players are truly heterogeneous with
respect to the probability of transitioning between elite and non-elite states. In order
to address this possibility, we consider extending our model to allow player-specific
transition parameters in our hidden Markov model.
638
Bayesian Modeling of Hitting
Our proposed extension, which we call the PSHMM, has player-specific transition
i
i
i
i
parameters ν i = (ν00
, ν01
, ν10
, ν11
) for each player i, that share a common prior distribution,
i
i
(ν00
, ν01
) ∼
i
i
(ν11 , ν10 ) ∼
Dirichlet (ω00k , ω01k )
Dirichlet (ω11k , ω10k )
(11)
where k is the position of player i. Global parameters ω k = (ω00k , ω01k , ω11k , ω10k )
are now allowed to vary with flat prior distributions. This new hierarchical structure
allows for transition probabilities ν i to vary between players, but still imposes some
shrinkage towards a common distribution controlled by global parameters ω k that are
shared across players with position k. Under this model extension, the new conditional
posterior distribution for each ν i is
¡
¢
i
i
E ∼ Dirichlet Ni00 + ω00k , Ni01 + ω01k
(ν00
, ν01
)|E
¡
¢
i
i
E ∼ Dirichlet Ni11 + ω11k , Ni10 + ω10k
(ν11
, ν10
)|E
(12)
where Niab =
nP
i −1
I(Ei,t = a, Ei,t+1 = b).
t=1
To implement this extended model, we must replace step 4 in our Gibbs sampler
with a step where we draw ν i from (12) for each player i. We must also insert a new
step in our Gibbs sampler where we sample the global parameters ω k given our sampled
values of all the ν i values for players at position k. This added step requires sampling
(ω00k , ω01k ) from the following conditional distribution:
#ω01k −1
#ω00k −1 " n
"n
¸n
·
k
k
Y
Y
Γ(ω00k + ω01k ) k
i
i
ν01
(13)
ν00
×
×
p(ω00k , ω01k |νν ) ∝
Γ(ω00k )Γ(ω01k )
i=1
i=1
where each product is only over players i at position k and nk is the number of players
at position k. We accomplish this sampling by using a Metropolis-Hastings step with
prop
true distribution (13) and Normal proposal distributions: ω00k
∼ N(ω̂00k , σ 2 ) and
prop
2
ω01k ∼ N(ω̂01k , σ ). The means of these proposal distributions are:
¶
¶
µ
µ
ν 00k (1 − ν 00k )
ν 00k (1 − ν 00k )
−1
and ω̂01k = (1 − ν 00k )
− 1 (14)
ω̂00k = ν 00k
s20k
s20k
with
ν 00k =
nk
X
i=1
i
ν00
/ nk
and
s20k =
nk
X
i
(ν00
− ν 00k )2 / nk
i=1
where each sum is over all players i at position k and nk is the number of players at position k. These estimates ω̂00k and ω̂01k were calculated by equating the sample mean
ν 00k and sample variance s20k with the mean and variance of the Dirichlet distribution (13). Similarly, we sample (ω11k , ω10k ) with the same procedure but with obvious
substitutions.
S. T. Jensen, B. B. McShane and A. J. Wyner
3
639
Results and Model Comparison
⋆
Our primary interest is the prediction of future hitting events, Yt+j
for years j = 1, 2, . . .
based on our model and observed data up to year t. We estimate the full posterior
distribution (8) and then use this posterior distribution to predict home run totals
⋆
for each player i in the 2006 season. The 2006 season serves as an external
Yi,2006
validation of our method, since this season is not included in our model fit. We use our
predicted home run totals Y ⋆2006 for the 2006 season to compare our performance to
several previous methods (Section 3.2) as well as evaluate several internal model choices
(Section 3.1). In Section 3.3, we present inference for other parameters of interest from
our model, such as the position-specific age curves.
3.1
Prediction of 2006 Home Run Totals: Internal Comparisons
We can use our posterior distribution (8) based on data from MLB seasons up to 2005
to calculate the predictive distribution of the 2006 hitting rate θi,2006 for each player i.
X) =
p(θi,2006 |X
Z
p(θi,2006 |Ri,2006 , Ai,2006 , Bi,2006 , Ei,2006 , α , β, γ )
E i , ν )p(α
α, β, γ , ν , E i |X
X ) dα
α dβ dγγ dνν dE
E
·p(Ei,2006 |E
(15)
where X represents all observed data up to 2005. This integral is estimated using the
α, β, γ , ν , E i |X
X ) that were generated
sampled values from our posterior distribution p(α
via our Gibbs sampling strategy.
We can use the posterior predictive distribution (15) of each 2006 home run rate
⋆
θi,2006 to calculate the distribution of the home run total Yi,2006
for each player in the
2006 season.
⋆
X)
p(Yi,2006
|X
=
Z
⋆
X ) dθi,2006
|Mi,2006 , θi,2006 ) · p(θi,2006 |X
p(Yi,2006
(16)
However, the issue with prediction of home run totals is that we must also consider the
number of opportunities Mi,2006 . Since our overall focus has been on modeling home run
rates θi,2006 , we will use the true value of Mi,2006 for the 2006 season in equation (16).
Using the true value of each Mi,2006 gives a fair comparison of the rate predictions θi,2006
for each model choice, since it is a constant scaling factor. This is not a particularly
realistic scenario in a prediction setting since the actual number of opportunities will
not be known ahead of time.
⋆
X ), we can report either a predictive
|X
Based on the predictive distribution p(Yi,2006
⋆
⋆
X ) or a predictive interval Ci⋆ such that p(Yi,2006
X ) ≥ 0.80. We
mean E(Yi,2006 |X
∈ Ci⋆ |X
can examine the accuracy of our model predictions by comparing to the observed home
run totals Yi,2006 for the 559 players in the 2006 season, which we did not include in our
model fit. We use the following three comparison metrics:
640
Bayesian Modeling of Hitting
1. RMSE: root mean square error of predictive means,
s
1X
⋆
X ) − Yi,2006 )2
(E(Yi,2006
|X
RMSE =
n i
2. Interval Coverage: fraction of 80% predictive intervals Ci⋆ covering observed
Yi,2006
3. Interval Width: average width of 80% predictive intervals Ci⋆
In Table 1, we evaluate our full model outlined in Section 2.1 relative to several
simpler modeling choices. Specifically, we examine a simpler version of our model without positional information or the mixture model on the α coefficients. We see from
Table 1 that our full model gives proper coverage and a substantially lower RMSE than
the version of our model without positional information or the elite/non-elite mixture
model. We also examine a truly simplistic strawman, which is to take last years home
⋆
run totals as the prediction for this years home run totals (ie. Yi,2006
= Yi,2005 ). Since
this strawman is only a point estimate, that comparison is made based solely on the
RMSE. As expected, the relative performance of this strawman model is terrible, with
a substantially higher RMSE compared to our full model. Of course, this simple strawman alternative is rather naive and in Section 3.2, we compare our performance to more
sophisticated external prediction approaches.
Table 1: Internal Comparison of Different Model Choices. Measures are calculated over 559
Players from 2006 season.
Model
Full Model
No Position or Elite Indicators
⋆
= Yi,2005
Strawman: Yi,2006
Player-Specific Transitions
RMSE
5.30
6.87
8.24
5.45
Coverage
of 80%
Intervals
0.855
0.644
NA
0.871
Average
Interval
Width
9.81
6.56
NA
10.36
We also considered an extended model in Section 2.3 with player-specific transition
parameters for the hidden Markov model on elite status, and the validation results from
this model are also given in Table 1. Our motivation for this extension was that allowing
player-specific transition parameters might reduce the interval width for players that
have displayed consistent past performance. However, we see that the overall prediction
accuracy was not improved with this model extension, suggesting that there is not
enough additional information in the personal history of most players to noticeably
improve the model predictions. Somewhat surprisingly, we also see that the width of
our 80% predictive intervals are not actually reduced in this extended model. The
reason is that, even for players with long careers of data, the player-specific transition
S. T. Jensen, B. B. McShane and A. J. Wyner
641
parameters ν i fit by this extended model are not extreme enough to force all sampled
elite indicators Ei,2006 to be either 0 or 1, and so the predictive interval is still wide
enough to include both possibilities.
3.2
Prediction of 2006 Home Run Totals: External Comparisons
Similarly to Section 3.1, we use hold-out home run data for the 2006 season to evaluate our model predictions compared to the predictions from two external methods,
PECOTA (Silver 2003) and MARCEL (Tango 2004), both described in Section 1. We
view MARCEL as the primary competitor of our approach, as it also is a fully-automated
method based on publicly available data. However, out of general interest we also compare our prediction accuracy to the proprietary and manually-curated PECOTA system.
For a reasonable comparison set, we focus our external validation on hitters with an
empirical home run rate of least 1 home run every 40 at-bats in at least one season up
to 2005 (minimum of 300 at-bats in that season). This restriction reduces our dataset
for model fitting down to 118 top home run hitters who all have predictions from the
competing methods PECOTA and MARCEL. As noted above, our predicted home run
totals for 2006 are based on the true number of at bats for 2006. In order to have a
fair comparison to external methods such as PECOTA or MARCEL, we also scale the
predictions from these methods by the true number of at bats in 2006.
Our approach has the advantage of producing the full predictive distribution of
future observations (summarized by our predictive intervals). However, the external
methods do not produce comparable intervals, so we only compare to other approaches
in terms of prediction accuracy. We expand our set of accuracy measures to include not
only the root mean square error (RMSE), but also the median absolute error (MAE).
In addition to comparing the predictions from each method using overall error rates, we
also calculated “% BEST” which is, for each method, the percentage of players for which
⋆
the predicted home run total Yi,2006
is the closest to the true home run total among
all methods. Each of these comparison statistics are given in Table 2. In addition to
giving these validation measures for all 118 players, we also separate our comparison for
young players (age ≤ 26 years in 2006) versus older players (age > 26 years in 2006).
The age cut-off of 26 years was used in order to isolate the small subset of players that
were just beginning their careers and for which each player had little personal history of
performance. It is worth noting that only 8 out of the 118 players (around 7%) in our
2006 test dataset were classified as young by this criterion, so the vast majority (110
out of 118) of players are in the “older” category.
We see from Table 2, that our model is extremely competitive with the external
methods PECOTA and MARCEL. When examining all 118 players, our model has the
smallest median absolute error and the highest “% Best” measure, suggesting that our
predictions are superior on these absolute scales. Our performance is more striking when
we examine only the small subset of young players in our dataset. We have the best
prediction on 62% of all young players, and for these young players, both the RMSE and
MAE from our method is substantially lower than either PECOTA or MARCEL. We
credit this superior performance to our sophisticated hierarchical approach that builds in
642
Bayesian Modeling of Hitting
Table 2: Comparison of our model to two external methods on the 2006 predictions of 118
top home run hitters. We also provide this comparison for only young players (age ≤ 26 years)
versus only older players (age > 26 years).
Method
RMSE
Our Model
7.33
PECOTA
7.11
MARCEL
7.82
All Players
Young Players
Older Players
MAE
% BEST RMSE
MAE
% BEST RMSE
MAE
% BEST
4.40
41 %
2.62
1.93
62%
7.56
4.48
39%
4.68
28 %
4.62
3.44
0%
7.26
4.79
30%
4.41
31 %
4.15
2.17
38%
8.02
4.57
31%
information via position instead of relying solely on limited past personal performance.
All eight young players had played three seasons or less before 2006, and six of the
eight players had two seasons or less before 2006. For these players, very little past
information is available about their performance and so the model must rely heavily on
position, where information is shared between players.
However, our method is not completely dominant: we have a larger root mean square
error than PECOTA for older players (and overall), which suggests that our model might
be making large errors on a small number of players. Further investigation shows that
our model commits its largest errors for players in the designated hitter (DH) position.
This is somewhat expected, since our model seems to perform best for young players and
DH is a position almost always occupied by an older player. Beyond this, the model
appears to be over-shrinking predictions for players in the DH role, perhaps because
this player position is rather unique and does not fit our model assumptions as well
as the other positions. Also, PECOTA is a manually-curated system that can account
for the latest information in terms of injuries and playing time adjustments, which
can greatly benefit their predictions. Overall, the validation results are generally very
encouraging for our approach compared to our nearest competitor, MARCEL, as well as
the proprietary system PECOTA. Our performance is especially good among younger
players where a principled balance of positional information with past performance is
most advantageous.
We further investigate our model dynamics among young players by examining how
many years of observed performance are needed to decide that a player is an elite home
run hitter. This question was posited in Section 1 and we now address the question using
our elite status indicators Eij . Taking all 559 available players examined in Section 3.1,
we focus our attention on the subset of players that were determined by our model to be
in the elite group (P(Eij = 1) ≥ 0.5) for at least two years in their career. For each elite
home run hitter, we tabulate the number of years of observed data that were needed
before they were declared elite. The distribution of the number of years needed is given
in Figure 2. We see that although some players are determined to be elite based on
just one year of observed data, most players (74%) need more than one year of observed
performance to determine that they are elite home run hitters. In fact, almost half
of players (46%) need more than two years of observed performance to determine that
they are elite home run hitters.
S. T. Jensen, B. B. McShane and A. J. Wyner
643
Figure 2: Distribution of number of seasons of observed data needed to infer elite status
(P(Eij = 1) ≥ 0.5) among all players determined by our model to be elite during their career.
Note that increasing the cut-off for elite states (e.g. P(Eij = 1) ≥ 0.75) shifts the distribution
towards a higher number of seasons needed, whereas decreasing the cut-off for elite states (e.g.
P(Eij = 1) ≥ 0.25) shifts the distribution towards a lower number of seasons needed.
40
30
0
10
20
Frequency
50
60
70
Distribution of Number of Years Until Elite
1
2
3
4
5
6
7
8
9
10
11
Years In Baseball
We also investigated our model dynamics among older players by examining the
balancing of past consistency with advancing age, which was also posited as a question
in Section 1. Specifically, for the older players (age ≥ 35) in our dataset, we examined
X ) from
the differences between the 2006 home run rate predictions θ̂i,2006 = E(θi,2006 |X
our model versus the naive prediction based entirely on the previous year θ̃i,2006 =
Yi,2005 /Mi,2005 . Is our model contribution for a player (which we define as the difference
between our model prediction θ̂i,2006 and the naive prediction θ̃i,2006 ) more a function
of advancing age or past consistency of that player? Both age and past consistency
(measured as the standard deviation of their past home run rates) were found to be
equally good predictors of our model contribution, which suggests that both sources of
information are being evenly balanced in the predictions produced by our model.
644
3.3
Bayesian Modeling of Hitting
Age Trajectory Curves
In addition to validating our model in terms of prediction accuracy, we can also examine
the age trajectory curves that are implied by our estimated posterior distribution (8).
We will examine these curves on the scale of the home run rate θij which is a function
of age Aij , ball-park b, and elite status Eij for player i in year j (with position k):
θij =
exp [(1 − Eij ) · αk0 + Eij · αk1 + βb + fk (Aij )]
.
1 + exp [(1 − Eij ) · αk0 + Eij · αk1 + βb + fk (Aij )]
(17)
The shape of these curves can differ by position k, ballpark b and also can differ between
elite and non-elite status as a consequence of having a different additive effect αk0 vs.
αk1 . In Figure 3, we compare the age trajectories for two positions, DH and SS, for
both elite player-years (Eij = 1) vs. non-elite player-years (Eij = 0) for an arbitrary
ballpark. Each graph contains multiple curves (100 in each graph), each of which is the
α, γ ) from a single iteration of our converged and
curve implied by the sampled values (α
thinned Gibbs sampling output. Examining the curves from multiple samples gives us
an indication of the variability in each curve.
We see a tremendous difference between the two positions DH and SS in terms of
the magnitude and shape of their age trajectory curves. This is not surprising, since
home run hitting ability is known to be quite different between designated hitters and
shortstops. In fact, DH and SS were chosen specifically to illustrate the variability
between position with regards to home run hitting. For the DH position, we also see
that elite vs. non-elite status show a substantial difference in the magnitude of the
home run rate, though the overall shape across age is restricted to be the same by the
fact that players of both statuses share the same fk (Aij ) in equation (17). There is
less difference between elite and non-elite status for shortstops, in part due to the lower
range of values for shortstops overall. Not surprisingly, the variability in the curves
grows with the magnitude of the home run rate.
We also perform a comparison across all positions by examining the elite vs. nonα0 , α 1 ) that were allowed to vary by position. We present the posterior
elite intercepts (α
distribution of each elite and non-elite intercept in Figure 4. For easier interpretation,
the values of each αk0 and αk1 have been transformed into the implied home run rate θij
for very young (age = 23) players in our dataset. We see in Figure 4 that the variability is
higher for the elite intercept in each position, and there is even more variability between
positions. The ordering of the positions is not surprising: the corner outfielders and
infielders have much higher home run rates than the middle infielder and centerfielder
positions.
For a player at a specific position, such as DH, our predictions of his home run
rate for a future season is a weighted mixture of elite and non-elite DH curves given
in Figure 3. The amount of weight given to elite vs. non-elite for a given player will
be determined by the full posterior distribution (8) as a function of that player’s past
performance. We illustrate this characteristic of our model in more detail in Figure 5
by examining six different hypothetical scenarios for players at the 2B position. Each
plot in Figure 5 gives several seasons of past performance for a single player, as well
S. T. Jensen, B. B. McShane and A. J. Wyner
645
Figure 3: Age Trajectories fk (·) for two positions and elite vs. non-elite status. X-axis is
age and Y-axis is Rate = θij
646
Bayesian Modeling of Hitting
α0 , α 1 ) for each position. The
Figure 4: Distribution of the elite vs. non-elite intercepts (α
α0 , α 1 ) are presented in terms of the home run rate θij for very young
distributions of each (α
(age = 23) players. The posterior mean is given as a black dot, and the 95% posterior interval
as a black line.
LF
DH
RF
1B
Elite
3B
C
SS
2B
CF
DH
1B
RF
LF
C
Non−Elite
3B
2B
SS
CF
0.02
0.04
0.06
Home Run Rate
0.08
S. T. Jensen, B. B. McShane and A. J. Wyner
647
as predictions for an additional season (age 30). Predictions are given both in terms
of posterior draws of the home run rate as well as the posterior mean of the home run
rate. The elite and non-elite age trajectories for the 2B position are also given in each
plot. We focus first on the left column of plots, which shows hypothetical players with
consistently high (top row), average (middle row), and poor (bottom row) past home
run rates. We see in each of these left-hand plots that our posterior draws (gray dots) for
the next season are a mixture of posterior samples from the elite and non-elite curves,
though each case has a different proportion of elite vs. non-elite, as indicated by the
posterior mean of those draws (black ×).
Now, what would happen if each of these players was not so consistent? In Section 1,
we asked about the effect of a single sub-par year on our model predictions. The plots in
the right column show the same three hypothetical players, but with their most recent
past season replaced by a season with distinctly different (and relatively poor) home
run hitting performance. We see from the resulting posterior means in each case that
only the average player (middle row) has his predictions substantially affected by the
one season of relatively poor performance. Despite the one year of poor performance,
the player in the top row of Figure 5 is still considered to be elite in the vast majority
of posterior draws. Similarly, the player in the bottom row of Figure 5 is going to be
considered non-elite regardless of that one year of extra poor performance. The one
season of poor performance has the most influence on the player in the middle row,
since the model has the most uncertainty with regards to the elite vs. non-elite status
of this average player.
4
Discussion
We have presented a sophisticated Bayesian hierarchical model for home run hitting
among major league baseball players. Our principled approach builds upon information
about past performance, age, position, and home ballpark to estimate the underlying
home run hitting ability of individual players, while sharing information across players.
Our primary outcome of interest is the prediction of future home run hitting, which we
evaluated on a held out season of data (2006). When compared to the previous methods, PECOTA (Silver 2003) and MARCEL (Tango 2004), we perform well in terms of
prediction accuracy, especially our “% BEST” measure which tabulates the percentage of players for which our predictions are the closest to the truth. Our prediction
accuracy completely dominates the MARCEL procedure which represents our closest
natural competitor, since it is also a fully-automated and based on publicly-available
data. Our prediction accuracy is also competitive with the proprietary PECOTA system
which is especially impressive given that PECOTA is manually curated based on the
latest information about injuries and playing time. Our approach does especially well
among young players, where a principled balance of positional information with past
performance seems most helpful. In addition, our method has the advantage of estimating the full posterior predictive distribution of each player, which provides additional
information in the form of posterior intervals. Beyond our primary goal of prediction,
our model-based approach also allows us to answer interesting supplemental questions
648
Bayesian Modeling of Hitting
Figure 5: Six different hypothetical scenarios for a player at the 2B position. Black curves
indicate the elite and non-elite age trajectories for the 2B position. Black points represent
several seasons of past performance for a single player. Predictions for an additional season
are given as posterior draws (gray points) of the home run rate and the posterior mean of the
home run rate (black ×). Left column of plots gives hypothetical players with consistently high
(top row), average (middle row), and poor (bottom row) past home run rates. Right column
of plots show the same hypothetical players, but with their most recent past season replaced
by a relatively poor home run hitting performance.
S. T. Jensen, B. B. McShane and A. J. Wyner
649
such as the ones posed in Section 1.
We have illustrated our methodology using home runs as the hitting event since
they are a familiar outcome that most readers can calibrate with their own anecdotal
experience. However, our approach could easily be adapted to other hitting outcomes of
interest, such as on-base percentage (rate of hits or walks) which has become a popular
tool for evaluating overall hitting quality. Also, although our procedure is presented in
the context of predicting a single hitting event, we can also extend our methodology
in order to model multiple hitting outcomes simultaneously. In this more general case,
there are several possible outcomes of an at-bat (out, single, double, etc.). Our units
of observation for a given player i in a given year j is now a vector of outcome totals
Y ij , which can be modeled as a multinomial outcome: Y ij ∼ Multinomial(Mij , θ ij )
where Mij are the number of opportunities (at bats) for player i in year j and θ ij is the
vector of player- and year-specific rates for each outcome. Our underlying model for the
rates θij as a function of position, ball-park and past performance could be extended
to a vector of rates θ ij . Our preliminary experience with this type of multinomial
model indicates that single-event predictions (such as home runs) are not improved
by considering multiple outcomes simultaneously, though one could argue that a more
honest assessment of the variance in each event would result from acknowledging the
possibility of multiple events from each at-bat.
An important element of our approach was the use of mixture modeling of the player
population to further refine our estimated home run rates. Sophisticated statistical
models have been used previously to model the careers of baseball hitters (Berry et al.
1999), but these approaches have not employed mixtures for the modeling of the player
population. Our internal model comparisons suggest that this mixture model component
is crucial for the accuracy of our model, dominating even information about player
position. Using a mixture of elite and non-elite players limits the shrinkage towards
the population mean of consistently elite home run hitters, leading to more accurate
predictions. Our fully Bayesian approach also allows us to investigate the dynamics of
our elite status indicators directly, as we do in Section 3.2.
In addition to our primary goal of home run prediction, our model also estimates
several secondary parameters of interest. We estimate career trajectories for both elite
and non-elite players within each position. In addition to evaluating the dramatic differences between positions in terms of home run trajectories, our fully Bayesian model
also has the advantage of estimating the variability in these trajectories, as can be seen
in Figure 3. It is worth noting that our age trajectories do not really represent the
typical major league baseball career, especially at the higher values of age. More accurately, our trajectories represent the typical career conditional on the player staying in
baseball, which is one reason why we do not see dramatic dropoff in Figure 3. Since our
primary goal is prediction, the fact that our trajectories are conditional is acceptable,
since one would presumably only be interested in prediction for baseball players that
are still in the major leagues. However, if one were more interested in estimating unconditional trajectories, then a more sophisticated modeling of the drop-out/censoring
process would be needed.
650
Bayesian Modeling of Hitting
Our focus in this paper has been the modeling of home run rates θij and so we have
made an assumption throughout our analysis that the number of plate appearances,
or opportunities, for each player is a known quantity. This is a reasonable assumption
when retrospectively estimating past performance, but when predicting future hitting
performance the number of future opportunities is not known. In order to maintain a fair
comparison between our method and previous approaches for prediction of future totals,
we have used the future number of opportunities, which is not a reasonable strategy
for real prediction. A focus of future research is to adapt our sophisticated hierarchical
approach to the modeling and prediction of plate appearances Mij in addition to our
current modeling of hitting rates θij .
References
Berry, S. M., Reese, S., and Larkey, P. D. (1999). “Bridging Different Eras in Sports.”
Journal of the American Statistical Association, 94: 661–686. 649
Brown, L. D. (2008). “In-Season Prediction of Batting Averages: A Field-test of Simple
Empirical Bayes and Bayes Methodologies.” Annals of Applied Statistics, 2: 113–152.
632, 633
Chib, S. (1996). “Calculating posterior distributions and modal estimates in Markov
mixture models.” Journal of Econometrics, 75: 79–97. 637
de Boor, C. (1978). A Practical Guide to Splines. Springer-Verlag. 634
Gelman, A., Carlin, J., Stern, H., and Rubin, D. (2003). Bayesian Data Analysis. Boca
Raton, FL: Chapman and Hall/CRC, 2nd edition edition. 636
Gelman, A., Roberts, G., and Gilks, W. (1996). “Efficient Metropolis jumping rules.”
In Bernardo, J., Berger, J., Dawid, A., and Smith, A. (eds.), Bayesian Statistics 5,
599–608. Oxford University Press. 637
Geman, S. and Geman, D. (1984). “Stochastic relaxation, Gibbs distributions, and the
Bayesian restoration of images.” IEEE Transaction on Pattern Analysis and Machine
Intelligence, 6: 721–741. 636
Lahman, S. (2006). “Baseball Archive.” Lahman’s Baseball Database, Version 5.5.
URL http://www.baseball1.com/ 632, 633
Quintana, F. A., Mueller, P., Rosner, G. L., and Munsell, M. (2008). “Semi-parametric
Bayesian Inference for Multi-Season Baseball Data.” Bayesian Analysis, 3: 317–338.
632
Silver, N. (2003). “Introducing PECOTA.” Baseball Prospectus, 2003: 507–514. 631,
641, 647
Tango, T. (2004). “Marcel The Monkey Forecasting System.” Tangotiger.net, March
10, 2004.
URL http://www.tangotiger.net/archives/stud0346.shtml 631, 632, 641, 647
S. T. Jensen, B. B. McShane and A. J. Wyner
Acknowledgments
We would like to thank Dylan Small and Larry Brown for helpful discussions.
651
652
Bayesian Modeling of Hitting
Bayesian Analysis (2009)
4, Number 4, pp. 653–660
Comment on Article by Jensen et al.
Jim Albert∗ and Phil Birnbaum†
1
Introduction
Prediction of future batting performance is an important problem in baseball. Due to
trades and the free agent system, there is a good movement of players between teams in
the “hot-stove league” (the baseball off-season) and teams will acquire new players with
the hope that they will achieve particular performances in the following season. The
authors propose a Bayesian hierarchical modeling framework for estimating home run
hitting probabilities and making predictions of future home run hitting performance.
Generally, this is an attractive methodology, especially when one is collecting data
from many players who have similar home run hitting abilities. By use of hierarchical
modeling, the estimates of the home run probabilities shrink or adjust the observed
rates towards a combined regression estimate. One attractive feature of the Bayesian
approach is that it is straightforward to obtain predictions from the posterior predictive
distribution and the authors test the value of their method by comparing it with two
alternative prediction systems MARCEL and PECOTA. It is straightforward to fit these
hierarchical models by MCMC algorithms and the authors provide the details of this
fitting algorithm.
Although we admire the authors’ paper from a Bayesian modeling/computation
perspective, it seems deficient from the application (baseball perspective). There is a
substantial research on home run hitting and in the modeling of career trajectories of
ballplayers and we believe this research should be helpful in defining relevant covariates
and proposing realistic models for trajectories. In the following comments, we discuss
several concerns with the basic modeling framework, focus on the choice of suitable
adjustments and suggest a more flexible framework for modeling career trajectories.
2
Data
The authors use data from the Lahman database where the counts of home runs and atbats are collected for each player for each season in the period 1990 and 2005. Although
this is a rich dataset, we are puzzled that the authors did not use the more detailed
play-by-play data available from the Retrosheet organization (www.retrosheet.org).
This dataset is easy to access and manipulate. As will be seen shortly, this richer dataset
would allow for the inclusion of suitable covariates in the adjustment of the home run
rates.
∗ Department of Mathematics and Statistics, Bowling Green State University, Bowling Green, OH,
http://www-math.bgsu.edu/~albert/
† Society of American Baseball Research, http://philbirnbaum.com/
c 2009 International Society for Bayesian Analysis
°
DOI:10.1214/09-BA424A
654
3
Comment on Article by Jensen et al.
Adjustments for Home Run Rates
In comparing baseball hitters across eras, Schell (2005) explains the importance of
adjusting home run rates for the era of play, the distribution of league-wide talent, the
ballpark effect, and a player’s late-career decline. Adjustments for league-wide talent
and the ballpark are also crucial in the modeling of a player’s hitting trajectory and
the prediction of future performance. There have been dramatic changes in home run
hitting from 1990 to 2005. The overall major league home run rate increased by 26%
between 1992 and 1993 and the rate has shown a 50% increase in this 15 year period.
Schell documents the significant impact of ballparks on the pattern of home run hitting.
In the current baseball season, it appears to be much easier to hit home runs in the new
Yankee stadium in New York. The park factor for the new Yankee Stadium is currently
1.295, which means that the rate of home run hitting in the Yankee home games is
about 30% higher than the rate of home run hitting in the Yankees away games.
One can understand changes in league-wide hitting talent by the fitting of a random
effects model. For a given season, we observe the number of home runs and at-bats
(yi , ni ) for all batters. We assume that yi is binomial(ni , pi ) and then we assume the
home run probabilities {pi } follow a beta distribution with shape parameters a and b.
The fitted values â and b̂ are informative about the location and shape of the home
run abilities of the batters. This random effects model is fit separately for each season,
obtaining estimates âj and b̂j for season j. The top graph in Figure 1 displays the
median home run ability of the players for the seasons 1990 to 2005, and the bottom
graph plots the interquartile spread of the home run ability distribution against season.
This figure shows dramatic changes in the location and spread of talent of hitting home
runs in this 15 year period. One way of adjusting a player’s season home run rate
compares his rate relative to the distribution of home run rates hit for that particular
season. Specifically, one can compute a predictive standardized score as in Albert (2009)
using the average and standard deviation of the predictive distribution.
This paper does include some adjustments in their regression model (2), specifically,
covariates for the home ballpark and fielding position. As the authors explain, the data
does not break down a player’s home run data by home and away games and so the
“home ballpark” covariate actually confounds two variables, the ballpark effect and the
team hitting ability. One could define a true ballpark effect by using the Retrosheet
data. We are puzzled by the inclusion of the fielding position covariate. Although
there are some tendencies, for example, first-basemen tend to hit more home runs than
second-basemen, modern hitters of all non-pitching positions are proficient in hitting
home runs. Why do the authors believe that fielding position is an important covariate?
More importantly, why do the authors believe that players of different positions have
different home run trajectories?
Another possible regression adjustment is the number of opportunities AB. There
is a general positive correlation between AB and home run rate – players with more
at-bats tend to hit a higher rate of home runs. Also, if a young player has a limited
number of AB one season, it is more likely that he will have a small number of home
runs and be sent back to the minors the following season. Also the number of AB and
J. Albert and P. Birnbaum
655
the player’s career trajectory provides a good prediction of the player’s AB in a future
season. (The authors assume that the player’s 2006 AB is the same as the AB in the
previous season.)
4
Elite/Non-Elite Players
The authors introduce a latent elite variable in their model with the justification that
“that there exists a sub-group of elite home run hitters within each position that share
a higher mean home run rate”. The authors do not present any evidence in the paper
that home run rates cluster in two groups of non-elite players and elite players. In our
exploration of these data, there appears to be a continuum of home run ability that is
right skewed with a few possible large outliers. It seems that the latent elite variable is
introduced not because the data suggests the two clusters, but rather to induce some
dependence in the home run rates for the same player. There is a more straightforward
way to model this dependence, specifically to assume that each player has a unique trajectory, where the individual player regression coefficient vectors are assumed to follow
a common distribution. This comment relates to the authors’ approach for modeling
trajectories which will be described next.
5
Modeling Career Trajectories
In the motivation for the career trajectories, the authors say that they “favor an approach that involves fewer parameters (to prevent over-fitting)”. But they make the
very restrictive assumption that players of a particular fielding position share the same
career trajectory. This assumption does not reflect the variable trajectory patterns of
home run hitting. To illustrate the variability in trajectories, consider the home run
hitting patterns of the Hall of Fame players Mickey Mantle and Hank Aaron (both who
played the same outfield position) who played in the same era. Figure 2 plots standardized home run rates for both players as a function of age, where the rates have been
standardized using the predictive distribution as described above. Note that Mantle
peaked in his late 20’s and declined quickly until retirement. In contrast, Aaron peaked
in home run hitting ability much later in his career and showed a more gradual decline
towards the end of his career.
It can be difficult to estimate the player trajectories individually using regression
models due to the high variability of the observed rates as shown in Figure 1. But one
can obtain good smoothed estimates of the individual trajectories by use of a multilevel
model. If the vector of regression coefficients for the ith player is represented by βi , then
one can assume that the {βj } are a random sample from a common normal distribution
with mean vector β and variance-covariance matrix Σ, and the hyperparameters β, Σ
are assigned a vague prior at the second state. The posterior estimates smooth the
individual trajectory estimates towards a common trajectory. This multilevel model
is shown to be successful in smoothing trajectories of WHIP (walk and hit) rates for
pitchers in Albert (2009). We have also used it for estimating trajectories of batter on-
656
Comment on Article by Jensen et al.
base percentages, and we would expect similar good results for estimating trajectories of
home run rates. This analysis would lead to more realistic estimates of career trajectories
and likely better predictions of future home run hitting. Certainly, one should make
different predictions for the home run hitting for a 35-year old Mickey Mantle and a
35-year old Hank Aaron since their patterns of decline were very different.
6
A Sabermetrics Perspective
Sabermetrics is the scientific search for objective knowledge about baseball, and the
search for better predictions of future performance is certainly something that sabermetricians – especially those who may be employed by major league clubs - are interested
in. But they are concerned with more than just accurate predictions; they are concerned
with what it is the projection reveals about players and changes in their performance.
Bill James, in a discussion about the existence of clutch hitting in James (1984),
says “How is it that a player who possesses the reflexes and the batting stroke and
the knowledge and the experience to be a .260 hitter in other circumstances magically
becomes a .300 hitter when the game is on the line? How does that happen? What
is the process? What are the effects? Until we can answer those questions, I see little
point in talking about clutch ability.” Likewise, sabermetricians are interested in the
process that leads to a prediction of home run hitting.
Sabermetricians are unsatisfied with mere predictions, no matter how accurate.
Given an accurate prediction of future performance, they ask, “what is it about that
prediction that makes it accurate? What does it tell us about the relationship of past
performance to future performance?”
One attractive feature of MARCEL is that it gives us clues to what might be going on.
Tango (2004) gives the full MARCEL algorithm, in which we can see the assumptions
that went into the formula. We see how it weights recent performance relative to more
distant performance, how much one should regress to the mean, and how one adjusts
the predictions to adjust to changes to league norms. These individual assumptions
can be adjusted in order to minimize prediction error, and, in so doing, we would come
closer to learning objective information about player hitting.
The Bayesian modeling approach presented in this paper, however, is more complex
and opaque. It performs only marginally better than MARCEL, while using more
information such as home team scoring and player position. It is uncertain what an
experienced sabermetrician would learn from the Bayesian process, and it is uncertain
whether the (marginally) improved predictions are the result of a better model, or
simply the result of the additional information being used.
Further, while the Bayesian model has shown itself to be successful in predicting,
certain of its assumptions are almost certain to be false. As has been noted, the classification of hitters into only two categories – elite and non-elite – is certainly false,
as home-run-hitting ability appears to be a continuum; there is no evidence that the
distribution of home run rates, even by position, is bimodal.
J. Albert and P. Birnbaum
657
The fact that the Bayesian model gives reasonable estimates cannot be taken as
evidence that the assumptions are correct. For instance, a black-box model that predicts
swine-flu infection rates is valuable, but, if the assumptions that went into the model
are correct, this model is useful in predicting future outbreaks. If the assumptions are
incorrect, the predictions based on the model may be inaccurate.
Sabermetricians would be very interested in the success of the Bayesian model in predicting home run rates for younger hitters; as Table 2 of the paper shows, the Bayesian
algorithm beats MARCEL 62% of the time, and beats PECOTA 100% of the time. We
note, however, that this is based only on a sample of eight players. Still, one could discover possible attributes of the prediction methodology by a case-by-case exploration.
It would useful to see the full list of players and their estimates, along with a discussion of what kinds of players, such as power hitters or high-average players, are better
estimated than others types of players. This would provide a useful comparison of the
methods, and provide a direction for future research to improve the knowledge that the
field of sabermetrics has compiled about the aging process.
As it stands now, the Bayesian method has made sabermetrics aware that slight
improvements over MARCEL are possible, but, without further exploration, we are
left with little understanding of where the improvements came from, where MARCEL
is weak, what assumptions need to be refined, or, indeed, how the aging process in
baseball can better be explained.
0.024
0.018
MEDIAN
AVERAGE HOMERUN TALENT
1990
1995
2000
2005
YEAR
0.022
0.018
IQR
SPREAD OF HOMERUN TALENT
1990
1995
2000
2005
YEAR
Figure 1: Fitted home run talent distributions for the seasons 1990 to 2005. The top graph
displays the median home run ability and the bottom graph displays the interquartile range of
the talent distribution.
658
2
3
Aaron
Mantle
0
1
STANDARDIZED RATE
4
Comment on Article by Jensen et al.
20
25
30
35
40
AGE
Figure 2: Standardized home run rates for Mickey Mantle and Hank Aaron ploted as a
function of age. The lowess smooths show that the home run trajectories of the two players
were significantly different.
7
Summing Up
The authors have proposed a useful hierarchical modeling framework and illustrated the
potential benefits of Bayesian modeling in predicting future home run counts. But we
believe the methods could be substantially improved by the proper adjustment of the
home run rates, the inclusion of useful covariates, and more realistic modeling of the
career trajectories. From the viewpoint of a baseball general manager, the prediction
of a particular player’s future performance is very important and it seems that this
prediction has to allow for the player’s unique career trajectory pattern. For the problem
of individual predictions, we don’t believe this methodology will be very helpful, since
all players of a particular fielding position are assumed to have the same trajectory
and lumped into the broad elite/non-elite classes. But we do believe that this general
approach, with the changes described above, can be used to make helpful predictions of
offensive performance.
References
Albert, J. (2009). “Is Roger Clemens’ WHIP Trajectory Unusual.” Chance, 22: 8–22.
James, B. (1984). The 1984 Baseball Abstract. Ballentine Books.
Schell, M. (2005). Baseball’s All-Time Best Sluggers: Adjusted Batting Performance
J. Albert and P. Birnbaum
659
from Strikeouts to Home Runs. Princeton University Press.
Tango,
T.
(2004).
“Marcel
the
Monkey
Http://www.tangotiger.net/archives/stud0346.shtml.
Forecasting
System.”
660
Comment on Article by Jensen et al.
Bayesian Analysis (2009)
4, Number 4, pp. 661–664
Comment on Article by Jensen et al.
Mark E. Glickman∗
I offer my congratulations to Jensen, McShane and Wyner (hereafter JMW) on
their paper modeling home run frequencies of Major League Baseball (MLB) players.
It is always refreshing to read such a clearly written, well-organized paper on a topic
of interest to a broad audience and one that illustrates cutting edge modeling and
computational tools in Bayesian Statistics. It is also worth noting that the first author
is becoming an accomplished researcher in quantitative aspects of baseball, most recently
having developed complex statistical models for evaluating fielding (Jensen et al. 2009).
The current paper adds to his accruing and impressive list of work on Statistics in
sports.
In the current paper, the authors develop and investigate a model for home run
frequencies for MLB seasons from 1990 through 2005 based on publicly available data.
The data contains player performance information aggregated by season, so examining
within-season variation is not possible. Home run frequencies for a player within a
season are modeled as binomial counts (out of the total number of at-bats, appropriately
defined), and the probability of a home run during a season is a function of the player’s
position, team, and age. The authors make some interesting specific assumptions that
result in a unique model. First, they posit that the effect of age on the log-odds of
the probability of a home run follows a cubic B-spline relationship for a given field
position. Second, they assume a latent categorization of each player in a given season
as elite versus non-elite, essentially treating a player’s home run frequency as a mixture
of two binomial components with different probabilities. Third, the latent elite status
for each player is assumed to follow a Markov process with transition probabilities that
are common for all players at the given field position. The authors also investigate
a generalization of their basic model in which the transition probabilities can vary
by player through model components specific to players at that position. The entire
model is fit via MCMC simulation from the posterior distribution, and performance
of their approach is evaluated through measures that compare model predictions in
2006 to observed home run frequencies. They conclude that their basic model fares
well against existing competitor approaches that are not nearly as sophisticated. The
authors deserve credit for constructing a model that is competitive with one that makes
use of data obtained on a daily basis. It is also particularly impressive that their model
predicts well given the paucity of covariate information.
One can raise minor quibbles with the authors’ approach, but many of the concerns
are an artifact of the constraints on the data available to them. For example, the ability
to account for within-season variation strikes me as a clear deficiency in modeling home
run probabilities. Given that players are generally improving from year to year in their
twenties, it is not unreasonable to speculate that some of this improvement is occurring
within a season rather than between seasons. Because the data JMW use is aggregated
∗ Boston
University School of Public Health, Boston, MA, mailto:mg@bu.edu
c 2009 International Society for Bayesian Analysis
°
DOI:10.1214/09-BA424B
662
Comment on Article by Jensen et al.
by season, it is impossible to infer such changes. The authors also incorporate a team
indicator in their model, which ostensibly is a proxy for playing half of the time in their
own ballpark, though this does not account for minor artifacts such as within-season
player trades. As JMW note, this team parameter may be difficult to interpret when
it applies to a whole season of games. If individual game-specific data were available,
then the impact of the actual ballpark could be incorporated into the model which
may have a profound effect on inferences. My own bias is to wonder whether modeling
and predicting home run frequencies is a question that baseball front office staff or
other professionals really want answered. While forecasting home run probabilities
seems like an interesting theoretical question, various metrics to measure hitting rates
might be of greater practical utility. The authors do mention at the conclusion of the
paper their interest in pursuing such activities. I also found curious that the expanded
model involving Markov transition probabilities that varied by player produced worse
predictions than the simpler model in which the transition probabilities were constrained
to vary only by player-position. This may suggest some combination of a model not
sufficiently capturing important features of the data, or an expanded model that is too
highly parameterized.
To me, the most interesting aspect of the paper is the decision to incorporate a latent
indicator of elite status into the model, and the accompanying stochastic process. On
the one hand, JMW are able to account for variation in home run rates and improve
predictions by introducing a 2-state hidden Markov model (HMM). One clear benefit of
incorporating this model component is that it allows answering questions about when
certain players can be considered elite versus non-elite. On the other hand, I wonder
whether a 2-state Markov model is the most appropriate and most flexible for predicting
home run frequencies. The authors consider a HMM in which players at the same
position share the same transition probabilities, and another in which the transition
probabilities vary by player but are centered at position-specific distributions. In both
cases, the size of the effect of being elite for all players at the specified position is the
same. I realize that JMW are focused on keeping the model as simply parameterized as
possible, but the question arises whether accuracy (especially predictive accuracy, one
of the main implied goals of the paper) is being sacrificed. Given that all the parameters
of the HMM are integrated out of the posterior distribution in making predictions, it
is the structure of the HMM that is most crucial, and not inferences about any of the
HMM parameters.
The authors’ HMM assumes that players at any given time are in one of two states,
once accounting for age, position and team. However, it strikes me that player effects
(beyond the effect of age, position and team) more justifiably fall on a continuum. A
natural way to modify JMW’s model is to assume
logit θijkb = αk + βb + fk (Aij ) + δijk
(1)
where θijkb is the home run probability for player i with home ballpark b in season j
at position k; αk , βb and fk (Aij ) are as defined in JMW; and δijk is a player-specific
effect following a stochastic process with a continuous state-space, such as
δijk ∼ N(δi,j−1,k , ψ 2 ),
(2)
M. E. Glickman
663
where initial player effects may be assumed drawn from a common distribution centered
at a position-specific model component,
δi1k ∼ N(ηk , φ2 )
(3)
with position-specific effects ηk . This model assumes that, beyond the effects of ballpark, position and age, an individual player effect in a given season is drawn from a
distribution centered at last season’s mean, thus inducing a time-correlation particular
to that player. Such an approach can represent trajectories of not only elite players, but
also better-than-average players as well as worse-than-average players. Similar models
for binomial data in a game/sports context have been examined by Fahrmeir and Tutz
(1994) and Glickman (1999), among others, though these approaches do not include an
additive spline component for age. Various changes to the assumptions in (2) and (3)
could be considered, such as having the innovation variance, ψ 2 , depend on player position (that is, ψk2 ), the transition model could be heavy-tailed, such as a t-distribution
instead of normal (which would account for occasional bursts of improvement in home
run probability), or having the prior variance, φ2 , depend on the player position (that
is, φ2k ).
An advantage to a continuous state-space compared to a 2-state system is that it
recognizes varying degrees of improvement and worsening over time beyond what is
captured by age-specific effects. Substituting the HMM in the authors’ framework with
that in (2) should involve straightforward modifications to the MCMC algorithm, so
the computational details ought to involve tractable calculations. Again, because the
parameters of a continuous state-space model are integrated out of the posterior distribution to obtain predictive inferences, or even age-curve estimates, the richer structure
compared to the 2-state HMM may result in more reliable inferences. The richer structure may also more appropriately calibrate the levels of uncertainty in predictions which
appear overly conservative as evidenced in Table 1 of their paper. Of course, one needs
to fit such a model to the data to be convinced of such speculation.
Notwithstanding some of my suggestions for alternative directions the authors could
take in further refining their model, I think that their approach makes an important
contribution to a growing literature on sophisticated methods in analyzing sports data.
Modeling the effect of age through a cubic B-spline is a nice feature of their approach,
and accounting for time dependence in home run rates through a hidden Markov model
is a novel addition, even though my feeling is that a continuous state-space Markov
model may be more promising. I look forward to the continued success and insightful
work from this productive group of researchers.
References
Fahrmeir, L. and Tutz, G. (1994). “Dynamic stochastic models for time-dependent
ordered paired comparison systems.” Journal of the American Statistical Association,
89: 1438–1449. 663
Glickman, M. E. (1999). “Parameter estimation in large dynamic paired comparison
664
Comment on Article by Jensen et al.
experiments.” Applied Statistics, 48: 377–394. 663
Jensen, S. T., Shirley, K., and Wyner, A. J. (2009). “Bayesball: a Bayesian hierarchical
model for evaluating fielding in major league baseball.” Annals of Applied Statistics,
3: 491–520. 661
Bayesian Analysis (2009)
4, Number 4, pp. 665–668
Comment on Article by Jensen et al.
Fernando A. Quintana∗ and Peter Müller
1
†
Introduction
We congratulate Shane T. Jensen, Blake McShane and Abraham J. Wyner (henceforth
JMW) for a very well written and interesting modeling and analysis of hitting performance for Major League Baseball players. JMW proposed a hierarchical model for data
extracted from the Lahman Baseball Database. They model the player/year-specific
home run rate using covariate information such as the player’s age, home ballpark,
and position. The proposed approach successfully strikes a balance of parsimonious
assumptions where detail does not matter versus structure where it is important for the
underlying decision problem. An interesting feature of the model is the time-dependence
that is induced by assuming the existence of a hidden Markov chain that drives the transition of players between “elite” and “non-elite” conditions. In the former case, JMW
postulate that the home run rate is increased by a certain position-dependent quantity. The model is used to predict home run totals for the 2006 season, and the results
compared to some external methods (MARCEL and PECOTA). The comparison gives
some mixed results, with the proposed method rating generally well, compared to their
competitors.
2
Some general comments
Inference for the Lahmann baseball data raises a number of practical challenges. The
data include records on over 2,000 players, but for many of them there is information for
only a couple of years. In many cases there are several years with missing information.
As usual in sports data, there is tremendous heterogeneity and unbalance among the
experimental units (players). We suspect this is partly the reason why the focus is
on predictions for a subset of players. However, this opens the question of whether
the model actually provides a good fit for all the players. We believe an interesting
challenge is to extend the modeling approach to larger subsets, and maybe all players.
For such extended inference the model needs to be extended to properly reflect the
increased heterogeneity across all players. We propose a possible approach below. Also,
the inference focus would shift from prediction to more emphasis on an explanatory
model.
Model (2) and the proposed variations, have the interesting feature of incorporating
in the home run rates θij an explicit dependence on player position k, home ballpark
∗ Departamento de Estadı́stica, Pontificia Universidad Católica de Chile, Chile, mailto:quintana@
mat.puc.cl
† Department of Biostatistics, The University of Texas M.D. Anderson Cancer Center, Houston, TX,
mailto:pmueller@mdanderson.org
c 2009 International Society for Bayesian Analysis
°
DOI:10.1214/09-BA424C
666
Comment on Article by Jensen et al.
b and a smooth position-specific age trajectory, expressed as an hypothesized linear
combination in the logit scale. The smooth function of age seems to capture interesting
nonlinear features of the home run rates evolution on time, as seen in Figures 3 and 5.
One may even venture the existence of an “optimal” age for hitting, and a natural decay
in abilities with progressing age. In fact, such conclusions have been reached elsewhere,
and even if not the target of this work, it is a nice feature of the analysis that the same
kind of findings are uncovered.
The hidden Markov model for “elite” status is the model component that is responsible for introducing dependence across seasons for a given player. The extended
model allows for player-specific transition parameters, i.e., individual trajectories for
i
i
the binary elite indicator variables. Concretely, JMW assume the parameters (ν00
, ν11
)
controlling these transitions to be a priori independent and Beta-distributed, with conditional independence across players sharing a same position k. These assumptions
imply flexibility in the evolution of the {Eij } elite indicators, which are well defined
regardless of missing data patterns along the sequences of home runs. Looking at the
results of the analysis, it is quite remarkable that a large number of players achieve elite
status after only one or two major league seasons, as seen in Figure 2. Intuitively one
would have expected a peak more likely around 3-5 years. JMW seem to be equally
surprised at such findings, when they comment that the sum over years 2 through 11
still represents 75% of the cases considered.
Another consequence of the elite/non-elite model is that the effect on home run rates
θij is only through a position-specific added term αk = αk0 (1 − Eij ) + αk1 Eij on the
logit scale. While this has the advantage of borrowing strength across players with the
same position, it may be not flexible enough to capture highly heterogeneous home run
profiles.
3
Extending the proposed approach
The latent elite indicator Eij defines a mixture model for the observed home run totals.
The use of Eij is an elegant way to formalize inference about top players. The model
balances parsimony with sufficient structure to achieve the desired inference. The authors correctly point out some of the remaining limitations. Perhaps the most important
limitation is that the model reduces the heterogeneity of the population of all players
to a mixture of only two homogeneous subpopulations. This is particularly of concern
in the light of the underlying decision problem. The resulting inference only informs
us about the probability of a player being in the elite group. Some evidence for more
heterogeneity beyond the mixture of only two subpopulations is seen in Figure 4. The
wide separation of the credible intervals suggests scope for intermediate performance
groups in the model. The population of players is highly heterogeneous, but not in such
a sharply bimodal fashion. It is also interesting to note in the same figure the almost
preserved ordering across positions between elite and non-elite groups.
A minor extension of the model could generalize the mixture to a random partition
into H subpopulations, which could help closing the gap just pointed out. Each cluster
F. A. Quintana and P. M. Müller
667
could have a cluster-specific set of intercepts αkh , h = 0, . . . , H − 1 for the logistic
regression prior (2) of player-season home run rates θij . Like in JMW’s model, the
intercepts remain ordered αkh ≤ αk,h+1 , k = 1, . . . , 9. This allows us to interpret the
clusters labels h = 0, . . . , H − 1 as latent player performance.
Formally the model extension would replace (2) by
logit(θij ) = αih + βb + fk (Aij ),
(1)
where βb and fk (Aij ) are as earlier, and h = Eij is the imputed cluster membership
for player i in season j. The prior for αk = (αkh , h = 0, . . . , H − 1) is similar to (9),
now for the H−dimensional vector αk . The prior for the latent cluster membership Eij
remains as in (3), extended to transitions between H states. The number of transition
parameters νrs remains unchanged with prior probability ν01 for upgrades in elite level,
prior probability ν10 for downgrades and ν00 for the probability of remaining in state
Eij = 0 and ν11 for the probability of remaining in a performance state E > 0. Like in
(7) the transition probabilities are position-specific.
The number of states H would itself be treated as unknown, with a geometric prior
p(H) = (1 − p)H−1 p and a hyperparameter p. The only additional step in the MCMC
implementation is a transition probability to change H. We consider two transitions,
“birth” of an additional performance level by splitting an existing level h into two new
levels and the reverse “death” move. This could be implemented as a reversible jump
move.
The generalized model defines a random partition of the player-years (ij) into performance clusters h = 0, . . . , H − 1. The unique features of this random partition model
would be the ordering of the clusters and the dependence across j. Both features are
naturally accommodated by the outlined model-based clustering. We see it as an interesting and challenging application of model-based clustering. In contrast to much of
the of clustering models in the recent Bayesian literature, the use of standard clustering models such as the ubiquitous Polya urn would be inappropriate. The Polya urn
model does not naturally allow the desired ordering of cluster-specific parameters and
time-dependence of cluster membership indicators.
4
Final words
We realize the above proposal can be extended/modified in many different ways, the
main point being the possibility of improving on the analysis and model proposed by
JMW. Our aim here was not to criticize the model but to help improve it. We indeed
think the hidden Markov component is a very nice feature, which combined with a
flexible extension, could motivate further analysis of the data under a more general
framework.
Acknowledgments
Fernando Quintana was partially funded by Fondecyt grant 1060729.
668
Comment on Article by Jensen et al.
Bayesian Analysis (2009)
4, Number 4, pp. 669–674
Rejoinder
Shane T. Jensen∗ , Blakeley B. McShane† and Abraham J. Wyner‡
We thank each discussant for his insightful comments and suggestions for improvement. We are pleased by the positive reception of our current endeavor towards modelbased prediction of hitting performance. It is our belief that academic statisticians can
serve a leadership role in the transition of quantitative analysis of baseball from simple
tabulations to sophisticated model-based approaches.
1
Alternative Models for Latent Variables
A clear theme of this discussion is the flexibility of the Bayesian hierarchical framework
as a principled means for prediction in this application. Of course, the other side of
that coin is that our model can always be improved by more sophisticated extensions.
The discussants offer several great suggestions for improvements to our methodology.
A first step in this effort is suggested by multiple discussants: extensions of the latent
“elite” mixture model. These proposals are great directions for future research, and we
briefly discuss the prospects of each below.
Albert & Birnbaum question our employment of a latent mixture model, citing the
fact that these mixture components are not self-evident from the raw home-run rate
distributions. However, they also note the presence of skewness and outliers. We argue
that latent mixture models are a common strategy for addressing skewness and outliers.
In fact, our original motivation for a latent mixture model was the observation that
hitters with consistently high home run rates were over-shrunk in a model that did not
allow for subpopulations of extreme home run performance.
Both Quintana & Müller and Glickman discuss the limitation of our mixture model to
two latent states. In our original analysis, we experimented with the addition of a third
latent state which was intended to capture players that showed inferior performance
relative to their position. However, the estimated models that included this third state
did not show any greater predictive power than the two-state model.
Quintana & Müller suggest a more comprehensive amelioration of our mixture model:
allowing the number of latent states to be unknown and estimated. Certainly, this proposal is the most natural extension of the current approach and would help address the
concerns raised by the discussants about the imposed “elite” vs. “non-elite” framework.
The hurdle would be implementation of this more complicated model, as the reversiblejump approach proposed by Quintana & Muller could be complicated in practice.
∗ Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA,
mailto:stjensen@wharton.upenn.edu
† Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA,
mailto:mcshaneb@wharton.upenn.edu
‡ Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA,
mailto:ajw@wharton.upenn.edu
c 2009 International Society for Bayesian Analysis
°
DOI:10.1214/09-BA424REJ
670
Rejoinder
Glickman proposes a model extension that is further afield. Instead of a discrete state
space model, he proposes a latent state that evolves continuously in an autoregressive
fashion. In our opinion, this continuous state space model would perform well for players
with a long and consistent history of performance. However, we are skeptical there would
be enough autocorrelated signal for younger players with very little personal history.
For these cases with sparser information, we believe our simpler model is better able to
pool information between players.
We have a similar concern about Albert & Birnbaum’s proposal to fit random effects
for each player. We concede that players (at the same position) can have very different
trajectories, as illustrated by their comparison of Mickey Mantle and Hank Aaron.
However, although there is enough information to model players with long careers in
this way, we suspect that these random effects would be too variable for players who
have only played a few seasons. For such players, the enforced shrinkage of our model
is beneficial.
Furthermore, while the selection of Mantle and Aaron nicely illustrate the benefits of
modeling trajectories individually, it also illustrates some of the pitfalls. Though Mantle
and Aaron were both towering sluggers of their era, we contend that both players are
unusually deviant from what is generally observed and their careers represent extreme
points in the space of individual trajectories. Mantle suffered a precipitous decline
due to debilitating injury while Aaron had an almost miraculously steady and lengthy
career.
Thus, we are not sure it is a criticism to point out that we would have failed to
predict Aaron’s unusual performance into his forties or Mantle’s steep early decline,
unaided by health information. For the purposes of prediction, discounting unusual
individual career trajectories and being guided mainly by position is a sound strategy,
and we remind the reader that center fielders like Mantle are more likely to experience
sharp declines in production than corner outfielders like Aaron. That said, the random
effects framework is a great idea, and we are currently investigating extending our model
to allow more flexible trajectories within each position.
There are of course many other generalizations and improvements not raised by
discussants which we will consider in future work. Most promising is the extension of
the usual first order Markov model to higher order or even variable order. This direction
has the potential to more accurately model an individual player’s trajectory.
2
Position and Other Potential Covariates
Beyond the latent mixture model, the discussants provide several suggestions for additional data and/or covariates that could further improve our predictions. Specifically, Albert & Birnbaum suggest the retrosheet database which provides more detailed
within-season information for each player. We agree that the additional detail within
the retrosheet database could improve our modeling efforts. One immediate advance,
as proposed by Albert & Birnbaum, would be to divide each hitter’s season into home
S. T. Jensen, B. B. McShane and A. J. Wyner
671
Figure 1: Boxplots of empirical home run rates by position. Each point gives HR/AB for a
given player-season for all player-seasons with 300 or more at-bats from 1990-2006.
Table 1: Analysis of Variance Table
Source
Position
Year
Age
Residuals
DF
8
16
25
3801
Sum Sq.
0.31486
0.05922
0.00670
1.24009
Mean Sq.
0.03936
0.00370
0.00027
0.00033
F Ratio
120.6345
11.3446
0.8214
Prob > F
<2e-16
<2e-16
0.7178
versus away games, thus enabling the estimation of true ballpark effects. We would
favor estimation of ballpark effects in this way rather than the use of external park
factors, which is also proposed by Albert & Birnbaum. In our experience, external park
factors are highly inconsistent from year to year and do not seem to contain much signal
except in some extreme cases (e.g., Coors Field or Citizens Bank Park).
Albert & Birnbaum question the use of position as a covariate in our model, claiming
that it is not immediately evident what information is being added by position. They
are correct to assert that there is heterogeneity of home run talent within each position,
but there is large variation in home run rates across position as can be seen in Figure 1.
In fact, we perform an analysis of the variance of home run rates by the nine positions,
seventeen years, and twenty-six ages in our dataset in Table 1. Position accounts for
20% of the total variation in home run rates, far more than any other factor.
These results suggest that position is a very informative covariate for home run
ability. In our view, position serves as a proxy variable for several player characteristics,
such as body type and speed, that cannot be directly observed from the data. Scouts and
672
Rejoinder
managers incorporate many of these unobserved variables into their personnel decisions
in terms of where to place players. By assigning a particular player to traditional
power positions such as first base, managers are adding information about that player’s
propensity to hit home runs. We think this information is especially important for
younger players who have less performance history upon which to base predictions.
Albert & Birnbaum also point out that our model does not address major shifts in
hitting performance between different eras in baseball. We do not argue the point, as it
was not the goal of our paper (though we note that Table 1 shows that the year factor
accounts for a modest 3.6% of the variance in home run rates). Our priority is the
prediction of future hitting performance, which motivated our focus on the current era.
The comparison of hitting performance in different eras is also an interesting question,
and has been addressed in the past with sophisticated Bayesian approaches (Berry et al.
1999).
We did investigate, somewhat indirectly, the possible effects of different eras on our
predictions. We fit our full model on a larger dataset consisting of all seasons from
1970 to 2005, in addition to our presented analysis based on seasons from 1990 to 2005.
We saw very little difference in the predictions between these two analyses, suggesting
that any large-scale changes in hitting dynamics over the past forty years do not have
a major impact on future hitting predictions.
Albert & Birnbaum also suggest using at-bats as a covariate for the modeling of
home run rates. This is a good suggestion and we have investigated the modeling of
at-bats as a means for improving the prediction of hitting totals. However, we need
to correct one statement made by Albert & Birnbaum: we do not assume that each
player’s 2006 at-bats are the same as the at-bats in the previous season. Rather, we
scale the predictions of hitting rates from our model (and the two external methods) by
the actual 2006 at-bat totals in our comparisons.
3
Focus on Prediction
Glickman suggests that home run totals may not be the most interesting outcome to
people in baseball. We certainly agree that home runs are not the best measure of
overall hitting performance, and we emphasize that our methodology can be adapted
to any other hitting event. Home runs were chosen for illustration since we believe that
most readers have a good intuition about the scale and variation of home run totals. We
also have experimented with a multinomial extension of our procedure that would model
each hitting outcome (i.e., singles, doubles, etc.) simultaneously, and this remains an
area of future research.
More generally, Albert & Birnbaum call for greater focus on model interpretation.
Despite our emphasis on prediction, there are elements of our model that are interesting
in their own right. The position-specific aging curves provide an interesting contrast in
the aging process between players at these different positions. Our “elite” versus “nonelite” indicators also provide a means for separating out consistently over-performing
S. T. Jensen, B. B. McShane and A. J. Wyner
673
players relative to their position.
Quintana & Müller also inquire about the predictive power of our model for all
players, not just the subset of players examined in our analysis. Our primary motivation was to have a set of common players for comparison with the external methods.
However, we concede that the players excluded from our analysis probably represent
an even tougher challenge for prediction. Albert & Birnbaum also suggest that extra
insight would be gained from a case-by-case exploration and comparison of our predictions. To this end, we have made available the entire set of our predictions for the
2006 season at the following website: http://stat.wharton.upenn.edu/~stjensen/
research/predictions.2006.xlsx
References
Berry, S. M., Reese, S., and Larkey, P. D. (1999). “Bridging Different Eras in Sports.”
Journal of the American Statistical Association, 94: 661–686. 672
674
Rejoinder
Bayesian Analysis (2009)
4, Number 4, pp. 675–706
Bayesian Inference for Directional Conditionally
Autoregressive Models
Minjung Kyung∗ and Sujit K. Ghosh†
Abstract. Counts or averages over arbitrary regions are often analyzed using conditionally autoregressive (CAR) models. The neighborhoods within CAR models
are generally determined using only the inter-distances or boundaries between the
sub-regions. To accommodate spatial variations that may depend on directions,
a new class of models is developed using different weights given to neighbors in
different directions. By accounting for such spatial anisotropy, the proposed model
generalizes the usual CAR model that assigns equal weight to all directions. Within
a fully hierarchical Bayesian framework, the posterior distributions of the parameters are derived using conjugate and non-informative priors. Efficient Markov
chain Monte Carlo (MCMC) sampling algorithms are provided to generate samples from the marginal posterior distribution of the parameters. Simulation studies
are presented to evaluate the performance of the estimators and are used to compare results with traditional CAR models. Finally the method is illustrated using
data sets on local crime frequencies in Columbus, OH and on the elevated blood
lead levels of children under the age of 72 months observed in Virginia counties
for the year of 2000.
Keywords: Anisotropy; Bayesian estimation; Conditionally autoregressive models;
Lattice data; Spatial analysis.
1
Introduction
In many studies, counts or averages over arbitrary regions, known as lattice or area
data (Cressie 1993), are observed and spatial analysis is performed. Given a set of
geographical regions, observations collected over regions nearer to each other tend to
have similar characteristics, as compared to distant regions. In geography, this feature
is known as Tobler’s First Law (Miller 2004). From a statistical perspective, this feature
is attributed to the fact that the autocorrelation between pairs of regions tends to be
higher for regions near one another than for those farther apart. Thus, this spatial
process observed over a lattice or a set of irregular regions is usually modeled using
autoregressive models.
In general, given a set of sub-regions S1 , . . . , Sn , we consider a generalized linear
model for the aggregated responses, Yi = Y (Si ), as
E[Y|Z] = g(Z)
and Z = µ + η,
∗ Department
† Department
(1)
of Statistics, University of Florida, Gainesville, FL, mailto:kyung@stat.ufl.edu
of Statistics, North Carolina State University, Raleigh, NC, mailto:ghosh@stat.ncsu.
edu
c 2009 International Society for Bayesian Analysis
°
DOI:10.1214/09-BA425
676
Bayesian inference for DCAR
where Y = (Y1 , . . . , Yn ) = (Y (S1 ), . . . , Y (Sn )), Z = (Z1 , . . . , Zn ) = (Z(S1 ), . . . , Z(Sn )),
µ = (µ1 , . . . , µn ) = (µ(S1 ), . . . , µ(Sn )) and η = (η1 , . . . , ηn ) = (η(S1 ), . . . , η(Sn )). Here,
g(·) is a suitable link function, µ represents a vector of large-scale variations (or trends
over geographical regions) and η denotes a vector of small-scale variations (or spatial
random effects) with mean 0 and the variance-covariance matrix Σ.
Usually, the large scale variations, µi ’s, are modeled as a deterministic function of
some explanatory variables (e.g., latitudes, longitudes and other area level covariates)
using a parametric or semiparametric regression model (see van der Linde et al. 1995)
involving a finite dimensional parameter β. However, a more difficult issue is to develop
suitable models for the spatial random effects ηi ’s, as they are spatially correlated and
model specifications are required to satisfy the positive definiteness condition of the
induced covariance structure. Popular approaches to estimate such spatial covariances
are based on choosing suitable parametric forms so that the n × n covariance matrix
Σ = Σ(ω) is a deterministic function of a finite dimensional parameter ω and then
ω is estimated from data. It is essential that any such deterministic function should
lead to a positive definite matrix for any sample size n and for all allowable parameter
values ω. For example, several geostatistical models are available for point-reference
observations assuming that the spatial process is weakly stationary and isotropic (see
Cressie 1993). Several extensions to model nonstationary and anisotropic processes have
also been recently developed (see Higdon 1998; Higdon et al. 1999; Fuentes and Smith
2001; Fuentes 2002, 2005; Paciorek and Schervish 2006; Hughes-Oliver et al. 2009).
Once a valid model for µ and η is specified, parameter estimates can be obtained
using maximum likelihood methods, weighted least squares methods or the posterior
distribution of (β, ω) (see Schabenberger and Gotway 2005). Once the point-referenced
data are aggregated to the sub-regions (Si ’s), the process representing the aggregated
data is modeled using integrals of a spatial continuous process (Journel and Huijbregts
1978). In this paper, the focus is on the estimation of ω with the model chosen for Σ.
In practice there are two distinct approaches to develop models for spatial covariance
based on areal data. A suitably aggregated geostatistical model directly specifies a
deterministic function of the elements of the Σ matrix. On the contrary, the conditional
autoregressive models involve specifying a deterministic function of elements of the
inverse of the covariance, Σ−1 (ω) (e.g., see Besag 1974; Besag and Kooperberg 1995).
There have been several attempts to explore the possible connections between these
approaches of spatial modeling (e.g., see Griffith and Csillag 1993; Rue and Tjelmeland
2002; Hrafnkelsson and Cressie 2003). Recently, Song et al. (2008) proposed that these
Gaussian geostatistical models can be approximately represented by Gaussian Markov
Random fields (GMRFs) and vice versa by using spectral densities. However so far most
of the GMRFs that are available in literature do not specifically take into account the
anisotropic nature of areal data.
In practice, statistical practitioners are accustomed to the exploration of relationships among variables, modeling these relationships with regression and classification
models, testing hypothesis about regression and treatment effects, developing meaningful contrasts, and so forth (Schabenberger and Gotway 2005). For these spatial linear
models, we usually assume a correlated relationship among sub-regions and study how a
M. Kyung and S. K. Ghosh
677
particular region is influenced by its “neighboring regions” (Cliff and Ord 1981). Therefore, we consider generalized linear mixed models for the area aggregate data. In these
models, the latent spatial process Zi ’s can be treated as a random effect and to model it,
conditionally autoregressive (CAR) models (Besag 1974, 1975; Cressie and Chan 1989)
and simultaneously autoregressive (SAR) models (Ord 1975) have been used widely.
Gaussian CAR models have been used as random effects within generalized mixed
effects models (Breslow and Clayton 1993; Clayton and Kaldor 1987). Because the
Gaussian CAR process has the merit that under fairly general regularity conditions
(e.g., positivity conditions etc.) lower dimensional conditional Gaussian distributions
uniquely determine joint Gaussianity of the spatial CAR processes. Thus, the maximum
likelihood (ML) and the Bayesian estimates can be easily obtained. However, one of
the major limitations of the CAR model is that the neighbors are formed using some
form of a distance metric and the effect of direction is completely ignored. In recent
years, there have been some attempts to use different CAR models for different parts
of the region. For instance, Reich et al. (2007) presented a novel model for periodontal
disease and use separate CAR models for separate jaws. White and Ghosh (2008)
used a stochastic parameter within the CAR framework to determine effects of the
neighbors. Nevertheless, if the underlying spatial process is anisotropic, the magnitude
of autocorrelation between the neighbors might be different in different directions. This
limitation serves as our main motivation and an extension of the regular CAR process
is proposed that can capture such inherent anisotropy. In this article, we focus on
developing and exploring more flexible models for the spatial random effects ηi ’s and
the newly proposed spatial process will be termed the directional CAR (DCAR) model.
In Section 2, we define the new spatial process and present statistical inferences
for the parameters based on samples obtained from the posterior distribution of the
parameters using suitable Markov chain Monte Carlo (MCMC) methods. In Section 3,
the finite sample performance of the Bayesian estimators are explored using simulated
data and the newly proposed DCAR models are compared to the regular CAR models
in terms of popular information theoretic criteria and various tests. In Section 4, the
proposed method is demonstrated and compared with regular CAR using data sets of
the crime frequencies in Columbus, OH and of the elevated blood lead levels of children
under the age of 72 months observed in Virginia in the year 2000. Finally, in Section 5,
some possible extensions of the DCAR model are discussed.
2
Directional CAR models
In this section, we develop a new model for the latent spatial process, Zi ’s, described
in (1). For simpler illustration and notational simplicity, we assume that Si are subregions in a two-dimensional space, i.e., Si ⊆ R2 , ∀i. However, the proposed model and
associated statistical inference presented in this article can easily be extended to higher
dimensional data. First, we consider how to define a neighbor structure that depends
on the directions between centroids for any pair of sub-regions. Let si = (s1i , s2i ) be a
centroid of the sub-region Si , where s1i corresponds to the horizontal coordinate (x-axis)
678
Bayesian inference for DCAR
Sj
Si
alpha(Si,Sj)
Figure 1: The angle (in radian) αij
and s2i corresponds to the vertical coordinate (y-axis). The angle (in radians) between
Si and Sj is defined as
( ¯
¯
¯ tan−1 ( s2j −s2i )¯
if s2j − s2i ≥ 0
s
−s
¯ 1j −11i s2j −s2i ¯¢
¡
αij = α(Si , Sj ) =
− π − ¯ tan ( s1j −s1i )¯ if s2j − s2i < 0
for all j 6= i. We consider directions of neighbors from the centroid of sub-region Si ’s.
For example, in Figure 1, Sj is in the north-east (NE) region of Si and hence α(Si , Sj )
is in [0, π2 ). Let Ni represent a set of indices (j’s) of neighborhoods for the ith region Si
that are based on some form of distance metric (say as in a regular CAR model). We
can now create new sub-regions, for each i, as follows:
Ni1
=
{j : j ∈ Ni , 0 ≤ αij <
Ni2
=
{j : j ∈ Ni ,
Ni3
=
Ni4
=
π
},
2
π
≤ αij < π},
2
3
{j : j ∈ Ni , π ≤ αij < π},
2
3
{j : j ∈ Ni , π ≤ αij < 2π}.
2
These directional neighborhoods should be chosen carefully so that, for each i, they
form a clique. Recall that a clique is any set of sites which either consists of a single site
or else in which every site is a neighbor of every other site in the set (Besag 1974). This
would allow us to show the existence of the spatial process by using the HammersleyClifford Theorem (Besag 1974, p.197-198) and to derive the finite dimensional joint
M. Kyung and S. K. Ghosh
679
distribution of the process using only a set of (lower dimensional) full conditional distributions. For instance, if j ∈ Ni1 , then it should be ensured that i ∈ Nj3 . For the
above four neighbor sets, we can combine each pair of the diagonally opposite
S neighbor
∗
sets to form
a
new
neighborhood.
It
means
that
we
can
create
N
=
N
Ni3 , and
i1
i1
S
∗
∗
∗
Ni2
= Ni2 Ni4 for i = 1, . . . , n. Now it is easy to check that if j ∈ Ni1
, then i ∈ Nj1
.
Thus, we redefine two subsets of Ni ’s as follows:
∗
Ni1
=
∗
Ni2
=
π
3
or π ≤ αij < π)}
2
2
3
π
{j : j ∈ Ni and ( ≤ αij < π or π ≤ αij < 2π)}.
2
2
{j : j ∈ Ni and (0 ≤ αij <
∗
∗
∗
Then, each of Ni1
and Ni2
forms a clique and it can be shown that Ni = Ni1
(2)
S
∗
Ni2
.
A centroid of the sub-region Si might not be given or available in some situations,
for example, neighbor relationships are defined via adjacencies instead of distances between centroids. In this case, the directions of neighbors for each sub-region Si are not
clear. One suggestion in this situation is that we might define the directions of neighbors intuitively based on the direction of adjacencies. For this topic, we need further
study. However, throughout this paper, we assume that we can define the directions of
neighbors for each sub-regions.
Before fitting a DCAR model, we would need to define these directional neighborhood just as we need to define the CAR weights before fitting a CAR model. Note that
with defined directional adjacency, we can easily rotate the distance category boundaries
while maintaining the clique. For example, we can easily define the different weights to
the neighbors in the north-south region compared to those in the east-west.
The above scheme of creating new neighborhoods based on the inter-angles, αij ’s
can be extended beyond just two sub-neighborhoods so that each of the new subneighborhood forms a clique. For example, we can extend the directional cliques with
4 sub-sets of neighborhoods as
∗
Ni1
∗
Ni2
∗
Ni3
∗
Ni4
5
π
or π ≤ αij < π)}
4
4
π
π
5
3
= {j : j ∈ Ni and ( ≤ αij < or π ≤ αij < π)}
4
2
4
2
π
3
3
7
= {j : j ∈ Ni and ( ≤ αij < π or π ≤ αij < π)}
2
4
2
4
7
3
= {j : j ∈ Ni and ( π ≤ αij < π or π ≤ αij < 2π)}.
4
4
= {j : j ∈ Ni and (0 ≤ αij <
However, it should be noted that anisotropic specifications for the geostatistical covariance functions are quite different from the directional specification of neighborhood
cliques used to define the inverse of the covariance. In this regard, the directional adjustments within the CAR framework allow the anisotropy parameters to capture the local
(neighboring) directional effects whereas the anisotropy parameters of a geostatistical
model generally capture the overall global directional effects. Finally, it is possible to
680
Bayesian inference for DCAR
increase the number of sub-neighborhoods to more than 2 or 4 sub-neighborhoods. However, we cautiously note that if we keep increasing the number of sub-neighborhoods, the
number of parameter increases whereas the amount of observations available within a
sub-neighborhood decreases. Thus, we need to restrict the number of sub-neighborhoods
by introducing some form of a penalty term (e.g., via the prior distributions of anisotropy
parameters) and use some form of information criterion to choose the number of subneighborhoods. This is an important but open issue within our DCAR framework.
Hence for the rest of the article, for simplicity, we restrict our attention to the case with
only two sub-neighborhoods as described in (2).
∗
∗
, we can construct
and Ni2
Based on subsets of the associated neighborhoods, Ni1
(1)
(2)
(1)
(2)
directional weight matrices W
= ((wij )) and W
= ((wij )), respectively. For
(1)
instance, we define the directional proximity matrices as wij
∗
= 1 if j ∈ Ni1
and
(2)
∗
wij = 1 if j ∈ Ni2
. Notice that W = W(1) + W(2) reproduces the commonly used
proximity matrix as in a regular CAR model.
In order to model the large-scale variations, we assume a canonical generalized linear
model, µi = xTi β, where xi ’s are vectors of predictor variables specific to the sub-region
Si and β = (β1 , . . . , βq )T is a vector of regression coefficients. Notice that nonlinear
regression functions, including smoothing splines and polynomials, can be re-written in
the above canonical form (e.g., see Wahba 1977; van der Linde et al. 1995). From model
(1) it follows that
E[Z] = Xβ and Var[Z] = Σ(ω),
(3)
where ω denotes the vector of spatial autocorrelation parameters and other variance
components. Notice that along with (3), the model (1) can be used for discrete responses
using a generalized linear model framework (Schabenberger and Gotway 2005, p.353).
Now, we develop a model for Σ(ω) that accounts for anisotropy.
Let δ1 and δ2 denote the directional spatial effects corresponding to Ni1 ’s and Ni2 ’s,
respectively. We define the distribution of Zi conditional on the rest of Zj ’s for j 6= i
using only the first two moments:
E[Zi |Zj = zj j 6= i, xi ] =
xTi β +
2
X
δk
j=1
k=1
Var[Zi |Zj = zj j 6= i, xi ] =
(k)
(k)
σ2
,
mi
where wij ≥ 0 and wii = 0 for k = 1, 2 and mi =
n
X
¢
(k) ¡
wij zj − xTj β
(4)
Pn
j=1
wij .
The joint distribution based on a given set of full conditional distributions can be
derived using Brook’s Lemma (Brook 1964) provided the positivity condition is satisfied
(e.g., see Besag 1974; Besag and Kooperberg 1995). For the DCAR model, by construc∗
∗
tion, it follows that each of Ni1
and Ni2
defined in (2) forms a clique for i = 1, . . . , n.
Thus, it follows from the Hammersley-Clifford Theorem that the latent spatial process
Zi of a DCAR model exists and is a Markov Random Field (MRF). Therefore, we can
M. Kyung and S. K. Ghosh
681
derive the exact joint distribution of the DCAR process, Zi ’s, by assuming that each of
the full conditional distribution is a Gaussian distribution.
2.1
Gaussian DCAR models
The Gaussian CAR model has been used widely as a suitable model for the latent spatial
process Zi . In this section, to derive the joint distribution of the Zi ’s from a set of given
full conditional distributions, we use Brook’s Lemma.
Assume that the full conditional distributions of Zi ’s are given as
2
n
2
X
X
¡
¢
σ
(k)
Zi |Zj = zj , j 6= i, xi ∼ N xTi β +
δk
wij zj − xTj β ,
,
m
i
j=1
(5)
k=1
(k)
where wij for k = 1, 2 are the directional weights. It can be shown that this latent
spatial DCAR process Zi ’s is a MRF. Thus, by Brook’s Lemma and the HammersleyClifford Theorem, it follows that the finite dimensional joint distribution is a multivariate Gaussian distribution given by
µ
³
´−1 ¶
2
(1)
(2)
D ,
Z ∼ Nn Xβ, σ I − δ1 W − δ2 W
where Z = (Z1 , . . . , Zn )T and D = diag( m11 , . . . , m1n ). For simplicity, we denote the
variance-covariance matrix of the DCAR process by ΣZ ≡ σ 2 (I−δ1 W(1) −δ2 W(2) )−1 D.
For a proper Gaussian model, the variance-covariance matrix ΣZ needs to be positive
definite. First, notice that if we use the standardized directional proximity matrices
(k)
w
(k)
W̃(k) = ((w̃ij = miji )), k = 1, 2, it can be easily shown that ΣZ is symmetric. Thus,
the finite dimensional joint distribution is given by
µ
´−1 ¶
³
D ,
(6)
Z ∼ Nn Xβ, σ 2 I − δ1 W̃(1) − δ2 W̃(2)
Next, we derive a sufficient condition that ensures that the variance-covariance matrix
ΣZ is non-singular and hence making it a positive definite matrix. As D is a diagonal
matrix, we only require suitable conditions on W̃(k) and on δk for k = 1, 2. The following
results provides a sufficient condition:
¡ ¢
P
Lemma 1. Let A = aij be a n × n symmetric matrix. If aii > j6=i |aij | for all i,
then A is positive definite.
Proof: See Ortega 1987, P.226. ✷
PK
PK
(k)
is a
Lemma 2. Let A = I − k=1 δk W̃(k) be an n × n matrix where
k=1 W̃
symmetric matrix with non-negative entries, diagonal 0 and each row sum equal to
unity. If max1≤k≤K |δk | < 1, then the matrix A is positive definite.
682
Bayesian inference for DCAR
Proof: Let aij denote the (i, j)th element of A. Notice that for each i = 1, 2 . . . , n,
we have
X
j6=i
|aij | =
K
K
K X
X X
X
X (k) X
(k)
(k)
|
δk w̃ij | ≤
|δk |
w̃ij <
w̃ij = 1 = aii
j6=i k=1
k=1
j6=i
k=1 j6=i
Hence it follows from Lemma 1 that A is positive definite. ✷
Notice that when δ1 = δ2 = ρ, DCAR(δ1 , δ2 , σ 2 ) reduces to CAR(ρ, σ 2 ) and hence
the regular CAR model is nested within the DCAR model provided we use a prior that
puts positive mass on the line δ1 = δ2 . The next step of our statistical analysis is to
estimate the unknown parameters of the DCAR model based on the observed responses
and the explanatory variables, so that it enables us to stabilize estimates within the
regions using the estimated spatial correlation. In the next section, we discuss Bayesian
methods for the spatial autoregressive models.
2.2
Parameter estimation using Bayesian methods
With the Gaussian DCAR model of the latent spatial process Zi ’s, we describe how
to estimate parameters and associated measures of uncertainties based on Bayesian
methods.
Bayesian inference about the unknown parameters has been considered for statistical
models for which the likelihood functions are analytically intractable, because of possibly
high-dimensional parameters or due to the fact that the likelihood function involves
high-dimensional integrations (e.g., when Yi ’s are discrete valued). In the Gaussian
DCAR model, because the likelihood function may involve high-dimensional integration,
posterior estimation is not easy to achieve analytically. In particular, the joint posterior
density of δ1 and δ2 does not have a closed form. Also, when a generalized mixed model
is used with the random spatial effects having a DCAR model, analytical exploration
of the posterior distribution becomes almost prohibitive. Thus, the Gaussian DCAR
model leads to an intractable posterior density and numerical methods are needed for
inference about unknown parameters.
Let θ = (β T , σ 2 , δ T )T , where β = (β1 , . . . , βp ) and δ = (δ1 , δ2 ). The posterior
density π(θ|z) is proportional to the product of the prior distribution π(θ) of unknown
parameters and the sampling density of data Z given θ. Therefore, by using Markov
chain Monte Carlo (MCMC) methods, we can obtain samples from the path of Markov
chains whose stationary density is the posterior density.
For the DCAR process Zi ’s, under the joint multivariate Gaussian distribution, the
likelihood function is given by
L(θ|X, z) ∝ |σ 2 A∗ (δ)−1 D|−1/2
ª
©
1
exp − 2 (z − Xβ)T D−1 A∗ (δ)(z − Xβ) ,
2σ
(7)
where A∗ (δ) = I − δ1 W̃(1) − δ2 W̃(2) and D = diag( m11 , . . . , m1n ). We consider a class of
M. Kyung and S. K. Ghosh
683
prior distributions that ensure that the posterior distribution is proper even when the
priors are improper. A class of such prior distribution is given by
π(β|σ 2 , δ) ≡
2
π(σ |δ) ∝
π(δ) =
1
µ
1
σ2
¶a+1
b
e− σ2 a, b > 0 and
1
I[max(|δ1 |, |δ2 |) < 1].
4
As the prior distribution of β is not proper, we need to ensure that the posterior is
proper. Given prior distributions above, the joint posterior distribution of θ can be
shown to have the following form:
π(θ|X, z) ∝
∝
L(θ|X, z)π(β|σ 2 , δ)π(σ 2 |δ)π(δ)
¯
¯−1/2
(σ 2 )−n/2−a−1 ¯A∗ (δ)−1 D¯
n
¤o
1 £
exp − 2 (z − Xβ)T D−1 A∗ (δ)(z − Xβ) + 2b
2σ
×I[max(|δ1 |, |δ2 |) < 1].
(8)
Here, if δ is known, like the regular posterior distribution in a regression model, the
posterior distribution of β given σ 2 and δ is a Guassian distribution with complicated
form of the mean and variance, and the posterior of σ 2 given δ is an inverse gamma
distribution. For the Gaussian DCAR model, assume that the design matrix X is full
rank and the variance-covariance matrix ΣZ is positive definite. Thus, based on the
sufficient conditions given by Sun et al. (1999), one can easily deduce that the posterior
distribution of θ|z is proper.
We can also consider the conditionally conjugate priors for β and σ 2 . Given the
values of δ, the likelihood function of the DCAR model (7) is like a regression model
with mean Xβ and variance-covariance σ 2 A∗ (δ)−1 D. Thus, a conditional conjugate
prior given δ can be considered in two stages according to
β|σ 2 , δ
∼
σ 2 |δ
∼
¡
¢
N β0 , σ 2 A∗ (δ)−1 D
IG(a0 , b0 ).
Given the conditionally conjugate prior distributions and a marginal prior for δ, the
joint posterior distribution of θ is given by
π(θ|X, z)
∝ L(θ|X, z)π(β|σ 2 , δ)π(σ 2 |δ)π(δ)
∝ (σ 2 )−(n/2+p/2+a0 )−1 × I[max(|δ1 |, |δ2 |) < 1]
½
1
× exp − 2 [ s̃
2σ
³
´T ³
´³
´¸¾
−1
(9)
+ β − β̃
XT D−1 A∗ (δ)X + (A∗ (δ) D)−1 β − β̃
684
Bayesian inference for DCAR
where
´
¡
¢−1 ³ ∗ −1 −1
(A (δ) D) β0 + XT D−1 A∗ (δ)Xβb ,
β̃ = XT D−1 A∗ (δ)X + (A∗ (δ)−1 D)−1
´
³
³
´T
c2 (n − p) + 2b0 + β0 − β̃ (A∗ (δ)−1 D)−1 β0 + b
b − β̃ XT D−1 A∗ (δ)Xβ,
b
s̃ = σ
¡
¢−1 T −1 ∗
βb = XT D−1 A∗ (δ)X
X D A (δ)z
T −1 ∗
b
b
c2 = (z − Xb) D A (δ)(z − Xb)
σ
n−p
and p is the dimension of β. Then the conditional posterior distributions of the parameters are given by
β|σ 2 , δ, Z
∼
σ 2 |δ, Z
∼
¡
¢
N β̃, XT D−1 A∗ (δ)X + (A∗ (δ)−1 D)−1 and
¡
¢
IG (n/2 + p/2 + a10 ), s̃ .
However, as we discussed above, there is no closed form for the posterior distribution
of δ. Therefore, we need numerical methods to obtain the posterior summaries (e.g.,
suitable moments) of θ.
In order to obtain samples from the path of the Markov chains, we need to consider the starting values of each parameters. In the DCAR model of ©
the latent spatial
process Zi ’s , the parameterª space Θ of θ can be defined as Θ = θ|β ∈ Rp , σ 2 ∈
(0, ∞), I[max(|δ1 |, |δ2 |) < 1] . Within Θ, we can choose several starting points for
chains and run parallel chains. After burn-in, we obtain approximate samples from the
posterior density π(θ|z).
2.3
A simulation study
In order to study the finite sample performance of Bayesian estimators, we conduct a
simulation study. In this simulation study, we focus on the behavior of Gaussian DCAR
model of the latent spatial process Z = (Z1 , . . . , Zn ) in (6).
Mardia and Marshall (1984) conducted a simulation study with 10 × 10 unit spacing
lattice, based on samples generated from a normal distribution with mean zero and a
spherical covariance model. The sampling distribution of the MLE’s of the parameters
were studied based on 300 Monte Carlo samples. Following a similar setup, for our
simulation study, we selected a 15 × 15 unit spacing lattice and generated N = 100 data
sets each of size n = 225 from a multivariate normal distribution with mean Xβ and
the variance-covariance matrix σ 2 A∗ (δ)−1 D, where A∗ (δ) = I − δ1 W̃(1) − δ2 W̃(2) . The
X matrix was chosen to consist of the coordinates of latitude and longitude in addition
to a column of ones to represent an intercept. The true value of the parameters were
fixed at β = (1, −1, 2)T and σ 2 = 2. For the above mentioned DCAR model, to study
the behavior of posterior distributions for δ1 and δ2 , we consider four different test cases
M. Kyung and S. K. Ghosh
685
of δ’s:
Case 1: δ1 = −0.95 & δ2 = −0.97
Case 2: δ1 = −0.30 & δ2 = 0.95
Case 3: δ1 = −0.95 & δ2 = 0.97
Case 4: δ1 = 0.95 & δ2 = 0.93.
Following Lemma 2, we restrict our choice of δ1 and δ2 to satisfy the sufficient condition,
max1≤k≤K |δk | < 1. Thus, for the near boundary values of δ1 and/or δ2 , like −0.95 and
0.93, there might be some unexpected behavior of the sampling distribution. Thus,
we generate data sets with two negative near boundary weights of δ (Case 1) and
two positive near boundary weights of δ (Case 4) in order to explore extreme cases,
respectively. In our applications we have found that such extreme values are quite
common within CAR or DCAR models (see Section 4). Besag and Kooperberg (1995)
also discussed similar situations in their paper. We also consider settings with one
positive near boundary weight assigned to one direction and one negative near boundary
weight assigned to the other direction for extremely different weighted situations (Case
3). Moreover, we give somewhat mild weight to one direction and positive boundary
weight to a different direction to study the behavior of a strong positive spatial effect
in one direction only (Case 2). Thus, with extreme boundary values of δ, we study the
sampling distributions of the directional spatial effect parameters.
As we discussed in Section 2.2, for the Bayesian estimates, we consider three sets of
initial values and run three parallel chains. We use a burn-in of B = 1000 for each of
the three chains followed by M = 2000 iterations. This scheme produces a sample of
6000 (correlated) values from the joint posterior distribution of the parameter vector.
As Bayesian estimation involves the use of computationally intensive MCMC methods, we studied the finite sample performance of Bayes estimates with only N = 100
repetitions. The (coordinatewise) posterior median of the parameter vector is used as a
Bayes estimate because of its robustness as compared to the posterior mean, especially
when the posterior distribution is skewed. Also, for each coordinate of the parameter
vector, we computed a 95% equal tail credible set (CS) using 2.5 and 97.5 percentiles as
an interval estimate. Then, we computed the 95% nominal coverage probability (CP)
PN
by using the following rule: 95% CP = N1 i=1 I(θ0 ∈ 95%CS).
We summarize the sampling distribution of parameters numerically in Table 1. The
bias represents the empirical bias of posterior medians as compared to the true value,
the Monte Carlo Standard Error (MCSE) is the standard error of the posterior medians,
the p-value is based on testing the average of posterior medians against the true value
and 95% CP represents the percentage of times the true value was included within the
95% CS. All of these summaries are based on 100 replications. We observe that for all
the four choices of δ, there are no significant biases in these Bayesian estimates with
MC repetitions of size 100. (e.g., all p-values are bigger than 0.18). When the true δ1
or δ2 is positive, the bias of the Bayesian estimate tends to be slightly negative, except
for δ2 in Case 4. For Case 3, the nominal 95% coverage probabilities (CP’s) of δ1 and δ2
are away from 0.95 and the MCSE’s are not small. Also, from Figure 2, we observe that
686
Bayesian inference for DCAR
True
bias
MCSE
P-value
95% CP
True
bias
MCSE
P-value
95% CP
δ1
-0.95
0.23
0.22
0.29
0.99
δ1
-0.95
0.20
0.15
0.18
0.89
δ2
-0.97
0.27
0.20
0.19
1.00
δ2
0.97
-0.18
0.17
0.28
0.93
σ2
2.00
0.17
0.20
0.40
0.92
σ2
2.00
0.26
0.21
0.21
0.84
δ1
-0.30
0.15
0.32
0.64
0.97
δ1
0.95
-0.01
0.03
0.74
0.98
δ2
0.95
-0.21
0.18
0.24
0.93
δ2
0.93
0.03
0.03
0.28
0.89
σ2
2.00
0.11
0.21
0.60
0.92
σ2
2.00
-0.05
0.37
0.89
0.67
Table 1: Finite sample performance of posterior estimates of the parameters of DCAR models
(based on 100 replications).
for the extremely differently positively (δ1 ) and negatively (δ2 ) weighted situations, the
posterior estimates seem to estimate true values with somewhat less precision. However,
the distribution of δ is skewed when the true value is in the boundary. It might be the
reason why we get somewhat larger values of MCSEs. For Case 4, the nominal 95%
CP’s of δ1 and δ2 are 0.98 and 0.89, respectively, and biases and MCSE’s are smaller
than those for any other cases. Thus, Bayesian methods based on posterior medians
tend to estimate the true value quite well even when the true values of δ1 and δ2 are near
the positive boundary. The higher than nominal coverage probability of the Bayesian
interval estimates based on equal tail CS may be due to the skewness of the sampling
distribution that we have observed in our empirical studies. Alternatively, a 95% HPD
interval can be obtained using the algorithm of Hyndman (1996). It was observed that
the posterior distributions of δ1 or δ2 are skewed to the right and to the left for the
negative extreme value and the positive extreme value, respectively.
The bias of the posterior median of σ 2 tends to be slightly negative for Case 4, but
for other cases, the biases are slightly positive. Again note that these biases are not
statistically significant (all four p-values are greater than 0.21). Thus, in these cases,
Bayesian estimation tends to estimate the true value quite well. However, for Case 4,
the MCSE of the Bayesian estimates of σ 2 is bigger than those for other cases and the
nominal coverage is only 0.67 as compared to a targeted value of 0.95.
For the estimation of the β’s, the estimates had small MCSE, and did not have
any significant bias, except for β0 in Case 4. Also we observed that the posterior
distributions of the β’s are fairly symmetric (results not reported due to lack of space
but are available in Kyung 2006).
M. Kyung and S. K. Ghosh
687
Histogram of posterior estimates
of delta 2 (True delta 2=0.97)
2.0
0.0
0.0
0.5
0.5
1.0
1.5
Density
1.0
Density
1.5
2.5
2.0
3.0
3.5
Histogram of posterior estimates
of delta 1 (True delta 1=−0.95)
−1.0
−0.5
0.0
0.5
posterior medians of delta 1
1.0
−1.0
−0.5
0.0
0.5
1.0
posterior medians of delta 2
Figure 2: Histogram of 100 posterior estimate of δ1 and δ2 based on the DCAR process data
with true δ1 = −0.95 and δ2 = 0.97 ( Posterior median of M = 6000 Gibbs samples with 100
replications).
688
3
Bayesian inference for DCAR
Comparing the performances of DCAR and CAR models using an information criterion
In Section 2.1 we have shown that the DCAR model is a generalization of the CAR model
and hence the DCAR model is expected to provide a reasonable fit to a given data set
possibly at the cost of loss of efficiency, especially when the data arise from a CAR model.
So it is of interest to explore the loss (gain) in efficiency of a DCAR model over the
regular CAR model when the data arises from a CAR (DCAR) model. There are several
criteria (e.g., information criteria, cross-validation measures, hypotheses tests, etc.) to
compare the performances across several competing models. Given the popularity of the
Deviance Information Criterion (DIC) originally proposed by Spiegelhalter et al. (2002)
we use DIC to compare the performance of fitting DCAR and CAR models to data
generated from a CAR model and then also from a DCAR model. Another advantage
of using DIC is that this criterion is already available within the WinBUGS software. To
calculate DIC, first we define the deviance as
D(θ) = −2 log L(θ|X, z) + 2 log h(z),
where h(z) is standardizing function of the data z only and remains the same across
all competing models. In general it is difficult to find the normalizing function h(z) for
models involving spatial random effects. However given that we are interested in the
differences of DIC between the models, we may use the following definition of deviance
by dropping the h(z) term:
D(θ) = −2 log L(θ|X, z),
as the normalizing term will cancel anyway when we take the difference between two
DIC’s with same h function. Based on the deviance, the definition of the effective
number of parameters, denoted by pD , is defined as:
¡
¢
pD = E[D(θ)|z] − D E[θ|z] = D̄ − D(θ̄),
where θ̄ = E[θ|y] is the posterior mean of θ. The DIC is then defined as
DIC = D(θ̄) + 2pD .
In theory, we select the model with the smaller DIC values. DIC and pD are easily
computed by MCMC methods. We consider two cases based on data generated from a
(i) CAR model and (ii) DCAR model.
3.1
Results based on data generated from CAR model
With samples from a Gaussian CAR process, we fit both CAR and DCAR models,
respectively. Notice that if there is no directional difference in the observed spatial data,
then the estimate of δ1 should be very similar to the estimate of δ2 . Thus we might
expect very similar estimates for δ1 and δ2 based on a sample from a CAR process
M. Kyung and S. K. Ghosh
True
bias
MCSE
P-value
95% CP
True
bias
MCSE
P-value
95% CP
True
bias
MCSE
P-value
95% CP
689
δ1
-0.95
0.01
0.02
0.67
1.00
δ1
0.00
-0.02
0.30
0.95
1.00
δ1
0.25
-0.10
0.31
0.85
0.99
δ2
-0.95
0.01
0.02
0.66
1.00
δ2
0.00
0.00
0.26
1.00
1.00
δ2
0.25
-0.09
0.26
0.72
1.00
ρ
-0.95
0.02
0.00
0.00
1.00
ρ
0.00
-0.01
0.30
0.97
1.00
ρ
0.25
0.01
0.32
0.98
0.98
δ1
-0.25
0.14
0.29
0.63
1.00
δ2
-0.25
0.15
0.26
0.57
1.00
ρ
-0.25
0.08
0.33
0.81
0.99
δ1
0.95
-0.15
0.11
0.17
1.00
δ2
0.95
-0.06
0.08
0.45
1.00
ρ
0.95
-0.06
0.04
0.14
1.00
Table 2: Performance of Bayesian estimates of δ1 ’s, δ2 ’s (DCAR) and ρ’s (CAR) based on
data generated from CAR model
because CAR(ρ, σ 2 ) = DCAR(ρ, ρ, σ 2 ). In fact, it might be a good idea to use a prior
on (δ1 , δ2 ) which allows for a positive mass on the diagonal line δ1 = δ2 to capture a CAR
model with positive probability. To study the performance of the model as function of
the key parameter ρ of the CAR model, we consider five different values of ρ: Case
1:ρ = −0.95, Case 2:ρ = −0.25, Case 3:ρ = 0, Case 4:ρ = 0.25 and Case 5:ρ = 0.95. For
each case, we generate 100 sets of data each of sample size n = 225 from a CAR model
with ρ taking values as one of above five cases, while the other parameters (β and σ) are
fixed at their true values (see Section 2.3). First we compare the posterior estimates of
the ρ when we fitted a CAR model and that of δ1 and δ2 when we fitted a DCAR model
to data generated from one of the five CAR models. In Table 2 we compare the bias (of
the posterior median), Monte Carlo Standard Error (MCSE) of the posterior median,
the p-value for testing the null hypothesis that the (MC) average of the 100 posterior
medians is the same as the true value, and the 95% nominal coverage probability (CP)
of the 95% posterior intervals constructed by computing 2.5% and 97.5% percentiles of
the posterior distribution of the parameters.
From the results (presented under the columns ρ in Table 2) based on the posterior
estimates (median and 95% equal-tail intervals) obtained by fitting a CAR model, we
observe that for all cases, the biases of ρ are slightly positive except Case 5, but such
empirical biases are not statistically significant (all p-values being bigger than 0.06).
The nominal 95% CP’s of ρ’s are higher than their targeted value of 0.95 for all cases.
For all cases, the biases of posterior medians of σ 2 tend to be slightly positive. However,
690
Bayesian inference for DCAR
DGP
Fit
PCD(DIC)
P-value
DGP
Fit
PCD(DIC)
P-value
CAR(ρ = −0.95)
CAR
DCAR
100%
0%
0.64
CAR(ρ = 0.25)
CAR
DCAR
55%
45%
0.50
CAR(ρ = −0.25)
CAR
DCAR
51%
49%
0.50
CAR(ρ = 0.95)
CAR
DCAR
100%
0%
0.68
CAR(ρ = 0.00)
CAR DCAR
34%
66%
0.47
Table 3: Comparison of DIC between CAR and DCAR models with data sets from CAR
process (PCD = Percentage of Correct Decision)
again we found that these empirical biases in all cases are not significant because all
calculated p-values are at least 0.5 (results not reported). Finally, with regard to the
performance of the posterior medians of β’s, we did not find any significant biases (all
p-values being bigger than 0.32). We have not presented the details of these results
(for σ 2 and β’s) due to lack of space, but detailed results are available online in the
doctoral thesis of the first author (Kyung 2006). Next we compare the results obtained
by fitting a DCAR model to the same data sets generated from the five CAR models
(as described above).
Performance of DCAR model under mis-specification
In Table 2 we also present the bias of the posterior medians of δ1 and δ2 , their MCSE’s,
p-values (for testing δ1 = δ2 = ρ), and the 95% CP’s of the 95% equal-tailed posterior
intervals, when a DCAR model is fitted to each of the same 100 data sets generated
from each of five CAR models (as described in the previous section). Although DCAR
is not the true model that generated the data in these cases, we observe that for all
five cases, the biases of posterior medians of δ1 and δ2 are marginally positive for Cases
1, 2 and 3, whereas, the biases are slightly negative for Cases 4 and 5. However,
these biases are not statistically significant (all p-values being greater than 0.09). This
indicates that even when the data arise from a CAR model, the posterior medians of
δ’s can well approximate the true ρ value of the CAR model. As expected, the MCSE’s
of the posterior medians of δ’s are slightly larger than that of the ρ’s, but such loss
in efficiency for fitting an incorrect model is not prominent either. Finally, in terms
of maintaining the nominal coverage of the posterior intervals, the results from both
model fits are comparable. Thus, in summary, when we fit a DCAR model to data sets
generated from a CAR model, the posterior estimates obtained from the DCAR model
are approximately unbiased and there is no big loss in efficiency.
In addition to comparing the parameter estimates based on fitting both CAR and
DCAR models to data generated from a CAR model we have also used DIC (as defined
earlier in this section) to compare the overall performance of these models. By a data
generating process (DGP) we mean the true model that generates data for our simulation
M. Kyung and S. K. Ghosh
³ DGP ´
E Var(DCAR)
VarCAR
³ DGP ´
E Var(DCAR)
VarCAR
691
CAR(ρ = −0.95)
CAR(ρ = −0.25)
CAR(ρ = 0.00)
1.009(0.000)
0.999(0.003)
0.998(0.003)
CAR(ρ = 0.25)
CAR(ρ = 0.95)
0.999(0.004)
1.009(0.001)
Table 4: The average ratio of posterior variances for DCAR and CAR models: Average(Var(DCAR)/Var(CAR)) based on Gibbs sampler from data sets of CAR process
study and we use the notation FIT to denote the model that was fitted to a simulated
data set. So in this case CAR is the DGP while FIT can be either a CAR or a DCAR
model. We measure the performance of the FIT by computing the percentage of correct
decisions (PCD) made by DIC in selecting one of the two models. In other words, PCD
represents the percentage of the times the DIC value, based on fitting a CAR model,
is lower than that of fitting a DCAR model to the same sets of data obtained from
a CAR model. We also report the p-values based on performing a two sample test
that compares the average values of the DICs (over 100 replications) between CAR and
DCAR models when the true data is generated from a CAR model.
From Table 3, we observe that when the DGP is a CAR with ρ = −0.95 (negative
boundary) and ρ = 0.95 (positive boundary), the PCD based on DIC is 100% which
means that the DIC correctly identifies a CAR model all the times when data are
generated from a CAR model with ρ = ±0.95. However for other cases, the PCD’s are
not that strongly in favor of a CAR model (when compared against a DCAR model)
even when the data arise from a CAR model. Thus, when the spatial dependence in a
CAR model is weak, DIC will not be able to distinguish between the CAR and DCAR
models. Again such a phenomenon is expected as DCAR nests CAR when δ1 = δ2 = ρ
and this is further evidenced by looking the p-values, which suggest that we can not
reject the null hypotheses that the DIC values are the same for both models.
For the measure of relative efficiency, the average ratio of posterior variances for
DCAR and CAR models based on data sets of the CAR processes are reported in
Table 4. From Table 4, we observe that there are no differences between the posterior
variance for DCAR and for the CAR models based on the Gibbs sampler from data
sets of each CAR process. Again such a phenomenon is expected, as DCAR nests CAR
when δ1 = δ2 = ρ.
3.2
Results based on data generated from DCAR model
In this section, our DGP (data generating process, as defined in earlier sections) is a
DCAR model while the FIT is again either a CAR or a DCAR model. Here again we
use the data sets generated from four DCAR models (as defined in Section 2.3) but fit
a CAR model in addition to the DCAR models that we fitted earlier (see Section 2.3
for details). In this case it is of interest to find out how the posterior estimates of ρ of a
CAR model behave, especially when the data arise from a DCAR model with δ values
692
Bayesian inference for DCAR
DGP:
DCAR
True
bias
MCSE
P-value
95% CS
δ1 = −0.95,
δ2 = −0.97
-0.96
0.25
0.13
0.09
1.00
ρ0 =
δ1 = −0.30,
δ2 = 0.95
0.33
0.07
0.31
0.83
0.99
δ1 +δ2
2
δ1 =
−0.95,
δ2 = 0.97
0.01
0.03
0.38
0.94
0.98
δ1 = 0.95,
δ2 = 0.93
0.94
0.03
0.02
0.13
0.85
Table 5: Fitting a CAR model to data generated from the DCAR process
well separated (e.g., for the cases 2 and 3 of Section 2.3).
Performance of CAR model under mis-specification
Based on generating 100 data sets from different DCAR models, we observed that the
posterior median of ρ seems to estimate the average of the true values of δ1 and δ2 of
2
the DCAR models. Therefore, we define a pseudo-true value of ρ as ρ0 = δ1 +δ
and
2
compare the performance of the posterior median of ρ to this so-called “pseudo-true”
value of ρ0 . In Table 5 we list the empirical bias of posterior median of ρ, the MCSE
of these posterior medians, the p-value for testing the null hypothesis ρ = ρ0 and the
95% nominal CP based on the 95% equal-tail posterior intervals of ρ when a CAR
model is fitted to four DCAR models (as described in Section 2.3). It is clear from
the results reported in this table that the ρ parameter of the CAR model attempts to
estimate (δ1 + δ2 )/2 of the DCAR model and thus will lead to a misleading conclusion,
especially when δ’s are of opposite signs but with large absolute values (e.g., cases 2 and
3 of Section 2.3). In other words, when there are strong spatial dependencies possibly
in orthogonal directions, the CAR model would fail to capture such dependencies as
opposed to a DCAR model. On the other hand, when the DGP is a CAR model,
the DCAR model still provides a very reasonable approximation to that DGP (see the
results on Section 3.1). This is one of the main advantages of fitting a DCAR model
over a regular CAR model.
In Table 6, we compare the performance of DIC in choosing the correct model (which
is a DCAR model in this case) when we fitted both CAR and DCAR models. The
numbers reported in this table have similar interpretations as in Table 3. As expected,
for the cases 1 and 4 (where δ1 ≈ δ2 ), the DIC more often chooses the CAR model as the
best parsimonious model even when the data arise from a DCAR model. However, the
p-values (for testing the null hypothesis of no difference in average DIC values indicate
that such DIC values are not statistically significantly different. On the other hand
when the DCAR model is sharply different from a CAR model (e.g., in cases 2 and 3,
where δ1 δ2 < 0), the DIC correctly picks DCAR as the better model more frequently
(e.g. 99% of the times in case 3) as compared to a CAR model. Moreover, the p-values
suggest that in these two cases the DIC values obtained by fitting CAR and DCAR
M. Kyung and S. K. Ghosh
DGP
Fit
PCD(DIC)
P-value
DGP
Fit
PCD(DIC)
P-value
693
DCAR(δ1 = −0.95, δ2 = −0.97)
CAR
DCAR
91%
9%
0.59
DCAR(δ1 = −0.95, δ2 = 0.97)
CAR
DCAR
1%
99%
0.01
DCAR(δ1 = −0.30, δ2 = 0.95)
CAR
DCAR
30%
70%
0.03
DCAR(δ1 = 0.95, δ2 = 0.93)
CAR
DCAR
56%
44%
0.73
Table 6: Comparison of DIC between CAR and DCAR models with data sets from DCAR
process (PCD = percentage of correct decisions)
³ DGP ´
E Var(DCAR)
VarCAR
³ DGP ´
E Var(DCAR)
VarCAR
DCAR(δ1 = −0.95, δ2 = −0.97)
DCAR(δ1 = −0.30, δ2 = 0.95 )
1.002(0.003)
0.994(0.007)
DCAR(δ1 = −0.95, δ2 = 0.97)
DCAR(δ1 = 0.95, δ2 = 0.93)
0.983(0.010)
1.041(0.247)
Table 7: The average ratio of posterior variances for DCAR and CAR models: Average(Var(DCAR)/Var(CAR)) based on Gibbs sampler from data sets of DCAR process
models are significantly different in favor of the DCAR model when the DGP is indeed
a DCAR model.
The average ratio of posterior variances for DCAR and CAR models based on data
sets of the DCAR process are reported in Table 7 for the measure of relative efficiency.
From Table 7, we observe that there are no differences in the posterior variances for
DCAR and for CAR models based on the Gibbs sampler For Cases 1 and 4 (where
δ1 ≈ δ2 ). Also, for Case 2, the posterior variances are not different for DCAR and CAR
models. However, for the extreme case (Case 3), the posterior variances of the DCAR
model are smaller than that of the CAR model. Thus, when there are strong spatial
dependencies, possibly in orthogonal directions, the DCAR model would capture such
dependencies more precisely than a CAR model.
From our extensive simulation studies we can make the following fairly general conclusions: (i) DCAR models provide a reasonably good fit and approximately unbiased
parameter estimates even when the data arise from a CAR model, (ii) CAR models
cannot provide an adequate fit for data sets arising from a DCAR model, especially
when there are strong spatial dependencies in opposite directions, (iii) DIC performs
reasonably well in choosing a parsimonious model when CAR and DCAR models are
compared.
694
4
Bayesian inference for DCAR
Data analysis
We illustrate the fitting of DCAR and CAR models using real data sets. For each
data set, we consider a linear regression model with iid errors and correlated errors
(modeled by CAR and DCAR processes). We obtain the Gibbs sampler of ρ, σ 2 ,
β = (β0 , β1 , β2 )T and δ = (δ1 , δ2 ) under different modeling assumptions. We consider
the following models:
Zi = Xi β + ǫi
i = 1, . . . , n
Model 1. ǫi ∼ N (0, σ 2 ): iid errors
³
´
Model 2. ǫ ∼ N 0, σ 2 (I − ρW̃ )−1 D : CAR errors
¡
¢
Model 3. ǫ ∼ N 0, σ 2 (I − δ1 W (1) − δ2 W (2) )−1 D : DCAR errors ,
where Zi = f (Yi ) and f (·) is a transformation function of the response Yi .
In addition to using DIC to compare the models with CAR and DCAR error structures, we also computed a cross-validation measure (leave-one-out mean square predictive error (MSPE)). This is defined as follows:
n
1X
MSPE =
(yi − yb−i )2 ,
n i=1
where yb−i = E(Yi |y−i ) is the posterior predictive mean of Yi obtained by fitting a
model based on a reduced data set consisting of all (n-1) observations leaving out the
ith observation yi .
4.1
Crime distribution in Columbus, Ohio
We illustrate the performance of fitting CAR and DCAR models to a real data set for
estimating the crime distribution in Columbus, Ohio collected during the year of 1980.
The original data set can be found in Table 12.1 of Anselin (1988, p.189). Using this
interesting data set, Anselin (1988) illustrated the presence of separate levels of spatial
dependencies by fitting two separate regression curves with simultaneous autoregressive
(SAR) error models for the east and west sides of Columbus city. As a result, the author
concluded that when a SAR error model is used, there exists structural instability in
terms of the regression models. In this paper, we fit proposed models to this data set.
Each of the models is a single regression curve but allow spatial anisotropy in the errors
by modeling the errors as a CAR or DCAR model.
The data set consists of the observations collected in 49 contiguous Planning Neighborhoods of Columbus, Ohio. Neighborhoods correspond to census tracts, or aggregates
of a small number of census tracts. In this data set, the crime variable represents the
total number of residential burglaries and vehicle thefts per thousand households (henceforth denoted by Yi for the ith neighborhood). As possible predictors for crime variable,
we use the income level and housing values for each one of these 49 neighborhoods. The
income and housing values are measured in thousands of dollars.
M. Kyung and S. K. Ghosh
695
Directional variogram of GLM residuals
Columbus OH: residential burglaries and vehicle
thefts per thousand households, 1980
11
47
48
6
45
46
44
49
7
8
9
10
11
0.08
0.06
0.04
6
2
8
7
4 3
10
39 38
18
4037
41 36 9
11
42
19
35 32 20
12
21 17
43
33
31
34
22
13
30 24 23
16 14
29 25
28 27
26
15
0.02
14
12
13
1
0.00
5
[14.3,19.3]
(19.3,30.6]
(30.6,39.7]
(39.7,53.5]
(53.5,68.9]
semivariance
0.10
15
0.12
0°
45°
90°
135°
0
1
2
3
4
distance
Figure 3: The crime distribution of 49 neighborhoods in Columbus, OH, and the correlogram
of the deviance residuals after fitting a Poisson regression model.
As a part of our preliminary exploratory data analysis, in Figure 3, we plot the the
crime counts divided into 5 intervals, based on 20% quantiles. During our initial analysis
we observed that Y4 and Y17 have extremely small values and hence could possibly be
eliminated as outliers or incorrectly recorded values (as these two values were less than
2.5% percentile of the Yi ’s). For rest of the analysis, we use the remaining n = 47
neighborhoods for our analysis. From the map in Figure 3 we observe that there seems
to be a relatively higher crime frequencies in NW/SE direction than those frequencies in
its orthogonal direction, though such differences in crime distribution are not strikingly
evident from this plot.
From the estimated directional spatial correlogram in Figure 3, it appears that
spatial correlations are not as strong. However, as the distance between neighbors
increases, the estimated directional spatial correlation is different from those in different
directions. There might be hidden effects of the different directional spatial correlation,
thus, we assume a Gaussian DCAR spatial structure.
As our response variable (the crime variable) is a count variable, we assume that
Yi ∼ P oisson(λi ) for i = 1, . . . , n. Also, let x1i and x2i represent the housing value and
the income both in thousand dollars, respectively. Thus xi = (1, x1i , x2i )T represents the
intercept and predictors for neighborhood i. We consider three over-dispersed Poisson
regression models using the latent variables Zi ’s as follows:
Yi ∼ P oisson(λi )
log(λi ) = Zi = xTi β + ǫi ,
β = (β0 , β1 , β2 )T ,
i = 1, . . . , n
Posterior estimates consisting of the posterior median (denoted by Est. in the table)
696
Bayesian inference for DCAR
Parameter
ρ
δ1
δ2
σ2
β0
β1
β2
DIC
Est.
2.576
4.568
-0.003
-0.064
iid
Std.Err.
0.074
0.001
0.006
-
CAR
Est.
Std.Err.
0.974
0.021
0.358
0.250
4.197
0.264
-0.004
0.003
-0.056
0.012
335.76
DCAR
Est.
Std.Err.
0.962
0.031
0.960
0.032
0.230
0.183
4.147
0.243
-0.004
0.003
-0.055
0.011
336.79
Table 8: Posterior estimates based on fitting different models to the crime frequency data.
Model 1
Model 2
Model 3
(iid error)
(CAR error)
(DCAR error)
MSPE
0.084
0.053
0.050
Table 9: Mean Squared Predicted Error of Leave-one-out method (MSPE)
and the posterior standard deviation (denoted by Std.Err in the table) of the parameters
under these models are displayed in Table 8. In this table, we observe that for all models,
the posterior estimates of the regression coefficients (β’s) are very similar across all
three models. As expected, the negative posterior medians of the β1 and β2 indicate
that crime frequencies are expected to be lower in neighborhoods with higher income
level and housing values. Next we turn our attention to the error part of the three
models. First, significantly lower values of the posterior medians of σ 2 under both the
CAR and DCAR models indicate that greater variability is explained by the models
with spatially correlated errors (i.e., by the CAR and DCAR models in this case) than
the corresponding model with independent errors. This is further evidenced from the
“deviance residual” (as defined by McCullagh and Nelder 1989) plot in Figure 4 which
also suggests that these residuals are not randomly scattered around the horizontal
line at the origin. Among the spatially correlated error models, the difference between
the DCAR model and the CAR model is negligible. Also, from the scatterplot of the
predicted values from DCAR spatial structure versus those from CAR in Figure 4, we
observe that there are many points which are far from the straight line. The straight
line has slope 1, so if the predicted values are similar to the original data, the points
are close to the straight line. However, the predicted values from DCAR and CAR are
not different from each other, which is also evident by comparing their corresponding
DIC values.
In addition to using DIC to compare the models with CAR and DCAR error structures, we also computed cross-validation measures like leave-one-out mean square predictive error (MSPE). Here zb−i = E(Zi |z−i ) is the ML predictive mean of Zi obtained
M. Kyung and S. K. Ghosh
697
60
Scatter Plot of Response VS Predictors
40
30
20
Predicted Values
50
prediction from CAR
prediction from DCAR
20
30
40
50
60
70
Log transformed Crime Rates
Figure 4: Scatterplot of regional estimated frequencies from DCAR versus those from CAR
for the log-transformed crime frequencies. The straight line has slope 1. Thus, if the predicted
values are similar to the original data, points are close to the straight line.
by fitting a model based on a reduced data set consisting of all (n-1) observations leaving out the ith observation yi . In Table 9 we present the MSPEs for three models.
Again it is evident that the spatially correlated error models perform much better than
the independent error model. Among the two spatial models, DCAR performs slightly
better than the CAR model in terms of having lower MSPE. Thus, we conclude that
although there is possibly no separate directional spatial correlations, there is a strong
spatial correlation on either side of the neighborhoods.
4.2
Elevated Blood Lead Levels in Virginia
We also illustrate the fitting of the CAR and DCAR models using a second data set,
estimating the rate, per thousand, of children under the age of 72 months with elevated
blood lead levels observed in Virginia in the year 2000. As predictors for the rate
of children with elevated blood lead levels, we consider the median housing value in
698
Bayesian inference for DCAR
$100, 000 and the number of children under 17 years of age living in poverty in 2000, per
100, 000 children at risk. These observations were collected in 133 counties in Virginia
in the year 2000, with coordinates being the centroids of each county. The aggregated
data for each county are counts: The number of children under 6 years of age with
elevated blood levels in county i and the number of children under 6 years of age tested.
In Schabenberger and Gotway (2005), the original data set was used to illustrate the
percentage of children under the age of 6 years with elevated blood lead levels by using
a Poisson-based generalized linear model (GLM) and a Poisson-based generalized linear
mixed model (GLMM) in the analysis of spatial data. Schabenberger and Gotway
(2005) illustrated spatial dependence by comparing predictions from a marginal spatial
GLM, a conditional spatial GLMM, a marginal spatial GLM using geostatistical variance
structure, and a marginal GLM using a CAR variance structure. For the CAR variance
structure, they used binary sets of neighbors which share a common border. They
mentioned that because of this choice of adjacency weights, the model with the CAR
variance smoothes the data much more than the model with the geostatistical variance.
Instead of using a generalized linear model for count data, we consider the FreemanTukey (FT) square-root transformation for the Yi ’s. There are zero values in some
counties, and the FT square-root transformation shows more stability than the usual
square-root transformation (Freeman and Tukey 1950; Cressie and Chan 1989). With
the FT square-root transformed elevated blood lead level rate, we assume a Gaussian
distribution with CAR and DCAR spatial structure. For the neighbor structure, we
compute distances among the centroids of each geographical group as measured in latitude and longitude. So as not to have any counties reporting zero neighbors, we include
counties whose distance is within a 54.69 radius of another county.
p
p
For this data set, we denote Zi =
1000 ∗ Yi /Ti + 1000 ∗ (Yi + 1)/Ti for i =
1, 2, . . . , n, where Yi is the number of children under the age of 72 months with elevated
blood lead levels observed and Ti is the number of children under the age of 72 months
who have been tested in Virginia in the year 2000.. Thus, Zi is a FT square-root transformed elevated blood lead level rate of sub-area Si . There exists significant correlation
between the median housing value in $100, 000 and the number of children under 17
years of age living in poverty in 2000, per 100, 000 children at risk. Thus, we only
include the centered housing value in $100, 000 (X).
We plot the the FT square-root transformed elevated blood lead level rate that
are divided into 5 intervals of the 20% quantiles in Figure 5. In Figure 5, it appears
that spatial correlations in the northeast (NE) direction seems strong. However, from
the estimated correlogram in Figure 5, we observe that the spatial correlations in four
different directions do not seem to be very different from each other. But, there seems
to be different amounts of correlation for the 450 and 1350 compared to no directional
correlation. Thus, we assume a DCAR process as a hidden spatial structure.
The posterior estimates, with standard deviations under iid error, CAR error, and
DCAR error models are displayed in Table 10. In this table, we observe that for all
models, the posterior mode of the intercept (β0 ) are very similar across all three models.
However, the estimate of the regression coefficient of median housing value under iid
M. Kyung and S. K. Ghosh
699
800
Directional variogram of linear residuals
600
0
200
400
semivariance
the F−T transformed elevated blood lead level
[3.51,10.3]
(10.3,14.1]
(14.1,18.5]
(18.5,23.3]
(23.3,57.4]
0°
45°
90°
135°
0
100
200
300
400
500
600
distance
Figure 5: The elevated blood lead levels rate per thousand of children under the age of 72
months observed in Virginia in the year 2000 the correlogram of the deviance residuals after
fitting linear model.
Parameter
ρ
δ1
δ2
σ2
β0
β1
DIC
Table 10:
level data.
Est.
88.52
17.46
-0.624
iid
Std.Err.
11.15
0.822
2.103
-
CAR
Est.
Std.Err.
0.792
0.120
574.1
72.170
18.78
2.822
-3.295
3.017
940.532
DCAR
Est.
Std.Err.
0.450
0.236
0.896
0.105
564.0
74.2
17.42
2.315
-3.072
2.756
938.854
Bayesian estimates based on fitting different models to the elevated blood lead
700
Bayesian inference for DCAR
the predicted F−T transformed elevated blood lead level
from CAR model (Model 2)
[3.51,10.3]
(10.3,14.1]
(14.1,18.5]
(18.5,23.3]
(23.3,57.4]
Figure 6:
Model 3)
the predicted F−T transformed elevated blood lead level
from DCAR model (Model 3)
[3.51,10.3]
(10.3,14.1]
(14.1,18.5]
(18.5,23.3]
(23.3,57.4]
Predicted elevated blood lead levels rate of children in Virginia (Model 2 and
errors are different from the posterior estimates of CAR errors and DCAR errors. As
expected, the negative posterior medians of the β1 indicate that the rates per thousand
of children under the age of 72 months with elevated blood lead levels are expected
to be lower at neighborhoods with higher housing values. The estimate of the error
term (σ 2 ) with independent errors is significantly lower than the corresponding estimates under spatially correlated errors. However, the posterior mode of β1 (-0.624)
is not significant under iid errors, having a large standard error (2.103). As we discussed in Section 3.2, the posterior mode of ρ (0.792) seems to estimate the average
of the true values of δ1 and δ2 (0.450 and 0.896) of the DCAR models. There exists a
positive spatial relationship for elevated blood lead levels among counties in Virginia.
However, there exist different amounts of positive spatial correlation among neighbors
in the northeast-southwest and the northwest-southeast directions. The spatial correlation among neighbors in northeast-southwest direction (δˆ2 = 0.896) is stronger that
in northwest-southeast direction (δˆ1 = 0.450). Among the spatially correlated error
models, DCAR explains slightly more variability than the CAR, though the difference
between these models is negligible, which is also evident by comparing their corresponding DIC values. This is further evidenced from the residual plots in Figure 6 which also
suggest that the residuals based on the DCAR error model appears not to have a trend
over the study region. Also in Figure 7, we observe that most of predicted values from
the DCAR spatial structure are bigger than those from CAR. This means that for the
FT-transformed elevated blood lead levels, the DCAR model captures more variability
than the CAR model in stabilizing estimates within the regions using the estimated
spatial correlation.
To compare the models with CAR and DCAR error structures, we also computed
leave-one-out mean square predictive error (MSPE). In Table 11 we present the MSPEs
for three models. Again it is evident that the spatially correlated error models perform
M. Kyung and S. K. Ghosh
701
30
25
20
15
Predicted Values from DCAR
35
Scatter Plot of Predicted Values
from CAR VS DCAR
10
15
20
25
30
35
Predicted Values from CAR
Figure 7: Scatterplot of regional estimated rates from DCAR versus those from CAR for the
FT-transformed original elevated blood lead level rates. The straight has slope 1. Thus, if the
predicted values from DCAR are similar to the predicted values of CAR, points are close to
the straight line.
702
Bayesian inference for DCAR
Model 1
Model 2
Model 3
MSPE
88.155
66.822
64.269
Table 11: Mean Squared Predicted Error (Elevated blood lead level data)
much better than the independent error model. Among the two spatial models, DCAR
(64.269) performs slightly better than the CAR model (66.822) in terms of having
lower MSPE, but the difference is negligible. Thus, we conclude that there are strong
spatial correlations with some evidence of differing strengths of correlation in different
directions.
5
Extensions and future work
DCAR models capture the directional spatial dependence in addition to distance specific
correlation, thus they are an extension of regular CAR models, which can often fail to
capture strong but directionally orthogonal spatial correlations. The DCAR model is
also found to be nearly as efficient as the CAR model even when data are generated
from the CAR model. However, CAR models usually fail to capture the directional
effects when data are generated from DCAR or other anisotropic models, particularly
when the anisotropy is pronounced.
Our model proposed in (6) can be extended to M (M ≥ 2) directions, and can be
expressed as
M
X
¡
¢
Z ∼ Nn Xβ, σ 2 (I −
δk W̃(k) )−1 D ,
k=1
where W̃(k) denotes the matrices of weights specific to kth directional effect. In this
paper we used only M = 2 sub-neighborhoods for a simpler illustration. However, we
note that if we keep increasing the number of sub-neighborhoods, the number of parameters increases, and the amount of observations available within a sub-neighborhood
decreases. Thus, we need to restrict the number of sub-neighborhoods by introducing
a penalty term (or prior) and use some form of information criterion to choose the
number of sub-neighborhoods. This is an important but open issue within our DCAR
framework and we leave its further exploration as a part of our future research.
References
Anselin, L., 1988. Spatial Econometrics: Methods and Models. Kluwer Academic Publishers. 694
Besag, J., 1974. Spatial interaction and the statistical analysis of lattice systems. Journal
M. Kyung and S. K. Ghosh
of the Royal Statistical Society, Series B. 36, 192-236 (with discussion).
678, 680
703
676, 677,
Besag, J., 1975. Spatial analysis of non-lattice data. The Statistician 24, 179-195. 677
Besag, J and Kooperberg, C., 1995. On Conditional and Intrinsic Autoregression.
Biometrika 82, 733-746. 676, 680, 685
Breslow, N. E. and Clayton, D. G., 1993. Approximate inference in generalized linear
mixed models. Journal of the American Statistical Association. 88, 9-25. 677
Brook, D., 1964, On the distinction between the conditional probability and the joint
probability approaches in the specification of nearest-neighbour systems. Biometrika.
51, 481-483. 680
Clayton, D. and Kaldor, J., 1987, Empirical Bayes Estimates of Age-Standardized Relative Risks for Use in Disease Mapping. Biometrics. 43, 671-681. 677
Cliff, A. D. and Ord, J. K., 1981. Spatial Processes: Models & Applications. Pion
Limited. 677
Cressie, N., 1993. Statistics for Spatial Data. John Wiley & Sons, Inc. 675, 676
Cressie, N. and Chan, N. H., 1989. Spatial modeling of regional variables. Journal of
the American Statistical Association. 84, 393-401. 677, 698
Freeman, M. F. and Tukey, J. W., 1950. Transformations related to the angular and the
square root. Annals of Mathematical Statistics. 21, 607-611. 698
Fuentes, M., 2002. Spectral methods for nonstationary spatial processes. Biometrika.
89, 197-210. 676
Fuentes, M., 2005. A formal test for nonstationarity of spatial stochastic processes.
Journal of Multivariate Analysis. 96, 30-54. 676
Fuentes, M and Smith, R., 2001. A new class of nonstationary spatial models. Technical
report, North Carolina State University, Department of Statistics. 676
Griffith, D. A. and Csillag, F., 1993. Exploring Relationships Between Semi-Variogram
and Spatial Autoregressive Models. Papers in Regional Science. 72, 283-295. 676
Higdon, D., 1998. A process-convolution approach to modelling temperatures in the
North Atlantic Ocean. Journal of Environmental and Ecological Statistics. 5, 173190. 676
Higdon, D., Swall, J. and Kern, J., 1999. Non-stationary spatial modeling. In Bayesian
Statistics 6, eds. J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith.
Oxford: Oxford University Press, 761-768. 676
Hrafnkelsson, B. and Cressie, N., 2003. Hierarchical modeling of count data with application to nuclear fall-out. Journal of Environmental and Ecological Statistics. 10,
179-200. 676
704
Bayesian inference for DCAR
Hughes-Oliver, J. M., Heo, T. Y. and Ghosh, S. K., 2009. An Autoregressive Point
Source Model for Spatial Processes. Environmetrics. 20, 575-594. 676
Hyndman, R. J., 1996. Computing and Graphing Highest Density Regions. The American Statistician. 50, 120-126. 686
Journel, A. G. and Huijbregts, C. J., 1978. Mining geostatistics. London:Academic. 676
Kyung, M., 2006. Generalized Conditionally Autoregressive Models. Ph. D. Thesis,
North Carolina State University, Department of Statistics. 686, 690
Mardia, K. V. and Marshall, R. J., 1984. Maximum likelihood estimation of models for
residual covariance in spatial regression. Biometrika. 71, 135-146. 684
McCullagh, P. and Nelder, J. A., 1989. Generalized Linear Models. Chapman and Hall,
London. 696
Miller, H. J., 2004. Tobler’s first law and spatial analysis. Annals of the Association of
American Geographers. 94, 284-295. 675
Ord, K., 1975. Estimation methods for models of spatial interaction. Journal of the
American Statistical Association. 70, 120-126. 677
Ortega, J. M., 1987. Matrix Theory. New York:Plenum Press. 681
Paciorek, C. J. and Schervish, M. J., 2006. Spatial modelling using a new class of
nonstationary covariance functions. Environmetics. 17, 483-506. 676
Reich, B. J., Hodges, J. S. and Carlin, B. P., 2007. Spatial analyses of periodontal data
using conditionally autoregressive priors having two classes of neighbor relations.
Journal of the American Statistical Association. 102, 44-55. 677
Rue, H. and Tjelmeland, H., 2002. Fitting Gaussian Markov Random Fields to Gaussian
Fields. Scandinavian Journal of Statistics. 29, 31-49. 676
Schabenberger, O. and Gotway, C. A., 2005. Statistical Methods for Spatial Data Analysis. Chapman & Hall/CRC. 676, 680, 698
Song, H. R., Fuentes, M. and Ghosh, S., 2008. A comparative study of Gaussian geostatistical models and Gaussian Markov random field models. Journal of Multivariate
Analysis. 99, 1681-1697. 676
Spiegelhalter, D. J., Best, N. J., Carlin, B. P. and van der Linde, A., 2002. Bayesian
measures of model complexity and fit. Journal of the Royal Statistical Society, Series
B. 64, 583-639 (with discussion). 688
Sun, D., Tsutakawa, R. K. and Speckman, P. L., 1999. Posterior distribution of hierarchical models using CAR(1) distributions. Biometrika. 86, 341-350. 683
van der Linde, A., Witzko, K.-H. and Jöckel, K.-H., 1995. Spatio-temporal analysis of
mortality using splines. Biometrics. 4, 1352-1360. 676, 680
M. Kyung and S. K. Ghosh
705
Wahba, G., 1977. Practical approximate solutions to linear operator equations when the
data are noisy. SIAM Journal on Numerical Analysis. 14, 651-667. 680
White, G. and Ghosh, S. K., 2008. A Stochastic neighborhood Conditional AutoRegressive Model for Spatial Data. Computational Statistics and Data Analysis. 53,
3033-3046. 677
Acknowledgments
The authors are grateful to a referee, the associate editor and the editor for their careful
reading of the paper and their constructive comments. We also thank Professor George Casella,
Department of Statistics, University of Florida for comments and suggestions that led to a much
improved version of the paper.
706
Bayesian inference for DCAR
Bayesian Analysis (2009)
4, Number 4, pp. 707–732
Spiked Dirichlet Process Prior for Bayesian
Multiple Hypothesis Testing in Random Effects
Models
Sinae Kim∗ , David B. Dahl† and Marina Vannucci‡
Abstract. We propose a Bayesian method for multiple hypothesis testing in random effects models that uses Dirichlet process (DP) priors for a nonparametric
treatment of the random effects distribution. We consider a general model formulation which accommodates a variety of multiple treatment conditions. A key
feature of our method is the use of a product of spiked distributions, i.e., mixtures
of a point-mass and continuous distributions, as the centering distribution for the
DP prior. Adopting these spiked centering priors readily accommodates sharp
null hypotheses and allows for the estimation of the posterior probabilities of such
hypotheses. Dirichlet process mixture models naturally borrow information across
objects through model-based clustering while inference on single hypotheses averages over clustering uncertainty. We demonstrate via a simulation study that our
method yields increased sensitivity in multiple hypothesis testing and produces a
lower proportion of false discoveries than other competitive methods. While our
modeling framework is general, here we present an application in the context of
gene expression from microarray experiments. In our application, the modeling
framework allows simultaneous inference on the parameters governing differential
expression and inference on the clustering of genes. We use experimental data on
the transcriptional response to oxidative stress in mouse heart muscle and compare
the results from our procedure with existing nonparametric Bayesian methods that
provide only a ranking of the genes by their evidence for differential expression.
Keywords: Bayesian nonparametrics; differential gene expression; Dirichlet process prior; DNA microarray; mixture priors; model-based clustering; multiple hypothesis testing
1
Introduction
This paper presents a semiparametric Bayesian approach to multiple hypothesis testing
in random effects models. The model formulation borrows strength across similar objects (here, genes) and provides probabilities of sharp hypotheses regarding each object.
Much of the literature in multiple hypothesis testing has been driven by DNA mi∗ Department
of Biostatistics, University of Michigan, Ann Arbor, MI, mailto:mailto:sinae@umich.
edu
† Department
of Statistics, Texas A&M University, College Station, TX, mailto:dahl@stat.tamu.
edu
‡ Department
of Statistics, Rice University, Houston, TX, mailto:marina@rice.edu
c 2009 International Society for Bayesian Analysis
°
DOI:10.1214/09-BA426
708
Spiked Dirichlet Process Prior for Multiple Testing
croarrays studies, where gene expression of tens of thousands of genes are measured
simultaneously (Dudoit et al. 2003). Multiple testing procedures seek to ensure that
the family-wise error rate (FWER) (e.g., Hochberg (1988), Hommel (1988), Westfall and
Young (1993)), the false discovery rate (FDR) (e.g., Benjamini and Hochberg (1995),
Storey (2002), Storey (2003), and Storey et al. (2004)), or similar quantities (e.g., Newton et al. (2004)) are below a nominal level without greatly sacrificing power. Accounts
on the Bayesian perspective to multiple testing are provided by Berry and Hochberg
(1999) and Scott and Berger (2006).
There is a great variety of modeling settings that accommodate multiple testing
procedures. The simplest approach, extensively used in the early literature on microarray data analysis, is to apply standard statistical procedures (such as the t-test)
separately and then combine the results for simultaneous inference (e.g., Dudoit et al.
2002). Westfall and Wolfinger (1997) recommended procedures that incorporate dependence. Baldi and Long (2001), Newton et al. (2001), Do et al. (2005) and others have
sought prior models that share information across objects, particularly when estimating
object-specific variance across samples. Yuan and Kendziorski (2006) use finite mixture
models to model dependence. Classical approaches that have incorporated dependence
in the analysis of gene expression data include Tibshirani and Wasserman (2006), Storey
et al. (2007), and Storey (2007) who use information from related genes when testing
for differential expression of individual genes.
Nonparametric Bayesian approaches to multiple testing have also been explored
(see, for example, Gopalan and Berry (1998), Dahl and Newton (2007), MacLehose
et al. (2007), Dahl et al. (2008)). These approaches model the uncertainty about the
distribution of the parameters of interest using Dirichlet process (DP) prior models that
naturally incorporate dependence in the model by inducing clustering of similar objects.
In this formulation, inference on single hypotheses is typically done by averaging over
clustering uncertainty. Dahl and Newton (2007) and Dahl et al. (2008) show that this
approach leads to increased power for hypothesis testing. However, the methods provide
posterior distributions that are continuous, and cannot therefore be used to directly test
sharp hypotheses, which have zero posterior probability. Instead, decisions regarding
such hypotheses are made based on calculating univariate scores that are context specific. Examples include the sum-of-squares of the treatment effects (to test a global
ANOVA-like hypothesis) and the probability that a linear combination of treatment
effects exceeds a threshold.
In this paper we build on the framework of Dahl and Newton (2007) and Dahl et al.
(2008) to show how the DP modeling framework can be adapted to provide meaningful
posterior probabilities of sharp hypotheses by using a mixture of a point-mass and a
continuous distribution as the centering distribution of the DP prior on the coefficients
of a random effects model. This modification retains the increased power of DP models
but also readily accommodates sharp hypotheses. The resulting posterior probabilities
have a very natural interpretation in a variety of uses. For example, they can be used to
rank objects and define a list according to a specified expected number of false discoveries. We demonstrate via a simulation study that our method yields increased sensitivity
in multiple hypothesis testing and produces a lower proportion of false discoveries than
S. Kim, D.B. Dahl and M. Vannucci
709
other competitive methods, including standard ANOVA procedures. In our application, the modeling framework we adopt simultaneously infers the parameters governing
differential expression and clusters the objects (i.e., genes). We use experimental data
on the transcriptional response to oxidative stress in mouse heart muscle and compare
results from our procedure with that of existing nonparametric Bayesian methods which
only provide a ranking of the genes by their evidence for differential expression.
Recently Cai and Dunson (2007) independently proposed the use of similar spiked
priors in DP priors in a Bayesian nonparametric linear mixed model where variable
selection is achieved by modeling the unknown distribution of univariate regression
coefficients. Similarly, MacLehose et al. (2007) used this formulation in their DP mixture
model to account for highly correlated regressors in an observational study. There, the
clustering induced by the Dirichlet process is on the univariate regression coefficients
and strength is borrowed across covariates. Finally, Dunson et al. (2008) use a similar
spiked centering distribution of univariate regression coefficients in a logistic regression.
In contrast, our goal is nonparametric modeling of multivariate random effects which
may equal the zero vector. That is, we do not share information across univariate
covariates but rather seek to leverage similarities across genes by clustering vectors of
regression coefficients associated with the genes.
The remainder of the paper is organized as follows. Section 2 describes our proposed
modeling framework and the prior model. In Section 3 we discuss the MCMC algorithm
for inference. Using simulated data, we show in Section 4.1 how to make use of the posterior probabilities of hypotheses of interest to aid the interpretation of the hypothesis
testing results. Section 4.2 describes the application to DNA microarrays. In both
Sections 4.1 and 4.2, we compare our proposed method to the LIMMA (Smyth 2004),
to the SIMTAC method of Dahl et al. (2008) and to a standard ANOVA procedure.
Section 5 concludes the paper.
2
2.1
Dirichlet Process Mixture Models for Multiple Testing
Random Effects Model
Suppose there are K observations on each of G objects and T ∗ treatments. For each object g, with g = 1, . . . , G, we model the data vector dg with the following K-dimensional
multivariate normal distribution:
dg | µg , βg , λg ∼ NK (dg | µg j + Xβg , λg M) ,
(1)
where µg is an object-specific mean, j is a vector of ones, X is a K × T design matrix,
βg is a vector of T regression coefficients specific to object g, M is the inverse of a
correlation matrix of the K observations from an object, and λg is an object-specific
precision (i.e., inverse of the variance). We are interested in testing a hypothesis for
each of G objects in the form:
H0,g : β1,g = . . . = βT ∗ ,g = 0
Ha,g : βt,g 6= 0 for some t = 1, . . . , T ∗
(2)
710
Spiked Dirichlet Process Prior for Multiple Testing
for g = 1, . . . , G.
Object-specific intercept terms are µg j, so the design matrix X does not contain
the usual column of ones and T is one less than the number of treatments (i.e., T =
T ∗ − 1). Also, d1 , . . . , dG are assumed to be conditionally independent given all model
parameters. In the example of Section 4.2, the objects are genes with dg being the
background-adjusted and normalized expression data for a gene g under T ∗ treatments,
G being the number of genes, and K being the number of microarrays. In the example,
we have K = 12 since there are 3 replicates for each of T ∗ = 4 treatments, and the X
matrix is therefore:
03 03 03
j3 03 03
X =
03 j3 03
03 03 j3
where j3 is a 3-dimensional column vector of ones and 03 a 3-dimensional column vector
of zeroes. If there are other covariates available, they would be placed as extra columns
in X. Note that the design matrix X and the correlation matrix M are known and
common to all objects, whereas µg , βg , and λg are unknown object-specific parameters.
For experimental designs involving independent sampling (e.g., the typical time-course
microarray experiment in which subjects are sacrificed rather than providing repeated
measures), M is simply the identity matrix.
2.2
Prior Model
We take a nonparametric Bayesian approach to model the uncertainty on the distribution of the random effects. The modeling framework we adopt allows for simultaneous
inference on the regression coefficients and on the clustering of the objects (i.e., genes).
We achieve this by placing a Dirichlet process (DP) prior (Antoniak 1974) with a spiked
centering distribution on the distribution function of the regression coefficient vectors,
β1 , . . . , βG ,
β1 , . . . , βG | Gβ
Gβ
∼
∼
Gβ
DP (αβ , G0β )
where Gβ denotes a distribution function of β, DP stands for a Dirichlet process,
αβ is a precision parameter, and G0β is a centering distribution, i.e., E[Gβ ] = G0β .
Sampling from DP induces ties among β1 , . . . , βG , since there is a positive probability
that βi = βj for every i 6= j. Two objects i 6= j are said to be clustered in terms
of their regression coefficients if and only if βi = βj . The clustering of the objects
encoded by the ties among the regression coefficients will simply be referred to as the
“clustering of the regression coefficients,” although it should be understood that it is
the data themselves that are clustered. The fact that our model induces ties among
the regression coefficients β1 , . . . , βG is the means by which it borrows strength across
objects for estimation.
S. Kim, D.B. Dahl and M. Vannucci
711
Set partition notation is helpful throughout the paper. A set partition ξ = {S1 , . . . ,
Sq } of S0 = {1, 2, . . . , G} has the following properties: Each component Si is non-empty,
the intersection of two components Si and Sj is empty, and the union of all components
is S0 . A cluster S in the set partition ξ for the regression coefficients is a set of indices
such that, for all i 6= j ∈ S, βi = βj . Let βS denote the common value of the regression
coefficients corresponding to cluster S. Using this set partition notation, the regression
coefficient vectors β1 , . . . , βG can be reparametrized as a partition ξβ and a collection
of unique model parameters φβ = (βS1 , . . . , βSq ). We will use the terms clustering and
set partition interchangeably.
Spiked Prior Distribution on the Regression Coefficients
Similar modeling frameworks and inferential goals to the one we describe in this paper
were considered by Dahl and Newton (2007) and Dahl et al. (2008). However, their
prior formulation does not naturally permit hypothesis testing of sharp hypotheses, i.e.,
it can not provide Pr(Ha,g |data) = 1 − Pr(H0,g |data), where hypotheses are defined as
in (2), since the posterior distribution of βt,g is continuous. Therefore, they must rely
on univariate scores capturing evidence for these hypotheses. The prior formulation
we adopt below, instead, allows us to estimate the probability of sharp null hypotheses
directly from the MCMC samples.
These distributions have been widely used as prior distribution in the Bayesian
variable selection literature (George and McCulloch 1993; Brown et al. 1998). Spiked
distributions are a mixture of two distributions: the “spike” refers to a point mass
distribution at zero and the other distribution is a continuous distribution for the parameter if it is not zero. Here we employ these priors to perform nonparametric multiple
hypothesis testing by specifying a spiked distribution as the centering distribution for
the DP prior on the regression coefficient vectors β1 , . . . , βG . Adopting a spiked centering distribution in DP allows for a positive posterior probability on βt,g = 0, so
that our proposed model is able to provide probabilities of sharp null hypotheses (e.g.,
H0,g : β1,g = . . . = βT ∗ ,g = 0 for g = 1, . . . , G) while simultaneously borrowing strength
from objects likely to have the same value of the regression coefficients.
We also adopt a “super-sparsity” prior on the probability of βt,g = 0 (defined as
πt for all g), since it is not uncommon that changes in expressions for many genes will
be minimal across treatments. The idea of the “super-sparsity” prior was investigated
in Lucas et al. (2006). By using another layer in the prior for πt , the probability of
βt,g = 0 will be shrunken toward one for genes showing no changes in expressions across
treatment conditions.
Specifically, our model uses the following prior for the regression coefficients β1 , . . . ,
712
Spiked Dirichlet Process Prior for Multiple Testing
βG :
β1 , . . . , βG | Gβ
Gβ
G0β
∼ Gβ
∼ DP (αβ , G0β )
=
T
Y
{πt δ0 (βt,g ) + (1 − πt )N (βt,g |mt , τt )}
t=1
π1 , . . . , πT |ρ1 , . . . , ρT
∼
T
Y
{(1 − ρt )δ0 (πt ) + ρt Beta(π|aπ , bπ )}
t=1
ρ1 , . . . , ρT
τ1 , . . . , τT
∼
∼
Beta(ρ|aρ , bρ )
Gamma(τ |aτ , bτ )
Note that a spiked formulation is used for each element of the regression coefficient
vector and πt = p(βt,1 = 0) = . . . = p(βt,G = 0). Typically, mt = 0, but other values
may be desired. We use the parameterization of the gamma distribution where the
expected value of τt is aτ bτ . For simplicity, let π = (π1 , · · · , πT ) and τ = (τ1 , · · · , τT ).
After marginalized over πt for all t, the G0β becomes
G0β
=
T
Y
{ρt rπ δ0 (βt ) + (1 − ρt rπ )N (βt |mt , τt )} ,
t=1
ρ1 , · · · , ρ t
∼
Beta(ρ|aρ , bρ ).
where rπ = aπ /(aπ + bπ ). As noted in equation above, the ρt rπ is now specified as a
probability of βt,g = 0 for all g.
Prior Distribution on the Precisions
Our model accommodates heteroscedasticity while preserving parsimony by placing a
DP prior on the precisions: λ1 , . . . , λG :
λ 1 , . . . λ G | Gλ
Gλ
G0λ
∼
Gλ
∼ DP (αλ , G0λ )
= Gamma(λ|aλ , bλ )
Note that the clustering of the regression coefficients is separate from that of the precisions. Although this treatment for the precisions also has the effect of clustering the
data, we are typically more interested in the clustering from the regression coefficients
since they capture changes across treatment conditions. We let ξλ denote the set partition for the precisions λ1 , . . . , λG and let φλ = (λS1 , . . . , λSq ) be the collection of unique
precision values.
S. Kim, D.B. Dahl and M. Vannucci
713
Prior Distribution on the Precision Parameters for DP
Following Escobar and West (1995), we place independent Gamma priors on the precision parameters αβ and αλ of the DP priors:
αβ
αλ
∼
∼
Gamma(αβ |aαβ , bαβ ),
Gamma(αλ |aαλ , bαλ ).
Prior Distribution on the Means
We assume a Gaussian prior on the object-specific mean parameters µ1 , . . . , µG :
µg ∼ N (µg | mµ , pµ ).
3
(3)
Inferential Procedures
In this section, we describe how to conduct multiple hypothesis tests and clustering
inference in the context of our model. We treat the object-specific means µ1 , . . . , µG as
nuisance parameters since they are not used either in forming clusters or for multiple
testing. Thus, we integrate the likelihood with respect to their prior distribution in (3).
Simple calculations lead to the following integrated likelihood (Dahl et al. 2008):
µ
¶
Eg
−1
,
(4)
dg | βg , λg ∼ NK dg | Xβg + Eg fg ,
λg j ′ Mj + pµ
where
Eg = λg (λg j ′ Mj + pµ )M − λ2g Mjj ′ M, and
fg = λg mµ pµ Mj.
(5)
Inference is based on the marginal posterior distribution of the regression coefficients,
i.e., p(β1 , . . . , βG | d1 , . . . , dG ) or, equivalently, p(ξβ , φβ | d1 , . . . , dG ). This distribution is not available in closed-form, so we use a Markov chain Monte Carlo (MCMC)
to sample from the full posterior distribution p(ξβ , φβ , ξλ , φλ , ρ, τ | d1 , . . . , dG ) and
marginalize over the parameters ξλ , φλ , ρ, and τ .
3.1
MCMC Scheme
Our MCMC sampling scheme updates each of the following parameters, one at a time:
ξβ , φβ , ξλ , φλ , ρ, and τ . Recall that βS is the element of φβ associated with cluster
S ∈ ξβ , with βSt being element t of that vector. Likewise, λS is the element of φλ
associated with cluster S ∈ ξλ . Given starting values for these parameters, we propose
the following MCMC sampling scheme. Details for the first three updates are available
in the Appendix.
714
Spiked Dirichlet Process Prior for Multiple Testing
(1) Obtain draws ρ = (ρ1 , . . . , ρT ) from its full conditional distribution by the following procedure. First, sample Yt = rπ ρt from its conditional distributions:
P
yt | · ∼ p(yt ) yt
S∈ξ
I(βSt =0)
(1 − yt )
P
S∈ξ
I(βSt 6=0)
,
with
a +bρ −1
p(yt ) ∝ yt ρ
(rπ − yt )bρ −1 ,
which does not have a known distributional form. A grid-based inverse-cdf method
has been adopted for sampling yt . Once we draw samples of Yt , then we will obtain
ρt as Yt /rπ .
(2) Draw samples of τ = (τ1 , · · · , τT ) from their full conditional distributions:
−1
|ζt | 1
1 X
(βSt − mt )2 ,
τt | · ∼ Gamma aτ +
,
+
2
bτ
2
(6)
S∈ζt
where ζt = {S ∈ ξβ | βSt 6= 0} and |ζt | is its cardinality.
(3) Draw samples of βS = (βS1 , . . . , βST ) for their full conditional distributions:
where
¡
¢
βSt | · ∼ πSt δ0 + (1 − πSt )N h−1
t zt , ht ,
ht
=
τt +
X
(7)
xTt Qg xt ,
g∈S
zt
=
mt τ t +
X
xTt Qg Ag ,
g∈S
Qg
Ag
=
=
′
(λg j M j + pµ )−1 Eg ,
dg − X(−t) βS(−t) − Eg−1 fg ,
and the probability πSt is
πSt =
yt
q
ª,
© 1
2 + 1 h−1 z 2
τ
m
yt + (1 − yt ) h−1
τ
exp
−
t
t
t
t
t
2
2 t
where yt = ρt rπ with rπ = aπ /(aπ + bπ ), and X(−t) and βS(−t) denote the X
and βS with the element t removed, respectively.
(4) Since a closed-form full conditional for λS is not available, update λS using a
univariate Gaussian random walk.
(5) Update ξβ using the Auxiliary Gibbs algorithm (Neal 2000).
S. Kim, D.B. Dahl and M. Vannucci
715
(6) Update αβ from its conditional distribution.
½
Gamma(aα + k, b∗α )
with probability pα
α|η, k ∼
Gamma(aα + k − 1, b∗α ) with probability 1 − pα
where
b∗α
=
pα
=
¶−1
1
− log(η)
, and
bα
aα + k − 1
aα + k − 1 + n/b∗α
µ
Also,
η|α, k ∼ Beta(α + 1, n).
(7) Update ξλ using the Auxiliary Gibbs algorithm.
(8) Update αλ using the same procedure in (6) above.
3.2
Inference from MCMC Results
Due to our formulation for the centering distribution of the DP prior on the regression
coefficients, our model can estimate the probability of sharp null hypotheses, such as
H0,g : β1,g = . . . = βT ∗ ,g = 0 for g = 1, . . . , G. Other hypotheses may be specified,
depending on the experimental goals. We estimate these probabilities by simply finding
the relative frequency that the hypotheses hold among the states of the Markov chains.
Our prior model formulation also permits inference on clustering of the G objects.
Several methods are available in the literature on DP models to estimate the cluster
memberships based on posterior samples. (See, for example, Medvedovic and Sivaganesan 2002; Dahl 2006; Lau and Green 2007.) In the examples below we adopt the
least-squares clustering estimation of Dahl (2006) which finds the clustering configuration among those sampled by the Markov chain that minimizes a posterior expected
loss proposed by Binder (1978) with equal costs of clustering mistakes.
3.3
Hyperparameters Setting
Our recommendation for setting the hyperparameters is based on computing for each
object the least-squares estimates of the regression coefficients, the y-intercept, and the
mean-squared error. We then set mµ to be the mean of the estimated y intercepts and
pµ to be the inverse of their variances. We also use the method of moments to set
(aτ , bτ ). This requires solving the following two equations:
aτ bτ
= mean of variances of least-squares regression coefficients
aτ b2τ
= sample variance of variances of least-squares regression coefficients
716
Spiked Dirichlet Process Prior for Multiple Testing
Cluster
1
2
3
4
5
6
7
Size
300
50
50
25
25
25
25
β1
0
0
0
∼ N (0, 14 )
0
∼ N (0, 14 )
∼ N (0, 14 )
β2
0
0
0
∼ N (0, 41 )
0
∼ N (0, 41 )
0
β3
0
∼ N (0, 41 )
0
0
∼ N (0, 41 )
∼ N (0, 41 )
∼ N (0, 41 )
β4
0
0
0
0
∼ N (0, 14 )
∼ N (0, 14 )
0
β5
0
∼ N (0, 14 )
∼ N (0, 14 )
0
0
0
∼ N (0, 14 )
Table 1: Schematic for the simulation of the regression coefficients vectors in the first alternative scenario.
Likewise, aλ and bλ are set using the method of moments estimation, assuming that
the inverse of the mean-squared errors are random draws from a gamma distribution
having mean aλ bλ . As for (aπ , bπ ) and (aρ , bρ ), a specification such that (rπ E[ρt ])T =
QT
t=1 p(βt,g = 0) = 0.50 is recommended if there is no prior information available.
We refer to these recommended hyperparameter settings as the method of moments
(MOM) settings. The MOM recommendations are based on a thorough sensitivity
analysis we performed on all the hyperparameters using simulated data. Some results
of this simulation study are described in Section 4.1.
4
Applications
We first demonstrate the performance in a simulation study and then apply our method
to gene expression data analysis.
4.1
Simulation Study
Data Generation
In an effort to imitate the structure of the microarray data experiment examined in the
next section, we generated 30 independent datasets with 500 objects measured at two
treatments and three time points, having three replicates at each of the six treatment
combinations. Since the model includes an object-specific mean, we set β6,g = 0 so that
the treatment index t ranges from 1 to 5.
We simulated data in which the regression coefficients β for each cluster is distributed
as described in Table 1. Similarly, the three pre-defined precisions λ1 = 1.5, λ2 = 0.2
and λ3 = 3.0 are randomly assigned to each of the 180, 180, and 140 objects (total 500
objects).
Sample-specific means µg were generated from a univariate normal distribution with
mean 10 and precision 0.2. Finally, each vector dg was sampled from a multivariate
normal distribution with mean µg j + Xβg and precision matrix λg I, where I is an
S. Kim, D.B. Dahl and M. Vannucci
717
identity matrix.
We repeated the procedure above to create 30 independent datasets. Our interest
lies in testing the null hypothesis H0,g : β1,g = . . . = β6,g = 0. All the computational
procedures were coded in Matlab.
Results
We applied the proposed method to the 30 simulated datasets. The model involves
several hyperparameters: mµ , pµ , aπ , bπ , aρ , bρ , aτ , bτ , aλ , bλ , aαβ , bαβ , aαλ , bαλ , and
mt . We set (aπ , bπ ) = (1, 0.15) and (aρ , bρ ) = (1, 0.005). The prior probability of
the null hypothesis (i.e., that all the regression coefficients are zero) for an object is
about 50%, which is (rπ ∗ E[ρ])5 with rπ = aπ /(aπ + bπ ) and E[ρ] = aρ /(aρ + bρ ), the
product of Bernoulli random variables across the T treatment conditions each having
success probability. We calculated the MOM recommendations from Section 3.3 to
set (aτ , bτ ) and (aλ , bλ ). These recommendations for the hyperparameters are based
on the sensitivity analysis described later in the paper. We somewhat arbitrarily set
(aαβ , bαβ ) = (5, 1) and (aαλ , bαλ ) = (1, 1), so that prior expected numbers of clusters are
about 24 and 7 for the regression coefficients and precisions, respectively. We show the
robustness of the choice of those parameters in the later section. For each dataset, we ran
two Markov chains for 5,000 iterations and different starting clustering configurations.
A trace plot of the number of clusters of β from the two different starting stages
for one of the simulated datasets, as well as a similar plot for λ, is shown in Figure 1.
Similar trace plots of generated αβ and αλ are shown in Figure 2. They do not indicate
any convergence or mixing problems. The other datasets also had plots indicating good
mixing. For each chain, we discarded the first 3,000 iterations for a burn-in and pooled
the results from the two chains.
Our interest in the study is to see whether there are changes between the two groups
within a time point and across time points. Specifically, we considered the null hypothesis that all regression coefficients are equal to zero: for g = 1, . . . , 500,
H0,g : β1,g = . . . = β6,g = 0
Ha,g : βt,g 6= 0 for some t = 1, . . . , 6.
We ranked the objects by their posterior probabilities of alternative hypotheses Ha,g ,
which equal 1−Pr(H0,g |data). A plot of the ranked posterior probability for each object
is shown in Figure 3.
Bayesian False Discovery Rate
Many multiple testing procedures seek to control some type of a false discovery rate
(FDR) at a desired value. The Bayesian FDR (Genovese and Wasserman 2003; Müller
718
Spiked Dirichlet Process Prior for Multiple Testing
Trace plot (β)
Trace plot (λ)
40
number of clusters (λ)
number of clusters (β)
100
80
60
40
20
0
0
1000
2000
3000
4000
30
20
10
0
0
5000
1000
iterations
2000
3000
4000
5000
iterations
(a) Number of clusters (β)
(b) Number of clusters (λ)
Figure 1: Trace plots of the number of clusters for the regression coefficients and the precisions
when fitting a simulated dataset.
Trace plot (αβ)
Trace plot (αλ)
30
10
8
20
αλ
αβ
6
4
10
2
0
0
1000
2000
3000
4000
5000
0
0
1000
2000
3000
iterations
iterations
(a) αβ
(b) αλ
4000
Figure 2: Trace plots of generated αβ and αλ when fitting a simulated dataset.
5000
S. Kim, D.B. Dahl and M. Vannucci
719
Probability of Alternative Hypothesis
1
0.8
0.6
0.4
0.2
0
0
100
200
300
sample index
400
500
Figure 3: Probability of the alternative hypothesis (i.e. 1 − Pr(H0,g : β1,g = . . . = β6,g =
0 | data)) for each object of a simulated dataset of 500 objects.
et al. 2004; Newton et al. 2004) can be obtained by
\
F
DR(c) =
PG
Dg (1 − vg )
PG
g=1 Dg
g=1
where vg = P r(Ha,g |data) and Dg = I(vg > c). We reject H0,g if the posterior probability vg is greater than the threshold c. The optimal threshold c can be found to be a
\
maximum value of c in the set of {c : F
DR(c) ≤ α} with pre-specified error rate α. We
averaged the Bayesian FDRs from the 30 simulated datasets. The optimal threshold,
on average, is found to be 0.7 for an Bayesian FDR of 0.05. The Bayesian FDR has also
been compared with the true proportion of false discoveries (labeled as “Realized FDR”
in the plot) and is displayed in Figure 4. In this simulation, our Bayesian approach is
slightly anti-conservative. As shown in Dudoit et al. (2008), anti-conservative behavior
in FDR controlling approaches is often observed for data with high correlation structure
and a high proportion of true null hypotheses.
Comparisons with Other Methods
We assessed the performance of the proposed method by comparing with three other
methods, a standard Analysis of Variance (ANOVA), the SIMTAC method of Dahl et al.
(2008), and LIMMA (Smyth 2004). The LIMMA procedure is set in the context of a
general linear model and provides, for each gene, an F -statistic to test for differential
expression at one or more time points. These F -statistics were used to rank the genes.
720
0.10
0.15
Realized FDR
Bayesian FDR
0.00
0.05
False Discovery Rate
0.20
Spiked Dirichlet Process Prior for Multiple Testing
0.5
0.6
0.7
0.8
0.9
1.0
Cut−off
Figure 4: Plot of proportion of false discoveries and Bayesian FDR averaged over 30 datasets.
The SIMTAC method uses a modeling framework similar to the one we adopt but
it is not able to provide estimates of probabilities for H0,g since its posterior density
is continuous. We used the univariate score suggested by Dahl et al. (2008) which
PT
2
captures support for the hypothesis of interest, namely qg = t=1 βt,g
. For the ANOVA
procedure, we ranked objects by their p-values associated with H0,g . Small p-values
indicate little support for the H0,g .
For each of the 30 datasets and each method, we ranked the objects as described
above. These lists were truncated at 1, 2, . . . , 200 samples. At each truncation, the proportions of false discoveries are computed and averaged over the 30 datasets. Results
are displayed in Figure 5. It is clear that our proposed method exhibits a lower proportion of false discoveries and that performances are substantially better than ANOVA
and LIMMA and noticeably better than the SIMTAC method.
Sensitivity Analysis
The model involves several hyperparameters: mµ , pµ , aπ , bπ , aρ , bρ , aτ , bτ , aλ , bλ ,
aαβ , bαβ , aαλ , bαλ , and mt . In order to investigate the sensitivity to the choice of these
hyperparameters, we randomly selected one of the 30 simulated datasets for a sensitivity
analysis.
We considered ten different hyperparameter settings. In the first scenario, called the
“MOM” setting, we used all the MOM estimates of the hyperparameters and (aπ , bπ )
= (1,0.15), (aρ , bρ ) = (1,0.005), and (aαβ , bαβ ) = (5,1). The other nine scenarios with
721
0.1
0.2
0.3
OUR METHOD
SIMTAC
ANOVA
LIMMA
0.0
Proportion of False Discoveries
0.4
S. Kim, D.B. Dahl and M. Vannucci
0
50
100
150
200
Number of Discoveries
Figure 5: Average proportion of false discoveries for the three methods based on the 30
simulated datasets
change in one set of parameters given all other parameters set same as in the first
scenario were:
i. (aπ , bπ ) = (15, 15), so that p(βt,g = 0) = 0.50.
ii. (aπ , bπ ) = (1, 9), so that p(βt,g = 0) = 0.10.
iii. (aρ , bρ ) = (1, 2), so that E[rπ ρt ] = 0.25.
iv. (aτ , bτ ) = (1, 0.26), to have smaller variance than MOM estimate.
v. (aτ , bτ ) = (1, 0.7), to have larger variance than MOM estimate.
vi. (aλ , bλ ) = (1, 0.5), to have smaller variance than MOM estimate.
vii. (aλ , bλ ) = (1, 3), to have larger variance than MOM estimate.
viii. (aαβ , bαβ ) = (25, 1), to have E[αβ ] = 25, so that prior expected number of clusters
is about 77.
ix. (aαβ , bαβ ) = (1, 1), to have E[αβ ] = 1, so that prior expected number of clusters
is about 7.
We set mt = 0. Also, the mean mµ of the distribution of µ was set to the estimated
least-squares intercepts and the precision pµ to the precision of the estimated intercepts.
722
Spiked Dirichlet Process Prior for Multiple Testing
An identity matrix was used for M since we assume independent sampling. We fixed
αλ = 1 throughout the sensitivity analysis. We expect similar sensitivity result of the
parameter as one for αβ . We ran two MCMC chains with different starting values; one
chain started from one cluster (for both β and λ) and the other from G clusters (for
both). Each chain was run for 5,000 iterations.
We assessed the sensitivity of the hyperparameter settings in two ways. Figure 6
shows that the proportion of false discoveries is remarkably consistent across the ten
different hyperparameter settings. We also identified, for each hyperparameter setting,
the 50 objects most likely to be “differentially expressed”. In other words, those 50 have
the smallest probability for the hypothesis H0 . Table 2 gives the number of common
objects among all the pairwise intersections from the various parameter settings. These
results indicate a high degree of concordance among the hyperparameter scenarios. We
are confident in recommending, in the absence of prior information, the use of the
MOM estimates for (aτ , bτ ) and (aλ , bλ ) and to choose (aπ , bπ ) and (aρ ,bρ ) such that
p(βt,g = 0) = 0.50. The choice for (aαβ , bαβ ) does not make a difference in the results.
1
MOM
(i)
(ii)
(iii)
(iv)
(v)
(vi)
(vii)
(viii)
(ix)
Proportion of False Discoveries
0.8
0.6
0.4
0.2
0
0
100
200
300
Number of Discoveries
400
500
Figure 6: Proportion of false discoveries under several hyperparameter settings based on one
dataset
S. Kim, D.B. Dahl and M. Vannucci
723
Table 2: Among the 50 most likely differentially expresed objects, the number in common
among the pairwise intersection of the samples identified under the ten hyperparameter settings.
MOM (both)
(i)
(ii)
(iii)
(iv)
(v)
(vi)
(vii)
(viii)
(i)
41
(ii)
37
42
(iii)
41
45
43
(iv)
38
45
44
45
(v)
39
45
43
45
44
(vi)
41
42
42
43
40
45
(vii)
39
45
44
45
47
44
42
(viii)
39
43
42
46
42
44
44
45
(ix)
42
46
43
46
44
45
45
44
44
724
4.2
Spiked Dirichlet Process Prior for Multiple Testing
Gene expression study
We illustrate the advantage of our method in a microarray data analysis. The dataset
was used in Dahl and Newton (2007). Researchers were interested in the transcriptional
response to oxidative stress in mouse heart muscle and how that response changes with
age. The data has been obtained in two age groups of mice; Young (5-month old)
and Old (25-month old) which were treated with an injection of paraquat (50mg/kg).
Mice were killed at 1, 3, 5 and 7 hours after the treatment or were killed without having
received paraquat (called baseline). So, the mice yield independent measurments, rather
than repeated measurements. Gene expressions were measured 3 times at all treatments.
Originally, gene expression was measured on 10,043 probe sets. We randomly select
G = 1, 000 genes out of 10,043 to reduce computation time. We also choose the first
two treatments, baseline and 1 hour after injection from both groups since it is often of
interest to see if gene expressions have been changed within 1 hour after injection. Old
mice at baseline were designated as a reference treatment. While the analysis is not
invariant to the choice of the reference treatment, we show in Section 5 that the results
are robust to the choice of the reference treatment. The data was background-adjusted
and normalized using the Robust Multichip Averaging (RMA) method of Irizarry et al.
(2003).
Our two main biological goals are to identify genes which either are:
1. Differentially expressed in some way across the four treatment conditions, i.e.,
genes having small probability of H0,g : β1,g = β2,g = β3,g = 0, or
2. Similarly expressed at baseline between old and young mice, but differentially
expressed 1 hour after injection, i.e. genes having large probability of Ha,g :
|β1,g − β3,g | = 0 & |β2,g − β4,g | > c, for some threshold c, such as 0.1.
Assuming that information on how many genes are differentially expressed is not
available, we set a prior on π by defining (aπ , bπ ) = (10, 3) and (aρ , bρ ) = (100, 0.05)
which implies a belief that about 50% of genes are differentially expressed. We set
(aαβ , bαβ ) = (5, 5) and (aαλ , bαλ ) = (1, 1) so that the expected numbers of clusters are
93 and 8 for the regression coefficients and precisions, respectively. Other parameters
are estimated as we recommended in the simulation study. We ran two chains starting
at two different initial stages: (i) all the genes being together and (ii) each having
its own cluster. The Markov chain Monte Carlo (MCMC) sampler was run for 10,000
iterations with the first 5,000 discarded as burn-in. Figure 7 shows trace plots of the
number of clusters for both regression coefficients and precisions. The plots do not
indicate convergence or mixing problems. The least-squares clustering method found
a clustering for the regression coefficients with 14 clusters and a clustering for the
precisions with 11 clusters.
There were six large clusters for β with size more than 50. Those clusters included
897 genes. The average gene expressions for each one of the six clusters are shown in
Figure 8(a). The y-axis indicates the average gene expressions, and the x-axis indicates
the treatments. Each cluster shows its unique profile. We found one cluster of 18 genes
S. Kim, D.B. Dahl and M. Vannucci
725
Trace plot (β)
Trace plot (λ)
100
number of clusters (λ)
number of clusters (β)
100
80
60
40
20
0
0
2000
4000
6000
8000
10000
80
60
40
20
0
0
2000
(a) Number of clusters (β)
4000
6000
8000
10000
iterations
iterations
(b) Number of clusters (λ)
Figure 7: Trace plots of number of clusters for the regression coefficients and the precisions
when fitting the gene expression data.
with all regression coefficients equal to zero (Figure 8(b)).
For hypothesis testing, we ranked genes by calculating posterior probabilities for the
genes least supportive of the null hypothesis, H0,g : β1,g = β2,g = β3,g = β4,g = 0. We
listed the fifty genes that were least supportive of the hypothesis H0,g . Figure 9 shows
the heatmap of those fifty genes.
Finally, in order to identified genes following the second hypothesis of interest Ha,g :
|β1,g − β3,g | = 0 & |β2,g − β4,g | > 0.1, we similarly identified the top fifty ranked genes.
For this hypothesis, our approach clearly finds genes following the desired pattern, as
shown in Figure 10.
5
Discussion
We have proposed a semiparametric Bayesian method for random effects models in the
context of multiple hypothesis testing. A key feature of the model is the use of a spiked
centering distribution for the Dirichlet process prior. Dirichlet process mixture models
naturally borrow information across similar observations through model-based clustering, gaining increased power for testing. This centering distribution in the DP allows
the model to accommodate the estimation of sharp hypotheses. We have demonstrated
via a simulation study that our method yields a lower proportion of false discoveries
than other competitive methods. We have also presented an application to microarray
data where our method readily infers posterior probabilities of genes being differentially
expressed.
One issue with our model is that the results are not necessarily invariant to the
726
Average Gene Expressions
8.8
9.0
9.2
9.4
Clu3
Clu5
Clu7
Clu8
Clu9
Clu10
8
8.6
Averge Gene Expressions
9
10
11
12
Spiked Dirichlet Process Prior for Multiple Testing
OldBase
Old1hr
YoungBase
Young1hr
OldBase
(a) Large clusters
Old1hr
YoungBase
Young1hr
(b) Cluster with estimated β = 0
Figure 8: Average expression profiles for (a) six large clusters; (b) cluster with estimated
β = 0.
Color Key
−1
0
1
2
Row Z−Score
Yng1hr3
Yng1hr2
Yng1hr1
YngBase3
YngBase2
YngBase1
Old1hr3
Old1hr2
Old1hr1
OldBase3
OldBase2
OldBase1
6390
7217
1536
6382
772
9705
7286
9040
704
9391
4441
5290
9044
9696
6146
7055
5032
9890
6092
9569
7170
5731
1835
7257
6847
7585
4603
9201
7524
7770
5727
6338
516
2924
6335
5890
764
5736
5592
7745
711
5355
747
6331
7507
8825
7796
7300
8727
4449
Figure 9: Heatmap of the 50 top-ranked genes which are least supportive of the assertion
that β1 = β2 = β3 = β4 = 0.
S. Kim, D.B. Dahl and M. Vannucci
727
9.20
Color Key
−2 −1 0
1
2
Row Z−Score
(a) Average gene expressions
Yng1hr3
Yng1hr2
Yng1hr1
YngBase3
YngBase2
Old1hr3
YngBase1
Old1hr2
Young1hr
Old1hr1
YoungBase
OldBase3
Old1hr
OldBase1
OldBase
OldBase2
9.05
Average Gene Expressions
9.10
9.15
3656
6623
3019
7341
2379
3630
8888
3618
5708
6666
4944
8630
9224
3451
9616
5295
245
9741
8567
4716
2775
8550
5438
7247
3554
6595
2912
8977
4790
4061
170
4949
3401
8074
9341
3763
9850
9148
9131
3736
3246
7088
9082
2256
1994
5801
5292
3290
104
2295
(b) Heatmap
800
600
400
200
0
Rank with Reference (Young, Baseline)
1000
Figure 10: (a) Average gene expressions of the 50 top-ranked genes supportive of |β1 − β3 | =
0 & |β2 − β4 | > 0.1; (b) Heatmap of those genes
0
200
400
600
800
1000
Rank with Reference being (Old, Baseline)
Figure 11: Scatter plot of rankings of genes resulting from using two reference treatments
728
Spiked Dirichlet Process Prior for Multiple Testing
choice of the reference treatment. Consider, for example, the gene expression analysis
of Section 4.2 in which we used the group (Old, Baseline) as the reference group. To
investigate robustness, we reanalyzed the data using (Young, Baseline) as the reference
group. We found that the rankings between two results are very close to each other
(Spearman’s correlation = 0.9937762, Figure 11).
Finally, as we mentioned in the Section 2.1, our current model can easily accommodate covariates by placing them in the X matrix. Such covariates might include,
for example, demographic variables regarding the subject or environmental conditions
(e.g., temperature in the lab) that affect each array measurement. Adjusting for such
covariates has the potential to increase the statistical power of the tests.
1
Appendix
1.1
Full Conditional for Precision
p(τ |d, β, λ)
∝
p(τ )
Y
p(βS |π, τ )p(dS |βS , λS , π, τ )
S∈ξ
)
T
Y Y
p(βSt |πt , τt )
p(τt )
=
t=1
S∈ζt t=1
T
Y
Y
N (βSt |mt , τt )
∝
p(τt )
t=1
S∈ζt
T
Y
X
1
1
a +|ζ |/2−1
∝
τt τ t
exp −τt +
(βSt − mt )2
bτ
2
t=1
(
T
Y
S∈ζt
1.2
Full Conditional for new probability yt = ρt rπ of Spike
Note: modified prior ρt rπ = p(βt = 0) where rπ = aπ /(aπ + bπ ), thus need a posterior
ρt |rest ∝ p(βt = 0|rest).
Set Yt = rπ ρt . Then the distribution of Yt is
¶b −1
µ ¶aρ −1 µ
1
yt ρ
1
yt
p(yt ) =
1−
B(aρ , bρ ) rπ
rπ
rπ
µ ¶aρ +bρ −1
1
1
a −1
=
yt ρ (rπ − yt )bρ −1 .
B(aρ , bρ ) rπ
Now, we are drawing Yt , not ρt from their conditional distributions: for t,
P
p(yt |rest) ∝ p(yt ) yt
S∈ξ
I(βSt =0)
P
(1 − yt )
S∈ξ
I(βSt 6=0)
,
S. Kim, D.B. Dahl and M. Vannucci
729
which is not of known form of distribution. Once we draw samples of Yt , then we will
get ρt as Yt /rπ . We used a grid-based inverse-cdf method. for sampling Yt .
1.3
Full Conditional for Regression Coefficients
p(βSt |λS , dS , yt , βS(−t) ) ∝ p(βSt |yt )
= yt δ0 (βSt )
Y
Y
p(dg |βSt , βS(−t) , λS )
g∈S
p(dg |βSt , βS(−t) , λS ) + (1 − yt )N (βSt |mt , τt )
g∈S
Y
p(dg |βSt , βS(−t) , λS )
g∈S
The first part is obvious. Look at the second part. Set xt = (X1t , · · · , XKt )T ,
X(−t) = (x1 , · · · .xt−1 , xt+1 , · · · , xT ), and βS(−t) = (βS1 , · · · , βS(t−1) , βS(t+1) , · · · , βST )T .
The second part is proportional to:
¾
½
X
1
1
DgT Qg Dg
exp − τt (βSt − mt )2 × exp −
2
2
g∈S
∝
∝
where Dg = dg − xt βSt − X(−t) βS(−t) − Eg−1 fg
½
¾
1X
¢
1¡ 2
exp − τt βSt − 2τt mt βSt exp −
(xt βSt − Ag )T Qg (xt βSt − Ag )
2
2
g∈S
X
X
1
2
βSt
(τt +
xTt Qg xt ) − 2βSt (mt τt +
xTt Qg Ag ) .
exp −
2
g∈S
g∈S
Therefore, for each t,
0
¶ with probability πst
µ
P
P
mt τt + g∈S xT
βSt | · =
t Qg Ag
P
with probability 1 − πst .
, τt + g∈S xTt Qg xt
∼N
x T Qg x t
τt +
g∈S
t
References
Antoniak, C. E. (1974). “Mixtures of Dirichlet Processes With Applications to Bayesian
Nonparametric Problems.” The Annals of Statistics, 2: 1152–1174. 710
Baldi, P. and Long, A. D. (2001). “A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes.”
Bioinformatrics, 17: 509–519. 708
Benjamini, Y. and Hochberg, Y. (1995). “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” Journal of the Royal Statistical
Society, Series B: Methodological, 57: 289–300. 708
730
Spiked Dirichlet Process Prior for Multiple Testing
Berry, D. A. and Hochberg, Y. (1999). “Bayesian Perspectives on Multiple Comparisons.” Journal of Statistical Planning and Inference, 82: 215–227. 708
Binder, D. A. (1978). “Bayesian Cluster Analysis.” Biometrika, 65: 31–38. 715
Brown, P., Vannucci, M., and Fearn, T. (1998). “Multivariate Bayesian variable selection
and prediction.” J. R. Statist. Soc. B, 60: 627–41. 711
Cai, B. and Dunson, D. (2007). “Variable selection in nonparametric random effects
models.” Technical report, Department of Statistical Science, Duke University. 709
Dahl, D. B. (2006). “Model-Based Clustering for Expression Data via a Dirichlet Process
Mixture Model.” In Do, K.-A., Müller, P., and Vannucci, M. (eds.), Bayesian Inference
for Gene Expression and Proteomics, 201–218. Cambridge University Press. 715
Dahl, D. B., Mo, Q., and Vannucci, M. (2008). “Simultaneous Inference for Multiple
Testing and Clustering via a Dirichlet Process Mixture Model.” Statistical Modelling:
An International Journal, 8: 23–39. 708, 709, 711, 713, 719, 720
Dahl, D. B. and Newton, M. A. (2007). “Multiple Hypothesis Testing by Clustering
Treatment Effects.” Journal of the American Statistical Association, 102(478): 517–
526. 708, 711
Do, K.-A., Müller, P., and Tang, F. (2005). “A Bayesian mixture model for differential gene expression.” Journal of the Royal Statistical Society: Series C (Applied
Statistics), 54(3): 627–644. 708
Dudoit, S., Gibert, H. N., and van der Laan, M. J. (2008). “Resampling-Based Empirical
Bayes Multiple Testing Procedures for Controlling Generalized Tail Probability and
Expected Value Error Rates: Focus on the False Discovery Rate and Simulation
Study.” Biometrical Journal, 50: 716–744. 719
Dudoit, S., Shaffer, J. P., and Boldrick, J. C. (2003). “Multiple Hypothesis Testing in
Microarray Experiments.” Statistical Science, 18(1): 71–103. 708
Dudoit, S., Yang, Y. H., Callow, M. J., and Speed, T. P. (2002). “Statistical methods
for identifying differentially expressed genes in replicated cDNA microarray experiments.” Statistica Sinica, 12(1): 111–139. 708
Dunson, D. B., Herring, A. H., and Engel, S. A. (2008). “Bayesian Selection and
Clustering of Polymorphisms in Functionally-Related gene.” Journal of the American
Statistical Association, in press. 709
Escobar, M. D. and West, M. (1995). “Bayesian Density Estimation and Inference Using
Mixtures.” Journal of the American Statistical Association, 90: 577–588. 713
Genovese, C. and Wasserman, L. (2003). “Bayesian and Frequentist Multiple Testing.”
In Bayesian Statistics 7, 145–161. Oxford University Press. 717
George, E. and McCulloch, R. (1993). “Variable selection via Gibbs sampling.” J. Am.
Statist. Assoc., 88: 881–9. 711
S. Kim, D.B. Dahl and M. Vannucci
731
Gopalan, R. and Berry, D. A. (1998). “Bayesian Multiple Comparisons Using Dirichlet
Process Priors.” Journal of the American Statistical Association, 93: 1130–1139. 708
Hochberg, Y. (1988). “A Sharper Bonferroni Procedure for Multiple Tests of Significance.” Biometrika, 75: 800–802. 708
Hommel, G. (1988). “A Stagewise Rejective Multiple Test Procedure Based on a Modified Bonferroni Test.” Biometrika, 75: 383–386. 708
Irizarry, R., Hobbs, B., Collin, F., Beazer-Barclay, Y., Antonellis, K., Scherf, U., and
Speed, T. (2003). “Exploration, Normalization, and Summaries of High Density
Oligonucleotide Array Probe Level Data.” Biostatistics, 4: 249–264. 724
Lau, J. W. and Green, P. J. (2007). “Bayesian model based clustering procedures.”
Journal of Computational and Graphical Statistics, 16: 526–558. 715
Lucas, J., Carvalho, C., Wang, Q., Bild, A., Nevins, J. R., and Mike, W. (2006). “Sparse
Statistical Modelling in Gene Expression Genomics.” In Do, K.-A., Müller, P., and
Vannucci, M. (eds.), Bayesian Inference for Gene Expression and Proteomics, 155–
174. Cambridge University Press. 711
MacLehose, R. F., Dunson, D. B., Herring, A. H., and Hoppin, J. A. (2007). “Bayesian
methods for highly correlated exposure data.” Epidemiology, 18(2): 199–207. 708,
709
Medvedovic, M. and Sivaganesan, S. (2002). “Bayesian Infinite Mixture Model Based
Clustering of Gene Expression Profiles.” Bioinformatrics, 18: 1194–1206. 715
Müller, P., Parmigiani, G., Robert, C., and Rousseau, J. (2004). “Optimal Sample
Size for Multiple Testing: The case of Gene Expression Microarrays.” Journal of the
American Statistical Association, 99: 990–1001. 717
Neal, R. M. (2000). “Markov Chain Sampling Methods for Dirichlet Process Mixture
Models.” Journal of Computational and Graphical Statistics, 9: 249–265. 714
Newton, M., Kendziorski, C., Richmond, C., Blattner, F., and Tsui, K. (2001). “On
differential variability of expression ratios: Improving statistical inference about gene
expression changes from microarray data.” Journal of Computational Biology, 8:
37–52. 708
Newton, M. A., Noueiry, A., Sarkar, D., and Ahlquist, P. (2004). “Detecting differential
gene expression with a semiparametric hierarchical mixture method.” Biostatistics,
5: 155–176. 708, 719
Scott, J. G. and Berger, J. O. (2006). “An Exploration of Aspects of Bayesian Multiple
Testing.” Journal of Statistical Planning and Inference, 136: 2144–2162. 708
Smyth, G. K. (2004). “Linear models and empirical Bayes methods for assessing differential expression in microarray experiments.” Statistical Applications in Genetics
and Molecular Biology, 3: No. 1, Article 3. 709, 719
732
Spiked Dirichlet Process Prior for Multiple Testing
Storey, J. (2007). “The optimal discovery procedure: A new approach to simultaneous
significance testing.” Journal of the Royal Statistical Society, Series B, 69: 347–368.
708
Storey, J., Dai, J. Y., and Leek, J. T. (2007). “The optimal discovery procedure for largescale significance testing, with applications to comparative microarray experiments.”
Biostatistics, 8: 414–432. 708
Storey, J. D. (2002). “A Direct Approach to False Discovery Rates.” Journal of the
Royal Statistical Society, Series B: Statistical Methodology, 64(3): 479–498. 708
— (2003). “The Positive False Discovery Rate: A Bayesian Interpretation and the
q-value.” The Annals of Statistics, 31(6): 2013–2035. 708
Storey, J. D., Taylor, J. E., and Siegmund, D. (2004). “Strong Control, Conservative Point Estimation and Simultaneous Conservative Consistency of False Discovery
Rates: a Unified Approach.” Journal of the Royal Statistical Society, Series B: Statistical Methodology, 66(1): 187–205. 708
Tibshirani, R. and Wasserman, L. (2006). “Correlation-sharing for Detection of Differential Gene Expression.” Technical Report 839, Department of Statistics, Carnegie
Mellon University. 708
Westfall, P. H. and Wolfinger, R. D. (1997). “Multiple Tests with Discrete Distributions.” The American Statistician, 51: 3–8. 708
Westfall, P. H. and Young, S. S. (1993). Resampling-based Multiple Testing: Examples
and Methods for P-value Adjustment. John Wiley & Sons. 708
Yuan, M. and Kendziorski, C. (2006). “A Unified Approach for Simultaneous Gene
Clustering and Differential Expression Identification.” Biometrics, 62: 1089–1098.
708
Acknowledgments
Marina Vannucci is supported by NIH/NHGRI grant R01HG003319 and by NSF award DMS0600416. The authors thank the Editor, the Associated Editor and the referee for their comments and constructive suggestions to improve the paper.
Bayesian Analysis (2009)
4, Number 4, pp. 733–758
Modeling space-time data using stochastic
differential equations
Jason A. Duan∗ , Alan E. Gelfand† and C. F. Sirmans‡
Abstract. This paper demonstrates the use and value of stochastic differential
equations for modeling space-time data in two common settings. The first consists
of point-referenced or geostatistical data where observations are collected at fixed
locations and times. The second considers random point pattern data where the
emergence of locations and times is random. For both cases, we employ stochastic differential equations to describe a latent process within a hierarchical model
for the data. The intent is to view this latent process mechanistically and endow
it with appropriate simple features and interpretable parameters. A motivating
problem for the second setting is to model urban development through observed
locations and times of new home construction; this gives rise to a space-time point
pattern. We show that a spatio-temporal Cox process whose intensity is driven
by a stochastic logistic equation is a viable mechanistic model that affords meaningful interpretation for the results of statistical inference. Other applications of
stochastic logistic differential equations with space-time varying parameters include modeling population growth and product diffusion, which motivate our first,
point-referenced data application. We propose a method to discretize both time
and space in order to fit the model. We demonstrate the inference for the geostatistical model through a simulated dataset. Then, we fit the Cox process model
to a real dataset taken from the greater Dallas metropolitan area.
Keywords: geostatistical data, point pattern, hierarchical model, stochastic logistic
equation, Markov chain Monte Carlo, urban development
1
Introduction
The contribution of this paper is to demonstrate the use and value of stochastic differential equations (SDE) to yield mechanistic and physically interpretable models for
space-time data. We consider two common settings: (i) real-valued and point-referenced
geostatistical data where observations are taken at non-random locations s and times
t and (ii) spatio-temporal point patterns where the locations and times themselves are
random. In either case, we assume s ∈ D ⊂ R2 where D is a fixed compact region and
t ∈ (0, T ] where T is specified.
Examples of spatio-temporal geostatistical data abound in the literature. Examples
appropriate to our objectives include ecological process models such as photosynthesis,
∗ Department of Marketing, McCombs School of Business,University of Texas, Austin, TX, mailto:
duanj@mccombs.utexas.edu
† Department of Statistical Science, Duke University, Durham, NC mailto:alan@stat.duke.edu
‡ Department of Risk Management, Insurance, Real Estate and Business Law, Florida State University, Tallahassee, FL mailto:cfsirmans@cob.fsu.edu
c 2009 International Society for Bayesian Analysis
°
DOI:10.1214/09-BA427
734
Space-time SDE models
transpiration, and soil moisture; diffusion models for populations, products or technologies; financial processes such as house price and/or land values over time. Here, we
employ a customary geostatistical modeling specification, i.e., noisy space-time data
are modeled by
Y (s, t) = Λ(s, t) + ǫ(s, t)
(1)
where ǫ(s, t) is a space-time noise/error process (contributing the “nugget”) and the
process of interest is Λ(s, t). For us, Λ(s, t) is a realization of a space-time stochastic
process generated by a stochastic differential equation.
Space-time point patterns also arise in many settings, e.g., ecology where we might
seek the evolution of the range of a species over time by observing the locations of its
presences; disease incidence examining the pattern of cases over time; urban development explained using say the pattern of single family homes constructed over time. The
random locations and times of these events are customarily modeled with an inhomogeneous intensity surface denoted again by Λ(s, t). Here, the theory of point processes
provides convenient tools; the most commonly used and easily interpretable model is
the spatio-temporal Poisson process: for any region in the area under study and any
specified time interval, the total number of observed points is a Poisson random variable
with mean equal to the integrated intensity over that region and time interval.
There is a substantial literature on modeling point-referenced space-time data. The
most common approach is the introduction of spatio-temporal random effects described
through a Gaussian process with a suitable space-time covariance function (e.g., Brown
et al. 2000; Gneiting 2002; Stein 2005). If time is discretized, we can employ dynamic
models as in Gelfand et al. (2005). If locations are on a lattice (or are projected to
a lattice), we can employ Gaussian Markov random fields (Rue and Held 2005). For
general discussion of such space-time modeling see Banerjee et al. (2004).
There is much less statistical literature on space-time point patterns. However, the
mathematical theory of point process on a general carrying space is well established
(Daley and Vere-Jones 1988; Karr 1991). Cressie (1993) and Møller and Waagepetersen
(2004) focus primarily on two-dimensional spatial point processes. Recent developments
in spatio-temporal point process modeling include Ogata (1998) with application to
statistical seismology and Brix and Møller (2001) with application in modeling weeds.
Brix and Diggle (2001), in modeling a plant disease, extend the log Gaussian Cox process
(Møller et al. 1998) to a space-time version by using a stochastic differential equation
model. See Diggle (2005) for a comprehensive review of this literature.
In either of the settings (i) and (ii) above, we propose to work with stochastic differential equation models. That is, we intentionally specify Λ (s, t) through a stochastic
differential equation rather than a spatio-temporal process (see, e.g., Banerjee et al.
2004 and references therein). We specifically introduce a mechanistic modeling scheme
where we are directly interested in the parameters that convey physical meanings in the
mechanism described by a stochastic differential equation. For example, the prototype
of our study is the logistic equation
·
¸
Λ (s, t)
∂Λ (s, t)
= r(s, t)Λ (s, t) 1 −
,
∂t
K (s)
J. A. Duan, A. E. Gelfand and C. F. Sirmans
735
where K (s) is the “carrying capacity” (assuming it is time-invariant) and r (s, t) is
the “growth rate”. Spatially and/or temporally varying parameters, such as growth
rate and carrying capacity, can be modeled by spatio-temporal processes. In practice,
the logistic equation finds various applications, e.g., population growth in ecology (Kot
2001), product and technology diffusion in economics (Mahajan and Wind 1986), and
urban development (see Section 4.2).
We recognize the flexibility that comes with a “purely empirical” model such as a
realization of a stationary or nonstationary space-time Gaussian process or smoothing
splines and that such specifications can be made to fit a given dataset at least as well
as a specification limited to a given class of stochastic differential equations. However,
for space-time data collected from physical systems it may be preferable to view them
as generated by appropriate simple mechanisms with necessary randomness. That is,
we particularly seek to incorporate the features of the mechanistic process into the
model for the space-time data enabling interpretation of the spatio-temporal parameter
processes that is more natural and intuitive. We also demonstrate, through a simulation
example, that, when such differential equations generate the data (up to noise), the
data can inform about the parameters in these equations and that model performance
is preferable to that of a customary model employing a random process realization.
In this regard, we must resort to discretization of time to work with these models,
i.e., we actually fit stochastic finite difference models. In other words, the continuoustime specification describes the process we seek to capture but to fit this specification
with observed data, we employ first order (Euler) approximation. Evidently, this raises
questions regarding the effect of discretization. Theoretical discussion in terms of weak
convergence is presented, for instance, in Kloeden and Platen (1992) while practical
refinements through latent variables have been proposed in, e.g., Elerian et al. (2001).
In any event, Euler approximation is widely applied in the literature and it is beyond our
contribution here to explore its impact. Moreover, the results of our simulation example
in Section 3.3 reveal that we recover the true parameters of the stochastic PDE quite
well under the discretization.
Indeed, the real issue for us is how to introduce randomness into a selected differential
equation specification. Section 2 is devoted to a brief review of available options and
their associated properties. Section 3 develops the geostatistical setting and provides
the aforementioned simulation illustration.
An early motivating problem for our research was the modeling of new home constructions in a fixed region, e.g., within a city boundary. Mathematically and conceptually, the continuous trend that drives the construction of new houses can be captured
by the logistic equation, where there is a rate for the growth of the number of houses
and a carrying capacity that limits the total number. When new houses can be built at
any locations and times, yielding a space-time point pattern, a spatio-temporal point
process governed by a version of the stochastic logistic equation becomes an appealing
mechanistic model. However, we also acknowledge its limitations, suggesting possible
extensions at the end of the paper.
Section 4.1 details the modeling for space-time point patterns and addresses formal
736
Space-time SDE models
modeling and computational issues. Section 4.2 provides a careful analysis of the house
construction dataset. Finally, in Section 5, we conclude with a summary and some
future directions.
2
SDE models for spatio-temporal data
Ignoring location for the moment, a usual nonlinear (non-autonomous) differential equation subject to the initial condition takes the form
dΛ(t) = g(Λ(t), t, r (t))dt and Λ (0) = Λ0 .
(2)
A natural way to add randomness is to model the random parameter r (t) with a
SDE:
dr (t) = a(r(t), t, β)dt + b (r (t) , t) dB (t)
where B(t) is an independent-increment process on R1 .1
Analytic solutions of SDE’s are rarely available so we usually employ a first order
Euler approximation of the form,
Λ (t + ∆t)
and r (t + ∆t)
= Λ (t) + g(Λ (t) , t, r (t))∆t
= r (t) + a(r(t), t, β)∆t + b (r (t) , t) [B (t + ∆t) − B (t)]
where ∆t is the interval between time points and B (t + ∆t) − B (t) ∼ N (0, ∆t) if B(t)
is a Brownian motion process. Higher order (Runge-Kutta) approximations can be
introduced but these do not seem to be employed in the statistics literature. Rather,
recent work introduces latent variables Λ(t′ )’s between Λ(t) and Λ(t + ∆t). See, e.g.,
Elerian et al. (2001); Golightly and Wilkinson (2008) and Stramer and Roberts (2007).
Our prototype is the logistic equation
·
¸
Λ (t)
dΛ (t) = r(t)Λ (t) 1 −
dt and Λ (0) = Λ0 .
K
To introduce systematic randomness into this model,
Ornstein-Uhlenbeck process for r (t):
(3)
we specify a mean-reverting
dr (t) = −α (µr − r (t)) dt + σζ dB (t) .
(4)
Under Brownian motion it is known that r (t) is a stationary Gaussian process with
cov(r (t) , r (t′ )) = (σζ2 /α) exp (−α|t − t′ |).
To extend model (3) to a spatio-temporal setting, we can model Λ (s, t) at every
location s with a SDE
·
¸
∂Λ (s, t)
Λ (s, t)
= r(s, t)Λ (s, t) 1 −
(5)
∂t
K (s)
1 This
is a very general specification. For example, a common SDE model is dΛ(t) = f (Λ(t), t)dt +
h(Λ(t), t)dB(t) where B(t) is Brownian motion over R1 with f and h the “drift” and “volatility” respectively. This model can be considered as model (2) with g (Λ (t) , t, r (t)) = f (Λ (t) , t) + r (t) h (Λ (t) , t)
and dr (t) = r (t) dt = dB (t), which implies dΛ(t) = f (Λ(t), t)dt + h (Λ (t) , t) dB(t).
737
J. A. Duan, A. E. Gelfand and C. F. Sirmans
subject to the initial conditions Λ (s, 0) = Λ0 (s). Expression (5)
R is derived directly from
(3) as follows. Assume model (3) for the aggregate Λ (D, t) = D Λ (s, t) ds:
¸
·
Λ (D, t)
∂Λ (D, t)
.
= r(D, t)Λ (D, t) 1 −
∂t
K (D)
(6)
R
Here r (D, t) is the average growth rate of Λ (D, t), i.e., r (D, t) = ( D r (s, t) ds)/|D|
where
|D| is the area of D. K(D) is the aggregate carrying capacity, i.e., K (D) =
R
K
(s)
ds.
D
The model for Λ (s, t) at any location s can be considered as the infinitesimal limit
of the model (6) when D is a neighborhood, δs of s whose area goes to zero. Then,
Z
Λ (s′ , t) ds′ )/ |δs | = Λ (s, t) ;
lim Λ (δs , t) / |δs | =
lim (
|δs |→0
|δs |→0 δs
Z
lim K (δs ) / |δs | =
lim (
K (s′ ) ds′ )/ |δs | = K(s);
|δs |→0
|δs |→0 δs
Z
r (s′ , t) ds′ /|δs | = r (s, t) .
lim r(δs , t) =
lim
|δs |→0
|δs |→0
δs
Plugging δs into (6) and passing to the limit, we obtain our local model (5).
Model (5) specifies an infinite-dimensional SDE model for the random field Λ (s, t) ,
s ∈ D. Similar to (4), we can add randomness to (5) by extending the OrnsteinUhlenbeck process to the case of infinite dimension,
∂B (s, t)
∂r (s, t)
= L (µr (s) − r(s, t)) +
,
∂t
∂t
(7)
where L (s) is a spatial linear operator given by
L (s) = a (s) +
2
X
l=1
2
bl (s)
1X
∂2
∂
−
cl (s) 2
∂sl
2
∂sl
(8)
l=1
where a (s), bl (s) and cl (s) are positive deterministic functions with s1 and s2 the
coordinates of location s. B (s, t) is a spatially correlated Brownian motion. Here,
equation (5) and (7) define a spatio-temporal model with a nonstationary and nonGaussian Λ (s, t) and a latent stationary Gaussian r (s, t). Note that a well-specified
CB will guarantee mean-square continuity and differentiability of r (s, t), s ∈ D. Because the logistic equation is Lipschitz, Λ (s, t) will also be mean-square continuous and
differentiable.
The simplest Ornstein-Uhlenbeck process model for r (s, t) sets bl and cl to zero (see
Brix and Diggle 2001); the resulting covariance is separable in space and time. For
example, with the Matérn spatial covariance function, we have
ν
Cr (s − s′ , t − t′ ) = σ 2 exp (−a |t − t′ |) (φζ |s − s′ |) κν (φζ |s − s′ |) ,
(9)
738
Space-time SDE models
where κν (·) is the modified Bessel function of the second kind. When bl and cl are not
equal to zero, r (s, t) defined by equation (7) is a blur-generated process in Brown et al.
(2000) with a nonseparable non-explicit spatio-temporal covariance function. Whittle
(1963) and Jones and Zhang (1997) propose other stochastic partial differential equation
models, which are shown by Brown et al. (2000) to be special examples of the OrnsteinUhlenbeck process model above.
Returning to the discussion in the Introduction, conceptually, Λ (s, tj ) is generated
by the continuous-time process defined by (5) and (7). However, the exact solution of
this infinite-dimensional SDE and the transition probability of the resulting Markov process for Λ (s, tj ) are not generally known in closed-form given the error process B (s, t).
Hence, for handling these models, time-discretization is usually required to compute
Λ (s, t) in simulation and estimation. So, we use Euler approximation to discretize the
SDE model for Λ (s, tj ) . The Euler scheme is broadly employed in simulating SDE’s
because it the simplest method that is proven to generate a stochastic process that has
both strong convergence of order 1/2 and weak convergence of order 1 (see Kloeden
and Platen 1992 for theoretical discussions). That is the stochastic difference equation
resulting from the Euler discretization scheme generates a process that will converge to
the process defined by the stochastic differential equation when the length of time steps
goes to zero.
3
3.1
Geostatistical models using SDE
A discretized space-time model with white noise
We assume time is discretized to small, equally spaced intervals of length ∆t, indexed as
tj , j = 0, 1, ..., J. The data is considered to be Y (si , tj ), i.e., an observation at location
s and any t ∈ (tj , tj + ∆t) is labeled as Y (s, tj ). Then, we assume
Y (s, tj ) = Λ (s, tj ) + ε (s, tj ) ,
where ε (s, tj ) models either sampling or measure errors because a researcher cannot
directly observe Λ (s, tj ).
The dynamics of the discretized Λ (s, tj ) is therefore modeled by a difference equation
using Euler’s approximation applied to (5):
¸
·
Λ (s, tj−1 )
∆t,
(10)
∆Λ (s, tj ) = r(s, tj−1 )Λ (s, tj−1 ) 1 −
K (s)
Λ (s, tj ) ≈
Λ (s, 0) +
j
X
∆Λ (s, tj−1 ) .
(11)
l=1
We do not have to discretize the space-time model for r (s, t) if the stationary spatiotemporal Gaussian process ζ (s, t) allows direct evaluation of its covariance function.
For example, the model (7) with constant L (s) = ar has the closed-form separable
J. A. Duan, A. E. Gelfand and C. F. Sirmans
739
covariance function given in (9) which can be directly used in modeling and can be
estimated. Using this form saves one approximation step.
We still need to model the initial Λ (s, 0) and K (s) if they are not known. For
example, because Λ (s, 0) and K (s) are positive in the logistic equation, we can model
them by the log-Gaussian spatial processes with regression forms for the means,
log Λ (s, 0) = µΛ (XΛ (s) , βΛ ) + θΛ (s) , θΛ (s) ∼ GP (0, CΛ (s − s′ ; ϕΛ )) ;
log K (s) = µK (XK (s) , βK ) + θK (s) , θK (s) ∼ GP (0, CK (s − s′ ; ϕK )) .
Similarly, µr (s) below (7) can be modeled as µr (Xr (s) , βr ).
Conditioned on Λ (s, tj ), the Y (s, tj ) are mutually independent. With data Y (si , tj )
at locations {si , i = 1, . . . , n} ⊂D, we can provide a hierarchical model based on the
evolution of Λ (s, t) and the space-time parameters. We fit this model within a Bayesian
framework so completion of the model specification requires introduction of suitable
priors on the hyper-parameters.
For simplicity, we suppress the indices t and s and let our observations at time tj be
yj = {yj1 , . . . , yjn } at the corresponding s1 , . . . , sn locations. Accordingly, we let Λj ,
∆Λj , rj , K, µΛ (βΛ ), µK (βK ), µr (βr ), θΛ , θK and ζ be the vectors of the corresponding
functions and processes in our continuous model evaluated at si ∈ {s1 , . . . , sn }. Note
that we begin with the initial observations y0 . The hierarchical model for y0 , . . . , yJ
becomes
¡
¢
yj |Λj ∼ N Λj , σε2 In , j = 0, . . . , J,
·
¸
Λj−1
∆Λj = rj−1 Λj−1 1 −
∆t,
K
Λj = Λ0 +
j−1
X
∆Λl ,
l=1
(12)
log Λ0 = µΛ (βΛ ) + θΛ , θΛ ∼ N (0, CΛ (s − s′ ; ϕΛ )) ,
log K = µK (βK ) + θK , θK ∼ N (0, CK (s − s′ ; ϕK )) ,
r = µr (βr ) + ζ, ζ ∼ N (0, Cr (s − s′ , t − t′ ; ϕr )) ,
βΛ , βr , βK , ϕΛ , ϕK , ϕr ∼ priors,
where β(·) are the parameters in the mean surface function; CΛ , CK and Cr 2 are the
covariance matrices. In this model, Λ0 , r and K are latent variables. Note that the
Λj ’s are deterministic functions of Λ0 , r and K. The joint likelihood for the J + 1
conditionally independent observations and latent variables is
J
Y
©
j=0
¡
¢ª
N yj |Λj (Λj−1 , rj , K) , σε2 In N (log Λ0 |µΛ , CΛ ) N (log K|µK , CK ) N (r|µr , Cr ) ,
(13)
where we let r = {r0 , . . . , rJ−1 }.
o
n
(j)
will write µΛ (βΛ ), µK (βK ), µr (βr ) ; j = 1, . . . , J , CΛ (ϕΛ ), CK (ϕK ) and Cr (ϕr ) as µΛ ,
µK , µr , CΛ , CK and Cr when there is no ambiguity.
2 We
740
Space-time SDE models
3.2
Bayesian inference and prediction
With regard to inference for the model in (12), there are three latent vectors: r, K and
Λ0 . The hyper-parameters in this model include the βr , βK and βΛ in the parametric
trend surfaces, the spatial random effects ζ, θK , θΛ and the hyper-parameters ϕr , ϕK ,
ϕΛ in the covariance functions.
The priors for the hyper-parameters are assumed to have the form
βr , βK , βΛ ∼ π (βr ) · π (βK ) · π (βΛ ) ; ϕr , ϕK , ϕΛ ∼ π (ϕr ) · π (ϕK ) · π (ϕΛ )
(14)
where each of β©r , βK , βΛ , ϕª
r , ϕK , ϕΛ may represent multiple parameters. For example,
we have ϕr = αr , φr , σr2 , ν in the Matérn class covariance function for the separable
model (9). Exact specifications of the priors for the β’s and ϕ’s depend on the particular
application. For example, if we take µr (s; βr ) =X (s) βr , we adopt a weak normal prior
N (0, Σβ ) for βr . The parameter σr2 receives the usual Inverse-Gamma prior. Note that
∆Λj in the likelihood (13) for the discretized model are deterministic functions of r, K
and Λ0 defined by (10) and (11). Therefore the joint posterior is proportional to
J
Y
©
j=0
¢ª
¡
N yj |Λj (Λj−1 , rj , K) , σε2 In N (log Λ0 |µΛ , CΛ ) N (log K|µK , CK ) N (r|µr , Cr ) ·
π (βr ) π (βK ) π (βΛ ) π (ϕr ) π (ϕK ) π (ϕΛ ) .
(15)
We simulate the posterior distributions of the model parameters and latent variables
in (15) using a Markov Chain Monte Carlo algorithm. Because the intensities in the
likelihood function are irregular recursive and nonlinear functions of the model parameters and latent variables, it is very difficult to obtain derivatives for an MCMC with
directional moves, such as the Langevin method. So, instead we use a random-walk
Metropolis-Hastings algorithm in the posterior simulation. Each parameter is updated
in turn in every iteration of the simulation.
The prediction problem concerns (i) interpolating the past at new locations and (ii)
forecasting the future at current and new locations. Indeed, we can hold out the observed data at new locations or in a future time period to validate our model. For the
logistic growth function, conditioning on the posterior samples of Λ0 , K, r and βr , βK ,
βΛ , ϕr , ϕK , ϕΛ , we can use spatio-temporal interpolation and temporal extrapolation
to obtain ∆ΛJ+∆J (s) in period J + ∆J at any new location s ∈ D by calculating
µr (s, βr ), µK (s, βK ), µΛ (s, βΛ ) and obtaining ζ(t, s), t = 1, . . . , J + ∆J, θK (s) and
θΛ (s) by spatio-temporal prediction, and then using (10) and (11) recursively. Because
we can obtain a predictive sample for ∆ΛJ+∆J (s) from the posterior samples of the
model fitting, we can infer on any feature of interest associated with the predictive
distribution of ∆ΛJ+∆J (s). The spatial interpolation of past observations at new locations is demonstrated in the subsection below using a simulated example. We will also
demonstrate temporal prediction when we apply a Cox-process version of our model to
the house construction data in Section 4.
741
J. A. Duan, A. E. Gelfand and C. F. Sirmans
3.3
A simulated data example
In order to see how well we can learn about the true process, we illustrate the fitting of
the models in (12) with a simulated data set. In a study region D of 10×10 square units
shown as the block in Figure 1, we simulate 44 locations at which spatial observations are
collected over 30 periods. Therefore our observed spatio-temporal data that constitute
a 44×30 matrix. The data are sampled using (12) where we fix the carrying capacity to
be one at all locations. We may envision that the data simulate the household adoption
rates for a certain durable product (e.g., air conditioners, motorcycles) in 44 cities
over 30 months. A capacity of one means 100% adoption. Household adoption rates are
collected by surveys with measurement errors. The initial condition Λ0 is simulated as a
log-Gaussian process with a constant mean surface µΛ and the Matérn class covariance
whose smoothness parameter ν is set to be 3/2. The spatio-temporal growth rate
r is simulated using a constant mean µr and the separable covariance function (9),
where the Matérn smoothness parameter ν is also set to be 3/2. This separable model
induces a convenient covariance matrix as the Kronecker product of the temporal and
spatial correlation matrices: σr2 Σt ⊗ Σs . The values of the fixed parameters in our data
simulation are presented in Table 1.
Model Parameters
µΛ
σΛ
φΛ
σε
µr
σr
φr
αr
True Value
-4.2
1.0
0.7
0.05
0.24
0.08
0.7
0.6
Posterior Mean
-4.14
0.91
0.77
0.049
0.24
0.088
0.78
0.64
95% Equal-tail Interval
(-4.88, -3.33)
(0.62, 1.46)
(0.50, 1.20)
(0.047, 0.052)
(0.22, 0.26)
(0.077, 0.097)
(0.60, 1.10)
(0.51, 0.98)
Table 1: Parameters and their posterior inference for the simulated example
We use the simulated r and Λ0 and the transition equation (10) recursively to obtain
∆Λj and Λj for each of the 30 periods. The observed data yj are sampled as mutually
independent given Λj with the random noise εj . The data at four selected locations
(marked as 1, 2, 3, and 4 in Figure 1) are shown as small circles in Figure 2. We leave
out the data at four randomly chosen locations (shown in diamond shape and marked
as A, B, C and D in Figure 1) for spatial prediction and out-of-sample validation for
our model.
We fit the same model (12) to the data at the remaining 40 locations (hence a
40×30 spatio-temporal
data set). We
¡
¢
¡ use ¢very vague priors for the constant means:
π(µΛ ) ∼ N 0, 108 and π (µr ) ∼ N 0, 108 . We use natural
priors for the
¡ conjugate
¢
2
precision
parameters
(inverse
of
variances)
of
r
and
Λ
:
π
1/σ
∼
Gamma(1,
1) and
0
r
¡
¢
2
π 1/σΛ
∼ Gamma(1, 1). The positive parameter¡ for the
temporal
correlation
of r
¢
also has a vague log-normal prior: π (αr ) ∼ log-N 0, 108 . Because the spatial range
parameters φr and φΛ are only weakly identified (Zhang 2004), we only use informative
742
Space-time SDE models
and discrete prior for them. Indeed we have chosen 20 values (from 0.1 to 2.0) and
assume uniform priors on them for both φr and φΛ .
We use the random-walk Metropolis-Hastings algorithm to simulate posterior samples of r and Λ0 . We draw the entire vector of Λ0 for all forty locations as a single block
in every iteration. Because r is very high-dimensional (r being a 40×30 matrix), we
cannot draw the entire matrix of r as one block and have satisfactory acceptance rate
(between 20% to 40%). Our algorithm divides r into 40 row blocks (location-wise) in
every odd-numbered iteration and 30 column blocks (period-wise) in every even numbered iteration. Each block is drawn in one Metropolis step. We find the posterior
samples start to converge after about 30,000 iterations. Given the sampled r and Λ0 ,
2
the mean parameters µr , µΛ and the precision parameters 1/σr2 and 1/σΛ
all have conjugate priors, and therefore their posterior samples are drawn with Gibbs samplers. φr
and φΛ have discrete priors and therefore discrete Gibbs samplers too. We also use the
random-walk Metropolis-Hastings algorithm to draw αr .
We obtain 200,000 samples from the algorithm and discard the first 100,000 as burnin. For the posterior inference, we use 4,000 subsamples from the remaining 100,000
samples, with a thinning equal to 25. It takes about 15 hours to finish the computation
using the R statistical software on an Intel Pentium 4 3.4GHz computer with 2GB of
memory. The posterior means and 95% equal-tail Bayesian posterior predictive intervals
for the model parameters are presented in Table 1. Evidently we are recovering the true
parameter values very well. Figure 2 displays the posterior mean of the growth curves
and 95% Bayesian predictive intervals for the four locations (1, 2, 3 and 4), compared
with the actual latent growth curve Λ (t, s) and observed data. Up to the uncertainty
in the model we approximate the actual curves very well. The fitted mean growth
curves almost perfectly overlap with the actual simulated growth curves. The empirical
coverage rate of the Bayesian predictive bounds is 93.4%.
We use the Bayesian spatial interpolation in Section 3.2 to obtain the predictive
growth curve for four new locations (A, B, C and D). In Figure 3 we display the means
of the predicted curves and 95% Bayesian predictive intervals, together with the holdout data. We can see the spatial prediction captures the patterns of the hold-out data
very well. The predicted mean growth curves overlap with the actual simulated growth
curves very well except for location D, because location D is rather far from all the
observed locations. The empirical coverage rate of the Bayesian predictive intervals is
95.8%.
We also fit the following customary process realization model with space-time random effects to the simulated data set
¡
¢
yj = µ + ξj + εj ; εj ∼ N 0, σε2 In , j = 0, . . . , J
(16)
where the random effects ξ = [ξ0 , . . . , ξj ] come from a Gaussian process with a separable
spatio-temporal correlation of the form:
ν
Cξ (t − t′ , s − s′ ) = σξ2 exp (−αξ |t − t′ |) (φξ |s − s′ |) κν (φξ |s − s′ |) , ν =
3
.
2
(17)
743
J. A. Duan, A. E. Gelfand and C. F. Sirmans
Comparison of model performance between our model in (12) and the model in
(16) is conducted using spatial prediction at the 4 new locations in Figure 1. The
computational cost of the model in (16) is, of course, much lower; this model can be
fitted with a Gibbs sampler and requires one hour for 100,000 iterations. After we
discard 20,000 as burn-in and thin the remaining samples to 4,000, we conduct the
prediction on the four new sites (A, B, C and D). In Figure 4 we display the means of
the predicted curves and 95% Bayesian predictive intervals, together with the hold-out
data. For the four hold-out sites, the average mean square error of the model (12) is
1.75×10−3 versus 3.34×10−3 of the model (16); the average length of the 95% predictive
intervals for the model (12) is 0.29 versus 0.72 for the model (16). It is evident that
the prediction results under the benchmark model are substantially worse than those
under our model (12); the mean growth curves are less accurate and less smooth, and
the 95% predictive intervals are much wider.
4
4.1
Space-time Cox process models using SDE
The model
Here, we turn to the use of a SDE to provide a Cox process model for space-time point
patterns. Let D again be a fixed region and let XT denote an observed space-time
point pattern within D over the time interval [0, T ]. The Cox process model assumes
a positive space-time intensity that is a realization of a stochastic process. Denote the
stochastic intensity by Ω (s, t) , s ∈ D, t ∈ [0, T ]. In practice, we may only know the
spatial coordinates of all the points whereas the time coordinates are only known to be
in the time interval [0, T ]. For example, in our house construction data, for Irving, TX,
we only have the geo-coded locations of the newly constructed houses within a year. The
exact time when the construction of a new house starts is not available. The integrated
RT
process Λ (s, T ) = 0 Ω (s, t) dt, provided that Ω (s, t) is integrable over [0, T ], is the
intensity for this kind of point patterns. We may also know multiple subintervals of
[0, T ]: [t1 = 0, t2 ), . . . , [tJ−1 , tJ = T ], and observe a point pattern in each subinterval.
These data constitute a series of discrete-time spatio-temporal point patterns, which are
denote by X[t1 =0,t2 ) , . . . , X[tj−1 ,tN =T ] . The integrated process also provides stochastic
intensities for these point patterns
∆Λj (s) = Λ (s, tj ) − Λ (s, tj−1 ) =
Z
tj
Ω (s, τ ) dτ.
tj−1
In this paper, we will model the dynamics of these point patterns by an infinite dimensional SDE subject to the initial condition for Λ (s, t). Note an equivalent infinite
dimensional SDE for Ω (s, t) can also be derived from the equation for Λ (s, t).
If we observed the complete space-time data XT (s, t), temporally dependent X[t1 =0,t2 )
,...,X[tj−1 ,tN =T ] will still provide a good approximation to XT (s, t), when the time intervals are sufficiently small (Brix and Diggle 2001). Moreover, this will also facilitate
744
Space-time SDE models
the use of the approximated intensity
∆Λj (s) = Λ (s, tj ) − Λ (s, tj−1 ) =
Z
tj
Ω (s, τ ) dτ ≈ Ω (s, tj−1 ) (tj − tj−1 ) .
tj−1
As a concrete example, we return to the house construction dataset mentioned in
Section 1. Let Xj = X[tj−1 ,tj ) = xj be the observed set of locations of new houses built
in region D and period j=[tj−1 , tj ). We can apply the Cox process model to Xj and
assume that the stochastic intensity Λ (s, t) follows the logistic equation model (5). We
can also apply the discretized version (10) to ∆Λj (s).
R0
Let our initial point pattern be x0 and the intensity be Λ0 (s) = −∞ Ω (s, τ ) dτ .
The hierarchical model for the space-time point patterns is merely the model (12) with
the first stage of the hierarchy replaced by the following
xj |∆Λj ∼ Poisson Process (D, ∆Λj ) , j = 1, . . . , J
x0 |Λ0 ∼ Poisson Process (D, Λ0 ) ,
(18)
where we suppress the indices t and s again for the periods t1 , . . . , tJ . Note that, unlike
in (12), the intensity ∆Λj for xj must be positive. Therefore, we model the log growth
rate, that is
log r (s, t) = µr (s; βr ) + ζ (s, t) , ζ ∼ GP (0, Cζ (s − s′ , t − t′ ; ϕr )) .
(19)
The J spatial point patterns are conditionally independent given the space-time
intensity, so the likelihood is
(
)
¶Y
µ Z
¶Y
µ Z
nj
n0
J
Y
Λ0 (x0i ) . (20)
∆Λj (s) ds
exp −
Λ0 (s) ds
∆Λj (xji ) · exp −
j=1
D
D
i=1
i=1
This likelihood
R is more difficult to work with than that in (13). There is a stochastic
integral in (20), D ∆Λj (s) ds, which must be approximated in model fitting by a Riemann sum. To do this, we divide the geographical region D into M cells and assume the
intensity is homogeneous within each cell. Let ∆Λj (m) and Λ0 (m) denote this average
intensity in cell m. Let the area of cell m be A (m). Then, the likelihood becomes
#
"
à M
! M
J
Y
Y
X
njm
exp −
∆Λj (m)
∆Λj (m) A (m)
m=1
m=1
j=1
Ã
· exp −
M
X
m=1
Λ0 (m) A (m)
!
M
Y
Λ0 (m)
(21)
n0m
m=1
where njm is the number of points in cell m in period j. Our parameter processes
r (s, tj ) and K (s) are also approximated accordingly as rj (m) and K (m), which are
constant in each cell m.
J. A. Duan, A. E. Gelfand and C. F. Sirmans
4.2
745
Modeling house construction data for Irving, TX
Our house construction dataset consists of the geo-coded locations and years of the
newly constructed residential houses in Irving, TX from 1901 to 2002. Figure 5 shows
how the city grows from the early 1950’s to the late 1960’s. Irving started to develop in
the early 1950’s and the outline of the city was already in its current shape by the late
1960’s. The city became almost fully developed by the early 1970’s with much fewer
new constructions after that era. Therefore, for our data analysis, we select the period
from 1951 through 1969 when there was rapid urban development. In our analysis, we
use the data from year 1951–1966 to fit our model and hold out the last three years
(1967, 1968 and 1969) for prediction and model validation.
As shown in the central block of Figure 6, our study region D in this example
is a square of 5.6×5.6 square miles with Irving, TX in the middle. This region is
geographically disconnected from other major urban areas in Dallas County, which
enables us to isolate Irving for analysis. We divide the region into 100 (10×10) equally
spaced grid cells shown in Figure 6. Within each cell, we model the point pattern with
a homogeneous Poisson process given ∆Λj (m). The corresponding Λ0 (m), K (m) and
rj (m) are collected into vectors Λ0 , K, and r which are modeled as follows.
log Λ0
log K
log r
=
=
=
µΛ + θΛ , θΛ ∼ N (0, CΛ )
µK + θK , θK ∼ N (0, CK )
µr + ζ, ζ ∼ N (0, Cr )
where the spatial covariance matrix CΛ and CK are constructed using the Matérn class
covariance function with distances between the centroids of the cells. The smoothness
2
2
and range parameters φΛ and
, σK
parameter ν is set to be 3/2. The variances σΛ
φK are to be estimated. The spatio-temporal log growth rate r is assumed to have a
separable covariance matrix Cr = σr2 Σt ⊗ Σs , where the spatial correlation Σs is also
constructed as a Matérn class function of the distances between cell centroids with
smoothness parameter ν being set to 3/2. The temporal correlation Σt is of exponential
form as in (9). The variance σr2 , spatial and temporal correlation parameters φr and αr
are to be estimated.
We use very vague priors for the parameters in the mean function: π(µΛ ), π (µK ),
¡
¢
ind
π (µr ) ∼ N 0, 108 . We use natural conjugate priors for the precision parameters
¡
¢ ¡
¢ ¡
¢ ind
2
2
(inverse of variances) of r and Λ0 : π 1/σΛ
, π 1/σK
, π 1/σr2 ∼ Gamma(1, 1). The
temporal
¡
¢correlation parameter of r also has a vague log-normal prior: π (αr ) ∼ logN 0, 108 . Again, the spatial range parameters φΛ φK and φr are only weakly identified
(Zhang 2004), so we use informative, discrete priors for them. Indeed we have chosen
40 values (from 1.1 to 5.0) and assume uniform priors on them for φΛ φK and φr .
We use the same random-walk Metropolis-Hastings algorithm as in the simulation
example to simulate posterior samples with the same tuning of acceptance rates. As
a production run we obtain 200,000 samples from the algorithm and discard the first
100,000 as burn-in. For the posterior inference, we use 4,000 subsamples from the
remaining 100,000 samples, with a thinning equal to 25. The posterior means and 95%
746
Space-time SDE models
equal-tail posterior intervals for the model parameters are presented in Table 2.
Model Parameters
µΛ
σΛ
φΛ
µr
σr
φr
αr
µK
σK
φK
Posterior Mean
2.78
1.77
3.03
-2.76
2.48
4.09
0.52
6.49
1.17
1.91
95% Equal-tail Interval
(2.15, 3.40)
(1.49, 2.11)
(2.70, 3.20)
(-3.24, -2.29)
(2.32, 2.68)
(3.70, 4.30)
(0.43, 0.62)
(5.93, 7.01)
(1.02, 1.44)
(1.60, 2.20)
Table 2: Posterior inference for the house construction data
Figure 7 shows the posterior mean growth curves and 95% Bayesian predictive intervals for the intensity in the four blocks (marked as block 1, 2, 3 and 4) in Figure 6.
Comparing with the observed number of houses in the four blocks from 1951 to 1966,
we can see the estimated curves fit the data very well.3
In Figure 8 we display the posterior mean intensity surface for year 1966 and the
predictive mean intensity surfaces for years 1967, 1968 and 1969. We also overlay the
actual point patterns of the new homes construct in those four years on the intensity
surfaces. Figure 8 shows that our model can forecast the major areas of high intensity,
hence high growth very well. For example, in the upper left corner, the intensity continues rising from 1966 to 1968 and starts to wane in 1969. We can see increasing numbers
of houses are built from 1966 to 1968 and much fewer are built in 1969. In the lower
left part of the plots near the bottom, we can see areas of high intensity gradually shift
down to the south and the house construction pattern confirms this trend too.
5
Discussion
We have illustrated the use of stochastic differential equations to model both geostatistical and point pattern space-time data. Our examples demonstrate that the proposed
hierarchical modeling can accommodate the complicated model structure and achieve
good estimation and prediction. The major challenges in fitting our proposed models
are: (i) the evaluation of a likelihood that involves discretization of the SDE in time and
stochastic integrals and (ii) a likelihood that does not allow an easy formulation of an efficient Metropolis-Hastings algorithm. In dealing with the first challenge, we utilize the
Euler approximation and the space discretization method in Benes et al. (2002). Though
the simulation results are encouraging, further investigation of these approximations or
3 The growth curves for the house construction data are much smoother than those in our simulated
data example in Section 3.3. Although our fitted mean growth curves seem to match the data too
perfectly, we do not think we overfit because our hold-out prediction results are very accurate as well.
J. A. Duan, A. E. Gelfand and C. F. Sirmans
747
alternatives would be of interest. For the second, we apply the random-walk Metropolis
algorithm to the posterior simulation, which is liable to create large auto-correlation in
the sampling chain. The nonlinear and recursive structure of our likelihood makes most
of the current Metropolis methods inapplicable, encouraging future research for a more
efficient Metropolis-Hastings algorithm for this class of problems.
Our application to the house construction data is really only a first attempt to
incorporate a structured growth model into a spatio-temporal point process to afford
insight into the mechanism of urban development. However, if it is plausible to assume
that the damping effect of growth is controlled by the carrying capacity of a logistic
model, then it is not unreasonable to assume the growth rate is mean-reverting. Of
course, we can envision several ways to make the model more realistic and these suggest
directions for future work. We might have additional information at the locations to
enable a so-called marked point process. For instance, we might assign the house to a
group according to its size. Fitting the resultant multivariate Cox process can clarify
the intensity of development. We could also have useful covariate information on zoning
or introduction of roads which could be incorporated into the modeling for the rates.
We can expect “holes” in the region - parks, lakes, etc. - where no construction can
occur. For locations in these regions, we should impose zero growth. Finally, it may be
that growth triggers more growth so that so-called self exciting process specifications
might be worth exploring.
References
Banerjee, S., Carlin, B., and Gelfand, A. (2004). Hierarchical Modeling and Analysis
for Spatial Data. Chapman & Hall/CRC. 734
Benes, V., Bodlak, K., Møller, J., and Waagepetersen, R. (2002). “Bayesian Analysis
of Log Gaussian Cox Processes for Disease Mapping.” Technical Report R-02-2001,
Department of Mathematical Sciences, Aalborg University. 746
Brix, A. and Diggle, P. (2001). “Spatiotemporal prediction for log-Gaussian Cox processes.” Journal of the Royal Statistical Society: Series B, 63: 823–841. 734, 737,
743
Brix, A. and Møller, J. (2001). “Space-time multi type log Gaussian Cox Processes with
a View to Modelling Weeds.” Scandinavian Journal of Statistics, 28: 471–488. 734
Brown, P., Karesen, K., Roberts, G., and Tonellato, S. (2000). “Blur-generated nonseparable space-time models.” Journal of the Royal Statistical Society: Series B, 62:
847–860. 734, 738
Cressie, N. (1993). Statistics for Spatial Data. New York: Wiley, 2nd edition. 734
Daley, D. and Vere-Jones, D. (1988). Introduction to the Theory of Point Processes.
New York: Springer Verlag. 734
748
Space-time SDE models
Diggle, P. (2005). “Spatio-temporal point processes: methods and applications.” Dept.
of Biostatistics Working Paper 78, Johns Hopkins University. 734
Elerian, O., Chib, S., and Shephard, N. (2001). “Likelihood inference for discretely
observed non-linear diffusions.” Econometrica, 69: 959–993. 735, 736
Gelfand, A. E., Banerjee, S., and Gamerman, D. (2005). “Spatial process modelling
for univariate and multivariate dynamic spatial data.” Environmetrics, 16: 465–479.
734
Gneiting, T. (2002). “Nonseparable, stationary covariance functions for space-time
data.” Journal of the American Statistical Association, 97: 590–600. 734
Golightly, A. and Wilkinson, D. (2008). “Bayesian inference for nonlinear multivariate
diffusion models observed with error.” Computational Statistics and Data Analysis,
52: 1674–1693. 736
Jones, R. and Zhang, Y. (1997). “Models for continuous stationary space-time processes.” In Gregoire, G., Brillinger, D., Diggle, P., Russek-Cohen, E., Warren, W.,
and Wolfinge, R. (eds.), Modelling Longitudinal and Spatially Correlated Data, 289–
298. Springer, New York. 738
Karr, A. (1991). Point Processes and Their Statistical Inference. New York: Marcel
Dekker, 2nd edition. 734
Kloeden, P. and Platen, E. (1992). Numerical Solution of Stochastic Differential Equations. Springer. 735, 738
Kot, M. (2001). Elements of Mathematical Ecology. Cambridge Press. 735
Mahajan, V. and Wind, Y. (1986). Innovation Diffusion Models of New Product Acceptance. Harper Business. 735
Møller, J., Syversveen, A., and Waagepetersen, R. (1998). “Log Gaussian Cox processes.” Scandanavian Journal of Statistics, 25: 451–482. 734
Møller, J. and Waagepetersen, R. (2004). Statistical Inference and Simulation for Spatial
Point Processes. Chapman and Hall/CRC Press. 734
Ogata, Y. (1998). “Space-time point-process models for earthquake occurrences.” Annals of the Institute for Statistical Mathematics, 50: 379–402. 734
Rue, H. and Held, L. (2005). Gaussian Markov random fields: theory and applications.
Chapman & Hall/CRC. 734
Stein, M. L. (2005). “Space-time covariance functions.” Journal of the American Statistical Association, 100: 310–321. 734
Stramer, O. and Roberts, G. (2007). “On Bayesian analysis of nonlinear continuous-time
autoregression models.” Journal of Time Series Analysis, 28: 744–762. 736
J. A. Duan, A. E. Gelfand and C. F. Sirmans
749
Whittle, P. (1963). “Stochastic processes in several dimensions.” Bulletin of the International Statistical Institute, 40: 974–994. 738
Zhang, H. (2004). “Inconsistent estimation and asymptotically equal interpolations
in model-based geostatistics.” Journal of the American Statistical Association, 99:
250–261. 741, 745
Acknowledgments
The authors thank Thomas Thibodeau for providing the Dallas house construction data.
750
8
10
Space-time SDE models
1
6
C
2
Latitude
A
4
B
D
2
4
0
3
0
2
4
6
8
10
Longitude
Figure 1: Locations for the simulated data example in Section 3.3.
J. A. Duan, A. E. Gelfand and C. F. Sirmans
751
752
1.0
0.8
0.6
0.0
0.2
0.4
Location 2
0.6
0.4
0.0
0.2
Location 1
0.8
1.0
Space-time SDE models
0
5
10
15
20
25
30
0
5
10
25
30
20
25
30
1.0
0.8
0.6
0.0
0.2
0.4
Location 4
0.8
0.6
0.4
0.0
0.2
Location 3
20
Timeline
1.0
Timeline
15
0
5
10
15
Timeline
20
25
30
0
5
10
15
Timeline
Figure 2: Observed space-time geostatistical data at 4 locations, actual (dashed line) and
fitted mean growth curves (solid line), and 95% predictive intervals (dotted line) by our model
(12) for the simulated data example.
753
1.0
0.8
0.6
0.4
New Location B
0.0
0.2
0.8
0.6
0.4
0.2
0.0
New Location A
1.0
J. A. Duan, A. E. Gelfand and C. F. Sirmans
0
5
10
15
20
25
30
0
5
10
25
30
20
25
30
1.0
0.8
0.6
0.4
New Location D
0.0
0.2
0.8
0.6
0.4
0.2
0.0
New Location C
20
Timeline
1.0
Timeline
15
0
5
10
15
Timeline
20
25
30
0
5
10
15
Timeline
Figure 3: Hold-out space-time geostatistical data at 4 locations, actual (dashed line) and
predicted mean growth curves (solid line) and 95% predictive intervals (dotted line) by our
model (12) for the simulated data example.
754
1.0
0.8
0.6
0.4
New Location B
−0.2
0.0
0.2
0.6
0.4
0.2
−0.2
0.0
New Location A
0.8
1.0
Space-time SDE models
0
5
10
15
20
25
30
0
5
10
25
30
20
25
30
1.0
0.8
0.6
0.4
New Location D
−0.2
0.0
0.2
0.8
0.6
0.4
0.2
−0.2
0.0
New Location C
20
Timeline
1.0
Timeline
15
0
5
10
15
Timeline
20
25
30
0
5
10
15
Timeline
Figure 4: Hold-out space-time geostatistical data at 4 locations, actual (dashed line) and
predicted mean growth curves (solid line) and 95% predictive intervals (dotted line) by the
benchmark model (16) for the simulated data example.
J. A. Duan, A. E. Gelfand and C. F. Sirmans
Figure 5: Growth of residential houses in Irving, TX.
755
756
Space-time SDE models
Figure 6: The gridded study region encompassing Irving, TX.
757
J. A. Duan, A. E. Gelfand and C. F. Sirmans
300
200
Number of Houses
0
5
10
15
5
Block 3
Block 4
10
15
10
15
300
200
100
0
100
200
300
Number of Houses
400
Year
400
Year
0
Number of Houses
100
300
200
100
0
Number of Houses
400
Block 2
400
Block 1
5
10
Year
15
5
Year
Figure 7: Mean growth curves (solid line) and their corresponding 95% predictive intervals
(dotted lines) for the intensity for the four blocks marked in Figure 6.
758
Space-time SDE models
Figure 8: Posterior and predictive mean intensity surfaces for the years 1966, 1967, 1968 and
1969
Bayesian Analysis (2009)
4, Number 4, pp. 759–762
Inconsistent Bayesian Estimation
Ronald Christensen∗
Abstract. A simple example is presented using standard continuous distributions
with a real valued parameter in which the posterior mean is inconsistent on a dense
subset of the real line.
Keywords: Dirichlet process, Posterior mean
1
Introduction
There has been extensive work on inconsistent Bayesian estimation. Early work was
done by Halpern (1974) , Stone (1976) , and Meeden and Ghosh (1981). An important
paper was Diaconis and Freedman (1986a), henceforth referred to as DFa, with extensive
references and discussion by Barron; Berger; Clayton; Dawid; Doksum and Lo; Doss;
Hartigan; Hjort; Krasker and Pratt; LeCam; and Lindley. Follow up work includes
Diaconis and Freedman (1986b, 1990, 1993), Datta (1991), Berliner and MacEachern
(1993), and Rukhin (1994).
DFa require consistency for every parameter value. They also point out that if their
definition of consistency holds, then the posterior mean is consistent (“minor technical
details apart”). The purpose of this note is to provide a particularly simple example
of an inconsistent Bayes estimate and to draw some conclusions from that example. In
particular, the example has a posterior mean that is inconsistent on a dense subset of
the real line.
Consider y1 , . . . , yn a random sample from a density f (y|θ). The distribution of
f (y|θ) is Cauchy with median θ when θ is a rational number and Normal with mean θ
and variance 1 when θ is irrational. In other words,
(
Cauchy(θ) θ rational
f (y|θ) =
N (θ, 1)
θ irrational.
For the prior density, we take g(θ) to be absolutely continuous. For the sake of simplicity,
take it to be N (µ0 , 1).
We will show that the posterior distribution of θ given the data is the same as
if the entire conditional distribution of y were N (θ, 1). In other words, the posterior
distribution is
¶
µ
1
µ0 + ny
,
.
f (θ|y1 , . . . , yn ) ∼ N
n+1 n+1
The standard Bayes estimate is the posterior mean, (µ0 + ny)/(n + 1), which behaves
asymptotically like y. If the true value of θ is an irrational number, the true sampling
∗ Department
of Mathematics and Statistics, University of New Mexico mailto:fletcher@stat.unm.
edu
c 2009 International Society for Bayesian Analysis
°
DOI:10.1214/09-BA428
760
Inconsistent Bayesian Estimation
distribution is normal and the Bayes estimate is consistent. However, if the true value
of θ is a rational number, the true sampling distribution is Cauchy(θ), for which it is
well known that y is an inconsistent estimate of θ. Thus we have the Bayes estimate
inconsistent on a dense set, but a set of prior probability zero.
As the editor has pointed out, except in neighborhoods of θ = 0, the example works
just as well with the Cauchy(θ) replaced by a N (−θ, 1). Then, the posterior mean is
consistent but for the wrong value of θ.
Obviously, the key to this example is that, by virtually any concept of proximity for
distributions, the conditional distributions f (y|θ) are discontinuous on a dense set of θs.
Not only is the mean function E(y|θ) discontinuous everywhere in θ but if F (y|θ) is the
cdf of f (y|θ), measures such as the Kolmogorov-Smirnov distance supy |F (y|θ)−F (y|θ′ )|
are never uniformly small in any neighborhood of θs. An interesting aspect of DFa
is that, while generally it is possible to get discrete distributions arbitrarily close to
continuous ones, DFa illustrate that you cannot always get Ferguson distributions close
enough to a continuous target.
It seems quite clear from the calculus behind this example that the proper concern
for Bayesians is whether their procedures are consistent with prior probability one.
Doob’s theorem, see DFa’s Corollary A.2, establishes precisely this result. Moreover,
there seems to be little remedy for Bayesian inconsistency if one has postulated a prior
distribution for which all interesting parameters have collective prior probability zero.
We have done that here. Who ever reports numerical values to clients that are not
rational numbers? This also seems to be the argument of DFa, that Dirichlet priors
put zero prior probability on continuous distributions and therefore the inconsistency
of Dirichlet priors with respect to continuous distributions in some applications is a
problem. Others might argue that the distribution of any observable phenomenon must
be discrete and that continuous models are merely useful approximations, in which case
the issue being called in question for Dirichlet processes is the usefulness of continuous
approximations.
Nothing in the Bayesian machinery will ensure conditional consistency everywhere.
That requires assumptions on the conditional distributions over and above the Bayesian
paradigm. However, such assumptions may well be valid considerations when developing
models for data.
2
Technical Details
Let Y = (y1 , . . . , yn )′ and consider the probability Pr[θ ∈ A and Y ∈ B] for arbitrary
Borel sets A and B. Let 1[A×B] (θ, Y ) be the indicator function of the set A × B. The
conditional probability Pr[θ ∈ A|Y = w] can be defined as a Y measurable function
such that for any set B
Z
Z
Pr[θ ∈ A|Y = w]dP (θ, Y ) = 1[A×B] (θ, Y )dP (θ, Y ),
(1)
B
see Rao (1973, p. 91) or Berry and Christensen (1979).
761
R. Christensen
First of all, the joint distribution of (θ, Y ) exists. The joint density (θ, Y ) is h(θ, Y ) ≡
f (Y |θ)g(θ). This is clearly dominated by taking g(θ) the same and replacing f (Y |θ)
with a finite multiple of a Cauchy(θ) density. Since the integral exists, we can apply
Fubini’s theorem.
Let f ∗ (y|θ) be the density for a N (θ, 1) distribution. We show that
Z
f (θ|Y )dθ
Pr[θ ∈ A|Y = w] =
A
where
f (θ|Y ) = R
f ∗ (y|θ)g(θ)
.
f ∗ (y|θ)g(θ)dθ
Thus, this version of the posterior probability behaves as if there were no Cauchy components to the sampling distribution at all. The claims of the previous section follow
immediately from this result. To see the validity of the result, observe that
Z
Z Z
f (Y |θ)g(θ)dY dθ.
1[A×B] (θ, Y )dP (θ, Y ) =
A
B
R
R
However, f (Y |θ) and f ∗ (Y |θ) are equal almost everywhere, so B f (Y |θ)dY = B f ∗ (Y |θ)dY
almost everywhere and
Z Z
Z Z
f ∗ (Y |θ)g(θ)dY dθ.
f (Y |θ)g(θ)dY dθ =
A
A
B
B
∗
The distribution associated with f (Y |θ) is perfectly well behaved, so Bayes theorem
can be applied to give
Z Z
Z Z
∗
f (θ|Y )f (Y )dθdY.
f (Y |θ)g(θ)dY dθ =
A
B
B
A
It follows that equation (1) holds.
References
Berliner, L. M. and MacEachern, S. N. (1993). “Examples of inconsistent Bayes procedures based on observations on dynamical systems.” Statistics and Probability
Letters, 17: 355–360. 759
Berry, D. A. and Christensen, R. (1979). “Empirical Bayes estimation of a binomial
parameter via mixtures of Dirichlet processes.” Annals of Statistics, 7: 558–568. 760
Datta, S. (1991). “On the consistency of posterior mixtures and its applications.” Annals
of Statistics, 19: 338–353. 759
Diaconis, P. and Freedman, D. (1986a). “On the consistency of Bayes estimates.” Annals
of Statistics, 14: 1–26. 759
762
Inconsistent Bayesian Estimation
— (1986b). “On inconsistent Bayes estimates of location.” Annals of Statistics, 14:
68–87. 759
— (1990). “On the uniform consistency of Bayes estimates for multinomial probabilities.” Annals of Statistics, 18: 1317–1327. 759
— (1993). “Nonparametric binary regression: A Bayesian approach.” Annals of Statistics, 21: 2108–2137. 759
Halpern, E. F. (1974). “Posterior consistency for coefficient estimation and model selection in the general linear hypothesis.” Annals of Statistics, 2: 703–712. 759
Meeden, G. and Ghosh, M. (1981). “Admissibility in finite problems.” Annals of
Statistics, 9: 846–852. 759
Rao, C. R. (1973). Linear Statistical Inference and its Applications. New York: John
Wiley and Sons, second edition. 760
Rukhin, A. L. (1994). “Recursive testing of multiple hypotheses: Consistency and
efficiency of the Bayes rule.” Annals of Statistics, 22: 616–633. 759
Stone, M. (1976). “Strong inconsistency from uniform priors.” Journal of the American
Statistical Association, 71: 114–116. 759
Bayesian Analysis (2009)
4, Number 4, pp. 763–792
Sample Size Calculation for Finding Unseen
Species
Hongmei Zhang∗ and Hal Stern†
Abstract. Estimation of the number of species extant in a geographic region has
been discussed in the statistical literature for more than sixty years. The focus
of this work is on the use of pilot data to design future studies in this context.
A Dirichlet-multinomial probability model for species frequency data is used to
obtain a posterior distribution on the number of species and to learn about the distribution of species frequencies. A geometric distribution is proposed as the prior
distribution for the number of species. Simulations demonstrate that this prior distribution can handle a wide range of species frequency distributions including the
problematic case with many rare species and a few exceptionally abundant species.
Monte Carlo methods are used along with the Dirichlet-multinomial model to perform sample size calculations from pilot data, e.g., to determine the number of
additional samples required to collect a certain proportion of all the species with
a pre-specified coverage probability. Simulations and real data applications are
discussed.
Keywords: Generalized multinomial model, Bayesian hierarchical model, Markov
Chain Monte Carlo (MCMC), Dirichlet distribution, geometric distribution.
1
Introduction
The “species problem” is a term used to refer to studies in which objects are sampled
and categorized with interest on the number of categories represented. Research related to the species problem dates back to the 1940’s. Corbet (1942) proposed that a
mathematical relation exists between the number of sampled individuals and the total
number of observed species in a random sample of insects or other animals. Fisher et al.
(1943) developed an expression for the relationship using a negative binomial model.
Their proposed relationship works well over the whole range of observed abundance,
and gives a very good fit to practical situations.
The focus of most statistical research on the species problem has been to estimate
the number of unseen species. Bunge and Fitzpatrick (1993) give a review of numerous
statistical methods to estimate the number of unseen species. Some notable references
are mentioned briefly here. Good and Toulmin (1956) address the estimation of the
expected number of unseen species based on a Poisson model. Efron and Thisted (1976)
use two different empirical Bayes approaches, both based on a similar Poisson model,
to estimate the number of unseen words in Shakespeare’s vocabulary. Pitman (1996)
proposes species sampling models utilizing a Dirichlet random measure. The negative
∗ Department of Epidemiology and Biostatistics, University of South Carolina, Columbia, SC,
mailto:hzhang@sc.edu
† Department of Statistics, University of California, Irvine, CA, mailto:sternh@uci.edu
c 2009 International Society for Bayesian Analysis
°
DOI:10.1214/09-BA429
764
Unseen Species Sample Size Calculation
binomial model proposed by Fisher et al. (1943) is also discussed there. Boender and
Rinnooy Kan (1987) suggest a Bayesian analysis of a multinomial model that can be
used to estimate the number of species. Their model is the starting point for the work
presented in this paper.
As seen from the references above, previous applications of the species problem have
included animal ecology where individual animals are sampled and categorized into
species (Fisher, Corbet, and Williams (1943)), and word usage where individual words
are sampled and each word defines its own category (Efron and Thisted (1976)). More
recent studies extend the species problem to applications in bioinformatics (Morris,
Baggerly, and Coombes (2003)), where the sample items might be DNA fragments and
each sequenced DNA segment represents a unique sequence. Our work is motivated by
a bioinformatics problem of this type. Some other studies focus on drawing inferences
for abundant species or rare species, e.g. Cao et al. (2001) and Zhang (2007). For
consistency with the earlier literature, we use the familiar terminology of animals and
species.
In this paper, we use the model of Boender and Rinnooy Kan (1987), a generalized
multinomial model, as our starting point. The major contribution of this paper is to
address sample size calculation for future data collection based on a pilot study. The
goal is to determine the sample size in order to achieve a specified degree of population
coverage. Non-parametric Bayesian methods have been developed for a related problem,
inferring from a given data set the probability of discovering new species, e.g. Tiwari
and Tripathi (1989) and Lijoi et al. (2007). In these studies the total number of species
is either assumed to be known or to be infinite. The method proposed in this paper,
on the other hand, is a two-phase design with the first phase used to infer the number
of species and the second phase to estimate the required sample size. The sample size
required to achieve a specified degree of population coverage is obtained by Monte Carlo
simulations.
The first step is a fully Bayesian approach to drawing inferences regarding the parameters for a generalized Dirichlet-multinomial model for species frequency data. The
posterior distribution of the model parameters is used in our Monte Carlo simulation
method for sample size determination. For parametric Bayesian analysis of species frequency data selecting an appropriate prior distribution for the number of species in the
population is very important (see, for example, Zhang and Stern (2005)). The prior
distributions proposed by previous studies (Zhang and Stern (2005); Boender and Rinnooy Kan (1987)) perform poorly in situations in which a population has many rare
species (each with very small number of representatives) and a few abundant species.
In this case, as indicated by Sethuraman (1994) and discussed in Zhang and Stern
(2005), the proportions of each species in the population are crowded at the vertexes of
a multi-dimensional simplex such that most proportions are close to zero. For this type
of population, inferences for the number of species in the population are often unrealistic. In this paper, we propose to use a geometric distribution as the prior distribution
for the number of species. Geometric distributions have been used in many studies,
but we have not seen any applications to the species problem. The geometric prior
distribution can be used to reflect prior beliefs about the minimum number of species
H. Zhang and H. Stern
765
in the population and prior belief about the range within which the number of species
is believed to lie. The flexibility provided in this manner seems to allow the geometric
prior distribution to adapt well to different species frequency distributions.
The rest of the paper is organized as follows. In Section 2, we review the hierarchical
Bayesian model for species data, describe our choice of prior distributions, and state the
conditions required to guarantee a proper posterior distribution for our model. Section
3 focuses on posterior inferences for the model’s parameters. Issues related to the
implementation of MCMC are also discussed. In Section 4, we develop a Monte Carlo
simulation approach for designing future data collection. Section 5 provides results for
a simulated data set where the proposed approach works reasonably well. Sensitivity of
results to the choice of prior distribution is also discussed. We apply our method to a
bioinformatics data set in Section 6. Finally we summarize our results in Section 7.
2
2.1
A Dirichlet-multinomial model
The likelihood function
Let yi denote the number of observed animals of species i in a sample of size N . Suppose
so is the number of different species observed and S is the number of species in a
population. Then y = {y1 , y2 , ..., yso } is one way to represent the observed sample.
An alternative description for data of this type based on frequency counts has often
been used in the literature. Let xo ≤ N be the maximum frequency over all observed
species and nx be the number of species captured x times, x = 1, 2, ..., xo . Then n =
(n1 , n2 , · · · , nxo ) is another way to represent the sample with
N
=
xo
X
x=1
xnx =
so
X
yi .
i=1
Here we motivate and describe the generalized multinomial probability model for y
of Boender and Rinnooy Kan (1987). To start we introduce notation ycomplete for the Sdimensional vector of species counts. The basic sampling model for the counts ycomplete
is multinomial with the probability for species i to be captured as θi , i = 1, · · · , S.
There are several possible interpretations for the θi ’s. If we assume all animals are
equally likely to be caught, then θi represents the relative proportion of species i among
the animal population. If not, then θi combines the likelihood of being caught and the
abundance. If the number of species S is known, the population size of animals is large,
and each species has a reasonably large number of representatives in the population, then
a plausible model for ycomplete is the multinomial distribution with parameters N and
θ = {θ1 , · · · , θS }, i.e. ycomplete |θ, S ∼ M ult(N, θ), where ycomplete = {y1 , y2 , ..., yS }.
When S is not known, however, we don’t know the dimension of ycomplete . Then it makes
sense to consider the observed data y = {y1 , y2 , ..., yso } which only indicates counts for
the so species that have been observed. There is a complication in that y provides counts
but does not indicate which elements of θ correspond to the observed species. Though
subtle, this point is important in that it invalidates the usual multinomial likelihood.
766
Unseen Species Sample Size Calculation
Since the correspondence between the yi ’s and θi ’s can not be determined, the data y
represent a generalized multinomial distribution where we sum over all possible choices
for the so observed species. Let W (so ) denote all subsets {i1 , . . . , iso } of so distinct
species labels from {1, . . . , S}, then the conditional distribution of y given θ and S can
be expressed as
P r(y|θ, S)
=
N!
y
!...y
so !
x=1 nx ! 1
QN
1
X
θi1 y1 ...θiso yso
(1)
{i1 ,··· ,iso }∈W (so )
which is the same result as the one given by Boender and Rinnooy Kan (1987).
The above model assumes infinite population sizes, which can produce limitations
in practice. A hypergeometric formulation, which recognizes the finiteness of the populations, seems more appropriate in this context. However, due to computational inefficiency of using hypergeometric models, we use model (1) to describe the distribution
of counts y, implicitly assuming that population sizes are large enough to validate the
assumptions of multinomial distributions.
2.2
Prior distribution for θ
We model θ given S as a random draw from a symmetric Dirichlet distribution with
parameter α, which is a conjugate prior distribution for the multinomial distribution.
We write
θ|S, α
∼
Dirichlet(1S α)
(2)
with
S
Γ(Sα) Y
p(θ|S, α) =
( θi )
S
Γ(α) i=1
α−1
where 1S is a vector length S with all entries equal to one, and θ = {θ1 , · · · , θS }
PS
with i=1 θi = 1. For a symmetric Dirichlet distribution, E(θi ) = 1/S, so the prior
distribution of θ assumes all species are a priori equally likely (in expectation) to be
captured. The prior variance for each θi is V ar(θi ) = (1/S)(1 − 1/S)(1/(Sα + 1)). The
variance of θi becomes smaller as α grows, and tends to 0 as α approaches to infinity. In
the limiting case of α being infinity, θi = 1/S and animals from each species are equally
likely to be captured. Smaller values of α correspond to greater variation among the
elements of θ. Small α can yield many small elements in the vector θ, which corresponds
to the case in which the population has many rare species. The reason for this is that as
α gets smaller, the vector of θi ’s generated from Dirichlet(1S α) is more concentrated
on the vertices of the S-dimensional simplex containing vectors θ that sum to one
(Sethuraman (1994); Zhang and Stern (2005)). For instance, with S = 2 the Dirichlet
distribution reduces to a Beta distribution. When α is small, the density function is
U-shaped, with density concentrated near 0 and 1 for both of the proportions. We
obtain further insight by considering the distribution of θ in three dimensions. Figure 1
H. Zhang and H. Stern
767
shows the distribution of θ in three dimensions for α = 1, 0.01, 0.001. When α is larger
(e.g. α = 1), the probability values are distributed evenly on the simplex. As α gets
smaller, θi ’s tend to move toward the vertices of the simplex which have value 1 or 0,
which implies more smaller elements in the vector θ will be generated from the Dirichlet
distribution.
(0,0,1)
(0,0,1)
(0,1,0)
(0,1,0)
(1,0,0)
(1,0,0)
(a) α = 1
(b) α = 0.01
(0,0,1)
(0,1,0)
(1,0,0)
(c) α = 0.001
Figure 1: Distribution of θ for different α’s
One might expect this prior to be a bit unrealistic in that we likely know that some
species are a priori more likely to be observed than others. The reason for choosing a
symmetric Dirichlet distribution is that we do not know S, so we have no information
to distinguish any of the θi′ s, i = 1, · · · , S, from any of the others. In this case, the prior
768
Unseen Species Sample Size Calculation
distribution for θ has to be exchangeable. A possible solution that can address known
heterogeneity but retain exchangeability is to consider a mixture of two symmetric
Dirichlet distributions corresponding to two different subpopulations, abundant species
and scarce species. This approach is used by Morris et al. (2003) in the case when S is
known.
2.3
Prior distribution for S and α
We apply a fully Bayesian approach to analyzing species frequency data and conclude
model specification by giving prior distributions for S and α.
We specify independent prior distributions for S and α. The two parameters become
dependent in their joint posterior distribution as is shown below in Section 2.4. For S
we would like to use a relatively flat prior distribution without specifying a strict upper
bound on the number of species. We find it useful to have the prior probability density
be a decreasing function of S so that there is a slight preference for smaller number of
species (this is discussed further in Section 5.5). A prior distribution for S with these
characteristics is the geometric distribution with probability mass function
P r(S) = f (1 − f )S−Smin ,
S ≥ Smin ,
(3)
where Smin is a specified minimum number of species and f is the geometric probability
parameter. Because of Theorem 1 (below) we generally take Smin = 2, the smallest value
for the number of observed species that yields a proper posterior distribution for our
model. One interpretation for the parameter f is as the prior probability that there
are exactly Smin species but this would not ordinarily be a quantity that scientists
are able to specify. Instead we propose to obtain a suitable value of f by specifying
a plausible maximum value Smax for the number of unique species and a measure of
prior certainty that S lies between Smin and Smax . The value of Smax will usually
be suggested by scientific collaborators as in our application. Under the geometric
distribution P r(Smin ≤ S ≤ Smax ) = 1 − (1 − f )Smax −Smin which can be inverted to
find f for specified values of Smin , Smax and the prior certainty. For instance, if we
would like to express high confidence, say probability .999 that S is between Smin = 2
and Smax = 10000, then we find f = .0007. On the other hand if we are less confident,
say 95% certain that the true number of species is in this interval, then f = .0003. Note
that despite the name we have assigned, we do not assume that Smax is an actual upper
bound. Smax is a device used for elicitation of the geometric probability parameter
f . Alternative prior distributions for S (and α) are described below. The sensitivity
of posterior inferences to different choices of f and to different prior distributions is
considered in Section 5.
The parameter α is important in characterizing the distribution of frequencies. As
discussed in an earlier section, large α values lead to uniform distributions and small
α leads to a skewed distribution with a few popular species and many rare species. As
we don’t have much information on α we follow an approach that appears to provide
a relatively noninformative hyperprior distribution. We note
√ that the prior standard
deviation for each element of θ is roughly proportional to 1/ α. By setting a noninfor-
H. Zhang and H. Stern
769
√
mative prior distribution on this quantity, p(1/ α) ∝ 1, and doing a change of variable,
we obtain a hyperprior distribution for α, i.e.
3
p(α) = α− 2 , α > 0.
(4)
This is not a proper hyperprior distribution but the following theorem indicates that
the posterior distribution is a proper distribution under fairly weak conditions.
Theorem 1: For the model defined by (1) through (4), the posterior distribution
p(S, α|y) is proper if at least two species are captured, i.e. so ≥ 2.
Proof: The proof is included in the Appendix.
Naturally other prior distributions are possible and several have appeared in the
literature. For example, Zhang and Stern (2005) use a noninformative prior for S, which
is discrete uniform distribution on an interval of plausible values, and they use the same
prior distribution of α that we use here. However, as discussed by Zhang and Stern
(2005), this set of prior distributions can provide misleading posterior inferences when
data are consistent with a small value of α. Another set of prior distributions is given
by Boender and Rinnooy Kan (1987), where independent proper prior distributions on
S and α are proposed:
½
1,
S < Scut
P r(S) ∝
,
(5)
1
,
S
≥ Scut
2
(S−Scut +1)
where Scut is a positive number to be set, and
p(α) =
½
1/2,
α≤1
,
1/2α−2 , α > 1
(6)
which was earlier proposed by Good (1965). When using this set of prior distributions,
as indicated by Boender and Rinnooy Kan (1987) and also by our later simulation
results, the posterior inferences can be very sensitive to the choice of Scut , especially
when data suggest small values of α. We comment on the sensitivity of results to the
prior distributions P (S, α) further in the data analyses of Section 5.
2.4
The posterior distribution
The joint posterior distribution of θ, S, and α for the probability model specified in (1)
through (4) is, up to a normalizing constant,
p(θ, S, α|y)
∝
∝
P r(y|θ, S)p(θ|S, α)P r(S)p(α)
X
1
N!
[ QN
y1 !...yso !
x=1 nx !
{i1 ,··· ,iso }∈W (so )
S−Smin
(1 − f )
α
−3
2
θi1 y1 ...θiso yso ]
s
Γ(Sα) Y α−1
θ
S
Γ(α) i=1 i
770
Unseen Species Sample Size Calculation
where, S ∈ {max(so , Smin ), max(so , Smin ) + 1, · · · }, α > 0. It should be noted that the
posterior distribution is defined over both continuous (for α and θ) and discrete (for S)
sample spaces.
The joint posterior distribution can be factored as
p(θ, S, α|y) = p(θ|y, S, α)p(S, α|y),
(7)
where p(θ|y, S, α) is the conditional posterior distribution of θ given S and α,
p(θ|y, S, α) ∝
=
[
X
θi1
{i1 ,··· ,iso }∈W (so )
X
{i1 ,··· ,iso }∈W (so )
y1
· · · θiso
yso
]
S
Y
θiα−1
i=1
y1 +α−1
· · · θiso yso +α−1
θi1
S
Y
j=1
j6∈{i1 ,...,iso }
θjα−1 . (8)
Note that the conditional posterior distribution of θ is proportional to the sum of S!/(S−
so )! Dirichlet densities. Also note that every Dirichlet distribution in the summation is
identical up to permutation of the species indices.
The other factor in (7), p(S, α|y), is
p(S, α|y) ∝
3
S!
Γ(Sα) Γ(y1 + α) · · · Γ(yso + α)
(1 − f )S α− 2 ,
s
o
(S − so )! Γ(N + Sα)
(Γ(α))
S ∈ {max(so , Smin ), max(so , Smin ) + 1, · · · }, α > 0.
(9)
This can be obtained in either of two ways, as the quotient p(θ, S, α|y)/p(θ|S, α, y),
or by integrating out θ from the joint distribution p(y, θ|S, α) and working with the
reduced likelihood p(y|S, α).
3
3.1
Posterior inferences
Posterior inferences for S and α
The posterior distribution of S and α as given by (9) is difficult to study analytically. Instead we use MCMC, specifically a Gibbs sampling algorithm with Metropolis-Hastings
steps for each parameter, to generate draws from the joint posterior distribution. In
applications we run multiple chains from dispersed starting values. Convergence of the
sampled sequences is evaluated using the methods developed by Gelman and Rubin
(1992a,b) and described for example by Gelman et al. (2003).
The conditional posterior distribution of S given y and α and the conditional posterior distribution of α given y and S, up to a normalizing constant, are
H. Zhang and H. Stern
P r(S|y, α)
p(α|y, S)
771
S!
Γ(Sα)
(1 − f )S ,
(S − so )! Γ(N + Sα)
S ∈ {max(so , Smin ), max(so , Smin ) + 1, · · · },
Γ(Sα) Γ(y1 + α) · · · Γ(yso + α) − 3
∝
α 2 , α > 0,
Γ(N + Sα)
(Γ(α))so
∝
(10)
respectively. For Metropolis-Hastings steps for these parameters we used jumping or
transition distributions that are essentially random walks. Specifically, the jumping
function for iteration t for S is a discrete uniform distribution centered at the (t − 1)th
sampled point; and the jumping function for α is selected as a log-normal distribution with location parameter being the logarithm of the (t − 1)th draw. The jumping
distributions are discussed more fully in the Appendix.
3.2
Posterior inference for θ
In this paper, posterior inference of θ is not of interest, but as it may be relevant for other
applications we discuss it briefly. The conditional posterior distribution p(θ|S, α, y)
given by (8) is a mixture of S!/(S − so )! Dirichlet distributions, one for each choice
of the so observed species from among the S total species. The component Dirichlet
distributions are identical up to permutation of the category indices. Because of this
feature of the mixture, each θi actually has the same marginal posterior distribution.
This makes interpretation of θi difficult. We can however talk about posterior inference
for a θ corresponding to a particular value of yi > 0. For example, if we define θyi
as the θ corresponding to an observed species with frequency yi then p(θyi |y, S, α) is
Beta(yi +α, N −yi +(S −1)α). The marginal posterior distribution, p(θyi |y), is obtained
by averaging this beta distribution over the posterior distribution of S and α that is
obtained in Section 3.1.
4
Planning for future data collection
The previous section describes an approach to obtaining posterior inferences for S, α, θ.
This section considers an additional inference question. Suppose it is possible to collect
additional data beyond the initial N observations. Then one might be interested in
questions related to the design of future data collection efforts, such as, “What is the
probability of observing at least 90% of all species if the current data are augmented
by an additional M animals?”, or the closely related question “How large an additional sample is required in order to observe at least 90% of all species with a specified
confidence level?”. This section addresses the answer to these types of questions.
772
4.1
Unseen Species Sample Size Calculation
A relevant probability calculation
Let p denote the proportion of species we want to capture (e.g. p = 0.9), then the
probability of capturing at least pS species conditional on the N observed animals and
M additional animals, denoted as π(M ), can be written as
π(M ) = P r((so + Snew ) ≥ pS|M, y),
(11)
where Snew is the number of previously unseen species observed in the M additional
samples. Let y∗ denote the additional data from the M additional observations. The
probability (11) can be expressed as an integral over the unknown parameters S, α, θ,
and the yet-to-be-observed data y∗ ,
Z Z Z Z
I(so + Snew ≥ pS)p(S, α, θ, y∗ |M, y) dS dα dθ dy∗ .
(12)
π(M ) =
y∗
θ
α
S
Here I is an indicator function which is easily determined given the counts y and y∗ ,
and the value of S. To describe a Monte Carlo approach to evaluating this integral we
first observe that
p(S, α, θ, y∗ |M, y)
= p(y∗ |θ, S, M )p(θ|y, S, α)p(S, α|y),
where p(y∗ |θ, S, M ) is a multinomial density function, p(θ|y, S, α) is a mixture of Dirichlet distribution function, and p(S, α|y) is given above in (9). Given this factorization,
the integration (summation in the case of S) in (12)can be carried out by first obtaining
posterior draws of S and α and then applying the specified conditional distributions for
θ and y∗ . As is shown below in Section 4.2, sampling θ from a mixture of Dirichlet distribution is no more difficult than sampling θ from a Dirichlet distribution. We do not
expect that the high dimension of θ will cause any problem in the numerical integration
process.
Carrying out the integration for a variety of values of M identifies a π(M ) curve and
allows us to identify the smallest sample size for which π(M ) exceeds a given target.
This approach provides a point estimate for the needed sample size but does not provide
a great deal of information about the uncertainty in such an estimate. Instead, we find
it useful to examine the function π(M ) for a variety of S, α values, i.e.
π(M |S, α) =
=
P r((so + Snew ) ≥ pS|M, y, S, α)
Z Z
I(so + Snew ≥ pS)p(θ, y∗ |M, y, S, α)dθdy∗ .
y∗ θ
(13)
Examining π(M ) in this way allows us to use the posterior distribution of S, α to convey
uncertainty about our estimate of M . The function π(M |S, α) is a complicated function
of S and α, and an analytical form of its posterior distribution is not possible. Instead,
we use a Monte Carlo approach to estimate the posterior distribution of π(M |S, α).
Specifically, for each posterior draw of S and α, we estimate the quantity π(M |S, α)
by averaging over θ and y∗ . The posterior distribution of π(M |S, α) is obtained by
repeating the Monte Carlo evaluation for the available draws of S and α.
H. Zhang and H. Stern
4.2
773
Monte Carlo simulation procedure
The Monte Carlo approach to evaluating π(M |S, α) in (13) is made explicit by applying the identity p(θ, y∗ |M, y, S, α) = p(y∗ |θ, S, M )p(θ|y, S, α), where we have assumed
that y∗ is conditionally independent of α and y given M, θ, and S. The assumption
of conditional independence is based on the consideration that θ and S fully define
the probability vector for multinomial sampling of y∗ . The algorithm for computing
π(M |S, α) for a given S, α pair is then given by the following steps. For t = 1, · · · , T ,
1) generate θ (t) from p(θ|y, S, α) (a mixture of Dirichlet distributions)
2) generate y∗ (t) from a Multinomial distribution with parameters M and θ (t)
3) define It = 1, if (so + Snew ) ≥ pS, and It = 0 otherwise.
Estimate π(M |S, α) with
of M as desired.
1
T
PT
t=1 It
and repeat steps 1 to 3 for as many different values
For each given pair of S, α, the result can be viewed as a curve giving the probability
of covering a proportion p of the species as a function of M . If k posterior draws of S, α
are available, then there are totally k such curves.
The Monte Carlo algorithm is conceptually straightforward but a number of implementation details are noteworthy. First, recall that the posterior distribution of θ given
y, S, α is a mixture of Dirichlet distributions. All of the Dirichlet distributions in the
mixture are identical up to permutation of the indices (i1 , i2 , · · · , iS ). Sampling from
the mixture distribution requires that one pick a set of labels from W (so ) to correspond
to the observed species and then simulate from the relevant component of the mixture
distribution. The subsequent steps in the algorithm would then be done conditional on
this choice of labels. In practice, because we are not interested in a specific θi or yi , it is
equally valid to arbitrarily assign labels to the observed species and proceed. A second
noteworthy detail concerns efficiency. Steps 1 and 2 of the algorithm propose to use
only a single draw of y∗ for each θ. It is natural to ask whether the algorithm might
be improved by selecting multiple y∗ vectors for each θ, perhaps thereby estimating a
separate curve for each θ. Our simulation results suggest however that variation among
the curves corresponding to different θ’s for fixed S and α is relatively small and consequently the algorithm described above works best. Lastly, we note that step 3 of the
Monte Carlo simulation procedure requires determining the number of new species by
counting the number of positive yi∗ ’s whose θi ’s correspond to species with yi = 0 (or
equivalently to Dirichlet parameter α). In practice it is possible to save a considerable
amount of computing time by embedding iteration over the sample size M within the
above loop (instead of running the above loop separately for each M ).
774
Unseen Species Sample Size Calculation
4.3
The probability of species coverage and the required sample size
The Monte Carlo algorithm yields a collection of curves, each showing π(M |S, α) as a
function of M . For any M these curves yield a posterior distribution of the quantity
π(M ) = P r((so + Snew ) ≥ pS|M, y). Posterior summaries, e.g., point estimates or
posterior intervals can be constructed from this estimated posterior distribution.
Another practical question is how to find the minimum sample size required to
observe at least a proportion p of all species with a specified probabilty q. We denote
this value as Mq ; it too can be viewed as a function of S and α. The posterior distribution
of Mq is determined easily using the Monte Carlo approach. For each (S, α) pair we
have developed an estimated curve showing π(M ) vs M . For each such curve we identify
the smallest value of M such that π(M |S, α) ≥ q. The collection of identified sample
sizes provides an estimated posterior distribution for Mq .
5
Simulations
To demonstrate our method we begin by simulating a single data set with N = 2000
observations for which S is known. In a later section, we consider the effect of increasing
the sample size.
5.1
Data
The data are N = 2000 observations simulated from a multinomial distribution with
S = 2000 species in the population and θ a random sample from a DirichletP
distribution
S
with α = 1. The distribution of θ is then uniform over all vectors with i=1 θi = 1.
Table 1 and Figure 2 describe the data as the number of species that appear exactly x
times, x = 1, 2, · · · , xo .
In this sample, the largest frequency xo = 11 and the number of observed species is
so = 965.
Table 1: Species frequencies
x
# species
5.2
1
451
2
268
3
116
4
61
5
35
6
16
7
5
8
7
9
2
10
3
11
1
Posterior inference for S
We used the approach described in Section 3 to find the posterior distribution of S
and α given the simulated data. We assume the plausible upper limit on S is Smax =
100 200 300 400
775
0
Number of species
H. Zhang and H. Stern
2
4
6
8
10
Frequency
Figure 2: The distribution of frequencies for the simulated data
10000 with prior belief about 0.999 that Smin ≤ S ≤ Smax . This gives the value of
f in our geometric prior distribution as f = 0.0007. In a later section, Section 5.5,
we evaluate the effect of choosing different values of f (i.e. different geometric prior
distributions). The posterior inferences in this section are based on 4000 draws from
the posterior distribution after a MCMC burn-in period of 4000 iterations. Figure 3
shows a contour plot of the joint posterior distribution of S and α. The distribution
has a single mode around S = 1800 and α = 1.0. Figure 4 and 5 are histograms of
the posterior distribution of S and α. The posterior mean of S is 1844. A 95% central
posterior interval for S is (1559, 2301). The true value, S = 2000, is contained in the
interval. Note that the method of Efron and Thisted (1976) based on the PoissonGamma model yields a similar estimate of S which is Sb = 1639 with standard error
226. A 95% central posterior interval of α is (0.64,1.69), which includes the true value
α = 1. The posterior mean of α is 1.07.
5.3
Sample size calculation for future sampling
As discussed in Section 4, we can estimate the probability of observing a proportion
p of the total number of species given an additional M animals, and the sample size
required to ensure that future sampling covers a specified proportion of the species with
a given probability of coverage. As a first step we estimate π(M |S, α) for M = 2000
to 30000 in steps of size 20 for a number of (S, α) pairs. Each point of the π(M |S, α)
vs M curve is based on T = 1500 Monte Carlo evaluations in order to make the Monte
Carlo simulation error of a given probability less than 0.015.
Figure 6 is a plot giving π(M |S, α) for 100 draws of (S, α) from the posterior distribution p(S, α|y) (including more curves makes the figure more difficult to read). Each
curve in the figure shows the relationship between the probability of seeing 90% of the
776
1.4
0.6
0.8
1.0
1.2
alpha
1.6
1.8
2.0
Unseen Species Sample Size Calculation
1600
1800
2000
2200
S
Figure 3: Contour plot for the data from S = 2000, α = 1 with N=2000 (f = 0.0007)
species and the additional sample size M for a single posterior draw of S and α. From
the figure, we can see the curves are spread out especially for larger coverage probabilities, which implies large uncertainty about the probability of seeing 90% of the species
with a given M , and also large uncertainty about the minimum sample size required
to see at least 90% of the species for a specified confidence level π. This reflects the
uncertainty about the parameter α which has a very substantial impact on the species
frequency distribution. Posterior draws with large α values will tend to have smaller S
values, and hence greater likelihood of observing 90% of the species with M additional
animals. These values correspond to the curves on the left in Figure 6. A small α
suggests the true S is larger, so we are less likely to observe 90% of the species with M
additional animals.
Probability of observing at Least pS species with M additional animals
We next use the curves in Figure 6 to draw posterior inference for the probability
of observing at least 0.9S species with M additional observations. Table 2 gives the
posterior median of π(M ) and a 90% central posterior interval for π(M ) for a range of M
777
0.0
0.0005
0.0010
0.0015
0.0020
0.0025
H. Zhang and H. Stern
1500
2000
2500
3000
3500
S
Figure 4: Histogram of the posterior of S
values. These inferences are based on k = 100 (S, α) pairs chosen randomly among the
4000 posterior samples. Figure 7 shows the posterior median and pointwise 90% central
posterior intervals graphically for M ranging from 400 to 20000. Posterior intervals for
a given value of M tend to be wide when M is relatively small (e.g. M = 5000), but
the length of the intervals decreases quickly with the increase of M . This reflects the
form of Figure 6; for a given M most curves have probability values of attaining the
target number of species near zero (if that value of M is relatively small compared to
the value of S on which the curve is based) or near one (if the value of M is relatively
large compared to the relevant value of S).
Table 2: Estimates of π(M ) for different M values
M
π
b(M )
90% emp. post. int.
5000
0.52
(0, 1)
8000
1
(0.02, 1)
10000
1
(0.48, 1)
12000
1
(0.75,1)
778
0.0
0.5
1.0
1.5
Unseen Species Sample Size Calculation
0.5
1.0
1.5
2.0
2.5
3.0
alpha
Figure 5: Histogram of the posterior distribution of α
Required sample size to capture at least pS species with coverage probability 0.9
Recall that for the simulated data the initial sample of 2000 animals captured approximately half the species. How many additional animals have to be collected if we want
to see at least 0.9S species with probability q = 0.9? Once again the Monte Carlo simulations in Figure 6 can be used to answer the question. Drawing a horizontal line with
coverage probability q = 0.9, the intersection points of the horizontal line and the curves
give estimates of Mq , one from each curve. The distribution of these values provides the
desired posterior inferences. The posterior median is 5330, and an empirical 90% central posterior interval is (2980, 13080). Table 3 gives the posterior median of the needed
sample size together with 90% central posterior intervals for various values of the target
proportion of species (p) and the desired coverage probability (q). As the proportion
of the species to be captured (p) increases, the number of additional samples required
increases quickly, which is natural because the more common species have undoubtedly
been observed. Further, as indicated in the table, for each p, the sample size required
to achieve different coverage probabilities (q) changes slowly. This is once again due to
the pattern observed in Figure 6, in which all curves display a similar trend: there is a
steep increase in the probability of coverage q near a threshold value of M (though this
threshold varies depending on S and α).
779
0.6
0.4
0.0
0.2
Probability of coverage
0.8
1.0
H. Zhang and H. Stern
0
5000
10000
15000
20000
Additional sample size
Figure 6: The relationship between probability π(M |S, α) and additional sample size M
(N=2000)
Table 3: Size of additional sample required to obtain a fraction p of the total species with
probability q
Fraction of species (p)
0.7
Probability of covering specified fraction (q)
0.5
0.7
0.9
810
830
870
(290, 1570)
(310, 1630)
(330, 1700)
0.8
1940
(1060, 3540)
2000
(1100, 3660)
2090
(1120, 3840)
0.9
5010
(3020, 10960)
5190
(3140, 11760)
5330
(2980, 13080)
780
0.6
0.4
0.0
0.2
Probability of coverage
0.8
1.0
Unseen Species Sample Size Calculation
0
5000
10000
15000
20000
Additional sample size
Figure 7: Trace of the estimated coverage probabilities. The middle line connects the posterior
median values of π(M |S, α), and the two dotted lines besides it are the 90% pointwise central
posterior intervals
5.4
Effect of sample size
Having demonstrated the approach for a sample of size 2000 from our hypothetical
population in Sections 5.1 through 5.3, we now demonstrate the impact of increasing
the sample size. One would expect the inferences to become more precise. We simulate
a data set with size N = 10000 from the same population as before, i.e. α = 1, S = 2000.
For this sample, the highest frequency is xo = 45, and the number of observed species
is so = 1663, which is more than 80% of the total number of species.
In this example, the value of f is also selected as f = 0.0007. The posterior mean for
S is 2030 with 95% central posterior interval (1948, 2129). The posterior mean for α is
0.93, and a 95% central posterior interval for α is (0.80, 1.08). Both intervals are much
narrower than for the case with N = 2000. Figure 8 shows the posterior distribution
of π(M |S, α) with p = 0.9 (i.e. capturing 90% of all species) for 100 (S, α) pairs and a
number of M values. There is much less variation than is present in Figure 6.
781
0.6
0.4
0.0
0.2
Confidence level
0.8
1.0
H. Zhang and H. Stern
1000
2000
3000
4000
Additional sample size
Figure 8: The relationship between probability π(M |S, α) and additional sample size M
(N = 10000)
The posterior median of Mq for q = 0.9 is 1880 and a 90% central posterior interval
is (1240, 2700). The required sample size is smaller because a larger number of species
are observed in the pilot sample. In addition the posterior interval is much narrower
than that based on N = 2000 observations.
5.5
Effect of prior distribution
As in any Bayesian analysis, it is critical to consider the impact of the choice of prior
distribution on the inferences obtained. That is especially true here with the prior distribution for S and α. This section addresses comparisons between our prior distribution
and others in the literature, as well as some practical issues associated with the use of
our prior distribution.
In section 2.3, two other choices of prior distributions for S and α were mentioned.
One is the proposal of Boender and Rinnooy Kan (1987) to use proper prior distribution
functions for S and α, and the other suggested by Zhang and Stern (2005), where a
782
Unseen Species Sample Size Calculation
uniform distribution with an upper bound is assigned to S and a vague prior distribution
is given to α (the same as the prior distribution for α that is used here). We applied
the Dirichlet-generalized multinomial model with their prior distributions on the same
simulated data discussed in Section 5.1. The posterior inferences for S and α are listed
in Table 4, which indicates the results from these two alternative prior distributions are
consistent with the results using the geometric prior distribution for S. The findings
regarding species coverage and sample size are also similar across different methods.
We next discuss the effect of different choices of f on the posterior inferences for
the simulated data. As noted earlier, different values of f imply different degrees of
confidence that we might have with respect to the suggested range of S, with larger
values of f corresponding to higher prior confidence of S being between Smin and the
plausible Smax . Table 4 lists various choices of f for Smin = 2 and Smax = 10000,
the corresponding probability of S ∈ (Smin , Smax ), and the posterior inferences of S
and α that result from this choice of f . The results suggest that as long as f is not
too big (i.e. our prior belief in Smax is not too extreme), the posterior inferences for
the parameters are consistent across different values of f and all agree reasonably well
with the true values. We also observe that the larger the value of f , the stronger
our prior information favors small values of S. This is reflected in the inferences; the
posterior mean decreases as f increases. As f increases the prior distribution puts more
probability mass on smaller values of S and thus the posterior mean will decrease.
The preceding discussion concerns the simulated data discussed in Section 5.1 where
the population essentially does not have any rare species. The three different prior
distribution choices give similar results in this case. However, as noted in Section 2.3,
the methods of Boender and Rinnooy Kan (1987) and Zhang and Stern (2005) both
have difficulty in inferring the number of species if the sample is consistent with small
α values in the population. We use simulated data to demonstrate and compare the
three methods in this context. A random sample is drawn from a population with a
large number of rare or infrequent species. The same scenario given in section 5.1 is
applied and a data set with N = 2000 observations is drawn from a population with
α = 0.01, S = 2000. In this data set, the number of observed species is so = 94 and the
highest frequency is xo = 155, which implies the population has some very abundant
species but many more rare or hard to capture species (Zhang and Stern (2005)). The
value of f is selected as before, i.e., f = .0007 corresponding to high confidence (.999)
that the true number of species is between Smin = 2 and the suggested Smax = 10000.
The posterior inferences from the three methods are listed in Table 5; posterior means
are given along with 95% central posterior intervals in parentheses. The results listed
in Table 5 demonstrate the poor performance using the prior distributions of Boender
and Rinnooy Kan (1987) and Zhang and Stern (2005). For the new prior distribution,
with f = 0.0007, the posterior inferences are consistent with the true values. We notice
that the posterior inferences seem more sensitive to the choice of f in this context than
in the high α case. We also find that the posterior interval of S using the geometric
prior distribution is wide, which implies large uncertainty on the value of S due to the
large number of rare species in the population. The inferences can be improved if more
information is available to help construct an informative prior distribution of S.
H. Zhang and H. Stern
783
Table 4: Posterior inferences on S and α from different methods (95% posterior intervals are
in the parentheses)
Sb
Boender and Rinnooy Kan (1987)
Scut = 500
1833
(1597, 2160)
Scut = 1000
1817
(1573, 2122)
Zhang and Stern (2005)
1866
(1576, 2337)
Geometric prior
p(Smin ≤ S ≤ Smax )
f
Sb
0.63
0.0001
1870
(1556, 2320)
0.90
0.00023
1865
(1574, 2335)
0.999
0.0007
1844
(1559, 2301)
0.9999
0.001
1842
(1537, 2252)
≈1
0.002
1803
(1537, 2252)
≈1
0.003
1782
(1530, 2153)
≈1
0.006
1721
(1501, 2022)
≈1
0.01
1660
(1467, 1909)
α
b
1.08
(0.72, 1.55)
1.1
(0.74, 1.62)
1.09
(0.64,1.75)
α
b
1.03
(0.61, 1.69)
1.04
(0.61, 1.69)
1.07
(0.64, 1.69)
1.07
(0.64, 1.70)
1.12
(0.70, 1.32)
1.16
(0.72, 1.81)
1.26
(0.82, 1.93)
1.39
(0.93, 2.09)
784
Unseen Species Sample Size Calculation
Table 5: Posterior inferences on S and α from different methods when population α is small
(α = 0.01)
Sb
α
b
Boender and Rinnooy Kan (1987)
Scut = 50
297
0.12
(129, 1021)
(0.021, 0.31)
Scut = 500
318
0.087
(151, 490)
(0.045, 0.23)
Scut = 5000
1506
0.023
(174, 4611) (0.0044, 0.18)
Zhang and Stern (2005)
5159
0.0054
(451, 9822) (0.0019, 0.052)
Geometric prior
f=0.0002
2548
0.013
(247, 9100) (0.0022, 0.11)
f=0.0005
2307
0.014
(270, 7323) (0.0027, 0.096)
f=0.0007
2180
0.014
(267, 6734) (0.0030, 0.098)
f=0.001
1650
0.017
(255, 5280) (0.0038, 0.11)
H. Zhang and H. Stern
6
6.1
785
Application to sequence data
Description of the data
This work was motivated by a bioinformatics problem arising during a genome sequencing project. Details of the technological approach are not particularly crucial here – for
one thing, the approach is no longer used by the company. A key issue that came up during the project was the desire to identify the unique elements in a set of DNA fragments.
The unique elements could be easily determined by sequencing all of the fragments but
this is not necessarily cost effective if there is a lot of duplication. One strategy under
consideration proposed sequencing a small sample of fragments and recording the frequency with which each unique sequence was found. Framed in this way the problem is
directly analogous to our species problem. The hope is that based on the small sample
it will be possible to determine how large of a sequencing effort to mount.
300
200
100
0
Number of species
400
A prototype data set was provided with sample size N = 1677 and so = 644, in which
there were 440 species each observed once and 1 species observed 76 times. Figure 9
shows the pattern of the data in terms of frequencies. The figure shows a very sharp
decreasing pattern in the distribution of frequencies, which is different from that of our
simulated data in Section 5.1 and more like the small α case discussed at the end of
Section 5.5. A few “species” occur with high frequencies, and a very high proportion of
the observed species only occur once. This is the type of data that typically indicates a
small value of α that can cause difficulties for the generalized multinomial model.
0
20
40
60
Frequency
Figure 9: The distribution of the frequencies for the DNA sequence data
786
0.04
0.01
0.02
0.03
alpha
0.05
0.06
0.07
Unseen Species Sample Size Calculation
6000
8000
10000
12000
14000
16000
18000
20000
S
Figure 10: Contour plot for the true data
6.2
Applying the model
We apply the method proposed in previous sections to the DNA segments data. Our
collaborator suggested the maximum value of S be S = 10000. With the value of f
selected as f = 0.0007, as discussed earlier, our prior confidence of S between 2 and
10000 is 0.999. Figure 10 is a contour plot for the posterior distribution of S and α,
which clearly shows one mode around S = 12000. The posterior mean for S is 12111. A
95% central posterior interval of S is (7246, 19637). The posterior mean for α is 0.033
and its 95% central posterior interval is (0.020, 0.056).
Note that although the prior confidence of S in the interval (2,10000) is 0.999, the
posterior distribution is concentrated above 10000. This suggests that the data provide
overwhelming evidence of a large number of rare species. As seen in the next section,
this inference results in our determination that extremely large sample sizes are required
to collect even a small fraction of the total number of species.
H. Zhang and H. Stern
6.3
787
Sample size calculation for future sampling
We use the Monte Carlo simulation approach discussed in Section 4.2 to carry out
the sample size calculation. The posterior inferences for S suggests a large number of
distinct DNA sequences in the population. The posterior inference of α implies that the
population has a large number of rare species. We thus expect a large sample is needed
even for modest species coverage. Table 6 lists the estimated sample sizes in order to see
10% or 15% of all the distinct DNA sequences with different probabilities of coverage.
Similar patterns are observed as in the simulations: the change in the required sample
sizes across different coverage probabilities (q) is small for a given target fraction (p).
On the other hand, the required sample sizes and the uncertainty both increase quickly
with small increase in the target fraction of species (p). Due to the large number of rare
species in the population the inferences obtained here are of limited value commercially;
an extremely large sample size is required to see a substantial fraction of the species.
Table 6: Additional sample sizes needed to collect 10% or 15% of all distinct DNA sequences
Fraction of species (p)
0.10
0.15
7
Probability of covering specified fraction (q)
0.5
0.9
1900
2200
(400, 6425)
(450, 7488)
7000
7500
(2600, 23500 ) (2800, 27000)
Summary
A multinomial-Dirichlet model is proposed for the analysis of data in which individual objects belong to different categories. The prior distribution for the number of
categories is selected to be a geometric distribution with probability parameter set to
reflect our confidence that the number of categories lies in a predetermined range. The
multinomial-Dirichlet model with this prior distribution seems to work well over a range
of scenarios. A new Monte Carlo simulation algorithm is introduced for determining
the minimum size of an additional sample required to capture a certain proportion of
categories in the population with specified coverage probability. Simulation results show
that sample size calculation in this way is feasible. An application to a DNA segments
data set indicates the applicability of the proposed method but also suggests continued
difficulty with the problematic case with many rare species. Future study is needed
to extend the model to address situations where the distribution of species is not well
approximated by our model, e.g. where the relative proportion of rare species is high.
788
Unseen Species Sample Size Calculation
Appendix
A.1 Proof of Theorem 1
The joint posterior distribution of S and α, as derived in Section 2, is
p(S, α|y) ∝
Γ(Sα) Γ(y1 + α) · · · Γ(yso + α) − 3
S!
α 2,
(S − so )! Γ(N + Sα)
(Γ(α))so
(14)
for S ≥ so and 0 < α < ∞. We find the conditions required to insure that
X Z ∞
∞
p(S, α|y)dα < ∞,
0
S=so
by obtaining an upper bound on the integral over α for each S.
For each S, choose ǫ > 0 such that Sǫ < 1. We then consider the integral over two
intervals (0, ǫ) and (ǫ, ∞).
On the interval (0, ǫ):
Recall that for the gamma function we have Γ(1 + z) = zΓ(z). This and other
properties of the gamma function yield the following results.
1. Γ(α) = Γ(1 + α)/α (α > 0)
2. If yi ≥ 1 and α < ǫ < 1, then Γ(yi + α) < max(Γ(yi + 1), 1)
3. Define γmin > 0 as the minimum value of the gamma function on the interval
(1,2), then Γ(1 + α) ≥ γmin for 0 ≤ α < 1.
4. Γ(Sα) = Γ(1 + Sα)/(Sα) < 1/(Sα) since Sα < Sǫ < 1.
5. If so ≥ 2, then we must have N ≥ 2 so that Γ(N + Sα) > Γ(N )
Applying these equalities and inequalities gives
∞ Z ǫ
X
p(S, α|y)dα
S=so
<
=
0
Z
Qso
SX
max
S=so
S!
(S − so )!
SX
max
(S − 1)!
(1 − f )S
(S − so )!
S=so
Qso
ǫ
i=1
0
max(Γ(yi + 1), 1) 1 so −3/2
α
(1 − f )S dα
so
γmin
Sα
Γ(N )
Z
ǫ
Cy αso −5/2 dα
0
so
Γ(N )] is a constant depending only on y. For
where Cy = [ i=1 max(Γ(yi +1), 1)]/[γmin
so ≥ 2 the integral near zero is finite and thus so is the sum since the prior distribution
of S is proper.
H. Zhang and H. Stern
789
On the interval (ǫ, ∞):
Repeated application of the recurrence Γ(1 + z) = zΓ(z) yields
Q so Q y i
S!
j=1 (yi + α − j) −3/2
i=1
α
(1 − f )S
p(S, α|y) ∝
QN
(S − so )!
(Sα
+
N
−
j)
j=1
Qso Qyi
(yi −j)
S −3/2
S!(1 − f ) α
j=1 (1 +
i=1
α )
=
Q
N
N
(N
−j)
(S − so )! S
j=1 (1 + Sα )
<
yi
so Y
(yi − j)
S!(1 − f )S α−3/2 Y
(1 +
)
(S − so )! S N i=1 j=1
ǫ
The final product is a constant in terms of S and α and the remaining terms yield
a finite integral over α and sum over S.
Combining the information from the two intervals, we conclude that the posterior
distribution is proper if so ≥ 2, i.e., there are at least two categories observed.
A.2 Jumping functions for S and α
Metropolis-Hastings jumping function for S
The jumping function for S is a symmetric discrete uniform distribution centered at
S (t−1) (an asymmetric distribution is used if S (t−1) is near the limit of its range) with
width parameter b(t−1) . Take S (∗) as the proposed value of S when jumping from S (t−1) .
The jumping distribution can be written as
S (∗) |S (t−1) ∼
½
DU N IF (S (t−1) − b(t−1) , S (t−1) + b(t−1) ),
DU N IF (so , S (t−1) + b(t−1) ),
S (t−1) ≥ so + b(t−1)
S (t−1) < so + b(t−1) ,
where the second two lines represent the cases where the current draw is near the boundary of the parameter space. The width parameter b(t−1) is selected to be proportional
to the current value S (t−1) .
Metropolis-Hastings jumping function for α
We use a normal jumping distribution on the logarithm of α. Define φ = log(α). Let
φ(t−1) denote the current sampled point, and φ(∗) be the candidate point generated
from the jumping distribution. The jumping distribution for φ is
φ(∗) |φ(t−1) ∼ N (φ(t−1) , V 2 )
where the standard deviation V is chosen to make the jumping function efficient. In
practice, V is selected based on a pilot sample to achieve acceptance rate near 0.44, the
optimal rate suggested by Gelman et al. (2003).
790
Unseen Species Sample Size Calculation
References
Boender, C. G. E. and Rinnooy Kan, A. H. G. (1987). “A multinomial Bayesian approach to the estimation of population and vocabulary size.” Biometrika, 74(4):
849–856. 764, 765, 766, 769, 781, 782, 783, 784
Bunge, J. and Fitzpatrick, M. (1993). “Estimating the number of species: a review.”
Journal of The American Statistical Association, 88: 364–373. 763
Cao, Y., Larsen, D. P., and Thorne, R. S.-J. (2001). “Rare species in multivariate analysis for bioassessment: some considerations.” Journal of North American Benthological
Soc., 21: 144–153. 764
Corbet, A. S. (1942). “The distribution of butterflies in the Malay Peninsula.” Proc.
Royal Entomological Society of London (A), 16: 101–116. 763
Efron, B. and Thisted, R. (1976). “Estimating the number of unseen species: How many
words did Shakespeare know?” Biometrika, 63: 435–448. 763, 764, 775
Fisher, R. A., Corbet, A. S., and Williams, C. B. (1943). “The relation between the
number of species and the number of individuals in a random sample of an animal
population.” Journal of Animal Ecology, 12: 42–58. 763, 764
Gelman, A., Carlin, J. B., Stern, H. S., and B., R. D. (2003). Bayesian Data Analysis.
Chapman & Hall/CRC. 770, 789
Gelman, A. and Rubin, B. D. (1992a). “Inference from iterative simulation using multiple sequences.” Statistical Science, 7: 457–511. 770
Gelman, A. and Rubin, D. B. (1992b). “A single series from the Gibbs sampler provides a
false sense of security.” In Bernardo, J. M., Berger, J. O., Dawid, A. P., and Smith, A.
F. M. (eds.), Bayesian Statistics 4. Proceedings of the Fourth Valencia International
Meeting, 625–631. Clarendon Press [Oxford University Press]. 770
Good, I. J. (1965). The Estimation of Probabilities; An Essay on Modern Bayesian
Methods. Cambridge, Mass.: MIT Press. 769
Good, I. J. and Toulmin, G. H. (1956). “The number of new species, and the increase
in population coverage, when a sample is increased.” Biometrika, 43: 45–63. 763
Lijoi, A., Mena, R. H., and Prünster, I. (2007). “Bayesian nonparametric estimation of
the probability of discovering new species.” Biometrika, 94: 769–786. 764
Morris, J. S., Baggerly, K. A., and Coombes, K. R. (2003). “Bayesian shrinkage estimation of the relative abundance of MRNA transcripts using SAGE.” Biometrics,
59(3): 476–486. 764, 768
Pitman, J. (1996). “Some Developments of the Blackwell-MacQueen Urn Scheme.” In
Ferguson, T. S., Shapley, L. S., and MacQueen, J. B. (eds.), Statistics, Probability
and Game Theory (IMS Lecture Notes Monograph Series, Vol. 30), 245–267. Institute
of Mathematical Statistics. 763
H. Zhang and H. Stern
791
Sethuraman, J. (1994). “A constructive definition of Dirichlet priors.” Statistica Sinica,
4: 639–650. 764, 766
Tiwari, R. C. and Tripathi, R. C. (1989). “Nonparametric Bayes estimation of the
probability of discovering a new species.” Communications in Statistics: Theory and
Methods, 18: 877–895. 764
Zhang, H. (2007). “Inferences on the number of unseen species and the number of
abundant/rare species.” Journal of Applied Statistics, 34(6): 725–740. 764
Zhang, H. and Stern, H. (2005). “Investigation of a generalized multinomial model for
species data.” Journal of Statistical Computing and Simulation, 75: 347–362. 764,
766, 769, 781, 782, 783, 784
792
Unseen Species Sample Size Calculation
Bayesian Analysis (2009)
4, Number 4, pp. 793–816
Markov Switching Dirichlet Process Mixture
Regression
Matthew A. Taddy∗ and Athanasios Kottas†
Abstract. Markov switching models can be used to study heterogeneous populations that are observed over time. This paper explores modeling the group
characteristics nonparametrically, under both homogeneous and nonhomogeneous
Markov switching for group probabilities. The model formulation involves a finite
mixture of conditionally independent Dirichlet process mixtures, with a Markov
chain defining the mixing distribution. The proposed methodology focuses on
settings where the number of subpopulations is small and can be assumed to be
known, and flexible modeling is required for group regressions. We develop Dirichlet process mixture prior probability models for the joint distribution of individual
group responses and covariates. The implied conditional distribution of the response given the covariates is then used for inference. The modeling framework allows for both non-linearities in the resulting regression functions and non-standard
shapes in the response distributions. We design a simulation-based model fitting
method for full posterior inference. Furthermore, we propose a general approach
for inclusion of external covariates dependent on the Markov chain but conditionally independent from the response. The methodology is applied to a problem
from fisheries research involving analysis of stock-recruitment data under shifts in
the ecosystem state.
Keywords: Dirichlet process prior; hidden Markov model; Markov chain Monte
Carlo; multivariate normal mixture; stock-recruitment relationship.
1
Introduction
The focus of this work is to develop a flexible approach to nonparametric switching regression which combines Dirichlet process (DP) mixture nonparametric regression with
a hidden Markov model. A modeling framework for data that has been drawn from a
number of unobserved states (or regimes), where each state defines a different relationship between response and covariates, switching regression was originally developed in
the context of econometrics (Goldfeld and Quandt 1973; Quandt and Ramsey 1978) and
has primarily been approached through likelihood-based estimation. A hidden Markov
mixture model in this context holds that the state vector constitutes a Markov chain,
and thus introduces an underlying dependence into the data. In such models, the regression functions corresponding to individual population regimes are typically linear
with additive error, and may or may not include an explicit time-series component
(e.g., Hamilton 1989; McCulloch and Tsay 1994). The work presented here has a dif∗ University
of Chicago, Booth School of Business, Chicago, IL, mailto:matt.taddy@chicagobooth.
edu
† Department of Applied Mathematics and Statistics, University of California, Santa Cruz, CA
mailto:thanos@ams.ucsc.edu
c 2009 International Society for Bayesian Analysis
DOI:10.1214/09-BA430
794
Markov Switching DPM Regression
ferent focus: flexible nonparametric inference within regimes, guided by an informative
parametric hidden Markov model for regime state switching. Such approaches reveal
a baseline inference: the posterior distribution for individual regression functions when
informed by little more than the state switching model. The proposed posterior simulation algorithms will also serve as a useful framework for more general inference about
mixtures of conditionally independent nonparametric processes.
Bayesian nonparametrics, and DP mixtures in particular, provide highly flexible
models for inference. Indeed, the practical implication of this flexibility is that, for
inference based on small to moderate sample sizes, a certain amount of prior information
must be provided to avoid a uselessly diffuse posterior. The DP hyperparameters provide
the natural mechanism for introducing prior information. However, it is also possible to
constrain inference by embedding the nonparametric component within a larger model.
The typical semiparametric extension to linear regression – nonparametric modeling for
the additive error distribution – is a familiar example of this approach. One can afford
to be very noninformative about the error distribution only because linearity of the
mean imposes a substantial constraint on model flexibility.
This paper explores one such class of semiparametric inference settings: nonparametric density or regression estimation for heterogeneous populations, using a DP mixture
framework, nested within an informative parametric model for the group membership,
using an either homogeneous or nonhomogeneous hidden Markov switching model. Although this framework applies generally to nonparametric density estimation, our particular focus is Markov switching nonparametric regression, specified in detail in Section
2, including model elaboration for the inclusion of external covariates. Following this,
Section 3 describes efficient forward-backward posterior simulation methodology for dependent mixtures of nonparametric mixture models, along with details for full posterior
inference.
In Section 4, the methods are illustrated with an application from fisheries research
involving analysis of stock-recruitment data under shifts in the ecosystem state, which
can be characterized as regimes that are either favorable or unfavorable for reproduction.
Here, the Markov switching nonparametric regression framework enables simultaneous
inference for the regime-specific biological stock-recruitment relationship and for the
probability of regime switching. Moreover, the DP mixture regression approach relaxes
parametric regression assumptions for the stock-recruitment relationships, and yields
inference that can capture non-standard response density shapes. These are important features of the proposed model, since they can improve predictive inference for
years beyond the end of the observed time series, a key inferential objective for fishery
management. Finally, Section 5 concludes with a summary and discussion of possible
extensions.
2
Markov Switching Nonparametric Regression
In Section 2.1, we introduce the two building blocks upon which our modeling approach
is based: Markov switching mixtures of DP mixtures, and fully nonparametric implied
M. A. Taddy and A. T. Kottas
795
conditional regression. Section 2.2 presents the hidden Markov DP mixture model, and
Section 2.3 extends the model to include external variables that are correlated with the
underlying Markov chain, but conditionally independent of the joint covariate-response
distribution.
2.1
Mixtures of Conditionally Independent
Dirichlet Process Mixtures
R
The generic nonparametric DP mixture model is written as f (z; G) = k(z; θ)dG(θ)
for the density of z, with a parametric kernel density, k(z; θ), and a random mixing
distribution G that is assigned a DP prior (Ferguson 1973; Antoniak 1974). In particular,
G ∼ DP(α, G0 ), where α is the precision parameter, and G0 is the centering distribution.
More specifically, the starting point for our approach is Bayesian nonparametric
implied conditional regression, wherein DP mixtures are used to model the joint distribution of response and covariates, from which full inference is obtained for the desired
conditional distribution for response given covariates. Both the response distribution
and, implicitly, the regression function are modeled nonparametrically, thus providing
a flexible framework for the general regression problem. In particular, working with
(real-valued) continuous variables, DP mixtures of multivariate normal densities can
be used to model the joint density of the covariates, X, and response Y (as in, e.g.,
Müller et al. 1996). Hence, the normal DP mixture regression model can be described
as follows:
Z
f (z; G) = N(z; µ, Σ) dG(µ, Σ), G | α, ψ ∼ DP(α, G0 ),
(1)
where z = (x, y), and G0 can be built from independent normal and inverse Wishart
components for µ and Σ, respectively. Inference for the implied conditional response
distribution under our Markov switching regression model is discussed in Section 3,
following the development in Taddy and Kottas (2009), where full inference about f (y |
x; G) was required to estimate quantile regression functions.
A model for multiple heterogeneous populations may be built upon the DP mixture
platform under either the density estimation or regression setting discussed above. Assume R distinct random mixing distributions G1 , . . . , GR , each characterized as a DP in
the prior, such that, for observations
z1 , . . . , zn with population membership vector h =
R
(h1 , . . . , hn ), f (zi ; Ghi ) = k(zi ; θ)dGhi (θ). This leads to the Gr being independent in
the posterior full conditional (due, in particular, to conditioning on h), which is both
conceptually important and, in Markov chain Monte Carlo (MCMC) simulation, practically useful. Model specification is completed with a state probability vector, pi =
(pi,1 , ..., pi,R ), defining the probability that the i-th observation was drawn from the DP
mixture corresponding to each of the Gr . The goal of this framework is to introduce
information into the model through the pi .
One way to inform pi,r is to incorporate temporal structure, and a natural way to do
so is by assuming that the hi constitute a Markov chain. Robert et al. (1993) and Chib
796
Markov Switching DPM Regression
(1996) discuss such hidden Markov models in the estimation of mixtures of parametric densities. Moreover, the basic Markov switching regression model defines distinct
regression functions for data that have been drawn from populations corresponding to
a number of unobserved states (see, e.g., Chapters 10 and 11 in Frühwirth-Schnatter
2006). Following the early work of Goldfeld and Quandt (1973) and Quandt and Ramsey (1978), the more recent literature includes, for instance, approaches for switching
dynamic linear models (Shumway and Stoffer 1991) and switching ARMA models (Billio et al. 1999). Moreover, Hurn et al. (2003) describe a Bayesian decision theoretic
approach to estimation for mixtures of linear regressions, whereas the approach of Shi
et al. (2005) offers a departure from the linear regression assumption through a mixture
of Gaussian process regressions.
Since, in our context, the Gr are modeled nonparametrically, this leads to inference
that is driven primarily by state membership and, in particular, the Markov transition probabilities. Taking this approach further, the proposed nonparametric switching
regression methodology will be most effective when state membership probabilities are
informed by external covariates. Hughes and Guttorp (1994) and Berliner and Lu (1999)
have proposed nonhomogeneous hidden Markov models where each observation’s state
probability vector pi is regressed onto a set of external covariates, ui . In Section 2.3, we
obtain a similar model by assuming that the external ui are randomly distributed according to a state dependent density function, phi (ui ). Conditioning on ui then implies
a nonhomogeneous hidden Markov model for h.
Hence, our methodological framework involves a known (small) number of states
where prior information is available on the properties of the underlying state Markov
chain, but there is a need for nonparametric modeling within each subpopulation. The
assumption that the number of mixture states is known fits within the general premise
of an informative state estimation coupled with flexible nonparametric modeling for
regression estimation. Thus, while the methodology is not generally suitable for settings with little information about state membership, it offers a practical solution to
switching regression problems that lack prior information about the shape of the individual regression functions and/or the form of the corresponding conditional response
densities.
2.2
Model Specification for Hidden Markov Nonparametric
Switching Regression
Mixtures of regressions are used to study multiple populations each of which involves a
different conditional relationship between response and covariates. The generic mixtures
of regressions setting holds that the response Y given covariates X has been drawn
from a member of a heterogeneous set of R conditional distributions defined by the
densities f1 (y | x), . . . , fR (y | x), and hence that Pr(y | x) = p1 f1 (y | x) + . . . +
PR
pR fR (y | x), where r=1 pr = 1. We propose a departure from this standard form,
wherein the response and covariates are jointly distributed according to one of the
densities f1 (x, y), . . . , fR (x, y) – i.e., now Pr(x, y) = p1 f1 (x, y) + . . . + pR fR (x, y) – and
797
M. A. Taddy and A. T. Kottas
PR
therefore Pr(y | x) = ρ1 f1 (x, y) + . . . + ρR fR (x, y), where ρr = pr / ℓ=1 pℓ fℓ (x). Thus,
the approach is particularly appropriate whenever mixture component probabilities for
a given x and y should be dependent upon the joint distribution for response and
covariates, even though primary interest is in the regression relationship for response
given covariates.
Specifically, we develop the extension of DP mixture implied conditional regression to
the context of time dependent switching regression. The data consist of covariate vectors
xt = (x1t , . . . , xdt x ), and corresponding responses yt observed at times t = 1, . . . , T ,
where dx is the dimension of the covariate space. The data from each time point
are associated with a hidden state variable, ht ∈ {1, . . . , R}, such that, given ht , the
response-covariate joint distribution is defined by a state-specific density fht (xt , yt ). We
begin by describing density estimation in the d = dx + 1 dimensional setting, with data
D = {zt = (xt , yt ) : t = 1, . . . , T }. Now, however, the successive observations zt are
correlated through dependence in state membership h = (h1 , . . . , hT ), which constitutes
a stationary Markov chain defined by an R × R transition matrix Q. Although we
consider only first-order dependence in the Markov chain, the model and posterior
simulation methods can be extended to handle higher order Markov chains.
The first-order hidden Markov location-scale normal DP mixture model (referred to
as model M1) can then be expressed as follows,
Z
ind
zt | ht , Ght ∼ fht (zt ) ≡ f (zt ; Ght ) = N(zt ; µ, Σ)dGht (µ, Σ), t = 1, . . . , T
Gr | αr , ψr
h|Q
ind
∼
∼
DP (αr , G0 (ψr )) , r = 1, . . . , R
Pr(h | Q) =
T
Y
(2)
Qht−1 ,ht ,
t=2
where we denote the r-th row of Q by Qr = (Qr,1 , . . . , Qr,R ), with Qr,s = Pr(ht = s |
ht−1 = r), for r, s = 1, ..., R (and t = 2, ..., T ). Moreover, the DP centering distributions,
G0 (µ, Σ; ψr ) = N(µ; mr , Vr )Wνr (Σ−1 ; Sr−1 ), with ψr = (mr , Vr , Sr ). Here, Wv (·; M )
denotes the Wishart distribution with v degrees of freedom and expectation vM .
Applying the regression approach discussed in Section 2.1, the joint response-covariate
density specification in (2) yields our proposed hidden Markov switching regression
model. In particular, for
R state r, the prior model for the marginal density for X can be
written as f (x; Gr ) = N(x; µx , Σxx )dGr (µ, Σ), after the mean vector and covariance
matrix of the normal kernel have been partitioned. In particular, µ comprises (dx × 1)
vector µx and scalar µy , and Σ is a square block matrix with diagonal elements given by
(dx × dx ) covariance matrix Σxx and scalar variance Σy , and above and below diagonal
vectors Σxy , and Σyx , respectively.
We assume that, in the prior, each state is equally likely for h1 . For r = 1, . . . , R, we
)
place hyperpriors on ψr and αr such that π(ψr ) = N (mr ; amr , Bmr ) WaVr (Vr−1 ; BV−1
r
WaSr (Sr ; BSr ), and π(αr ) = Γ(αr ; aαr , bαr ). The prior for Q is built from independent Dirichlet distributions, π(Qr ) = Dir(Qr ; λr ), where Dir(Qr ; λr ), with λr =
PR
(λr,1 , . . . , λr,R ), denotes the Dirichlet distribution such that E[Qr,s ] = λr,s /( i=1 λr,i ).
798
Markov Switching DPM Regression
In practice, the hyperparameters for the αr , ψr and for Q need to be carefully chosen;
our approach to prior specification is detailed in Appendix A.
2.3
Extension to Semiparametric Modeling with External Covariates
In the spirit of allowing the switching probabilities to drive the nonparametric regression,
we extend here model M1 to include additional information about the state vector in
the form of an external covariate, U , with values u = {u1 , ..., uT }. (Although we
present the methodology for a single covariate, the work can be readily extended to
the setting with multiple external covariates.) The modeling extension involves a nonhomogeneous Markov mixture where the hidden state provides a link between the joint
covariate-response random variable and the external covariate.
The standard non-homogeneous hidden Markov model holds that the transition
probabilities are dependent upon the external covariates, such that Pr(ht | h1 , . . . , ht−1 ,
u) = Pr(ht | ht−1 , ut ). Berliner and Lu (1999) present a Bayesian parametric approach
to non-homogeneous hidden Markov models in which Pr(ht | ht−1 , ut ) is estimated
through probit regression. Also related is the likelihood analysis of Hughes and Guttorp (1994), wherein a heuristic argument, using Bayes theorem, is proposed to justify
the model Pr(ht | ht−1 , ut ) ∝ Pr(ht | ht−1 )L(ht ; ut ), where the likelihood L(ht ; ut ) in
their example is normal with state dependent mean.
Treating each ut as the realization of a random variable yields a natural modeling
framework in the context of our approach. Hence, we obtain a semiparametric extension
ind
of model M1 (referred to as model M2) by adding a further stage, ut | ht ∼ p(ut | γht ),
to the model, along with hyperpriors for γ = {γr : r = 1, ..., R}, the state-specific
parameters of the distribution for the external covariate. Moreover, we assume that u
is conditionally independent of {z1 , ..., zT } given h. Thus, for model M2, the first stage
in (2) is replaced with
ind
zt , ut | ht , Ght , γ ∼ p(ut | γht )f (zt ; Ght ), t = 1, ..., T.
Clearly, the formulation of model M2 implies that the hidden Markov chain is nonhomogeneous conditional on u. However, unconditionally in the prior, it is more accurate to say that {z1 , ..., zT } and u are dependent upon a shared homogeneous Markov
chain, and that they are conditionally independent given h. In Section 4, we illustrate
with a Gaussian form for p(ut | γht ). More general examples, with multiple external
covariates, could incorporate dependence relationships, or even model some subset of
the vector of external covariates as a function of the others.
3
Efficient Posterior Simulation
Here, we present MCMC methods for posterior inference under the models developed
in Section 2, beginning with model M1 and adapting this to external covariates in
Section 2.3. To obtain the full probability model, we introduce latent parameters θ =
799
M. A. Taddy and A. T. Kottas
ind
{θt = (µt , Σt ) : t = 1, ..., T } such that the first stage in (2) is replaced with zt | θt ∼
ind
N(zt ; θt ) and θt | ht , Ght ∼ Ght , for t = 1, ..., T . The standard approach to posterior
simulation from DP-based hierarchical models involves marginalization of the random
mixing distributions Gr in (2) over their DP priors. Conditionally on h, the vector of
latent mixing parameters breaks down into state-specific subvectors θr = {θt : ht =
r}, r = 1, ..., R, such that the distribution of each θr is built from independent Gr
distributions for
can be written
QRthe θt corresponding to state r. Thus, the full posterior Q
T
as Pr(h | Q) r=1 π(αr )π(ψr )π(Qr )Pr(θr | h, αr , ψr )DP(Gr ; αr⋆ , G⋆r0 ) t=1 N(zt ; θt ),
using results from Blackwell and MacQueen (1973) and Antoniak (1974). Here, Pr(θr |
h, αr , ψr ) is the Pólya urn marginal prior for θr ; αr⋆h = αr +nr (where nr = |{t : ht =i r}|);
P
and G⋆r0 (·) ≡ G⋆r0 (· | h, θr , αr , ψr ) = (αr + nr )−1 αr dG0 (·; ψr ) + {t:ht =r} δθt (·) .
This posterior can be sampled extending standard MCMC techniques for DP mixtures (e.g., Neal 2000; Gelfand and Kottas 2002). However, marginalization over the Gr
requires that each pair (θt , ht ) must be sampled jointly, conditional on the remaining
paramaters (θt′ , ht′ ), for all t′ 6= t. This is possible, but inefficient, through use of a
Metropolis-Hastings step with proposal distribution built from a marginal Pr(ht = r) ∝
Qht−1 ,r Qr,ht+1 , r = 1, ..., R, and a conditional for θt |ht = r given by the Pólya urn prior
full conditional arising from Pr(θr | h, αr , ψr ).
3.1
Blocked Gibbs with Forward-Backward Sampling
The posterior simulation approach discussed above requires updating each ht one at a
time, whereas forward-backward sampling for the entire state vector h is a substantially
more efficient method for exploring the state space (see, e.g., Scott 2002). To implement
forward-backward sampling, we need to evaluate the joint probability mass function for
states (ht−1 , ht ) conditional on the incomplete data vector {z1 , ..., zt } and relevant model
parameters, which include the random mixing distributions {G1 , ..., GR }. Therefore,
to compute state probabilities, it is necessary to obtain realizations for each Gr in
the course of the MCMC algorithm. The blocked Gibbs sampling approach for DP
mixture models (Ishwaran and James 2001) provides a natural approach wherein the
entire MCMC method is based on a finite truncation approximation of the DP, using
its stick-breaking definition (Sethuraman 1994). Based on this definition, a DP(α, G0 )
realization is almost surely a discrete distribution with a countable number of possible
values drawn i.i.d. from G0 , and corresponding weights that are built from i.i.d. β(1, α)
variables through stick-breaking. (We use β(a, b) to denote the Beta distribution with
mean a/(a + b).) As well as being the consistent choice if the truncated distributions
are used in state vector draws, blocked Gibbs can lead to very efficient sampling for the
complete posterior.
Using the DP stick-breaking representation, we replace each Gr in model M1 with
a truncation approximation. Specifically, for specified (finite) L, we work with
GL
r (·) =
L
X
l=1
ωl,r δθ̃l,r (·),
800
Markov Switching DPM Regression
where the θ̃l,r = (µ̃l,r , Σ̃l,r ), l = 1, ..., L, are i.i.d. G0 (ψr ), and the finite stick-breaking
prior for ωr = (ω1,r , ..., ωL,r ) (denoted by PL (ωr | 1, αr )) is defined constructively by
iid
ζ1 , . . . , ζL−1 ∼ β(1, αr ), ζL = 1; and for l = 1, . . . , L : ωl,r = ζl
l−1
Y
(1 − ζs ).
(3)
s=1
Hence, each GL
r is defined by the set of L location-scale parameters θ̃r = (θ̃1,r , ..., θ̃L,r )
and weights ωr . Guidelines to choose the truncation level L, up to any desired accuracy,
can be obtained, e.g., from Ishwaran and Zarepour (2000).
ind PL
The first stage of model (2) is replaced with zt | ht , (ωht , θ̃ht ) ∼
l=1 ωl,ht N(zt ;
mixture model (as L → ∞) is
θ̃l,ht ), t = 1, ..., T . The limiting case of this finite
R
the countable DP mixture model f (zt ; Ght ) = N(zt ; θ)dGht (θ) in (2). Again, we
can introduce latent parameters θt = (µt , Σt ) to expand the first stage specification
ind
ind
to zt | θt ∼ N(zt ; θt ) and θt | ht , (ωht , θ̃ht ) ∼ GL
ht , for t = 1, ..., T . Alternatively,
since θt = θ̃l,ht with probability ωl,ht , we can work with configuration variables k =
(k1 , ..., kT ), where each kt takes values in {1, ..., L}, such that, conditionally on ht , kt = l
if and only if θt = θ̃l,ht . Hence, model M1 with the DP truncation approximation can
be expressed in the following hierarchical form
zt | θ̃ht , kt
ind
kt | ht , ωht
ind
∼
∼
N(zt ; θ̃kt ,ht ), t = 1, ..., T
L
X
ωl,ht δl (kt ), t = 1, ..., T
(4)
l=1
ωr , θ̃r | αr , ψr
ind
∼
PL (ωr | 1, αr )
L
Y
dG0 (θ̃l,r ; ψr ), r = 1, ..., R
l=1
with h | Q ∼ Pr(h | Q) =
Section 2.2.
QT
t=2
Qht−1 ,ht , and the hyperpriors for α, ψ, and Q given in
Denote by φ the vector comprising model parameters α, ψ, k, Q, and {(ωr , θ̃r ) :
r = 1, ..., R}. The full posterior, Pr(φ, h | D), corresponding to model (4) is now
proportional to "
L
R
Y
Y
π(αr )π(ψr )π(Qr )PL (ωr | 1, αr )
Pr(h | Q)
dG0 (θ̃l,r ; ψr )
r=1
l=1
Y
{t:ht =r}
N(zt ; θ̃kt ,r )
L
X
l=1
!
ωl,r δl (kt ) .
Here, the key observation is that, conditionally onnh, the first two stages
of model (4),
o
QT
QT
PL
t=1 N(zt ; θ̃kt ,ht )
t=1 Pr(zt , kt | ht , (ωht , θ̃ht )) =
l=1 ωl,ht δl (kt ) , can be expressed
nP
oo
QR nQ
L
in the state-specific form, r=1
. To explore
{t:ht =r} N(zt ; θ̃kt ,r )
l=1 ωl,r δl (kt )
the full posterior, we develop an MCMC approach that combines Gibbs sampling steps
801
M. A. Taddy and A. T. Kottas
for parameters in φ with forward-backward sampling for the state vector h. We discuss
the latter next, deferring to Appendix B the details of the Gibbs sampler for all other
parameters.
As discussed above, sampling the truncated random mixing distribution GL
r ≡
(ωr , θ̃r ) for each state r, enables use of forward-backward recursive sampling for the
posterior full conditional distribution, Pr(h | φ, D). Note that this conditional distribuQT −1
tion can be written, in general, as Pr(hT | φ, D) t=1 Pr(hT −t | {hT −t+1 , ..., hT }, φ, D),
whereas under the hidden Markov model structure it simplifies to
Pr(h | φ, D) = Pr(hT | φ, D)
TY
−1
Pr(hT −t | hT −t+1 , φ, {z1 , ..., zT −t+1 }).
(5)
t=1
Hence, the state vector can be updated as a block in each MCMC iteration by sampling
from each component in (5).
To this end, the forward-backward sampling scheme begins by recursively calculat(t)
ing the forward matrices F (t) , for t = 2, ..., T , where Fr,s = Pr(ht−1 = r, ht = s |
φ, {z1 , ..., zt }), for r, s = 1, ..., R. Thus, F (t) defines the joint distribution for (ht−1 , ht )
given model parameters and data up to time t. For t = 3, ..., T , F (t) is obtained from
F (t−1) through the following recursive calculation:
(t)
Fr,s
∝ Pr(ht−1 = r, ht = s, zt | φ, {z1 , ..., zt−1 })
= Pr(ht = s | ht−1 = r, φ) Pr(zt | ht = s, φ)Pr(ht−1 = r | φ, {z1 , ..., zt−1 })
= Qr,s
L
X
ωl,s N(zt ; θ̃l,s )
L
X
l=1
Pr(ht−2 = i, ht−1 = r | φ, {z1 , ..., zt−1 })
i=1
l=1
= Qr,s
R
X
ωl,s N(zt ; θ̃l,s )
R
X
(t−1)
Fi,r
(6)
i=1
PR PR
(t)
where the proportionality constant is obtained from r=1 s=1 Fr,s = 1. For t = 2,
PL
PL
(2)
a similar calculation yields Fr,s ∝ Qr,s l=1 ωl,s N(z2 ; θ̃l,s ) l=1 ωl,r N(z1 ; θ̃l,r ), where,
PR PR
(2)
again, the proportionality constant results from r=1 s=1 Fr,s = 1.
Next, exploiting the form in (5), the (stochastic) backward sampling step begins by
PR
drawing hT according to Pr(hT = r | φ, D) = i=1 Pr(hT −1 = i, hT = r | φ, D) =
PR
(T )
i=1 Fi,r , for r = 1, ..., R. Sampling from (5) is then completed by drawing for each
t = T − 1, T − 2, ..., 1 from Pr(ht = r | ht+1 , φ, {z1 , ..., zt+1 }) ∝ Pr(ht = r, ht+1 |
(t+1)
φ, {z1 , ..., zt+1 }) = Fr,ht+1 , for r = 1, ..., R, where the proportionality constant arises
PR
(t+1)
from r=1 Fr,ht+1 .
3.2
Inference and Forecasting for Regression Relationships
The posterior samples for the truncated DP parameters, {(ωl,r , (µ̃l,r , Σ̃l,r )) : l = 1, ..., L},
for each state r = 1, ..., R can be used to develop inference for the state-specific regres-
802
Markov Switching DPM Regression
sions. In particular, conditional on the posterior draw for the state-specific mixing distribution, GL
r , the posterior realization for the conditional response density, f (y | x; Gr ),
corresponding to state r is
f (y |
x; GL
r)
PL
ωl,r N(x, y; µ̃l,r , Σ̃l,r )
f (x, y; GL
r)
=
= Pl=1
L
x
xx
)
f (x; GL
r
l=1 ωl,r N(x; µ̃l,r , Σ̃l,r )
(7)
for any specified value (x, y).
In addition, the structure of conditional moments for the normal mixture kernel
enables posterior sampling of the state-specific conditional mean regression functions
without having to compute the corresponding conditional density. Specifically,
E Y | x; GL
r =
L
i
h
X
1
y
yx
x
xx
xx −1
x
)
,
ω
N(x;
µ̃
,
Σ̃
)
µ̃
+
Σ̃
(
Σ̃
)
(x
−
µ̃
l,r
l,r
l,r
l,r
l,r
l,r
l,r
f (x; GL
r)
l=1
which, evaluated over a grid in x, yields posterior realizations of the conditional mean
regression function for each state.
Moreover, of interest is prediction in future years (forecasting) for the joint responsecovariate distribution and the corresponding implied conditional regression relationship.
Illustrating with year T + 1, the full model that includes the future covariate-response
vector (xT +1 , yT +1 ) and corresponding regime state hT +1 , can be expressed as
Pr((xT +1 , yT +1 ), hT +1 , φ, h | D)
= Pr(φ, h | D)QhT ,hT +1
L
X
ωl,hT +1 N(xT +1 , yT +1 ; θ̃l,hT +1 ).
l=1
Hence, the posterior samples for (φ, h) along with draws for the new regime state
hT +1 , driven by Q and hT , can be used to estimate the joint posterior forecast density
Pr(xT +1 , yT +1 | D). More generally, using the posterior samples for (φ, h) and hT +1 , we
obtain posterior realizations for the conditional response density in year T + 1 through
L
L
f (y | x; GL
hT +1 ) = f (x, y; GhT +1 )/f (x; GhT +1 ). Note that, in contrast to (7), these realizations incorporate posterior uncertainty in hT +1 . This type of inference is illustrated
with the data example of Section 4.
3.3
Extension to External Covariates
Posterior inference under model M2, discussed in Section 2.3, can be implemented
with a straightforward extension of the MCMC algorithm of Section 3.1. The parameters γ can be sampled conditional on only u and the state vector h. Regarding the
other model parameters, only the MCMC draws that involve h need to be altered. In
particular, the starting point is again an expression analogous to (5) for the posterior
QT −1
full conditional for h. Specifically, Pr(h | φ, γ, D) = Pr(hT | φ, γ, D) t=1 Pr(hT −t |
hT −t+1 , φ, γ, {(zℓ , uℓ ) : ℓ = 1, ..., T −t+1}). Note that now the data vector D comprises
803
M. A. Taddy and A. T. Kottas
{(zt , ut ) : t = 1, ..., T }. For t = 3, ..., T , the recursive calculation of (6) for the forward
matrices becomes
(t)
Fr,s
∝ Qr,s p(ut | γs )
L
X
l=1
ωl,s N(zt ; θ̃l,s )
R
X
(t−1)
Fi,r
,
i=1
PR PR
(t)
(2)
with the proportionality constant obtained from r=1 s=1 Fr,s = 1. Moreover, Fr,s ∝
PL
PL
PR PR
Qr,s p(u2 | γs )p(u1 | γr ) l=1 ωl,s N(z2 ; θ̃l,s ) l=1 ωl,r N(z1 ; θ̃l,r ), where
r=1
s=1
(2)
Fr,s = 1. Finally, the backward sampling step proceeds as described in Section 3.1
using probabilities from the forward matrices F (T ) , F (T −1) , ..., F (2) .
4
Analysis of Stock-Recruitment Relationships Under
Environmental Regime Shifts
The relationship between the number of mature individuals of a species (stock) and the
production of offspring (recruitment) is fundamental to the behavior of any ecological
system. This has special relevance in fisheries research, where the stock-recruitment
relationship applies directly to decision problems of fishery management with serious
policy implications (e.g., Quinn and Derisio 1999). A standard ecological modeling assumption holds that as stock abundance increases, successful recruitment per individual
(reproductive success) decreases. However, a wide variety of factors will influence this
reproductive relationship and there are many competing models for the influence of
biological and physical mechanisms. Munch et al. (2005) present an overview of the literature on parametric modeling for stock-recruitment functions, arguing for the utility
of standard semiparametric Gaussian process regression modeling. In the same spirit,
albeit under the more general DP mixture modeling framework developed in Section 2,
our focus is to allow flexible regression to capture the nature of recruitment dependence
upon stock without making parametric assumptions for either the stock-recruitment
function or the errors around it.
An added complexity in studying stock-recruitment relationships is introduced by
ecosystem regime switching. It has been observed that rapid shifts in the ecosystem
state can occur, during which biological relationships, such as that between stock and
recruitment, will undergo major change. This has been observed in the North Pacific in
particular (McGowan et al. 1998; Hare and Mantua 2000). Although empirical evidence
of regime shifts is well documented and there have been attempts to establish mechanisms for the effect of this switching on stock-recruitment (e.g., Jacobson et al. 2005),
the relationship between the physical effects of regime shifts and their biological manifestation is still unclear. This presents an ideal setting for Markov-dependent switching
regression models due to their ability to link observed processes that occur on different scales (in this case, biological and physical) and are correlated in an undetermined
manner.
To illustrate our Markov switching regression models, we use data on annual stock
and recruitment for Japanese sardine from years 1951 to 1990. Wada and Jacobson
Log Reproductive Success
2
4
6
8
804
Markov Switching DPM Regression
71
73
70
58 59
77
69
80
78 81
82 83
76
72
74
53 52
74
54
51
57 55
56 60
68
67
66
85
75
79
84 86
87
65
64 63 6261
89
88
0
90
2
4
6
8
2
Log Egg Production
4
6
8
Figure 1: The left panel plots the data with the regime allocation from Wada and Jacobson
(1998). The right panel includes draws from the bivariate normal distribution, which, under
each regime, is defined by the marginal mean and covariance matrix for the location of a single
DP mixture component (see Section 4 for details). In both panels, black and grey color indicate
the unfavorable and favorable regime, respectively.
(1998) use modeling of catch abundance and egg count samples to estimate R, the successful recruits of age less than one (in multiples of 106 fish). With estimated annual
egg production E (in multiples of 1012 eggs) used as a proxy for stock abundance, they
investigate the relationship between log(E) and log reproductive success, log(R/E).
Japanese sardine have been observed to switch between favorable and unfavorable feeding regime states related to the North Pacific environmental regime switching discussed
above. Based upon a predetermined regime allocation (see Figure 1), Wada and Jacobson (1998) fit a linear regression relationship for log(E) vs log(R/E) within each
regime.
We consider an analysis of the Japanese sardine data using the modeling framework
developed in Section 2, which relaxes parametric (linear) regression assumptions and
allows for simultaneous estimation of regime state allocation and regime-specific stockrecruitment relationships. As in the original analysis by Wada and Jacobson (1998), this
model formulation does not take into account temporal dependence between successive
observations from the same regime. This suits the purposes of our application, but one
can envision many settings where a structured time series model is more appropriate
than the fully nonparametric approach. Although the low dimensionality of this example
is useful for illustrative purposes, the techniques will perhaps be most powerful in the
exploration of higher dimensional datasets where such temporal structure is not assumed
(an example of implied conditional regression in higher dimensions is studied in Taddy
M. A. Taddy and A. T. Kottas
805
and Kottas 2009).
We first apply model M1 in (4) to the sardine data, zt = (log(Et ), log(Rt /Et )),
available for T = 40 years from 1951 to 1990, with the underlying states ht defined by
either the unfavorable or favorable feeding regime (with values 1 or 2, respectively). A
(conservative) truncation of L = 100 was used in the stick-breaking priors. Regarding
the prior hyperparameters, we set aα = 2 and bα = 0.2 in the gamma prior for α.
The prior for ψr is specified as outlined in Appendix A such that, conditional on the
prior regime allocation taken from Wada and Jacobson (1998), am1 and am2 are set
to data means (5, 3) and (5, 5) for the unfavorable and favorable regime observations,
respectively, while Bm1 and (aV1 − 3)−1 BV1 , with diagonal (5.3, 2.6) and off-diagonal
−3.1, and Bm2 and (aV2 − 3)−1 BV2 , with diagonal (4.5, 1.4) and off-diagonal −2.0, is
the observed covariance matrix for each regime. The BSr , for r = 1, 2, are diagonal
matrices and are specified by setting the diagonal entries of aSr BSr equal to (7.8, 7.7),
which defines one quarter of the data range. Finally, we set ν1 = ν2 = aV1 = aV2 =
aS1 = aS2 = 2(d + 1) = 6. The prior for Q is induced by a β(3, 1.5) prior for the
probability of staying in the same state, which reflects the relative rarity of regime
shifts. The data and prior allocation are shown in Figure 1 along with bivariate normal
draws based on the marginal mean and covariance matrix for the location, µr , of a single
component of the DP mixture, for each of the two regimes. Hence, the right panel of
Figure 1 shows draws from the prior expectation of the random mixing distribution
for the µr (i.e., from state-specific normal distributions with means E[µr ] = amr and
variance var(µr ) = var(mr ) + E[Vr ] = Bmr + (aVr − 3)−1 BVr ). Noting that this does
not include prior uncertainty in the µr due to the DP mixture, clearly shows that the
prior specification has not overly restricted mixture components.
As described above, the sardine feeding regime is part of a larger ecosystem state
for this region of the North Pacific. The physical variables that are linked to the
ecosystem state switching can be used as external covariates for the hidden Markov
chain. Hence, to illustrate the modeling approach of Section 2.3, we choose a physical
variable as the single external covariate, specifically, the winter average Pacific decadal
oscillation (PDO) index, which is highly correlated with biological regime switching
(Hare and Mantua 2000). The PDO index provides the first principle component of an
aggregate of North Pacific sea surface temperatures. Although not directly responsible,
sea surface temperature is believed to be a proxy for mechanisms such as current flow
that control the regime switching (MacCall 2002). Therefore, with vector u comprising
winter average PDO values from 1951 to 1990, we apply model M2 to the sardine
data working with a normal PDO distribution with state-specific mean. Hence, we
ind
assume ut | ht ∼ N(ut ; γht , τ −2 ), with (independent) normal priors for γ = {γ1 , γ2 }
and a gamma prior for τ 2 , in particular, γ1 ∼ N(−0.44, 0.26), γ2 ∼ N(0.73, 0.26), and
τ 2 ∼ Γ(0.5, 0.125). The γr prior mean values are average winter PDO for two ten
year periods that are generally accepted to fall within each ecosystem regime (Hare
and Mantua 2000); the common γr prior variance is the pooled variance for these mean
estimates, and the prior median for τ −2 is chosen to provide some overlap between prior
PDO densities for each regime. Extending the MCMC algorithm of Section 3.1 to sample
the γr and τ 2 is straightforward, since their posterior full conditionals, conditional on
806
3
1
7
5
1
3
Log reproductive success
5
7
Markov Switching DPM Regression
2
4
6
8
2
4
6
8
Log egg production
Figure 2: Mean posterior conditional density surface for each regime. The unfavorable regime
is plotted on the left panels and the favorable on the right panels. The top row corresponds
to the analysis from model M1 and the bottom row to model M2, which includes PDO as an
external covariate. In each panel, the grey points represent the data, i.e., the observed values
for (log(Et ), log(Rt /Et )), t = 1, ..., 40.
u and h, are given by normal and gamma distributions, respectively. The posterior
means for γ1 and γ2 are given by −0.65 and 0.69, with 90% posterior intervals of
(−0.89, −0.40) and (0.30, 1.10), respectively, and τ −2 has posterior mean 0.68 with a
90% posterior interval of (0.45, 1.00).
Results from the analyses under the two models are presented in Figures 2 – 4. The
regime-specific posterior mean implied conditional densities, E[f (log(R/E) | log(E); GL
r)
| D], evaluated over a 50 × 50 grid, are shown in Figure 2. These provide point estimates
of the conditional relationship between stock and recruitment for each regime. Figure 3
shows the posterior mean for the state vector h as well as posterior point and interval
estimates for mean regression functions, E[log(R/E) | log(E); GL
r ], for each regime. The
impact of inclusion of PDO as an external variable is evident. In the absence of such
information, the observations for years 1988 – 1990 are more likely to be allocated in
the favorable regime due to the rarity of regime shifting (i.e., due to posterior realizations of Q which put a high probability on staying in the same state). However, with
the inclusion of PDO, these years are more probably associated with the unfavorable
regime. Also, the posterior estimates for the regime-specific mean regression curves do
not exclude the possibility of a linear mean relationship between log egg production
807
0
3
1
7
1
3
5
Log reproductive success
0
1
0.5
Mean state allocation
0.5
5
7
1
M. A. Taddy and A. T. Kottas
50
60
70
Year
80
90
2
4
6
8
Log egg production
Figure 3: The left panels show the posterior mean regime membership by year, where 0 corresponds to the unfavorable regime. The right panels include posterior point and 90% interval
estimates for the conditional mean regression function under each regime (interval estimates
are denoted by dashed lines for the favorable regime, and by dotted lines for the unfavorable).
Also included on the right panels are the observed values for (log(Et ), log(Rt /Et )), t = 1, ..., 40,
denoted by the grey points. The top row corresponds to model M1 and the bottom row to
model M2, which includes PDO as an external covariate.
and log reproductive success. Hence, it is interesting to note that the more general DP
mixture switching regression modeling framework provides a certain level of support to
the original assumptions of Wada and Jacobson (1998).
Wada and Jacobson (1998) also provide egg production and estimated recruit numbers for the years 1991 – 1995, and winter PDO is readily available. The recruit estimates after 1990 are regarded as less accurate than field data from previous years,
and for this reason they were not included in our original analysis. However, prediction
for this estimated out-of-sample data provides a useful criterion for model comparison.
Hence, repeated prediction conditional on each existing parameter state was incorporated into the MCMC algorithm. In each successive year, a regime state is drawn
conditional on the sampled regime corresponding to the previous year, and prediction
for log reproductive success is provided by the associated conditional response density
in year 199∗, f (log(R/E) | log(E199∗ ); GL
h199∗ ), where 199∗ runs from year 1991 to 1995.
The regime state is then resampled conditional on actual log reproductive success (i.e.,
conditional on log(R199∗ /E199∗ ) and log(E199∗ )), and the process is repeated with this
808
0.8 0.0
0.6
0.4
0.2
0.8 0.0
0.0
0.2
0.4
0.6
0.8 0.0
0.2
0.4
0.6
Conditional probability density
0.2
0.4
0.6
0.8 0.0
0.2
0.4
0.6
0.8
Markov Switching DPM Regression
0
2
4
6
8
0
2
4
6
8
Log reproductive success
Figure 4: Predictive inference for years 1991 – 1995 (by row, moving from 1991 at top to 1995
at bottom). The left column corresponds to model M1, and the right column to model M2
with PDO as an external covariate. Each panel plots posterior mean and 90% interval estimates
(solid and dashed lines, respectively) for the one-step-ahead conditional density, corresponding
to log(E) values for 1991 – 1995 of [7.58, 6.51, 6.13, 4.67, 4.93]. The grey vertical lines mark the
true log reproductive success for each year reported in Wada and Jacobson (1998).
M. A. Taddy and A. T. Kottas
809
state used as the basis for the next year’s prediction. More precisely, considering model
M1, prediction for year 1991 proceeds exactly as outlined in Section 3.2. Next, for year
L
1992, we sample the previous regime from Pr(h1991 = r | h1990 , GL
1 , G2 , E1991 , R1991 ) ∝
L
f (log(E1991 ), log(R1991 /E1991 ); Gr )Qh1990 ,r , for r = 1, 2, and use the sampled state r
for prediction through f (log(R/E) | log(E1992 ); GL
r ). Prediction for years 1993 – 1995
proceeds in an analogous fashion, and the general approach is similar for prediction
under model M2. Since this occurs at each MCMC iteration, we are averaging over
uncertainty in both h199∗ and the GL
r . The results are shown in Figure 4, and it can be
seen that the introduction of PDO as an external covariate leads to subtle changes in
conditional predictive information. In particular, the predictions for year 1991 benefit
from additional information about the regime state in this year (and in the preceding
three years), resulting in a conditional response density for model M2 that is both
more accurate and less dispersed than the one obtained under model M1. As the
first-order Markov model is only informative in relatively short-term prediction, distributions corresponding to both models become fairly diffuse in later years. However,
model M2 assigned consistently higher one-step-ahead mean conditional probability at
the true log reproductive success values, and the average total log probability assigned
to observations from 1991 – 1995 was −8.2 for model M2 against only −8.6 for model
M1.
The inference results reported in Figure 4 illustrate the posterior variability and
non-standard shapes of the predicted conditional response densities. The quantification
of this variability as well as the capacity of the DP mixture switching regression models
to capture non-standard features of the response distribution are important aspects of
the proposed nonparametric modeling framework.
5
Conclusion
We have presented a general framework for semiparametric hidden Markov switching
regression. While the basic switching DP mixture regression methodology provides a
powerful modeling technique in its own right, we feel that it is most practically important
when combined with further parametric modeling for the effect of external covariates
on state membership. Both modeling techniques, with or without external covariates,
have been illustrated with the analysis of stock-recruitment data.
The general approach of having informative parametric modeling linked with nonparametric models through an underlying hidden stochastic process is both theoretically
appealing and practically powerful. We believe that there is great potential for such
models, since they provide an efficient way to bridge the difference in scale between two
observed processes, and the MCMC algorithms presented in this paper can be the basis
for extended techniques in other settings.
We have focused on models for switching regression, but the methodology is applicable in more general settings involving hidden Markov model structure. In particular, since the switching occurs at the level of the joint distribution for response and
covariates, the modeling approach is directly applicable to nonparametric density es-
810
Markov Switching DPM Regression
timation through DP mixtures of multivariate normals for heterogeneous populations
where switching between subpopulations occurs as a Markov chain. Furthermore, the
modeling framework can be elaborated for problems where the multivariate normal is
not a suitable choice for the DP mixture kernel. For instance, categorical covariates
can be accommodated through mixed continuous-discrete kernels. Finally, our work in
the development of the MCMC algorithm can be extended to incorporate stick-breaking
priors other than the DP.
Appendix A: Prior Specification
Here, we discuss the approach to prior specification for the hyperparameters of model
M1 developed in Section 2.2.
Our approach is motivated by a setting where prior information is available on the
state vector h, and the λr parameters of π(Qr ) are chosen based on prior expectation
for the probabilities of moving from state r to each state in a single time step. However, this prior information pertains only to the transition probabilities between states
and does not fully identify the state components. Thus, we need to provide enough
information to facilitate identification of the mixture components and ensure that the
transition probabilities defined by Q refer to the intended states. On the other hand,
the nonparametric regression is motivated by a desire to allow flexible inference about
each regression component and we thus seek a more automatic prior specification for
each ψr .
Within the framework of our DP mixture implied conditional regression, it is possible
to have each state-specific centering distribution, G0 (ψr ), associate the densities
R
N(z; µ, Σ)dGr (µ, Σ) with specific regions of the joint response-covariate space, without
putting prior information on the shape of the conditional response density or regression
curve within each region. Since the prior parameters mr and Vr control the location of
the normal kernels, the hyperparameters amr , Bmr , aVr , and BVr can be used to express
prior belief about the state-specific joint response-covariate distributions. Specifically,
assume a prior guess for the mean and covariance matrix corresponding to the population for state r, where prior information for the covariance may only be available in the
form of a diagonal matrix. Then, we can set amr equal to the prior mean, Bmr to the
prior covariance, and choose aVr and BVr such that E[Vr ] is equal to the prior covariance
(alternatively, E[Vr−1 ] can be set equal to the inverse of the prior covariance matrix and
we have observed the method to be robust to either specification). In the absence of such
prior information, one can use a data-dependent prior specification technique. Given a
prior allocation of observations expressed as the state vector hπ = (hπ1 , ..., hπT ), each set
{amr , Bmr , BVr } can be specified through the mean and covariance of the data subset
{zt : hπt = r}. In particular, amr is set to the state-specific data mean and both Bmr
and E[Vr ] = (aVr − d − 1)−1 BVr are set to the state-specific data covariance. With care
taken to ensure that it does not overly restrict the component locations, this approach
provides an automatic prior specification that combines strong state allocation beliefs
with weak information about the state-specific regression functions.
811
M. A. Taddy and A. T. Kottas
For the Sr we seek only to scale the mixture components to the data, and thus we set
all the E(Sr ) = aSr BSr equal to a diagonal matrix with each diagonal entry a quarter
of the full data range for the respective dimension. The precision parameters aVr , aSr ,
and νr , for r = 1, . . . , R, are set to values slightly larger than d + 2; in practice, we have
found 2(d+1) to work well. Working with various data sets, including the one in Section
4, we have observed results to be insensitive to reasonable changes in this specification.
In particular, experimentation with a variety of choices for the matrices BSr , indicating
prior expectation of either more or less diffuse normal kernel components, resulted in
robust posterior inference.
Specification of the hyperpriors on DP precision parameters is facilitated by the role
that each αr plays in the prior distribution for the number of unique components in
the set of nr latent mixing parameters θt = (µt , Σt ) corresponding to state r. For a
given nr (i.e., conditional on h), we can use results from Antoniak (1974) to explore
properties of this prior for different αr values. For instance, the prior expected number
of unique components in the set {θt : ht = r} is approximately αr log[(nr + αr )/αr ],
and this expression may be used to guide prior intuition about the αr .
Appendix B: MCMC Posterior Simulation
Here, we develop the approach to MCMC posterior simulation discussed in Section 3.
Recall that the key to the finite stick-breaking algorithm is that we are able to use
forward-backward recursive sampling of the posterior conditional distribution for h as
described in Section 3.1. Gibbs sampling details for all other parameters of model (4)
are provided below.
First, for each t = 1, ..., T , kt has a discrete posterior full conditional P
distribution
L
with values in {1, ..., L} and corresponding probabilities ωl,ht N(zt ; θ̃l,ht ) / m=1 ωm,ht
N(zt ; θ̃m,ht ), for l = 1, ..., L.
For each r = 1, ..., R, the posterior
distribution for ωr , is proporP full conditional
Q
QL
Ml,r
L
tional to PL (ωr | 1, αr ) {t:ht =r}
l=1 ωl,r δl (kt ) = PL (ωr | 1, αr )
l=1 ωl,r , where
Ml,r = |{t : ht = r, kt = l}|. Note that the PL (ωr | 1, αr ) prior for ωr , defined
constructively in (3), is given by
PL (ωr | 1, αr ) =
αr −1
αrL−1 ωL,r
(1
−1
− ω1,r )
−1
(1 − (ω1,r + ω2,r ))
... 1 −
XL−2
l=1
ωl,r
−1
.
(B.1)
Recall the generalized Dirichlet distribution GD(p; a, b) (Connor and Mosimann 1969)
for random vector p = (p1 , ..., pL ), supported on the L dimensional simplex, with
aL−1 −1
b
−1
density proportional to pa1 1 −1 . . . pL−1
pLL−1
(1 − p1 )b1 −(a2 +b2 ) . . . (1 − (p1 +
bL−2 −(aL−1 +bL−1 )
... + pL−2 ))
, where the parameters are a = (a1 , ..., aL−1 ) and b =
(b1 , ..., bL−1 ). Then, PL (ωr | 1, αr ) ≡ GD(ωr ; a, b) with a = (1, ..., 1) and b = (αr , ..., αr ).
QL
M
Moreover, the l=1 ωl,rl,r form is also proportional to a GD(ωr ; a, b) distribution with
PL
a = (M1,r + 1, ..., ML−1,r + 1) and b = ((L − 1) + l=2 Ml,r , ..., 2 + ML−1,r + ML,r , 1 +
812
Markov Switching DPM Regression
ML,r ). Hence, the posterior full conditional for ωr can be completed to a generalized Dirichlet distribution with parameters a = (M1,r + 1, ..., ML−1,r + 1) and b =
PL
PL
(αr + l=2 Ml,r , αr + l=3 Ml,r , ..., αr + ML,r ). This distribution can be sampled
PL
constructively by first drawing independent ζl ∼ β(1 + Ml,r , αr + s=l+1 Ms,r ), for
Ql−1
l = 1, ..., L − 1, and then setting ω1,r = ζ1 ; ωl,r = ζl s=1 (1 − ζs ), l = 2, ..., L − 1; and
PL−1
ωL,r = 1 − l=1 ωl,r .
Next, for each r = 1, ..., R, the posterior full conditional distribution for θ̃r is proporQL
Qn∗r Q
∗
∗
tional to l=1 dG0 (θ̃l,r ; ψr ) j=1
{t:ht =r,kt =kj∗ } N(zt ; θ̃kj ,r ). Here, nr is the number of
distinct values of the kt that correspond to the r-th state, i.e., the number of distinct
kt for t ∈ {t : ht = r}. These distinct values are denoted by kj∗ , j = 1, ..., n∗r . Now, for
all l ∈
/ {kj∗ : j = 1, ..., n∗r }, we can draw θ̃l,r i.i.d. G0 (ψr ). Otherwise, the posterior full
conditional for θ̃kj∗ ,r ≡ (µ̃∗j,r , Σ̃∗j,r ) is proportional to
Y
−1
N(µ̃∗j,r ; mr , Vr )Wνr (Σ̃∗−1
N(zt ; µ̃∗j,r , Σ̃∗j,r ),
j,r ; Sr )
{t:ht =r,kt =kj∗ }
and can be sampled by extending the Gibbs sampler to draw from the full conditional
−1
for µ̃∗j,r and for Σ̃∗−1
+
j,r . The former is normal with covariance matrix Tj = (Vr
∗−1 −1
∗
∗
−1
∗
Mj,r Σ̃j,r ) , where Mj,r = |{t : ht = r, kt = kj }|, and mean vector Tj (Vr mr +
P
P
∗
∗
Σ̃∗−1
j,r
{t:ht =r,kt =k∗ } (zt − µ̃j,r )(zt −
{t:ht =r,kt =k∗ } zt ). The latter is Wνr +Mj,r (·; (Sr +
j
j
µ̃∗j,r )T )−1 ).
The posterior full conditional for the hyperparameters, ψr = (mr , Vr , Sr ), can be
simplified by marginalizing the joint posterior full conditional for θ̃r and ψr over all the
θ̃l,r for l ∈
/ {kj∗ : j = 1, ..., n∗r }. Thus, for each r = 1, ..., R, the full conditional for ψr is
proportional to
∗
)WaSr (Sr ; BSr )
N (mr ; amr , Bmr ) WaVr (Vr−1 ; BV−1
r
nr
Y
−1
N(µ̃∗j,r ; mr , Vr )Wνr (Σ̃∗−1
j,r ; Sr ).
j=1
Hence, ψr can be updated by separate draws from the posterior full conditionals for
′
mr , Vr , and Sr . The full conditional for mr is normal with covariance matrix Bm
=
r
∗
P
n
−1
∗
−1
∗ −1 −1
′
−1
r
).
The
full
conditional
a
+V
µ̃
(B
+n
V
)
and
mean
vector
B
(Bm
r
mr mr
r r
mr
j=1 j,r
r
Pn∗r
(µ̃∗j,r − mr )(µ̃∗j,r − mr )T )−1 ), and the full conditional
for Vr−1 is Wn∗r +aVr (·; (BVr + j=1
Pn∗r ∗−1 −1
+ j=1
Σ̃j,r ) ).
for Sr is Wνr n∗r +aSr (·; (BS−1
r
Regarding the DP precision parameters, combining the Γ(aαr , bαr ) prior for αr with
the relevant terms from (B.1), we obtain that, for each r = 1, ..., R, the posterior full
conditional for αr is a Γ(aαr + L − 1, − log(ωL,r ) + bαr ) distribution.
Finally, with the Dir(Qr ; λr ) prior on each row Qr of the transition matrix Q, the
posterior full conditional for Qr is Dir(Qr ; λr + Jr ), where Jr = (Jr,1 , ..., Jr,R ) with Jr,s
denoting the number of transitions from state r to state s, which are defined by the
currently imputed state vector h.
M. A. Taddy and A. T. Kottas
813
References
Antoniak, C. (1974). “Mixtures of Dirichlet processes with applications to Bayesian
nonparametric problems.” Annals of Statistics, 2: 1152 – 1174. 795, 799, 811
Berliner, L. M. and Lu, Z.-Q. (1999). “Markov switching time series models with
application to a daily runoff series.” Water Resources Research, 35: 523–534. 796,
798
Billio, M., Monfort, A., and Robert, C. P. (1999). “Bayesian estimation of switching
ARMA models.” Journal of Econometrics, 93: 229–255. 796
Blackwell, D. and MacQueen, J. (1973). “Ferguson distributions via Pólya urn schemes.”
Annals of Statistics, 1: 353–355. 799
Chib, S. (1996). “Calculating posterior distributions and modal estimates in Markov
mixture models.” Journal of Econometrics, 75: 79–97. 795
Connor, R. and Mosimann, J. (1969). “Concepts of independence for proportions with
a generalization of the Dirichlet distribution.” Journal of the American Statistical
Association, 64: 194–206. 811
Ferguson, T. (1973). “A Bayesian analysis of some nonparametric problems.” Annals
of Statistics, 1: 209–230. 795
Frühwirth-Schnatter, S. (2006). Finite Mixture and Markov Switching Models. Springer.
796
Gelfand, A. E. and Kottas, A. (2002). “A Computational Approach for Full Nonparametric Bayesian Inference under Dirichlet Process Mixture Models.” Journal of
Computational and Graphical Statistics, 11: 289–305. 799
Goldfeld, S. M. and Quandt, R. E. (1973). “A Markov model for switching regression.”
Journal of Econometrics, 1: 3–16. 793, 796
Hamilton, J. D. (1989). “A New Approach to the Economic Analysis of Nonstationary
Time Series and the Business Cycle.” Econometrica, 57: 357–384. 793
Hare, S. R. and Mantua, N. J. (2000). “Empirical evidence for North Pacific regime
shifts in 1977 and 1989.” Progress in Oceanography, 47: 103–145. 803, 805
Hughes, J. P. and Guttorp, P. (1994). “A class of stochastic model for relating synoptic
atmospheric patters to regional hydrologic phenomena.” Water Resources Research,
30: 1535–1546. 796, 798
Hurn, M., Justel, A., and Robert, C. P. (2003). “Estimating mixtures of regressions.”
Journal of Computational and Graphical Statistics, 12: 55–79. 796
Ishwaran, H. and James, L. (2001). “Gibbs sampling methods for stick-breaking priors.”
Journal of the American Statistical Association, 96: 161–173. 799
814
Markov Switching DPM Regression
Ishwaran, H. and Zarepour, M. (2000). “Markov chain Monte Carlo in approximate
Dirichlet and beta two-parameter process hierarchical models.” Biometrika, 87: 371–
390. 800
Jacobson, L. D., Bograd, S. J., Parrish, R. H., Mendelssohn, R., and Schwing, F. B.
(2005). “An ecosystem-based hypothesis for climatic effects on surplus production
in California sardine and environmentally dependent surplus production models.”
Canadian Journal of Fisheries and Aquatic Sciences, 62: 1782–1796. 803
MacCall, A. D. (2002). “An hypothesis explaining biological regimes in sardineproducing Pacific boundary current systems.” In Climate and fisheries: interacting
paradigms, scales, and policy approaches: the IRI-IPRC Pacific Climate-Fisheries
Workshop, 39–42. International Research Institute for Climate Prediction, Columbia
University. 805
McCulloch, R. E. and Tsay, R. S. (1994). “Statistical Analysis of Economic Time Series
via Markov Switching Models.” Journal of Time Series Analysis, 15: 523–539. 793
McGowan, J. A., Cayan, D. R., and Dorman, L. M. (1998). “Climate-ocean variability
and ecosystem response in the Northeast Pacific.” Science, 281: 210–217. 803
Müller, P., Erkanli, A., and West, M. (1996). “Bayesian curve fitting using multivariate
Normal mixtures.” Biometrika, 83: 67–79. 795
Munch, S. B., Kottas, A., and Mangel, M. (2005). “Bayesian nonparametric analysis of
stock-recruitment relationships.” Canadian Journal of Fisheries and Aquatic Sciences,
62: 1808–1821. 803
Neal, R. (2000). “Markov Chain Sampling Methods for Dirichlet Process Mixture Models.” Journal of Computational and Graphical Statistics, 9: 249–265. 799
Quandt, R. E. and Ramsey, J. B. (1978). “Estimating Mixtures of Normal Distributions
and switching regressions (with discussion).” Journal of the American Statistical
Association, 73: 730–752. 793, 796
Quinn, T. J. I. and Derisio, R. B. (1999). Quantitative Fish Dynamics. Oxford University Press. 803
Robert, C., Celeux, G., and Diebolt, J. (1993). “Bayesian estimation of hidden Markov
chains: a stochastic implementation.” Statistics & Probability Letters, 16: 77–83.
795
Scott, S. (2002). “Bayesian methods for hidden Markov models: recursive computing
for the 21st century.” Journal of the American Statistical Association, 97: 337–351.
799
Sethuraman, J. (1994). “A constructive definition of Dirichlet priors.” Statistica Sinica,
4: 639–6650. 799
M. A. Taddy and A. T. Kottas
815
Shi, J. Q., Murray-Smith, R., and Titterington, D. M. (2005). “Hierarchical Gaussian
process mixtures for regression.” Statistics and Computing, 15: 31–41. 796
Shumway, R. H. and Stoffer, D. S. (1991). “Dynamic linear models with switching.”
Journal of the American Statistical Association, 86: 763–769. 796
Taddy, M. and Kottas, A. (2009). “A Bayesian Nonparametric Approach to Inference
for Quantile Regression.” Journal of Business and Economic Statistics, to appear.
795, 804
Wada, T. and Jacobson, L. D. (1998). “Regimes and stock-recruitment relationships in
Japanese sardine, 1951-1995.” Canadian Journal of Fisheries and Aquatic Sciences,
55: 2455–2463. 803, 804, 805, 807, 808
Acknowledgments
This research is part of the first author’s Ph.D. dissertation, completed at University of California, Santa Cruz, and was supported in part by the National Science Foundation under
Award DEB-0727543. The authors thank Alec MacCall, Steve Munch, and Marc Mangel for
helpful discussions regarding the analysis of the Pacific sardine data, and two referees for useful
comments that led to an improved presentation of the material in the paper.
816
Markov Switching DPM Regression
Bayesian Analysis (2009)
4, Number 4, pp. 817–846
A Case for Robust Bayesian Priors with
Applications to Clinical Trials
Jairo A. Fúquene∗ , John D. Cook† and Luis R. Pericchi‡
Abstract. Bayesian analysis is frequently confused with conjugate Bayesian analysis. This is particularly the case in the analysis of clinical trial data. Even
though conjugate analysis is perceived to be simpler computationally (but see
below, Berger’s prior), the price to be paid is high: such analysis is not robust with
respect to the prior, i.e. changing the prior may affect the conclusions without
bound. Furthermore, conjugate Bayesian analysis is blind with respect to the
potential conflict between the prior and the data. Robust priors, however, have
bounded influence. The prior is discounted automatically when there are conflicts
between prior information and data. In other words, conjugate priors may lead
to a dogmatic analysis while robust priors promote self-criticism since prior and
sample information are not on equal footing. The original proposal of robust priors
was made by de-Finetti in the 1960’s. However, the practice has not taken hold
in important areas where the Bayesian approach is making definite advances such
as in clinical trials where conjugate priors are ubiquitous.
We show here how the Bayesian analysis for simple binary binomial data, expressed in its exponential family form, is improved by employing Cauchy priors.
This requires no undue computational cost, given the advances in computation and
analytical approximations. Moreover, we introduce in the analysis of clinical trials
a robust prior originally developed by J.O. Berger that we call Berger’s prior. We
implement specific choices of prior hyperparameters that give closed-form results
when coupled with a normal log-odds likelihood. Berger’s prior yields a robust
analysis with no added computational complication compared to the conjugate
analysis. We illustrate the results with famous textbook examples and also with
a real data set and a prior obtained from a previous trial. On the formal side,
we present a general and novel theorem, the “Polynomial Tails Comparison Theorem.” This theorem establishes the analytical behavior of any likelihood function
with tails bounded by a polynomial when used with priors with polynomial tails,
such as Cauchy or Student’s t. The advantages of the theorem are that the likelihood does not have to be a location family nor exponential family distribution and
that the conditions are easily verifiable. The binomial likelihood can be handled
as a direct corollary of the result. Next, we proceed to prove a striking result: the
intrinsic prior to test a normal mean, obtained as an objective prior for hypothesis
testing, is a limit of Berger’s robust prior. This result is useful for assessments and
for MCMC computations. We then generalize the theorem to prove that Berger’s
prior and intrinsic priors are robust with normal likelihoods. Finally, we apply the
∗ Institute of Statistics, School of Business Administration, University of Puerto Rico, San Juan,
PR, mailto:jairo.a.fuquene@uprrp.edu
† Division of Quantitative Sciences, M.D. Anderson Cancer Center, University of Texas, Houston,
TX, jdcook@mdanderson.org
‡ Department of Mathematics, University of Puerto Rico, San Juan, PR, mailto:lrpericchi@uprrp.
edu
c 2009 International Society for Bayesian Analysis
°
DOI:10.1214/09-BA431
818
Robust Bayesian Priors for Clinical Trials
results to a large clinical trial that took place in Venezuela, using prior information
based on a previous clinical trial conducted in Finland.
Our main conclusion is that introducing the existing prior information in the
form of a robust prior is more justifiable simultaneously for federal agencies, researchers, and other constituents because the prior information is coherently discarded when in conflict with the sample information.
Keywords: Berger’s Prior, Clinical Trials, Exponential Family, Intrinsic Prior,
Parametric Robust Priors, Polynomial Tails Comparison Theorem, Robust Priors
1
Introduction
In Bayesian statistics the selection of the family of prior distributions is crucial to the
analysis of data because the conclusions depend on this selection. However, there is
little analysis of clinical trials using non-conjugate priors. It is common to report an
analysis using different conjugate priors: clinical, skeptical, and non-informative. The
precision in these priors is important and sensitivity analyses regarding the priors is
necessary. One approach to this problem is advocated by Greenhouse and Wasserman
(1995) who compute bounds on posterior expectations over an ε-contaminated class
of prior distributions. An alternative solution is proposed in Carlin and Louis (1996),
where one re-specifies the prior and re-computes the result. These authors obtain fairly
specific results for some restricted non-parametric classes of priors. Along the same
line, another alternative is the “prior partitioning” of Carlin and Sargent (1996) which
selects a suitably flexible class of priors (a non-parametric class whose members include
a quasi-unimodal, a semi-parametric normal mixture class, and the fully parametric
normal family) and identify the priors that lead to posterior conclusions of interest.
These are (very few) proposals about what may be called “non-parametric” robustness
to the prior. The proposals in this paper are “parametric” robust Bayesian analysis,
quite distinct from the previous proposals. Some general results on parametric Bayesian
robustness are in Dawid (1973), O’Hagan (1979), Pericchi and Sansó (1995). We believe
that the main road forward for clinical trials is on the parametric side for three reasons.
First, it is more natural to represent the information given by a previous trial in terms
of parametric priors. More generally, parametric priors are easier to assess. Second, it
is far more clear how to generalize a parametric robust analysis to hierarchical modeling
than a non-parametric class of priors. Finally, non-parametric priors do not appear to
have achieved a significant impact in practice. In Gelman et al. (2008) the authors take
a somewhat similar point of view to ours. Their arguments are very applied while ours
are more theoretical and so the papers are complementary.
Clinical trials are exceptional in at least two ways. First, there is often substantial “hard” prior data. Second, there are multiple parties overseeing the analysis: researchers, statisticians, regulatory bodies such as the FDA, data and safety monitoring
boards, journal editors, etc. In this framework there are fundamental issues such as
the following. How do you assess a prior from the prior data? How do you assess how
relevant the previous data is to the current trial? By using prior information are we
J. A. Fúquene, J. D. Cook and L. R. Pericchi
819
enhancing the analysis or biasing it? Our key message in this paper is that robust priors
are a better framework to get consensus in clinical trials for the following reason:
1. Prior information may be substantial about certain characteristics like location,
scale, but it is very weak about the tails of prior distributions.
2. The tail size is crucial in the posterior inference when there is conflict between
prior and sample.
3. The behavior of posterior inference under robust priors is superior because when
the prior information is irrelevant for the case at hand, then the prior information
is coherently and automatically discarded by Bayes’ theorem.
Conjugate light-tailed priors do not have these features and may be called “dogmatic.”
See Berger (1985) for an authoritative discussion of these issues and our example in
6.1. Of course if all involved had unlimited time for several sensitivity analysis, the
results using light tailed priors might be acceptable. Instead, we are suggesting that
Bayes’ theorem should be allowed to perform coherently the sensitivity analyses, and
for that heavy tailed priors are required. A referee has pointed out that “a researcher
who had carefully constructed a prior distribution that reflected substantial available
information almost certainly would prefer for that information to be reflected in the
posterior distribution or at least for prior/data conflict to be recognized and investigated”. Certainly, if someone has a prior that they want included in the analysis, fine.
But it need not be the only prior used. There’s no harm in repeating an analysis with
several priors, and in fact it is a recommended practice to do so. Furthermore, there are
limits to how well someone can quantify their prior uncertainty, particularly far from
the center of their estimate. It is hard to imagine that someone could say that their
prior belief follows a normal distribution rather than a Student-t with say six degrees of
freedom. If individuals cannot specify with fidelity the tail behavior of their subjective
priors, the tail behavior should be determined by technical criteria such as robustness.
The popular normal/normal (N/N) or beta/binomial (B/B) conjugate analysis (see
for example Spiegelhalter et al. (2004)) will be exposed in this article as non-robust.
Workable (parametric) alternatives are available to the practitioner. For motivation
consider: The posterior mean µn in the N/N and B/B models is (see next section)
µn = (n0 + n)−1 (n0 µ + nX̄n ). Thus the mean is a convex combination of the prior
expectation, µ, and the data average, X̄n , and thus the prior has unbounded influence.
For example, as the location prior/data conflict |µ − x̄| grows, so does |µn − x̄| and
without bound. These considerations motivate the interest in non-conjugate models
for Bayesian analysis of clinical trials, and more generally motivate heavy-tailed priors.
(See the theorem in the next section.)
We may employ the following heuristic: Bayesian clinical trials are not better because
they stop sampling earlier (although they often do) but because they stop intelligently,
that is the stopping is conditional on the amount of evidence. Robust priors are not
better because they have less influence (though this is true) but because they influence
in a more intelligent way: the influence of the robust prior is a function of the potential
820
Robust Bayesian Priors for Clinical Trials
conflict between prior and sample information about the region where the parameters are
most likely to live. (For more general measures of prior-sample conflict see for example,
Evans and Moshonov (2006)). In this paper we show that the Cauchy prior is robust in
two models for clinical trials. Pericchi and Smith (1992) considered the robustness of
the Student-t prior in the Student-t/normal model. We consider as a particular case the
Cauchy/normal (C/N) model for normal log-odds. Much less known, however, is the
robust property of the Cauchy prior with the binomial likelihood and more generally
for exponential family likelihoods. To prove the robustness of the Cauchy prior when
coupled with a binomial likelihood, we prove a more general result that only requires a
bound in the tail behavior of the likelihood. This novel theorem is easy to verify and is
very general. Under these conditions, when the prior and the model are in conflict, then
the prior acts “as if” it were uniform. In other words, the prior influences the analysis
only when prior information and likelihood are in broad agreement. Otherwise Bayes’
theorem effectively switches back to a uniform prior. In this paper we rely heavily on the
fact that the binomial likelihood belongs to the exponential family (though the theorem
is not limited to exponential family likelihoods) showing the robustness of the Cauchy
prior in the Cauchy/binomial (C/B) model for binary data.
Cauchy priors do not lead to analytical closed-form results, but our next suggestion
does. In his very influential book (Berger (1985)) Berger proposes a prior (called here
“Berger’s prior”). We use Berger’s prior for clinical trials analysis, assuming a prior
mean and scale suggested by previous data or by general features of the current trial. It
turns out that this gives closed-form results when coupled with a normal log-odds likelihood. We show the robustness of Berger’s prior for the Berger-prior/normal log-odds
(BP/N) model, which makes it more attractive than both the Cauchy and conjugate
priors. We also prove here a striking result: The intrinsic prior to test a normal mean of
Berger and Pericchi (1996) which is obtained as an objective prior for hypothesis testing, is also a limit of Berger’s robust prior. This result is useful for assessments and for
MCMC computations. We then generalize the Polynomial Tails Comparison theorem
to prove that Berger’s prior and intrinsic priors are robust with normal likelihoods. We
finally apply the results to massive clinical trial that took place in Venezuela, and the
prior information is taken from a previous clinical trial in Finland.
Lastly we remark that the hierarchical model is not the solution for the lack of
robustness of conjugate analysis. Quite to the contrary, the hierarchical model should
use robust priors in the hierarchy to prevent unbounded and undesirable shrinkages.
This is being studied in work in progress by M.E. Perez, and L.R. Pericchi.
This article is organized as follows. Section 2 is devoted to the Polynomial Tails
Comparison Theorem. In Section 3 we review the prior specification and posterior
moments of the C/B model. In Section 4 we examine the robustness of the Cauchy
prior in the C/B posterior model. In the Sections 3 and 4 we show the application of
the C/B model in a clinical trial. In Section 5 we describe the robustness of the C/N
and BP/N models and prove that the intrinsic prior is a limit of Berger’s priors. In
Section 6 we prove the Generalized Polynomial Tails theorem and illustrate the results
in a real and important clinical trial published in the New England Journal of Medicine.
We make some closing remarks in Section 7.
J. A. Fúquene, J. D. Cook and L. R. Pericchi
2
821
The Polynomial Tails Comparison Theorem
The following theorem is decidedly useful and easy to apply when determining whether
a prior is robust with respect to a likelihood.
For ν > 0, define
t(λ; µ, ν) =
µ
¶−(ν+1)/2
(λ − µ)2
1+
.
ν
Aside from a normalization constant that would cancel out in our calculations, t(λ; µ, ν)
is the PDF of a Student-t distribution with ν degrees of freedom centered at µ.
Let f (λ) be any likelihood function such that as |λ| → ∞
Z
f (λ) dλ = O(m−ν−1−ε ).
(1)
|λ|>m
In the application we have in mind, f is a binomial likelihood function although the
result is more general. For instance the latter holds for any ν in any likelihood with
exponentially decreasing tails.
Denote by π T (λ | data) and π U (λ | data) the posterior densities employing the Studentt and the uniform prior densities respectively. Applying Bayes’ rule to both densities,
for any parameter value λ0 the following ratio:
R∞
f (λ) t(λ; µ, ν) dλ
π U (λ0 | data)
R∞
= −∞
.
T
π (λ0 | data)
t(λ0 ; µ, ν) −∞ f (λ) dλ
Theorem 2.1. For fixed λ0 ,
R∞
f (λ) t(λ; µ, ν) dλ
R∞
= 1.
µ→∞ t(λ0 ; µ, ν)
f (λ) dλ
−∞
lim
−∞
Proof. We will show that
R∞
R∞
f (λ) t(λ; µ, ν) dλ − t(λ0 ; µ, ν) −∞ f (λ) dλ
−∞
R∞
lim
= 0.
µ→∞
t(λ0 ; µ, ν) −∞ f (λ) dλ
(2)
(3)
Note that the numerator can be written as
Z ∞
f (λ) (t(λ; µ, ν) − t(λ0 ; µ, ν)) dλ.
−∞
We break the region of integration in the numerator into two parts, |λ| < µk and
|λ| > µk , for some 0 < k < 1 that we will pick later, and show that as µ → ∞ each
integral goes to zero faster than the denominator.
822
Robust Bayesian Priors for Clinical Trials
First consider
Z
|λ|<µk
f (λ) (t(λ; µ, ν) − t(λ0 ; µ, ν)) dλ.
(4)
For every λ, there exists a ξ between λ and λ0 such that
t(λ; µ, ν) − t(λ0 ; µ, ν) = t′ (ξ; µ, ν)(λ − λ0 )
by the mean value theorem. Since µ → ∞, we can assume µ > µk > λ0 .
|t(λ; µ, ν) − t(λ0 ; µ, ν)| =
=
=
=
|t′ (ξ; µ, ν)(λ − λ0 )|
(ν + 1)|λ − µ| |λ − λ0 |
´(ν+3)/2
³
2
ν 1 + (λ−µ)
ν
O(µ1+k )
Ω(µν+3 )
O(µk−ν−2 ).
[Here we use the familiar O notation and the less familiar Ω notation. Just as
f = O(µn ) means that f is eventually bounded above by a constant multiple of µn , the
notation f = Ω(µn ) means that f is eventually bounded below by a constant multiple
of µn .]
As µ → ∞, the integral (4) goes to zero as O(µk−ν−2 ). Since t(λ0 ; µ, ν) is Ω(µ−ν−1 ),
the ratio of the integral (4) to t(λ0 ; µ, ν) is O(µk−1 ). Since k < 1, this ratio goes to zero
as µ → ∞.
Next consider the remaining integral,
Z
|λ|>µk
f (λ) (t(λ; µ, ν) − t(λ0 ; µ, ν)) dλ.
(5)
The term t(λ; µ, ν) − t(λ0 ; µ, ν) is bounded, and we assumed
Z
|x|>m
f (λ) dλ = O(m−ν−1−ε ).
Therefore the integral (5) is O((µk )−ν−1−ε ) = O(µ−k(ν+1+ε) ). Since t(λ0 ; µ, ν) is
Ω(µ−ν−1 ), the ratio of the integral (5) to t(λ0 ; µ, ν) is of order O(µ−k(ν+1+ε/(ν+1) ).
This term goes to zero as µ → ∞ provided k > (ν + 1)/(ν + 1 + ε).
Note that in particular the theorem applies when f is the likelihood function of a
binomial model with at least one success and one failure and ν = 1, i.e. a Cauchy prior.
823
J. A. Fúquene, J. D. Cook and L. R. Pericchi
3
The Binomial Likelihood with Conjugate and Cauchy
Priors
Assume a sample of size n with X1 , . . . , Xn ∼ Bernoulli(θ). The binomial likelihood in
its explicit exponential family form is given by
©
ª
f (X+ | λ) ∝ exp X+ λ − n log(1 + eλ ) ,
(6)
Pn
where X+ = i=1 Xi ∼ binomial(n, θ) is the number of success in n trials. Notice that
for the binomial likelihood, it is enough to assume that there is at least one success and
one failure, i.e. 0 < X+ < n, (for assumption (2.1) of the theorem of the previous section
to be fulfilled for every ν ≥ 1), since then the binomial has exponentially decreasing
tails. The natural parameter is the log-odds, λ = log(θ/(1 − θ)), which is the parameter
to be modeled as a Cauchy variable later, for which one can make use of the theorem.
If desired, a Student-t prior with more than one degree of freedom can be used, and all
results apply as well. We employ the Cauchy for good use of “conservatism” regarding
the treatment of prior information, a point shared with Gelman et al. (2008).
First we perform a conjugate analysis, expressing the beta(a, b) prior, after the
transformation of the parameter θ to log-odds, as
¶b
µ λ ¶a µ
Γ(a + b)
1
e
πB (λ) =
a, b > 0.
(7)
Γ(a)Γ(b) 1 + eλ
1 + eλ
The cumulant generating function of the prior distribution πB (λ) is given by
Kλ (t) = − log(Γ(a)Γ(b)) + log(Γ(a + t)) + log(Γ(b − t)),
′
(8)
′
hence EB (λ) = Ψ(a) − Ψ(b) and VB (λ) = Ψ (a) + Ψ (b), where Ψ(·) is the digamma
′
function and Ψ (·) is the trigamma function, that are extensively tabulated in for
example Abramowitz and Stegun (1970). The posterior distribution of the B/B model
is given by
©
ª
πB (λ | X+ ) = K × exp (a + X+ )λ − (n + a + b) log(1 + eλ )
(9)
Γ(n + a + b)
. On the other hand, one proposal for robust
Γ(X+ + a)Γ(n − X+ + b)
analysis for binomial data (see also the next sections for Berger’s prior for an alternative)
is to use a Cauchy prior for the natural parameter λ in order to achieve robustness with
respect to the prior,
where K =
πC (λ) =
β
,
π[β 2 + (λ − α)2 ]
(10)
with parameters of location and scale α and β respectively. The posterior distribution
of the C/B model is
©
¡
¢ª
exp X+ λ − n log(1 + eλ ) − log β 2 + (λ − α)2
πC (λ | X+ ) =
,
m(X+ )
824
Robust Bayesian Priors for Clinical Trials
where m(X+ ) is the predictive marginal. Notice that this posterior also belongs to
the exponential family. One approach to the approximation of m(X+ ) is Laplace’s
method, refined
by Tierney and Kadane (1986) for statistical applications given by
√
m(X+ ) ≈ 2πσ̂n−1/2 exp{−nh(λ̂)} where −nh(λ) = log(πC (λ)f (X+ | λ)), λ̂ is the
′′
maximum of −h(λ̂), and σ̂ = [h (λ)]−1/2 |λ=λ̂ . The accuracy is of order O(n−1 ).
Example 3.1. A Textbook Clinical Trial Example. We apply the preceding approximation adapting an example considered in Spiegelhalter et al. (2004). Suppose that
previous experience with similar compounds has suggested that a drug has a true response rate θ, between 0 and 0.4, with an expectation around 0.2. For normal distributions we know that m ± 2s includes just over 95% of the probability, so if we were
assuming a normal prior we might estimate m = 0.2 and s = 0.1. However, the beta
distributions with reasonably high a and b have approximately normal shape, so that
θ ∼ beta(a = 3, b = 12). Suppose that we test the drug on 20 additional patients and
observe 16 positive responses (X+ = 16). Then the likelihood of the experiment is X+ ∼
binomial(n = 20, θ) and the posterior in this case θ | X+ ∼ beta(a = 19, b = 16). Our
proposal is to use a Cauchy prior in order to achieve robustness with respect to the prior,
πC (λ), with the same parameters of location and scale of the beta
p prior. For this example the location and the scale are Ψ(3) − Ψ(12) = −1.52 and Ψ′ (3) + Ψ′ (12) = 0.69
respectively. Figures 1 and 2 display a large discrepancy between the means of the prior
information and the normalized likelihood (i.e. the posterior density using a uniform
prior) of the data. In the B/B model the prior and the likelihood receive equal weight.
The weight of the likelihood in the C/B posterior model is higher that in the B/B model.
The form of the C/B model is much closer to the normalized likelihood.
1.2
density
1.1
1.0
beta-prior
0.9
Likelihood
0.8
Posterior
0.7
0.6
0.5
0.4
0.3
0.2
0.1
−4
−3
−2
−1
0
1
2
3
4
5
Log-Odds
Figure 1: Beta prior, normalized binomial likelihood and B/B posterior model for the Example
1.
825
J. A. Fúquene, J. D. Cook and L. R. Pericchi
1.2
1.1
density
1.0
0.9
Cauchy-prior
0.8
Likelihood
0.7
Posterior
0.6
0.5
0.4
0.3
0.2
0.1
−6
−5
−4
−3
−2
−1
0
1
Log-Odds
2
3
4
Figure 2: Cauchy prior, normalized binomial likelihood and C/B posterior model for the
Example 1.
The posterior moments of the natural parameter of an exponential family are considered in Pericchi et al. (1993) and Gutierrez-Peña (1997). The cumulant generating
function of the posterior, πB (λ | X+ ), in the B/B model is
¶
µ
Γ(X+ + a + t)Γ(n − X+ + b − t)
,
(11)
Kλ | X+ (t) = log
Γ(X+ + a)Γ(n − X+ + b)
hence
EB (λ | X+ ) = Ψ(X+ + a) − Ψ(n − X+ + b)
(12)
VB (λ | X+ ) = Ψ (X+ + a) + Ψ (n − X+ + b).
(13)
′
′
In the C/B model, we need to calculate the approximation of EC (λ | X+ ) and VC (λ | X+ ).
The posterior expectation EC (λ | X+ ) involves the ratio of two integrals and the Laplace
method can be used, as
Ẽ(λ | X+ ) =
µ
σ∗
σ̂
¶
n
o
exp −n[h∗ (λ∗ ) − h(λ̂)] ,
(14)
where −nh∗ (λ) = log(λπC (λ)f (X+ | λ)), λ∗ is the maximum of −h∗ (λ) and
′′
σ = [h∗ (λ)]−1/2 |λ=λ∗ . The error in (14) is of order O(n−2 ) (see Tierney and Kadane
(1986)). However, in (14) we must assume that λ does not change sign. Tierney et al.
(1989) recommend to add a large constant c to λ, apply Laplace’s method (14) and
finally subtract the constant. We let ẼC (λ | X+ ) and ṼC (λ | X+ ) to denote approximate
posterior expectation and posterior variance of the C/B model
∗
826
Robust Bayesian Priors for Clinical Trials
ẼC (λ | X+ ) = Ẽ(c + λ | X+ ) − c.
(15)
ṼC (λ | X+ ) = Ẽ((c + λ)2 | X+ ) − [Ẽ(c + λ | X+ )]2 .
(16)
For both functions h(λ̂) and h∗ (λ) it is not possible to find the maximum analytically and then we use Newton Raphson algorithm. Here c is the value of λ such that
πC (λ = c | X+ ) ≤ 0.5−4 and the starting value in the Newton-Raphson algorithm is the
maximum likelihood estimator (MLE) of the natural parameter, λ̂ = log((X̄n )/(1−X̄n )).
Result 3.1. The posterior expectations for the C/B model and B/B satisfy the following:
1. Robust result:
lim EC (λ | X+ ) ≈ λ̂ +
α→±∞
e2λ̂ − 1
2neλ̂
.
(17)
2. Non-robust result:
lim
EB (λ)→±∞
EB (λ | X+ ) → ±∞.
(18)
respectively.
Proof. See the Appendix. Result 3.1 is a corollary of the Theorem 2.1.
Note: the limit (17) is not equal to the MLE, but consistent with Theorem 2.1.
4
Computations with Cauchy Priors
We use weighted rejection sampling to compute the (“exact”) posterior moments in
the C/B model due to its simplicity and generality for simulating draws directly from
the target density πC (λ | X+ ) (see Smith and Gelfand (1992)). In the C/B model the
envelope function is the Cauchy prior. The rejection method proceeds as follows:
1. Calculate M = f (X+ | λ̂).
2. Generate λj ∼ πC (λ).
3. Generate Uj ∼ uniform(0,1).
4. If M Uj πC (λj ) < f (X+ | λj ) πC (λj ), accept λj . Otherwise reject λj and go to Step
2.
827
J. A. Fúquene, J. D. Cook and L. R. Pericchi
It is clear that the Cauchy density is an envelope. Because it is simple to generate
Cauchy distributed samples, the method is feasible. Using Monte Carlo methods and
10,000 random samples from πC (λ | X+ ) we compute Esim and Vsim . Results available
from the authors show that the agreement between the Laplace approximations and the
rejection algorithm is quite good for sample sizes bigger than n = 10. In Figures 3 to
5 we illustrate the striking qualitative difference of posterior moments, as a function of
the discrepancy between prior and sample location |µ − x̄|. Figure 3 shows that the
beta prior has an unbounded influence and it is not robust. Figures 4 and 5 display
the qualitative forms of dependence of the posterior expectation and variance on the
discrepancy between the prior location and the MLE using a Cauchy prior. Here λ̂ = 0
and a and b take various values with their sum fixed at 50. In Figures 4 and 5, the
approximations (15) and (16) are shown as functions of the discrepancy. Note that
(16) is non-monotonic in the discrepancy. The posterior expectation, ẼC (λ | X+ ), is a
function of the “information discount.”
EB (λ | X+ )
2.0
1.0
0.0
−1.0
−4
−3
−2
−1
0
1
2
3
4
5
E(λ) − λ̂
Figure 3: Behavior of the posterior expectation, EB (λ | X+ ), in the B/B model for values
n = 10, λ̂ = 0 and a + b = 50.
828
Robust Bayesian Priors for Clinical Trials
0.3
ẼC (λ | X+ )
0.2
0.1
0.0
−0.1
−0.2
−4
−3
−2
−1
0
1
2
3
4
5
E(λ) − λ̂
Figure 4: Behavior of the posterior expectation, ẼC (λ | X+ ), in the C/B model for values
n = 50, λ̂ = 0 and a + b = 50.
0.10
0.09
Posterior Variance
0.08
ṼB (λ | X+ )
0.07
ṼC (λ | X+ )
0.06
0.05
0.04
−4
−3
−2
−1
0
1
2
3
4
5
E(λ) − λ̂
Figure 5: Behavior of the posterior variance, VB (λ | X+ ) in the B/B and ṼC (λ | X+ ) in the
C/B for values n = 50, λ̂ = 0 and a + b = 50.
829
J. A. Fúquene, J. D. Cook and L. R. Pericchi
Example 4.1. Textbook Example (Continued): Moments and predictions for
binary data.
EB (λ | X+ )
VB (λ | X+ )
ẼC (λ | X+ )
ṼC (λ | X+ )
M.L.E
0.18
0.12
1.26
0.33
1.39
Table 1: Posterior Expectation and Variance for the B/B and C/B models.
The estimate resulting from the C/B model is closer to the data because the MLE
0.8 of θ is closer to 0.77 than to 0.54, the estimate resulting from the B/B model.
In Table 1 there is a large difference between the values of the posterior mean (0.18)
for the B/B model and the MLE. On the other hand, the results of the C/B model and
the MLE λ̂ are similar. The discrepancies between the expectations of the posterior
models and the MLE are approximately 3.5 and 0.23 standard errors for B/B and C/B
respectively. For the scale of θ that is the true response rate for a Bernoulli trials
set, we know that the predictive mean of the total number of successes in m trials is
E(Xm ) = mE(θ | X+ ). If we plan to treat 40 additional patients in the B/B model the
predictive mean is 40 × 0.54 ≈ 22, and in the C/B model is equal to 40 × 0.77 ≈ 31.
The estimate resulting from the C/B model is closer to the data because the MLE 0.8
of θ is closer to 0.77 than to 0.54, the estimate resulting from the B/B model. The beta
prior is more “dogmatic” than the Cauchy prior leading to non-robust results. Bayesian
analysis is not dogmatic in general, but conjugate Bayesian analysis can be. This is a
major selling point of robust Bayesian methods.
5
Normal Log-Odds and Berger’s Prior
An alternative to the binomial likelihood is the normal likelihood in the log-odds, see
Spiegelhalter et al. (2004). Pericchi and Smith (1992) showed some aspects of the
robustness of the Student-t prior for a normal location parameter and provided approximations to the posterior moments in the model Student-t/normal. The Cauchy prior,
as a Student-t with one degree of freedom, can be used in this context as well. However,
for normal log-odds there is a robust prior that leads to a closed-form posterior and
moments, a sort of “best of both worlds.” Bayesians have long come to terms with
the disadvantages of procedures based on conjugate priors because of the desire for
closed-form results. However, Berger (1985) proposed for comparison of several means
a robust prior (called “Berger’s prior” in this work) that gives closed-form results for
coupled normal means. Berger’s prior (BP) is similar to a Cauchy prior in the tails.
Our proposal is an analysis based on Berger’s prior that we call BP/N posterior model.
µ
In this work the location of Berger’s prior, πBP
(λ), is denoted by µ. This prior has the
830
Robust Bayesian Priors for Clinical Trials
following form.
πBP (λ) =
Z
0
1
N (λ | µ;
1
d+b
− d) · √ dν
2ν
2 ν
(19)
Here N (λ | µ, τ 2 ) denotes a normal density on the parameter λ with mean and variance µ, τ 2 respectively, which is well-defined whenever b ≥ d. The hyper-parameters d
and b have to be assessed (see the end of the section for alternative assessments). We
set here b = β 2 (equal to the scale of the Cauchy) and d = σ 2 /n.
Suppose that X1 , . . . , Xn ∼ normal(λ, σ 2 ) where σ 2 is assumed known and λ is
unknown, then the Berger’s prior is
πBP (λ) =
where
Z
1
0
·
¸¾
½
2ν(λ − µ)2
n
dν
K × exp −
2 σ 2 (1 − 2ν) + nβ 2
√
n
K=p
.
4π(σ 2 (1 − 2ν) + nβ 2 )
(20)
(21)
Result 5.1. Suppose that X1 , . . . , Xn ∼ normal(λ, σ 2 ) where σ 2 is assumed known and
λ is unknown. The predictive distribution of the BP/N model is
p
¾¸
·
½
n(X̄n − µ)2
σ 2 + nβ 2
.
m(X̄n ) = √
1 − exp − 2
σ + nβ 2
4πn(X̄n − µ)2
The posterior distribution of the BP/N model is
½
¾
n(X̄n − λ)2
πBP (λ) exp −
2σ 2
p
πBP (λ | X̄n ) =
¾¸ .
·
½
σ σ 2 + nβ 2
n(X̄n − µ)2
√
1 − exp − 2
σ + nβ 2
2n(X̄n − µ)2
(22)
The posterior expectation of the BP/N model
EBP (λ | X̄n ) = X̄n +
2σ 2 n(X̄n − µ)2 − 2σ 2 (σ 2 + nβ 2 )(f (X̄n ) − 1)
,
n(X̄n − µ)(σ 2 + nβ 2 )(f (X̄n ) − 1)
(23)
and the posterior variance of the BP/N model is
¾
½
σ2
σ4
4n2 (X̄n − µ)2 f (X̄n )
VBP (λ | X̄n ) =
(24)
− 2
n
n
(σ 2 + nβ 2 )2 (f (X̄n ) − 1)2
½
¾
σ 4 2(σ 2 + nβ 2 )(f (X̄n ) − 1)((σ 2 + nβ 2 )(f (X̄n ) − 1) − n)
+ 2
,
n
(σ 2 + nβ 2 )2 (f (X̄n ) − 1)2 (X̄n − µ)2
831
J. A. Fúquene, J. D. Cook and L. R. Pericchi
½
¾
n(X̄n − µ)2
where f (X̄n ) = exp
.
σ 2 + nβ 2
Proof. See Appendix.
The posterior expectation for the BP/N model satisfies the following
lim EBP (λ | X̄n ) = X̄n ; lim EBP (λ | X̄n ) = X̄n .
µ→±∞
µ→X̄n
(25)
This can be shown simply using L’Hôpital’s rule on the expression of the posterior
expectation (23) and proves the robustness of the Berger prior coupled with the normal
log-odds (see also Berger (1985)). Also we have the following result for a Cauchy prior
(as a corollary of the theorem):
lim ECN (λ | X̄n ) ≈ lim X̄n −
α→±∞
5.1
α→±∞
2σ 2 (X̄n − α)
= X̄n .
n(β 2 ) + (X̄n − α)2
(26)
The Intrinsic Prior as the Limit of Berger’s Prior
It is a striking sort of synthesis result that the limit (as n → ∞, for d = σ 2 /n) of
Berger’s prior is the intrinsic prior (Berger and Pericchi (1996)). Define η = λ − µ and
recall the standard intrinsic prior for a normal location parameter:
1 1 − exp(−η 2 )
.
ϕ(η) = √
η2
2 π
(27)
and extend it to a scale family by defining
ϕ(η; σ) =
1 ³η´
.
ϕ
σ
σ
Then
lim πBP (η; b, d) = ϕ(η,
d→0
√
b)
as we will prove in the next section.
5.2
Bounds for Berger’s prior
In this section we develop upper and lower bounds for the density of Berger’s prior. We
then use these bounds to prove that the limiting case of Berger’s prior is the intrinsic
prior. First, define
1
.
w(ν; b, d) =
b + d − 2νd
We will suppress the dependence on b and d unless there is a need to be explicit. With
this notation the integral defining Berger’s prior becomes
832
Robust Bayesian Priors for Clinical Trials
1
πB (η) = √
2 π
Z
0
1
¡
¢
exp −η 2 w(ν) ν w(ν)1/2 dν.
Next we multiply and divide by (wν)′ where the prime indicates the derivative with
respect to ν. Then
1
πB (η) = √
2 π
Note that
Z
0
1
¢
¡
w(ν)1/2
exp −η 2 w(ν) ν (w(ν)ν)′
dν.
(w(ν) ν)′
(b + d − 2νd)3/2
w(ν)1/2
=
.
(w(ν) ν)′
b+d
Therefore
k1 (b, d) ≡
(b − d)3/2
w(ν)1/2
≤
≤ (b + d)1/2 ≡ k2 (b, d).
b+d
(w(ν) ν)′
It follows that
πBP (η)
=
≤
=
1
¡
¢
w(ν)1/2
dν.
exp −η 2 w(ν) ν (w(ν)ν)′
(w(ν)ν)′
0
Z 1
¡
¢
1
√
exp −η 2 w(ν) ν (w(ν)ν)′ k2 (b, d) dν.
2 π 0
µ
µ
¶¶
η2
k2 (b, d)
√ 2 1 − exp −
b−d
2 πη
1
√
2 π
Z
(28)
(29)
(30)
by computing the integral (29). Similarly, by applying the lower bound k1 (b, d) in the
integral (29) and reversing the direction of the inequality
¶¶
µ
µ
k1 (b, d)
η2
πBP (η) ≥ √ 2 1 − exp −
.
b−d
2 πη
To summarize,
k1 (b, d) ψ(η; b, d) ≤ πB (η; b, d) ≤ k2 (b, d) ψ(η; b, d)
(31)
where
³
´
η2
1 − exp − b−d
√
ψ(η; b, d) =
.
2 πη 2
√
the
Note that as d → 0 the terms k1 (b, d) and k2 (b, d) converge to b. Therefore √
upper and lower bounds on πBP (η; b, d) converge to the intrinsic prior scaled by b.
833
J. A. Fúquene, J. D. Cook and L. R. Pericchi
Also, these bounds suggest that one could construct an efficient accept-reject algorithm
for sampling from Berger’s prior by using the intrinsic prior as a proposal density.
Notice that the intrinsic prior was obtained by a completely unrelated method
(Berger and Pericchi (1996)). It was originally obtained as the implicit prior to which
the arithmetic intrinsic Bayes factor converges. The intrinsic Bayes factor is derived
within an approach to objective model selection. It is pleasant that it coheres with
robust Bayesian reasoning. The intrinsic prior does not yield closed-form results with
the normal likelihood. The next theorem generalizes the Polynomial Tails Comparison
Theorem to prove that the intrinsic prior, as well as the Cauchy prior, are robust.
6
Generalized Polynomial Tails Comparison Theorem
We begin by reviewing the symbols O, Ω, and Θ used to denote asymptotic order,
extending the notation used in Section 2. We say f (λ) = O(g(λ)) if there exist positive
constants M and C such that for all λ > M ,
f (λ) ≤ Cg(λ).
Similarly, we say f (λ) = Ω(g(λ)) if there exist positive constants M and C such that
for all λ > M ,
f (λ) ≥ Cg(λ).
One could read O as “eventually bounded above by a multiple of” and Ω as “eventually
bounded below by a multiple of.” Finally, we say f (λ) = Θ(g(λ)) if f (λ) = O(g(λ))
and f (λ) = Ω(g(λ)).
Let f (λ) be any bounded likelihood function such that as |λ| → ∞
Z
f (λ) dλ = O(m−d−ε )
(32)
|λ|>m
for positive constants d and ε. In particular, note that this condition is satisfied for the
binomial likelihood in logistic form as long as there has been at least one success and at
least one failure observed. The condition also applies for any likelihood function with
exponentially decreasing tails.
Let p(λ) be a continuous, symmetric distribution. (The assumption of symmetry
is not essential, but the distributions we are most interested in are symmetric and the
assumption simplifies the presentation.) We may extend p to a location-scale family as:
1
p(λ; µ, σ) = p
σ
µ
λ−µ
σ
¶
.
Assume that as λ → ∞, p(λ) = Θ(λ−d ) and that p′ (λ; µ) = O(λ−d−1 ) where p′
is the derivative of p with respect to λ. We will show later that the Student-t family,
Berger’s prior, and the intrinsic prior all satisfy these two conditions.
834
Robust Bayesian Priors for Clinical Trials
We are now ready to state and prove the generalized polynomial tails comparison
theorem. Denote by π P (λ | data) and π U (λ | data) the posterior densities employing
the prior p(λ | µ, σ) and the uniform prior respectively. Applying Bayes’ rule to both
densities yields for any parameter value λ0 the following ratio:
R∞
f (λ)p(λ; µ, σ) dλ
π U (λ | data)
R∞
= −∞
.
π P (λ | data)
p(λ0 ; µ, σ) −∞ f (λ) dλ
Theorem 6.1. For fixed λ0 ,
R∞
f (λ)p(λ; µ, σ) dλ
R∞
= 1.
p(λ0 ; µ, σ) −∞ f (λ) dλ
−∞
lim
µ→∞
Proof. Since our only assumptions on the prior p(λ; µ, σ) involve the asymptotic order
of p and its derivative, and since these assumptions are not effected by a scaling factor
σ, we may assume σ = 1 and drop σ from our notation.
We will show that
lim
µ→∞
R∞
−∞
R∞
f (λ)p(λ; µ) dλ − p(λ0 ; µ) −∞ f (λ) dλ
R∞
= 0.
p(λ0 ; µ) −∞ f (λ) dλ
(33)
Note that the numerator may be written as
Z ∞
f (λ)(p(λ; µ) − p(λ0 ; µ)) dλ.
−∞
We break the region of integration in the numerator into two parts, |λ| < µk and
|λ| > µk , for some 0 < k < 1 to be chosen later, and show that as µ → ∞ each integral
goes to zero faster than the denominator.
First consider
Z
|λ|<µk
f (λ)(p(λ; µ) − p(λ0 ; µ)) dλ.
By the fact that p(λ; µ) = p(λ − µ) and the mean value theorem we have
¯ Z
¯Z
¯
¯
¯
¯
f (λ)(p(λ; µ) − p(λ0 ; µ)) dλ¯ ≤
f (λ)p′ (ξ(λ))|λ − λ0 | dλ
¯
¯
¯ |λ|<µk
k
|λ|<µ
(34)
where −µk < λ, λ0 < µk and each ξ(λ) is between λ − µ and λ0 − µ. Therefore
ξ(λ) = O(µ) and p′ (ξ(λ)) = O(µ−d−1 ). The term |λ − λ0 | is O(µk ) and so the the
integral (34) is O(µk−d−1 ). The denominator of (33) is Ω(µd ) and so the contribution
of (34) to (33) is O(µk−d−1 )/Ω(µ−d ) = O(µk−1 ). Since k < 1, this term goes to zero as
µ → ∞.
J. A. Fúquene, J. D. Cook and L. R. Pericchi
835
Next consider
Z
|λ|>µk
f (λ)(p(λ; µ) − p(λ0 ; µ)) dλ.
(35)
The term (p(λ; µ) − p(λ0 ; µ)) is bounded and so by the assumption on the tails of
the likelihood function f , the integral (35) is of order O((µk )−d−ε ) = O(µ−k(d+ε) ).
Therefore the contribution of the integral (35) to the ratio (33) is O(µd−k(d+ε) ) and so
this term goes to 0 as µ → ∞ provided k < d/(d + ε).
Next we show that the Student-t family, the intrinsic prior, and Berger’s prior all
satisfy the conditions of the Generalized Polynomial Tail Comparison Theorem.
Clearly the tails of a Student-t distribution with ν degrees of freedom are Θ(λ−1−ν ).
Also, the intrinsic prior and Berger’s prior are clearly Θ(λ−2 ) in the tails. The derivative
conditions remain to be demonstrated.
The density for a Student-t is proportional to
µ
¶−(ν+1)/2
λ2
1+
ν
and so its derivative is proportional to
¶−(3+ν)/2
µ
λ2
−(1 + ν)λ 1 +
ν
which is of order O(λ−2−ν ).
For the intrinsic prior, the asymptotic order is determined by the λ−2 term, the
1 − exp(−λ2 ) being essentially 1 in the tails. Therefore the asymptotic order of the
derivative of the tails is λ−3 .
Showing that the derivative of Berger’s prior has the necessary asymptotic order is
more involved.
By differentiating inside the integral defining Berger’s prior, we have
Z 1
d
λ
πB (λ) = − √
exp(−wλ2 ν)w3/2 ν dν.
dλ
π 0
Next we multiply and divide by the derivative with respect to ν of wνλ2 and define
Then
1
1 w3/2
.
=p
M = sup √
′
(wν)
π
π(b + d)
0≤ν≤1
′
|πBP
(λ)| ≤
M
λ
Z
1
ν exp(−wλ2 ν)(wλ2 ν)′ dν.
0
836
Robust Bayesian Priors for Clinical Trials
Next, we integrate by parts, showing the the right hand side above equals
µZ 1
µZ 1
¶
¶
M
M
exp(−wλ2 ν) dν − exp(−wλ2 ) ≤
exp(−wλ2 ν) dν .
λ
λ
0
0
We can show that
Z
1
0
exp(−wλ2 ν) dν = O(λ−2 )
by an argument similar to that used to establish the bounds on the tails of πBP (λ).
′
Therefore πBP
(λ) = O(λ−3 ) and so πBP satisfies the requirements of the Generalized
Polynomial Tail Comparison Theorem.
Figures 6 and 7 display the qualitative forms of dependence of the posterior mean
and variance on the discrepancy between the prior location parameter and the observed
sample mean, for n = 10 and β 2 = σ 2 = 1. The posterior expectation and variance
are shown as functions of the discrepancy |µ − X̄n |. Figure 6 shows that the posterior
expectations with a Cauchy prior and with Berger’s prior are very similar. In both
posterior models the posterior expectation has a bounded influence. On other hand,
Figure 7 displays that the variances have the same qualitative form, but the variance
with the Cauchy prior is smaller when µ tends to X̄n . We argue that the variance with
Berger’s prior is preferable than with the Cauchy in this example. Finally, if we consider
a normal prior for this analysis then the posterior variance is constant in |µ − X̄n |, and
equal to 0.09.
Posterior Expectation
0.2
EBP (λ|X̄n )
0.1
ECN (λ|X̄n )
EN N (λ|X̄n )
0.0
−0.1
−4
−3
−2
−1
0
µ − X̄n
1
2
3
4
5
Figure 6: Behavior of the posterior expectation: EBP (λ|X̄n ) in the BP/N, ECN (λ|X̄n ) in
the C/N and EN N (λ|X̄n ) in the N/N model. For values n = 10, X̄n = 0 and β = σ = 1.
837
J. A. Fúquene, J. D. Cook and L. R. Pericchi
Posterior Variance
0.11
0.10
0.09
VBP (λ|X̄n )
VCN (λ|X̄n )
0.08
VN N (λ|X̄n )
−4
−3
−2
−1
0
µ − X̄n
1
2
3
4
5
Figure 7: Behavior of the posterior variance: VBP (λ|X̄n ) in the BP/N, VCN (λ|X̄n ) in the
C/N and VN N (λ|X̄n ) in the N/N model. For values n = 10, X̄n = 0 and β = σ = 1.
Example 6.1. Application BP/N model for Example 3.1. In this example
the Berger prior has µ = −1.52 and β = 0.63. We must approximate the binomial
likelihood by a normal distribution. For the likelihood (6), the Fisher information is
In (λ) = (neλ /(1+eλ )2 ). In this example X̄n ∼ N (log(0.8/(1−0.8)), (1+e1.38 )2 /20e1.38 ),
that is, X̄n ∼ N (1.38, 0.31). The posterior mean and variance of λ for the BP/N model
are EBP (λ|X̄n ) = 1.16 and VBP (λ|X̄n ) = 0.33 respectively. These results are robust
and very similar to the obtained with the Cauchy prior for the C/B model.
6.1
Application: BP/N and C/N model in a clinical trial
In this section we show application of the C/N and BP/N model in a clinical trial.
Example 6.2. Bayesian analysis of a trial of the Rhesus Rotavirus-Based
Quadrivalent Vaccine.
Reference: Pérez-Schael et al. (1997).
Study Design: Randomized, double blind, placebo-controlled trial.
Aim of Study: To compare rhesus rotaviruses-based quadrivalent vaccine (a new drug
that is highly effective in preventing severe diarrhea in developed countries) and placebo.
Outcome measure: Over approximately 19 to 20 months, episodes of gastroenteritis
were evaluated at the hospital among infants. The ratio of the odds of response (episode
of gastroenteritis) following the new treatment to the odds of response on the conventional: OR < 1 therefore favors the new treatment.
Statistical Models: Approximate normal likelihood and normal prior for the logarithm of the odds ratio. In the Cauchy prior and Berger’s prior the values of the
838
Robust Bayesian Priors for Clinical Trials
location parameters are the same with respect to normal prior. The scale is the same
in the Cauchy and normal prior.
Prior Distribution: Was based on a published trial: Joensuu et al. (1997), where it is
shown that in Finland the vaccine had a high success rate in preventing severe rotavirus
diarrhea. In this trial the primary efficacy analysis was based on children of which 1128
received three doses of rhesus rotaviruses-based quadrivalent vaccine and 1145 placebo.
100 episodes of gastroenteritis were severe, 8 in rhesus rotaviruses-based quadrivalent
recipients and 92 in placebo recipients:
Event
Vaccine
Placebo
Episode of gastroenteritis
8
92
100
Non-episode of gastroenteritis
1120
1053
2173
1128
1145
2273
Table 2: Episodes of gastroenteritis in the groups Vaccine and Placebo, Finland.
Loss function or demands: None specified.
Computation/software: Conjugate normal analysis and C/N and BP/N models.
Evidence from study: In this randomized, double-blind, placebo-controlled trial,
2207 infants received three oral of the rhesus rotaviruses-based quadrivalent vaccine.
The following data show the episodes of gastroenteritis in Venezuela.
Event
Vaccine
Placebo
Episode of gastroenteritis
70
135
205
Non-episode of gastroenteritis
1042
960
2002
1112
1095
2207
Table 3: Episodes of gastroenteritis in the groups Vaccine and Placebo, Venezuela.
Results: We show the normal approximation for binary data for the log-odds with
the approximate standard error recommended in Spiegelhalter et al. (2004) for 2 × 2
tables, following their suggestion of an standard error of the likelihood normal and N/N
?? the prior and likelihood have a standard
posterior model equal to σ = 2. In Table
√
√
deviation of σ/ n0 = 0.36 and σ/ n = 0.15 respectively. The posterior mean for
the posterior model N/N is equal to (n0 µ + nX̄n )/(n0 + n) = −0.99. We see that the
standard errors of the C/N and BP/N model with respect to the likelihood are equal.
The influence of the equivalent number of observations in the posterior distribution
(n0 + n = 31 + 178 = 209, thus the likelihood can be thought to have around
178/31 ≈ 6
√
times as much information as the prior) over the standard error (σ/ n0 + n) is very
high in the N/N model. The data of the current experiment (data in the Venezuelan
experiment) dominated the C/N and BP/N models, resulting in a posterior expectation
much closer to the MLE.
839
J. A. Fúquene, J. D. Cook and L. R. Pericchi
Location
Scale
Prior
Normalized likelihood
Posterior
Prior
Normalized likelihood
N/N
-2.45
-0.73
-0.99
0.36
0.15
Posterior
0.14
C/N
-2.45
-0.73
-0.76
0.36
0.15
0.15
BP/N
-2.45
-0.73
-0.76
0.36
0.15
0.15
Table 4: Exact and approximate moments of the N/N, C/N and BP/N models in the scale
of log-odds.
The expectations of the BP/N and C/N posterior models and the MLE are approximately equal. We can see in Table 5 that N/N, C/N and BP/N posterior models
are in favor of the vaccine (OR<1). However, the risk reduction in the N/N model
is 63% (the estimated odds ratio is around e−0.99 = 0.37) and in the C/N and BP/N
models is around 53% (in the normalized likelihood is 52%). The credible interval of the
C/N and BP/N posterior model is closely related to the data in the trial. Finally, the
N/N
OR
95% Credible Interval (OR Scale)
0.37
[0.28; 0.49]
C/N
0.47
[0.35; 0.63]
BP/N
0.47
[0.35; 0.63]
Normalized likelihood
0.48
[0.36; 0.65]
Table 5: Odds ratio and Credible Interval of the Posterior Model.
discrepancies between the expectation of the posterior models and the MLE are 1.86
standard errors for N/N and 0.2 for C/N and BP/N. This case dramatically illustrates
the danger of assuming a conjugate prior as prior information in clinical trials. Figures
8 and 9 show the posterior distributions obtained in the conjugate analysis and nonconjugate analysis. We see that the prior distribution receives more weight in the N/N
model. The posterior model C/N is very similar to the normalized likelihood. For
the Figure 9 the posterior distributions for the C/N and BP/N model are almost the
same. The results in the N/N model are suspect because the mean posterior is far
from the likelihood given the conflict between the Finnish and the Venezuelan data.
Incidentally, the researchers concluded that the Finnish and the Venezuelan responses
were qualitatively different given the different levels of exposure of the children to the
virus. In short, the robust analyses are giving a sensible answer while the conjugate
analysis myopically insists that Finland and Venezuela are quite similar in respect to
children’s responses. On the other hand, if the two cases were indeed similar, without a
drastic conflict on responses, then the robust analyses would give answers quite similar
to the conjugate analysis, with conclusions with high precision. In other words, the
use of robust priors makes Bayesian responses adaptive to potential conflicts between
current data and previous trials.
840
Robust Bayesian Priors for Clinical Trials
4.8
4.4
density
4.0
3.6
Prior
3.2
Likelihood
2.8
Posterior
2.4
2.0
1.6
1.2
0.8
0.4
−4
−3
−2
−1
0
Log-Odds
Figure 8: Prior(Finland), normalized likelihood (Venezuela) and posterior distributions in
the Bayesian analysis of a trial of the Rhesus Rotavirus-Based Quadrivalent Vaccine for the
N/N model.
4.8
density
4.4
4.0
Berger prior
3.6
Cauchy prior
3.2
Likelihood
2.8
Posterior
2.4
2.0
1.6
1.2
0.8
0.4
−4
−3
−2
−1
0
Log-Odds
Figure 9: Prior(Finland), normalized likelihood(Venezuela) and posterior distributions in the
Bayesian analysis of a trial of the Rhesus Rotavirus-Based Quadrivalent Vaccine for the C/N
and BP/N model.
J. A. Fúquene, J. D. Cook and L. R. Pericchi
7
841
Conclusions
The issues discussed in this paper have led us to the following conclusions: 1). The
Cauchy prior in the Cauchy/binomial model is robust but the beta prior in the conjugate beta/binomial model for inference on the log-odds is not. We can use the
Cauchy/binomial model in clinical trials making a robust prediction in binary data.
2). Simulation of the moments in the Cauchy/binomial model reveals that the approximation performs well over a range of n ≥ 10. Furthermore, we can use rejection
sampling with either large or small sample sizes for exact results. 3) Berger’s prior
is very useful in clinical trials for a robust estimation since it gives closed-form exact
results (when the normal log-odds likelihood is employed), and at the same time does
not have the defects of conjugate priors. It can be argued that besides computational
convenience it is superior to the Cauchy as a robust prior, because the posterior variance does not decrease as much as with the Cauchy, when the assessed priors scales are
equal or close, see Figure 7. Berger’s prior seems more cautious. 4). In more complex
situations, with several different centers that are modeled with a hierarchical structure,
the use of robust priors may be even more important. This will be explored elsewhere.
5). The use of prior information in terms of robust (and non-conjugate) priors will be
much more acceptable to both researchers and regulatory agencies, because the prior
can not dominate the likelihood when the data conflict with the prior. Remember the
archetypal criticism of “Bayesian” analysis: “With Bayes, you can get the results you
want, by changing your prior!” This should say instead: “With conjugate Bayes, you
can get the results you want, by changing your prior!”
1
Proofs of Results 3.1
1.1
Cauchy Prior
Proof. Invoking the Polynomial Tails Comparison Theorem, we can use the uniform
prior instead of the Cauchy prior when α → ±∞ for the binomial likelihood, (assuming
that 0 < X+ < n) so the generating function for the C/B model is
©
ª
R∞
exp X+ λ − n log(1 + eλ ) + tλ dλ
−∞
tλ
R∞
,
(36)
lim EC (e |X+ ) =
α→±∞
exp {X+ λ − n log(1 + eλ )} dλ
−∞
after of the transformation λ = log(θ/(1 − θ)), (36) is
lim EC (etλ |X+ ) =
α→±∞
Γ(X+ + t)Γ(n − X+ − t)
,
Γ(X+ )Γ(n − X+ )
(37)
hence
lim EC (λ|X+ ) = Ψ(X+ ) − Ψ(n − X+ ).
α→±∞
(38)
The approximation of the Digamma function (see Abramowitz and Stegun (1970)) is
Ψ(z) ≈ log(z) −
1
− O(z −2 ),
2z
(39)
842
Robust Bayesian Priors for Clinical Trials
hence
µ
X̄n
1 − X̄n
¶
1
1
−2
+
− O(X+
) + O((n − X+ )−2 ).
2X+
2(n − X+ )
(40)
Now, we show that the limit in (36) exists: consider the following functions of real
variable with positive real values, defined by the equations
lim EC (λ|X+ ) ≈ log
α→±∞
F (λ, t) =
−
exp{(X+ + t)λ}
exp(X+ λ)
1
; f (λ) =
; τ (λ) = 2
,
(1 + eλ )n
(1 + eλ )n
β + λ2
(41)
where X+ , n ∈ N; n ≥ 2, X+ ≥ 1 and β is a positive constant. We prove that the
convolutions of F ∗ τ and f ∗ τ , defined respectively by the equations
Z ∞
Z
exp{X+ λ − n log(1 + eλ ) + tλ}
dλ
(42)
F (λ)τ (α − λ)dλ =
β 2 + (λ − α)2
−∞
R
Z ∞
Z
exp{X+ λ − n log(1 + eλ )}
f (λ)τ (α − λ)dλ =
dλ
(43)
β 2 + (λ − α)2
R
−∞
are finite. For λ ∈ (−∞, ∞), we have
|F (λ)g(α − λ)| =
≤
≤
=
¯
¯
¯
¯
exp{(X+ + t)λ}
¯
¯
¯ (1 + eλ )n (β 2 + (α − λ)2 ) ¯
| exp(X+ + t)λ| −2
β
| exp(nλ)|
exp{(t − s)|λ|}
β2
g(λ)
Since |F (λ)g(α − λ)| is dominated by the function g(λ), and g belongs to L1 (R), if
t − s ≤ 0 (where s = n − X+ ≥ 1). Therefore F ∗ τ < ∞. A similar argument shows
|f (λ)g(α − λ)| ≤
exp{−s|λ|}
β2
(44)
and thus f ∗ τ < ∞.
1.2
Conjugate Prior
Proof. We have EB (λ) → ∞ as a → ∞ and EB (λ) → −∞ as b → ∞, the approximation
of the posterior expectation for the conjugate beta/binomial model is
EB (λ|X+ )
≈ log
µ
nX̄n + a
n(1 − X̄n ) + b
¶
−
1
1
+
2(nX̄n + a) 2(n(1 − X̄n ) + b)
− O((nX̄n + a)2 ) + O((n(1 − X̄n ) + b)2 )
and EB (λ|X+ ) → ∞ as a → ∞ and EB (λ|X+ ) → −∞ as b → ∞.
J. A. Fúquene, J. D. Cook and L. R. Pericchi
2
843
Proof of Result 5.1 Berger Prior
Proof. We make the change of variable η = λ − µ. With the normal likelihood
√
o
n n
n
exp − 2 (η − (X̄n − µ))2 ,
f (X̄n | η) = √
2σ
2πσ
it follows that the predictive density satisfies the relation
Z 1Z ∞
n n o
m(X̄n ) =
K × exp − K2 dη dν,
2
0
−∞
where
¸
2νη 2
1
2
K2 =
+ 2 (η − (X̄n − µ)) .
σ 2 (1 − 2ν) + nβ 2
σ
·
(45)
(46)
(47)
The method of completing the square tell us that
¸2
·
σ 2 + nβ 2
(X̄n − µ)(σ 2 (1 − 2ν) + nβ 2 )
K2 = η −
σ 2 + nβ 2
σ 2 (1 − 2ν) + nβ 2
2ν(X̄n − µ)2
.
+
σ 2 + nβ 2
(48)
The generating-function of the posterior distribution (22) is given by
©
ª
R1R∞
K × exp − n2 K3 dη dν
0 −∞
tη
EBP (e | X̄n ) = R 1 R ∞
,
ª
©
K × exp − n2 K2 dη dν
0 −∞
(49)
where
K3 =
·
¸
1
2t
2νη 2
2
+
(η
−
(
X̄
−
µ))
−
(η
+
µ)
.
n
σ 2 (1 − 2ν) + nβ 2
σ2
n
(50)
Hence, the cumulant-generating function of the posterior distribution (22) is given by
"
(
µ
¶2 )#
t
n
X̄n − µ +
Kη|X̄n ∝ log 1 − exp − 2
σ + nβ 2
n
¶
µ
¶2
µ
n
t
t
+
X̄n − µ +
+ tµ.
(51)
− 2 log X̄n − µ +
n
2
n
References
Abramowitz, M. and Stegun, I. (1970). Handbook of Mathematical Functions. National
Bureau of Standards, volume 46. Applied Mathematics Series. 823, 841
Berger, J. O. (1985). Statistical Decision Theory and Bayesian Analysis. SpringerVerlag, second edition. 819, 820, 829, 831
844
Robust Bayesian Priors for Clinical Trials
Berger, J. O. and Pericchi, L. R. (1996). “The Intrinsic Bayes Factor for Model Selection
and Prediction.” JASA, 91: 112–115. 820, 831, 833
Carlin, B. P. and Louis, T. A. (1996). “Identifying Prior Distributions that Produce
Specific Decisions, with Application to Monitoring Clinical Trials.” In Bayesian Analisys in Statistics and Econometrics: Essays in Honor of Arnold Zellner, 493–503. New
York: Wiley. 818
Carlin, B. P. and Sargent, D. J. (1996). “Robust Bayesian Approaches for clinical trial
monitoring.” Statistics in Medicine, 15: 1093–1106. 818
Dawid, A. P. (1973). “Posterior expectations for large observations.” Biometrika, 60:
664–667. 818
Evans, M. and Moshonov, H. (2006). “Checking for Prior-Data Conflict.” Bayesian
Analysis, 1: 893–914. 820
Gelman, A., Jakulin, A., Pittau, M. G., and Su, Y.-S. (2008). “A weakly informative
default prior distribution for logistic and other regression models.” Annals of Applied
Statistics, 2: 1360–1383. 818, 823
Greenhouse, J. B. and Wasserman, L. A. (1995). “Robust Bayesian Methods for Monitoring Clinical Trials.” Statistics in Medicine, 14: 1379–1391. 818
Gutierrez-Peña, E. (1997). “Moments for the Canonical Parameter of an Exponential
Family Under a Conjugate Distribution.” Biometrika, 84: 727–732. 825
Joensuu, J., Koskenniemi, E., Pang, X.-L., and Vesikari, T. (1997). “Randomised
placebo-controlled trial of rhesus-human reassortant rotavirus vaccine for prevention
of severe rotavirus gastroenteritis.” The Lancet, 350: 1205–1209. 838
O’Hagan, A. (1979). “On outlier rejection phenomena in Bayes inference.” JRSSB, B,
41: 358–367. 818
Pérez-Schael, I., Guntiñas, M. J., Pérez, M., Pagone, V., Rojas, A. M., González,
R., Cunto, W., Hoshino, Y., and Kapikian, A. Z. (1997). “Efficacy of the Rhesus
Rotavirus-Based Quadrivalent Vaccine in Infants and Young Children in Venezuela.”
The New England Journal of Medicine, 337: 1181–1187. 837
Pericchi, L. R. and Sansó, B. (1995). “A note on bounded influence in Bayesian analysis.” Biometrika, B, 82(1): 223–225. 818
Pericchi, L. R., Sansó, B., and Smith, A. F. M. (1993). “Posterior Cumulant Relationships in Bayesian Inference Involving the Exponential Family.” JASA, 88: 1419–1426.
825
Pericchi, L. R. and Smith, A. F. M. (1992). “Exact and Approximate Posterior Moments
for a Normal Location Parameter.” JRSSB, 54: 793–804. 820, 829
Smith, A. F. M. and Gelfand, A. E. (1992). “Bayesian Statistics Without Tears: A
Sampling-Resampling Perspective.” The American Statistician, 46: 84–88. 826
J. A. Fúquene, J. D. Cook and L. R. Pericchi
845
Spiegelhalter, D. J., Abrams, K. R., and Myles, J. P. (2004). Bayesian Approaches to
Clinical Trials and Health-Care Evaluation. London: Wiley. 819, 824, 829, 838
Tierney, L. and Kadane, J. B. (1986). “Accurate Approximations for Posterior Moments
and Marginal Densities.” JASA, 81: 82–86. 824, 825
Tierney, L., Kass, R. E., and Kadane, J. B. (1989). “Fully Exponential Laplace Approximations to Expectations and Variances of Nonpositive Functions.” JASA, 84:
710–716. 825
Funding
National Science Foundation (DMS-0604896 to LRP).
Acknowledgments
We thank Dr. Marı́a Egleé Pérez for helpful comments and several suggestions. JF was
supported by Institute of Statistics, College of Business Administration, University of Puerto
Rico-RRP and by M D Anderson Cancer Center. LRP was in sabbatical leave by The University
of Puerto Rico-RRP. Detailed comments by referees and editors were most useful in preparing
a final version.
846
Robust Bayesian Priors for Clinical Trials
Bayesian Analysis (2009)
4, Number 4, pp. 847–850
Editor-in-Chief’s Note
Bradley P. Carlin∗
This issue of Bayesian Analysis, Volume 4 Number 4, is the twelfth and final one
for which I have the privilege of serving as editor-in-chief (EiC); my three-year term
(2007-09) is drawing to a close. It’s been an enormously gratifying and enlightening
run, so I’d like to take just a few paragraphs to say thanks, describe a bit of what
we’ve accomplished as a journal in the past year or so, and mention where the journal
is headed and what challenges and opportunities it’s likely to face there.
As I type this, it is near the end of Thanksgiving Day in the United States, and
it’s impossible not to reflect on how thankful I am for the chance to serve as EiC,
and for the many dedicated men and women who do all the real work of the journal,
completely without financial compensation and at a time when ever-increasing pressures
to further improve productivity encourage one to forego often-thankless volunteer work
like editing and refereeing whenever possible. Simply put, I am eternally grateful to
all the editors, associate editors, referees, and production staff who make each issue
possible. It is dangerous to begin naming names in situations like this since one is sure
to miss someone important, but I do want to mention a few key persons who have been
around BA since the very beginning some 5 years ago. These include System Managing
Editor Pantelis Vlachos, Managing Editor Herbie Lee, Deputy Editor Marina Vannucci,
and Editors Philip Dawid, David Heckerman, Michael Jordan, and Fabrizio Ruggeri.
Philip and Marina have decided to step down as part of the transition, and let me
thank them at this time for their years of outstanding service. I am hopeful that most
if not all of the others are willing to continue, along with Production Editor Angelika
van der Linde and Editors Kate Cowles, David Dunson, Antonietta Mira, and Bruno
Sansó. All do wonderful work, as do all our AEs and referees. Thanks again.
2009 has been another good year for the journal. We will again publish about
850 pages, very similar to our page counts in 2007 and 2008. We submitted three
consecutive issues of the journal to Thomson-Reuters as evidence that we merit inclusion
in their indexing systems, and in October we received the good news that BA has been
accepted into the Science Citation Index-Expanded (SCIE) including the Web of Science,
the ISI Alerting Service, and Current Contents/Physical, Chemical and Earth Sciences
(CC/PC&ES). We have been told that Thomson-Reuters will have sufficient source item
and citation data to compute an impact factor for the next Journal Citation Reports
(JCR), which is scheduled to be published in June 2010. Getting BA on the road to an
impact factor (critically important these days, especially to our European contributors)
was one of my primary goals as EiC, so I’m very pleased this got done before my term
expired.
My other primary goal was to keep the flow of interesting discussion papers going,
and here again I’ve been very pleased by the results. I have striven to select a wide
∗ Department of Biostatistics, University of Minnesota, Minneapolis, MN, http://www.biostat.umn.
edu/~brad
c 2009 International Society for Bayesian Analysis
°
DOI:10.1214/09-BA432
848
Editor-in-Chief ’s Note
range of papers for this very visible quarterly slot, from foundations to applications
and even an occasional thought piece (such as Andrew Gelman’s “April Fools’ Day
blog” paper in Volume 3 Number 3). The current issue’s discussion paper is on a
subject near and dear to my heart (baseball), and as usual features some state of
the art Bayesian methods followed by a spirited question-and-answer period in the
discussions and rejoinder. We also get a nice stream of potential discussion papers
from the Case Studies in Bayesian Statistics (Carnegie Mellon) and other ISBA-related
meetings throughout the year. Some in the profession have suggested opening the papers
on the BA website to discussion by anyone, rather than permit only a few papers to be
discussed by a few high-profile discussants selected by the EiC. I must confess I have
been hesitant to change our current system, since I like the idea of one “special” paper
per issue, plus selecting this paper and coordinating the discussions and rejoinders is
perhaps the best part of the EiC job! But it’s a good suggestion and one with which
the next EiC and editorial board may choose to grapple.
Speaking of the new EiC, I am very pleased to announce to those who missed the
email from ISBA President Mike West that it will be none other than Herbie Lee, the
current (and founding) managing editor of BA. I am very pleased the search committee
offered the position to Herbie, and that he accepted! It means an especially easy transition period since Herbie’s long tenure with the journal means he needs essentially no
“training” of any kind. Herbie will no doubt want to bring on some new editors and
AEs, and I’m confident the journal will remain strong under his leadership.
Of course, as I said there will be challenges to greet Herbie and his team. One
involves the online review system, which was constructed specifically for us several
years ago, but is now beginning to show its age somewhat. Other online review system
products are out there that may offer advantages over ours in terms of flexibility and
extendability in the long run. One of these is already used by our institutional partner
IMS, for whom BA is already an IMS Supported Journal and with whom ISBA already
has a joint membership agreement and a variety of well-attended jointly sponsored
conferences (including the MCMSki series, the next of which will be January 5-7, 2011).
Expanding this IMS partnership may be most natural.
A second issue continues to be finding a reliable revenue stream to support the
journal. We now offer on-demand printing of issues for a small fee, but BA’s free online
availability has essentially precluded any significant sales revenue. Other arrangements
may include adding advertising, or even affiliating with the Berkeley Electronic Press,
which has a fairly long history of profitably running journals like ours. I happen to
know BEP is interested in seeing this happen, but whether we should surrender our
independence in such a dramatic way for a modest revenue stream is again something
Herbie and the ISBA Board will need to ponder.
On that note, I close this editorial. Thanks again for the opportunity to serve as
EiC, and for your support of the journal during my tenure. While my involvement with
the journal will now shrink dramatically in order for me to free up the time necessary
to chair my own biostatistics group at Minnesota, I will stay involved as a guest editor,
helping the journal process the contributed papers from the upcoming Valencia 9/2010
B. P. Carlin
849
ISBA World Meeting in Spain this June. I look forward to seeing many of you at that
meeting, and as always, to your submissions at http://ba.stat.cmu.edu, and your
more personal thoughts and reactions via email to mailto:brad@biostat.umn.edu.
Brad Carlin
Lincoln, Nebraska
Thanksgiving 2009
850
Editor-in-Chief ’s Note