Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Bayesian Analysis

2009

Volume 4 Number 4 2009 Bayesian Analysis Hierarchical Bayesian Modeling of Hitting Performance in Baseball . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . S. T. Jensen, B. B. McShane and A. J. Wyner Comment on Article by Jensen et al. . . . . . . . . . . . . . . J. Albert and P. Birnbaum Comment on Article by Jensen et al. . . . . . . . . . . . . . . . . . . . . . . . . . M. E. Glickman Comment on Article by Jensen et al. . . . . . . . F. A. Quintana and P. M. Müller Rejoinder . . . . . . . . . . . . . . . . . . . . . S. T. Jensen, B. B. McShane and A. J. Wyner Bayesian Inference for Directional Conditionally Autoregressive Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M. Kyung and S. K. Ghosh Spiked Dirichlet Process Prior for Bayesian Multiple Hypothesis Testing in Random Effects Models . . . . . . . . . . . . . . . . . . . . . S. Kim, D.B. Dahl and M. Vannucci Modeling space-time data using stochastic differential equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J. A. Duan, A. E. Gelfand and C. F. Sirmans Inconsistent Bayesian Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . R. Christensen Sample Size Calculation for Finding Unseen Species . . . . . H. Zhang and H. Stern Markov Switching Dirichlet Process Mixture Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M. A. Taddy and A. T. Kottas A Case for Robust Bayesian Priors with Applications to Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J. A. Fúquene, J. D. Cook and L. R. Pericchi Editor-in-Chief’s Note . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B. P. Carlin 631 653 661 665 669 675 707 733 759 763 793 817 847 Bayesian Analysis (2009) 4, Number 4, pp. 631–652 Hierarchical Bayesian Modeling of Hitting Performance in Baseball Shane T. Jensen∗ , Blakeley B. McShane† and Abraham J. Wyner‡ Abstract. We have developed a sophisticated statistical model for predicting the hitting performance of Major League baseball players. The Bayesian paradigm provides a principled method for balancing past performance with crucial covariates, such as player age and position. We share information across time and across players by using mixture distributions to control shrinkage for improved accuracy. We compare the performance of our model to current sabermetric methods on a held-out season (2006), and discuss both successes and limitations. Keywords: baseball, hidden Markov model, hierarchical Bayes 1 Introduction and Motivation There is substantial public and private interest in the projection of future hitting performance in baseball. Major league baseball teams award large monetary contracts to top free agent hitters under the assumption that they can reasonably expect that past success will continue into the future. Of course, there is an expectation that future performance will vary, but for the most part it appears that teams are often quite foolishly seduced by a fine performance over a single season. There are many questions: How should past consistency be balanced with advancing age when projecting future hitting performance? In young players, how many seasons of above-average performance need to be observed before we consider a player to be a truly exceptional hitter? What is the effect of a single sub-par year in an otherwise consistent career? We will attempt to answer these questions through the use of fully parametric statistical models for hitting performance. Modeling and prediction of hitting performance is an area of very active research within the quantitatively-oriented baseball community. Popular current methods include PECOTA (Silver 2003) and MARCEL (Tango 2004). PECOTA is considered a ”gold-standard” tool in the sabermetrics community and its predictions are billed by Baseball Prospectus as being ”deadly accurate”. It is a sophisticated commercial product managed by a team of statisticians which incorporates proprietary data, minor league histories, and detailed injury reports. Since PECOTA is proprietary, we cannot ∗ Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA, mailto:stjensen@wharton.upenn.edu † Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA, mailto:mcshaneb@wharton.upenn.edu ‡ Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA, mailto:ajw@wharton.upenn.edu c 2009 International Society for Bayesian Analysis ° DOI:10.1214/09-BA424 632 Bayesian Modeling of Hitting say exactly what methods they use though we know the general method is based on matching a player’s past career performance to the careers of a set of comparable major league ballplayers. For each player, their set of comparable players is found by a nearest neighbor analysis of past players (both minor and major league) with similar performance at the same age. Once a comparison set is found, the future performance prediction for the player is based on the historical performance of those past comparable players. Factors such as park effects, league effects and physical attributes of the player are also taken into account. PECOTA also makes use of substantial manual curation both to the matching process and to introduce new information as it becomes available. We have observed that the pre-season PECOTA predictions are adjusted on a daily basis as news (e.g., injury information, pre-season performance, etc.) is released. In contrast, our focus is on a model-based approach to prediction of hitting performance which is fully-automated and based on publicly available data. Thus, a more appropriate benchmark for our analysis is MARCEL, a publicly available prediction engine based on the same freely available dataset (Lahman 2006) as our model. MARCEL is a simple two-stage system for prediction. First, MARCEL takes a weighted average of the performance of the player over the previous three years, giving more weight to the most recent seasons. Then, it shrinks this weighted average to the overall league mean based on the number of plate appearances. Thus, the more data for a given player, the less shrinkage. Over several seasons, MARCEL has performed well against more elaborate competitors (Tango 2004), but should be outperformed by our principled approach. Although it is less of a fair benchmark, we will also compare with PECOTA in order to assess how well our model does against the best available proprietary commercial product. In Section 2, we present a Bayesian hierarchical model for the evolution of hitting performance throughout the careers of individual players. Bayesian or Empirical Bayes approaches have recently been used to model individual hitting events based on various within-game covariates (Quintana et al. 2008) and for prediction of within-season performance (Brown 2008). We are addressing a different question: how can we predict the course of a particular hitter’s career based on the seasons of information we have observed thus far? Our model includes several covariates that are crucial for the accurate prediction of hitting for a particular player in a given year. A player’s age and home ballpark certainly has an influence on their hitting; we will include this information among the covariates in our model. We will also include player position in our model, since we believe that position is an important proxy for hitting performance (e.g., second basemen have a generally lower propensity for home runs than first basemen). Finally, our model will factor past performance of each player into future predictions. In Section 3, we test our predictions against a hold out data set, and compare our performance with several competing methods. A major advantage of our model-based approach is the ability to move beyond the point predictions offered by other engines to the incorporation of variability via calibrated predictive intervals. We examine our results not only in terms of accuracy of our point predictions, but also the quality the prediction intervals produced by our model. We also investigate several other interesting aspects of our model in Section 3 and then conclude with a brief discussion in Section 4. S. T. Jensen, B. B. McShane and A. J. Wyner 2 633 Model and Implementation Our data comes from the publicly-available Lahman Baseball Database (Lahman 2006), which contains hitting totals for each major league baseball player from 1871 to the present day, though we will fit our model using only seasons from 1990 to 2005. In total, we have 10280 player-years of of information from major league baseball between 1990 and 2005 that will be used for model estimation. Within each season j, we will use the following data for each player i: 1. Home Run Total : Yij 2. At Bat Total : Mij 3. Age : Aij 4. Home Ballpark : Bij 5. Position : Rij As an example, Barry Bonds in 2001 had Yij = 73 home runs out of Mij = 476 at bats. We excluded pitchers from our model, leaving us with nine positions: first basemen (1B), second basemen (2B), third basemen (3B), shortstop (SS), left fielder (LF), center fielder (CF), right fielder (RF), catcher (C), and the designated hitter (DH). There were 46 different home ballparks used in major league baseball between 1990 and 2005. Player ages ranged between 20 and 49, though the vast majority of player ages were between 23 and 44. 2.1 Hierarchical Model for Hitting Our outcome of interest for a given player i in a given year (season) j is their home run total Yij , which we model as a Binomial variable: Yij ∼ Binomial(Mij , θij ) (1) where θij is a player- and year-specific home run rate, and Mij are the number of opportunities (at bats) for player i in year j. Note that by using at-bats as our number of opportunities, we are excluding outcomes such as walks, sacrifice flies and hit-bypitches. We will assume that the number of opportunities Mij are fixed and known so we focus our efforts on modeling each home run rate θij . The i.i.d. assumption underlying the binomial model has already been justified for hitting totals within a single season (Brown 2008), and so seems reasonable for hitting totals across an entire season. We next model each unobserved player-year rate θij as a function of home ballpark b = Bij , position k = Rij and age Aij of player i in year j. µ ¶ θij log = αk + βb + fk (Aij ) (2) 1 − θij 634 Bayesian Modeling of Hitting The parameter vector α = (α1 , . . . , α9 ) are the position-specific intercepts for each of the nine player positions. The function fk (Aij ) is a smooth trajectory of Aij , that is different for each position k. We allow a flexible model for fk (·) by using a cubic B-spline (de Boor 1978) with different spline coefficients γ estimated for each position. The age trajectory component of this model involves the estimation of 36 parameters: four B-spline coefficients per position × nine different positions. We call the parameter vector β the “team effects” since these parameters are shared by all players with the same team and home ballpark. However, these coefficients β can not be interpreted as a true “ballpark effect” since they are confounded with the effect of the team playing in that ballpark. If a particular team contains many home run hitters, then that can influence the effect of their home ballpark. Separating the effect of team versus the effect of ballpark would require examining hitting data at the game level instead of the seasonal level we are using for our current model. There are two additional aspects of hitting performance that are not captured by the model outlined in (1)-(2). Firstly, conditional on the covariates age, position, and ballpark, our model treats the home run rate θij as independent and identically-distributed across players i and years j. However, we suspect that not all hitters are created equal: we posit that there exists a sub-group of elite home run hitters within each position that share a higher mean home run rate. We can represent this belief by placing a mixture model on the intercept term αk dictated by a latent variable Eij in each player-year. In other words, ½ αk0 if Eij = 0 αk = αk1 if Eij = 1 where we force αk0 < αk1 for each position k. We call the latent variable Eij the elite status for player i in year j. Players with elite status are modeled as having the same shape to their age trajectory, but with an extra additive term (on the log-odds scale) that increases their home run rate. However, we have a different elite indicator Eij for each player-year, which means that a particular player i can move in and out of elite status during the course of his career. Thus, the elite sub-group is maintained in the player population throughout time even though this sub-group will not contain the exact same players from year to year. The second aspect of hitting performance that needs to be addressed is that the past performance of a particular player should contain information about his future performance. One option would be to use player-specific intercepts in the model to allow each player to have a different trajectory. However, this model choice would involve a large number of parameters, even if these player-specific intercepts were assumed to share a common prior distribution. In addition, many of these intercepts would be subject to over-fitting due to small number of observed years of data for many players. We instead favor an approach that involves fewer parameters (to prevent over-fitting) while still allowing different histories for individual players. We accomplish this goal by building the past performance of each player into our model through a hidden Markov model on the elite status indicators Eij for each player i. Specifically, our probability model of the elite status indicator for player i in year j + 1 is allowed to depend on the S. T. Jensen, B. B. McShane and A. J. Wyner 635 Figure 1: Hidden Markov Model for Elite Status elite status indicator for player i in year j: p(Ei,j+1 = b|Eij = a, Rij = k) = νabk a, b ∈ {0, 1} (3) where Eij is the elite status indicator and Rij is the position of player i in year j. This relationship is also graphically represented in Figure 1. The Markovian assumption induces a dependence structure on the home run rates θi,j over time for each player i. Players that show elite performance up until year j are more likely to be predicted as elite at year j + 1. The transition parameters ν k = (ν00k , ν01k , ν10k , ν11k ) for each position k = 1, . . . , 9 are shared across players at their position, but can differ between positions, which allows for a different proportion of elite players in each position. We initialize each player’s Markov chain by setting Ei0 = 0 for all i, meaning that each player starts their career in non-elite status. This initialization has the desired consequence that young players must show consistently elite performance in multiple years in order to have a high probability of moving to the elite group. In order to take a fully Bayesian approach to this problem, we must specify prior distributions for all of our unknown parameters. The forty-eight different ballpark coefficients β in our model all share a common Normal distribution, βl ∼ Normal(0, τ 2 ) ∀ l = 1, . . . , 48 (4) The spline coefficients γ needed for the modeling of our age trajectories also share a common Normal distribution, γkl ∼ Normal(0, τ 2 ) ∀ k = 1, . . . , 9, l = 1, . . . , L (5) where L is the number of spline coefficients needed in the modeling of age trajectories for f (Aij , Rij ) for each position. In our latent mixture model, we also have two intercept coefficients for each position, αk = (αk0 , αk1 ), which share a truncated Normal distribution, αk ∼ MVNormal(00, τ 2I 2 ) · Ind(αk0 < αk1 ) ∀ k = 1, . . . , 9 (6) 636 Bayesian Modeling of Hitting where 0 is the 2 × 1 vector of zeros and I 2 is the 2 × 2 identity matrix. This bivariate distribution is truncated by the indicator function Ind(·) to ensure that αk0 < αk1 for each position k. We make each of the prior distributions (4)-(6) non-informative by setting the variance hyperparameter τ 2 to a very large value (10000 in this study). Finally, for the position-specific transition parameters of our elite status ν , we use flat Dirichlet prior distributions, (ν00k , ν01k ) ∼ (ν10k , ν11k ) ∼ Dirichlet(ω, ω) Dirichlet(ω, ω) ∀ k = 1, . . . , 9, ∀ k = 1, . . . , 9. (7) These prior distributions are made non-informative by setting ω to a small value (ω = 1 in this study). We also examined other values for ω and found that using different values had no influence on our posterior inference, which is to be expected considering the dominance of the data in equation (9). Combining these prior distributions together with equations (1)-(3) give us the full posterior distribution of our unknown parameters, Y α, β, γ , ν , E |X X) ∝ p(α p(Yij |Mij , θij ) · p(θij |Rij , Aij , Bij , Eij , α , β, γ ) i,j α, β, γ , ν ). ·p(Eij |Ei,j−1 , ν ) · p(α (8) A, B , M , R ). where we use X to denote our entire set of observed data Y and covariates (A 2.2 MCMC Implementation We estimate our posterior distribution (8) by a Gibbs sampling strategy (Geman and Geman 1984). We iteratively sample from the following conditional distributions of each set of parameters given the current values of the other parameters: α|β, γ , ν , E , X ) = p(α α|β, γ , E , X ) 1. p(α α, γ , ν , E , X ) = p(β|α α, γ , E , X ) 2. p(β|α 3. p(γγ |β, α , ν , E , X ) = p(γγ |β, α , E , X ) E) 4. p(νν |β, γ , α , E , X ) = p(νν |E E |β, γ , ν , E , X ) 5. p(E A, B , M , R ). where again X denotes our entire set of observed data Y and covariates (A Combined together, steps 1-3 of the Gibbs sampler represent the usual estimation of α, β, γ ) in a Bayesian logistic regression model. The conditional regression coefficients (α posterior distributions for these coefficients are complicated and we employ the common strategy of using the Metropolis-Hastings algorithm to sample each coefficient (see, e.g. Gelman et al. (2003)). The proposal distribution for a particular coefficient is a Normal distribution centered at the maximum likelihood estimate of that coefficient. The S. T. Jensen, B. B. McShane and A. J. Wyner 637 variance of this Normal proposal distribution is a tuning parameter that was adaptively adjusted to provide a reasonable rejection/acceptance ratio (Gelman et al. 1996). Step 4 of the Gibbs sampler involves standard distributions for our transition parameters ν k = (ν00k , ν01k , ν10k , ν11k ) for each position k = 1, . . . , 9. The conditional posterior distributions for our transition parameters implied by (8) are E (ν00k , ν01k )|E E (ν11k , ν10k )|E where Nabk = ni PP ∼ ∼ Dirichlet (N00k + ω, N01k + ω) Dirichlet (N11k + ω, N10k + ω) (9) I(Ei,t = a, Ei,t+1 = b) over all players i in position k and where i t=1 ni represents the number of years of observed data for player i’s career. Finally, step 5 of our Gibbs sampler involves sampling the elite status Eij for each year j of player i, which can be done using the “Forward-summing Backward-sampling” algorithm for hidden Markov models (Chib 1996). For a particular player i, this algorithm “forwardsums” by recursively calculating X i,t , Θ) ∝ p(Eit |X ∝ X i,t−1 , Θ) p(Xi,t |Eit , Θ) · p(Eit |X p(Xi,t |Eit , Θ ) 1 X X i,t−1 , Θ ) p(Eit |Ei,t−1 = e, Θ ) p(Ei,t−1 = e|X (10) e=0 for all t = 1, . . . , ni where X i,k represents the observed data for player i up until year k, Xi,k represents only the observed data for player i in year k, and Θ represents all other parameters. The algorithm then ”backward-samples” by sampling the terminal X i,ni , Θ) and then sampling Ei,t−1 |Ei,t for elite state Ei,ni from the distribution p(Ei,ni |X t = ni back to t = 1. Repeating this algorithm for each player i gives us a complete sample of our elite statuses E . We ran multiple chains from different starting values to evaluate convergence of our Gibbs sampler. Our results are based on several chains where the first 1000 iterations were discarded as burn-in. Our chains were also thinned, taking only every eighth iteration, in order to eliminate autocorrelation. 2.3 Model Extension: Player-Specific Transition Parameters In Section 2.1, we introduced a hidden Markov model that allows the past performance of each player to influence predictions for future performance. If we infer player i to have been elite in year t (Ei,t = 1), then this inference influences the elite status of that player in his next year, Ei,t+1 through the transition parameters ν k . However, one potential limitation of these transition parameters ν k is that they are shared globally across all players at that position: each player at position k has the same probability of transitioning from elite to non-elite and vice versa. This model assumption allows us to pool information across players for the estimation of our transition parameters in (9), but may lead to loss of information if players are truly heterogeneous with respect to the probability of transitioning between elite and non-elite states. In order to address this possibility, we consider extending our model to allow player-specific transition parameters in our hidden Markov model. 638 Bayesian Modeling of Hitting Our proposed extension, which we call the PSHMM, has player-specific transition i i i i parameters ν i = (ν00 , ν01 , ν10 , ν11 ) for each player i, that share a common prior distribution, i i (ν00 , ν01 ) ∼ i i (ν11 , ν10 ) ∼ Dirichlet (ω00k , ω01k ) Dirichlet (ω11k , ω10k ) (11) where k is the position of player i. Global parameters ω k = (ω00k , ω01k , ω11k , ω10k ) are now allowed to vary with flat prior distributions. This new hierarchical structure allows for transition probabilities ν i to vary between players, but still imposes some shrinkage towards a common distribution controlled by global parameters ω k that are shared across players with position k. Under this model extension, the new conditional posterior distribution for each ν i is ¡ ¢ i i E ∼ Dirichlet Ni00 + ω00k , Ni01 + ω01k (ν00 , ν01 )|E ¡ ¢ i i E ∼ Dirichlet Ni11 + ω11k , Ni10 + ω10k (ν11 , ν10 )|E (12) where Niab = nP i −1 I(Ei,t = a, Ei,t+1 = b). t=1 To implement this extended model, we must replace step 4 in our Gibbs sampler with a step where we draw ν i from (12) for each player i. We must also insert a new step in our Gibbs sampler where we sample the global parameters ω k given our sampled values of all the ν i values for players at position k. This added step requires sampling (ω00k , ω01k ) from the following conditional distribution: #ω01k −1 #ω00k −1 " n "n ¸n · k k Y Y Γ(ω00k + ω01k ) k i i ν01 (13) ν00 × × p(ω00k , ω01k |νν ) ∝ Γ(ω00k )Γ(ω01k ) i=1 i=1 where each product is only over players i at position k and nk is the number of players at position k. We accomplish this sampling by using a Metropolis-Hastings step with prop true distribution (13) and Normal proposal distributions: ω00k ∼ N(ω̂00k , σ 2 ) and prop 2 ω01k ∼ N(ω̂01k , σ ). The means of these proposal distributions are: ¶ ¶ µ µ ν 00k (1 − ν 00k ) ν 00k (1 − ν 00k ) −1 and ω̂01k = (1 − ν 00k ) − 1 (14) ω̂00k = ν 00k s20k s20k with ν 00k = nk X i=1 i ν00 / nk and s20k = nk X i (ν00 − ν 00k )2 / nk i=1 where each sum is over all players i at position k and nk is the number of players at position k. These estimates ω̂00k and ω̂01k were calculated by equating the sample mean ν 00k and sample variance s20k with the mean and variance of the Dirichlet distribution (13). Similarly, we sample (ω11k , ω10k ) with the same procedure but with obvious substitutions. S. T. Jensen, B. B. McShane and A. J. Wyner 3 639 Results and Model Comparison ⋆ Our primary interest is the prediction of future hitting events, Yt+j for years j = 1, 2, . . . based on our model and observed data up to year t. We estimate the full posterior distribution (8) and then use this posterior distribution to predict home run totals ⋆ for each player i in the 2006 season. The 2006 season serves as an external Yi,2006 validation of our method, since this season is not included in our model fit. We use our predicted home run totals Y ⋆2006 for the 2006 season to compare our performance to several previous methods (Section 3.2) as well as evaluate several internal model choices (Section 3.1). In Section 3.3, we present inference for other parameters of interest from our model, such as the position-specific age curves. 3.1 Prediction of 2006 Home Run Totals: Internal Comparisons We can use our posterior distribution (8) based on data from MLB seasons up to 2005 to calculate the predictive distribution of the 2006 hitting rate θi,2006 for each player i. X) = p(θi,2006 |X Z p(θi,2006 |Ri,2006 , Ai,2006 , Bi,2006 , Ei,2006 , α , β, γ ) E i , ν )p(α α, β, γ , ν , E i |X X ) dα α dβ dγγ dνν dE E ·p(Ei,2006 |E (15) where X represents all observed data up to 2005. This integral is estimated using the α, β, γ , ν , E i |X X ) that were generated sampled values from our posterior distribution p(α via our Gibbs sampling strategy. We can use the posterior predictive distribution (15) of each 2006 home run rate ⋆ θi,2006 to calculate the distribution of the home run total Yi,2006 for each player in the 2006 season. ⋆ X) p(Yi,2006 |X = Z ⋆ X ) dθi,2006 |Mi,2006 , θi,2006 ) · p(θi,2006 |X p(Yi,2006 (16) However, the issue with prediction of home run totals is that we must also consider the number of opportunities Mi,2006 . Since our overall focus has been on modeling home run rates θi,2006 , we will use the true value of Mi,2006 for the 2006 season in equation (16). Using the true value of each Mi,2006 gives a fair comparison of the rate predictions θi,2006 for each model choice, since it is a constant scaling factor. This is not a particularly realistic scenario in a prediction setting since the actual number of opportunities will not be known ahead of time. ⋆ X ), we can report either a predictive |X Based on the predictive distribution p(Yi,2006 ⋆ ⋆ X ) or a predictive interval Ci⋆ such that p(Yi,2006 X ) ≥ 0.80. We mean E(Yi,2006 |X ∈ Ci⋆ |X can examine the accuracy of our model predictions by comparing to the observed home run totals Yi,2006 for the 559 players in the 2006 season, which we did not include in our model fit. We use the following three comparison metrics: 640 Bayesian Modeling of Hitting 1. RMSE: root mean square error of predictive means, s 1X ⋆ X ) − Yi,2006 )2 (E(Yi,2006 |X RMSE = n i 2. Interval Coverage: fraction of 80% predictive intervals Ci⋆ covering observed Yi,2006 3. Interval Width: average width of 80% predictive intervals Ci⋆ In Table 1, we evaluate our full model outlined in Section 2.1 relative to several simpler modeling choices. Specifically, we examine a simpler version of our model without positional information or the mixture model on the α coefficients. We see from Table 1 that our full model gives proper coverage and a substantially lower RMSE than the version of our model without positional information or the elite/non-elite mixture model. We also examine a truly simplistic strawman, which is to take last years home ⋆ run totals as the prediction for this years home run totals (ie. Yi,2006 = Yi,2005 ). Since this strawman is only a point estimate, that comparison is made based solely on the RMSE. As expected, the relative performance of this strawman model is terrible, with a substantially higher RMSE compared to our full model. Of course, this simple strawman alternative is rather naive and in Section 3.2, we compare our performance to more sophisticated external prediction approaches. Table 1: Internal Comparison of Different Model Choices. Measures are calculated over 559 Players from 2006 season. Model Full Model No Position or Elite Indicators ⋆ = Yi,2005 Strawman: Yi,2006 Player-Specific Transitions RMSE 5.30 6.87 8.24 5.45 Coverage of 80% Intervals 0.855 0.644 NA 0.871 Average Interval Width 9.81 6.56 NA 10.36 We also considered an extended model in Section 2.3 with player-specific transition parameters for the hidden Markov model on elite status, and the validation results from this model are also given in Table 1. Our motivation for this extension was that allowing player-specific transition parameters might reduce the interval width for players that have displayed consistent past performance. However, we see that the overall prediction accuracy was not improved with this model extension, suggesting that there is not enough additional information in the personal history of most players to noticeably improve the model predictions. Somewhat surprisingly, we also see that the width of our 80% predictive intervals are not actually reduced in this extended model. The reason is that, even for players with long careers of data, the player-specific transition S. T. Jensen, B. B. McShane and A. J. Wyner 641 parameters ν i fit by this extended model are not extreme enough to force all sampled elite indicators Ei,2006 to be either 0 or 1, and so the predictive interval is still wide enough to include both possibilities. 3.2 Prediction of 2006 Home Run Totals: External Comparisons Similarly to Section 3.1, we use hold-out home run data for the 2006 season to evaluate our model predictions compared to the predictions from two external methods, PECOTA (Silver 2003) and MARCEL (Tango 2004), both described in Section 1. We view MARCEL as the primary competitor of our approach, as it also is a fully-automated method based on publicly available data. However, out of general interest we also compare our prediction accuracy to the proprietary and manually-curated PECOTA system. For a reasonable comparison set, we focus our external validation on hitters with an empirical home run rate of least 1 home run every 40 at-bats in at least one season up to 2005 (minimum of 300 at-bats in that season). This restriction reduces our dataset for model fitting down to 118 top home run hitters who all have predictions from the competing methods PECOTA and MARCEL. As noted above, our predicted home run totals for 2006 are based on the true number of at bats for 2006. In order to have a fair comparison to external methods such as PECOTA or MARCEL, we also scale the predictions from these methods by the true number of at bats in 2006. Our approach has the advantage of producing the full predictive distribution of future observations (summarized by our predictive intervals). However, the external methods do not produce comparable intervals, so we only compare to other approaches in terms of prediction accuracy. We expand our set of accuracy measures to include not only the root mean square error (RMSE), but also the median absolute error (MAE). In addition to comparing the predictions from each method using overall error rates, we also calculated “% BEST” which is, for each method, the percentage of players for which ⋆ the predicted home run total Yi,2006 is the closest to the true home run total among all methods. Each of these comparison statistics are given in Table 2. In addition to giving these validation measures for all 118 players, we also separate our comparison for young players (age ≤ 26 years in 2006) versus older players (age > 26 years in 2006). The age cut-off of 26 years was used in order to isolate the small subset of players that were just beginning their careers and for which each player had little personal history of performance. It is worth noting that only 8 out of the 118 players (around 7%) in our 2006 test dataset were classified as young by this criterion, so the vast majority (110 out of 118) of players are in the “older” category. We see from Table 2, that our model is extremely competitive with the external methods PECOTA and MARCEL. When examining all 118 players, our model has the smallest median absolute error and the highest “% Best” measure, suggesting that our predictions are superior on these absolute scales. Our performance is more striking when we examine only the small subset of young players in our dataset. We have the best prediction on 62% of all young players, and for these young players, both the RMSE and MAE from our method is substantially lower than either PECOTA or MARCEL. We credit this superior performance to our sophisticated hierarchical approach that builds in 642 Bayesian Modeling of Hitting Table 2: Comparison of our model to two external methods on the 2006 predictions of 118 top home run hitters. We also provide this comparison for only young players (age ≤ 26 years) versus only older players (age > 26 years). Method RMSE Our Model 7.33 PECOTA 7.11 MARCEL 7.82 All Players Young Players Older Players MAE % BEST RMSE MAE % BEST RMSE MAE % BEST 4.40 41 % 2.62 1.93 62% 7.56 4.48 39% 4.68 28 % 4.62 3.44 0% 7.26 4.79 30% 4.41 31 % 4.15 2.17 38% 8.02 4.57 31% information via position instead of relying solely on limited past personal performance. All eight young players had played three seasons or less before 2006, and six of the eight players had two seasons or less before 2006. For these players, very little past information is available about their performance and so the model must rely heavily on position, where information is shared between players. However, our method is not completely dominant: we have a larger root mean square error than PECOTA for older players (and overall), which suggests that our model might be making large errors on a small number of players. Further investigation shows that our model commits its largest errors for players in the designated hitter (DH) position. This is somewhat expected, since our model seems to perform best for young players and DH is a position almost always occupied by an older player. Beyond this, the model appears to be over-shrinking predictions for players in the DH role, perhaps because this player position is rather unique and does not fit our model assumptions as well as the other positions. Also, PECOTA is a manually-curated system that can account for the latest information in terms of injuries and playing time adjustments, which can greatly benefit their predictions. Overall, the validation results are generally very encouraging for our approach compared to our nearest competitor, MARCEL, as well as the proprietary system PECOTA. Our performance is especially good among younger players where a principled balance of positional information with past performance is most advantageous. We further investigate our model dynamics among young players by examining how many years of observed performance are needed to decide that a player is an elite home run hitter. This question was posited in Section 1 and we now address the question using our elite status indicators Eij . Taking all 559 available players examined in Section 3.1, we focus our attention on the subset of players that were determined by our model to be in the elite group (P(Eij = 1) ≥ 0.5) for at least two years in their career. For each elite home run hitter, we tabulate the number of years of observed data that were needed before they were declared elite. The distribution of the number of years needed is given in Figure 2. We see that although some players are determined to be elite based on just one year of observed data, most players (74%) need more than one year of observed performance to determine that they are elite home run hitters. In fact, almost half of players (46%) need more than two years of observed performance to determine that they are elite home run hitters. S. T. Jensen, B. B. McShane and A. J. Wyner 643 Figure 2: Distribution of number of seasons of observed data needed to infer elite status (P(Eij = 1) ≥ 0.5) among all players determined by our model to be elite during their career. Note that increasing the cut-off for elite states (e.g. P(Eij = 1) ≥ 0.75) shifts the distribution towards a higher number of seasons needed, whereas decreasing the cut-off for elite states (e.g. P(Eij = 1) ≥ 0.25) shifts the distribution towards a lower number of seasons needed. 40 30 0 10 20 Frequency 50 60 70 Distribution of Number of Years Until Elite 1 2 3 4 5 6 7 8 9 10 11 Years In Baseball We also investigated our model dynamics among older players by examining the balancing of past consistency with advancing age, which was also posited as a question in Section 1. Specifically, for the older players (age ≥ 35) in our dataset, we examined X ) from the differences between the 2006 home run rate predictions θ̂i,2006 = E(θi,2006 |X our model versus the naive prediction based entirely on the previous year θ̃i,2006 = Yi,2005 /Mi,2005 . Is our model contribution for a player (which we define as the difference between our model prediction θ̂i,2006 and the naive prediction θ̃i,2006 ) more a function of advancing age or past consistency of that player? Both age and past consistency (measured as the standard deviation of their past home run rates) were found to be equally good predictors of our model contribution, which suggests that both sources of information are being evenly balanced in the predictions produced by our model. 644 3.3 Bayesian Modeling of Hitting Age Trajectory Curves In addition to validating our model in terms of prediction accuracy, we can also examine the age trajectory curves that are implied by our estimated posterior distribution (8). We will examine these curves on the scale of the home run rate θij which is a function of age Aij , ball-park b, and elite status Eij for player i in year j (with position k): θij = exp [(1 − Eij ) · αk0 + Eij · αk1 + βb + fk (Aij )] . 1 + exp [(1 − Eij ) · αk0 + Eij · αk1 + βb + fk (Aij )] (17) The shape of these curves can differ by position k, ballpark b and also can differ between elite and non-elite status as a consequence of having a different additive effect αk0 vs. αk1 . In Figure 3, we compare the age trajectories for two positions, DH and SS, for both elite player-years (Eij = 1) vs. non-elite player-years (Eij = 0) for an arbitrary ballpark. Each graph contains multiple curves (100 in each graph), each of which is the α, γ ) from a single iteration of our converged and curve implied by the sampled values (α thinned Gibbs sampling output. Examining the curves from multiple samples gives us an indication of the variability in each curve. We see a tremendous difference between the two positions DH and SS in terms of the magnitude and shape of their age trajectory curves. This is not surprising, since home run hitting ability is known to be quite different between designated hitters and shortstops. In fact, DH and SS were chosen specifically to illustrate the variability between position with regards to home run hitting. For the DH position, we also see that elite vs. non-elite status show a substantial difference in the magnitude of the home run rate, though the overall shape across age is restricted to be the same by the fact that players of both statuses share the same fk (Aij ) in equation (17). There is less difference between elite and non-elite status for shortstops, in part due to the lower range of values for shortstops overall. Not surprisingly, the variability in the curves grows with the magnitude of the home run rate. We also perform a comparison across all positions by examining the elite vs. nonα0 , α 1 ) that were allowed to vary by position. We present the posterior elite intercepts (α distribution of each elite and non-elite intercept in Figure 4. For easier interpretation, the values of each αk0 and αk1 have been transformed into the implied home run rate θij for very young (age = 23) players in our dataset. We see in Figure 4 that the variability is higher for the elite intercept in each position, and there is even more variability between positions. The ordering of the positions is not surprising: the corner outfielders and infielders have much higher home run rates than the middle infielder and centerfielder positions. For a player at a specific position, such as DH, our predictions of his home run rate for a future season is a weighted mixture of elite and non-elite DH curves given in Figure 3. The amount of weight given to elite vs. non-elite for a given player will be determined by the full posterior distribution (8) as a function of that player’s past performance. We illustrate this characteristic of our model in more detail in Figure 5 by examining six different hypothetical scenarios for players at the 2B position. Each plot in Figure 5 gives several seasons of past performance for a single player, as well S. T. Jensen, B. B. McShane and A. J. Wyner 645 Figure 3: Age Trajectories fk (·) for two positions and elite vs. non-elite status. X-axis is age and Y-axis is Rate = θij 646 Bayesian Modeling of Hitting α0 , α 1 ) for each position. The Figure 4: Distribution of the elite vs. non-elite intercepts (α α0 , α 1 ) are presented in terms of the home run rate θij for very young distributions of each (α (age = 23) players. The posterior mean is given as a black dot, and the 95% posterior interval as a black line. LF DH RF 1B Elite 3B C SS 2B CF DH 1B RF LF C Non−Elite 3B 2B SS CF 0.02 0.04 0.06 Home Run Rate 0.08 S. T. Jensen, B. B. McShane and A. J. Wyner 647 as predictions for an additional season (age 30). Predictions are given both in terms of posterior draws of the home run rate as well as the posterior mean of the home run rate. The elite and non-elite age trajectories for the 2B position are also given in each plot. We focus first on the left column of plots, which shows hypothetical players with consistently high (top row), average (middle row), and poor (bottom row) past home run rates. We see in each of these left-hand plots that our posterior draws (gray dots) for the next season are a mixture of posterior samples from the elite and non-elite curves, though each case has a different proportion of elite vs. non-elite, as indicated by the posterior mean of those draws (black ×). Now, what would happen if each of these players was not so consistent? In Section 1, we asked about the effect of a single sub-par year on our model predictions. The plots in the right column show the same three hypothetical players, but with their most recent past season replaced by a season with distinctly different (and relatively poor) home run hitting performance. We see from the resulting posterior means in each case that only the average player (middle row) has his predictions substantially affected by the one season of relatively poor performance. Despite the one year of poor performance, the player in the top row of Figure 5 is still considered to be elite in the vast majority of posterior draws. Similarly, the player in the bottom row of Figure 5 is going to be considered non-elite regardless of that one year of extra poor performance. The one season of poor performance has the most influence on the player in the middle row, since the model has the most uncertainty with regards to the elite vs. non-elite status of this average player. 4 Discussion We have presented a sophisticated Bayesian hierarchical model for home run hitting among major league baseball players. Our principled approach builds upon information about past performance, age, position, and home ballpark to estimate the underlying home run hitting ability of individual players, while sharing information across players. Our primary outcome of interest is the prediction of future home run hitting, which we evaluated on a held out season of data (2006). When compared to the previous methods, PECOTA (Silver 2003) and MARCEL (Tango 2004), we perform well in terms of prediction accuracy, especially our “% BEST” measure which tabulates the percentage of players for which our predictions are the closest to the truth. Our prediction accuracy completely dominates the MARCEL procedure which represents our closest natural competitor, since it is also a fully-automated and based on publicly-available data. Our prediction accuracy is also competitive with the proprietary PECOTA system which is especially impressive given that PECOTA is manually curated based on the latest information about injuries and playing time. Our approach does especially well among young players, where a principled balance of positional information with past performance seems most helpful. In addition, our method has the advantage of estimating the full posterior predictive distribution of each player, which provides additional information in the form of posterior intervals. Beyond our primary goal of prediction, our model-based approach also allows us to answer interesting supplemental questions 648 Bayesian Modeling of Hitting Figure 5: Six different hypothetical scenarios for a player at the 2B position. Black curves indicate the elite and non-elite age trajectories for the 2B position. Black points represent several seasons of past performance for a single player. Predictions for an additional season are given as posterior draws (gray points) of the home run rate and the posterior mean of the home run rate (black ×). Left column of plots gives hypothetical players with consistently high (top row), average (middle row), and poor (bottom row) past home run rates. Right column of plots show the same hypothetical players, but with their most recent past season replaced by a relatively poor home run hitting performance. S. T. Jensen, B. B. McShane and A. J. Wyner 649 such as the ones posed in Section 1. We have illustrated our methodology using home runs as the hitting event since they are a familiar outcome that most readers can calibrate with their own anecdotal experience. However, our approach could easily be adapted to other hitting outcomes of interest, such as on-base percentage (rate of hits or walks) which has become a popular tool for evaluating overall hitting quality. Also, although our procedure is presented in the context of predicting a single hitting event, we can also extend our methodology in order to model multiple hitting outcomes simultaneously. In this more general case, there are several possible outcomes of an at-bat (out, single, double, etc.). Our units of observation for a given player i in a given year j is now a vector of outcome totals Y ij , which can be modeled as a multinomial outcome: Y ij ∼ Multinomial(Mij , θ ij ) where Mij are the number of opportunities (at bats) for player i in year j and θ ij is the vector of player- and year-specific rates for each outcome. Our underlying model for the rates θij as a function of position, ball-park and past performance could be extended to a vector of rates θ ij . Our preliminary experience with this type of multinomial model indicates that single-event predictions (such as home runs) are not improved by considering multiple outcomes simultaneously, though one could argue that a more honest assessment of the variance in each event would result from acknowledging the possibility of multiple events from each at-bat. An important element of our approach was the use of mixture modeling of the player population to further refine our estimated home run rates. Sophisticated statistical models have been used previously to model the careers of baseball hitters (Berry et al. 1999), but these approaches have not employed mixtures for the modeling of the player population. Our internal model comparisons suggest that this mixture model component is crucial for the accuracy of our model, dominating even information about player position. Using a mixture of elite and non-elite players limits the shrinkage towards the population mean of consistently elite home run hitters, leading to more accurate predictions. Our fully Bayesian approach also allows us to investigate the dynamics of our elite status indicators directly, as we do in Section 3.2. In addition to our primary goal of home run prediction, our model also estimates several secondary parameters of interest. We estimate career trajectories for both elite and non-elite players within each position. In addition to evaluating the dramatic differences between positions in terms of home run trajectories, our fully Bayesian model also has the advantage of estimating the variability in these trajectories, as can be seen in Figure 3. It is worth noting that our age trajectories do not really represent the typical major league baseball career, especially at the higher values of age. More accurately, our trajectories represent the typical career conditional on the player staying in baseball, which is one reason why we do not see dramatic dropoff in Figure 3. Since our primary goal is prediction, the fact that our trajectories are conditional is acceptable, since one would presumably only be interested in prediction for baseball players that are still in the major leagues. However, if one were more interested in estimating unconditional trajectories, then a more sophisticated modeling of the drop-out/censoring process would be needed. 650 Bayesian Modeling of Hitting Our focus in this paper has been the modeling of home run rates θij and so we have made an assumption throughout our analysis that the number of plate appearances, or opportunities, for each player is a known quantity. This is a reasonable assumption when retrospectively estimating past performance, but when predicting future hitting performance the number of future opportunities is not known. In order to maintain a fair comparison between our method and previous approaches for prediction of future totals, we have used the future number of opportunities, which is not a reasonable strategy for real prediction. A focus of future research is to adapt our sophisticated hierarchical approach to the modeling and prediction of plate appearances Mij in addition to our current modeling of hitting rates θij . References Berry, S. M., Reese, S., and Larkey, P. D. (1999). “Bridging Different Eras in Sports.” Journal of the American Statistical Association, 94: 661–686. 649 Brown, L. D. (2008). “In-Season Prediction of Batting Averages: A Field-test of Simple Empirical Bayes and Bayes Methodologies.” Annals of Applied Statistics, 2: 113–152. 632, 633 Chib, S. (1996). “Calculating posterior distributions and modal estimates in Markov mixture models.” Journal of Econometrics, 75: 79–97. 637 de Boor, C. (1978). A Practical Guide to Splines. Springer-Verlag. 634 Gelman, A., Carlin, J., Stern, H., and Rubin, D. (2003). Bayesian Data Analysis. Boca Raton, FL: Chapman and Hall/CRC, 2nd edition edition. 636 Gelman, A., Roberts, G., and Gilks, W. (1996). “Efficient Metropolis jumping rules.” In Bernardo, J., Berger, J., Dawid, A., and Smith, A. (eds.), Bayesian Statistics 5, 599–608. Oxford University Press. 637 Geman, S. and Geman, D. (1984). “Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images.” IEEE Transaction on Pattern Analysis and Machine Intelligence, 6: 721–741. 636 Lahman, S. (2006). “Baseball Archive.” Lahman’s Baseball Database, Version 5.5. URL http://www.baseball1.com/ 632, 633 Quintana, F. A., Mueller, P., Rosner, G. L., and Munsell, M. (2008). “Semi-parametric Bayesian Inference for Multi-Season Baseball Data.” Bayesian Analysis, 3: 317–338. 632 Silver, N. (2003). “Introducing PECOTA.” Baseball Prospectus, 2003: 507–514. 631, 641, 647 Tango, T. (2004). “Marcel The Monkey Forecasting System.” Tangotiger.net, March 10, 2004. URL http://www.tangotiger.net/archives/stud0346.shtml 631, 632, 641, 647 S. T. Jensen, B. B. McShane and A. J. Wyner Acknowledgments We would like to thank Dylan Small and Larry Brown for helpful discussions. 651 652 Bayesian Modeling of Hitting Bayesian Analysis (2009) 4, Number 4, pp. 653–660 Comment on Article by Jensen et al. Jim Albert∗ and Phil Birnbaum† 1 Introduction Prediction of future batting performance is an important problem in baseball. Due to trades and the free agent system, there is a good movement of players between teams in the “hot-stove league” (the baseball off-season) and teams will acquire new players with the hope that they will achieve particular performances in the following season. The authors propose a Bayesian hierarchical modeling framework for estimating home run hitting probabilities and making predictions of future home run hitting performance. Generally, this is an attractive methodology, especially when one is collecting data from many players who have similar home run hitting abilities. By use of hierarchical modeling, the estimates of the home run probabilities shrink or adjust the observed rates towards a combined regression estimate. One attractive feature of the Bayesian approach is that it is straightforward to obtain predictions from the posterior predictive distribution and the authors test the value of their method by comparing it with two alternative prediction systems MARCEL and PECOTA. It is straightforward to fit these hierarchical models by MCMC algorithms and the authors provide the details of this fitting algorithm. Although we admire the authors’ paper from a Bayesian modeling/computation perspective, it seems deficient from the application (baseball perspective). There is a substantial research on home run hitting and in the modeling of career trajectories of ballplayers and we believe this research should be helpful in defining relevant covariates and proposing realistic models for trajectories. In the following comments, we discuss several concerns with the basic modeling framework, focus on the choice of suitable adjustments and suggest a more flexible framework for modeling career trajectories. 2 Data The authors use data from the Lahman database where the counts of home runs and atbats are collected for each player for each season in the period 1990 and 2005. Although this is a rich dataset, we are puzzled that the authors did not use the more detailed play-by-play data available from the Retrosheet organization (www.retrosheet.org). This dataset is easy to access and manipulate. As will be seen shortly, this richer dataset would allow for the inclusion of suitable covariates in the adjustment of the home run rates. ∗ Department of Mathematics and Statistics, Bowling Green State University, Bowling Green, OH, http://www-math.bgsu.edu/~albert/ † Society of American Baseball Research, http://philbirnbaum.com/ c 2009 International Society for Bayesian Analysis ° DOI:10.1214/09-BA424A 654 3 Comment on Article by Jensen et al. Adjustments for Home Run Rates In comparing baseball hitters across eras, Schell (2005) explains the importance of adjusting home run rates for the era of play, the distribution of league-wide talent, the ballpark effect, and a player’s late-career decline. Adjustments for league-wide talent and the ballpark are also crucial in the modeling of a player’s hitting trajectory and the prediction of future performance. There have been dramatic changes in home run hitting from 1990 to 2005. The overall major league home run rate increased by 26% between 1992 and 1993 and the rate has shown a 50% increase in this 15 year period. Schell documents the significant impact of ballparks on the pattern of home run hitting. In the current baseball season, it appears to be much easier to hit home runs in the new Yankee stadium in New York. The park factor for the new Yankee Stadium is currently 1.295, which means that the rate of home run hitting in the Yankee home games is about 30% higher than the rate of home run hitting in the Yankees away games. One can understand changes in league-wide hitting talent by the fitting of a random effects model. For a given season, we observe the number of home runs and at-bats (yi , ni ) for all batters. We assume that yi is binomial(ni , pi ) and then we assume the home run probabilities {pi } follow a beta distribution with shape parameters a and b. The fitted values â and b̂ are informative about the location and shape of the home run abilities of the batters. This random effects model is fit separately for each season, obtaining estimates âj and b̂j for season j. The top graph in Figure 1 displays the median home run ability of the players for the seasons 1990 to 2005, and the bottom graph plots the interquartile spread of the home run ability distribution against season. This figure shows dramatic changes in the location and spread of talent of hitting home runs in this 15 year period. One way of adjusting a player’s season home run rate compares his rate relative to the distribution of home run rates hit for that particular season. Specifically, one can compute a predictive standardized score as in Albert (2009) using the average and standard deviation of the predictive distribution. This paper does include some adjustments in their regression model (2), specifically, covariates for the home ballpark and fielding position. As the authors explain, the data does not break down a player’s home run data by home and away games and so the “home ballpark” covariate actually confounds two variables, the ballpark effect and the team hitting ability. One could define a true ballpark effect by using the Retrosheet data. We are puzzled by the inclusion of the fielding position covariate. Although there are some tendencies, for example, first-basemen tend to hit more home runs than second-basemen, modern hitters of all non-pitching positions are proficient in hitting home runs. Why do the authors believe that fielding position is an important covariate? More importantly, why do the authors believe that players of different positions have different home run trajectories? Another possible regression adjustment is the number of opportunities AB. There is a general positive correlation between AB and home run rate – players with more at-bats tend to hit a higher rate of home runs. Also, if a young player has a limited number of AB one season, it is more likely that he will have a small number of home runs and be sent back to the minors the following season. Also the number of AB and J. Albert and P. Birnbaum 655 the player’s career trajectory provides a good prediction of the player’s AB in a future season. (The authors assume that the player’s 2006 AB is the same as the AB in the previous season.) 4 Elite/Non-Elite Players The authors introduce a latent elite variable in their model with the justification that “that there exists a sub-group of elite home run hitters within each position that share a higher mean home run rate”. The authors do not present any evidence in the paper that home run rates cluster in two groups of non-elite players and elite players. In our exploration of these data, there appears to be a continuum of home run ability that is right skewed with a few possible large outliers. It seems that the latent elite variable is introduced not because the data suggests the two clusters, but rather to induce some dependence in the home run rates for the same player. There is a more straightforward way to model this dependence, specifically to assume that each player has a unique trajectory, where the individual player regression coefficient vectors are assumed to follow a common distribution. This comment relates to the authors’ approach for modeling trajectories which will be described next. 5 Modeling Career Trajectories In the motivation for the career trajectories, the authors say that they “favor an approach that involves fewer parameters (to prevent over-fitting)”. But they make the very restrictive assumption that players of a particular fielding position share the same career trajectory. This assumption does not reflect the variable trajectory patterns of home run hitting. To illustrate the variability in trajectories, consider the home run hitting patterns of the Hall of Fame players Mickey Mantle and Hank Aaron (both who played the same outfield position) who played in the same era. Figure 2 plots standardized home run rates for both players as a function of age, where the rates have been standardized using the predictive distribution as described above. Note that Mantle peaked in his late 20’s and declined quickly until retirement. In contrast, Aaron peaked in home run hitting ability much later in his career and showed a more gradual decline towards the end of his career. It can be difficult to estimate the player trajectories individually using regression models due to the high variability of the observed rates as shown in Figure 1. But one can obtain good smoothed estimates of the individual trajectories by use of a multilevel model. If the vector of regression coefficients for the ith player is represented by βi , then one can assume that the {βj } are a random sample from a common normal distribution with mean vector β and variance-covariance matrix Σ, and the hyperparameters β, Σ are assigned a vague prior at the second state. The posterior estimates smooth the individual trajectory estimates towards a common trajectory. This multilevel model is shown to be successful in smoothing trajectories of WHIP (walk and hit) rates for pitchers in Albert (2009). We have also used it for estimating trajectories of batter on- 656 Comment on Article by Jensen et al. base percentages, and we would expect similar good results for estimating trajectories of home run rates. This analysis would lead to more realistic estimates of career trajectories and likely better predictions of future home run hitting. Certainly, one should make different predictions for the home run hitting for a 35-year old Mickey Mantle and a 35-year old Hank Aaron since their patterns of decline were very different. 6 A Sabermetrics Perspective Sabermetrics is the scientific search for objective knowledge about baseball, and the search for better predictions of future performance is certainly something that sabermetricians – especially those who may be employed by major league clubs - are interested in. But they are concerned with more than just accurate predictions; they are concerned with what it is the projection reveals about players and changes in their performance. Bill James, in a discussion about the existence of clutch hitting in James (1984), says “How is it that a player who possesses the reflexes and the batting stroke and the knowledge and the experience to be a .260 hitter in other circumstances magically becomes a .300 hitter when the game is on the line? How does that happen? What is the process? What are the effects? Until we can answer those questions, I see little point in talking about clutch ability.” Likewise, sabermetricians are interested in the process that leads to a prediction of home run hitting. Sabermetricians are unsatisfied with mere predictions, no matter how accurate. Given an accurate prediction of future performance, they ask, “what is it about that prediction that makes it accurate? What does it tell us about the relationship of past performance to future performance?” One attractive feature of MARCEL is that it gives us clues to what might be going on. Tango (2004) gives the full MARCEL algorithm, in which we can see the assumptions that went into the formula. We see how it weights recent performance relative to more distant performance, how much one should regress to the mean, and how one adjusts the predictions to adjust to changes to league norms. These individual assumptions can be adjusted in order to minimize prediction error, and, in so doing, we would come closer to learning objective information about player hitting. The Bayesian modeling approach presented in this paper, however, is more complex and opaque. It performs only marginally better than MARCEL, while using more information such as home team scoring and player position. It is uncertain what an experienced sabermetrician would learn from the Bayesian process, and it is uncertain whether the (marginally) improved predictions are the result of a better model, or simply the result of the additional information being used. Further, while the Bayesian model has shown itself to be successful in predicting, certain of its assumptions are almost certain to be false. As has been noted, the classification of hitters into only two categories – elite and non-elite – is certainly false, as home-run-hitting ability appears to be a continuum; there is no evidence that the distribution of home run rates, even by position, is bimodal. J. Albert and P. Birnbaum 657 The fact that the Bayesian model gives reasonable estimates cannot be taken as evidence that the assumptions are correct. For instance, a black-box model that predicts swine-flu infection rates is valuable, but, if the assumptions that went into the model are correct, this model is useful in predicting future outbreaks. If the assumptions are incorrect, the predictions based on the model may be inaccurate. Sabermetricians would be very interested in the success of the Bayesian model in predicting home run rates for younger hitters; as Table 2 of the paper shows, the Bayesian algorithm beats MARCEL 62% of the time, and beats PECOTA 100% of the time. We note, however, that this is based only on a sample of eight players. Still, one could discover possible attributes of the prediction methodology by a case-by-case exploration. It would useful to see the full list of players and their estimates, along with a discussion of what kinds of players, such as power hitters or high-average players, are better estimated than others types of players. This would provide a useful comparison of the methods, and provide a direction for future research to improve the knowledge that the field of sabermetrics has compiled about the aging process. As it stands now, the Bayesian method has made sabermetrics aware that slight improvements over MARCEL are possible, but, without further exploration, we are left with little understanding of where the improvements came from, where MARCEL is weak, what assumptions need to be refined, or, indeed, how the aging process in baseball can better be explained. 0.024 0.018 MEDIAN AVERAGE HOMERUN TALENT 1990 1995 2000 2005 YEAR 0.022 0.018 IQR SPREAD OF HOMERUN TALENT 1990 1995 2000 2005 YEAR Figure 1: Fitted home run talent distributions for the seasons 1990 to 2005. The top graph displays the median home run ability and the bottom graph displays the interquartile range of the talent distribution. 658 2 3 Aaron Mantle 0 1 STANDARDIZED RATE 4 Comment on Article by Jensen et al. 20 25 30 35 40 AGE Figure 2: Standardized home run rates for Mickey Mantle and Hank Aaron ploted as a function of age. The lowess smooths show that the home run trajectories of the two players were significantly different. 7 Summing Up The authors have proposed a useful hierarchical modeling framework and illustrated the potential benefits of Bayesian modeling in predicting future home run counts. But we believe the methods could be substantially improved by the proper adjustment of the home run rates, the inclusion of useful covariates, and more realistic modeling of the career trajectories. From the viewpoint of a baseball general manager, the prediction of a particular player’s future performance is very important and it seems that this prediction has to allow for the player’s unique career trajectory pattern. For the problem of individual predictions, we don’t believe this methodology will be very helpful, since all players of a particular fielding position are assumed to have the same trajectory and lumped into the broad elite/non-elite classes. But we do believe that this general approach, with the changes described above, can be used to make helpful predictions of offensive performance. References Albert, J. (2009). “Is Roger Clemens’ WHIP Trajectory Unusual.” Chance, 22: 8–22. James, B. (1984). The 1984 Baseball Abstract. Ballentine Books. Schell, M. (2005). Baseball’s All-Time Best Sluggers: Adjusted Batting Performance J. Albert and P. Birnbaum 659 from Strikeouts to Home Runs. Princeton University Press. Tango, T. (2004). “Marcel the Monkey Http://www.tangotiger.net/archives/stud0346.shtml. Forecasting System.” 660 Comment on Article by Jensen et al. Bayesian Analysis (2009) 4, Number 4, pp. 661–664 Comment on Article by Jensen et al. Mark E. Glickman∗ I offer my congratulations to Jensen, McShane and Wyner (hereafter JMW) on their paper modeling home run frequencies of Major League Baseball (MLB) players. It is always refreshing to read such a clearly written, well-organized paper on a topic of interest to a broad audience and one that illustrates cutting edge modeling and computational tools in Bayesian Statistics. It is also worth noting that the first author is becoming an accomplished researcher in quantitative aspects of baseball, most recently having developed complex statistical models for evaluating fielding (Jensen et al. 2009). The current paper adds to his accruing and impressive list of work on Statistics in sports. In the current paper, the authors develop and investigate a model for home run frequencies for MLB seasons from 1990 through 2005 based on publicly available data. The data contains player performance information aggregated by season, so examining within-season variation is not possible. Home run frequencies for a player within a season are modeled as binomial counts (out of the total number of at-bats, appropriately defined), and the probability of a home run during a season is a function of the player’s position, team, and age. The authors make some interesting specific assumptions that result in a unique model. First, they posit that the effect of age on the log-odds of the probability of a home run follows a cubic B-spline relationship for a given field position. Second, they assume a latent categorization of each player in a given season as elite versus non-elite, essentially treating a player’s home run frequency as a mixture of two binomial components with different probabilities. Third, the latent elite status for each player is assumed to follow a Markov process with transition probabilities that are common for all players at the given field position. The authors also investigate a generalization of their basic model in which the transition probabilities can vary by player through model components specific to players at that position. The entire model is fit via MCMC simulation from the posterior distribution, and performance of their approach is evaluated through measures that compare model predictions in 2006 to observed home run frequencies. They conclude that their basic model fares well against existing competitor approaches that are not nearly as sophisticated. The authors deserve credit for constructing a model that is competitive with one that makes use of data obtained on a daily basis. It is also particularly impressive that their model predicts well given the paucity of covariate information. One can raise minor quibbles with the authors’ approach, but many of the concerns are an artifact of the constraints on the data available to them. For example, the ability to account for within-season variation strikes me as a clear deficiency in modeling home run probabilities. Given that players are generally improving from year to year in their twenties, it is not unreasonable to speculate that some of this improvement is occurring within a season rather than between seasons. Because the data JMW use is aggregated ∗ Boston University School of Public Health, Boston, MA, mailto:mg@bu.edu c 2009 International Society for Bayesian Analysis ° DOI:10.1214/09-BA424B 662 Comment on Article by Jensen et al. by season, it is impossible to infer such changes. The authors also incorporate a team indicator in their model, which ostensibly is a proxy for playing half of the time in their own ballpark, though this does not account for minor artifacts such as within-season player trades. As JMW note, this team parameter may be difficult to interpret when it applies to a whole season of games. If individual game-specific data were available, then the impact of the actual ballpark could be incorporated into the model which may have a profound effect on inferences. My own bias is to wonder whether modeling and predicting home run frequencies is a question that baseball front office staff or other professionals really want answered. While forecasting home run probabilities seems like an interesting theoretical question, various metrics to measure hitting rates might be of greater practical utility. The authors do mention at the conclusion of the paper their interest in pursuing such activities. I also found curious that the expanded model involving Markov transition probabilities that varied by player produced worse predictions than the simpler model in which the transition probabilities were constrained to vary only by player-position. This may suggest some combination of a model not sufficiently capturing important features of the data, or an expanded model that is too highly parameterized. To me, the most interesting aspect of the paper is the decision to incorporate a latent indicator of elite status into the model, and the accompanying stochastic process. On the one hand, JMW are able to account for variation in home run rates and improve predictions by introducing a 2-state hidden Markov model (HMM). One clear benefit of incorporating this model component is that it allows answering questions about when certain players can be considered elite versus non-elite. On the other hand, I wonder whether a 2-state Markov model is the most appropriate and most flexible for predicting home run frequencies. The authors consider a HMM in which players at the same position share the same transition probabilities, and another in which the transition probabilities vary by player but are centered at position-specific distributions. In both cases, the size of the effect of being elite for all players at the specified position is the same. I realize that JMW are focused on keeping the model as simply parameterized as possible, but the question arises whether accuracy (especially predictive accuracy, one of the main implied goals of the paper) is being sacrificed. Given that all the parameters of the HMM are integrated out of the posterior distribution in making predictions, it is the structure of the HMM that is most crucial, and not inferences about any of the HMM parameters. The authors’ HMM assumes that players at any given time are in one of two states, once accounting for age, position and team. However, it strikes me that player effects (beyond the effect of age, position and team) more justifiably fall on a continuum. A natural way to modify JMW’s model is to assume logit θijkb = αk + βb + fk (Aij ) + δijk (1) where θijkb is the home run probability for player i with home ballpark b in season j at position k; αk , βb and fk (Aij ) are as defined in JMW; and δijk is a player-specific effect following a stochastic process with a continuous state-space, such as δijk ∼ N(δi,j−1,k , ψ 2 ), (2) M. E. Glickman 663 where initial player effects may be assumed drawn from a common distribution centered at a position-specific model component, δi1k ∼ N(ηk , φ2 ) (3) with position-specific effects ηk . This model assumes that, beyond the effects of ballpark, position and age, an individual player effect in a given season is drawn from a distribution centered at last season’s mean, thus inducing a time-correlation particular to that player. Such an approach can represent trajectories of not only elite players, but also better-than-average players as well as worse-than-average players. Similar models for binomial data in a game/sports context have been examined by Fahrmeir and Tutz (1994) and Glickman (1999), among others, though these approaches do not include an additive spline component for age. Various changes to the assumptions in (2) and (3) could be considered, such as having the innovation variance, ψ 2 , depend on player position (that is, ψk2 ), the transition model could be heavy-tailed, such as a t-distribution instead of normal (which would account for occasional bursts of improvement in home run probability), or having the prior variance, φ2 , depend on the player position (that is, φ2k ). An advantage to a continuous state-space compared to a 2-state system is that it recognizes varying degrees of improvement and worsening over time beyond what is captured by age-specific effects. Substituting the HMM in the authors’ framework with that in (2) should involve straightforward modifications to the MCMC algorithm, so the computational details ought to involve tractable calculations. Again, because the parameters of a continuous state-space model are integrated out of the posterior distribution to obtain predictive inferences, or even age-curve estimates, the richer structure compared to the 2-state HMM may result in more reliable inferences. The richer structure may also more appropriately calibrate the levels of uncertainty in predictions which appear overly conservative as evidenced in Table 1 of their paper. Of course, one needs to fit such a model to the data to be convinced of such speculation. Notwithstanding some of my suggestions for alternative directions the authors could take in further refining their model, I think that their approach makes an important contribution to a growing literature on sophisticated methods in analyzing sports data. Modeling the effect of age through a cubic B-spline is a nice feature of their approach, and accounting for time dependence in home run rates through a hidden Markov model is a novel addition, even though my feeling is that a continuous state-space Markov model may be more promising. I look forward to the continued success and insightful work from this productive group of researchers. References Fahrmeir, L. and Tutz, G. (1994). “Dynamic stochastic models for time-dependent ordered paired comparison systems.” Journal of the American Statistical Association, 89: 1438–1449. 663 Glickman, M. E. (1999). “Parameter estimation in large dynamic paired comparison 664 Comment on Article by Jensen et al. experiments.” Applied Statistics, 48: 377–394. 663 Jensen, S. T., Shirley, K., and Wyner, A. J. (2009). “Bayesball: a Bayesian hierarchical model for evaluating fielding in major league baseball.” Annals of Applied Statistics, 3: 491–520. 661 Bayesian Analysis (2009) 4, Number 4, pp. 665–668 Comment on Article by Jensen et al. Fernando A. Quintana∗ and Peter Müller 1 † Introduction We congratulate Shane T. Jensen, Blake McShane and Abraham J. Wyner (henceforth JMW) for a very well written and interesting modeling and analysis of hitting performance for Major League Baseball players. JMW proposed a hierarchical model for data extracted from the Lahman Baseball Database. They model the player/year-specific home run rate using covariate information such as the player’s age, home ballpark, and position. The proposed approach successfully strikes a balance of parsimonious assumptions where detail does not matter versus structure where it is important for the underlying decision problem. An interesting feature of the model is the time-dependence that is induced by assuming the existence of a hidden Markov chain that drives the transition of players between “elite” and “non-elite” conditions. In the former case, JMW postulate that the home run rate is increased by a certain position-dependent quantity. The model is used to predict home run totals for the 2006 season, and the results compared to some external methods (MARCEL and PECOTA). The comparison gives some mixed results, with the proposed method rating generally well, compared to their competitors. 2 Some general comments Inference for the Lahmann baseball data raises a number of practical challenges. The data include records on over 2,000 players, but for many of them there is information for only a couple of years. In many cases there are several years with missing information. As usual in sports data, there is tremendous heterogeneity and unbalance among the experimental units (players). We suspect this is partly the reason why the focus is on predictions for a subset of players. However, this opens the question of whether the model actually provides a good fit for all the players. We believe an interesting challenge is to extend the modeling approach to larger subsets, and maybe all players. For such extended inference the model needs to be extended to properly reflect the increased heterogeneity across all players. We propose a possible approach below. Also, the inference focus would shift from prediction to more emphasis on an explanatory model. Model (2) and the proposed variations, have the interesting feature of incorporating in the home run rates θij an explicit dependence on player position k, home ballpark ∗ Departamento de Estadı́stica, Pontificia Universidad Católica de Chile, Chile, mailto:quintana@ mat.puc.cl † Department of Biostatistics, The University of Texas M.D. Anderson Cancer Center, Houston, TX, mailto:pmueller@mdanderson.org c 2009 International Society for Bayesian Analysis ° DOI:10.1214/09-BA424C 666 Comment on Article by Jensen et al. b and a smooth position-specific age trajectory, expressed as an hypothesized linear combination in the logit scale. The smooth function of age seems to capture interesting nonlinear features of the home run rates evolution on time, as seen in Figures 3 and 5. One may even venture the existence of an “optimal” age for hitting, and a natural decay in abilities with progressing age. In fact, such conclusions have been reached elsewhere, and even if not the target of this work, it is a nice feature of the analysis that the same kind of findings are uncovered. The hidden Markov model for “elite” status is the model component that is responsible for introducing dependence across seasons for a given player. The extended model allows for player-specific transition parameters, i.e., individual trajectories for i i the binary elite indicator variables. Concretely, JMW assume the parameters (ν00 , ν11 ) controlling these transitions to be a priori independent and Beta-distributed, with conditional independence across players sharing a same position k. These assumptions imply flexibility in the evolution of the {Eij } elite indicators, which are well defined regardless of missing data patterns along the sequences of home runs. Looking at the results of the analysis, it is quite remarkable that a large number of players achieve elite status after only one or two major league seasons, as seen in Figure 2. Intuitively one would have expected a peak more likely around 3-5 years. JMW seem to be equally surprised at such findings, when they comment that the sum over years 2 through 11 still represents 75% of the cases considered. Another consequence of the elite/non-elite model is that the effect on home run rates θij is only through a position-specific added term αk = αk0 (1 − Eij ) + αk1 Eij on the logit scale. While this has the advantage of borrowing strength across players with the same position, it may be not flexible enough to capture highly heterogeneous home run profiles. 3 Extending the proposed approach The latent elite indicator Eij defines a mixture model for the observed home run totals. The use of Eij is an elegant way to formalize inference about top players. The model balances parsimony with sufficient structure to achieve the desired inference. The authors correctly point out some of the remaining limitations. Perhaps the most important limitation is that the model reduces the heterogeneity of the population of all players to a mixture of only two homogeneous subpopulations. This is particularly of concern in the light of the underlying decision problem. The resulting inference only informs us about the probability of a player being in the elite group. Some evidence for more heterogeneity beyond the mixture of only two subpopulations is seen in Figure 4. The wide separation of the credible intervals suggests scope for intermediate performance groups in the model. The population of players is highly heterogeneous, but not in such a sharply bimodal fashion. It is also interesting to note in the same figure the almost preserved ordering across positions between elite and non-elite groups. A minor extension of the model could generalize the mixture to a random partition into H subpopulations, which could help closing the gap just pointed out. Each cluster F. A. Quintana and P. M. Müller 667 could have a cluster-specific set of intercepts αkh , h = 0, . . . , H − 1 for the logistic regression prior (2) of player-season home run rates θij . Like in JMW’s model, the intercepts remain ordered αkh ≤ αk,h+1 , k = 1, . . . , 9. This allows us to interpret the clusters labels h = 0, . . . , H − 1 as latent player performance. Formally the model extension would replace (2) by logit(θij ) = αih + βb + fk (Aij ), (1) where βb and fk (Aij ) are as earlier, and h = Eij is the imputed cluster membership for player i in season j. The prior for αk = (αkh , h = 0, . . . , H − 1) is similar to (9), now for the H−dimensional vector αk . The prior for the latent cluster membership Eij remains as in (3), extended to transitions between H states. The number of transition parameters νrs remains unchanged with prior probability ν01 for upgrades in elite level, prior probability ν10 for downgrades and ν00 for the probability of remaining in state Eij = 0 and ν11 for the probability of remaining in a performance state E > 0. Like in (7) the transition probabilities are position-specific. The number of states H would itself be treated as unknown, with a geometric prior p(H) = (1 − p)H−1 p and a hyperparameter p. The only additional step in the MCMC implementation is a transition probability to change H. We consider two transitions, “birth” of an additional performance level by splitting an existing level h into two new levels and the reverse “death” move. This could be implemented as a reversible jump move. The generalized model defines a random partition of the player-years (ij) into performance clusters h = 0, . . . , H − 1. The unique features of this random partition model would be the ordering of the clusters and the dependence across j. Both features are naturally accommodated by the outlined model-based clustering. We see it as an interesting and challenging application of model-based clustering. In contrast to much of the of clustering models in the recent Bayesian literature, the use of standard clustering models such as the ubiquitous Polya urn would be inappropriate. The Polya urn model does not naturally allow the desired ordering of cluster-specific parameters and time-dependence of cluster membership indicators. 4 Final words We realize the above proposal can be extended/modified in many different ways, the main point being the possibility of improving on the analysis and model proposed by JMW. Our aim here was not to criticize the model but to help improve it. We indeed think the hidden Markov component is a very nice feature, which combined with a flexible extension, could motivate further analysis of the data under a more general framework. Acknowledgments Fernando Quintana was partially funded by Fondecyt grant 1060729. 668 Comment on Article by Jensen et al. Bayesian Analysis (2009) 4, Number 4, pp. 669–674 Rejoinder Shane T. Jensen∗ , Blakeley B. McShane† and Abraham J. Wyner‡ We thank each discussant for his insightful comments and suggestions for improvement. We are pleased by the positive reception of our current endeavor towards modelbased prediction of hitting performance. It is our belief that academic statisticians can serve a leadership role in the transition of quantitative analysis of baseball from simple tabulations to sophisticated model-based approaches. 1 Alternative Models for Latent Variables A clear theme of this discussion is the flexibility of the Bayesian hierarchical framework as a principled means for prediction in this application. Of course, the other side of that coin is that our model can always be improved by more sophisticated extensions. The discussants offer several great suggestions for improvements to our methodology. A first step in this effort is suggested by multiple discussants: extensions of the latent “elite” mixture model. These proposals are great directions for future research, and we briefly discuss the prospects of each below. Albert & Birnbaum question our employment of a latent mixture model, citing the fact that these mixture components are not self-evident from the raw home-run rate distributions. However, they also note the presence of skewness and outliers. We argue that latent mixture models are a common strategy for addressing skewness and outliers. In fact, our original motivation for a latent mixture model was the observation that hitters with consistently high home run rates were over-shrunk in a model that did not allow for subpopulations of extreme home run performance. Both Quintana & Müller and Glickman discuss the limitation of our mixture model to two latent states. In our original analysis, we experimented with the addition of a third latent state which was intended to capture players that showed inferior performance relative to their position. However, the estimated models that included this third state did not show any greater predictive power than the two-state model. Quintana & Müller suggest a more comprehensive amelioration of our mixture model: allowing the number of latent states to be unknown and estimated. Certainly, this proposal is the most natural extension of the current approach and would help address the concerns raised by the discussants about the imposed “elite” vs. “non-elite” framework. The hurdle would be implementation of this more complicated model, as the reversiblejump approach proposed by Quintana & Muller could be complicated in practice. ∗ Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA, mailto:stjensen@wharton.upenn.edu † Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA, mailto:mcshaneb@wharton.upenn.edu ‡ Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA, mailto:ajw@wharton.upenn.edu c 2009 International Society for Bayesian Analysis ° DOI:10.1214/09-BA424REJ 670 Rejoinder Glickman proposes a model extension that is further afield. Instead of a discrete state space model, he proposes a latent state that evolves continuously in an autoregressive fashion. In our opinion, this continuous state space model would perform well for players with a long and consistent history of performance. However, we are skeptical there would be enough autocorrelated signal for younger players with very little personal history. For these cases with sparser information, we believe our simpler model is better able to pool information between players. We have a similar concern about Albert & Birnbaum’s proposal to fit random effects for each player. We concede that players (at the same position) can have very different trajectories, as illustrated by their comparison of Mickey Mantle and Hank Aaron. However, although there is enough information to model players with long careers in this way, we suspect that these random effects would be too variable for players who have only played a few seasons. For such players, the enforced shrinkage of our model is beneficial. Furthermore, while the selection of Mantle and Aaron nicely illustrate the benefits of modeling trajectories individually, it also illustrates some of the pitfalls. Though Mantle and Aaron were both towering sluggers of their era, we contend that both players are unusually deviant from what is generally observed and their careers represent extreme points in the space of individual trajectories. Mantle suffered a precipitous decline due to debilitating injury while Aaron had an almost miraculously steady and lengthy career. Thus, we are not sure it is a criticism to point out that we would have failed to predict Aaron’s unusual performance into his forties or Mantle’s steep early decline, unaided by health information. For the purposes of prediction, discounting unusual individual career trajectories and being guided mainly by position is a sound strategy, and we remind the reader that center fielders like Mantle are more likely to experience sharp declines in production than corner outfielders like Aaron. That said, the random effects framework is a great idea, and we are currently investigating extending our model to allow more flexible trajectories within each position. There are of course many other generalizations and improvements not raised by discussants which we will consider in future work. Most promising is the extension of the usual first order Markov model to higher order or even variable order. This direction has the potential to more accurately model an individual player’s trajectory. 2 Position and Other Potential Covariates Beyond the latent mixture model, the discussants provide several suggestions for additional data and/or covariates that could further improve our predictions. Specifically, Albert & Birnbaum suggest the retrosheet database which provides more detailed within-season information for each player. We agree that the additional detail within the retrosheet database could improve our modeling efforts. One immediate advance, as proposed by Albert & Birnbaum, would be to divide each hitter’s season into home S. T. Jensen, B. B. McShane and A. J. Wyner 671 Figure 1: Boxplots of empirical home run rates by position. Each point gives HR/AB for a given player-season for all player-seasons with 300 or more at-bats from 1990-2006. Table 1: Analysis of Variance Table Source Position Year Age Residuals DF 8 16 25 3801 Sum Sq. 0.31486 0.05922 0.00670 1.24009 Mean Sq. 0.03936 0.00370 0.00027 0.00033 F Ratio 120.6345 11.3446 0.8214 Prob > F <2e-16 <2e-16 0.7178 versus away games, thus enabling the estimation of true ballpark effects. We would favor estimation of ballpark effects in this way rather than the use of external park factors, which is also proposed by Albert & Birnbaum. In our experience, external park factors are highly inconsistent from year to year and do not seem to contain much signal except in some extreme cases (e.g., Coors Field or Citizens Bank Park). Albert & Birnbaum question the use of position as a covariate in our model, claiming that it is not immediately evident what information is being added by position. They are correct to assert that there is heterogeneity of home run talent within each position, but there is large variation in home run rates across position as can be seen in Figure 1. In fact, we perform an analysis of the variance of home run rates by the nine positions, seventeen years, and twenty-six ages in our dataset in Table 1. Position accounts for 20% of the total variation in home run rates, far more than any other factor. These results suggest that position is a very informative covariate for home run ability. In our view, position serves as a proxy variable for several player characteristics, such as body type and speed, that cannot be directly observed from the data. Scouts and 672 Rejoinder managers incorporate many of these unobserved variables into their personnel decisions in terms of where to place players. By assigning a particular player to traditional power positions such as first base, managers are adding information about that player’s propensity to hit home runs. We think this information is especially important for younger players who have less performance history upon which to base predictions. Albert & Birnbaum also point out that our model does not address major shifts in hitting performance between different eras in baseball. We do not argue the point, as it was not the goal of our paper (though we note that Table 1 shows that the year factor accounts for a modest 3.6% of the variance in home run rates). Our priority is the prediction of future hitting performance, which motivated our focus on the current era. The comparison of hitting performance in different eras is also an interesting question, and has been addressed in the past with sophisticated Bayesian approaches (Berry et al. 1999). We did investigate, somewhat indirectly, the possible effects of different eras on our predictions. We fit our full model on a larger dataset consisting of all seasons from 1970 to 2005, in addition to our presented analysis based on seasons from 1990 to 2005. We saw very little difference in the predictions between these two analyses, suggesting that any large-scale changes in hitting dynamics over the past forty years do not have a major impact on future hitting predictions. Albert & Birnbaum also suggest using at-bats as a covariate for the modeling of home run rates. This is a good suggestion and we have investigated the modeling of at-bats as a means for improving the prediction of hitting totals. However, we need to correct one statement made by Albert & Birnbaum: we do not assume that each player’s 2006 at-bats are the same as the at-bats in the previous season. Rather, we scale the predictions of hitting rates from our model (and the two external methods) by the actual 2006 at-bat totals in our comparisons. 3 Focus on Prediction Glickman suggests that home run totals may not be the most interesting outcome to people in baseball. We certainly agree that home runs are not the best measure of overall hitting performance, and we emphasize that our methodology can be adapted to any other hitting event. Home runs were chosen for illustration since we believe that most readers have a good intuition about the scale and variation of home run totals. We also have experimented with a multinomial extension of our procedure that would model each hitting outcome (i.e., singles, doubles, etc.) simultaneously, and this remains an area of future research. More generally, Albert & Birnbaum call for greater focus on model interpretation. Despite our emphasis on prediction, there are elements of our model that are interesting in their own right. The position-specific aging curves provide an interesting contrast in the aging process between players at these different positions. Our “elite” versus “nonelite” indicators also provide a means for separating out consistently over-performing S. T. Jensen, B. B. McShane and A. J. Wyner 673 players relative to their position. Quintana & Müller also inquire about the predictive power of our model for all players, not just the subset of players examined in our analysis. Our primary motivation was to have a set of common players for comparison with the external methods. However, we concede that the players excluded from our analysis probably represent an even tougher challenge for prediction. Albert & Birnbaum also suggest that extra insight would be gained from a case-by-case exploration and comparison of our predictions. To this end, we have made available the entire set of our predictions for the 2006 season at the following website: http://stat.wharton.upenn.edu/~stjensen/ research/predictions.2006.xlsx References Berry, S. M., Reese, S., and Larkey, P. D. (1999). “Bridging Different Eras in Sports.” Journal of the American Statistical Association, 94: 661–686. 672 674 Rejoinder Bayesian Analysis (2009) 4, Number 4, pp. 675–706 Bayesian Inference for Directional Conditionally Autoregressive Models Minjung Kyung∗ and Sujit K. Ghosh† Abstract. Counts or averages over arbitrary regions are often analyzed using conditionally autoregressive (CAR) models. The neighborhoods within CAR models are generally determined using only the inter-distances or boundaries between the sub-regions. To accommodate spatial variations that may depend on directions, a new class of models is developed using different weights given to neighbors in different directions. By accounting for such spatial anisotropy, the proposed model generalizes the usual CAR model that assigns equal weight to all directions. Within a fully hierarchical Bayesian framework, the posterior distributions of the parameters are derived using conjugate and non-informative priors. Efficient Markov chain Monte Carlo (MCMC) sampling algorithms are provided to generate samples from the marginal posterior distribution of the parameters. Simulation studies are presented to evaluate the performance of the estimators and are used to compare results with traditional CAR models. Finally the method is illustrated using data sets on local crime frequencies in Columbus, OH and on the elevated blood lead levels of children under the age of 72 months observed in Virginia counties for the year of 2000. Keywords: Anisotropy; Bayesian estimation; Conditionally autoregressive models; Lattice data; Spatial analysis. 1 Introduction In many studies, counts or averages over arbitrary regions, known as lattice or area data (Cressie 1993), are observed and spatial analysis is performed. Given a set of geographical regions, observations collected over regions nearer to each other tend to have similar characteristics, as compared to distant regions. In geography, this feature is known as Tobler’s First Law (Miller 2004). From a statistical perspective, this feature is attributed to the fact that the autocorrelation between pairs of regions tends to be higher for regions near one another than for those farther apart. Thus, this spatial process observed over a lattice or a set of irregular regions is usually modeled using autoregressive models. In general, given a set of sub-regions S1 , . . . , Sn , we consider a generalized linear model for the aggregated responses, Yi = Y (Si ), as E[Y|Z] = g(Z) and Z = µ + η, ∗ Department † Department (1) of Statistics, University of Florida, Gainesville, FL, mailto:kyung@stat.ufl.edu of Statistics, North Carolina State University, Raleigh, NC, mailto:ghosh@stat.ncsu. edu c 2009 International Society for Bayesian Analysis ° DOI:10.1214/09-BA425 676 Bayesian inference for DCAR where Y = (Y1 , . . . , Yn ) = (Y (S1 ), . . . , Y (Sn )), Z = (Z1 , . . . , Zn ) = (Z(S1 ), . . . , Z(Sn )), µ = (µ1 , . . . , µn ) = (µ(S1 ), . . . , µ(Sn )) and η = (η1 , . . . , ηn ) = (η(S1 ), . . . , η(Sn )). Here, g(·) is a suitable link function, µ represents a vector of large-scale variations (or trends over geographical regions) and η denotes a vector of small-scale variations (or spatial random effects) with mean 0 and the variance-covariance matrix Σ. Usually, the large scale variations, µi ’s, are modeled as a deterministic function of some explanatory variables (e.g., latitudes, longitudes and other area level covariates) using a parametric or semiparametric regression model (see van der Linde et al. 1995) involving a finite dimensional parameter β. However, a more difficult issue is to develop suitable models for the spatial random effects ηi ’s, as they are spatially correlated and model specifications are required to satisfy the positive definiteness condition of the induced covariance structure. Popular approaches to estimate such spatial covariances are based on choosing suitable parametric forms so that the n × n covariance matrix Σ = Σ(ω) is a deterministic function of a finite dimensional parameter ω and then ω is estimated from data. It is essential that any such deterministic function should lead to a positive definite matrix for any sample size n and for all allowable parameter values ω. For example, several geostatistical models are available for point-reference observations assuming that the spatial process is weakly stationary and isotropic (see Cressie 1993). Several extensions to model nonstationary and anisotropic processes have also been recently developed (see Higdon 1998; Higdon et al. 1999; Fuentes and Smith 2001; Fuentes 2002, 2005; Paciorek and Schervish 2006; Hughes-Oliver et al. 2009). Once a valid model for µ and η is specified, parameter estimates can be obtained using maximum likelihood methods, weighted least squares methods or the posterior distribution of (β, ω) (see Schabenberger and Gotway 2005). Once the point-referenced data are aggregated to the sub-regions (Si ’s), the process representing the aggregated data is modeled using integrals of a spatial continuous process (Journel and Huijbregts 1978). In this paper, the focus is on the estimation of ω with the model chosen for Σ. In practice there are two distinct approaches to develop models for spatial covariance based on areal data. A suitably aggregated geostatistical model directly specifies a deterministic function of the elements of the Σ matrix. On the contrary, the conditional autoregressive models involve specifying a deterministic function of elements of the inverse of the covariance, Σ−1 (ω) (e.g., see Besag 1974; Besag and Kooperberg 1995). There have been several attempts to explore the possible connections between these approaches of spatial modeling (e.g., see Griffith and Csillag 1993; Rue and Tjelmeland 2002; Hrafnkelsson and Cressie 2003). Recently, Song et al. (2008) proposed that these Gaussian geostatistical models can be approximately represented by Gaussian Markov Random fields (GMRFs) and vice versa by using spectral densities. However so far most of the GMRFs that are available in literature do not specifically take into account the anisotropic nature of areal data. In practice, statistical practitioners are accustomed to the exploration of relationships among variables, modeling these relationships with regression and classification models, testing hypothesis about regression and treatment effects, developing meaningful contrasts, and so forth (Schabenberger and Gotway 2005). For these spatial linear models, we usually assume a correlated relationship among sub-regions and study how a M. Kyung and S. K. Ghosh 677 particular region is influenced by its “neighboring regions” (Cliff and Ord 1981). Therefore, we consider generalized linear mixed models for the area aggregate data. In these models, the latent spatial process Zi ’s can be treated as a random effect and to model it, conditionally autoregressive (CAR) models (Besag 1974, 1975; Cressie and Chan 1989) and simultaneously autoregressive (SAR) models (Ord 1975) have been used widely. Gaussian CAR models have been used as random effects within generalized mixed effects models (Breslow and Clayton 1993; Clayton and Kaldor 1987). Because the Gaussian CAR process has the merit that under fairly general regularity conditions (e.g., positivity conditions etc.) lower dimensional conditional Gaussian distributions uniquely determine joint Gaussianity of the spatial CAR processes. Thus, the maximum likelihood (ML) and the Bayesian estimates can be easily obtained. However, one of the major limitations of the CAR model is that the neighbors are formed using some form of a distance metric and the effect of direction is completely ignored. In recent years, there have been some attempts to use different CAR models for different parts of the region. For instance, Reich et al. (2007) presented a novel model for periodontal disease and use separate CAR models for separate jaws. White and Ghosh (2008) used a stochastic parameter within the CAR framework to determine effects of the neighbors. Nevertheless, if the underlying spatial process is anisotropic, the magnitude of autocorrelation between the neighbors might be different in different directions. This limitation serves as our main motivation and an extension of the regular CAR process is proposed that can capture such inherent anisotropy. In this article, we focus on developing and exploring more flexible models for the spatial random effects ηi ’s and the newly proposed spatial process will be termed the directional CAR (DCAR) model. In Section 2, we define the new spatial process and present statistical inferences for the parameters based on samples obtained from the posterior distribution of the parameters using suitable Markov chain Monte Carlo (MCMC) methods. In Section 3, the finite sample performance of the Bayesian estimators are explored using simulated data and the newly proposed DCAR models are compared to the regular CAR models in terms of popular information theoretic criteria and various tests. In Section 4, the proposed method is demonstrated and compared with regular CAR using data sets of the crime frequencies in Columbus, OH and of the elevated blood lead levels of children under the age of 72 months observed in Virginia in the year 2000. Finally, in Section 5, some possible extensions of the DCAR model are discussed. 2 Directional CAR models In this section, we develop a new model for the latent spatial process, Zi ’s, described in (1). For simpler illustration and notational simplicity, we assume that Si are subregions in a two-dimensional space, i.e., Si ⊆ R2 , ∀i. However, the proposed model and associated statistical inference presented in this article can easily be extended to higher dimensional data. First, we consider how to define a neighbor structure that depends on the directions between centroids for any pair of sub-regions. Let si = (s1i , s2i ) be a centroid of the sub-region Si , where s1i corresponds to the horizontal coordinate (x-axis) 678 Bayesian inference for DCAR Sj Si alpha(Si,Sj) Figure 1: The angle (in radian) αij and s2i corresponds to the vertical coordinate (y-axis). The angle (in radians) between Si and Sj is defined as ( ¯ ¯ ¯ tan−1 ( s2j −s2i )¯ if s2j − s2i ≥ 0 s −s ¯ 1j −11i s2j −s2i ¯¢ ¡ αij = α(Si , Sj ) = − π − ¯ tan ( s1j −s1i )¯ if s2j − s2i < 0 for all j 6= i. We consider directions of neighbors from the centroid of sub-region Si ’s. For example, in Figure 1, Sj is in the north-east (NE) region of Si and hence α(Si , Sj ) is in [0, π2 ). Let Ni represent a set of indices (j’s) of neighborhoods for the ith region Si that are based on some form of distance metric (say as in a regular CAR model). We can now create new sub-regions, for each i, as follows: Ni1 = {j : j ∈ Ni , 0 ≤ αij < Ni2 = {j : j ∈ Ni , Ni3 = Ni4 = π }, 2 π ≤ αij < π}, 2 3 {j : j ∈ Ni , π ≤ αij < π}, 2 3 {j : j ∈ Ni , π ≤ αij < 2π}. 2 These directional neighborhoods should be chosen carefully so that, for each i, they form a clique. Recall that a clique is any set of sites which either consists of a single site or else in which every site is a neighbor of every other site in the set (Besag 1974). This would allow us to show the existence of the spatial process by using the HammersleyClifford Theorem (Besag 1974, p.197-198) and to derive the finite dimensional joint M. Kyung and S. K. Ghosh 679 distribution of the process using only a set of (lower dimensional) full conditional distributions. For instance, if j ∈ Ni1 , then it should be ensured that i ∈ Nj3 . For the above four neighbor sets, we can combine each pair of the diagonally opposite S neighbor ∗ sets to form a new neighborhood. It means that we can create N = N Ni3 , and i1 i1 S ∗ ∗ ∗ Ni2 = Ni2 Ni4 for i = 1, . . . , n. Now it is easy to check that if j ∈ Ni1 , then i ∈ Nj1 . Thus, we redefine two subsets of Ni ’s as follows: ∗ Ni1 = ∗ Ni2 = π 3 or π ≤ αij < π)} 2 2 3 π {j : j ∈ Ni and ( ≤ αij < π or π ≤ αij < 2π)}. 2 2 {j : j ∈ Ni and (0 ≤ αij < ∗ ∗ ∗ Then, each of Ni1 and Ni2 forms a clique and it can be shown that Ni = Ni1 (2) S ∗ Ni2 . A centroid of the sub-region Si might not be given or available in some situations, for example, neighbor relationships are defined via adjacencies instead of distances between centroids. In this case, the directions of neighbors for each sub-region Si are not clear. One suggestion in this situation is that we might define the directions of neighbors intuitively based on the direction of adjacencies. For this topic, we need further study. However, throughout this paper, we assume that we can define the directions of neighbors for each sub-regions. Before fitting a DCAR model, we would need to define these directional neighborhood just as we need to define the CAR weights before fitting a CAR model. Note that with defined directional adjacency, we can easily rotate the distance category boundaries while maintaining the clique. For example, we can easily define the different weights to the neighbors in the north-south region compared to those in the east-west. The above scheme of creating new neighborhoods based on the inter-angles, αij ’s can be extended beyond just two sub-neighborhoods so that each of the new subneighborhood forms a clique. For example, we can extend the directional cliques with 4 sub-sets of neighborhoods as ∗ Ni1 ∗ Ni2 ∗ Ni3 ∗ Ni4 5 π or π ≤ αij < π)} 4 4 π π 5 3 = {j : j ∈ Ni and ( ≤ αij < or π ≤ αij < π)} 4 2 4 2 π 3 3 7 = {j : j ∈ Ni and ( ≤ αij < π or π ≤ αij < π)} 2 4 2 4 7 3 = {j : j ∈ Ni and ( π ≤ αij < π or π ≤ αij < 2π)}. 4 4 = {j : j ∈ Ni and (0 ≤ αij < However, it should be noted that anisotropic specifications for the geostatistical covariance functions are quite different from the directional specification of neighborhood cliques used to define the inverse of the covariance. In this regard, the directional adjustments within the CAR framework allow the anisotropy parameters to capture the local (neighboring) directional effects whereas the anisotropy parameters of a geostatistical model generally capture the overall global directional effects. Finally, it is possible to 680 Bayesian inference for DCAR increase the number of sub-neighborhoods to more than 2 or 4 sub-neighborhoods. However, we cautiously note that if we keep increasing the number of sub-neighborhoods, the number of parameter increases whereas the amount of observations available within a sub-neighborhood decreases. Thus, we need to restrict the number of sub-neighborhoods by introducing some form of a penalty term (e.g., via the prior distributions of anisotropy parameters) and use some form of information criterion to choose the number of subneighborhoods. This is an important but open issue within our DCAR framework. Hence for the rest of the article, for simplicity, we restrict our attention to the case with only two sub-neighborhoods as described in (2). ∗ ∗ , we can construct and Ni2 Based on subsets of the associated neighborhoods, Ni1 (1) (2) (1) (2) directional weight matrices W = ((wij )) and W = ((wij )), respectively. For (1) instance, we define the directional proximity matrices as wij ∗ = 1 if j ∈ Ni1 and (2) ∗ wij = 1 if j ∈ Ni2 . Notice that W = W(1) + W(2) reproduces the commonly used proximity matrix as in a regular CAR model. In order to model the large-scale variations, we assume a canonical generalized linear model, µi = xTi β, where xi ’s are vectors of predictor variables specific to the sub-region Si and β = (β1 , . . . , βq )T is a vector of regression coefficients. Notice that nonlinear regression functions, including smoothing splines and polynomials, can be re-written in the above canonical form (e.g., see Wahba 1977; van der Linde et al. 1995). From model (1) it follows that E[Z] = Xβ and Var[Z] = Σ(ω), (3) where ω denotes the vector of spatial autocorrelation parameters and other variance components. Notice that along with (3), the model (1) can be used for discrete responses using a generalized linear model framework (Schabenberger and Gotway 2005, p.353). Now, we develop a model for Σ(ω) that accounts for anisotropy. Let δ1 and δ2 denote the directional spatial effects corresponding to Ni1 ’s and Ni2 ’s, respectively. We define the distribution of Zi conditional on the rest of Zj ’s for j 6= i using only the first two moments: E[Zi |Zj = zj j 6= i, xi ] = xTi β + 2 X δk j=1 k=1 Var[Zi |Zj = zj j 6= i, xi ] = (k) (k) σ2 , mi where wij ≥ 0 and wii = 0 for k = 1, 2 and mi = n X ¢ (k) ¡ wij zj − xTj β (4) Pn j=1 wij . The joint distribution based on a given set of full conditional distributions can be derived using Brook’s Lemma (Brook 1964) provided the positivity condition is satisfied (e.g., see Besag 1974; Besag and Kooperberg 1995). For the DCAR model, by construc∗ ∗ tion, it follows that each of Ni1 and Ni2 defined in (2) forms a clique for i = 1, . . . , n. Thus, it follows from the Hammersley-Clifford Theorem that the latent spatial process Zi of a DCAR model exists and is a Markov Random Field (MRF). Therefore, we can M. Kyung and S. K. Ghosh 681 derive the exact joint distribution of the DCAR process, Zi ’s, by assuming that each of the full conditional distribution is a Gaussian distribution. 2.1 Gaussian DCAR models The Gaussian CAR model has been used widely as a suitable model for the latent spatial process Zi . In this section, to derive the joint distribution of the Zi ’s from a set of given full conditional distributions, we use Brook’s Lemma. Assume that the full conditional distributions of Zi ’s are given as   2 n 2 X X ¡ ¢ σ  (k) Zi |Zj = zj , j 6= i, xi ∼ N xTi β + δk wij zj − xTj β , , m i j=1 (5) k=1 (k) where wij for k = 1, 2 are the directional weights. It can be shown that this latent spatial DCAR process Zi ’s is a MRF. Thus, by Brook’s Lemma and the HammersleyClifford Theorem, it follows that the finite dimensional joint distribution is a multivariate Gaussian distribution given by µ ³ ´−1 ¶ 2 (1) (2) D , Z ∼ Nn Xβ, σ I − δ1 W − δ2 W where Z = (Z1 , . . . , Zn )T and D = diag( m11 , . . . , m1n ). For simplicity, we denote the variance-covariance matrix of the DCAR process by ΣZ ≡ σ 2 (I−δ1 W(1) −δ2 W(2) )−1 D. For a proper Gaussian model, the variance-covariance matrix ΣZ needs to be positive definite. First, notice that if we use the standardized directional proximity matrices (k) w (k) W̃(k) = ((w̃ij = miji )), k = 1, 2, it can be easily shown that ΣZ is symmetric. Thus, the finite dimensional joint distribution is given by µ ´−1 ¶ ³ D , (6) Z ∼ Nn Xβ, σ 2 I − δ1 W̃(1) − δ2 W̃(2) Next, we derive a sufficient condition that ensures that the variance-covariance matrix ΣZ is non-singular and hence making it a positive definite matrix. As D is a diagonal matrix, we only require suitable conditions on W̃(k) and on δk for k = 1, 2. The following results provides a sufficient condition: ¡ ¢ P Lemma 1. Let A = aij be a n × n symmetric matrix. If aii > j6=i |aij | for all i, then A is positive definite. Proof: See Ortega 1987, P.226. ✷ PK PK (k) is a Lemma 2. Let A = I − k=1 δk W̃(k) be an n × n matrix where k=1 W̃ symmetric matrix with non-negative entries, diagonal 0 and each row sum equal to unity. If max1≤k≤K |δk | < 1, then the matrix A is positive definite. 682 Bayesian inference for DCAR Proof: Let aij denote the (i, j)th element of A. Notice that for each i = 1, 2 . . . , n, we have X j6=i |aij | = K K K X X X X X (k) X (k) (k) | δk w̃ij | ≤ |δk | w̃ij < w̃ij = 1 = aii j6=i k=1 k=1 j6=i k=1 j6=i Hence it follows from Lemma 1 that A is positive definite. ✷ Notice that when δ1 = δ2 = ρ, DCAR(δ1 , δ2 , σ 2 ) reduces to CAR(ρ, σ 2 ) and hence the regular CAR model is nested within the DCAR model provided we use a prior that puts positive mass on the line δ1 = δ2 . The next step of our statistical analysis is to estimate the unknown parameters of the DCAR model based on the observed responses and the explanatory variables, so that it enables us to stabilize estimates within the regions using the estimated spatial correlation. In the next section, we discuss Bayesian methods for the spatial autoregressive models. 2.2 Parameter estimation using Bayesian methods With the Gaussian DCAR model of the latent spatial process Zi ’s, we describe how to estimate parameters and associated measures of uncertainties based on Bayesian methods. Bayesian inference about the unknown parameters has been considered for statistical models for which the likelihood functions are analytically intractable, because of possibly high-dimensional parameters or due to the fact that the likelihood function involves high-dimensional integrations (e.g., when Yi ’s are discrete valued). In the Gaussian DCAR model, because the likelihood function may involve high-dimensional integration, posterior estimation is not easy to achieve analytically. In particular, the joint posterior density of δ1 and δ2 does not have a closed form. Also, when a generalized mixed model is used with the random spatial effects having a DCAR model, analytical exploration of the posterior distribution becomes almost prohibitive. Thus, the Gaussian DCAR model leads to an intractable posterior density and numerical methods are needed for inference about unknown parameters. Let θ = (β T , σ 2 , δ T )T , where β = (β1 , . . . , βp ) and δ = (δ1 , δ2 ). The posterior density π(θ|z) is proportional to the product of the prior distribution π(θ) of unknown parameters and the sampling density of data Z given θ. Therefore, by using Markov chain Monte Carlo (MCMC) methods, we can obtain samples from the path of Markov chains whose stationary density is the posterior density. For the DCAR process Zi ’s, under the joint multivariate Gaussian distribution, the likelihood function is given by L(θ|X, z) ∝ |σ 2 A∗ (δ)−1 D|−1/2 ª © 1 exp − 2 (z − Xβ)T D−1 A∗ (δ)(z − Xβ) , 2σ (7) where A∗ (δ) = I − δ1 W̃(1) − δ2 W̃(2) and D = diag( m11 , . . . , m1n ). We consider a class of M. Kyung and S. K. Ghosh 683 prior distributions that ensure that the posterior distribution is proper even when the priors are improper. A class of such prior distribution is given by π(β|σ 2 , δ) ≡ 2 π(σ |δ) ∝ π(δ) = 1 µ 1 σ2 ¶a+1 b e− σ2 a, b > 0 and 1 I[max(|δ1 |, |δ2 |) < 1]. 4 As the prior distribution of β is not proper, we need to ensure that the posterior is proper. Given prior distributions above, the joint posterior distribution of θ can be shown to have the following form: π(θ|X, z) ∝ ∝ L(θ|X, z)π(β|σ 2 , δ)π(σ 2 |δ)π(δ) ¯ ¯−1/2 (σ 2 )−n/2−a−1 ¯A∗ (δ)−1 D¯ n ¤o 1 £ exp − 2 (z − Xβ)T D−1 A∗ (δ)(z − Xβ) + 2b 2σ ×I[max(|δ1 |, |δ2 |) < 1]. (8) Here, if δ is known, like the regular posterior distribution in a regression model, the posterior distribution of β given σ 2 and δ is a Guassian distribution with complicated form of the mean and variance, and the posterior of σ 2 given δ is an inverse gamma distribution. For the Gaussian DCAR model, assume that the design matrix X is full rank and the variance-covariance matrix ΣZ is positive definite. Thus, based on the sufficient conditions given by Sun et al. (1999), one can easily deduce that the posterior distribution of θ|z is proper. We can also consider the conditionally conjugate priors for β and σ 2 . Given the values of δ, the likelihood function of the DCAR model (7) is like a regression model with mean Xβ and variance-covariance σ 2 A∗ (δ)−1 D. Thus, a conditional conjugate prior given δ can be considered in two stages according to β|σ 2 , δ ∼ σ 2 |δ ∼ ¡ ¢ N β0 , σ 2 A∗ (δ)−1 D IG(a0 , b0 ). Given the conditionally conjugate prior distributions and a marginal prior for δ, the joint posterior distribution of θ is given by π(θ|X, z) ∝ L(θ|X, z)π(β|σ 2 , δ)π(σ 2 |δ)π(δ) ∝ (σ 2 )−(n/2+p/2+a0 )−1 × I[max(|δ1 |, |δ2 |) < 1] ½ 1 × exp − 2 [ s̃ 2σ ³ ´T ³ ´³ ´¸¾ −1 (9) + β − β̃ XT D−1 A∗ (δ)X + (A∗ (δ) D)−1 β − β̃ 684 Bayesian inference for DCAR where ´ ¡ ¢−1 ³ ∗ −1 −1 (A (δ) D) β0 + XT D−1 A∗ (δ)Xβb , β̃ = XT D−1 A∗ (δ)X + (A∗ (δ)−1 D)−1 ´ ³ ³ ´T c2 (n − p) + 2b0 + β0 − β̃ (A∗ (δ)−1 D)−1 β0 + b b − β̃ XT D−1 A∗ (δ)Xβ, b s̃ = σ ¡ ¢−1 T −1 ∗ βb = XT D−1 A∗ (δ)X X D A (δ)z T −1 ∗ b b c2 = (z − Xb) D A (δ)(z − Xb) σ n−p and p is the dimension of β. Then the conditional posterior distributions of the parameters are given by β|σ 2 , δ, Z ∼ σ 2 |δ, Z ∼ ¡ ¢ N β̃, XT D−1 A∗ (δ)X + (A∗ (δ)−1 D)−1 and ¡ ¢ IG (n/2 + p/2 + a10 ), s̃ . However, as we discussed above, there is no closed form for the posterior distribution of δ. Therefore, we need numerical methods to obtain the posterior summaries (e.g., suitable moments) of θ. In order to obtain samples from the path of the Markov chains, we need to consider the starting values of each parameters. In the DCAR model of © the latent spatial process Zi ’s , the parameterª space Θ of θ can be defined as Θ = θ|β ∈ Rp , σ 2 ∈ (0, ∞), I[max(|δ1 |, |δ2 |) < 1] . Within Θ, we can choose several starting points for chains and run parallel chains. After burn-in, we obtain approximate samples from the posterior density π(θ|z). 2.3 A simulation study In order to study the finite sample performance of Bayesian estimators, we conduct a simulation study. In this simulation study, we focus on the behavior of Gaussian DCAR model of the latent spatial process Z = (Z1 , . . . , Zn ) in (6). Mardia and Marshall (1984) conducted a simulation study with 10 × 10 unit spacing lattice, based on samples generated from a normal distribution with mean zero and a spherical covariance model. The sampling distribution of the MLE’s of the parameters were studied based on 300 Monte Carlo samples. Following a similar setup, for our simulation study, we selected a 15 × 15 unit spacing lattice and generated N = 100 data sets each of size n = 225 from a multivariate normal distribution with mean Xβ and the variance-covariance matrix σ 2 A∗ (δ)−1 D, where A∗ (δ) = I − δ1 W̃(1) − δ2 W̃(2) . The X matrix was chosen to consist of the coordinates of latitude and longitude in addition to a column of ones to represent an intercept. The true value of the parameters were fixed at β = (1, −1, 2)T and σ 2 = 2. For the above mentioned DCAR model, to study the behavior of posterior distributions for δ1 and δ2 , we consider four different test cases M. Kyung and S. K. Ghosh 685 of δ’s: Case 1: δ1 = −0.95 & δ2 = −0.97 Case 2: δ1 = −0.30 & δ2 = 0.95 Case 3: δ1 = −0.95 & δ2 = 0.97 Case 4: δ1 = 0.95 & δ2 = 0.93. Following Lemma 2, we restrict our choice of δ1 and δ2 to satisfy the sufficient condition, max1≤k≤K |δk | < 1. Thus, for the near boundary values of δ1 and/or δ2 , like −0.95 and 0.93, there might be some unexpected behavior of the sampling distribution. Thus, we generate data sets with two negative near boundary weights of δ (Case 1) and two positive near boundary weights of δ (Case 4) in order to explore extreme cases, respectively. In our applications we have found that such extreme values are quite common within CAR or DCAR models (see Section 4). Besag and Kooperberg (1995) also discussed similar situations in their paper. We also consider settings with one positive near boundary weight assigned to one direction and one negative near boundary weight assigned to the other direction for extremely different weighted situations (Case 3). Moreover, we give somewhat mild weight to one direction and positive boundary weight to a different direction to study the behavior of a strong positive spatial effect in one direction only (Case 2). Thus, with extreme boundary values of δ, we study the sampling distributions of the directional spatial effect parameters. As we discussed in Section 2.2, for the Bayesian estimates, we consider three sets of initial values and run three parallel chains. We use a burn-in of B = 1000 for each of the three chains followed by M = 2000 iterations. This scheme produces a sample of 6000 (correlated) values from the joint posterior distribution of the parameter vector. As Bayesian estimation involves the use of computationally intensive MCMC methods, we studied the finite sample performance of Bayes estimates with only N = 100 repetitions. The (coordinatewise) posterior median of the parameter vector is used as a Bayes estimate because of its robustness as compared to the posterior mean, especially when the posterior distribution is skewed. Also, for each coordinate of the parameter vector, we computed a 95% equal tail credible set (CS) using 2.5 and 97.5 percentiles as an interval estimate. Then, we computed the 95% nominal coverage probability (CP) PN by using the following rule: 95% CP = N1 i=1 I(θ0 ∈ 95%CS). We summarize the sampling distribution of parameters numerically in Table 1. The bias represents the empirical bias of posterior medians as compared to the true value, the Monte Carlo Standard Error (MCSE) is the standard error of the posterior medians, the p-value is based on testing the average of posterior medians against the true value and 95% CP represents the percentage of times the true value was included within the 95% CS. All of these summaries are based on 100 replications. We observe that for all the four choices of δ, there are no significant biases in these Bayesian estimates with MC repetitions of size 100. (e.g., all p-values are bigger than 0.18). When the true δ1 or δ2 is positive, the bias of the Bayesian estimate tends to be slightly negative, except for δ2 in Case 4. For Case 3, the nominal 95% coverage probabilities (CP’s) of δ1 and δ2 are away from 0.95 and the MCSE’s are not small. Also, from Figure 2, we observe that 686 Bayesian inference for DCAR True bias MCSE P-value 95% CP True bias MCSE P-value 95% CP δ1 -0.95 0.23 0.22 0.29 0.99 δ1 -0.95 0.20 0.15 0.18 0.89 δ2 -0.97 0.27 0.20 0.19 1.00 δ2 0.97 -0.18 0.17 0.28 0.93 σ2 2.00 0.17 0.20 0.40 0.92 σ2 2.00 0.26 0.21 0.21 0.84 δ1 -0.30 0.15 0.32 0.64 0.97 δ1 0.95 -0.01 0.03 0.74 0.98 δ2 0.95 -0.21 0.18 0.24 0.93 δ2 0.93 0.03 0.03 0.28 0.89 σ2 2.00 0.11 0.21 0.60 0.92 σ2 2.00 -0.05 0.37 0.89 0.67 Table 1: Finite sample performance of posterior estimates of the parameters of DCAR models (based on 100 replications). for the extremely differently positively (δ1 ) and negatively (δ2 ) weighted situations, the posterior estimates seem to estimate true values with somewhat less precision. However, the distribution of δ is skewed when the true value is in the boundary. It might be the reason why we get somewhat larger values of MCSEs. For Case 4, the nominal 95% CP’s of δ1 and δ2 are 0.98 and 0.89, respectively, and biases and MCSE’s are smaller than those for any other cases. Thus, Bayesian methods based on posterior medians tend to estimate the true value quite well even when the true values of δ1 and δ2 are near the positive boundary. The higher than nominal coverage probability of the Bayesian interval estimates based on equal tail CS may be due to the skewness of the sampling distribution that we have observed in our empirical studies. Alternatively, a 95% HPD interval can be obtained using the algorithm of Hyndman (1996). It was observed that the posterior distributions of δ1 or δ2 are skewed to the right and to the left for the negative extreme value and the positive extreme value, respectively. The bias of the posterior median of σ 2 tends to be slightly negative for Case 4, but for other cases, the biases are slightly positive. Again note that these biases are not statistically significant (all four p-values are greater than 0.21). Thus, in these cases, Bayesian estimation tends to estimate the true value quite well. However, for Case 4, the MCSE of the Bayesian estimates of σ 2 is bigger than those for other cases and the nominal coverage is only 0.67 as compared to a targeted value of 0.95. For the estimation of the β’s, the estimates had small MCSE, and did not have any significant bias, except for β0 in Case 4. Also we observed that the posterior distributions of the β’s are fairly symmetric (results not reported due to lack of space but are available in Kyung 2006). M. Kyung and S. K. Ghosh 687 Histogram of posterior estimates of delta 2 (True delta 2=0.97) 2.0 0.0 0.0 0.5 0.5 1.0 1.5 Density 1.0 Density 1.5 2.5 2.0 3.0 3.5 Histogram of posterior estimates of delta 1 (True delta 1=−0.95) −1.0 −0.5 0.0 0.5 posterior medians of delta 1 1.0 −1.0 −0.5 0.0 0.5 1.0 posterior medians of delta 2 Figure 2: Histogram of 100 posterior estimate of δ1 and δ2 based on the DCAR process data with true δ1 = −0.95 and δ2 = 0.97 ( Posterior median of M = 6000 Gibbs samples with 100 replications). 688 3 Bayesian inference for DCAR Comparing the performances of DCAR and CAR models using an information criterion In Section 2.1 we have shown that the DCAR model is a generalization of the CAR model and hence the DCAR model is expected to provide a reasonable fit to a given data set possibly at the cost of loss of efficiency, especially when the data arise from a CAR model. So it is of interest to explore the loss (gain) in efficiency of a DCAR model over the regular CAR model when the data arises from a CAR (DCAR) model. There are several criteria (e.g., information criteria, cross-validation measures, hypotheses tests, etc.) to compare the performances across several competing models. Given the popularity of the Deviance Information Criterion (DIC) originally proposed by Spiegelhalter et al. (2002) we use DIC to compare the performance of fitting DCAR and CAR models to data generated from a CAR model and then also from a DCAR model. Another advantage of using DIC is that this criterion is already available within the WinBUGS software. To calculate DIC, first we define the deviance as D(θ) = −2 log L(θ|X, z) + 2 log h(z), where h(z) is standardizing function of the data z only and remains the same across all competing models. In general it is difficult to find the normalizing function h(z) for models involving spatial random effects. However given that we are interested in the differences of DIC between the models, we may use the following definition of deviance by dropping the h(z) term: D(θ) = −2 log L(θ|X, z), as the normalizing term will cancel anyway when we take the difference between two DIC’s with same h function. Based on the deviance, the definition of the effective number of parameters, denoted by pD , is defined as: ¡ ¢ pD = E[D(θ)|z] − D E[θ|z] = D̄ − D(θ̄), where θ̄ = E[θ|y] is the posterior mean of θ. The DIC is then defined as DIC = D(θ̄) + 2pD . In theory, we select the model with the smaller DIC values. DIC and pD are easily computed by MCMC methods. We consider two cases based on data generated from a (i) CAR model and (ii) DCAR model. 3.1 Results based on data generated from CAR model With samples from a Gaussian CAR process, we fit both CAR and DCAR models, respectively. Notice that if there is no directional difference in the observed spatial data, then the estimate of δ1 should be very similar to the estimate of δ2 . Thus we might expect very similar estimates for δ1 and δ2 based on a sample from a CAR process M. Kyung and S. K. Ghosh True bias MCSE P-value 95% CP True bias MCSE P-value 95% CP True bias MCSE P-value 95% CP 689 δ1 -0.95 0.01 0.02 0.67 1.00 δ1 0.00 -0.02 0.30 0.95 1.00 δ1 0.25 -0.10 0.31 0.85 0.99 δ2 -0.95 0.01 0.02 0.66 1.00 δ2 0.00 0.00 0.26 1.00 1.00 δ2 0.25 -0.09 0.26 0.72 1.00 ρ -0.95 0.02 0.00 0.00 1.00 ρ 0.00 -0.01 0.30 0.97 1.00 ρ 0.25 0.01 0.32 0.98 0.98 δ1 -0.25 0.14 0.29 0.63 1.00 δ2 -0.25 0.15 0.26 0.57 1.00 ρ -0.25 0.08 0.33 0.81 0.99 δ1 0.95 -0.15 0.11 0.17 1.00 δ2 0.95 -0.06 0.08 0.45 1.00 ρ 0.95 -0.06 0.04 0.14 1.00 Table 2: Performance of Bayesian estimates of δ1 ’s, δ2 ’s (DCAR) and ρ’s (CAR) based on data generated from CAR model because CAR(ρ, σ 2 ) = DCAR(ρ, ρ, σ 2 ). In fact, it might be a good idea to use a prior on (δ1 , δ2 ) which allows for a positive mass on the diagonal line δ1 = δ2 to capture a CAR model with positive probability. To study the performance of the model as function of the key parameter ρ of the CAR model, we consider five different values of ρ: Case 1:ρ = −0.95, Case 2:ρ = −0.25, Case 3:ρ = 0, Case 4:ρ = 0.25 and Case 5:ρ = 0.95. For each case, we generate 100 sets of data each of sample size n = 225 from a CAR model with ρ taking values as one of above five cases, while the other parameters (β and σ) are fixed at their true values (see Section 2.3). First we compare the posterior estimates of the ρ when we fitted a CAR model and that of δ1 and δ2 when we fitted a DCAR model to data generated from one of the five CAR models. In Table 2 we compare the bias (of the posterior median), Monte Carlo Standard Error (MCSE) of the posterior median, the p-value for testing the null hypothesis that the (MC) average of the 100 posterior medians is the same as the true value, and the 95% nominal coverage probability (CP) of the 95% posterior intervals constructed by computing 2.5% and 97.5% percentiles of the posterior distribution of the parameters. From the results (presented under the columns ρ in Table 2) based on the posterior estimates (median and 95% equal-tail intervals) obtained by fitting a CAR model, we observe that for all cases, the biases of ρ are slightly positive except Case 5, but such empirical biases are not statistically significant (all p-values being bigger than 0.06). The nominal 95% CP’s of ρ’s are higher than their targeted value of 0.95 for all cases. For all cases, the biases of posterior medians of σ 2 tend to be slightly positive. However, 690 Bayesian inference for DCAR DGP Fit PCD(DIC) P-value DGP Fit PCD(DIC) P-value CAR(ρ = −0.95) CAR DCAR 100% 0% 0.64 CAR(ρ = 0.25) CAR DCAR 55% 45% 0.50 CAR(ρ = −0.25) CAR DCAR 51% 49% 0.50 CAR(ρ = 0.95) CAR DCAR 100% 0% 0.68 CAR(ρ = 0.00) CAR DCAR 34% 66% 0.47 Table 3: Comparison of DIC between CAR and DCAR models with data sets from CAR process (PCD = Percentage of Correct Decision) again we found that these empirical biases in all cases are not significant because all calculated p-values are at least 0.5 (results not reported). Finally, with regard to the performance of the posterior medians of β’s, we did not find any significant biases (all p-values being bigger than 0.32). We have not presented the details of these results (for σ 2 and β’s) due to lack of space, but detailed results are available online in the doctoral thesis of the first author (Kyung 2006). Next we compare the results obtained by fitting a DCAR model to the same data sets generated from the five CAR models (as described above). Performance of DCAR model under mis-specification In Table 2 we also present the bias of the posterior medians of δ1 and δ2 , their MCSE’s, p-values (for testing δ1 = δ2 = ρ), and the 95% CP’s of the 95% equal-tailed posterior intervals, when a DCAR model is fitted to each of the same 100 data sets generated from each of five CAR models (as described in the previous section). Although DCAR is not the true model that generated the data in these cases, we observe that for all five cases, the biases of posterior medians of δ1 and δ2 are marginally positive for Cases 1, 2 and 3, whereas, the biases are slightly negative for Cases 4 and 5. However, these biases are not statistically significant (all p-values being greater than 0.09). This indicates that even when the data arise from a CAR model, the posterior medians of δ’s can well approximate the true ρ value of the CAR model. As expected, the MCSE’s of the posterior medians of δ’s are slightly larger than that of the ρ’s, but such loss in efficiency for fitting an incorrect model is not prominent either. Finally, in terms of maintaining the nominal coverage of the posterior intervals, the results from both model fits are comparable. Thus, in summary, when we fit a DCAR model to data sets generated from a CAR model, the posterior estimates obtained from the DCAR model are approximately unbiased and there is no big loss in efficiency. In addition to comparing the parameter estimates based on fitting both CAR and DCAR models to data generated from a CAR model we have also used DIC (as defined earlier in this section) to compare the overall performance of these models. By a data generating process (DGP) we mean the true model that generates data for our simulation M. Kyung and S. K. Ghosh ³ DGP ´ E Var(DCAR) VarCAR ³ DGP ´ E Var(DCAR) VarCAR 691 CAR(ρ = −0.95) CAR(ρ = −0.25) CAR(ρ = 0.00) 1.009(0.000) 0.999(0.003) 0.998(0.003) CAR(ρ = 0.25) CAR(ρ = 0.95) 0.999(0.004) 1.009(0.001) Table 4: The average ratio of posterior variances for DCAR and CAR models: Average(Var(DCAR)/Var(CAR)) based on Gibbs sampler from data sets of CAR process study and we use the notation FIT to denote the model that was fitted to a simulated data set. So in this case CAR is the DGP while FIT can be either a CAR or a DCAR model. We measure the performance of the FIT by computing the percentage of correct decisions (PCD) made by DIC in selecting one of the two models. In other words, PCD represents the percentage of the times the DIC value, based on fitting a CAR model, is lower than that of fitting a DCAR model to the same sets of data obtained from a CAR model. We also report the p-values based on performing a two sample test that compares the average values of the DICs (over 100 replications) between CAR and DCAR models when the true data is generated from a CAR model. From Table 3, we observe that when the DGP is a CAR with ρ = −0.95 (negative boundary) and ρ = 0.95 (positive boundary), the PCD based on DIC is 100% which means that the DIC correctly identifies a CAR model all the times when data are generated from a CAR model with ρ = ±0.95. However for other cases, the PCD’s are not that strongly in favor of a CAR model (when compared against a DCAR model) even when the data arise from a CAR model. Thus, when the spatial dependence in a CAR model is weak, DIC will not be able to distinguish between the CAR and DCAR models. Again such a phenomenon is expected as DCAR nests CAR when δ1 = δ2 = ρ and this is further evidenced by looking the p-values, which suggest that we can not reject the null hypotheses that the DIC values are the same for both models. For the measure of relative efficiency, the average ratio of posterior variances for DCAR and CAR models based on data sets of the CAR processes are reported in Table 4. From Table 4, we observe that there are no differences between the posterior variance for DCAR and for the CAR models based on the Gibbs sampler from data sets of each CAR process. Again such a phenomenon is expected, as DCAR nests CAR when δ1 = δ2 = ρ. 3.2 Results based on data generated from DCAR model In this section, our DGP (data generating process, as defined in earlier sections) is a DCAR model while the FIT is again either a CAR or a DCAR model. Here again we use the data sets generated from four DCAR models (as defined in Section 2.3) but fit a CAR model in addition to the DCAR models that we fitted earlier (see Section 2.3 for details). In this case it is of interest to find out how the posterior estimates of ρ of a CAR model behave, especially when the data arise from a DCAR model with δ values 692 Bayesian inference for DCAR DGP: DCAR True bias MCSE P-value 95% CS δ1 = −0.95, δ2 = −0.97 -0.96 0.25 0.13 0.09 1.00 ρ0 = δ1 = −0.30, δ2 = 0.95 0.33 0.07 0.31 0.83 0.99 δ1 +δ2 2 δ1 = −0.95, δ2 = 0.97 0.01 0.03 0.38 0.94 0.98 δ1 = 0.95, δ2 = 0.93 0.94 0.03 0.02 0.13 0.85 Table 5: Fitting a CAR model to data generated from the DCAR process well separated (e.g., for the cases 2 and 3 of Section 2.3). Performance of CAR model under mis-specification Based on generating 100 data sets from different DCAR models, we observed that the posterior median of ρ seems to estimate the average of the true values of δ1 and δ2 of 2 the DCAR models. Therefore, we define a pseudo-true value of ρ as ρ0 = δ1 +δ and 2 compare the performance of the posterior median of ρ to this so-called “pseudo-true” value of ρ0 . In Table 5 we list the empirical bias of posterior median of ρ, the MCSE of these posterior medians, the p-value for testing the null hypothesis ρ = ρ0 and the 95% nominal CP based on the 95% equal-tail posterior intervals of ρ when a CAR model is fitted to four DCAR models (as described in Section 2.3). It is clear from the results reported in this table that the ρ parameter of the CAR model attempts to estimate (δ1 + δ2 )/2 of the DCAR model and thus will lead to a misleading conclusion, especially when δ’s are of opposite signs but with large absolute values (e.g., cases 2 and 3 of Section 2.3). In other words, when there are strong spatial dependencies possibly in orthogonal directions, the CAR model would fail to capture such dependencies as opposed to a DCAR model. On the other hand, when the DGP is a CAR model, the DCAR model still provides a very reasonable approximation to that DGP (see the results on Section 3.1). This is one of the main advantages of fitting a DCAR model over a regular CAR model. In Table 6, we compare the performance of DIC in choosing the correct model (which is a DCAR model in this case) when we fitted both CAR and DCAR models. The numbers reported in this table have similar interpretations as in Table 3. As expected, for the cases 1 and 4 (where δ1 ≈ δ2 ), the DIC more often chooses the CAR model as the best parsimonious model even when the data arise from a DCAR model. However, the p-values (for testing the null hypothesis of no difference in average DIC values indicate that such DIC values are not statistically significantly different. On the other hand when the DCAR model is sharply different from a CAR model (e.g., in cases 2 and 3, where δ1 δ2 < 0), the DIC correctly picks DCAR as the better model more frequently (e.g. 99% of the times in case 3) as compared to a CAR model. Moreover, the p-values suggest that in these two cases the DIC values obtained by fitting CAR and DCAR M. Kyung and S. K. Ghosh DGP Fit PCD(DIC) P-value DGP Fit PCD(DIC) P-value 693 DCAR(δ1 = −0.95, δ2 = −0.97) CAR DCAR 91% 9% 0.59 DCAR(δ1 = −0.95, δ2 = 0.97) CAR DCAR 1% 99% 0.01 DCAR(δ1 = −0.30, δ2 = 0.95) CAR DCAR 30% 70% 0.03 DCAR(δ1 = 0.95, δ2 = 0.93) CAR DCAR 56% 44% 0.73 Table 6: Comparison of DIC between CAR and DCAR models with data sets from DCAR process (PCD = percentage of correct decisions) ³ DGP ´ E Var(DCAR) VarCAR ³ DGP ´ E Var(DCAR) VarCAR DCAR(δ1 = −0.95, δ2 = −0.97) DCAR(δ1 = −0.30, δ2 = 0.95 ) 1.002(0.003) 0.994(0.007) DCAR(δ1 = −0.95, δ2 = 0.97) DCAR(δ1 = 0.95, δ2 = 0.93) 0.983(0.010) 1.041(0.247) Table 7: The average ratio of posterior variances for DCAR and CAR models: Average(Var(DCAR)/Var(CAR)) based on Gibbs sampler from data sets of DCAR process models are significantly different in favor of the DCAR model when the DGP is indeed a DCAR model. The average ratio of posterior variances for DCAR and CAR models based on data sets of the DCAR process are reported in Table 7 for the measure of relative efficiency. From Table 7, we observe that there are no differences in the posterior variances for DCAR and for CAR models based on the Gibbs sampler For Cases 1 and 4 (where δ1 ≈ δ2 ). Also, for Case 2, the posterior variances are not different for DCAR and CAR models. However, for the extreme case (Case 3), the posterior variances of the DCAR model are smaller than that of the CAR model. Thus, when there are strong spatial dependencies, possibly in orthogonal directions, the DCAR model would capture such dependencies more precisely than a CAR model. From our extensive simulation studies we can make the following fairly general conclusions: (i) DCAR models provide a reasonably good fit and approximately unbiased parameter estimates even when the data arise from a CAR model, (ii) CAR models cannot provide an adequate fit for data sets arising from a DCAR model, especially when there are strong spatial dependencies in opposite directions, (iii) DIC performs reasonably well in choosing a parsimonious model when CAR and DCAR models are compared. 694 4 Bayesian inference for DCAR Data analysis We illustrate the fitting of DCAR and CAR models using real data sets. For each data set, we consider a linear regression model with iid errors and correlated errors (modeled by CAR and DCAR processes). We obtain the Gibbs sampler of ρ, σ 2 , β = (β0 , β1 , β2 )T and δ = (δ1 , δ2 ) under different modeling assumptions. We consider the following models: Zi = Xi β + ǫi i = 1, . . . , n Model 1. ǫi ∼ N (0, σ 2 ): iid errors ³ ´ Model 2. ǫ ∼ N 0, σ 2 (I − ρW̃ )−1 D : CAR errors ¡ ¢ Model 3. ǫ ∼ N 0, σ 2 (I − δ1 W (1) − δ2 W (2) )−1 D : DCAR errors , where Zi = f (Yi ) and f (·) is a transformation function of the response Yi . In addition to using DIC to compare the models with CAR and DCAR error structures, we also computed a cross-validation measure (leave-one-out mean square predictive error (MSPE)). This is defined as follows: n 1X MSPE = (yi − yb−i )2 , n i=1 where yb−i = E(Yi |y−i ) is the posterior predictive mean of Yi obtained by fitting a model based on a reduced data set consisting of all (n-1) observations leaving out the ith observation yi . 4.1 Crime distribution in Columbus, Ohio We illustrate the performance of fitting CAR and DCAR models to a real data set for estimating the crime distribution in Columbus, Ohio collected during the year of 1980. The original data set can be found in Table 12.1 of Anselin (1988, p.189). Using this interesting data set, Anselin (1988) illustrated the presence of separate levels of spatial dependencies by fitting two separate regression curves with simultaneous autoregressive (SAR) error models for the east and west sides of Columbus city. As a result, the author concluded that when a SAR error model is used, there exists structural instability in terms of the regression models. In this paper, we fit proposed models to this data set. Each of the models is a single regression curve but allow spatial anisotropy in the errors by modeling the errors as a CAR or DCAR model. The data set consists of the observations collected in 49 contiguous Planning Neighborhoods of Columbus, Ohio. Neighborhoods correspond to census tracts, or aggregates of a small number of census tracts. In this data set, the crime variable represents the total number of residential burglaries and vehicle thefts per thousand households (henceforth denoted by Yi for the ith neighborhood). As possible predictors for crime variable, we use the income level and housing values for each one of these 49 neighborhoods. The income and housing values are measured in thousands of dollars. M. Kyung and S. K. Ghosh 695 Directional variogram of GLM residuals Columbus OH: residential burglaries and vehicle thefts per thousand households, 1980 11 47 48 6 45 46 44 49 7 8 9 10 11 0.08 0.06 0.04 6 2 8 7 4 3 10 39 38 18 4037 41 36 9 11 42 19 35 32 20 12 21 17 43 33 31 34 22 13 30 24 23 16 14 29 25 28 27 26 15 0.02 14 12 13 1 0.00 5 [14.3,19.3] (19.3,30.6] (30.6,39.7] (39.7,53.5] (53.5,68.9] semivariance 0.10 15 0.12 0° 45° 90° 135° 0 1 2 3 4 distance Figure 3: The crime distribution of 49 neighborhoods in Columbus, OH, and the correlogram of the deviance residuals after fitting a Poisson regression model. As a part of our preliminary exploratory data analysis, in Figure 3, we plot the the crime counts divided into 5 intervals, based on 20% quantiles. During our initial analysis we observed that Y4 and Y17 have extremely small values and hence could possibly be eliminated as outliers or incorrectly recorded values (as these two values were less than 2.5% percentile of the Yi ’s). For rest of the analysis, we use the remaining n = 47 neighborhoods for our analysis. From the map in Figure 3 we observe that there seems to be a relatively higher crime frequencies in NW/SE direction than those frequencies in its orthogonal direction, though such differences in crime distribution are not strikingly evident from this plot. From the estimated directional spatial correlogram in Figure 3, it appears that spatial correlations are not as strong. However, as the distance between neighbors increases, the estimated directional spatial correlation is different from those in different directions. There might be hidden effects of the different directional spatial correlation, thus, we assume a Gaussian DCAR spatial structure. As our response variable (the crime variable) is a count variable, we assume that Yi ∼ P oisson(λi ) for i = 1, . . . , n. Also, let x1i and x2i represent the housing value and the income both in thousand dollars, respectively. Thus xi = (1, x1i , x2i )T represents the intercept and predictors for neighborhood i. We consider three over-dispersed Poisson regression models using the latent variables Zi ’s as follows: Yi ∼ P oisson(λi ) log(λi ) = Zi = xTi β + ǫi , β = (β0 , β1 , β2 )T , i = 1, . . . , n Posterior estimates consisting of the posterior median (denoted by Est. in the table) 696 Bayesian inference for DCAR Parameter ρ δ1 δ2 σ2 β0 β1 β2 DIC Est. 2.576 4.568 -0.003 -0.064 iid Std.Err. 0.074 0.001 0.006 - CAR Est. Std.Err. 0.974 0.021 0.358 0.250 4.197 0.264 -0.004 0.003 -0.056 0.012 335.76 DCAR Est. Std.Err. 0.962 0.031 0.960 0.032 0.230 0.183 4.147 0.243 -0.004 0.003 -0.055 0.011 336.79 Table 8: Posterior estimates based on fitting different models to the crime frequency data. Model 1 Model 2 Model 3 (iid error) (CAR error) (DCAR error) MSPE 0.084 0.053 0.050 Table 9: Mean Squared Predicted Error of Leave-one-out method (MSPE) and the posterior standard deviation (denoted by Std.Err in the table) of the parameters under these models are displayed in Table 8. In this table, we observe that for all models, the posterior estimates of the regression coefficients (β’s) are very similar across all three models. As expected, the negative posterior medians of the β1 and β2 indicate that crime frequencies are expected to be lower in neighborhoods with higher income level and housing values. Next we turn our attention to the error part of the three models. First, significantly lower values of the posterior medians of σ 2 under both the CAR and DCAR models indicate that greater variability is explained by the models with spatially correlated errors (i.e., by the CAR and DCAR models in this case) than the corresponding model with independent errors. This is further evidenced from the “deviance residual” (as defined by McCullagh and Nelder 1989) plot in Figure 4 which also suggests that these residuals are not randomly scattered around the horizontal line at the origin. Among the spatially correlated error models, the difference between the DCAR model and the CAR model is negligible. Also, from the scatterplot of the predicted values from DCAR spatial structure versus those from CAR in Figure 4, we observe that there are many points which are far from the straight line. The straight line has slope 1, so if the predicted values are similar to the original data, the points are close to the straight line. However, the predicted values from DCAR and CAR are not different from each other, which is also evident by comparing their corresponding DIC values. In addition to using DIC to compare the models with CAR and DCAR error structures, we also computed cross-validation measures like leave-one-out mean square predictive error (MSPE). Here zb−i = E(Zi |z−i ) is the ML predictive mean of Zi obtained M. Kyung and S. K. Ghosh 697 60 Scatter Plot of Response VS Predictors 40 30 20 Predicted Values 50 prediction from CAR prediction from DCAR 20 30 40 50 60 70 Log transformed Crime Rates Figure 4: Scatterplot of regional estimated frequencies from DCAR versus those from CAR for the log-transformed crime frequencies. The straight line has slope 1. Thus, if the predicted values are similar to the original data, points are close to the straight line. by fitting a model based on a reduced data set consisting of all (n-1) observations leaving out the ith observation yi . In Table 9 we present the MSPEs for three models. Again it is evident that the spatially correlated error models perform much better than the independent error model. Among the two spatial models, DCAR performs slightly better than the CAR model in terms of having lower MSPE. Thus, we conclude that although there is possibly no separate directional spatial correlations, there is a strong spatial correlation on either side of the neighborhoods. 4.2 Elevated Blood Lead Levels in Virginia We also illustrate the fitting of the CAR and DCAR models using a second data set, estimating the rate, per thousand, of children under the age of 72 months with elevated blood lead levels observed in Virginia in the year 2000. As predictors for the rate of children with elevated blood lead levels, we consider the median housing value in 698 Bayesian inference for DCAR $100, 000 and the number of children under 17 years of age living in poverty in 2000, per 100, 000 children at risk. These observations were collected in 133 counties in Virginia in the year 2000, with coordinates being the centroids of each county. The aggregated data for each county are counts: The number of children under 6 years of age with elevated blood levels in county i and the number of children under 6 years of age tested. In Schabenberger and Gotway (2005), the original data set was used to illustrate the percentage of children under the age of 6 years with elevated blood lead levels by using a Poisson-based generalized linear model (GLM) and a Poisson-based generalized linear mixed model (GLMM) in the analysis of spatial data. Schabenberger and Gotway (2005) illustrated spatial dependence by comparing predictions from a marginal spatial GLM, a conditional spatial GLMM, a marginal spatial GLM using geostatistical variance structure, and a marginal GLM using a CAR variance structure. For the CAR variance structure, they used binary sets of neighbors which share a common border. They mentioned that because of this choice of adjacency weights, the model with the CAR variance smoothes the data much more than the model with the geostatistical variance. Instead of using a generalized linear model for count data, we consider the FreemanTukey (FT) square-root transformation for the Yi ’s. There are zero values in some counties, and the FT square-root transformation shows more stability than the usual square-root transformation (Freeman and Tukey 1950; Cressie and Chan 1989). With the FT square-root transformed elevated blood lead level rate, we assume a Gaussian distribution with CAR and DCAR spatial structure. For the neighbor structure, we compute distances among the centroids of each geographical group as measured in latitude and longitude. So as not to have any counties reporting zero neighbors, we include counties whose distance is within a 54.69 radius of another county. p p For this data set, we denote Zi = 1000 ∗ Yi /Ti + 1000 ∗ (Yi + 1)/Ti for i = 1, 2, . . . , n, where Yi is the number of children under the age of 72 months with elevated blood lead levels observed and Ti is the number of children under the age of 72 months who have been tested in Virginia in the year 2000.. Thus, Zi is a FT square-root transformed elevated blood lead level rate of sub-area Si . There exists significant correlation between the median housing value in $100, 000 and the number of children under 17 years of age living in poverty in 2000, per 100, 000 children at risk. Thus, we only include the centered housing value in $100, 000 (X). We plot the the FT square-root transformed elevated blood lead level rate that are divided into 5 intervals of the 20% quantiles in Figure 5. In Figure 5, it appears that spatial correlations in the northeast (NE) direction seems strong. However, from the estimated correlogram in Figure 5, we observe that the spatial correlations in four different directions do not seem to be very different from each other. But, there seems to be different amounts of correlation for the 450 and 1350 compared to no directional correlation. Thus, we assume a DCAR process as a hidden spatial structure. The posterior estimates, with standard deviations under iid error, CAR error, and DCAR error models are displayed in Table 10. In this table, we observe that for all models, the posterior mode of the intercept (β0 ) are very similar across all three models. However, the estimate of the regression coefficient of median housing value under iid M. Kyung and S. K. Ghosh 699 800 Directional variogram of linear residuals 600 0 200 400 semivariance the F−T transformed elevated blood lead level [3.51,10.3] (10.3,14.1] (14.1,18.5] (18.5,23.3] (23.3,57.4] 0° 45° 90° 135° 0 100 200 300 400 500 600 distance Figure 5: The elevated blood lead levels rate per thousand of children under the age of 72 months observed in Virginia in the year 2000 the correlogram of the deviance residuals after fitting linear model. Parameter ρ δ1 δ2 σ2 β0 β1 DIC Table 10: level data. Est. 88.52 17.46 -0.624 iid Std.Err. 11.15 0.822 2.103 - CAR Est. Std.Err. 0.792 0.120 574.1 72.170 18.78 2.822 -3.295 3.017 940.532 DCAR Est. Std.Err. 0.450 0.236 0.896 0.105 564.0 74.2 17.42 2.315 -3.072 2.756 938.854 Bayesian estimates based on fitting different models to the elevated blood lead 700 Bayesian inference for DCAR the predicted F−T transformed elevated blood lead level from CAR model (Model 2) [3.51,10.3] (10.3,14.1] (14.1,18.5] (18.5,23.3] (23.3,57.4] Figure 6: Model 3) the predicted F−T transformed elevated blood lead level from DCAR model (Model 3) [3.51,10.3] (10.3,14.1] (14.1,18.5] (18.5,23.3] (23.3,57.4] Predicted elevated blood lead levels rate of children in Virginia (Model 2 and errors are different from the posterior estimates of CAR errors and DCAR errors. As expected, the negative posterior medians of the β1 indicate that the rates per thousand of children under the age of 72 months with elevated blood lead levels are expected to be lower at neighborhoods with higher housing values. The estimate of the error term (σ 2 ) with independent errors is significantly lower than the corresponding estimates under spatially correlated errors. However, the posterior mode of β1 (-0.624) is not significant under iid errors, having a large standard error (2.103). As we discussed in Section 3.2, the posterior mode of ρ (0.792) seems to estimate the average of the true values of δ1 and δ2 (0.450 and 0.896) of the DCAR models. There exists a positive spatial relationship for elevated blood lead levels among counties in Virginia. However, there exist different amounts of positive spatial correlation among neighbors in the northeast-southwest and the northwest-southeast directions. The spatial correlation among neighbors in northeast-southwest direction (δˆ2 = 0.896) is stronger that in northwest-southeast direction (δˆ1 = 0.450). Among the spatially correlated error models, DCAR explains slightly more variability than the CAR, though the difference between these models is negligible, which is also evident by comparing their corresponding DIC values. This is further evidenced from the residual plots in Figure 6 which also suggest that the residuals based on the DCAR error model appears not to have a trend over the study region. Also in Figure 7, we observe that most of predicted values from the DCAR spatial structure are bigger than those from CAR. This means that for the FT-transformed elevated blood lead levels, the DCAR model captures more variability than the CAR model in stabilizing estimates within the regions using the estimated spatial correlation. To compare the models with CAR and DCAR error structures, we also computed leave-one-out mean square predictive error (MSPE). In Table 11 we present the MSPEs for three models. Again it is evident that the spatially correlated error models perform M. Kyung and S. K. Ghosh 701 30 25 20 15 Predicted Values from DCAR 35 Scatter Plot of Predicted Values from CAR VS DCAR 10 15 20 25 30 35 Predicted Values from CAR Figure 7: Scatterplot of regional estimated rates from DCAR versus those from CAR for the FT-transformed original elevated blood lead level rates. The straight has slope 1. Thus, if the predicted values from DCAR are similar to the predicted values of CAR, points are close to the straight line. 702 Bayesian inference for DCAR Model 1 Model 2 Model 3 MSPE 88.155 66.822 64.269 Table 11: Mean Squared Predicted Error (Elevated blood lead level data) much better than the independent error model. Among the two spatial models, DCAR (64.269) performs slightly better than the CAR model (66.822) in terms of having lower MSPE, but the difference is negligible. Thus, we conclude that there are strong spatial correlations with some evidence of differing strengths of correlation in different directions. 5 Extensions and future work DCAR models capture the directional spatial dependence in addition to distance specific correlation, thus they are an extension of regular CAR models, which can often fail to capture strong but directionally orthogonal spatial correlations. The DCAR model is also found to be nearly as efficient as the CAR model even when data are generated from the CAR model. However, CAR models usually fail to capture the directional effects when data are generated from DCAR or other anisotropic models, particularly when the anisotropy is pronounced. Our model proposed in (6) can be extended to M (M ≥ 2) directions, and can be expressed as M X ¡ ¢ Z ∼ Nn Xβ, σ 2 (I − δk W̃(k) )−1 D , k=1 where W̃(k) denotes the matrices of weights specific to kth directional effect. In this paper we used only M = 2 sub-neighborhoods for a simpler illustration. However, we note that if we keep increasing the number of sub-neighborhoods, the number of parameters increases, and the amount of observations available within a sub-neighborhood decreases. Thus, we need to restrict the number of sub-neighborhoods by introducing a penalty term (or prior) and use some form of information criterion to choose the number of sub-neighborhoods. This is an important but open issue within our DCAR framework and we leave its further exploration as a part of our future research. References Anselin, L., 1988. Spatial Econometrics: Methods and Models. Kluwer Academic Publishers. 694 Besag, J., 1974. Spatial interaction and the statistical analysis of lattice systems. Journal M. Kyung and S. K. Ghosh of the Royal Statistical Society, Series B. 36, 192-236 (with discussion). 678, 680 703 676, 677, Besag, J., 1975. Spatial analysis of non-lattice data. The Statistician 24, 179-195. 677 Besag, J and Kooperberg, C., 1995. On Conditional and Intrinsic Autoregression. Biometrika 82, 733-746. 676, 680, 685 Breslow, N. E. and Clayton, D. G., 1993. Approximate inference in generalized linear mixed models. Journal of the American Statistical Association. 88, 9-25. 677 Brook, D., 1964, On the distinction between the conditional probability and the joint probability approaches in the specification of nearest-neighbour systems. Biometrika. 51, 481-483. 680 Clayton, D. and Kaldor, J., 1987, Empirical Bayes Estimates of Age-Standardized Relative Risks for Use in Disease Mapping. Biometrics. 43, 671-681. 677 Cliff, A. D. and Ord, J. K., 1981. Spatial Processes: Models & Applications. Pion Limited. 677 Cressie, N., 1993. Statistics for Spatial Data. John Wiley & Sons, Inc. 675, 676 Cressie, N. and Chan, N. H., 1989. Spatial modeling of regional variables. Journal of the American Statistical Association. 84, 393-401. 677, 698 Freeman, M. F. and Tukey, J. W., 1950. Transformations related to the angular and the square root. Annals of Mathematical Statistics. 21, 607-611. 698 Fuentes, M., 2002. Spectral methods for nonstationary spatial processes. Biometrika. 89, 197-210. 676 Fuentes, M., 2005. A formal test for nonstationarity of spatial stochastic processes. Journal of Multivariate Analysis. 96, 30-54. 676 Fuentes, M and Smith, R., 2001. A new class of nonstationary spatial models. Technical report, North Carolina State University, Department of Statistics. 676 Griffith, D. A. and Csillag, F., 1993. Exploring Relationships Between Semi-Variogram and Spatial Autoregressive Models. Papers in Regional Science. 72, 283-295. 676 Higdon, D., 1998. A process-convolution approach to modelling temperatures in the North Atlantic Ocean. Journal of Environmental and Ecological Statistics. 5, 173190. 676 Higdon, D., Swall, J. and Kern, J., 1999. Non-stationary spatial modeling. In Bayesian Statistics 6, eds. J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith. Oxford: Oxford University Press, 761-768. 676 Hrafnkelsson, B. and Cressie, N., 2003. Hierarchical modeling of count data with application to nuclear fall-out. Journal of Environmental and Ecological Statistics. 10, 179-200. 676 704 Bayesian inference for DCAR Hughes-Oliver, J. M., Heo, T. Y. and Ghosh, S. K., 2009. An Autoregressive Point Source Model for Spatial Processes. Environmetrics. 20, 575-594. 676 Hyndman, R. J., 1996. Computing and Graphing Highest Density Regions. The American Statistician. 50, 120-126. 686 Journel, A. G. and Huijbregts, C. J., 1978. Mining geostatistics. London:Academic. 676 Kyung, M., 2006. Generalized Conditionally Autoregressive Models. Ph. D. Thesis, North Carolina State University, Department of Statistics. 686, 690 Mardia, K. V. and Marshall, R. J., 1984. Maximum likelihood estimation of models for residual covariance in spatial regression. Biometrika. 71, 135-146. 684 McCullagh, P. and Nelder, J. A., 1989. Generalized Linear Models. Chapman and Hall, London. 696 Miller, H. J., 2004. Tobler’s first law and spatial analysis. Annals of the Association of American Geographers. 94, 284-295. 675 Ord, K., 1975. Estimation methods for models of spatial interaction. Journal of the American Statistical Association. 70, 120-126. 677 Ortega, J. M., 1987. Matrix Theory. New York:Plenum Press. 681 Paciorek, C. J. and Schervish, M. J., 2006. Spatial modelling using a new class of nonstationary covariance functions. Environmetics. 17, 483-506. 676 Reich, B. J., Hodges, J. S. and Carlin, B. P., 2007. Spatial analyses of periodontal data using conditionally autoregressive priors having two classes of neighbor relations. Journal of the American Statistical Association. 102, 44-55. 677 Rue, H. and Tjelmeland, H., 2002. Fitting Gaussian Markov Random Fields to Gaussian Fields. Scandinavian Journal of Statistics. 29, 31-49. 676 Schabenberger, O. and Gotway, C. A., 2005. Statistical Methods for Spatial Data Analysis. Chapman & Hall/CRC. 676, 680, 698 Song, H. R., Fuentes, M. and Ghosh, S., 2008. A comparative study of Gaussian geostatistical models and Gaussian Markov random field models. Journal of Multivariate Analysis. 99, 1681-1697. 676 Spiegelhalter, D. J., Best, N. J., Carlin, B. P. and van der Linde, A., 2002. Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society, Series B. 64, 583-639 (with discussion). 688 Sun, D., Tsutakawa, R. K. and Speckman, P. L., 1999. Posterior distribution of hierarchical models using CAR(1) distributions. Biometrika. 86, 341-350. 683 van der Linde, A., Witzko, K.-H. and Jöckel, K.-H., 1995. Spatio-temporal analysis of mortality using splines. Biometrics. 4, 1352-1360. 676, 680 M. Kyung and S. K. Ghosh 705 Wahba, G., 1977. Practical approximate solutions to linear operator equations when the data are noisy. SIAM Journal on Numerical Analysis. 14, 651-667. 680 White, G. and Ghosh, S. K., 2008. A Stochastic neighborhood Conditional AutoRegressive Model for Spatial Data. Computational Statistics and Data Analysis. 53, 3033-3046. 677 Acknowledgments The authors are grateful to a referee, the associate editor and the editor for their careful reading of the paper and their constructive comments. We also thank Professor George Casella, Department of Statistics, University of Florida for comments and suggestions that led to a much improved version of the paper. 706 Bayesian inference for DCAR Bayesian Analysis (2009) 4, Number 4, pp. 707–732 Spiked Dirichlet Process Prior for Bayesian Multiple Hypothesis Testing in Random Effects Models Sinae Kim∗ , David B. Dahl† and Marina Vannucci‡ Abstract. We propose a Bayesian method for multiple hypothesis testing in random effects models that uses Dirichlet process (DP) priors for a nonparametric treatment of the random effects distribution. We consider a general model formulation which accommodates a variety of multiple treatment conditions. A key feature of our method is the use of a product of spiked distributions, i.e., mixtures of a point-mass and continuous distributions, as the centering distribution for the DP prior. Adopting these spiked centering priors readily accommodates sharp null hypotheses and allows for the estimation of the posterior probabilities of such hypotheses. Dirichlet process mixture models naturally borrow information across objects through model-based clustering while inference on single hypotheses averages over clustering uncertainty. We demonstrate via a simulation study that our method yields increased sensitivity in multiple hypothesis testing and produces a lower proportion of false discoveries than other competitive methods. While our modeling framework is general, here we present an application in the context of gene expression from microarray experiments. In our application, the modeling framework allows simultaneous inference on the parameters governing differential expression and inference on the clustering of genes. We use experimental data on the transcriptional response to oxidative stress in mouse heart muscle and compare the results from our procedure with existing nonparametric Bayesian methods that provide only a ranking of the genes by their evidence for differential expression. Keywords: Bayesian nonparametrics; differential gene expression; Dirichlet process prior; DNA microarray; mixture priors; model-based clustering; multiple hypothesis testing 1 Introduction This paper presents a semiparametric Bayesian approach to multiple hypothesis testing in random effects models. The model formulation borrows strength across similar objects (here, genes) and provides probabilities of sharp hypotheses regarding each object. Much of the literature in multiple hypothesis testing has been driven by DNA mi∗ Department of Biostatistics, University of Michigan, Ann Arbor, MI, mailto:mailto:sinae@umich. edu † Department of Statistics, Texas A&M University, College Station, TX, mailto:dahl@stat.tamu. edu ‡ Department of Statistics, Rice University, Houston, TX, mailto:marina@rice.edu c 2009 International Society for Bayesian Analysis ° DOI:10.1214/09-BA426 708 Spiked Dirichlet Process Prior for Multiple Testing croarrays studies, where gene expression of tens of thousands of genes are measured simultaneously (Dudoit et al. 2003). Multiple testing procedures seek to ensure that the family-wise error rate (FWER) (e.g., Hochberg (1988), Hommel (1988), Westfall and Young (1993)), the false discovery rate (FDR) (e.g., Benjamini and Hochberg (1995), Storey (2002), Storey (2003), and Storey et al. (2004)), or similar quantities (e.g., Newton et al. (2004)) are below a nominal level without greatly sacrificing power. Accounts on the Bayesian perspective to multiple testing are provided by Berry and Hochberg (1999) and Scott and Berger (2006). There is a great variety of modeling settings that accommodate multiple testing procedures. The simplest approach, extensively used in the early literature on microarray data analysis, is to apply standard statistical procedures (such as the t-test) separately and then combine the results for simultaneous inference (e.g., Dudoit et al. 2002). Westfall and Wolfinger (1997) recommended procedures that incorporate dependence. Baldi and Long (2001), Newton et al. (2001), Do et al. (2005) and others have sought prior models that share information across objects, particularly when estimating object-specific variance across samples. Yuan and Kendziorski (2006) use finite mixture models to model dependence. Classical approaches that have incorporated dependence in the analysis of gene expression data include Tibshirani and Wasserman (2006), Storey et al. (2007), and Storey (2007) who use information from related genes when testing for differential expression of individual genes. Nonparametric Bayesian approaches to multiple testing have also been explored (see, for example, Gopalan and Berry (1998), Dahl and Newton (2007), MacLehose et al. (2007), Dahl et al. (2008)). These approaches model the uncertainty about the distribution of the parameters of interest using Dirichlet process (DP) prior models that naturally incorporate dependence in the model by inducing clustering of similar objects. In this formulation, inference on single hypotheses is typically done by averaging over clustering uncertainty. Dahl and Newton (2007) and Dahl et al. (2008) show that this approach leads to increased power for hypothesis testing. However, the methods provide posterior distributions that are continuous, and cannot therefore be used to directly test sharp hypotheses, which have zero posterior probability. Instead, decisions regarding such hypotheses are made based on calculating univariate scores that are context specific. Examples include the sum-of-squares of the treatment effects (to test a global ANOVA-like hypothesis) and the probability that a linear combination of treatment effects exceeds a threshold. In this paper we build on the framework of Dahl and Newton (2007) and Dahl et al. (2008) to show how the DP modeling framework can be adapted to provide meaningful posterior probabilities of sharp hypotheses by using a mixture of a point-mass and a continuous distribution as the centering distribution of the DP prior on the coefficients of a random effects model. This modification retains the increased power of DP models but also readily accommodates sharp hypotheses. The resulting posterior probabilities have a very natural interpretation in a variety of uses. For example, they can be used to rank objects and define a list according to a specified expected number of false discoveries. We demonstrate via a simulation study that our method yields increased sensitivity in multiple hypothesis testing and produces a lower proportion of false discoveries than S. Kim, D.B. Dahl and M. Vannucci 709 other competitive methods, including standard ANOVA procedures. In our application, the modeling framework we adopt simultaneously infers the parameters governing differential expression and clusters the objects (i.e., genes). We use experimental data on the transcriptional response to oxidative stress in mouse heart muscle and compare results from our procedure with that of existing nonparametric Bayesian methods which only provide a ranking of the genes by their evidence for differential expression. Recently Cai and Dunson (2007) independently proposed the use of similar spiked priors in DP priors in a Bayesian nonparametric linear mixed model where variable selection is achieved by modeling the unknown distribution of univariate regression coefficients. Similarly, MacLehose et al. (2007) used this formulation in their DP mixture model to account for highly correlated regressors in an observational study. There, the clustering induced by the Dirichlet process is on the univariate regression coefficients and strength is borrowed across covariates. Finally, Dunson et al. (2008) use a similar spiked centering distribution of univariate regression coefficients in a logistic regression. In contrast, our goal is nonparametric modeling of multivariate random effects which may equal the zero vector. That is, we do not share information across univariate covariates but rather seek to leverage similarities across genes by clustering vectors of regression coefficients associated with the genes. The remainder of the paper is organized as follows. Section 2 describes our proposed modeling framework and the prior model. In Section 3 we discuss the MCMC algorithm for inference. Using simulated data, we show in Section 4.1 how to make use of the posterior probabilities of hypotheses of interest to aid the interpretation of the hypothesis testing results. Section 4.2 describes the application to DNA microarrays. In both Sections 4.1 and 4.2, we compare our proposed method to the LIMMA (Smyth 2004), to the SIMTAC method of Dahl et al. (2008) and to a standard ANOVA procedure. Section 5 concludes the paper. 2 2.1 Dirichlet Process Mixture Models for Multiple Testing Random Effects Model Suppose there are K observations on each of G objects and T ∗ treatments. For each object g, with g = 1, . . . , G, we model the data vector dg with the following K-dimensional multivariate normal distribution: dg | µg , βg , λg ∼ NK (dg | µg j + Xβg , λg M) , (1) where µg is an object-specific mean, j is a vector of ones, X is a K × T design matrix, βg is a vector of T regression coefficients specific to object g, M is the inverse of a correlation matrix of the K observations from an object, and λg is an object-specific precision (i.e., inverse of the variance). We are interested in testing a hypothesis for each of G objects in the form: H0,g : β1,g = . . . = βT ∗ ,g = 0 Ha,g : βt,g 6= 0 for some t = 1, . . . , T ∗ (2) 710 Spiked Dirichlet Process Prior for Multiple Testing for g = 1, . . . , G. Object-specific intercept terms are µg j, so the design matrix X does not contain the usual column of ones and T is one less than the number of treatments (i.e., T = T ∗ − 1). Also, d1 , . . . , dG are assumed to be conditionally independent given all model parameters. In the example of Section 4.2, the objects are genes with dg being the background-adjusted and normalized expression data for a gene g under T ∗ treatments, G being the number of genes, and K being the number of microarrays. In the example, we have K = 12 since there are 3 replicates for each of T ∗ = 4 treatments, and the X matrix is therefore:   03 03 03  j3 03 03   X =   03 j3 03  03 03 j3 where j3 is a 3-dimensional column vector of ones and 03 a 3-dimensional column vector of zeroes. If there are other covariates available, they would be placed as extra columns in X. Note that the design matrix X and the correlation matrix M are known and common to all objects, whereas µg , βg , and λg are unknown object-specific parameters. For experimental designs involving independent sampling (e.g., the typical time-course microarray experiment in which subjects are sacrificed rather than providing repeated measures), M is simply the identity matrix. 2.2 Prior Model We take a nonparametric Bayesian approach to model the uncertainty on the distribution of the random effects. The modeling framework we adopt allows for simultaneous inference on the regression coefficients and on the clustering of the objects (i.e., genes). We achieve this by placing a Dirichlet process (DP) prior (Antoniak 1974) with a spiked centering distribution on the distribution function of the regression coefficient vectors, β1 , . . . , βG , β1 , . . . , βG | Gβ Gβ ∼ ∼ Gβ DP (αβ , G0β ) where Gβ denotes a distribution function of β, DP stands for a Dirichlet process, αβ is a precision parameter, and G0β is a centering distribution, i.e., E[Gβ ] = G0β . Sampling from DP induces ties among β1 , . . . , βG , since there is a positive probability that βi = βj for every i 6= j. Two objects i 6= j are said to be clustered in terms of their regression coefficients if and only if βi = βj . The clustering of the objects encoded by the ties among the regression coefficients will simply be referred to as the “clustering of the regression coefficients,” although it should be understood that it is the data themselves that are clustered. The fact that our model induces ties among the regression coefficients β1 , . . . , βG is the means by which it borrows strength across objects for estimation. S. Kim, D.B. Dahl and M. Vannucci 711 Set partition notation is helpful throughout the paper. A set partition ξ = {S1 , . . . , Sq } of S0 = {1, 2, . . . , G} has the following properties: Each component Si is non-empty, the intersection of two components Si and Sj is empty, and the union of all components is S0 . A cluster S in the set partition ξ for the regression coefficients is a set of indices such that, for all i 6= j ∈ S, βi = βj . Let βS denote the common value of the regression coefficients corresponding to cluster S. Using this set partition notation, the regression coefficient vectors β1 , . . . , βG can be reparametrized as a partition ξβ and a collection of unique model parameters φβ = (βS1 , . . . , βSq ). We will use the terms clustering and set partition interchangeably. Spiked Prior Distribution on the Regression Coefficients Similar modeling frameworks and inferential goals to the one we describe in this paper were considered by Dahl and Newton (2007) and Dahl et al. (2008). However, their prior formulation does not naturally permit hypothesis testing of sharp hypotheses, i.e., it can not provide Pr(Ha,g |data) = 1 − Pr(H0,g |data), where hypotheses are defined as in (2), since the posterior distribution of βt,g is continuous. Therefore, they must rely on univariate scores capturing evidence for these hypotheses. The prior formulation we adopt below, instead, allows us to estimate the probability of sharp null hypotheses directly from the MCMC samples. These distributions have been widely used as prior distribution in the Bayesian variable selection literature (George and McCulloch 1993; Brown et al. 1998). Spiked distributions are a mixture of two distributions: the “spike” refers to a point mass distribution at zero and the other distribution is a continuous distribution for the parameter if it is not zero. Here we employ these priors to perform nonparametric multiple hypothesis testing by specifying a spiked distribution as the centering distribution for the DP prior on the regression coefficient vectors β1 , . . . , βG . Adopting a spiked centering distribution in DP allows for a positive posterior probability on βt,g = 0, so that our proposed model is able to provide probabilities of sharp null hypotheses (e.g., H0,g : β1,g = . . . = βT ∗ ,g = 0 for g = 1, . . . , G) while simultaneously borrowing strength from objects likely to have the same value of the regression coefficients. We also adopt a “super-sparsity” prior on the probability of βt,g = 0 (defined as πt for all g), since it is not uncommon that changes in expressions for many genes will be minimal across treatments. The idea of the “super-sparsity” prior was investigated in Lucas et al. (2006). By using another layer in the prior for πt , the probability of βt,g = 0 will be shrunken toward one for genes showing no changes in expressions across treatment conditions. Specifically, our model uses the following prior for the regression coefficients β1 , . . . , 712 Spiked Dirichlet Process Prior for Multiple Testing βG : β1 , . . . , βG | Gβ Gβ G0β ∼ Gβ ∼ DP (αβ , G0β ) = T Y {πt δ0 (βt,g ) + (1 − πt )N (βt,g |mt , τt )} t=1 π1 , . . . , πT |ρ1 , . . . , ρT ∼ T Y {(1 − ρt )δ0 (πt ) + ρt Beta(π|aπ , bπ )} t=1 ρ1 , . . . , ρT τ1 , . . . , τT ∼ ∼ Beta(ρ|aρ , bρ ) Gamma(τ |aτ , bτ ) Note that a spiked formulation is used for each element of the regression coefficient vector and πt = p(βt,1 = 0) = . . . = p(βt,G = 0). Typically, mt = 0, but other values may be desired. We use the parameterization of the gamma distribution where the expected value of τt is aτ bτ . For simplicity, let π = (π1 , · · · , πT ) and τ = (τ1 , · · · , τT ). After marginalized over πt for all t, the G0β becomes G0β = T Y {ρt rπ δ0 (βt ) + (1 − ρt rπ )N (βt |mt , τt )} , t=1 ρ1 , · · · , ρ t ∼ Beta(ρ|aρ , bρ ). where rπ = aπ /(aπ + bπ ). As noted in equation above, the ρt rπ is now specified as a probability of βt,g = 0 for all g. Prior Distribution on the Precisions Our model accommodates heteroscedasticity while preserving parsimony by placing a DP prior on the precisions: λ1 , . . . , λG : λ 1 , . . . λ G | Gλ Gλ G0λ ∼ Gλ ∼ DP (αλ , G0λ ) = Gamma(λ|aλ , bλ ) Note that the clustering of the regression coefficients is separate from that of the precisions. Although this treatment for the precisions also has the effect of clustering the data, we are typically more interested in the clustering from the regression coefficients since they capture changes across treatment conditions. We let ξλ denote the set partition for the precisions λ1 , . . . , λG and let φλ = (λS1 , . . . , λSq ) be the collection of unique precision values. S. Kim, D.B. Dahl and M. Vannucci 713 Prior Distribution on the Precision Parameters for DP Following Escobar and West (1995), we place independent Gamma priors on the precision parameters αβ and αλ of the DP priors: αβ αλ ∼ ∼ Gamma(αβ |aαβ , bαβ ), Gamma(αλ |aαλ , bαλ ). Prior Distribution on the Means We assume a Gaussian prior on the object-specific mean parameters µ1 , . . . , µG : µg ∼ N (µg | mµ , pµ ). 3 (3) Inferential Procedures In this section, we describe how to conduct multiple hypothesis tests and clustering inference in the context of our model. We treat the object-specific means µ1 , . . . , µG as nuisance parameters since they are not used either in forming clusters or for multiple testing. Thus, we integrate the likelihood with respect to their prior distribution in (3). Simple calculations lead to the following integrated likelihood (Dahl et al. 2008): µ ¶ Eg −1 , (4) dg | βg , λg ∼ NK dg | Xβg + Eg fg , λg j ′ Mj + pµ where Eg = λg (λg j ′ Mj + pµ )M − λ2g Mjj ′ M, and fg = λg mµ pµ Mj. (5) Inference is based on the marginal posterior distribution of the regression coefficients, i.e., p(β1 , . . . , βG | d1 , . . . , dG ) or, equivalently, p(ξβ , φβ | d1 , . . . , dG ). This distribution is not available in closed-form, so we use a Markov chain Monte Carlo (MCMC) to sample from the full posterior distribution p(ξβ , φβ , ξλ , φλ , ρ, τ | d1 , . . . , dG ) and marginalize over the parameters ξλ , φλ , ρ, and τ . 3.1 MCMC Scheme Our MCMC sampling scheme updates each of the following parameters, one at a time: ξβ , φβ , ξλ , φλ , ρ, and τ . Recall that βS is the element of φβ associated with cluster S ∈ ξβ , with βSt being element t of that vector. Likewise, λS is the element of φλ associated with cluster S ∈ ξλ . Given starting values for these parameters, we propose the following MCMC sampling scheme. Details for the first three updates are available in the Appendix. 714 Spiked Dirichlet Process Prior for Multiple Testing (1) Obtain draws ρ = (ρ1 , . . . , ρT ) from its full conditional distribution by the following procedure. First, sample Yt = rπ ρt from its conditional distributions: P yt | · ∼ p(yt ) yt S∈ξ I(βSt =0) (1 − yt ) P S∈ξ I(βSt 6=0) , with a +bρ −1 p(yt ) ∝ yt ρ (rπ − yt )bρ −1 , which does not have a known distributional form. A grid-based inverse-cdf method has been adopted for sampling yt . Once we draw samples of Yt , then we will obtain ρt as Yt /rπ . (2) Draw samples of τ = (τ1 , · · · , τT ) from their full conditional distributions: −1  |ζt |  1 1 X   (βSt − mt )2   , τt | · ∼ Gamma aτ + , + 2 bτ 2   (6) S∈ζt where ζt = {S ∈ ξβ | βSt 6= 0} and |ζt | is its cardinality. (3) Draw samples of βS = (βS1 , . . . , βST ) for their full conditional distributions: where ¡ ¢ βSt | · ∼ πSt δ0 + (1 − πSt )N h−1 t zt , ht , ht = τt + X (7) xTt Qg xt , g∈S zt = mt τ t + X xTt Qg Ag , g∈S Qg Ag = = ′ (λg j M j + pµ )−1 Eg , dg − X(−t) βS(−t) − Eg−1 fg , and the probability πSt is πSt = yt q ª, © 1 2 + 1 h−1 z 2 τ m yt + (1 − yt ) h−1 τ exp − t t t t t 2 2 t where yt = ρt rπ with rπ = aπ /(aπ + bπ ), and X(−t) and βS(−t) denote the X and βS with the element t removed, respectively. (4) Since a closed-form full conditional for λS is not available, update λS using a univariate Gaussian random walk. (5) Update ξβ using the Auxiliary Gibbs algorithm (Neal 2000). S. Kim, D.B. Dahl and M. Vannucci 715 (6) Update αβ from its conditional distribution. ½ Gamma(aα + k, b∗α ) with probability pα α|η, k ∼ Gamma(aα + k − 1, b∗α ) with probability 1 − pα where b∗α = pα = ¶−1 1 − log(η) , and bα aα + k − 1 aα + k − 1 + n/b∗α µ Also, η|α, k ∼ Beta(α + 1, n). (7) Update ξλ using the Auxiliary Gibbs algorithm. (8) Update αλ using the same procedure in (6) above. 3.2 Inference from MCMC Results Due to our formulation for the centering distribution of the DP prior on the regression coefficients, our model can estimate the probability of sharp null hypotheses, such as H0,g : β1,g = . . . = βT ∗ ,g = 0 for g = 1, . . . , G. Other hypotheses may be specified, depending on the experimental goals. We estimate these probabilities by simply finding the relative frequency that the hypotheses hold among the states of the Markov chains. Our prior model formulation also permits inference on clustering of the G objects. Several methods are available in the literature on DP models to estimate the cluster memberships based on posterior samples. (See, for example, Medvedovic and Sivaganesan 2002; Dahl 2006; Lau and Green 2007.) In the examples below we adopt the least-squares clustering estimation of Dahl (2006) which finds the clustering configuration among those sampled by the Markov chain that minimizes a posterior expected loss proposed by Binder (1978) with equal costs of clustering mistakes. 3.3 Hyperparameters Setting Our recommendation for setting the hyperparameters is based on computing for each object the least-squares estimates of the regression coefficients, the y-intercept, and the mean-squared error. We then set mµ to be the mean of the estimated y intercepts and pµ to be the inverse of their variances. We also use the method of moments to set (aτ , bτ ). This requires solving the following two equations: aτ bτ = mean of variances of least-squares regression coefficients aτ b2τ = sample variance of variances of least-squares regression coefficients 716 Spiked Dirichlet Process Prior for Multiple Testing Cluster 1 2 3 4 5 6 7 Size 300 50 50 25 25 25 25 β1 0 0 0 ∼ N (0, 14 ) 0 ∼ N (0, 14 ) ∼ N (0, 14 ) β2 0 0 0 ∼ N (0, 41 ) 0 ∼ N (0, 41 ) 0 β3 0 ∼ N (0, 41 ) 0 0 ∼ N (0, 41 ) ∼ N (0, 41 ) ∼ N (0, 41 ) β4 0 0 0 0 ∼ N (0, 14 ) ∼ N (0, 14 ) 0 β5 0 ∼ N (0, 14 ) ∼ N (0, 14 ) 0 0 0 ∼ N (0, 14 ) Table 1: Schematic for the simulation of the regression coefficients vectors in the first alternative scenario. Likewise, aλ and bλ are set using the method of moments estimation, assuming that the inverse of the mean-squared errors are random draws from a gamma distribution having mean aλ bλ . As for (aπ , bπ ) and (aρ , bρ ), a specification such that (rπ E[ρt ])T = QT t=1 p(βt,g = 0) = 0.50 is recommended if there is no prior information available. We refer to these recommended hyperparameter settings as the method of moments (MOM) settings. The MOM recommendations are based on a thorough sensitivity analysis we performed on all the hyperparameters using simulated data. Some results of this simulation study are described in Section 4.1. 4 Applications We first demonstrate the performance in a simulation study and then apply our method to gene expression data analysis. 4.1 Simulation Study Data Generation In an effort to imitate the structure of the microarray data experiment examined in the next section, we generated 30 independent datasets with 500 objects measured at two treatments and three time points, having three replicates at each of the six treatment combinations. Since the model includes an object-specific mean, we set β6,g = 0 so that the treatment index t ranges from 1 to 5. We simulated data in which the regression coefficients β for each cluster is distributed as described in Table 1. Similarly, the three pre-defined precisions λ1 = 1.5, λ2 = 0.2 and λ3 = 3.0 are randomly assigned to each of the 180, 180, and 140 objects (total 500 objects). Sample-specific means µg were generated from a univariate normal distribution with mean 10 and precision 0.2. Finally, each vector dg was sampled from a multivariate normal distribution with mean µg j + Xβg and precision matrix λg I, where I is an S. Kim, D.B. Dahl and M. Vannucci 717 identity matrix. We repeated the procedure above to create 30 independent datasets. Our interest lies in testing the null hypothesis H0,g : β1,g = . . . = β6,g = 0. All the computational procedures were coded in Matlab. Results We applied the proposed method to the 30 simulated datasets. The model involves several hyperparameters: mµ , pµ , aπ , bπ , aρ , bρ , aτ , bτ , aλ , bλ , aαβ , bαβ , aαλ , bαλ , and mt . We set (aπ , bπ ) = (1, 0.15) and (aρ , bρ ) = (1, 0.005). The prior probability of the null hypothesis (i.e., that all the regression coefficients are zero) for an object is about 50%, which is (rπ ∗ E[ρ])5 with rπ = aπ /(aπ + bπ ) and E[ρ] = aρ /(aρ + bρ ), the product of Bernoulli random variables across the T treatment conditions each having success probability. We calculated the MOM recommendations from Section 3.3 to set (aτ , bτ ) and (aλ , bλ ). These recommendations for the hyperparameters are based on the sensitivity analysis described later in the paper. We somewhat arbitrarily set (aαβ , bαβ ) = (5, 1) and (aαλ , bαλ ) = (1, 1), so that prior expected numbers of clusters are about 24 and 7 for the regression coefficients and precisions, respectively. We show the robustness of the choice of those parameters in the later section. For each dataset, we ran two Markov chains for 5,000 iterations and different starting clustering configurations. A trace plot of the number of clusters of β from the two different starting stages for one of the simulated datasets, as well as a similar plot for λ, is shown in Figure 1. Similar trace plots of generated αβ and αλ are shown in Figure 2. They do not indicate any convergence or mixing problems. The other datasets also had plots indicating good mixing. For each chain, we discarded the first 3,000 iterations for a burn-in and pooled the results from the two chains. Our interest in the study is to see whether there are changes between the two groups within a time point and across time points. Specifically, we considered the null hypothesis that all regression coefficients are equal to zero: for g = 1, . . . , 500, H0,g : β1,g = . . . = β6,g = 0 Ha,g : βt,g 6= 0 for some t = 1, . . . , 6. We ranked the objects by their posterior probabilities of alternative hypotheses Ha,g , which equal 1−Pr(H0,g |data). A plot of the ranked posterior probability for each object is shown in Figure 3. Bayesian False Discovery Rate Many multiple testing procedures seek to control some type of a false discovery rate (FDR) at a desired value. The Bayesian FDR (Genovese and Wasserman 2003; Müller 718 Spiked Dirichlet Process Prior for Multiple Testing Trace plot (β) Trace plot (λ) 40 number of clusters (λ) number of clusters (β) 100 80 60 40 20 0 0 1000 2000 3000 4000 30 20 10 0 0 5000 1000 iterations 2000 3000 4000 5000 iterations (a) Number of clusters (β) (b) Number of clusters (λ) Figure 1: Trace plots of the number of clusters for the regression coefficients and the precisions when fitting a simulated dataset. Trace plot (αβ) Trace plot (αλ) 30 10 8 20 αλ αβ 6 4 10 2 0 0 1000 2000 3000 4000 5000 0 0 1000 2000 3000 iterations iterations (a) αβ (b) αλ 4000 Figure 2: Trace plots of generated αβ and αλ when fitting a simulated dataset. 5000 S. Kim, D.B. Dahl and M. Vannucci 719 Probability of Alternative Hypothesis 1 0.8 0.6 0.4 0.2 0 0 100 200 300 sample index 400 500 Figure 3: Probability of the alternative hypothesis (i.e. 1 − Pr(H0,g : β1,g = . . . = β6,g = 0 | data)) for each object of a simulated dataset of 500 objects. et al. 2004; Newton et al. 2004) can be obtained by \ F DR(c) = PG Dg (1 − vg ) PG g=1 Dg g=1 where vg = P r(Ha,g |data) and Dg = I(vg > c). We reject H0,g if the posterior probability vg is greater than the threshold c. The optimal threshold c can be found to be a \ maximum value of c in the set of {c : F DR(c) ≤ α} with pre-specified error rate α. We averaged the Bayesian FDRs from the 30 simulated datasets. The optimal threshold, on average, is found to be 0.7 for an Bayesian FDR of 0.05. The Bayesian FDR has also been compared with the true proportion of false discoveries (labeled as “Realized FDR” in the plot) and is displayed in Figure 4. In this simulation, our Bayesian approach is slightly anti-conservative. As shown in Dudoit et al. (2008), anti-conservative behavior in FDR controlling approaches is often observed for data with high correlation structure and a high proportion of true null hypotheses. Comparisons with Other Methods We assessed the performance of the proposed method by comparing with three other methods, a standard Analysis of Variance (ANOVA), the SIMTAC method of Dahl et al. (2008), and LIMMA (Smyth 2004). The LIMMA procedure is set in the context of a general linear model and provides, for each gene, an F -statistic to test for differential expression at one or more time points. These F -statistics were used to rank the genes. 720 0.10 0.15 Realized FDR Bayesian FDR 0.00 0.05 False Discovery Rate 0.20 Spiked Dirichlet Process Prior for Multiple Testing 0.5 0.6 0.7 0.8 0.9 1.0 Cut−off Figure 4: Plot of proportion of false discoveries and Bayesian FDR averaged over 30 datasets. The SIMTAC method uses a modeling framework similar to the one we adopt but it is not able to provide estimates of probabilities for H0,g since its posterior density is continuous. We used the univariate score suggested by Dahl et al. (2008) which PT 2 captures support for the hypothesis of interest, namely qg = t=1 βt,g . For the ANOVA procedure, we ranked objects by their p-values associated with H0,g . Small p-values indicate little support for the H0,g . For each of the 30 datasets and each method, we ranked the objects as described above. These lists were truncated at 1, 2, . . . , 200 samples. At each truncation, the proportions of false discoveries are computed and averaged over the 30 datasets. Results are displayed in Figure 5. It is clear that our proposed method exhibits a lower proportion of false discoveries and that performances are substantially better than ANOVA and LIMMA and noticeably better than the SIMTAC method. Sensitivity Analysis The model involves several hyperparameters: mµ , pµ , aπ , bπ , aρ , bρ , aτ , bτ , aλ , bλ , aαβ , bαβ , aαλ , bαλ , and mt . In order to investigate the sensitivity to the choice of these hyperparameters, we randomly selected one of the 30 simulated datasets for a sensitivity analysis. We considered ten different hyperparameter settings. In the first scenario, called the “MOM” setting, we used all the MOM estimates of the hyperparameters and (aπ , bπ ) = (1,0.15), (aρ , bρ ) = (1,0.005), and (aαβ , bαβ ) = (5,1). The other nine scenarios with 721 0.1 0.2 0.3 OUR METHOD SIMTAC ANOVA LIMMA 0.0 Proportion of False Discoveries 0.4 S. Kim, D.B. Dahl and M. Vannucci 0 50 100 150 200 Number of Discoveries Figure 5: Average proportion of false discoveries for the three methods based on the 30 simulated datasets change in one set of parameters given all other parameters set same as in the first scenario were: i. (aπ , bπ ) = (15, 15), so that p(βt,g = 0) = 0.50. ii. (aπ , bπ ) = (1, 9), so that p(βt,g = 0) = 0.10. iii. (aρ , bρ ) = (1, 2), so that E[rπ ρt ] = 0.25. iv. (aτ , bτ ) = (1, 0.26), to have smaller variance than MOM estimate. v. (aτ , bτ ) = (1, 0.7), to have larger variance than MOM estimate. vi. (aλ , bλ ) = (1, 0.5), to have smaller variance than MOM estimate. vii. (aλ , bλ ) = (1, 3), to have larger variance than MOM estimate. viii. (aαβ , bαβ ) = (25, 1), to have E[αβ ] = 25, so that prior expected number of clusters is about 77. ix. (aαβ , bαβ ) = (1, 1), to have E[αβ ] = 1, so that prior expected number of clusters is about 7. We set mt = 0. Also, the mean mµ of the distribution of µ was set to the estimated least-squares intercepts and the precision pµ to the precision of the estimated intercepts. 722 Spiked Dirichlet Process Prior for Multiple Testing An identity matrix was used for M since we assume independent sampling. We fixed αλ = 1 throughout the sensitivity analysis. We expect similar sensitivity result of the parameter as one for αβ . We ran two MCMC chains with different starting values; one chain started from one cluster (for both β and λ) and the other from G clusters (for both). Each chain was run for 5,000 iterations. We assessed the sensitivity of the hyperparameter settings in two ways. Figure 6 shows that the proportion of false discoveries is remarkably consistent across the ten different hyperparameter settings. We also identified, for each hyperparameter setting, the 50 objects most likely to be “differentially expressed”. In other words, those 50 have the smallest probability for the hypothesis H0 . Table 2 gives the number of common objects among all the pairwise intersections from the various parameter settings. These results indicate a high degree of concordance among the hyperparameter scenarios. We are confident in recommending, in the absence of prior information, the use of the MOM estimates for (aτ , bτ ) and (aλ , bλ ) and to choose (aπ , bπ ) and (aρ ,bρ ) such that p(βt,g = 0) = 0.50. The choice for (aαβ , bαβ ) does not make a difference in the results. 1 MOM (i) (ii) (iii) (iv) (v) (vi) (vii) (viii) (ix) Proportion of False Discoveries 0.8 0.6 0.4 0.2 0 0 100 200 300 Number of Discoveries 400 500 Figure 6: Proportion of false discoveries under several hyperparameter settings based on one dataset S. Kim, D.B. Dahl and M. Vannucci 723 Table 2: Among the 50 most likely differentially expresed objects, the number in common among the pairwise intersection of the samples identified under the ten hyperparameter settings. MOM (both) (i) (ii) (iii) (iv) (v) (vi) (vii) (viii) (i) 41 (ii) 37 42 (iii) 41 45 43 (iv) 38 45 44 45 (v) 39 45 43 45 44 (vi) 41 42 42 43 40 45 (vii) 39 45 44 45 47 44 42 (viii) 39 43 42 46 42 44 44 45 (ix) 42 46 43 46 44 45 45 44 44 724 4.2 Spiked Dirichlet Process Prior for Multiple Testing Gene expression study We illustrate the advantage of our method in a microarray data analysis. The dataset was used in Dahl and Newton (2007). Researchers were interested in the transcriptional response to oxidative stress in mouse heart muscle and how that response changes with age. The data has been obtained in two age groups of mice; Young (5-month old) and Old (25-month old) which were treated with an injection of paraquat (50mg/kg). Mice were killed at 1, 3, 5 and 7 hours after the treatment or were killed without having received paraquat (called baseline). So, the mice yield independent measurments, rather than repeated measurements. Gene expressions were measured 3 times at all treatments. Originally, gene expression was measured on 10,043 probe sets. We randomly select G = 1, 000 genes out of 10,043 to reduce computation time. We also choose the first two treatments, baseline and 1 hour after injection from both groups since it is often of interest to see if gene expressions have been changed within 1 hour after injection. Old mice at baseline were designated as a reference treatment. While the analysis is not invariant to the choice of the reference treatment, we show in Section 5 that the results are robust to the choice of the reference treatment. The data was background-adjusted and normalized using the Robust Multichip Averaging (RMA) method of Irizarry et al. (2003). Our two main biological goals are to identify genes which either are: 1. Differentially expressed in some way across the four treatment conditions, i.e., genes having small probability of H0,g : β1,g = β2,g = β3,g = 0, or 2. Similarly expressed at baseline between old and young mice, but differentially expressed 1 hour after injection, i.e. genes having large probability of Ha,g : |β1,g − β3,g | = 0 & |β2,g − β4,g | > c, for some threshold c, such as 0.1. Assuming that information on how many genes are differentially expressed is not available, we set a prior on π by defining (aπ , bπ ) = (10, 3) and (aρ , bρ ) = (100, 0.05) which implies a belief that about 50% of genes are differentially expressed. We set (aαβ , bαβ ) = (5, 5) and (aαλ , bαλ ) = (1, 1) so that the expected numbers of clusters are 93 and 8 for the regression coefficients and precisions, respectively. Other parameters are estimated as we recommended in the simulation study. We ran two chains starting at two different initial stages: (i) all the genes being together and (ii) each having its own cluster. The Markov chain Monte Carlo (MCMC) sampler was run for 10,000 iterations with the first 5,000 discarded as burn-in. Figure 7 shows trace plots of the number of clusters for both regression coefficients and precisions. The plots do not indicate convergence or mixing problems. The least-squares clustering method found a clustering for the regression coefficients with 14 clusters and a clustering for the precisions with 11 clusters. There were six large clusters for β with size more than 50. Those clusters included 897 genes. The average gene expressions for each one of the six clusters are shown in Figure 8(a). The y-axis indicates the average gene expressions, and the x-axis indicates the treatments. Each cluster shows its unique profile. We found one cluster of 18 genes S. Kim, D.B. Dahl and M. Vannucci 725 Trace plot (β) Trace plot (λ) 100 number of clusters (λ) number of clusters (β) 100 80 60 40 20 0 0 2000 4000 6000 8000 10000 80 60 40 20 0 0 2000 (a) Number of clusters (β) 4000 6000 8000 10000 iterations iterations (b) Number of clusters (λ) Figure 7: Trace plots of number of clusters for the regression coefficients and the precisions when fitting the gene expression data. with all regression coefficients equal to zero (Figure 8(b)). For hypothesis testing, we ranked genes by calculating posterior probabilities for the genes least supportive of the null hypothesis, H0,g : β1,g = β2,g = β3,g = β4,g = 0. We listed the fifty genes that were least supportive of the hypothesis H0,g . Figure 9 shows the heatmap of those fifty genes. Finally, in order to identified genes following the second hypothesis of interest Ha,g : |β1,g − β3,g | = 0 & |β2,g − β4,g | > 0.1, we similarly identified the top fifty ranked genes. For this hypothesis, our approach clearly finds genes following the desired pattern, as shown in Figure 10. 5 Discussion We have proposed a semiparametric Bayesian method for random effects models in the context of multiple hypothesis testing. A key feature of the model is the use of a spiked centering distribution for the Dirichlet process prior. Dirichlet process mixture models naturally borrow information across similar observations through model-based clustering, gaining increased power for testing. This centering distribution in the DP allows the model to accommodate the estimation of sharp hypotheses. We have demonstrated via a simulation study that our method yields a lower proportion of false discoveries than other competitive methods. We have also presented an application to microarray data where our method readily infers posterior probabilities of genes being differentially expressed. One issue with our model is that the results are not necessarily invariant to the 726 Average Gene Expressions 8.8 9.0 9.2 9.4 Clu3 Clu5 Clu7 Clu8 Clu9 Clu10 8 8.6 Averge Gene Expressions 9 10 11 12 Spiked Dirichlet Process Prior for Multiple Testing OldBase Old1hr YoungBase Young1hr OldBase (a) Large clusters Old1hr YoungBase Young1hr (b) Cluster with estimated β = 0 Figure 8: Average expression profiles for (a) six large clusters; (b) cluster with estimated β = 0. Color Key −1 0 1 2 Row Z−Score Yng1hr3 Yng1hr2 Yng1hr1 YngBase3 YngBase2 YngBase1 Old1hr3 Old1hr2 Old1hr1 OldBase3 OldBase2 OldBase1 6390 7217 1536 6382 772 9705 7286 9040 704 9391 4441 5290 9044 9696 6146 7055 5032 9890 6092 9569 7170 5731 1835 7257 6847 7585 4603 9201 7524 7770 5727 6338 516 2924 6335 5890 764 5736 5592 7745 711 5355 747 6331 7507 8825 7796 7300 8727 4449 Figure 9: Heatmap of the 50 top-ranked genes which are least supportive of the assertion that β1 = β2 = β3 = β4 = 0. S. Kim, D.B. Dahl and M. Vannucci 727 9.20 Color Key −2 −1 0 1 2 Row Z−Score (a) Average gene expressions Yng1hr3 Yng1hr2 Yng1hr1 YngBase3 YngBase2 Old1hr3 YngBase1 Old1hr2 Young1hr Old1hr1 YoungBase OldBase3 Old1hr OldBase1 OldBase OldBase2 9.05 Average Gene Expressions 9.10 9.15 3656 6623 3019 7341 2379 3630 8888 3618 5708 6666 4944 8630 9224 3451 9616 5295 245 9741 8567 4716 2775 8550 5438 7247 3554 6595 2912 8977 4790 4061 170 4949 3401 8074 9341 3763 9850 9148 9131 3736 3246 7088 9082 2256 1994 5801 5292 3290 104 2295 (b) Heatmap 800 600 400 200 0 Rank with Reference (Young, Baseline) 1000 Figure 10: (a) Average gene expressions of the 50 top-ranked genes supportive of |β1 − β3 | = 0 & |β2 − β4 | > 0.1; (b) Heatmap of those genes 0 200 400 600 800 1000 Rank with Reference being (Old, Baseline) Figure 11: Scatter plot of rankings of genes resulting from using two reference treatments 728 Spiked Dirichlet Process Prior for Multiple Testing choice of the reference treatment. Consider, for example, the gene expression analysis of Section 4.2 in which we used the group (Old, Baseline) as the reference group. To investigate robustness, we reanalyzed the data using (Young, Baseline) as the reference group. We found that the rankings between two results are very close to each other (Spearman’s correlation = 0.9937762, Figure 11). Finally, as we mentioned in the Section 2.1, our current model can easily accommodate covariates by placing them in the X matrix. Such covariates might include, for example, demographic variables regarding the subject or environmental conditions (e.g., temperature in the lab) that affect each array measurement. Adjusting for such covariates has the potential to increase the statistical power of the tests. 1 Appendix 1.1 Full Conditional for Precision p(τ |d, β, λ) ∝ p(τ ) Y p(βS |π, τ )p(dS |βS , λS , π, τ ) S∈ξ  ) T  Y Y p(βSt |πt , τt ) p(τt ) =   t=1 S∈ζt t=1   T  Y Y N (βSt |mt , τt ) ∝ p(τt )   t=1 S∈ζt    T   Y X 1 1 a +|ζ |/2−1 ∝ τt τ t exp −τt  + (βSt − mt )2    bτ 2 t=1 ( T Y S∈ζt 1.2 Full Conditional for new probability yt = ρt rπ of Spike Note: modified prior ρt rπ = p(βt = 0) where rπ = aπ /(aπ + bπ ), thus need a posterior ρt |rest ∝ p(βt = 0|rest). Set Yt = rπ ρt . Then the distribution of Yt is ¶b −1 µ ¶aρ −1 µ 1 yt ρ 1 yt p(yt ) = 1− B(aρ , bρ ) rπ rπ rπ µ ¶aρ +bρ −1 1 1 a −1 = yt ρ (rπ − yt )bρ −1 . B(aρ , bρ ) rπ Now, we are drawing Yt , not ρt from their conditional distributions: for t, P p(yt |rest) ∝ p(yt ) yt S∈ξ I(βSt =0) P (1 − yt ) S∈ξ I(βSt 6=0) , S. Kim, D.B. Dahl and M. Vannucci 729 which is not of known form of distribution. Once we draw samples of Yt , then we will get ρt as Yt /rπ . We used a grid-based inverse-cdf method. for sampling Yt . 1.3 Full Conditional for Regression Coefficients p(βSt |λS , dS , yt , βS(−t) ) ∝ p(βSt |yt ) = yt δ0 (βSt ) Y Y p(dg |βSt , βS(−t) , λS ) g∈S p(dg |βSt , βS(−t) , λS ) + (1 − yt )N (βSt |mt , τt ) g∈S Y p(dg |βSt , βS(−t) , λS ) g∈S The first part is obvious. Look at the second part. Set xt = (X1t , · · · , XKt )T , X(−t) = (x1 , · · · .xt−1 , xt+1 , · · · , xT ), and βS(−t) = (βS1 , · · · , βS(t−1) , βS(t+1) , · · · , βST )T . The second part is proportional to:    ¾ ½ X  1 1 DgT Qg Dg  exp − τt (βSt − mt )2 × exp −  2 2 g∈S ∝ ∝ where Dg = dg − xt βSt − X(−t) βS(−t) − Eg−1 fg   ½ ¾  1X  ¢ 1¡ 2 exp − τt βSt − 2τt mt βSt exp − (xt βSt − Ag )T Qg (xt βSt − Ag )  2  2 g∈S      X X 1 2 βSt (τt + xTt Qg xt ) − 2βSt (mt τt + xTt Qg Ag )  . exp −  2 g∈S g∈S Therefore, for each t,   0 ¶ with probability πst µ P P mt τt + g∈S xT βSt | · = t Qg Ag P with probability 1 − πst . , τt + g∈S xTt Qg xt  ∼N x T Qg x t τt + g∈S t References Antoniak, C. E. (1974). “Mixtures of Dirichlet Processes With Applications to Bayesian Nonparametric Problems.” The Annals of Statistics, 2: 1152–1174. 710 Baldi, P. and Long, A. D. (2001). “A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes.” Bioinformatrics, 17: 509–519. 708 Benjamini, Y. and Hochberg, Y. (1995). “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” Journal of the Royal Statistical Society, Series B: Methodological, 57: 289–300. 708 730 Spiked Dirichlet Process Prior for Multiple Testing Berry, D. A. and Hochberg, Y. (1999). “Bayesian Perspectives on Multiple Comparisons.” Journal of Statistical Planning and Inference, 82: 215–227. 708 Binder, D. A. (1978). “Bayesian Cluster Analysis.” Biometrika, 65: 31–38. 715 Brown, P., Vannucci, M., and Fearn, T. (1998). “Multivariate Bayesian variable selection and prediction.” J. R. Statist. Soc. B, 60: 627–41. 711 Cai, B. and Dunson, D. (2007). “Variable selection in nonparametric random effects models.” Technical report, Department of Statistical Science, Duke University. 709 Dahl, D. B. (2006). “Model-Based Clustering for Expression Data via a Dirichlet Process Mixture Model.” In Do, K.-A., Müller, P., and Vannucci, M. (eds.), Bayesian Inference for Gene Expression and Proteomics, 201–218. Cambridge University Press. 715 Dahl, D. B., Mo, Q., and Vannucci, M. (2008). “Simultaneous Inference for Multiple Testing and Clustering via a Dirichlet Process Mixture Model.” Statistical Modelling: An International Journal, 8: 23–39. 708, 709, 711, 713, 719, 720 Dahl, D. B. and Newton, M. A. (2007). “Multiple Hypothesis Testing by Clustering Treatment Effects.” Journal of the American Statistical Association, 102(478): 517– 526. 708, 711 Do, K.-A., Müller, P., and Tang, F. (2005). “A Bayesian mixture model for differential gene expression.” Journal of the Royal Statistical Society: Series C (Applied Statistics), 54(3): 627–644. 708 Dudoit, S., Gibert, H. N., and van der Laan, M. J. (2008). “Resampling-Based Empirical Bayes Multiple Testing Procedures for Controlling Generalized Tail Probability and Expected Value Error Rates: Focus on the False Discovery Rate and Simulation Study.” Biometrical Journal, 50: 716–744. 719 Dudoit, S., Shaffer, J. P., and Boldrick, J. C. (2003). “Multiple Hypothesis Testing in Microarray Experiments.” Statistical Science, 18(1): 71–103. 708 Dudoit, S., Yang, Y. H., Callow, M. J., and Speed, T. P. (2002). “Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments.” Statistica Sinica, 12(1): 111–139. 708 Dunson, D. B., Herring, A. H., and Engel, S. A. (2008). “Bayesian Selection and Clustering of Polymorphisms in Functionally-Related gene.” Journal of the American Statistical Association, in press. 709 Escobar, M. D. and West, M. (1995). “Bayesian Density Estimation and Inference Using Mixtures.” Journal of the American Statistical Association, 90: 577–588. 713 Genovese, C. and Wasserman, L. (2003). “Bayesian and Frequentist Multiple Testing.” In Bayesian Statistics 7, 145–161. Oxford University Press. 717 George, E. and McCulloch, R. (1993). “Variable selection via Gibbs sampling.” J. Am. Statist. Assoc., 88: 881–9. 711 S. Kim, D.B. Dahl and M. Vannucci 731 Gopalan, R. and Berry, D. A. (1998). “Bayesian Multiple Comparisons Using Dirichlet Process Priors.” Journal of the American Statistical Association, 93: 1130–1139. 708 Hochberg, Y. (1988). “A Sharper Bonferroni Procedure for Multiple Tests of Significance.” Biometrika, 75: 800–802. 708 Hommel, G. (1988). “A Stagewise Rejective Multiple Test Procedure Based on a Modified Bonferroni Test.” Biometrika, 75: 383–386. 708 Irizarry, R., Hobbs, B., Collin, F., Beazer-Barclay, Y., Antonellis, K., Scherf, U., and Speed, T. (2003). “Exploration, Normalization, and Summaries of High Density Oligonucleotide Array Probe Level Data.” Biostatistics, 4: 249–264. 724 Lau, J. W. and Green, P. J. (2007). “Bayesian model based clustering procedures.” Journal of Computational and Graphical Statistics, 16: 526–558. 715 Lucas, J., Carvalho, C., Wang, Q., Bild, A., Nevins, J. R., and Mike, W. (2006). “Sparse Statistical Modelling in Gene Expression Genomics.” In Do, K.-A., Müller, P., and Vannucci, M. (eds.), Bayesian Inference for Gene Expression and Proteomics, 155– 174. Cambridge University Press. 711 MacLehose, R. F., Dunson, D. B., Herring, A. H., and Hoppin, J. A. (2007). “Bayesian methods for highly correlated exposure data.” Epidemiology, 18(2): 199–207. 708, 709 Medvedovic, M. and Sivaganesan, S. (2002). “Bayesian Infinite Mixture Model Based Clustering of Gene Expression Profiles.” Bioinformatrics, 18: 1194–1206. 715 Müller, P., Parmigiani, G., Robert, C., and Rousseau, J. (2004). “Optimal Sample Size for Multiple Testing: The case of Gene Expression Microarrays.” Journal of the American Statistical Association, 99: 990–1001. 717 Neal, R. M. (2000). “Markov Chain Sampling Methods for Dirichlet Process Mixture Models.” Journal of Computational and Graphical Statistics, 9: 249–265. 714 Newton, M., Kendziorski, C., Richmond, C., Blattner, F., and Tsui, K. (2001). “On differential variability of expression ratios: Improving statistical inference about gene expression changes from microarray data.” Journal of Computational Biology, 8: 37–52. 708 Newton, M. A., Noueiry, A., Sarkar, D., and Ahlquist, P. (2004). “Detecting differential gene expression with a semiparametric hierarchical mixture method.” Biostatistics, 5: 155–176. 708, 719 Scott, J. G. and Berger, J. O. (2006). “An Exploration of Aspects of Bayesian Multiple Testing.” Journal of Statistical Planning and Inference, 136: 2144–2162. 708 Smyth, G. K. (2004). “Linear models and empirical Bayes methods for assessing differential expression in microarray experiments.” Statistical Applications in Genetics and Molecular Biology, 3: No. 1, Article 3. 709, 719 732 Spiked Dirichlet Process Prior for Multiple Testing Storey, J. (2007). “The optimal discovery procedure: A new approach to simultaneous significance testing.” Journal of the Royal Statistical Society, Series B, 69: 347–368. 708 Storey, J., Dai, J. Y., and Leek, J. T. (2007). “The optimal discovery procedure for largescale significance testing, with applications to comparative microarray experiments.” Biostatistics, 8: 414–432. 708 Storey, J. D. (2002). “A Direct Approach to False Discovery Rates.” Journal of the Royal Statistical Society, Series B: Statistical Methodology, 64(3): 479–498. 708 — (2003). “The Positive False Discovery Rate: A Bayesian Interpretation and the q-value.” The Annals of Statistics, 31(6): 2013–2035. 708 Storey, J. D., Taylor, J. E., and Siegmund, D. (2004). “Strong Control, Conservative Point Estimation and Simultaneous Conservative Consistency of False Discovery Rates: a Unified Approach.” Journal of the Royal Statistical Society, Series B: Statistical Methodology, 66(1): 187–205. 708 Tibshirani, R. and Wasserman, L. (2006). “Correlation-sharing for Detection of Differential Gene Expression.” Technical Report 839, Department of Statistics, Carnegie Mellon University. 708 Westfall, P. H. and Wolfinger, R. D. (1997). “Multiple Tests with Discrete Distributions.” The American Statistician, 51: 3–8. 708 Westfall, P. H. and Young, S. S. (1993). Resampling-based Multiple Testing: Examples and Methods for P-value Adjustment. John Wiley & Sons. 708 Yuan, M. and Kendziorski, C. (2006). “A Unified Approach for Simultaneous Gene Clustering and Differential Expression Identification.” Biometrics, 62: 1089–1098. 708 Acknowledgments Marina Vannucci is supported by NIH/NHGRI grant R01HG003319 and by NSF award DMS0600416. The authors thank the Editor, the Associated Editor and the referee for their comments and constructive suggestions to improve the paper. Bayesian Analysis (2009) 4, Number 4, pp. 733–758 Modeling space-time data using stochastic differential equations Jason A. Duan∗ , Alan E. Gelfand† and C. F. Sirmans‡ Abstract. This paper demonstrates the use and value of stochastic differential equations for modeling space-time data in two common settings. The first consists of point-referenced or geostatistical data where observations are collected at fixed locations and times. The second considers random point pattern data where the emergence of locations and times is random. For both cases, we employ stochastic differential equations to describe a latent process within a hierarchical model for the data. The intent is to view this latent process mechanistically and endow it with appropriate simple features and interpretable parameters. A motivating problem for the second setting is to model urban development through observed locations and times of new home construction; this gives rise to a space-time point pattern. We show that a spatio-temporal Cox process whose intensity is driven by a stochastic logistic equation is a viable mechanistic model that affords meaningful interpretation for the results of statistical inference. Other applications of stochastic logistic differential equations with space-time varying parameters include modeling population growth and product diffusion, which motivate our first, point-referenced data application. We propose a method to discretize both time and space in order to fit the model. We demonstrate the inference for the geostatistical model through a simulated dataset. Then, we fit the Cox process model to a real dataset taken from the greater Dallas metropolitan area. Keywords: geostatistical data, point pattern, hierarchical model, stochastic logistic equation, Markov chain Monte Carlo, urban development 1 Introduction The contribution of this paper is to demonstrate the use and value of stochastic differential equations (SDE) to yield mechanistic and physically interpretable models for space-time data. We consider two common settings: (i) real-valued and point-referenced geostatistical data where observations are taken at non-random locations s and times t and (ii) spatio-temporal point patterns where the locations and times themselves are random. In either case, we assume s ∈ D ⊂ R2 where D is a fixed compact region and t ∈ (0, T ] where T is specified. Examples of spatio-temporal geostatistical data abound in the literature. Examples appropriate to our objectives include ecological process models such as photosynthesis, ∗ Department of Marketing, McCombs School of Business,University of Texas, Austin, TX, mailto: duanj@mccombs.utexas.edu † Department of Statistical Science, Duke University, Durham, NC mailto:alan@stat.duke.edu ‡ Department of Risk Management, Insurance, Real Estate and Business Law, Florida State University, Tallahassee, FL mailto:cfsirmans@cob.fsu.edu c 2009 International Society for Bayesian Analysis ° DOI:10.1214/09-BA427 734 Space-time SDE models transpiration, and soil moisture; diffusion models for populations, products or technologies; financial processes such as house price and/or land values over time. Here, we employ a customary geostatistical modeling specification, i.e., noisy space-time data are modeled by Y (s, t) = Λ(s, t) + ǫ(s, t) (1) where ǫ(s, t) is a space-time noise/error process (contributing the “nugget”) and the process of interest is Λ(s, t). For us, Λ(s, t) is a realization of a space-time stochastic process generated by a stochastic differential equation. Space-time point patterns also arise in many settings, e.g., ecology where we might seek the evolution of the range of a species over time by observing the locations of its presences; disease incidence examining the pattern of cases over time; urban development explained using say the pattern of single family homes constructed over time. The random locations and times of these events are customarily modeled with an inhomogeneous intensity surface denoted again by Λ(s, t). Here, the theory of point processes provides convenient tools; the most commonly used and easily interpretable model is the spatio-temporal Poisson process: for any region in the area under study and any specified time interval, the total number of observed points is a Poisson random variable with mean equal to the integrated intensity over that region and time interval. There is a substantial literature on modeling point-referenced space-time data. The most common approach is the introduction of spatio-temporal random effects described through a Gaussian process with a suitable space-time covariance function (e.g., Brown et al. 2000; Gneiting 2002; Stein 2005). If time is discretized, we can employ dynamic models as in Gelfand et al. (2005). If locations are on a lattice (or are projected to a lattice), we can employ Gaussian Markov random fields (Rue and Held 2005). For general discussion of such space-time modeling see Banerjee et al. (2004). There is much less statistical literature on space-time point patterns. However, the mathematical theory of point process on a general carrying space is well established (Daley and Vere-Jones 1988; Karr 1991). Cressie (1993) and Møller and Waagepetersen (2004) focus primarily on two-dimensional spatial point processes. Recent developments in spatio-temporal point process modeling include Ogata (1998) with application to statistical seismology and Brix and Møller (2001) with application in modeling weeds. Brix and Diggle (2001), in modeling a plant disease, extend the log Gaussian Cox process (Møller et al. 1998) to a space-time version by using a stochastic differential equation model. See Diggle (2005) for a comprehensive review of this literature. In either of the settings (i) and (ii) above, we propose to work with stochastic differential equation models. That is, we intentionally specify Λ (s, t) through a stochastic differential equation rather than a spatio-temporal process (see, e.g., Banerjee et al. 2004 and references therein). We specifically introduce a mechanistic modeling scheme where we are directly interested in the parameters that convey physical meanings in the mechanism described by a stochastic differential equation. For example, the prototype of our study is the logistic equation · ¸ Λ (s, t) ∂Λ (s, t) = r(s, t)Λ (s, t) 1 − , ∂t K (s) J. A. Duan, A. E. Gelfand and C. F. Sirmans 735 where K (s) is the “carrying capacity” (assuming it is time-invariant) and r (s, t) is the “growth rate”. Spatially and/or temporally varying parameters, such as growth rate and carrying capacity, can be modeled by spatio-temporal processes. In practice, the logistic equation finds various applications, e.g., population growth in ecology (Kot 2001), product and technology diffusion in economics (Mahajan and Wind 1986), and urban development (see Section 4.2). We recognize the flexibility that comes with a “purely empirical” model such as a realization of a stationary or nonstationary space-time Gaussian process or smoothing splines and that such specifications can be made to fit a given dataset at least as well as a specification limited to a given class of stochastic differential equations. However, for space-time data collected from physical systems it may be preferable to view them as generated by appropriate simple mechanisms with necessary randomness. That is, we particularly seek to incorporate the features of the mechanistic process into the model for the space-time data enabling interpretation of the spatio-temporal parameter processes that is more natural and intuitive. We also demonstrate, through a simulation example, that, when such differential equations generate the data (up to noise), the data can inform about the parameters in these equations and that model performance is preferable to that of a customary model employing a random process realization. In this regard, we must resort to discretization of time to work with these models, i.e., we actually fit stochastic finite difference models. In other words, the continuoustime specification describes the process we seek to capture but to fit this specification with observed data, we employ first order (Euler) approximation. Evidently, this raises questions regarding the effect of discretization. Theoretical discussion in terms of weak convergence is presented, for instance, in Kloeden and Platen (1992) while practical refinements through latent variables have been proposed in, e.g., Elerian et al. (2001). In any event, Euler approximation is widely applied in the literature and it is beyond our contribution here to explore its impact. Moreover, the results of our simulation example in Section 3.3 reveal that we recover the true parameters of the stochastic PDE quite well under the discretization. Indeed, the real issue for us is how to introduce randomness into a selected differential equation specification. Section 2 is devoted to a brief review of available options and their associated properties. Section 3 develops the geostatistical setting and provides the aforementioned simulation illustration. An early motivating problem for our research was the modeling of new home constructions in a fixed region, e.g., within a city boundary. Mathematically and conceptually, the continuous trend that drives the construction of new houses can be captured by the logistic equation, where there is a rate for the growth of the number of houses and a carrying capacity that limits the total number. When new houses can be built at any locations and times, yielding a space-time point pattern, a spatio-temporal point process governed by a version of the stochastic logistic equation becomes an appealing mechanistic model. However, we also acknowledge its limitations, suggesting possible extensions at the end of the paper. Section 4.1 details the modeling for space-time point patterns and addresses formal 736 Space-time SDE models modeling and computational issues. Section 4.2 provides a careful analysis of the house construction dataset. Finally, in Section 5, we conclude with a summary and some future directions. 2 SDE models for spatio-temporal data Ignoring location for the moment, a usual nonlinear (non-autonomous) differential equation subject to the initial condition takes the form dΛ(t) = g(Λ(t), t, r (t))dt and Λ (0) = Λ0 . (2) A natural way to add randomness is to model the random parameter r (t) with a SDE: dr (t) = a(r(t), t, β)dt + b (r (t) , t) dB (t) where B(t) is an independent-increment process on R1 .1 Analytic solutions of SDE’s are rarely available so we usually employ a first order Euler approximation of the form, Λ (t + ∆t) and r (t + ∆t) = Λ (t) + g(Λ (t) , t, r (t))∆t = r (t) + a(r(t), t, β)∆t + b (r (t) , t) [B (t + ∆t) − B (t)] where ∆t is the interval between time points and B (t + ∆t) − B (t) ∼ N (0, ∆t) if B(t) is a Brownian motion process. Higher order (Runge-Kutta) approximations can be introduced but these do not seem to be employed in the statistics literature. Rather, recent work introduces latent variables Λ(t′ )’s between Λ(t) and Λ(t + ∆t). See, e.g., Elerian et al. (2001); Golightly and Wilkinson (2008) and Stramer and Roberts (2007). Our prototype is the logistic equation · ¸ Λ (t) dΛ (t) = r(t)Λ (t) 1 − dt and Λ (0) = Λ0 . K To introduce systematic randomness into this model, Ornstein-Uhlenbeck process for r (t): (3) we specify a mean-reverting dr (t) = −α (µr − r (t)) dt + σζ dB (t) . (4) Under Brownian motion it is known that r (t) is a stationary Gaussian process with cov(r (t) , r (t′ )) = (σζ2 /α) exp (−α|t − t′ |). To extend model (3) to a spatio-temporal setting, we can model Λ (s, t) at every location s with a SDE · ¸ ∂Λ (s, t) Λ (s, t) = r(s, t)Λ (s, t) 1 − (5) ∂t K (s) 1 This is a very general specification. For example, a common SDE model is dΛ(t) = f (Λ(t), t)dt + h(Λ(t), t)dB(t) where B(t) is Brownian motion over R1 with f and h the “drift” and “volatility” respectively. This model can be considered as model (2) with g (Λ (t) , t, r (t)) = f (Λ (t) , t) + r (t) h (Λ (t) , t) and dr (t) = r (t) dt = dB (t), which implies dΛ(t) = f (Λ(t), t)dt + h (Λ (t) , t) dB(t). 737 J. A. Duan, A. E. Gelfand and C. F. Sirmans subject to the initial conditions Λ (s, 0) = Λ0 (s). Expression (5) R is derived directly from (3) as follows. Assume model (3) for the aggregate Λ (D, t) = D Λ (s, t) ds: ¸ · Λ (D, t) ∂Λ (D, t) . = r(D, t)Λ (D, t) 1 − ∂t K (D) (6) R Here r (D, t) is the average growth rate of Λ (D, t), i.e., r (D, t) = ( D r (s, t) ds)/|D| where |D| is the area of D. K(D) is the aggregate carrying capacity, i.e., K (D) = R K (s) ds. D The model for Λ (s, t) at any location s can be considered as the infinitesimal limit of the model (6) when D is a neighborhood, δs of s whose area goes to zero. Then, Z Λ (s′ , t) ds′ )/ |δs | = Λ (s, t) ; lim Λ (δs , t) / |δs | = lim ( |δs |→0 |δs |→0 δs Z lim K (δs ) / |δs | = lim ( K (s′ ) ds′ )/ |δs | = K(s); |δs |→0 |δs |→0 δs Z r (s′ , t) ds′ /|δs | = r (s, t) . lim r(δs , t) = lim |δs |→0 |δs |→0 δs Plugging δs into (6) and passing to the limit, we obtain our local model (5). Model (5) specifies an infinite-dimensional SDE model for the random field Λ (s, t) , s ∈ D. Similar to (4), we can add randomness to (5) by extending the OrnsteinUhlenbeck process to the case of infinite dimension, ∂B (s, t) ∂r (s, t) = L (µr (s) − r(s, t)) + , ∂t ∂t (7) where L (s) is a spatial linear operator given by L (s) = a (s) + 2 X l=1 2 bl (s) 1X ∂2 ∂ − cl (s) 2 ∂sl 2 ∂sl (8) l=1 where a (s), bl (s) and cl (s) are positive deterministic functions with s1 and s2 the coordinates of location s. B (s, t) is a spatially correlated Brownian motion. Here, equation (5) and (7) define a spatio-temporal model with a nonstationary and nonGaussian Λ (s, t) and a latent stationary Gaussian r (s, t). Note that a well-specified CB will guarantee mean-square continuity and differentiability of r (s, t), s ∈ D. Because the logistic equation is Lipschitz, Λ (s, t) will also be mean-square continuous and differentiable. The simplest Ornstein-Uhlenbeck process model for r (s, t) sets bl and cl to zero (see Brix and Diggle 2001); the resulting covariance is separable in space and time. For example, with the Matérn spatial covariance function, we have ν Cr (s − s′ , t − t′ ) = σ 2 exp (−a |t − t′ |) (φζ |s − s′ |) κν (φζ |s − s′ |) , (9) 738 Space-time SDE models where κν (·) is the modified Bessel function of the second kind. When bl and cl are not equal to zero, r (s, t) defined by equation (7) is a blur-generated process in Brown et al. (2000) with a nonseparable non-explicit spatio-temporal covariance function. Whittle (1963) and Jones and Zhang (1997) propose other stochastic partial differential equation models, which are shown by Brown et al. (2000) to be special examples of the OrnsteinUhlenbeck process model above. Returning to the discussion in the Introduction, conceptually, Λ (s, tj ) is generated by the continuous-time process defined by (5) and (7). However, the exact solution of this infinite-dimensional SDE and the transition probability of the resulting Markov process for Λ (s, tj ) are not generally known in closed-form given the error process B (s, t). Hence, for handling these models, time-discretization is usually required to compute Λ (s, t) in simulation and estimation. So, we use Euler approximation to discretize the SDE model for Λ (s, tj ) . The Euler scheme is broadly employed in simulating SDE’s because it the simplest method that is proven to generate a stochastic process that has both strong convergence of order 1/2 and weak convergence of order 1 (see Kloeden and Platen 1992 for theoretical discussions). That is the stochastic difference equation resulting from the Euler discretization scheme generates a process that will converge to the process defined by the stochastic differential equation when the length of time steps goes to zero. 3 3.1 Geostatistical models using SDE A discretized space-time model with white noise We assume time is discretized to small, equally spaced intervals of length ∆t, indexed as tj , j = 0, 1, ..., J. The data is considered to be Y (si , tj ), i.e., an observation at location s and any t ∈ (tj , tj + ∆t) is labeled as Y (s, tj ). Then, we assume Y (s, tj ) = Λ (s, tj ) + ε (s, tj ) , where ε (s, tj ) models either sampling or measure errors because a researcher cannot directly observe Λ (s, tj ). The dynamics of the discretized Λ (s, tj ) is therefore modeled by a difference equation using Euler’s approximation applied to (5): ¸ · Λ (s, tj−1 ) ∆t, (10) ∆Λ (s, tj ) = r(s, tj−1 )Λ (s, tj−1 ) 1 − K (s) Λ (s, tj ) ≈ Λ (s, 0) + j X ∆Λ (s, tj−1 ) . (11) l=1 We do not have to discretize the space-time model for r (s, t) if the stationary spatiotemporal Gaussian process ζ (s, t) allows direct evaluation of its covariance function. For example, the model (7) with constant L (s) = ar has the closed-form separable J. A. Duan, A. E. Gelfand and C. F. Sirmans 739 covariance function given in (9) which can be directly used in modeling and can be estimated. Using this form saves one approximation step. We still need to model the initial Λ (s, 0) and K (s) if they are not known. For example, because Λ (s, 0) and K (s) are positive in the logistic equation, we can model them by the log-Gaussian spatial processes with regression forms for the means, log Λ (s, 0) = µΛ (XΛ (s) , βΛ ) + θΛ (s) , θΛ (s) ∼ GP (0, CΛ (s − s′ ; ϕΛ )) ; log K (s) = µK (XK (s) , βK ) + θK (s) , θK (s) ∼ GP (0, CK (s − s′ ; ϕK )) . Similarly, µr (s) below (7) can be modeled as µr (Xr (s) , βr ). Conditioned on Λ (s, tj ), the Y (s, tj ) are mutually independent. With data Y (si , tj ) at locations {si , i = 1, . . . , n} ⊂D, we can provide a hierarchical model based on the evolution of Λ (s, t) and the space-time parameters. We fit this model within a Bayesian framework so completion of the model specification requires introduction of suitable priors on the hyper-parameters. For simplicity, we suppress the indices t and s and let our observations at time tj be yj = {yj1 , . . . , yjn } at the corresponding s1 , . . . , sn locations. Accordingly, we let Λj , ∆Λj , rj , K, µΛ (βΛ ), µK (βK ), µr (βr ), θΛ , θK and ζ be the vectors of the corresponding functions and processes in our continuous model evaluated at si ∈ {s1 , . . . , sn }. Note that we begin with the initial observations y0 . The hierarchical model for y0 , . . . , yJ becomes ¡ ¢ yj |Λj ∼ N Λj , σε2 In , j = 0, . . . , J, · ¸ Λj−1 ∆Λj = rj−1 Λj−1 1 − ∆t, K Λj = Λ0 + j−1 X ∆Λl , l=1 (12) log Λ0 = µΛ (βΛ ) + θΛ , θΛ ∼ N (0, CΛ (s − s′ ; ϕΛ )) , log K = µK (βK ) + θK , θK ∼ N (0, CK (s − s′ ; ϕK )) , r = µr (βr ) + ζ, ζ ∼ N (0, Cr (s − s′ , t − t′ ; ϕr )) , βΛ , βr , βK , ϕΛ , ϕK , ϕr ∼ priors, where β(·) are the parameters in the mean surface function; CΛ , CK and Cr 2 are the covariance matrices. In this model, Λ0 , r and K are latent variables. Note that the Λj ’s are deterministic functions of Λ0 , r and K. The joint likelihood for the J + 1 conditionally independent observations and latent variables is J Y © j=0 ¡ ¢ª N yj |Λj (Λj−1 , rj , K) , σε2 In N (log Λ0 |µΛ , CΛ ) N (log K|µK , CK ) N (r|µr , Cr ) , (13) where we let r = {r0 , . . . , rJ−1 }. o n (j) will write µΛ (βΛ ), µK (βK ), µr (βr ) ; j = 1, . . . , J , CΛ (ϕΛ ), CK (ϕK ) and Cr (ϕr ) as µΛ , µK , µr , CΛ , CK and Cr when there is no ambiguity. 2 We 740 Space-time SDE models 3.2 Bayesian inference and prediction With regard to inference for the model in (12), there are three latent vectors: r, K and Λ0 . The hyper-parameters in this model include the βr , βK and βΛ in the parametric trend surfaces, the spatial random effects ζ, θK , θΛ and the hyper-parameters ϕr , ϕK , ϕΛ in the covariance functions. The priors for the hyper-parameters are assumed to have the form βr , βK , βΛ ∼ π (βr ) · π (βK ) · π (βΛ ) ; ϕr , ϕK , ϕΛ ∼ π (ϕr ) · π (ϕK ) · π (ϕΛ ) (14) where each of β©r , βK , βΛ , ϕª r , ϕK , ϕΛ may represent multiple parameters. For example, we have ϕr = αr , φr , σr2 , ν in the Matérn class covariance function for the separable model (9). Exact specifications of the priors for the β’s and ϕ’s depend on the particular application. For example, if we take µr (s; βr ) =X (s) βr , we adopt a weak normal prior N (0, Σβ ) for βr . The parameter σr2 receives the usual Inverse-Gamma prior. Note that ∆Λj in the likelihood (13) for the discretized model are deterministic functions of r, K and Λ0 defined by (10) and (11). Therefore the joint posterior is proportional to J Y © j=0 ¢ª ¡ N yj |Λj (Λj−1 , rj , K) , σε2 In N (log Λ0 |µΛ , CΛ ) N (log K|µK , CK ) N (r|µr , Cr ) · π (βr ) π (βK ) π (βΛ ) π (ϕr ) π (ϕK ) π (ϕΛ ) . (15) We simulate the posterior distributions of the model parameters and latent variables in (15) using a Markov Chain Monte Carlo algorithm. Because the intensities in the likelihood function are irregular recursive and nonlinear functions of the model parameters and latent variables, it is very difficult to obtain derivatives for an MCMC with directional moves, such as the Langevin method. So, instead we use a random-walk Metropolis-Hastings algorithm in the posterior simulation. Each parameter is updated in turn in every iteration of the simulation. The prediction problem concerns (i) interpolating the past at new locations and (ii) forecasting the future at current and new locations. Indeed, we can hold out the observed data at new locations or in a future time period to validate our model. For the logistic growth function, conditioning on the posterior samples of Λ0 , K, r and βr , βK , βΛ , ϕr , ϕK , ϕΛ , we can use spatio-temporal interpolation and temporal extrapolation to obtain ∆ΛJ+∆J (s) in period J + ∆J at any new location s ∈ D by calculating µr (s, βr ), µK (s, βK ), µΛ (s, βΛ ) and obtaining ζ(t, s), t = 1, . . . , J + ∆J, θK (s) and θΛ (s) by spatio-temporal prediction, and then using (10) and (11) recursively. Because we can obtain a predictive sample for ∆ΛJ+∆J (s) from the posterior samples of the model fitting, we can infer on any feature of interest associated with the predictive distribution of ∆ΛJ+∆J (s). The spatial interpolation of past observations at new locations is demonstrated in the subsection below using a simulated example. We will also demonstrate temporal prediction when we apply a Cox-process version of our model to the house construction data in Section 4. 741 J. A. Duan, A. E. Gelfand and C. F. Sirmans 3.3 A simulated data example In order to see how well we can learn about the true process, we illustrate the fitting of the models in (12) with a simulated data set. In a study region D of 10×10 square units shown as the block in Figure 1, we simulate 44 locations at which spatial observations are collected over 30 periods. Therefore our observed spatio-temporal data that constitute a 44×30 matrix. The data are sampled using (12) where we fix the carrying capacity to be one at all locations. We may envision that the data simulate the household adoption rates for a certain durable product (e.g., air conditioners, motorcycles) in 44 cities over 30 months. A capacity of one means 100% adoption. Household adoption rates are collected by surveys with measurement errors. The initial condition Λ0 is simulated as a log-Gaussian process with a constant mean surface µΛ and the Matérn class covariance whose smoothness parameter ν is set to be 3/2. The spatio-temporal growth rate r is simulated using a constant mean µr and the separable covariance function (9), where the Matérn smoothness parameter ν is also set to be 3/2. This separable model induces a convenient covariance matrix as the Kronecker product of the temporal and spatial correlation matrices: σr2 Σt ⊗ Σs . The values of the fixed parameters in our data simulation are presented in Table 1. Model Parameters µΛ σΛ φΛ σε µr σr φr αr True Value -4.2 1.0 0.7 0.05 0.24 0.08 0.7 0.6 Posterior Mean -4.14 0.91 0.77 0.049 0.24 0.088 0.78 0.64 95% Equal-tail Interval (-4.88, -3.33) (0.62, 1.46) (0.50, 1.20) (0.047, 0.052) (0.22, 0.26) (0.077, 0.097) (0.60, 1.10) (0.51, 0.98) Table 1: Parameters and their posterior inference for the simulated example We use the simulated r and Λ0 and the transition equation (10) recursively to obtain ∆Λj and Λj for each of the 30 periods. The observed data yj are sampled as mutually independent given Λj with the random noise εj . The data at four selected locations (marked as 1, 2, 3, and 4 in Figure 1) are shown as small circles in Figure 2. We leave out the data at four randomly chosen locations (shown in diamond shape and marked as A, B, C and D in Figure 1) for spatial prediction and out-of-sample validation for our model. We fit the same model (12) to the data at the remaining 40 locations (hence a 40×30 spatio-temporal data set). We ¡ ¢ ¡ use ¢very vague priors for the constant means: π(µΛ ) ∼ N 0, 108 and π (µr ) ∼ N 0, 108 . We use natural priors for the ¡ conjugate ¢ 2 precision parameters (inverse of variances) of r and Λ : π 1/σ ∼ Gamma(1, 1) and 0 r ¡ ¢ 2 π 1/σΛ ∼ Gamma(1, 1). The positive parameter¡ for the temporal correlation of r ¢ also has a vague log-normal prior: π (αr ) ∼ log-N 0, 108 . Because the spatial range parameters φr and φΛ are only weakly identified (Zhang 2004), we only use informative 742 Space-time SDE models and discrete prior for them. Indeed we have chosen 20 values (from 0.1 to 2.0) and assume uniform priors on them for both φr and φΛ . We use the random-walk Metropolis-Hastings algorithm to simulate posterior samples of r and Λ0 . We draw the entire vector of Λ0 for all forty locations as a single block in every iteration. Because r is very high-dimensional (r being a 40×30 matrix), we cannot draw the entire matrix of r as one block and have satisfactory acceptance rate (between 20% to 40%). Our algorithm divides r into 40 row blocks (location-wise) in every odd-numbered iteration and 30 column blocks (period-wise) in every even numbered iteration. Each block is drawn in one Metropolis step. We find the posterior samples start to converge after about 30,000 iterations. Given the sampled r and Λ0 , 2 the mean parameters µr , µΛ and the precision parameters 1/σr2 and 1/σΛ all have conjugate priors, and therefore their posterior samples are drawn with Gibbs samplers. φr and φΛ have discrete priors and therefore discrete Gibbs samplers too. We also use the random-walk Metropolis-Hastings algorithm to draw αr . We obtain 200,000 samples from the algorithm and discard the first 100,000 as burnin. For the posterior inference, we use 4,000 subsamples from the remaining 100,000 samples, with a thinning equal to 25. It takes about 15 hours to finish the computation using the R statistical software on an Intel Pentium 4 3.4GHz computer with 2GB of memory. The posterior means and 95% equal-tail Bayesian posterior predictive intervals for the model parameters are presented in Table 1. Evidently we are recovering the true parameter values very well. Figure 2 displays the posterior mean of the growth curves and 95% Bayesian predictive intervals for the four locations (1, 2, 3 and 4), compared with the actual latent growth curve Λ (t, s) and observed data. Up to the uncertainty in the model we approximate the actual curves very well. The fitted mean growth curves almost perfectly overlap with the actual simulated growth curves. The empirical coverage rate of the Bayesian predictive bounds is 93.4%. We use the Bayesian spatial interpolation in Section 3.2 to obtain the predictive growth curve for four new locations (A, B, C and D). In Figure 3 we display the means of the predicted curves and 95% Bayesian predictive intervals, together with the holdout data. We can see the spatial prediction captures the patterns of the hold-out data very well. The predicted mean growth curves overlap with the actual simulated growth curves very well except for location D, because location D is rather far from all the observed locations. The empirical coverage rate of the Bayesian predictive intervals is 95.8%. We also fit the following customary process realization model with space-time random effects to the simulated data set ¡ ¢ yj = µ + ξj + εj ; εj ∼ N 0, σε2 In , j = 0, . . . , J (16) where the random effects ξ = [ξ0 , . . . , ξj ] come from a Gaussian process with a separable spatio-temporal correlation of the form: ν Cξ (t − t′ , s − s′ ) = σξ2 exp (−αξ |t − t′ |) (φξ |s − s′ |) κν (φξ |s − s′ |) , ν = 3 . 2 (17) 743 J. A. Duan, A. E. Gelfand and C. F. Sirmans Comparison of model performance between our model in (12) and the model in (16) is conducted using spatial prediction at the 4 new locations in Figure 1. The computational cost of the model in (16) is, of course, much lower; this model can be fitted with a Gibbs sampler and requires one hour for 100,000 iterations. After we discard 20,000 as burn-in and thin the remaining samples to 4,000, we conduct the prediction on the four new sites (A, B, C and D). In Figure 4 we display the means of the predicted curves and 95% Bayesian predictive intervals, together with the hold-out data. For the four hold-out sites, the average mean square error of the model (12) is 1.75×10−3 versus 3.34×10−3 of the model (16); the average length of the 95% predictive intervals for the model (12) is 0.29 versus 0.72 for the model (16). It is evident that the prediction results under the benchmark model are substantially worse than those under our model (12); the mean growth curves are less accurate and less smooth, and the 95% predictive intervals are much wider. 4 4.1 Space-time Cox process models using SDE The model Here, we turn to the use of a SDE to provide a Cox process model for space-time point patterns. Let D again be a fixed region and let XT denote an observed space-time point pattern within D over the time interval [0, T ]. The Cox process model assumes a positive space-time intensity that is a realization of a stochastic process. Denote the stochastic intensity by Ω (s, t) , s ∈ D, t ∈ [0, T ]. In practice, we may only know the spatial coordinates of all the points whereas the time coordinates are only known to be in the time interval [0, T ]. For example, in our house construction data, for Irving, TX, we only have the geo-coded locations of the newly constructed houses within a year. The exact time when the construction of a new house starts is not available. The integrated RT process Λ (s, T ) = 0 Ω (s, t) dt, provided that Ω (s, t) is integrable over [0, T ], is the intensity for this kind of point patterns. We may also know multiple subintervals of [0, T ]: [t1 = 0, t2 ), . . . , [tJ−1 , tJ = T ], and observe a point pattern in each subinterval. These data constitute a series of discrete-time spatio-temporal point patterns, which are denote by X[t1 =0,t2 ) , . . . , X[tj−1 ,tN =T ] . The integrated process also provides stochastic intensities for these point patterns ∆Λj (s) = Λ (s, tj ) − Λ (s, tj−1 ) = Z tj Ω (s, τ ) dτ. tj−1 In this paper, we will model the dynamics of these point patterns by an infinite dimensional SDE subject to the initial condition for Λ (s, t). Note an equivalent infinite dimensional SDE for Ω (s, t) can also be derived from the equation for Λ (s, t). If we observed the complete space-time data XT (s, t), temporally dependent X[t1 =0,t2 ) ,...,X[tj−1 ,tN =T ] will still provide a good approximation to XT (s, t), when the time intervals are sufficiently small (Brix and Diggle 2001). Moreover, this will also facilitate 744 Space-time SDE models the use of the approximated intensity ∆Λj (s) = Λ (s, tj ) − Λ (s, tj−1 ) = Z tj Ω (s, τ ) dτ ≈ Ω (s, tj−1 ) (tj − tj−1 ) . tj−1 As a concrete example, we return to the house construction dataset mentioned in Section 1. Let Xj = X[tj−1 ,tj ) = xj be the observed set of locations of new houses built in region D and period j=[tj−1 , tj ). We can apply the Cox process model to Xj and assume that the stochastic intensity Λ (s, t) follows the logistic equation model (5). We can also apply the discretized version (10) to ∆Λj (s). R0 Let our initial point pattern be x0 and the intensity be Λ0 (s) = −∞ Ω (s, τ ) dτ . The hierarchical model for the space-time point patterns is merely the model (12) with the first stage of the hierarchy replaced by the following xj |∆Λj ∼ Poisson Process (D, ∆Λj ) , j = 1, . . . , J x0 |Λ0 ∼ Poisson Process (D, Λ0 ) , (18) where we suppress the indices t and s again for the periods t1 , . . . , tJ . Note that, unlike in (12), the intensity ∆Λj for xj must be positive. Therefore, we model the log growth rate, that is log r (s, t) = µr (s; βr ) + ζ (s, t) , ζ ∼ GP (0, Cζ (s − s′ , t − t′ ; ϕr )) . (19) The J spatial point patterns are conditionally independent given the space-time intensity, so the likelihood is ( ) ¶Y µ Z ¶Y µ Z nj n0 J Y Λ0 (x0i ) . (20) ∆Λj (s) ds exp − Λ0 (s) ds ∆Λj (xji ) · exp − j=1 D D i=1 i=1 This likelihood R is more difficult to work with than that in (13). There is a stochastic integral in (20), D ∆Λj (s) ds, which must be approximated in model fitting by a Riemann sum. To do this, we divide the geographical region D into M cells and assume the intensity is homogeneous within each cell. Let ∆Λj (m) and Λ0 (m) denote this average intensity in cell m. Let the area of cell m be A (m). Then, the likelihood becomes # " à M ! M J Y Y X njm exp − ∆Λj (m) ∆Λj (m) A (m) m=1 m=1 j=1 à · exp − M X m=1 Λ0 (m) A (m) ! M Y Λ0 (m) (21) n0m m=1 where njm is the number of points in cell m in period j. Our parameter processes r (s, tj ) and K (s) are also approximated accordingly as rj (m) and K (m), which are constant in each cell m. J. A. Duan, A. E. Gelfand and C. F. Sirmans 4.2 745 Modeling house construction data for Irving, TX Our house construction dataset consists of the geo-coded locations and years of the newly constructed residential houses in Irving, TX from 1901 to 2002. Figure 5 shows how the city grows from the early 1950’s to the late 1960’s. Irving started to develop in the early 1950’s and the outline of the city was already in its current shape by the late 1960’s. The city became almost fully developed by the early 1970’s with much fewer new constructions after that era. Therefore, for our data analysis, we select the period from 1951 through 1969 when there was rapid urban development. In our analysis, we use the data from year 1951–1966 to fit our model and hold out the last three years (1967, 1968 and 1969) for prediction and model validation. As shown in the central block of Figure 6, our study region D in this example is a square of 5.6×5.6 square miles with Irving, TX in the middle. This region is geographically disconnected from other major urban areas in Dallas County, which enables us to isolate Irving for analysis. We divide the region into 100 (10×10) equally spaced grid cells shown in Figure 6. Within each cell, we model the point pattern with a homogeneous Poisson process given ∆Λj (m). The corresponding Λ0 (m), K (m) and rj (m) are collected into vectors Λ0 , K, and r which are modeled as follows. log Λ0 log K log r = = = µΛ + θΛ , θΛ ∼ N (0, CΛ ) µK + θK , θK ∼ N (0, CK ) µr + ζ, ζ ∼ N (0, Cr ) where the spatial covariance matrix CΛ and CK are constructed using the Matérn class covariance function with distances between the centroids of the cells. The smoothness 2 2 and range parameters φΛ and , σK parameter ν is set to be 3/2. The variances σΛ φK are to be estimated. The spatio-temporal log growth rate r is assumed to have a separable covariance matrix Cr = σr2 Σt ⊗ Σs , where the spatial correlation Σs is also constructed as a Matérn class function of the distances between cell centroids with smoothness parameter ν being set to 3/2. The temporal correlation Σt is of exponential form as in (9). The variance σr2 , spatial and temporal correlation parameters φr and αr are to be estimated. We use very vague priors for the parameters in the mean function: π(µΛ ), π (µK ), ¡ ¢ ind π (µr ) ∼ N 0, 108 . We use natural conjugate priors for the precision parameters ¡ ¢ ¡ ¢ ¡ ¢ ind 2 2 (inverse of variances) of r and Λ0 : π 1/σΛ , π 1/σK , π 1/σr2 ∼ Gamma(1, 1). The temporal ¡ ¢correlation parameter of r also has a vague log-normal prior: π (αr ) ∼ logN 0, 108 . Again, the spatial range parameters φΛ φK and φr are only weakly identified (Zhang 2004), so we use informative, discrete priors for them. Indeed we have chosen 40 values (from 1.1 to 5.0) and assume uniform priors on them for φΛ φK and φr . We use the same random-walk Metropolis-Hastings algorithm as in the simulation example to simulate posterior samples with the same tuning of acceptance rates. As a production run we obtain 200,000 samples from the algorithm and discard the first 100,000 as burn-in. For the posterior inference, we use 4,000 subsamples from the remaining 100,000 samples, with a thinning equal to 25. The posterior means and 95% 746 Space-time SDE models equal-tail posterior intervals for the model parameters are presented in Table 2. Model Parameters µΛ σΛ φΛ µr σr φr αr µK σK φK Posterior Mean 2.78 1.77 3.03 -2.76 2.48 4.09 0.52 6.49 1.17 1.91 95% Equal-tail Interval (2.15, 3.40) (1.49, 2.11) (2.70, 3.20) (-3.24, -2.29) (2.32, 2.68) (3.70, 4.30) (0.43, 0.62) (5.93, 7.01) (1.02, 1.44) (1.60, 2.20) Table 2: Posterior inference for the house construction data Figure 7 shows the posterior mean growth curves and 95% Bayesian predictive intervals for the intensity in the four blocks (marked as block 1, 2, 3 and 4) in Figure 6. Comparing with the observed number of houses in the four blocks from 1951 to 1966, we can see the estimated curves fit the data very well.3 In Figure 8 we display the posterior mean intensity surface for year 1966 and the predictive mean intensity surfaces for years 1967, 1968 and 1969. We also overlay the actual point patterns of the new homes construct in those four years on the intensity surfaces. Figure 8 shows that our model can forecast the major areas of high intensity, hence high growth very well. For example, in the upper left corner, the intensity continues rising from 1966 to 1968 and starts to wane in 1969. We can see increasing numbers of houses are built from 1966 to 1968 and much fewer are built in 1969. In the lower left part of the plots near the bottom, we can see areas of high intensity gradually shift down to the south and the house construction pattern confirms this trend too. 5 Discussion We have illustrated the use of stochastic differential equations to model both geostatistical and point pattern space-time data. Our examples demonstrate that the proposed hierarchical modeling can accommodate the complicated model structure and achieve good estimation and prediction. The major challenges in fitting our proposed models are: (i) the evaluation of a likelihood that involves discretization of the SDE in time and stochastic integrals and (ii) a likelihood that does not allow an easy formulation of an efficient Metropolis-Hastings algorithm. In dealing with the first challenge, we utilize the Euler approximation and the space discretization method in Benes et al. (2002). Though the simulation results are encouraging, further investigation of these approximations or 3 The growth curves for the house construction data are much smoother than those in our simulated data example in Section 3.3. Although our fitted mean growth curves seem to match the data too perfectly, we do not think we overfit because our hold-out prediction results are very accurate as well. J. A. Duan, A. E. Gelfand and C. F. Sirmans 747 alternatives would be of interest. For the second, we apply the random-walk Metropolis algorithm to the posterior simulation, which is liable to create large auto-correlation in the sampling chain. The nonlinear and recursive structure of our likelihood makes most of the current Metropolis methods inapplicable, encouraging future research for a more efficient Metropolis-Hastings algorithm for this class of problems. Our application to the house construction data is really only a first attempt to incorporate a structured growth model into a spatio-temporal point process to afford insight into the mechanism of urban development. However, if it is plausible to assume that the damping effect of growth is controlled by the carrying capacity of a logistic model, then it is not unreasonable to assume the growth rate is mean-reverting. Of course, we can envision several ways to make the model more realistic and these suggest directions for future work. We might have additional information at the locations to enable a so-called marked point process. For instance, we might assign the house to a group according to its size. Fitting the resultant multivariate Cox process can clarify the intensity of development. We could also have useful covariate information on zoning or introduction of roads which could be incorporated into the modeling for the rates. We can expect “holes” in the region - parks, lakes, etc. - where no construction can occur. For locations in these regions, we should impose zero growth. Finally, it may be that growth triggers more growth so that so-called self exciting process specifications might be worth exploring. References Banerjee, S., Carlin, B., and Gelfand, A. (2004). Hierarchical Modeling and Analysis for Spatial Data. Chapman & Hall/CRC. 734 Benes, V., Bodlak, K., Møller, J., and Waagepetersen, R. (2002). “Bayesian Analysis of Log Gaussian Cox Processes for Disease Mapping.” Technical Report R-02-2001, Department of Mathematical Sciences, Aalborg University. 746 Brix, A. and Diggle, P. (2001). “Spatiotemporal prediction for log-Gaussian Cox processes.” Journal of the Royal Statistical Society: Series B, 63: 823–841. 734, 737, 743 Brix, A. and Møller, J. (2001). “Space-time multi type log Gaussian Cox Processes with a View to Modelling Weeds.” Scandinavian Journal of Statistics, 28: 471–488. 734 Brown, P., Karesen, K., Roberts, G., and Tonellato, S. (2000). “Blur-generated nonseparable space-time models.” Journal of the Royal Statistical Society: Series B, 62: 847–860. 734, 738 Cressie, N. (1993). Statistics for Spatial Data. New York: Wiley, 2nd edition. 734 Daley, D. and Vere-Jones, D. (1988). Introduction to the Theory of Point Processes. New York: Springer Verlag. 734 748 Space-time SDE models Diggle, P. (2005). “Spatio-temporal point processes: methods and applications.” Dept. of Biostatistics Working Paper 78, Johns Hopkins University. 734 Elerian, O., Chib, S., and Shephard, N. (2001). “Likelihood inference for discretely observed non-linear diffusions.” Econometrica, 69: 959–993. 735, 736 Gelfand, A. E., Banerjee, S., and Gamerman, D. (2005). “Spatial process modelling for univariate and multivariate dynamic spatial data.” Environmetrics, 16: 465–479. 734 Gneiting, T. (2002). “Nonseparable, stationary covariance functions for space-time data.” Journal of the American Statistical Association, 97: 590–600. 734 Golightly, A. and Wilkinson, D. (2008). “Bayesian inference for nonlinear multivariate diffusion models observed with error.” Computational Statistics and Data Analysis, 52: 1674–1693. 736 Jones, R. and Zhang, Y. (1997). “Models for continuous stationary space-time processes.” In Gregoire, G., Brillinger, D., Diggle, P., Russek-Cohen, E., Warren, W., and Wolfinge, R. (eds.), Modelling Longitudinal and Spatially Correlated Data, 289– 298. Springer, New York. 738 Karr, A. (1991). Point Processes and Their Statistical Inference. New York: Marcel Dekker, 2nd edition. 734 Kloeden, P. and Platen, E. (1992). Numerical Solution of Stochastic Differential Equations. Springer. 735, 738 Kot, M. (2001). Elements of Mathematical Ecology. Cambridge Press. 735 Mahajan, V. and Wind, Y. (1986). Innovation Diffusion Models of New Product Acceptance. Harper Business. 735 Møller, J., Syversveen, A., and Waagepetersen, R. (1998). “Log Gaussian Cox processes.” Scandanavian Journal of Statistics, 25: 451–482. 734 Møller, J. and Waagepetersen, R. (2004). Statistical Inference and Simulation for Spatial Point Processes. Chapman and Hall/CRC Press. 734 Ogata, Y. (1998). “Space-time point-process models for earthquake occurrences.” Annals of the Institute for Statistical Mathematics, 50: 379–402. 734 Rue, H. and Held, L. (2005). Gaussian Markov random fields: theory and applications. Chapman & Hall/CRC. 734 Stein, M. L. (2005). “Space-time covariance functions.” Journal of the American Statistical Association, 100: 310–321. 734 Stramer, O. and Roberts, G. (2007). “On Bayesian analysis of nonlinear continuous-time autoregression models.” Journal of Time Series Analysis, 28: 744–762. 736 J. A. Duan, A. E. Gelfand and C. F. Sirmans 749 Whittle, P. (1963). “Stochastic processes in several dimensions.” Bulletin of the International Statistical Institute, 40: 974–994. 738 Zhang, H. (2004). “Inconsistent estimation and asymptotically equal interpolations in model-based geostatistics.” Journal of the American Statistical Association, 99: 250–261. 741, 745 Acknowledgments The authors thank Thomas Thibodeau for providing the Dallas house construction data. 750 8 10 Space-time SDE models 1 6 C 2 Latitude A 4 B D 2 4 0 3 0 2 4 6 8 10 Longitude Figure 1: Locations for the simulated data example in Section 3.3. J. A. Duan, A. E. Gelfand and C. F. Sirmans 751 752 1.0 0.8 0.6 0.0 0.2 0.4 Location 2 0.6 0.4 0.0 0.2 Location 1 0.8 1.0 Space-time SDE models 0 5 10 15 20 25 30 0 5 10 25 30 20 25 30 1.0 0.8 0.6 0.0 0.2 0.4 Location 4 0.8 0.6 0.4 0.0 0.2 Location 3 20 Timeline 1.0 Timeline 15 0 5 10 15 Timeline 20 25 30 0 5 10 15 Timeline Figure 2: Observed space-time geostatistical data at 4 locations, actual (dashed line) and fitted mean growth curves (solid line), and 95% predictive intervals (dotted line) by our model (12) for the simulated data example. 753 1.0 0.8 0.6 0.4 New Location B 0.0 0.2 0.8 0.6 0.4 0.2 0.0 New Location A 1.0 J. A. Duan, A. E. Gelfand and C. F. Sirmans 0 5 10 15 20 25 30 0 5 10 25 30 20 25 30 1.0 0.8 0.6 0.4 New Location D 0.0 0.2 0.8 0.6 0.4 0.2 0.0 New Location C 20 Timeline 1.0 Timeline 15 0 5 10 15 Timeline 20 25 30 0 5 10 15 Timeline Figure 3: Hold-out space-time geostatistical data at 4 locations, actual (dashed line) and predicted mean growth curves (solid line) and 95% predictive intervals (dotted line) by our model (12) for the simulated data example. 754 1.0 0.8 0.6 0.4 New Location B −0.2 0.0 0.2 0.6 0.4 0.2 −0.2 0.0 New Location A 0.8 1.0 Space-time SDE models 0 5 10 15 20 25 30 0 5 10 25 30 20 25 30 1.0 0.8 0.6 0.4 New Location D −0.2 0.0 0.2 0.8 0.6 0.4 0.2 −0.2 0.0 New Location C 20 Timeline 1.0 Timeline 15 0 5 10 15 Timeline 20 25 30 0 5 10 15 Timeline Figure 4: Hold-out space-time geostatistical data at 4 locations, actual (dashed line) and predicted mean growth curves (solid line) and 95% predictive intervals (dotted line) by the benchmark model (16) for the simulated data example. J. A. Duan, A. E. Gelfand and C. F. Sirmans Figure 5: Growth of residential houses in Irving, TX. 755 756 Space-time SDE models Figure 6: The gridded study region encompassing Irving, TX. 757 J. A. Duan, A. E. Gelfand and C. F. Sirmans 300 200 Number of Houses 0 5 10 15 5 Block 3 Block 4 10 15 10 15 300 200 100 0 100 200 300 Number of Houses 400 Year 400 Year 0 Number of Houses 100 300 200 100 0 Number of Houses 400 Block 2 400 Block 1 5 10 Year 15 5 Year Figure 7: Mean growth curves (solid line) and their corresponding 95% predictive intervals (dotted lines) for the intensity for the four blocks marked in Figure 6. 758 Space-time SDE models Figure 8: Posterior and predictive mean intensity surfaces for the years 1966, 1967, 1968 and 1969 Bayesian Analysis (2009) 4, Number 4, pp. 759–762 Inconsistent Bayesian Estimation Ronald Christensen∗ Abstract. A simple example is presented using standard continuous distributions with a real valued parameter in which the posterior mean is inconsistent on a dense subset of the real line. Keywords: Dirichlet process, Posterior mean 1 Introduction There has been extensive work on inconsistent Bayesian estimation. Early work was done by Halpern (1974) , Stone (1976) , and Meeden and Ghosh (1981). An important paper was Diaconis and Freedman (1986a), henceforth referred to as DFa, with extensive references and discussion by Barron; Berger; Clayton; Dawid; Doksum and Lo; Doss; Hartigan; Hjort; Krasker and Pratt; LeCam; and Lindley. Follow up work includes Diaconis and Freedman (1986b, 1990, 1993), Datta (1991), Berliner and MacEachern (1993), and Rukhin (1994). DFa require consistency for every parameter value. They also point out that if their definition of consistency holds, then the posterior mean is consistent (“minor technical details apart”). The purpose of this note is to provide a particularly simple example of an inconsistent Bayes estimate and to draw some conclusions from that example. In particular, the example has a posterior mean that is inconsistent on a dense subset of the real line. Consider y1 , . . . , yn a random sample from a density f (y|θ). The distribution of f (y|θ) is Cauchy with median θ when θ is a rational number and Normal with mean θ and variance 1 when θ is irrational. In other words, ( Cauchy(θ) θ rational f (y|θ) = N (θ, 1) θ irrational. For the prior density, we take g(θ) to be absolutely continuous. For the sake of simplicity, take it to be N (µ0 , 1). We will show that the posterior distribution of θ given the data is the same as if the entire conditional distribution of y were N (θ, 1). In other words, the posterior distribution is ¶ µ 1 µ0 + ny , . f (θ|y1 , . . . , yn ) ∼ N n+1 n+1 The standard Bayes estimate is the posterior mean, (µ0 + ny)/(n + 1), which behaves asymptotically like y. If the true value of θ is an irrational number, the true sampling ∗ Department of Mathematics and Statistics, University of New Mexico mailto:fletcher@stat.unm. edu c 2009 International Society for Bayesian Analysis ° DOI:10.1214/09-BA428 760 Inconsistent Bayesian Estimation distribution is normal and the Bayes estimate is consistent. However, if the true value of θ is a rational number, the true sampling distribution is Cauchy(θ), for which it is well known that y is an inconsistent estimate of θ. Thus we have the Bayes estimate inconsistent on a dense set, but a set of prior probability zero. As the editor has pointed out, except in neighborhoods of θ = 0, the example works just as well with the Cauchy(θ) replaced by a N (−θ, 1). Then, the posterior mean is consistent but for the wrong value of θ. Obviously, the key to this example is that, by virtually any concept of proximity for distributions, the conditional distributions f (y|θ) are discontinuous on a dense set of θs. Not only is the mean function E(y|θ) discontinuous everywhere in θ but if F (y|θ) is the cdf of f (y|θ), measures such as the Kolmogorov-Smirnov distance supy |F (y|θ)−F (y|θ′ )| are never uniformly small in any neighborhood of θs. An interesting aspect of DFa is that, while generally it is possible to get discrete distributions arbitrarily close to continuous ones, DFa illustrate that you cannot always get Ferguson distributions close enough to a continuous target. It seems quite clear from the calculus behind this example that the proper concern for Bayesians is whether their procedures are consistent with prior probability one. Doob’s theorem, see DFa’s Corollary A.2, establishes precisely this result. Moreover, there seems to be little remedy for Bayesian inconsistency if one has postulated a prior distribution for which all interesting parameters have collective prior probability zero. We have done that here. Who ever reports numerical values to clients that are not rational numbers? This also seems to be the argument of DFa, that Dirichlet priors put zero prior probability on continuous distributions and therefore the inconsistency of Dirichlet priors with respect to continuous distributions in some applications is a problem. Others might argue that the distribution of any observable phenomenon must be discrete and that continuous models are merely useful approximations, in which case the issue being called in question for Dirichlet processes is the usefulness of continuous approximations. Nothing in the Bayesian machinery will ensure conditional consistency everywhere. That requires assumptions on the conditional distributions over and above the Bayesian paradigm. However, such assumptions may well be valid considerations when developing models for data. 2 Technical Details Let Y = (y1 , . . . , yn )′ and consider the probability Pr[θ ∈ A and Y ∈ B] for arbitrary Borel sets A and B. Let 1[A×B] (θ, Y ) be the indicator function of the set A × B. The conditional probability Pr[θ ∈ A|Y = w] can be defined as a Y measurable function such that for any set B Z Z Pr[θ ∈ A|Y = w]dP (θ, Y ) = 1[A×B] (θ, Y )dP (θ, Y ), (1) B see Rao (1973, p. 91) or Berry and Christensen (1979). 761 R. Christensen First of all, the joint distribution of (θ, Y ) exists. The joint density (θ, Y ) is h(θ, Y ) ≡ f (Y |θ)g(θ). This is clearly dominated by taking g(θ) the same and replacing f (Y |θ) with a finite multiple of a Cauchy(θ) density. Since the integral exists, we can apply Fubini’s theorem. Let f ∗ (y|θ) be the density for a N (θ, 1) distribution. We show that Z f (θ|Y )dθ Pr[θ ∈ A|Y = w] = A where f (θ|Y ) = R f ∗ (y|θ)g(θ) . f ∗ (y|θ)g(θ)dθ Thus, this version of the posterior probability behaves as if there were no Cauchy components to the sampling distribution at all. The claims of the previous section follow immediately from this result. To see the validity of the result, observe that Z Z Z f (Y |θ)g(θ)dY dθ. 1[A×B] (θ, Y )dP (θ, Y ) = A B R R However, f (Y |θ) and f ∗ (Y |θ) are equal almost everywhere, so B f (Y |θ)dY = B f ∗ (Y |θ)dY almost everywhere and Z Z Z Z f ∗ (Y |θ)g(θ)dY dθ. f (Y |θ)g(θ)dY dθ = A A B B ∗ The distribution associated with f (Y |θ) is perfectly well behaved, so Bayes theorem can be applied to give Z Z Z Z ∗ f (θ|Y )f (Y )dθdY. f (Y |θ)g(θ)dY dθ = A B B A It follows that equation (1) holds. References Berliner, L. M. and MacEachern, S. N. (1993). “Examples of inconsistent Bayes procedures based on observations on dynamical systems.” Statistics and Probability Letters, 17: 355–360. 759 Berry, D. A. and Christensen, R. (1979). “Empirical Bayes estimation of a binomial parameter via mixtures of Dirichlet processes.” Annals of Statistics, 7: 558–568. 760 Datta, S. (1991). “On the consistency of posterior mixtures and its applications.” Annals of Statistics, 19: 338–353. 759 Diaconis, P. and Freedman, D. (1986a). “On the consistency of Bayes estimates.” Annals of Statistics, 14: 1–26. 759 762 Inconsistent Bayesian Estimation — (1986b). “On inconsistent Bayes estimates of location.” Annals of Statistics, 14: 68–87. 759 — (1990). “On the uniform consistency of Bayes estimates for multinomial probabilities.” Annals of Statistics, 18: 1317–1327. 759 — (1993). “Nonparametric binary regression: A Bayesian approach.” Annals of Statistics, 21: 2108–2137. 759 Halpern, E. F. (1974). “Posterior consistency for coefficient estimation and model selection in the general linear hypothesis.” Annals of Statistics, 2: 703–712. 759 Meeden, G. and Ghosh, M. (1981). “Admissibility in finite problems.” Annals of Statistics, 9: 846–852. 759 Rao, C. R. (1973). Linear Statistical Inference and its Applications. New York: John Wiley and Sons, second edition. 760 Rukhin, A. L. (1994). “Recursive testing of multiple hypotheses: Consistency and efficiency of the Bayes rule.” Annals of Statistics, 22: 616–633. 759 Stone, M. (1976). “Strong inconsistency from uniform priors.” Journal of the American Statistical Association, 71: 114–116. 759 Bayesian Analysis (2009) 4, Number 4, pp. 763–792 Sample Size Calculation for Finding Unseen Species Hongmei Zhang∗ and Hal Stern† Abstract. Estimation of the number of species extant in a geographic region has been discussed in the statistical literature for more than sixty years. The focus of this work is on the use of pilot data to design future studies in this context. A Dirichlet-multinomial probability model for species frequency data is used to obtain a posterior distribution on the number of species and to learn about the distribution of species frequencies. A geometric distribution is proposed as the prior distribution for the number of species. Simulations demonstrate that this prior distribution can handle a wide range of species frequency distributions including the problematic case with many rare species and a few exceptionally abundant species. Monte Carlo methods are used along with the Dirichlet-multinomial model to perform sample size calculations from pilot data, e.g., to determine the number of additional samples required to collect a certain proportion of all the species with a pre-specified coverage probability. Simulations and real data applications are discussed. Keywords: Generalized multinomial model, Bayesian hierarchical model, Markov Chain Monte Carlo (MCMC), Dirichlet distribution, geometric distribution. 1 Introduction The “species problem” is a term used to refer to studies in which objects are sampled and categorized with interest on the number of categories represented. Research related to the species problem dates back to the 1940’s. Corbet (1942) proposed that a mathematical relation exists between the number of sampled individuals and the total number of observed species in a random sample of insects or other animals. Fisher et al. (1943) developed an expression for the relationship using a negative binomial model. Their proposed relationship works well over the whole range of observed abundance, and gives a very good fit to practical situations. The focus of most statistical research on the species problem has been to estimate the number of unseen species. Bunge and Fitzpatrick (1993) give a review of numerous statistical methods to estimate the number of unseen species. Some notable references are mentioned briefly here. Good and Toulmin (1956) address the estimation of the expected number of unseen species based on a Poisson model. Efron and Thisted (1976) use two different empirical Bayes approaches, both based on a similar Poisson model, to estimate the number of unseen words in Shakespeare’s vocabulary. Pitman (1996) proposes species sampling models utilizing a Dirichlet random measure. The negative ∗ Department of Epidemiology and Biostatistics, University of South Carolina, Columbia, SC, mailto:hzhang@sc.edu † Department of Statistics, University of California, Irvine, CA, mailto:sternh@uci.edu c 2009 International Society for Bayesian Analysis ° DOI:10.1214/09-BA429 764 Unseen Species Sample Size Calculation binomial model proposed by Fisher et al. (1943) is also discussed there. Boender and Rinnooy Kan (1987) suggest a Bayesian analysis of a multinomial model that can be used to estimate the number of species. Their model is the starting point for the work presented in this paper. As seen from the references above, previous applications of the species problem have included animal ecology where individual animals are sampled and categorized into species (Fisher, Corbet, and Williams (1943)), and word usage where individual words are sampled and each word defines its own category (Efron and Thisted (1976)). More recent studies extend the species problem to applications in bioinformatics (Morris, Baggerly, and Coombes (2003)), where the sample items might be DNA fragments and each sequenced DNA segment represents a unique sequence. Our work is motivated by a bioinformatics problem of this type. Some other studies focus on drawing inferences for abundant species or rare species, e.g. Cao et al. (2001) and Zhang (2007). For consistency with the earlier literature, we use the familiar terminology of animals and species. In this paper, we use the model of Boender and Rinnooy Kan (1987), a generalized multinomial model, as our starting point. The major contribution of this paper is to address sample size calculation for future data collection based on a pilot study. The goal is to determine the sample size in order to achieve a specified degree of population coverage. Non-parametric Bayesian methods have been developed for a related problem, inferring from a given data set the probability of discovering new species, e.g. Tiwari and Tripathi (1989) and Lijoi et al. (2007). In these studies the total number of species is either assumed to be known or to be infinite. The method proposed in this paper, on the other hand, is a two-phase design with the first phase used to infer the number of species and the second phase to estimate the required sample size. The sample size required to achieve a specified degree of population coverage is obtained by Monte Carlo simulations. The first step is a fully Bayesian approach to drawing inferences regarding the parameters for a generalized Dirichlet-multinomial model for species frequency data. The posterior distribution of the model parameters is used in our Monte Carlo simulation method for sample size determination. For parametric Bayesian analysis of species frequency data selecting an appropriate prior distribution for the number of species in the population is very important (see, for example, Zhang and Stern (2005)). The prior distributions proposed by previous studies (Zhang and Stern (2005); Boender and Rinnooy Kan (1987)) perform poorly in situations in which a population has many rare species (each with very small number of representatives) and a few abundant species. In this case, as indicated by Sethuraman (1994) and discussed in Zhang and Stern (2005), the proportions of each species in the population are crowded at the vertexes of a multi-dimensional simplex such that most proportions are close to zero. For this type of population, inferences for the number of species in the population are often unrealistic. In this paper, we propose to use a geometric distribution as the prior distribution for the number of species. Geometric distributions have been used in many studies, but we have not seen any applications to the species problem. The geometric prior distribution can be used to reflect prior beliefs about the minimum number of species H. Zhang and H. Stern 765 in the population and prior belief about the range within which the number of species is believed to lie. The flexibility provided in this manner seems to allow the geometric prior distribution to adapt well to different species frequency distributions. The rest of the paper is organized as follows. In Section 2, we review the hierarchical Bayesian model for species data, describe our choice of prior distributions, and state the conditions required to guarantee a proper posterior distribution for our model. Section 3 focuses on posterior inferences for the model’s parameters. Issues related to the implementation of MCMC are also discussed. In Section 4, we develop a Monte Carlo simulation approach for designing future data collection. Section 5 provides results for a simulated data set where the proposed approach works reasonably well. Sensitivity of results to the choice of prior distribution is also discussed. We apply our method to a bioinformatics data set in Section 6. Finally we summarize our results in Section 7. 2 2.1 A Dirichlet-multinomial model The likelihood function Let yi denote the number of observed animals of species i in a sample of size N . Suppose so is the number of different species observed and S is the number of species in a population. Then y = {y1 , y2 , ..., yso } is one way to represent the observed sample. An alternative description for data of this type based on frequency counts has often been used in the literature. Let xo ≤ N be the maximum frequency over all observed species and nx be the number of species captured x times, x = 1, 2, ..., xo . Then n = (n1 , n2 , · · · , nxo ) is another way to represent the sample with N = xo X x=1 xnx = so X yi . i=1 Here we motivate and describe the generalized multinomial probability model for y of Boender and Rinnooy Kan (1987). To start we introduce notation ycomplete for the Sdimensional vector of species counts. The basic sampling model for the counts ycomplete is multinomial with the probability for species i to be captured as θi , i = 1, · · · , S. There are several possible interpretations for the θi ’s. If we assume all animals are equally likely to be caught, then θi represents the relative proportion of species i among the animal population. If not, then θi combines the likelihood of being caught and the abundance. If the number of species S is known, the population size of animals is large, and each species has a reasonably large number of representatives in the population, then a plausible model for ycomplete is the multinomial distribution with parameters N and θ = {θ1 , · · · , θS }, i.e. ycomplete |θ, S ∼ M ult(N, θ), where ycomplete = {y1 , y2 , ..., yS }. When S is not known, however, we don’t know the dimension of ycomplete . Then it makes sense to consider the observed data y = {y1 , y2 , ..., yso } which only indicates counts for the so species that have been observed. There is a complication in that y provides counts but does not indicate which elements of θ correspond to the observed species. Though subtle, this point is important in that it invalidates the usual multinomial likelihood. 766 Unseen Species Sample Size Calculation Since the correspondence between the yi ’s and θi ’s can not be determined, the data y represent a generalized multinomial distribution where we sum over all possible choices for the so observed species. Let W (so ) denote all subsets {i1 , . . . , iso } of so distinct species labels from {1, . . . , S}, then the conditional distribution of y given θ and S can be expressed as P r(y|θ, S) = N! y !...y so ! x=1 nx ! 1 QN 1 X θi1 y1 ...θiso yso (1) {i1 ,··· ,iso }∈W (so ) which is the same result as the one given by Boender and Rinnooy Kan (1987). The above model assumes infinite population sizes, which can produce limitations in practice. A hypergeometric formulation, which recognizes the finiteness of the populations, seems more appropriate in this context. However, due to computational inefficiency of using hypergeometric models, we use model (1) to describe the distribution of counts y, implicitly assuming that population sizes are large enough to validate the assumptions of multinomial distributions. 2.2 Prior distribution for θ We model θ given S as a random draw from a symmetric Dirichlet distribution with parameter α, which is a conjugate prior distribution for the multinomial distribution. We write θ|S, α ∼ Dirichlet(1S α) (2) with S Γ(Sα) Y p(θ|S, α) = ( θi ) S Γ(α) i=1 α−1 where 1S is a vector length S with all entries equal to one, and θ = {θ1 , · · · , θS } PS with i=1 θi = 1. For a symmetric Dirichlet distribution, E(θi ) = 1/S, so the prior distribution of θ assumes all species are a priori equally likely (in expectation) to be captured. The prior variance for each θi is V ar(θi ) = (1/S)(1 − 1/S)(1/(Sα + 1)). The variance of θi becomes smaller as α grows, and tends to 0 as α approaches to infinity. In the limiting case of α being infinity, θi = 1/S and animals from each species are equally likely to be captured. Smaller values of α correspond to greater variation among the elements of θ. Small α can yield many small elements in the vector θ, which corresponds to the case in which the population has many rare species. The reason for this is that as α gets smaller, the vector of θi ’s generated from Dirichlet(1S α) is more concentrated on the vertices of the S-dimensional simplex containing vectors θ that sum to one (Sethuraman (1994); Zhang and Stern (2005)). For instance, with S = 2 the Dirichlet distribution reduces to a Beta distribution. When α is small, the density function is U-shaped, with density concentrated near 0 and 1 for both of the proportions. We obtain further insight by considering the distribution of θ in three dimensions. Figure 1 H. Zhang and H. Stern 767 shows the distribution of θ in three dimensions for α = 1, 0.01, 0.001. When α is larger (e.g. α = 1), the probability values are distributed evenly on the simplex. As α gets smaller, θi ’s tend to move toward the vertices of the simplex which have value 1 or 0, which implies more smaller elements in the vector θ will be generated from the Dirichlet distribution. (0,0,1) (0,0,1) (0,1,0) (0,1,0) (1,0,0) (1,0,0) (a) α = 1 (b) α = 0.01 (0,0,1) (0,1,0) (1,0,0) (c) α = 0.001 Figure 1: Distribution of θ for different α’s One might expect this prior to be a bit unrealistic in that we likely know that some species are a priori more likely to be observed than others. The reason for choosing a symmetric Dirichlet distribution is that we do not know S, so we have no information to distinguish any of the θi′ s, i = 1, · · · , S, from any of the others. In this case, the prior 768 Unseen Species Sample Size Calculation distribution for θ has to be exchangeable. A possible solution that can address known heterogeneity but retain exchangeability is to consider a mixture of two symmetric Dirichlet distributions corresponding to two different subpopulations, abundant species and scarce species. This approach is used by Morris et al. (2003) in the case when S is known. 2.3 Prior distribution for S and α We apply a fully Bayesian approach to analyzing species frequency data and conclude model specification by giving prior distributions for S and α. We specify independent prior distributions for S and α. The two parameters become dependent in their joint posterior distribution as is shown below in Section 2.4. For S we would like to use a relatively flat prior distribution without specifying a strict upper bound on the number of species. We find it useful to have the prior probability density be a decreasing function of S so that there is a slight preference for smaller number of species (this is discussed further in Section 5.5). A prior distribution for S with these characteristics is the geometric distribution with probability mass function P r(S) = f (1 − f )S−Smin , S ≥ Smin , (3) where Smin is a specified minimum number of species and f is the geometric probability parameter. Because of Theorem 1 (below) we generally take Smin = 2, the smallest value for the number of observed species that yields a proper posterior distribution for our model. One interpretation for the parameter f is as the prior probability that there are exactly Smin species but this would not ordinarily be a quantity that scientists are able to specify. Instead we propose to obtain a suitable value of f by specifying a plausible maximum value Smax for the number of unique species and a measure of prior certainty that S lies between Smin and Smax . The value of Smax will usually be suggested by scientific collaborators as in our application. Under the geometric distribution P r(Smin ≤ S ≤ Smax ) = 1 − (1 − f )Smax −Smin which can be inverted to find f for specified values of Smin , Smax and the prior certainty. For instance, if we would like to express high confidence, say probability .999 that S is between Smin = 2 and Smax = 10000, then we find f = .0007. On the other hand if we are less confident, say 95% certain that the true number of species is in this interval, then f = .0003. Note that despite the name we have assigned, we do not assume that Smax is an actual upper bound. Smax is a device used for elicitation of the geometric probability parameter f . Alternative prior distributions for S (and α) are described below. The sensitivity of posterior inferences to different choices of f and to different prior distributions is considered in Section 5. The parameter α is important in characterizing the distribution of frequencies. As discussed in an earlier section, large α values lead to uniform distributions and small α leads to a skewed distribution with a few popular species and many rare species. As we don’t have much information on α we follow an approach that appears to provide a relatively noninformative hyperprior distribution. We note √ that the prior standard deviation for each element of θ is roughly proportional to 1/ α. By setting a noninfor- H. Zhang and H. Stern 769 √ mative prior distribution on this quantity, p(1/ α) ∝ 1, and doing a change of variable, we obtain a hyperprior distribution for α, i.e. 3 p(α) = α− 2 , α > 0. (4) This is not a proper hyperprior distribution but the following theorem indicates that the posterior distribution is a proper distribution under fairly weak conditions. Theorem 1: For the model defined by (1) through (4), the posterior distribution p(S, α|y) is proper if at least two species are captured, i.e. so ≥ 2. Proof: The proof is included in the Appendix. Naturally other prior distributions are possible and several have appeared in the literature. For example, Zhang and Stern (2005) use a noninformative prior for S, which is discrete uniform distribution on an interval of plausible values, and they use the same prior distribution of α that we use here. However, as discussed by Zhang and Stern (2005), this set of prior distributions can provide misleading posterior inferences when data are consistent with a small value of α. Another set of prior distributions is given by Boender and Rinnooy Kan (1987), where independent proper prior distributions on S and α are proposed: ½ 1, S < Scut P r(S) ∝ , (5) 1 , S ≥ Scut 2 (S−Scut +1) where Scut is a positive number to be set, and p(α) = ½ 1/2, α≤1 , 1/2α−2 , α > 1 (6) which was earlier proposed by Good (1965). When using this set of prior distributions, as indicated by Boender and Rinnooy Kan (1987) and also by our later simulation results, the posterior inferences can be very sensitive to the choice of Scut , especially when data suggest small values of α. We comment on the sensitivity of results to the prior distributions P (S, α) further in the data analyses of Section 5. 2.4 The posterior distribution The joint posterior distribution of θ, S, and α for the probability model specified in (1) through (4) is, up to a normalizing constant, p(θ, S, α|y) ∝ ∝ P r(y|θ, S)p(θ|S, α)P r(S)p(α) X 1 N! [ QN y1 !...yso ! x=1 nx ! {i1 ,··· ,iso }∈W (so ) S−Smin (1 − f ) α −3 2 θi1 y1 ...θiso yso ] s Γ(Sα) Y α−1 θ S Γ(α) i=1 i 770 Unseen Species Sample Size Calculation where, S ∈ {max(so , Smin ), max(so , Smin ) + 1, · · · }, α > 0. It should be noted that the posterior distribution is defined over both continuous (for α and θ) and discrete (for S) sample spaces. The joint posterior distribution can be factored as p(θ, S, α|y) = p(θ|y, S, α)p(S, α|y), (7) where p(θ|y, S, α) is the conditional posterior distribution of θ given S and α, p(θ|y, S, α) ∝ = [ X θi1 {i1 ,··· ,iso }∈W (so ) X {i1 ,··· ,iso }∈W (so ) y1 · · · θiso  yso ] S Y θiα−1 i=1  y1 +α−1 · · · θiso yso +α−1 θi1 S Y j=1 j6∈{i1 ,...,iso }   θjα−1  . (8) Note that the conditional posterior distribution of θ is proportional to the sum of S!/(S− so )! Dirichlet densities. Also note that every Dirichlet distribution in the summation is identical up to permutation of the species indices. The other factor in (7), p(S, α|y), is p(S, α|y) ∝ 3 S! Γ(Sα) Γ(y1 + α) · · · Γ(yso + α) (1 − f )S α− 2 , s o (S − so )! Γ(N + Sα) (Γ(α)) S ∈ {max(so , Smin ), max(so , Smin ) + 1, · · · }, α > 0. (9) This can be obtained in either of two ways, as the quotient p(θ, S, α|y)/p(θ|S, α, y), or by integrating out θ from the joint distribution p(y, θ|S, α) and working with the reduced likelihood p(y|S, α). 3 3.1 Posterior inferences Posterior inferences for S and α The posterior distribution of S and α as given by (9) is difficult to study analytically. Instead we use MCMC, specifically a Gibbs sampling algorithm with Metropolis-Hastings steps for each parameter, to generate draws from the joint posterior distribution. In applications we run multiple chains from dispersed starting values. Convergence of the sampled sequences is evaluated using the methods developed by Gelman and Rubin (1992a,b) and described for example by Gelman et al. (2003). The conditional posterior distribution of S given y and α and the conditional posterior distribution of α given y and S, up to a normalizing constant, are H. Zhang and H. Stern P r(S|y, α) p(α|y, S) 771 S! Γ(Sα) (1 − f )S , (S − so )! Γ(N + Sα) S ∈ {max(so , Smin ), max(so , Smin ) + 1, · · · }, Γ(Sα) Γ(y1 + α) · · · Γ(yso + α) − 3 ∝ α 2 , α > 0, Γ(N + Sα) (Γ(α))so ∝ (10) respectively. For Metropolis-Hastings steps for these parameters we used jumping or transition distributions that are essentially random walks. Specifically, the jumping function for iteration t for S is a discrete uniform distribution centered at the (t − 1)th sampled point; and the jumping function for α is selected as a log-normal distribution with location parameter being the logarithm of the (t − 1)th draw. The jumping distributions are discussed more fully in the Appendix. 3.2 Posterior inference for θ In this paper, posterior inference of θ is not of interest, but as it may be relevant for other applications we discuss it briefly. The conditional posterior distribution p(θ|S, α, y) given by (8) is a mixture of S!/(S − so )! Dirichlet distributions, one for each choice of the so observed species from among the S total species. The component Dirichlet distributions are identical up to permutation of the category indices. Because of this feature of the mixture, each θi actually has the same marginal posterior distribution. This makes interpretation of θi difficult. We can however talk about posterior inference for a θ corresponding to a particular value of yi > 0. For example, if we define θyi as the θ corresponding to an observed species with frequency yi then p(θyi |y, S, α) is Beta(yi +α, N −yi +(S −1)α). The marginal posterior distribution, p(θyi |y), is obtained by averaging this beta distribution over the posterior distribution of S and α that is obtained in Section 3.1. 4 Planning for future data collection The previous section describes an approach to obtaining posterior inferences for S, α, θ. This section considers an additional inference question. Suppose it is possible to collect additional data beyond the initial N observations. Then one might be interested in questions related to the design of future data collection efforts, such as, “What is the probability of observing at least 90% of all species if the current data are augmented by an additional M animals?”, or the closely related question “How large an additional sample is required in order to observe at least 90% of all species with a specified confidence level?”. This section addresses the answer to these types of questions. 772 4.1 Unseen Species Sample Size Calculation A relevant probability calculation Let p denote the proportion of species we want to capture (e.g. p = 0.9), then the probability of capturing at least pS species conditional on the N observed animals and M additional animals, denoted as π(M ), can be written as π(M ) = P r((so + Snew ) ≥ pS|M, y), (11) where Snew is the number of previously unseen species observed in the M additional samples. Let y∗ denote the additional data from the M additional observations. The probability (11) can be expressed as an integral over the unknown parameters S, α, θ, and the yet-to-be-observed data y∗ , Z Z Z Z I(so + Snew ≥ pS)p(S, α, θ, y∗ |M, y) dS dα dθ dy∗ . (12) π(M ) = y∗ θ α S Here I is an indicator function which is easily determined given the counts y and y∗ , and the value of S. To describe a Monte Carlo approach to evaluating this integral we first observe that p(S, α, θ, y∗ |M, y) = p(y∗ |θ, S, M )p(θ|y, S, α)p(S, α|y), where p(y∗ |θ, S, M ) is a multinomial density function, p(θ|y, S, α) is a mixture of Dirichlet distribution function, and p(S, α|y) is given above in (9). Given this factorization, the integration (summation in the case of S) in (12)can be carried out by first obtaining posterior draws of S and α and then applying the specified conditional distributions for θ and y∗ . As is shown below in Section 4.2, sampling θ from a mixture of Dirichlet distribution is no more difficult than sampling θ from a Dirichlet distribution. We do not expect that the high dimension of θ will cause any problem in the numerical integration process. Carrying out the integration for a variety of values of M identifies a π(M ) curve and allows us to identify the smallest sample size for which π(M ) exceeds a given target. This approach provides a point estimate for the needed sample size but does not provide a great deal of information about the uncertainty in such an estimate. Instead, we find it useful to examine the function π(M ) for a variety of S, α values, i.e. π(M |S, α) = = P r((so + Snew ) ≥ pS|M, y, S, α) Z Z I(so + Snew ≥ pS)p(θ, y∗ |M, y, S, α)dθdy∗ . y∗ θ (13) Examining π(M ) in this way allows us to use the posterior distribution of S, α to convey uncertainty about our estimate of M . The function π(M |S, α) is a complicated function of S and α, and an analytical form of its posterior distribution is not possible. Instead, we use a Monte Carlo approach to estimate the posterior distribution of π(M |S, α). Specifically, for each posterior draw of S and α, we estimate the quantity π(M |S, α) by averaging over θ and y∗ . The posterior distribution of π(M |S, α) is obtained by repeating the Monte Carlo evaluation for the available draws of S and α. H. Zhang and H. Stern 4.2 773 Monte Carlo simulation procedure The Monte Carlo approach to evaluating π(M |S, α) in (13) is made explicit by applying the identity p(θ, y∗ |M, y, S, α) = p(y∗ |θ, S, M )p(θ|y, S, α), where we have assumed that y∗ is conditionally independent of α and y given M, θ, and S. The assumption of conditional independence is based on the consideration that θ and S fully define the probability vector for multinomial sampling of y∗ . The algorithm for computing π(M |S, α) for a given S, α pair is then given by the following steps. For t = 1, · · · , T , 1) generate θ (t) from p(θ|y, S, α) (a mixture of Dirichlet distributions) 2) generate y∗ (t) from a Multinomial distribution with parameters M and θ (t) 3) define It = 1, if (so + Snew ) ≥ pS, and It = 0 otherwise. Estimate π(M |S, α) with of M as desired. 1 T PT t=1 It and repeat steps 1 to 3 for as many different values For each given pair of S, α, the result can be viewed as a curve giving the probability of covering a proportion p of the species as a function of M . If k posterior draws of S, α are available, then there are totally k such curves. The Monte Carlo algorithm is conceptually straightforward but a number of implementation details are noteworthy. First, recall that the posterior distribution of θ given y, S, α is a mixture of Dirichlet distributions. All of the Dirichlet distributions in the mixture are identical up to permutation of the indices (i1 , i2 , · · · , iS ). Sampling from the mixture distribution requires that one pick a set of labels from W (so ) to correspond to the observed species and then simulate from the relevant component of the mixture distribution. The subsequent steps in the algorithm would then be done conditional on this choice of labels. In practice, because we are not interested in a specific θi or yi , it is equally valid to arbitrarily assign labels to the observed species and proceed. A second noteworthy detail concerns efficiency. Steps 1 and 2 of the algorithm propose to use only a single draw of y∗ for each θ. It is natural to ask whether the algorithm might be improved by selecting multiple y∗ vectors for each θ, perhaps thereby estimating a separate curve for each θ. Our simulation results suggest however that variation among the curves corresponding to different θ’s for fixed S and α is relatively small and consequently the algorithm described above works best. Lastly, we note that step 3 of the Monte Carlo simulation procedure requires determining the number of new species by counting the number of positive yi∗ ’s whose θi ’s correspond to species with yi = 0 (or equivalently to Dirichlet parameter α). In practice it is possible to save a considerable amount of computing time by embedding iteration over the sample size M within the above loop (instead of running the above loop separately for each M ). 774 Unseen Species Sample Size Calculation 4.3 The probability of species coverage and the required sample size The Monte Carlo algorithm yields a collection of curves, each showing π(M |S, α) as a function of M . For any M these curves yield a posterior distribution of the quantity π(M ) = P r((so + Snew ) ≥ pS|M, y). Posterior summaries, e.g., point estimates or posterior intervals can be constructed from this estimated posterior distribution. Another practical question is how to find the minimum sample size required to observe at least a proportion p of all species with a specified probabilty q. We denote this value as Mq ; it too can be viewed as a function of S and α. The posterior distribution of Mq is determined easily using the Monte Carlo approach. For each (S, α) pair we have developed an estimated curve showing π(M ) vs M . For each such curve we identify the smallest value of M such that π(M |S, α) ≥ q. The collection of identified sample sizes provides an estimated posterior distribution for Mq . 5 Simulations To demonstrate our method we begin by simulating a single data set with N = 2000 observations for which S is known. In a later section, we consider the effect of increasing the sample size. 5.1 Data The data are N = 2000 observations simulated from a multinomial distribution with S = 2000 species in the population and θ a random sample from a DirichletP distribution S with α = 1. The distribution of θ is then uniform over all vectors with i=1 θi = 1. Table 1 and Figure 2 describe the data as the number of species that appear exactly x times, x = 1, 2, · · · , xo . In this sample, the largest frequency xo = 11 and the number of observed species is so = 965. Table 1: Species frequencies x # species 5.2 1 451 2 268 3 116 4 61 5 35 6 16 7 5 8 7 9 2 10 3 11 1 Posterior inference for S We used the approach described in Section 3 to find the posterior distribution of S and α given the simulated data. We assume the plausible upper limit on S is Smax = 100 200 300 400 775 0 Number of species H. Zhang and H. Stern 2 4 6 8 10 Frequency Figure 2: The distribution of frequencies for the simulated data 10000 with prior belief about 0.999 that Smin ≤ S ≤ Smax . This gives the value of f in our geometric prior distribution as f = 0.0007. In a later section, Section 5.5, we evaluate the effect of choosing different values of f (i.e. different geometric prior distributions). The posterior inferences in this section are based on 4000 draws from the posterior distribution after a MCMC burn-in period of 4000 iterations. Figure 3 shows a contour plot of the joint posterior distribution of S and α. The distribution has a single mode around S = 1800 and α = 1.0. Figure 4 and 5 are histograms of the posterior distribution of S and α. The posterior mean of S is 1844. A 95% central posterior interval for S is (1559, 2301). The true value, S = 2000, is contained in the interval. Note that the method of Efron and Thisted (1976) based on the PoissonGamma model yields a similar estimate of S which is Sb = 1639 with standard error 226. A 95% central posterior interval of α is (0.64,1.69), which includes the true value α = 1. The posterior mean of α is 1.07. 5.3 Sample size calculation for future sampling As discussed in Section 4, we can estimate the probability of observing a proportion p of the total number of species given an additional M animals, and the sample size required to ensure that future sampling covers a specified proportion of the species with a given probability of coverage. As a first step we estimate π(M |S, α) for M = 2000 to 30000 in steps of size 20 for a number of (S, α) pairs. Each point of the π(M |S, α) vs M curve is based on T = 1500 Monte Carlo evaluations in order to make the Monte Carlo simulation error of a given probability less than 0.015. Figure 6 is a plot giving π(M |S, α) for 100 draws of (S, α) from the posterior distribution p(S, α|y) (including more curves makes the figure more difficult to read). Each curve in the figure shows the relationship between the probability of seeing 90% of the 776 1.4 0.6 0.8 1.0 1.2 alpha 1.6 1.8 2.0 Unseen Species Sample Size Calculation 1600 1800 2000 2200 S Figure 3: Contour plot for the data from S = 2000, α = 1 with N=2000 (f = 0.0007) species and the additional sample size M for a single posterior draw of S and α. From the figure, we can see the curves are spread out especially for larger coverage probabilities, which implies large uncertainty about the probability of seeing 90% of the species with a given M , and also large uncertainty about the minimum sample size required to see at least 90% of the species for a specified confidence level π. This reflects the uncertainty about the parameter α which has a very substantial impact on the species frequency distribution. Posterior draws with large α values will tend to have smaller S values, and hence greater likelihood of observing 90% of the species with M additional animals. These values correspond to the curves on the left in Figure 6. A small α suggests the true S is larger, so we are less likely to observe 90% of the species with M additional animals. Probability of observing at Least pS species with M additional animals We next use the curves in Figure 6 to draw posterior inference for the probability of observing at least 0.9S species with M additional observations. Table 2 gives the posterior median of π(M ) and a 90% central posterior interval for π(M ) for a range of M 777 0.0 0.0005 0.0010 0.0015 0.0020 0.0025 H. Zhang and H. Stern 1500 2000 2500 3000 3500 S Figure 4: Histogram of the posterior of S values. These inferences are based on k = 100 (S, α) pairs chosen randomly among the 4000 posterior samples. Figure 7 shows the posterior median and pointwise 90% central posterior intervals graphically for M ranging from 400 to 20000. Posterior intervals for a given value of M tend to be wide when M is relatively small (e.g. M = 5000), but the length of the intervals decreases quickly with the increase of M . This reflects the form of Figure 6; for a given M most curves have probability values of attaining the target number of species near zero (if that value of M is relatively small compared to the value of S on which the curve is based) or near one (if the value of M is relatively large compared to the relevant value of S). Table 2: Estimates of π(M ) for different M values M π b(M ) 90% emp. post. int. 5000 0.52 (0, 1) 8000 1 (0.02, 1) 10000 1 (0.48, 1) 12000 1 (0.75,1) 778 0.0 0.5 1.0 1.5 Unseen Species Sample Size Calculation 0.5 1.0 1.5 2.0 2.5 3.0 alpha Figure 5: Histogram of the posterior distribution of α Required sample size to capture at least pS species with coverage probability 0.9 Recall that for the simulated data the initial sample of 2000 animals captured approximately half the species. How many additional animals have to be collected if we want to see at least 0.9S species with probability q = 0.9? Once again the Monte Carlo simulations in Figure 6 can be used to answer the question. Drawing a horizontal line with coverage probability q = 0.9, the intersection points of the horizontal line and the curves give estimates of Mq , one from each curve. The distribution of these values provides the desired posterior inferences. The posterior median is 5330, and an empirical 90% central posterior interval is (2980, 13080). Table 3 gives the posterior median of the needed sample size together with 90% central posterior intervals for various values of the target proportion of species (p) and the desired coverage probability (q). As the proportion of the species to be captured (p) increases, the number of additional samples required increases quickly, which is natural because the more common species have undoubtedly been observed. Further, as indicated in the table, for each p, the sample size required to achieve different coverage probabilities (q) changes slowly. This is once again due to the pattern observed in Figure 6, in which all curves display a similar trend: there is a steep increase in the probability of coverage q near a threshold value of M (though this threshold varies depending on S and α). 779 0.6 0.4 0.0 0.2 Probability of coverage 0.8 1.0 H. Zhang and H. Stern 0 5000 10000 15000 20000 Additional sample size Figure 6: The relationship between probability π(M |S, α) and additional sample size M (N=2000) Table 3: Size of additional sample required to obtain a fraction p of the total species with probability q Fraction of species (p) 0.7 Probability of covering specified fraction (q) 0.5 0.7 0.9 810 830 870 (290, 1570) (310, 1630) (330, 1700) 0.8 1940 (1060, 3540) 2000 (1100, 3660) 2090 (1120, 3840) 0.9 5010 (3020, 10960) 5190 (3140, 11760) 5330 (2980, 13080) 780 0.6 0.4 0.0 0.2 Probability of coverage 0.8 1.0 Unseen Species Sample Size Calculation 0 5000 10000 15000 20000 Additional sample size Figure 7: Trace of the estimated coverage probabilities. The middle line connects the posterior median values of π(M |S, α), and the two dotted lines besides it are the 90% pointwise central posterior intervals 5.4 Effect of sample size Having demonstrated the approach for a sample of size 2000 from our hypothetical population in Sections 5.1 through 5.3, we now demonstrate the impact of increasing the sample size. One would expect the inferences to become more precise. We simulate a data set with size N = 10000 from the same population as before, i.e. α = 1, S = 2000. For this sample, the highest frequency is xo = 45, and the number of observed species is so = 1663, which is more than 80% of the total number of species. In this example, the value of f is also selected as f = 0.0007. The posterior mean for S is 2030 with 95% central posterior interval (1948, 2129). The posterior mean for α is 0.93, and a 95% central posterior interval for α is (0.80, 1.08). Both intervals are much narrower than for the case with N = 2000. Figure 8 shows the posterior distribution of π(M |S, α) with p = 0.9 (i.e. capturing 90% of all species) for 100 (S, α) pairs and a number of M values. There is much less variation than is present in Figure 6. 781 0.6 0.4 0.0 0.2 Confidence level 0.8 1.0 H. Zhang and H. Stern 1000 2000 3000 4000 Additional sample size Figure 8: The relationship between probability π(M |S, α) and additional sample size M (N = 10000) The posterior median of Mq for q = 0.9 is 1880 and a 90% central posterior interval is (1240, 2700). The required sample size is smaller because a larger number of species are observed in the pilot sample. In addition the posterior interval is much narrower than that based on N = 2000 observations. 5.5 Effect of prior distribution As in any Bayesian analysis, it is critical to consider the impact of the choice of prior distribution on the inferences obtained. That is especially true here with the prior distribution for S and α. This section addresses comparisons between our prior distribution and others in the literature, as well as some practical issues associated with the use of our prior distribution. In section 2.3, two other choices of prior distributions for S and α were mentioned. One is the proposal of Boender and Rinnooy Kan (1987) to use proper prior distribution functions for S and α, and the other suggested by Zhang and Stern (2005), where a 782 Unseen Species Sample Size Calculation uniform distribution with an upper bound is assigned to S and a vague prior distribution is given to α (the same as the prior distribution for α that is used here). We applied the Dirichlet-generalized multinomial model with their prior distributions on the same simulated data discussed in Section 5.1. The posterior inferences for S and α are listed in Table 4, which indicates the results from these two alternative prior distributions are consistent with the results using the geometric prior distribution for S. The findings regarding species coverage and sample size are also similar across different methods. We next discuss the effect of different choices of f on the posterior inferences for the simulated data. As noted earlier, different values of f imply different degrees of confidence that we might have with respect to the suggested range of S, with larger values of f corresponding to higher prior confidence of S being between Smin and the plausible Smax . Table 4 lists various choices of f for Smin = 2 and Smax = 10000, the corresponding probability of S ∈ (Smin , Smax ), and the posterior inferences of S and α that result from this choice of f . The results suggest that as long as f is not too big (i.e. our prior belief in Smax is not too extreme), the posterior inferences for the parameters are consistent across different values of f and all agree reasonably well with the true values. We also observe that the larger the value of f , the stronger our prior information favors small values of S. This is reflected in the inferences; the posterior mean decreases as f increases. As f increases the prior distribution puts more probability mass on smaller values of S and thus the posterior mean will decrease. The preceding discussion concerns the simulated data discussed in Section 5.1 where the population essentially does not have any rare species. The three different prior distribution choices give similar results in this case. However, as noted in Section 2.3, the methods of Boender and Rinnooy Kan (1987) and Zhang and Stern (2005) both have difficulty in inferring the number of species if the sample is consistent with small α values in the population. We use simulated data to demonstrate and compare the three methods in this context. A random sample is drawn from a population with a large number of rare or infrequent species. The same scenario given in section 5.1 is applied and a data set with N = 2000 observations is drawn from a population with α = 0.01, S = 2000. In this data set, the number of observed species is so = 94 and the highest frequency is xo = 155, which implies the population has some very abundant species but many more rare or hard to capture species (Zhang and Stern (2005)). The value of f is selected as before, i.e., f = .0007 corresponding to high confidence (.999) that the true number of species is between Smin = 2 and the suggested Smax = 10000. The posterior inferences from the three methods are listed in Table 5; posterior means are given along with 95% central posterior intervals in parentheses. The results listed in Table 5 demonstrate the poor performance using the prior distributions of Boender and Rinnooy Kan (1987) and Zhang and Stern (2005). For the new prior distribution, with f = 0.0007, the posterior inferences are consistent with the true values. We notice that the posterior inferences seem more sensitive to the choice of f in this context than in the high α case. We also find that the posterior interval of S using the geometric prior distribution is wide, which implies large uncertainty on the value of S due to the large number of rare species in the population. The inferences can be improved if more information is available to help construct an informative prior distribution of S. H. Zhang and H. Stern 783 Table 4: Posterior inferences on S and α from different methods (95% posterior intervals are in the parentheses) Sb Boender and Rinnooy Kan (1987) Scut = 500 1833 (1597, 2160) Scut = 1000 1817 (1573, 2122) Zhang and Stern (2005) 1866 (1576, 2337) Geometric prior p(Smin ≤ S ≤ Smax ) f Sb 0.63 0.0001 1870 (1556, 2320) 0.90 0.00023 1865 (1574, 2335) 0.999 0.0007 1844 (1559, 2301) 0.9999 0.001 1842 (1537, 2252) ≈1 0.002 1803 (1537, 2252) ≈1 0.003 1782 (1530, 2153) ≈1 0.006 1721 (1501, 2022) ≈1 0.01 1660 (1467, 1909) α b 1.08 (0.72, 1.55) 1.1 (0.74, 1.62) 1.09 (0.64,1.75) α b 1.03 (0.61, 1.69) 1.04 (0.61, 1.69) 1.07 (0.64, 1.69) 1.07 (0.64, 1.70) 1.12 (0.70, 1.32) 1.16 (0.72, 1.81) 1.26 (0.82, 1.93) 1.39 (0.93, 2.09) 784 Unseen Species Sample Size Calculation Table 5: Posterior inferences on S and α from different methods when population α is small (α = 0.01) Sb α b Boender and Rinnooy Kan (1987) Scut = 50 297 0.12 (129, 1021) (0.021, 0.31) Scut = 500 318 0.087 (151, 490) (0.045, 0.23) Scut = 5000 1506 0.023 (174, 4611) (0.0044, 0.18) Zhang and Stern (2005) 5159 0.0054 (451, 9822) (0.0019, 0.052) Geometric prior f=0.0002 2548 0.013 (247, 9100) (0.0022, 0.11) f=0.0005 2307 0.014 (270, 7323) (0.0027, 0.096) f=0.0007 2180 0.014 (267, 6734) (0.0030, 0.098) f=0.001 1650 0.017 (255, 5280) (0.0038, 0.11) H. Zhang and H. Stern 6 6.1 785 Application to sequence data Description of the data This work was motivated by a bioinformatics problem arising during a genome sequencing project. Details of the technological approach are not particularly crucial here – for one thing, the approach is no longer used by the company. A key issue that came up during the project was the desire to identify the unique elements in a set of DNA fragments. The unique elements could be easily determined by sequencing all of the fragments but this is not necessarily cost effective if there is a lot of duplication. One strategy under consideration proposed sequencing a small sample of fragments and recording the frequency with which each unique sequence was found. Framed in this way the problem is directly analogous to our species problem. The hope is that based on the small sample it will be possible to determine how large of a sequencing effort to mount. 300 200 100 0 Number of species 400 A prototype data set was provided with sample size N = 1677 and so = 644, in which there were 440 species each observed once and 1 species observed 76 times. Figure 9 shows the pattern of the data in terms of frequencies. The figure shows a very sharp decreasing pattern in the distribution of frequencies, which is different from that of our simulated data in Section 5.1 and more like the small α case discussed at the end of Section 5.5. A few “species” occur with high frequencies, and a very high proportion of the observed species only occur once. This is the type of data that typically indicates a small value of α that can cause difficulties for the generalized multinomial model. 0 20 40 60 Frequency Figure 9: The distribution of the frequencies for the DNA sequence data 786 0.04 0.01 0.02 0.03 alpha 0.05 0.06 0.07 Unseen Species Sample Size Calculation 6000 8000 10000 12000 14000 16000 18000 20000 S Figure 10: Contour plot for the true data 6.2 Applying the model We apply the method proposed in previous sections to the DNA segments data. Our collaborator suggested the maximum value of S be S = 10000. With the value of f selected as f = 0.0007, as discussed earlier, our prior confidence of S between 2 and 10000 is 0.999. Figure 10 is a contour plot for the posterior distribution of S and α, which clearly shows one mode around S = 12000. The posterior mean for S is 12111. A 95% central posterior interval of S is (7246, 19637). The posterior mean for α is 0.033 and its 95% central posterior interval is (0.020, 0.056). Note that although the prior confidence of S in the interval (2,10000) is 0.999, the posterior distribution is concentrated above 10000. This suggests that the data provide overwhelming evidence of a large number of rare species. As seen in the next section, this inference results in our determination that extremely large sample sizes are required to collect even a small fraction of the total number of species. H. Zhang and H. Stern 6.3 787 Sample size calculation for future sampling We use the Monte Carlo simulation approach discussed in Section 4.2 to carry out the sample size calculation. The posterior inferences for S suggests a large number of distinct DNA sequences in the population. The posterior inference of α implies that the population has a large number of rare species. We thus expect a large sample is needed even for modest species coverage. Table 6 lists the estimated sample sizes in order to see 10% or 15% of all the distinct DNA sequences with different probabilities of coverage. Similar patterns are observed as in the simulations: the change in the required sample sizes across different coverage probabilities (q) is small for a given target fraction (p). On the other hand, the required sample sizes and the uncertainty both increase quickly with small increase in the target fraction of species (p). Due to the large number of rare species in the population the inferences obtained here are of limited value commercially; an extremely large sample size is required to see a substantial fraction of the species. Table 6: Additional sample sizes needed to collect 10% or 15% of all distinct DNA sequences Fraction of species (p) 0.10 0.15 7 Probability of covering specified fraction (q) 0.5 0.9 1900 2200 (400, 6425) (450, 7488) 7000 7500 (2600, 23500 ) (2800, 27000) Summary A multinomial-Dirichlet model is proposed for the analysis of data in which individual objects belong to different categories. The prior distribution for the number of categories is selected to be a geometric distribution with probability parameter set to reflect our confidence that the number of categories lies in a predetermined range. The multinomial-Dirichlet model with this prior distribution seems to work well over a range of scenarios. A new Monte Carlo simulation algorithm is introduced for determining the minimum size of an additional sample required to capture a certain proportion of categories in the population with specified coverage probability. Simulation results show that sample size calculation in this way is feasible. An application to a DNA segments data set indicates the applicability of the proposed method but also suggests continued difficulty with the problematic case with many rare species. Future study is needed to extend the model to address situations where the distribution of species is not well approximated by our model, e.g. where the relative proportion of rare species is high. 788 Unseen Species Sample Size Calculation Appendix A.1 Proof of Theorem 1 The joint posterior distribution of S and α, as derived in Section 2, is p(S, α|y) ∝ Γ(Sα) Γ(y1 + α) · · · Γ(yso + α) − 3 S! α 2, (S − so )! Γ(N + Sα) (Γ(α))so (14) for S ≥ so and 0 < α < ∞. We find the conditions required to insure that X Z ∞ ∞ p(S, α|y)dα < ∞, 0 S=so by obtaining an upper bound on the integral over α for each S. For each S, choose ǫ > 0 such that Sǫ < 1. We then consider the integral over two intervals (0, ǫ) and (ǫ, ∞). On the interval (0, ǫ): Recall that for the gamma function we have Γ(1 + z) = zΓ(z). This and other properties of the gamma function yield the following results. 1. Γ(α) = Γ(1 + α)/α (α > 0) 2. If yi ≥ 1 and α < ǫ < 1, then Γ(yi + α) < max(Γ(yi + 1), 1) 3. Define γmin > 0 as the minimum value of the gamma function on the interval (1,2), then Γ(1 + α) ≥ γmin for 0 ≤ α < 1. 4. Γ(Sα) = Γ(1 + Sα)/(Sα) < 1/(Sα) since Sα < Sǫ < 1. 5. If so ≥ 2, then we must have N ≥ 2 so that Γ(N + Sα) > Γ(N ) Applying these equalities and inequalities gives ∞ Z ǫ X p(S, α|y)dα S=so < = 0 Z Qso SX max S=so S! (S − so )! SX max (S − 1)! (1 − f )S (S − so )! S=so Qso ǫ i=1 0 max(Γ(yi + 1), 1) 1 so −3/2 α (1 − f )S dα so γmin Sα Γ(N ) Z ǫ Cy αso −5/2 dα 0 so Γ(N )] is a constant depending only on y. For where Cy = [ i=1 max(Γ(yi +1), 1)]/[γmin so ≥ 2 the integral near zero is finite and thus so is the sum since the prior distribution of S is proper. H. Zhang and H. Stern 789 On the interval (ǫ, ∞): Repeated application of the recurrence Γ(1 + z) = zΓ(z) yields Q so Q y i S! j=1 (yi + α − j) −3/2 i=1 α (1 − f )S p(S, α|y) ∝ QN (S − so )! (Sα + N − j) j=1 Qso Qyi (yi −j) S −3/2 S!(1 − f ) α j=1 (1 + i=1 α ) = Q N N (N −j) (S − so )! S j=1 (1 + Sα ) < yi so Y (yi − j) S!(1 − f )S α−3/2 Y (1 + ) (S − so )! S N i=1 j=1 ǫ The final product is a constant in terms of S and α and the remaining terms yield a finite integral over α and sum over S. Combining the information from the two intervals, we conclude that the posterior distribution is proper if so ≥ 2, i.e., there are at least two categories observed. A.2 Jumping functions for S and α Metropolis-Hastings jumping function for S The jumping function for S is a symmetric discrete uniform distribution centered at S (t−1) (an asymmetric distribution is used if S (t−1) is near the limit of its range) with width parameter b(t−1) . Take S (∗) as the proposed value of S when jumping from S (t−1) . The jumping distribution can be written as S (∗) |S (t−1) ∼ ½ DU N IF (S (t−1) − b(t−1) , S (t−1) + b(t−1) ), DU N IF (so , S (t−1) + b(t−1) ), S (t−1) ≥ so + b(t−1) S (t−1) < so + b(t−1) , where the second two lines represent the cases where the current draw is near the boundary of the parameter space. The width parameter b(t−1) is selected to be proportional to the current value S (t−1) . Metropolis-Hastings jumping function for α We use a normal jumping distribution on the logarithm of α. Define φ = log(α). Let φ(t−1) denote the current sampled point, and φ(∗) be the candidate point generated from the jumping distribution. The jumping distribution for φ is φ(∗) |φ(t−1) ∼ N (φ(t−1) , V 2 ) where the standard deviation V is chosen to make the jumping function efficient. In practice, V is selected based on a pilot sample to achieve acceptance rate near 0.44, the optimal rate suggested by Gelman et al. (2003). 790 Unseen Species Sample Size Calculation References Boender, C. G. E. and Rinnooy Kan, A. H. G. (1987). “A multinomial Bayesian approach to the estimation of population and vocabulary size.” Biometrika, 74(4): 849–856. 764, 765, 766, 769, 781, 782, 783, 784 Bunge, J. and Fitzpatrick, M. (1993). “Estimating the number of species: a review.” Journal of The American Statistical Association, 88: 364–373. 763 Cao, Y., Larsen, D. P., and Thorne, R. S.-J. (2001). “Rare species in multivariate analysis for bioassessment: some considerations.” Journal of North American Benthological Soc., 21: 144–153. 764 Corbet, A. S. (1942). “The distribution of butterflies in the Malay Peninsula.” Proc. Royal Entomological Society of London (A), 16: 101–116. 763 Efron, B. and Thisted, R. (1976). “Estimating the number of unseen species: How many words did Shakespeare know?” Biometrika, 63: 435–448. 763, 764, 775 Fisher, R. A., Corbet, A. S., and Williams, C. B. (1943). “The relation between the number of species and the number of individuals in a random sample of an animal population.” Journal of Animal Ecology, 12: 42–58. 763, 764 Gelman, A., Carlin, J. B., Stern, H. S., and B., R. D. (2003). Bayesian Data Analysis. Chapman & Hall/CRC. 770, 789 Gelman, A. and Rubin, B. D. (1992a). “Inference from iterative simulation using multiple sequences.” Statistical Science, 7: 457–511. 770 Gelman, A. and Rubin, D. B. (1992b). “A single series from the Gibbs sampler provides a false sense of security.” In Bernardo, J. M., Berger, J. O., Dawid, A. P., and Smith, A. F. M. (eds.), Bayesian Statistics 4. Proceedings of the Fourth Valencia International Meeting, 625–631. Clarendon Press [Oxford University Press]. 770 Good, I. J. (1965). The Estimation of Probabilities; An Essay on Modern Bayesian Methods. Cambridge, Mass.: MIT Press. 769 Good, I. J. and Toulmin, G. H. (1956). “The number of new species, and the increase in population coverage, when a sample is increased.” Biometrika, 43: 45–63. 763 Lijoi, A., Mena, R. H., and Prünster, I. (2007). “Bayesian nonparametric estimation of the probability of discovering new species.” Biometrika, 94: 769–786. 764 Morris, J. S., Baggerly, K. A., and Coombes, K. R. (2003). “Bayesian shrinkage estimation of the relative abundance of MRNA transcripts using SAGE.” Biometrics, 59(3): 476–486. 764, 768 Pitman, J. (1996). “Some Developments of the Blackwell-MacQueen Urn Scheme.” In Ferguson, T. S., Shapley, L. S., and MacQueen, J. B. (eds.), Statistics, Probability and Game Theory (IMS Lecture Notes Monograph Series, Vol. 30), 245–267. Institute of Mathematical Statistics. 763 H. Zhang and H. Stern 791 Sethuraman, J. (1994). “A constructive definition of Dirichlet priors.” Statistica Sinica, 4: 639–650. 764, 766 Tiwari, R. C. and Tripathi, R. C. (1989). “Nonparametric Bayes estimation of the probability of discovering a new species.” Communications in Statistics: Theory and Methods, 18: 877–895. 764 Zhang, H. (2007). “Inferences on the number of unseen species and the number of abundant/rare species.” Journal of Applied Statistics, 34(6): 725–740. 764 Zhang, H. and Stern, H. (2005). “Investigation of a generalized multinomial model for species data.” Journal of Statistical Computing and Simulation, 75: 347–362. 764, 766, 769, 781, 782, 783, 784 792 Unseen Species Sample Size Calculation Bayesian Analysis (2009) 4, Number 4, pp. 793–816 Markov Switching Dirichlet Process Mixture Regression Matthew A. Taddy∗ and Athanasios Kottas† Abstract. Markov switching models can be used to study heterogeneous populations that are observed over time. This paper explores modeling the group characteristics nonparametrically, under both homogeneous and nonhomogeneous Markov switching for group probabilities. The model formulation involves a finite mixture of conditionally independent Dirichlet process mixtures, with a Markov chain defining the mixing distribution. The proposed methodology focuses on settings where the number of subpopulations is small and can be assumed to be known, and flexible modeling is required for group regressions. We develop Dirichlet process mixture prior probability models for the joint distribution of individual group responses and covariates. The implied conditional distribution of the response given the covariates is then used for inference. The modeling framework allows for both non-linearities in the resulting regression functions and non-standard shapes in the response distributions. We design a simulation-based model fitting method for full posterior inference. Furthermore, we propose a general approach for inclusion of external covariates dependent on the Markov chain but conditionally independent from the response. The methodology is applied to a problem from fisheries research involving analysis of stock-recruitment data under shifts in the ecosystem state. Keywords: Dirichlet process prior; hidden Markov model; Markov chain Monte Carlo; multivariate normal mixture; stock-recruitment relationship. 1 Introduction The focus of this work is to develop a flexible approach to nonparametric switching regression which combines Dirichlet process (DP) mixture nonparametric regression with a hidden Markov model. A modeling framework for data that has been drawn from a number of unobserved states (or regimes), where each state defines a different relationship between response and covariates, switching regression was originally developed in the context of econometrics (Goldfeld and Quandt 1973; Quandt and Ramsey 1978) and has primarily been approached through likelihood-based estimation. A hidden Markov mixture model in this context holds that the state vector constitutes a Markov chain, and thus introduces an underlying dependence into the data. In such models, the regression functions corresponding to individual population regimes are typically linear with additive error, and may or may not include an explicit time-series component (e.g., Hamilton 1989; McCulloch and Tsay 1994). The work presented here has a dif∗ University of Chicago, Booth School of Business, Chicago, IL, mailto:matt.taddy@chicagobooth. edu † Department of Applied Mathematics and Statistics, University of California, Santa Cruz, CA mailto:thanos@ams.ucsc.edu c 2009 International Society for Bayesian Analysis DOI:10.1214/09-BA430 794 Markov Switching DPM Regression ferent focus: flexible nonparametric inference within regimes, guided by an informative parametric hidden Markov model for regime state switching. Such approaches reveal a baseline inference: the posterior distribution for individual regression functions when informed by little more than the state switching model. The proposed posterior simulation algorithms will also serve as a useful framework for more general inference about mixtures of conditionally independent nonparametric processes. Bayesian nonparametrics, and DP mixtures in particular, provide highly flexible models for inference. Indeed, the practical implication of this flexibility is that, for inference based on small to moderate sample sizes, a certain amount of prior information must be provided to avoid a uselessly diffuse posterior. The DP hyperparameters provide the natural mechanism for introducing prior information. However, it is also possible to constrain inference by embedding the nonparametric component within a larger model. The typical semiparametric extension to linear regression – nonparametric modeling for the additive error distribution – is a familiar example of this approach. One can afford to be very noninformative about the error distribution only because linearity of the mean imposes a substantial constraint on model flexibility. This paper explores one such class of semiparametric inference settings: nonparametric density or regression estimation for heterogeneous populations, using a DP mixture framework, nested within an informative parametric model for the group membership, using an either homogeneous or nonhomogeneous hidden Markov switching model. Although this framework applies generally to nonparametric density estimation, our particular focus is Markov switching nonparametric regression, specified in detail in Section 2, including model elaboration for the inclusion of external covariates. Following this, Section 3 describes efficient forward-backward posterior simulation methodology for dependent mixtures of nonparametric mixture models, along with details for full posterior inference. In Section 4, the methods are illustrated with an application from fisheries research involving analysis of stock-recruitment data under shifts in the ecosystem state, which can be characterized as regimes that are either favorable or unfavorable for reproduction. Here, the Markov switching nonparametric regression framework enables simultaneous inference for the regime-specific biological stock-recruitment relationship and for the probability of regime switching. Moreover, the DP mixture regression approach relaxes parametric regression assumptions for the stock-recruitment relationships, and yields inference that can capture non-standard response density shapes. These are important features of the proposed model, since they can improve predictive inference for years beyond the end of the observed time series, a key inferential objective for fishery management. Finally, Section 5 concludes with a summary and discussion of possible extensions. 2 Markov Switching Nonparametric Regression In Section 2.1, we introduce the two building blocks upon which our modeling approach is based: Markov switching mixtures of DP mixtures, and fully nonparametric implied M. A. Taddy and A. T. Kottas 795 conditional regression. Section 2.2 presents the hidden Markov DP mixture model, and Section 2.3 extends the model to include external variables that are correlated with the underlying Markov chain, but conditionally independent of the joint covariate-response distribution. 2.1 Mixtures of Conditionally Independent Dirichlet Process Mixtures R The generic nonparametric DP mixture model is written as f (z; G) = k(z; θ)dG(θ) for the density of z, with a parametric kernel density, k(z; θ), and a random mixing distribution G that is assigned a DP prior (Ferguson 1973; Antoniak 1974). In particular, G ∼ DP(α, G0 ), where α is the precision parameter, and G0 is the centering distribution. More specifically, the starting point for our approach is Bayesian nonparametric implied conditional regression, wherein DP mixtures are used to model the joint distribution of response and covariates, from which full inference is obtained for the desired conditional distribution for response given covariates. Both the response distribution and, implicitly, the regression function are modeled nonparametrically, thus providing a flexible framework for the general regression problem. In particular, working with (real-valued) continuous variables, DP mixtures of multivariate normal densities can be used to model the joint density of the covariates, X, and response Y (as in, e.g., Müller et al. 1996). Hence, the normal DP mixture regression model can be described as follows: Z f (z; G) = N(z; µ, Σ) dG(µ, Σ), G | α, ψ ∼ DP(α, G0 ), (1) where z = (x, y), and G0 can be built from independent normal and inverse Wishart components for µ and Σ, respectively. Inference for the implied conditional response distribution under our Markov switching regression model is discussed in Section 3, following the development in Taddy and Kottas (2009), where full inference about f (y | x; G) was required to estimate quantile regression functions. A model for multiple heterogeneous populations may be built upon the DP mixture platform under either the density estimation or regression setting discussed above. Assume R distinct random mixing distributions G1 , . . . , GR , each characterized as a DP in the prior, such that, for observations z1 , . . . , zn with population membership vector h = R (h1 , . . . , hn ), f (zi ; Ghi ) = k(zi ; θ)dGhi (θ). This leads to the Gr being independent in the posterior full conditional (due, in particular, to conditioning on h), which is both conceptually important and, in Markov chain Monte Carlo (MCMC) simulation, practically useful. Model specification is completed with a state probability vector, pi = (pi,1 , ..., pi,R ), defining the probability that the i-th observation was drawn from the DP mixture corresponding to each of the Gr . The goal of this framework is to introduce information into the model through the pi . One way to inform pi,r is to incorporate temporal structure, and a natural way to do so is by assuming that the hi constitute a Markov chain. Robert et al. (1993) and Chib 796 Markov Switching DPM Regression (1996) discuss such hidden Markov models in the estimation of mixtures of parametric densities. Moreover, the basic Markov switching regression model defines distinct regression functions for data that have been drawn from populations corresponding to a number of unobserved states (see, e.g., Chapters 10 and 11 in Frühwirth-Schnatter 2006). Following the early work of Goldfeld and Quandt (1973) and Quandt and Ramsey (1978), the more recent literature includes, for instance, approaches for switching dynamic linear models (Shumway and Stoffer 1991) and switching ARMA models (Billio et al. 1999). Moreover, Hurn et al. (2003) describe a Bayesian decision theoretic approach to estimation for mixtures of linear regressions, whereas the approach of Shi et al. (2005) offers a departure from the linear regression assumption through a mixture of Gaussian process regressions. Since, in our context, the Gr are modeled nonparametrically, this leads to inference that is driven primarily by state membership and, in particular, the Markov transition probabilities. Taking this approach further, the proposed nonparametric switching regression methodology will be most effective when state membership probabilities are informed by external covariates. Hughes and Guttorp (1994) and Berliner and Lu (1999) have proposed nonhomogeneous hidden Markov models where each observation’s state probability vector pi is regressed onto a set of external covariates, ui . In Section 2.3, we obtain a similar model by assuming that the external ui are randomly distributed according to a state dependent density function, phi (ui ). Conditioning on ui then implies a nonhomogeneous hidden Markov model for h. Hence, our methodological framework involves a known (small) number of states where prior information is available on the properties of the underlying state Markov chain, but there is a need for nonparametric modeling within each subpopulation. The assumption that the number of mixture states is known fits within the general premise of an informative state estimation coupled with flexible nonparametric modeling for regression estimation. Thus, while the methodology is not generally suitable for settings with little information about state membership, it offers a practical solution to switching regression problems that lack prior information about the shape of the individual regression functions and/or the form of the corresponding conditional response densities. 2.2 Model Specification for Hidden Markov Nonparametric Switching Regression Mixtures of regressions are used to study multiple populations each of which involves a different conditional relationship between response and covariates. The generic mixtures of regressions setting holds that the response Y given covariates X has been drawn from a member of a heterogeneous set of R conditional distributions defined by the densities f1 (y | x), . . . , fR (y | x), and hence that Pr(y | x) = p1 f1 (y | x) + . . . + PR pR fR (y | x), where r=1 pr = 1. We propose a departure from this standard form, wherein the response and covariates are jointly distributed according to one of the densities f1 (x, y), . . . , fR (x, y) – i.e., now Pr(x, y) = p1 f1 (x, y) + . . . + pR fR (x, y) – and 797 M. A. Taddy and A. T. Kottas PR therefore Pr(y | x) = ρ1 f1 (x, y) + . . . + ρR fR (x, y), where ρr = pr / ℓ=1 pℓ fℓ (x). Thus, the approach is particularly appropriate whenever mixture component probabilities for a given x and y should be dependent upon the joint distribution for response and covariates, even though primary interest is in the regression relationship for response given covariates. Specifically, we develop the extension of DP mixture implied conditional regression to the context of time dependent switching regression. The data consist of covariate vectors xt = (x1t , . . . , xdt x ), and corresponding responses yt observed at times t = 1, . . . , T , where dx is the dimension of the covariate space. The data from each time point are associated with a hidden state variable, ht ∈ {1, . . . , R}, such that, given ht , the response-covariate joint distribution is defined by a state-specific density fht (xt , yt ). We begin by describing density estimation in the d = dx + 1 dimensional setting, with data D = {zt = (xt , yt ) : t = 1, . . . , T }. Now, however, the successive observations zt are correlated through dependence in state membership h = (h1 , . . . , hT ), which constitutes a stationary Markov chain defined by an R × R transition matrix Q. Although we consider only first-order dependence in the Markov chain, the model and posterior simulation methods can be extended to handle higher order Markov chains. The first-order hidden Markov location-scale normal DP mixture model (referred to as model M1) can then be expressed as follows, Z ind zt | ht , Ght ∼ fht (zt ) ≡ f (zt ; Ght ) = N(zt ; µ, Σ)dGht (µ, Σ), t = 1, . . . , T Gr | αr , ψr h|Q ind ∼ ∼ DP (αr , G0 (ψr )) , r = 1, . . . , R Pr(h | Q) = T Y (2) Qht−1 ,ht , t=2 where we denote the r-th row of Q by Qr = (Qr,1 , . . . , Qr,R ), with Qr,s = Pr(ht = s | ht−1 = r), for r, s = 1, ..., R (and t = 2, ..., T ). Moreover, the DP centering distributions, G0 (µ, Σ; ψr ) = N(µ; mr , Vr )Wνr (Σ−1 ; Sr−1 ), with ψr = (mr , Vr , Sr ). Here, Wv (·; M ) denotes the Wishart distribution with v degrees of freedom and expectation vM . Applying the regression approach discussed in Section 2.1, the joint response-covariate density specification in (2) yields our proposed hidden Markov switching regression model. In particular, for R state r, the prior model for the marginal density for X can be written as f (x; Gr ) = N(x; µx , Σxx )dGr (µ, Σ), after the mean vector and covariance matrix of the normal kernel have been partitioned. In particular, µ comprises (dx × 1) vector µx and scalar µy , and Σ is a square block matrix with diagonal elements given by (dx × dx ) covariance matrix Σxx and scalar variance Σy , and above and below diagonal vectors Σxy , and Σyx , respectively. We assume that, in the prior, each state is equally likely for h1 . For r = 1, . . . , R, we ) place hyperpriors on ψr and αr such that π(ψr ) = N (mr ; amr , Bmr ) WaVr (Vr−1 ; BV−1 r WaSr (Sr ; BSr ), and π(αr ) = Γ(αr ; aαr , bαr ). The prior for Q is built from independent Dirichlet distributions, π(Qr ) = Dir(Qr ; λr ), where Dir(Qr ; λr ), with λr = PR (λr,1 , . . . , λr,R ), denotes the Dirichlet distribution such that E[Qr,s ] = λr,s /( i=1 λr,i ). 798 Markov Switching DPM Regression In practice, the hyperparameters for the αr , ψr and for Q need to be carefully chosen; our approach to prior specification is detailed in Appendix A. 2.3 Extension to Semiparametric Modeling with External Covariates In the spirit of allowing the switching probabilities to drive the nonparametric regression, we extend here model M1 to include additional information about the state vector in the form of an external covariate, U , with values u = {u1 , ..., uT }. (Although we present the methodology for a single covariate, the work can be readily extended to the setting with multiple external covariates.) The modeling extension involves a nonhomogeneous Markov mixture where the hidden state provides a link between the joint covariate-response random variable and the external covariate. The standard non-homogeneous hidden Markov model holds that the transition probabilities are dependent upon the external covariates, such that Pr(ht | h1 , . . . , ht−1 , u) = Pr(ht | ht−1 , ut ). Berliner and Lu (1999) present a Bayesian parametric approach to non-homogeneous hidden Markov models in which Pr(ht | ht−1 , ut ) is estimated through probit regression. Also related is the likelihood analysis of Hughes and Guttorp (1994), wherein a heuristic argument, using Bayes theorem, is proposed to justify the model Pr(ht | ht−1 , ut ) ∝ Pr(ht | ht−1 )L(ht ; ut ), where the likelihood L(ht ; ut ) in their example is normal with state dependent mean. Treating each ut as the realization of a random variable yields a natural modeling framework in the context of our approach. Hence, we obtain a semiparametric extension ind of model M1 (referred to as model M2) by adding a further stage, ut | ht ∼ p(ut | γht ), to the model, along with hyperpriors for γ = {γr : r = 1, ..., R}, the state-specific parameters of the distribution for the external covariate. Moreover, we assume that u is conditionally independent of {z1 , ..., zT } given h. Thus, for model M2, the first stage in (2) is replaced with ind zt , ut | ht , Ght , γ ∼ p(ut | γht )f (zt ; Ght ), t = 1, ..., T. Clearly, the formulation of model M2 implies that the hidden Markov chain is nonhomogeneous conditional on u. However, unconditionally in the prior, it is more accurate to say that {z1 , ..., zT } and u are dependent upon a shared homogeneous Markov chain, and that they are conditionally independent given h. In Section 4, we illustrate with a Gaussian form for p(ut | γht ). More general examples, with multiple external covariates, could incorporate dependence relationships, or even model some subset of the vector of external covariates as a function of the others. 3 Efficient Posterior Simulation Here, we present MCMC methods for posterior inference under the models developed in Section 2, beginning with model M1 and adapting this to external covariates in Section 2.3. To obtain the full probability model, we introduce latent parameters θ = 799 M. A. Taddy and A. T. Kottas ind {θt = (µt , Σt ) : t = 1, ..., T } such that the first stage in (2) is replaced with zt | θt ∼ ind N(zt ; θt ) and θt | ht , Ght ∼ Ght , for t = 1, ..., T . The standard approach to posterior simulation from DP-based hierarchical models involves marginalization of the random mixing distributions Gr in (2) over their DP priors. Conditionally on h, the vector of latent mixing parameters breaks down into state-specific subvectors θr = {θt : ht = r}, r = 1, ..., R, such that the distribution of each θr is built from independent Gr distributions for can be written QRthe θt corresponding to state r. Thus, the full posterior Q T as Pr(h | Q) r=1 π(αr )π(ψr )π(Qr )Pr(θr | h, αr , ψr )DP(Gr ; αr⋆ , G⋆r0 ) t=1 N(zt ; θt ), using results from Blackwell and MacQueen (1973) and Antoniak (1974). Here, Pr(θr | h, αr , ψr ) is the Pólya urn marginal prior for θr ; αr⋆h = αr +nr (where nr = |{t : ht =i r}|); P and G⋆r0 (·) ≡ G⋆r0 (· | h, θr , αr , ψr ) = (αr + nr )−1 αr dG0 (·; ψr ) + {t:ht =r} δθt (·) . This posterior can be sampled extending standard MCMC techniques for DP mixtures (e.g., Neal 2000; Gelfand and Kottas 2002). However, marginalization over the Gr requires that each pair (θt , ht ) must be sampled jointly, conditional on the remaining paramaters (θt′ , ht′ ), for all t′ 6= t. This is possible, but inefficient, through use of a Metropolis-Hastings step with proposal distribution built from a marginal Pr(ht = r) ∝ Qht−1 ,r Qr,ht+1 , r = 1, ..., R, and a conditional for θt |ht = r given by the Pólya urn prior full conditional arising from Pr(θr | h, αr , ψr ). 3.1 Blocked Gibbs with Forward-Backward Sampling The posterior simulation approach discussed above requires updating each ht one at a time, whereas forward-backward sampling for the entire state vector h is a substantially more efficient method for exploring the state space (see, e.g., Scott 2002). To implement forward-backward sampling, we need to evaluate the joint probability mass function for states (ht−1 , ht ) conditional on the incomplete data vector {z1 , ..., zt } and relevant model parameters, which include the random mixing distributions {G1 , ..., GR }. Therefore, to compute state probabilities, it is necessary to obtain realizations for each Gr in the course of the MCMC algorithm. The blocked Gibbs sampling approach for DP mixture models (Ishwaran and James 2001) provides a natural approach wherein the entire MCMC method is based on a finite truncation approximation of the DP, using its stick-breaking definition (Sethuraman 1994). Based on this definition, a DP(α, G0 ) realization is almost surely a discrete distribution with a countable number of possible values drawn i.i.d. from G0 , and corresponding weights that are built from i.i.d. β(1, α) variables through stick-breaking. (We use β(a, b) to denote the Beta distribution with mean a/(a + b).) As well as being the consistent choice if the truncated distributions are used in state vector draws, blocked Gibbs can lead to very efficient sampling for the complete posterior. Using the DP stick-breaking representation, we replace each Gr in model M1 with a truncation approximation. Specifically, for specified (finite) L, we work with GL r (·) = L X l=1 ωl,r δθ̃l,r (·), 800 Markov Switching DPM Regression where the θ̃l,r = (µ̃l,r , Σ̃l,r ), l = 1, ..., L, are i.i.d. G0 (ψr ), and the finite stick-breaking prior for ωr = (ω1,r , ..., ωL,r ) (denoted by PL (ωr | 1, αr )) is defined constructively by iid ζ1 , . . . , ζL−1 ∼ β(1, αr ), ζL = 1; and for l = 1, . . . , L : ωl,r = ζl l−1 Y (1 − ζs ). (3) s=1 Hence, each GL r is defined by the set of L location-scale parameters θ̃r = (θ̃1,r , ..., θ̃L,r ) and weights ωr . Guidelines to choose the truncation level L, up to any desired accuracy, can be obtained, e.g., from Ishwaran and Zarepour (2000). ind PL The first stage of model (2) is replaced with zt | ht , (ωht , θ̃ht ) ∼ l=1 ωl,ht N(zt ; mixture model (as L → ∞) is θ̃l,ht ), t = 1, ..., T . The limiting case of this finite R the countable DP mixture model f (zt ; Ght ) = N(zt ; θ)dGht (θ) in (2). Again, we can introduce latent parameters θt = (µt , Σt ) to expand the first stage specification ind ind to zt | θt ∼ N(zt ; θt ) and θt | ht , (ωht , θ̃ht ) ∼ GL ht , for t = 1, ..., T . Alternatively, since θt = θ̃l,ht with probability ωl,ht , we can work with configuration variables k = (k1 , ..., kT ), where each kt takes values in {1, ..., L}, such that, conditionally on ht , kt = l if and only if θt = θ̃l,ht . Hence, model M1 with the DP truncation approximation can be expressed in the following hierarchical form zt | θ̃ht , kt ind kt | ht , ωht ind ∼ ∼ N(zt ; θ̃kt ,ht ), t = 1, ..., T L X ωl,ht δl (kt ), t = 1, ..., T (4) l=1 ωr , θ̃r | αr , ψr ind ∼ PL (ωr | 1, αr ) L Y dG0 (θ̃l,r ; ψr ), r = 1, ..., R l=1 with h | Q ∼ Pr(h | Q) = Section 2.2. QT t=2 Qht−1 ,ht , and the hyperpriors for α, ψ, and Q given in Denote by φ the vector comprising model parameters α, ψ, k, Q, and {(ωr , θ̃r ) : r = 1, ..., R}. The full posterior, Pr(φ, h | D), corresponding to model (4) is now proportional to " L R Y Y π(αr )π(ψr )π(Qr )PL (ωr | 1, αr ) Pr(h | Q) dG0 (θ̃l,r ; ψr ) r=1 l=1 Y {t:ht =r} N(zt ; θ̃kt ,r ) L X l=1 ! ωl,r δl (kt )  . Here, the key observation is that, conditionally onnh, the first two stages of model (4), o QT QT PL t=1 N(zt ; θ̃kt ,ht ) t=1 Pr(zt , kt | ht , (ωht , θ̃ht )) = l=1 ωl,ht δl (kt ) , can be expressed nP oo QR nQ L in the state-specific form, r=1 . To explore {t:ht =r} N(zt ; θ̃kt ,r ) l=1 ωl,r δl (kt ) the full posterior, we develop an MCMC approach that combines Gibbs sampling steps 801 M. A. Taddy and A. T. Kottas for parameters in φ with forward-backward sampling for the state vector h. We discuss the latter next, deferring to Appendix B the details of the Gibbs sampler for all other parameters. As discussed above, sampling the truncated random mixing distribution GL r ≡ (ωr , θ̃r ) for each state r, enables use of forward-backward recursive sampling for the posterior full conditional distribution, Pr(h | φ, D). Note that this conditional distribuQT −1 tion can be written, in general, as Pr(hT | φ, D) t=1 Pr(hT −t | {hT −t+1 , ..., hT }, φ, D), whereas under the hidden Markov model structure it simplifies to Pr(h | φ, D) = Pr(hT | φ, D) TY −1 Pr(hT −t | hT −t+1 , φ, {z1 , ..., zT −t+1 }). (5) t=1 Hence, the state vector can be updated as a block in each MCMC iteration by sampling from each component in (5). To this end, the forward-backward sampling scheme begins by recursively calculat(t) ing the forward matrices F (t) , for t = 2, ..., T , where Fr,s = Pr(ht−1 = r, ht = s | φ, {z1 , ..., zt }), for r, s = 1, ..., R. Thus, F (t) defines the joint distribution for (ht−1 , ht ) given model parameters and data up to time t. For t = 3, ..., T , F (t) is obtained from F (t−1) through the following recursive calculation: (t) Fr,s ∝ Pr(ht−1 = r, ht = s, zt | φ, {z1 , ..., zt−1 }) = Pr(ht = s | ht−1 = r, φ) Pr(zt | ht = s, φ)Pr(ht−1 = r | φ, {z1 , ..., zt−1 }) = Qr,s L X ωl,s N(zt ; θ̃l,s ) L X l=1 Pr(ht−2 = i, ht−1 = r | φ, {z1 , ..., zt−1 }) i=1 l=1 = Qr,s R X ωl,s N(zt ; θ̃l,s ) R X (t−1) Fi,r (6) i=1 PR PR (t) where the proportionality constant is obtained from r=1 s=1 Fr,s = 1. For t = 2, PL PL (2) a similar calculation yields Fr,s ∝ Qr,s l=1 ωl,s N(z2 ; θ̃l,s ) l=1 ωl,r N(z1 ; θ̃l,r ), where, PR PR (2) again, the proportionality constant results from r=1 s=1 Fr,s = 1. Next, exploiting the form in (5), the (stochastic) backward sampling step begins by PR drawing hT according to Pr(hT = r | φ, D) = i=1 Pr(hT −1 = i, hT = r | φ, D) = PR (T ) i=1 Fi,r , for r = 1, ..., R. Sampling from (5) is then completed by drawing for each t = T − 1, T − 2, ..., 1 from Pr(ht = r | ht+1 , φ, {z1 , ..., zt+1 }) ∝ Pr(ht = r, ht+1 | (t+1) φ, {z1 , ..., zt+1 }) = Fr,ht+1 , for r = 1, ..., R, where the proportionality constant arises PR (t+1) from r=1 Fr,ht+1 . 3.2 Inference and Forecasting for Regression Relationships The posterior samples for the truncated DP parameters, {(ωl,r , (µ̃l,r , Σ̃l,r )) : l = 1, ..., L}, for each state r = 1, ..., R can be used to develop inference for the state-specific regres- 802 Markov Switching DPM Regression sions. In particular, conditional on the posterior draw for the state-specific mixing distribution, GL r , the posterior realization for the conditional response density, f (y | x; Gr ), corresponding to state r is f (y | x; GL r) PL ωl,r N(x, y; µ̃l,r , Σ̃l,r ) f (x, y; GL r) = = Pl=1 L x xx ) f (x; GL r l=1 ωl,r N(x; µ̃l,r , Σ̃l,r ) (7) for any specified value (x, y). In addition, the structure of conditional moments for the normal mixture kernel enables posterior sampling of the state-specific conditional mean regression functions without having to compute the corresponding conditional density. Specifically,   E Y | x; GL r = L i h X 1 y yx x xx xx −1 x ) , ω N(x; µ̃ , Σ̃ ) µ̃ + Σ̃ ( Σ̃ ) (x − µ̃ l,r l,r l,r l,r l,r l,r l,r f (x; GL r) l=1 which, evaluated over a grid in x, yields posterior realizations of the conditional mean regression function for each state. Moreover, of interest is prediction in future years (forecasting) for the joint responsecovariate distribution and the corresponding implied conditional regression relationship. Illustrating with year T + 1, the full model that includes the future covariate-response vector (xT +1 , yT +1 ) and corresponding regime state hT +1 , can be expressed as Pr((xT +1 , yT +1 ), hT +1 , φ, h | D) = Pr(φ, h | D)QhT ,hT +1 L X ωl,hT +1 N(xT +1 , yT +1 ; θ̃l,hT +1 ). l=1 Hence, the posterior samples for (φ, h) along with draws for the new regime state hT +1 , driven by Q and hT , can be used to estimate the joint posterior forecast density Pr(xT +1 , yT +1 | D). More generally, using the posterior samples for (φ, h) and hT +1 , we obtain posterior realizations for the conditional response density in year T + 1 through L L f (y | x; GL hT +1 ) = f (x, y; GhT +1 )/f (x; GhT +1 ). Note that, in contrast to (7), these realizations incorporate posterior uncertainty in hT +1 . This type of inference is illustrated with the data example of Section 4. 3.3 Extension to External Covariates Posterior inference under model M2, discussed in Section 2.3, can be implemented with a straightforward extension of the MCMC algorithm of Section 3.1. The parameters γ can be sampled conditional on only u and the state vector h. Regarding the other model parameters, only the MCMC draws that involve h need to be altered. In particular, the starting point is again an expression analogous to (5) for the posterior QT −1 full conditional for h. Specifically, Pr(h | φ, γ, D) = Pr(hT | φ, γ, D) t=1 Pr(hT −t | hT −t+1 , φ, γ, {(zℓ , uℓ ) : ℓ = 1, ..., T −t+1}). Note that now the data vector D comprises 803 M. A. Taddy and A. T. Kottas {(zt , ut ) : t = 1, ..., T }. For t = 3, ..., T , the recursive calculation of (6) for the forward matrices becomes (t) Fr,s ∝ Qr,s p(ut | γs ) L X l=1 ωl,s N(zt ; θ̃l,s ) R X (t−1) Fi,r , i=1 PR PR (t) (2) with the proportionality constant obtained from r=1 s=1 Fr,s = 1. Moreover, Fr,s ∝ PL PL PR PR Qr,s p(u2 | γs )p(u1 | γr ) l=1 ωl,s N(z2 ; θ̃l,s ) l=1 ωl,r N(z1 ; θ̃l,r ), where r=1 s=1 (2) Fr,s = 1. Finally, the backward sampling step proceeds as described in Section 3.1 using probabilities from the forward matrices F (T ) , F (T −1) , ..., F (2) . 4 Analysis of Stock-Recruitment Relationships Under Environmental Regime Shifts The relationship between the number of mature individuals of a species (stock) and the production of offspring (recruitment) is fundamental to the behavior of any ecological system. This has special relevance in fisheries research, where the stock-recruitment relationship applies directly to decision problems of fishery management with serious policy implications (e.g., Quinn and Derisio 1999). A standard ecological modeling assumption holds that as stock abundance increases, successful recruitment per individual (reproductive success) decreases. However, a wide variety of factors will influence this reproductive relationship and there are many competing models for the influence of biological and physical mechanisms. Munch et al. (2005) present an overview of the literature on parametric modeling for stock-recruitment functions, arguing for the utility of standard semiparametric Gaussian process regression modeling. In the same spirit, albeit under the more general DP mixture modeling framework developed in Section 2, our focus is to allow flexible regression to capture the nature of recruitment dependence upon stock without making parametric assumptions for either the stock-recruitment function or the errors around it. An added complexity in studying stock-recruitment relationships is introduced by ecosystem regime switching. It has been observed that rapid shifts in the ecosystem state can occur, during which biological relationships, such as that between stock and recruitment, will undergo major change. This has been observed in the North Pacific in particular (McGowan et al. 1998; Hare and Mantua 2000). Although empirical evidence of regime shifts is well documented and there have been attempts to establish mechanisms for the effect of this switching on stock-recruitment (e.g., Jacobson et al. 2005), the relationship between the physical effects of regime shifts and their biological manifestation is still unclear. This presents an ideal setting for Markov-dependent switching regression models due to their ability to link observed processes that occur on different scales (in this case, biological and physical) and are correlated in an undetermined manner. To illustrate our Markov switching regression models, we use data on annual stock and recruitment for Japanese sardine from years 1951 to 1990. Wada and Jacobson Log Reproductive Success 2 4 6 8 804 Markov Switching DPM Regression 71 73 70 58 59 77 69 80 78 81 82 83 76 72 74 53 52 74 54 51 57 55 56 60 68 67 66 85 75 79 84 86 87 65 64 63 6261 89 88 0 90 2 4 6 8 2 Log Egg Production 4 6 8 Figure 1: The left panel plots the data with the regime allocation from Wada and Jacobson (1998). The right panel includes draws from the bivariate normal distribution, which, under each regime, is defined by the marginal mean and covariance matrix for the location of a single DP mixture component (see Section 4 for details). In both panels, black and grey color indicate the unfavorable and favorable regime, respectively. (1998) use modeling of catch abundance and egg count samples to estimate R, the successful recruits of age less than one (in multiples of 106 fish). With estimated annual egg production E (in multiples of 1012 eggs) used as a proxy for stock abundance, they investigate the relationship between log(E) and log reproductive success, log(R/E). Japanese sardine have been observed to switch between favorable and unfavorable feeding regime states related to the North Pacific environmental regime switching discussed above. Based upon a predetermined regime allocation (see Figure 1), Wada and Jacobson (1998) fit a linear regression relationship for log(E) vs log(R/E) within each regime. We consider an analysis of the Japanese sardine data using the modeling framework developed in Section 2, which relaxes parametric (linear) regression assumptions and allows for simultaneous estimation of regime state allocation and regime-specific stockrecruitment relationships. As in the original analysis by Wada and Jacobson (1998), this model formulation does not take into account temporal dependence between successive observations from the same regime. This suits the purposes of our application, but one can envision many settings where a structured time series model is more appropriate than the fully nonparametric approach. Although the low dimensionality of this example is useful for illustrative purposes, the techniques will perhaps be most powerful in the exploration of higher dimensional datasets where such temporal structure is not assumed (an example of implied conditional regression in higher dimensions is studied in Taddy M. A. Taddy and A. T. Kottas 805 and Kottas 2009). We first apply model M1 in (4) to the sardine data, zt = (log(Et ), log(Rt /Et )), available for T = 40 years from 1951 to 1990, with the underlying states ht defined by either the unfavorable or favorable feeding regime (with values 1 or 2, respectively). A (conservative) truncation of L = 100 was used in the stick-breaking priors. Regarding the prior hyperparameters, we set aα = 2 and bα = 0.2 in the gamma prior for α. The prior for ψr is specified as outlined in Appendix A such that, conditional on the prior regime allocation taken from Wada and Jacobson (1998), am1 and am2 are set to data means (5, 3) and (5, 5) for the unfavorable and favorable regime observations, respectively, while Bm1 and (aV1 − 3)−1 BV1 , with diagonal (5.3, 2.6) and off-diagonal −3.1, and Bm2 and (aV2 − 3)−1 BV2 , with diagonal (4.5, 1.4) and off-diagonal −2.0, is the observed covariance matrix for each regime. The BSr , for r = 1, 2, are diagonal matrices and are specified by setting the diagonal entries of aSr BSr equal to (7.8, 7.7), which defines one quarter of the data range. Finally, we set ν1 = ν2 = aV1 = aV2 = aS1 = aS2 = 2(d + 1) = 6. The prior for Q is induced by a β(3, 1.5) prior for the probability of staying in the same state, which reflects the relative rarity of regime shifts. The data and prior allocation are shown in Figure 1 along with bivariate normal draws based on the marginal mean and covariance matrix for the location, µr , of a single component of the DP mixture, for each of the two regimes. Hence, the right panel of Figure 1 shows draws from the prior expectation of the random mixing distribution for the µr (i.e., from state-specific normal distributions with means E[µr ] = amr and variance var(µr ) = var(mr ) + E[Vr ] = Bmr + (aVr − 3)−1 BVr ). Noting that this does not include prior uncertainty in the µr due to the DP mixture, clearly shows that the prior specification has not overly restricted mixture components. As described above, the sardine feeding regime is part of a larger ecosystem state for this region of the North Pacific. The physical variables that are linked to the ecosystem state switching can be used as external covariates for the hidden Markov chain. Hence, to illustrate the modeling approach of Section 2.3, we choose a physical variable as the single external covariate, specifically, the winter average Pacific decadal oscillation (PDO) index, which is highly correlated with biological regime switching (Hare and Mantua 2000). The PDO index provides the first principle component of an aggregate of North Pacific sea surface temperatures. Although not directly responsible, sea surface temperature is believed to be a proxy for mechanisms such as current flow that control the regime switching (MacCall 2002). Therefore, with vector u comprising winter average PDO values from 1951 to 1990, we apply model M2 to the sardine data working with a normal PDO distribution with state-specific mean. Hence, we ind assume ut | ht ∼ N(ut ; γht , τ −2 ), with (independent) normal priors for γ = {γ1 , γ2 } and a gamma prior for τ 2 , in particular, γ1 ∼ N(−0.44, 0.26), γ2 ∼ N(0.73, 0.26), and τ 2 ∼ Γ(0.5, 0.125). The γr prior mean values are average winter PDO for two ten year periods that are generally accepted to fall within each ecosystem regime (Hare and Mantua 2000); the common γr prior variance is the pooled variance for these mean estimates, and the prior median for τ −2 is chosen to provide some overlap between prior PDO densities for each regime. Extending the MCMC algorithm of Section 3.1 to sample the γr and τ 2 is straightforward, since their posterior full conditionals, conditional on 806 3 1 7 5 1 3 Log reproductive success 5 7 Markov Switching DPM Regression 2 4 6 8 2 4 6 8 Log egg production Figure 2: Mean posterior conditional density surface for each regime. The unfavorable regime is plotted on the left panels and the favorable on the right panels. The top row corresponds to the analysis from model M1 and the bottom row to model M2, which includes PDO as an external covariate. In each panel, the grey points represent the data, i.e., the observed values for (log(Et ), log(Rt /Et )), t = 1, ..., 40. u and h, are given by normal and gamma distributions, respectively. The posterior means for γ1 and γ2 are given by −0.65 and 0.69, with 90% posterior intervals of (−0.89, −0.40) and (0.30, 1.10), respectively, and τ −2 has posterior mean 0.68 with a 90% posterior interval of (0.45, 1.00). Results from the analyses under the two models are presented in Figures 2 – 4. The regime-specific posterior mean implied conditional densities, E[f (log(R/E) | log(E); GL r) | D], evaluated over a 50 × 50 grid, are shown in Figure 2. These provide point estimates of the conditional relationship between stock and recruitment for each regime. Figure 3 shows the posterior mean for the state vector h as well as posterior point and interval estimates for mean regression functions, E[log(R/E) | log(E); GL r ], for each regime. The impact of inclusion of PDO as an external variable is evident. In the absence of such information, the observations for years 1988 – 1990 are more likely to be allocated in the favorable regime due to the rarity of regime shifting (i.e., due to posterior realizations of Q which put a high probability on staying in the same state). However, with the inclusion of PDO, these years are more probably associated with the unfavorable regime. Also, the posterior estimates for the regime-specific mean regression curves do not exclude the possibility of a linear mean relationship between log egg production 807 0 3 1 7 1 3 5 Log reproductive success 0 1 0.5 Mean state allocation 0.5 5 7 1 M. A. Taddy and A. T. Kottas 50 60 70 Year 80 90 2 4 6 8 Log egg production Figure 3: The left panels show the posterior mean regime membership by year, where 0 corresponds to the unfavorable regime. The right panels include posterior point and 90% interval estimates for the conditional mean regression function under each regime (interval estimates are denoted by dashed lines for the favorable regime, and by dotted lines for the unfavorable). Also included on the right panels are the observed values for (log(Et ), log(Rt /Et )), t = 1, ..., 40, denoted by the grey points. The top row corresponds to model M1 and the bottom row to model M2, which includes PDO as an external covariate. and log reproductive success. Hence, it is interesting to note that the more general DP mixture switching regression modeling framework provides a certain level of support to the original assumptions of Wada and Jacobson (1998). Wada and Jacobson (1998) also provide egg production and estimated recruit numbers for the years 1991 – 1995, and winter PDO is readily available. The recruit estimates after 1990 are regarded as less accurate than field data from previous years, and for this reason they were not included in our original analysis. However, prediction for this estimated out-of-sample data provides a useful criterion for model comparison. Hence, repeated prediction conditional on each existing parameter state was incorporated into the MCMC algorithm. In each successive year, a regime state is drawn conditional on the sampled regime corresponding to the previous year, and prediction for log reproductive success is provided by the associated conditional response density in year 199∗, f (log(R/E) | log(E199∗ ); GL h199∗ ), where 199∗ runs from year 1991 to 1995. The regime state is then resampled conditional on actual log reproductive success (i.e., conditional on log(R199∗ /E199∗ ) and log(E199∗ )), and the process is repeated with this 808 0.8 0.0 0.6 0.4 0.2 0.8 0.0 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 Conditional probability density 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 Markov Switching DPM Regression 0 2 4 6 8 0 2 4 6 8 Log reproductive success Figure 4: Predictive inference for years 1991 – 1995 (by row, moving from 1991 at top to 1995 at bottom). The left column corresponds to model M1, and the right column to model M2 with PDO as an external covariate. Each panel plots posterior mean and 90% interval estimates (solid and dashed lines, respectively) for the one-step-ahead conditional density, corresponding to log(E) values for 1991 – 1995 of [7.58, 6.51, 6.13, 4.67, 4.93]. The grey vertical lines mark the true log reproductive success for each year reported in Wada and Jacobson (1998). M. A. Taddy and A. T. Kottas 809 state used as the basis for the next year’s prediction. More precisely, considering model M1, prediction for year 1991 proceeds exactly as outlined in Section 3.2. Next, for year L 1992, we sample the previous regime from Pr(h1991 = r | h1990 , GL 1 , G2 , E1991 , R1991 ) ∝ L f (log(E1991 ), log(R1991 /E1991 ); Gr )Qh1990 ,r , for r = 1, 2, and use the sampled state r for prediction through f (log(R/E) | log(E1992 ); GL r ). Prediction for years 1993 – 1995 proceeds in an analogous fashion, and the general approach is similar for prediction under model M2. Since this occurs at each MCMC iteration, we are averaging over uncertainty in both h199∗ and the GL r . The results are shown in Figure 4, and it can be seen that the introduction of PDO as an external covariate leads to subtle changes in conditional predictive information. In particular, the predictions for year 1991 benefit from additional information about the regime state in this year (and in the preceding three years), resulting in a conditional response density for model M2 that is both more accurate and less dispersed than the one obtained under model M1. As the first-order Markov model is only informative in relatively short-term prediction, distributions corresponding to both models become fairly diffuse in later years. However, model M2 assigned consistently higher one-step-ahead mean conditional probability at the true log reproductive success values, and the average total log probability assigned to observations from 1991 – 1995 was −8.2 for model M2 against only −8.6 for model M1. The inference results reported in Figure 4 illustrate the posterior variability and non-standard shapes of the predicted conditional response densities. The quantification of this variability as well as the capacity of the DP mixture switching regression models to capture non-standard features of the response distribution are important aspects of the proposed nonparametric modeling framework. 5 Conclusion We have presented a general framework for semiparametric hidden Markov switching regression. While the basic switching DP mixture regression methodology provides a powerful modeling technique in its own right, we feel that it is most practically important when combined with further parametric modeling for the effect of external covariates on state membership. Both modeling techniques, with or without external covariates, have been illustrated with the analysis of stock-recruitment data. The general approach of having informative parametric modeling linked with nonparametric models through an underlying hidden stochastic process is both theoretically appealing and practically powerful. We believe that there is great potential for such models, since they provide an efficient way to bridge the difference in scale between two observed processes, and the MCMC algorithms presented in this paper can be the basis for extended techniques in other settings. We have focused on models for switching regression, but the methodology is applicable in more general settings involving hidden Markov model structure. In particular, since the switching occurs at the level of the joint distribution for response and covariates, the modeling approach is directly applicable to nonparametric density es- 810 Markov Switching DPM Regression timation through DP mixtures of multivariate normals for heterogeneous populations where switching between subpopulations occurs as a Markov chain. Furthermore, the modeling framework can be elaborated for problems where the multivariate normal is not a suitable choice for the DP mixture kernel. For instance, categorical covariates can be accommodated through mixed continuous-discrete kernels. Finally, our work in the development of the MCMC algorithm can be extended to incorporate stick-breaking priors other than the DP. Appendix A: Prior Specification Here, we discuss the approach to prior specification for the hyperparameters of model M1 developed in Section 2.2. Our approach is motivated by a setting where prior information is available on the state vector h, and the λr parameters of π(Qr ) are chosen based on prior expectation for the probabilities of moving from state r to each state in a single time step. However, this prior information pertains only to the transition probabilities between states and does not fully identify the state components. Thus, we need to provide enough information to facilitate identification of the mixture components and ensure that the transition probabilities defined by Q refer to the intended states. On the other hand, the nonparametric regression is motivated by a desire to allow flexible inference about each regression component and we thus seek a more automatic prior specification for each ψr . Within the framework of our DP mixture implied conditional regression, it is possible to have each state-specific centering distribution, G0 (ψr ), associate the densities R N(z; µ, Σ)dGr (µ, Σ) with specific regions of the joint response-covariate space, without putting prior information on the shape of the conditional response density or regression curve within each region. Since the prior parameters mr and Vr control the location of the normal kernels, the hyperparameters amr , Bmr , aVr , and BVr can be used to express prior belief about the state-specific joint response-covariate distributions. Specifically, assume a prior guess for the mean and covariance matrix corresponding to the population for state r, where prior information for the covariance may only be available in the form of a diagonal matrix. Then, we can set amr equal to the prior mean, Bmr to the prior covariance, and choose aVr and BVr such that E[Vr ] is equal to the prior covariance (alternatively, E[Vr−1 ] can be set equal to the inverse of the prior covariance matrix and we have observed the method to be robust to either specification). In the absence of such prior information, one can use a data-dependent prior specification technique. Given a prior allocation of observations expressed as the state vector hπ = (hπ1 , ..., hπT ), each set {amr , Bmr , BVr } can be specified through the mean and covariance of the data subset {zt : hπt = r}. In particular, amr is set to the state-specific data mean and both Bmr and E[Vr ] = (aVr − d − 1)−1 BVr are set to the state-specific data covariance. With care taken to ensure that it does not overly restrict the component locations, this approach provides an automatic prior specification that combines strong state allocation beliefs with weak information about the state-specific regression functions. 811 M. A. Taddy and A. T. Kottas For the Sr we seek only to scale the mixture components to the data, and thus we set all the E(Sr ) = aSr BSr equal to a diagonal matrix with each diagonal entry a quarter of the full data range for the respective dimension. The precision parameters aVr , aSr , and νr , for r = 1, . . . , R, are set to values slightly larger than d + 2; in practice, we have found 2(d+1) to work well. Working with various data sets, including the one in Section 4, we have observed results to be insensitive to reasonable changes in this specification. In particular, experimentation with a variety of choices for the matrices BSr , indicating prior expectation of either more or less diffuse normal kernel components, resulted in robust posterior inference. Specification of the hyperpriors on DP precision parameters is facilitated by the role that each αr plays in the prior distribution for the number of unique components in the set of nr latent mixing parameters θt = (µt , Σt ) corresponding to state r. For a given nr (i.e., conditional on h), we can use results from Antoniak (1974) to explore properties of this prior for different αr values. For instance, the prior expected number of unique components in the set {θt : ht = r} is approximately αr log[(nr + αr )/αr ], and this expression may be used to guide prior intuition about the αr . Appendix B: MCMC Posterior Simulation Here, we develop the approach to MCMC posterior simulation discussed in Section 3. Recall that the key to the finite stick-breaking algorithm is that we are able to use forward-backward recursive sampling of the posterior conditional distribution for h as described in Section 3.1. Gibbs sampling details for all other parameters of model (4) are provided below. First, for each t = 1, ..., T , kt has a discrete posterior full conditional P distribution L with values in {1, ..., L} and corresponding probabilities ωl,ht N(zt ; θ̃l,ht ) / m=1 ωm,ht N(zt ; θ̃m,ht ), for l = 1, ..., L. For each r = 1, ..., R, the posterior distribution for ωr , is proporP full conditional  Q QL Ml,r L tional to PL (ωr | 1, αr ) {t:ht =r} l=1 ωl,r δl (kt ) = PL (ωr | 1, αr ) l=1 ωl,r , where Ml,r = |{t : ht = r, kt = l}|. Note that the PL (ωr | 1, αr ) prior for ωr , defined constructively in (3), is given by PL (ωr | 1, αr ) = αr −1 αrL−1 ωL,r (1 −1 − ω1,r ) −1 (1 − (ω1,r + ω2,r ))  ... 1 − XL−2 l=1 ωl,r −1 . (B.1) Recall the generalized Dirichlet distribution GD(p; a, b) (Connor and Mosimann 1969) for random vector p = (p1 , ..., pL ), supported on the L dimensional simplex, with aL−1 −1 b −1 density proportional to pa1 1 −1 . . . pL−1 pLL−1 (1 − p1 )b1 −(a2 +b2 ) . . . (1 − (p1 + bL−2 −(aL−1 +bL−1 ) ... + pL−2 )) , where the parameters are a = (a1 , ..., aL−1 ) and b = (b1 , ..., bL−1 ). Then, PL (ωr | 1, αr ) ≡ GD(ωr ; a, b) with a = (1, ..., 1) and b = (αr , ..., αr ). QL M Moreover, the l=1 ωl,rl,r form is also proportional to a GD(ωr ; a, b) distribution with PL a = (M1,r + 1, ..., ML−1,r + 1) and b = ((L − 1) + l=2 Ml,r , ..., 2 + ML−1,r + ML,r , 1 + 812 Markov Switching DPM Regression ML,r ). Hence, the posterior full conditional for ωr can be completed to a generalized Dirichlet distribution with parameters a = (M1,r + 1, ..., ML−1,r + 1) and b = PL PL (αr + l=2 Ml,r , αr + l=3 Ml,r , ..., αr + ML,r ). This distribution can be sampled PL constructively by first drawing independent ζl ∼ β(1 + Ml,r , αr + s=l+1 Ms,r ), for Ql−1 l = 1, ..., L − 1, and then setting ω1,r = ζ1 ; ωl,r = ζl s=1 (1 − ζs ), l = 2, ..., L − 1; and PL−1 ωL,r = 1 − l=1 ωl,r . Next, for each r = 1, ..., R, the posterior full conditional distribution for θ̃r is proporQL Qn∗r Q ∗ ∗ tional to l=1 dG0 (θ̃l,r ; ψr ) j=1 {t:ht =r,kt =kj∗ } N(zt ; θ̃kj ,r ). Here, nr is the number of distinct values of the kt that correspond to the r-th state, i.e., the number of distinct kt for t ∈ {t : ht = r}. These distinct values are denoted by kj∗ , j = 1, ..., n∗r . Now, for all l ∈ / {kj∗ : j = 1, ..., n∗r }, we can draw θ̃l,r i.i.d. G0 (ψr ). Otherwise, the posterior full conditional for θ̃kj∗ ,r ≡ (µ̃∗j,r , Σ̃∗j,r ) is proportional to Y −1 N(µ̃∗j,r ; mr , Vr )Wνr (Σ̃∗−1 N(zt ; µ̃∗j,r , Σ̃∗j,r ), j,r ; Sr ) {t:ht =r,kt =kj∗ } and can be sampled by extending the Gibbs sampler to draw from the full conditional −1 for µ̃∗j,r and for Σ̃∗−1 + j,r . The former is normal with covariance matrix Tj = (Vr ∗−1 −1 ∗ ∗ −1 ∗ Mj,r Σ̃j,r ) , where Mj,r = |{t : ht = r, kt = kj }|, and mean vector Tj (Vr mr + P P ∗ ∗ Σ̃∗−1 j,r {t:ht =r,kt =k∗ } (zt − µ̃j,r )(zt − {t:ht =r,kt =k∗ } zt ). The latter is Wνr +Mj,r (·; (Sr + j j µ̃∗j,r )T )−1 ). The posterior full conditional for the hyperparameters, ψr = (mr , Vr , Sr ), can be simplified by marginalizing the joint posterior full conditional for θ̃r and ψr over all the θ̃l,r for l ∈ / {kj∗ : j = 1, ..., n∗r }. Thus, for each r = 1, ..., R, the full conditional for ψr is proportional to ∗ )WaSr (Sr ; BSr ) N (mr ; amr , Bmr ) WaVr (Vr−1 ; BV−1 r nr Y −1 N(µ̃∗j,r ; mr , Vr )Wνr (Σ̃∗−1 j,r ; Sr ). j=1 Hence, ψr can be updated by separate draws from the posterior full conditionals for ′ mr , Vr , and Sr . The full conditional for mr is normal with covariance matrix Bm = r ∗ P n −1 ∗ −1 ∗ −1 −1 ′ −1 r ). The full conditional a +V µ̃ (B +n V ) and mean vector B (Bm r mr mr r r mr j=1 j,r r Pn∗r (µ̃∗j,r − mr )(µ̃∗j,r − mr )T )−1 ), and the full conditional for Vr−1 is Wn∗r +aVr (·; (BVr + j=1 Pn∗r ∗−1 −1 + j=1 Σ̃j,r ) ). for Sr is Wνr n∗r +aSr (·; (BS−1 r Regarding the DP precision parameters, combining the Γ(aαr , bαr ) prior for αr with the relevant terms from (B.1), we obtain that, for each r = 1, ..., R, the posterior full conditional for αr is a Γ(aαr + L − 1, − log(ωL,r ) + bαr ) distribution. Finally, with the Dir(Qr ; λr ) prior on each row Qr of the transition matrix Q, the posterior full conditional for Qr is Dir(Qr ; λr + Jr ), where Jr = (Jr,1 , ..., Jr,R ) with Jr,s denoting the number of transitions from state r to state s, which are defined by the currently imputed state vector h. M. A. Taddy and A. T. Kottas 813 References Antoniak, C. (1974). “Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems.” Annals of Statistics, 2: 1152 – 1174. 795, 799, 811 Berliner, L. M. and Lu, Z.-Q. (1999). “Markov switching time series models with application to a daily runoff series.” Water Resources Research, 35: 523–534. 796, 798 Billio, M., Monfort, A., and Robert, C. P. (1999). “Bayesian estimation of switching ARMA models.” Journal of Econometrics, 93: 229–255. 796 Blackwell, D. and MacQueen, J. (1973). “Ferguson distributions via Pólya urn schemes.” Annals of Statistics, 1: 353–355. 799 Chib, S. (1996). “Calculating posterior distributions and modal estimates in Markov mixture models.” Journal of Econometrics, 75: 79–97. 795 Connor, R. and Mosimann, J. (1969). “Concepts of independence for proportions with a generalization of the Dirichlet distribution.” Journal of the American Statistical Association, 64: 194–206. 811 Ferguson, T. (1973). “A Bayesian analysis of some nonparametric problems.” Annals of Statistics, 1: 209–230. 795 Frühwirth-Schnatter, S. (2006). Finite Mixture and Markov Switching Models. Springer. 796 Gelfand, A. E. and Kottas, A. (2002). “A Computational Approach for Full Nonparametric Bayesian Inference under Dirichlet Process Mixture Models.” Journal of Computational and Graphical Statistics, 11: 289–305. 799 Goldfeld, S. M. and Quandt, R. E. (1973). “A Markov model for switching regression.” Journal of Econometrics, 1: 3–16. 793, 796 Hamilton, J. D. (1989). “A New Approach to the Economic Analysis of Nonstationary Time Series and the Business Cycle.” Econometrica, 57: 357–384. 793 Hare, S. R. and Mantua, N. J. (2000). “Empirical evidence for North Pacific regime shifts in 1977 and 1989.” Progress in Oceanography, 47: 103–145. 803, 805 Hughes, J. P. and Guttorp, P. (1994). “A class of stochastic model for relating synoptic atmospheric patters to regional hydrologic phenomena.” Water Resources Research, 30: 1535–1546. 796, 798 Hurn, M., Justel, A., and Robert, C. P. (2003). “Estimating mixtures of regressions.” Journal of Computational and Graphical Statistics, 12: 55–79. 796 Ishwaran, H. and James, L. (2001). “Gibbs sampling methods for stick-breaking priors.” Journal of the American Statistical Association, 96: 161–173. 799 814 Markov Switching DPM Regression Ishwaran, H. and Zarepour, M. (2000). “Markov chain Monte Carlo in approximate Dirichlet and beta two-parameter process hierarchical models.” Biometrika, 87: 371– 390. 800 Jacobson, L. D., Bograd, S. J., Parrish, R. H., Mendelssohn, R., and Schwing, F. B. (2005). “An ecosystem-based hypothesis for climatic effects on surplus production in California sardine and environmentally dependent surplus production models.” Canadian Journal of Fisheries and Aquatic Sciences, 62: 1782–1796. 803 MacCall, A. D. (2002). “An hypothesis explaining biological regimes in sardineproducing Pacific boundary current systems.” In Climate and fisheries: interacting paradigms, scales, and policy approaches: the IRI-IPRC Pacific Climate-Fisheries Workshop, 39–42. International Research Institute for Climate Prediction, Columbia University. 805 McCulloch, R. E. and Tsay, R. S. (1994). “Statistical Analysis of Economic Time Series via Markov Switching Models.” Journal of Time Series Analysis, 15: 523–539. 793 McGowan, J. A., Cayan, D. R., and Dorman, L. M. (1998). “Climate-ocean variability and ecosystem response in the Northeast Pacific.” Science, 281: 210–217. 803 Müller, P., Erkanli, A., and West, M. (1996). “Bayesian curve fitting using multivariate Normal mixtures.” Biometrika, 83: 67–79. 795 Munch, S. B., Kottas, A., and Mangel, M. (2005). “Bayesian nonparametric analysis of stock-recruitment relationships.” Canadian Journal of Fisheries and Aquatic Sciences, 62: 1808–1821. 803 Neal, R. (2000). “Markov Chain Sampling Methods for Dirichlet Process Mixture Models.” Journal of Computational and Graphical Statistics, 9: 249–265. 799 Quandt, R. E. and Ramsey, J. B. (1978). “Estimating Mixtures of Normal Distributions and switching regressions (with discussion).” Journal of the American Statistical Association, 73: 730–752. 793, 796 Quinn, T. J. I. and Derisio, R. B. (1999). Quantitative Fish Dynamics. Oxford University Press. 803 Robert, C., Celeux, G., and Diebolt, J. (1993). “Bayesian estimation of hidden Markov chains: a stochastic implementation.” Statistics & Probability Letters, 16: 77–83. 795 Scott, S. (2002). “Bayesian methods for hidden Markov models: recursive computing for the 21st century.” Journal of the American Statistical Association, 97: 337–351. 799 Sethuraman, J. (1994). “A constructive definition of Dirichlet priors.” Statistica Sinica, 4: 639–6650. 799 M. A. Taddy and A. T. Kottas 815 Shi, J. Q., Murray-Smith, R., and Titterington, D. M. (2005). “Hierarchical Gaussian process mixtures for regression.” Statistics and Computing, 15: 31–41. 796 Shumway, R. H. and Stoffer, D. S. (1991). “Dynamic linear models with switching.” Journal of the American Statistical Association, 86: 763–769. 796 Taddy, M. and Kottas, A. (2009). “A Bayesian Nonparametric Approach to Inference for Quantile Regression.” Journal of Business and Economic Statistics, to appear. 795, 804 Wada, T. and Jacobson, L. D. (1998). “Regimes and stock-recruitment relationships in Japanese sardine, 1951-1995.” Canadian Journal of Fisheries and Aquatic Sciences, 55: 2455–2463. 803, 804, 805, 807, 808 Acknowledgments This research is part of the first author’s Ph.D. dissertation, completed at University of California, Santa Cruz, and was supported in part by the National Science Foundation under Award DEB-0727543. The authors thank Alec MacCall, Steve Munch, and Marc Mangel for helpful discussions regarding the analysis of the Pacific sardine data, and two referees for useful comments that led to an improved presentation of the material in the paper. 816 Markov Switching DPM Regression Bayesian Analysis (2009) 4, Number 4, pp. 817–846 A Case for Robust Bayesian Priors with Applications to Clinical Trials Jairo A. Fúquene∗ , John D. Cook† and Luis R. Pericchi‡ Abstract. Bayesian analysis is frequently confused with conjugate Bayesian analysis. This is particularly the case in the analysis of clinical trial data. Even though conjugate analysis is perceived to be simpler computationally (but see below, Berger’s prior), the price to be paid is high: such analysis is not robust with respect to the prior, i.e. changing the prior may affect the conclusions without bound. Furthermore, conjugate Bayesian analysis is blind with respect to the potential conflict between the prior and the data. Robust priors, however, have bounded influence. The prior is discounted automatically when there are conflicts between prior information and data. In other words, conjugate priors may lead to a dogmatic analysis while robust priors promote self-criticism since prior and sample information are not on equal footing. The original proposal of robust priors was made by de-Finetti in the 1960’s. However, the practice has not taken hold in important areas where the Bayesian approach is making definite advances such as in clinical trials where conjugate priors are ubiquitous. We show here how the Bayesian analysis for simple binary binomial data, expressed in its exponential family form, is improved by employing Cauchy priors. This requires no undue computational cost, given the advances in computation and analytical approximations. Moreover, we introduce in the analysis of clinical trials a robust prior originally developed by J.O. Berger that we call Berger’s prior. We implement specific choices of prior hyperparameters that give closed-form results when coupled with a normal log-odds likelihood. Berger’s prior yields a robust analysis with no added computational complication compared to the conjugate analysis. We illustrate the results with famous textbook examples and also with a real data set and a prior obtained from a previous trial. On the formal side, we present a general and novel theorem, the “Polynomial Tails Comparison Theorem.” This theorem establishes the analytical behavior of any likelihood function with tails bounded by a polynomial when used with priors with polynomial tails, such as Cauchy or Student’s t. The advantages of the theorem are that the likelihood does not have to be a location family nor exponential family distribution and that the conditions are easily verifiable. The binomial likelihood can be handled as a direct corollary of the result. Next, we proceed to prove a striking result: the intrinsic prior to test a normal mean, obtained as an objective prior for hypothesis testing, is a limit of Berger’s robust prior. This result is useful for assessments and for MCMC computations. We then generalize the theorem to prove that Berger’s prior and intrinsic priors are robust with normal likelihoods. Finally, we apply the ∗ Institute of Statistics, School of Business Administration, University of Puerto Rico, San Juan, PR, mailto:jairo.a.fuquene@uprrp.edu † Division of Quantitative Sciences, M.D. Anderson Cancer Center, University of Texas, Houston, TX, jdcook@mdanderson.org ‡ Department of Mathematics, University of Puerto Rico, San Juan, PR, mailto:lrpericchi@uprrp. edu c 2009 International Society for Bayesian Analysis ° DOI:10.1214/09-BA431 818 Robust Bayesian Priors for Clinical Trials results to a large clinical trial that took place in Venezuela, using prior information based on a previous clinical trial conducted in Finland. Our main conclusion is that introducing the existing prior information in the form of a robust prior is more justifiable simultaneously for federal agencies, researchers, and other constituents because the prior information is coherently discarded when in conflict with the sample information. Keywords: Berger’s Prior, Clinical Trials, Exponential Family, Intrinsic Prior, Parametric Robust Priors, Polynomial Tails Comparison Theorem, Robust Priors 1 Introduction In Bayesian statistics the selection of the family of prior distributions is crucial to the analysis of data because the conclusions depend on this selection. However, there is little analysis of clinical trials using non-conjugate priors. It is common to report an analysis using different conjugate priors: clinical, skeptical, and non-informative. The precision in these priors is important and sensitivity analyses regarding the priors is necessary. One approach to this problem is advocated by Greenhouse and Wasserman (1995) who compute bounds on posterior expectations over an ε-contaminated class of prior distributions. An alternative solution is proposed in Carlin and Louis (1996), where one re-specifies the prior and re-computes the result. These authors obtain fairly specific results for some restricted non-parametric classes of priors. Along the same line, another alternative is the “prior partitioning” of Carlin and Sargent (1996) which selects a suitably flexible class of priors (a non-parametric class whose members include a quasi-unimodal, a semi-parametric normal mixture class, and the fully parametric normal family) and identify the priors that lead to posterior conclusions of interest. These are (very few) proposals about what may be called “non-parametric” robustness to the prior. The proposals in this paper are “parametric” robust Bayesian analysis, quite distinct from the previous proposals. Some general results on parametric Bayesian robustness are in Dawid (1973), O’Hagan (1979), Pericchi and Sansó (1995). We believe that the main road forward for clinical trials is on the parametric side for three reasons. First, it is more natural to represent the information given by a previous trial in terms of parametric priors. More generally, parametric priors are easier to assess. Second, it is far more clear how to generalize a parametric robust analysis to hierarchical modeling than a non-parametric class of priors. Finally, non-parametric priors do not appear to have achieved a significant impact in practice. In Gelman et al. (2008) the authors take a somewhat similar point of view to ours. Their arguments are very applied while ours are more theoretical and so the papers are complementary. Clinical trials are exceptional in at least two ways. First, there is often substantial “hard” prior data. Second, there are multiple parties overseeing the analysis: researchers, statisticians, regulatory bodies such as the FDA, data and safety monitoring boards, journal editors, etc. In this framework there are fundamental issues such as the following. How do you assess a prior from the prior data? How do you assess how relevant the previous data is to the current trial? By using prior information are we J. A. Fúquene, J. D. Cook and L. R. Pericchi 819 enhancing the analysis or biasing it? Our key message in this paper is that robust priors are a better framework to get consensus in clinical trials for the following reason: 1. Prior information may be substantial about certain characteristics like location, scale, but it is very weak about the tails of prior distributions. 2. The tail size is crucial in the posterior inference when there is conflict between prior and sample. 3. The behavior of posterior inference under robust priors is superior because when the prior information is irrelevant for the case at hand, then the prior information is coherently and automatically discarded by Bayes’ theorem. Conjugate light-tailed priors do not have these features and may be called “dogmatic.” See Berger (1985) for an authoritative discussion of these issues and our example in 6.1. Of course if all involved had unlimited time for several sensitivity analysis, the results using light tailed priors might be acceptable. Instead, we are suggesting that Bayes’ theorem should be allowed to perform coherently the sensitivity analyses, and for that heavy tailed priors are required. A referee has pointed out that “a researcher who had carefully constructed a prior distribution that reflected substantial available information almost certainly would prefer for that information to be reflected in the posterior distribution or at least for prior/data conflict to be recognized and investigated”. Certainly, if someone has a prior that they want included in the analysis, fine. But it need not be the only prior used. There’s no harm in repeating an analysis with several priors, and in fact it is a recommended practice to do so. Furthermore, there are limits to how well someone can quantify their prior uncertainty, particularly far from the center of their estimate. It is hard to imagine that someone could say that their prior belief follows a normal distribution rather than a Student-t with say six degrees of freedom. If individuals cannot specify with fidelity the tail behavior of their subjective priors, the tail behavior should be determined by technical criteria such as robustness. The popular normal/normal (N/N) or beta/binomial (B/B) conjugate analysis (see for example Spiegelhalter et al. (2004)) will be exposed in this article as non-robust. Workable (parametric) alternatives are available to the practitioner. For motivation consider: The posterior mean µn in the N/N and B/B models is (see next section) µn = (n0 + n)−1 (n0 µ + nX̄n ). Thus the mean is a convex combination of the prior expectation, µ, and the data average, X̄n , and thus the prior has unbounded influence. For example, as the location prior/data conflict |µ − x̄| grows, so does |µn − x̄| and without bound. These considerations motivate the interest in non-conjugate models for Bayesian analysis of clinical trials, and more generally motivate heavy-tailed priors. (See the theorem in the next section.) We may employ the following heuristic: Bayesian clinical trials are not better because they stop sampling earlier (although they often do) but because they stop intelligently, that is the stopping is conditional on the amount of evidence. Robust priors are not better because they have less influence (though this is true) but because they influence in a more intelligent way: the influence of the robust prior is a function of the potential 820 Robust Bayesian Priors for Clinical Trials conflict between prior and sample information about the region where the parameters are most likely to live. (For more general measures of prior-sample conflict see for example, Evans and Moshonov (2006)). In this paper we show that the Cauchy prior is robust in two models for clinical trials. Pericchi and Smith (1992) considered the robustness of the Student-t prior in the Student-t/normal model. We consider as a particular case the Cauchy/normal (C/N) model for normal log-odds. Much less known, however, is the robust property of the Cauchy prior with the binomial likelihood and more generally for exponential family likelihoods. To prove the robustness of the Cauchy prior when coupled with a binomial likelihood, we prove a more general result that only requires a bound in the tail behavior of the likelihood. This novel theorem is easy to verify and is very general. Under these conditions, when the prior and the model are in conflict, then the prior acts “as if” it were uniform. In other words, the prior influences the analysis only when prior information and likelihood are in broad agreement. Otherwise Bayes’ theorem effectively switches back to a uniform prior. In this paper we rely heavily on the fact that the binomial likelihood belongs to the exponential family (though the theorem is not limited to exponential family likelihoods) showing the robustness of the Cauchy prior in the Cauchy/binomial (C/B) model for binary data. Cauchy priors do not lead to analytical closed-form results, but our next suggestion does. In his very influential book (Berger (1985)) Berger proposes a prior (called here “Berger’s prior”). We use Berger’s prior for clinical trials analysis, assuming a prior mean and scale suggested by previous data or by general features of the current trial. It turns out that this gives closed-form results when coupled with a normal log-odds likelihood. We show the robustness of Berger’s prior for the Berger-prior/normal log-odds (BP/N) model, which makes it more attractive than both the Cauchy and conjugate priors. We also prove here a striking result: The intrinsic prior to test a normal mean of Berger and Pericchi (1996) which is obtained as an objective prior for hypothesis testing, is also a limit of Berger’s robust prior. This result is useful for assessments and for MCMC computations. We then generalize the Polynomial Tails Comparison theorem to prove that Berger’s prior and intrinsic priors are robust with normal likelihoods. We finally apply the results to massive clinical trial that took place in Venezuela, and the prior information is taken from a previous clinical trial in Finland. Lastly we remark that the hierarchical model is not the solution for the lack of robustness of conjugate analysis. Quite to the contrary, the hierarchical model should use robust priors in the hierarchy to prevent unbounded and undesirable shrinkages. This is being studied in work in progress by M.E. Perez, and L.R. Pericchi. This article is organized as follows. Section 2 is devoted to the Polynomial Tails Comparison Theorem. In Section 3 we review the prior specification and posterior moments of the C/B model. In Section 4 we examine the robustness of the Cauchy prior in the C/B posterior model. In the Sections 3 and 4 we show the application of the C/B model in a clinical trial. In Section 5 we describe the robustness of the C/N and BP/N models and prove that the intrinsic prior is a limit of Berger’s priors. In Section 6 we prove the Generalized Polynomial Tails theorem and illustrate the results in a real and important clinical trial published in the New England Journal of Medicine. We make some closing remarks in Section 7. J. A. Fúquene, J. D. Cook and L. R. Pericchi 2 821 The Polynomial Tails Comparison Theorem The following theorem is decidedly useful and easy to apply when determining whether a prior is robust with respect to a likelihood. For ν > 0, define t(λ; µ, ν) = µ ¶−(ν+1)/2 (λ − µ)2 1+ . ν Aside from a normalization constant that would cancel out in our calculations, t(λ; µ, ν) is the PDF of a Student-t distribution with ν degrees of freedom centered at µ. Let f (λ) be any likelihood function such that as |λ| → ∞ Z f (λ) dλ = O(m−ν−1−ε ). (1) |λ|>m In the application we have in mind, f is a binomial likelihood function although the result is more general. For instance the latter holds for any ν in any likelihood with exponentially decreasing tails. Denote by π T (λ | data) and π U (λ | data) the posterior densities employing the Studentt and the uniform prior densities respectively. Applying Bayes’ rule to both densities, for any parameter value λ0 the following ratio: R∞ f (λ) t(λ; µ, ν) dλ π U (λ0 | data) R∞ = −∞ . T π (λ0 | data) t(λ0 ; µ, ν) −∞ f (λ) dλ Theorem 2.1. For fixed λ0 , R∞ f (λ) t(λ; µ, ν) dλ R∞ = 1. µ→∞ t(λ0 ; µ, ν) f (λ) dλ −∞ lim −∞ Proof. We will show that R∞ R∞ f (λ) t(λ; µ, ν) dλ − t(λ0 ; µ, ν) −∞ f (λ) dλ −∞ R∞ lim = 0. µ→∞ t(λ0 ; µ, ν) −∞ f (λ) dλ (2) (3) Note that the numerator can be written as Z ∞ f (λ) (t(λ; µ, ν) − t(λ0 ; µ, ν)) dλ. −∞ We break the region of integration in the numerator into two parts, |λ| < µk and |λ| > µk , for some 0 < k < 1 that we will pick later, and show that as µ → ∞ each integral goes to zero faster than the denominator. 822 Robust Bayesian Priors for Clinical Trials First consider Z |λ|<µk f (λ) (t(λ; µ, ν) − t(λ0 ; µ, ν)) dλ. (4) For every λ, there exists a ξ between λ and λ0 such that t(λ; µ, ν) − t(λ0 ; µ, ν) = t′ (ξ; µ, ν)(λ − λ0 ) by the mean value theorem. Since µ → ∞, we can assume µ > µk > λ0 . |t(λ; µ, ν) − t(λ0 ; µ, ν)| = = = = |t′ (ξ; µ, ν)(λ − λ0 )| (ν + 1)|λ − µ| |λ − λ0 | ´(ν+3)/2 ³ 2 ν 1 + (λ−µ) ν O(µ1+k ) Ω(µν+3 ) O(µk−ν−2 ). [Here we use the familiar O notation and the less familiar Ω notation. Just as f = O(µn ) means that f is eventually bounded above by a constant multiple of µn , the notation f = Ω(µn ) means that f is eventually bounded below by a constant multiple of µn .] As µ → ∞, the integral (4) goes to zero as O(µk−ν−2 ). Since t(λ0 ; µ, ν) is Ω(µ−ν−1 ), the ratio of the integral (4) to t(λ0 ; µ, ν) is O(µk−1 ). Since k < 1, this ratio goes to zero as µ → ∞. Next consider the remaining integral, Z |λ|>µk f (λ) (t(λ; µ, ν) − t(λ0 ; µ, ν)) dλ. (5) The term t(λ; µ, ν) − t(λ0 ; µ, ν) is bounded, and we assumed Z |x|>m f (λ) dλ = O(m−ν−1−ε ). Therefore the integral (5) is O((µk )−ν−1−ε ) = O(µ−k(ν+1+ε) ). Since t(λ0 ; µ, ν) is Ω(µ−ν−1 ), the ratio of the integral (5) to t(λ0 ; µ, ν) is of order O(µ−k(ν+1+ε/(ν+1) ). This term goes to zero as µ → ∞ provided k > (ν + 1)/(ν + 1 + ε). Note that in particular the theorem applies when f is the likelihood function of a binomial model with at least one success and one failure and ν = 1, i.e. a Cauchy prior. 823 J. A. Fúquene, J. D. Cook and L. R. Pericchi 3 The Binomial Likelihood with Conjugate and Cauchy Priors Assume a sample of size n with X1 , . . . , Xn ∼ Bernoulli(θ). The binomial likelihood in its explicit exponential family form is given by © ª f (X+ | λ) ∝ exp X+ λ − n log(1 + eλ ) , (6) Pn where X+ = i=1 Xi ∼ binomial(n, θ) is the number of success in n trials. Notice that for the binomial likelihood, it is enough to assume that there is at least one success and one failure, i.e. 0 < X+ < n, (for assumption (2.1) of the theorem of the previous section to be fulfilled for every ν ≥ 1), since then the binomial has exponentially decreasing tails. The natural parameter is the log-odds, λ = log(θ/(1 − θ)), which is the parameter to be modeled as a Cauchy variable later, for which one can make use of the theorem. If desired, a Student-t prior with more than one degree of freedom can be used, and all results apply as well. We employ the Cauchy for good use of “conservatism” regarding the treatment of prior information, a point shared with Gelman et al. (2008). First we perform a conjugate analysis, expressing the beta(a, b) prior, after the transformation of the parameter θ to log-odds, as ¶b µ λ ¶a µ Γ(a + b) 1 e πB (λ) = a, b > 0. (7) Γ(a)Γ(b) 1 + eλ 1 + eλ The cumulant generating function of the prior distribution πB (λ) is given by Kλ (t) = − log(Γ(a)Γ(b)) + log(Γ(a + t)) + log(Γ(b − t)), ′ (8) ′ hence EB (λ) = Ψ(a) − Ψ(b) and VB (λ) = Ψ (a) + Ψ (b), where Ψ(·) is the digamma ′ function and Ψ (·) is the trigamma function, that are extensively tabulated in for example Abramowitz and Stegun (1970). The posterior distribution of the B/B model is given by © ª πB (λ | X+ ) = K × exp (a + X+ )λ − (n + a + b) log(1 + eλ ) (9) Γ(n + a + b) . On the other hand, one proposal for robust Γ(X+ + a)Γ(n − X+ + b) analysis for binomial data (see also the next sections for Berger’s prior for an alternative) is to use a Cauchy prior for the natural parameter λ in order to achieve robustness with respect to the prior, where K = πC (λ) = β , π[β 2 + (λ − α)2 ] (10) with parameters of location and scale α and β respectively. The posterior distribution of the C/B model is © ¡ ¢ª exp X+ λ − n log(1 + eλ ) − log β 2 + (λ − α)2 πC (λ | X+ ) = , m(X+ ) 824 Robust Bayesian Priors for Clinical Trials where m(X+ ) is the predictive marginal. Notice that this posterior also belongs to the exponential family. One approach to the approximation of m(X+ ) is Laplace’s method, refined by Tierney and Kadane (1986) for statistical applications given by √ m(X+ ) ≈ 2πσ̂n−1/2 exp{−nh(λ̂)} where −nh(λ) = log(πC (λ)f (X+ | λ)), λ̂ is the ′′ maximum of −h(λ̂), and σ̂ = [h (λ)]−1/2 |λ=λ̂ . The accuracy is of order O(n−1 ). Example 3.1. A Textbook Clinical Trial Example. We apply the preceding approximation adapting an example considered in Spiegelhalter et al. (2004). Suppose that previous experience with similar compounds has suggested that a drug has a true response rate θ, between 0 and 0.4, with an expectation around 0.2. For normal distributions we know that m ± 2s includes just over 95% of the probability, so if we were assuming a normal prior we might estimate m = 0.2 and s = 0.1. However, the beta distributions with reasonably high a and b have approximately normal shape, so that θ ∼ beta(a = 3, b = 12). Suppose that we test the drug on 20 additional patients and observe 16 positive responses (X+ = 16). Then the likelihood of the experiment is X+ ∼ binomial(n = 20, θ) and the posterior in this case θ | X+ ∼ beta(a = 19, b = 16). Our proposal is to use a Cauchy prior in order to achieve robustness with respect to the prior, πC (λ), with the same parameters of location and scale of the beta p prior. For this example the location and the scale are Ψ(3) − Ψ(12) = −1.52 and Ψ′ (3) + Ψ′ (12) = 0.69 respectively. Figures 1 and 2 display a large discrepancy between the means of the prior information and the normalized likelihood (i.e. the posterior density using a uniform prior) of the data. In the B/B model the prior and the likelihood receive equal weight. The weight of the likelihood in the C/B posterior model is higher that in the B/B model. The form of the C/B model is much closer to the normalized likelihood. 1.2 density 1.1 1.0 beta-prior 0.9 Likelihood 0.8 Posterior 0.7 0.6 0.5 0.4 0.3 0.2 0.1 −4 −3 −2 −1 0 1 2 3 4 5 Log-Odds Figure 1: Beta prior, normalized binomial likelihood and B/B posterior model for the Example 1. 825 J. A. Fúquene, J. D. Cook and L. R. Pericchi 1.2 1.1 density 1.0 0.9 Cauchy-prior 0.8 Likelihood 0.7 Posterior 0.6 0.5 0.4 0.3 0.2 0.1 −6 −5 −4 −3 −2 −1 0 1 Log-Odds 2 3 4 Figure 2: Cauchy prior, normalized binomial likelihood and C/B posterior model for the Example 1. The posterior moments of the natural parameter of an exponential family are considered in Pericchi et al. (1993) and Gutierrez-Peña (1997). The cumulant generating function of the posterior, πB (λ | X+ ), in the B/B model is ¶ µ Γ(X+ + a + t)Γ(n − X+ + b − t) , (11) Kλ | X+ (t) = log Γ(X+ + a)Γ(n − X+ + b) hence EB (λ | X+ ) = Ψ(X+ + a) − Ψ(n − X+ + b) (12) VB (λ | X+ ) = Ψ (X+ + a) + Ψ (n − X+ + b). (13) ′ ′ In the C/B model, we need to calculate the approximation of EC (λ | X+ ) and VC (λ | X+ ). The posterior expectation EC (λ | X+ ) involves the ratio of two integrals and the Laplace method can be used, as Ẽ(λ | X+ ) = µ σ∗ σ̂ ¶ n o exp −n[h∗ (λ∗ ) − h(λ̂)] , (14) where −nh∗ (λ) = log(λπC (λ)f (X+ | λ)), λ∗ is the maximum of −h∗ (λ) and ′′ σ = [h∗ (λ)]−1/2 |λ=λ∗ . The error in (14) is of order O(n−2 ) (see Tierney and Kadane (1986)). However, in (14) we must assume that λ does not change sign. Tierney et al. (1989) recommend to add a large constant c to λ, apply Laplace’s method (14) and finally subtract the constant. We let ẼC (λ | X+ ) and ṼC (λ | X+ ) to denote approximate posterior expectation and posterior variance of the C/B model ∗ 826 Robust Bayesian Priors for Clinical Trials ẼC (λ | X+ ) = Ẽ(c + λ | X+ ) − c. (15) ṼC (λ | X+ ) = Ẽ((c + λ)2 | X+ ) − [Ẽ(c + λ | X+ )]2 . (16) For both functions h(λ̂) and h∗ (λ) it is not possible to find the maximum analytically and then we use Newton Raphson algorithm. Here c is the value of λ such that πC (λ = c | X+ ) ≤ 0.5−4 and the starting value in the Newton-Raphson algorithm is the maximum likelihood estimator (MLE) of the natural parameter, λ̂ = log((X̄n )/(1−X̄n )). Result 3.1. The posterior expectations for the C/B model and B/B satisfy the following: 1. Robust result: lim EC (λ | X+ ) ≈ λ̂ + α→±∞ e2λ̂ − 1 2neλ̂ . (17) 2. Non-robust result: lim EB (λ)→±∞ EB (λ | X+ ) → ±∞. (18) respectively. Proof. See the Appendix. Result 3.1 is a corollary of the Theorem 2.1. Note: the limit (17) is not equal to the MLE, but consistent with Theorem 2.1. 4 Computations with Cauchy Priors We use weighted rejection sampling to compute the (“exact”) posterior moments in the C/B model due to its simplicity and generality for simulating draws directly from the target density πC (λ | X+ ) (see Smith and Gelfand (1992)). In the C/B model the envelope function is the Cauchy prior. The rejection method proceeds as follows: 1. Calculate M = f (X+ | λ̂). 2. Generate λj ∼ πC (λ). 3. Generate Uj ∼ uniform(0,1). 4. If M Uj πC (λj ) < f (X+ | λj ) πC (λj ), accept λj . Otherwise reject λj and go to Step 2. 827 J. A. Fúquene, J. D. Cook and L. R. Pericchi It is clear that the Cauchy density is an envelope. Because it is simple to generate Cauchy distributed samples, the method is feasible. Using Monte Carlo methods and 10,000 random samples from πC (λ | X+ ) we compute Esim and Vsim . Results available from the authors show that the agreement between the Laplace approximations and the rejection algorithm is quite good for sample sizes bigger than n = 10. In Figures 3 to 5 we illustrate the striking qualitative difference of posterior moments, as a function of the discrepancy between prior and sample location |µ − x̄|. Figure 3 shows that the beta prior has an unbounded influence and it is not robust. Figures 4 and 5 display the qualitative forms of dependence of the posterior expectation and variance on the discrepancy between the prior location and the MLE using a Cauchy prior. Here λ̂ = 0 and a and b take various values with their sum fixed at 50. In Figures 4 and 5, the approximations (15) and (16) are shown as functions of the discrepancy. Note that (16) is non-monotonic in the discrepancy. The posterior expectation, ẼC (λ | X+ ), is a function of the “information discount.” EB (λ | X+ ) 2.0 1.0 0.0 −1.0 −4 −3 −2 −1 0 1 2 3 4 5 E(λ) − λ̂ Figure 3: Behavior of the posterior expectation, EB (λ | X+ ), in the B/B model for values n = 10, λ̂ = 0 and a + b = 50. 828 Robust Bayesian Priors for Clinical Trials 0.3 ẼC (λ | X+ ) 0.2 0.1 0.0 −0.1 −0.2 −4 −3 −2 −1 0 1 2 3 4 5 E(λ) − λ̂ Figure 4: Behavior of the posterior expectation, ẼC (λ | X+ ), in the C/B model for values n = 50, λ̂ = 0 and a + b = 50. 0.10 0.09 Posterior Variance 0.08 ṼB (λ | X+ ) 0.07 ṼC (λ | X+ ) 0.06 0.05 0.04 −4 −3 −2 −1 0 1 2 3 4 5 E(λ) − λ̂ Figure 5: Behavior of the posterior variance, VB (λ | X+ ) in the B/B and ṼC (λ | X+ ) in the C/B for values n = 50, λ̂ = 0 and a + b = 50. 829 J. A. Fúquene, J. D. Cook and L. R. Pericchi Example 4.1. Textbook Example (Continued): Moments and predictions for binary data. EB (λ | X+ ) VB (λ | X+ ) ẼC (λ | X+ ) ṼC (λ | X+ ) M.L.E 0.18 0.12 1.26 0.33 1.39 Table 1: Posterior Expectation and Variance for the B/B and C/B models. The estimate resulting from the C/B model is closer to the data because the MLE 0.8 of θ is closer to 0.77 than to 0.54, the estimate resulting from the B/B model. In Table 1 there is a large difference between the values of the posterior mean (0.18) for the B/B model and the MLE. On the other hand, the results of the C/B model and the MLE λ̂ are similar. The discrepancies between the expectations of the posterior models and the MLE are approximately 3.5 and 0.23 standard errors for B/B and C/B respectively. For the scale of θ that is the true response rate for a Bernoulli trials set, we know that the predictive mean of the total number of successes in m trials is E(Xm ) = mE(θ | X+ ). If we plan to treat 40 additional patients in the B/B model the predictive mean is 40 × 0.54 ≈ 22, and in the C/B model is equal to 40 × 0.77 ≈ 31. The estimate resulting from the C/B model is closer to the data because the MLE 0.8 of θ is closer to 0.77 than to 0.54, the estimate resulting from the B/B model. The beta prior is more “dogmatic” than the Cauchy prior leading to non-robust results. Bayesian analysis is not dogmatic in general, but conjugate Bayesian analysis can be. This is a major selling point of robust Bayesian methods. 5 Normal Log-Odds and Berger’s Prior An alternative to the binomial likelihood is the normal likelihood in the log-odds, see Spiegelhalter et al. (2004). Pericchi and Smith (1992) showed some aspects of the robustness of the Student-t prior for a normal location parameter and provided approximations to the posterior moments in the model Student-t/normal. The Cauchy prior, as a Student-t with one degree of freedom, can be used in this context as well. However, for normal log-odds there is a robust prior that leads to a closed-form posterior and moments, a sort of “best of both worlds.” Bayesians have long come to terms with the disadvantages of procedures based on conjugate priors because of the desire for closed-form results. However, Berger (1985) proposed for comparison of several means a robust prior (called “Berger’s prior” in this work) that gives closed-form results for coupled normal means. Berger’s prior (BP) is similar to a Cauchy prior in the tails. Our proposal is an analysis based on Berger’s prior that we call BP/N posterior model. µ In this work the location of Berger’s prior, πBP (λ), is denoted by µ. This prior has the 830 Robust Bayesian Priors for Clinical Trials following form. πBP (λ) = Z 0 1 N (λ | µ; 1 d+b − d) · √ dν 2ν 2 ν (19) Here N (λ | µ, τ 2 ) denotes a normal density on the parameter λ with mean and variance µ, τ 2 respectively, which is well-defined whenever b ≥ d. The hyper-parameters d and b have to be assessed (see the end of the section for alternative assessments). We set here b = β 2 (equal to the scale of the Cauchy) and d = σ 2 /n. Suppose that X1 , . . . , Xn ∼ normal(λ, σ 2 ) where σ 2 is assumed known and λ is unknown, then the Berger’s prior is πBP (λ) = where Z 1 0 · ¸¾ ½ 2ν(λ − µ)2 n dν K × exp − 2 σ 2 (1 − 2ν) + nβ 2 √ n K=p . 4π(σ 2 (1 − 2ν) + nβ 2 ) (20) (21) Result 5.1. Suppose that X1 , . . . , Xn ∼ normal(λ, σ 2 ) where σ 2 is assumed known and λ is unknown. The predictive distribution of the BP/N model is p ¾¸ · ½ n(X̄n − µ)2 σ 2 + nβ 2 . m(X̄n ) = √ 1 − exp − 2 σ + nβ 2 4πn(X̄n − µ)2 The posterior distribution of the BP/N model is ½ ¾ n(X̄n − λ)2 πBP (λ) exp − 2σ 2 p πBP (λ | X̄n ) = ¾¸ . · ½ σ σ 2 + nβ 2 n(X̄n − µ)2 √ 1 − exp − 2 σ + nβ 2 2n(X̄n − µ)2 (22) The posterior expectation of the BP/N model EBP (λ | X̄n ) = X̄n + 2σ 2 n(X̄n − µ)2 − 2σ 2 (σ 2 + nβ 2 )(f (X̄n ) − 1) , n(X̄n − µ)(σ 2 + nβ 2 )(f (X̄n ) − 1) (23) and the posterior variance of the BP/N model is ¾ ½ σ2 σ4 4n2 (X̄n − µ)2 f (X̄n ) VBP (λ | X̄n ) = (24) − 2 n n (σ 2 + nβ 2 )2 (f (X̄n ) − 1)2 ½ ¾ σ 4 2(σ 2 + nβ 2 )(f (X̄n ) − 1)((σ 2 + nβ 2 )(f (X̄n ) − 1) − n) + 2 , n (σ 2 + nβ 2 )2 (f (X̄n ) − 1)2 (X̄n − µ)2 831 J. A. Fúquene, J. D. Cook and L. R. Pericchi ½ ¾ n(X̄n − µ)2 where f (X̄n ) = exp . σ 2 + nβ 2 Proof. See Appendix. The posterior expectation for the BP/N model satisfies the following lim EBP (λ | X̄n ) = X̄n ; lim EBP (λ | X̄n ) = X̄n . µ→±∞ µ→X̄n (25) This can be shown simply using L’Hôpital’s rule on the expression of the posterior expectation (23) and proves the robustness of the Berger prior coupled with the normal log-odds (see also Berger (1985)). Also we have the following result for a Cauchy prior (as a corollary of the theorem): lim ECN (λ | X̄n ) ≈ lim X̄n − α→±∞ 5.1 α→±∞ 2σ 2 (X̄n − α) = X̄n . n(β 2 ) + (X̄n − α)2 (26) The Intrinsic Prior as the Limit of Berger’s Prior It is a striking sort of synthesis result that the limit (as n → ∞, for d = σ 2 /n) of Berger’s prior is the intrinsic prior (Berger and Pericchi (1996)). Define η = λ − µ and recall the standard intrinsic prior for a normal location parameter: 1 1 − exp(−η 2 ) . ϕ(η) = √ η2 2 π (27) and extend it to a scale family by defining ϕ(η; σ) = 1 ³η´ . ϕ σ σ Then lim πBP (η; b, d) = ϕ(η, d→0 √ b) as we will prove in the next section. 5.2 Bounds for Berger’s prior In this section we develop upper and lower bounds for the density of Berger’s prior. We then use these bounds to prove that the limiting case of Berger’s prior is the intrinsic prior. First, define 1 . w(ν; b, d) = b + d − 2νd We will suppress the dependence on b and d unless there is a need to be explicit. With this notation the integral defining Berger’s prior becomes 832 Robust Bayesian Priors for Clinical Trials 1 πB (η) = √ 2 π Z 0 1 ¡ ¢ exp −η 2 w(ν) ν w(ν)1/2 dν. Next we multiply and divide by (wν)′ where the prime indicates the derivative with respect to ν. Then 1 πB (η) = √ 2 π Note that Z 0 1 ¢ ¡ w(ν)1/2 exp −η 2 w(ν) ν (w(ν)ν)′ dν. (w(ν) ν)′ (b + d − 2νd)3/2 w(ν)1/2 = . (w(ν) ν)′ b+d Therefore k1 (b, d) ≡ (b − d)3/2 w(ν)1/2 ≤ ≤ (b + d)1/2 ≡ k2 (b, d). b+d (w(ν) ν)′ It follows that πBP (η) = ≤ = 1 ¡ ¢ w(ν)1/2 dν. exp −η 2 w(ν) ν (w(ν)ν)′ (w(ν)ν)′ 0 Z 1 ¡ ¢ 1 √ exp −η 2 w(ν) ν (w(ν)ν)′ k2 (b, d) dν. 2 π 0 µ µ ¶¶ η2 k2 (b, d) √ 2 1 − exp − b−d 2 πη 1 √ 2 π Z (28) (29) (30) by computing the integral (29). Similarly, by applying the lower bound k1 (b, d) in the integral (29) and reversing the direction of the inequality ¶¶ µ µ k1 (b, d) η2 πBP (η) ≥ √ 2 1 − exp − . b−d 2 πη To summarize, k1 (b, d) ψ(η; b, d) ≤ πB (η; b, d) ≤ k2 (b, d) ψ(η; b, d) (31) where ³ ´ η2 1 − exp − b−d √ ψ(η; b, d) = . 2 πη 2 √ the Note that as d → 0 the terms k1 (b, d) and k2 (b, d) converge to b. Therefore √ upper and lower bounds on πBP (η; b, d) converge to the intrinsic prior scaled by b. 833 J. A. Fúquene, J. D. Cook and L. R. Pericchi Also, these bounds suggest that one could construct an efficient accept-reject algorithm for sampling from Berger’s prior by using the intrinsic prior as a proposal density. Notice that the intrinsic prior was obtained by a completely unrelated method (Berger and Pericchi (1996)). It was originally obtained as the implicit prior to which the arithmetic intrinsic Bayes factor converges. The intrinsic Bayes factor is derived within an approach to objective model selection. It is pleasant that it coheres with robust Bayesian reasoning. The intrinsic prior does not yield closed-form results with the normal likelihood. The next theorem generalizes the Polynomial Tails Comparison Theorem to prove that the intrinsic prior, as well as the Cauchy prior, are robust. 6 Generalized Polynomial Tails Comparison Theorem We begin by reviewing the symbols O, Ω, and Θ used to denote asymptotic order, extending the notation used in Section 2. We say f (λ) = O(g(λ)) if there exist positive constants M and C such that for all λ > M , f (λ) ≤ Cg(λ). Similarly, we say f (λ) = Ω(g(λ)) if there exist positive constants M and C such that for all λ > M , f (λ) ≥ Cg(λ). One could read O as “eventually bounded above by a multiple of” and Ω as “eventually bounded below by a multiple of.” Finally, we say f (λ) = Θ(g(λ)) if f (λ) = O(g(λ)) and f (λ) = Ω(g(λ)). Let f (λ) be any bounded likelihood function such that as |λ| → ∞ Z f (λ) dλ = O(m−d−ε ) (32) |λ|>m for positive constants d and ε. In particular, note that this condition is satisfied for the binomial likelihood in logistic form as long as there has been at least one success and at least one failure observed. The condition also applies for any likelihood function with exponentially decreasing tails. Let p(λ) be a continuous, symmetric distribution. (The assumption of symmetry is not essential, but the distributions we are most interested in are symmetric and the assumption simplifies the presentation.) We may extend p to a location-scale family as: 1 p(λ; µ, σ) = p σ µ λ−µ σ ¶ . Assume that as λ → ∞, p(λ) = Θ(λ−d ) and that p′ (λ; µ) = O(λ−d−1 ) where p′ is the derivative of p with respect to λ. We will show later that the Student-t family, Berger’s prior, and the intrinsic prior all satisfy these two conditions. 834 Robust Bayesian Priors for Clinical Trials We are now ready to state and prove the generalized polynomial tails comparison theorem. Denote by π P (λ | data) and π U (λ | data) the posterior densities employing the prior p(λ | µ, σ) and the uniform prior respectively. Applying Bayes’ rule to both densities yields for any parameter value λ0 the following ratio: R∞ f (λ)p(λ; µ, σ) dλ π U (λ | data) R∞ = −∞ . π P (λ | data) p(λ0 ; µ, σ) −∞ f (λ) dλ Theorem 6.1. For fixed λ0 , R∞ f (λ)p(λ; µ, σ) dλ R∞ = 1. p(λ0 ; µ, σ) −∞ f (λ) dλ −∞ lim µ→∞ Proof. Since our only assumptions on the prior p(λ; µ, σ) involve the asymptotic order of p and its derivative, and since these assumptions are not effected by a scaling factor σ, we may assume σ = 1 and drop σ from our notation. We will show that lim µ→∞ R∞ −∞ R∞ f (λ)p(λ; µ) dλ − p(λ0 ; µ) −∞ f (λ) dλ R∞ = 0. p(λ0 ; µ) −∞ f (λ) dλ (33) Note that the numerator may be written as Z ∞ f (λ)(p(λ; µ) − p(λ0 ; µ)) dλ. −∞ We break the region of integration in the numerator into two parts, |λ| < µk and |λ| > µk , for some 0 < k < 1 to be chosen later, and show that as µ → ∞ each integral goes to zero faster than the denominator. First consider Z |λ|<µk f (λ)(p(λ; µ) − p(λ0 ; µ)) dλ. By the fact that p(λ; µ) = p(λ − µ) and the mean value theorem we have ¯ Z ¯Z ¯ ¯ ¯ ¯ f (λ)(p(λ; µ) − p(λ0 ; µ)) dλ¯ ≤ f (λ)p′ (ξ(λ))|λ − λ0 | dλ ¯ ¯ ¯ |λ|<µk k |λ|<µ (34) where −µk < λ, λ0 < µk and each ξ(λ) is between λ − µ and λ0 − µ. Therefore ξ(λ) = O(µ) and p′ (ξ(λ)) = O(µ−d−1 ). The term |λ − λ0 | is O(µk ) and so the the integral (34) is O(µk−d−1 ). The denominator of (33) is Ω(µd ) and so the contribution of (34) to (33) is O(µk−d−1 )/Ω(µ−d ) = O(µk−1 ). Since k < 1, this term goes to zero as µ → ∞. J. A. Fúquene, J. D. Cook and L. R. Pericchi 835 Next consider Z |λ|>µk f (λ)(p(λ; µ) − p(λ0 ; µ)) dλ. (35) The term (p(λ; µ) − p(λ0 ; µ)) is bounded and so by the assumption on the tails of the likelihood function f , the integral (35) is of order O((µk )−d−ε ) = O(µ−k(d+ε) ). Therefore the contribution of the integral (35) to the ratio (33) is O(µd−k(d+ε) ) and so this term goes to 0 as µ → ∞ provided k < d/(d + ε). Next we show that the Student-t family, the intrinsic prior, and Berger’s prior all satisfy the conditions of the Generalized Polynomial Tail Comparison Theorem. Clearly the tails of a Student-t distribution with ν degrees of freedom are Θ(λ−1−ν ). Also, the intrinsic prior and Berger’s prior are clearly Θ(λ−2 ) in the tails. The derivative conditions remain to be demonstrated. The density for a Student-t is proportional to µ ¶−(ν+1)/2 λ2 1+ ν and so its derivative is proportional to ¶−(3+ν)/2 µ λ2 −(1 + ν)λ 1 + ν which is of order O(λ−2−ν ). For the intrinsic prior, the asymptotic order is determined by the λ−2 term, the 1 − exp(−λ2 ) being essentially 1 in the tails. Therefore the asymptotic order of the derivative of the tails is λ−3 . Showing that the derivative of Berger’s prior has the necessary asymptotic order is more involved. By differentiating inside the integral defining Berger’s prior, we have Z 1 d λ πB (λ) = − √ exp(−wλ2 ν)w3/2 ν dν. dλ π 0 Next we multiply and divide by the derivative with respect to ν of wνλ2 and define Then 1 1 w3/2 . =p M = sup √ ′ (wν) π π(b + d) 0≤ν≤1 ′ |πBP (λ)| ≤ M λ Z 1 ν exp(−wλ2 ν)(wλ2 ν)′ dν. 0 836 Robust Bayesian Priors for Clinical Trials Next, we integrate by parts, showing the the right hand side above equals µZ 1 µZ 1 ¶ ¶ M M exp(−wλ2 ν) dν − exp(−wλ2 ) ≤ exp(−wλ2 ν) dν . λ λ 0 0 We can show that Z 1 0 exp(−wλ2 ν) dν = O(λ−2 ) by an argument similar to that used to establish the bounds on the tails of πBP (λ). ′ Therefore πBP (λ) = O(λ−3 ) and so πBP satisfies the requirements of the Generalized Polynomial Tail Comparison Theorem. Figures 6 and 7 display the qualitative forms of dependence of the posterior mean and variance on the discrepancy between the prior location parameter and the observed sample mean, for n = 10 and β 2 = σ 2 = 1. The posterior expectation and variance are shown as functions of the discrepancy |µ − X̄n |. Figure 6 shows that the posterior expectations with a Cauchy prior and with Berger’s prior are very similar. In both posterior models the posterior expectation has a bounded influence. On other hand, Figure 7 displays that the variances have the same qualitative form, but the variance with the Cauchy prior is smaller when µ tends to X̄n . We argue that the variance with Berger’s prior is preferable than with the Cauchy in this example. Finally, if we consider a normal prior for this analysis then the posterior variance is constant in |µ − X̄n |, and equal to 0.09. Posterior Expectation 0.2 EBP (λ|X̄n ) 0.1 ECN (λ|X̄n ) EN N (λ|X̄n ) 0.0 −0.1 −4 −3 −2 −1 0 µ − X̄n 1 2 3 4 5 Figure 6: Behavior of the posterior expectation: EBP (λ|X̄n ) in the BP/N, ECN (λ|X̄n ) in the C/N and EN N (λ|X̄n ) in the N/N model. For values n = 10, X̄n = 0 and β = σ = 1. 837 J. A. Fúquene, J. D. Cook and L. R. Pericchi Posterior Variance 0.11 0.10 0.09 VBP (λ|X̄n ) VCN (λ|X̄n ) 0.08 VN N (λ|X̄n ) −4 −3 −2 −1 0 µ − X̄n 1 2 3 4 5 Figure 7: Behavior of the posterior variance: VBP (λ|X̄n ) in the BP/N, VCN (λ|X̄n ) in the C/N and VN N (λ|X̄n ) in the N/N model. For values n = 10, X̄n = 0 and β = σ = 1. Example 6.1. Application BP/N model for Example 3.1. In this example the Berger prior has µ = −1.52 and β = 0.63. We must approximate the binomial likelihood by a normal distribution. For the likelihood (6), the Fisher information is In (λ) = (neλ /(1+eλ )2 ). In this example X̄n ∼ N (log(0.8/(1−0.8)), (1+e1.38 )2 /20e1.38 ), that is, X̄n ∼ N (1.38, 0.31). The posterior mean and variance of λ for the BP/N model are EBP (λ|X̄n ) = 1.16 and VBP (λ|X̄n ) = 0.33 respectively. These results are robust and very similar to the obtained with the Cauchy prior for the C/B model. 6.1 Application: BP/N and C/N model in a clinical trial In this section we show application of the C/N and BP/N model in a clinical trial. Example 6.2. Bayesian analysis of a trial of the Rhesus Rotavirus-Based Quadrivalent Vaccine. Reference: Pérez-Schael et al. (1997). Study Design: Randomized, double blind, placebo-controlled trial. Aim of Study: To compare rhesus rotaviruses-based quadrivalent vaccine (a new drug that is highly effective in preventing severe diarrhea in developed countries) and placebo. Outcome measure: Over approximately 19 to 20 months, episodes of gastroenteritis were evaluated at the hospital among infants. The ratio of the odds of response (episode of gastroenteritis) following the new treatment to the odds of response on the conventional: OR < 1 therefore favors the new treatment. Statistical Models: Approximate normal likelihood and normal prior for the logarithm of the odds ratio. In the Cauchy prior and Berger’s prior the values of the 838 Robust Bayesian Priors for Clinical Trials location parameters are the same with respect to normal prior. The scale is the same in the Cauchy and normal prior. Prior Distribution: Was based on a published trial: Joensuu et al. (1997), where it is shown that in Finland the vaccine had a high success rate in preventing severe rotavirus diarrhea. In this trial the primary efficacy analysis was based on children of which 1128 received three doses of rhesus rotaviruses-based quadrivalent vaccine and 1145 placebo. 100 episodes of gastroenteritis were severe, 8 in rhesus rotaviruses-based quadrivalent recipients and 92 in placebo recipients: Event Vaccine Placebo Episode of gastroenteritis 8 92 100 Non-episode of gastroenteritis 1120 1053 2173 1128 1145 2273 Table 2: Episodes of gastroenteritis in the groups Vaccine and Placebo, Finland. Loss function or demands: None specified. Computation/software: Conjugate normal analysis and C/N and BP/N models. Evidence from study: In this randomized, double-blind, placebo-controlled trial, 2207 infants received three oral of the rhesus rotaviruses-based quadrivalent vaccine. The following data show the episodes of gastroenteritis in Venezuela. Event Vaccine Placebo Episode of gastroenteritis 70 135 205 Non-episode of gastroenteritis 1042 960 2002 1112 1095 2207 Table 3: Episodes of gastroenteritis in the groups Vaccine and Placebo, Venezuela. Results: We show the normal approximation for binary data for the log-odds with the approximate standard error recommended in Spiegelhalter et al. (2004) for 2 × 2 tables, following their suggestion of an standard error of the likelihood normal and N/N ?? the prior and likelihood have a standard posterior model equal to σ = 2. In Table √ √ deviation of σ/ n0 = 0.36 and σ/ n = 0.15 respectively. The posterior mean for the posterior model N/N is equal to (n0 µ + nX̄n )/(n0 + n) = −0.99. We see that the standard errors of the C/N and BP/N model with respect to the likelihood are equal. The influence of the equivalent number of observations in the posterior distribution (n0 + n = 31 + 178 = 209, thus the likelihood can be thought to have around 178/31 ≈ 6 √ times as much information as the prior) over the standard error (σ/ n0 + n) is very high in the N/N model. The data of the current experiment (data in the Venezuelan experiment) dominated the C/N and BP/N models, resulting in a posterior expectation much closer to the MLE. 839 J. A. Fúquene, J. D. Cook and L. R. Pericchi Location Scale Prior Normalized likelihood Posterior Prior Normalized likelihood N/N -2.45 -0.73 -0.99 0.36 0.15 Posterior 0.14 C/N -2.45 -0.73 -0.76 0.36 0.15 0.15 BP/N -2.45 -0.73 -0.76 0.36 0.15 0.15 Table 4: Exact and approximate moments of the N/N, C/N and BP/N models in the scale of log-odds. The expectations of the BP/N and C/N posterior models and the MLE are approximately equal. We can see in Table 5 that N/N, C/N and BP/N posterior models are in favor of the vaccine (OR<1). However, the risk reduction in the N/N model is 63% (the estimated odds ratio is around e−0.99 = 0.37) and in the C/N and BP/N models is around 53% (in the normalized likelihood is 52%). The credible interval of the C/N and BP/N posterior model is closely related to the data in the trial. Finally, the N/N OR 95% Credible Interval (OR Scale) 0.37 [0.28; 0.49] C/N 0.47 [0.35; 0.63] BP/N 0.47 [0.35; 0.63] Normalized likelihood 0.48 [0.36; 0.65] Table 5: Odds ratio and Credible Interval of the Posterior Model. discrepancies between the expectation of the posterior models and the MLE are 1.86 standard errors for N/N and 0.2 for C/N and BP/N. This case dramatically illustrates the danger of assuming a conjugate prior as prior information in clinical trials. Figures 8 and 9 show the posterior distributions obtained in the conjugate analysis and nonconjugate analysis. We see that the prior distribution receives more weight in the N/N model. The posterior model C/N is very similar to the normalized likelihood. For the Figure 9 the posterior distributions for the C/N and BP/N model are almost the same. The results in the N/N model are suspect because the mean posterior is far from the likelihood given the conflict between the Finnish and the Venezuelan data. Incidentally, the researchers concluded that the Finnish and the Venezuelan responses were qualitatively different given the different levels of exposure of the children to the virus. In short, the robust analyses are giving a sensible answer while the conjugate analysis myopically insists that Finland and Venezuela are quite similar in respect to children’s responses. On the other hand, if the two cases were indeed similar, without a drastic conflict on responses, then the robust analyses would give answers quite similar to the conjugate analysis, with conclusions with high precision. In other words, the use of robust priors makes Bayesian responses adaptive to potential conflicts between current data and previous trials. 840 Robust Bayesian Priors for Clinical Trials 4.8 4.4 density 4.0 3.6 Prior 3.2 Likelihood 2.8 Posterior 2.4 2.0 1.6 1.2 0.8 0.4 −4 −3 −2 −1 0 Log-Odds Figure 8: Prior(Finland), normalized likelihood (Venezuela) and posterior distributions in the Bayesian analysis of a trial of the Rhesus Rotavirus-Based Quadrivalent Vaccine for the N/N model. 4.8 density 4.4 4.0 Berger prior 3.6 Cauchy prior 3.2 Likelihood 2.8 Posterior 2.4 2.0 1.6 1.2 0.8 0.4 −4 −3 −2 −1 0 Log-Odds Figure 9: Prior(Finland), normalized likelihood(Venezuela) and posterior distributions in the Bayesian analysis of a trial of the Rhesus Rotavirus-Based Quadrivalent Vaccine for the C/N and BP/N model. J. A. Fúquene, J. D. Cook and L. R. Pericchi 7 841 Conclusions The issues discussed in this paper have led us to the following conclusions: 1). The Cauchy prior in the Cauchy/binomial model is robust but the beta prior in the conjugate beta/binomial model for inference on the log-odds is not. We can use the Cauchy/binomial model in clinical trials making a robust prediction in binary data. 2). Simulation of the moments in the Cauchy/binomial model reveals that the approximation performs well over a range of n ≥ 10. Furthermore, we can use rejection sampling with either large or small sample sizes for exact results. 3) Berger’s prior is very useful in clinical trials for a robust estimation since it gives closed-form exact results (when the normal log-odds likelihood is employed), and at the same time does not have the defects of conjugate priors. It can be argued that besides computational convenience it is superior to the Cauchy as a robust prior, because the posterior variance does not decrease as much as with the Cauchy, when the assessed priors scales are equal or close, see Figure 7. Berger’s prior seems more cautious. 4). In more complex situations, with several different centers that are modeled with a hierarchical structure, the use of robust priors may be even more important. This will be explored elsewhere. 5). The use of prior information in terms of robust (and non-conjugate) priors will be much more acceptable to both researchers and regulatory agencies, because the prior can not dominate the likelihood when the data conflict with the prior. Remember the archetypal criticism of “Bayesian” analysis: “With Bayes, you can get the results you want, by changing your prior!” This should say instead: “With conjugate Bayes, you can get the results you want, by changing your prior!” 1 Proofs of Results 3.1 1.1 Cauchy Prior Proof. Invoking the Polynomial Tails Comparison Theorem, we can use the uniform prior instead of the Cauchy prior when α → ±∞ for the binomial likelihood, (assuming that 0 < X+ < n) so the generating function for the C/B model is © ª R∞ exp X+ λ − n log(1 + eλ ) + tλ dλ −∞ tλ R∞ , (36) lim EC (e |X+ ) = α→±∞ exp {X+ λ − n log(1 + eλ )} dλ −∞ after of the transformation λ = log(θ/(1 − θ)), (36) is lim EC (etλ |X+ ) = α→±∞ Γ(X+ + t)Γ(n − X+ − t) , Γ(X+ )Γ(n − X+ ) (37) hence lim EC (λ|X+ ) = Ψ(X+ ) − Ψ(n − X+ ). α→±∞ (38) The approximation of the Digamma function (see Abramowitz and Stegun (1970)) is Ψ(z) ≈ log(z) − 1 − O(z −2 ), 2z (39) 842 Robust Bayesian Priors for Clinical Trials hence µ X̄n 1 − X̄n ¶ 1 1 −2 + − O(X+ ) + O((n − X+ )−2 ). 2X+ 2(n − X+ ) (40) Now, we show that the limit in (36) exists: consider the following functions of real variable with positive real values, defined by the equations lim EC (λ|X+ ) ≈ log α→±∞ F (λ, t) = − exp{(X+ + t)λ} exp(X+ λ) 1 ; f (λ) = ; τ (λ) = 2 , (1 + eλ )n (1 + eλ )n β + λ2 (41) where X+ , n ∈ N; n ≥ 2, X+ ≥ 1 and β is a positive constant. We prove that the convolutions of F ∗ τ and f ∗ τ , defined respectively by the equations Z ∞ Z exp{X+ λ − n log(1 + eλ ) + tλ} dλ (42) F (λ)τ (α − λ)dλ = β 2 + (λ − α)2 −∞ R Z ∞ Z exp{X+ λ − n log(1 + eλ )} f (λ)τ (α − λ)dλ = dλ (43) β 2 + (λ − α)2 R −∞ are finite. For λ ∈ (−∞, ∞), we have |F (λ)g(α − λ)| = ≤ ≤ = ¯ ¯ ¯ ¯ exp{(X+ + t)λ} ¯ ¯ ¯ (1 + eλ )n (β 2 + (α − λ)2 ) ¯ | exp(X+ + t)λ| −2 β | exp(nλ)| exp{(t − s)|λ|} β2 g(λ) Since |F (λ)g(α − λ)| is dominated by the function g(λ), and g belongs to L1 (R), if t − s ≤ 0 (where s = n − X+ ≥ 1). Therefore F ∗ τ < ∞. A similar argument shows |f (λ)g(α − λ)| ≤ exp{−s|λ|} β2 (44) and thus f ∗ τ < ∞. 1.2 Conjugate Prior Proof. We have EB (λ) → ∞ as a → ∞ and EB (λ) → −∞ as b → ∞, the approximation of the posterior expectation for the conjugate beta/binomial model is EB (λ|X+ ) ≈ log µ nX̄n + a n(1 − X̄n ) + b ¶ − 1 1 + 2(nX̄n + a) 2(n(1 − X̄n ) + b) − O((nX̄n + a)2 ) + O((n(1 − X̄n ) + b)2 ) and EB (λ|X+ ) → ∞ as a → ∞ and EB (λ|X+ ) → −∞ as b → ∞. J. A. Fúquene, J. D. Cook and L. R. Pericchi 2 843 Proof of Result 5.1 Berger Prior Proof. We make the change of variable η = λ − µ. With the normal likelihood √ o n n n exp − 2 (η − (X̄n − µ))2 , f (X̄n | η) = √ 2σ 2πσ it follows that the predictive density satisfies the relation Z 1Z ∞ n n o m(X̄n ) = K × exp − K2 dη dν, 2 0 −∞ where ¸ 2νη 2 1 2 K2 = + 2 (η − (X̄n − µ)) . σ 2 (1 − 2ν) + nβ 2 σ · (45) (46) (47) The method of completing the square tell us that ¸2 · σ 2 + nβ 2 (X̄n − µ)(σ 2 (1 − 2ν) + nβ 2 ) K2 = η − σ 2 + nβ 2 σ 2 (1 − 2ν) + nβ 2 2ν(X̄n − µ)2 . + σ 2 + nβ 2 (48) The generating-function of the posterior distribution (22) is given by © ª R1R∞ K × exp − n2 K3 dη dν 0 −∞ tη EBP (e | X̄n ) = R 1 R ∞ , ª © K × exp − n2 K2 dη dν 0 −∞ (49) where K3 = · ¸ 1 2t 2νη 2 2 + (η − ( X̄ − µ)) − (η + µ) . n σ 2 (1 − 2ν) + nβ 2 σ2 n (50) Hence, the cumulant-generating function of the posterior distribution (22) is given by " ( µ ¶2 )# t n X̄n − µ + Kη|X̄n ∝ log 1 − exp − 2 σ + nβ 2 n ¶ µ ¶2 µ n t t + X̄n − µ + + tµ. (51) − 2 log X̄n − µ + n 2 n References Abramowitz, M. and Stegun, I. (1970). Handbook of Mathematical Functions. National Bureau of Standards, volume 46. Applied Mathematics Series. 823, 841 Berger, J. O. (1985). Statistical Decision Theory and Bayesian Analysis. SpringerVerlag, second edition. 819, 820, 829, 831 844 Robust Bayesian Priors for Clinical Trials Berger, J. O. and Pericchi, L. R. (1996). “The Intrinsic Bayes Factor for Model Selection and Prediction.” JASA, 91: 112–115. 820, 831, 833 Carlin, B. P. and Louis, T. A. (1996). “Identifying Prior Distributions that Produce Specific Decisions, with Application to Monitoring Clinical Trials.” In Bayesian Analisys in Statistics and Econometrics: Essays in Honor of Arnold Zellner, 493–503. New York: Wiley. 818 Carlin, B. P. and Sargent, D. J. (1996). “Robust Bayesian Approaches for clinical trial monitoring.” Statistics in Medicine, 15: 1093–1106. 818 Dawid, A. P. (1973). “Posterior expectations for large observations.” Biometrika, 60: 664–667. 818 Evans, M. and Moshonov, H. (2006). “Checking for Prior-Data Conflict.” Bayesian Analysis, 1: 893–914. 820 Gelman, A., Jakulin, A., Pittau, M. G., and Su, Y.-S. (2008). “A weakly informative default prior distribution for logistic and other regression models.” Annals of Applied Statistics, 2: 1360–1383. 818, 823 Greenhouse, J. B. and Wasserman, L. A. (1995). “Robust Bayesian Methods for Monitoring Clinical Trials.” Statistics in Medicine, 14: 1379–1391. 818 Gutierrez-Peña, E. (1997). “Moments for the Canonical Parameter of an Exponential Family Under a Conjugate Distribution.” Biometrika, 84: 727–732. 825 Joensuu, J., Koskenniemi, E., Pang, X.-L., and Vesikari, T. (1997). “Randomised placebo-controlled trial of rhesus-human reassortant rotavirus vaccine for prevention of severe rotavirus gastroenteritis.” The Lancet, 350: 1205–1209. 838 O’Hagan, A. (1979). “On outlier rejection phenomena in Bayes inference.” JRSSB, B, 41: 358–367. 818 Pérez-Schael, I., Guntiñas, M. J., Pérez, M., Pagone, V., Rojas, A. M., González, R., Cunto, W., Hoshino, Y., and Kapikian, A. Z. (1997). “Efficacy of the Rhesus Rotavirus-Based Quadrivalent Vaccine in Infants and Young Children in Venezuela.” The New England Journal of Medicine, 337: 1181–1187. 837 Pericchi, L. R. and Sansó, B. (1995). “A note on bounded influence in Bayesian analysis.” Biometrika, B, 82(1): 223–225. 818 Pericchi, L. R., Sansó, B., and Smith, A. F. M. (1993). “Posterior Cumulant Relationships in Bayesian Inference Involving the Exponential Family.” JASA, 88: 1419–1426. 825 Pericchi, L. R. and Smith, A. F. M. (1992). “Exact and Approximate Posterior Moments for a Normal Location Parameter.” JRSSB, 54: 793–804. 820, 829 Smith, A. F. M. and Gelfand, A. E. (1992). “Bayesian Statistics Without Tears: A Sampling-Resampling Perspective.” The American Statistician, 46: 84–88. 826 J. A. Fúquene, J. D. Cook and L. R. Pericchi 845 Spiegelhalter, D. J., Abrams, K. R., and Myles, J. P. (2004). Bayesian Approaches to Clinical Trials and Health-Care Evaluation. London: Wiley. 819, 824, 829, 838 Tierney, L. and Kadane, J. B. (1986). “Accurate Approximations for Posterior Moments and Marginal Densities.” JASA, 81: 82–86. 824, 825 Tierney, L., Kass, R. E., and Kadane, J. B. (1989). “Fully Exponential Laplace Approximations to Expectations and Variances of Nonpositive Functions.” JASA, 84: 710–716. 825 Funding National Science Foundation (DMS-0604896 to LRP). Acknowledgments We thank Dr. Marı́a Egleé Pérez for helpful comments and several suggestions. JF was supported by Institute of Statistics, College of Business Administration, University of Puerto Rico-RRP and by M D Anderson Cancer Center. LRP was in sabbatical leave by The University of Puerto Rico-RRP. Detailed comments by referees and editors were most useful in preparing a final version. 846 Robust Bayesian Priors for Clinical Trials Bayesian Analysis (2009) 4, Number 4, pp. 847–850 Editor-in-Chief’s Note Bradley P. Carlin∗ This issue of Bayesian Analysis, Volume 4 Number 4, is the twelfth and final one for which I have the privilege of serving as editor-in-chief (EiC); my three-year term (2007-09) is drawing to a close. It’s been an enormously gratifying and enlightening run, so I’d like to take just a few paragraphs to say thanks, describe a bit of what we’ve accomplished as a journal in the past year or so, and mention where the journal is headed and what challenges and opportunities it’s likely to face there. As I type this, it is near the end of Thanksgiving Day in the United States, and it’s impossible not to reflect on how thankful I am for the chance to serve as EiC, and for the many dedicated men and women who do all the real work of the journal, completely without financial compensation and at a time when ever-increasing pressures to further improve productivity encourage one to forego often-thankless volunteer work like editing and refereeing whenever possible. Simply put, I am eternally grateful to all the editors, associate editors, referees, and production staff who make each issue possible. It is dangerous to begin naming names in situations like this since one is sure to miss someone important, but I do want to mention a few key persons who have been around BA since the very beginning some 5 years ago. These include System Managing Editor Pantelis Vlachos, Managing Editor Herbie Lee, Deputy Editor Marina Vannucci, and Editors Philip Dawid, David Heckerman, Michael Jordan, and Fabrizio Ruggeri. Philip and Marina have decided to step down as part of the transition, and let me thank them at this time for their years of outstanding service. I am hopeful that most if not all of the others are willing to continue, along with Production Editor Angelika van der Linde and Editors Kate Cowles, David Dunson, Antonietta Mira, and Bruno Sansó. All do wonderful work, as do all our AEs and referees. Thanks again. 2009 has been another good year for the journal. We will again publish about 850 pages, very similar to our page counts in 2007 and 2008. We submitted three consecutive issues of the journal to Thomson-Reuters as evidence that we merit inclusion in their indexing systems, and in October we received the good news that BA has been accepted into the Science Citation Index-Expanded (SCIE) including the Web of Science, the ISI Alerting Service, and Current Contents/Physical, Chemical and Earth Sciences (CC/PC&ES). We have been told that Thomson-Reuters will have sufficient source item and citation data to compute an impact factor for the next Journal Citation Reports (JCR), which is scheduled to be published in June 2010. Getting BA on the road to an impact factor (critically important these days, especially to our European contributors) was one of my primary goals as EiC, so I’m very pleased this got done before my term expired. My other primary goal was to keep the flow of interesting discussion papers going, and here again I’ve been very pleased by the results. I have striven to select a wide ∗ Department of Biostatistics, University of Minnesota, Minneapolis, MN, http://www.biostat.umn. edu/~brad c 2009 International Society for Bayesian Analysis ° DOI:10.1214/09-BA432 848 Editor-in-Chief ’s Note range of papers for this very visible quarterly slot, from foundations to applications and even an occasional thought piece (such as Andrew Gelman’s “April Fools’ Day blog” paper in Volume 3 Number 3). The current issue’s discussion paper is on a subject near and dear to my heart (baseball), and as usual features some state of the art Bayesian methods followed by a spirited question-and-answer period in the discussions and rejoinder. We also get a nice stream of potential discussion papers from the Case Studies in Bayesian Statistics (Carnegie Mellon) and other ISBA-related meetings throughout the year. Some in the profession have suggested opening the papers on the BA website to discussion by anyone, rather than permit only a few papers to be discussed by a few high-profile discussants selected by the EiC. I must confess I have been hesitant to change our current system, since I like the idea of one “special” paper per issue, plus selecting this paper and coordinating the discussions and rejoinders is perhaps the best part of the EiC job! But it’s a good suggestion and one with which the next EiC and editorial board may choose to grapple. Speaking of the new EiC, I am very pleased to announce to those who missed the email from ISBA President Mike West that it will be none other than Herbie Lee, the current (and founding) managing editor of BA. I am very pleased the search committee offered the position to Herbie, and that he accepted! It means an especially easy transition period since Herbie’s long tenure with the journal means he needs essentially no “training” of any kind. Herbie will no doubt want to bring on some new editors and AEs, and I’m confident the journal will remain strong under his leadership. Of course, as I said there will be challenges to greet Herbie and his team. One involves the online review system, which was constructed specifically for us several years ago, but is now beginning to show its age somewhat. Other online review system products are out there that may offer advantages over ours in terms of flexibility and extendability in the long run. One of these is already used by our institutional partner IMS, for whom BA is already an IMS Supported Journal and with whom ISBA already has a joint membership agreement and a variety of well-attended jointly sponsored conferences (including the MCMSki series, the next of which will be January 5-7, 2011). Expanding this IMS partnership may be most natural. A second issue continues to be finding a reliable revenue stream to support the journal. We now offer on-demand printing of issues for a small fee, but BA’s free online availability has essentially precluded any significant sales revenue. Other arrangements may include adding advertising, or even affiliating with the Berkeley Electronic Press, which has a fairly long history of profitably running journals like ours. I happen to know BEP is interested in seeing this happen, but whether we should surrender our independence in such a dramatic way for a modest revenue stream is again something Herbie and the ISBA Board will need to ponder. On that note, I close this editorial. Thanks again for the opportunity to serve as EiC, and for your support of the journal during my tenure. While my involvement with the journal will now shrink dramatically in order for me to free up the time necessary to chair my own biostatistics group at Minnesota, I will stay involved as a guest editor, helping the journal process the contributed papers from the upcoming Valencia 9/2010 B. P. Carlin 849 ISBA World Meeting in Spain this June. I look forward to seeing many of you at that meeting, and as always, to your submissions at http://ba.stat.cmu.edu, and your more personal thoughts and reactions via email to mailto:brad@biostat.umn.edu. Brad Carlin Lincoln, Nebraska Thanksgiving 2009 850 Editor-in-Chief ’s Note