Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Detecting Outbreaks Using a Latent Field: Part I - Spatial Modeling

Cosmin Safta
Sandia National Laboratories
Livermore, CA
csafta@sandia.gov
&Jaideep Ray
Sandia National Laboratories
Livermore, CA
jairay@sandia.gov
\ANDWyatt Bridgman
Sandia National Laboratories
Livermore, CA
whbridg@sandia.gov
Abstract

In this paper, we develop a method to estimate the infection-rate of a disease, over a region, as a field that varies in space and time. To do so, we use time-series of case-counts of symptomatic patients as observed in the areal units that comprise the region. We also extend an epidemiological model, initially developed to represent the temporal dynamics in a single areal unit, to encompass multiple areal units. This is done using a (parameterized) Gaussian random field, whose structure is modeled using the dynamics in the case-counts, and which serves as a spatial prior, in the estimation process. The estimation is performed using an adaptive Markov chain Monte Carlo method, using COVID-19 case-count data collected from three adjacent counties in New Mexico, USA. We find that we can estimate both the temporal and spatial variation of the infection with sufficient accuracy to be useful in forecasting. Further, the ability to “borrow” information from neighboring areal units allows us to regularize the estimation in areal units with high variance (“poor quality”) data. The ability to forecast allows us to check whether the estimated infection-rate can be used to detect a change in the epidemiological dynamics e.g., the arrival of a new wave of infection, such as the fall wave of 2020 which arrived in New Mexico in mid-September 2020. We fashion a simple anomaly detector, conditioned on the estimated infection-rate and find that it performs better than a conventional surveillance algorithm that uses case-counts (and not the infection-rate) to detect the arrival of the same wave.

Keywords Gaussian random fields, Markov chain Monte Carlo, disease infection-rate, anomaly detection

1 Introduction

The infection-rate of a disease, especially a (human-to-human) communicable one, is perhaps the most concise distillation of the epidemiological dynamics of an outbreak. It waxes and wanes as a population’s mixing patterns change with the seasons or when a new variant arrives. It varies in space, modulated by risk factors viz., socioeconomic conditions, population density and demographic profile. It could potentially be a very informative quantity to monitor as part of disease surveillance, but is rarely ever done. This is because the infection-rate of an outbreak cannot be directly observed; instead, it has to be estimated, most commonly using a time-series of case-counts of patients (i.e., infected people who have tested positive). Depending on the quality of case-count data, which could have large reporting errors and display a considerable amount of variability if obtained from a small population where case-counts are low, the estimation of the infection-rate can be a difficult task.

Regardless of these difficulties, there have been many studies that estimate the infection-rate, particularly for the COVID-19 pandemic [1, 2, 3]. Our own work [4, 5, 6] parameterized a temporally-varying infection-rate and convolved it with the incubation period of COVID-19 to construct a disease model; when fitted to COVID-19 case-count data using Bayesian inference, it yielded parameters of the infection-rate model. This model could be used to provide 2-week-ahead forecasts of the behavior of the outbreak; when the observed data disagreed with the forecasts consistently, it indicated a change in epidemiological dynamics (e.g., the effect of lockdowns in California [6] or the start of the fall wave of COVID-19 in New Mexico [4]). All these studies aggregate case-counts over large populations (usually above 250,000) to reduce the variability in the observed case-counts and thus ease the estimation problem for the infection-rate. However, this aggregation can be problematic if performed over a large, sparsely populated region (e.g., the state of New Mexico, USA). The infection-rate estimated is necessarily an average over the regional population and may bear little resemblance to the local population if the population displays large spatial heterogeneity; this is certainly the case with New Mexico due to the presence of urban areas as well as remote, sparsely-populated desert counties. Since public health measures are often decided at the county-level, these regionally-averaged estimates of infection-rate are only used as a rough guide by public health professionals.

In this paper, we develop a method to estimate the infection-rate as a spatiotemporal field, described over areal units that comprise a region. Each areal unit supplies a time-series of case-counts for the estimation of the infection-rate field. For the purposes of this paper, we will use the COVID-19 outbreak in New Mexico (NM) and its counties as the test case, using data collected between June 1, 2020 and September 15, 2020; after September 15, the case-counts in NM steadily rose into the winter, an event we will refer to colloquially as the “Fall 2020” wave. Our approach is based on two key hypotheses. Our first premise is that the parameterized model for the time-varying infection-rate, as developed by Safta et al. [6], can be used to model the temporal evolution of the outbreak in each areal unit. This will lead to an inverse/estimation problem that will scale with the number of areal units and could quickly become intractable. Our second premise is that the spatial correlations in the epidemiological dynamics, as observed in the case-count data, can be fashioned into a random field model to regularize the high-dimensional field inversion and render it tractable. As part of this investigation, our method will be exposed to observational data of variable “quality”, from relatively low-variability observations from populous counties, such as Bernalillo, to high-variability low case-count data from smaller counties around it.

The development of the this method will require us to address the following research questions:

  • How does one fashion a random field model, from observational data of case-counts, to regularize the estimation problem for the infection-rate field?

  • How does one include the random field model into the estimation of the infection-rate field? Does its inclusion improve the quality of the estimated infection-rate vis-à-vis an estimation performed using data from a areal unit independently? In particular, for counties/areal units with poor quality data, does the inclusion of the random field model (i.e., incorporate the ability to “borrow” information from neighbors) improve the estimation of the infection-rate?

  • Can we use the estimated infection-rate to detect the arrival of the Fall 2020 wave in the counties of NM? How does it compare to a conventional outbreak-detector (specifically Höhle and Paul, 2008 [7])? In addition, in the absence of the Fall 2020 wave, does the use of the infection-rate lead to a false positive?

We will address the questions using data from three adjoining NM counties viz. Bernalillo, Santa Fe and Valencia. The inverse problem is sufficiently low-dimensional to be solved exactly using an adaptive Markov chain Monte Carlo (AMCMC; see Haario et al. [8]). A companion paper (see Ray et al. [9] for the technical report version) extends the method to all 33 counties (areal units) of NM, using mean-field Variational Inference to solve the inverse problem for the infection-rate approximately, as the problem becomes too high-dimensional for AMCMC.

The main contribution of the paper is in illustrating the use of random field models in inverse problems to yield local epidemiological information, using the spatial correlation extant in epidemiological dynamics (caused by population mixing) to compensate for high-variability in the case-count time-series observational data. A second contribution of the paper is to demonstrate that the information so obtained (in the form of a local infection-rate) contains actionable public health information; we will do so by detecting the arrival of the Fall 2020 wave. Note that we do not attempt to make a proper outbreak detector in this paper; that is left to future work. Also note that the use of random field models in disease mapping is well-established [10, 11]; however, these methods seek to only smooth observed case-count data rather than estimate the underlying infection-rate.

The paper is structured as follows. In § 2 we review existing literature on infection-rate estimation, the empirical construction and parameterization of random field models, especially in disease mapping, and how outbreak-detectors function. In § 4, we parameterize a Gaussian random field (GRF) model to represent spatial correlations in epidemiological dynamics and formulate a general inverse problem for the infection-rate. In § 5, we present the results of the infection-rate estimation, jointly for the three counties, and compare them with the results obtained from independent estimation. We also discuss how the estimated infection-rate performs in detecting the Fall 2020 wave, compared to conventional techniques (§ 6). We conclude in § 7.

2 Literature Review

Covariates and spatial autocorrelation in COVID-19 dynamics: Huang et al.  [12] analyzed the spatial relationship between the main environmental and meteorological factors and COVID-19 cases in Hubei province of China using a geographically weighted regression (GWR) model. Results suggest that the impacts of environmental and meteorological factors on the development of COVID-19 were not significant, something we also found in NM (see § 3). Their findings indicate that measures such as social distancing and isolation played the primary role in controlling the development of the COVID-19 epidemic. Geng et al.  [13] analyzed spatio-temporal patterns of COVID-19 infections at scales spanning from county to continental. They found that spatial evolution of COVID-19 cases in the United States followed multifractal scaling. A rapid increase in the spatial correlation was identified early in the outbreak (March to April 2020) followed by an increase at a slower rate until approaching the spatial correlation of human population. For this study, the multiphase COVID-19 epidemics were modeled by a kernel-modulated susceptible–infectious–recovered (SIR) algorithm. Schuler et al.  [14] employed a compartmental model for all 412 districts of Germany coupled with non-pharmaceutical intervention (NPI) models. They identify disease spread dynamics that corresponds to different spatial correlation levels, obtained via variogram estimation, between adjacent districts. McMahon et al.  [15] analyzed the spatial correlations of new active cases in the USA at the county level and showed that various stages of the epidemic are distinguished by significant differences in the correlation length. Their results indicate that the correlation length may be large even during periods when the number of cases declines and that correlations between urban centers were more significant than between rural areas. Rendana et al.  [16] analyzed the spatial distribution of COVID-19 cases, epidemic infection-rate, spatial pattern during the first and second waves in the South Sumatra Province of Indonesia. The study found little to no correlation between different regions. Air temperature, wind speed, and precipitation have contributed to the high epidemic infection-rate in the second wave. Indika et al.  [17] inspect the daily count data related to the total cases of COVID-19 in 93 counties in the state of Virginia using a Bayesian conditional autoregressive (CAR) modeling framework. The authors find that Moran statistic values at specific time points are impacted by, and linked to, the executive orders at the state level. In summary, there is some evidence that modeling of COVID-19 over small areal units might need to accommodate spatial auto-correlation, and might also require the inclusion of other covariates.

Random fields and disease maps: There is little literature on the use of a random field to estimate the infection-rate of a disease. However, the estimation of a latent field called relative risk r(𝒙)𝑟𝒙r(\bm{x})italic_r ( bold_italic_x ) is central to disease mapping. [18, 19] A disease map is a 2D plot of the risk of contracting a disease, computed from case-counts collected over areal units e.g., counties, that comprise a region e.g., a province. First, one obtains an “expected” value eisubscript𝑒𝑖e_{i}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for the observed case counts yi(obs)subscriptsuperscript𝑦𝑜𝑏𝑠𝑖y^{(obs)}_{i}italic_y start_POSTSUPERSCRIPT ( italic_o italic_b italic_s ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for areal unit i𝑖iitalic_i, usually from a region-wide average of disease incidences and demographics. It is then locally adjusted (in space) using the relative risk field to bring is closer to observations i.e., yiobsPoisson(riei)similar-tosubscriptsuperscript𝑦𝑜𝑏𝑠𝑖Poissonsubscript𝑟𝑖subscript𝑒𝑖y^{obs}_{i}\sim{\rm Poisson}(r_{i}e_{i})italic_y start_POSTSUPERSCRIPT italic_o italic_b italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ roman_Poisson ( italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). The risk risubscript𝑟𝑖r_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is then modeled as log(ri)=𝒛i𝜷+ϕisubscript𝑟𝑖subscript𝒛𝑖𝜷subscriptitalic-ϕ𝑖\log(r_{i})=\boldsymbol{z}_{i}\cdot\boldsymbol{\beta}+\phi_{i}roman_log ( italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ bold_italic_β + italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where 𝒛isubscript𝒛𝑖\boldsymbol{z}_{i}bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are co-variate risk factors for areal unit i𝑖iitalic_i, 𝜷𝜷\boldsymbol{\beta}bold_italic_β are regression weights and ϕisubscriptitalic-ϕ𝑖\phi_{i}italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT captures auto-correlated random effects in space using a random field model. The simplest random field model is iCAR (intrinsic Conditional AutoRegressive [18]), a specific type of Gaussian Markov Random Field (GMRF). Thus

ϕ={ϕi}𝒩(0,{τ2Q}1), Q=diag(W𝟏)W,formulae-sequenceitalic-ϕsubscriptitalic-ϕ𝑖similar-to𝒩0superscriptsuperscript𝜏2𝑄1 𝑄diag𝑊1𝑊\phi=\{\phi_{i}\}\sim{\mathcal{N}}\left(0,\{\tau^{2}Q\}^{-1}\right),\mbox{% \hskip 28.45274pt}Q={\rm diag}(W\boldsymbol{1})-W,italic_ϕ = { italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } ∼ caligraphic_N ( 0 , { italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_Q } start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) , italic_Q = roman_diag ( italic_W bold_1 ) - italic_W ,

where W𝑊Witalic_W is the adjacency matrix of the areal units (i.e., wij=1subscript𝑤𝑖𝑗1w_{ij}=1italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1 if areal units i𝑖iitalic_i and j𝑗jitalic_j share a boundary). The object of estimation from data is τ2superscript𝜏2\tau^{2}italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. The precision matrix Q𝑄Qitalic_Q tends to be sparse. This formulation leads to an improper jont distribution for ϕitalic-ϕ\phiitalic_ϕ. The Besag-York-Mollie (BYM) model [20] overcomes this issue by extending iCAR as ϕ=ϕ1+ϕ2,ϕ1𝒩(0,{τ2Q}1)formulae-sequenceitalic-ϕsuperscriptitalic-ϕ1superscriptitalic-ϕ2similar-tosuperscriptitalic-ϕ1𝒩0superscriptsuperscript𝜏2𝑄1\phi=\phi^{1}+\phi^{2},\phi^{1}\sim{\mathcal{N}}(0,\{\tau^{2}Q\}^{-1})italic_ϕ = italic_ϕ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT + italic_ϕ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_ϕ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ∼ caligraphic_N ( 0 , { italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_Q } start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) and ϕ2𝒩(0,σ2I)similar-tosuperscriptitalic-ϕ2𝒩0superscript𝜎2𝐼\phi^{2}\sim{\mathcal{N}}(0,\sigma^{2}I)italic_ϕ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∼ caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I ). We will use a variation of BYM in our work. The objects of estimation from case-count data are (τ2,σ2)superscript𝜏2superscript𝜎2(\tau^{2},\sigma^{2})( italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). A second variation, called pCAR (proper CAR [21, 22]), modifies the precision matrix Q=diag(W𝟏)ρW𝑄diag𝑊1𝜌𝑊Q={\rm diag}(W\boldsymbol{1})-\rho Witalic_Q = roman_diag ( italic_W bold_1 ) - italic_ρ italic_W, where the objects of estimation are (τ2,ρ,σ2)superscript𝜏2𝜌superscript𝜎2(\tau^{2},\rho,\sigma^{2})( italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_ρ , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). The idea of a random field being used to smooth areal units in feature-space (as opposed to geometrical space) has also been developed using GMRF [23]. Such a method is useful for diseases like alcohol abuse where similarity of socioeconomic and health factors in areal units, rather than the geometric distance between them, are more relevant for smoothing. The difference lies in how Q𝑄Qitalic_Q is modeled using a similarity S𝑆Sitalic_S matrix [24].

Outbreak detectors: Outbreak detection functions primarily as anomaly detection in space and time [25]. The case-count at time t𝑡titalic_t, ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, is often modeled as a normal random variate yt𝒩(μt,σt2)similar-tosubscript𝑦𝑡𝒩subscript𝜇𝑡superscriptsubscript𝜎𝑡2y_{t}\sim{\mathcal{N}}(\mu_{t},\sigma_{t}^{2})italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ caligraphic_N ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ); an alarm is raised if ytμt>κσtsubscript𝑦𝑡subscript𝜇𝑡𝜅subscript𝜎𝑡y_{t}-\mu_{t}>\kappa\sigma_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT > italic_κ italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, where κ𝜅\kappaitalic_κ is a threshold value adjusted to trade-off specificity and sensitivity of the detection. This approach can be considered as an expansion of Shewhart charts [26] and is sometimes referred to as “statistical process control” (SPC) methods. Methods differ on how (μt,σt)subscript𝜇𝑡subscript𝜎𝑡(\mu_{t},\sigma_{t})( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) are computed. Serfling [27] fitted historical data of case-counts from influenza outbreaks with a linear trend and trigonometric functions (to account for their seasonality) to obtain estimates (and forecasts of) (μt,σt)subscript𝜇𝑡subscript𝜎𝑡(\mu_{t},\sigma_{t})( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). A zero-mean Gaussian was assumed as a model for the fitting errors. The method is widely used and over time the linear and periodic components have been adapted for local conditions and specific diseases [28]. For outbreaks with low counts, this approach has been modified to use Poisson error models, where the log-mean is modeled as a function of time, much like Serfling’s method [29, 30]. Farrington’s widely used method [31] parallels Serfling’s approach, with linear and periodic trends, but the quasi-Poisson model accommodates the over-dispersion observed in epidemiological surveillance data as var(yt)=ϕμtvarsubscript𝑦𝑡italic-ϕsubscript𝜇𝑡{\rm var}(y_{t})=\phi\mu_{t}roman_var ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_ϕ italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, where ϕitalic-ϕ\phiitalic_ϕ is estimated from the data. (μt,σt)subscript𝜇𝑡subscript𝜎𝑡(\mu_{t},\sigma_{t})( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) have also been modeled and forecast using time-series model [32] such as AutoRegressive Integrated Moving Average (ARIMA) but the surveillance time-series has to be first rendered stationary by subtracting out any trends and seasonality (which incurs errors). A comparison of ARIMA and SPC methods for detecting outbreaks showed that ARIMA methods were unremarkable in their ability to model surveillance data [33], due to non-stationarity and sparsity. Outbreaks detection can also be modeled as state-transition events and thus based on Hidden Markov Models [34] and Markov switching models [35, 36, 37]. Outbreak detection can also be formulated as a two-component model consisting of an endemic phase (modeled using a Poisson distribution) and an epidemic one (modeled using an autoregressive parameter). Both components are fitted to the data in a time-window around t𝑡titalic_t and a likelihood ratio test is used to evaluate which model fits better [38, 7]. This can be used to detect when an epidemic starts. We will use such a model [7] as a baseline in § 6.

Perhaps the investigations that are closest to ours, in modeling philosophy, are those by Lawson and collaborators[39, 40, 41]. Fundamentally, our approach consists of “stitching together” models meant for individual areal units[6, 4] via CAR models (specifically, the BYM model). Lawson and co-workers model case-counts directly, whereas we use a parametric model of a temporally-variable (and, in this paper, also spatially-variable) infection-rate field that is related to the case-counts via the incubation period distribution. The use of the incubation-period model (see § 4) makes our model computationally more expensive than the ones used by Lawson and collaborators. Case-counts, in Lawson’s formulation, are modeled using a Susceptible-Infected-Removed compartmental formalism with a one-lagged-in-time auto-correlation and a BYM CAR model to couple with adjoining areal units; the clearest description of the model is in Lawson and Song, 2010[39], which was applied to four counties in South Carolina. The same model was adapted to COVID-19 data from all counties of South Carolina[42] and the UK[43]. In an allied work, Lawson investigates, and selects between, various formulations of their basic model, as applied to COVID-19 data, with 1-day-ahead forecasting accuracy in mind; he finds no clear benefits between using a space-time versus a purely temporal model[40]. The group has also investigated, much like us, whether departures from forecasts could be used to detect anomalies within the context of epidemiological surveillance[44, 41].They devised metrics such as the Surveillance Kullback-Liebler[45] (SKL) and Surveillance Conditional Predictive ordinate[46] (SCPO) to monitor and detect outlier epidemiological behavior. Lawson and Kim[44] found that one needed to include a leading indicator/syndrome of epidemiological activity e.g., absenteeism, as a modeling covariate to detect epidemiological changes in a timely manner. A more methodologically-oriented paper[41] investigated whether Poisson or Negative Binomial (NB) distributions should be use to link the observed case-counts to the modeled values in a likelihood function. They found that the NB distribution provided better goodness-of-fits (perhaps because the two-parameter distribution is more flexible than Poisson) but for small datasets, Poisson provided more predictive forecasts. To summarize, one can use cases-counts directly for (spatio-temporal) model-based syndromic surveillance and there is some uncertainty over whether one should use Poisson or NB distributions to capture the stochasticity in the observation. However, the possibility of using a latent variable that might be better behaved, e.g., infection-rate, has not been investigated.

3 Exploratory Data Analysis

In this section we perform an exploratory data analysis on the COVID-19 data from New Mexico (NM), in order to design the spatial problem.

3.1 The COVID-19 Dataset

The COVID-19 dataset covers the duration from 2020-01-22 to 2022-05-13, and consists of daily (new) case-counts of COVID-19 from each of the 33 counties of NM; the data is available online. [47, 48] The 73 covariates (i.e., risk factors) of COVID-19 span demographics, socioeconomic information (income, business and home ownership etc.) and infrastructure. These were obtained from another group in Sandia National Laboratories and is described in their publication [49]; we provide a summary below. Demographic data on age distribution, gender, racial orgins, housing, family units and living arrangements, education, health etc. were obtained from US Census Bureau’s QuickFacts for New Mexico [50], representing 5-year estimates between 2014-2018 and the 2013-2017 American Community Survey estimates. Geographical information e.g., area of counties,population densities etc. were also obtained from the Census dataset. Infrastructure represents the resources needed by a county to operate, such as number of COVID testing sites, nursing homes and K-12 schools. [51, 52] Geospatial data was also extracted from University of New Mexico Earth Data Analysis Center which develops the Resource Geographic Information System [53]. In total, data was compiled from 40 sources, manually down-selected to 73 features and adjusted (when needed) to each county’s population.

3.2 Data Analysis

Let Yt={yt,1,yt,2,yt,R}subscript𝑌𝑡subscript𝑦𝑡1subscript𝑦𝑡2subscript𝑦𝑡𝑅Y_{t}=\{y_{t,1},y_{t,2},\ldots y_{t,R}\}italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_y start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT , … italic_y start_POSTSUBSCRIPT italic_t , italic_R end_POSTSUBSCRIPT } be the vector of case-counts reported on day t𝑡titalic_t in each of the R𝑅Ritalic_R areal units (i.e., counties of NM). Let Yt={yt,1/p1,yt,2/p2,yt,R/pR}subscriptsuperscript𝑌𝑡subscriptsuperscript𝑦𝑡1subscript𝑝1subscriptsuperscript𝑦𝑡2subscript𝑝2subscriptsuperscript𝑦𝑡𝑅subscript𝑝𝑅Y^{\ast}_{t}=\{y^{\ast}_{t,1}/p_{1},y^{\ast}_{t,2}/p_{2},\ldots y^{\ast}_{t,R}% /p_{R}\}italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT / italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , 2 end_POSTSUBSCRIPT / italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_R end_POSTSUBSCRIPT / italic_p start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT } be the vector of normalized cumulative case-counts over the duration (t90,t]𝑡90𝑡\left(t-90,t\right]( italic_t - 90 , italic_t ] i.e., yt,rsubscriptsuperscript𝑦𝑡𝑟y^{\ast}_{t,r}italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_r end_POSTSUBSCRIPT is the cumulative number of case-counts over the 90-day period (t90,t]𝑡90𝑡\left(t-90,t\right]( italic_t - 90 , italic_t ] for areal unit r𝑟ritalic_r and prsubscript𝑝𝑟p_{r}italic_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT is the areal unit’s population. The 90-day window is adopted to average out the effect of reporting errors, as well as to reduce the effect of low case-counts in some of the very sparsely populated desert counties of NM. We assume that the case-counts can be modeled as a linear function of risk factors i.e., Ytv0,t+[𝐙]𝐯tsubscriptsuperscript𝑌𝑡subscript𝑣0𝑡delimited-[]𝐙subscript𝐯𝑡Y^{\ast}_{t}\approx{v}_{0,t}+\left[{\mathbf{Z}}\right]{\mathbf{v}}_{t}italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≈ italic_v start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT + [ bold_Z ] bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT where the kthsuperscript𝑘𝑡k^{th}italic_k start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT column of 𝐙𝐙{\mathbf{Z}}bold_Z contains the value of the kthsuperscript𝑘𝑡k^{th}italic_k start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT risk factor for all R𝑅Ritalic_R areal units and 𝐯t={vk,t},k=1Kformulae-sequencesubscript𝐯𝑡subscript𝑣𝑘𝑡𝑘1𝐾{\mathbf{v}}_{t}=\{v_{k,t}\},k=1\ldots Kbold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_v start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT } , italic_k = 1 … italic_K are their relative weights in time-window t𝑡titalic_t. The risk factors 𝐙𝐙{\mathbf{Z}}bold_Z are constant in time but vary between areal units. In disease mapping terms, the model 𝐯0,t+[𝐙]𝐯tsubscript𝐯0𝑡delimited-[]𝐙subscript𝐯𝑡{\mathbf{v}}_{0,t}+\left[{\mathbf{Z}}\right]{\mathbf{v}}_{t}bold_v start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT + [ bold_Z ] bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT provides the expected value of Ytsubscriptsuperscript𝑌𝑡Y^{\ast}_{t}italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and any deviations would be deemed “random”, to be modeled statistically.

Some of the risk factors are very correlated and thus carry little independent information, and consequently we simplify the model via sparse Principal Component Analysis [54] (PCA) to a set of principal component [𝚽]={ϕk}delimited-[]𝚽subscriptitalic-ϕ𝑘\left[{\mathbf{\Phi}}\right]=\{\phi_{k}\}[ bold_Φ ] = { italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } to remove unnecessary risk factors i.e., Ytv0,t+[𝐙]𝐯tw0,t+[𝚽]𝐰tsubscriptsuperscript𝑌𝑡subscript𝑣0𝑡delimited-[]𝐙subscript𝐯𝑡subscript𝑤0𝑡delimited-[]𝚽subscript𝐰𝑡Y^{\ast}_{t}\approx{v}_{0,t}+\left[{\mathbf{Z}}\right]{\mathbf{v}}_{t}\approx{% w}_{0,t}+\left[{\mathbf{\Phi}}\right]{\mathbf{w}}_{t}italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≈ italic_v start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT + [ bold_Z ] bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≈ italic_w start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT + [ bold_Φ ] bold_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Note that the principal components ϕksubscriptitalic-ϕ𝑘\phi_{k}italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT from sparse PCA do not form an orthogonal basis set. We see from the scree plot in Fig. 9 (in the Appendix) that K=10𝐾10K=10italic_K = 10 is sufficient to explain 95% of the variation in Ytsubscriptsuperscript𝑌𝑡Y^{\ast}_{t}italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Further, sparse PCA constructs ϕksubscriptitalic-ϕ𝑘\phi_{k}italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT using the most important risk factors. The main components of the sparse PCA modes are percent elderly, affluence, medical institutions per capita, size of population, percent native American and percent male.

Refer to caption Refer to caption

Refer to caption Refer to caption

Figure 1: Top left: Evolution of coefficients wk,tsubscript𝑤𝑘𝑡w_{k,t}italic_w start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT over time as the risk-factor model is fitted to cumulative case-counts yt,rsubscript𝑦𝑡𝑟y_{t,r}italic_y start_POSTSUBSCRIPT italic_t , italic_r end_POSTSUBSCRIPT normalized by county populations. Results are plotted for the intercept and four principal components (PC). Only the intercept survives and is far larger that the weights associated with the principal components. Top right: Plot of the prediction error from a 7-fold cross-validation performed with the risk-factor model and LASSO, on case-count data accumulated over the entire two-and-a-half-year duration (and normalized by county populations). The figures on the upper horizontal axis denotes the number of principal components retained in the fitted model. λminsubscript𝜆𝑚𝑖𝑛\lambda_{min}italic_λ start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT and λ1sesubscript𝜆1𝑠𝑒\lambda_{1se}italic_λ start_POSTSUBSCRIPT 1 italic_s italic_e end_POSTSUBSCRIPT are clearly marked. Bottom left: Distribution of coefficients, corresponding to penalties λminsubscript𝜆𝑚𝑖𝑛\lambda_{min}italic_λ start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT and λ1sesubscript𝜆1𝑠𝑒\lambda_{1se}italic_λ start_POSTSUBSCRIPT 1 italic_s italic_e end_POSTSUBSCRIPT; the intercept dominates. Bottom right: The residuals from the risk-factors model i.e., the component not explained by the risk-factors model. The spatial correlations are clear.

We fit a regression model Yt=w0,t+[𝚽]𝐰t+η,η={ηr},r=1R,ηr𝒩(0,σ2)formulae-sequencesubscriptsuperscript𝑌𝑡subscript𝑤0𝑡delimited-[]𝚽subscript𝐰𝑡𝜂formulae-sequence𝜂subscript𝜂𝑟formulae-sequence𝑟1𝑅similar-tosubscript𝜂𝑟𝒩0superscript𝜎2Y^{\ast}_{t}={w}_{0,t}+\left[{\mathbf{\Phi}}\right]{\mathbf{w}}_{t}+{\mathbf{% \eta}},{\mathbf{\eta}}=\{\eta_{r}\},r=1\ldots R,\eta_{r}\sim{\mathcal{N}}(0,% \sigma^{2})italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_w start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT + [ bold_Φ ] bold_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_η , italic_η = { italic_η start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT } , italic_r = 1 … italic_R , italic_η start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) and simplify it with backward-forward stepwise elimination for each time window. New time-windows are obtained by advancing the previous one by 30 days. Fig. 1 (top left) plots the variation of the absolute values of the coefficients 𝐰𝐰{\mathbf{w}}bold_w over time. We see that the intercept w0subscript𝑤0w_{0}italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT dominates and persists over the entire duration, whereas the others are present only episodically, suggesting that the model might be fitting to noise. To investigate whether the risk factors play any part in the regression model, we take the cumulative sum of the case-counts over the entire duration of the dataset Ytsubscriptsuperscript𝑌absent𝑡Y^{\ast\ast}_{t}italic_Y start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and fit Yt=u0,t+[𝚽]𝐮t+ϵsubscriptsuperscript𝑌absent𝑡subscript𝑢0𝑡delimited-[]𝚽subscript𝐮𝑡italic-ϵY^{\ast\ast}_{t}={u}_{0,t}+\left[{\mathbf{\Phi}}\right]{\mathbf{u}}_{t}+\epsilonitalic_Y start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_u start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT + [ bold_Φ ] bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_ϵ via LASSO. Fig. 1 (top right) shows the MSE as a function of the sparsity penalty λ𝜆\lambdaitalic_λ in LASSO; the digits along the upper horizontal axis plots the PCA modes retained as log(λ)𝜆\log(\lambda)roman_log ( italic_λ ) is increased. The “error bars” show the variation in MSE as we undergo 7-fold cross-validation. We use the value of λ1sesubscript𝜆1𝑠𝑒\lambda_{1se}italic_λ start_POSTSUBSCRIPT 1 italic_s italic_e end_POSTSUBSCRIPT in our regression model (the second vertical dotted line in Fig. 1 (top right), where the mean MSE corresponds to 1 standard deviation away from the minimum MSE observed for λminsubscript𝜆𝑚𝑖𝑛\lambda_{min}italic_λ start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT). The coefficients 𝐮𝐮{\mathbf{u}}bold_u obtained from these two values of λ𝜆\lambdaitalic_λ are plotted in Fig. 1 (bottom left). It is clear that the intercept w0subscript𝑤0w_{0}italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT dominates i.e., the case-counts for COVID-19 are not very dependent on [𝚽]delimited-[]𝚽\left[{\mathbf{\Phi}}\right][ bold_Φ ] and Ytu0,t+ϵsubscriptsuperscript𝑌absent𝑡subscript𝑢0𝑡italic-ϵY^{\ast\ast}_{t}\approx{u}_{0,t}+{\mathbf{\epsilon}}italic_Y start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≈ italic_u start_POSTSUBSCRIPT 0 , italic_t end_POSTSUBSCRIPT + italic_ϵ. The implication is that over the time-period of interest, the spatial patterns observed in Ytsubscriptsuperscript𝑌absent𝑡Y^{\ast\ast}_{t}italic_Y start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT were not explained by the spatially-variable risk factors. Fig. 1 (bottom right) plots the zlimit-from𝑧z-italic_z -score of ϵitalic-ϵ{\mathbf{\epsilon}}italic_ϵ and the spatial correlation of the epidemiological dynamics not modeled by risk factors is clear. There is a “blue” diagonal of NM counties running Northeast to Southwest, where as the Northwest and Southeast corners are yellow. In between are “magenta” counties. Note that much of the blue diagonal is along the Rio Grande valley, and the population density falls as we travel away from it, into the desert. Clearly, a neighborhood matrix W𝑊Witalic_W for a GMRF model could be made from this data, and we address this next. Note that this spatial variation is not explained by risk factors, but perhaps is due to mixing of populations in the counties.

Moran’s Ilimit-from𝐼I-italic_I -statistic test [55] is used to detect spatial autocorrelation in a variable defined over areal units. It requires an adjacency matrix W𝑊Witalic_W between areal units as input. We consider three different definitions of W𝑊Witalic_W viz. “binary” where wij=1subscript𝑤𝑖𝑗1w_{ij}=1italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1 when areal units i𝑖iitalic_i and j𝑗jitalic_j share a border (i.e., they are immediate neighbors), “binary-modified“ where wijsubscript𝑤𝑖𝑗w_{ij}italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is weighed by the reciprocal of the distance between adjacent counties’ county seat and “row-standardised“ where wijsubscript𝑤𝑖𝑗w_{ij}italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is weighed by the number of neighbors that areal unit i𝑖iitalic_i has. Moran’s Ilimit-from𝐼I-italic_I -statistic is computed with the ϵitalic-ϵ{\mathbf{\epsilon}}italic_ϵ that is provided to the test (“observed Ilimit-from𝐼I-italic_I -statistic”) versus the null case where the elements of ϵitalic-ϵ{\mathbf{\epsilon}}italic_ϵ are IID. The figure of merit is the standard deviate of the observed Ilimit-from𝐼I-italic_I -statistic. The standard deviate of the ϵitalic-ϵ{\mathbf{\epsilon}}italic_ϵ shown in Fig. 1 (bottom right) is in Table 1, top row; clearly it is far from being IID random. Thereafter, we perform the same Moran’s Ilimit-from𝐼I-italic_I -statistic test for the 90-day windows (Fig. 1 (top left)) and tabulate the mean and standard deviation of the the Ilimit-from𝐼I-italic_I -statistic in Table 1, bottom row; again, the Ilimit-from𝐼I-italic_I -statistic indicates significant spatial auto-correlation. We see that the “binary” and “row-standardised” versions of the adjacency matrix give similar results and they are both far superior to the “binary-modified“ form of W𝑊Witalic_W. The computation was repeated with an adjacency matrix with a 2-hop neighborhood (where the immediate neighbors of an areal unit, and their immediate neighbors, were included in the adjacency matrix) and the Ilimit-from𝐼I-italic_I -statistic was indistinguishable from random ϵitalic-ϵ{\mathbf{\epsilon}}italic_ϵ. Henceforth, we will adopt the row-standardised form of W𝑊Witalic_W as our spatial prior as we estimate the infection-rate field over multiple areal units, as it provides the largest standard deviate of Moran’s Ilimit-from𝐼I-italic_I -statistic.

Table 1: Standard deviate of the Ilimit-from𝐼I-italic_I -statistic of the observed data with different adjacency matrices. In the second row, we tabulate the mean standard deviate over all windows; the number in parenthesis is the standard deviation.
Test case Binary W𝑊Witalic_W Binary-modified W𝑊Witalic_W Row-standardised W𝑊Witalic_W
Cumulative cases for the full dataset 3.44 2.76 3.57
90-day windows 2.5 (1.1) 2.08 (0.8) 2.7 (1.35)

4 Formulation

Next, we propose an epidemiological model to forecast infection rates across adjacent geographical regions. The model is an extension of previous work by Safta et al. [6] and Blonigan et al. [4] for epidemic forecasts over a single region to multiple regions. In this section we will briefly describe the single region model and then present statistical approaches to estimate the model parameters over adjacent geographical regions.

4.1 Epidemiological Model

The epidemiological model combines an infection-rate model and an incubation rate model. In a given areal unit r𝑟ritalic_r, the infection rate is assumed to follow a Gamma distribution (in time) with a probability density function (pdf) given by

finf(t;kr,θr)=θrkrtkr1exp(t/θr)/Γ(kr).subscript𝑓𝑖𝑛𝑓𝑡subscript𝑘𝑟subscript𝜃𝑟superscriptsubscript𝜃𝑟subscript𝑘𝑟superscript𝑡subscript𝑘𝑟1𝑡subscript𝜃𝑟Γsubscript𝑘𝑟f_{inf}(t;k_{r},\theta_{r})=\theta_{r}^{-k_{r}}t^{k_{r}-1}\exp(-t/\theta_{r})% \big{/}\Gamma(k_{r}).italic_f start_POSTSUBSCRIPT italic_i italic_n italic_f end_POSTSUBSCRIPT ( italic_t ; italic_k start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) = italic_θ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_k start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT roman_exp ( - italic_t / italic_θ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) / roman_Γ ( italic_k start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) . (1)

The infection-rate in Eq. (1) is controlled by two parameters, krsubscript𝑘𝑟k_{r}italic_k start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT (shape) and θrsubscript𝜃𝑟\theta_{r}italic_θ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT (scale), and is sufficiently flexible to capture a range of outbreaks. The third parameter, t0,rsubscript𝑡0𝑟t_{0,r}italic_t start_POSTSUBSCRIPT 0 , italic_r end_POSTSUBSCRIPT, represents the start of the outbreak and will be inferred jointly with the infection rate parameters. For incubation we employ a model calibrated against early COVID-19 data [56]. This model follows a lognormal distribution with a cumulative distribution function (CDF) given by

Finc(t;μ,σ)=12erfc(logtμσ2)subscript𝐹𝑖𝑛𝑐𝑡𝜇𝜎12erfc𝑡𝜇𝜎2F_{inc}(t;\mu,\sigma)=\frac{1}{2}\mathrm{erfc}\left(-\frac{\log t-\mu}{\sigma% \sqrt{2}}\right)italic_F start_POSTSUBSCRIPT italic_i italic_n italic_c end_POSTSUBSCRIPT ( italic_t ; italic_μ , italic_σ ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_erfc ( - divide start_ARG roman_log italic_t - italic_μ end_ARG start_ARG italic_σ square-root start_ARG 2 end_ARG end_ARG ) (2)

Note that μ𝜇\muitalic_μ and σ𝜎\sigmaitalic_σ are not constants, but are random variables themselves. The mean μ𝜇\muitalic_μ is approximated as a Student’s tlimit-from𝑡t-italic_t -distribution and σ𝜎\sigmaitalic_σ is assumed to have a chi-square distribution. These choices result in 95% confidence intervals of [1.48,1.76]1.481.76\left[1.48,1.76\right][ 1.48 , 1.76 ] and [0.320,0.515]0.3200.515\left[0.320,0.515\right][ 0.320 , 0.515 ] for μ𝜇\muitalic_μ and σ𝜎\sigmaitalic_σ, respectively, as described in Safta et al.  [6]. We will refer to this model as the stochastic incubation model.

The cumulative number of people that have turned symptomatic between time t0,rsubscript𝑡0𝑟t_{0,r}italic_t start_POSTSUBSCRIPT 0 , italic_r end_POSTSUBSCRIPT (the start of the current epidemic wave) and time tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is computed as a convolution between the infection rate and the CDF of the incubation model

Ni,r=Nrt0,rtifinf(τt0;kr,θr)Finc(tiτ;μ,σ)𝑑τ,subscript𝑁𝑖𝑟subscript𝑁𝑟superscriptsubscriptsubscript𝑡0𝑟subscript𝑡𝑖subscript𝑓𝑖𝑛𝑓𝜏subscript𝑡0subscript𝑘𝑟subscript𝜃𝑟subscript𝐹𝑖𝑛𝑐subscript𝑡𝑖𝜏𝜇𝜎differential-d𝜏N_{i,r}=N_{r}\int_{t_{0,r}}^{t_{i}}f_{inf}(\tau-t_{0};k_{r},\theta_{r})F_{inc}% (t_{i}-\tau;\mu,\sigma)d\tau,italic_N start_POSTSUBSCRIPT italic_i , italic_r end_POSTSUBSCRIPT = italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 , italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i italic_n italic_f end_POSTSUBSCRIPT ( italic_τ - italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_k start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) italic_F start_POSTSUBSCRIPT italic_i italic_n italic_c end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_τ ; italic_μ , italic_σ ) italic_d italic_τ , (3)

where Nrsubscript𝑁𝑟N_{r}italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT is the total number of people that will get infected (and counted) during the entire epidemic wave in areal unit r𝑟ritalic_r. This model assumes that a person shows symptoms once the virus incubation has completed. Furthermore, once symptoms are evident, it is also assumed that individuals have prompt access to medical services or otherwise self-report the COVID-19 infection, getting counted without delay. These assumptions will be relaxed in future versions of this effort where the model above will be endowed with latent variables that account for uncertainties due to reporting delays and unreported positive counts.

The number of people that turn symptomatic over the time interval [ti1,ti]subscript𝑡𝑖1subscript𝑡𝑖[t_{i-1},t_{i}][ italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ], in areal unit r𝑟ritalic_r, is estimated as

ni,r=Ni,rNi1,rsubscript𝑛𝑖𝑟subscript𝑁𝑖𝑟subscript𝑁𝑖1𝑟\displaystyle n_{i,r}=N_{i,r}-N_{i-1,r}italic_n start_POSTSUBSCRIPT italic_i , italic_r end_POSTSUBSCRIPT = italic_N start_POSTSUBSCRIPT italic_i , italic_r end_POSTSUBSCRIPT - italic_N start_POSTSUBSCRIPT italic_i - 1 , italic_r end_POSTSUBSCRIPT =\displaystyle== Nrt0,rtifinf(τt0,r;kr,θr)(Finc(tiτ;μ,σ)Finc(ti1τ;μ,σ))𝑑τsubscript𝑁𝑟superscriptsubscriptsubscript𝑡0𝑟subscript𝑡𝑖subscript𝑓𝑖𝑛𝑓𝜏subscript𝑡0𝑟subscript𝑘𝑟subscript𝜃𝑟subscript𝐹𝑖𝑛𝑐subscript𝑡𝑖𝜏𝜇𝜎subscript𝐹𝑖𝑛𝑐subscript𝑡𝑖1𝜏𝜇𝜎differential-d𝜏\displaystyle N_{r}\int_{t_{0,r}}^{t_{i}}f_{inf}(\tau-t_{0,r};k_{r},\theta_{r}% )\left(F_{inc}(t_{i}-\tau;\mu,\sigma)-F_{inc}(t_{i-1}-\tau;\mu,\sigma)\right)d\,\tauitalic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 , italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i italic_n italic_f end_POSTSUBSCRIPT ( italic_τ - italic_t start_POSTSUBSCRIPT 0 , italic_r end_POSTSUBSCRIPT ; italic_k start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) ( italic_F start_POSTSUBSCRIPT italic_i italic_n italic_c end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_τ ; italic_μ , italic_σ ) - italic_F start_POSTSUBSCRIPT italic_i italic_n italic_c end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT - italic_τ ; italic_μ , italic_σ ) ) italic_d italic_τ (4)
\displaystyle\approx Nr(titi1)t0,rtifinf(τt0;kr,θr)finc(tiτ;μ,σ)𝑑τsubscript𝑁𝑟subscript𝑡𝑖subscript𝑡𝑖1superscriptsubscriptsubscript𝑡0𝑟subscript𝑡𝑖subscript𝑓𝑖𝑛𝑓𝜏subscript𝑡0subscript𝑘𝑟subscript𝜃𝑟subscript𝑓𝑖𝑛𝑐subscript𝑡𝑖𝜏𝜇𝜎differential-d𝜏\displaystyle N_{r}(t_{i}-t_{i-1})\int_{t_{0,r}}^{t_{i}}f_{inf}(\tau-t_{0};k_{% r},\theta_{r})f_{inc}(t_{i}-\tau;\mu,\sigma)d\tauitalic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) ∫ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 , italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i italic_n italic_f end_POSTSUBSCRIPT ( italic_τ - italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_k start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) italic_f start_POSTSUBSCRIPT italic_i italic_n italic_c end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_τ ; italic_μ , italic_σ ) italic_d italic_τ (5)

where fincsubscript𝑓𝑖𝑛𝑐f_{inc}italic_f start_POSTSUBSCRIPT italic_i italic_n italic_c end_POSTSUBSCRIPT is the pdf of the incubation model. In transitioning from Eq. (4) to Eq. (5) we made use of the approximation

finc(tiτ;μ,σ)Finc(tiτ;μ,σ)Finc(ti1τ;μ,σ)titi1subscript𝑓𝑖𝑛𝑐subscript𝑡𝑖𝜏𝜇𝜎subscript𝐹𝑖𝑛𝑐subscript𝑡𝑖𝜏𝜇𝜎subscript𝐹𝑖𝑛𝑐subscript𝑡𝑖1𝜏𝜇𝜎subscript𝑡𝑖subscript𝑡𝑖1f_{inc}(t_{i}-\tau;\mu,\sigma)\approx\frac{F_{inc}(t_{i}-\tau;\mu,\sigma)-F_{% inc}(t_{i-1}-\tau;\mu,\sigma)}{t_{i}-t_{i-1}}italic_f start_POSTSUBSCRIPT italic_i italic_n italic_c end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_τ ; italic_μ , italic_σ ) ≈ divide start_ARG italic_F start_POSTSUBSCRIPT italic_i italic_n italic_c end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_τ ; italic_μ , italic_σ ) - italic_F start_POSTSUBSCRIPT italic_i italic_n italic_c end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT - italic_τ ; italic_μ , italic_σ ) end_ARG start_ARG italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_ARG

which amounts to approximating the incubation model PDF with a histogram with bin of size (titi1)subscript𝑡𝑖subscript𝑡𝑖1(t_{i}-t_{i-1})( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ). Thus the four parameters that describe the epidemiological dynamics in an areal unit r𝑟ritalic_r are γr={kr,θr,t0,r,Nr}subscript𝛾𝑟subscript𝑘𝑟subscript𝜃𝑟subscript𝑡0𝑟subscript𝑁𝑟\gamma_{r}=\{k_{r},\theta_{r},t_{0,r},N_{r}\}italic_γ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = { italic_k start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 0 , italic_r end_POSTSUBSCRIPT , italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT } and γ={γr}𝛾subscript𝛾𝑟\mathbf{\gamma}=\{\gamma_{r}\}italic_γ = { italic_γ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT } is the accumulation of parameters over all R𝑅Ritalic_R areal units. We will refer to them colloquially as the “epidemiological” parameters. In this paper we focus on outbreak detection and for this purpose a model that follows a single wave, as above, is sufficient for the task. Given the assumptions above, these outbreak forecasts represent a lower bound on the actual number of people that are infected with COVID-19. A fraction of the population infected with a novel disease might also exhibit minor or no symptoms at all and might not seek medical advice, further contributing to lowering the predicted counts compared to the actual size of the epidemic.

4.2 Model Calibration

Given data in the form of time-series of daily counts, labeled generically as 𝒀𝒀\bm{Y}bold_italic_Y, as shown in §3.2, and the model predictions 𝒏𝒏\bm{n}bold_italic_n for the number of new symptomatic counts daily, presented in §4.1, we will employ a Bayesian framework to calibrate the epidemiological model parameters. The discrepancy between the data and the model is written as

𝒀=𝒏(𝒑)+ϵ(𝒑)𝒀𝒏𝒑italic-ϵ𝒑{\bm{Y}}={\bm{n}}(\bm{p})+\epsilon(\bm{p})bold_italic_Y = bold_italic_n ( bold_italic_p ) + italic_ϵ ( bold_italic_p ) (6)

where 𝒑𝒑{\bm{p}}bold_italic_p are the parameters that describe both the epidemiological models and the statistical discrepancy ϵitalic-ϵ\epsilonitalic_ϵ between the data and the epidemiological model. These parameters will be detailed in the following sub-sections. The probabilistic error model encapsulates both errors in the observations, e.g. availability of testing capabilities and test accuracy, as well as errors due to empirical modeling choices.

The multivariate distribution for the vector of parameters 𝒑𝒑\bm{p}bold_italic_p can be estimated in a Bayesian framework as

P(𝒑|𝒀)P(𝒀|𝒑)P(𝒑)proportional-to𝑃conditional𝒑𝒀𝑃conditional𝒀𝒑𝑃𝒑P(\bm{p}|{\bm{Y}})\propto P({\bm{Y}}|\bm{p})P(\bm{p})italic_P ( bold_italic_p | bold_italic_Y ) ∝ italic_P ( bold_italic_Y | bold_italic_p ) italic_P ( bold_italic_p ) (7)

where P(𝒑|𝒀)𝑃conditional𝒑𝒀P(\bm{p}|{\bm{Y}})italic_P ( bold_italic_p | bold_italic_Y ) is the posterior distribution we are seeking after observing the data 𝒀𝒀{\bm{Y}}bold_italic_Y, P(𝒀|𝒑)𝑃conditional𝒀𝒑P({\bm{Y}}|\bm{p})italic_P ( bold_italic_Y | bold_italic_p ) is the likelihood of observing the data 𝒀𝒀{\bm{Y}}bold_italic_Y given a specific choice for parameters 𝒑𝒑\bm{p}bold_italic_p, and P(𝒑)𝑃𝒑P(\bm{p})italic_P ( bold_italic_p ) contains the prior information about the models parameters. The subsections below provide a detailed description about the setup of the likelihood and prior distributions.

4.2.1 Likelihood Construction with Spatial Correlations

We now derive a likelihood expression 𝒟subscript𝒟\mathcal{L}_{\mathcal{D}}caligraphic_L start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT which accounts for the discrepancies between the number of people reported symptomatic daily and the number of new cases predicted by the model, via Eq. (5). We denote the reported daily count Yi(o)={yi,1,yi,2,,yi,R}superscriptsubscript𝑌𝑖𝑜subscript𝑦𝑖1subscript𝑦𝑖2subscript𝑦𝑖𝑅Y_{i}^{(o)}=\{y_{i,1},y_{i,2},\ldots,y_{i,R}\}italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_o ) end_POSTSUPERSCRIPT = { italic_y start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i , 2 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_i , italic_R end_POSTSUBSCRIPT } for day i𝑖iitalic_i, and the daily predicted count Yi(p)={ni,1,ni,2,,ni,R}=(ti;γ)superscriptsubscript𝑌𝑖𝑝subscript𝑛𝑖1subscript𝑛𝑖2subscript𝑛𝑖𝑅subscript𝑡𝑖𝛾Y_{i}^{(p)}=\{n_{i,1},n_{i,2},\ldots,n_{i,R}\}=\mathcal{M}(t_{i};\mathbf{% \gamma})italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p ) end_POSTSUPERSCRIPT = { italic_n start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT italic_i , 2 end_POSTSUBSCRIPT , … , italic_n start_POSTSUBSCRIPT italic_i , italic_R end_POSTSUBSCRIPT } = caligraphic_M ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_γ ), where (ti;γ)subscript𝑡𝑖𝛾\mathcal{M}(t_{i};\mathbf{\gamma})caligraphic_M ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_γ ) is the epidemiological model described in Eq. 5, with γ𝛾\mathbf{\gamma}italic_γ constituting the epidemiological parameters over R𝑅Ritalic_R regions, some of which might be adjacent. γ𝛾\mathbf{\gamma}italic_γ are the parameters that will be jointly inferred given the available data.

For a given data i𝑖iitalic_i, we state

Yi(o)=Yi(p)+εi=(ti;γ)+εi,εi𝒩(0,Σi),formulae-sequencesuperscriptsubscript𝑌𝑖𝑜superscriptsubscript𝑌𝑖𝑝subscript𝜀𝑖subscript𝑡𝑖𝛾subscript𝜀𝑖similar-tosubscript𝜀𝑖𝒩0subscriptΣ𝑖Y_{i}^{(o)}=Y_{i}^{(p)}+\mathbf{\varepsilon}_{i}=\mathcal{M}(t_{i};\mathbf{% \gamma})+\mathbf{\varepsilon}_{i},\mathbf{\varepsilon}_{i}\sim{\mathcal{N}}% \left(0,\Sigma_{i}\right),italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_o ) end_POSTSUPERSCRIPT = italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p ) end_POSTSUPERSCRIPT + italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = caligraphic_M ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_γ ) + italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , roman_Σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (8)

i.e., we assume that the data – model mismatch is a multivariate Gaussian distribution with a block covariance matrix. We will assume that the discrepancies are independent over the temporal axis and correlated in space, i.e.

𝒟=i=1Nd1(2π)Nr/2det(Σi1/2)exp(12(Yi(o)Yi(p))Σi1(Yi(o)Yi(p))T)subscript𝒟superscriptsubscriptproduct𝑖1subscript𝑁𝑑1superscript2𝜋subscript𝑁𝑟2detsuperscriptsubscriptΣ𝑖1212subscriptsuperscript𝑌𝑜𝑖subscriptsuperscript𝑌𝑝𝑖superscriptsubscriptΣ𝑖1superscriptsubscriptsuperscript𝑌𝑜𝑖subscriptsuperscript𝑌𝑝𝑖𝑇\mathcal{L}_{\mathcal{D}}=\prod_{i=1}^{N_{d}}\frac{1}{(2\pi)^{N_{r}/2}\mathrm{% det}(\Sigma_{i}^{1/2})}\exp\left(-\frac{1}{2}(Y^{(o)}_{i}-Y^{(p)}_{i})\Sigma_{% i}^{-1}(Y^{(o)}_{i}-Y^{(p)}_{i})^{T}\right)caligraphic_L start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG ( 2 italic_π ) start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT / 2 end_POSTSUPERSCRIPT roman_det ( roman_Σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ) end_ARG roman_exp ( - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_Y start_POSTSUPERSCRIPT ( italic_o ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_Y start_POSTSUPERSCRIPT ( italic_p ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_Σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_Y start_POSTSUPERSCRIPT ( italic_o ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_Y start_POSTSUPERSCRIPT ( italic_p ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) (9)

Here ΣisubscriptΣ𝑖\Sigma_{i}roman_Σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the block in the large covariance matrix (that spans over Ndsubscript𝑁𝑑N_{d}italic_N start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT days of observations) that corresponds to the predictions for day i𝑖iitalic_i. Per the BYM model, we will model the discrepancy Yi(o)Yi(p)=εisubscriptsuperscript𝑌𝑜𝑖subscriptsuperscript𝑌𝑝𝑖subscript𝜀𝑖Y^{(o)}_{i}-Y^{(p)}_{i}=\mathbf{\varepsilon}_{i}italic_Y start_POSTSUPERSCRIPT ( italic_o ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_Y start_POSTSUPERSCRIPT ( italic_p ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with two components i.e., εi=εi,1+εi,2subscript𝜀𝑖subscript𝜀𝑖1subscript𝜀𝑖2\mathbf{\varepsilon}_{i}=\mathbf{\varepsilon}_{i,1}+\mathbf{\varepsilon}_{i,2}italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_ε start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT + italic_ε start_POSTSUBSCRIPT italic_i , 2 end_POSTSUBSCRIPT. Per Fig. 1 (bottom right), εi,1subscript𝜀𝑖1\mathbf{\varepsilon}_{i,1}italic_ε start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT will be modeled with a pCAR to capture spatial auto-correlation. In contrast εi,2subscript𝜀𝑖2\mathbf{\varepsilon}_{i,2}italic_ε start_POSTSUBSCRIPT italic_i , 2 end_POSTSUBSCRIPT models random, temporally independent, reporting errors and any model shortcomings. Consequently the Yi(o)Yi(p)=εisubscriptsuperscript𝑌𝑜𝑖subscriptsuperscript𝑌𝑝𝑖subscript𝜀𝑖Y^{(o)}_{i}-Y^{(p)}_{i}=\mathbf{\varepsilon}_{i}italic_Y start_POSTSUPERSCRIPT ( italic_o ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_Y start_POSTSUPERSCRIPT ( italic_p ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT discrepancy is modeled as the product of two independent, zero-mean multivariate Gaussian components [57], with a resulting in a joint covariance matrix given by

Σi=P1+diag(σa+σmYi(p))2,subscriptΣ𝑖superscript𝑃1diagsuperscriptsubscript𝜎𝑎subscript𝜎𝑚subscriptsuperscript𝑌𝑝𝑖2\Sigma_{i}=P^{-1}+\mathrm{diag}\left(\sigma_{a}+\sigma_{m}Y^{(p)}_{i}\right)^{% 2},roman_Σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_P start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + roman_diag ( italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT + italic_σ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_Y start_POSTSUPERSCRIPT ( italic_p ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (10)

where P𝑃Pitalic_P is the precision matrix associated with the Gaussian Markov Random Field (GMRF) model assumed to account for the spatial correlations between adjacent regions (a proper Conditional Auto-Regressive (pCAR) model[19]). We will refer to the parameters σ={σa,σm}𝜎subscript𝜎𝑎subscript𝜎𝑚\mathbf{\sigma}=\{\sigma_{a},\sigma_{m}\}italic_σ = { italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } as the “error model” (or ErrM). The precision matrix P𝑃Pitalic_P is defined as

P=1τϕ2(diag{g1,g2,,gNr}λϕW)𝑃1superscriptsubscript𝜏italic-ϕ2diagsubscript𝑔1subscript𝑔2subscript𝑔subscript𝑁𝑟subscript𝜆italic-ϕ𝑊P=\frac{1}{\tau_{\phi}^{2}}\left(\mathrm{diag}\{g_{1},g_{2},\ldots,g_{N_{r}}\}% -\lambda_{\phi}W\right)italic_P = divide start_ARG 1 end_ARG start_ARG italic_τ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( roman_diag { italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_g start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT } - italic_λ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_W ) (11)

Here, gjsubscript𝑔𝑗g_{j}italic_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the number of regions adjacent to region j𝑗jitalic_j, and W𝑊Witalic_W is a matrix that encodes the relative topology of the regions considered in the joint inference, with entries defined as

wjj=0andwjk={1if regions j and k are adjacent,0otherwise.subscript𝑤𝑗𝑗0andsubscript𝑤𝑗𝑘cases1if regions j and k are adjacent,0otherwise.w_{jj}=0\,\textrm{and}\,w_{jk}=\begin{cases}1&\textrm{if regions {\it j} and {% \it k} are adjacent,}\\ 0&\textrm{otherwise.}\end{cases}italic_w start_POSTSUBSCRIPT italic_j italic_j end_POSTSUBSCRIPT = 0 and italic_w start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT = { start_ROW start_CELL 1 end_CELL start_CELL if regions italic_j and italic_k are adjacent, end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise. end_CELL end_ROW (12)

Thus P𝑃Pitalic_P defines a pCAR spatial model with row-standardisation and is a function of the “spatial coefficients” (or SpC) ψ={τϕ2,λϕ}𝜓superscriptsubscript𝜏italic-ϕ2subscript𝜆italic-ϕ\mathbf{\psi}=\{\tau_{\phi}^{2},\lambda_{\phi}\}italic_ψ = { italic_τ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_λ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT }, which will also have to be estimated from the data. The inclusion of ψ𝜓\mathbf{\psi}italic_ψ implies that the epidemiological parameters γ𝛾\mathbf{\gamma}italic_γ will display spatial correlation. The magnitude of the correlation is unknown a priori, and will be estimated from the case-count data.

To summarize, the accuracy of the spatiotemporal model for epidemiological dynamics is controlled by the parameters 𝐩={γ,σ,ψ}𝐩𝛾𝜎𝜓\mathbf{p}=\{\mathbf{\gamma},\mathbf{\sigma},\mathbf{\psi}\}bold_p = { italic_γ , italic_σ , italic_ψ }, which will be the object of inference from data from R𝑅Ritalic_R NM counties. The dimensionality of the inverse problem scales with R𝑅Ritalic_R and is limited by the scalability of the inversion method. We will use R=3𝑅3R=3italic_R = 3 and consider inferences using the following setups:

  • independent inferences (i.e., R=1𝑅1R=1italic_R = 1), county by county, for the counties of Bernalillo, Santa Fe, and Valencia.

  • two adjacent counties (i.e., R=2𝑅2R=2italic_R = 2), i.e. Bernalillo & Santa Fe and Bernalillo & Valencia. For these cases the covariance matrix P1superscript𝑃1P^{-1}italic_P start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT corresponding to the GMRF model is given by

    P1=τϕ21λϕ2[1λϕλϕ1]superscript𝑃1superscriptsubscript𝜏italic-ϕ21superscriptsubscript𝜆italic-ϕ2matrix1subscript𝜆italic-ϕsubscript𝜆italic-ϕ1P^{-1}=\frac{\tau_{\phi}^{2}}{1-\lambda_{\phi}^{2}}\begin{bmatrix}1&\lambda_{% \phi}\\ \lambda_{\phi}&1\\ \end{bmatrix}italic_P start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = divide start_ARG italic_τ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_λ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL italic_λ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_λ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] (13)
  • three counties (i.e., R=3𝑅3R=3italic_R = 3), Bernalillo, Santa Fe, and Valencia, jointly. Bernalillo is adjacent to the other two counties but Santa Fe and Valencia do not share a border. The GMRF covariance matrix P1superscript𝑃1P^{-1}italic_P start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT is given by

    P1=τϕ22(1λϕ2)[1λϕλϕλϕ2λϕ2λϕ2λϕλϕ22λϕ2]superscript𝑃1superscriptsubscript𝜏italic-ϕ221superscriptsubscript𝜆italic-ϕ2matrix1subscript𝜆italic-ϕsubscript𝜆italic-ϕsubscript𝜆italic-ϕ2superscriptsubscript𝜆italic-ϕ2superscriptsubscript𝜆italic-ϕ2subscript𝜆italic-ϕsuperscriptsubscript𝜆italic-ϕ22superscriptsubscript𝜆italic-ϕ2P^{-1}=\frac{\tau_{\phi}^{2}}{2\left(1-\lambda_{\phi}^{2}\right)}\begin{% bmatrix}1&\lambda_{\phi}&\lambda_{\phi}\\ \lambda_{\phi}&2-\lambda_{\phi}^{2}&\lambda_{\phi}^{2}\\ \lambda_{\phi}&\lambda_{\phi}^{2}&2-\lambda_{\phi}^{2}\\ \end{bmatrix}italic_P start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = divide start_ARG italic_τ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 ( 1 - italic_λ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL italic_λ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_CELL start_CELL italic_λ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_λ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_CELL start_CELL 2 - italic_λ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL start_CELL italic_λ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_λ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_CELL start_CELL italic_λ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL start_CELL 2 - italic_λ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] (14)

4.2.2 Prior Distributions

We employ uninformative priors for the shape and scale parameters, krsubscript𝑘𝑟k_{r}italic_k start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and θrsubscript𝜃𝑟\theta_{r}italic_θ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, of the infection rate models, in Eq. (1). We also employ an uninformative prior for the total count of infected people during the pandemic Nrsubscript𝑁𝑟N_{r}italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT. From our previous work [6, 4] we observed that the convolution model in Eqs. (3)-(5) exhibit sharp transitions when the inferred start time t0subscript𝑡0t_{0}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is not well constrained by the data, e.g. in situations where the daily counts are noisy in the low single digits. For this purpose for t0subscript𝑡0t_{0}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT we selected a Gaussian distribution with a wide enough standard deviation, e.g. 10 days, to allow the data to easily overcome this prior when the number of counts increases beyond the low single digits count.

Further, to ensure the discrepancy model parameters, σasubscript𝜎𝑎\sigma_{a}italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and σmsubscript𝜎𝑚\sigma_{m}italic_σ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, are automatically positive, we work with their natural logarithm in the Bayesian framework. Consequently, the equivalent uninformative prior for the logarithm of standard deviations, logσasubscript𝜎𝑎\log\sigma_{a}roman_log italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and logσmsubscript𝜎𝑚\log\sigma_{m}roman_log italic_σ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, is the uniform distributions. For both these parameters, we bound the natural logarithms’ values to [30,10]3010[-30,10][ - 30 , 10 ], a range sufficiently wide to account for the discrepancies between model predictions and observations, while preventing numerical underflow/overflow errors during MCMC sampling.

For the parameters controlling the pCAR model, we employ a Gamma distribution with shape 10101010 and scale 2222, Γ(10,2)Γ102\Gamma(10,2)roman_Γ ( 10 , 2 ), for τϕsubscript𝜏italic-ϕ\tau_{\phi}italic_τ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT and a uniform distribution U(0,0.9)𝑈00.9U(0,0.9)italic_U ( 0 , 0.9 ) for λϕsubscript𝜆italic-ϕ\lambda_{\phi}italic_λ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT following Shand et al.  [49]

4.2.3 Sampling the Posterior Distribution

As in our previous work on epidemiological models [6, 4], we employ a Markov Chain Monte Carlo (MCMC) algorithm is used to sample from the posterior density p(𝒑|𝒀)𝑝conditional𝒑𝒀p(\bm{p}|\bm{Y})italic_p ( bold_italic_p | bold_italic_Y ), specifically the adaptive Metropolis (AMCMC) algorithm [8]. To accommodate the stochastic incubation model (Eq. (2)), we employ an unbiased estimate of the likelihood presented in Eq. (9). For each MCMC step we select a random set of (μ,σ)𝜇𝜎(\mu,\sigma)( italic_μ , italic_σ ) for the incubation model according to their prescribed distributions, then run the epidemiological model to generate 𝒀(p)superscript𝒀𝑝\bm{Y}^{(p)}bold_italic_Y start_POSTSUPERSCRIPT ( italic_p ) end_POSTSUPERSCRIPT and estimate the likelihood. This approach is similar to the pseudo-marginal MCMC algorithm [58] guaranteeing that the resulting samples correspond to the unbiased posterior distribution model. We use the Effective Sample Size (ESS) [59] estimate to gauge the number of samples sufficient to describe the posterior distribution given the data available. For the results presented in this paper, we found that 1111 to 2222 million MCMC samples were needed to extract 5555K-10101010K effective samples required to estimate summary statistics and marginal distributions for the epidemiological models’ parameters.

4.2.4 Diagnostics

The sampling process described in § 4.2.3 yields O(106)Osuperscript106{\rm O(10^{6})}roman_O ( 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT ) samples of 𝐩={t0,r,kr,Nr,θr,τϕ2,λϕ}𝐩subscript𝑡0𝑟subscript𝑘𝑟subscript𝑁𝑟subscript𝜃𝑟superscriptsubscript𝜏italic-ϕ2subscript𝜆italic-ϕ\mathbf{p}=\{t_{0,r},k_{r},N_{r},\theta_{r},\tau_{\phi}^{2},\lambda_{\phi}\}bold_p = { italic_t start_POSTSUBSCRIPT 0 , italic_r end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_λ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT } from the posterior probability density function (PDF) and the question arises regarding how we assess the accuracy/predictive skill of the PDF. Primarily, we will use posterior predictive tests, whereby we will select 100 samples from the posterior PDFs and use Eq. 8 to predict case-counts. These forecasts will be limited to 14 days, beyond which, as described in our previous papers[6], the model is not expected to be predictive. Fundamentally, observations up to time t𝑡titalic_t contains information about epidemiological dynamics up to time tΔ𝑡Δt-\Deltaitalic_t - roman_Δ, ΔΔ\Deltaroman_Δ being a measure of the incubation period; after that, an increasing fraction of the infected people have yet to show symptoms and appear in the case-counts. Using the mean incubation period plus twice the standard deviation as an estimate for ΔΔ\Deltaroman_Δ (Eq. 2), we get Δexp(1.76+2×0.515)=16.3Δ1.7620.51516.3\Delta\approx\exp(1.76+2\times 0.515)=16.3roman_Δ ≈ roman_exp ( 1.76 + 2 × 0.515 ) = 16.3 days, and so we curtail forecasting at a 2-week horizon. These forecasts are compared with the observed case-counts, and in case of a mismatch, the epidemiological dynamics are assumed to have changed after time t𝑡titalic_t. Apart from forecasting, the correlation structure in the 𝐩𝐩\mathbf{p}bold_p samples can be informative. For each of the areal units of interest, we plot 2D marginal plots (in the Appendix) and, in § 5.2, perform grouped statistical dependence analysis to uncover how parameters for each areal unit vary with those from other areal units, or with global parameters such as {λϕ,τϕ2.σa,σm}formulae-sequencesubscript𝜆italic-ϕsuperscriptsubscript𝜏italic-ϕ2subscript𝜎𝑎subscript𝜎𝑚\{\lambda_{\phi},\tau_{\phi}^{2}.\sigma_{a},\sigma_{m}\}{ italic_λ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT }.

5 Results

Our use of AMCMC[8] (which is not very scalable when coupled with a moderately computationally expensive model) limits us to 10-15 dimensional posterior distributions. For this reason, we limit our study to three regions, R=3𝑅3R=3italic_R = 3, for a total of 16 parameters, i.e. 4444 parameters for each region, and 4444 parameters to describe the error model and correlations between regions. We selected three NM counties, Bernalillo, Santa Fe and Valencia, shown in Fig. 2, as this allows to understand whether the adjacency between counties plays a role in the model calibration. Bernalillo is sandwiched between the other two counties and thus shares boundaries with the other two; while Santa Fe and Valencia do not share boundaries.

Refer to caption
Figure 2: The geographical extent of three adjacent New Mexico counties considered in this paper: Bernalillo (in green), Santa Fe (in orange) and Valencia (in blue).

5.1 Markov Chain Monte Carlo results

In this section we will discuss summaries given samples from the posterior distributions sampled via MCMC. We first compare posterior results obtained for 1-, 2-, and 3-region statistical inference runs and the examine their impact on quality of model predictions vs the available observations.

Bernalillo Santa Fe Valencia

Refer to caption Refer to caption Refer to caption

Refer to caption Refer to caption Refer to caption

Refer to caption Refer to caption Refer to caption

Refer to caption Refer to caption Refer to caption

Figure 3: 1-D marginal posterior distributions to Bernalillo (left column), Santa Fe (middle column), and Valencia (right column). Top row: PDFs for t0,rsubscript𝑡0𝑟t_{0,r}italic_t start_POSTSUBSCRIPT 0 , italic_r end_POSTSUBSCRIPT. Second row: PDFs for Nrsubscript𝑁𝑟N_{r}italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT. Third row: PDFs for k𝑘kitalic_k. Bottom row: PDFs for θrsubscript𝜃𝑟\theta_{r}italic_θ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT.

In Fig. 3 we plot the 1D marginalized posterior PDFs of the epidemiological parameters i.e. (t0,r,kr,Nr,θr)subscript𝑡0𝑟subscript𝑘𝑟subscript𝑁𝑟subscript𝜃𝑟(t_{0,r},k_{r},N_{r},\theta_{r})( italic_t start_POSTSUBSCRIPT 0 , italic_r end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) for all three counties. The 2D marginals are in the Appendix in Figs. 11, Fig. 13 and Fig. 12. The 1D PDFs were computed using data from all three counties jointly (denoted “3r” in the legend), jointly using data from 2 counties at a time (denoted as “2r”) and independently (denoted as “1r” inversions). We see that joint estimation does not noticeably sharpen the PDFs for any of the objects of interest (OOI), but does shift the PDFs for Santa Fe. This robustness to population size is because the likelihood for the inverse problem is constructed with normalized counts, implying that the larger case-counts observed in Bernalillo (about 6 times larger than Santa Fe or Valencia) do not bias the results against the smaller counties. We note that the PDFs for Valencia do not change much in the three estimations. In Fig. 4, top row, we plot the parameters of the GMRF (τϕ2,λϕ)superscriptsubscript𝜏italic-ϕ2subscript𝜆italic-ϕ(\tau_{\phi}^{2},\lambda_{\phi})( italic_τ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_λ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ). It is clear that these spatial parameters can be estimated from the 2r2𝑟2r2 italic_r and 3r3𝑟3r3 italic_r inversions, with λϕsubscript𝜆italic-ϕ\lambda_{\phi}italic_λ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT becoming easier to estimate with specificity as we add more regions, at the expense of log(τϕ2)superscriptsubscript𝜏italic-ϕ2\log(\tau_{\phi}^{2})roman_log ( italic_τ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ).

Refer to caption Refer to caption

Refer to caption Refer to caption

Figure 4: Marginal posterior distributions GMRF parameters (τϕ2,λϕ)superscriptsubscript𝜏italic-ϕ2subscript𝜆italic-ϕ(\tau_{\phi}^{2},\lambda_{\phi})( italic_τ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_λ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ) (top row) and noise parameters (σa,σm)subscript𝜎𝑎subscript𝜎𝑚(\sigma_{a},\sigma_{m})( italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) (bottom row), estimated via 2r2𝑟2r2 italic_r and 3r3𝑟3r3 italic_r joint estimations with data for Santa Fe.

In Fig. 4, bottom row, we plot the noise parameters (σa,σm)subscript𝜎𝑎subscript𝜎𝑚(\sigma_{a},\sigma_{m})( italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ), for Santa Fe, obtained from the same set of inversions. We see that the noise parameters are small and can be estimated, though it becomes progressively more difficult to estimate σasubscript𝜎𝑎\sigma_{a}italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT with much specificity with joint estimation, while σmsubscript𝜎𝑚\sigma_{m}italic_σ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT becomes easier. This is because σasubscript𝜎𝑎\sigma_{a}italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT estimates the magnitude of the epidemiological processes unexplained by our model and the genesis of these processes is likely to be different in the three counties, leading to the difficulty in estimation. This can be explained using Eq. 10. Here τϕ2superscriptsubscript𝜏italic-ϕ2\tau_{\phi}^{2}italic_τ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and σasubscript𝜎𝑎\sigma_{a}italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT appear additively, and the uncertainties in one could be exchanged for the other, as can be seen in Fig. 4 top left and bottom right.

Bernalillo Santa Fe Valencia

Refer to caption Refer to caption Refer to caption

Refer to caption Refer to caption Refer to caption

Figure 5: Comparison of posterior predictive distribution results obtained via joint inference (using the GMRF model) for Bernalillo (left), Santa Fe (middle), and Valencia (right) shown on top row with equivalent results from independent inferences for each county separately, on the bottom row; data up to September 15thsuperscript15th{\rm 15^{th}}15 start_POSTSUPERSCRIPT roman_th end_POSTSUPERSCRIPT, 2020 is used and case-count data was smoothed with a 7-day running average. The red line is the median prediction, the shaded teal region is the inter-quartile range and the dashed lines are 5thsuperscript5th{\rm 5^{th}}5 start_POSTSUPERSCRIPT roman_th end_POSTSUPERSCRIPT and 95thsuperscript95th{\rm 95^{th}}95 start_POSTSUPERSCRIPT roman_th end_POSTSUPERSCRIPT percentiles.

In Fig. 5, we plot the fit of the model to data till September 15, 2020 (the arrival of the Fall 2020 wave) and the two-week forecasts done after that. These predictions are performed by randomly sampling 100 (t0,r,kr,Nr,θr,τϕ2,λϕ)subscript𝑡0𝑟subscript𝑘𝑟subscript𝑁𝑟subscript𝜃𝑟superscriptsubscript𝜏italic-ϕ2subscript𝜆italic-ϕ(t_{0,r},k_{r},N_{r},\theta_{r},\tau_{\phi}^{2},\lambda_{\phi})( italic_t start_POSTSUBSCRIPT 0 , italic_r end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_λ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ) from the posterior distribution (Fig. 3) and running the model forward from the start of our calibration period to the end of September 2020 (note that the calibration data stops at September 15, 2020, and the rest is a forecast). The data for the two-week period is also plotted and it not supposed to agree with the forecast, as the calibrated model does not contain information about the Fall 2020 wave. We see quite clearly that the uncertainies in forecast (the dashed blue line denoting the 5thsuperscript5th{\rm 5^{th}}5 start_POSTSUPERSCRIPT roman_th end_POSTSUPERSCRIPT and 95thsuperscript95th{\rm 95^{th}}95 start_POSTSUPERSCRIPT roman_th end_POSTSUPERSCRIPT percentiles are tighter for the 3-region joint inversion (top row) for all three counties. This tightness implies that it becomes easier for us to detect the discrepancy between the forecast and the data, the marker for the arrival of the Fall 2020 wave. This is particularly true for Santa Fe. The agreement between the predictions (up to September 15, 2020) and the reported case-counts are quantified using the (Continuous Ranked Probability Score [60]) and tabulated in Table 2. We see that the most accurate forecasts do arise from independent estimations, but the 3r3𝑟3r3 italic_r inversions are close behind.

Bernalillo Santa Fe Valencia

Refer to caption Refer to caption Refer to caption

Refer to caption Refer to caption Refer to caption

Figure 6: Comparison of reconstructed infection-rate profiles that underly the predictions in Fig. 5. The top row contains results obtained via joint inference (using the GMRF model) for Bernalillo (left), Santa Fe (middle), and Valencia (right). Results from independent inferences for each county separately, are shown in the bottom row. The calibration data spans up to September 15thsuperscript15th{\rm 15^{th}}15 start_POSTSUPERSCRIPT roman_th end_POSTSUPERSCRIPT, 2020 and the case-count data was smoothed with a 7-day running average. The red line is the median prediction, the shaded teal region is the inter-quartile range and the dashed lines are 5thsuperscript5th{\rm 5^{th}}5 start_POSTSUPERSCRIPT roman_th end_POSTSUPERSCRIPT and 95thsuperscript95th{\rm 95^{th}}95 start_POSTSUPERSCRIPT roman_th end_POSTSUPERSCRIPT percentiles.

In Fig. 6, we plot the corresponding infection rates for all three counties. Differences in the estimated infection rates, 3r3𝑟3r3 italic_r joint estimation (top row) versus independent (bottom row), are difficult to discern. This is because the infection rate is only affected by (t0,r,kr,Nr,θr)subscript𝑡0𝑟subscript𝑘𝑟subscript𝑁𝑟subscript𝜃𝑟(t_{0,r},k_{r},N_{r},\theta_{r})( italic_t start_POSTSUBSCRIPT 0 , italic_r end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) and, as is clear from Fig. 3, there is not much difference in their posterior PDFs. Instead, it is the noise and spatial parameters whose estimates differ as we add more regions to the joint estimation (see Fig. 4).

3 counties 2 counties (B & SF) 2 counties (B & V) 1 county
Bernalillo 11.30 12.34 11.75 10.20
Santa Fe 2.65 2.87 - 2.48
Valencia 1.76 - 1.82 1.61
Table 2: Average CRPS values computed based on the discrepancy between the posterior predictive values corresponding to several model inference settings and the case counts recorded up to Sept. 15, 2020. The best forecasts arise when parameters are estimated for each county independently (last column), but the 3-county joint inversion is close behind (second column).

5.2 Statistical dependence analysis

In this section we use distance correlation [61] to ascertain the degree of dependence in the posterior distributions for individual parameters and between collections of parameters, e.g. parameters that define the model for individual counties. Distance correlation values, denoted dcorsubscript𝑑cord_{\textrm{cor}}italic_d start_POSTSUBSCRIPT cor end_POSTSUBSCRIPT, reveal the relationships between model parameters inside each region and between regions when the parameters are inferred jointly. This information can be used to aid in model construction and gauge the degree of which the parameters controlling the dynamics of the epidemics are connected across region boundaries and therefore can benefit a joint inference approach.

Numerically, we estimate the distance correlation using the algorithm presented in definition 3 in Székely et al. [61]. This algorithm employs samples generated by the MCMC exploration of the joint posterior distribution of the model parameters and estimates the degree of dependency between individual parameters conditioned on the count data available. We also employ this approach to estimate the degree of dependence between parameter subsets, grouped by regions.

Table 3 shows dcorsubscript𝑑cord_{\textrm{cor}}italic_d start_POSTSUBSCRIPT cor end_POSTSUBSCRIPTvalues for the Bernalillo (left table) and Santa Fe (right) table. The entries in this table can be viewed as quantitative assessments of the shapes observed for the 2D marginal PDFs presented in the right frames of Figs. 11 and 12 included in the Appendix. For both counties we observe strong dependencies between k𝑘kitalic_k and θ𝜃\thetaitalic_θ, the shape and scale parameters of the Gamma distribution used to model the infection rate, and t0subscript𝑡0t_{0}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. These strong dependencies, explained by the corresponding narrow 2D marginal PDFs (in Figs. 11 and  12 in the Appendix) are induced by the strong constraints imposed by the available case-count data and the infection rate dynamics. The error model parameters, σasubscript𝜎𝑎\sigma_{a}italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and σmsubscript𝜎𝑚\sigma_{m}italic_σ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, exhibit little dependency among themselves and with other model parameters for Bernalillo county which is driven by larger case-counts values. However, for Santa Fe, which exhibits lower case-counts and hence changes in case count values are more relevant, the model discrepancy parameters show non-negligible dependencies with respect to each other and other model parameters. Similar trends are also observed for Valencia county (results not shown) for which the observed case counts are comparable in magnitude to Santa Fe.

Table 4 shows dcorsubscript𝑑cord_{\textrm{cor}}italic_d start_POSTSUBSCRIPT cor end_POSTSUBSCRIPTvalues computed with MCMC samples corresponding to a joint inversion for the three counties simultaneously. The sections in this table were colored to highlight the different types of parameter dependencies. The dcorsubscript𝑑cord_{\textrm{cor}}italic_d start_POSTSUBSCRIPT cor end_POSTSUBSCRIPTvalues corresponding to Bernalillo and Santa Fe counties (colored in orange) are similar to the corresponding values when the epidemiological models are calibrated region by region. This is due to infection rate models being defined on a per region basis and hence it is expected to observe that similar trends for the corresponding parameters affected by regional case counts. Given the large discrepancy between the magnitude of the case counts in adjacent regions, the additive component σasubscript𝜎𝑎\sigma_{a}italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT of the error model is now less impactful compared to the multiplicative component. The spatial correlation model parameters and the multiplicative error model component show non-negligible dcorsubscript𝑑cord_{\textrm{cor}}italic_d start_POSTSUBSCRIPT cor end_POSTSUBSCRIPT(with joint PDFs displaying negative correlations - results not shown). We also show, in Table 5, the corresponding dcorsubscript𝑑cord_{\textrm{cor}}italic_d start_POSTSUBSCRIPT cor end_POSTSUBSCRIPTvalues between model parameters grouped by model components, i.e. by region, then spatial correlation and error models, respectively. These results are essentially summaries of the corresponding values aggregated in similarly colored regions in Table 4.

t0subscript𝑡0t_{0}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT N𝑁Nitalic_N k𝑘kitalic_k θ𝜃\thetaitalic_θ σasubscript𝜎𝑎\sigma_{a}italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT
N𝑁Nitalic_N 0
k𝑘kitalic_k 0.9 0.1
θ𝜃\thetaitalic_θ 0.8 0.3 0.9
σasubscript𝜎𝑎\sigma_{a}italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT 0 0.1 0 0
σmsubscript𝜎𝑚\sigma_{m}italic_σ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT 0 0 0 0 0.1
t0subscript𝑡0t_{0}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT N𝑁Nitalic_N k𝑘kitalic_k θ𝜃\thetaitalic_θ σasubscript𝜎𝑎\sigma_{a}italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT
N𝑁Nitalic_N 0
k𝑘kitalic_k 0.9 0.4
θ𝜃\thetaitalic_θ 0.6 0.6 0.9
σasubscript𝜎𝑎\sigma_{a}italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT 0.4 0.3 0.2 0
σmsubscript𝜎𝑚\sigma_{m}italic_σ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT 0.3 0.2 0.2 0 0.7
Table 3: Distance correlation values between parameters corresponding to the Bernalillo county (left) and Santa Fe county (right) using samples resulted from model calibrations using data for one county at a time.
Bernalillo Santa Fe Valencia SpC ErrM
t0subscript𝑡0t_{0}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT N𝑁Nitalic_N k𝑘kitalic_k θ𝜃\thetaitalic_θ t0subscript𝑡0t_{0}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT N𝑁Nitalic_N k𝑘kitalic_k θ𝜃\thetaitalic_θ t0subscript𝑡0t_{0}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT N𝑁Nitalic_N k𝑘kitalic_k θ𝜃\thetaitalic_θ τΦ2superscriptsubscript𝜏Φ2\tau_{\Phi}^{2}italic_τ start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT λΦsubscript𝜆Φ\lambda_{\Phi}italic_λ start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT σasubscript𝜎𝑎\sigma_{a}italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT
Bernalillo t0subscript𝑡0t_{0}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT
N𝑁Nitalic_N 0.1
k𝑘kitalic_k 0.9 0.2
θ𝜃\thetaitalic_θ 0.8 0.3 0.9
Santa Fe t0subscript𝑡0t_{0}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT 0.2 0 0.2 0.2
N𝑁Nitalic_N 0 0.3 0.1 0.2 0.3
k𝑘kitalic_k 0.2 0.1 0.2 0.2 0.9 0.5
θ𝜃\thetaitalic_θ 0.1 0.1 0.2 0.2 0.8 0.6 0.9
Valencia t0subscript𝑡0t_{0}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT 0.2 0 0.2 0.2 0.2 0.1 0.2 0.2
N𝑁Nitalic_N 0 0.3 0.1 0.1 0.1 0.3 0.1 0.1 0.1
k𝑘kitalic_k 0.2 0 0.2 0.2 0.2 0 0.2 0.2 1.0 0.1
θ𝜃\thetaitalic_θ 0.2 0.1 0.2 0.2 0.1 0.2 0.1 0.2 0.9 0.2 1.0
SpC τΦ2superscriptsubscript𝜏Φ2\tau_{\Phi}^{2}italic_τ start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 0 0.1 0.1 0.2 0 0.1 0.1 0.1 0.1 0 0.1 0
λΦsubscript𝜆Φ\lambda_{\Phi}italic_λ start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0 0.1 0.1 0.9
ErrM σasubscript𝜎𝑎\sigma_{a}italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT 0.1 0 0 0 0.1 0.1 0.1 0.1 0 0 0 0 0 0
σmsubscript𝜎𝑚\sigma_{m}italic_σ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.1 0.1 0.5 0.4 0.1
Table 4: Distance correlation values between model parameters corresponding to three adjacent counties, the spatial correlation model (SpC), and to the error model (ErrM). The light orange color corresponds to dependencies between model parameters corresponding to the same region, blue to values between pairs of parameters in difference regions, green denotes dcor values between the SpC and the region parameters and light red to dcor values that pertain between ErrM and the regional model parameters.
Bernalillo Santa Fe Valencia SpC
Santa Fe 0.2
Valencia 0.2 0.2
SpC 0.1 0.1 0.1
ErrM 0.1 0.1 0.05 0.05
Table 5: Distance correlation values between groups of parameters corresponding to three adjacent counties, the spatial correlation model (SpC), and the parameters of the error model (ErrM). The color scheme is similar to the one presented in Table 4.

6 Discussion

The results in § 5 show that we can estimate the infection-rate with a sufficient degree of accuracy so as to be able to provide short-term (2-week-ahead) forecasts of the evolution. Given that the inversion is, in effect, a smoothing operation (i.e., the observations inform infection processes that happened in the past), any discrepancy between forecasts and observations could be caused by a sudden change in the infection-rate. Thus it may be feasible to detect the arrival of a new wave of infection using the (latent) infection-rates estimated in Fig. 6.

The state of NM experienced three waves of COVID-19 infections in 2020; the state-wide totals of case-counts are shown in Fig. 7. The second wave, that was felt between June 1stsuperscript1st{\rm 1^{st}}1 start_POSTSUPERSCRIPT roman_st end_POSTSUPERSCRIPT and September 15thsuperscript15th{\rm 15^{th}}15 start_POSTSUPERSCRIPT roman_th end_POSTSUPERSCRIPT, provides us with ample data to infer an infection-rate, and forecast the outbreak till the end of September. As is clear from Fig. 7, these forecasts will deviate from the data due to the arrival of the third wave (henceforth the “Fall 2020" wave). Our aim is to use the estimated infection-rate to detect the Fall 2020 wave, and compare our performance versus a conventional method. We will also conduct such a test using data collected till August 15thsuperscript15th{\rm 15^{th}}15 start_POSTSUPERSCRIPT roman_th end_POSTSUPERSCRIPT (before the Fall 2020 wave) and check whether our infection-rate method detects a (false) positive.

Refer to caption

Figure 7: The waves of COVID-19 infection in New Mexico in 2020. The Fall 2020 wave started around September 15, and is marked with a solid vertical line. The dashed line is August 15, where we will also check our detector.

We sample the posterior distribution for 𝐩𝐩\mathbf{p}bold_p (plotted in Fig. 3) and produce a fantail of predictions of the evolution of the outbreak; the 99thsuperscript99𝑡99^{th}99 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT percentile prediction is treated as the “outlier boundary” (similar to SCPO[46]) and any day with a case-count above the boundary is deemed an “outlier". We treat three consecutive days of outliers as an “alarm" indicating an anomaly in the behavior of the data with respect to the infection-rate estimated before. This is plotted for Bernalillo, Santa Fe and Valencia counties in Fig. 8 (left column). The green line denotes September 15thsuperscript15th{\rm 15^{th}}15 start_POSTSUPERSCRIPT roman_th end_POSTSUPERSCRIPT. Beyond this date, we see a number of days where the case-counts lie above the red “outlier boundary”; these are circled in red. Some days also have their case-count data encased inside a box; these are the third of a 3-day sequence of outlier days (and thus an “alarm” day). We see that in all three counties, we could detect the arrival of the Fall 2020 wave successfully. We repeated the infection-rate estimation using data from June 1stsuperscript1st{\rm 1^{st}}1 start_POSTSUPERSCRIPT roman_st end_POSTSUPERSCRIPT to August 15thsuperscript15th{\rm 15^{th}}15 start_POSTSUPERSCRIPT roman_th end_POSTSUPERSCRIPT and performed a similar check for “alarm" days between August 15thsuperscript15th{\rm 15^{th}}15 start_POSTSUPERSCRIPT roman_th end_POSTSUPERSCRIPT and 31stsuperscript31st{\rm 31^{st}}31 start_POSTSUPERSCRIPT roman_st end_POSTSUPERSCRIPT; these are plotted in Fig. 10 (in the Appendix). While we do detect many “outlier days", we do not see any “alarm days". Thus monitoring the infection-rate allows us to detect the Fall 2020 wave when it is present; further, it does not lead to a false positive in the absence of a new wave of infection.

Next we compare the performance of the detection method using the infection-rate against a conventional detector [7], which we call “GLR-Poisson” (for Generalized Likelihood Ratio - Poisson). This detector uses the raw case-counts to fit a time-series model (complete with prediction uncertainty bounds) and thus detect “outlier days”. The detector has two formulations, one based on the negative binomial (NB) distribution and another based on Poisson. We use the implementation in the R Statistical Software[62] (R version 4.3.2 (2023-10-31)) package surveillance[63]. The case-count on any day is modeled as ytNB(μt,α)similar-tosubscript𝑦𝑡NBsubscript𝜇𝑡𝛼y_{t}\sim{\textit{NB}}(\mu_{t},\alpha)italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ NB ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_α ), (or ytPois(μt)similar-tosubscript𝑦𝑡Poissubscript𝜇𝑡y_{t}\sim{\textit{Pois}}(\mu_{t})italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ Pois ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )) where μtsubscript𝜇𝑡\mu_{t}italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the mean and α𝛼\alphaitalic_α is the dispersion of a NB distribution. The mean is modeled as log(μt)=β0+β1t+s=1Sβ2ssin(ωst)+β2s+1cos(ωst)subscript𝜇𝑡subscript𝛽0subscript𝛽1𝑡superscriptsubscript𝑠1𝑆subscript𝛽2𝑠𝜔𝑠𝑡subscript𝛽2𝑠1𝜔𝑠𝑡\log(\mu_{t})=\beta_{0}+\beta_{1}t+\sum_{s=1}^{S}\beta_{2s}\sin(\omega st)+% \beta_{2s+1}\cos(\omega st)roman_log ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_t + ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT 2 italic_s end_POSTSUBSCRIPT roman_sin ( italic_ω italic_s italic_t ) + italic_β start_POSTSUBSCRIPT 2 italic_s + 1 end_POSTSUBSCRIPT roman_cos ( italic_ω italic_s italic_t ), where ω=2π/365𝜔2𝜋365\omega=2\pi/365italic_ω = 2 italic_π / 365; in essence, this is a seasonal log-linear model with parameters β𝛽\mathbf{\beta}italic_β. We set S=1𝑆1S=1italic_S = 1, since there is clearly only one mode in Fig. 7. We fit a model log(μ0)superscript𝜇0\log(\mu^{0})roman_log ( italic_μ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) using data from June 1stsuperscript1st{\rm 1^{st}}1 start_POSTSUPERSCRIPT roman_st end_POSTSUPERSCRIPT to September 15thsuperscript15th{\rm 15^{th}}15 start_POSTSUPERSCRIPT roman_th end_POSTSUPERSCRIPT (corresponding to β0superscript𝛽0\mathbf{\beta}^{0}italic_β start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT), before the arrival of the Fall 2020 wave, and test whether a new model (for log(μ1)superscript𝜇1\log(\mu^{1})roman_log ( italic_μ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT )) (corresponding to β1superscript𝛽1\mathbf{\beta}^{1}italic_β start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT), fitted solely to a moving window in the post-September 15thsuperscript15th{\rm 15^{th}}15 start_POSTSUPERSCRIPT roman_th end_POSTSUPERSCRIPT data, explains it appreciably better than the original log(μ0)superscript𝜇0\log(\mu^{0})roman_log ( italic_μ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) model. Indexing the days after September 15thsuperscript15th{\rm 15^{th}}15 start_POSTSUPERSCRIPT roman_th end_POSTSUPERSCRIPT as l=1L=15𝑙1𝐿15l=1\ldots L=15italic_l = 1 … italic_L = 15, we compute the set of days lsuperscript𝑙l^{\ast}italic_l start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT where

max1lLsupβ[t=lLlog(fβ1(yt)fβ0(yt))]>cγ,subscript1𝑙𝐿subscriptsupremum𝛽delimited-[]superscriptsubscript𝑡𝑙𝐿subscript𝑓superscript𝛽1subscript𝑦𝑡subscript𝑓superscript𝛽0subscript𝑦𝑡subscript𝑐𝛾\max_{1\leq l\leq L}\sup_{\mathbf{\beta}}\left[\sum_{t=l}^{L}\log\left(\frac{f% _{\mathbf{\beta}^{1}}(y_{t})}{f_{\mathbf{\beta}^{0}}(y_{t})}\right)\right]>c_{% \gamma},roman_max start_POSTSUBSCRIPT 1 ≤ italic_l ≤ italic_L end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT roman_log ( divide start_ARG italic_f start_POSTSUBSCRIPT italic_β start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_f start_POSTSUBSCRIPT italic_β start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG ) ] > italic_c start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT , (15)

where fβ(yt)subscript𝑓𝛽subscript𝑦𝑡f_{\mathbf{\beta}}(y_{t})italic_f start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is the negative binomial distribution and cγ=3subscript𝑐𝛾3c_{\gamma}=3italic_c start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT = 3. In essence, in the 15-day period between September 16thsuperscript16th{\rm 16^{th}}16 start_POSTSUPERSCRIPT roman_th end_POSTSUPERSCRIPT and 30thsuperscript30th{\rm 30^{th}}30 start_POSTSUPERSCRIPT roman_th end_POSTSUPERSCRIPT, we search for a window where the original log(μ0)superscript𝜇0\log(\mu^{0})roman_log ( italic_μ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) model explains the data poorly. Note that this model does require much historical data to calibrate β0superscript𝛽0\mathbf{\beta}^{0}italic_β start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT (for example, to determine the seasonal nature of the outbreaks), something that is rarely available for novel diseases such as COVID-19. Using the distribution (negative binomial or Poisson), it is also possible to predict the case-count that would have caused an “outlier day”. Per Kim et al. [41], the NB tends to give better fits whereas Poisson is preferable for small datasets, and so we test both formulations. The results with the NB distribution are clearly inferior and are in our technical report[9]. The results with the Poisson distribution are plotted in Fig. 8 (right column), with the “outlier boundary” in red. For Bernalillo, in the post-September 15thsuperscript15th{\rm 15^{th}}15 start_POSTSUPERSCRIPT roman_th end_POSTSUPERSCRIPT period, we see many outliers and a few alarm days, implying that the Fall 2020 wave was detected. The detector does not show any alarms for Valencia or Santa Fe, thus completely missing the Fall 2020 wave. We repeat this analysis for data between June 1stsuperscript1st{\rm 1^{st}}1 start_POSTSUPERSCRIPT roman_st end_POSTSUPERSCRIPT and August 15thsuperscript15th{\rm 15^{th}}15 start_POSTSUPERSCRIPT roman_th end_POSTSUPERSCRIPT (see Fig. 10 in the Appendix). Here the detector identifies outliers and alarms in the data for Bernalillo and Santa Fe, thus “detecting” the Fall 2020 wave a full month before its arrival; clearly, this is a false positive. In contrast, the detector behaves correctly for Valencia. The reason for the poor performance of the GLR-Poisson detector is likely due to the peculiarities of our COVID-19 data (no long historical record and low case-counts from sparsely populated areal units), which runs afoul of many assumptions embedded in conventional disease detectors.

Note that the ability to detect the Fall 2020 wave correctly does not imply that we have fashioned an infection-rate-based disease detector (e.g., we have not attempted to compute a Receiver Operating Characteristic curve); rather, it shows that the infection-rate of an outbreak of a novel disease has the information content that could be exploited within a disease detector. The smoothing effect of our estimation process (which reduces the effect of noise in the observations) and the use of epidemiological information i.e., the incubation period distribution, compensates for the lack of long time-series data that conventional detectors rely on for information content, thus making our method particularly suited for novel outbreaks. For endemic diseases with long time-series and high-quality data, our method would possibly be unnecessarily complex.

Refer to caption Refer to caption

Refer to caption Refer to caption

Refer to caption Refer to caption

Figure 8: Comparison of the infection-rate detector (left) compared with the GLR-Poisson detector (right), using data from 2020-06-01 to 2020-09-15. The symbols are the observed case-counts. The Fall 2020 wave is believed to have started around September 15. The red line beyond September 15 is the outlier boundary; a day with a case-count above the dashed line is an “outlier” and is circled. A data point with a square box around it denotes the the last of a sequence of three consecutive alarmed days. Top row: Performance for Bernalillo county. Middle row: Results for Santa Fe county. Bottom row: Results for Valencia county. In all cases we see that the GLRNB detector misses the Fall 2020 wave.

7 Conclusion

In this paper, we explore whether it is possible to use the (latent) infection-rate of a disease as a monitoring variable in disease surveillance. This is because the infection-rate, which is governed by mixing patterns and spreading characteristics of the pathogen in question, does not vary erratically from day-to-day; in contrast, observed case-counts, the monitoring variable for all conventional disease surveillance algorithms is contaminated by reporting errors. The difficulty, of course, lies in being able to estimate the infection-rate from the case-counts, which can have high variance if they are small numbers.

To this end, we developed a method to estimate an infection-rate (spatiotemporal) field defined over multiple areal units, conditional on case-count time-series, of various fidelities, gathered from the areal units. The aim of estimating a field, rather than a time-varying infection-rate inside an areal unit, was driven by our desire to encode spatial patterns of epidemiological dynamics into the infection-rate field, allowing us to “borrow” information from neighboring areal units and compensate for poor quality observations. The method was demonstrated on COVID-19 data from 3 counties of New Mexico - Bernalillo, Santa Fe and Valencia. Our method uses COVID-19 data and exogenous covariates to uncover the spatial patterns in epidemiological dynamics and encode them as a Gaussian Markov Random Field (GMRF) model. We extend our original method for estimating the infection-rate in one areal unit[6] to multiple units, and use the GMRF to impose a degree of smoothing. Joint inversions for disease parameters showed that the PDFs and posterior predictive simulations for Santa Fe (which had low case-count data) were sharper compared to inversions performed for one areal unit.

The estimated infection-rate field, estimated using data from June 1, 2020 to September 15, 2020, was used to forecast the evolution of the outbreak for two weeks ahead. The Fall 2020 wave of COVID-19 arrived on September 15 and the forecasts are expected to be erroneous i.e., our forecast acts as a detector of the new wave of infection. Our model’s performance was compared with that of a conventional surveillance algorithm that, like all other surveillance algorithms, relies on a long historical training database and which was not not available for COVID-19 because of its novelty. Our method successfully detected the arrival within the two-week period whereas the conventional detector failed. In addition, we tested the method with data till August 15th, 2020, one month before the arrival of the Fall 2020 wave. Our method failed to detect a wave; the conventional detector detected a non-existent one for Bernalillo. The aberrant behavior of the conventional detector is easily explained by the insufficiency of training data, but this is likely to be the case for any novel disease. Thus our premise that the infection-rate could be used as a monitoring variable in surveillance algorithms seems to be a promising one and does not suffer from the need for a lot of data to function well. This robustness makes it particularly well-suited for novel diseases.

Our method suffers from two shortcomings. Our first shortcoming is that while our formulation is generalizable to many areal units, it has been demonstrated on just three areal units. This is due to the lack of scalability of MCMC. We have adapted our method to use approximate, but scalable, mean-field Variational Inference and scaled it to all 33 counties in NM; this is documented in a technical report[9] and is the subject of our next paper. The second shortcoming is the use of Gaussian models throughout this paper, even though the low case-count data for some counties, e.g., Santa Fe, would have suggested a negative binomial distribution. This, however, would have requires us to develop a random field model using negative binomials, and is left to future work.

Author contributions

Cosmin Safta formulated the problem, wrote the software to solve it, generated the figures and wrote the paper. Jaideep Ray formulated the spatial inverse problem, wrote the software to perform the detection of the Fall 2020 wave and wrote the sections of paper describing it. Wyatt Bridgman helped with problem formulation and writing of the paper.

Acknowledgments

This paper (SAND2024-07653O) describes objective technical results and analysis. Any subjective views or opinions that might be expressed in the paper do not necessarily represent the views of the U.S. Department of Energy or the United States Government. This article has been authored by an employee of National Technology & Engineering Solutions of Sandia, LLC under Contract No. DE-NA0003525 with the U.S. Department of Energy (DOE). The employee owns all right, title and interest in and to the article and is solely responsible for its contents. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this article or allow others to do so, for United States Government purposes. The DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan https://www.energy.gov/downloads/doe-public-access-plan.

Financial disclosure

None reported.

Conflict of interest

The authors declare no potential conflict of interests.

Refer to caption
Figure 9: Variation explained by the principal components obtained via sparse PCA of the 79 risk factors used to model population-normalized case-counts in the counties of New Mexico. We see that 12 principal components can cover 95% of the variations observed in the risk-factors.

Refer to caption Refer to caption

Refer to caption Refer to caption

Refer to caption Refer to caption

Figure 10: Comparison of the infection-rate detector (left) compared with the GLR-Poisson detector (right), using data from 2020-06-01 to 2020-08-15. August 15 is a month before the arrival of the Fall 2020 wave. The symbols are the observed case-counts. The red line beyond August 15 is the outlier boundary; a day with a case-count above the dashed line is an “outlier” and is circled. Top row: Performance for Bernalillo county. Middle row: Results for Santa Fe county. Bottom row: Results for Valencia county. The absence of any days with a box around it implies that no alarms were raised, which is the correct behavior.

Figs. 11-13 show 1D and 2D marginal posterior distributions for the three counties tackled in this study. These results indicate a strong correlation between the inferred start of the epidemic, t0subscript𝑡0t_{0}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, and the parameters of the infection model k𝑘kitalic_k and θ𝜃\thetaitalic_θ for each of these regions. When calibrating model for individual regions, the discrepancy between the model and the available observations results in an error model with both the additive σasubscript𝜎𝑎\sigma_{a}italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and multiplicative σmsubscript𝜎𝑚\sigma_{m}italic_σ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT components informed by the data for the counties with smaller populations, Santa Fe and Valencia. For Bernalillo only just the additive error component is sufficient to model the discrepancy. When performing the statistical inference with all three counties, the multiplicative component takes over as that error model component is less sensitive to phase shifts of the epidemic waves compared to the additive component.

Refer to caption Refer to caption

Figure 11: One and two-dimensional marginal posterior distributions for the Bernalillo county model parameters; left: 3 region joint inversion, right: Bernalillo county only.

Refer to caption Refer to caption

Figure 12: One and two-dimensional marginal posterior distributions for the Santa Fe county model parameters; left: 3 region joint inversion, right: Santa Fe county only.

Refer to caption Refer to caption

Figure 13: One and two-dimensional marginal posterior distributions for the Valencia county model parameters; left: 3 region joint inversion, right: Valencia county only.

References

  • [1] M. L. Daza-Torres, M. A. Capistrán, A. Capella, and J. A. Christen, “Bayesian sequential data assimilation for covid-19 forecasting,” Epidemics, vol. 39, p. 100564, 2022.
  • [2] Z. Wang, X. Zhang, G. Teichert, M. Carrasco-Teja, and K. Garikipati, “System inference for the spatio-temporal evolution of infectious diseases: Michigan in the time of covid-19,” Computational Mechanics, vol. 66, pp. 1153–1176, 2020.
  • [3] P. Chen, K. Wu, and O. Ghattas, “Bayesian inference of heterogeneous epidemic models: Application to covid-19 spread accounting for long-term care facilities,” Computer Methods in Applied Mechanics and Engineering, vol. 385, p. 114020, 2021.
  • [4] P. Blonigan, J. Ray, and C. Safta, “Forecasting multi-wave epidemics through bayesian inference,” Archives of Computational Methods in Engineering, vol. 28, no. 6, pp. 4169–4183, 2021.
  • [5] Y. Lin, J. Neumann, E. Miller, R. Posner, A. Mallela, C. Safta, J. Ray, G. Thakur, S. Chinthavali, and W. Hlavacek, “Daily forecasting of regional epidemics of coronavirus disease with bayesian uncertainty quantification, united states.,” Emerging Infectious Diseases, vol. 27, no. 3, pp. 767–778, 2021.
  • [6] C. Safta, J. Ray, and K. Sargsyan, “Characterization of partially observed epidemics through bayesian inference: Application to covid-19,” Computational Mechanics, vol. 66, no. 5, pp. 1109–1129, 2020.
  • [7] M. Höhle and M. Paul, “Count data regression charts for the monitoring of surveillance time series,” Computational Statistics and Data Analysis, vol. 52, no. 9, pp. 4357–4368, 2008.
  • [8] H. Haario, E. Saksman, and J. Tamminen, “An adaptive metropolis algorithm,” Bernoulli, pp. 223–242, 2001.
  • [9] J. Ray, C. Safta, W. Bridgman, M. Horii, and A. Gould, “A spatially regularized detector for emergent/re-emergent disease outbreaks,” Tech. Rep. SAND2023-09749R, Sandia National Laboratories, PO Box 5800, Albuquerque, NM 87185, September 2023. https://www.sandia.gov/app/uploads/sites/203/2023/09/SAND2023-09749R.pdf.
  • [10] N. Best, S. Richardson, and A. Thomson, “A comparison of bayesian spatial models for disease mapping,” Statistical methods in medical research, vol. 14, no. 1, pp. 35–59, 2005.
  • [11] L. Waller and B. Carlin, “Disease mapping,” in Handbook of Spatial Statistics (A. E. Gelfand, P. J. Diggle, M. Fuentes, and P. Guttorp, eds.), ch. 14, Chapman & Hall / CRC Press, 2010.
  • [12] X. Huang, H. Zhou, X. Yang, W. Zhou, J. Huang, and Y. Yuan, “Spatial characteristics of coronavirus disease 2019 and their possible relationship with environmental and meteorological factors in hubei province, china,” GeoHealth, vol. 5, no. 6, p. e2020GH000358, 2021.
  • [13] X. Geng, G. G. Katul, F. Gerges, E. Bou-Zeid, H. Nassif, and M. C. Boufadel, “A kernel-modulated sir model for covid-19 contagiousspread from county to continent,” Proceedings of the National Academy of Sciences, vol. 118, no. 21, p. e2023321118, 2021.
  • [14] L. Schuler, J. M. Calabrese, and S. Attinger, “Data driven high resolution modeling and spatial analyses of the covid-19 pandemic in germany,” PLOS ONE, vol. 16, pp. 1–14, 08 2021.
  • [15] T. McMahon, A. Chan, S. Havlin, and L. K. Gallos, “Spatial correlations in geographical spreading of COVID-19 in the United States,” Scientific Reports, vol. 12, no. 1, p. 699, 2022.
  • [16] M. Rendana, W. M. R. Idris, and S. Abdul Rahim, “Spatial distribution of covid-19 cases, epidemic spread rate, spatial pattern, and its correlation with meteorological factors during the first to the second waves,” Journal of Infection and Public Health, vol. 14, no. 10, pp. 1340–1348, 2021. Special Issue on COVID-19 – Vaccine, Variants and New Waves.
  • [17] S. H. S. Indika, N. Diawara, H. A. Jeng, B. D. Giles, and D. S. K. Gamage, “Modeling the spread of covid-19 in spatio-temporal context,” Mathematical Biosciences and Engineering, vol. 20, no. 6, pp. 10552–10569, 2023.
  • [18] A. Lawson and D. Lee, “Chapter 16 - bayesian disease mapping for public health,” in Disease Modelling and Public Health, Part A (A. S. Srinivasa Rao, S. Pyne, and C. Rao, eds.), vol. 36 of Handbook of Statistics, pp. 443–481, Elsevier, 2017.
  • [19] Y. C. MacNab, “Bayesian disease mapping: Past, present, and future,” Spatial Statistics, vol. 50, pp. 100593, 28, 2022.
  • [20] J. Besag, J. York, and A. Mollié, “Bayesian image restoration, with two applications in spatial statistics,” Annals of the institute of statistical mathematics, vol. 43, pp. 1–20, 1991.
  • [21] H. Stern and N. A. Cressie, “Inference for extremes in disease mapping,” in Disease Mapping and Risk Assessment for Public Health (A. Lawson, ed.), pp. 63–84, Chichester: Wiley, 1999.
  • [22] N. Cressie, Statistics for spatial data. John Wiley & Sons, 2015. Revised Edition.
  • [23] H. Baptista, J. M. Mendes, Y. C. MacNab, M. Xavier, and J. C. de Almeida, “A gaussian random field model for similarity-based smoothing in bayesian disease mapping,” Statistical Methods in medical Research, vol. 25, no. 4, pp. 1166–1184, 2016.
  • [24] N. G. Best, R. A. Arnold, A. Thomas, L. A. Waller, and E. M. Conlon, “Bayesian models for spatially correlated disease and exposure data,” Bayesian statistics, vol. 6, pp. 131–156, 1999.
  • [25] S. Unkel, C. P. Farrington, P. H. Garthwaite, C. Robertson, and N. Andrews, “Statistical methods for the prospective detection of infectious disease outbreaks: a review,” Journal of the Royal Statistical Society Series A: Statistics in Society, vol. 175, no. 1, pp. 49–82, 2012.
  • [26] W. A. Shewhart, “Economic quality control of manufactured product 1,” Bell System Technical Journal, vol. 9, no. 2, pp. 364–389, 1930.
  • [27] R. E. Serfling, “Methods for current statistical analysis of excess pneumonia-influenza deaths,” Public health reports, vol. 78, no. 6, p. 494, 1963.
  • [28] C. Pelat, P.-Y. Boëlle, B. J. Cowling, F. Carrat, A. Flahault, S. Ansart, and A.-J. Valleron, “Online detection and quantification of epidemics,” BMC Medical Informatics and Decision Making, vol. 7, no. 1, pp. 1–9, 2007.
  • [29] R. Parker, “Analysis of surveillance data with poisson regression: a case study,” Statistics in Medicine, vol. 8, no. 3, pp. 285–294, 1989.
  • [30] M. L. Jackson, A. Baer, I. Painter, and J. Duchin, “A simulation study comparing aberration detection algorithms for syndromic surveillance,” BMC medical informatics and decision making, vol. 7, no. 1, pp. 1–11, 2007.
  • [31] C. Farrington, N. J. Andrews, A. Beale, and M. Catchpole, “A statistical algorithm for the early detection of outbreaks of infectious disease,” Journal of the Royal Statistical Society: Series A (Statistics in Society), vol. 159, no. 3, pp. 547–563, 1996.
  • [32] B. Y. Reis and K. D. Mandl, “Time series modeling for syndromic surveillance,” BMC medical informatics and decision making, vol. 3, no. 1, pp. 1–11, 2003.
  • [33] G. D. Williamson and G. Weatherby Hudson, “A monitoring system for detecting aberrations in public health surveillance reports,” Statistics in medicine, vol. 18, no. 23, pp. 3283–3298, 1999.
  • [34] Y. Le Strat and F. Carrat, “Monitoring epidemiologic surveillance data using hidden markov models,” Statistics in medicine, vol. 18, no. 24, pp. 3463–3478, 1999.
  • [35] M. A. Martínez-Beneito, D. Conesa, A. López-Quílez, and A. López-Maside, “Bayesian markov switching models for the early detection of influenza epidemics,” Statistics in medicine, vol. 27, no. 22, pp. 4455–4468, 2008.
  • [36] D. Conesa, R. Amorós, A. López-Quılez, and M. A. Martınez-Beneito, “Mean-variability hidden markov models for the detection of influenza outbreaks,” in 25th International Workshop on Statistical Modelling, (Amsterdam, The Netherlands), Statistical Modelling Society, 2010.
  • [37] H.-M. Lu, D. Zeng, and H. Chen, “Markov switching models for outbreak detection,” in Infectious Disease Informatics and Biosurveillance: Research, Systems and Case Studies (C. Castillo-Chavez, H. Chen, W. B. Lober, M. Thurmond, and D. Zeng, eds.), pp. 111–144, Springer US, 2011.
  • [38] L. Held, M. Hofmann, M. Höhle, and V. Schmid, “A two-component model for counts of infectious diseases,” Biostatistics, vol. 7, no. 3, pp. 422–437, 2006.
  • [39] A. B. Lawson and H.-R. Song, “Bayesian hierarchical modeling of the dynamics of spatio-temporal influenza season outbreaks,” Spatial and spatio-temporal epidemiology, vol. 1, no. 2-3, pp. 187–195, 2010.
  • [40] A. B. Lawson, “Evaluation of predictive capability of bayesian spatio-temporal models for covid-19 spread,” BMC Medical Research Methodology, vol. 23, no. 1, p. 182, 2023.
  • [41] J. Kim, A. B. Lawson, B. Neelon, J. E. Korte, J. M. Eberth, and G. Chowell, “Evaluation of bayesian spatiotemporal infectious disease models for prospective surveillance analysis,” BMC Medical Research Methodology, vol. 23, no. 1, p. 171, 2023.
  • [42] A. B. Lawson and J. Kim, “Space-time covid-19 bayesian sir modeling in south carolina,” PLoS One, vol. 16, no. 3, p. e0242777, 2021.
  • [43] B. Sartorius, A. Lawson, and R. Pullan, “Modelling and predicting the spatio-temporal spread of covid-19, associated deaths and impact of key risk factors in england,” Scientific reports, vol. 11, no. 1, p. 5378, 2021.
  • [44] A. B. Lawson and J. Kim, “Issues in bayesian prospective surveillance of spatial health data,” Spatial and Spatio-temporal Epidemiology, vol. 41, p. 100431, 2022.
  • [45] C. Rotejanaprasert, A. Lawson, S. Bolick-Aldrich, and D. Hurley, “Spatial bayesian surveillance for small area case event data,” Statistical methods in medical research, vol. 25, no. 4, pp. 1101–1117, 2016.
  • [46] A. Corberán-Vallet and A. B. Lawson, “Conditional predictive inference for online surveillance of spatial disease incidence,” Statistics in medicine, vol. 30, no. 26, pp. 3095–3116, 2011.
  • [47] “Coronavirus (Covid-19) Data in the United States.” https://github.com/nytimes/covid-19-data. Accessed: 2023-01-01.
  • [48] “COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University.” https://github.com/CSSEGISandData/COVID-19. Accessed: 2023-01-01.
  • [49] L. Shand, A. Foss, A. Zhang, J. D. Tucker, and G. Huerta, “A statistical model for the spread of sars-cov-2 in new mexico,” Tech. Rep. SAND2020-10080, Sandia National Laboratories, PO Box 5800, Albuquerque, NM 87185, September 2020.
  • [50] “United States Census Bureau, quickfacts new mexico.” https://www.census.gov/quickfacts/NM, January 2024. Accessed January 2024.
  • [51] “New Mexico Primary Care Association, find a health center.” https://www.nmpca.org/find-a-health-center, 2024. Accessed January 2024.
  • [52] “New Mexico Department of Health, covid-19 screening and test sites.” https://cvprovider.nmhealth.org/directory.html. Accessed September 2020.
  • [53] “Earth Data Analysis Center, university of new mexico, resource geographic information system.” https://rgis.unm.edu. Accessed September 2020.
  • [54] N. B. Erichson, P. Zheng, K. Manohar, S. L. Brunton, J. N. Kutz, and A. Y. Aravkin, “Sparse principal component analysis via variable projection,” SIAM Journal on Applied Mathematics, vol. 80, no. 2, pp. 977–1002, 2020.
  • [55] R. S. Bivand and D. W. Wong, “Comparing implementations of global and local indicators of spatial association,” Test, vol. 27, no. 3, pp. 716–748, 2018.
  • [56] S. A. Lauer, K. H. Grantz, Q. Bi, F. K. Jones, Q. Zheng, H. R. Meredith, A. S. Azman, N. G. Reich, and J. Lessler, “The Incubation Period of Coronavirus Disease 2019 (COVID-19) From Publicly Reported Confirmed Cases: Estimation and Application,” Annals of Internal Medicine, 2020.
  • [57] N. Cressie and G. Johannesson, “Fixed rank kriging for very large spatial data sets,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 70, no. 1, pp. 209–226, 2008.
  • [58] C. Andrieu and G. O. Roberts, “The pseudo-marginal approach for efficient Monte Carlo computations,” Ann. Statist., vol. 37, no. 2, pp. 697–725, 2009.
  • [59] R. Kass, B. Carlin, A. Gelman, and R. Neal, “Markov chain monte carlo in practice: A roundtable discussion,” The American Statistician, vol. 52, no. 2, pp. 93–100, 1998.
  • [60] T. Gneiting and M. Katzfuss, “Probabilistic forecasting,” Annual Review of Statistics and Its Application, vol. 1, pp. 125–151, 2014.
  • [61] G. Székely, M. Rizzo, and N. Bakirov, “Measuring and testing dependence by correlation of distances,” Annals of Statistics, vol. 35, pp. 2769–2794, 2007.
  • [62] R Core Team, R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2023.
  • [63] M. Höhle, “surveillance: An R package for the monitoring of infectious diseases,” Computational Statistics, vol. 22, no. 4, pp. 571–582, 2007.