research-article

Open access

Toward Accurate Spatiotemporal COVID-19 Risk Scores Using High-Resolution Real-World Mobility Data

Authors:

Yan LiuAuthors Info & Claims

ACM Transactions on Spatial Algorithms and Systems (TSAS), Volume 8, Issue 2

Article No.: 10, Pages 1 - 30

https://doi.org/10.1145/3481044

Published: 18 January 2022 Publication History

All formats PDF

Abstract

As countries look toward re-opening of economic activities amidst the ongoing COVID-19 pandemic, ensuring public health has been challenging. While contact tracing only aims to track past activities of infected users, one path to safe reopening is to develop reliable spatiotemporal risk scores to indicate the propensity of the disease. Existing works which aim at developing risk scores either rely on compartmental model-based reproduction numbers (which assume uniform population mixing) or develop coarse-grain spatial scores based on reproduction number (R0) and macro-level density-based mobility statistics. Instead, in this article, we develop a Hawkes process-based technique to assign relatively fine-grain spatial and temporal risk scores by leveraging high-resolution mobility data based on cell-phone originated location signals. While COVID-19 risk scores also depend on a number of factors specific to an individual, including demography and existing medical conditions, the primary mode of disease transmission is via physical proximity and contact. Therefore, we focus on developing risk scores based on location density and mobility behaviour. We demonstrate the efficacy of the developed risk scores via simulation based on real-world mobility data. Our results show that fine-grain spatiotemporal risk scores based on high-resolution mobility data can provide useful insights and facilitate safe re-opening.

1 Introduction

As the Coronavirus Disease, COVID-19 becomes a long-term challenge in our day-to-day lives, planning, and learning effective ways to navigate the disease has become critical. In the earlier phases of the disease when there were a relatively small number of cases, contact tracing—identifying people who may have come into contact with an infected individual—served as an effective tool to mitigate disease spread [30]. However, as the number of cases reach record levels, and a general sense of pandemic fatigue sets in, it has become important to develop practical tools for navigating the disease safely to resume normal activities [20, 50, 51].

COVID-19 transmits through contacts between individuals. It spreads in a population due to infected people moving and co-locating with people susceptible to the disease [40]. For such susceptible individuals, infection risk boils down to going to locations with a high probability of co-locating with infected individuals [23]. Thus, one approach toward allowing safe resumption of normal activities is assigning risk scores for different regions to show the danger in each area. This can be used both for policymaking at the government level as well as individual decision making (e.g., to avoid high-risk areas). To this end, the risk scores provided need to be (1) spatiotemporally fine-grained, (2) reliable, and (3) accurately evaluated.

Fine-Grained Risk-Scores using mobility patterns. Coarse-grain risk scores and reproduction number (R0) estimates at the county or state level are not readily usable for public policy decisions at finer spatial and temporal scales [9, 28]. Analogous to traffic congestion prediction for transportation applications (e.g., car navigation) where high-resolution information such as the average speed at a specific freeway segment at a particular time is critical to avoid potential congested routes, risk scores need to be local to an area to influence decision making [34]. In other words, coarse-grain risk scores are akin to reporting the average speed of cars for an entire city, which provides no useful information for both individual and policy-making plans. Nevertheless, such information becomes increasingly useful with finer granularity, both spatially and temporally. This improvement in spatiotemporal resolution has also led to recent advances in traffic prediction [39].

Motivated from this insight, we focus on developing finer-grain spatiotemporal risk scores leveraging the real-world mobility patterns from one area in a city to another. Recent works which primarily focus on disease forecasting, consider coarse-grain mobility densities (at county-level) [12], or densities at certain point-of-interests within a city [10], potentially due to lack of access to fine-grained data. In this work, we leverage actual mobility—as opposed to relying on the popularity of an area (density)—using Origin-Destination (OD) mobility graphs [5].

In contrast to traffic flow forecasting where the aim at predicting future traffic flow from the current flow, predicting risk score has an additional challenge of forecasting infections from the current spatial density and mobility behavior. That is, rather than predicting future mobility as is done for traffic forecasting, the model needs to predict the future chance of infection that relies on both mobility behavior and rate of infection. To this end, we leverage a Hawkes process-based modelling procedure to predict future infections based on mobility patterns. Here, the Hawkes process-based modelling, which is extensively used to model event-based infection spread, provides a flexible way to incorporate the mobility data while allowing for explicit modelling of disease transmission [12]. In addition to the mobility-based model which leverages high-resolution location densities, we also develop a variant of our model which utilizes the mobility of infections from one region to another, i.e., infection mobility to aid infection and risk predictions. We show that the use of high-resolution mobility patterns along with infection mobility leads to improved infection and risk prediction.

Reliable Risk Scores. Popular disease prediction models employ compartmental modelling based on the classical Susceptible-Infected-Removed (SIR) model [32] to explicitly model disease transmission. Although popular in practice due to their simplicity, these models rely on the homogeneous population mixing (i.e., each person has equal probability of coming into contact with another individual) to learn the model parameters, and thus leading to coarse-grain R0 estimates [6, 29, 43, 45, 55]. Nevertheless these are useful when finer-grain data is not available, and recent works leverage time-varying models to develop dynamic R0 for relatively finer scales, assuming homogeneous mixing [34].

One way to relax the homogeneous mixing assumption is to employ self-excitation-based Hawkes point process models [19, 41] which are mathematically related to the compartmental models [35, 46]. Specifically, these show that Hawkes process can be viewed as a special case of stochastic SIR models when the recoveries are unobserved (a more realistic scenario). As a result, Hawkes process-based models have become popular in epidemic spread modeling COVID-19 [6, 12, 36, 53], and other outbreaks such as Ebola [31, 48] where it has also been found to yield better predictions with minimal assumptions [31]. To this end, a recent work [12] leverages the flexibility of the Hawkes process models to incorporate the demographic and mobility density indices for COVID-19 prediction. As a result, existing models either (a) do not incorporate high-resolution mobility data [12], or (b) use compartmental models which assume homogeneous mixing [34]. In this work, we show that incorporating high-resolution mobility data along with Hawkes process-based modeling leads to more reliable fine-grained risk scores.

Accurately Evaluating the Risk Scores. Another challenge in research on accurately modeling the spread of the disease is not having access to reliable fine-grain infection statistics. This is due to inaccurately reporting the number of infections in the real-world as not everyone who gets infected is tested for the virus [49]. To address this, existing work either (a) ignore this issue and use the reported number of infections as ground truth [6, 12, 34, 43] or try to circumvent it by (b) using the number of deaths to study the spread of the disease [12]. In the former case, judging the accuracy of the model becomes difficult as it is not tested on the ground truth. In the latter case, the model will not be able to model how infections occur, but rather it only models number of deaths. Furthermore, fine-grained statistic for infection location is not publicly available. That is, at best we can know number of infections per county using public sources, which makes it difficult to assign risk scores to different locations (e.g., a zipcode or shopping center) within it.

To summarize, there are three main limitations with existing methods (a) they assume homogeneous mixing of population i.e., do not incorporate mobility information, and/or (b) assign coarse-grain scores (at county level at best) and (c) they are difficult to evaluate. As a result, there is a need to build a technique to assign COVID-19 risk scores which is (a) informative (fine-grain and time-varying), (b) reliable (considers in-homogeneous mixing) and (c) accurately evaluated.

1.1 Our Approach

To address these challenges, we develop a Hawkes process-based spatiotemporal risk measure, dubbed LocationRisk@T, which utilizes the high-resolution mobility patterns in a city available via cell-phone location signals [2] (a location signal is a record containing the location of the device at a particular point in time). We specifically show that such fine-grained mobility information can be used to assign risk scores that closely track future infections in a region. We corroborate the efficacy of LocationRisk@T by showing its ability to track infections on simulated disease spread on real-world mobility for the months of December 2019, January and March 2020.

In particular, to evaluate LocationRisk@T, we use an agent-based simulation to compute the ground-truth number of infections. Our simulation, called SpreadSim, uses location signals from large number of cell-phones across the US. SpreadSim simulates how the disease spreads across the population by utilizing this real-world high-resolution mobility patterns. Since SpreadSim utilizes the real-world data of people’s movement, it can generate infection patterns for the population that are closer to the real-world. Furthermore, since the infection are generated by the simulation, we have access to the ground truth number of infections which allows for accurate evaluation of LocationRisk@T.

Overall, our disease transmission simulation uses real-world mobility patterns and co-locations to emulate the spread of disease in real-world. Our point-process-based prediction model then leverages the mobility patterns to forecast future infections, the resulting intensity function of the Hawkes process-based model yields the risk score. Our main contributions are as follows.

•

SpreadSim: A Disease Spread Simulation Using Real-World Location Signals. We build a co-location-based disease spread model using real-world location signals. SpreadSim emulates the disease transmission process in the real-world to generate infection patterns, and can be of independent interest for the analysis of disease spread and intervention policies.

•

LocationRisk@T: High-Resolution Spatiotemporal Risk Scores. As opposed to previous works which utilize coarse-grain location density indices, we first develop a mobility-aware Hawkes process-based model—LocationRisk@T \(_{Mob}\) —which leverages the high-resolution mobility patterns between different regions of a city to predict infections and assign spatiotemporal risk scores. Subsequently, we develop LocationRisk@T \(_{Mob^+}\) , which also accounts for the movement of infected population to improve the performance of the model.

•

Analyze the Disease Spread Patterns. We leverage SpreadSim and LocationRisk@T to analyze the differences in disease spread at different conditions of real-world mobility rates, namely, before and after the March 2020 lockdown. Our results on cities across United States demonstrate that high-resolution mobility data can be used as a reliable public health tool to assess potential risk associated with parts of a city over time.

One possible application of LocationRisk@T risk score is to use it to reduce foot traffic to areas of high risk during the time the predicted risk score is high. Furthermore, since LocationRisk@T risk score leverages region-specific aggregate mobility patterns, it preserves privacy of device owners. In other words, although our SpreadSim, for evaluation purposes, utilizes individual-level co-locations, our LocationRisk@T risk prediction model can be used to assign risk scores while keeping the user data private.

1.2 Overview and Organization

Our approach can be summarized as follows.

First, our agent-based simulation, SpreadSim, uses real-world location signals to generate infection patterns. Specifically, SpreadSim takes as input location trajectories of a number of individuals, as well as parameters that control how the disease spreads between the individuals to simulate how the infection progresses in the population over time. Finally, it outputs a list of infected individuals during the simulation together with the time and location they became infected. The details of SpreadSim are discussed in Section 2.

Second, our LocationRisk@T takes as input the infection statistics (aggregate data, not individuals’ locations, or co-locations) that were generated from the simulation. It uses the statistics for the first n days of the simulation, for a parameter n, to learn a model that is able to accurately predict a number of infections for different locations, as well as provide a risk score for each location. Infection statistics from day n until the end of the simulation are used to evaluate the learned model. The details of LocationRisk@T are discussed in Section 3.

The rest of the article is organized as follows. In Section 4, we present the results of the analysis over different months for different cities across the United States. We also discuss related works in the context of the proposed technique in Section 5, and conclude our discussion in Section 6.

2 SpreadSim

Our agent-based simulation, SpreadSim, uses real-world mobility data and parameters from the existing literature on how COVID-19 spreads in a population to generate realistic infection patterns in the population. SpreadSim consists of a set of agents, collectively referred to as the population. Some agents are initially infected. Each agent moves and co-locates with other agents based on real-world fine-grain mobility data provided by Veraset [2]. As the agents move based on the location signal data (described in Section 2.1), according to their co-locations, they get infected and spread the disease following a variation of the SIR model and using incubation and generation periods of COVID-19 reported in the literature. The output of SpreadSim is the information on which agents get infected, when and where. Thus, SpreadSim is comprised of three components. (1) The location of each agent at each point in time is determined by a predefined mobility pattern. (2) The spread of the disease amongst the agents is determined by a transmission model. (3) Who is initially infected is determined by the initialization conditions? We next discuss each of the components and the relevant implementation details.

2.1 Mobility Pattern

The mobility pattern determines where each agent is at each point in time during the simulation. We use real-world location signals provided by Veraset [2]. Veraset [2] is a data-as-a-service company that provides anonymized population movement data collected through GPS signals of cell-phones across the US. We were provided access to this dataset for the months of December, January and (from 5th to 26th of) March. For a single day in December, there are 2,630,669,304 location signals across the US. Each location signal corresponds to a device_id and there are 28,264,106 device_ids across the US in that day. We assume each device_id corresponds to a unique individual. Figure 1 shows the number of daily location signals recorded in the month of Dec. 2019 in the area of Manhattan, New York. Furthermore, Figure 2 shows the distribution of location signals across individuals in Manhattan in Dec 2019. A point \((x, y)\) in Figure 2 means that x percent of the individuals have at least y location signals in the month of Dec. in Manhattan. Using this real-world location data, we are able to create infection patterns for different cities and at different periods of time. This allows us to study the hypotheses that concern the spread of the disease for different cities and at different times.

Fig. 1.

Fig. 2.

Detailed statistics about the subset of the Veraset data we used for our experiments are discussed in Section 4. Here, we discuss our general approach of using real-world location signals for SpreadSim, independently of the actual dataset used. We consider a dataset such that each record in the dataset consists of an anonymized user_id, latitude, longitude, and timestamp and is a location signal of the user with id user_id at the specified time and location. We consider each individual to be an agent in the simulation and for each agent, we have access to their latitude and longitude for various timestamps. Sorting this by time, we obtain a trajectory of location signals of an agent over time, denoted by the sequence \(\langle c_1, c_2,\ldots , c_k\rangle\) . For two consecutive location signals, \(c_i\) and \(c_{i+1}\) of an agent at times \(t_{i}\) and \(t_{i+1}\) , we assume the agent is at the location specified by \(c_i\) from time \(t_i\) to \(t_{i+1}\) . Using this piece-wise constant approximation, together with our location data, we have access to the location of every agent at every point in time during the simulation. We note that, although more sophisticated interpolations are possible, this falls beyond the scope of this article, since our focus is on providing risk scores for different locations.

2.2 Transmission Model

Agents belong to either one of the following compartments: Susceptible (S), Infected and Not Spreading (INS), Infected and Spreading (IS), and Isolated or Recovered (R). An agent is said to be infected if they are in either INS or IS compartments. Intuitively, a susceptible agent can contract the disease, an INS agent is infected but cannot spread the disease, an IS agent is infected and can spread the disease and an Isolated or Recovered agent cannot spread the disease or contract it anymore. Figures 3 and 4 illustrate the transmission model.

Fig. 3.

Fig. 4.

2.2.1 Compartments.

All agents are initially Susceptible, with the exception of the agents that are infected initially as discussed in Section 2.3.1. We call the compartment an agent is in the agent’s status. Assume an agent gets infected at time t based on the dynamics discussed in Section 2.2.2. Thus, at time t, the agent changes its status from S to INS. Consequently, at time \(t+t_{IS}\) , the agent becomes Infected and Spreading. Furthermore, at time \(t+t_{R}\) , the agent becomes Isolated or Recovered, where \(t_{IS}\sim \mathcal {N}(\mu _{IS}, \sigma _{IS})\) and \(t_{R}\sim \mathcal {N}(\mu _{R}, \sigma _{R})\) , where \(\mu _{IS},~ \sigma _{IS}, \mu _{R}\) , and \(\sigma _{R}\) are the parameters of the model. These parameters can be set based on existing research on COVID-19 [22, 26, 38]. For instance, [26] mentions that “1% of transmission would occur before 5 days and 9% of transmission would occur before 3 days prior to symptom onset”, while [38] finds that incubation period (time from infection until the onset of symptoms) has mean 5.2 days. Using this, we can find the parameter for the model that match the empirical evidence the best. We discuss the specific parameter setting used in our experiments in Section 4. We note that, first, \(\mathcal {N}(\cdot)\) is a truncated normal distribution, such that if the sampled \(t_{IS}\) or \(t_{R}\) are less than zero, they are discarded and a new sample is obtained. Furthermore, if \(t_{R} \lt t_{IS}\) , the agent never spreads the disease, and moves directly from INS to the R compartment (the probability of either of these happening is very low based on the parameter setting chosen).

2.2.2 Transmission Dynamics.

Only IS agents infect other agents, and only S agents get infected. Simply put, if an IS agent is within distance \(d_{max}\) of an S agent for duration at least \(t_{min}\) then, with probability p, the IS agent infects the S agent. \(d_{max}\) , \(t_{min}\) , and p are the parameters of the transmission model. The model is designed to mimic how the disease spreads in the physical world, following the works of [7, 14, 52], where prolonged close-range contact is the main source of transmission of the disease. Specifically, consider an IS agent, u, and an S agent v. At any time, t, during the simulation, consider the situation when the distance between u and v becomes at most \(d_{max}\) (i.e., the distance between u and v was larger than \(d_{max}\) right before time t) from time t until time at least \(t+t_{min}\) . Then, a number, uniformly at random, is generated in the range \([0, 1]\) . If the number is at most p, then u infects v. When v becomes infected, as explained in Section 2.2.1, its status changes from S to INS, and eventually to IS and R. Note that the random number is generated once for the co-location, irrespective of the duration of the co-location (as long as the duration is at least \(t_{min}\) ). Furthermore, if multiple co-locations happen at the same time, e.g., between u and \(v_1\) as well as between u and \(v_2\) , two separate random numbers are generated, one to decide if u infects \(v_1\) and another one to decide if u infects \(v_2\) .

2.3 Initialization and Implementation

2.3.1 Initialization.

For a parameter \(n_{init}\) , we infect \(n_{init}\) number of agents at the beginning of the simulation uniformly at random. The initial infections are treated as if the agent was infected by another agent. That is, an initially infected agent has initially the INS status and becomes IS after time \(t_{IS}\) and recovered after \(t_R\) , where \(t_{IS}\) and \(t_R\) are sampled from the normal distributions discussed before.

2.3.2 Implementation.

Our implementation of SpreadSim, publicly available at [4], first sorts all the location signals based on time and we use a grid to index the location of the agents. The grid supports the operation \(findWithinDmax(x, y, T)\) , where given lat. and lon. values x and y, the grid returns all the individuals in the T compartment that are within \(d_{max}\) of \((x, y)\) , where T can be either S, INS, IS, or R. The operation checks all the cells in the grid that overlap the circle centered at \((x, y)\) and with radius \(d_{max}\) , and for any individual with status T in the cells, checks if their distance to \((x, y)\) is in fact \(d_{max}\) (to avoid false positives).

During the simulation, location signals are processed one by one in the order of time. Every location signal corresponds to an agent, u, moving from an old location to a new location \((x, y)\) . If u is susceptible, we call \(findWithinDmax(x, y, IS)\) on the grid index to see if there are any IS agents within \(d_{max}\) of the new location. If u is IS, we check if there are any S agents within its \(d_{max}\) , by calling \(findWithinDmax(x, y, S)\) on the grid index. We call it a co-location if two agents are within \(d_{max}\) distance of each other. In either of the two cases above, for any co-location between an S agent u and an IS agent v, we check if it lasts for at least \(t_{min}\) by traversing their trajectories. If it does, then, with probability p, we infect u. The process can be optimized further by building a multidimensional index on both time and location.

3 Learning Spatiotemporal Risk Scores with LocationRisk@T

We now describe our Hawkes process model, LocationRisk@T, which leverages location data along with mobility patterns in a particular region to assign spatiotemporal risk scores; implementation publicly available at [3]. For a given city, we first form clusters \(c\in \mathcal {C}\) using k-means clustering algorithm based on location signals in the city. The number of such clusters can be chosen according to the desired spatial resolution.¹

3.1 Mobility-Aware Modelling

LocationRisk@T leverages the OD matrix to represent the flow between two clusters c and \(c^{\prime }\) in a city. OD matrices are a popular way to encode spatial traffic flow information between two nodes in a transportation graph [5]. Specifically, let \(\mathcal {G}^t(V,E)\) denote a directed graph, where the edge weight \(w_{i\rightarrow j}(t)\) encodes the traffic volume from cluster i to j between time \((t-1)\) and t. The OD matrix \(W (t) \in \mathbb {R}^{|\mathcal {C}| \times |\mathcal {C}|}\) represents the traffic flows between different clusters \(c \in \mathcal {C}\) in a city at time t. Here, the diagonal elements \(w_{i\rightarrow i}(t)\) denote the traffic generated in a cluster i. We use this traffic flow information to inform LocationRisk@T.

We form the mobility feature vector at time t, \(\boldsymbol {m}_{\mathcal {G}_c}^t \in \mathbb {R}^{f}\) , elements of which encode the different mobility features derived from the OD matrix W. Here, we develop two Hawkes process-based models (a) LocationRisk@T \(_{Mob}\) , and (b) LocationRisk@T \(_{Mob^+}\) , depending upon the traffic flow features. For LocationRisk@T \(_{Mob}\) we set

\begin{align} ({LocationRisk@T_{Mob} Features}) \boldsymbol {m}_{c}^t = \begin{bmatrix} w_{c\rightarrow c}(t), ~w_{to-c}(t), ~w_{from-c}(t) \end{bmatrix}^\top , \end{align}

(1)

where

\[\begin{eqnarray*} w_{to-c}(t) := \sum \limits _{c^{\prime }\in \mathcal {C}\backslash c} w_{c^{\prime }\rightarrow c}(t) \hspace{28.45274pt}\text{and} \hspace{28.45274pt} w_{from-c}(t) := \sum \limits _{c^{\prime }\in \mathcal {C}\backslash c} w_{c\rightarrow c^{\prime }}(t) , \end{eqnarray*}\]

and \(``\backslash \hbox{''}\) denotes the set difference operator. In effect, LocationRisk@T \(_{Mob}\) considers the net traffic to, from, and within a cluster, while being agnostic to the infections in each cluster.

To make the model infection-aware, we design LocationRisk@T \(_{Mob^+}\) which has additional mobility-based features to account for the infections at the origin cluster. Let \(I_c(t)\) denote the infections in cluster c at time t, and further let \(I_{-c}(t)\) denote the total infections (over all clusters) except those in cluster c, i.e.,

\[\begin{eqnarray*} I_{-c}(t) = \sum \limits _{c^{\prime }\in \mathcal {C} \backslash c} I_{c^{\prime }}(t). \end{eqnarray*}\]

We formulate the infection mobility to a cluster c as

\[\begin{eqnarray*} Im_{to-c}(t) := \dfrac{w_{to-c}(t)}{\sum \limits _{c^{\prime }\in \mathcal {C}\backslash c} \left(w_{to-c^{\prime }}(t) + w_{from-c^{\prime }}(t) \right)} \cdot I_{-c}(t) . \end{eqnarray*}\]

Here, \(Im_{to-c}(t)\) captures the fraction of infections travelling to a cluster c relative to the total mobility (both to and from). We use both “to” and “from” mobility in the denominator, since (a) mobility from any cluster impacts how many infections enter other clusters, and (b) the mobility to a cluster also takes away from the ones that may enter other clusters. Similarly, we also form \(Im_{of-c}(t)\) to weigh the total number of infections in a cluster c by the ratio of self-mobility and traffic from other clusters.

\[\begin{eqnarray*} Im_{of-c}(t) := \dfrac{w_{c\rightarrow c}(t)}{ w_{to-c}(t)} \cdot I_{c}(t). \end{eqnarray*}\]

With these features, we form the feature set for LocationRisk@T \(_{Mob^+}\) as follows.

\begin{align} ({LocationRisk@T_{Mob^+}Features}) \boldsymbol {m}_{c}^t = \begin{bmatrix} w_{c\rightarrow c}(t), ~w_{to-c}(t), ~w_{from-c}(t), ~Im_{to-c}(t), ~Im_{of-c}(t) \end{bmatrix}^\top \end{align}

(2)

3.2 Incorporating Mobility Features Into Hawkes Process

Now, given the daily infections at each cluster \(c\in \mathcal {C}\) , i.e., a realization of a point process \(N_c(t)\) on \([0, T]\) for \(T\lt \infty\) , at timestamps \(\mathcal {T}_c = \lbrace t_1^c, t_2^c, \dots t_n^c\rbrace\) , we model the rate of new cases at time t [12], \(\lambda _c(t)\) associated with a cluster c by incorporating the corresponding traffic mobility features \(\boldsymbol {m}_c^{t}\) developed in Section 3.1 as

\begin{align} \lambda _c(t) = \mu _c + \sum _{t\gt t_j, t_j \in \mathcal {T}} R_c^{t_j}\left(\boldsymbol {m}_c^{t_j - \Delta }, \theta \right)~wbl(t-t_j), \end{align}

(3)

where, \(\mu _c\) is known as the time-invariant background rate, and captures the inherent proclivity of a cluster to produce infections. We use the pdf of the Weibull distribution, \(wbl(\cdot)\) with shape \(\alpha\) and scale \(\beta\) to weigh the inter-event time, which specifies the influence of past events. The time-delay parameter \(\Delta\) is used to account for the delay between the mobility and the infections. Therefore, the second component in the definition of \(\lambda _c(t)\) represents the self-excitations.

The mobility features are used to model the time-varying cluster-dependent R0 \(R_c^{t}(\boldsymbol {m}_c^{t-\Delta }, \theta)\) at a time step t. We assume that \(R^t_c(m^{t-\Delta }_c,\theta)\) is the mean parameter of a Poisson random variable, interpreted as the average number of secondary infections caused by a primary infection. Thus, the notation \(R^t_c(\cdot)\) denotes that the R0 is a function of \(m^{t-\Delta }_c\) and \(\theta\) , where \(m^{t-\Delta }_c\) denotes the vector of mobility features, and \(\theta\) are the learnable parameters. In particular, we model \(R_c^{t}(\cdot)\) as follows, which leverages the traffic-flow based mobility features for each cluster as

\begin{align} R_c^{t}\left(\boldsymbol {m}_c^{t-\Delta }, \theta \right)= \exp \left(\theta ^\top \boldsymbol {m}_c^{t - \Delta }\right), \end{align}

(4)

where \(\top\) denotes the transpose operator. Here, \(\theta\) is a vector of size \(\boldsymbol {m}_c^t\) , where each element of \(\theta\) parameterizes the weights corresponding to each mobility index in \(\boldsymbol {m}_c^t\) , learned via Poisson regression. Poisson regression is a Generalized Linear Model (GLM), a popular choice for modeling count data estimated via Maximum-Likelihood-based approaches² [8, 15]. In the context of GLMs, Equation (4) shows the relationship between the expectation of the response variables and the linear predictor (the link function); see also [12] and [8].

The mobility and cluster-dependent R0 \(R_c^{t}(\boldsymbol {m}_c^{t-\Delta }, \theta)\) can be viewed as the average number of secondary infections caused by a primary infection (in the Granger sense [17, 24, 53]). In addition, the first term \(\mu _c\) helps in modelling the effect of factors not captured by the mobility-dependent second term.

Overall, LocationRisk@T incorporates the actual high-resolution mobility patterns within a city, which enables us to learn informative spatiotemporal risk scores (developed in Section 3.4). As compared to related mobility-based methods (such as [12]) which rely only on mobility density, LocationRisk@T leverage fine-grained flow information. Specifically, our contributions over [12] are threefold, (a) we propose a mobility-aware Hawkes process model amenable to leverage high-resolution location signals, (b) we accomplish this by introducing a graph structure over the clusters to account for the inter-connections between different clusters, and (c) develop various mobility indices, including infection mobility to develop spatiotemporal risk scores. The main task now is to infer the model parameters \(\theta\) of LocationRisk@T using the mobility features and the infections. To this end, we leverage Expectation-Maximization (EM) to learn the model parameters, which is a popular choice for inferring parameters of the Hawkes process model and is standard in the literature; see [6, 12, 43, 53] and references therein. We now describe the EM-based inference procedure to learn \(\theta\) .

3.3 Expectation Maximization-Based Inference Procedure

We adopt an EM-based approach to infer the parameters \(\theta\) . Our algorithm (outlined in Algorithm 1) provides an iterative way to evaluate the maximum-likelihood estimates of LocationRisk@T; see also [12]. We begin by introducing latent variables Y in order to model the unobserved variables of our model. To this end, we first write the likelihood \(\mathcal {L}(\Theta ; X)\) for Hawkes process given data \(X = (\lbrace t_j^c\rbrace _{j=1, c=1}^{|\mathcal {T}_c|, |\mathcal {C}|}, \lbrace \boldsymbol {m}_c^t\rbrace _{c=1, t=1}^{|\mathcal {C}|, |\mathcal {T}|})\) where \(t_j^c\) s are the timestamps of the infections in a cluster c, and the parameters \(\Theta = (\lbrace \mu _c\rbrace _{c=1}^{\mathcal {C}}, \theta)\) as

\begin{align} \mathcal {L}(\Theta ; X) = \prod _{c=1}^{|\mathcal {C}|}\prod _{i=1}^{n} \lambda _c(t_i) \exp ^{-\int _0^T \lambda _c(t)dt}. \end{align}

(5)

For the EM procedure, we introduce latent variables \(Y^{c}_{ij}\) to indicate that an event j is an off-spring event i in cluster c, and \(Y_{ii}^{c}\) to denote that it was generated by a background event (in c), to formulate the complete data log-likelihood as

\begin{align} \log (\mathcal {L}(\Theta ; X, Y)) = \sum _{c=1}^{|\mathcal {C}|} \sum _{i=1}^{n} Y^{c}_{ii} \log (\mu _c) + Y^{c}_{ij} \log \left(\sum _{t_i\gt t_j, t_j \in \mathcal {T}} R_c^{t_j}\!\left(\boldsymbol {m}_c^{t_j - \Delta }, \theta \right)~wbl(t-t_j)\!\right) - \int _0^T \lambda _c(t) dt. \end{align}

(6)

3.3.1 Expectation Step.

Since both \(\log (\mathcal {L}(\Theta ; X, Y))\) and Y are random variables, at the kth iteration in the E-step, we evaluate the expectation function \(Q(\Theta , \Theta ^{k-1})\) as

\begin{align} Q(\Theta , \Theta ^{k-1}) &= \boldsymbol {E}_Y[\log (\mathcal {L}(\Theta ; X, Y))|X,\Theta ^{k-1}], \\ &= \int \log (\mathcal {L}(\Theta ; X, Y)) f(Y|X, \Theta ^{k-1})dY , \end{align}

(7)

where \(\Theta ^{k-1}\) are the estimated parameters at the \((k-1)\) th iteration, and \(f(Y|X, \Theta ^{k-1})\) is the conditional distribution of Y given X and \(\Theta ^{k-1}\) . Specifically, as in [12], we estimate the probability \(p_c(i,j)\) as follows:

\[\begin{eqnarray*} p_c(i,j) := \boldsymbol {E}_Y[Y^{c}_{ij}|X,\Theta ^{k-1}] = \dfrac{R_c^{t_j}\left(\boldsymbol {m}_c^{t_j - \Delta }, \theta \right)~wbl(t_i-t_j|\alpha ,\beta)}{\lambda _c(t_i)} , \end{eqnarray*}\]

and \(p_c(i,i)\) as

\[\begin{eqnarray*} p_c(i,i):= \boldsymbol {E}_Y[Y^{c}_{ii}|X,\Theta ^{k-1}] = \dfrac{\mu _c}{\lambda _c(t_i)}. \end{eqnarray*}\]

3.3.2 Maximization Step.

Based on the probabilities \(p_c(i,j)\) and \(p_c(i,i)\) , we estimate the parameters by maximizing \(Q(\Theta , \Theta ^{k-1})\) in Equation (7) w.r.t. to each parameter.³ With this, we arrive at the following update steps for each parameter [12]. We learn the parameters \(\theta\) via Poisson regression with the mean parameter modeled in Equation (4). Specifically, this step utilizes Maximum-Likelihood to estimate the parameters \(\theta\) as follows; see [8, 15] for the derivation of Poisson regression.

\[\begin{eqnarray*} \hat{\theta } := \underset{\theta }{\arg \max } \sum _{c=1}^{|\mathcal {C}|} \left(\sum _{j=1}^{n} P_c(j) \theta ^\top \boldsymbol {m}_c^{t_j - \Delta } - \exp \left(\theta ^\top \boldsymbol {m}_c^{t_j - \Delta }\right)\right), \end{eqnarray*}\]

where \(P_c(j) = \sum _{i=j+1}^{n} p_c(i,j)\) , and the following closed form update for the background rate parameters:

\[\begin{eqnarray*} \hat{\mu _c} := \underset{\mu _c}{\arg \max } \sum _{i=1}^{n} p_c(i,i) \log (\mu _c) - \int _0^T \mu _c dt = \sum _{i=1}^{n} \tfrac{p_c(i,i)}{T}. \end{eqnarray*}\]

3.4 Characterizing Risk

The dynamic R0 (commonly referred to as \(R0\) ) has been used to assign risk scores to communities [34]. Indeed as explored by [12], the R0 is dependent on the mobility density over time. However, since the R0 is highly sensitive to parameter choices and models [16, 37], LocationRisk@T leverages both the dynamic R0 \(R_c^t\) and the background rate \(\mu _c\) associated with a cluster c to assign risk scores. Specifically, we propose a spatiotemporal risk score based on the intensity function \(\lambda _c(t)\) of LocationRisk@T. Let \(\Lambda \in \mathbb {R}^{|\mathcal {C}| \times T}\) be a matrix where \(\Lambda (c,t) = \lambda _c(t)\) . Then, we define our risk score \(\rho \in [0,1]\) for a cluster c at time t as

\begin{align} Risk ~Score ~\rho _c(t) = \dfrac{\lambda _c(t) - \underset{c^{\prime } \in \mathcal {C}, t^{\prime }\in T}{\min }~ \Lambda (c^{\prime }, t^{\prime })}{\underset{ c^{\prime } \in \mathcal {C}, t^{\prime }\in T}{\max }~ \Lambda (c^{\prime }, t^{\prime }) - \underset{c^{\prime } \in \mathcal {C}, t^{\prime }\in T}{\min }~ \Lambda (c^{\prime }, t^{\prime })}. \end{align}

(8)

This essentially scales the intensity function of the disease relative to the intensities in other clusters over time, and alleviates the issues associated with calculating accurate R0. Note that, in this work we define the risk of cluster as being proportional to the number of infected people in it, irrespective of its population. This choice is motivated by the fact that the number of infected people in an area, independent of the population, is currently being used to manage the lockdown policies. Nevertheless, the prediction task can be modified appropriately for other variants of the risk scores if desired. Furthermore, since our risk scores do not directly depend on the raw features, our technique is flexible and can incorporate such variations.

4 Experimental Results

We now evaluate and compare the performance of our proposed method with competing techniques based on the accuracy of their infection and risk prediction over different months and cities in the United States (US).

4.1 Data

The statistics of our dataset is summarised in Tables 1 and 2. We obtained the data from Veraset [2] a data-as-a-service company that provides anonymized population movement data collected through GPS signals of cell-phones across the US. The obtained dataset consists of location signals across the US for the time-periods December 1, 2019 to January 31, 2020, as well as March 5, 2020 to March 26, 2020. For each of the cities or counties considered, we first use a range defined by a rectangle that roughly covers the area of the city or county to select the location signals that fall within the area. Each record in the dataset consists of anonymized_device_id, latitude, longitude, timestamp and horizontal accuracy. We assume each anonymized_device_id corresponds to a unique individual. We discard any location signal with horizontal accuracy of worse than 25 meters. Furthermore, we filter out individuals with less than 100 location signals for every one month period considered. The information in Tables 1 and 2 are for after this pre-processing step.

Table 1.

City	December	January	March
San Francisco	116 \(\,\times \,10^6\)	96 \(\,\times \,10^6\)	61 \(\,\times \,10^6\)
Miami	75 \(\,\times \,10^6\)	105 \(\,\times \,10^6\)	64 \(\,\times \,10^6\)
Chicago	135 \(\,\times \,10^6\)	175 \(\,\times \,10^6\)	107 \(\,\times \,10^6\)
Houston	135 \(\,\times \,10^6\)	182 \(\,\times \,10^6\)	105 \(\,\times \,10^6\)

Table 1. Total No. Location Signals per Month

Table 2.

City	December	January	March
San Francisco	62 \(\,\times \,10^3\)	72 \(\,\times \,10^3\)	55 \(\,\times \,10^3\)
Miami	42 \(\,\times \,10^3\)	55 \(\,\times \,10^3\)	46 \(\,\times \,10^3\)
Chicago	76 \(\times \,10^3\)	106 \(\,\times \,10^3\)	90 \(\,\times \,10^3\)
Houston	69 \(\,\times \,10^3\)	97 \(\,\times \,10^3\)	78 \(\,\times \,10^3\)

Table 2. No. Agents per Month

4.2 Experimental Setup

We run SpreadSim [4] described in Section 2 for the areas around the cities of San Francisco (and the bay area), Miami (Miami-Dade county), Chicago (Cook county) and Houston (Harris county) over the available days in December 2019, January 2020, and March 2020. The results for more cities across the US can be found in Appendix A. For each city and month, SpreadSim starts on day 1 and ends on the last day of the month, generating a disease spread pattern based on the activities of the agents. For each experiment, we use the last 5 days as our test set to evaluate LocationRisk@T’s infection and risk prediction performance.

4.2.1 SpreadSim Parameters.

For SpreadSim, the parameter setting is shown in Table 3. We set \(\mu _{IS}\) to 5 similar to the mean incubation period reported in [38], and \(\sigma _{IS}=1\) since [26] observed that most infections occur within 3 days of symptom onset. The work in [26] reports “Infectiousness was estimated to decline quickly within 7 days”, so we set \(\mu _{R}\) to \(\mu _{IS}+7\) . We use a larger value for \(\sigma _{R}\) to account for the fact that some people may self-isolate after the onset of the symptoms. We use a value for \(d_{max}\) larger than the common 2 meter (m) recommendation for social distancing as the discussion in [7] shows how infections can happen through co-locations with larger separation between individuals if the co-location lasts for a long enough duration. Thus, we also set \(t_{min}\) to 1 hour, larger than the 15-minute minimum threshold mentioned by [52]. Moreover, because we consider co-locations that are at least 1 hour, we set \(p=1\) , as the probability of infection becomes higher when co-locations last for long periods.

Table 3.

Parameter	Description	Value
\(d_{max}\)	Maximum distance for a co-location	\(\sim 11\,\) m
\(t_{min}\)	Minimum duration for infection	1 h
p	Probability of infection	1
\(n_{init}\)	Number of agents initially infected	1,000
\(\mu _{IS}\)	Average number of days for an agent to become IS from exposure	5
\(\sigma _{IS}\)	Std. deviation of number of days for an agent to become IS from exposure	1
\(\mu _{R}\)	Average number of days for an agent to become R from exposure	12
\(\sigma _{R}\)	Std. deviation of number of days for an agent to become R from exposure	2.4

Table 3. SpreadSim Parameter Setting

Furthermore, note that the accuracy of our location data is about 25 m, which is larger than the value we used for \(d_{max}\) . Thus, the inaccuracy in our location data can affect who gets infected in the simulation. However, this does not impact the quality of the infection patterns generated, because the 25 m accuracy still ensures that an agent in the same area as the infected agent will get infected. For instance, if an infected person is in a shopping mall, the other infected agents will still be in the shopping mall.

Finally, we note that although we have made every effort to design SpreadSim such that it follows the transmission dynamics of COVID-19, existing inaccuracies in the transmission model do not affect the observations made in this article. This is because, firstly, SpreadSim is used to generate the ground-truth. The models predicting risk-scores are both trained and evaluated against this ground-truth data. Secondly, SpreadSim is used consistently across all the baselines and our proposed models, and thus our evaluation is fair.

4.2.2 Baselines.

We analyze the performance of LocationRisk@T \(_{Mob}\) and LocationRisk@T \(_{Mob^+}\) – the two variants which utilize the fine-grain mobility-flow information (as described in Section 3.1) along with Hawkes \(_{Den}\) , which relies only on the mobility density in the clusters and is the state-of-the-art approach for spatiotemporal modeling of COVID-19 from mobility data [12]. Specifically, Hawkes \(_{Den}\) uses the location signals at the clusters, with the mobility density feature defined as

\begin{align} ({Hawkes_{Den} Features}) \boldsymbol {m}_{c}^t = \begin{bmatrix} w_{c\rightarrow c}(t) \end{bmatrix}. \end{align}

(9)

4.2.3 Metrics.

We evaluate the techniques on two main criteria—(a) infection prediction, to judge the model fit, and (b) risk prediction to analyze the efficacy of using the proposed risk metric in Equation (8) is a reliable indicator for the predicting potential risk associated with a region over time. For infection prediction, we use mean relative absolute error (relative-MAE) between the predicted infection trajectory and the true infections (R-MAE(I)) as the performance metric. This relative measure accounts for the scale differences (in the number of infections) across different clusters. In addition, we also report the standard deviation of the predicted infection trajectory ( \(\sigma\) (I)) corresponding to our relative-MAE metric. For risk prediction, we report the mean absolute error (MAE) between the predicted risk and the infections (scaled between 0 and 1) on the test set (MAE( \(\rho _{\mathrm{test}}\) )) and overall (MAE( \(\rho _{\mathrm{all}}\) )).

4.2.4 Pre-Processing Steps.

We use OD matrix \(W(t)\) as a means to gauge the traffic-flow characteristics on day t. The OD matrix is a \(|\mathcal {C}|\times |\mathcal {C}|\) matrix where \(|\mathcal {C}|\) is the number of clusters. In our experiments, we set \(|\mathcal {C}|= 15\) . The entry \(w_{i\rightarrow j}(t)\) on the ith row and the jth column of \(W(t)\) is calculated as the total count of consecutive location signals, \(\mathscr{c}\) and \(\mathscr{c}\) ’, overall individuals, such that \(\mathscr{c}\) is in cluster i and \(\mathscr{c}\) ’ is in cluster j; see also Section 3.1.

Next, we use a 6-day moving median filter to smooth the infection and mobility traces. Empirically, we find that such a smoothing helps to counter the dominance of a cluster with the sudden rise in cases on the learned model. All mobility features are standardized, meaning that we subtract the mean and divide by the standard deviation.

4.2.5 Parameter Choices.

The main parameters of the model are the Weibull parameters ( \(\alpha\) , \(\beta\) ) and the time delay parameter \(\Delta\) . For each city and month, we choose a set of parameters ( \(\alpha\) , \(\beta\) , \(\Delta\) ), which yields the best relative-MAE on infection prediction. For a fair comparison, all techniques are provided with the same set of parameters for a city and month.

4.3 Results

We show the infection prediction performance measured in terms of R-MAE and the risk prediction performance in terms of the MAE between the evaluated risk and scaled infections for top-5 clusters (in terms of location signals) of the metropolitan areas of San Francisco, Miami, Chicago and Houston in Tables 4, 5, 6, and 7, respectively. We provide additional results for the metro areas of Los Angeles, New York, Seattle, and Salt Lake County in Appendix A. For each of these cities, LocationRisk@T \(_{Mob^{+}}\) yields the best infection prediction performance across all months considered. Furthermore, our models LocationRisk@T \(_{Mob}\) and LocationRisk@T \(_{Mob^{+}}\) are also more robust ( \(\rho _{\mathrm{test}}\) performance) and yield superior performance overall for risk prediction as compared to the density-only baseline of Hawkes \(_{Den}\) . Note that like any learning model, the prediction capabilities are sensitive to the amount of available infections. As a result, the prediction accuracy for clusters with a small number of infections is not high. Arguably, the risk scores for clusters with a large number of infections matter more, and for low-infection levels contact tracing may be a more effective tool.

Table 4.

Table 5.

Table 6.

Table 7.

In addition, in Figures 5, 6, 7, and 8 panels (a–c) (i–iii), we provide visualizations of the predicted spatiotemporal risk by LocationRisk@T \(_{Mob^+}\) corresponding to Tables 4, 5, 6, and 7, respectively, for days 10, 15, and 20 of the months of Dec. 2019, Jan. 2020, and Mar. 2020. Furthermore, in panel (d) (i–iii) in each of these figures, we compare the corresponding average mobility density, infections and the predicted risk across all clusters (in a city) for these months. Note that here the risk scores are scaled from \([0,1]\) for each month, where a darker color represents a higher risk score relative to the risk in that month. From Figures 5, 6, 7, and 8 we note an interesting trend that the similar mobility patterns of the months of Dec. 2019 and Jan. 2020 (except for scale), leads to similar disease spread patterns, and ultimately similar risk scores. The month of Mar. 2020, however, is different for each of these cities from Dec. 2019 and Jan. 2020 across the board. Recall that the stay-at-home order was implemented in the middle of Mar. 2020 [1], hence our March dataset includes mobility patterns before, during and after the lock-down. Consequently, we observe higher mobility density during the beginning of the month (which results in higher infections early on). However, as the mobility drops (panel (d-i), there is a corresponding drop in the infections (panel (d-ii), and the risk scores by LocationRisk@T \(_{Mob^+}\) (panel (d-iii)). Here, LocationRisk@T \(_{Mob^+}\) ’s risk scores track the infections, emerging as a reliable spatiotemporal risk metric.

Fig. 5.

Fig. 6.

Fig. 7.

Fig. 8.

Lastly, another important observation is regarding the high-risk areas for each city. Our results using LocationRisk@T \(_{Mob^+}\) corroborates our intuition that popular destinations in a city are riskier. For instance, for the metro area of San Francisco, the actual downtown area turns out to be high-risk. This trend can also be observed in other cities as well. Here, the superior performance of LocationRisk@T \(_{Mob^+}\) over Hawkes \(_{Den}\) and LocationRisk@T \(_{Mob}\) for both infection and risk prediction can be attributed to modeling the infection mobility in addition to the location signal density and mobility. Incorporating the infection mobility as opposed to just relying on the popularity of an area, as in the case of Hawkes \(_{Den}\) , allows LocationRisk@T \(_{Mob^+}\) to improve infection prediction performance by being cognizant of the past infections in different areas in a city. As a result, LocationRisk@T \(_{Mob^+}\) underscores the importance of bringing together mobility patterns and infection spread prediction model to assign high-resolution risk scores.

5 Related Work

5.1 Disease Prediction and Mobility Indicators

Most popular disease prediction models employ compartmental models since they allow explicit modeling of the transmission characteristics. These mainly include variants of the classical SIR model [32], such as the S-Exposed-IR (SEIR) model and its variants, which primarily aim to add additional latent states to the SIR model [42]. Popular in practice due to their simplicity, these models rely on the homogeneous population mixing to learn the model parameters, and have been used to predict R0 at coarser granularity for counties, states and entire countries [6]. Although useful to communicate the disease characteristics at early stages, coarse-grain risk scores and R0 estimates at the county or state level are not readily usable for public policy decisions at finer spatial and temporal scales [9, 28]. On the other hand, Hawkes process-based point process disease spread models have also emerged as an alternative way to model COVID-19 spread [12]. Even though mathematical similarities between compartmental models and Hawkes process-based models exist, Hawkes process-based modeling does not involve complex parameter estimation, model identifiability and mis-specification as in the case of SEIR model [18, 27, 36, 44, 47].

Moving away from homogeneous mixing models, therefore, involves analyzing the specific mobility patterns, and utilizing these covariates in disease spread prediction. To this end, the study in [12] leverages the flexibility of the Hawkes process models to incorporate the demographic and mobility indices for COVID-19 prediction. Their work shows that the dynamic R0 correlates with the time-delayed mobility density at the county-level across the United States, where R0 is viewed as a proxy for the risk associated with a region. Although such coarse-grain risk scores are useful in policy decisions for a country or a state, these may not be informative enough for city-level planning. To this end, [34] use dynamic (time-varying) R0 to assign risk scores to communities using a compartmental based model (assume homogeneous mixing), but do not consider mobility features. To this end, LocationRisk@T leverages the Poisson Regression-based R0 modelling proposed by [12] to incorporate the mobility patterns provided by OD matrices, and infection mobility covariates to develop the spatiotemporal risk scores, as discussed in Sections 3 and 4.

5.2 Agent-Based Models and Simulations

Various agent-based simulations are used to model the spread of a disease in a population [11, 20, 21, 25, 33]. They generate synthetic contacts to simulate human contacts using contact matrices, where a pre-defined probability of contact between individuals in different groups of the society is used to decide whether there is contact between individuals at any point in time. Furthermore, their goal is to study different intervention policies. SpreadSim is different from the existing work in two aspects. First, we use real-world location signals to model human mobility. This allows us to create realistic infection patterns in a population that changes over time-based on mobility and is different for different populations. The existing simulations are incapable of capturing this because they generate contacts between individuals synthetically. Second, we use SpreadSim only as a means of generating realistic infection information to allow us to evaluate LocationRisk@T.

6 Conclusions and Future Work

In this work, we demonstrated that time-varying location-based risk scores can be a valuable public health tool to facilitate the safe reopening of normal activities. The existing risk scores (based on R0 learned using compartmental models) either do not provide the information at spatial and time resolutions to be useful, or rely on uniform mixing of population, which is unrealistic in practice.

We developed LocationRisk@T, a Hawkes process-based model for infection and risk forecasting, where we incorporate actual mobility patterns along with the mobility of infected population from different regions of city to assign spatiotemporal risk scores at relatively finer temporal and spatial scales. Subsequently, we demonstrated the applicability of model by, SpreadSim, which simulates the disease spread over actual mobility data from months of Dec. 2019, Jan. 2020, and Mar. 2020 across cities in the United States. Our risk scores emerge as a reliable metric while tracking the infections in a city. One limitation of our approach is that even though we rely on real-world co-locations, the disease spread mechanism is based on simulation, which is agnostic to any real physical barriers or other factors which may influence disease spread. Furthermore, to focus on the problem of developing risk scores, we have assumed that SpreadSim has access to the mobility patterns of the full population. However, in practice, we will only have access to the mobility patterns of a subset of the population, in which case methods such as [54] should be used together with SpreadSim to be able to estimate the infection statistics for the whole population from the mobility patterns of the subset.

We plan to extend our work in three directions. First, we intend to develop individual-level user-specific risk scores by combining user trajectory prediction models with our spatiotemporal risk prediction model. Next, we plan to address the privacy issues associated with assigning user-specific risk scores. Finally, incorporating demographics and electronic medical records (EMR) data, considering spatiotemporal Hawkes process models [53] and deep learning based models for long-term forecasting capability, also constitute our future work.

Footnotes

Provided each cluster contains adequate number of location signals and infections for reliable learning.

A number of off-the-shelf algorithms can be used, in this exposition, we use MATLAB’s fitglm.

The Weibull shape and scale parameters \(\alpha\) and \(\beta\) can also be learned by adding additional EM-based update steps when adequate data samples are available, i.e., \(|\mathcal {T}|\) is large enough; see [12]. Since we run our simulation over relatively shorter time scales, in our formulation, we set \(\alpha\) and \(\beta\) based on grid-search.

A Additional Experimental Results

We ran our experiments, with the same setup described in Section 4, for the areas around four more cities, namely Los Angeles, Seattle (King county), New York (Manhattan) and Salt Lake county. The statistics of these datasets can be found in Tables 8 and 9. The results are presented in Tables 10, 11, 12, and 13 and Figures 9, 10, 11, and 12.

Fig. 9.

Fig. 10.

Fig. 11.

Fig. 12.

Table 8.

City	December	January	March
Los Angeles	\(331\,\times \,10^6\)	282 \(\,\times \,10^6\)	171 \(\,\times \,10^6\)
Seattle	53 \(\,\times \,10^6\)	66 \(\,\times \,10^6\)	38 \(\,\times \,10^6\)
New York	41 \(\,\times \,10^6\)	36 \(\,\times \,10^6\)	21 \(\,\times \,10^6\)
Salt Lake	31 \(\,\times \,10^6\)	39 \(\,\times \,10^6\)	29 \(\,\times \,10^6\)

Table 8. Total No. Location Signals per Month

Table 9.

City	December	January	March
Los Angeles	159 \(\,\times \,10^3\)	171 \(\,\times \,10^3\)	131 \(\,\times \,10^3\)
Seattle	29 \(\,\times \,10^3\)	38 \(\,\times \,10^3\)	32 \(\,\times \,10^3\)
New York	40 \(\,\times \,10^3\)	41 \(\,\times \,10^3\)	32 \(\,\times \,10^3\)
Salt Lake	16 \(\,\times \,10^3\)	22 \(\,\times \,10^3\)	23 \(\,\times \,10^3\)

Table 9. No. Agents per Month

Table 10.

Table 11.

Table 12.

Table 13.

An interesting observation, in cities of New York (Manhattan) and Salt Lake County shown in Figure 11 and Table 12, and Figure 12 and Table 13, respectively, is that an increased mobility in Mar. 2020 leading to a large number of infections decreases model’s capacity to predict later on when the infections drop. This is because the Hawkes process model tends to attribute the infections to the background rate \(\mu _c\) rather than the mobility-dependent \(R_c^t\) . This leads to poor infection prediction performance (for all methods) since the model anticipates infections to happen at the relatively large background rate. In practice, such a modality can be avoided by using longer traces like [12]. Nevertheless, our risk scores Equation (8) based on the intensity function \(\lambda _c(t)\) Equation (3) are still faithful to the infections since they take into account both the background rate and the mobility-dependence. This in fact highlights the advantages and reliability of the proposed risk score metric. In addition, our risk score for Manhattan (New York) also illustrates its applicability for fine-grained spatiotemporal risk assignment.

B Experiments On Open-source Dataset

We performed further analysis on the Gowalla dataset [13], an open-source user checking-in dataset. We used the data for a 60 day period, starting from 2009-09-26 (we observed very few check-ins and individuals for days before this period). There are a total of 1,148 individuals and 37,174 check-ins for that period. The SpreadSim parameters are set as before, with the exception that now the number of initial infections is set to one percent of the population size.

We select the check-ins in the San Francisco (SF) Bay Area, over a 60-day period for risk score analysis for this exposition. Note that, most of the Gowalla check-ins are concentrated around the city of SF, San Mateo and Alameda. The results of the analysis along with the visualizations are presented in Table 14 and Figure 13, respectively.

Fig. 13.

Table 14.

Model	Infection and Risk Prediction
Model	\((\alpha , \beta , \Delta) = (10, 1, 10)\)
	R-MAE(I)	\(\sigma\) (I)	MAE( \(\rho _{\mathrm{test}}\) )	MAE( \(\rho _{\mathrm{all}}\) )
Hawkes \(_{Den}\)	0.3774*	0.3394*	0.1249*	0.0624*
LocationRisk@T \(_{Mob}\)	0.3750*	0.3508*	0.1342*	0.0589*
LocationRisk@T \(_{Mob^+}\)	0.3690	0.3159	0.1748	0.0457

Table 14. Predicting (5-day) Infections and Risk for San Francisco–Bay Area, CA Using the Gowalla Dataset

The table shows the error in predicted infections (I), the corresponding standard deviation, risk ( \(\rho\) ) for the test set, and over 60 days for the 5 clusters. \(^*\) The model is essentially a constant model based on the \(\chi ^2\) statistic. Best performance for each metric (across different methods) is highlighted in bold.

Acknowledgements

We would also like to acknowledge Veraset for providing us with high fidelity location signals and the USC Machine Learning Center (MASCLE). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the sponsors.

References

[1]

2020. Stay-at-home order. Retrieved November 10, 2020 from https://covid19.ca.gov/stay-home-except-for-essential-needs#::text=All20individuals20living20in20the,the20Questions202620Answers20below).

Abstract

1 Introduction

1.1 Our Approach

1.2 Overview and Organization

2 SpreadSim

2.1 Mobility Pattern

2.2 Transmission Model

2.2.1 Compartments.

2.2.2 Transmission Dynamics.

2.3 Initialization and Implementation

2.3.1 Initialization.

2.3.2 Implementation.

3 Learning Spatiotemporal Risk Scores with LocationRisk@T

3.1 Mobility-Aware Modelling

3.2 Incorporating Mobility Features Into Hawkes Process

3.3 Expectation Maximization-Based Inference Procedure

3.3.1 Expectation Step.

3.3.2 Maximization Step.

3.4 Characterizing Risk

4 Experimental Results

4.1 Data

4.2 Experimental Setup

4.2.1 SpreadSim Parameters.

4.2.2 Baselines.

4.2.3 Metrics.

4.2.4 Pre-Processing Steps.

4.2.5 Parameter Choices.

4.3 Results

5 Related Work

5.1 Disease Prediction and Mobility Indicators

5.2 Agent-Based Models and Simulations

6 Conclusions and Future Work

Footnotes

A Additional Experimental Results

B Experiments On Open-source Dataset

Acknowledgements

References

Cited By

Index Terms

Recommendations

Risk profiles for negative and positive COVID-19 hospitalized patients

Covid-19 Vaccines As a Condition of Employment: Impact on Uptake, Staffing, and Mortality in Elderly Care Homes

Anxiety, Depression, Insomnia and Stress during the COVID-19 Pandemic: Prevalence and risk according to associated experiences in the general population

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations