1. Introduction
On 31 December 2019, 27 cases of pneumonia were reported in Wuhan City, Hubei Province in China. The cause was identified on 7 January 2020, and was subsequently termed SARS-CoV-2 (the virus) and COVID-19 (the disease) by the World Health Organization (WHO) [
1]. The disease subsequently grew to the point that the WHO officially declared it a pandemic on 11 March 2020 [
2]. As of 31 May 2022, a cumulative total of 526,558,033 cases and 6,287,117 deaths attributed to the disease had occurred [
3]. The sheer scale of the disease’s development prompted research to combat it across the globe. A proper modeling of the disease has been crucial not only in evaluating the behavior of the virus, but also towards policy decisions made in response to it [
4,
5]. An extensive evaluation of social-distancing measures and other nonpharmaceutical interventions found that these approaches were effective. These results supported the adoption of these policies on numerous occasions throughout the pandemic [
6,
7,
8]. More recently, the role of vaccination has also been considered in the relevant models [
9].
One of the greatest issues in studying the disease has been the difficulty in estimating its spread. Even simply getting an accurate estimate of the current number of infected individuals has been extremely difficult [
10]. Accurate counts of the infected population are crucial. They serve as a guiding tool on policy decisions from the distribution of testing resources, to the allocation of treatment materials, to the severity of lockdown procedures. Limited testing supplies in the early stages and the frequency of asymptomatic cases have caused major information gathering difficulties [
11]. The accurate estimation of certain parameters used in modeling the disease, notably the base reproduction rate and case confirmation rate, are also useful in guiding policy. The parameters’ true values, however, are dependent on these unknown infected cases. There are some methods, however, to infer these unknown quantities from limited available data; see, e.g., [
12] for a recent discussion. We focus here specifically on the usage of compartmental epidemiological models in conjunction with the usage of the widely applicable and highly successful technology of the so-called physics-informed neural networks (PINNs) (see [
13] for a recent review thereof) to estimate these quantities.
The classic SIR ODE model is, arguably, the most well-known compartmental model. It separates the population into susceptible, infected, and removed (recovered) groups, with the model’s origins tracing back to the seminal work of [
14]. In the almost century that has followed there has been a wide variety of variations of the model to fit the particular features of different diseases [
15]. In the case of COVID-19, several different modifications have been considered to more accurately describe its development. The SEAHIR model (involving susceptible, exposed, asymptomatic, hospitalized, infected, and recovered populations) incorporates the effect of social isolation measures [
16]. Meanwhile, the SIRSi (susceptible, infected, recovered, sick) approach models temporary immunity that wanes over time [
17]. There have been numerous other proposals, involving different numbers of components (and other features such as, e.g., age stratification [
18] and spatial distribution [
19]). More complex models do run the risk of their unknown quantities being difficult to estimate, or in some cases even impossible [
20]. If we want to estimate these unknowns, we need a model that reflects the practical realities of COVID-19, the effects of isolation [
6], and the prevalence of asymptomatic cases [
10,
11], while still allowing estimations.
We adopt the usage of the SICRD model implemented in [
21]. The susceptible, infected, and recovered groups are similar to the standard SIR setting discussed above. We now also consider two additional compartments, the confirmed and death cases, which we assume to be the only directly known variables. We adopt this model as it incorporates the noteworthy effects of testing and quarantining. It also crucially splits the infected population into unknown/known compartments, as well as includes data on fatalities. The latter is an extremely important piece of accessible information as we note in
Section 2.3.2. Indeed without it, it would actually be impossible to estimate the number of infected individuals, or most of the important parameters from confirmed case counts alone. The relevance of the inclusion of the death data (as well as potentially and more recently, the data on hospitalizations) is a feature that some of the present authors have argued in various earlier works [
12,
18]. The SEAHIR model has similar advantages and much of the analysis below could be adapted to that model as well (an avenue that we do not pursue herein).
We propose the usage of a sophisticated neural network approach to estimate the unknown cases and parameters, i.e., a methodology that has proved extremely successful in a variety of applications within scientific computing [
13,
22]. A key component in the usage of a neural network over traditional regression techniques is in their improved flexibility. Regression methods such as those in [
23] may only model the data as exactly following a specified model. A neural network has many more adjustable parameters, allowing for a more flexible modeling of a given problem. Indeed, in this vein, there have been some quite promising recent results in the usage of neural networks in disease modeling. The work of [
24] effectively forecasts COVID-19 spread using a combination of an ensemble neural network with a fuzzy logic system. The usage of a graph neural network leveraged human mobility information to improve forecasting of COVID-19 in Germany [
25]. Meanwhile, the spread of influenza in the U.S. has been modeled using a recurrent neural network to better reflect the time-sequential nature of the viral spread [
26]. The usage of a neural network is also quite natural for us, as our model is particularly well suited for an adaptation of the “Physics Informed Neural Network” (PINN) approach [
27].
The principle of PINNs was developed for the estimation of unknown quantities in a system adhering to a physical law, generally a nonlinear PDE (or lattice dynamical equation [
28]) that the system is assumed to obey [
27]. The effectiveness of this approach has been substantial across a wide variety of disciplines, notably having excellent compatibility with other techniques such as the previously mentioned graph and recurrent neural networks [
22]. We note there have been several other attempts to apply the PINN concept to studying COVID-19. The work of [
29] considered the SIR model. They suitably modified the dynamical equations thereof through a nonlinear transformation, aiming to leverage neural networks to identify the (assumed-to-be) linear dependence of the infection rate and subsequently performed short-term predictions. This has some interesting parallels to our proposals and analysis below, although we seek to avoid some of the relevant assumptions on the reliability of the total case counts or of the linear time-dependence of the infection rate. Meanwhile, the work of [
30] also attempted a similar inference of parameters to our work and considered the more nuanced susceptible–exposed–infected–recovered–susceptible model. This attempt, however, also considered all variables to be known. Yet another effort in this vein of leveraging data-driven techniques in epidemiological problems can be found in the interesting work of [
31]. In this paper, the authors sought to discover the associated dynamic models directly from data in the case of childhood diseases such as chickenpox, rubella, and measles, with partial success.
The novelty of our work comes from the substantial difficulty of having unknown variables in our system and the modifications of the PINN technique we make to overcome this. The standard PINN directly uses the governing equations to define its loss term, with no further reformulation specific to a given model. We found that in our case, with several unknown variables, the standard PINN consistently converged to incorrect values for all parameters and variables. We address this by introducing a new approach to the training loss. We derive a new set of terms for our loss function in
Section 2.5. The new loss terms are structured to reduce the number of unknowns that must be simultaneously considered in a single term. This avoids errors due to too many degrees of freedom in the network’s attempted estimation. The result is a network with rapid and accurate convergence. We achieve very promising results, even with a simple feed-forward network architecture. This suggests the relevance of further research into other architectures, such as recurrent networks to improve performance, or an application to a graph neural network to incorporate human mobility [
19,
32]. The exact form of our new loss is specific to the SICRD model, but the principle of deriving new equations for the purpose of improved stability of the training is far broader. The concept could be applied to the training loss of any machine learning system, so long as there exist some governing equations for the considered data. Thus, any such analysis could see improvements in the network’s convergence, stability, and efficiency with appropriately derived loss terms from the governing equations. In summary, we propose the usage of a neural network approach to avoid the more substantial restraints of traditional regression methodologies. We use the PINNs among the former to connect to a widely used tool. Finally, we provide an approach that leads to an improvement of the training loss of the latter and whose principle can be generalized to other systems within this class (of relevance to epidemiological applications).
We begin in
Section 2.1 and
Section 2.2 by explicitly defining the SICRD model, as well as explaining the general concepts of the model and the sources of our data. In
Section 2.3, we consider the question of whether the unknown variables and parameters may be estimated from the partially known information. We formally address this by showing our model is identifiable, meaning the unknown parameters and unknown variables can be uniquely determined by the known variables. We thus verify we are in a regime where the estimation goal is possible. It is also shown that several variations of our model are not identifiable, which justifies our particular choice over several considered alternatives. In
Section 2.4, we address the issue of potential noise in our collected data with the use of a wavelet transform, which separates the signal into key features and the noisy component caused by smaller scale fluctuations. This filtering also smooths the data, which improves our ability to apply an ODE model to it. With the model defined and the data appropriately processed,
Section 2.5 explicitly presents our novel loss function for the network along with a corresponding explanation for its usage.
We then demonstrate the neural network’s effectiveness by running estimations on the ideal case, with simulated data that explicitly obey our model in
Section 3.1. We briefly review the results of denoising via the wavelet transform in
Section 3.2. Then, in
Section 3.3 we demonstrate the network’s ability to estimate the unknown infected population and several unknown parameters given real data reported in U.S. states. In
Section 3.3.1, we also demonstrate that, due to the high efficiency in the network’s calculations, it may also be used to perform estimates using a model with time-varying parameters as opposed to simply assuming constant parameters, and no particular restrictions on the form of time-dependence are necessary, extending the work of other attempts to allow time-varying parameters; see, e.g., [
21] or [
29]. Our results suggest that when the parameters are assumed to vary, the variance is substantial enough that over any longer time frame, constant parameter assumptions would be unrealistic.
In
Section 3.3.2, we propose a ranking of states, according to how well the states are conducting testing by calculating the ratio of the (estimated) infected population over the confirmed population. The intention is to give a more precise estimate on how effectively testing is being conducted in a region, rather than simply looking at the per capita number of infections. This can highlight population centers where infections are low, as the disease has only entered it recently, but testing is low even relative to the number of infections. If a region with poor testing can be recognized early, then the problem can be addressed before it reaches a major outbreak. Finally, in
Section 4, we briefly review our results and propose several possible directions of future research. We discuss some of the limitations of the model and network, as well as several methods to expand the work to address those limitations.
2. Materials and Methods
2.1. The SICRD Model
In modeling the development of COVID-19, we focused on U.S. states as our population centers due to the relative ease of access to epidemiological statistics. In each state, we used a modification of the SIR model [
21] by introducing two new population compartments: the death cases and the confirmed cases [
21]. This gives the SICRD model, a compartmental model in which the population is divided into susceptible (S), infectious (I), confirmed (C), recovered (R), and dead (D) individuals; see also
Figure 1. Note that each compartment refers to the current count and not the total cumulative number of cases.
This model reflects the real-life situation in which an infectious person may recover without receiving a formal diagnosis, as well as accounting for the effect of testing for the disease and quarantining. The model is defined explicitly by the following ODE system:
Here, is the total population size. is the basic reproductive number, which refers to the the average number of cases directly infected by one infectious case in a completely susceptible population. is the average number of days from first becoming infectious to confirmation, recovery, or death. is the average number of days from being confirmed as infected to recovery or death. is the proportion of confirmed cases among all cases transferred from the unconfirmed infectious state. Finally, is the fatality rate. The deaths (D) and recoveries (R) stem from either the infectious compartment (I) or from the confirmed infection one (C). We also note that we assume individuals who have been confirmed to be infectious sufficiently quarantine and do not infect any further individuals.
Our goal is to estimate the unknown infected population
, along with the unknown parameters
,
, and
. We wish to perform an estimation using only our accessible data, the count of confirmed cases
, deaths
, and the more directly estimable time scale parameters
and
. The reasons for this precise formulation of the SICRD model are discussed further in
Section 2.3.
We note that other, much more complex compartmental epidemiological models with more compartments such as those presented in [
18] were also considered; here, inspired by the latter work, we present a reduced version of the model for reasons also relating to identifiability, as discussed below. Furthermore, this more straightforward model was chosen to test the capabilities of our neural network approach without added complications, while the testing of more detailed models is left to be considered in future work.
SICRD with Time-Dependent Parameters
The assumption that the parameters in Equation (
1) are constant (i.e., time-independent) is a highly restrictive and unrealistic one for long-enough time scales. In a real-world scenario, many of these parameters are changing as time passes, due to factors such as mutations of the virus and a shifting government policy; the latter is well-known to modify social interactions in the case of nonpharmacological interventions and hence affect factors such as
[
6,
18]. In this case, we update the model in (
1) to replace
and
with
and
, while
and
are left fixed as they are relatively well-known, stable parameters. In the case of mutations, it is not unreasonable to expect variations of these parameters too, but for the cases under consideration, we expect such variations to be small and anyway secondary to the above effects. This gives the alternate model:
To perform an estimation on the vector of parameters
, we first estimate a constant value for the parameters over a time interval of length
. We then perform the estimation of
with a simple “rolling window” approach, where we take the constant parameter estimation over the interval
to approximate
. This method allows an estimation of the parameter values which does not presume any particular form for their time dependence. This expands on the interesting, very recent work of [
29]. The primary potential issue is the costliness of performing so many estimations, but we demonstrate in
Section 3.3 that our network performs efficiently enough to make this scheme feasible.
2.2. Data Set
All nonsimulated data on COVID-19’s development used in this paper were pulled from the tool developed in [
33]. The tool aggregates data from a variety of data sources including the WHO, each state’s individual department of health, and the CDC. The data used included reports on case counts, active cases, and deaths within each individual U.S. state. The usage of these data is expanded upon in
Section 3.2,
Section 3.3.1 and
Section 3.3.2.
2.3. Identifiability
2.3.1. Identifiability Definitions
In order to reasonably use Equation (
1), we need to verify the identifiability of the model. Identifiability is the ability to uniquely identify the model parameters from the known variables. We know from the results of [
18,
20,
34] that many compartmental epidemiological models have potential issues with identifiability. To be concrete, we recall the precise definition of the term.
Consider an
n-dimensional ODE system. We let
be the vector of constant parameters for the ODE system where
is our allowable parameter space. We note that the initial values of the variables in the system are also considered parameters. We let
be all the assumed known (measured) variables of the system and
be all the assumed unknown (hidden) variables of the ODE system, with
. Then, the ODE system can be represented by [
20]:
Note that the same ODE system may be treated with different separations into
m and
h depending on what variables are assumed known. Given such a system and a choice of known variables, we say that it is structurally globally identifiable if
The interpretation is that the measured variables uniquely determine the values of the constant parameters for the ODE system. In such a case it is not possible for two distinct choices of parameters to give precisely the same values for the measured variables. Note though, from a practical perspective, that this does not disallow the possibility of
p and
with very different values, resulting in
. Whether a narrow range for the parameter values can be determined from an uncertain value for
m is the question of whether the system is
practically identifiable or not. Basic global identifiability needs to be verified first though, as the practical identifiability of a system is dependent on the particular values chosen for the parameters and the range of the variables [
20]. For an interesting example where the range of variables may play a crucial role in the practical identifiability, see, e.g., the very recent work of [
35].
It is also possible for a system to fail to be globally identifiable, but still be “locally identifiable” meaning that no unique choice of parameters may be determined from
, but that there are finitely many choices of
p that can be made. We obtain in
Section 2.3.2 that no cases of local but not global identifiability for any parameters are found for our considered systems.
For Equation (
1), we always assume that
C and
D are known and thus
.
I is, by definition, not directly known, while
S and
R both require knowledge of
I to be known and thus
. (Reported data on recoveries are available but this information is only on recoveries from confirmed cases and thus is not actually the same quantity as
R in our model.) We want to guarantee that our model at least satisfies structural global identifiability, meaning it is possible to estimate the parameters from
C and
D alone.
2.3.2. Numerical Results on Identifiability
The precise calculation of structural identifiability results for a given ODE system is usually infeasible to perform by hand for all but the simplest of systems. There exist many software applications to perform this type of calculation, such as the differential algebra techniques of [
34]. We elected to use the SIAN (Structural Identifiability ANalyser) package to test the identifiability of our model and several slight variations of the model [
36]. The code runs a Monte Carlo algorithm to verify both local and global identifiability for an ODE system to within a high degree of certainty (>99%). The following models were tested using the code.
The first model is the system of equations given in (
1) without modifications. The second model matches (
1) for
and
with the following equations for
and
It is relevant to note that in (
1), we allow deaths to occur from the unknown infected cases
I, but that we assume the rate is the same as that for
C, as well as allowing these deaths to be known. This is intended to reflect circumstances where a person is diagnosed posthumously, or extremely close to death. Meanwhile, in (
5), we assume each fatality to occur in a case where the individual is first diagnosed. While (
1) makes some stronger assumptions, it will soon be apparent that these or similar assumptions are necessary for identifiability. We thus need this for the parameters of the model to be capable of estimation from known data.
The third case is a slight modification of (
1), now allowing
C and
I to have two distinct death rates, referred to as
and
, respectively. Explicitly, the changed equations are
Finally, we consider the case of (
6) but with the death information recorded separately, assigning,
to
C and
to
I. Here, we consider both to be known as the case where only
is known was found to have only
identifiable and the case where only
is known is unreasonable, as confirmed-case-induced deaths are naturally expected to be known.
Each of the models, with varying assumptions on the assumed known parameters, were run through the SIAN code. In each case there were no instances of parameters which were locally, but not globally identifiable. We also note that in cases where was the only known variable, , , and were all not identifiable.
In each case,
was not identifiable, due to the fact that the current value for
R did not affect the rate at which any compartment was changing. It can be estimated though, if each of the other variables’ initial value is estimated and the total population is known. We see from
Table 1 that under even the worst cases,
and
may be estimated, which aligns with the common assumption that they can be more straightforwardly inferred from known incidence data. To estimate any other parameter requires either knowledge of the death rate for
I, or for
I and
C to be assumed to have the same death rate. In the cases where the identifiability of parameters failed, not even local identifiability held, meaning that an infinite number of parameter combinations could give rise to the exact same solution for the known variables. Thus, in order to numerically estimate the parameters we moved forward with model (
1), as it had the best conditions for parameter estimation, while still having reasonable underlying assumptions. A similar approach would also work for models with hospitalization data. There, hospitalizations take on a similar role to the deaths in providing indirect information on the unknown case numbers.
As an additional remark, the properties of model (
5) are worse than those of (
1) because for the former the time derivative of
D essentially provides information for
C (which is known) and hence does not assist towards identifying
I. The latter identification is, apparently, possible within (
1). In the case of (
6),
is needed to allow for identifiability (once again to allow for the detection of
I), while
connected with the confirmed cases does not suffice. It also may not be reasonable to know
and
separately as generally, only total death counts are in the reported information. The combination of these factors, as well as the knowledge that even with both death counts, the identifiability only holds in this model when
is known, which is unreasonable if it is assumed distinct from
, supports that model (
6) is not as useful practically.
While having two distinct values of
would be more realistic, we see that it creates fundamental issues with the identifiability of the model. We thus made the assumption of a single
value here, in order to use it as a test case for our network. There may be potential methods to resolve these issues while retaining identifiability, but in the present work our intention was to keep the assumptions of the model relatively simple while testing the novel loss function introduced in
Section 2.5. The application of the network to a more complex model is left to future work.
2.4. Data Processing—Denoising of Data Using a Wavelet Transform
Before feeding our data to the neural network we constructed, we first processed the data using a wavelet transform to separate the noise of the signal from its primary features. Our usage of this process was motivated by many other works where the denoising of an input signal for a neural network using a wavelet transform produced notable improvements [
37,
38,
39]. We implemented code from the PyWavelets package to perform our wavelet analysis and denoising on our signal [
40]. We considered the following framework for our data: in our model, each piece of measured signal data can be mathematically expressed as
where
is the true data that would have been obtained in ideal measuring conditions and
comprises the adverse effects of the local environment or faulty activity for the feature, and is referred to as “noise”. To avoid dealing with the noise in the data set, we applied a wavelet transform to denoise the data.
Wavelet theory provides a mathematical tool for hierarchically decomposing signals and, hence, constitutes an elegant technique for representing signals at multiple levels of detail [
41]. In general, the wavelet transform is generated by the choice of a single “mother” wavelet
. In our implementation we used “Symlets 5” as our choice of mother wavelet (details can be found in the code documentation of [
40]). Wavelets at different locations and spatial scales are formed by translating and scaling the mother wavelet. The translation and dilation of the mother wavelet are written with the operator
acting as follows [
42]:
Essentially, we can treat the “daughter” wavelets above, generated by rescaling and shifting the “mother” wavelet, as analogous to the sinusoidal functions of the Fourier transform. The advantage is that the rescaling and shifting allow the wavelets to capture more local behavior. This is performed by taking wavelets with small scales u, and shifting across the considered time range using varying values of v. A potentially very robust analysis of signals may be conducted in this way, though for our purposes the goal was to use the wavelets to decompose the signal into high and low frequency components, then discard the especially high frequency components as noise.
Now, for a given signal
f, the wavelet transform of
f,
, at scale
u and location
v, is given by the inner product:
Here, star represents the complex conjugate. This is known as the continuous wavelet transform or CWT. To ensure that the inverse CWT is well-defined, we need the following inequality
Here,
is simply the Fourier transform of
. This inequality is referred to as the admissibility condition [
42]. One interpretation of this is that the choice of a mother wavelet must have no zero frequency component, i.e., no nonzero constant component. The finiteness of this integral guarantees that the result of the CWT always has a finite
norm. Generally, once this integral is verified to be finite, the mother wavelet is rescaled by it so that the inverse CWT may simply be defined as
This method of constructing the wavelet transform proceeds by producing the wavelets directly in the signal domain, through scaling and translation. When the signal frequency is higher, the wavelet base with a higher time domain resolution and a lower frequency resolution is used for the analysis. Conversely, when the signal frequency is lower, a lower time domain resolution and higher frequency resolution are used. This adaptive resolution analysis performance of the wavelet transform can effectively distinguish the local characteristics of the signal and the high-frequency noise, and accordingly perform the noise filtering of the signal.
The general steps of the wavelet transform threshold filtering method are as follows:
- (1)
Perform the multiscale wavelet decomposition on noisy time series signals; this process can be continued until the “noisy” or detailed component of the signal is sufficiently low in variance (see
Section 3.2);
- (2)
Determine a reasonable cutoff threshold and eliminate the high-frequency coefficients at each scale after decomposition;
- (3)
Perform a wavelet signal reconstruction from the wavelet coefficients after the zeroing process to obtain the filtered denoised signal.
The original signal f can be decomposed into with being a lower frequency approximation and a high frequency signal. Essentially we are decomposing the original signal into a sum of wavelet terms of the form , with gradually decreasing values of u and suitable values of v. After reaching a cutoff for u, we reassemble the function using the approximation via lower-frequency and larger-scale wavelets, giving , and the remaining portion of the function is represented as . The process may be repeated, treating as the new “original” signal, to get the decomposition . The process may be repeated for several iterations.
In addition to the removal of noise, filtering also smooths the data out as well. This smoothing does not substantially change the actual numerical values or trends, but it does help achieve a better fit for the model. As an ODE, the SICRD model assumes each of the variables is at least continuously differentiable, thus smoothing the inherently nondifferentiable accessible data helps the network perform an analysis on it. This decomposition allows us to obtain a denoised signal, while still retaining information about the exact nature of the filtered noise. With a method to appropriately process a noisy signal and prepare it as a training set, we can now define the network intended to analyze our data.
2.5. Setup of the Neural Network
The ability of neural networks to act as universal approximators is well known [
43] but the ability to perform accurate estimations in reasonable time frames is a central element of their increasing appeal. A variety of network structures have been developed to suit the features of particular problems. For example, recurrent neural networks, such as the long short-term memory (LSTM) network, have been used to improve pattern recognition in data that have historical dependencies [
44].
For our approach, we implemented a new “modified” “Physics Informed Neural Network” (PINN) to learn the values of the parameters and the unknown variable
in our model
1. The PINN approach involves the changing of the loss function for the network, rather than any particular change in the network architecture itself, and in fact is compatible with a wide variety of possible architectures [
13,
22]. The concept of the PINN is to incorporate some physical law that must be obeyed by the system, and to introduce an extra term into the loss function which becomes smaller the closer the network output adheres to the law [
27]. The approach was developed to estimate parameters and do forecasting in cases where data are only available at relatively sparse time intervals. The method is also robust to issues of overfitting, making it particularly useful for studies in limited-information regimes such as disease modeling [
27]. While the method has been applied to more complex network architectures [
22], we considered only the fully connected feed-forward network in our work as we tested our “proof of concept” modification.
Initially, this concept was introduced in the context of nonlinear partial differential equations, though we see that our case of a nonlinear ordinary differential equation still falls into the applicable category (and other studies have also indeed used them in this context [
28,
45]). The common approach is to simply take the difference of the differential equations’ sides and introduce that as an extra term to minimize in the loss function.
This is sufficient for many applications but we found it unable to generate acceptable results in our case. This was largely due to the entirely unknown variables
S,
I, and
R in our model. We also only had a single time series from the system to use in estimating the parameters, whereas in many attempts to estimate system parameters using PINN methods, an entire ensemble of starting conditions and subsequent trajectories is used. For example, the DeepXDE deep learning method was used to attempt to estimate the parameters of a Lorenz system but needed to take a large collection of different starting trajectories to achieve good estimates [
46]. Many methods of augmenting the PINN with other approaches exist [
22], but generally the core structure of the loss function, the defining feature of the PINN, is not adjusted to improve training. Once a choice of modeling equation is made, it is directly used for the loss terms. The work of [
29] is an appealing recent example with a reformulation of the original two-dimensional ODE system with an equivalent one-dimensional second-order ODE. Their goal was to formulate a more straightforward expression for the infection rate parameter, though the inherent effect of the approach on the network training was not considered.
To address our case, we developed a new method, changing the structure of the loss function to that in Equation (
14). The new terms still constituted the “physical law” for our system, but were of a form where the network’s gradient descent was far more stable. This novel approach, in our view, nontrivially extends PINN methods, and hence that is why we refer to it as a “modified” PINN system. This reveals the importance of tailoring the loss function not only to each system, but to the specific circumstances of accessible variables and parameters for a given problem. The tests on artificially generated data in
Section 3.1 were extremely promising, with accurate estimations of the unknown parameters
and
.This was performed with only a single time series of
and
being used to perform the estimation. The network architecture was also the simple fully connected feed-forward network, yet it still performed well. This attests to the efficiency of this suitably chosen loss function.
Before we precisely define our modified approach, we first review the standard PINN implementation. Generally, although many forms of PINN systems have been used, the loss function is basically of the same form [
13,
22]. To give a precise definition for the ODE case, consider an n-dimensional ODE system defined by
where
is some function of all our variables
X, and the given system parameters are
p. The loss term
may then be defined by
where
are the estimated values of the variables and parameters output from the neural network.
is the mean square error up until time
t. This loss essentially measures how well the output of the neural network is obeying the physical laws that govern the system. Note that the estimated values of the system parameters appear in
f and thus may be learned using this loss term.
The PINN loss is then combined with the standard loss function, with
being the given test data:
Thus, we have the total loss:
Our initial testing essentially implemented the above scheme using the “Disease Informed Neural Network” code as a base [
45]. Our version of
could only include
C and
D, due to these being (assumed to be) the only known variables, but otherwise the loss function was the same. It was found that the network consistently led to incorrect values for all variables and parameters. The existence of unknown variables created several complications: the network had a fundamental gap in its training data, and the network was trying to estimate the unknown variables
in addition to estimating the unknown parameters. The presence of so much unknown information created substantial problems in the network’s ability to minimize loss for any particular term from
, without creating significant losses in another element of the equations
. The standard PINN loss terms also varied quite substantially in magnitude. The size of
was much smaller than
, and thus might impact the network’s learning less substantially, despite how crucial the death data were (as seen in our analysis in
Section 2.3).
Motivated by these deficiencies, we developed a modified interpretation of the PINN concept in the following way. If our variables for an ODE system are given by the vector
, we choose functions
(where we recall
p is the vector of parameters for the system) which should each be 0 if the time series
precisely obeys the ODE system. Concretely, we consider the SICRD model and derive the following functions
from Equation (
1):
Notice that these expressions involve suitable modifications of the dynamical equations through (linear) combinations thereof. We can then use these functions to define the term:
Then, the total loss function is given by
where
Some issues with the performance of the code were found when trying to use ratios or log scales to try to equalize the terms in the MSE loss, but ultimately the performance of the network worked well enough with the standard form. Alternative approaches with these methods of equalizing could be considered though. With these new loss functions, the network converged quickly and accurately when tested on artificially generated data (results in
Section 3.1).
Part of the motivation for this choice is that it excludes from consideration the recovered population for which there are always unidentifiable features (such as ) and focuses on the rest of the populations, including an effective rewrite of the conservation law associated with the total population. We also note that in our new equations, we tried to reduce how frequently multiple unknown quantities appeared together. To illustrate this point, involves only known quantities except for , if we assume to be known. This allows this equation to be used to learn one parameter. If is in principle known from , then all quantities except I are known in , and thus I can be well estimated. Then, all quantities except are in principle known in , letting and S themselves be estimated. Finally can then estimate the basic reproduction number .
Practically, this is not precisely what is happening, as each term is being minimized simultaneously by the network. This is just meant to illustrate in principle how this structure reduces the interference of multiple competing unknown quantities (a detail known to cause issues in the training of some networks [
47]). We also note that all of the equations are of similar order, whereas in the standard PINN network, the quantity
is substantially smaller than any other derivatives appearing in the system. In that light, the present restructuring of the equations systematically builds the optimization of the system parameters.
This approach is in no way unique to our particular choice of system either. It shows that the concept of the PINN can be made broader than the standard choice of how to incorporate information from the original set of differential equations a system is assumed to obey. Various different ODE or PDE systems could have alternatively derived formulations that can then be used to train a neural network. The most crucial point of our example is that it shows the existence of a case where the standard PINN approach is insufficient, but the modified approach is extremely effective.
As far as the structure of the network itself is concerned, we note that it is a simple fully connected neural network, constructed with some minor modifications of the base code of [
45]. The hidden dimension and learning rate initially presented in [
45] created some issues, due to the original work assuming that all of the population compartments were known. This assumption allowed the original code to still converge reasonably well with a much lower learning rate, as well as a lower hidden dimension. In our implementation, we created 6 layers for our fully connected linear neural network, with hidden dimension 64. The nonlinear activation function in each layer was chosen to be the usual ReLu function, defined as:
, if
, and
otherwise. The learning rate was taken to be
, while the optimizer used was Adam [
48].
4. Discussion
In the present work, we chose a modified version of the SIR model for the time evolution of COVID-19, motivated by the features of the disease and the nature of available data. The model allowed the distinguishing of confirmed and unconfirmed cases, as well as a recording of the number of fatalities due to the disease. Upon verification of the identifiability of the unknown quantities within the model (based on the practically available data), we were able to use our modified PINN approach to analyze available data, with a special loss function tailored to the lack of available information.
The most striking result was the substantial improvement that our modification made, as it was not obvious that a reformulation of the governing equations should have such an impact on the network’s learning. In principle the equations of (
14) and (
1) with
contain the same information. The exact nature of this dependence has likely gone unnoticed due to the basic implementation of the PINN being sufficient for most applications. This kind of circumstance though, performing inference when some variables are unknown, is not a rare one. The PINN was originally developed to be effective for inference with sparse data [
27] but our work demonstrates that it extends to even incomplete data effectively, when appropriately modified.
There are, of course, many further refinements that could be made to our methods. The SICRD model could always be further separated into more distinct population compartments relevant to the spread of the disease. Compartments for exposed, though not yet infectious individuals, explicitly asymptomatic individuals, hospitalized individuals, and more could be included, and we discussed some principles on the basis of which such considerations could be explored. Additional topics that are quite relevant are the consideration of aspects of age stratification [
15,
18,
55], as well as the incorporation of the role of vaccines and immunity waning (for later time frames than the ones considered herein, during which a vaccine was available) [
9,
56,
57,
58]. We could also consider cases where confirmed individuals may still infect others, albeit at a reduced rate.
Our current framework, however, was intended to be used to study the development of the disease on relatively short time frames in the early pandemic stages, mitigating the effect of immunity loss as individuals would most likely still have immunity. Within this time frame, we could also disregard vaccinations, though, as stated earlier, the inclusion of vaccination effects could be incorporated if later times were considered. Instances of confirmed individuals spreading infection were assumed to be an exception as most individuals followed appropriate quarantining procedures on a positive test result.
The crucial role of the death data (or hospitalization data in an SEAHIR model [
16]) for estimation does raise the important issue of the validity of such data. The existence of underreporting, delays, and misattribution (namely, the important concern about “death from COVID” vs. “death with COVID” [
59]) can create potential issues that warrant further examination. Nevertheless, we consider it to be the most relevant starting point for the study conducted herein and certainly a far more reliable one than the total case count, as has been explained also earlier in connection to studies from the CDC; see [
18] for a relevant discussion.
Indeed, it is also relevant to point out that, as demonstrated in
Section 2.3, the unknown infected population is impossible to estimate from recorded case counts alone. Hospitalization case information has also been incorporated into other methods of disease modeling with effective results so there is an existing precedent [
16]. Such issues exist in models of waterborne illness, where the presence of difficult-to-track pathogens in water creates similar complications to our asymptomatic cases [
34]. Further work can and should include further data to improve the model’s robustness. What the present work does is provide a mathematical basis and a computational implementation such that, even with a more complicated model, the PINN approach will still be reasonably applicable. Before our modification, it was not clear that a PINN could feasibly work even with our simpler model, in the presence of multiple unknown variables.
Ultimately the simplifications made for the model, as well as for the architecture of the neural network, were taken so that the evaluation of the novel loss function could be performed with minimal complications. Future work could naturally focus on expanding both the model’s complexity as mentioned above as well as incorporating more nuanced network structures. The work of [
25] on graph neural networks gives a very promising direction for future work in both directions, with a natural extension of the model to incorporate human mobility, paired with the natural choice of a graph neural network architecture. Indeed, the study of such metapopulation models of wide appeal within the modeling of COVID-19 [
4,
21], in conjunction with some of the technical approaches and methodologies presented herein, constitutes a promising direction for future study.
The approach could also be suitably adapted to some form of recurrent neural network (RNN) as well. RNNs have already been used in disease modeling [
26] and the combination of the LSTM, a specific form of RNN, with the PINN concept has been attempted in other works with notable success [
22]. A limitation of our current work is the inability to estimate more of the unknown parameters simultaneously. The modified PINN still struggles to perform well if it must estimate parameters
and
in addition to
and
. There are simply too many simultaneous unknowns for the network to perform estimation efficiently. Indeed, this issue of numerous local minima that perform equally well has been encountered in various other studies; see, e.g., [
18]. An RNN approach could be efficient enough to allow this broader parameter estimation.
A particularly relevant direction, in terms of how widely it may expand our results, would be a rigorously proven criterion for alternate PINN loss terms. We gave an informed explanation of why the choice of Equation (
14) led to better performance, but without a formalized proof. A natural future goal could be to formally prove that an alternate set of loss terms with fewer unknowns generally leads to faster or more accurate convergence. Some analysis on this concept of a network’s difficulty to balance multiple competing objectives does exist [
47], so it would be reasonable to expect a form of generalized criteria to hold. If such a proof could be made, it would allow our approach to be applied extremely broadly with a general recipe for how to adapt it to a particular problem. It should also be added the very recently proposed idea of incorporating causality into the loss function of the PINNs [
60] may also be quite relevant to consider in the present context. In addition, one could extend the study using graph neural networks [
61].