Mitigating distribution shift in machine learning-augmented hybrid simulation

\newsiamthm
claimClaim \newsiamremarkremarkRemark \newsiamremarkhypothesisHypothesis

\headersMitigating distribution shift in MLHSZhao and Li






Mitigating distribution shift in machine learning-augmented hybrid simulation



Jiaxi Zhao
Department of Mathematics, National University of Singapore ().
jiaxi.zhao@u.nus.edu

  
Qianxiao Li
Department of Mathematics &
Institute for Functional Intelligent Materials, National University of Singapore ().
qianxiao@nus.edu.sg




Abstract
We study the problem of distribution shift generally arising in machine-learning augmented hybrid simulation,
where parts of simulation algorithms are
replaced by data-driven surrogates.
We first establish a mathematical
framework to understand the structure of machine-learning augmented hybrid simulation problems,
and the cause and effect of the associated distribution shift.
We show correlations between distribution shift and simulation error both
numerically and theoretically. Then, we propose a simple methodology based
on tangent-space regularized estimator to control the distribution shift,
thereby improving the long-term accuracy of the simulation results.
In the linear dynamics case, we provide a thorough theoretical analysis to
quantify the effectiveness of the proposed method. Moreover, we conduct
several numerical experiments, including simulating a partially known
reaction-diffusion equation and solving Navier-Stokes equations using the
projection method with a data-driven pressure solver. In all cases, we
observe marked improvements in simulation accuracy under the proposed
method, especially for systems with high degrees of distribution shift, such
as those with relatively strong non-linear reaction mechanisms,
or flows at large Reynolds numbers.


keywords: 
Machine learning,
Distribution Shift,
Regularization,
Error Analysis,
Fluid dynamics


{MSCcodes}
68T99, 65M15, 37M05





1 Introduction


Many scientific computational applications, such as computational fluid dynamics (CFD)
and molecular dynamics (MD) can be
viewed as dynamical system modeling and simulation problems, which is tackled by
rigorous numerical tools with theoretical guarantee [45, 26]. However, in many cases a part of the simulation workflow, such
as the Reynolds stresses in Reynolds-averaged Navier-Stokes equation (RANS) [1] and the exchange-correlation energies
in density function theory used to compute force-fields that drives
MD simulations [27], depends on models that are
either expensive to compute, or even unknown in practice.
One thus often resort to a hybrid simulation method,
where the known, resolved components of the dynamics are computed exactly,
while the unresolved components are replaced by
approximate, but computationally tractable models. For example, solving Navier-Stokes equations using projection method involves two steps. In the first step, all the terms except for the gradient of the pressure is used to evolve the velocity. This step is computationally cheap and thus understood as the resolved part. Next, the pressure is solved from a Poisson equation and then used to correct the velocity. Most of the computational cost is contained in solving this Poisson equation and we thereby viewed this as the unresolved part. We call such scientific computing problems with both resolved and unresolved
parts “hybrid simulation problems”. Similar problems are surveyed in [51] under the name of “Hybrid physics-DL models”.


As machine learning becomes increasingly powerful in areas like computer vision
and natural language processing, practitioners begin to use data-driven modules
to model the unresolved part to carry out the simulation.
For example, in [49]
the author replaced the numerical Poisson solver
of the unresolved part by a convolutional neural network trained using a novel
unsupervised learning framework. Then, this data-driven model is coupled with
resolved model and provide fast and realistic simulation results in 2D and 3D.
Similar ideas are used else where, e.g. RANS [28]. We
hereafter refer to this simulation procedure by machine-learning augmented
hybrid simulation (MLHS). This structure represents a great part of the scientific machine learing research [54, 46, 25] and will be the focus of this paper.
In MLHS, a common problem appears: while the data-drive model performs well on the training data, the performance quickly deteriorates when iteratively applying it in simulation, driving the dynamics to some regime which is not observed from the
training source. Empirical evidences have been observed in various applications,
such as CFD [46, 35, 9, 54], molecular
dynamics [56, 55, 25], and iterative numerical solver [2] but these problems are not
well-studied algorithmically and theoretically in the literature.


This issue has strong connections with the so-called distribution shift (DS) in
computer science applications,
especially in reinforcement learning and computer vision.
Researchers use DS to refer to problems where training and
testing distribution differ significantly. An example is an image classifier
trained with images taken in the daytime that is tested under night conditions [23]. Therefore, accuracy on the training regime itself cannot
guarantee the performance during inference. To resolve this,
researchers have systematically
developed several methods, e.g. domain adaptation [11],
ensemble learning [8], and
meta-learning [50]. However, there are key differences between the distribution shift phenomena in MLHS and that in traditional applications between the DS phenomena in MLHS and computer vision.
While distribution shift in computer vision lacks a theoretical model to describe [23], that in MLHS comes from dynamical systems for which we have abundant knowledge on the resolved parts of the models. For example, let us consider using projection method [10] to solve the incompressible Navier-Stokes equation.
Suppose we replace the Poisson pressure solver by a data-driven model based on neural
network.
Then, the dynamics of this data-driven hybrid numerical solver will
largely depend on the properties of the resolved numerical part of the
projection method, which is well-studied [10].
By analyzing the stability properties of this
dynamics, one can quantify which family of distribution shifts may arise and
resolve them according to the information of the dynamics. Therefore, one can
pose the following question: Can we use the information of resolved parts to design robust learning algorithm for unresolved parts to improve hybrid simulation?


In this paper, we first develop a mathematical framework to understand the
origin of distribution shift in MLHS and how it may lead to simulation performance deterioration. We emphasize
the difference between this instability issue in data-driven scientific computing
and distribution shift in the machine learning literature, such as reinforcement
learning and computer vision. Then, we propose an algorithm to improve the
simulation accuracy by mitigating distribution shift.
We assume by manifold hypothesis [34] that the correct
trajectories
lie on a low-dimensional manifold of the high dimensional space, e.g. the fluid
configuration of Navier-Stokes equation lies on the solution manifold of the
high-dimensional grid space. The key idea is to combine the physical information
of the resolved part of the model, i.e. Navier-Stokes equation and this manifold
structure learned from the data to form a regularization term for the
data-driven model. Intuitively speaking, this regularizer stablizes the dynamics by preventing it from moving further away the data manifold.
Such movements may result in configurations that are either
non-physical, or corresponds to different initial/boundary conditions.
Therefore, preventing such movements reduces the severity of
distribution shift, and the
data-driven model will stay relatively accurate during the simulation, which promotes high simulation fidelity in a long time interval.
In implementation, we first use an autoencoder (AE) to parameterize the underlying
data manifold. After this preprocessing, the AE is combined with the resolved dynamics and plugged into the loss
function of the model as a regularization, which prevents the simulation from
moving to unseen regimes. One then back-propagates this modified loss function
to optimize the data-driven model.
The algorithm is tested on several
representative numerical examples to demonstrate its effectiveness.
We show that the proposed approach can
improve simulation accuracy under different extents of distribution shift.
Indeed, the improvements become more significant in scenarios where the
fidelity of the simulation is highly sensitive to errors introduced in the data-driven
surrogates, such as fluid simulations at high Reynolds numbers and
reaction-diffusion equations with relatively strong non-linear reaction.
Specifically, in high Reynolds number fluid simulations, naive
data-driven surrogate models may quickly cause error blow-ups,
but our method can maintain simulation accuracy over large time intervals.


The paper is structured as follows. In Section 2, we establish a precise
framework to identify the origins of distribution shift in MLHS. In Section 3, we introduce our regularized learning algorithm motivated
from the analysis of the previously identified form of
distribution shift and theoretically understand its performance in the linear setting.
We also discuss connections with the literature on control and system
identification. In Section 4, we validate our algorithm on several
practical numerical cases, including simulating a reaction-diffusion equation and the incompressible Navier-Stokes equation.



2 Mathematical formulation of MLHS and the problem of distribution shift


In this section, we provide the mathematical formulation of MLHS and then
identify the problem of distribution shift. In the first subsection we give a
general treatment, and we provide concrete examples in Section 2.2.



2.1 The resolved and unresolved components in hybrid simulations


Throughout this paper, we consider the dynamics




 $\displaystyle\partial_{t}\mathbf{u}$ 
 $\displaystyle=\mathcal{L}(\mathbf{u},\mathbf{y},t),\quad\mathbf{u}\in\mathbb{R% }^{m},\mathbf{y}\in\mathbb{R}^{n},\mathcal{L}:\mathbb{R}^{m}\times\mathbb{R}^{% n}\times\mathbb{R}_{+}\rightarrow\mathbb{R}^{m},$ 

(1)



 $\displaystyle\mathbf{y}$ 
 $\displaystyle=\phi(\mathbf{u},t),\quad\phi:\mathbb{R}^{m}\times\mathbb{R}_{+}% \rightarrow\mathbb{R}^{n},$ 




where  $\mathbf{u}$  is the resolved state variable, e.g. fluid velocity field
or chemical concentration field.
The vector  $\mathbf{y}$  is the unresolved variable,
which drives the dynamics for  $\mathbf{u}$  but is either expensive to compute
or cannot be directly observed.
Throughout the paper, we will omit writing the explicit time-dependence
by using  $\mathbf{u}$  ( $\mathbf{y}$ ) to denote  $\mathbf{u}(t)$  ( $\mathbf{y}(t)$ )
whenever there is no ambiguity.
We also restrict the discussion to the finite dimension case,
but most of the analysis here will be
applicable to the more general case of  $\mathbf{u},\mathbf{y}$  belonging
to an infinite dimensional Hilbert space.
We adopt the following assumptions associated with Eq. 1


• 

1. Resolved component:  $\mathcal{L}$  is known, possibly non-linear.



• 

2. Unresolved component:  $\phi$  is either unknown, partially-known or expensive to evaluate.



Given such a system, the goal is to first obtain information on the unresolved
model and then integrate it with resolved part to simulate the whole dynamics.
This hybrid structure, which we call hybrid simulation problems
is general enough to include most of the
settings in hybrid simulation, e.g. when simulating the trajectories of
molecular dynamics
which satisfy some stochastic differential equations or
the time-evolution of fluid velocity fields
following the Navier-Stokes equation.
This paper focuses on MLHS,
a particular variation of such hybrid simulation problem
where the unresolved component is tackled by a data-driven method.
This point will be made more precise in Section 2.3.
To simplify the analysis, we adopt the forward Euler time discretization for simulation




 $\displaystyle\widehat{\mathbf{u}}_{k+1}$ 
 $\displaystyle=\widehat{\mathbf{u}}_{k}+\Delta t\mathcal{L}(\widehat{\mathbf{u}% }_{k},\widehat{\mathbf{y}}_{k}),\qquad\widehat{\mathbf{y}}_{k}=\phi(\widehat{% \mathbf{u}}_{k}),$ 

(2)



while other consistent discretizations may be analyzed in the same spirit. Here we drop the dependence of  $\mathcal{L},\phi$  on  $t$  to consider autonomous system while our later discussion and algorithm are also applicable to non-autonomous system.




2.2 Examples of hybrid simulation problems


Before moving on to distribution shift, we first provide some examples in MLHS of our interest, which are also closely related to our numerical experiments. We will present these examples according to the structure Eq. 1.


The first example is solving the 2D incompressible Navier-Stokes equation using the projection method [17, 10],
a variant of which is as follows:




 $\displaystyle\frac{\partial\mathbf{u}}{\partial t}+(\mathbf{u}\cdot\nabla)% \mathbf{u}-\nu\Delta\mathbf{u}=\nabla p,\quad\nabla\cdot\mathbf{u}=0,\quad T% \in[0,1],$ 

(3)



where  $\mathbf{u}=(u(x,y,t),v(x,y,t))^{T}\in\mathbb{R}^{2}$  is velocity and  $p$ 
pressure. Fix a regular grid size for discretization. The projection
method can be written as follows




 $\displaystyle\mathbf{u}_{k+1}$ 
 $\displaystyle=\mathbf{u}_{k}+\Delta t\mathcal{L}(\mathbf{u}_{k},p_{k})$ 

(4)




 $\displaystyle=\mathbf{u}_{k}+\Delta t(\nu\Delta\mathbf{u}_{k}-(\mathbf{u}_{k}% \cdot\nabla)\mathbf{u}_{k}-\nabla p_{k}),$ 




 $\displaystyle p_{k}$ 
 $\displaystyle=\phi(\mathbf{u}_{k})=\Delta^{-1}(\nabla\cdot\left(\nu\Delta% \mathbf{u}_{k}-(\mathbf{u}_{k}\cdot\nabla)\mathbf{u}_{k}\right)),$ 




We write  $\Delta^{-1}$  as the inverse operator for Poisson equation formally.
The resolved part is performed by first stepping the
convection and diffusion term then using pressure to correct the step, and the
unresolved part is a Poisson equation which related the unsolved state  $p_{k}$  with the
velocity  $\mathbf{u}_{k}$ .
Here,  $p_{k}$  represents the unresolved variable  $\mathbf{y}_{k}$  in our
formulation Eq. 1.


In this method, the most expensive step is the pressure computation,
which requires repeated solutions of similar large-scale linear equations.
State of the art solvers such as multigrid [4]
and conjugate gradient [43] become prohibitively expensive
when the problem size is large.
Recently, a promising direction is to replace the Poisson solver by
data-driven surrogate models. Ref. [49]
replaces the pressure calculation step by a convolutional neural network
trained with unsupervised learning by requiring the
updated velocity to have zero divergence.
Similarly, machine learning models such
as tensor-based neural networks are used in [28] to replace
classical turbulence modeling for unclosed Reynolds stresses tensor. This model
is then plugged into the Reynolds-averaged Navier-Stokes simulation to predict
the flow separation. More recently, [37] combines the idea of
reduced-order modeling with graph convolutional neural network to encode the
reduced manifold and enable fast evaluations of parametrized PDEs.


The second example is based on simulation on two grids of different sizes.
Let us continue with the
projection method setting Eq. 4.
Denote the time-evolution operator on
grid size  $n$  as  $f_{n}:\mathbf{u}_{k}^{n}\rightarrow\mathbf{u}_{k+1}^{n}$ , which becomes very
expensive to compute when  $n$  is large.
Therefore, one may wish to replace
the fine-grid solver with a coarse-grid one and add some correction.
This again reduces the computational burden for high-fidelity simulations.
This structure shows up in various scientific computing situations, such as multigrid
solvers [4], Reynolds-averaged Navier-Stokes
(RANS) [38, 1], and large eddy simulation
(LES) [59].
Suppose we have a fine grid with size  $2n\times 2n$  and a coarse grid with size
 $n\times n$ , we use superscript
 $n,2n$  to denote the field defined on these two
grids, i.e.  $\mathbf{u}_{k}^{n},\mathbf{u}_{k}^{2n}$ .
Moreover, we fix an interpolation and a restriction operator between fields on these two grids:




 $\displaystyle I_{n}^{2n}:\mathbb{R}^{n\times n}\rightarrow\mathbb{R}^{2n\times 2% n},\quad R_{2n}^{n}:\mathbb{R}^{2n\times 2n}\rightarrow\mathbb{R}^{n\times n}.$ 

(5)



Now, we can state the second hybrid simulation problem as follows:




 $\left\{\begin{aligned} \mathbf{u}_{k+1}^{2n}&=I_{n}^{2n}\circ f_{n}(R_{2n}^{n}% (\mathbf{u}_{k}^{2n})))+\mathbf{y}_{k}^{2n},\\ \mathbf{y}_{k}^{2n}&=\phi(\mathbf{u}_{k}^{2n}).\end{aligned}\right.$ 

(6)


In detail, the resolved component contains a solver of the evolution equation
over the coarse grid which marches forward by one time step, with field given by
a restriction of the field on the fine grid. Then, the next step field
configuration is interpolated to the fine grid, and the unresolved component
serves to correct the deviation of the field variables between
simulating on grids of different sizes.
The unresolved variables  $\mathbf{y}_{k}^{2n}$  can be calculated if we
have accurate simulation results  $\mathbf{u}_{k}^{2n}$  on the fine grid by




 $\mathbf{y}_{k}^{2n}=\mathbf{u}_{k+1}^{2n}-I_{n}^{2n}\circ f_{n}(R_{2n}^{n}(% \mathbf{u}_{k}^{2n}))).$ 

(7)


In the literature, [21] treats the smoothing algorithm in
multigrid solver as unresolved components and other procedures such as
restriction and prolongation as resolved components. This fits into our framework
of MLHS.
They introduce a supervised loss function based on multigrid convergence theory to learn the optimal smoother and improve the convergence rate over anisotropic rotated Laplacian problems and variable coefficient diffusion problems. In [32], the authors applies neural network as non-linear regression to fit
between some key fluid features and sub-grid stresses. Going beyond this, [35] adds neural emulation to offline learning to prevent the trajectories from deviating away the ground truth. This data-driven model is plugged into hybrid simulation to test its ability of preserving coherent structure and scaling laws.




2.3 Learning the unresolved model and distribution shift


We assume that the resolved part  $\mathcal{L}$  is known,
or we at least have access to a gray box that can perform
its evaluation and compute its gradients.
While the unresolved part is unknown, we can
access data tuples  $\left\{(\mathbf{u}_{1},\mathbf{y}_{1},t_{1}),(\mathbf{u}_{2},\mathbf{y}_{2},t_% {2}),\cdots,(\mathbf{u}_{N},\mathbf{y}_{N},t_{N})\right\}$  which are obtained either by
physical experiments or accurate but expensive numerical simulations.
The goal is hence to construct an approximate mapping
 $(\mathbf{u}_{k},t_{k})\mapsto\mathbf{y}_{k}$  from this dataset.
Viewing this as a supervised
learning task, there are many existing methods based on empirical risk
minimization [19, 20, 24]. One
of the simplest method is based on minimizing the  $L_{2}$  loss function. Given a
parameterized model  $\phi_{\theta}$ , either a classical one with unknown
parameters or a surrogate one such as a neural network, we learn from data the
optimal value of the parameter  $\theta$ , i.e.  $\widehat{\theta}=\arg\min_{\theta}\mathbb{E}_{(\mathbf{u},\mathbf{y})}\left\|% \mathbf{y}-\phi_{\theta}(\mathbf{u},t)\right\|^{2}$ . However, we
emphasize that the data assumption in learning of dynamical processes differs significantly with the
traditional i.i.d. assumption in statistical learning [20].
As the data is composed by several simulation trajectories, those belonging to the
same trajectory will have large correlation.
Therefore, it is reasonable to expect that
simple empirical risk minimization
(which many existing MLHS works employ [46, 42])
may not be the optimal method.
We will demonstrate this by using the least squares estimator
as a baseline comparison to our proposed approach Section 4.


After determining the parameters of the unresolved model,
one can perform hybrid simulation
to obtain new trajectories, i.e.




 $\displaystyle\widehat{\mathbf{u}}_{k+1}=\widehat{\mathbf{u}}_{k}+\Delta t% \mathcal{L}(\widehat{\mathbf{u}}_{k},\widehat{\mathbf{y}}_{k},t_{k}),\quad% \widehat{\mathbf{y}}_{k}=\phi_{\widehat{\theta}}(\widehat{\mathbf{u}}_{k},t_{k% }).$ 

(8)



Notice we add hat superscript to all the quantities related to this simulated dynamics to distinguish them from the ground truth  $(\mathbf{u}_{k},\mathbf{y}_{k})$ . We measure the performance of the simulator by the error along the whole trajectories, i.e.




 $\min\sum_{k=1}^{n}\left\|\mathbf{u}_{k}-\widehat{\mathbf{u}}_{k}\right\|^{2},$ 

(9)


where  $\mathbf{u}_{k}$ s are the true trajectories and  $\widehat{\mathbf{u}}_{k}$ s are the simulated one with the same initial condition.


We begin by illustrating the issue of distribution shift by analyzing of the trajectory error. For simplicity of presentation, we assume that the model hypothesis space
 $\{\phi_{\theta}:\theta\in\Theta\}$  is bias-free,
meaning that there exists  $\theta^{*}\in\Theta$ 
such that  $\phi=\phi_{\theta^{*}}$ . In the general case where the hypothesis space is universal but not closed,
this equality would be replaced by an approximate one,
but the argument follows analogously.
Comparing Eq. 1 and Eq. 8, we have





 $\displaystyle\ \widehat{\mathbf{u}}_{k+1}-\mathbf{u}_{k+1}$ 

(10)



 $\displaystyle=$ 
 $\displaystyle\ \widehat{\mathbf{u}}_{k}-\mathbf{u}_{k}+\Delta t\left(\mathcal{% L}(\mathbf{u}_{k},\phi_{\widehat{\theta}}(\mathbf{u}_{k}))-\mathcal{L}(\mathbf% {u}_{k},\phi_{\theta_{*}}(\mathbf{u}_{k}))\right)$ 





 $\displaystyle+\Delta t\left(\mathcal{L}(\widehat{\mathbf{u}}_{k},\phi_{% \widehat{\theta}}(\widehat{\mathbf{u}}_{k}))-\mathcal{L}(\mathbf{u}_{k},\phi_{% \widehat{\theta}}(\mathbf{u}_{k}))\right).$ 




The error introduced by moving one step forward is thus decomposed into two
parts: the error associated with the estimator  $\widehat{\theta}:\mathcal{L}(\mathbf{u}_{k},\phi_{\widehat{\theta}}(\mathbf{u}% _{k}))-\mathcal{L}(\mathbf{u}_{k},\phi_{\theta_{*}}(\mathbf{u}_{k}))$ , and
the other error  $\mathcal{L}(\widehat{\mathbf{u}}_{k},\phi_{\widehat{\theta}}(\widehat{\mathbf{% u}}_{k}))-\mathcal{L}(\mathbf{u}_{k},\phi_{\widehat{\theta}}(\mathbf{u}_{k}))$ . The former resembles the error appearing in
classical statistical inference problem under the i.i.d. setting since  $\mathbf{u}_{k}$ 
follows an identical distribution of the training dataset. However, the latter is different in the sense that  $\widehat{\mathbf{u}}_{k}$  and  $\mathbf{u}_{k}$  may come from distributions of their respective driving dynamics, which are different since  $\theta_{*}\neq\widehat{\theta}$ . This may be due to noisy labels, few data points, or having an under-determined system. This difference between MLHS and statistical learning is also emphasized in [51].
This latter error is rather akin to the issue of stability
in numerical solution for partial differential equations:
although the error in each step is relatively small,
it can accumulate exponentially as time step iterates,
driving  $\widehat{\mathbf{u}}_{k}$ 
to completely different regime where we have not observed in the data  $\mathbf{u}_{k},k=1,\cdots,N$ . As a result,
the data-driven model  $\widehat{\phi}$  can no longer be trusted to be accurate,
since it is trained on data from a different distribution. This may then lead to a vicious cycle, where further deterioration of trajectory error
occurs, leading to increased discrepancy between the distributions. Consequently, one can understand the distribution
shift as a large discrepancy between the true distribution of  $\mathbf{u}_{k}$  and simulated distribution  $\widehat{\mathbf{u}}_{k}$ . This problems occur commonly in MLHS.
For example, [28, 29, 53, 39] all observe unphysical solutions when substituting Reynolds stresses models calculated from direct numerical simulation database into RANS.
Ref. [53] attributes this to the ill-conditioning of the
RANS problems and propose a method to treat Reynolds stresses implicitly,
while [30] uses coupling modes perspective to numerically study the
stability and convergence of RANS and [18] derives the
propagation error of RANS simulation error mathematically.
However, a general theoretical framework to understand
and combat distribution shift in MLHS is lacking.
The current paper aims to make progress in this direction.


Let us illustrate the problem of distribution shift using numerical examples.
We solve the 2D FitzHugh-Nagumo reaction-diffusion equation
over a periodic domain  $[0,6.4]\times[0,6.4]$ ,
whose dynamics is given by




 $\displaystyle\frac{\partial\mathbf{u}}{\partial t}$ 
 $\displaystyle=D\Delta\mathbf{u}+\phi(\mathbf{u}),\quad T\in[0,20],$ 

(11)



 $\displaystyle\phi(\mathbf{u})$ 
 $\displaystyle=\phi(u,v)=\begin{pmatrix}u-u^{3}-v+\alpha\\ \beta(u-v)\end{pmatrix}.$ 




where  $\mathbf{u}=(u(x,y,t),v(x,y,t))^{T}\in\mathbb{R}^{2}$  are two interactive
components,  $D$  is the diffusion matrix, and  $\mathbf{R}(\mathbf{u})$  is source
term for the reaction.
We fix the parameters to  $\alpha=0.01,\beta=1.0,D=\begin{pmatrix}0.05&0\\ 0&0.1\end{pmatrix}$ , where the system is known to form Turing patterns [33].
Assuming that we do not know the
exact form of the reaction term  $\phi(\mathbf{u})$ ,
the resolved part is the diffusion term and the unresolved part is the
non-linear reaction term.
The ground truth data are
calculated using the semi-implicit Crank-Nicolson scheme with full explicit
discretization for the non-linear term.
As we do not know the reaction term is pointwise in space,
we use a convolutional neural network-based model to learn it
based on ordinary least squares (OLS).
Specifically, we solve the
least squares problem  $\min_{\theta}\mathbb{E}_{\mathbf{u}}\left\|\phi_{\theta}(\mathbf{u})-\phi(% \mathbf{u})\right\|^{2}$  using Adam [22]. Moreover, we conduct an
ablation study where the unresolved model consists of the exact form of the
non-linear term as in Eq. 11, plus a random Gaussian noise whose variance
is set equal to the optimization error of the OLS estimator, which is  $10^{-4}$ 
in the following experiments.
By comparing this and the simulation
based on the OLS model, we can conclude whether or not the trajectory error is caused
mainly by the estimation error of the unresolved model, or by the shift
in distribution of the input data fed to the unresolved model.
More details on
the data generation and optimization procedure is described in the supplementary materials LABEL:numeric-setup.
We simulate this PDE with a test initial condition
sampled from the same distribution as training trajectories and compare the
results with the ground truth. We present the
comparison in Fig. 1.



Figure 1: The first three rows display the snapshots of
ground truth, simulated fields, and ablation study simulation
at time steps: 0, 100, 200, 300, 500. (d) plots the relative error and distribution shift between two flow configurations at corresponding time. The error is defined as  $\frac{\left\|\widehat{\mathbf{u}}_{t}-\mathbf{u}_{t}\right\|_{2}}{\left\|% \mathbf{u}_{t}\right\|_{2}}$  at each time step  $t$  instead of the whole trajectory as in Eq. 9.
The plotted distribution shift
roughly measures the average discrepancy between the distributions of  $\mathbf{u}_{k}$  and
 $\widehat{\mathbf{u}}_{k}$  at time step  $k$ ,
and is calculated using an autoencoder explained in Section 3.2. Although the OLS estimator and ablation-study estimator share the same error magnitude for unresolved model, their long time trajectory performance form sharp contrast, indicating the optimization error of the unresolved model could not explain the failure in long time trajectory prediction.


The first three rows display the snapshots of
ground truth, simulated fields, and ablation study simulation
at time steps: 0, 100, 200, 300, 500. (d) plots the relative error and distribution shift between two flow configurations at corresponding time. The error is defined as  $\frac{\left\|\widehat{\mathbf{u}}_{t}-\mathbf{u}_{t}\right\|_{2}}{\left\|% \mathbf{u}_{t}\right\|_{2}}$  at each time step  $t$  instead of the whole trajectory as in Eq. 9.
The plotted distribution shift
roughly measures the average discrepancy between the distributions of  $\mathbf{u}_{k}$  and
 $\widehat{\mathbf{u}}_{k}$  at time step  $k$ ,
and is calculated using an autoencoder explained in Section 3.2.
As can be observed, it does not take a very long
time for the distribution shift issue to be severe enough using ordinary least squares estimator. Comparing the OLS error and distribution shift with the those of the
ablation study, we conclude that
the estimation error is not the main cause of the trajectory error,
since the ablation-study trajectory has the same magnitude of estimation
error, but has much smaller trajectory error and distribution shift.
Moreover, in both cases the relative error of the trajectory has the same trend as the distribution shift, indicating correlation.


Indeed, this phenomenon is ubiquitous in modern
MLHS [51, 56, 46].
As a second example, we solve the Navier-Stokes equation Eq. 3 using
projection method Eq. 4 where pressure solver is replaced by a learned convolutional neural network
predictor. The predictor is optimized over data pair of fluid velocity and pressure obtained from high-fidelity simulation to attain an error of  $10^{-5}$ . Due to the complexity of projection solver and grid issue, we do not conduct ablation study in this case.



Figure 2: 
Configurations of the Navier-Stokes equation simulation with Reynolds number  $200$  at time step: 0, 60, 120, 180, 240: (a) ground truth (b) simulated fluid field using OLS estimator.
We only show velocity field over part of the domain. The dynamics blow up in the last configuration, which does not show up in the case of reaction-diffusion equation. This
can be explained by the fact that the Navier-Stokes equation is more unstable than the reaction-diffusion equation. In (c), we illustrate the error
along the trajectories, the distribution shift.



The distribution shift in Fig. 2 turns out
to be more severe when we simulate the incompressible Navier-Stokes equation. After  $160$  time steps, the fluid configuration
becomes rough and un-physical. Here, we again observe that the DS and error increase
together, suggesting correlation.
In the next section, we will make this connection precise,
and furthermore develop principled methods to control
the distribution shift, thereby improving the prediction fidelity.



3 Theoretical analysis and algorithm


In this section, we first put the distribution shift issue into a theoretical framework. Then, we introduce and analyze our algorithm in this section. We will provide rigorous
study of the linear case and use this to motivate the algorithm.



3.1 Distribution shift in linear dynamics and motivation of tangent-space regularized estimator


As there exists few tools to tackle the general non-linear dynamics
Eq. 1, we switch to the simpler case of
linear dynamics, which also capture some key features in general situations. We consider the
following hybrid simulation problem




 $\displaystyle\partial_{t}\mathbf{u}$ 
 $\displaystyle=A\mathbf{u}+B\mathbf{y},\quad\mathbf{u}\in\mathbb{R}^{m},\mathbf% {y}\in\mathbb{R}^{n},A\in\mathbb{R}^{m\times m},B\in\mathbb{R}^{m\times n}$ 

(12)



 $\displaystyle\mathbf{y}$ 
 $\displaystyle=C^{*}\mathbf{u},\quad C^{*}\in\mathbb{R}^{n\times m}.$ 




Such structures appear in many scenarios, e.g. finite difference solution
of discretized linear PDEs and linear control problems.
Alternatively, one can think
of this linear system as the linearization of the
non-linear dynamics Eq. 1 about some steady-state.


We assume that the state variables  $\mathbf{u}_{i}$ s in training data come from several
trajectories and  $\mathbf{y}_{i}$ s may be subject to some random
measurement error, i.e.  $\mathbf{y}_{i}=C^{*}\mathbf{u}_{i}+\epsilon_{i}$ ,
or in matrix form  $Y=C^{*}\mathbf{U}+\epsilon$ .
The key
point is that these trajectories may belong to a low-dimensional
subspace in the high-dimensional state space.
This is the case for many scientific computing problems, such as
computational fluid dynamics [38], and is
also known more generally as the manifold hypothesis [34].
We use  $V\subset\mathbb{R}^{m}$  to denote the subspace that
contains all the training data  $\mathbf{u}_{i}$ . In this setting, the appearance of this
low-dimensional structure may be caused by two different situations. In the
first situation, the initial value of  $\mathbf{u}$  is supported on the subspace
spanned by several eigenvectors of the evolution operator  $e^{(A+BC^{*})}$ , i.e.




 $\displaystyle\mathbf{u}_{0}\in\operatorname{span}\{\mathbf{v}_{1},\mathbf{v}_{% 2},\cdots,\mathbf{v}_{l}\},\quad e^{(A+BC^{*})}\mathbf{v}_{1}=\lambda_{1}% \mathbf{v}_{1},\cdots,e^{(A+BC^{*})}\mathbf{v}_{l}=\lambda_{l}\mathbf{v}_{l},$ 

(13)


Then, all the training data  $\mathbf{u}_{i}$  belong to this subspace, i.e.  $V=\operatorname{span}\{\mathbf{v}_{1},\mathbf{v}_{2},\cdots,\mathbf{v}_{l}\}$ . The second situation
is that the dynamics is degenerate in the sense that the
evolution matrix  $A$  is not of full rank.
This will not
appear in the differential formulation as infinitesimal transformations
in the form  $e^{(A+BC^{*})\Delta t}$  are always
non-degenerate. However, for a discrete dynamics




 $\mathbf{u}_{k+1}=A\mathbf{u}_{k}+B\mathbf{y}_{k},\quad\mathbf{y}_{k}=C^{*}% \mathbf{u}_{k},$ 

(14)


if the matrix  $A+BC^{*}$  is degenerate, then all the data  $\mathbf{u}_{k}$  will belong to
the range of this operator  $A+BC^{*}$ , which is a proper subspace.
Most cases of scientific computing applications belong
to the first class.
Due to the degeneracy in the training data,
the least squares solution is not unique.
Hence, various empirical regularizations (e.g.  $\ell^{2}$  regularization)
can be introduced to obtain an estimator with desirable properties.


Now, suppose one obtains an estimator  $\widehat{C}$ , one can calculate
several error metrics.
The first is the test error




 $l_{OLS}(\widehat{C})=\mathbb{E}\left\|(\widehat{C}-C^{*})\mathbf{u}\right\|^{2},$ 

(15)


where  $\mathbf{u}$  is sampled from the low dimensional subspace  $V$  and follows the
same distribution as the training data.
We refer to this as statistical estimation error.
However, we are actually interested in the error
of the simulated dynamics




 $\displaystyle\partial_{t}\widehat{\mathbf{u}}=A\widehat{\mathbf{u}}+B\widehat{% \mathbf{y}},\quad\widehat{\mathbf{y}}=\widehat{C}\widehat{\mathbf{u}},$ 

(16)



compared to the ground truth  $\mathbf{u}$  with the same initial condition.
The time evolution of their difference is given by




 $\displaystyle\partial_{t}(\widehat{\mathbf{u}}-\mathbf{u})=$ 
 $\displaystyle\ (A+B\widehat{C})\widehat{\mathbf{u}}-(A+BC^{*})\mathbf{u}$ 

(17)



 $\displaystyle=$ 
 $\displaystyle\ A(\widehat{\mathbf{u}}-\mathbf{u})+B(\widehat{C}\widehat{% \mathbf{u}}-\widehat{C}P_{V}\widehat{\mathbf{u}}+\widehat{C}P_{V}\widehat{% \mathbf{u}}-\widehat{C}\mathbf{u}+\widehat{C}\mathbf{u}-C^{*}\mathbf{u})$ 




 $\displaystyle=$ 
 $\displaystyle\ (A+B\widehat{C}P_{V})(\widehat{\mathbf{u}}-\mathbf{u})+B(% \widehat{C}-C^{*})\mathbf{u}+B\widehat{C}(\widehat{\mathbf{u}}-P_{V}\widehat{% \mathbf{u}}),$ 




where  $P_{V}$  is the orthogonal projection onto the data subspace  $V$  and we have
 $P_{V}\mathbf{u}=\mathbf{u}$  since  $\mathbf{u}$  belongs to the data subspace. As derived in Eq. 17, the overall error of the trajectory during evolution
can be decomposed into three parts. In the following we make a detailed
analysis of each term.


The first term can be viewed as an amplitude factor of the error propagation and
can be easily bounded by




 $\left\|(A+B\widehat{C}P_{V})(\widehat{\mathbf{u}}-\mathbf{u})\right\|_{2}\leq% \left\|A+B\widehat{C}P_{V}\right\|_{2}\left\|\widehat{\mathbf{u}}-\mathbf{u}% \right\|_{2}.$ 

(18)


The operator  $A+B\widehat{C}P_{V}$  can be viewed as a posterior estimation of the error propagation and a stability constant of the simulated dynamics.


The second term is simply the statistical estimation error of the estimator
 $\widehat{C}$  defined in Eq. 15. Given that the initial
condition  $\mathbf{u}_{0}$  of the training and test trajectories are sampled from the
same distribution, this term is well-bounded in classical statistical learning
theory [20], provided the number of data tuples is
sufficiently large.
In particular, this error has nothing to do with distribution shift.


Now we move to the last part  $B\widehat{C}(\widehat{\mathbf{u}}-P_{V}\widehat{\mathbf{u}})$ ,
which can be more easily understood from a geometric viewpoint.
The last factor  $(\widehat{\mathbf{u}}-P_{V}\widehat{\mathbf{u}})$  measures how far the simulated trajectory is away from the data subspace
 $V$ . In other words,
this term measures how far the simulated trajectory’s state distribution “shifts”
from the true distribution.
This is exactly the term we want to obtain to
control the distribution shift in simulation.
Meanwhile, this term forms sharp
contrast with previous two terms in the sense that it is not automatically
bounded in most algorithm for estimating  $\widehat{C}$ .
We will illustrate this point
further by deriving a error bound of the simulated dynamics, which shows that it depends sensitively on this
distribution shift term. Thus, we can improve prediction accuracy if we combat such distribution shifts.


Now, taking the expectation of the inner product between  $\widehat{\mathbf{u}}-\mathbf{u}$  and Eq. 17 and applying Gronwall’s lemma one obtains the following bound on the error propagation.





 $\displaystyle\mathbb{E}\left\|\widehat{\mathbf{u}}(T)-\mathbf{u}(T)\right\|_{2% }^{2}\leq$ 

(19)




 $\displaystyle\frac{\left\|B\right\|_{2}^{2}\mathbb{E}\left\|(\widehat{C}-C^{*}% )\mathbf{u}\right\|^{2}+\left\|B\widehat{C}\right\|_{2}^{2}\sup_{t\leq T}\left% \|\widehat{\mathbf{u}}(t)-P_{V}\widehat{\mathbf{u}}(t)\right\|_{2}^{2}}{\left% \|A+B\widehat{C}P_{V}\right\|_{2}^{2}+3}(e^{(\left\|A+B\widehat{C}P_{V}\right% \|_{2}^{2}+3)T}-1).$ 




An interpretation for this result is that the trajectory error can be divided
into two parts:  $\left\|B\right\|_{2}^{2}\mathbb{E}\left\|(\widehat{C}-C^{*})\mathbf{u}\right\|% ^{2}$  and  $\left\|B\widehat{C}\right\|_{2}^{2}\sup_{t}\left\|\widehat{\mathbf{u}}(t)-P_{V% }\widehat{\mathbf{u}}(t)\right\|_{2}^{2}$ . The first error is the statistical estimation error of unresolved component,
which can be reduced by better training algorithms, better models, or more data etc.
This is the familiar term one faced in supervised learning under
common i.i.d. setting. The second error relates to the distribution
shift since it is measured by the difference between the sampled trajectories and
underlying data manifold, i.e.  $\left\|\widehat{\mathbf{u}}-P_{V}\widehat{\mathbf{u}}\right\|=\operatorname{% dist}(\widehat{\mathbf{u}},V)$ . Moreover, this shift is caused by the interaction
between the estimator of unresolved model and the resolved model dynamics. Specifically, the estimator makes some error in prediction and this error is propagated via the
resolved model dynamics and then fed into the next round of hybrid simulation.
This distribution shift in MLHS is far different from the one in computer
science such as covariate shift and label shift. In computer vision, the cause
of distribution shift may be extrinsic [23], i.e. the pictures in
training set are all taken at daytime while those for testing are all taken at
night.
Such shifts are hard to model, so their resolution
tends to depend on data augmentation and related techniques,
e.g. the Dagger algorithm [41].
Instead of changing the optimization
problem, the data source is modified. This shares much similarity with
adversarial training, which also add more data to the training set to make the
prediction robust under adversarial attack, a typical distribution shift in the
area of computer vision [15].
In contrast, in our setting the distribution shift is intrinsically
driven by the hybrid simulation structure of which we have partial knowledge.
Hence, we can quantify this distribution shift and design specific
algorithm to mitigate them.


Returning to the linear problem,
in order to guarantee that the error  $\widehat{\mathbf{u}}-\mathbf{u}$  is bounded
over the simulated trajectory, a natural choice is to use a regularization for
 $\left\|\widehat{\mathbf{u}}-P_{V}\widehat{\mathbf{u}}\right\|_{2}$ . The most naive choice would be
 $\left\|\widehat{\mathbf{u}}-P_{V}\widehat{\mathbf{u}}\right\|_{2}$  itself, but a problem is that  $\widehat{\mathbf{u}}$  is calculated via hybrid simulation until time  $t$ , which corresponds to a rather complicated computational graph.
This makes subsequent gradient-based optimization computational expensive. Hence, we
take the advantage of an
one-step predictor, i.e. we set




 $\widehat{\mathbf{u}}_{k+1}=\mathbf{u}_{k}+\Delta t(A\mathbf{u}_{k}+B\widehat{C% }\mathbf{u}_{k}).$ 

(20)


The  $\widehat{\mathbf{u}}_{k+1}$  is calculated based on a single step of the simulated
dynamics from the ground truth solution  $\mathbf{u}_{k}$ . Assume  $\mathbf{u}_{k}\in V$ , one has




 $\left\|\widehat{\mathbf{u}}_{k+1}-P_{V}\widehat{\mathbf{u}}_{k+1}\right\|_{2}=% \Delta t\left\|P_{V^{\perp}}(A\mathbf{u}_{k}+B\widehat{C}\mathbf{u}_{k})\right% \|_{2},$ 

(21)


and we may penalize the right hand side to promote trajectory alignment with the data manifold. Wrapping up all the ingredients, we state the loss function in the linear case as




 $\min_{C}l(C):=\min_{C}\mathbb{E}_{(\mathbf{u},\mathbf{y})}\left(\left\|\mathbf% {y}-C\mathbf{u}\right\|_{2}^{2}+\lambda\left\|P_{V^{\perp}}(A+BC)\mathbf{u}% \right\|_{2}^{2}\right),$ 

(22)


and here  $\lambda$  is a penalty strength parameter which can be
chosen during training.
We briefly discuss in the supplementary materials LABEL:lambda how to choose  $\lambda$ .


In above discussion, we have assumed that the underlying data subspace  $V$  is known a
priori or we can calculate  $P_{V^{\perp}}$  directly. However, in most cases the
only information we have is a set of resolved state variables  $\mathbf{u}_{i}$ s. Then,
we can use standard dimension reduction methods such as principle component
analysis [20] to obtain a subspace  $V$ , which can be thought
as an approximation of the data manifold. The loss objective Eq. 22 can
thus be calculated.




3.2 Algorithm in the general form


The discussion above applies to linear dynamics where the
underlying data manifold is also linear.
In this subsection we generalize it to non-linear cases and
state the general form of our algorithm.


Let us return to our original formulation of the task Eq. 1,
where we have a dataset  $\left\{(\mathbf{u}_{1},\mathbf{y}_{1},t_{1}),(\mathbf{u}_{2},\mathbf{y}_{2},t_% {2}),\cdots,(\mathbf{u}_{N},\mathbf{y}_{N},t_{N})\right\}$ .
As before, we assume that the resolved variable lies on a low-dimensional manifold  $\mathcal{M}$  in
the configuration space and the estimator  $\phi_{\theta}(\mathbf{u},t)$  achieves high accuracy along this whole manifold while we have no guarantee outside the
manifold. The key difference between the general case and the linear case is that this
manifold may not be linear and we do not have an orthogonal decomposition.


To resolve this, we leverage an autoencoder to represent the
data manifold structure, i.e. we learn two neural networks
 $E:\mathbb{R}^{m}\rightarrow\mathbb{R}^{l}$ 
(encoder) and  $D:\mathbb{R}^{l}\rightarrow\mathbb{R}^{m}$  (decoder)
to represent the manifold.
The encoder maps the state variable  $\mathbf{u}$  into a low dimensional
space and decoder maps them back to original space. We optimize  $E,D$ 
simultaneously to minimize the reconstruction error




 $\min_{E,D}\frac{1}{N}\sum_{i=1}^{N}\left\|\mathbf{u}_{i}-D(E(\mathbf{u}_{i}))% \right\|_{2}^{2}.$ 

(23)


Here, we only use the resolved states  $\mathbf{u}_{i}$  from the dataset.
After training the autoencoder, we obtain a function




 $F:\mathbb{R}^{m}\rightarrow\mathbb{R}_{\geq 0},\quad F(\mathbf{u})=\left\|% \mathbf{u}-D(E(\mathbf{u}))\right\|_{2}^{2},$ 

(24)


which implicitly parametrizes the data manifold via  $F(\mathbf{u})=0,\forall\mathbf{u}\in\mathcal{M}$ .
Notice that the linear version of autoencoder is exactly the principle component
analysis we mentioned to identify linear subspace  $V$ .
Such techniques to identify data manifolds have been used
in improving robustness against adversarial
attacks [5, 6].
Besides parametrizing the data manifold,
 $F$  can be used as an indicator of
distribution shift, i.e. for any state variable  $\mathbf{u}$ ,
we use  $F(\mathbf{u})=\left\|\mathbf{u}-D(E(\mathbf{u}))\right\|_{2}^{2}$ 
to measure its distribution shift w.r.t. the
training data. This is used in all the figures in this paper to quantify
distribution shift. Strictly speaking, training only ensures that the learned
manifold is contained in the zero level-set of  $F$ .
Thus, the gradient of  $F$  belongs to the normal space
of the manifold  $\mathcal{M}$  at the point  $\mathbf{u}$ , i.e.
 $\nabla F\in\mathcal{M}^{\perp}$ . Minimizing the inner product with  $\nabla F$  now
means partially minimizing the orthogonal components, which is the cause of the distribution shift in our formulation.
This strictly generalizes the linear case Eq. 22, where the data subspace  $V$  is given by  $\operatorname{span}\{\mathbf{v}_{1},\mathbf{v}_{2},\cdots,\mathbf{v}_{l}\}$ . Augmenting this basis to obtain an orthonormal basis
 $\{\mathbf{v}_{1},\mathbf{v}_{2},\cdots,\mathbf{v}_{l},\mathbf{v}_{l+1},\cdots,% \mathbf{v}_{m}\}$  for the whole
space, a choice of the function  $F$  is given by




 $F(\mathbf{u})=\frac{1}{2}\sum_{i=l+1}^{m}(\mathbf{u}^{T}\mathbf{v}_{i})^{2},$ 

(25)


and the gradient of  $F$  is




 $\nabla F(\mathbf{u})=\left(\sum_{i=l+1}^{m}\mathbf{v}_{i}\mathbf{v}_{i}^{T}% \right)\mathbf{u},$ 

(26)


which is just the orthogonal projection onto  $V^{\perp}=\operatorname{span}\{\mathbf{v}_{l+1},\mathbf{v}_{l+2},\cdots,% \mathbf{v}_{m}\}.$ 
We can state the regularized loss function of our algorithm in general case:




 $\min_{\theta}l(\theta):=\min_{\phi_{\theta}}\frac{1}{N}\sum_{i=1}^{N}\left(% \left\|\mathbf{y}_{i}-\phi_{\theta}(\mathbf{u}_{i})\right\|_{2}^{2}+\lambda% \left(\left(\nabla F(\mathbf{u}_{i})\right)^{T}\mathcal{L}(\mathbf{u}_{i},\phi% _{\theta}(\mathbf{u}_{i}),t_{i})\right)^{2}\right).$ 

(27)


The first term above is the least squares term and the second term measures how far the estimated tangent vector  $\partial_{t}\mathbf{u}_{i}=\mathcal{L}(\mathbf{u}_{i},\phi_{\theta}(\mathbf{u}% _{i}),t_{i})$  deviates from the tangent space of  $\mathcal{M}$  at  $\mathbf{u}_{i}$ . In the second term,  $F(\mathbf{u})=\left\|\mathbf{u}-D(E(\mathbf{u}))\right\|_{2}^{2}$  and  $D,E$  are optimized via
Eq. 23 and frozen during the training of  $\phi_{\theta}$ . Since the loss function is regularized by a term which approximately measures the deviation of the tangent vector  $\mathcal{L}(\mathbf{u}_{i},\phi_{\theta}(\mathbf{u}_{i}),t_{i})$  from the tangent space of the data manifold, we name our algorithm “MLHS with tangent-space regularized estimator”. The overall algorithm can be summarized in Algorithm 1.


Algorithm 1  MLHS with tangent-space regularized estimator


0:   $\left\{(\mathbf{u}_{1},\mathbf{y}_{1},t_{1}),(\mathbf{u}_{2},\mathbf{y}_{2},t_% {2}),\cdots,(\mathbf{u}_{N},\mathbf{y}_{N},t_{N})\right\}$ , resolved model, penalty strength  $\lambda$ .



1:  Learn a parameterized model which encode the structure of training data, i.e.  $F_{\eta}(\mathbf{u})\geq 0,F_{\eta}(\mathbf{u}_{k})=0.$  



2:  Freeze the parameters of this learned model.



3:  Introduce another surrogate model  $\phi_{\theta}(\mathbf{u})$ .



4:  for  $k=1,2,\cdots,N$  do





5:     Predict the control variable  $\widehat{\mathbf{y}}_{k}=\phi_{\theta}(\mathbf{u}_{k})$ .



6:     Calculate the state variable after one-step iteration, i.e.  $\widehat{\mathbf{u}}_{k+1}=\mathbf{u}_{k}+\Delta t\mathcal{L}(\mathbf{u}_{k},% \widehat{\mathbf{y}}_{k},t_{k})$ .



7:     Form the loss  $l(\theta)=\mathbb{E}\left[\left\|\mathbf{y}_{k}-\widehat{\mathbf{y}}_{k}\right% \|_{2}^{2}+\lambda\left(\left(\nabla F(\mathbf{u}_{k})\right)^{T}\mathcal{L}(% \mathbf{u}_{k},\widehat{\mathbf{y}}_{k}),t_{k})\right)^{2}\right]$ .



8:     Backpropogate to update  $\theta$ .



9:  end for


9:  tangent-space regularized estimator:  $\phi_{\theta}$ .





Comparing to literature of distributionally
robust optimization (DRO)
[40] and numerical analysis of dynamical systems
[48, 47], our work takes advantages from both.
First, the tangent-space regularized estimator belongs to the family of DRO in
the sense that it minimizes the data-driven module of the unresolved part over
some perturbations of the data distribution.
However, unlike the original DRO which
considers all possible perturbations of the data distribution
in its neighborhood under some metric, e.g. Wasserstein metric  [3],
our regularization focus
on those perturbations caused by the resolved components of the dynamics.
These facilitate us to take advantage of the partial
knowledge of the dynamics.
As a result, the regularizer we proposed is more problem-specific
Moreover, from the perspective of DRO, our method takes into
consideration the behavior of estimator under certain
perturbation normal to the
data manifold, which is much more tractable than guarding against
arbitrary perturbations in distribution space. In Section 4, we will implement other benchmarks with general regularizations and compare with our algorithm to illustrate the benefit of combining resolved model information into regularization.


Furthermore, we can understand our method through the lens of the stability
in numerical analysis of differential equations.
The classical result that the least squares estimator converges
to the ground truth in the limit of large datasets
can be understood as a consistency statement.
Here, we show that this is insufficient, and in order to
ensure convergence of the simulated trajectories
one needs to also promote stability,
and our proposed regularizer serves precisely this role.
This regularization approach should be contrasted with recent works [54, 7, 25, 16] on
learning dynamical models from data which build stability
by specifying the model architecture.




3.3 Interpretation and analysis of the algorithm


Let us now quantify in the linear case, the gains of our algorithm over
the ordinary least squares estimator.
We first calculates the exact formula for two estimators.
Due to the low rank structure, OLS algorithm allows
infinitely many solutions.
Throughout this section, we make the common choice
of the minimum 2-norm solution.



Proposition 3.1.


Consider the linear dynamics Eq. 12, the training data of the variable
 $\mathbf{u}$  is arranged into a data matrix  $\mathbf{U}\in\mathbb{R}^{m\times N}$ .
Moreover, we assume the column space of  $\mathbf{U}$  is contained in a subspace
 $V\subset\mathbb{R}^{m}$  with associated projection operator  $P_{V}$ .
Then, the OLS estimator
and tangent-space regularized estimator for the unresolved component  $Y=C^{*}\mathbf{U}+\epsilon$  have the following form:




 $\displaystyle\widehat{C}_{OLS}=$ 
 $\displaystyle\ C^{*}P_{V}+\epsilon\mathbf{U}^{\dagger},$ 

(28)



 $\displaystyle\widehat{C}_{TR}=$ 
 $\displaystyle\ (\mathbf{I}+\lambda B^{T}P_{V^{\perp}}B)^{-1}(C^{*}P_{V}+% \epsilon\mathbf{U}^{\dagger}-\lambda B^{T}P_{V^{\perp}}AP_{V}).$ 








Specifically, in the noiseless scenarios,  $\widehat{C}_{OLS}$  recovers  $C^{*}$  on the subspace  $V$  but vanishes on its orthogonal complement. In order to interpret our estimator, let us consider a simpler case with  $B=\mathbf{I},A=\mathbf{0}$ . This corresponds to the dynamics




 $\partial_{t}\mathbf{u}=C^{*}\mathbf{u},$ 

(29)


and the regularized estimator is given by




 $\widehat{C}_{TR}=(\mathbf{I}+\lambda P_{V^{\perp}})^{-1}(C^{*}P_{V}+\epsilon% \mathbf{U}^{\dagger})=(\mathbf{I}+\lambda P_{V^{\perp}})^{-1}\widehat{C}_{OLS}.$ 

(30)


Therefore, our estimator performs a weighted least squares regression which penalizes
the direction perpendicular to the data subspace  $V$ , as controlled by  $\lambda$ .
As the error along this normal direction is the main cause of the distribution shift,
it is sensible to reduce it more than the error along the tangent direction,
which cause deviations inside the manifold.
This shows that the tangent-space regularized estimator does in fact mitigate distribution shift.


Next, we move on to analyze the accuracy gain of the whole
simulation algorithm by proving a bound for the error
of the simulation trajectory.
Denote the statistical error of tangent-space regularized estimator by




 $l_{TR}(\widehat{C})=\mathbb{E}\left(\left\|(C^{*}-\widehat{C})\mathbf{u}\right% \|_{2}^{2}+\lambda\left\|P_{V^{\perp}}(A+B\widehat{C})\mathbf{u}\right\|_{2}^{% 2}\right)$ 

(31)


We define the following function which helps simplify the notation




 $Q(r,T)=\frac{e^{rT}-1}{r},\quad r,T>0.$ 

(32)





Theorem 3.2.


Under the same setting as Proposition 3.1, assume both the OLS estimator
and TR estimator  $\widehat{C}_{OLS},\widehat{C}_{TR}$  are optimized to
have error bounded by  $\delta$  so that  $l_{OLS}(\widehat{C}_{OLS}),l_{TR}(\widehat{C}_{TR})<\delta$  in Eq. 15 and Eq. 31. Suppose the
true trajectory is simulated from an initial condition  $\mathbf{u}(0)\in V$  which
follows the same distribution of the training data. Then, the errors of OLS
and our algorithm are bounded respectively by




 $\displaystyle\mathbb{E}\left\|\widehat{\mathbf{u}}_{TR}(T)-\mathbf{u}(T)\right% \|^{2}\leq$ 
 $\displaystyle\delta Q(\left\|A+B\widehat{C}_{TR}P_{V}\right\|^{2}+3,T)$ 

(33)




 $\displaystyle\times\left(\left\|B\right\|^{2}+\frac{Q(\left\|P_{V^{\perp}}(A+B% \widehat{C}_{TR})\right\|^{2}+2,T)}{\lambda}\right),$ 




 $\displaystyle\mathbb{E}\left\|\widehat{\mathbf{u}}_{OLS}(T)-\mathbf{u}(T)% \right\|^{2}$ 
 $\displaystyle\leq\delta Q(\left\|A+B\widehat{C}_{OLS}P_{V}\right\|^{2}+\left\|% B\widehat{C}_{OLS}\right\|^{2}+3,T)\left\|B\right\|^{2}.$ 








The proof idea is illustrated in deriving Eq. 19 and we
leave step-by-step derivations in the supplementary materials Section 3.
To understand this
result, let us focus on the exponential part of two error bounds as this will
dominate the error when the simulation time  $t$  is large. The exponent of our
algorithm is bounded by  $(\left\|A+B\widehat{C}_{TR}P_{V}\right\|^{2}+3)t$  provided
that the penalty parameter  $\lambda$  is large enough. This is related to the
intrinsic stability constant associated with the dynamics, underlying manifold. If we assume that two terms
 $\left\|A+B\widehat{C}_{OLS}P_{V}\right\|,\left\|A+B\widehat{C}_{TR}P_{V}\right\|$  are close to each other, this suggests that our algorithm may have a
slower error accumulation rate for large  $t$ . As the training data lies on a
subspace  $V$  of the whole state space, we do not have any bound on the behavior
of  $\widehat{C}_{OLS}$  outside this subspace. For example, the orthogonal complement
 $V^{\perp}$  may contain an eigenspace of  $B\widehat{C}_{OLS}$  with a large
eigenvalue. If this is the case,  $\left\|B\widehat{C}_{OLS}\right\|$  may be
significantly larger than  $\left\|A+B\widehat{C}_{TR}P_{V}\right\|$ , and the improvement of our algorithm may be more promising. In supplementary material LABEL:fig:linear-cmp, we provide a toy example which illustrates the effectiveness of our method comparing to OLS.



4 Numerical experiments


In this section, we present numerical experiments which illustrate the
effect of the tangent-space regularized algorithm presented in Algorithm 1.
We use the experiments to illustrate two phenomena.
The first demonstrates the effectiveness of our algorithm, for which
we test on the reaction-diffusion equation and the incompressible
Navier-Stokes equation, and compare with benchmark algorithms using least
squares, as well as simple regularizations that do not account
for the structure of the resolved dynamics or the data manifold.
For the second goal, we show that our method brings more significant
improvements in problems with severe distribution shifts.
This is demonstrated on the same family of PDEs, i.e. reaction-diffusion equation with
varying diffusion coefficient and Navier-Stokes equation with different Reynolds numbers.
These varying parameters can be understood - as we show numerically - as
quantitative indicators of the severity of distribution shift. The implementation of our method and experiment reproduction is found in
the repository [57].



4.1 Tangent-space regularized algorithm improves simulation accuracy


We first show that the proposed algorithm can be advantageous on several
prototypical machine learning augmented scientific computing applications.


In the first experiment, we use the FitzHugh-Nagumo reaction-diffusion Eq. 11 equation as the test case of our algorithm. The resolved-unresolved structure is the coarse-fine grid correction structure discussed in Section 2.2. This helps us obtain high-fidelity simulation results by combining a low-fidelity numerical solver with a fast neural network surrogate model for correction. Here we choose  $\alpha=0.01,\beta=1.0,D=\begin{pmatrix}\gamma&0\\ 0&2\gamma\end{pmatrix},\gamma\in\{0.05,0.10,0.15,0.20,0.25\}$  in the simulation which guarantee the existence of Turing patterns. The initial condition  $\mathbf{u}_{0}$  is generated by i.i.d. sampling from a normal
distribution:




 $u(x,y,0),v(x,y,0)\sim\mathcal{N}(0,1).$ 

(34)


This system is known to have a dynamical
spatial patterns where initial smaller
“clusters” gradually transforms and merges to form bigger cluster according to the balance between diffusion and reaction effects [33]. In literature, the importance of the sampling procedure is also emphasized and various adaptive sampling method based on error and particle methods are studied in [12, 52, 58].


Now, we explain how the training data is prepared. Recall that in
Eq. 6 we discussed the hybrid simulation structure for
a correction procedure under a pair of coarse and fine mesh:




 $\left\{\begin{aligned} \mathbf{u}_{k+1}^{2n}&=\mathcal{L}(\mathbf{u}_{k}^{2n},% \mathbf{y}_{k}^{2n})=I_{n}^{2n}\circ f_{n}(R_{2n}^{n}(\mathbf{u}_{k}^{2n})))+% \mathbf{y}_{k}^{2n},\\ \mathbf{y}_{k}^{2n}&=\phi(\mathbf{u}_{k}^{2n}),\end{aligned}\right.$ 

(35)


where the mapping  $\phi$  between the fine grid state  $\mathbf{u}_{k}^{2n}$  to the correction term  $\mathbf{y}_{k}^{2n}$  is the unresolved model we desire to learn. Here  $I_{n}^{2n},R_{2n}^{n}$  are fixed restriction and interpolation operators
between two meshes, i.e.





 $\displaystyle R_{2n}^{n}:\mathbb{R}^{2n\times 2n}\rightarrow\mathbb{R}^{n% \times n},\quad\mathbf{u}_{ij}^{n}=\frac{\mathbf{u}_{2i,2j}^{2n}+\mathbf{u}_{2% i,2j+1}^{2n}+\mathbf{u}_{2i+1,2j}^{2n}+\mathbf{u}_{2i+1,2j+1}^{2n}}{4},$ 

(36)




 $\displaystyle I_{2n}^{n}:\mathbb{R}^{n\times n}\rightarrow\mathbb{R}^{2n\times 2% n},\quad\mathbf{u}_{2i,2j}^{2n}=\mathbf{u}_{2i,2j+1}^{2n}=\mathbf{u}_{2i+1,2j}% ^{2n}=\mathbf{u}_{2i+1,2j+1}^{2n}=\mathbf{u}_{ij}^{n}.$ 




To obtain training data for  $\mathbf{u}_{n}^{t},\mathbf{u}_{2n}^{t}$ , we
simulate Eq. 11 (with the same initial condition)
on two meshes of sizes  $64\times 64$  and  $128\times 128$ 
respectively,
with time step  $\Delta t=0.01,N_{\text{step}}=T/\Delta t=100$ ,
using a second order Crank-Nicolson scheme. It remains to calculate the labels for the training data,
i.e.  $\mathbf{y}_{k}^{2n}=\phi(\mathbf{u}_{k}^{2n})$ .
These are obtained by using the first half of the Eq. 36,
i.e.





 $\mathbf{y}_{n}^{t}=\mathbf{u}_{2n}^{t+1}-I_{n}^{2n}\circ f_{n}(R_{2n}^{n}(% \mathbf{u}_{2n}^{t}))),$ 

(37)


Finally, we pair them
together to form the training data  $\left\{(\mathbf{u}_{n}^{1},\mathbf{y}_{n}^{1}),\cdots,(\mathbf{u}_{n}^{N},% \mathbf{y}_{n}^{N})\right\}$ .
The data here is different from the one introduced in Section 2.3 since the
system is autonomous, so we do not include time indices.
To summarize, the training data
contains two patches of field component patterns
of size  $100\times 128\times 128$  corresponding to
 $\mathbf{u},\mathbf{v}$  respectively and two patches  $\mathbf{y}_{u},\mathbf{y}_{v}$  of the same shape and generated
according to Eq. 36.


The baseline algorithms we implemented to compare with ours is  $\phi_{\text{OLS}}$ ,
which minimizes the mean-squared error on the training data, i.e.




 $\phi_{\text{OLS}}=\arg\min\mathbb{E}_{\mathbf{u}\sim\rho_{0}}\left\|\phi_{% \theta}(\mathbf{u})-\phi(\mathbf{u})\right\|_{2}^{2},$ 

(38)


where we use  $\rho_{0}$  to denote the distribution of the field variable generated
by certain simulation condition Eq. 34. The corresponding numerical results are labelled by “OLS”.



Figure 3: 
Simulation of the reaction-diffusion equation and configurations at time step: 0, 100, 200, 300, 500, (a) ground truth; (b) simulated fluid field using OLS estimator; (c) simulated fluid field using TR estimator. (d) is the comparison of the ordinary least squares and regularized one on reaction-diffusion equation: the solid line represents the averaged error along the trajectory and the dash line represented the averaged distribution shift calculated using Eq. 24. The shadow part is the standard deviation in ten random experiments with different training and testing data. Both the error and the distribution shift increase at a lower speed in TR simulation than in OLS simulation. This suggests that the better performance of TR algorithm in reaction-diffusion equation is partially caused by improving the distribution shift issue.


We summarize the comparison results in Fig. 3.
Here, solid lines represent the errors along the trajectories
and dashed lines represent the distribution shift.
The shaded part is the standard deviation over ten random experiments
with different split of the training and testing data.
As can be observed, compared to our method the OLS baseline incur
greater errors and distribution shifts.
Moreover, this difference increases with time.
In contrast, consistent with our analysis in the simplified setting,
our algorithm mitigates both the distribution shift and
the trajectory error.


Next, we consider the Navier-Stokes equation Eq. 3 as the test case of our algorithm Algorithm 1. In this case, the resolved model is an explicit scheme to evolve the velocity field while unresolved model is calculated via a fast CNN surrogate, which can accelerate the simulation for large grid system. The simulation domain is rectangular with aspect ratio  $1/4$ . On the upper and
lower boundary we impose no-slip boundary condition on the velocity and on the
inlet, i.e.  $x=0$  we specify the velocity field  $\mathbf{u}$ . Moreover, at the outlet, i.e.  $x=n$  we pose zero-gradient
conditions on both the horizontal velocity and pressure, i.e.  $\frac{\partial p}{\partial n}=\frac{\partial u}{\partial n}=0$ . The inflow boundary is defined according to following
functional forms:




 $\mathbf{u}(0,y,t)=\begin{pmatrix}u(0,y,t)\\ v(0,y,y)\end{pmatrix}=\begin{pmatrix}\exp\{-50(y-y_{0})^{2}\}\\ \sin t\cdot\exp\{-50(y-y_{0})^{2}\}\end{pmatrix}$ 

(39)


The training configuration is calculated using the projection method
[10] with staggered grid, i.e. the pressure is placed at
the center of each cell and horizontal (vertical) velocity is placed at
vertical (horizontal) edge of each cell.
The jet location  $y_{0}$  is selected from  $0.3$  to  $0.7$  uniformly.


Again, we compare the performance of our algorithm and the OLS method. As
illustrated in Fig. 4, naively using OLS to estimate the unresolved models
will generally lead to error blow-up, whereas using our regularized
estimator we can obtain a reasonable simulation along the whole time period.
Moreover, the trend of the trajectory error fits that of the distribution
shift. In particular, the error and distribution shift blow-ups occur simultaneously.
Our algorithm again decreases the trajectory error by mitigating the problem of distribution shift.



Figure 4: 
Simulation of the Navier-Stokes equation and configurations at time step: 0, 60, 120, 180, 240, (a)
ground truth; (b) simulated fluid velocity field using OLS estimator;
(c) simulated fluid velocity field using TR estimator;
(d) the comparison of the OLS estimator and the regularized one.
The solid lines represent the trajectory error and
the dashed lines represent distribution shift
calculated using Eq. 24.
We observe that the OLS estimator leads to error (and DS)
blow-up at around 270 time steps, while our method remains
stable.
Again, the trends of the error and the distribution shift
are highly correlated, and our method that controls
distribution shift leads to error control,
consistent with our analyses.



The machine-learning augmented NS simulations appear to have
a greater degree of distribution shift,
and our method has a relatively stronger benefit in this case.
This suggests that our method is most beneficial
in problems where the intrinsic distribution shift is severe.
In the next subsection, we will study this
more systematically by testing our method on
the reaction-diffusion equations with varying
diffusion coefficient  $\gamma$  and
Navier-Stokes equations with varying Reynolds
numbers.




4.2 Performance under distribution shift of different magnitudes


We now demonstrate concretely that our method brings
more significant improvements when the intrinsic distribution
shift is more severe.
Intuitively speaking, since our algorithm penalizes
the deviation of the simulated trajectories to the data manifold while OLS
totally ignores this, it would be reasonable to expect that our
algorithm will outperform OLS more in the problems where such deviations are large. A sanity check showing different distribution shift extents under different parameters is provided in the supplementary materials LABEL:fig:NSRe-ds.


Here, we introduce two more baselines for ablation studies. Comparing with the OLS
algorithm, these algorithms both add some regularization to the loss function
while their regularizations are general and not dynamics-specific compared to
our regularization. Consequently, any improved performance in our algorithm over
these regularized algorithms suggests that targeted regularization, which
takes into account the structure of the resolved part, effectively balances accuracy and stability.


The first baseline regularizes the original least squares objective by a term which quantifies some complexity the network model, e.g.




 $\phi_{\text{mOLS}}=\arg\min\mathbb{E}_{\mathbf{u}\sim\rho_{0}}\left\|\phi_{% \theta}(\mathbf{u})-\phi(\mathbf{u})\right\|_{2}^{2}+\lambda\left\|\theta% \right\|_{2}^{2}.$ 

(40)


Here the naive  $L^{2}$  regularized algorithm is implemented by adjusting the weight decay [31] parameter of the Adam solver during the training process. The corresponding numerical results are labelled by “mOLS” (“m” for modified).



The second baseline adds noise to the training data during training, which can be formulated as




 $\phi_{\text{aOLS}}=\arg\min\mathbb{E}_{\mathbf{u}\sim\rho_{\xi}}\left\|\phi_{% \theta}(\mathbf{u})-\phi(\mathbf{u})\right\|_{2}^{2}.$ 

(41)


Notice that the training data follows the distribution  $\rho_{\epsilon}=\rho_{0}*\xi$  where  $*$  denotes the convolution operator and  $\xi$  is some random noise which we
take to be Gaussian random variable with an adjustable variance. In
implementation, we generate the noise and add to the input during each epoch so
the inputs are not the same across epochs. Moreover, this noise  $\xi$  should
be distinguished from the observational noise  $\epsilon$  we introduced in
Section 3. This method is used in several previous work
[46, 36] to improve
stability and prevent the error from accumulating too fast. As this technique is
similar to adversarial attack and data augmentation [44, 14, 13], we name this estimator and algorithm
as “aOLS” (“a” stands for adversarial). We slightly abuse the terminology here by naming this random perturbation as some kind of adversarial attack.



Table 1: Performance on reaction-diffusion equation over 10 experiments.




 $\gamma=0.05$ 
 $\gamma=0.10$ 
 $\gamma=0.15$ 
 $\gamma=0.20$ 
 $\gamma=0.25$ 





grid size:  $64\times 64$ 




OLS
8.40e+01
1.04e+01
8.99e+00
2.22e+00
6.78e-01


mOLS
8.90e+00
9.22e+00
1.41e+00
2.26e+00
9.98e-01


aOLS
3.57e+01
4.00e+01
2.40e+00
1.79e+00
1.06e+00


TR
1.23e-01
1.06e-01
1.38e-01
1.85e-01
2.15e-01


Diff
99.9%
99.0%
98.5%
91.7%
68.3%





grid size:  $128\times 128$ 




OLS
4.10e+02
8.18e+01
5.11e+01
4.88e+00
2.25e+00


mOLS
2.40e+01
3.20e+01
2.20e+00
1.07e+01
1.25e+00


aOLS
1.46e+02
2.58e+02
8.69e+00
3.30e+00
1.79e+00


TR
2.79e-01
1.66e-01
2.42e-01
1.81e-01
1.91e-01


Diff
99.9%
99.8%
99.5%
96.3%
91.5%





In Table 1,
we test the algorithms under different diffusion coefficient  $\gamma=0.05$ ,
 $0.10,0.15,0.20,0.25$ . We calculate the relative trajectory error at time  $T=1000$ , i.e.  $\frac{\left\|\mathbf{u}_{k}-\widehat{\mathbf{u}}_{k}\right\|}{\left\|\mathbf{u% }_{k}\right\|}$  over ten random split of the training and test data. The last row calculates
the performance difference between our method and the OLS baseline.
We observe that under almost all the settings our method out-performs
OLS, mOLS, aOLS baselines, and comparing horizontally,
we see that the relative improvement
decreases with  $\gamma$ , as diffusion term mitigate the distribution shift.


We now consider the Navier-Stokes equation.
Since the baseline method quickly leads to error blow up,
we use another comparison criterion.
We define a stopping time
 $t_{K}=\arg\max_{t}\frac{\left\|\mathbf{u}_{t}-\widehat{\mathbf{u}}_{t}\right\|% }{\left\|\mathbf{u}_{t}\right\|}\leq K$  where  $K$  is a
error threshold to be determined during the experiments.
We calculate the first
time the trajectory error reaches a threshold under different
Reynolds numbers and different mesh sizes.
The results are shown in Table 2.
Under almost all the scenarios, our method outperforms all the baseline algorithms by a larger  $t_{K}$ . This validates the effectiveness of our algorithm. Comparing Table 2 horizontally, one finds that the improvement of our
method is also increasing with respect to the Reynolds number.
As the flow field generally becomes more complex (and sensitive)
with increasing Reynolds number,
our method is expected to bring bigger improvements.
This is consistent with our experiments.


Table 2: Comparison of stopping time  $t_{K}$  on Navier-Stokes equation over 10 experiments.




 $Re=100$ 
 $Re=200$ 
 $Re=300$ 
 $Re=400$ 
 $Re=500$ 





grid size:  $128\times 32$ 






OLS
 $435\pm 15$ 
 $301\pm 20$ 
 $240\pm 12$ 
 $177\pm 17$ 
 $120\pm 7$ 


mOLS
 $420\pm 12$ 
 $333\pm 14$ 
 $254\pm 12$ 
 $204\pm 9$ 
 $210\pm 7$ 


aOLS
 $477\pm 25$ 
 $392\pm 31$ 
 $450\pm 23$ 
 $239\pm 25$ 
 $321\pm 20$ 


TR
 $\mathbf{506\pm 18}$ 
 $\mathbf{497\pm 28}$ 
 $\mathbf{482\pm 29}$ 
 $\mathbf{442\pm 21}$ 
 $\mathbf{452\pm 13}$ 





grid size:  $256\times 64$ 




OLS
 $602\pm 38$ 
 $571\pm 35$ 
 $355\pm 25$ 
 $301\pm 11$ 
 $148\pm 22$ 


mOLS
 $623\pm 18$ 
 $531\pm 26$ 
 $489\pm 20$ 
 $417\pm 10$ 
 $298\pm 9$ 


aOLS
 $\mathbf{651\pm 48}$ 
 $601\pm 42$ 
 $\mathbf{595\pm 39}$ 
 $378\pm 31$ 
 $398\pm 32$ 


TR
 $636\pm 35$ 
 $\mathbf{602\pm 28}$ 
 $585\pm 23$ 
 $\mathbf{422\pm 35}$ 
 $\mathbf{402\pm 28}$ 






5 Conclusion


In this paper, we establish a theoretical framework for machine-learning augmented
hybrid simulation problems, where a data-driven surrogate is used to accelerate
traditional simulation methods.
We identify the cause and effect of distribution
shift, i.e. the empirically observed phenomenon that
the simulated dynamics may be driven away from the support of the training data
due to systematic errors introduced by the data-driven surrogate,
magnified by the resolved components of the dynamics.
Based on this, we propose a tangent-space
regularized algorithm for training the surrogate for the unresolved part,
which incorporates the resolved model information to
control deviations from the true data manifold.
In the case of linear dynamics, we show our algorithm is
provably better than the traditional training method based on ordinary least
squares. Then, we validate our algorithm in numerical experiments, including
Turing instabilities in reaction-diffusion and fluid flow. In both cases,
our method outperforms baselines by better mitigating distribution shift,
thus reducing trajectory error.


Acknowledgements


The research work presented is supported by the National Research Foundation, Singapore,
under the NRF fellowship (project No. NRF-NRFF13-2021-0005).


References



[1]

G. Alfonsi, Reynolds-averaged navier–stokes equations for
turbulence modeling, Applied Mechanics Reviews, 62 (2009).




[2]

S. Arisaka and Q. Li, Principled acceleration of iterative numerical
methods using machine learning, in Proceedings of the 40th International
Conference on Machine Learning, ICML’23, JMLR.org, 2023.




[3]

J. Blanchet, K. Murthy, and N. Si, Confidence regions in wasserstein
distributionally robust estimation, Biometrika, 109 (2022), pp. 295–315.




[4]

W. L. Briggs, V. E. Henson, and S. F. McCormick, A multigrid
tutorial, SIAM, 2000.




[5]

Z. Chen, Q. Li, and Z. Zhang, Towards robust neural networks via
close-loop control, arXiv preprint arXiv:2102.01862, (2021).




[6]

Z. Chen, Q. Li, and Z. Zhang, Self-healing robust neural networks
via closed-loop control, Journal of Machine Learning Research, 23 (2022),
pp. 1–54.




[7]

M. Cranmer, S. Greydanus, S. Hoyer, P. Battaglia, D. Spergel, and S. Ho,
Lagrangian neural networks, arXiv preprint arXiv:2003.04630, (2020).




[8]

X. Dong, Z. Yu, W. Cao, Y. Shi, and Q. Ma, A survey on ensemble
learning, Frontiers of Computer Science, 14 (2020), pp. 241–258.




[9]

K. Duraisamy, G. Iaccarino, and H. Xiao, Turbulence modeling in the
age of data, Annual review of fluid mechanics, 51 (2019), pp. 357–377.




[10]

W. E and J.-G. Liu, Projection method i: convergence and numerical
boundary layers, SIAM journal on numerical analysis, (1995),
pp. 1017–1057.




[11]

A. Farahani, S. Voghoei, K. Rasheed, and H. R. Arabnia, A brief
review of domain adaptation, Advances in Data Science and Information
Engineering: Proceedings from ICDATA 2020 and IKE 2020, (2021),
pp. 877–894.




[12]

Z. Gao, L. Yan, and T. Zhou, Failure-informed adaptive sampling for
pinns, SIAM Journal on Scientific Computing, 45 (2023), pp. A1971–A1994,
https://doi.org/10.1137/22M1527763,
https://doi.org/10.1137/22M1527763,
https://arxiv.org/abs/https://doi.org/10.1137/22M1527763.




[13]

I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT
Press, 2016.


http://www.deeplearningbook.org.




[14]

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
S. Ozair, A. Courville, and Y. Bengio, Generative adversarial nets, in
Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling,
C. Cortes, N. Lawrence, and K. Weinberger, eds., vol. 27, Curran Associates,
Inc., 2014,
https://proceedings.neurips.cc/paper_files/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf.




[15]

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
S. Ozair, A. Courville, and Y. Bengio, Generative adversarial
networks, Communications of the ACM, 63 (2020), pp. 139–144.




[16]

S. Greydanus, M. Dzamba, and J. Yosinski, Hamiltonian neural
networks, Advances in neural information processing systems, 32 (2019).




[17]

J.-L. Guermond, P. Minev, and J. Shen, An overview of projection
methods for incompressible flows, Computer methods in applied mechanics and
engineering, 195 (2006), pp. 6011–6045.




[18]

X. Guo, Z. Xia, and S. Chen, Computing mean fields with known
reynolds stresses at steady state, Theoretical and Applied Mechanics
Letters, 11 (2021), p. 100244.




[19]

T. Hastie, R. Tibshirani, J. Friedman, T. Hastie, R. Tibshirani, and
J. Friedman, Overview of supervised learning, The elements of
statistical learning: Data mining, inference, and prediction, (2009),
pp. 9–41.




[20]

T. Hastie, R. Tibshirani, J. H. Friedman, and J. H. Friedman, The
elements of statistical learning: data mining, inference, and prediction,
vol. 2, Springer, 2009.




[21]

R. Huang, R. Li, and Y. Xi, Learning optimal multigrid smoothers via
neural networks, SIAM Journal on Scientific Computing, (2022),
pp. S199–S225.




[22]

D. P. Kingma and J. Ba, Adam: A method for stochastic optimization,
arXiv preprint arXiv:1412.6980, (2014).




[23]

P. W. Koh, S. Sagawa, H. Marklund, S. M. Xie, M. Zhang, A. Balsubramani,
W. Hu, M. Yasunaga, R. L. Phillips, I. Gao, et al., Wilds: A benchmark
of in-the-wild distribution shifts, in International Conference on Machine
Learning, PMLR, 2021, pp. 5637–5664.




[24]

Y. LeCun, Y. Bengio, and G. Hinton, Deep learning, nature, 521
(2015), pp. 436–444.




[25]

B. Lin, Q. Li, and W. Ren, A data driven method for computing
quasipotentials, arXiv preprint arXiv:2012.09111, (2020).




[26]

H.-W. Lin, E. R. Nocera, F. Olness, K. Orginos, J. Rojo, A. Accardi,
C. Alexandrou, A. Bacchetta, G. Bozzi, J.-W. Chen, et al., Parton
distributions and lattice qcd calculations: a community white paper,
Progress in Particle and Nuclear Physics, 100 (2018), pp. 107–160.




[27]

L. Lin, J. Lu, and L. Ying, Numerical methods for kohn–sham density
functional theory, Acta Numerica, 28 (2019), pp. 405–539.




[28]

J. Ling, A. Kurzawski, and J. Templeton, Reynolds averaged
turbulence modelling using deep neural networks with embedded invariance,
Journal of Fluid Mechanics, 807 (2016), pp. 155–166.




[29]

J. Ling and J. Templeton, Evaluation of machine learning algorithms
for prediction of regions of high reynolds averaged navier stokes
uncertainty, Physics of Fluids, 27 (2015), p. 085103.




[30]

Y. Liu, W. Zhang, and Z. Xia, Analysis on numerical stability and
convergence of rans turbulence models from the perspective of coupling
modes, arXiv preprint arXiv:2110.08459, (2021).




[31]

I. Loshchilov and F. Hutter, Decoupled weight decay regularization,
arXiv preprint arXiv:1711.05101, (2017).




[32]

R. Maulik, O. San, A. Rasheed, and P. Vedula, Subgrid modelling for
two-dimensional turbulence using neural networks, Journal of Fluid
Mechanics, 858 (2019), pp. 122–144.




[33]

J. D. Murray and J. D. Murray, Mathematical Biology: II: Spatial
Models and Biomedical Applications, vol. 3, Springer, 2003.




[34]

H. Narayanan and S. Mitter, Sample complexity of testing the
manifold hypothesis, Advances in neural information processing systems, 23
(2010).




[35]

C. Pedersen, L. Zanna, J. Bruna, and P. Perezhogin, Reliable
coarse-grained turbulent simulations through combined offline learning and
neural emulation, 2023, https://arxiv.org/abs/2307.13144.




[36]

T. Pfaff, M. Fortunato, A. Sanchez-Gonzalez, and P. W. Battaglia, Learning mesh-based simulation with graph networks, CoRR, abs/2010.03409
(2020), https://arxiv.org/abs/2010.03409,
https://arxiv.org/abs/2010.03409.




[37]

F. Pichi, B. Moya, and J. S. Hesthaven, A graph convolutional
autoencoder approach to model order reduction for parametrized pdes, arXiv
preprint arXiv:2305.08573, (2023).




[38]

S. B. Pope and S. B. Pope, Turbulent flows, Cambridge university
press, 2000.




[39]

S. V. Poroseva, J. D. Colmenares F, and S. M. Murman, On the
accuracy of rans simulations with dns data, Physics of Fluids, 28 (2016),
p. 115102.




[40]

H. Rahimian and S. Mehrotra, Distributionally robust optimization: A
review, arXiv preprint arXiv:1908.05659, (2019).




[41]

A. Ramdas, J. Chen, M. J. Wainwright, and M. I. Jordan, Dagger: A
sequential algorithm for fdr control on dags, arXiv preprint
arXiv:1709.10250, (2017).




[42]

P. Ren, C. Rao, Y. Liu, Z. Ma, Q. Wang, J.-X. Wang, and H. Sun, Physics-informed deep super-resolution for spatiotemporal data, arXiv
preprint arXiv:2208.01462, (2022).




[43]

J. R. Shewchuk, An introduction to the conjugate gradient method
without the agonizing pain, 1994.




[44]

C. Shorten and T. M. Khoshgoftaar, A survey on image data
augmentation for deep learning, Journal of Big Data, 6 (2019), p. 60,
https://doi.org/10.1186/s40537-019-0197-0,
https://doi.org/10.1186/s40537-019-0197-0.




[45]

J. P. Slotnick, A. Khodadoust, J. Alonso, D. Darmofal, W. Gropp, E. Lurie,
and D. J. Mavriplis, Cfd vision 2030 study: a path to revolutionary
computational aerosciences, 2014.




[46]

K. Stachenfeld, D. B. Fielding, D. Kochkov, M. Cranmer, T. Pfaff,
J. Godwin, C. Cui, S. Ho, P. Battaglia, and A. Sanchez-Gonzalez, Learned coarse models for efficient turbulence simulation, 2022,
https://arxiv.org/abs/2112.15275.




[47]

A. Stuart and A. R. Humphries, Dynamical systems and numerical
analysis, vol. 2, Cambridge University Press, 1998.




[48]

A. M. Stuart, Numerical analysis of dynamical systems, Acta
numerica, 3 (1994), pp. 467–572.




[49]

J. Tompson, K. Schlachter, P. Sprechmann, and K. Perlin, Accelerating eulerian fluid simulation with convolutional networks, in
International Conference on Machine Learning, PMLR, 2017, pp. 3424–3433.




[50]

J. Vanschoren, Meta-learning: A survey, arXiv preprint
arXiv:1810.03548, (2018).




[51]

R. Wang and R. Yu, Physics-guided deep learning for dynamical
systems: A survey, arXiv preprint arXiv:2107.01272, (2021).




[52]

Y. Wen, E. Vanden-Eijnden, and B. Peherstorfer, Coupling parameter
and particle dynamics for adaptive sampling in neural galerkin schemes,
arXiv preprint arXiv:2306.15630, (2023).




[53]

J. Wu, H. Xiao, R. Sun, and Q. Wang, Reynolds-averaged
navier–stokes equations with explicit data-driven reynolds stress closure
can be ill-conditioned, Journal of Fluid Mechanics, 869 (2019),
pp. 553–586.




[54]

H. Yu, X. Tian, W. E, and Q. Li, Onsagernet: Learning stable and
interpretable dynamics using a generalized onsager principle, Physical
Review Fluids, 6 (2021), p. 114402.




[55]

L. Zhang, J. Han, H. Wang, R. Car, and E. Weinan, Deep potential
molecular dynamics: a scalable model with the accuracy of quantum mechanics,
Physical review letters, 120 (2018), p. 143001.




[56]

L. Zhang, D.-Y. Lin, H. Wang, R. Car, and E. Weinan, Active learning
of uniformly accurate interatomic potentials for materials simulation,
Physical Review Materials, 3 (2019), p. 023804.




[57]

J. Zhao and Q. Li, Code repository for mitigating distribution shift
in machine-learning augmented hybrid simulation,
https://github.com/jiaxi98/TR.




[58]

Z. Zhao and Q. Li, Adaptive sampling methods for learning dynamical
systems, in Mathematical and Scientific Machine Learning, PMLR, 2022,
pp. 335–350.




[59]

Y. Zhiyin, Large-eddy simulation: Past, present and the future,
Chinese journal of Aeronautics, 28 (2015), pp. 11–24.
	$\displaystyle\partial_{t}\mathbf{u}$	$\displaystyle=\mathcal{L}(\mathbf{u},\mathbf{y},t),\quad\mathbf{u}\in\mathbb{R% }^{m},\mathbf{y}\in\mathbb{R}^{n},\mathcal{L}:\mathbb{R}^{m}\times\mathbb{R}^{% n}\times\mathbb{R}_{+}\rightarrow\mathbb{R}^{m},$		(1)
	$\displaystyle\mathbf{y}$	$\displaystyle=\phi(\mathbf{u},t),\quad\phi:\mathbb{R}^{m}\times\mathbb{R}_{+}% \rightarrow\mathbb{R}^{n},$		(1)
$\displaystyle\mathbf{u}_{k+1}$	$\displaystyle=\mathbf{u}_{k}+\Delta t\mathcal{L}(\mathbf{u}_{k},p_{k})$	(4)
	$\displaystyle=\mathbf{u}_{k}+\Delta t(\nu\Delta\mathbf{u}_{k}-(\mathbf{u}_{k}% \cdot\nabla)\mathbf{u}_{k}-\nabla p_{k}),$
$\displaystyle p_{k}$	$\displaystyle=\phi(\mathbf{u}_{k})=\Delta^{-1}(\nabla\cdot\left(\nu\Delta% \mathbf{u}_{k}-(\mathbf{u}_{k}\cdot\nabla)\mathbf{u}_{k}\right)),$
	$\displaystyle\ \widehat{\mathbf{u}}_{k+1}-\mathbf{u}_{k+1}$	(10)
$\displaystyle=$	$\displaystyle\ \widehat{\mathbf{u}}_{k}-\mathbf{u}_{k}+\Delta t\left(\mathcal{% L}(\mathbf{u}_{k},\phi_{\widehat{\theta}}(\mathbf{u}_{k}))-\mathcal{L}(\mathbf% {u}_{k},\phi_{\theta_{*}}(\mathbf{u}_{k}))\right)$
	$\displaystyle+\Delta t\left(\mathcal{L}(\widehat{\mathbf{u}}_{k},\phi_{% \widehat{\theta}}(\widehat{\mathbf{u}}_{k}))-\mathcal{L}(\mathbf{u}_{k},\phi_{% \widehat{\theta}}(\mathbf{u}_{k}))\right).$
	$\displaystyle\frac{\partial\mathbf{u}}{\partial t}$	$\displaystyle=D\Delta\mathbf{u}+\phi(\mathbf{u}),\quad T\in[0,20],$		(11)
	$\displaystyle\phi(\mathbf{u})$	$\displaystyle=\phi(u,v)=\begin{pmatrix}u-u^{3}-v+\alpha\\ \beta(u-v)\end{pmatrix}.$		(11)
	$\displaystyle\partial_{t}\mathbf{u}$	$\displaystyle=A\mathbf{u}+B\mathbf{y},\quad\mathbf{u}\in\mathbb{R}^{m},\mathbf% {y}\in\mathbb{R}^{n},A\in\mathbb{R}^{m\times m},B\in\mathbb{R}^{m\times n}$		(12)
	$\displaystyle\mathbf{y}$	$\displaystyle=C^{}\mathbf{u},\quad C^{}\in\mathbb{R}^{n\times m}.$		(12)
$\displaystyle\partial_{t}(\widehat{\mathbf{u}}-\mathbf{u})=$	$\displaystyle\ (A+B\widehat{C})\widehat{\mathbf{u}}-(A+BC^{*})\mathbf{u}$	(17)
$\displaystyle=$	$\displaystyle\ A(\widehat{\mathbf{u}}-\mathbf{u})+B(\widehat{C}\widehat{% \mathbf{u}}-\widehat{C}P_{V}\widehat{\mathbf{u}}+\widehat{C}P_{V}\widehat{% \mathbf{u}}-\widehat{C}\mathbf{u}+\widehat{C}\mathbf{u}-C^{*}\mathbf{u})$
$\displaystyle=$	$\displaystyle\ (A+B\widehat{C}P_{V})(\widehat{\mathbf{u}}-\mathbf{u})+B(% \widehat{C}-C^{*})\mathbf{u}+B\widehat{C}(\widehat{\mathbf{u}}-P_{V}\widehat{% \mathbf{u}}),$
		$\displaystyle\mathbb{E}\left\\|\widehat{\mathbf{u}}(T)-\mathbf{u}(T)\right\\|_{2% }^{2}\leq$		(19)
		$\displaystyle\frac{\left\\|B\right\\|_{2}^{2}\mathbb{E}\left\\|(\widehat{C}-C^{*}% )\mathbf{u}\right\\|^{2}+\left\\|B\widehat{C}\right\\|_{2}^{2}\sup_{t\leq T}\left% \\|\widehat{\mathbf{u}}(t)-P_{V}\widehat{\mathbf{u}}(t)\right\\|_{2}^{2}}{\left% \\|A+B\widehat{C}P_{V}\right\\|_{2}^{2}+3}(e^{(\left\\|A+B\widehat{C}P_{V}\right% \\|_{2}^{2}+3)T}-1).$		(19)
	$\displaystyle\widehat{C}_{OLS}=$	$\displaystyle\ C^{*}P_{V}+\epsilon\mathbf{U}^{\dagger},$		(28)
	$\displaystyle\widehat{C}_{TR}=$	$\displaystyle\ (\mathbf{I}+\lambda B^{T}P_{V^{\perp}}B)^{-1}(C^{*}P_{V}+% \epsilon\mathbf{U}^{\dagger}-\lambda B^{T}P_{V^{\perp}}AP_{V}).$		(28)
$\displaystyle\mathbb{E}\left\\|\widehat{\mathbf{u}}_{TR}(T)-\mathbf{u}(T)\right% \\|^{2}\leq$	$\displaystyle\delta Q(\left\\|A+B\widehat{C}_{TR}P_{V}\right\\|^{2}+3,T)$	(33)
	$\displaystyle\times\left(\left\\|B\right\\|^{2}+\frac{Q(\left\\|P_{V^{\perp}}(A+B% \widehat{C}_{TR})\right\\|^{2}+2,T)}{\lambda}\right),$
$\displaystyle\mathbb{E}\left\\|\widehat{\mathbf{u}}_{OLS}(T)-\mathbf{u}(T)% \right\\|^{2}$	$\displaystyle\leq\delta Q(\left\\|A+B\widehat{C}_{OLS}P_{V}\right\\|^{2}+\left\\|% B\widehat{C}_{OLS}\right\\|^{2}+3,T)\left\\|B\right\\|^{2}.$
		$\displaystyle R_{2n}^{n}:\mathbb{R}^{2n\times 2n}\rightarrow\mathbb{R}^{n% \times n},\quad\mathbf{u}_{ij}^{n}=\frac{\mathbf{u}_{2i,2j}^{2n}+\mathbf{u}_{2% i,2j+1}^{2n}+\mathbf{u}_{2i+1,2j}^{2n}+\mathbf{u}_{2i+1,2j+1}^{2n}}{4},$		(36)
		$\displaystyle I_{2n}^{n}:\mathbb{R}^{n\times n}\rightarrow\mathbb{R}^{2n\times 2% n},\quad\mathbf{u}_{2i,2j}^{2n}=\mathbf{u}_{2i,2j+1}^{2n}=\mathbf{u}_{2i+1,2j}% ^{2n}=\mathbf{u}_{2i+1,2j+1}^{2n}=\mathbf{u}_{ij}^{n}.$		(36)
	$\gamma=0.05$	$\gamma=0.10$	$\gamma=0.15$	$\gamma=0.20$	$\gamma=0.25$
			grid size: $64\times 64$
OLS	8.40e+01	1.04e+01	8.99e+00	2.22e+00	6.78e-01
mOLS	8.90e+00	9.22e+00	1.41e+00	2.26e+00	9.98e-01
aOLS	3.57e+01	4.00e+01	2.40e+00	1.79e+00	1.06e+00
TR	1.23e-01	1.06e-01	1.38e-01	1.85e-01	2.15e-01
Diff	99.9%	99.0%	98.5%	91.7%	68.3%
			grid size: $128\times 128$
OLS	4.10e+02	8.18e+01	5.11e+01	4.88e+00	2.25e+00
mOLS	2.40e+01	3.20e+01	2.20e+00	1.07e+01	1.25e+00
aOLS	1.46e+02	2.58e+02	8.69e+00	3.30e+00	1.79e+00
TR	2.79e-01	1.66e-01	2.42e-01	1.81e-01	1.91e-01
Diff	99.9%	99.8%	99.5%	96.3%	91.5%
	$Re=100$	$Re=200$	$Re=300$	$Re=400$	$Re=500$
			grid size: $128\times 32$
OLS	$435\pm 15$	$301\pm 20$	$240\pm 12$	$177\pm 17$	$120\pm 7$
mOLS	$420\pm 12$	$333\pm 14$	$254\pm 12$	$204\pm 9$	$210\pm 7$
aOLS	$477\pm 25$	$392\pm 31$	$450\pm 23$	$239\pm 25$	$321\pm 20$
TR	$\mathbf{506\pm 18}$	$\mathbf{497\pm 28}$	$\mathbf{482\pm 29}$	$\mathbf{442\pm 21}$	$\mathbf{452\pm 13}$
			grid size: $256\times 64$
OLS	$602\pm 38$	$571\pm 35$	$355\pm 25$	$301\pm 11$	$148\pm 22$
mOLS	$623\pm 18$	$531\pm 26$	$489\pm 20$	$417\pm 10$	$298\pm 9$
aOLS	$\mathbf{651\pm 48}$	$601\pm 42$	$\mathbf{595\pm 39}$	$378\pm 31$	$398\pm 32$
TR	$636\pm 35$	$\mathbf{602\pm 28}$	$585\pm 23$	$\mathbf{422\pm 35}$	$\mathbf{402\pm 28}$