Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: bold-extra

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY 4.0
arXiv:2401.09259v1 [math.NA] 17 Jan 2024
\newsiamthm

claimClaim \newsiamremarkremarkRemark \newsiamremarkhypothesisHypothesis


\headersMitigating distribution shift in MLHSZhao and Li






Mitigating distribution shift in machine learning-augmented hybrid simulation

Jiaxi Zhao Department of Mathematics, National University of Singapore (). jiaxi.zhao@u.nus.edu    Qianxiao Li Department of Mathematics & Institute for Functional Intelligent Materials, National University of Singapore (). qianxiao@nus.edu.sg
Abstract

We study the problem of distribution shift generally arising in machine-learning augmented hybrid simulation, where parts of simulation algorithms are replaced by data-driven surrogates. We first establish a mathematical framework to understand the structure of machine-learning augmented hybrid simulation problems, and the cause and effect of the associated distribution shift. We show correlations between distribution shift and simulation error both numerically and theoretically. Then, we propose a simple methodology based on tangent-space regularized estimator to control the distribution shift, thereby improving the long-term accuracy of the simulation results. In the linear dynamics case, we provide a thorough theoretical analysis to quantify the effectiveness of the proposed method. Moreover, we conduct several numerical experiments, including simulating a partially known reaction-diffusion equation and solving Navier-Stokes equations using the projection method with a data-driven pressure solver. In all cases, we observe marked improvements in simulation accuracy under the proposed method, especially for systems with high degrees of distribution shift, such as those with relatively strong non-linear reaction mechanisms, or flows at large Reynolds numbers.

keywords:
Machine learning, Distribution Shift, Regularization, Error Analysis, Fluid dynamics

{MSCcodes}
68T99, 65M15, 37M05



1 Introduction

Many scientific computational applications, such as computational fluid dynamics (CFD) and molecular dynamics (MD) can be viewed as dynamical system modeling and simulation problems, which is tackled by rigorous numerical tools with theoretical guarantee [45, 26]. However, in many cases a part of the simulation workflow, such as the Reynolds stresses in Reynolds-averaged Navier-Stokes equation (RANS) [1] and the exchange-correlation energies in density function theory used to compute force-fields that drives MD simulations [27], depends on models that are either expensive to compute, or even unknown in practice. One thus often resort to a hybrid simulation method, where the known, resolved components of the dynamics are computed exactly, while the unresolved components are replaced by approximate, but computationally tractable models. For example, solving Navier-Stokes equations using projection method involves two steps. In the first step, all the terms except for the gradient of the pressure is used to evolve the velocity. This step is computationally cheap and thus understood as the resolved part. Next, the pressure is solved from a Poisson equation and then used to correct the velocity. Most of the computational cost is contained in solving this Poisson equation and we thereby viewed this as the unresolved part. We call such scientific computing problems with both resolved and unresolved parts “hybrid simulation problems”. Similar problems are surveyed in [51] under the name of “Hybrid physics-DL models”.

As machine learning becomes increasingly powerful in areas like computer vision and natural language processing, practitioners begin to use data-driven modules to model the unresolved part to carry out the simulation. For example, in [49] the author replaced the numerical Poisson solver of the unresolved part by a convolutional neural network trained using a novel unsupervised learning framework. Then, this data-driven model is coupled with resolved model and provide fast and realistic simulation results in 2D and 3D. Similar ideas are used else where, e.g. RANS [28]. We hereafter refer to this simulation procedure by machine-learning augmented hybrid simulation (MLHS). This structure represents a great part of the scientific machine learing research [54, 46, 25] and will be the focus of this paper. In MLHS, a common problem appears: while the data-drive model performs well on the training data, the performance quickly deteriorates when iteratively applying it in simulation, driving the dynamics to some regime which is not observed from the training source. Empirical evidences have been observed in various applications, such as CFD [46, 35, 9, 54], molecular dynamics [56, 55, 25], and iterative numerical solver [2] but these problems are not well-studied algorithmically and theoretically in the literature.

This issue has strong connections with the so-called distribution shift (DS) in computer science applications, especially in reinforcement learning and computer vision. Researchers use DS to refer to problems where training and testing distribution differ significantly. An example is an image classifier trained with images taken in the daytime that is tested under night conditions [23]. Therefore, accuracy on the training regime itself cannot guarantee the performance during inference. To resolve this, researchers have systematically developed several methods, e.g. domain adaptation [11], ensemble learning [8], and meta-learning [50]. However, there are key differences between the distribution shift phenomena in MLHS and that in traditional applications between the DS phenomena in MLHS and computer vision. While distribution shift in computer vision lacks a theoretical model to describe [23], that in MLHS comes from dynamical systems for which we have abundant knowledge on the resolved parts of the models. For example, let us consider using projection method [10] to solve the incompressible Navier-Stokes equation. Suppose we replace the Poisson pressure solver by a data-driven model based on neural network. Then, the dynamics of this data-driven hybrid numerical solver will largely depend on the properties of the resolved numerical part of the projection method, which is well-studied [10]. By analyzing the stability properties of this dynamics, one can quantify which family of distribution shifts may arise and resolve them according to the information of the dynamics. Therefore, one can pose the following question: Can we use the information of resolved parts to design robust learning algorithm for unresolved parts to improve hybrid simulation?

In this paper, we first develop a mathematical framework to understand the origin of distribution shift in MLHS and how it may lead to simulation performance deterioration. We emphasize the difference between this instability issue in data-driven scientific computing and distribution shift in the machine learning literature, such as reinforcement learning and computer vision. Then, we propose an algorithm to improve the simulation accuracy by mitigating distribution shift. We assume by manifold hypothesis [34] that the correct trajectories lie on a low-dimensional manifold of the high dimensional space, e.g. the fluid configuration of Navier-Stokes equation lies on the solution manifold of the high-dimensional grid space. The key idea is to combine the physical information of the resolved part of the model, i.e. Navier-Stokes equation and this manifold structure learned from the data to form a regularization term for the data-driven model. Intuitively speaking, this regularizer stablizes the dynamics by preventing it from moving further away the data manifold. Such movements may result in configurations that are either non-physical, or corresponds to different initial/boundary conditions. Therefore, preventing such movements reduces the severity of distribution shift, and the data-driven model will stay relatively accurate during the simulation, which promotes high simulation fidelity in a long time interval. In implementation, we first use an autoencoder (AE) to parameterize the underlying data manifold. After this preprocessing, the AE is combined with the resolved dynamics and plugged into the loss function of the model as a regularization, which prevents the simulation from moving to unseen regimes. One then back-propagates this modified loss function to optimize the data-driven model. The algorithm is tested on several representative numerical examples to demonstrate its effectiveness. We show that the proposed approach can improve simulation accuracy under different extents of distribution shift. Indeed, the improvements become more significant in scenarios where the fidelity of the simulation is highly sensitive to errors introduced in the data-driven surrogates, such as fluid simulations at high Reynolds numbers and reaction-diffusion equations with relatively strong non-linear reaction. Specifically, in high Reynolds number fluid simulations, naive data-driven surrogate models may quickly cause error blow-ups, but our method can maintain simulation accuracy over large time intervals.

The paper is structured as follows. In Section 2, we establish a precise framework to identify the origins of distribution shift in MLHS. In Section 3, we introduce our regularized learning algorithm motivated from the analysis of the previously identified form of distribution shift and theoretically understand its performance in the linear setting. We also discuss connections with the literature on control and system identification. In Section 4, we validate our algorithm on several practical numerical cases, including simulating a reaction-diffusion equation and the incompressible Navier-Stokes equation.

2 Mathematical formulation of MLHS and the problem of distribution shift

In this section, we provide the mathematical formulation of MLHS and then identify the problem of distribution shift. In the first subsection we give a general treatment, and we provide concrete examples in Section 2.2.

2.1 The resolved and unresolved components in hybrid simulations

Throughout this paper, we consider the dynamics

t𝐮subscript𝑡𝐮\displaystyle\partial_{t}\mathbf{u}∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_u =(𝐮,𝐲,t),𝐮m,𝐲n,:m×n×+m,:formulae-sequenceabsent𝐮𝐲𝑡formulae-sequence𝐮superscript𝑚𝐲superscript𝑛superscript𝑚superscript𝑛subscriptsuperscript𝑚\displaystyle=\mathcal{L}(\mathbf{u},\mathbf{y},t),\quad\mathbf{u}\in\mathbb{R% }^{m},\mathbf{y}\in\mathbb{R}^{n},\mathcal{L}:\mathbb{R}^{m}\times\mathbb{R}^{% n}\times\mathbb{R}_{+}\rightarrow\mathbb{R}^{m},= caligraphic_L ( bold_u , bold_y , italic_t ) , bold_u ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , bold_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , caligraphic_L : blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT × blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , (1)
𝐲𝐲\displaystyle\mathbf{y}bold_y =ϕ(𝐮,t),ϕ:m×+n,:absentitalic-ϕ𝐮𝑡italic-ϕsuperscript𝑚subscriptsuperscript𝑛\displaystyle=\phi(\mathbf{u},t),\quad\phi:\mathbb{R}^{m}\times\mathbb{R}_{+}% \rightarrow\mathbb{R}^{n},= italic_ϕ ( bold_u , italic_t ) , italic_ϕ : blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT × blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ,

where 𝐮𝐮\mathbf{u}bold_u is the resolved state variable, e.g. fluid velocity field or chemical concentration field. The vector 𝐲𝐲\mathbf{y}bold_y is the unresolved variable, which drives the dynamics for 𝐮𝐮\mathbf{u}bold_u but is either expensive to compute or cannot be directly observed. Throughout the paper, we will omit writing the explicit time-dependence by using 𝐮𝐮\mathbf{u}bold_u (𝐲𝐲\mathbf{y}bold_y) to denote 𝐮(t)𝐮𝑡\mathbf{u}(t)bold_u ( italic_t ) (𝐲(t)𝐲𝑡\mathbf{y}(t)bold_y ( italic_t )) whenever there is no ambiguity. We also restrict the discussion to the finite dimension case, but most of the analysis here will be applicable to the more general case of 𝐮,𝐲𝐮𝐲\mathbf{u},\mathbf{y}bold_u , bold_y belonging to an infinite dimensional Hilbert space. We adopt the following assumptions associated with Eq. 1

  • 1. Resolved component: \mathcal{L}caligraphic_L is known, possibly non-linear.

  • 2. Unresolved component: ϕitalic-ϕ\phiitalic_ϕ is either unknown, partially-known or expensive to evaluate.

Given such a system, the goal is to first obtain information on the unresolved model and then integrate it with resolved part to simulate the whole dynamics. This hybrid structure, which we call hybrid simulation problems is general enough to include most of the settings in hybrid simulation, e.g. when simulating the trajectories of molecular dynamics which satisfy some stochastic differential equations or the time-evolution of fluid velocity fields following the Navier-Stokes equation. This paper focuses on MLHS, a particular variation of such hybrid simulation problem where the unresolved component is tackled by a data-driven method. This point will be made more precise in Section 2.3. To simplify the analysis, we adopt the forward Euler time discretization for simulation

𝐮^k+1subscript^𝐮𝑘1\displaystyle\widehat{\mathbf{u}}_{k+1}over^ start_ARG bold_u end_ARG start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT =𝐮^k+Δt(𝐮^k,𝐲^k),𝐲^k=ϕ(𝐮^k),formulae-sequenceabsentsubscript^𝐮𝑘Δ𝑡subscript^𝐮𝑘subscript^𝐲𝑘subscript^𝐲𝑘italic-ϕsubscript^𝐮𝑘\displaystyle=\widehat{\mathbf{u}}_{k}+\Delta t\mathcal{L}(\widehat{\mathbf{u}% }_{k},\widehat{\mathbf{y}}_{k}),\qquad\widehat{\mathbf{y}}_{k}=\phi(\widehat{% \mathbf{u}}_{k}),= over^ start_ARG bold_u end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + roman_Δ italic_t caligraphic_L ( over^ start_ARG bold_u end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , over^ start_ARG bold_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , over^ start_ARG bold_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_ϕ ( over^ start_ARG bold_u end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , (2)

while other consistent discretizations may be analyzed in the same spirit. Here we drop the dependence of ,ϕitalic-ϕ\mathcal{L},\phicaligraphic_L , italic_ϕ on t𝑡titalic_t to consider autonomous system while our later discussion and algorithm are also applicable to non-autonomous system.

2.2 Examples of hybrid simulation problems

Before moving on to distribution shift, we first provide some examples in MLHS of our interest, which are also closely related to our numerical experiments. We will present these examples according to the structure Eq. 1.

The first example is solving the 2D incompressible Navier-Stokes equation using the projection method [17, 10], a variant of which is as follows:

𝐮t+(𝐮)𝐮νΔ𝐮=p,𝐮=0,T[0,1],formulae-sequence𝐮𝑡𝐮𝐮𝜈Δ𝐮𝑝formulae-sequence𝐮0𝑇01\displaystyle\frac{\partial\mathbf{u}}{\partial t}+(\mathbf{u}\cdot\nabla)% \mathbf{u}-\nu\Delta\mathbf{u}=\nabla p,\quad\nabla\cdot\mathbf{u}=0,\quad T% \in[0,1],divide start_ARG ∂ bold_u end_ARG start_ARG ∂ italic_t end_ARG + ( bold_u ⋅ ∇ ) bold_u - italic_ν roman_Δ bold_u = ∇ italic_p , ∇ ⋅ bold_u = 0 , italic_T ∈ [ 0 , 1 ] , (3)

where 𝐮=(u(x,y,t),v(x,y,t))T2𝐮superscript𝑢𝑥𝑦𝑡𝑣𝑥𝑦𝑡𝑇superscript2\mathbf{u}=(u(x,y,t),v(x,y,t))^{T}\in\mathbb{R}^{2}bold_u = ( italic_u ( italic_x , italic_y , italic_t ) , italic_v ( italic_x , italic_y , italic_t ) ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is velocity and p𝑝pitalic_p pressure. Fix a regular grid size for discretization. The projection method can be written as follows

𝐮k+1subscript𝐮𝑘1\displaystyle\mathbf{u}_{k+1}bold_u start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT =𝐮k+Δt(𝐮k,pk)absentsubscript𝐮𝑘Δ𝑡subscript𝐮𝑘subscript𝑝𝑘\displaystyle=\mathbf{u}_{k}+\Delta t\mathcal{L}(\mathbf{u}_{k},p_{k})= bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + roman_Δ italic_t caligraphic_L ( bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) (4)
=𝐮k+Δt(νΔ𝐮k(𝐮k)𝐮kpk),absentsubscript𝐮𝑘Δ𝑡𝜈Δsubscript𝐮𝑘subscript𝐮𝑘subscript𝐮𝑘subscript𝑝𝑘\displaystyle=\mathbf{u}_{k}+\Delta t(\nu\Delta\mathbf{u}_{k}-(\mathbf{u}_{k}% \cdot\nabla)\mathbf{u}_{k}-\nabla p_{k}),= bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + roman_Δ italic_t ( italic_ν roman_Δ bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - ( bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⋅ ∇ ) bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - ∇ italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ,
pksubscript𝑝𝑘\displaystyle p_{k}italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT =ϕ(𝐮k)=Δ1((νΔ𝐮k(𝐮k)𝐮k)),absentitalic-ϕsubscript𝐮𝑘superscriptΔ1𝜈Δsubscript𝐮𝑘subscript𝐮𝑘subscript𝐮𝑘\displaystyle=\phi(\mathbf{u}_{k})=\Delta^{-1}(\nabla\cdot\left(\nu\Delta% \mathbf{u}_{k}-(\mathbf{u}_{k}\cdot\nabla)\mathbf{u}_{k}\right)),= italic_ϕ ( bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = roman_Δ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( ∇ ⋅ ( italic_ν roman_Δ bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - ( bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⋅ ∇ ) bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) ,

We write Δ1superscriptΔ1\Delta^{-1}roman_Δ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT as the inverse operator for Poisson equation formally. The resolved part is performed by first stepping the convection and diffusion term then using pressure to correct the step, and the unresolved part is a Poisson equation which related the unsolved state pksubscript𝑝𝑘p_{k}italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT with the velocity 𝐮ksubscript𝐮𝑘\mathbf{u}_{k}bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Here, pksubscript𝑝𝑘p_{k}italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT represents the unresolved variable 𝐲ksubscript𝐲𝑘\mathbf{y}_{k}bold_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in our formulation Eq. 1.

In this method, the most expensive step is the pressure computation, which requires repeated solutions of similar large-scale linear equations. State of the art solvers such as multigrid [4] and conjugate gradient [43] become prohibitively expensive when the problem size is large. Recently, a promising direction is to replace the Poisson solver by data-driven surrogate models. Ref. [49] replaces the pressure calculation step by a convolutional neural network trained with unsupervised learning by requiring the updated velocity to have zero divergence. Similarly, machine learning models such as tensor-based neural networks are used in [28] to replace classical turbulence modeling for unclosed Reynolds stresses tensor. This model is then plugged into the Reynolds-averaged Navier-Stokes simulation to predict the flow separation. More recently, [37] combines the idea of reduced-order modeling with graph convolutional neural network to encode the reduced manifold and enable fast evaluations of parametrized PDEs.

The second example is based on simulation on two grids of different sizes. Let us continue with the projection method setting Eq. 4. Denote the time-evolution operator on grid size n𝑛nitalic_n as fn:𝐮kn𝐮k+1n:subscript𝑓𝑛superscriptsubscript𝐮𝑘𝑛superscriptsubscript𝐮𝑘1𝑛f_{n}:\mathbf{u}_{k}^{n}\rightarrow\mathbf{u}_{k+1}^{n}italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT : bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → bold_u start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, which becomes very expensive to compute when n𝑛nitalic_n is large. Therefore, one may wish to replace the fine-grid solver with a coarse-grid one and add some correction. This again reduces the computational burden for high-fidelity simulations. This structure shows up in various scientific computing situations, such as multigrid solvers [4], Reynolds-averaged Navier-Stokes (RANS) [38, 1], and large eddy simulation (LES) [59]. Suppose we have a fine grid with size 2n×2n2𝑛2𝑛2n\times 2n2 italic_n × 2 italic_n and a coarse grid with size n×n𝑛𝑛n\times nitalic_n × italic_n, we use superscript n,2n𝑛2𝑛n,2nitalic_n , 2 italic_n to denote the field defined on these two grids, i.e. 𝐮kn,𝐮k2nsuperscriptsubscript𝐮𝑘𝑛superscriptsubscript𝐮𝑘2𝑛\mathbf{u}_{k}^{n},\mathbf{u}_{k}^{2n}bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_n end_POSTSUPERSCRIPT. Moreover, we fix an interpolation and a restriction operator between fields on these two grids:

In2n:n×n2n×2n,R2nn:2n×2nn×n.:superscriptsubscript𝐼𝑛2𝑛superscript𝑛𝑛superscript2𝑛2𝑛superscriptsubscript𝑅2𝑛𝑛:superscript2𝑛2𝑛superscript𝑛𝑛\displaystyle I_{n}^{2n}:\mathbb{R}^{n\times n}\rightarrow\mathbb{R}^{2n\times 2% n},\quad R_{2n}^{n}:\mathbb{R}^{2n\times 2n}\rightarrow\mathbb{R}^{n\times n}.italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_n end_POSTSUPERSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT 2 italic_n × 2 italic_n end_POSTSUPERSCRIPT , italic_R start_POSTSUBSCRIPT 2 italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT : blackboard_R start_POSTSUPERSCRIPT 2 italic_n × 2 italic_n end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT . (5)

Now, we can state the second hybrid simulation problem as follows:

{𝐮k+12n=In2nfn(R2nn(𝐮k2n)))+𝐲k2n,𝐲k2n=ϕ(𝐮k2n).\left\{\begin{aligned} \mathbf{u}_{k+1}^{2n}&=I_{n}^{2n}\circ f_{n}(R_{2n}^{n}% (\mathbf{u}_{k}^{2n})))+\mathbf{y}_{k}^{2n},\\ \mathbf{y}_{k}^{2n}&=\phi(\mathbf{u}_{k}^{2n}).\end{aligned}\right.{ start_ROW start_CELL bold_u start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_n end_POSTSUPERSCRIPT end_CELL start_CELL = italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_n end_POSTSUPERSCRIPT ∘ italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT 2 italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_n end_POSTSUPERSCRIPT ) ) ) + bold_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_n end_POSTSUPERSCRIPT , end_CELL end_ROW start_ROW start_CELL bold_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_n end_POSTSUPERSCRIPT end_CELL start_CELL = italic_ϕ ( bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_n end_POSTSUPERSCRIPT ) . end_CELL end_ROW (6)

In detail, the resolved component contains a solver of the evolution equation over the coarse grid which marches forward by one time step, with field given by a restriction of the field on the fine grid. Then, the next step field configuration is interpolated to the fine grid, and the unresolved component serves to correct the deviation of the field variables between simulating on grids of different sizes. The unresolved variables 𝐲k2nsuperscriptsubscript𝐲𝑘2𝑛\mathbf{y}_{k}^{2n}bold_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_n end_POSTSUPERSCRIPT can be calculated if we have accurate simulation results 𝐮k2nsuperscriptsubscript𝐮𝑘2𝑛\mathbf{u}_{k}^{2n}bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_n end_POSTSUPERSCRIPT on the fine grid by

𝐲k2n=𝐮k+12nIn2nfn(R2nn(𝐮k2n))).\mathbf{y}_{k}^{2n}=\mathbf{u}_{k+1}^{2n}-I_{n}^{2n}\circ f_{n}(R_{2n}^{n}(% \mathbf{u}_{k}^{2n}))).bold_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_n end_POSTSUPERSCRIPT = bold_u start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_n end_POSTSUPERSCRIPT - italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_n end_POSTSUPERSCRIPT ∘ italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT 2 italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_n end_POSTSUPERSCRIPT ) ) ) . (7)

In the literature, [21] treats the smoothing algorithm in multigrid solver as unresolved components and other procedures such as restriction and prolongation as resolved components. This fits into our framework of MLHS. They introduce a supervised loss function based on multigrid convergence theory to learn the optimal smoother and improve the convergence rate over anisotropic rotated Laplacian problems and variable coefficient diffusion problems. In [32], the authors applies neural network as non-linear regression to fit between some key fluid features and sub-grid stresses. Going beyond this, [35] adds neural emulation to offline learning to prevent the trajectories from deviating away the ground truth. This data-driven model is plugged into hybrid simulation to test its ability of preserving coherent structure and scaling laws.

2.3 Learning the unresolved model and distribution shift

We assume that the resolved part \mathcal{L}caligraphic_L is known, or we at least have access to a gray box that can perform its evaluation and compute its gradients. While the unresolved part is unknown, we can access data tuples {(𝐮1,𝐲1,t1),(𝐮2,𝐲2,t2),,(𝐮N,𝐲N,tN)}subscript𝐮1subscript𝐲1subscript𝑡1subscript𝐮2subscript𝐲2subscript𝑡2subscript𝐮𝑁subscript𝐲𝑁subscript𝑡𝑁\left\{(\mathbf{u}_{1},\mathbf{y}_{1},t_{1}),(\mathbf{u}_{2},\mathbf{y}_{2},t_% {2}),\cdots,(\mathbf{u}_{N},\mathbf{y}_{N},t_{N})\right\}{ ( bold_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ( bold_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , ⋯ , ( bold_u start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) } which are obtained either by physical experiments or accurate but expensive numerical simulations. The goal is hence to construct an approximate mapping (𝐮k,tk)𝐲kmaps-tosubscript𝐮𝑘subscript𝑡𝑘subscript𝐲𝑘(\mathbf{u}_{k},t_{k})\mapsto\mathbf{y}_{k}( bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ↦ bold_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT from this dataset. Viewing this as a supervised learning task, there are many existing methods based on empirical risk minimization [19, 20, 24]. One of the simplest method is based on minimizing the L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss function. Given a parameterized model ϕθsubscriptitalic-ϕ𝜃\phi_{\theta}italic_ϕ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, either a classical one with unknown parameters or a surrogate one such as a neural network, we learn from data the optimal value of the parameter θ𝜃\thetaitalic_θ, i.e. θ^=argminθ𝔼(𝐮,𝐲)𝐲ϕθ(𝐮,t)2^𝜃subscript𝜃subscript𝔼𝐮𝐲superscriptnorm𝐲subscriptitalic-ϕ𝜃𝐮𝑡2\widehat{\theta}=\arg\min_{\theta}\mathbb{E}_{(\mathbf{u},\mathbf{y})}\left\|% \mathbf{y}-\phi_{\theta}(\mathbf{u},t)\right\|^{2}over^ start_ARG italic_θ end_ARG = roman_arg roman_min start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT ( bold_u , bold_y ) end_POSTSUBSCRIPT ∥ bold_y - italic_ϕ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_u , italic_t ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. However, we emphasize that the data assumption in learning of dynamical processes differs significantly with the traditional i.i.d. assumption in statistical learning [20]. As the data is composed by several simulation trajectories, those belonging to the same trajectory will have large correlation. Therefore, it is reasonable to expect that simple empirical risk minimization (which many existing MLHS works employ [46, 42]) may not be the optimal method. We will demonstrate this by using the least squares estimator as a baseline comparison to our proposed approach Section 4.

After determining the parameters of the unresolved model, one can perform hybrid simulation to obtain new trajectories, i.e.

𝐮^k+1=𝐮^k+Δt(𝐮^k,𝐲^k,tk),𝐲^k=ϕθ^(𝐮^k,tk).formulae-sequencesubscript^𝐮𝑘1subscript^𝐮𝑘Δ𝑡subscript^𝐮𝑘subscript^𝐲𝑘subscript𝑡𝑘subscript^𝐲𝑘subscriptitalic-ϕ^𝜃subscript^𝐮𝑘subscript𝑡𝑘\displaystyle\widehat{\mathbf{u}}_{k+1}=\widehat{\mathbf{u}}_{k}+\Delta t% \mathcal{L}(\widehat{\mathbf{u}}_{k},\widehat{\mathbf{y}}_{k},t_{k}),\quad% \widehat{\mathbf{y}}_{k}=\phi_{\widehat{\theta}}(\widehat{\mathbf{u}}_{k},t_{k% }).over^ start_ARG bold_u end_ARG start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = over^ start_ARG bold_u end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + roman_Δ italic_t caligraphic_L ( over^ start_ARG bold_u end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , over^ start_ARG bold_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , over^ start_ARG bold_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_ϕ start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( over^ start_ARG bold_u end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) . (8)

Notice we add hat superscript to all the quantities related to this simulated dynamics to distinguish them from the ground truth (𝐮k,𝐲k)subscript𝐮𝑘subscript𝐲𝑘(\mathbf{u}_{k},\mathbf{y}_{k})( bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ). We measure the performance of the simulator by the error along the whole trajectories, i.e.

mink=1n𝐮k𝐮^k2,superscriptsubscript𝑘1𝑛superscriptnormsubscript𝐮𝑘subscript^𝐮𝑘2\min\sum_{k=1}^{n}\left\|\mathbf{u}_{k}-\widehat{\mathbf{u}}_{k}\right\|^{2},roman_min ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∥ bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - over^ start_ARG bold_u end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (9)

where 𝐮ksubscript𝐮𝑘\mathbf{u}_{k}bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPTs are the true trajectories and 𝐮^ksubscript^𝐮𝑘\widehat{\mathbf{u}}_{k}over^ start_ARG bold_u end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPTs are the simulated one with the same initial condition.

We begin by illustrating the issue of distribution shift by analyzing of the trajectory error. For simplicity of presentation, we assume that the model hypothesis space {ϕθ:θΘ}conditional-setsubscriptitalic-ϕ𝜃𝜃Θ\{\phi_{\theta}:\theta\in\Theta\}{ italic_ϕ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT : italic_θ ∈ roman_Θ } is bias-free, meaning that there exists θ*Θsuperscript𝜃Θ\theta^{*}\in\Thetaitalic_θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∈ roman_Θ such that ϕ=ϕθ*italic-ϕsubscriptitalic-ϕsuperscript𝜃\phi=\phi_{\theta^{*}}italic_ϕ = italic_ϕ start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. In the general case where the hypothesis space is universal but not closed, this equality would be replaced by an approximate one, but the argument follows analogously. Comparing Eq. 1 and Eq. 8, we have

𝐮^k+1𝐮k+1subscript^𝐮𝑘1subscript𝐮𝑘1\displaystyle\ \widehat{\mathbf{u}}_{k+1}-\mathbf{u}_{k+1}over^ start_ARG bold_u end_ARG start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - bold_u start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT (10)
=\displaystyle== 𝐮^k𝐮k+Δt((𝐮k,ϕθ^(𝐮k))(𝐮k,ϕθ*(𝐮k)))subscript^𝐮𝑘subscript𝐮𝑘Δ𝑡subscript𝐮𝑘subscriptitalic-ϕ^𝜃subscript𝐮𝑘subscript𝐮𝑘subscriptitalic-ϕsubscript𝜃subscript𝐮𝑘\displaystyle\ \widehat{\mathbf{u}}_{k}-\mathbf{u}_{k}+\Delta t\left(\mathcal{% L}(\mathbf{u}_{k},\phi_{\widehat{\theta}}(\mathbf{u}_{k}))-\mathcal{L}(\mathbf% {u}_{k},\phi_{\theta_{*}}(\mathbf{u}_{k}))\right)over^ start_ARG bold_u end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + roman_Δ italic_t ( caligraphic_L ( bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) - caligraphic_L ( bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT * end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) )
+Δt((𝐮^k,ϕθ^(𝐮^k))(𝐮k,ϕθ^(𝐮k))).Δ𝑡subscript^𝐮𝑘subscriptitalic-ϕ^𝜃subscript^𝐮𝑘subscript𝐮𝑘subscriptitalic-ϕ^𝜃subscript𝐮𝑘\displaystyle+\Delta t\left(\mathcal{L}(\widehat{\mathbf{u}}_{k},\phi_{% \widehat{\theta}}(\widehat{\mathbf{u}}_{k}))-\mathcal{L}(\mathbf{u}_{k},\phi_{% \widehat{\theta}}(\mathbf{u}_{k}))\right).+ roman_Δ italic_t ( caligraphic_L ( over^ start_ARG bold_u end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( over^ start_ARG bold_u end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) - caligraphic_L ( bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) ) .

The error introduced by moving one step forward is thus decomposed into two parts: the error associated with the estimator θ^:(𝐮k,ϕθ^(𝐮k))(𝐮k,ϕθ*(𝐮k)):^𝜃subscript𝐮𝑘subscriptitalic-ϕ^𝜃subscript𝐮𝑘subscript𝐮𝑘subscriptitalic-ϕsubscript𝜃subscript𝐮𝑘\widehat{\theta}:\mathcal{L}(\mathbf{u}_{k},\phi_{\widehat{\theta}}(\mathbf{u}% _{k}))-\mathcal{L}(\mathbf{u}_{k},\phi_{\theta_{*}}(\mathbf{u}_{k}))over^ start_ARG italic_θ end_ARG : caligraphic_L ( bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) - caligraphic_L ( bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT * end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ), and the other error (𝐮^k,ϕθ^(𝐮^k))(𝐮k,ϕθ^(𝐮k))subscript^𝐮𝑘subscriptitalic-ϕ^𝜃subscript^𝐮𝑘subscript𝐮𝑘subscriptitalic-ϕ^𝜃subscript𝐮𝑘\mathcal{L}(\widehat{\mathbf{u}}_{k},\phi_{\widehat{\theta}}(\widehat{\mathbf{% u}}_{k}))-\mathcal{L}(\mathbf{u}_{k},\phi_{\widehat{\theta}}(\mathbf{u}_{k}))caligraphic_L ( over^ start_ARG bold_u end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( over^ start_ARG bold_u end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) - caligraphic_L ( bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ). The former resembles the error appearing in classical statistical inference problem under the i.i.d. setting since 𝐮ksubscript𝐮𝑘\mathbf{u}_{k}bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT follows an identical distribution of the training dataset. However, the latter is different in the sense that 𝐮^ksubscript^𝐮𝑘\widehat{\mathbf{u}}_{k}over^ start_ARG bold_u end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and 𝐮ksubscript𝐮𝑘\mathbf{u}_{k}bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT may come from distributions of their respective driving dynamics, which are different since θ*θ^subscript𝜃^𝜃\theta_{*}\neq\widehat{\theta}italic_θ start_POSTSUBSCRIPT * end_POSTSUBSCRIPT ≠ over^ start_ARG italic_θ end_ARG. This may be due to noisy labels, few data points, or having an under-determined system. This difference between MLHS and statistical learning is also emphasized in [51]. This latter error is rather akin to the issue of stability in numerical solution for partial differential equations: although the error in each step is relatively small, it can accumulate exponentially as time step iterates, driving 𝐮^ksubscript^𝐮𝑘\widehat{\mathbf{u}}_{k}over^ start_ARG bold_u end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT to completely different regime where we have not observed in the data 𝐮k,k=1,,Nformulae-sequencesubscript𝐮𝑘𝑘1𝑁\mathbf{u}_{k},k=1,\cdots,Nbold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_k = 1 , ⋯ , italic_N. As a result, the data-driven model ϕ^^italic-ϕ\widehat{\phi}over^ start_ARG italic_ϕ end_ARG can no longer be trusted to be accurate, since it is trained on data from a different distribution. This may then lead to a vicious cycle, where further deterioration of trajectory error occurs, leading to increased discrepancy between the distributions. Consequently, one can understand the distribution shift as a large discrepancy between the true distribution of 𝐮ksubscript𝐮𝑘\mathbf{u}_{k}bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and simulated distribution 𝐮^ksubscript^𝐮𝑘\widehat{\mathbf{u}}_{k}over^ start_ARG bold_u end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. This problems occur commonly in MLHS. For example, [28, 29, 53, 39] all observe unphysical solutions when substituting Reynolds stresses models calculated from direct numerical simulation database into RANS. Ref. [53] attributes this to the ill-conditioning of the RANS problems and propose a method to treat Reynolds stresses implicitly, while [30] uses coupling modes perspective to numerically study the stability and convergence of RANS and [18] derives the propagation error of RANS simulation error mathematically. However, a general theoretical framework to understand and combat distribution shift in MLHS is lacking. The current paper aims to make progress in this direction.

Let us illustrate the problem of distribution shift using numerical examples. We solve the 2D FitzHugh-Nagumo reaction-diffusion equation over a periodic domain [0,6.4]×[0,6.4]06.406.4[0,6.4]\times[0,6.4][ 0 , 6.4 ] × [ 0 , 6.4 ], whose dynamics is given by

𝐮t𝐮𝑡\displaystyle\frac{\partial\mathbf{u}}{\partial t}divide start_ARG ∂ bold_u end_ARG start_ARG ∂ italic_t end_ARG =DΔ𝐮+ϕ(𝐮),T[0,20],formulae-sequenceabsent𝐷Δ𝐮italic-ϕ𝐮𝑇020\displaystyle=D\Delta\mathbf{u}+\phi(\mathbf{u}),\quad T\in[0,20],= italic_D roman_Δ bold_u + italic_ϕ ( bold_u ) , italic_T ∈ [ 0 , 20 ] , (11)
ϕ(𝐮)italic-ϕ𝐮\displaystyle\phi(\mathbf{u})italic_ϕ ( bold_u ) =ϕ(u,v)=(uu3v+αβ(uv)).absentitalic-ϕ𝑢𝑣matrix𝑢superscript𝑢3𝑣𝛼𝛽𝑢𝑣\displaystyle=\phi(u,v)=\begin{pmatrix}u-u^{3}-v+\alpha\\ \beta(u-v)\end{pmatrix}.= italic_ϕ ( italic_u , italic_v ) = ( start_ARG start_ROW start_CELL italic_u - italic_u start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT - italic_v + italic_α end_CELL end_ROW start_ROW start_CELL italic_β ( italic_u - italic_v ) end_CELL end_ROW end_ARG ) .

where 𝐮=(u(x,y,t),v(x,y,t))T2𝐮superscript𝑢𝑥𝑦𝑡𝑣𝑥𝑦𝑡𝑇superscript2\mathbf{u}=(u(x,y,t),v(x,y,t))^{T}\in\mathbb{R}^{2}bold_u = ( italic_u ( italic_x , italic_y , italic_t ) , italic_v ( italic_x , italic_y , italic_t ) ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT are two interactive components, D𝐷Ditalic_D is the diffusion matrix, and 𝐑(𝐮)𝐑𝐮\mathbf{R}(\mathbf{u})bold_R ( bold_u ) is source term for the reaction. We fix the parameters to α=0.01,β=1.0,D=(0.05000.1)formulae-sequence𝛼0.01formulae-sequence𝛽1.0𝐷matrix0.05000.1\alpha=0.01,\beta=1.0,D=\begin{pmatrix}0.05&0\\ 0&0.1\end{pmatrix}italic_α = 0.01 , italic_β = 1.0 , italic_D = ( start_ARG start_ROW start_CELL 0.05 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0.1 end_CELL end_ROW end_ARG ), where the system is known to form Turing patterns [33]. Assuming that we do not know the exact form of the reaction term ϕ(𝐮)italic-ϕ𝐮\phi(\mathbf{u})italic_ϕ ( bold_u ), the resolved part is the diffusion term and the unresolved part is the non-linear reaction term. The ground truth data are calculated using the semi-implicit Crank-Nicolson scheme with full explicit discretization for the non-linear term. As we do not know the reaction term is pointwise in space, we use a convolutional neural network-based model to learn it based on ordinary least squares (OLS). Specifically, we solve the least squares problem minθ𝔼𝐮ϕθ(𝐮)ϕ(𝐮)2subscript𝜃subscript𝔼𝐮superscriptnormsubscriptitalic-ϕ𝜃𝐮italic-ϕ𝐮2\min_{\theta}\mathbb{E}_{\mathbf{u}}\left\|\phi_{\theta}(\mathbf{u})-\phi(% \mathbf{u})\right\|^{2}roman_min start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT bold_u end_POSTSUBSCRIPT ∥ italic_ϕ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_u ) - italic_ϕ ( bold_u ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT using Adam [22]. Moreover, we conduct an ablation study where the unresolved model consists of the exact form of the non-linear term as in Eq. 11, plus a random Gaussian noise whose variance is set equal to the optimization error of the OLS estimator, which is 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT in the following experiments. By comparing this and the simulation based on the OLS model, we can conclude whether or not the trajectory error is caused mainly by the estimation error of the unresolved model, or by the shift in distribution of the input data fed to the unresolved model. More details on the data generation and optimization procedure is described in the supplementary materials LABEL:numeric-setup. We simulate this PDE with a test initial condition sampled from the same distribution as training trajectories and compare the results with the ground truth. We present the comparison in Fig. 1.

Refer to caption

Figure 1: The first three rows display the snapshots of ground truth, simulated fields, and ablation study simulation at time steps: 0, 100, 200, 300, 500. (d) plots the relative error and distribution shift between two flow configurations at corresponding time. The error is defined as 𝐮^t𝐮t2𝐮t2subscriptnormsubscript^𝐮𝑡subscript𝐮𝑡2subscriptnormsubscript𝐮𝑡2\frac{\left\|\widehat{\mathbf{u}}_{t}-\mathbf{u}_{t}\right\|_{2}}{\left\|% \mathbf{u}_{t}\right\|_{2}}divide start_ARG ∥ over^ start_ARG bold_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG at each time step t𝑡titalic_t instead of the whole trajectory as in Eq. 9. The plotted distribution shift roughly measures the average discrepancy between the distributions of 𝐮ksubscript𝐮𝑘\mathbf{u}_{k}bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and 𝐮^ksubscript^𝐮𝑘\widehat{\mathbf{u}}_{k}over^ start_ARG bold_u end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT at time step k𝑘kitalic_k, and is calculated using an autoencoder explained in Section 3.2. Although the OLS estimator and ablation-study estimator share the same error magnitude for unresolved model, their long time trajectory performance form sharp contrast, indicating the optimization error of the unresolved model could not explain the failure in long time trajectory prediction.

The first three rows display the snapshots of ground truth, simulated fields, and ablation study simulation at time steps: 0, 100, 200, 300, 500. (d) plots the relative error and distribution shift between two flow configurations at corresponding time. The error is defined as 𝐮^t𝐮t2𝐮t2subscriptnormsubscript^𝐮𝑡subscript𝐮𝑡2subscriptnormsubscript𝐮𝑡2\frac{\left\|\widehat{\mathbf{u}}_{t}-\mathbf{u}_{t}\right\|_{2}}{\left\|% \mathbf{u}_{t}\right\|_{2}}divide start_ARG ∥ over^ start_ARG bold_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG at each time step t𝑡titalic_t instead of the whole trajectory as in Eq. 9. The plotted distribution shift roughly measures the average discrepancy between the distributions of 𝐮ksubscript𝐮𝑘\mathbf{u}_{k}bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and 𝐮^ksubscript^𝐮𝑘\widehat{\mathbf{u}}_{k}over^ start_ARG bold_u end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT at time step k𝑘kitalic_k, and is calculated using an autoencoder explained in Section 3.2. As can be observed, it does not take a very long time for the distribution shift issue to be severe enough using ordinary least squares estimator. Comparing the OLS error and distribution shift with the those of the ablation study, we conclude that the estimation error is not the main cause of the trajectory error, since the ablation-study trajectory has the same magnitude of estimation error, but has much smaller trajectory error and distribution shift. Moreover, in both cases the relative error of the trajectory has the same trend as the distribution shift, indicating correlation.

Indeed, this phenomenon is ubiquitous in modern MLHS [51, 56, 46]. As a second example, we solve the Navier-Stokes equation Eq. 3 using projection method Eq. 4 where pressure solver is replaced by a learned convolutional neural network predictor. The predictor is optimized over data pair of fluid velocity and pressure obtained from high-fidelity simulation to attain an error of 105superscript10510^{-5}10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT. Due to the complexity of projection solver and grid issue, we do not conduct ablation study in this case.

Refer to caption

Figure 2: Configurations of the Navier-Stokes equation simulation with Reynolds number 200200200200 at time step: 0, 60, 120, 180, 240: (a) ground truth (b) simulated fluid field using OLS estimator. We only show velocity field over part of the domain. The dynamics blow up in the last configuration, which does not show up in the case of reaction-diffusion equation. This can be explained by the fact that the Navier-Stokes equation is more unstable than the reaction-diffusion equation. In (c), we illustrate the error along the trajectories, the distribution shift.

The distribution shift in Fig. 2 turns out to be more severe when we simulate the incompressible Navier-Stokes equation. After 160160160160 time steps, the fluid configuration becomes rough and un-physical. Here, we again observe that the DS and error increase together, suggesting correlation. In the next section, we will make this connection precise, and furthermore develop principled methods to control the distribution shift, thereby improving the prediction fidelity.

3 Theoretical analysis and algorithm

In this section, we first put the distribution shift issue into a theoretical framework. Then, we introduce and analyze our algorithm in this section. We will provide rigorous study of the linear case and use this to motivate the algorithm.

3.1 Distribution shift in linear dynamics and motivation of tangent-space regularized estimator

As there exists few tools to tackle the general non-linear dynamics Eq. 1, we switch to the simpler case of linear dynamics, which also capture some key features in general situations. We consider the following hybrid simulation problem

t𝐮subscript𝑡𝐮\displaystyle\partial_{t}\mathbf{u}∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_u =A𝐮+B𝐲,𝐮m,𝐲n,Am×m,Bm×nformulae-sequenceabsent𝐴𝐮𝐵𝐲formulae-sequence𝐮superscript𝑚formulae-sequence𝐲superscript𝑛formulae-sequence𝐴superscript𝑚𝑚𝐵superscript𝑚𝑛\displaystyle=A\mathbf{u}+B\mathbf{y},\quad\mathbf{u}\in\mathbb{R}^{m},\mathbf% {y}\in\mathbb{R}^{n},A\in\mathbb{R}^{m\times m},B\in\mathbb{R}^{m\times n}= italic_A bold_u + italic_B bold_y , bold_u ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , bold_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_m end_POSTSUPERSCRIPT , italic_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT (12)
𝐲𝐲\displaystyle\mathbf{y}bold_y =C*𝐮,C*n×m.formulae-sequenceabsentsuperscript𝐶𝐮superscript𝐶superscript𝑛𝑚\displaystyle=C^{*}\mathbf{u},\quad C^{*}\in\mathbb{R}^{n\times m}.= italic_C start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT bold_u , italic_C start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_m end_POSTSUPERSCRIPT .

Such structures appear in many scenarios, e.g. finite difference solution of discretized linear PDEs and linear control problems. Alternatively, one can think of this linear system as the linearization of the non-linear dynamics Eq. 1 about some steady-state.

We assume that the state variables 𝐮isubscript𝐮𝑖\mathbf{u}_{i}bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPTs in training data come from several trajectories and 𝐲isubscript𝐲𝑖\mathbf{y}_{i}bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPTs may be subject to some random measurement error, i.e. 𝐲i=C*𝐮i+ϵisubscript𝐲𝑖superscript𝐶subscript𝐮𝑖subscriptitalic-ϵ𝑖\mathbf{y}_{i}=C^{*}\mathbf{u}_{i}+\epsilon_{i}bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_C start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, or in matrix form Y=C*𝐔+ϵ𝑌superscript𝐶𝐔italic-ϵY=C^{*}\mathbf{U}+\epsilonitalic_Y = italic_C start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT bold_U + italic_ϵ. The key point is that these trajectories may belong to a low-dimensional subspace in the high-dimensional state space. This is the case for many scientific computing problems, such as computational fluid dynamics [38], and is also known more generally as the manifold hypothesis [34]. We use Vm𝑉superscript𝑚V\subset\mathbb{R}^{m}italic_V ⊂ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT to denote the subspace that contains all the training data 𝐮isubscript𝐮𝑖\mathbf{u}_{i}bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. In this setting, the appearance of this low-dimensional structure may be caused by two different situations. In the first situation, the initial value of 𝐮𝐮\mathbf{u}bold_u is supported on the subspace spanned by several eigenvectors of the evolution operator e(A+BC*)superscript𝑒𝐴𝐵superscript𝐶e^{(A+BC^{*})}italic_e start_POSTSUPERSCRIPT ( italic_A + italic_B italic_C start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT, i.e.

𝐮0span{𝐯1,𝐯2,,𝐯l},e(A+BC*)𝐯1=λ1𝐯1,,e(A+BC*)𝐯l=λl𝐯l,formulae-sequencesubscript𝐮0spansubscript𝐯1subscript𝐯2subscript𝐯𝑙formulae-sequencesuperscript𝑒𝐴𝐵superscript𝐶subscript𝐯1subscript𝜆1subscript𝐯1superscript𝑒𝐴𝐵superscript𝐶subscript𝐯𝑙subscript𝜆𝑙subscript𝐯𝑙\displaystyle\mathbf{u}_{0}\in\operatorname{span}\{\mathbf{v}_{1},\mathbf{v}_{% 2},\cdots,\mathbf{v}_{l}\},\quad e^{(A+BC^{*})}\mathbf{v}_{1}=\lambda_{1}% \mathbf{v}_{1},\cdots,e^{(A+BC^{*})}\mathbf{v}_{l}=\lambda_{l}\mathbf{v}_{l},bold_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ roman_span { bold_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT } , italic_e start_POSTSUPERSCRIPT ( italic_A + italic_B italic_C start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT bold_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_e start_POSTSUPERSCRIPT ( italic_A + italic_B italic_C start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , (13)

Then, all the training data 𝐮isubscript𝐮𝑖\mathbf{u}_{i}bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT belong to this subspace, i.e. V=span{𝐯1,𝐯2,,𝐯l}𝑉spansubscript𝐯1subscript𝐯2subscript𝐯𝑙V=\operatorname{span}\{\mathbf{v}_{1},\mathbf{v}_{2},\cdots,\mathbf{v}_{l}\}italic_V = roman_span { bold_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT }. The second situation is that the dynamics is degenerate in the sense that the evolution matrix A𝐴Aitalic_A is not of full rank. This will not appear in the differential formulation as infinitesimal transformations in the form e(A+BC*)Δtsuperscript𝑒𝐴𝐵superscript𝐶Δ𝑡e^{(A+BC^{*})\Delta t}italic_e start_POSTSUPERSCRIPT ( italic_A + italic_B italic_C start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) roman_Δ italic_t end_POSTSUPERSCRIPT are always non-degenerate. However, for a discrete dynamics

𝐮k+1=A𝐮k+B𝐲k,𝐲k=C*𝐮k,formulae-sequencesubscript𝐮𝑘1𝐴subscript𝐮𝑘𝐵subscript𝐲𝑘subscript𝐲𝑘superscript𝐶subscript𝐮𝑘\mathbf{u}_{k+1}=A\mathbf{u}_{k}+B\mathbf{y}_{k},\quad\mathbf{y}_{k}=C^{*}% \mathbf{u}_{k},bold_u start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = italic_A bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_B bold_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_C start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , (14)

if the matrix A+BC*𝐴𝐵superscript𝐶A+BC^{*}italic_A + italic_B italic_C start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT is degenerate, then all the data 𝐮ksubscript𝐮𝑘\mathbf{u}_{k}bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT will belong to the range of this operator A+BC*𝐴𝐵superscript𝐶A+BC^{*}italic_A + italic_B italic_C start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, which is a proper subspace. Most cases of scientific computing applications belong to the first class. Due to the degeneracy in the training data, the least squares solution is not unique. Hence, various empirical regularizations (e.g. 2superscript2\ell^{2}roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT regularization) can be introduced to obtain an estimator with desirable properties.

Now, suppose one obtains an estimator C^^𝐶\widehat{C}over^ start_ARG italic_C end_ARG, one can calculate several error metrics. The first is the test error

lOLS(C^)=𝔼(C^C*)𝐮2,subscript𝑙𝑂𝐿𝑆^𝐶𝔼superscriptnorm^𝐶superscript𝐶𝐮2l_{OLS}(\widehat{C})=\mathbb{E}\left\|(\widehat{C}-C^{*})\mathbf{u}\right\|^{2},italic_l start_POSTSUBSCRIPT italic_O italic_L italic_S end_POSTSUBSCRIPT ( over^ start_ARG italic_C end_ARG ) = blackboard_E ∥ ( over^ start_ARG italic_C end_ARG - italic_C start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) bold_u ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (15)

where 𝐮𝐮\mathbf{u}bold_u is sampled from the low dimensional subspace V𝑉Vitalic_V and follows the same distribution as the training data. We refer to this as statistical estimation error. However, we are actually interested in the error of the simulated dynamics

t𝐮^=A𝐮^+B𝐲^,𝐲^=C^𝐮^,formulae-sequencesubscript𝑡^𝐮𝐴^𝐮𝐵^𝐲^𝐲^𝐶^𝐮\displaystyle\partial_{t}\widehat{\mathbf{u}}=A\widehat{\mathbf{u}}+B\widehat{% \mathbf{y}},\quad\widehat{\mathbf{y}}=\widehat{C}\widehat{\mathbf{u}},∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT over^ start_ARG bold_u end_ARG = italic_A over^ start_ARG bold_u end_ARG + italic_B over^ start_ARG bold_y end_ARG , over^ start_ARG bold_y end_ARG = over^ start_ARG italic_C end_ARG over^ start_ARG bold_u end_ARG , (16)

compared to the ground truth 𝐮𝐮\mathbf{u}bold_u with the same initial condition. The time evolution of their difference is given by

t(𝐮^𝐮)=subscript𝑡^𝐮𝐮absent\displaystyle\partial_{t}(\widehat{\mathbf{u}}-\mathbf{u})=∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over^ start_ARG bold_u end_ARG - bold_u ) = (A+BC^)𝐮^(A+BC*)𝐮𝐴𝐵^𝐶^𝐮𝐴𝐵superscript𝐶𝐮\displaystyle\ (A+B\widehat{C})\widehat{\mathbf{u}}-(A+BC^{*})\mathbf{u}( italic_A + italic_B over^ start_ARG italic_C end_ARG ) over^ start_ARG bold_u end_ARG - ( italic_A + italic_B italic_C start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) bold_u (17)
=\displaystyle== A(𝐮^𝐮)+B(C^𝐮^C^PV𝐮^+C^PV𝐮^C^𝐮+C^𝐮C*𝐮)𝐴^𝐮𝐮𝐵^𝐶^𝐮^𝐶subscript𝑃𝑉^𝐮^𝐶subscript𝑃𝑉^𝐮^𝐶𝐮^𝐶𝐮superscript𝐶𝐮\displaystyle\ A(\widehat{\mathbf{u}}-\mathbf{u})+B(\widehat{C}\widehat{% \mathbf{u}}-\widehat{C}P_{V}\widehat{\mathbf{u}}+\widehat{C}P_{V}\widehat{% \mathbf{u}}-\widehat{C}\mathbf{u}+\widehat{C}\mathbf{u}-C^{*}\mathbf{u})italic_A ( over^ start_ARG bold_u end_ARG - bold_u ) + italic_B ( over^ start_ARG italic_C end_ARG over^ start_ARG bold_u end_ARG - over^ start_ARG italic_C end_ARG italic_P start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT over^ start_ARG bold_u end_ARG + over^ start_ARG italic_C end_ARG italic_P start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT over^ start_ARG bold_u end_ARG - over^ start_ARG italic_C end_ARG bold_u + over^ start_ARG italic_C end_ARG bold_u - italic_C start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT bold_u )
=\displaystyle== (A+BC^PV)(𝐮^𝐮)+B(C^C*)𝐮+BC^(𝐮^PV𝐮^),𝐴𝐵^𝐶subscript𝑃𝑉^𝐮𝐮𝐵^𝐶superscript𝐶𝐮𝐵^𝐶^𝐮subscript𝑃𝑉^𝐮\displaystyle\ (A+B\widehat{C}P_{V})(\widehat{\mathbf{u}}-\mathbf{u})+B(% \widehat{C}-C^{*})\mathbf{u}+B\widehat{C}(\widehat{\mathbf{u}}-P_{V}\widehat{% \mathbf{u}}),( italic_A + italic_B over^ start_ARG italic_C end_ARG italic_P start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ) ( over^ start_ARG bold_u end_ARG - bold_u ) + italic_B ( over^ start_ARG italic_C end_ARG - italic_C start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) bold_u + italic_B over^ start_ARG italic_C end_ARG ( over^ start_ARG bold_u end_ARG - italic_P start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT over^ start_ARG bold_u end_ARG ) ,

where PVsubscript𝑃𝑉P_{V}italic_P start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT is the orthogonal projection onto the data subspace V𝑉Vitalic_V and we have PV𝐮=𝐮subscript𝑃𝑉𝐮𝐮P_{V}\mathbf{u}=\mathbf{u}italic_P start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT bold_u = bold_u since 𝐮𝐮\mathbf{u}bold_u belongs to the data subspace. As derived in Eq. 17, the overall error of the trajectory during evolution can be decomposed into three parts. In the following we make a detailed analysis of each term.

The first term can be viewed as an amplitude factor of the error propagation and can be easily bounded by

(A+BC^PV)(𝐮^𝐮)2A+BC^PV2𝐮^𝐮2.subscriptnorm𝐴𝐵^𝐶subscript𝑃𝑉^𝐮𝐮2subscriptnorm𝐴𝐵^𝐶subscript𝑃𝑉2subscriptnorm^𝐮𝐮2\left\|(A+B\widehat{C}P_{V})(\widehat{\mathbf{u}}-\mathbf{u})\right\|_{2}\leq% \left\|A+B\widehat{C}P_{V}\right\|_{2}\left\|\widehat{\mathbf{u}}-\mathbf{u}% \right\|_{2}.∥ ( italic_A + italic_B over^ start_ARG italic_C end_ARG italic_P start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ) ( over^ start_ARG bold_u end_ARG - bold_u ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ ∥ italic_A + italic_B over^ start_ARG italic_C end_ARG italic_P start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ over^ start_ARG bold_u end_ARG - bold_u ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . (18)

The operator A+BC^PV𝐴𝐵^𝐶subscript𝑃𝑉A+B\widehat{C}P_{V}italic_A + italic_B over^ start_ARG italic_C end_ARG italic_P start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT can be viewed as a posterior estimation of the error propagation and a stability constant of the simulated dynamics.

The second term is simply the statistical estimation error of the estimator C^^𝐶\widehat{C}over^ start_ARG italic_C end_ARG defined in Eq. 15. Given that the initial condition 𝐮0subscript𝐮0\mathbf{u}_{0}bold_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT of the training and test trajectories are sampled from the same distribution, this term is well-bounded in classical statistical learning theory [20], provided the number of data tuples is sufficiently large. In particular, this error has nothing to do with distribution shift.

Now we move to the last part BC^(𝐮^PV𝐮^)𝐵^𝐶^𝐮subscript𝑃𝑉^𝐮B\widehat{C}(\widehat{\mathbf{u}}-P_{V}\widehat{\mathbf{u}})italic_B over^ start_ARG italic_C end_ARG ( over^ start_ARG bold_u end_ARG - italic_P start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT over^ start_ARG bold_u end_ARG ), which can be more easily understood from a geometric viewpoint. The last factor (𝐮^PV𝐮^)^𝐮subscript𝑃𝑉^𝐮(\widehat{\mathbf{u}}-P_{V}\widehat{\mathbf{u}})( over^ start_ARG bold_u end_ARG - italic_P start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT over^ start_ARG bold_u end_ARG ) measures how far the simulated trajectory is away from the data subspace V𝑉Vitalic_V. In other words, this term measures how far the simulated trajectory’s state distribution “shifts” from the true distribution. This is exactly the term we want to obtain to control the distribution shift in simulation. Meanwhile, this term forms sharp contrast with previous two terms in the sense that it is not automatically bounded in most algorithm for estimating C^^𝐶\widehat{C}over^ start_ARG italic_C end_ARG. We will illustrate this point further by deriving a error bound of the simulated dynamics, which shows that it depends sensitively on this distribution shift term. Thus, we can improve prediction accuracy if we combat such distribution shifts.

Now, taking the expectation of the inner product between 𝐮^𝐮^𝐮𝐮\widehat{\mathbf{u}}-\mathbf{u}over^ start_ARG bold_u end_ARG - bold_u and Eq. 17 and applying Gronwall’s lemma one obtains the following bound on the error propagation.

𝔼𝐮^(T)𝐮(T)22𝔼superscriptsubscriptnorm^𝐮𝑇𝐮𝑇22absent\displaystyle\mathbb{E}\left\|\widehat{\mathbf{u}}(T)-\mathbf{u}(T)\right\|_{2% }^{2}\leqblackboard_E ∥ over^ start_ARG bold_u end_ARG ( italic_T ) - bold_u ( italic_T ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ (19)
B22𝔼(C^C*)𝐮2+BC^22suptT𝐮^(t)PV𝐮^(t)22A+BC^PV22+3(e(A+BC^PV22+3)T1).superscriptsubscriptnorm𝐵22𝔼superscriptnorm^𝐶superscript𝐶𝐮2superscriptsubscriptnorm𝐵^𝐶22subscriptsupremum𝑡𝑇superscriptsubscriptnorm^𝐮𝑡subscript𝑃𝑉^𝐮𝑡22superscriptsubscriptnorm𝐴𝐵^𝐶subscript𝑃𝑉223superscript𝑒superscriptsubscriptnorm𝐴𝐵^𝐶subscript𝑃𝑉223𝑇1\displaystyle\frac{\left\|B\right\|_{2}^{2}\mathbb{E}\left\|(\widehat{C}-C^{*}% )\mathbf{u}\right\|^{2}+\left\|B\widehat{C}\right\|_{2}^{2}\sup_{t\leq T}\left% \|\widehat{\mathbf{u}}(t)-P_{V}\widehat{\mathbf{u}}(t)\right\|_{2}^{2}}{\left% \|A+B\widehat{C}P_{V}\right\|_{2}^{2}+3}(e^{(\left\|A+B\widehat{C}P_{V}\right% \|_{2}^{2}+3)T}-1).divide start_ARG ∥ italic_B ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E ∥ ( over^ start_ARG italic_C end_ARG - italic_C start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) bold_u ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_B over^ start_ARG italic_C end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_sup start_POSTSUBSCRIPT italic_t ≤ italic_T end_POSTSUBSCRIPT ∥ over^ start_ARG bold_u end_ARG ( italic_t ) - italic_P start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT over^ start_ARG bold_u end_ARG ( italic_t ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∥ italic_A + italic_B over^ start_ARG italic_C end_ARG italic_P start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 3 end_ARG ( italic_e start_POSTSUPERSCRIPT ( ∥ italic_A + italic_B over^ start_ARG italic_C end_ARG italic_P start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 3 ) italic_T end_POSTSUPERSCRIPT - 1 ) .

An interpretation for this result is that the trajectory error can be divided into two parts: B22𝔼(C^C*)𝐮2superscriptsubscriptnorm𝐵22𝔼superscriptnorm^𝐶superscript𝐶𝐮2\left\|B\right\|_{2}^{2}\mathbb{E}\left\|(\widehat{C}-C^{*})\mathbf{u}\right\|% ^{2}∥ italic_B ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E ∥ ( over^ start_ARG italic_C end_ARG - italic_C start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) bold_u ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and BC^22supt𝐮^(t)PV𝐮^(t)22superscriptsubscriptnorm𝐵^𝐶22subscriptsupremum𝑡superscriptsubscriptnorm^𝐮𝑡subscript𝑃𝑉^𝐮𝑡22\left\|B\widehat{C}\right\|_{2}^{2}\sup_{t}\left\|\widehat{\mathbf{u}}(t)-P_{V% }\widehat{\mathbf{u}}(t)\right\|_{2}^{2}∥ italic_B over^ start_ARG italic_C end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_sup start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ over^ start_ARG bold_u end_ARG ( italic_t ) - italic_P start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT over^ start_ARG bold_u end_ARG ( italic_t ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. The first error is the statistical estimation error of unresolved component, which can be reduced by better training algorithms, better models, or more data etc. This is the familiar term one faced in supervised learning under common i.i.d. setting. The second error relates to the distribution shift since it is measured by the difference between the sampled trajectories and underlying data manifold, i.e. 𝐮^PV𝐮^=dist(𝐮^,V)norm^𝐮subscript𝑃𝑉^𝐮dist^𝐮𝑉\left\|\widehat{\mathbf{u}}-P_{V}\widehat{\mathbf{u}}\right\|=\operatorname{% dist}(\widehat{\mathbf{u}},V)∥ over^ start_ARG bold_u end_ARG - italic_P start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT over^ start_ARG bold_u end_ARG ∥ = roman_dist ( over^ start_ARG bold_u end_ARG , italic_V ). Moreover, this shift is caused by the interaction between the estimator of unresolved model and the resolved model dynamics. Specifically, the estimator makes some error in prediction and this error is propagated via the resolved model dynamics and then fed into the next round of hybrid simulation. This distribution shift in MLHS is far different from the one in computer science such as covariate shift and label shift. In computer vision, the cause of distribution shift may be extrinsic [23], i.e. the pictures in training set are all taken at daytime while those for testing are all taken at night. Such shifts are hard to model, so their resolution tends to depend on data augmentation and related techniques, e.g. the Dagger algorithm [41]. Instead of changing the optimization problem, the data source is modified. This shares much similarity with adversarial training, which also add more data to the training set to make the prediction robust under adversarial attack, a typical distribution shift in the area of computer vision [15]. In contrast, in our setting the distribution shift is intrinsically driven by the hybrid simulation structure of which we have partial knowledge. Hence, we can quantify this distribution shift and design specific algorithm to mitigate them.

Returning to the linear problem, in order to guarantee that the error 𝐮^𝐮^𝐮𝐮\widehat{\mathbf{u}}-\mathbf{u}over^ start_ARG bold_u end_ARG - bold_u is bounded over the simulated trajectory, a natural choice is to use a regularization for 𝐮^PV𝐮^2subscriptnorm^𝐮subscript𝑃𝑉^𝐮2\left\|\widehat{\mathbf{u}}-P_{V}\widehat{\mathbf{u}}\right\|_{2}∥ over^ start_ARG bold_u end_ARG - italic_P start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT over^ start_ARG bold_u end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. The most naive choice would be 𝐮^PV𝐮^2subscriptnorm^𝐮subscript𝑃𝑉^𝐮2\left\|\widehat{\mathbf{u}}-P_{V}\widehat{\mathbf{u}}\right\|_{2}∥ over^ start_ARG bold_u end_ARG - italic_P start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT over^ start_ARG bold_u end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT itself, but a problem is that 𝐮^^𝐮\widehat{\mathbf{u}}over^ start_ARG bold_u end_ARG is calculated via hybrid simulation until time t𝑡titalic_t, which corresponds to a rather complicated computational graph. This makes subsequent gradient-based optimization computational expensive. Hence, we take the advantage of an one-step predictor, i.e. we set

𝐮^k+1=𝐮k+Δt(A𝐮k+BC^𝐮k).subscript^𝐮𝑘1subscript𝐮𝑘Δ𝑡𝐴subscript𝐮𝑘𝐵^𝐶subscript𝐮𝑘\widehat{\mathbf{u}}_{k+1}=\mathbf{u}_{k}+\Delta t(A\mathbf{u}_{k}+B\widehat{C% }\mathbf{u}_{k}).over^ start_ARG bold_u end_ARG start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + roman_Δ italic_t ( italic_A bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_B over^ start_ARG italic_C end_ARG bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) . (20)

The 𝐮^k+1subscript^𝐮𝑘1\widehat{\mathbf{u}}_{k+1}over^ start_ARG bold_u end_ARG start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT is calculated based on a single step of the simulated dynamics from the ground truth solution 𝐮ksubscript𝐮𝑘\mathbf{u}_{k}bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Assume 𝐮kVsubscript𝐮𝑘𝑉\mathbf{u}_{k}\in Vbold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ italic_V, one has

𝐮^k+1PV𝐮^k+12=ΔtPV(A𝐮k+BC^𝐮k)2,subscriptnormsubscript^𝐮𝑘1subscript𝑃𝑉subscript^𝐮𝑘12Δ𝑡subscriptnormsubscript𝑃superscript𝑉perpendicular-to𝐴subscript𝐮𝑘𝐵^𝐶subscript𝐮𝑘2\left\|\widehat{\mathbf{u}}_{k+1}-P_{V}\widehat{\mathbf{u}}_{k+1}\right\|_{2}=% \Delta t\left\|P_{V^{\perp}}(A\mathbf{u}_{k}+B\widehat{C}\mathbf{u}_{k})\right% \|_{2},∥ over^ start_ARG bold_u end_ARG start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT over^ start_ARG bold_u end_ARG start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = roman_Δ italic_t ∥ italic_P start_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_A bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_B over^ start_ARG italic_C end_ARG bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (21)

and we may penalize the right hand side to promote trajectory alignment with the data manifold. Wrapping up all the ingredients, we state the loss function in the linear case as

minCl(C):=minC𝔼(𝐮,𝐲)(𝐲C𝐮22+λPV(A+BC)𝐮22),assignsubscript𝐶𝑙𝐶subscript𝐶subscript𝔼𝐮𝐲superscriptsubscriptnorm𝐲𝐶𝐮22𝜆superscriptsubscriptnormsubscript𝑃superscript𝑉perpendicular-to𝐴𝐵𝐶𝐮22\min_{C}l(C):=\min_{C}\mathbb{E}_{(\mathbf{u},\mathbf{y})}\left(\left\|\mathbf% {y}-C\mathbf{u}\right\|_{2}^{2}+\lambda\left\|P_{V^{\perp}}(A+BC)\mathbf{u}% \right\|_{2}^{2}\right),roman_min start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT italic_l ( italic_C ) := roman_min start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT ( bold_u , bold_y ) end_POSTSUBSCRIPT ( ∥ bold_y - italic_C bold_u ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ ∥ italic_P start_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_A + italic_B italic_C ) bold_u ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , (22)

and here λ𝜆\lambdaitalic_λ is a penalty strength parameter which can be chosen during training. We briefly discuss in the supplementary materials LABEL:lambda how to choose λ𝜆\lambdaitalic_λ.

In above discussion, we have assumed that the underlying data subspace V𝑉Vitalic_V is known a priori or we can calculate PVsubscript𝑃superscript𝑉perpendicular-toP_{V^{\perp}}italic_P start_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT directly. However, in most cases the only information we have is a set of resolved state variables 𝐮isubscript𝐮𝑖\mathbf{u}_{i}bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPTs. Then, we can use standard dimension reduction methods such as principle component analysis [20] to obtain a subspace V𝑉Vitalic_V, which can be thought as an approximation of the data manifold. The loss objective Eq. 22 can thus be calculated.

3.2 Algorithm in the general form

The discussion above applies to linear dynamics where the underlying data manifold is also linear. In this subsection we generalize it to non-linear cases and state the general form of our algorithm.

Let us return to our original formulation of the task Eq. 1, where we have a dataset {(𝐮1,𝐲1,t1),(𝐮2,𝐲2,t2),,(𝐮N,𝐲N,tN)}subscript𝐮1subscript𝐲1subscript𝑡1subscript𝐮2subscript𝐲2subscript𝑡2subscript𝐮𝑁subscript𝐲𝑁subscript𝑡𝑁\left\{(\mathbf{u}_{1},\mathbf{y}_{1},t_{1}),(\mathbf{u}_{2},\mathbf{y}_{2},t_% {2}),\cdots,(\mathbf{u}_{N},\mathbf{y}_{N},t_{N})\right\}{ ( bold_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ( bold_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , ⋯ , ( bold_u start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) }. As before, we assume that the resolved variable lies on a low-dimensional manifold \mathcal{M}caligraphic_M in the configuration space and the estimator ϕθ(𝐮,t)subscriptitalic-ϕ𝜃𝐮𝑡\phi_{\theta}(\mathbf{u},t)italic_ϕ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_u , italic_t ) achieves high accuracy along this whole manifold while we have no guarantee outside the manifold. The key difference between the general case and the linear case is that this manifold may not be linear and we do not have an orthogonal decomposition.

To resolve this, we leverage an autoencoder to represent the data manifold structure, i.e. we learn two neural networks E:ml:𝐸superscript𝑚superscript𝑙E:\mathbb{R}^{m}\rightarrow\mathbb{R}^{l}italic_E : blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT (encoder) and D:lm:𝐷superscript𝑙superscript𝑚D:\mathbb{R}^{l}\rightarrow\mathbb{R}^{m}italic_D : blackboard_R start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT (decoder) to represent the manifold. The encoder maps the state variable 𝐮𝐮\mathbf{u}bold_u into a low dimensional space and decoder maps them back to original space. We optimize E,D𝐸𝐷E,Ditalic_E , italic_D simultaneously to minimize the reconstruction error

minE,D1Ni=1N𝐮iD(E(𝐮i))22.subscript𝐸𝐷1𝑁superscriptsubscript𝑖1𝑁superscriptsubscriptnormsubscript𝐮𝑖𝐷𝐸subscript𝐮𝑖22\min_{E,D}\frac{1}{N}\sum_{i=1}^{N}\left\|\mathbf{u}_{i}-D(E(\mathbf{u}_{i}))% \right\|_{2}^{2}.roman_min start_POSTSUBSCRIPT italic_E , italic_D end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∥ bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_D ( italic_E ( bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (23)

Here, we only use the resolved states 𝐮isubscript𝐮𝑖\mathbf{u}_{i}bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from the dataset. After training the autoencoder, we obtain a function

F:m0,F(𝐮)=𝐮D(E(𝐮))22,:𝐹formulae-sequencesuperscript𝑚subscriptabsent0𝐹𝐮superscriptsubscriptnorm𝐮𝐷𝐸𝐮22F:\mathbb{R}^{m}\rightarrow\mathbb{R}_{\geq 0},\quad F(\mathbf{u})=\left\|% \mathbf{u}-D(E(\mathbf{u}))\right\|_{2}^{2},italic_F : blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT → blackboard_R start_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT , italic_F ( bold_u ) = ∥ bold_u - italic_D ( italic_E ( bold_u ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (24)

which implicitly parametrizes the data manifold via F(𝐮)=0,𝐮formulae-sequence𝐹𝐮0for-all𝐮F(\mathbf{u})=0,\forall\mathbf{u}\in\mathcal{M}italic_F ( bold_u ) = 0 , ∀ bold_u ∈ caligraphic_M. Notice that the linear version of autoencoder is exactly the principle component analysis we mentioned to identify linear subspace V𝑉Vitalic_V. Such techniques to identify data manifolds have been used in improving robustness against adversarial attacks [5, 6]. Besides parametrizing the data manifold, F𝐹Fitalic_F can be used as an indicator of distribution shift, i.e. for any state variable 𝐮𝐮\mathbf{u}bold_u, we use F(𝐮)=𝐮D(E(𝐮))22𝐹𝐮superscriptsubscriptnorm𝐮𝐷𝐸𝐮22F(\mathbf{u})=\left\|\mathbf{u}-D(E(\mathbf{u}))\right\|_{2}^{2}italic_F ( bold_u ) = ∥ bold_u - italic_D ( italic_E ( bold_u ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT to measure its distribution shift w.r.t. the training data. This is used in all the figures in this paper to quantify distribution shift. Strictly speaking, training only ensures that the learned manifold is contained in the zero level-set of F𝐹Fitalic_F. Thus, the gradient of F𝐹Fitalic_F belongs to the normal space of the manifold \mathcal{M}caligraphic_M at the point 𝐮𝐮\mathbf{u}bold_u, i.e. F𝐹superscriptperpendicular-to\nabla F\in\mathcal{M}^{\perp}∇ italic_F ∈ caligraphic_M start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT. Minimizing the inner product with F𝐹\nabla F∇ italic_F now means partially minimizing the orthogonal components, which is the cause of the distribution shift in our formulation. This strictly generalizes the linear case Eq. 22, where the data subspace V𝑉Vitalic_V is given by span{𝐯1,𝐯2,,𝐯l}spansubscript𝐯1subscript𝐯2subscript𝐯𝑙\operatorname{span}\{\mathbf{v}_{1},\mathbf{v}_{2},\cdots,\mathbf{v}_{l}\}roman_span { bold_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT }. Augmenting this basis to obtain an orthonormal basis {𝐯1,𝐯2,,𝐯l,𝐯l+1,,𝐯m}subscript𝐯1subscript𝐯2subscript𝐯𝑙subscript𝐯𝑙1subscript𝐯𝑚\{\mathbf{v}_{1},\mathbf{v}_{2},\cdots,\mathbf{v}_{l},\mathbf{v}_{l+1},\cdots,% \mathbf{v}_{m}\}{ bold_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT , ⋯ , bold_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } for the whole space, a choice of the function F𝐹Fitalic_F is given by

F(𝐮)=12i=l+1m(𝐮T𝐯i)2,𝐹𝐮12superscriptsubscript𝑖𝑙1𝑚superscriptsuperscript𝐮𝑇subscript𝐯𝑖2F(\mathbf{u})=\frac{1}{2}\sum_{i=l+1}^{m}(\mathbf{u}^{T}\mathbf{v}_{i})^{2},italic_F ( bold_u ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_i = italic_l + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( bold_u start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (25)

and the gradient of F𝐹Fitalic_F is

F(𝐮)=(i=l+1m𝐯i𝐯iT)𝐮,𝐹𝐮superscriptsubscript𝑖𝑙1𝑚subscript𝐯𝑖superscriptsubscript𝐯𝑖𝑇𝐮\nabla F(\mathbf{u})=\left(\sum_{i=l+1}^{m}\mathbf{v}_{i}\mathbf{v}_{i}^{T}% \right)\mathbf{u},∇ italic_F ( bold_u ) = ( ∑ start_POSTSUBSCRIPT italic_i = italic_l + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) bold_u , (26)

which is just the orthogonal projection onto V=span{𝐯l+1,𝐯l+2,,𝐯m}.superscript𝑉perpendicular-tospansubscript𝐯𝑙1subscript𝐯𝑙2subscript𝐯𝑚V^{\perp}=\operatorname{span}\{\mathbf{v}_{l+1},\mathbf{v}_{l+2},\cdots,% \mathbf{v}_{m}\}.italic_V start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT = roman_span { bold_v start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT italic_l + 2 end_POSTSUBSCRIPT , ⋯ , bold_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } . We can state the regularized loss function of our algorithm in general case:

minθl(θ):=minϕθ1Ni=1N(𝐲iϕθ(𝐮i)22+λ((F(𝐮i))T(𝐮i,ϕθ(𝐮i),ti))2).assignsubscript𝜃𝑙𝜃subscriptsubscriptitalic-ϕ𝜃1𝑁superscriptsubscript𝑖1𝑁superscriptsubscriptnormsubscript𝐲𝑖subscriptitalic-ϕ𝜃subscript𝐮𝑖22𝜆superscriptsuperscript𝐹subscript𝐮𝑖𝑇subscript𝐮𝑖subscriptitalic-ϕ𝜃subscript𝐮𝑖subscript𝑡𝑖2\min_{\theta}l(\theta):=\min_{\phi_{\theta}}\frac{1}{N}\sum_{i=1}^{N}\left(% \left\|\mathbf{y}_{i}-\phi_{\theta}(\mathbf{u}_{i})\right\|_{2}^{2}+\lambda% \left(\left(\nabla F(\mathbf{u}_{i})\right)^{T}\mathcal{L}(\mathbf{u}_{i},\phi% _{\theta}(\mathbf{u}_{i}),t_{i})\right)^{2}\right).roman_min start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_l ( italic_θ ) := roman_min start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( ∥ bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_ϕ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ ( ( ∇ italic_F ( bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT caligraphic_L ( bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) . (27)

The first term above is the least squares term and the second term measures how far the estimated tangent vector t𝐮i=(𝐮i,ϕθ(𝐮i),ti)subscript𝑡subscript𝐮𝑖subscript𝐮𝑖subscriptitalic-ϕ𝜃subscript𝐮𝑖subscript𝑡𝑖\partial_{t}\mathbf{u}_{i}=\mathcal{L}(\mathbf{u}_{i},\phi_{\theta}(\mathbf{u}% _{i}),t_{i})∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = caligraphic_L ( bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) deviates from the tangent space of \mathcal{M}caligraphic_M at 𝐮isubscript𝐮𝑖\mathbf{u}_{i}bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. In the second term, F(𝐮)=𝐮D(E(𝐮))22𝐹𝐮superscriptsubscriptnorm𝐮𝐷𝐸𝐮22F(\mathbf{u})=\left\|\mathbf{u}-D(E(\mathbf{u}))\right\|_{2}^{2}italic_F ( bold_u ) = ∥ bold_u - italic_D ( italic_E ( bold_u ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and D,E𝐷𝐸D,Eitalic_D , italic_E are optimized via Eq. 23 and frozen during the training of ϕθsubscriptitalic-ϕ𝜃\phi_{\theta}italic_ϕ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. Since the loss function is regularized by a term which approximately measures the deviation of the tangent vector (𝐮i,ϕθ(𝐮i),ti)subscript𝐮𝑖subscriptitalic-ϕ𝜃subscript𝐮𝑖subscript𝑡𝑖\mathcal{L}(\mathbf{u}_{i},\phi_{\theta}(\mathbf{u}_{i}),t_{i})caligraphic_L ( bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) from the tangent space of the data manifold, we name our algorithm “MLHS with tangent-space regularized estimator”. The overall algorithm can be summarized in Algorithm 1.

Algorithm 1 MLHS with tangent-space regularized estimator
0:  {(𝐮1,𝐲1,t1),(𝐮2,𝐲2,t2),,(𝐮N,𝐲N,tN)}subscript𝐮1subscript𝐲1subscript𝑡1subscript𝐮2subscript𝐲2subscript𝑡2subscript𝐮𝑁subscript𝐲𝑁subscript𝑡𝑁\left\{(\mathbf{u}_{1},\mathbf{y}_{1},t_{1}),(\mathbf{u}_{2},\mathbf{y}_{2},t_% {2}),\cdots,(\mathbf{u}_{N},\mathbf{y}_{N},t_{N})\right\}{ ( bold_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ( bold_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , ⋯ , ( bold_u start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) }, resolved model, penalty strength λ𝜆\lambdaitalic_λ.
1:  Learn a parameterized model which encode the structure of training data, i.e. Fη(𝐮)0,Fη(𝐮k)=0.formulae-sequencesubscript𝐹𝜂𝐮0subscript𝐹𝜂subscript𝐮𝑘0F_{\eta}(\mathbf{u})\geq 0,F_{\eta}(\mathbf{u}_{k})=0.italic_F start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( bold_u ) ≥ 0 , italic_F start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = 0 .
2:  Freeze the parameters of this learned model.
3:  Introduce another surrogate model ϕθ(𝐮)subscriptitalic-ϕ𝜃𝐮\phi_{\theta}(\mathbf{u})italic_ϕ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_u ).
4:  for k=1,2,,N𝑘12𝑁k=1,2,\cdots,Nitalic_k = 1 , 2 , ⋯ , italic_N do
5:     Predict the control variable 𝐲^k=ϕθ(𝐮k)subscript^𝐲𝑘subscriptitalic-ϕ𝜃subscript𝐮𝑘\widehat{\mathbf{y}}_{k}=\phi_{\theta}(\mathbf{u}_{k})over^ start_ARG bold_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_ϕ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ).
6:     Calculate the state variable after one-step iteration, i.e. 𝐮^k+1=𝐮k+Δt(𝐮k,𝐲^k,tk)subscript^𝐮𝑘1subscript𝐮𝑘Δ𝑡subscript𝐮𝑘subscript^𝐲𝑘subscript𝑡𝑘\widehat{\mathbf{u}}_{k+1}=\mathbf{u}_{k}+\Delta t\mathcal{L}(\mathbf{u}_{k},% \widehat{\mathbf{y}}_{k},t_{k})over^ start_ARG bold_u end_ARG start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + roman_Δ italic_t caligraphic_L ( bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , over^ start_ARG bold_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ).
7:     Form the loss l(θ)=𝔼[𝐲k𝐲^k22+λ((F(𝐮k))T(𝐮k,𝐲^k),tk))2]l(\theta)=\mathbb{E}\left[\left\|\mathbf{y}_{k}-\widehat{\mathbf{y}}_{k}\right% \|_{2}^{2}+\lambda\left(\left(\nabla F(\mathbf{u}_{k})\right)^{T}\mathcal{L}(% \mathbf{u}_{k},\widehat{\mathbf{y}}_{k}),t_{k})\right)^{2}\right]italic_l ( italic_θ ) = blackboard_E [ ∥ bold_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - over^ start_ARG bold_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ ( ( ∇ italic_F ( bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT caligraphic_L ( bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , over^ start_ARG bold_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ].
8:     Backpropogate to update θ𝜃\thetaitalic_θ.
9:  end for
9:  tangent-space regularized estimator: ϕθsubscriptitalic-ϕ𝜃\phi_{\theta}italic_ϕ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT.

Comparing to literature of distributionally robust optimization (DRO) [40] and numerical analysis of dynamical systems [48, 47], our work takes advantages from both. First, the tangent-space regularized estimator belongs to the family of DRO in the sense that it minimizes the data-driven module of the unresolved part over some perturbations of the data distribution. However, unlike the original DRO which considers all possible perturbations of the data distribution in its neighborhood under some metric, e.g. Wasserstein metric  [3], our regularization focus on those perturbations caused by the resolved components of the dynamics. These facilitate us to take advantage of the partial knowledge of the dynamics. As a result, the regularizer we proposed is more problem-specific Moreover, from the perspective of DRO, our method takes into consideration the behavior of estimator under certain perturbation normal to the data manifold, which is much more tractable than guarding against arbitrary perturbations in distribution space. In Section 4, we will implement other benchmarks with general regularizations and compare with our algorithm to illustrate the benefit of combining resolved model information into regularization.

Furthermore, we can understand our method through the lens of the stability in numerical analysis of differential equations. The classical result that the least squares estimator converges to the ground truth in the limit of large datasets can be understood as a consistency statement. Here, we show that this is insufficient, and in order to ensure convergence of the simulated trajectories one needs to also promote stability, and our proposed regularizer serves precisely this role. This regularization approach should be contrasted with recent works [54, 7, 25, 16] on learning dynamical models from data which build stability by specifying the model architecture.

3.3 Interpretation and analysis of the algorithm

Let us now quantify in the linear case, the gains of our algorithm over the ordinary least squares estimator. We first calculates the exact formula for two estimators. Due to the low rank structure, OLS algorithm allows infinitely many solutions. Throughout this section, we make the common choice of the minimum 2-norm solution.

Proposition 3.1.

Consider the linear dynamics Eq. 12, the training data of the variable 𝐮𝐮\mathbf{u}bold_u is arranged into a data matrix 𝐔m×N𝐔superscript𝑚𝑁\mathbf{U}\in\mathbb{R}^{m\times N}bold_U ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_N end_POSTSUPERSCRIPT. Moreover, we assume the column space of 𝐔𝐔\mathbf{U}bold_U is contained in a subspace Vm𝑉superscript𝑚V\subset\mathbb{R}^{m}italic_V ⊂ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT with associated projection operator PVsubscript𝑃𝑉P_{V}italic_P start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT. Then, the OLS estimator and tangent-space regularized estimator for the unresolved component Y=C*𝐔+ϵ𝑌superscript𝐶𝐔italic-ϵY=C^{*}\mathbf{U}+\epsilonitalic_Y = italic_C start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT bold_U + italic_ϵ have the following form:

C^OLS=subscript^𝐶𝑂𝐿𝑆absent\displaystyle\widehat{C}_{OLS}=over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_O italic_L italic_S end_POSTSUBSCRIPT = C*PV+ϵ𝐔,superscript𝐶subscript𝑃𝑉italic-ϵsuperscript𝐔\displaystyle\ C^{*}P_{V}+\epsilon\mathbf{U}^{\dagger},italic_C start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT + italic_ϵ bold_U start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT , (28)
C^TR=subscript^𝐶𝑇𝑅absent\displaystyle\widehat{C}_{TR}=over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_T italic_R end_POSTSUBSCRIPT = (𝐈+λBTPVB)1(C*PV+ϵ𝐔λBTPVAPV).superscript𝐈𝜆superscript𝐵𝑇subscript𝑃superscript𝑉perpendicular-to𝐵1superscript𝐶subscript𝑃𝑉italic-ϵsuperscript𝐔𝜆superscript𝐵𝑇subscript𝑃superscript𝑉perpendicular-to𝐴subscript𝑃𝑉\displaystyle\ (\mathbf{I}+\lambda B^{T}P_{V^{\perp}}B)^{-1}(C^{*}P_{V}+% \epsilon\mathbf{U}^{\dagger}-\lambda B^{T}P_{V^{\perp}}AP_{V}).( bold_I + italic_λ italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_B ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_C start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT + italic_ϵ bold_U start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT - italic_λ italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_A italic_P start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ) .

Specifically, in the noiseless scenarios, C^OLSsubscript^𝐶𝑂𝐿𝑆\widehat{C}_{OLS}over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_O italic_L italic_S end_POSTSUBSCRIPT recovers C*superscript𝐶C^{*}italic_C start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT on the subspace V𝑉Vitalic_V but vanishes on its orthogonal complement. In order to interpret our estimator, let us consider a simpler case with B=𝐈,A=𝟎formulae-sequence𝐵𝐈𝐴0B=\mathbf{I},A=\mathbf{0}italic_B = bold_I , italic_A = bold_0. This corresponds to the dynamics

t𝐮=C*𝐮,subscript𝑡𝐮superscript𝐶𝐮\partial_{t}\mathbf{u}=C^{*}\mathbf{u},∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_u = italic_C start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT bold_u , (29)

and the regularized estimator is given by

C^TR=(𝐈+λPV)1(C*PV+ϵ𝐔)=(𝐈+λPV)1C^OLS.subscript^𝐶𝑇𝑅superscript𝐈𝜆subscript𝑃superscript𝑉perpendicular-to1superscript𝐶subscript𝑃𝑉italic-ϵsuperscript𝐔superscript𝐈𝜆subscript𝑃superscript𝑉perpendicular-to1subscript^𝐶𝑂𝐿𝑆\widehat{C}_{TR}=(\mathbf{I}+\lambda P_{V^{\perp}})^{-1}(C^{*}P_{V}+\epsilon% \mathbf{U}^{\dagger})=(\mathbf{I}+\lambda P_{V^{\perp}})^{-1}\widehat{C}_{OLS}.over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_T italic_R end_POSTSUBSCRIPT = ( bold_I + italic_λ italic_P start_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_C start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT + italic_ϵ bold_U start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ) = ( bold_I + italic_λ italic_P start_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_O italic_L italic_S end_POSTSUBSCRIPT . (30)

Therefore, our estimator performs a weighted least squares regression which penalizes the direction perpendicular to the data subspace V𝑉Vitalic_V, as controlled by λ𝜆\lambdaitalic_λ. As the error along this normal direction is the main cause of the distribution shift, it is sensible to reduce it more than the error along the tangent direction, which cause deviations inside the manifold. This shows that the tangent-space regularized estimator does in fact mitigate distribution shift.

Next, we move on to analyze the accuracy gain of the whole simulation algorithm by proving a bound for the error of the simulation trajectory. Denote the statistical error of tangent-space regularized estimator by

lTR(C^)=𝔼((C*C^)𝐮22+λPV(A+BC^)𝐮22)subscript𝑙𝑇𝑅^𝐶𝔼superscriptsubscriptnormsuperscript𝐶^𝐶𝐮22𝜆superscriptsubscriptnormsubscript𝑃superscript𝑉perpendicular-to𝐴𝐵^𝐶𝐮22l_{TR}(\widehat{C})=\mathbb{E}\left(\left\|(C^{*}-\widehat{C})\mathbf{u}\right% \|_{2}^{2}+\lambda\left\|P_{V^{\perp}}(A+B\widehat{C})\mathbf{u}\right\|_{2}^{% 2}\right)italic_l start_POSTSUBSCRIPT italic_T italic_R end_POSTSUBSCRIPT ( over^ start_ARG italic_C end_ARG ) = blackboard_E ( ∥ ( italic_C start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT - over^ start_ARG italic_C end_ARG ) bold_u ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ ∥ italic_P start_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_A + italic_B over^ start_ARG italic_C end_ARG ) bold_u ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (31)

We define the following function which helps simplify the notation

Q(r,T)=erT1r,r,T>0.formulae-sequence𝑄𝑟𝑇superscript𝑒𝑟𝑇1𝑟𝑟𝑇0Q(r,T)=\frac{e^{rT}-1}{r},\quad r,T>0.italic_Q ( italic_r , italic_T ) = divide start_ARG italic_e start_POSTSUPERSCRIPT italic_r italic_T end_POSTSUPERSCRIPT - 1 end_ARG start_ARG italic_r end_ARG , italic_r , italic_T > 0 . (32)
Theorem 3.2.

Under the same setting as Proposition 3.1, assume both the OLS estimator and TR estimator C^OLS,C^TRsubscriptnormal-^𝐶𝑂𝐿𝑆subscriptnormal-^𝐶𝑇𝑅\widehat{C}_{OLS},\widehat{C}_{TR}over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_O italic_L italic_S end_POSTSUBSCRIPT , over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_T italic_R end_POSTSUBSCRIPT are optimized to have error bounded by δ𝛿\deltaitalic_δ so that lOLS(C^OLS),lTR(C^TR)<δsubscript𝑙𝑂𝐿𝑆subscriptnormal-^𝐶𝑂𝐿𝑆subscript𝑙𝑇𝑅subscriptnormal-^𝐶𝑇𝑅𝛿l_{OLS}(\widehat{C}_{OLS}),l_{TR}(\widehat{C}_{TR})<\deltaitalic_l start_POSTSUBSCRIPT italic_O italic_L italic_S end_POSTSUBSCRIPT ( over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_O italic_L italic_S end_POSTSUBSCRIPT ) , italic_l start_POSTSUBSCRIPT italic_T italic_R end_POSTSUBSCRIPT ( over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_T italic_R end_POSTSUBSCRIPT ) < italic_δ in Eq. 15 and Eq. 31. Suppose the true trajectory is simulated from an initial condition 𝐮(0)V𝐮0𝑉\mathbf{u}(0)\in Vbold_u ( 0 ) ∈ italic_V which follows the same distribution of the training data. Then, the errors of OLS and our algorithm are bounded respectively by

𝔼𝐮^TR(T)𝐮(T)2𝔼superscriptnormsubscript^𝐮𝑇𝑅𝑇𝐮𝑇2absent\displaystyle\mathbb{E}\left\|\widehat{\mathbf{u}}_{TR}(T)-\mathbf{u}(T)\right% \|^{2}\leqblackboard_E ∥ over^ start_ARG bold_u end_ARG start_POSTSUBSCRIPT italic_T italic_R end_POSTSUBSCRIPT ( italic_T ) - bold_u ( italic_T ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ δQ(A+BC^TRPV2+3,T)𝛿𝑄superscriptnorm𝐴𝐵subscript^𝐶𝑇𝑅subscript𝑃𝑉23𝑇\displaystyle\delta Q(\left\|A+B\widehat{C}_{TR}P_{V}\right\|^{2}+3,T)italic_δ italic_Q ( ∥ italic_A + italic_B over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_T italic_R end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 3 , italic_T ) (33)
×(B2+Q(PV(A+BC^TR)2+2,T)λ),absentsuperscriptnorm𝐵2𝑄superscriptnormsubscript𝑃superscript𝑉perpendicular-to𝐴𝐵subscript^𝐶𝑇𝑅22𝑇𝜆\displaystyle\times\left(\left\|B\right\|^{2}+\frac{Q(\left\|P_{V^{\perp}}(A+B% \widehat{C}_{TR})\right\|^{2}+2,T)}{\lambda}\right),× ( ∥ italic_B ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_Q ( ∥ italic_P start_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_A + italic_B over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_T italic_R end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 , italic_T ) end_ARG start_ARG italic_λ end_ARG ) ,
𝔼𝐮^OLS(T)𝐮(T)2𝔼superscriptnormsubscript^𝐮𝑂𝐿𝑆𝑇𝐮𝑇2\displaystyle\mathbb{E}\left\|\widehat{\mathbf{u}}_{OLS}(T)-\mathbf{u}(T)% \right\|^{2}blackboard_E ∥ over^ start_ARG bold_u end_ARG start_POSTSUBSCRIPT italic_O italic_L italic_S end_POSTSUBSCRIPT ( italic_T ) - bold_u ( italic_T ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT δQ(A+BC^OLSPV2+BC^OLS2+3,T)B2.absent𝛿𝑄superscriptnorm𝐴𝐵subscript^𝐶𝑂𝐿𝑆subscript𝑃𝑉2superscriptnorm𝐵subscript^𝐶𝑂𝐿𝑆23𝑇superscriptnorm𝐵2\displaystyle\leq\delta Q(\left\|A+B\widehat{C}_{OLS}P_{V}\right\|^{2}+\left\|% B\widehat{C}_{OLS}\right\|^{2}+3,T)\left\|B\right\|^{2}.≤ italic_δ italic_Q ( ∥ italic_A + italic_B over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_O italic_L italic_S end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_B over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_O italic_L italic_S end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 3 , italic_T ) ∥ italic_B ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

The proof idea is illustrated in deriving Eq. 19 and we leave step-by-step derivations in the supplementary materials Section 3. To understand this result, let us focus on the exponential part of two error bounds as this will dominate the error when the simulation time t𝑡titalic_t is large. The exponent of our algorithm is bounded by (A+BC^TRPV2+3)tsuperscriptnorm𝐴𝐵subscript^𝐶𝑇𝑅subscript𝑃𝑉23𝑡(\left\|A+B\widehat{C}_{TR}P_{V}\right\|^{2}+3)t( ∥ italic_A + italic_B over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_T italic_R end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 3 ) italic_t provided that the penalty parameter λ𝜆\lambdaitalic_λ is large enough. This is related to the intrinsic stability constant associated with the dynamics, underlying manifold. If we assume that two terms A+BC^OLSPV,A+BC^TRPVnorm𝐴𝐵subscript^𝐶𝑂𝐿𝑆subscript𝑃𝑉norm𝐴𝐵subscript^𝐶𝑇𝑅subscript𝑃𝑉\left\|A+B\widehat{C}_{OLS}P_{V}\right\|,\left\|A+B\widehat{C}_{TR}P_{V}\right\|∥ italic_A + italic_B over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_O italic_L italic_S end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ∥ , ∥ italic_A + italic_B over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_T italic_R end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ∥ are close to each other, this suggests that our algorithm may have a slower error accumulation rate for large t𝑡titalic_t. As the training data lies on a subspace V𝑉Vitalic_V of the whole state space, we do not have any bound on the behavior of C^OLSsubscript^𝐶𝑂𝐿𝑆\widehat{C}_{OLS}over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_O italic_L italic_S end_POSTSUBSCRIPT outside this subspace. For example, the orthogonal complement Vsuperscript𝑉perpendicular-toV^{\perp}italic_V start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT may contain an eigenspace of BC^OLS𝐵subscript^𝐶𝑂𝐿𝑆B\widehat{C}_{OLS}italic_B over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_O italic_L italic_S end_POSTSUBSCRIPT with a large eigenvalue. If this is the case, BC^OLSnorm𝐵subscript^𝐶𝑂𝐿𝑆\left\|B\widehat{C}_{OLS}\right\|∥ italic_B over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_O italic_L italic_S end_POSTSUBSCRIPT ∥ may be significantly larger than A+BC^TRPVnorm𝐴𝐵subscript^𝐶𝑇𝑅subscript𝑃𝑉\left\|A+B\widehat{C}_{TR}P_{V}\right\|∥ italic_A + italic_B over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_T italic_R end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ∥, and the improvement of our algorithm may be more promising. In supplementary material LABEL:fig:linear-cmp, we provide a toy example which illustrates the effectiveness of our method comparing to OLS.

4 Numerical experiments

In this section, we present numerical experiments which illustrate the effect of the tangent-space regularized algorithm presented in Algorithm 1. We use the experiments to illustrate two phenomena. The first demonstrates the effectiveness of our algorithm, for which we test on the reaction-diffusion equation and the incompressible Navier-Stokes equation, and compare with benchmark algorithms using least squares, as well as simple regularizations that do not account for the structure of the resolved dynamics or the data manifold. For the second goal, we show that our method brings more significant improvements in problems with severe distribution shifts. This is demonstrated on the same family of PDEs, i.e. reaction-diffusion equation with varying diffusion coefficient and Navier-Stokes equation with different Reynolds numbers. These varying parameters can be understood - as we show numerically - as quantitative indicators of the severity of distribution shift. The implementation of our method and experiment reproduction is found in the repository [57].

4.1 Tangent-space regularized algorithm improves simulation accuracy

We first show that the proposed algorithm can be advantageous on several prototypical machine learning augmented scientific computing applications.

In the first experiment, we use the FitzHugh-Nagumo reaction-diffusion Eq. 11 equation as the test case of our algorithm. The resolved-unresolved structure is the coarse-fine grid correction structure discussed in Section 2.2. This helps us obtain high-fidelity simulation results by combining a low-fidelity numerical solver with a fast neural network surrogate model for correction. Here we choose α=0.01,β=1.0,D=(γ002γ),γ{0.05,0.10,0.15,0.20,0.25}formulae-sequence𝛼0.01formulae-sequence𝛽1.0formulae-sequence𝐷matrix𝛾002𝛾𝛾0.050.100.150.200.25\alpha=0.01,\beta=1.0,D=\begin{pmatrix}\gamma&0\\ 0&2\gamma\end{pmatrix},\gamma\in\{0.05,0.10,0.15,0.20,0.25\}italic_α = 0.01 , italic_β = 1.0 , italic_D = ( start_ARG start_ROW start_CELL italic_γ end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 2 italic_γ end_CELL end_ROW end_ARG ) , italic_γ ∈ { 0.05 , 0.10 , 0.15 , 0.20 , 0.25 } in the simulation which guarantee the existence of Turing patterns. The initial condition 𝐮0subscript𝐮0\mathbf{u}_{0}bold_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is generated by i.i.d. sampling from a normal distribution:

u(x,y,0),v(x,y,0)𝒩(0,1).similar-to𝑢𝑥𝑦0𝑣𝑥𝑦0𝒩01u(x,y,0),v(x,y,0)\sim\mathcal{N}(0,1).italic_u ( italic_x , italic_y , 0 ) , italic_v ( italic_x , italic_y , 0 ) ∼ caligraphic_N ( 0 , 1 ) . (34)

This system is known to have a dynamical spatial patterns where initial smaller “clusters” gradually transforms and merges to form bigger cluster according to the balance between diffusion and reaction effects [33]. In literature, the importance of the sampling procedure is also emphasized and various adaptive sampling method based on error and particle methods are studied in [12, 52, 58].

Now, we explain how the training data is prepared. Recall that in Eq. 6 we discussed the hybrid simulation structure for a correction procedure under a pair of coarse and fine mesh:

{𝐮k+12n=(𝐮k2n,𝐲k2n)=In2nfn(R2nn(𝐮k2n)))+𝐲k2n,𝐲k2n=ϕ(𝐮k2n),\left\{\begin{aligned} \mathbf{u}_{k+1}^{2n}&=\mathcal{L}(\mathbf{u}_{k}^{2n},% \mathbf{y}_{k}^{2n})=I_{n}^{2n}\circ f_{n}(R_{2n}^{n}(\mathbf{u}_{k}^{2n})))+% \mathbf{y}_{k}^{2n},\\ \mathbf{y}_{k}^{2n}&=\phi(\mathbf{u}_{k}^{2n}),\end{aligned}\right.{ start_ROW start_CELL bold_u start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_n end_POSTSUPERSCRIPT end_CELL start_CELL = caligraphic_L ( bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_n end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_n end_POSTSUPERSCRIPT ) = italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_n end_POSTSUPERSCRIPT ∘ italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT 2 italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_n end_POSTSUPERSCRIPT ) ) ) + bold_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_n end_POSTSUPERSCRIPT , end_CELL end_ROW start_ROW start_CELL bold_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_n end_POSTSUPERSCRIPT end_CELL start_CELL = italic_ϕ ( bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_n end_POSTSUPERSCRIPT ) , end_CELL end_ROW (35)

where the mapping ϕitalic-ϕ\phiitalic_ϕ between the fine grid state 𝐮k2nsuperscriptsubscript𝐮𝑘2𝑛\mathbf{u}_{k}^{2n}bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_n end_POSTSUPERSCRIPT to the correction term 𝐲k2nsuperscriptsubscript𝐲𝑘2𝑛\mathbf{y}_{k}^{2n}bold_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_n end_POSTSUPERSCRIPT is the unresolved model we desire to learn. Here In2n,R2nnsuperscriptsubscript𝐼𝑛2𝑛superscriptsubscript𝑅2𝑛𝑛I_{n}^{2n},R_{2n}^{n}italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_n end_POSTSUPERSCRIPT , italic_R start_POSTSUBSCRIPT 2 italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT are fixed restriction and interpolation operators between two meshes, i.e.

R2nn:2n×2nn×n,𝐮ijn=𝐮2i,2j2n+𝐮2i,2j+12n+𝐮2i+1,2j2n+𝐮2i+1,2j+12n4,:superscriptsubscript𝑅2𝑛𝑛formulae-sequencesuperscript2𝑛2𝑛superscript𝑛𝑛superscriptsubscript𝐮𝑖𝑗𝑛superscriptsubscript𝐮2𝑖2𝑗2𝑛superscriptsubscript𝐮2𝑖2𝑗12𝑛superscriptsubscript𝐮2𝑖12𝑗2𝑛superscriptsubscript𝐮2𝑖12𝑗12𝑛4\displaystyle R_{2n}^{n}:\mathbb{R}^{2n\times 2n}\rightarrow\mathbb{R}^{n% \times n},\quad\mathbf{u}_{ij}^{n}=\frac{\mathbf{u}_{2i,2j}^{2n}+\mathbf{u}_{2% i,2j+1}^{2n}+\mathbf{u}_{2i+1,2j}^{2n}+\mathbf{u}_{2i+1,2j+1}^{2n}}{4},italic_R start_POSTSUBSCRIPT 2 italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT : blackboard_R start_POSTSUPERSCRIPT 2 italic_n × 2 italic_n end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT , bold_u start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = divide start_ARG bold_u start_POSTSUBSCRIPT 2 italic_i , 2 italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_n end_POSTSUPERSCRIPT + bold_u start_POSTSUBSCRIPT 2 italic_i , 2 italic_j + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_n end_POSTSUPERSCRIPT + bold_u start_POSTSUBSCRIPT 2 italic_i + 1 , 2 italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_n end_POSTSUPERSCRIPT + bold_u start_POSTSUBSCRIPT 2 italic_i + 1 , 2 italic_j + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_n end_POSTSUPERSCRIPT end_ARG start_ARG 4 end_ARG , (36)
I2nn:n×n2n×2n,𝐮2i,2j2n=𝐮2i,2j+12n=𝐮2i+1,2j2n=𝐮2i+1,2j+12n=𝐮ijn.:superscriptsubscript𝐼2𝑛𝑛formulae-sequencesuperscript𝑛𝑛superscript2𝑛2𝑛superscriptsubscript𝐮2𝑖2𝑗2𝑛superscriptsubscript𝐮2𝑖2𝑗12𝑛superscriptsubscript𝐮2𝑖12𝑗2𝑛superscriptsubscript𝐮2𝑖12𝑗12𝑛superscriptsubscript𝐮𝑖𝑗𝑛\displaystyle I_{2n}^{n}:\mathbb{R}^{n\times n}\rightarrow\mathbb{R}^{2n\times 2% n},\quad\mathbf{u}_{2i,2j}^{2n}=\mathbf{u}_{2i,2j+1}^{2n}=\mathbf{u}_{2i+1,2j}% ^{2n}=\mathbf{u}_{2i+1,2j+1}^{2n}=\mathbf{u}_{ij}^{n}.italic_I start_POSTSUBSCRIPT 2 italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT 2 italic_n × 2 italic_n end_POSTSUPERSCRIPT , bold_u start_POSTSUBSCRIPT 2 italic_i , 2 italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_n end_POSTSUPERSCRIPT = bold_u start_POSTSUBSCRIPT 2 italic_i , 2 italic_j + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_n end_POSTSUPERSCRIPT = bold_u start_POSTSUBSCRIPT 2 italic_i + 1 , 2 italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_n end_POSTSUPERSCRIPT = bold_u start_POSTSUBSCRIPT 2 italic_i + 1 , 2 italic_j + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_n end_POSTSUPERSCRIPT = bold_u start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT .

To obtain training data for 𝐮nt,𝐮2ntsuperscriptsubscript𝐮𝑛𝑡superscriptsubscript𝐮2𝑛𝑡\mathbf{u}_{n}^{t},\mathbf{u}_{2n}^{t}bold_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_u start_POSTSUBSCRIPT 2 italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, we simulate Eq. 11 (with the same initial condition) on two meshes of sizes 64×64646464\times 6464 × 64 and 128×128128128128\times 128128 × 128 respectively, with time step Δt=0.01,Nstep=T/Δt=100formulae-sequenceΔ𝑡0.01subscript𝑁step𝑇Δ𝑡100\Delta t=0.01,N_{\text{step}}=T/\Delta t=100roman_Δ italic_t = 0.01 , italic_N start_POSTSUBSCRIPT step end_POSTSUBSCRIPT = italic_T / roman_Δ italic_t = 100, using a second order Crank-Nicolson scheme. It remains to calculate the labels for the training data, i.e. 𝐲k2n=ϕ(𝐮k2n)superscriptsubscript𝐲𝑘2𝑛italic-ϕsuperscriptsubscript𝐮𝑘2𝑛\mathbf{y}_{k}^{2n}=\phi(\mathbf{u}_{k}^{2n})bold_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_n end_POSTSUPERSCRIPT = italic_ϕ ( bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_n end_POSTSUPERSCRIPT ). These are obtained by using the first half of the Eq. 36, i.e.

𝐲nt=𝐮2nt+1In2nfn(R2nn(𝐮2nt))),\mathbf{y}_{n}^{t}=\mathbf{u}_{2n}^{t+1}-I_{n}^{2n}\circ f_{n}(R_{2n}^{n}(% \mathbf{u}_{2n}^{t}))),bold_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = bold_u start_POSTSUBSCRIPT 2 italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT - italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_n end_POSTSUPERSCRIPT ∘ italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT 2 italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( bold_u start_POSTSUBSCRIPT 2 italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) ) , (37)

Finally, we pair them together to form the training data {(𝐮n1,𝐲n1),,(𝐮nN,𝐲nN)}superscriptsubscript𝐮𝑛1superscriptsubscript𝐲𝑛1superscriptsubscript𝐮𝑛𝑁superscriptsubscript𝐲𝑛𝑁\left\{(\mathbf{u}_{n}^{1},\mathbf{y}_{n}^{1}),\cdots,(\mathbf{u}_{n}^{N},% \mathbf{y}_{n}^{N})\right\}{ ( bold_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) , ⋯ , ( bold_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) }. The data here is different from the one introduced in Section 2.3 since the system is autonomous, so we do not include time indices. To summarize, the training data contains two patches of field component patterns of size 100×128×128100128128100\times 128\times 128100 × 128 × 128 corresponding to 𝐮,𝐯𝐮𝐯\mathbf{u},\mathbf{v}bold_u , bold_v respectively and two patches 𝐲u,𝐲vsubscript𝐲𝑢subscript𝐲𝑣\mathbf{y}_{u},\mathbf{y}_{v}bold_y start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT of the same shape and generated according to Eq. 36.

The baseline algorithms we implemented to compare with ours is ϕOLSsubscriptitalic-ϕOLS\phi_{\text{OLS}}italic_ϕ start_POSTSUBSCRIPT OLS end_POSTSUBSCRIPT, which minimizes the mean-squared error on the training data, i.e.

ϕOLS=argmin𝔼𝐮ρ0ϕθ(𝐮)ϕ(𝐮)22,subscriptitalic-ϕOLSsubscript𝔼similar-to𝐮subscript𝜌0superscriptsubscriptnormsubscriptitalic-ϕ𝜃𝐮italic-ϕ𝐮22\phi_{\text{OLS}}=\arg\min\mathbb{E}_{\mathbf{u}\sim\rho_{0}}\left\|\phi_{% \theta}(\mathbf{u})-\phi(\mathbf{u})\right\|_{2}^{2},italic_ϕ start_POSTSUBSCRIPT OLS end_POSTSUBSCRIPT = roman_arg roman_min blackboard_E start_POSTSUBSCRIPT bold_u ∼ italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_ϕ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_u ) - italic_ϕ ( bold_u ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (38)

where we use ρ0subscript𝜌0\rho_{0}italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to denote the distribution of the field variable generated by certain simulation condition Eq. 34. The corresponding numerical results are labelled by “OLS”.

Refer to caption

Figure 3: Simulation of the reaction-diffusion equation and configurations at time step: 0, 100, 200, 300, 500, (a) ground truth; (b) simulated fluid field using OLS estimator; (c) simulated fluid field using TR estimator. (d) is the comparison of the ordinary least squares and regularized one on reaction-diffusion equation: the solid line represents the averaged error along the trajectory and the dash line represented the averaged distribution shift calculated using Eq. 24. The shadow part is the standard deviation in ten random experiments with different training and testing data. Both the error and the distribution shift increase at a lower speed in TR simulation than in OLS simulation. This suggests that the better performance of TR algorithm in reaction-diffusion equation is partially caused by improving the distribution shift issue.

We summarize the comparison results in Fig. 3. Here, solid lines represent the errors along the trajectories and dashed lines represent the distribution shift. The shaded part is the standard deviation over ten random experiments with different split of the training and testing data. As can be observed, compared to our method the OLS baseline incur greater errors and distribution shifts. Moreover, this difference increases with time. In contrast, consistent with our analysis in the simplified setting, our algorithm mitigates both the distribution shift and the trajectory error.

Next, we consider the Navier-Stokes equation Eq. 3 as the test case of our algorithm Algorithm 1. In this case, the resolved model is an explicit scheme to evolve the velocity field while unresolved model is calculated via a fast CNN surrogate, which can accelerate the simulation for large grid system. The simulation domain is rectangular with aspect ratio 1/4141/41 / 4. On the upper and lower boundary we impose no-slip boundary condition on the velocity and on the inlet, i.e. x=0𝑥0x=0italic_x = 0 we specify the velocity field 𝐮𝐮\mathbf{u}bold_u. Moreover, at the outlet, i.e. x=n𝑥𝑛x=nitalic_x = italic_n we pose zero-gradient conditions on both the horizontal velocity and pressure, i.e. pn=un=0𝑝𝑛𝑢𝑛0\frac{\partial p}{\partial n}=\frac{\partial u}{\partial n}=0divide start_ARG ∂ italic_p end_ARG start_ARG ∂ italic_n end_ARG = divide start_ARG ∂ italic_u end_ARG start_ARG ∂ italic_n end_ARG = 0. The inflow boundary is defined according to following functional forms:

𝐮(0,y,t)=(u(0,y,t)v(0,y,y))=(exp{50(yy0)2}sintexp{50(yy0)2})𝐮0𝑦𝑡matrix𝑢0𝑦𝑡𝑣0𝑦𝑦matrix50superscript𝑦subscript𝑦02𝑡50superscript𝑦subscript𝑦02\mathbf{u}(0,y,t)=\begin{pmatrix}u(0,y,t)\\ v(0,y,y)\end{pmatrix}=\begin{pmatrix}\exp\{-50(y-y_{0})^{2}\}\\ \sin t\cdot\exp\{-50(y-y_{0})^{2}\}\end{pmatrix}bold_u ( 0 , italic_y , italic_t ) = ( start_ARG start_ROW start_CELL italic_u ( 0 , italic_y , italic_t ) end_CELL end_ROW start_ROW start_CELL italic_v ( 0 , italic_y , italic_y ) end_CELL end_ROW end_ARG ) = ( start_ARG start_ROW start_CELL roman_exp { - 50 ( italic_y - italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } end_CELL end_ROW start_ROW start_CELL roman_sin italic_t ⋅ roman_exp { - 50 ( italic_y - italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } end_CELL end_ROW end_ARG ) (39)

The training configuration is calculated using the projection method [10] with staggered grid, i.e. the pressure is placed at the center of each cell and horizontal (vertical) velocity is placed at vertical (horizontal) edge of each cell. The jet location y0subscript𝑦0y_{0}italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is selected from 0.30.30.30.3 to 0.70.70.70.7 uniformly.

Again, we compare the performance of our algorithm and the OLS method. As illustrated in Fig. 4, naively using OLS to estimate the unresolved models will generally lead to error blow-up, whereas using our regularized estimator we can obtain a reasonable simulation along the whole time period. Moreover, the trend of the trajectory error fits that of the distribution shift. In particular, the error and distribution shift blow-ups occur simultaneously. Our algorithm again decreases the trajectory error by mitigating the problem of distribution shift.

Refer to caption

Figure 4: Simulation of the Navier-Stokes equation and configurations at time step: 0, 60, 120, 180, 240, (a) ground truth; (b) simulated fluid velocity field using OLS estimator; (c) simulated fluid velocity field using TR estimator; (d) the comparison of the OLS estimator and the regularized one. The solid lines represent the trajectory error and the dashed lines represent distribution shift calculated using Eq. 24. We observe that the OLS estimator leads to error (and DS) blow-up at around 270 time steps, while our method remains stable. Again, the trends of the error and the distribution shift are highly correlated, and our method that controls distribution shift leads to error control, consistent with our analyses.

The machine-learning augmented NS simulations appear to have a greater degree of distribution shift, and our method has a relatively stronger benefit in this case. This suggests that our method is most beneficial in problems where the intrinsic distribution shift is severe. In the next subsection, we will study this more systematically by testing our method on the reaction-diffusion equations with varying diffusion coefficient γ𝛾\gammaitalic_γ and Navier-Stokes equations with varying Reynolds numbers.

4.2 Performance under distribution shift of different magnitudes

We now demonstrate concretely that our method brings more significant improvements when the intrinsic distribution shift is more severe. Intuitively speaking, since our algorithm penalizes the deviation of the simulated trajectories to the data manifold while OLS totally ignores this, it would be reasonable to expect that our algorithm will outperform OLS more in the problems where such deviations are large. A sanity check showing different distribution shift extents under different parameters is provided in the supplementary materials LABEL:fig:NSRe-ds.

Here, we introduce two more baselines for ablation studies. Comparing with the OLS algorithm, these algorithms both add some regularization to the loss function while their regularizations are general and not dynamics-specific compared to our regularization. Consequently, any improved performance in our algorithm over these regularized algorithms suggests that targeted regularization, which takes into account the structure of the resolved part, effectively balances accuracy and stability.

The first baseline regularizes the original least squares objective by a term which quantifies some complexity the network model, e.g.

ϕmOLS=argmin𝔼𝐮ρ0ϕθ(𝐮)ϕ(𝐮)22+λθ22.subscriptitalic-ϕmOLSsubscript𝔼similar-to𝐮subscript𝜌0superscriptsubscriptnormsubscriptitalic-ϕ𝜃𝐮italic-ϕ𝐮22𝜆superscriptsubscriptnorm𝜃22\phi_{\text{mOLS}}=\arg\min\mathbb{E}_{\mathbf{u}\sim\rho_{0}}\left\|\phi_{% \theta}(\mathbf{u})-\phi(\mathbf{u})\right\|_{2}^{2}+\lambda\left\|\theta% \right\|_{2}^{2}.italic_ϕ start_POSTSUBSCRIPT mOLS end_POSTSUBSCRIPT = roman_arg roman_min blackboard_E start_POSTSUBSCRIPT bold_u ∼ italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_ϕ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_u ) - italic_ϕ ( bold_u ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ ∥ italic_θ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (40)

Here the naive L2superscript𝐿2L^{2}italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT regularized algorithm is implemented by adjusting the weight decay [31] parameter of the Adam solver during the training process. The corresponding numerical results are labelled by “mOLS” (“m” for modified).

The second baseline adds noise to the training data during training, which can be formulated as

ϕaOLS=argmin𝔼𝐮ρξϕθ(𝐮)ϕ(𝐮)22.subscriptitalic-ϕaOLSsubscript𝔼similar-to𝐮subscript𝜌𝜉superscriptsubscriptnormsubscriptitalic-ϕ𝜃𝐮italic-ϕ𝐮22\phi_{\text{aOLS}}=\arg\min\mathbb{E}_{\mathbf{u}\sim\rho_{\xi}}\left\|\phi_{% \theta}(\mathbf{u})-\phi(\mathbf{u})\right\|_{2}^{2}.italic_ϕ start_POSTSUBSCRIPT aOLS end_POSTSUBSCRIPT = roman_arg roman_min blackboard_E start_POSTSUBSCRIPT bold_u ∼ italic_ρ start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_ϕ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_u ) - italic_ϕ ( bold_u ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (41)

Notice that the training data follows the distribution ρϵ=ρ0*ξsubscript𝜌italic-ϵsubscript𝜌0𝜉\rho_{\epsilon}=\rho_{0}*\xiitalic_ρ start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT = italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT * italic_ξ where *** denotes the convolution operator and ξ𝜉\xiitalic_ξ is some random noise which we take to be Gaussian random variable with an adjustable variance. In implementation, we generate the noise and add to the input during each epoch so the inputs are not the same across epochs. Moreover, this noise ξ𝜉\xiitalic_ξ should be distinguished from the observational noise ϵitalic-ϵ\epsilonitalic_ϵ we introduced in Section 3. This method is used in several previous work [46, 36] to improve stability and prevent the error from accumulating too fast. As this technique is similar to adversarial attack and data augmentation [44, 14, 13], we name this estimator and algorithm as “aOLS” (“a” stands for adversarial). We slightly abuse the terminology here by naming this random perturbation as some kind of adversarial attack.

Table 1: Performance on reaction-diffusion equation over 10 experiments.
γ=0.05𝛾0.05\gamma=0.05italic_γ = 0.05 γ=0.10𝛾0.10\gamma=0.10italic_γ = 0.10 γ=0.15𝛾0.15\gamma=0.15italic_γ = 0.15 γ=0.20𝛾0.20\gamma=0.20italic_γ = 0.20 γ=0.25𝛾0.25\gamma=0.25italic_γ = 0.25
grid size: 64×64646464\times 6464 × 64
OLS 8.40e+01 1.04e+01 8.99e+00 2.22e+00 6.78e-01
mOLS 8.90e+00 9.22e+00 1.41e+00 2.26e+00 9.98e-01
aOLS 3.57e+01 4.00e+01 2.40e+00 1.79e+00 1.06e+00
TR 1.23e-01 1.06e-01 1.38e-01 1.85e-01 2.15e-01
Diff 99.9% 99.0% 98.5% 91.7% 68.3%
grid size: 128×128128128128\times 128128 × 128
OLS 4.10e+02 8.18e+01 5.11e+01 4.88e+00 2.25e+00
mOLS 2.40e+01 3.20e+01 2.20e+00 1.07e+01 1.25e+00
aOLS 1.46e+02 2.58e+02 8.69e+00 3.30e+00 1.79e+00
TR 2.79e-01 1.66e-01 2.42e-01 1.81e-01 1.91e-01
Diff 99.9% 99.8% 99.5% 96.3% 91.5%

In Table 1, we test the algorithms under different diffusion coefficient γ=0.05𝛾0.05\gamma=0.05italic_γ = 0.05, 0.10,0.15,0.20,0.250.100.150.200.250.10,0.15,0.20,0.250.10 , 0.15 , 0.20 , 0.25. We calculate the relative trajectory error at time T=1000𝑇1000T=1000italic_T = 1000, i.e. 𝐮k𝐮^k𝐮knormsubscript𝐮𝑘subscript^𝐮𝑘normsubscript𝐮𝑘\frac{\left\|\mathbf{u}_{k}-\widehat{\mathbf{u}}_{k}\right\|}{\left\|\mathbf{u% }_{k}\right\|}divide start_ARG ∥ bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - over^ start_ARG bold_u end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ end_ARG start_ARG ∥ bold_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ end_ARG over ten random split of the training and test data. The last row calculates the performance difference between our method and the OLS baseline. We observe that under almost all the settings our method out-performs OLS, mOLS, aOLS baselines, and comparing horizontally, we see that the relative improvement decreases with γ𝛾\gammaitalic_γ, as diffusion term mitigate the distribution shift.

We now consider the Navier-Stokes equation. Since the baseline method quickly leads to error blow up, we use another comparison criterion. We define a stopping time tK=argmaxt𝐮t𝐮^t𝐮tKsubscript𝑡𝐾subscript𝑡normsubscript𝐮𝑡subscript^𝐮𝑡normsubscript𝐮𝑡𝐾t_{K}=\arg\max_{t}\frac{\left\|\mathbf{u}_{t}-\widehat{\mathbf{u}}_{t}\right\|% }{\left\|\mathbf{u}_{t}\right\|}\leq Kitalic_t start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT divide start_ARG ∥ bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - over^ start_ARG bold_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ end_ARG start_ARG ∥ bold_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ end_ARG ≤ italic_K where K𝐾Kitalic_K is a error threshold to be determined during the experiments. We calculate the first time the trajectory error reaches a threshold under different Reynolds numbers and different mesh sizes. The results are shown in Table 2. Under almost all the scenarios, our method outperforms all the baseline algorithms by a larger tKsubscript𝑡𝐾t_{K}italic_t start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT. This validates the effectiveness of our algorithm. Comparing Table 2 horizontally, one finds that the improvement of our method is also increasing with respect to the Reynolds number. As the flow field generally becomes more complex (and sensitive) with increasing Reynolds number, our method is expected to bring bigger improvements. This is consistent with our experiments.

Table 2: Comparison of stopping time tKsubscript𝑡𝐾t_{K}italic_t start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT on Navier-Stokes equation over 10 experiments.
Re=100𝑅𝑒100Re=100italic_R italic_e = 100 Re=200𝑅𝑒200Re=200italic_R italic_e = 200 Re=300𝑅𝑒300Re=300italic_R italic_e = 300 Re=400𝑅𝑒400Re=400italic_R italic_e = 400 Re=500𝑅𝑒500Re=500italic_R italic_e = 500
grid size: 128×3212832128\times 32128 × 32
OLS 435±15plus-or-minus43515435\pm 15435 ± 15 301±20plus-or-minus30120301\pm 20301 ± 20 240±12plus-or-minus24012240\pm 12240 ± 12 177±17plus-or-minus17717177\pm 17177 ± 17 120±7plus-or-minus1207120\pm 7120 ± 7
mOLS 420±12plus-or-minus42012420\pm 12420 ± 12 333±14plus-or-minus33314333\pm 14333 ± 14 254±12plus-or-minus25412254\pm 12254 ± 12 204±9plus-or-minus2049204\pm 9204 ± 9 210±7plus-or-minus2107210\pm 7210 ± 7
aOLS 477±25plus-or-minus47725477\pm 25477 ± 25 392±31plus-or-minus39231392\pm 31392 ± 31 450±23plus-or-minus45023450\pm 23450 ± 23 239±25plus-or-minus23925239\pm 25239 ± 25 321±20plus-or-minus32120321\pm 20321 ± 20
TR 𝟓𝟎𝟔±𝟏𝟖plus-or-minus50618\mathbf{506\pm 18}bold_506 ± bold_18 𝟒𝟗𝟕±𝟐𝟖plus-or-minus49728\mathbf{497\pm 28}bold_497 ± bold_28 𝟒𝟖𝟐±𝟐𝟗plus-or-minus48229\mathbf{482\pm 29}bold_482 ± bold_29 𝟒𝟒𝟐±𝟐𝟏plus-or-minus44221\mathbf{442\pm 21}bold_442 ± bold_21 𝟒𝟓𝟐±𝟏𝟑plus-or-minus45213\mathbf{452\pm 13}bold_452 ± bold_13
grid size: 256×6425664256\times 64256 × 64
OLS 602±38plus-or-minus60238602\pm 38602 ± 38 571±35plus-or-minus57135571\pm 35571 ± 35 355±25plus-or-minus35525355\pm 25355 ± 25 301±11plus-or-minus30111301\pm 11301 ± 11 148±22plus-or-minus14822148\pm 22148 ± 22
mOLS 623±18plus-or-minus62318623\pm 18623 ± 18 531±26plus-or-minus53126531\pm 26531 ± 26 489±20plus-or-minus48920489\pm 20489 ± 20 417±10plus-or-minus41710417\pm 10417 ± 10 298±9plus-or-minus2989298\pm 9298 ± 9
aOLS 𝟔𝟓𝟏±𝟒𝟖plus-or-minus65148\mathbf{651\pm 48}bold_651 ± bold_48 601±42plus-or-minus60142601\pm 42601 ± 42 𝟓𝟗𝟓±𝟑𝟗plus-or-minus59539\mathbf{595\pm 39}bold_595 ± bold_39 378±31plus-or-minus37831378\pm 31378 ± 31 398±32plus-or-minus39832398\pm 32398 ± 32
TR 636±35plus-or-minus63635636\pm 35636 ± 35 𝟔𝟎𝟐±𝟐𝟖plus-or-minus60228\mathbf{602\pm 28}bold_602 ± bold_28 585±23plus-or-minus58523585\pm 23585 ± 23 𝟒𝟐𝟐±𝟑𝟓plus-or-minus42235\mathbf{422\pm 35}bold_422 ± bold_35 𝟒𝟎𝟐±𝟐𝟖plus-or-minus40228\mathbf{402\pm 28}bold_402 ± bold_28

5 Conclusion

In this paper, we establish a theoretical framework for machine-learning augmented hybrid simulation problems, where a data-driven surrogate is used to accelerate traditional simulation methods. We identify the cause and effect of distribution shift, i.e. the empirically observed phenomenon that the simulated dynamics may be driven away from the support of the training data due to systematic errors introduced by the data-driven surrogate, magnified by the resolved components of the dynamics. Based on this, we propose a tangent-space regularized algorithm for training the surrogate for the unresolved part, which incorporates the resolved model information to control deviations from the true data manifold. In the case of linear dynamics, we show our algorithm is provably better than the traditional training method based on ordinary least squares. Then, we validate our algorithm in numerical experiments, including Turing instabilities in reaction-diffusion and fluid flow. In both cases, our method outperforms baselines by better mitigating distribution shift, thus reducing trajectory error.

Acknowledgements

The research work presented is supported by the National Research Foundation, Singapore, under the NRF fellowship (project No. NRF-NRFF13-2021-0005).

References

  • [1] G. Alfonsi, Reynolds-averaged navier–stokes equations for turbulence modeling, Applied Mechanics Reviews, 62 (2009).
  • [2] S. Arisaka and Q. Li, Principled acceleration of iterative numerical methods using machine learning, in Proceedings of the 40th International Conference on Machine Learning, ICML’23, JMLR.org, 2023.
  • [3] J. Blanchet, K. Murthy, and N. Si, Confidence regions in wasserstein distributionally robust estimation, Biometrika, 109 (2022), pp. 295–315.
  • [4] W. L. Briggs, V. E. Henson, and S. F. McCormick, A multigrid tutorial, SIAM, 2000.
  • [5] Z. Chen, Q. Li, and Z. Zhang, Towards robust neural networks via close-loop control, arXiv preprint arXiv:2102.01862, (2021).
  • [6] Z. Chen, Q. Li, and Z. Zhang, Self-healing robust neural networks via closed-loop control, Journal of Machine Learning Research, 23 (2022), pp. 1–54.
  • [7] M. Cranmer, S. Greydanus, S. Hoyer, P. Battaglia, D. Spergel, and S. Ho, Lagrangian neural networks, arXiv preprint arXiv:2003.04630, (2020).
  • [8] X. Dong, Z. Yu, W. Cao, Y. Shi, and Q. Ma, A survey on ensemble learning, Frontiers of Computer Science, 14 (2020), pp. 241–258.
  • [9] K. Duraisamy, G. Iaccarino, and H. Xiao, Turbulence modeling in the age of data, Annual review of fluid mechanics, 51 (2019), pp. 357–377.
  • [10] W. E and J.-G. Liu, Projection method i: convergence and numerical boundary layers, SIAM journal on numerical analysis, (1995), pp. 1017–1057.
  • [11] A. Farahani, S. Voghoei, K. Rasheed, and H. R. Arabnia, A brief review of domain adaptation, Advances in Data Science and Information Engineering: Proceedings from ICDATA 2020 and IKE 2020, (2021), pp. 877–894.
  • [12] Z. Gao, L. Yan, and T. Zhou, Failure-informed adaptive sampling for pinns, SIAM Journal on Scientific Computing, 45 (2023), pp. A1971–A1994, https://doi.org/10.1137/22M1527763, https://doi.org/10.1137/22M1527763, https://arxiv.org/abs/https://doi.org/10.1137/22M1527763.
  • [13] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT Press, 2016. http://www.deeplearningbook.org.
  • [14] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, Generative adversarial nets, in Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, eds., vol. 27, Curran Associates, Inc., 2014, https://proceedings.neurips.cc/paper_files/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf.
  • [15] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, Generative adversarial networks, Communications of the ACM, 63 (2020), pp. 139–144.
  • [16] S. Greydanus, M. Dzamba, and J. Yosinski, Hamiltonian neural networks, Advances in neural information processing systems, 32 (2019).
  • [17] J.-L. Guermond, P. Minev, and J. Shen, An overview of projection methods for incompressible flows, Computer methods in applied mechanics and engineering, 195 (2006), pp. 6011–6045.
  • [18] X. Guo, Z. Xia, and S. Chen, Computing mean fields with known reynolds stresses at steady state, Theoretical and Applied Mechanics Letters, 11 (2021), p. 100244.
  • [19] T. Hastie, R. Tibshirani, J. Friedman, T. Hastie, R. Tibshirani, and J. Friedman, Overview of supervised learning, The elements of statistical learning: Data mining, inference, and prediction, (2009), pp. 9–41.
  • [20] T. Hastie, R. Tibshirani, J. H. Friedman, and J. H. Friedman, The elements of statistical learning: data mining, inference, and prediction, vol. 2, Springer, 2009.
  • [21] R. Huang, R. Li, and Y. Xi, Learning optimal multigrid smoothers via neural networks, SIAM Journal on Scientific Computing, (2022), pp. S199–S225.
  • [22] D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980, (2014).
  • [23] P. W. Koh, S. Sagawa, H. Marklund, S. M. Xie, M. Zhang, A. Balsubramani, W. Hu, M. Yasunaga, R. L. Phillips, I. Gao, et al., Wilds: A benchmark of in-the-wild distribution shifts, in International Conference on Machine Learning, PMLR, 2021, pp. 5637–5664.
  • [24] Y. LeCun, Y. Bengio, and G. Hinton, Deep learning, nature, 521 (2015), pp. 436–444.
  • [25] B. Lin, Q. Li, and W. Ren, A data driven method for computing quasipotentials, arXiv preprint arXiv:2012.09111, (2020).
  • [26] H.-W. Lin, E. R. Nocera, F. Olness, K. Orginos, J. Rojo, A. Accardi, C. Alexandrou, A. Bacchetta, G. Bozzi, J.-W. Chen, et al., Parton distributions and lattice qcd calculations: a community white paper, Progress in Particle and Nuclear Physics, 100 (2018), pp. 107–160.
  • [27] L. Lin, J. Lu, and L. Ying, Numerical methods for kohn–sham density functional theory, Acta Numerica, 28 (2019), pp. 405–539.
  • [28] J. Ling, A. Kurzawski, and J. Templeton, Reynolds averaged turbulence modelling using deep neural networks with embedded invariance, Journal of Fluid Mechanics, 807 (2016), pp. 155–166.
  • [29] J. Ling and J. Templeton, Evaluation of machine learning algorithms for prediction of regions of high reynolds averaged navier stokes uncertainty, Physics of Fluids, 27 (2015), p. 085103.
  • [30] Y. Liu, W. Zhang, and Z. Xia, Analysis on numerical stability and convergence of rans turbulence models from the perspective of coupling modes, arXiv preprint arXiv:2110.08459, (2021).
  • [31] I. Loshchilov and F. Hutter, Decoupled weight decay regularization, arXiv preprint arXiv:1711.05101, (2017).
  • [32] R. Maulik, O. San, A. Rasheed, and P. Vedula, Subgrid modelling for two-dimensional turbulence using neural networks, Journal of Fluid Mechanics, 858 (2019), pp. 122–144.
  • [33] J. D. Murray and J. D. Murray, Mathematical Biology: II: Spatial Models and Biomedical Applications, vol. 3, Springer, 2003.
  • [34] H. Narayanan and S. Mitter, Sample complexity of testing the manifold hypothesis, Advances in neural information processing systems, 23 (2010).
  • [35] C. Pedersen, L. Zanna, J. Bruna, and P. Perezhogin, Reliable coarse-grained turbulent simulations through combined offline learning and neural emulation, 2023, https://arxiv.org/abs/2307.13144.
  • [36] T. Pfaff, M. Fortunato, A. Sanchez-Gonzalez, and P. W. Battaglia, Learning mesh-based simulation with graph networks, CoRR, abs/2010.03409 (2020), https://arxiv.org/abs/2010.03409, https://arxiv.org/abs/2010.03409.
  • [37] F. Pichi, B. Moya, and J. S. Hesthaven, A graph convolutional autoencoder approach to model order reduction for parametrized pdes, arXiv preprint arXiv:2305.08573, (2023).
  • [38] S. B. Pope and S. B. Pope, Turbulent flows, Cambridge university press, 2000.
  • [39] S. V. Poroseva, J. D. Colmenares F, and S. M. Murman, On the accuracy of rans simulations with dns data, Physics of Fluids, 28 (2016), p. 115102.
  • [40] H. Rahimian and S. Mehrotra, Distributionally robust optimization: A review, arXiv preprint arXiv:1908.05659, (2019).
  • [41] A. Ramdas, J. Chen, M. J. Wainwright, and M. I. Jordan, Dagger: A sequential algorithm for fdr control on dags, arXiv preprint arXiv:1709.10250, (2017).
  • [42] P. Ren, C. Rao, Y. Liu, Z. Ma, Q. Wang, J.-X. Wang, and H. Sun, Physics-informed deep super-resolution for spatiotemporal data, arXiv preprint arXiv:2208.01462, (2022).
  • [43] J. R. Shewchuk, An introduction to the conjugate gradient method without the agonizing pain, 1994.
  • [44] C. Shorten and T. M. Khoshgoftaar, A survey on image data augmentation for deep learning, Journal of Big Data, 6 (2019), p. 60, https://doi.org/10.1186/s40537-019-0197-0, https://doi.org/10.1186/s40537-019-0197-0.
  • [45] J. P. Slotnick, A. Khodadoust, J. Alonso, D. Darmofal, W. Gropp, E. Lurie, and D. J. Mavriplis, Cfd vision 2030 study: a path to revolutionary computational aerosciences, 2014.
  • [46] K. Stachenfeld, D. B. Fielding, D. Kochkov, M. Cranmer, T. Pfaff, J. Godwin, C. Cui, S. Ho, P. Battaglia, and A. Sanchez-Gonzalez, Learned coarse models for efficient turbulence simulation, 2022, https://arxiv.org/abs/2112.15275.
  • [47] A. Stuart and A. R. Humphries, Dynamical systems and numerical analysis, vol. 2, Cambridge University Press, 1998.
  • [48] A. M. Stuart, Numerical analysis of dynamical systems, Acta numerica, 3 (1994), pp. 467–572.
  • [49] J. Tompson, K. Schlachter, P. Sprechmann, and K. Perlin, Accelerating eulerian fluid simulation with convolutional networks, in International Conference on Machine Learning, PMLR, 2017, pp. 3424–3433.
  • [50] J. Vanschoren, Meta-learning: A survey, arXiv preprint arXiv:1810.03548, (2018).
  • [51] R. Wang and R. Yu, Physics-guided deep learning for dynamical systems: A survey, arXiv preprint arXiv:2107.01272, (2021).
  • [52] Y. Wen, E. Vanden-Eijnden, and B. Peherstorfer, Coupling parameter and particle dynamics for adaptive sampling in neural galerkin schemes, arXiv preprint arXiv:2306.15630, (2023).
  • [53] J. Wu, H. Xiao, R. Sun, and Q. Wang, Reynolds-averaged navier–stokes equations with explicit data-driven reynolds stress closure can be ill-conditioned, Journal of Fluid Mechanics, 869 (2019), pp. 553–586.
  • [54] H. Yu, X. Tian, W. E, and Q. Li, Onsagernet: Learning stable and interpretable dynamics using a generalized onsager principle, Physical Review Fluids, 6 (2021), p. 114402.
  • [55] L. Zhang, J. Han, H. Wang, R. Car, and E. Weinan, Deep potential molecular dynamics: a scalable model with the accuracy of quantum mechanics, Physical review letters, 120 (2018), p. 143001.
  • [56] L. Zhang, D.-Y. Lin, H. Wang, R. Car, and E. Weinan, Active learning of uniformly accurate interatomic potentials for materials simulation, Physical Review Materials, 3 (2019), p. 023804.
  • [57] J. Zhao and Q. Li, Code repository for mitigating distribution shift in machine-learning augmented hybrid simulation, https://github.com/jiaxi98/TR.
  • [58] Z. Zhao and Q. Li, Adaptive sampling methods for learning dynamical systems, in Mathematical and Scientific Machine Learning, PMLR, 2022, pp. 335–350.
  • [59] Y. Zhiyin, Large-eddy simulation: Past, present and the future, Chinese journal of Aeronautics, 28 (2015), pp. 11–24.