Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

On instabilities in neural network-based physics simulators

Daniel Floryan dfloryan@uh.edu Department of Mechanical Engineering, University of Houston, Houston, TX 77204, USA
(June 18, 2024)
Abstract

When neural networks are trained from data to simulate the dynamics of physical systems, they encounter a persistent challenge: the long-time dynamics they produce are often unphysical or unstable. We analyze the origin of such instabilities when learning linear dynamical systems, focusing on the training dynamics. We make several analytical findings which empirical observations suggest extend to nonlinear dynamical systems. First, the rate of convergence of the training dynamics is uneven and depends on the distribution of energy in the data. As a special case, the dynamics in directions where the data have no energy cannot be learned. Second, in the unlearnable directions, the dynamics produced by the neural network depend on the weight initialization, and common weight initialization schemes can produce unstable dynamics. Third, injecting synthetic noise into the data during training adds damping to the training dynamics and can stabilize the learned simulator, though doing so undesirably biases the learned dynamics. For each contributor to instability, we suggest mitigative strategies. We also highlight important differences between learning discrete-time and continuous-time dynamics, and discuss extensions to nonlinear systems.

I Introduction

The ability to accurately and efficiently predict the behavior of a dynamical system is fundamental to science and engineering. As data have become increasingly available, machine learning has been leveraged to learn models that can accurately and efficiently simulate the dynamics of physical systems, both in cases when the true dynamics are unknown, or when they are known but are expensive to compute [18, 29, 15, 11, 20, 16, 23, 28, 3, 5]. Weather forecasting offers a prime example of this new approach, where decades of high-quality data [9] have been used to develop deep learning-based forecasters with accuracy comparable to traditional numerical weather prediction while offering significant speedups [19, 1, 13].

While accurate at short times, neural network-based physics simulators can produce unphysical or unstable predictions at long times [29, 20, 16, 24, 12, 2]. Although certain stabilization tricks have been identified, it is unclear when they may be employed and why they succeed. Here, we identify probable culprits for the origin of these instabilities by analyzing the training dynamics of neural networks tasked with emulating a dynamical system. Along the way, we propose mitigative strategies and rationalize stabilization tricks found in the literature.

To render the problem analytically tractable, we consider linear dynamical systems, which can be perfectly emulated by a linear single-layer neural network. Our analysis considers a prototypical neural network training scheme, consisting of the mean squared error loss function, Glorot weight initialization [7], and gradient descent to update the weights of the neural network. We analyze discrete- and continuous-time dynamical systems in turn, highlighting important differences that arise between these two model classes. Our analysis focuses on the effects of energy distribution, weight initialization, and noise on the training dynamics.

II Discrete-time dynamical systems

Consider the linear discrete-time dynamical system

xi+1=Axi,xin,An×n.formulae-sequencesubscript𝑥𝑖1𝐴subscript𝑥𝑖formulae-sequencesubscript𝑥𝑖superscript𝑛𝐴superscript𝑛𝑛x_{i+1}=Ax_{i},\qquad x_{i}\in\mathbb{R}^{n},A\in\mathbb{R}^{n\times n}.italic_x start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT = italic_A italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT . (1)

Suppose that A𝐴Aitalic_A is not known explicitly, and we want to learn an accurate approximation A^n×n^𝐴superscript𝑛𝑛\hat{A}\in\mathbb{R}^{n\times n}over^ start_ARG italic_A end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT from a dataset of pairs of snapshots, {(xi,xi#)}i=1msuperscriptsubscriptsubscript𝑥𝑖superscriptsubscript𝑥𝑖#𝑖1𝑚\{(x_{i},x_{i}^{\#})\}_{i=1}^{m}{ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, that are generated by the dynamical system, so that

xi#=Axi,i=1,,m.formulae-sequencesuperscriptsubscript𝑥𝑖#𝐴subscript𝑥𝑖𝑖1𝑚x_{i}^{\#}=Ax_{i},\qquad i=1,\ldots,m.italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT = italic_A italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i = 1 , … , italic_m . (2)

Assembling the data into the data matrices X=[x1xm]𝑋delimited-[]subscript𝑥1subscript𝑥𝑚X=[x_{1}\;\cdots\;x_{m}]italic_X = [ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ] and X#=[x1#xm#]superscript𝑋#delimited-[]superscriptsubscript𝑥1#superscriptsubscript𝑥𝑚#X^{\#}=[x_{1}^{\#}\;\cdots\;x_{m}^{\#}]italic_X start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT = [ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT ⋯ italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT ], which are related by X#=AXsuperscript𝑋#𝐴𝑋X^{\#}=AXitalic_X start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT = italic_A italic_X, we seek the A^^𝐴\hat{A}over^ start_ARG italic_A end_ARG that minimizes the mean squared error loss function L𝐿Litalic_L,

minA^12mnX#A^XF2.subscript^𝐴12𝑚𝑛superscriptsubscriptnormsuperscript𝑋#^𝐴𝑋𝐹2\min_{\hat{A}}\quad\frac{1}{2mn}\|X^{\#}-\hat{A}X\|_{F}^{2}.roman_min start_POSTSUBSCRIPT over^ start_ARG italic_A end_ARG end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 italic_m italic_n end_ARG ∥ italic_X start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT - over^ start_ARG italic_A end_ARG italic_X ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (3)

Here, F\|\cdot\|_{F}∥ ⋅ ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT is the Frobenius norm, the factor of 1212\frac{1}{2}divide start_ARG 1 end_ARG start_ARG 2 end_ARG is for convenience, and the factor of 1mn1𝑚𝑛\frac{1}{mn}divide start_ARG 1 end_ARG start_ARG italic_m italic_n end_ARG mirrors how this loss function is typically implemented in machine learning software packages. The minimizer may not be unique. The least-squares/minimum-norm solution is A^=X#X+^𝐴superscript𝑋#superscript𝑋\hat{A}=X^{\#}X^{+}over^ start_ARG italic_A end_ARG = italic_X start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, where X+superscript𝑋X^{+}italic_X start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT is the pseudoinverse of X𝑋Xitalic_X. However, what is of interest here is what the commonly used gradient descent algorithm produces.

The gradient of L𝐿Litalic_L with respect to A^^𝐴\hat{A}over^ start_ARG italic_A end_ARG is

A^L=1mn(AA^)XXT.subscript^𝐴𝐿1𝑚𝑛𝐴^𝐴𝑋superscript𝑋𝑇\nabla_{\hat{A}}L=-\frac{1}{mn}(A-\hat{A})XX^{T}.∇ start_POSTSUBSCRIPT over^ start_ARG italic_A end_ARG end_POSTSUBSCRIPT italic_L = - divide start_ARG 1 end_ARG start_ARG italic_m italic_n end_ARG ( italic_A - over^ start_ARG italic_A end_ARG ) italic_X italic_X start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT . (4)

In the limit of infinitesimal learning rate, gradient descent leads to continuous-time gradient flow dynamics

ddτA^=A^L,dd𝜏^𝐴subscript^𝐴𝐿\frac{\textrm{d}}{\textrm{d}\tau}\hat{A}=-\nabla_{\hat{A}}L,divide start_ARG d end_ARG start_ARG d italic_τ end_ARG over^ start_ARG italic_A end_ARG = - ∇ start_POSTSUBSCRIPT over^ start_ARG italic_A end_ARG end_POSTSUBSCRIPT italic_L , (5)

with τ𝜏\tauitalic_τ being a pseudo-time variable. The solution to this matrix ordinary differential equation is

A^(τ)=A+[A^(0)A]exp(1mnXXTτ),^𝐴𝜏𝐴delimited-[]^𝐴0𝐴1𝑚𝑛𝑋superscript𝑋𝑇𝜏\hat{A}(\tau)=A+[\hat{A}(0)-A]\exp\left(-\frac{1}{mn}XX^{T}\tau\right),over^ start_ARG italic_A end_ARG ( italic_τ ) = italic_A + [ over^ start_ARG italic_A end_ARG ( 0 ) - italic_A ] roman_exp ( - divide start_ARG 1 end_ARG start_ARG italic_m italic_n end_ARG italic_X italic_X start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_τ ) , (6)

where A^(0)^𝐴0\hat{A}(0)over^ start_ARG italic_A end_ARG ( 0 ) is the initialization of our linear single-layer neural network.

A coordinate change provides insight. Let X=UΣVT𝑋𝑈Σsuperscript𝑉𝑇X=U\Sigma V^{T}italic_X = italic_U roman_Σ italic_V start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT be the full singular value decomposition of X𝑋Xitalic_X, and A~=UTAU~𝐴superscript𝑈𝑇𝐴𝑈\tilde{A}=U^{T}AUover~ start_ARG italic_A end_ARG = italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_A italic_U the dynamics matrix in the basis of the left singular vectors. Then the training dynamics give

A~^(τ)=A~+[A~^(0)A~]exp(1mnΣΣTτ).^~𝐴𝜏~𝐴delimited-[]^~𝐴0~𝐴1𝑚𝑛ΣsuperscriptΣ𝑇𝜏\hat{\tilde{A}}(\tau)=\tilde{A}+[\hat{\tilde{A}}(0)-\tilde{A}]\exp\left(-\frac% {1}{mn}\Sigma\Sigma^{T}\tau\right).over^ start_ARG over~ start_ARG italic_A end_ARG end_ARG ( italic_τ ) = over~ start_ARG italic_A end_ARG + [ over^ start_ARG over~ start_ARG italic_A end_ARG end_ARG ( 0 ) - over~ start_ARG italic_A end_ARG ] roman_exp ( - divide start_ARG 1 end_ARG start_ARG italic_m italic_n end_ARG roman_Σ roman_Σ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_τ ) . (7)

Three points of interest arise. First, columns of A~^^~𝐴\hat{\tilde{A}}over^ start_ARG over~ start_ARG italic_A end_ARG end_ARG corresponding to non-zero singular values converge to the corresponding columns of A~~𝐴\tilde{A}over~ start_ARG italic_A end_ARG—that is, the true dynamics—while those corresponding to zero singular values remain equal to their initial values. In other words, the dynamics are not learnable in directions in which the data have no energy. Second, the rate of convergence depends on how energy is distributed in the data. In directions in which the data have low energy, convergence to the true dynamics is slower. It is, therefore, more difficult to learn the dynamics of low-energy modes. The convergence rates can be made uniform by normalizing the data so that they have equal energy in all directions. Third, the initialization of the neural network will impact the learned dynamics in the unlearnable directions, or in all directions if gradient descent is stopped short of convergence (which is typical).

To illuminate the impact of the initialization, we suppose the data are in a set that is invariant under the dynamics of our system. This is generally true for the large class of physical systems with dissipation, for which the long-time dynamics approach an invariant manifold [10, 6, 27, 4]. In our case, the true dynamics take the form

A~=[A~11A~120A~22],~𝐴matrixsubscript~𝐴11subscript~𝐴120subscript~𝐴22\tilde{A}=\begin{bmatrix}\tilde{A}_{11}&\tilde{A}_{12}\\ 0&\tilde{A}_{22}\end{bmatrix},over~ start_ARG italic_A end_ARG = [ start_ARG start_ROW start_CELL over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT end_CELL start_CELL over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] , (8)

where the upper-left submatrix gives the dynamics in the invariant subspace containing the data, and the other columns correspond to the unlearnable directions. As τ𝜏\tau\rightarrow\inftyitalic_τ → ∞, the learned dynamics will be

A~^=[A~11A~^12(0)0A~^22(0)].^~𝐴matrixsubscript~𝐴11subscript^~𝐴1200subscript^~𝐴220\hat{\tilde{A}}=\begin{bmatrix}\tilde{A}_{11}&\hat{\tilde{A}}_{12}(0)\\ 0&\hat{\tilde{A}}_{22}(0)\end{bmatrix}.over^ start_ARG over~ start_ARG italic_A end_ARG end_ARG = [ start_ARG start_ROW start_CELL over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT end_CELL start_CELL over^ start_ARG over~ start_ARG italic_A end_ARG end_ARG start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT ( 0 ) end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL over^ start_ARG over~ start_ARG italic_A end_ARG end_ARG start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT ( 0 ) end_CELL end_ROW end_ARG ] . (9)

The eigenvalues of the learned dynamics are the union of the eigenvalues of A~11subscript~𝐴11\tilde{A}_{11}over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT and the eigenvalues of A~^22(0)subscript^~𝐴220\hat{\tilde{A}}_{22}(0)over^ start_ARG over~ start_ARG italic_A end_ARG end_ARG start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT ( 0 ). If X𝑋Xitalic_X has rank r𝑟ritalic_r and A^(0)^𝐴0\hat{A}(0)over^ start_ARG italic_A end_ARG ( 0 ) is initialized using the typical Glorot normal or uniform initializers, then as n𝑛n\rightarrow\inftyitalic_n → ∞, the distribution of the eigenvalues of A~^22(0)subscript^~𝐴220\hat{\tilde{A}}_{22}(0)over^ start_ARG over~ start_ARG italic_A end_ARG end_ARG start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT ( 0 ) converges almost surely to the uniform distribution on the disk of radius nrn𝑛𝑟𝑛\sqrt{\frac{n-r}{n}}square-root start_ARG divide start_ARG italic_n - italic_r end_ARG start_ARG italic_n end_ARG end_ARG centered at the origin [26]. For finite n𝑛nitalic_n, however, the eigenvalues can lie outside the unit circle, as shown in Figure 1, leaving the potential for unstable dynamics.

Refer to caption
Figure 1: Histogram of eigenvalues of an n×n𝑛𝑛n\times nitalic_n × italic_n matrix whose entries are generated using the Glorot normal initializer. 105/nsuperscript105𝑛10^{5}/n10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT / italic_n realizations of the random matrix were used. The unit circle is drawn with a dashed cyan line. ϕitalic-ϕ\phiitalic_ϕ gives the fraction of eigenvalues outside of the unit circle. The Glorot uniform initializer produces nearly identical histograms.

This point bears emphasizing. In physical systems with dissipation, the state of the system will quickly approach an invariant set. When we collect measurements of the state, those measurements will generally not include information outside of the invariant set due to its stability. Without such information, it is impossible to learn the dynamics outside of the invariant set; the learned dynamics in that region depend on the initialization. Ironically, the stability of the true dynamics can lead to unstable learned dynamics since there is no information available to erase the memory of the potentially unstable initialization.

Two remedies are clear. One is to restrict the learned model to remain in the invariant set. For linear systems, this amounts to projecting the system onto the subspace with non-zero singular values. The second remedy is to ensure that the initialization gives stable dynamics, which requires developing new weight initialization schemes. Gershgorin’s circle theorem provides one way to create a mostly random matrix whose eigenvalues are guaranteed to lie inside the unit circle, thereby giving stable dynamics. For example, by drawing the entries of A^(0)^𝐴0\hat{A}(0)over^ start_ARG italic_A end_ARG ( 0 ) from the uniform distribution on 1n1[1,1]1𝑛111\frac{1}{n-1}[-1,1]divide start_ARG 1 end_ARG start_ARG italic_n - 1 end_ARG [ - 1 , 1 ] and setting the diagonal entries equal to zero, the eigenvalues of A^(0)^𝐴0\hat{A}(0)over^ start_ARG italic_A end_ARG ( 0 ) are bounded by the unit circle. This bound is sharp (e.g., if all off-diagonal entries are equal to 1n11𝑛1\frac{1}{n-1}divide start_ARG 1 end_ARG start_ARG italic_n - 1 end_ARG then 1 is an eigenvalue) but almost surely conservative, as demonstrated in Figure 2. The distribution of eigenvalues can be expanded (but still bounded by the unit circle) by re-normalizing each row (or column) of A^(0)^𝐴0\hat{A}(0)over^ start_ARG italic_A end_ARG ( 0 ) so that the row (column) sums of the absolute values of the entries in each row (column) are equal to 1.

Refer to caption
Figure 2: Histogram of eigenvalues of an n×n𝑛𝑛n\times nitalic_n × italic_n matrix whose entries are generated using the initializer based on Gershgorin’s circle theorem. 105/nsuperscript105𝑛10^{5}/n10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT / italic_n realizations of the random matrix were used. The unit circle is drawn with a dashed black line.

II.1 Noisy measurements

Curiously, it has been observed that injecting synthetic noise into the data during training can help stabilize learned physics simulators, though it is unclear why [29, 30, 22, 20, 24, 25].

Suppose our measurements are noisy, so that the data matrices are X+N𝑋𝑁X+Nitalic_X + italic_N instead of X𝑋Xitalic_X, and X#+N#superscript𝑋#superscript𝑁#X^{\#}+N^{\#}italic_X start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT + italic_N start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT instead of X#superscript𝑋#X^{\#}italic_X start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT. N𝑁Nitalic_N and N#superscript𝑁#N^{\#}italic_N start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT are random matrices of noise, with all entries assumed to be independent random variables with zero mean and variance σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, and independent of the noise-free data. It is likely that many columns of N#superscript𝑁#N^{\#}italic_N start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT are the same as columns of N𝑁Nitalic_N, but shifted over one column in location, since data are often gathered from long trajectories that provide many data. With noisy data, the gradient of the loss function with respect to A^^𝐴\hat{A}over^ start_ARG italic_A end_ARG is

A^L=1mn[(AA^)(X+N)+N#AN](X+N)T.subscript^𝐴𝐿1𝑚𝑛delimited-[]𝐴^𝐴𝑋𝑁superscript𝑁#𝐴𝑁superscript𝑋𝑁𝑇\nabla_{\hat{A}}L=-\frac{1}{mn}[(A-\hat{A})(X+N)+N^{\#}-AN](X+N)^{T}.∇ start_POSTSUBSCRIPT over^ start_ARG italic_A end_ARG end_POSTSUBSCRIPT italic_L = - divide start_ARG 1 end_ARG start_ARG italic_m italic_n end_ARG [ ( italic_A - over^ start_ARG italic_A end_ARG ) ( italic_X + italic_N ) + italic_N start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT - italic_A italic_N ] ( italic_X + italic_N ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT . (10)

In the basis of the left singular vectors of the noise-free data X𝑋Xitalic_X, the training dynamics give

A~^(τ)=A~+[A~^(0)A~]exp(1mn(ΣVT+UTN)(ΣVT+UTN)Tτ)+1mn(UTN#A~UTN)(VΣT+NTU)(1mn(ΣVT+UTN)(ΣVT+UTN)T)1×[exp(1mn(ΣVT+UTN)(ΣVT+UTN)Tτ)I],^~𝐴𝜏~𝐴delimited-[]^~𝐴0~𝐴1𝑚𝑛Σsuperscript𝑉𝑇superscript𝑈𝑇𝑁superscriptΣsuperscript𝑉𝑇superscript𝑈𝑇𝑁𝑇𝜏1𝑚𝑛superscript𝑈𝑇superscript𝑁#~𝐴superscript𝑈𝑇𝑁𝑉superscriptΣ𝑇superscript𝑁𝑇𝑈superscript1𝑚𝑛Σsuperscript𝑉𝑇superscript𝑈𝑇𝑁superscriptΣsuperscript𝑉𝑇superscript𝑈𝑇𝑁𝑇1delimited-[]1𝑚𝑛Σsuperscript𝑉𝑇superscript𝑈𝑇𝑁superscriptΣsuperscript𝑉𝑇superscript𝑈𝑇𝑁𝑇𝜏𝐼\hat{\tilde{A}}(\tau)=\tilde{A}+[\hat{\tilde{A}}(0)-\tilde{A}]\exp\left(-\frac% {1}{mn}(\Sigma V^{T}+U^{T}N)(\Sigma V^{T}+U^{T}N)^{T}\tau\right)\\ +\frac{1}{mn}(U^{T}N^{\#}-\tilde{A}U^{T}N)(V\Sigma^{T}+N^{T}U)\left(-\frac{1}{% mn}(\Sigma V^{T}+U^{T}N)(\Sigma V^{T}+U^{T}N)^{T}\right)^{-1}\\ \times\left[\exp\left(-\frac{1}{mn}(\Sigma V^{T}+U^{T}N)(\Sigma V^{T}+U^{T}N)^% {T}\tau\right)-I\right],start_ROW start_CELL over^ start_ARG over~ start_ARG italic_A end_ARG end_ARG ( italic_τ ) = over~ start_ARG italic_A end_ARG + [ over^ start_ARG over~ start_ARG italic_A end_ARG end_ARG ( 0 ) - over~ start_ARG italic_A end_ARG ] roman_exp ( - divide start_ARG 1 end_ARG start_ARG italic_m italic_n end_ARG ( roman_Σ italic_V start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_N ) ( roman_Σ italic_V start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_N ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_τ ) end_CELL end_ROW start_ROW start_CELL + divide start_ARG 1 end_ARG start_ARG italic_m italic_n end_ARG ( italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_N start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT - over~ start_ARG italic_A end_ARG italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_N ) ( italic_V roman_Σ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + italic_N start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_U ) ( - divide start_ARG 1 end_ARG start_ARG italic_m italic_n end_ARG ( roman_Σ italic_V start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_N ) ( roman_Σ italic_V start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_N ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL × [ roman_exp ( - divide start_ARG 1 end_ARG start_ARG italic_m italic_n end_ARG ( roman_Σ italic_V start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_N ) ( roman_Σ italic_V start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_N ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_τ ) - italic_I ] , end_CELL end_ROW (11)

where we have assumed that (ΣVT+UTN)(ΣVT+UTN)TΣsuperscript𝑉𝑇superscript𝑈𝑇𝑁superscriptΣsuperscript𝑉𝑇superscript𝑈𝑇𝑁𝑇(\Sigma V^{T}+U^{T}N)(\Sigma V^{T}+U^{T}N)^{T}( roman_Σ italic_V start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_N ) ( roman_Σ italic_V start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_N ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT is invertible, which is generally true for mn𝑚𝑛m\geq nitalic_m ≥ italic_n. In the limit τ𝜏\tau\rightarrow\inftyitalic_τ → ∞,

A~^=A~+(UTN#A~UTN)(VΣT+NTU)((ΣVT+UTN)(ΣVT+UTN)T)1.^~𝐴~𝐴superscript𝑈𝑇superscript𝑁#~𝐴superscript𝑈𝑇𝑁𝑉superscriptΣ𝑇superscript𝑁𝑇𝑈superscriptΣsuperscript𝑉𝑇superscript𝑈𝑇𝑁superscriptΣsuperscript𝑉𝑇superscript𝑈𝑇𝑁𝑇1\hat{\tilde{A}}=\tilde{A}+(U^{T}N^{\#}-\tilde{A}U^{T}N)(V\Sigma^{T}+N^{T}U)% \left((\Sigma V^{T}+U^{T}N)(\Sigma V^{T}+U^{T}N)^{T}\right)^{-1}.over^ start_ARG over~ start_ARG italic_A end_ARG end_ARG = over~ start_ARG italic_A end_ARG + ( italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_N start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT - over~ start_ARG italic_A end_ARG italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_N ) ( italic_V roman_Σ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + italic_N start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_U ) ( ( roman_Σ italic_V start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_N ) ( roman_Σ italic_V start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_N ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT . (12)

The expected value is

A~^=A~[Imσ2(ΣΣT+mσ2I)1],^~𝐴~𝐴delimited-[]𝐼𝑚superscript𝜎2superscriptΣsuperscriptΣ𝑇𝑚superscript𝜎2𝐼1\hat{\tilde{A}}=\tilde{A}[I-m\sigma^{2}(\Sigma\Sigma^{T}+m\sigma^{2}I)^{-1}],over^ start_ARG over~ start_ARG italic_A end_ARG end_ARG = over~ start_ARG italic_A end_ARG [ italic_I - italic_m italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( roman_Σ roman_Σ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + italic_m italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ] , (13)

where we have used the fact that 𝔼[UTNNTU]=𝔼[NNT]=mσ2I𝔼delimited-[]superscript𝑈𝑇𝑁superscript𝑁𝑇𝑈𝔼delimited-[]𝑁superscript𝑁𝑇𝑚superscript𝜎2𝐼\mathbb{E}[U^{T}NN^{T}U]=\mathbb{E}[NN^{T}]=m\sigma^{2}Iblackboard_E [ italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_N italic_N start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_U ] = blackboard_E [ italic_N italic_N start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] = italic_m italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I. Rewriting the above expression provides clarity:

A~^=A~[σ12σ12+mσ2σr2σr2+mσ20].^~𝐴~𝐴matrixsuperscriptsubscript𝜎12superscriptsubscript𝜎12𝑚superscript𝜎2missing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionsuperscriptsubscript𝜎𝑟2superscriptsubscript𝜎𝑟2𝑚superscript𝜎2missing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpression0missing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpression\hat{\tilde{A}}=\tilde{A}\begin{bmatrix}\frac{\sigma_{1}^{2}}{\sigma_{1}^{2}+m% \sigma^{2}}&&&&\\ &\ddots&&&\\ &&\frac{\sigma_{r}^{2}}{\sigma_{r}^{2}+m\sigma^{2}}&&\\ &&&0&\\ &&&&\ddots\end{bmatrix}.over^ start_ARG over~ start_ARG italic_A end_ARG end_ARG = over~ start_ARG italic_A end_ARG [ start_ARG start_ROW start_CELL divide start_ARG italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_m italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ⋱ end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL divide start_ARG italic_σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_m italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL 0 end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL ⋱ end_CELL end_ROW end_ARG ] . (14)

We see that when measurement noise is present, A~^^~𝐴\hat{\tilde{A}}over^ start_ARG over~ start_ARG italic_A end_ARG end_ARG is a biased version of A~~𝐴\tilde{A}over~ start_ARG italic_A end_ARG. In particular, columns of A~^^~𝐴\hat{\tilde{A}}over^ start_ARG over~ start_ARG italic_A end_ARG end_ARG corresponding to non-zero singular values converge to the corresponding columns of A~~𝐴\tilde{A}over~ start_ARG italic_A end_ARG, but biased by a multiplicative factor σi2/(σi2+mσ2)1superscriptsubscript𝜎𝑖2superscriptsubscript𝜎𝑖2𝑚superscript𝜎21\sigma_{i}^{2}/(\sigma_{i}^{2}+m\sigma^{2})\leq 1italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / ( italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_m italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ≤ 1, where σisubscript𝜎𝑖\sigma_{i}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the ithsuperscript𝑖thi^{\text{th}}italic_i start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT singular value of X𝑋Xitalic_X. This bias factor can be written as SNRi/(1+SNRi)𝑆𝑁subscript𝑅𝑖1𝑆𝑁subscript𝑅𝑖SNR_{i}/(1+SNR_{i})italic_S italic_N italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / ( 1 + italic_S italic_N italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), where SNRi=σi2/mσ2𝑆𝑁subscript𝑅𝑖superscriptsubscript𝜎𝑖2𝑚superscript𝜎2SNR_{i}=\sigma_{i}^{2}/m\sigma^{2}italic_S italic_N italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_m italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is the signal-to-noise ratio in the direction of the ithsuperscript𝑖thi^{\text{th}}italic_i start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT singular vector. In contrast, without measurement noise, these columns of A~^^~𝐴\hat{\tilde{A}}over^ start_ARG over~ start_ARG italic_A end_ARG end_ARG converge to the true columns of A~~𝐴\tilde{A}over~ start_ARG italic_A end_ARG. The columns corresponding to zero singular values converge to zero, while they stayed equal to their initial guess when there was no measurement noise. The measurement noise adds damping to the training dynamics, erasing the memory of the initialization in the unlearnable directions and replacing it with dynamics that are strongly stable. This is a highly desirable effect since, as previously explained, in physical systems with dissipation, the stabilizing effect of dissipation is what makes those directions unlearnable.

How fast is the memory of the initialization erased by the noise? The corresponding eigenvalues of the gradient flow dynamics are, in expectation, equal to σ2/nsuperscript𝜎2𝑛-\sigma^{2}/n- italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_n. The rate of convergence depends on the strength of the noise and the dimension of the system, with weak noise and a large dimension of the system leading to slow convergence.

There is a tradeoff between the desirable and undesirable effects of noise: noise stabilizes a learned physics simulator, with stronger noise stabilizing the learned system more quickly, but stronger noise also leads to greater bias. This is illustrated in Figure 3 for a three-dimensional system, along with the effect of energy distribution. It may be possible to obtain stability while avoiding bias by selectively applying noise only in the unlearnable/zero-energy directions.

Refer to caption
Figure 3: Training dynamics of a three-dimensional system. The real parts of the eigenvalues are shown. The high-energy mode converges more quickly (blue; in time τenergy1subscript𝜏energy1\tau_{\text{energy1}}italic_τ start_POSTSUBSCRIPT energy1 end_POSTSUBSCRIPT) than the low-energy mode (red; in time τenergy2subscript𝜏energy2\tau_{\text{energy2}}italic_τ start_POSTSUBSCRIPT energy2 end_POSTSUBSCRIPT). Noise stabilizes the unlearnable and otherwise unstable mode (green), but biases the learnable dynamics.

III Continuous-time dynamical systems

Important differences arise in continuous-time dynamics. Consider the linear continuous-time dynamical system

ddtx=Ax,x(t)n,An×n.formulae-sequencedd𝑡𝑥𝐴𝑥formulae-sequence𝑥𝑡superscript𝑛𝐴superscript𝑛𝑛\frac{\textrm{d}}{\textrm{d}t}x=Ax,\qquad x(t)\in\mathbb{R}^{n},A\in\mathbb{R}% ^{n\times n}.divide start_ARG d end_ARG start_ARG d italic_t end_ARG italic_x = italic_A italic_x , italic_x ( italic_t ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT . (15)

As in the discrete-time case, suppose that A𝐴Aitalic_A is not known explicitly, and we want to learn an accurate approximation A^n×n^𝐴superscript𝑛𝑛\hat{A}\in\mathbb{R}^{n\times n}over^ start_ARG italic_A end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT from a dataset of pairs of snapshots separated by a time ΔtΔ𝑡\Delta troman_Δ italic_t, {(x(ti),x(ti+Δt))}i=1m={(xi,xi#)}i=1msuperscriptsubscript𝑥subscript𝑡𝑖𝑥subscript𝑡𝑖Δ𝑡𝑖1𝑚superscriptsubscriptsubscript𝑥𝑖superscriptsubscript𝑥𝑖#𝑖1𝑚\{(x(t_{i}),x(t_{i}+\Delta t))\}_{i=1}^{m}=\{(x_{i},x_{i}^{\#})\}_{i=1}^{m}{ ( italic_x ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_x ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + roman_Δ italic_t ) ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT = { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, that are generated by the dynamical system, so that

xi#=eAΔtxi,i=1,,m.formulae-sequencesuperscriptsubscript𝑥𝑖#superscript𝑒𝐴Δ𝑡subscript𝑥𝑖𝑖1𝑚x_{i}^{\#}=e^{A\Delta t}x_{i},\qquad i=1,\ldots,m.italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT = italic_e start_POSTSUPERSCRIPT italic_A roman_Δ italic_t end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i = 1 , … , italic_m . (16)

Assembling the data into data matrices as before, we seek the A^^𝐴\hat{A}over^ start_ARG italic_A end_ARG that minimizes the mean squared error L𝐿Litalic_L,

minA^12mnX#eA^ΔtXF2.subscript^𝐴12𝑚𝑛superscriptsubscriptnormsuperscript𝑋#superscript𝑒^𝐴Δ𝑡𝑋𝐹2\min_{\hat{A}}\quad\frac{1}{2mn}\|X^{\#}-e^{\hat{A}\Delta t}X\|_{F}^{2}.roman_min start_POSTSUBSCRIPT over^ start_ARG italic_A end_ARG end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 italic_m italic_n end_ARG ∥ italic_X start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT - italic_e start_POSTSUPERSCRIPT over^ start_ARG italic_A end_ARG roman_Δ italic_t end_POSTSUPERSCRIPT italic_X ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (17)

Due to the presence of the matrix exponential, there is no simple expression for the gradient of L𝐿Litalic_L with respect to A^^𝐴\hat{A}over^ start_ARG italic_A end_ARG, and the training dynamics are nonlinear in A^^𝐴\hat{A}over^ start_ARG italic_A end_ARG. To gain insight, we expand the matrix exponential to 𝒪(Δt)𝒪Δ𝑡\mathcal{O}(\Delta t)caligraphic_O ( roman_Δ italic_t )—equivalent to solving the dynamics using the forward Euler method, making the learning problem equivalent to learning a residual network (scaled by ΔtΔ𝑡\Delta troman_Δ italic_t) [8]. Under this approximation, the gradient of the loss function is

A^L=Δt2mn(AA^)XXT,subscript^𝐴𝐿Δsuperscript𝑡2𝑚𝑛𝐴^𝐴𝑋superscript𝑋𝑇\nabla_{\hat{A}}L=-\frac{\Delta t^{2}}{mn}(A-\hat{A})XX^{T},∇ start_POSTSUBSCRIPT over^ start_ARG italic_A end_ARG end_POSTSUBSCRIPT italic_L = - divide start_ARG roman_Δ italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_m italic_n end_ARG ( italic_A - over^ start_ARG italic_A end_ARG ) italic_X italic_X start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , (18)

equal to that in the discrete-time setting but scaled by a factor Δt2Δsuperscript𝑡2\Delta t^{2}roman_Δ italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. The conclusions made in the discrete-time setting therefore extend to the continuous-time setting, but for the following exceptions.

First, the factor of Δt2Δsuperscript𝑡2\Delta t^{2}roman_Δ italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT in the gradient changes the rate of convergence of the training dynamics. Small ΔtΔ𝑡\Delta troman_Δ italic_t could lead to very slow convergence.

Second, for continuous-time dynamics, an instability arises when the real part of any eigenvalue of A^^𝐴\hat{A}over^ start_ARG italic_A end_ARG is positive. When A^(0)^𝐴0\hat{A}(0)over^ start_ARG italic_A end_ARG ( 0 ) is created using the Glorot initializer, each of its eigenvalues has a 50% of having a positive real part. The continuous-time problem is therefore more susceptible to instability. Gershgorin’s circle theorem can again be used to create a stable mostly random matrix by having the Gershgorin disks lie in the left half of the complex plane (or within the stable region of the numerical integrator begin used; for example, if using a residual network, the Gershgorin disks should be made to lie within the stability boundary of the forward Euler method).

Finally, the effects of measurement noise differ in the details. With noisy data, the gradient is

A^L=Δt2mn(AA^)(X+N)(X+N)TΔtmn(N#N)(X+N)T+Δt2mnAN(X+N)T.subscript^𝐴𝐿Δsuperscript𝑡2𝑚𝑛𝐴^𝐴𝑋𝑁superscript𝑋𝑁𝑇Δ𝑡𝑚𝑛superscript𝑁#𝑁superscript𝑋𝑁𝑇Δsuperscript𝑡2𝑚𝑛𝐴𝑁superscript𝑋𝑁𝑇\nabla_{\hat{A}}L=-\frac{\Delta t^{2}}{mn}(A-\hat{A})(X+N)(X+N)^{T}-\frac{% \Delta t}{mn}(N^{\#}-N)(X+N)^{T}+\frac{\Delta t^{2}}{mn}AN(X+N)^{T}.∇ start_POSTSUBSCRIPT over^ start_ARG italic_A end_ARG end_POSTSUBSCRIPT italic_L = - divide start_ARG roman_Δ italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_m italic_n end_ARG ( italic_A - over^ start_ARG italic_A end_ARG ) ( italic_X + italic_N ) ( italic_X + italic_N ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - divide start_ARG roman_Δ italic_t end_ARG start_ARG italic_m italic_n end_ARG ( italic_N start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT - italic_N ) ( italic_X + italic_N ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + divide start_ARG roman_Δ italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_m italic_n end_ARG italic_A italic_N ( italic_X + italic_N ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT . (19)

In the basis of the left singular vectors of the noise-free data X𝑋Xitalic_X, the training dynamics give

A~^(τ)=A~+[A~^(0)A~]exp(Δt2mn(ΣVT+UTN)(ΣVT+UTN)Tτ)+Δtmn[UTN#(I+A~Δt)UTN](VΣT+NTU)×(Δt2mn(ΣVT+UTN)(ΣVT+UTN)T)1×[exp(Δt2mn(ΣVT+UTN)(ΣVT+UTN)Tτ)I].^~𝐴𝜏~𝐴delimited-[]^~𝐴0~𝐴Δsuperscript𝑡2𝑚𝑛Σsuperscript𝑉𝑇superscript𝑈𝑇𝑁superscriptΣsuperscript𝑉𝑇superscript𝑈𝑇𝑁𝑇𝜏Δ𝑡𝑚𝑛delimited-[]superscript𝑈𝑇superscript𝑁#𝐼~𝐴Δ𝑡superscript𝑈𝑇𝑁𝑉superscriptΣ𝑇superscript𝑁𝑇𝑈superscriptΔsuperscript𝑡2𝑚𝑛Σsuperscript𝑉𝑇superscript𝑈𝑇𝑁superscriptΣsuperscript𝑉𝑇superscript𝑈𝑇𝑁𝑇1delimited-[]Δsuperscript𝑡2𝑚𝑛Σsuperscript𝑉𝑇superscript𝑈𝑇𝑁superscriptΣsuperscript𝑉𝑇superscript𝑈𝑇𝑁𝑇𝜏𝐼\hat{\tilde{A}}(\tau)=\tilde{A}+[\hat{\tilde{A}}(0)-\tilde{A}]\exp\left(-\frac% {\Delta t^{2}}{mn}(\Sigma V^{T}+U^{T}N)(\Sigma V^{T}+U^{T}N)^{T}\tau\right)\\ +\frac{\Delta t}{mn}[U^{T}N^{\#}-(I+\tilde{A}\Delta t)U^{T}N](V\Sigma^{T}+N^{T% }U)\\ \times\left(-\frac{\Delta t^{2}}{mn}(\Sigma V^{T}+U^{T}N)(\Sigma V^{T}+U^{T}N)% ^{T}\right)^{-1}\\ \times\left[\exp\left(-\frac{\Delta t^{2}}{mn}(\Sigma V^{T}+U^{T}N)(\Sigma V^{% T}+U^{T}N)^{T}\tau\right)-I\right].start_ROW start_CELL over^ start_ARG over~ start_ARG italic_A end_ARG end_ARG ( italic_τ ) = over~ start_ARG italic_A end_ARG + [ over^ start_ARG over~ start_ARG italic_A end_ARG end_ARG ( 0 ) - over~ start_ARG italic_A end_ARG ] roman_exp ( - divide start_ARG roman_Δ italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_m italic_n end_ARG ( roman_Σ italic_V start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_N ) ( roman_Σ italic_V start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_N ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_τ ) end_CELL end_ROW start_ROW start_CELL + divide start_ARG roman_Δ italic_t end_ARG start_ARG italic_m italic_n end_ARG [ italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_N start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT - ( italic_I + over~ start_ARG italic_A end_ARG roman_Δ italic_t ) italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_N ] ( italic_V roman_Σ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + italic_N start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_U ) end_CELL end_ROW start_ROW start_CELL × ( - divide start_ARG roman_Δ italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_m italic_n end_ARG ( roman_Σ italic_V start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_N ) ( roman_Σ italic_V start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_N ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL × [ roman_exp ( - divide start_ARG roman_Δ italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_m italic_n end_ARG ( roman_Σ italic_V start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_N ) ( roman_Σ italic_V start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_N ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_τ ) - italic_I ] . end_CELL end_ROW (20)

In the limit τ𝜏\tau\rightarrow\inftyitalic_τ → ∞,

A~^=A~+1Δt[UTN#(I+A~Δt)UTN](VΣT+NTU)((ΣVT+UTN)(ΣVT+UTN)T)1.^~𝐴~𝐴1Δ𝑡delimited-[]superscript𝑈𝑇superscript𝑁#𝐼~𝐴Δ𝑡superscript𝑈𝑇𝑁𝑉superscriptΣ𝑇superscript𝑁𝑇𝑈superscriptΣsuperscript𝑉𝑇superscript𝑈𝑇𝑁superscriptΣsuperscript𝑉𝑇superscript𝑈𝑇𝑁𝑇1\hat{\tilde{A}}=\tilde{A}+\frac{1}{\Delta t}[U^{T}N^{\#}-(I+\tilde{A}\Delta t)% U^{T}N](V\Sigma^{T}+N^{T}U)\left((\Sigma V^{T}+U^{T}N)(\Sigma V^{T}+U^{T}N)^{T% }\right)^{-1}.over^ start_ARG over~ start_ARG italic_A end_ARG end_ARG = over~ start_ARG italic_A end_ARG + divide start_ARG 1 end_ARG start_ARG roman_Δ italic_t end_ARG [ italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_N start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT - ( italic_I + over~ start_ARG italic_A end_ARG roman_Δ italic_t ) italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_N ] ( italic_V roman_Σ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + italic_N start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_U ) ( ( roman_Σ italic_V start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_N ) ( roman_Σ italic_V start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_N ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT . (21)

The expected value is

A~^=A~[Imσ2(ΣΣT+mσ2I)1]mσ2Δt(ΣΣT+mσ2I)1.^~𝐴~𝐴delimited-[]𝐼𝑚superscript𝜎2superscriptΣsuperscriptΣ𝑇𝑚superscript𝜎2𝐼1𝑚superscript𝜎2Δ𝑡superscriptΣsuperscriptΣ𝑇𝑚superscript𝜎2𝐼1\hat{\tilde{A}}=\tilde{A}[I-m\sigma^{2}(\Sigma\Sigma^{T}+m\sigma^{2}I)^{-1}]-% \frac{m\sigma^{2}}{\Delta t}(\Sigma\Sigma^{T}+m\sigma^{2}I)^{-1}.over^ start_ARG over~ start_ARG italic_A end_ARG end_ARG = over~ start_ARG italic_A end_ARG [ italic_I - italic_m italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( roman_Σ roman_Σ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + italic_m italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ] - divide start_ARG italic_m italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG roman_Δ italic_t end_ARG ( roman_Σ roman_Σ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + italic_m italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT . (22)

Rewriting the above expression provides clarity:

A~^=A~[σ12σ12+mσ2σr2σr2+mσ20]1Δt[mσ2σ12+mσ2mσ2σr2+mσ21].^~𝐴~𝐴matrixsuperscriptsubscript𝜎12superscriptsubscript𝜎12𝑚superscript𝜎2missing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionsuperscriptsubscript𝜎𝑟2superscriptsubscript𝜎𝑟2𝑚superscript𝜎2missing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpression0missing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpression1Δ𝑡matrix𝑚superscript𝜎2superscriptsubscript𝜎12𝑚superscript𝜎2missing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpression𝑚superscript𝜎2superscriptsubscript𝜎𝑟2𝑚superscript𝜎2missing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpression1missing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpression\hat{\tilde{A}}=\tilde{A}\begin{bmatrix}\frac{\sigma_{1}^{2}}{\sigma_{1}^{2}+m% \sigma^{2}}&&&&\\ &\ddots&&&\\ &&\frac{\sigma_{r}^{2}}{\sigma_{r}^{2}+m\sigma^{2}}&&\\ &&&0&\\ &&&&\ddots\end{bmatrix}-\frac{1}{\Delta t}\begin{bmatrix}\frac{m\sigma^{2}}{% \sigma_{1}^{2}+m\sigma^{2}}&&&&\\ &\ddots&&&\\ &&\frac{m\sigma^{2}}{\sigma_{r}^{2}+m\sigma^{2}}&&\\ &&&1&\\ &&&&\ddots\end{bmatrix}.over^ start_ARG over~ start_ARG italic_A end_ARG end_ARG = over~ start_ARG italic_A end_ARG [ start_ARG start_ROW start_CELL divide start_ARG italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_m italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ⋱ end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL divide start_ARG italic_σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_m italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL 0 end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL ⋱ end_CELL end_ROW end_ARG ] - divide start_ARG 1 end_ARG start_ARG roman_Δ italic_t end_ARG [ start_ARG start_ROW start_CELL divide start_ARG italic_m italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_m italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ⋱ end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL divide start_ARG italic_m italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_m italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL 1 end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL ⋱ end_CELL end_ROW end_ARG ] . (23)

As in the discrete-time setting, noise creates a multiplicative bias factor. Additionally, there is an additive bias that can be substantial for small ΔtΔ𝑡\Delta troman_Δ italic_t. The columns corresponding to zero singular values converge to columns whose only non-zero entries are along the diagonal and are equal to 1/Δt1Δ𝑡-1/\Delta t- 1 / roman_Δ italic_t. Measurement noise again erases the memory of the initialization in the unlearnable directions and replaces it with stable dynamics, with smaller ΔtΔ𝑡\Delta troman_Δ italic_t leading to more stable dynamics. In addition to the tradeoffs noted in the discrete-time setting, in the continuous-time setting ΔtΔ𝑡\Delta troman_Δ italic_t also has a tradeoff: smaller ΔtΔ𝑡\Delta troman_Δ italic_t creates more stable dynamics but greater bias.

IV Discussion

Despite the importance of the long-term stability of learned physics simulators, theoretical insight into this issue is conspicuously missing. Our findings for linear dynamical systems constitute an important step towards a general theory, buoyed by nonlinear analogs with strong empirical support.

In the linear case, we showed that it is more difficult to learn low-energy dynamics due to the associated slower rates of convergence in the training dynamics. This seems to hold for nonlinear systems as well, for which it has been empirically observed that it is difficult to learn the dynamics of high wavenumbers in physical systems, which typically have low energy [2, 17]. Although the difficulty to learn the dynamics of high wavenumbers has previously been attributed to the spectral bias of neural networks [2, 32, 21], recent work on multi-stage neural networks supports that non-uniform distribution of energy and spectral bias both contribute to slow convergence [31]. The non-uniformity of energy could be addressed by first transforming the variables to a space in which the data are distributed isotropically, then learning the dynamics of the transformed variables.

We then showed that dynamics off of the data subspace cannot be learned, and what is learned in the complement of the data subspace depends on the weight initialization. This obviously holds for nonlinear systems, where the dynamics off of the data submanifold cannot be learned. What is less clear is whether a non-trivial weight initialization scheme can be designed so that the default dynamics are stable, as we have done here for linear dynamical systems. In the linear case, an alternative is to project the system onto the data subspace. In the nonlinear case, manifold learning methods can be used to project the system onto the data submanifold; doing so has, indeed, been found to stabilize the learned dynamics [16, 3, 5]. Another alternative is to add global damping to the system [29, 16, 14], though it is unclear how strong it should be.

Finally, we noted the empirical success of noise injection as a stabilizer when learning nonlinear dynamical systems, and showed why it works when learning linear dynamical systems. Adding noise to the data adds damping to the training dynamics. Damping is a generic mechanism that is likely to extend to the training dynamics of nonlinear systems. Furthermore, we showed that there is a tradeoff between the stabilization and bias that noise creates, and this tradeoff can be seen in empirical results for nonlinear systems [22]. The agreement with empirical results for nonlinear systems suggests that we have identified the correct mechanisms.

References

  • [1] K. Bi, L. Xie, H. Zhang, X. Chen, X. Gu, and Q. Tian. Accurate medium-range global weather forecasting with 3D neural networks. Nature, 619(7970):533–538, 2023.
  • [2] A. Chattopadhyay and P. Hassanzadeh. Long-term instabilities of deep learning-based digital twins of the climate system: The cause and a solution. arXiv preprint arXiv:2304.07029, 2023.
  • [3] B. Chen, K. Huang, S. Raghupathi, I. Chandratreya, Q. Du, and H. Lipson. Automated discovery of fundamental variables hidden in experimental data. Nature Computational Science, 2(7):433–442, 2022.
  • [4] C. R. Doering and J. D. Gibbon. Applied Analysis of the Navier-Stokes Equations. Number 12 in Cambridge Texts in Applied Mathematics. Cambridge University Press, 1995.
  • [5] D. Floryan and M. D. Graham. Data-driven discovery of intrinsic dynamics. Nature Machine Intelligence, 4(12):1113–1120, 2022.
  • [6] C. Foias, G. R. Sell, and R. Temam. Inertial manifolds for nonlinear evolutionary equations. Journal of Differential Equations, 73(2):309–353, 1988.
  • [7] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In Y. W. Teh and M. Titterington, editors, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pages 249–256. PMLR, 2010.
  • [8] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
  • [9] H. Hersbach, B. Bell, P. Berrisford, S. Hirahara, A. Horányi, J. Muñoz-Sabater, J. Nicolas, C. Peubey, R. Radu, D. Schepers, A. Simmons, C. Soci, S. Abdalla, X. Abellan, G. Balsamo, P. Bechtold, G. Biavati, J. Bidlot, M. Bonavita, G. De Chiara, P. Dahlgren, D. Dee, M. Diamantakis, R. Dragani, J. Flemming, R. Forbes, M. Fuentes, A. Geer, A. Haimberger, S. Healy, R. J. Hogan, E. Hólm, M. Janisková, S. Keeley, P. Laloyaux, P. Lopez, C. Lupu, G. Radnoti, P. de Rosnay, I. Rozum, F. Vamborg, S. Villaume, and J.-N. Thépaut. The ERA5 global reanalysis. Quarterly Journal of the Royal Meteorological Society, 146(730):1999–2049, 2020.
  • [10] E. Hopf. A mathematical example displaying features of turbulence. Communications on Pure and Applied Mathematics, 1(4):303–322, 1948.
  • [11] G. E. Karniadakis, I. G. Kevrekidis, L. Lu, P. Perdikaris, S. Wang, and L. Yang. Physics-informed machine learning. Nature Reviews Physics, 3(6):422–440, 2021.
  • [12] R. Keisler. Forecasting global weather with graph neural networks. arXiv preprint arXiv:2202.07575, 2022.
  • [13] R. Lam, A. Sanchez-Gonzalez, M. Willson, P. Wirnsberger, M. Fortunato, F. Alet, S. Ravuri, T. Ewalds, Z. Eaton-Rosen, W. Hu, A. Merose, S. Hoyer, G. Holland, O. Vinyals, J. Stott, A. Pritzel, S. Mohamed, and P. Battaglia. Learning skillful medium-range global weather forecasting. Science, 382(6677):1416–1421, 2023.
  • [14] A. J. Linot, J. W. Burby, Q. Tang, P. Balaprakash, M. D. Graham, and R. Maulik. Stabilized neural ordinary differential equations for long-time forecasting of dynamical systems. Journal of Computational Physics, 474:111838, 2023.
  • [15] A. J. Linot and M. D. Graham. Deep learning to discover and predict dynamics on an inertial manifold. Physical Review E, 101(6):062209, 2020.
  • [16] A. J. Linot and M. D. Graham. Data-driven reduced-order modeling of spatiotemporal chaos with neural ordinary differential equations. Chaos: An Interdisciplinary Journal of Nonlinear Science, 32(7), 2022.
  • [17] P. Lippe, B. Veeling, P. Perdikaris, R. Turner, and J. Brandstetter. PDE-refiner: Achieving accurate long rollouts with neural PDE solvers. Advances in Neural Information Processing Systems, 36, 2024.
  • [18] J. Pathak, B. Hunt, M. Girvan, Z. Lu, and E. Ott. Model-free prediction of large spatiotemporally chaotic systems from data: A reservoir computing approach. Physical Review Letters, 120(2):024102, 2018.
  • [19] J. Pathak, S. Subramanian, P. Harrington, S. Raja, A. Chattopadhyay, M. Mardani, T. Kurth, D. Hall, Z. Li, K. Azizzadenesheli, P. Hassanzadeh, K. Kashinath, and A. Anandkumar. Fourcastnet: A global data-driven high-resolution weather model using adaptive Fourier neural operators. arXiv preprint arXiv:2202.11214, 2022.
  • [20] T. Pfaff, M. Fortunato, A. Sanchez-Gonzalez, and P. W. Battaglia. Learning mesh-based simulation with graph networks. In International Conference on Learning Representations, 2021.
  • [21] N. Rahaman, A. Baratin, D. Arpit, F. Draxler, M. Lin, F. Hamprecht, Y. Bengio, and A. Courville. On the spectral bias of neural networks. In International Conference on Machine Learning, pages 5301–5310. PMLR, 2019.
  • [22] A. Sanchez-Gonzalez, J. Godwin, T. Pfaff, R. Ying, J. Leskovec, and P. Battaglia. Learning to simulate complex physics with graph networks. In International Conference on Machine Learning, pages 8459–8468. PMLR, 2020.
  • [23] K. Srinivasan, N. Coble, J. Hamlin, T. Antonsen, E. Ott, and M. Girvan. Parallel machine learning for forecasting the dynamics of complex networks. Physical Review Letters, 128(16):164101, 2022.
  • [24] K. Stachenfeld, D. B. Fielding, D. Kochkov, M. Cranmer, T. Pfaff, J. Godwin, C. Cui, S. Ho, P. Battaglia, and A. Sanchez-Gonzalez. Learned simulators for turbulence. In International Conference on Learning Representations, 2022.
  • [25] J. Su, J. Kempe, D. Fielding, N. Tsilivis, M. Cranmer, and S. Ho. Adversarial noise injection for learned turbulence simulations. In NeurIPS 2022 Workshop on Machine Learning and the Physical Sciences, 2022.
  • [26] T. Tao, V. Vu, and M. Krishnapur. Random matrices: Universality of ESDs and the circular law. The Annals of Probability, 38(5):2023–2065, 2010.
  • [27] R. Temam and X. M. Wang. Estimates on the lowest dimension of inertial manifolds for the Kuramoto-Sivashinsky equation in the general case. Differential and Integral Equations, 7(3-4):1095–1108, 1994.
  • [28] P. R. Vlachas, G. Arampatzis, C. Uhler, and P. Koumoutsakos. Multiscale simulations of complex systems by learning their effective dynamics. Nature Machine Intelligence, 4(4):359–366, 2022.
  • [29] P. R. Vlachas, W. Byeon, Z. Y. Wan, T. P. Sapsis, and P. Koumoutsakos. Data-driven forecasting of high-dimensional chaotic systems with long short-term memory networks. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 474(2213):20170844, 2018.
  • [30] P. R. Vlachas, J. Pathak, B. R. Hunt, T. P. Sapsis, M. Girvan, E. Ott, and P. Koumoutsakos. Backpropagation algorithms and reservoir computing in recurrent neural networks for the forecasting of complex spatiotemporal dynamics. Neural Networks, 126:191–217, 2020.
  • [31] Y. Wang and C.-Y. Lai. Multi-stage neural networks: Function approximator of machine precision. Journal of Computational Physics, page 112865, 2024.
  • [32] Z.-Q. J. Xu, Y. Zhang, T. Luo, Y. Xiao, and Z. Ma. Frequency principle: Fourier analysis sheds light on deep neural networks. arXiv preprint arXiv:1901.06523, 2019.