On instabilities in neural network-based physics simulators
Daniel Floryan
dfloryan@uh.edu
Department of Mechanical Engineering, University of Houston, Houston, TX 77204, USA
(June 18, 2024)
Abstract
When neural networks are trained from data to simulate the dynamics of physical systems, they encounter a persistent challenge: the long-time dynamics they produce are often unphysical or unstable. We analyze the origin of such instabilities when learning linear dynamical systems, focusing on the training dynamics. We make several analytical findings which empirical observations suggest extend to nonlinear dynamical systems. First, the rate of convergence of the training dynamics is uneven and depends on the distribution of energy in the data. As a special case, the dynamics in directions where the data have no energy cannot be learned. Second, in the unlearnable directions, the dynamics produced by the neural network depend on the weight initialization, and common weight initialization schemes can produce unstable dynamics. Third, injecting synthetic noise into the data during training adds damping to the training dynamics and can stabilize the learned simulator, though doing so undesirably biases the learned dynamics. For each contributor to instability, we suggest mitigative strategies. We also highlight important differences between learning discrete-time and continuous-time dynamics, and discuss extensions to nonlinear systems.
I Introduction
The ability to accurately and efficiently predict the behavior of a dynamical system is fundamental to science and engineering.
As data have become increasingly available, machine learning has been leveraged to learn models that can accurately and efficiently simulate the dynamics of physical systems, both in cases when the true dynamics are unknown, or when they are known but are expensive to compute [18, 29, 15, 11, 20, 16, 23, 28, 3, 5]. Weather forecasting offers a prime example of this new approach, where decades of high-quality data [9] have been used to develop deep learning-based forecasters with accuracy comparable to traditional numerical weather prediction while offering significant speedups [19, 1, 13].
While accurate at short times, neural network-based physics simulators can produce unphysical or unstable predictions at long times [29, 20, 16, 24, 12, 2]. Although certain stabilization tricks have been identified, it is unclear when they may be employed and why they succeed. Here, we identify probable culprits for the origin of these instabilities by analyzing the training dynamics of neural networks tasked with emulating a dynamical system. Along the way, we propose mitigative strategies and rationalize stabilization tricks found in the literature.
To render the problem analytically tractable, we consider linear dynamical systems, which can be perfectly emulated by a linear single-layer neural network. Our analysis considers a prototypical neural network training scheme, consisting of the mean squared error loss function, Glorot weight initialization [7], and gradient descent to update the weights of the neural network. We analyze discrete- and continuous-time dynamical systems in turn, highlighting important differences that arise between these two model classes. Our analysis focuses on the effects of energy distribution, weight initialization, and noise on the training dynamics.
II Discrete-time dynamical systems
Consider the linear discrete-time dynamical system
(1)
Suppose that is not known explicitly, and we want to learn an accurate approximation from a dataset of pairs of snapshots, , that are generated by the dynamical system, so that
(2)
Assembling the data into the data matrices and , which are related by , we seek the that minimizes the mean squared error loss function ,
(3)
Here, is the Frobenius norm, the factor of is for convenience, and the factor of mirrors how this loss function is typically implemented in machine learning software packages. The minimizer may not be unique. The least-squares/minimum-norm solution is , where is the pseudoinverse of . However, what is of interest here is what the commonly used gradient descent algorithm produces.
The gradient of with respect to is
(4)
In the limit of infinitesimal learning rate, gradient descent leads to continuous-time gradient flow dynamics
(5)
with being a pseudo-time variable. The solution to this matrix ordinary differential equation is
(6)
where is the initialization of our linear single-layer neural network.
A coordinate change provides insight. Let be the full singular value decomposition of , and the dynamics matrix in the basis of the left singular vectors. Then the training dynamics give
(7)
Three points of interest arise. First, columns of corresponding to non-zero singular values converge to the corresponding columns of —that is, the true dynamics—while those corresponding to zero singular values remain equal to their initial values. In other words, the dynamics are not learnable in directions in which the data have no energy. Second, the rate of convergence depends on how energy is distributed in the data. In directions in which the data have low energy, convergence to the true dynamics is slower. It is, therefore, more difficult to learn the dynamics of low-energy modes. The convergence rates can be made uniform by normalizing the data so that they have equal energy in all directions. Third, the initialization of the neural network will impact the learned dynamics in the unlearnable directions, or in all directions if gradient descent is stopped short of convergence (which is typical).
To illuminate the impact of the initialization, we suppose the data are in a set that is invariant under the dynamics of our system. This is generally true for the large class of physical systems with dissipation, for which the long-time dynamics approach an invariant manifold [10, 6, 27, 4]. In our case, the true dynamics take the form
(8)
where the upper-left submatrix gives the dynamics in the invariant subspace containing the data, and the other columns correspond to the unlearnable directions. As , the learned dynamics will be
(9)
The eigenvalues of the learned dynamics are the union of the eigenvalues of and the eigenvalues of . If has rank and is initialized using the typical Glorot normal or uniform initializers, then as , the distribution of the eigenvalues of converges almost surely to the uniform distribution on the disk of radius centered at the origin [26]. For finite , however, the eigenvalues can lie outside the unit circle, as shown in Figure 1, leaving the potential for unstable dynamics.
This point bears emphasizing. In physical systems with dissipation, the state of the system will quickly approach an invariant set. When we collect measurements of the state, those measurements will generally not include information outside of the invariant set due to its stability. Without such information, it is impossible to learn the dynamics outside of the invariant set; the learned dynamics in that region depend on the initialization. Ironically, the stability of the true dynamics can lead to unstable learned dynamics since there is no information available to erase the memory of the potentially unstable initialization.
Two remedies are clear. One is to restrict the learned model to remain in the invariant set. For linear systems, this amounts to projecting the system onto the subspace with non-zero singular values. The second remedy is to ensure that the initialization gives stable dynamics, which requires developing new weight initialization schemes. Gershgorin’s circle theorem provides one way to create a mostly random matrix whose eigenvalues are guaranteed to lie inside the unit circle, thereby giving stable dynamics. For example, by drawing the entries of from the uniform distribution on and setting the diagonal entries equal to zero, the eigenvalues of are bounded by the unit circle. This bound is sharp (e.g., if all off-diagonal entries are equal to then 1 is an eigenvalue) but almost surely conservative, as demonstrated in Figure 2. The distribution of eigenvalues can be expanded (but still bounded by the unit circle) by re-normalizing each row (or column) of so that the row (column) sums of the absolute values of the entries in each row (column) are equal to 1.
II.1 Noisy measurements
Curiously, it has been observed that injecting synthetic noise into the data during training can help stabilize learned physics simulators, though it is unclear why [29, 30, 22, 20, 24, 25].
Suppose our measurements are noisy, so that the data matrices are instead of , and instead of . and are random matrices of noise, with all entries assumed to be independent random variables with zero mean and variance , and independent of the noise-free data. It is likely that many columns of are the same as columns of , but shifted over one column in location, since data are often gathered from long trajectories that provide many data. With noisy data, the gradient of the loss function with respect to is
(10)
In the basis of the left singular vectors of the noise-free data , the training dynamics give
(11)
where we have assumed that is invertible, which is generally true for . In the limit ,
(12)
The expected value is
(13)
where we have used the fact that . Rewriting the above expression provides clarity:
(14)
We see that when measurement noise is present, is a biased version of . In particular, columns of corresponding to non-zero singular values converge to the corresponding columns of , but biased by a multiplicative factor , where is the singular value of . This bias factor can be written as , where is the signal-to-noise ratio in the direction of the singular vector. In contrast, without measurement noise, these columns of converge to the true columns of . The columns corresponding to zero singular values converge to zero, while they stayed equal to their initial guess when there was no measurement noise. The measurement noise adds damping to the training dynamics, erasing the memory of the initialization in the unlearnable directions and replacing it with dynamics that are strongly stable. This is a highly desirable effect since, as previously explained, in physical systems with dissipation, the stabilizing effect of dissipation is what makes those directions unlearnable.
How fast is the memory of the initialization erased by the noise? The corresponding eigenvalues of the gradient flow dynamics are, in expectation, equal to . The rate of convergence depends on the strength of the noise and the dimension of the system, with weak noise and a large dimension of the system leading to slow convergence.
There is a tradeoff between the desirable and undesirable effects of noise: noise stabilizes a learned physics simulator, with stronger noise stabilizing the learned system more quickly, but stronger noise also leads to greater bias. This is illustrated in Figure 3 for a three-dimensional system, along with the effect of energy distribution. It may be possible to obtain stability while avoiding bias by selectively applying noise only in the unlearnable/zero-energy directions.
III Continuous-time dynamical systems
Important differences arise in continuous-time dynamics. Consider the linear continuous-time dynamical system
(15)
As in the discrete-time case, suppose that is not known explicitly, and we want to learn an accurate approximation from a dataset of pairs of snapshots separated by a time , , that are generated by the dynamical system, so that
(16)
Assembling the data into data matrices as before, we seek the that minimizes the mean squared error ,
(17)
Due to the presence of the matrix exponential, there is no simple expression for the gradient of with respect to , and the training dynamics are nonlinear in . To gain insight, we expand the matrix exponential to —equivalent to solving the dynamics using the forward Euler method, making the learning problem equivalent to learning a residual network (scaled by ) [8]. Under this approximation, the gradient of the loss function is
(18)
equal to that in the discrete-time setting but scaled by a factor . The conclusions made in the discrete-time setting therefore extend to the continuous-time setting, but for the following exceptions.
First, the factor of in the gradient changes the rate of convergence of the training dynamics. Small could lead to very slow convergence.
Second, for continuous-time dynamics, an instability arises when the real part of any eigenvalue of is positive. When is created using the Glorot initializer, each of its eigenvalues has a 50% of having a positive real part. The continuous-time problem is therefore more susceptible to instability. Gershgorin’s circle theorem can again be used to create a stable mostly random matrix by having the Gershgorin disks lie in the left half of the complex plane (or within the stable region of the numerical integrator begin used; for example, if using a residual network, the Gershgorin disks should be made to lie within the stability boundary of the forward Euler method).
Finally, the effects of measurement noise differ in the details. With noisy data, the gradient is
(19)
In the basis of the left singular vectors of the noise-free data , the training dynamics give
(20)
In the limit ,
(21)
The expected value is
(22)
Rewriting the above expression provides clarity:
(23)
As in the discrete-time setting, noise creates a multiplicative bias factor. Additionally, there is an additive bias that can be substantial for small . The columns corresponding to zero singular values converge to columns whose only non-zero entries are along the diagonal and are equal to . Measurement noise again erases the memory of the initialization in the unlearnable directions and replaces it with stable dynamics, with smaller leading to more stable dynamics. In addition to the tradeoffs noted in the discrete-time setting, in the continuous-time setting also has a tradeoff: smaller creates more stable dynamics but greater bias.
IV Discussion
Despite the importance of the long-term stability of learned physics simulators, theoretical insight into this issue is conspicuously missing. Our findings for linear dynamical systems constitute an important step towards a general theory, buoyed by nonlinear analogs with strong empirical support.
In the linear case, we showed that it is more difficult to learn low-energy dynamics due to the associated slower rates of convergence in the training dynamics. This seems to hold for nonlinear systems as well, for which it has been empirically observed that it is difficult to learn the dynamics of high wavenumbers in physical systems, which typically have low energy [2, 17]. Although the difficulty to learn the dynamics of high wavenumbers has previously been attributed to the spectral bias of neural networks [2, 32, 21], recent work on multi-stage neural networks supports that non-uniform distribution of energy and spectral bias both contribute to slow convergence [31]. The non-uniformity of energy could be addressed by first transforming the variables to a space in which the data are distributed isotropically, then learning the dynamics of the transformed variables.
We then showed that dynamics off of the data subspace cannot be learned, and what is learned in the complement of the data subspace depends on the weight initialization. This obviously holds for nonlinear systems, where the dynamics off of the data submanifold cannot be learned. What is less clear is whether a non-trivial weight initialization scheme can be designed so that the default dynamics are stable, as we have done here for linear dynamical systems. In the linear case, an alternative is to project the system onto the data subspace. In the nonlinear case, manifold learning methods can be used to project the system onto the data submanifold; doing so has, indeed, been found to stabilize the learned dynamics [16, 3, 5]. Another alternative is to add global damping to the system [29, 16, 14], though it is unclear how strong it should be.
Finally, we noted the empirical success of noise injection as a stabilizer when learning nonlinear dynamical systems, and showed why it works when learning linear dynamical systems. Adding noise to the data adds damping to the training dynamics. Damping is a generic mechanism that is likely to extend to the training dynamics of nonlinear systems. Furthermore, we showed that there is a tradeoff between the stabilization and bias that noise creates, and this tradeoff can be seen in empirical results for nonlinear systems [22]. The agreement with empirical results for nonlinear systems suggests that we have identified the correct mechanisms.
References
[1]
K. Bi, L. Xie, H. Zhang, X. Chen, X. Gu, and Q. Tian.
Accurate medium-range global weather forecasting with 3D neural
networks.
Nature, 619(7970):533–538, 2023.
[2]
A. Chattopadhyay and P. Hassanzadeh.
Long-term instabilities of deep learning-based digital twins of the
climate system: The cause and a solution.
arXiv preprint arXiv:2304.07029, 2023.
[3]
B. Chen, K. Huang, S. Raghupathi, I. Chandratreya, Q. Du, and H. Lipson.
Automated discovery of fundamental variables hidden in experimental
data.
Nature Computational Science, 2(7):433–442, 2022.
[4]
C. R. Doering and J. D. Gibbon.
Applied Analysis of the Navier-Stokes Equations.
Number 12 in Cambridge Texts in Applied Mathematics. Cambridge
University Press, 1995.
[5]
D. Floryan and M. D. Graham.
Data-driven discovery of intrinsic dynamics.
Nature Machine Intelligence, 4(12):1113–1120, 2022.
[6]
C. Foias, G. R. Sell, and R. Temam.
Inertial manifolds for nonlinear evolutionary equations.
Journal of Differential Equations, 73(2):309–353, 1988.
[7]
X. Glorot and Y. Bengio.
Understanding the difficulty of training deep feedforward neural
networks.
In Y. W. Teh and M. Titterington, editors, Proceedings of the
Thirteenth International Conference on Artificial Intelligence and
Statistics, volume 9 of Proceedings of Machine Learning Research,
pages 249–256. PMLR, 2010.
[8]
K. He, X. Zhang, S. Ren, and J. Sun.
Deep residual learning for image recognition.
In 2016 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pages 770–778, 2016.
[9]
H. Hersbach, B. Bell, P. Berrisford, S. Hirahara, A. Horányi,
J. Muñoz-Sabater, J. Nicolas, C. Peubey, R. Radu, D. Schepers,
A. Simmons, C. Soci, S. Abdalla, X. Abellan, G. Balsamo, P. Bechtold,
G. Biavati, J. Bidlot, M. Bonavita, G. De Chiara, P. Dahlgren, D. Dee,
M. Diamantakis, R. Dragani, J. Flemming, R. Forbes, M. Fuentes, A. Geer,
A. Haimberger, S. Healy, R. J. Hogan, E. Hólm, M. Janisková, S. Keeley,
P. Laloyaux, P. Lopez, C. Lupu, G. Radnoti, P. de Rosnay, I. Rozum,
F. Vamborg, S. Villaume, and J.-N. Thépaut.
The ERA5 global reanalysis.
Quarterly Journal of the Royal Meteorological Society,
146(730):1999–2049, 2020.
[10]
E. Hopf.
A mathematical example displaying features of turbulence.
Communications on Pure and Applied Mathematics, 1(4):303–322,
1948.
[11]
G. E. Karniadakis, I. G. Kevrekidis, L. Lu, P. Perdikaris, S. Wang, and
L. Yang.
Physics-informed machine learning.
Nature Reviews Physics, 3(6):422–440, 2021.
[12]
R. Keisler.
Forecasting global weather with graph neural networks.
arXiv preprint arXiv:2202.07575, 2022.
[13]
R. Lam, A. Sanchez-Gonzalez, M. Willson, P. Wirnsberger, M. Fortunato, F. Alet,
S. Ravuri, T. Ewalds, Z. Eaton-Rosen, W. Hu, A. Merose, S. Hoyer, G. Holland,
O. Vinyals, J. Stott, A. Pritzel, S. Mohamed, and P. Battaglia.
Learning skillful medium-range global weather forecasting.
Science, 382(6677):1416–1421, 2023.
[14]
A. J. Linot, J. W. Burby, Q. Tang, P. Balaprakash, M. D. Graham, and R. Maulik.
Stabilized neural ordinary differential equations for long-time
forecasting of dynamical systems.
Journal of Computational Physics, 474:111838, 2023.
[15]
A. J. Linot and M. D. Graham.
Deep learning to discover and predict dynamics on an inertial
manifold.
Physical Review E, 101(6):062209, 2020.
[16]
A. J. Linot and M. D. Graham.
Data-driven reduced-order modeling of spatiotemporal chaos with
neural ordinary differential equations.
Chaos: An Interdisciplinary Journal of Nonlinear Science,
32(7), 2022.
[17]
P. Lippe, B. Veeling, P. Perdikaris, R. Turner, and J. Brandstetter.
PDE-refiner: Achieving accurate long rollouts with neural PDE
solvers.
Advances in Neural Information Processing Systems, 36, 2024.
[18]
J. Pathak, B. Hunt, M. Girvan, Z. Lu, and E. Ott.
Model-free prediction of large spatiotemporally chaotic systems from
data: A reservoir computing approach.
Physical Review Letters, 120(2):024102, 2018.
[19]
J. Pathak, S. Subramanian, P. Harrington, S. Raja, A. Chattopadhyay,
M. Mardani, T. Kurth, D. Hall, Z. Li, K. Azizzadenesheli, P. Hassanzadeh,
K. Kashinath, and A. Anandkumar.
Fourcastnet: A global data-driven high-resolution weather model using
adaptive Fourier neural operators.
arXiv preprint arXiv:2202.11214, 2022.
[20]
T. Pfaff, M. Fortunato, A. Sanchez-Gonzalez, and P. W. Battaglia.
Learning mesh-based simulation with graph networks.
In International Conference on Learning Representations, 2021.
[21]
N. Rahaman, A. Baratin, D. Arpit, F. Draxler, M. Lin, F. Hamprecht, Y. Bengio,
and A. Courville.
On the spectral bias of neural networks.
In International Conference on Machine Learning, pages
5301–5310. PMLR, 2019.
[22]
A. Sanchez-Gonzalez, J. Godwin, T. Pfaff, R. Ying, J. Leskovec, and
P. Battaglia.
Learning to simulate complex physics with graph networks.
In International Conference on Machine Learning, pages
8459–8468. PMLR, 2020.
[23]
K. Srinivasan, N. Coble, J. Hamlin, T. Antonsen, E. Ott, and M. Girvan.
Parallel machine learning for forecasting the dynamics of complex
networks.
Physical Review Letters, 128(16):164101, 2022.
[24]
K. Stachenfeld, D. B. Fielding, D. Kochkov, M. Cranmer, T. Pfaff, J. Godwin,
C. Cui, S. Ho, P. Battaglia, and A. Sanchez-Gonzalez.
Learned simulators for turbulence.
In International Conference on Learning Representations, 2022.
[25]
J. Su, J. Kempe, D. Fielding, N. Tsilivis, M. Cranmer, and S. Ho.
Adversarial noise injection for learned turbulence simulations.
In NeurIPS 2022 Workshop on Machine Learning and the Physical
Sciences, 2022.
[26]
T. Tao, V. Vu, and M. Krishnapur.
Random matrices: Universality of ESDs and the circular law.
The Annals of Probability, 38(5):2023–2065, 2010.
[27]
R. Temam and X. M. Wang.
Estimates on the lowest dimension of inertial manifolds for the
Kuramoto-Sivashinsky equation in the general case.
Differential and Integral Equations, 7(3-4):1095–1108, 1994.
[28]
P. R. Vlachas, G. Arampatzis, C. Uhler, and P. Koumoutsakos.
Multiscale simulations of complex systems by learning their effective
dynamics.
Nature Machine Intelligence, 4(4):359–366, 2022.
[29]
P. R. Vlachas, W. Byeon, Z. Y. Wan, T. P. Sapsis, and P. Koumoutsakos.
Data-driven forecasting of high-dimensional chaotic systems with long
short-term memory networks.
Proceedings of the Royal Society A: Mathematical, Physical and
Engineering Sciences, 474(2213):20170844, 2018.
[30]
P. R. Vlachas, J. Pathak, B. R. Hunt, T. P. Sapsis, M. Girvan, E. Ott, and
P. Koumoutsakos.
Backpropagation algorithms and reservoir computing in recurrent
neural networks for the forecasting of complex spatiotemporal dynamics.
Neural Networks, 126:191–217, 2020.
[31]
Y. Wang and C.-Y. Lai.
Multi-stage neural networks: Function approximator of machine
precision.
Journal of Computational Physics, page 112865, 2024.
[32]
Z.-Q. J. Xu, Y. Zhang, T. Luo, Y. Xiao, and Z. Ma.
Frequency principle: Fourier analysis sheds light on deep neural
networks.
arXiv preprint arXiv:1901.06523, 2019.