Bayesian identification of nonseparable Hamiltonians with multiplicative noise using deep learning and reduced-order modeling

Nicholas Galioto ngalioto@umich.edu Harsh Sharma hasharma@ucsd.edu Boris Kramer bmkramer@ucsd.edu Alex Arkady Gorodetsky goroda@umich.edu

Abstract

This paper presents a structure-preserving Bayesian approach for learning nonseparable Hamiltonian systems using stochastic dynamic models allowing for statistically-dependent, vector-valued additive and multiplicative measurement noise. The approach is comprised of three main facets. First, we derive a Gaussian filter for a statistically-dependent, vector-valued, additive and multiplicative noise model that is needed to evaluate the likelihood within the Bayesian posterior. Second, we develop a novel algorithm for cost-effective application of Bayesian system identification to high-dimensional systems. Third, we demonstrate how structure-preserving methods can be incorporated into the proposed framework, using nonseparable Hamiltonians as an illustrative system class. We assess the method’s performance based on the forecasting accuracy of a model estimated from single-trajectory data. We compare the Bayesian method to a state-of-the-art machine learning method on a canonical nonseparable Hamiltonian model and a chaotic double pendulum model with small, noisy training datasets. The results show that using the Bayesian posterior as a training objective can yield upwards of 724 times improvement in Hamiltonian mean squared error using training data with up to 10% multiplicative noise compared to a standard training objective. Lastly, we demonstrate the utility of the novel algorithm for parameter estimation of a 64-dimensional model of the spatially-discretized nonlinear Schrödinger equation with data corrupted by up to 20% multiplicative noise.

keywords:

Bayesian system identification , physics-informed machine learning , multiplicative noise , high-dimensional systems , nonseparable Hamiltonian systems, deep learning

\affiliation

[1]organization=Department of Aerospace Engineering, University of Michigan, city=Ann Arbor, postcode=MI 48109, country=USA

\affiliation

[2]organization=Department of Mechanical and Aerospace Engineering, University of California San Diego, city=San Diego, postcode=CA 92161, country=USA

{highlights}

A Gaussian filter is derived for general additive and multiplicative noise models

An algorithm is introduced for Bayesian estimation of high-dimensional Hamiltonians

Probabilistic modeling is shown to improve a deep system ID method on noisy data

Bayesian parameter estimation is successfully performed on a 64-dimensional system

1 Introduction

System identification (ID) plays a key role in many engineering and scientific frameworks such as model predictive control, system forecasting, and dynamical analysis. Creating a system ID algorithm includes careful selection of a class of candidate models and of an objective function to optimize. The success of system ID strongly depends on how efficiently this pair of model class and objective utilizes available information to guide estimation. Data from the system are most commonly used as sources of information, but prior knowledge on the system physics can also be considered within the estimation procedure.

Incorporating physical knowledge into system ID has been demonstrated through a variety of methodologies [1]. In one direction, physically-inconsistent models can be penalized through the addition of physics-based terms in the objective function. This approach is adopted by the widely-used physics-informed neural networks [2, 3] and has also been applied to various applications such as improving molecular dynamics simulation [4] and lake temperature modeling [5]. In another direction, physics are explicitly encoded into the model parameterization. This approach has led to various neural network architectures for learning conservative systems based on Hamiltonian [6, 7, 8] and Lagrangian [9, 10, 11, 12] mechanics. Other examples of this approach include preservation of symmetry groups in convolutional neural networks [13] and enforcement of boundary conditions in boundary value problems [14]. Compared to traditional machine learning approaches, these methods have all shown significant improvements in estimation accuracy.

In addition to incorporating physical knowledge through models, data must be utilized to encourage consistency with the real world. Proper design of a learning objective is crucial to ensure information in the data is being extracted properly. The most common objectives take the form of a summation of vector norms of the differences between the data and estimated outputs, but alternative forms can be derived through probabilistic modeling. Notably, the modeling of model, measurement, and parameter uncertainties within a Bayesian framework in [15] led to an objective (the negative log posterior) that has been shown to yield more accurate estimates over many widely-used objectives [16]. To further improve information extraction capabilities, this method can be combined with structure-preserving parameterization techniques [15, 17]. However, evaluation of this posterior relies on probabilistic filtering and is computationally challenging. Two challenges arise: (i) filtering can be costly if the noise models are non-additive, and (ii) filtering tends to not scale well with the state and measurement dimensions.

The challenge of non-additive noise arises when the sensor noise is dependent on the signal. The most common form of this is multiplicative noise. As an example, a distance-measuring sensor contains noise that increases roughly linearly with its distance from the target [18]. To address multiplicative noise, a Kalman filter was first derived in 1971 by [19]. Since then, a number of other filters have been developed for various multiplicative noise models such as Gaussian mixtures [20, 21], non-stationary noise [22, 4, 20], and deterministic uncertainties [23, 24]. Each of these filters, however, makes the assumption that the noise from each sensor is independent and identically distributed (i.i.d.). If sensors measure along the same axis, e.g., distance and velocity sensors or redundant sensors, their measurements, and therefore noises, will be correlated and not independent. And for sensors that have been manufactured differently, it is unlikely that their noise models will be identically distributed. Therefore, a more general filtering algorithm for multiplicative noises is needed.

The high-dimensional problem setting poses challenges for not only the Bayesian approach, but also many other system ID methods. Most nonlinear system ID algorithms require thousands of evaluations of a forward model (or data points) for training, which can quickly become prohibitive for large-scale dynamical models. This issue is often addressed through reduced-order modeling in which a low-dimensional approximation, known as a reduced-order model (ROM), of the high-dimensional dynamics is estimated. To identify the ROM, many approaches begin by estimating a low-dimensional subspace for the reduced-order state vector followed by system ID in the reduced-order space. In the time domain, common approaches for the subspace identification task include linear methods such as the proper orthogonal decomposition [25] and the reduced basis method [26, 27], and nonlinear methods such as autoencoders [28, 29] and polynomial manifolds [30, 31]. These methods, however, are trained using full-field simulation data from full-order models (FOMs) that realistically are often incorrect, partially unknown, or uncertain. As a result, the accuracy of the ROMs is limited by the accuracy of the FOMs. To improve past this limit requires using experimental data collected directly from the system of interest. Since these data are often noisy, training with them introduces additional error into the subspace approximation. For a system ID method to handle this added error, careful modeling of uncertainty will be key.

In this work, we introduce methodologies to address the challenges of Bayesian system ID for multiplicative noise and high-dimensional systems. Then we demonstrate how these methodologies can be combined with structure-preserving methods, using nonseparable Hamiltonian systems as an example class of systems. Lastly, we apply the methodologies to estimate a dynamics model from single-trajectory data and evaluate the model’s quality based on its forecasting accuracy. We choose nonseparable Hamiltonian systems for several reasons. Hamiltonian systems can demonstrate complex nonlinear behavior while possessing an underlying highly-structured geometry. These systems possess interesting physical properties that are important to preserve including conservation of the Hamiltonian, reversibility, and symplecticity. Nonseparable Hamiltonians, specifically, are of interest because they arise in diverse fields such as multibody dynamics and control in robotics [32], the Kozai-Lidov mechanism in astrophysics [33], particle accelerators in physics [34], 3D vortex dynamics in fluid mechanics [35], and the nonlinear Schrödinger equation in quantum mechanics [36].

In a previous work [17], we also considered Bayesian system ID of nonseparable Hamiltonian systems with multiplicative measurement noise. There are three major distinctions between that work and the present one. The first is that in the past work, we considered only a polynomial Hamiltonian with a polynomial model parameterization. The current work uses a more expressive deep neural network parameterization and considers a non-polynomial Hamiltonian example. Second, we previously used an additive noise model and trained on data with multiplicative noise to demonstrate robustness to model misspecification, but here we adapt the model and algorithm toward multiplicative noise. The third and most significant difference in this work is that we develop an original algorithm for efficient estimation of high-dimensional nonseparable Hamiltonian systems. These novel additions to the structure-preserving Bayesian learning framework of [17] are stated more specifically as follows:

1.

derivation of a Gaussian filter for a statistically-dependent, vector-valued, additive and multiplicative noise model and analysis of the added computational complexity compared to a filter for only additive noise in Section 4.1,
2.

creation of a novel learning algorithm (Algorithm 2) for estimation of high-dimensional nonseparable Hamiltonians in a reduced-dimensional space in Section 5,
3.

numerical experimentation showing that the proposed likelihood-based objective outperforms the original objective of a state-of-the-art machine learning method when training with sparse data with multiplicative uniform noise with respect to both state and Hamiltonian error in Sections 6.3 and 6.4. These results include upwards of 724 times improvement in Hamiltonian mean squared error when training with data corrupted by up to 10% multiplicative noise,
4.

demonstration of the effectiveness of Algorithm 2 for parameter estimation on the 64-dimensional spatially-discretized nonlinear Schrödinger equation within a reduced 8-dimensional space using data with 20% multiplicative noise in Section 6.5.

The rest of this paper is structured as follows. Section 2 provides notation and the probabilistic formulation of the system ID problem. Section 3 gives a background discussion on algorithmic design choices for the Bayesian method. Section 4 introduces a filter for a general additive and multiplicative noise model and reviews structure-preserving techniques for nonseparable Hamiltonian systems. Section 5 presents a novel algorithm for efficient and structure-preserving estimation of high-dimensional Hamiltonians. Section 6 applies the proposed Bayesian algorithm to two low-dimensional (one polynomial and one non-polynomial) nonseparable Hamiltonian systems. Then, the novel algorithm for estimating high-dimensional Hamiltonians is applied to a spatial discretization of the nonlinear Schrödinger equation. Finally, Section 7 provides concluding remarks and future research directions.

2 Problem statement

In this section, we define notation and formulate the system ID problem.

2.1 Notation

The space of real numbers is denoted by $\mathbb{R}$ and the set of positive integers by $\mathbb{Z}_{+}$ . Vectors are written in lowercase, non-italic, bold font, e.g., $\mathbf{x}$ , and matrices in uppercase, non-italic, bold font, e.g., $\mathbf{A}$ . The transpose is denoted by the ^⊤ symbol. If a vector varies in space and/or time, it is spatially indexed as $\mathbf{x}^{s}$ and/or temporally indexed as $\mathbf{x}_{t}$ for discrete space and time $s,t\in\mathbb{Z}_{+}$ .

Let $(\Omega,\mathcal{F},\mathbb{P})$ be a probability triple where $\Omega$ is a sample space, $\mathcal{F}$ is a $\sigma$ -algebra, and $\mathbb{P}$ is a probability measure. Random variables are denoted in lowercase, italic, bold font, e.g., $\bm{z}$ , and their realizations are denoted as their non-italic counterparts, e.g., $\mathbf{z}$ . We assume that for a continuous random variable $\bm{z}$ , the probability measure $\mathbb{P}({\rm d}\bm{z})$ admits a probability density function (pdf) $\pi(\mathbf{z})$ . A $d$ -dimensional uniform distribution with lower and upper bounds $\mathbf{a},\mathbf{b}\in\mathbb{R}^{d}$ is denoted as $\mathcal{U}[\mathbf{a},\mathbf{b}]$ . A normal distribution with mean $\mathbf{m}$ and covariance $\mathbf{C}$ is denoted as $\mathcal{N}(\mathbf{m},\mathbf{C})$ . If $\bm{z}\sim\mathcal{N}(\mathbf{0},\mathbf{C})$ , then $\lvert\bm{z}\rvert$ follows a half-normal distribution denoted as $\text{half-}\mathcal{N}(\mathbf{0},\mathbf{C})$ .

The symbol $\odot$ represents element-wise multiplication. The $\odot$ operation is defined when the dimensions of the operands match or when the operands are a matrix and a vector with length equal to the number of matrix columns. In the latter case, $\odot$ multiplies the $i$ th column of the matrix by the $i$ th element of the vector. This operator has the useful property that $\bm{z}\odot\mathbf{a}\sim\mathcal{N}\left(\mathbf{m}\odot\mathbf{a},\mathbf{C}% \odot(\mathbf{a}\mathbf{a}^{\top})\right)$ for the Gaussian random vector $\bm{z}\sim\mathcal{N}(\mathbf{m},\mathbf{C})$ and a constant vector $\mathbf{a}$ of the same length.

2.2 Probabilistic problem formulation

In this section, we describe a probabilistic problem formulation for Bayesian system ID for arbitrary noise models. We model the states $\bm{x}_{k}(\omega)\in\mathbb{R}^{n}$ and the outputs $\bm{y}_{k}(\omega)\in\mathbb{R}^{m}$ for $\omega\in\Omega$ as discrete-time stochastic processes indexed by $k\in\mathbb{Z}_{+}$ . The dynamics are modeled as a hidden Markov model (HMM)


$\displaystyle\bm{x}_{k+1}$	$\displaystyle=\mathcal{T}(\bm{x}_{k},\bm{\theta},\omega),$	(1a)
$\displaystyle\bm{y}_{k}$	$\displaystyle=\mathcal{M}(\bm{x}_{k},\bm{\theta},\omega),$	(1b)

where $\mathcal{T}:\mathbb{R}^{n}\times\mathbb{R}^{\ell}\times\Omega\mapsto\mathbb{R}% ^{n}$ is the state-transition function and $\mathcal{M}:\mathbb{R}^{n}\times\mathbb{R}^{\ell}\times\Omega\mapsto\mathbb{R}% ^{m}$ the measurement function. These functions are parameterized by the random variable $\bm{\theta}(\omega)\in\mathbb{R}^{\ell}$ , whose realizations we denote with $\bm{\uptheta}$ , and both operators are functions of $\omega\in\Omega$ to represent that their outputs are random variables. The system is therefore characterized by the sequences of transitional pdfs $\pi(\mathbf{x}_{k+1}|\mathbf{x}_{k},\bm{\uptheta})$ and conditional output pdfs $\pi(\mathbf{y}_{k}|\mathbf{x}_{k},\bm{\uptheta})$ induced by $\mathcal{T}$ and $\mathcal{M}$ , respectively.

We seek to represent the posterior distribution $\pi(\bm{\uptheta}|\mathcal{Y}_{N})$ characterizing the uncertainty in system parameters $\bm{\theta}$ given a collection of measurements $\mathcal{Y}_{N}\coloneqq(\mathbf{y}_{1},\ldots,\mathbf{y}_{N})$ . Bayes’ rule expresses the posterior in a computable form via the likelihood $\mathcal{L}(\bm{\uptheta};\mathcal{Y}_{N})\coloneqq\pi(\mathcal{Y}_{N}|\bm{% \uptheta})$ , the prior $\pi(\bm{\uptheta})$ , and a normalizing constant $\pi(\mathcal{Y}_{N})$ known as the evidence according to

\pi(\bm{\uptheta}|\mathcal{Y}_{N})=\frac{\mathcal{L}(\bm{\uptheta};\mathcal{Y}% _{N})\pi(\bm{\uptheta})}{\pi(\mathcal{Y}_{N})}.

(2)

The HMM, however, has the additional collection of uncertain variables $\bm{x}_{1},\ldots,\bm{x}_{N}$ about which we do not intend to make inferences. These uncertain states are marginalized out of the inference problem within the likelihood as $\int\pi(\mathcal{Y}_{N},\mathbf{x}_{1},\ldots,\mathbf{x}_{N}|\bm{\uptheta}){% \rm d}\mathbf{x}_{1},\ldots,{\rm d}\mathbf{x}_{N}$ . At first, this marginalization appears to require a costly $nN$ -dimensional integration. However, the recursive structure of the HMM can be exploited to break the high-dimensional integral into $N$ integrals of the more manageable dimension $n$ by the decomposition

\mathcal{L}(\bm{\uptheta};\mathcal{Y}_{N})=\pi(\mathbf{y}_{1}|\bm{\uptheta})% \prod_{k=2}^{N}\pi(\mathbf{y}_{k}|\bm{\uptheta},\mathcal{Y}_{k-1}).

(3)

Each term in this product can be efficiently computed using recursion, as shown in Algorithm 1 from [37, Th. 12.3].

Algorithm 1 Recursive marginal likelihood evaluation [37]

\pi(\mathbf{x}_{1}|\bm{\uptheta})

\mathcal{Y}_{N}

\mathcal{L}(\bm{\uptheta};\mathcal{Y}_{N})

3:Initialize

\pi(\mathbf{x}_{1}|\mathcal{Y}_{0},\bm{\uptheta})\coloneqq\pi(\mathbf{x}_{1}|% \bm{\uptheta})

and

\mathcal{L}(\bm{\uptheta};\mathcal{Y}_{0})\coloneqq 1

4:for

k=1,\ldots N

5: Marginalize:

\displaystyle\pi(\mathbf{y}_{k}|\mathcal{Y}_{k-1},\bm{\uptheta})\leftarrow\int% \pi(\mathbf{y}_{k}|\mathbf{x}_{k},\bm{\uptheta})\pi(\mathbf{x}_{k}|\mathcal{Y}% _{k-1},\bm{\uptheta}){\rm d}\mathbf{x}_{k}

\displaystyle\mathcal{L}(\bm{\uptheta};\mathcal{Y}_{k})\leftarrow\mathcal{L}(% \bm{\uptheta};\mathcal{Y}_{k-1})\pi(\mathbf{y}_{k}|\mathcal{Y}_{k-1},\bm{% \uptheta})

7: if

k<N

then

8: Update:

\displaystyle\pi(\mathbf{x}_{k}|\mathcal{Y}_{k},\bm{\uptheta})\leftarrow\frac{% \pi(\mathbf{y}_{k}|\mathbf{x}_{k},\bm{\uptheta})}{\pi(\mathbf{y}_{k}|\mathcal{% Y}_{k-1},\bm{\uptheta})}\pi(\mathbf{x}_{k}|\mathcal{Y}_{k-1},\bm{\uptheta})

9: Predict:

\displaystyle\pi(\mathbf{x}_{k+1}|\mathcal{Y}_{k},\bm{\uptheta})\leftarrow\int% \pi(\mathbf{x}_{k+1}|\mathbf{x}_{k},\bm{\uptheta})\pi(\mathbf{x}_{k}|\mathcal{% Y}_{k},\bm{\uptheta}){\rm d}\mathbf{x}_{k}

10: end if

11:end for

3 Algorithmic design choices

While Algorithm 1 is general enough to be applicable to any system that can be modeled in the general HMM form of Eq. (1), this flexibility requires design choices addressing two primary challenges: (i) the computational expense of filtering and (ii) how to encode prior knowledge into the parameterizations of $\mathcal{T}$ and $\mathcal{M}$ .

The first challenge of computational expense arises because the integrals in Algorithm 1 do not, in general, admit closed form solutions. The user’s choice in evaluation method will determine the accuracy of the marginal likelihood evaluation and overall computational complexity. The design choices to address the second challenge include selecting the coordinate system, system dimensions, and model fidelities within $\mathcal{T}$ and $\mathcal{M}$ . These choices often involve a tradeoff between accuracy and computational expense. For the best accuracy and generalizability of the learned model, prior information on the system should inform the model parameterization as much as possible. We now describe these choices and our contributions in more detail.

3.1 Sources of computational expense

Integral evaluation methods within filtering can be divided into two classes: those that estimate the exact integral and those that estimate an approximation of the integral. Estimation of the exact integral is typically performed using sequential Monte Carlo algorithms such as particle filtering [38]. The efficiency of this approach, however, is strongly dependent on the ability to draw uncorrelated samples from the appropriate distributions, which is, in general, nontrivial. When the efficiency of Monte Carlo sampling is prohibitive, approximations are used instead. The most common class of approximation methods for these integrals is Gaussian filtering. These methods approximate the marginal pdf $\mathcal{L}(\bm{\uptheta};\mathcal{Y}_{k})$ as Gaussian, which requires tracking only the first two moments of all other pdfs in Algorithm 1. Tracking the mean and covariance of the prediction pdf $\pi(\mathbf{x}_{k+1}|\mathcal{Y}_{k},\bm{\uptheta})$ and output pdf $\pi(\mathbf{y}_{k}|\mathcal{Y}_{k-1},\bm{\uptheta})$ can be achieved through linearization using Taylor series expansion or through Gaussian integration. These techniques are possible because the prediction and output pdfs are defined in terms of available functions, in this case $\mathcal{T}$ and $\mathcal{M}$ . There is no such function, however, for the update pdf $\pi(\mathbf{x}_{k}|\mathcal{Y}_{k},\bm{\uptheta})$ . The mean and covariance of this pdf are instead approximated with the Kalman update, which delivers the minimum mean squared error (MMSE) estimate of these quantities.

3.1.1 Process and measurement model forms

If more is known about the forms of $\mathcal{T}$ and $\mathcal{M}$ , linearization and Gaussian integration can sometimes be replaced by closed form solutions. To this end, it is useful to separate the dynamics and output functions into deterministic and stochastic components. Let $\Psi:\mathbb{R}^{n}\times\mathbb{R}^{\ell}\mapsto\mathbb{R}^{n}$ and $h:\mathbb{R}^{n}\times\mathbb{R}^{\ell}\mapsto\mathbb{R}^{m}$ represent deterministic dynamics and measurement functions, respectively. The stochastic components are usually further divided according to how they enter into the model. The two main approaches for modeling these components are as additive or multiplicative noise. If both of these types of noise are present, the model is written as


$\displaystyle\mathcal{T}(\bm{x}_{k},\bm{\theta},\omega)$	$\displaystyle=\Psi(\bm{x}_{k},\bm{\theta})\odot\bm{w}_{k}(\omega)+\bm{\xi}_{k}% (\omega),$	(4a)
$\displaystyle\mathcal{M}(\bm{x}_{k},\bm{\theta},\omega)$	$\displaystyle=h(\bm{x}_{k},\bm{\theta})\odot\bm{v}_{k}(\omega)+\bm{\eta}_{k}(% \omega),$	(4b)

where $\bm{\xi}_{k}(\omega),\bm{w}_{k}(\omega)\in\mathbb{R}^{n}$ and $\bm{\eta}_{k}(\omega),\bm{v}_{k}(\omega)\in\mathbb{R}^{m}$ are discrete-time stochastic processes.

The most common noise model is additive noise. One benefit of an additive noise model is that the first two moments of the noise terms can be estimated separately from those of the dynamics and observation functions, reducing complexity. Then the total mean and covariance for either $\mathcal{T}$ or $\mathcal{M}$ is simply the sum of these separate estimates. For linear systems, this allows for the means and covariances of $\pi(\mathbf{x}_{k+1}|\mathcal{Y}_{k},\bm{\uptheta})$ and $\pi(\mathbf{y}_{k}|\mathcal{Y}_{k-1},\bm{\uptheta})$ to be evaluated analytically with the Kalman filter when the first two moments of the noise terms are known. Moreover, if the system is linear and the noise terms are Gaussian, then the prediction, output, and update pdfs in Algorithm 1 are all Gaussian, and the Kalman filter is exact.

Although additive Gaussian noise is a suitable choice for many problems, recent works on learning Hamiltonians [39, 40] have begun considering multiplicative uniform noise. As with any other noise model, Gaussian filtering can be used to approximate Algorithm 1 for multiplicative uniform noise, but it is preferable to compute as much of each integral in closed form as possible for the sake of complexity. By using knowledge of the noise model, the general Gaussian filtering procedure can be adapted to replace a portion of the approximations with an exact and computationally efficient evaluation. In Section 4.1, we introduce such a filtering procedure for multiplicative noise and analyze its computational expense.

3.1.2 Dimensionality

Although using Gaussian filtering for integral evaluation is significantly more efficient than Monte Carlo sampling, it tends to not scale well with the state and measurement dimensions. For example, the Kalman filter has a computational complexity of $\mathcal{O}(N(n^{3}+m^{3}))$ , which can make optimization and sampling schemes infeasible for even moderate dimensions. Since in most real-world systems, $n\gg m$ , the state dimension tends to be the limiting dimension. Therefore, one solution is to use reduced-order modeling to reduce the state dimension to $r\ll n$ and perform estimation in this $r$ -dimensional subspace. In Section 5, we present a method for learning high-dimensional systems efficiently using Algorithm 1 and reduced-order modeling. If this approach is used, it is critical that the ROM still preserve the geometric structure of the FOM. Structure-preservation can be achieved by encoding prior physics knowledge into the parameterization of the dynamics model $\mathcal{T}$ . We discuss such an approach in the following section.

3.2 Incorporation of prior knowledge

Here we consider the choice of model parameterization with a focus on embedding geometric structure into the model. The benefits of physics-informed parameterization are that it leads to models that generalize better beyond the training data and reduces the number of data required for training. There are two main ways to enforce the system physics within the dynamics model $\mathcal{T}$ . First, the parameterization of $\mathcal{T}$ should be designed to only admit models whose dynamics possess the same physical structure of the system. Second, the model dynamics should be evaluated in such a way that the resulting flow preserves the structure of the dynamics. Since this work considers specialization of methods to nonseparable Hamiltonian systems, we briefly review aspects of the Hamiltonian structure that will inform the proposed parameterization method.

Finite-dimensional canonical Hamiltonian systems are defined by a scalar function $H$ , known as the Hamiltonian, of the canonical position $\mathbf{q}$ and momentum $\mathbf{p}$ , both in $\mathbb{R}^{d}$ . For these systems, the state is defined as $\mathbf{x}=\begin{bmatrix}\mathbf{q}^{\top}&\mathbf{p}^{\top}\end{bmatrix}^{\top}$ such that $n=2d$ . The governing equations of the system, known as Hamilton’s equations, are derived from this function

\dot{\mathbf{q}}=\frac{\partial H(\mathbf{q},\mathbf{p})}{\partial\mathbf{p}},% \quad\quad\dot{\mathbf{p}}=-\frac{\partial H(\mathbf{q},\mathbf{p})}{\partial% \mathbf{q}}.

(5)

Many Hamiltonian systems of interest to engineers and scientists (e.g., [32, 34, 35, 33, 36]) are not additively separable with respect to functions of the position and momentum. Such Hamiltonian systems are said to be nonseparable. Unlike the separable Hamiltonian systems which can be written as $H(\mathbf{q},\mathbf{p})=T(\mathbf{p})+U(\mathbf{q})$ with a kinetic energy function $T(\mathbf{p})$ and a potential energy function $U(\mathbf{q})$ , Eq. (5) cannot be further simplified for nonseparable Hamiltonians. A distinctive feature of Hamilton’s equations is that they possess physically meaningful geometric properties that can be described in the form of symplecticity, invariants of motion, and energy conservation. We discuss how these properties can be embedded in the estimation procedure in Section 4.2.

4 Methodology

In this section, we present solutions to the problems of computational expense and prior knowledge incorporation. In Section 4.1, we derive a Gaussian filter for a statistically-dependent, vector-valued, additive and multiplicative noise model, and we analyze its computational expense relative to a Gaussian filter for the more widely-used additive noise model. Then in Section 4.2, we add the additional capability of physics-informed estimation to the algorithm by embedding geometric structure within the dynamics propagator $\Psi$ , and we tailor the approach specifically toward nonseparable Hamiltonian systems.

4.1 Filtering with multiplicative noise

In this section, we extend the filter for multiplicative scalar noise from [19] to models with statistically-dependent, vector-valued noise. Consider the HMM with additive and multiplicative noise in Eq. (4). Let the vector-valued multiplicative noise terms $\bm{w}_{k}$ and $\bm{v}_{k}$ each be i.i.d. with means $\bar{\mathbf{w}}$ and $\bar{\mathbf{v}}$ and covariances $\mathbf{\Sigma}_{\odot}$ and $\mathbf{\Gamma}_{\odot}$ , respectively. Similarly, let the additive noise terms $\bm{\xi}_{k}$ and $\bm{\eta}_{k}$ also be i.i.d. with zero means and covariances $\mathbf{\Sigma}_{+}$ and $\mathbf{\Gamma}_{+}$ , respectively. Since we are interested in Gaussian filtering, the higher order moments of these noise terms are not needed.

Recall from Section 3.1 that the goal of Gaussian filtering within Algorithm 1 is to compute (approximations of) the means and covariances of the distributions $\pi(\mathbf{x}_{k}|\mathcal{Y}_{k-1},\bm{\uptheta})$ , $\pi(\mathbf{y}_{k}|\mathcal{Y}_{k-1},\bm{\uptheta})$ , and $\pi(\mathbf{x}_{k}|\mathcal{Y}_{k},\bm{\uptheta})$ . Many of these evaluations require statistics of a function output that can be computed in closed form if the function is linear or approximated with Gaussian approximation if the function is nonlinear. To represent these function outputs, we denote $\Psi(\bm{x}_{k},\bm{\theta})$ as $\bm{\psi}_{k}$ and $h(\mathbf{y}_{k},\bm{\theta})$ as $\bm{h}_{k}$ . The following equations outline the filtering procedure with line numbers denoting where each group of equations are evaluated within Algorithm 1. The mean $\mathbf{m}_{k}$ and covariance $\mathbf{C}_{k}$ of $\mathcal{T}(\bm{x}_{k-1},\bm{\theta},\omega)$ with respect to $\pi(\mathbf{x}_{k-1}|\mathcal{Y}_{k-1},\bm{\uptheta})$ (line 6):

	$\displaystyle\mathbf{m}_{k}$	$\displaystyle=\mathbb{E}\left[\bm{\psi}_{k}\right]\odot\bar{\mathbf{w}},$		(6)
	$\displaystyle\mathbf{C}_{k}$	$\displaystyle=\mathbb{E}\left[\bm{\psi}_{k}\bm{\psi}_{k}^{\top}\right]\odot% \mathbf{\Sigma}_{\odot}+\mathbb{V}\text{ar}\left[\bm{\psi}_{k}\right]\odot(% \bar{\mathbf{w}}\bar{\mathbf{w}}^{\top})+\mathbf{\Sigma}_{+}.$		(7)

The mean $\bm{\upmu}_{k}$ and covariance $\mathbf{S}_{k}$ of $\mathcal{M}(\bm{x}_{k},\bm{\theta},\omega)$ with respect to $\pi(\mathbf{x}_{k}|\mathcal{Y}_{k-1},\bm{\uptheta})$ and the covariance $\mathbf{U}_{k}$ between $\mathcal{T}(\bm{x}_{k-1},\bm{\theta},\omega)$ and $\mathcal{M}(\bm{x}_{k},\bm{\theta},\omega)$ with respect to $\pi(\mathbf{x}_{k},\mathbf{x}_{k-1}|\mathcal{Y}_{k-1},\bm{\uptheta})$ (line 3):

$\displaystyle\bm{\upmu}_{k}$	$\displaystyle=\mathbb{E}\left[\bm{h}_{k}\right]\odot\bar{\mathbf{v}},$	(8)
$\displaystyle\mathbf{S}_{k}$	$\displaystyle=\mathbb{E}\left[\bm{h}_{k}\bm{h}_{k}^{\top}\right]\odot\mathbf{% \Gamma}_{\odot}+\mathbb{V}\text{ar}\left[\bm{h}_{k}\right]\odot(\bar{\mathbf{v% }}\bar{\mathbf{v}}^{\top})+\mathbf{\Gamma}_{+},$	(9)
$\displaystyle\mathbf{U}_{k}$	$\displaystyle=\mathbb{C}\text{ov}\left[\bm{\psi}_{k},\bm{h}_{k}\right]\odot% \bar{\mathbf{v}}_{k}.$	(10)

The mean $\mathbf{m}_{k}^{+}$ and covariance $\mathbf{C}_{k}^{+}$ of $\bm{x}_{k}$ with respect to $\pi(\mathbf{x}_{k}|\mathcal{Y}_{k},\bm{\uptheta})$ (line 5):

$\displaystyle\mathbf{K}_{k}$	$\displaystyle=\mathbf{U}_{k}\mathbf{S}_{k}^{-1},$	(11)
$\displaystyle\mathbf{m}_{k}^{+}$	$\displaystyle\approx\mathbf{m}_{k}+\mathbf{K}_{k}(\mathbf{y}_{k}-\bm{\upmu}_{k% }),$	(12)
$\displaystyle\mathbf{C}_{k}^{+}$	$\displaystyle\approx\mathbf{C}_{k}-\mathbf{K}_{k}\mathbf{U}_{k}^{\top}.$	(13)

These last three equations represent the linear MMSE estimator with gain matrix $\mathbf{K}_{k}\in\mathbb{R}^{n\times m}$ . Therefore, $\mathbf{m}_{k}^{+}$ and $\mathbf{C}_{k}^{+}$ are only approximations unless $\bm{x}_{k}$ and $\bm{y}_{k}$ are jointly Gaussian. Also notice that $\mathbb{E}\left[\bm{\psi}_{k}\bm{\psi}_{k}^{\top}\right]$ and $\mathbb{E}\left[\bm{h}_{k}\bm{h}_{k}^{\top}\right]$ are required when multiplicative noise is present. Again, these could be computed with linearization or Gaussian integration, but for computational efficiency, we assume that the function outputs are Gaussian. This assumption allows Eq. (7) to be computed in the following form:

\mathbf{C}_{k}=\mathbb{V}\text{ar}\left[\bm{\psi}_{k}\right]\odot(\mathbf{% \Sigma}_{\odot}+\bar{\mathbf{w}}\bar{\mathbf{w}}^{\top})+(\mathbb{E}\left[\bm{% \psi}_{k}\right]\mathbb{E}\left[\bm{\psi}_{k}\right]^{\top})\odot\mathbf{% \Sigma}_{\odot}+\mathbf{\Sigma}_{+}.

(14)

A similar form can be used to compute Eq. (9).

In a past work [15], we analyzed the computational complexity of the filtering procedure for additive noise models and found it to be on the order $\mathcal{O}(N(n^{3}+m^{3}))$ . Here, we analyze the increase in computational complexity of this filtering procedure when multiplicative noise is present in addition to additive noise. Because we assumed that the noise is stationary, the outer products $\bar{\mathbf{w}}\bar{\mathbf{w}}^{\top}$ and $\bar{\mathbf{v}}\bar{\mathbf{v}}^{\top}$ need only be evaluated once. All other operations scale linearly with the number of data $N$ . The added computational cost is summarized in Table 1. The order of the added expense is $\mathcal{O}(N(n^{2}+m^{2}))$ , so the additional computation required when multiplicative noise is added to the model does not affect the order of complexity of the overall algorithm.

Table 1: The increase in computational complexity of Gaussian filtering with both additive and multiplicative noise in the HMM (4) compared to Gaussian filtering with only additive noise present.

Equation

Added flops

Dynamics

Eq. (6)

Nn

Eq. (7)

5Nn^{2}+n^{2}

Observations

Eq. (8)

Nm

Eq. (9)

5Nm^{2}+m^{2}

Eq. (10)

Nnm

Total:

N(5n^{2}+5m^{2}+nm

+n+m)+m^{2}+n^{2}

4.2 Embedding symplectic structure

Next we describe a parameterization strategy for embedding symplectic structure within the learning process that is shown graphically in Fig. 1. The main idea behind this strategy is that rather than parameterizing the time derivatives or the propagator $\Psi$ directly, we parameterize the Hamiltonian $H(\mathbf{q},\mathbf{p},\bm{\theta})$ . From this Hamiltonian, the time derivatives are derived from Hamilton’s equations and then integrated with a symplectic integrator. This process ensures that the estimated model will be Hamiltonian and also that its flow will be symplectic. This approach has shown utility in various other works [41, 17, 7, 42].

Figure 1: Schematic showing how Hamiltonian structure is preserved when evaluating the likelihood.

From the parameterized Hamiltonian, the continuous-time dynamics are given by Hamilton’s equations (5). However, the HMM (4) that we are using is formulated in discrete time. To resolve this discrepancy, we require a propagator $\Psi$ that can map the state forward in time using the estimated time derivatives without violating the physical properties mentioned in Section 3.2. For this purpose, we utilize a recently-developed explicit symplectic integrator for nonseparable Hamiltonian systems [43] that we refer to as “Tao’s integrator” throughout the paper. We briefly describe the integration method here.

Consider an arbitrary Hamiltonian $H(\mathbf{q},\mathbf{p})$ and fictitious position and momentum vectors $\bar{\mathbf{q}}$ and $\bar{\mathbf{p}}$ corresponding to $\mathbf{q}$ and $\mathbf{p}$ . Next, define the augmented Hamiltonian

\bar{H}(\mathbf{q},\mathbf{p},\bar{\mathbf{q}},\bar{\mathbf{p}}):=\underbrace{% H(\mathbf{q},\bar{\mathbf{p}})}_{H_{a}}+\underbrace{H(\bar{\mathbf{q}},\mathbf% {p})}_{H_{b}}+\lambda\cdot\underbrace{\left(\|\mathbf{q}-\bar{\mathbf{q}}\|_{2% }^{2}/2+\|\mathbf{p}-\bar{\mathbf{p}}\|_{2}^{2}/2\right)}_{H_{c}},

(15)

where $H_{a}:=H(\mathbf{q},\bar{\mathbf{p}})$ and $H_{b}:=H(\bar{\mathbf{q}},\mathbf{p})$ correspond to two copies of the original nonseparable Hamiltonian system with mixed-up positions and momenta; $H_{c}$ is an artificial constraint; and $\lambda$ is a constant that controls the binding of the two copies. Unlike the original Hamiltonian $H$ , the extended Hamiltonian $\bar{H}$ is amenable to explicit symplectic integration.

From $\bar{H}$ , a propagator $\Psi$ is derived using a second-order explicit symplectic method based on Strang splitting

\Psi:=\psi^{\Delta t/2}_{H_{a}}\circ\psi^{\Delta t/2}_{H_{b}}\circ\psi^{\Delta t% }_{\lambda H_{c}}\circ\psi^{\Delta t/2}_{H_{b}}\circ\psi^{\Delta t/2}_{H_{a}},

(16)

where $\psi^{\Delta t}_{H_{a}},\psi^{\Delta t}_{H_{b}},$ and $\psi^{\Delta t}_{\lambda H_{c}}$ are the time- $\Delta t$ flow of $H_{a},H_{b},$ and $\lambda H_{c}$ . Each flow is evaluated with explicit symplectic Euler substeps, so the result of the composition is an explicit and symplectic propagator as desired.

5 Structure-preserving dimension reduction for high-dimensional systems

In this section, we discuss the approach of probabilistically learning low-dimensional dynamics of high-dimensional systems, including how to preserve Hamiltonian structure throughout the dimension-reduction process. This approach is divided into three steps. The first step in Section 5.1 is defining a structure-preserving propagator in the reduced-dimensional state space, the second step in Section 5.2 is deriving a low-dimensional observation function that determines a reduced-dimensional observation space, and the final step in Section 5.3 is deriving and evaluating a likelihood in this low-dimensional space. This revised process for high-dimensional Hamiltonians is graphically illustrated in Fig. 2.

Figure 2: Schematic showing how Hamiltonian structure is preserved in reduced dimensions when evaluating the likelihood. In this setting, the parameter vector is partitioned into quadratic

\theta_{\text{quad}}

and nonlinear components

\theta_{\text{nl}}

, where

\theta_{\text{quad}}

is determined by

\theta_{\text{nl}}

through the H-OpInf algorithm.

5.1 Reduced-dimension dynamics

Reducing the dimension of a dynamical system hinges upon a key hypothesis: that the system of interest is a high-dimensional realization of underlying low-dimensional dynamics. Under this assumption, it is theoretically possible to estimate dynamics with a considerably lower state dimension than the ambient dimension without any loss of accuracy. Estimating the hypothesized low-dimensional dynamics involves two primary components. The first is identifying a low-dimensional subspace on which a significant portion of the state evolution is contained. The other component is estimating a dynamical model whose state evolves within this low-dimensional space. Here we first describe linear dimension reduction for dynamical systems and then expound on alterations that can be made to preserve symplectic structure.

5.1.1 Linear dimension reduction

In this work, we consider linear projections. Let $\mathbf{V}\in\mathbb{R}^{n\times r}$ and $\mathbf{V}_{\perp}\in\mathbb{R}^{n\times(n-r)}$ be projection matrices that project onto complementary subspaces $\mathcal{S},\mathcal{S}_{\perp}\subseteq\mathbb{R}^{n}$ with $r\ll n$ . Denoting the components of the state $\mathbf{x}_{k}$ that lie in $\mathcal{S}$ and $\mathcal{S}_{\perp}$ as $\mathbf{x}_{\mathcal{S}}$ and $\mathbf{x}_{\mathcal{S}^{\perp}}$ , the state can be written as

\mathbf{x}_{k}=\mathbf{x}_{k,\mathcal{S}}+\mathbf{x}_{k,\mathcal{S}^{\perp}}=% \mathbf{V}\mathbf{V}^{\top}\mathbf{x}_{k}+\mathbf{V}_{\perp}\mathbf{V}_{\perp}% ^{\top}\mathbf{x}_{k}=\mathbf{V}\tilde{\mathbf{x}}_{k}+\mathbf{V}_{\perp}% \tilde{\mathbf{x}}_{k,\perp},

(17)

where $\tilde{\mathbf{x}}_{k}\in\mathbb{R}^{r}$ and $\tilde{\mathbf{x}}_{k,\perp}\in\mathbb{R}^{n-r}$ are the low-dimensional representations of $\mathbf{x}_{\mathcal{S}}$ and $\mathbf{x}_{\mathcal{S}^{\perp}}$ .

Once the subspace has been identified, the next step is to learn low-dimensional dynamics of the form

\tilde{\mathcal{T}}(\tilde{\bm{x}}_{k},\bm{\theta},\omega)=\tilde{\Psi}(\tilde% {\bm{x}}_{k},\bm{\theta})+\tilde{\bm{\xi}}_{k}(\omega),

(18)

where $\tilde{\Psi}:\mathbb{R}^{r}\times\mathbb{R}^{\ell}\mapsto\mathbb{R}^{r}$ is the low-dimensional dynamics propagator and $\tilde{\bm{\xi}}_{k}(\omega)\in\mathbb{R}^{r}$ represents the model uncertainty of $\tilde{\Psi}$ . Since the form of uncertainty in the model is unknown, we select the additive model for simplicity. We use $\tilde{\bm{x}}_{k}$ as the uncertain state vector and therefore model it as a random vector. The state $\tilde{\mathbf{x}}_{k,\perp}$ is omitted from this model as a result of the earlier hypothesis that the dynamics are largely confined to a low-dimensional manifold, in this case $\mathcal{S}$ . With these dynamics, predictions of the high-dimensional state can be made using $\bm{x}_{k}=\mathbf{V}\tilde{\Psi}^{k-1}(\tilde{\bm{x}}_{1},\bm{\theta})$ , where $\tilde{\Psi}^{k-1}$ denotes $k-1$ compositions of the low-dimensional propagator.

5.1.2 Hamiltonian case

Now we seek to extend this methodology to learn low-dimensional Hamiltonian models arising from discretizations of infinite-dimensional Hamiltonian systems. For this work, we focus on infinite-dimensional canonical Hamiltonian systems with Hamiltonian functionals of the form

\mathcal{H}[q,p]=\int\left(H_{\text{quad}}(q,p,q_{z},p_{z},\cdots)+H_{\text{nl% }}(q,p)\right)\ \text{d}z,

(19)

where $z$ is the spatial variable and $q_{z}=\frac{\partial q}{\partial z}$ is the partial derivative of $q$ with respect to $z$ . The function $H_{\text{quad}}(q,p,q_{z},p_{z},\cdots)$ contains quadratic terms, and $H_{\text{nl}}(q,p)$ contains spatially-local nonlinear terms.

To preserve the symplectic structure of the Hamiltonian system, we require that the symplecticity of the Hamiltonian flow is preserved in the reduced-dimensional space. This can be achieved by using the cotangent lift algorithm to find a projection matrix $\mathbf{V}$ that is symplectic. We briefly describe the cotangent lift algorithm here, but more details can be found in A. Let $\mathbf{x}_{1},\dots,\mathbf{x}_{N}$ be a collection of full-field data collected from the system at times $t_{1},\dots,t_{N}$ . We define the snapshot matrices

\mathbf{Q}=\begin{bmatrix}\mathbf{q}_{1}\cdots\mathbf{q}_{N}\end{bmatrix}\in% \mathbb{R}^{d\times N},\quad\mathbf{P}=\begin{bmatrix}\mathbf{p}_{1}\cdots% \mathbf{p}_{N}\end{bmatrix}\in\mathbb{R}^{d\times N}.

(20)

Then we compute a linear symplectic basis

\textbf{V}=\begin{bmatrix}\mathbf{\mathbf{\Phi}}&\mathbf{0}\\ \mathbf{0}&\mathbf{\mathbf{\Phi}}\end{bmatrix}\in\mathbb{R}^{2d\times 2r},

(21)

where $\mathbf{\mathbf{\Phi}}\in\mathbb{R}^{d\times r}$ is based on the leading $r$ left singular vectors of the extended snapshot matrix $\textbf{X}_{e}=[\mathbf{Q},\mathbf{P}]$ .

Next, we propose a parameterization of $\tilde{\Psi}$ that preserves the low-dimensional Hamiltonian structure. We define the low-dimensional position and momentum $\tilde{\mathbf{q}},\tilde{\mathbf{p}}\in\mathbb{R}^{r}$ as

\tilde{\mathbf{q}}=\mathbf{\Phi}^{\top}\mathbf{q},\qquad\tilde{\mathbf{p}}=% \mathbf{\Phi}^{\top}\mathbf{p}.

(22)

Based on the Hamiltonian form of Eq. (19), we use the parameterization

\tilde{H}(\tilde{\mathbf{q}},\tilde{\mathbf{p}},\bm{\theta})=\frac{1}{2}\tilde% {\mathbf{q}}^{\top}\tilde{\mathbf{D}}_{\mathbf{q}}(\bm{\theta}_{\text{quad}})% \tilde{\mathbf{q}}+\frac{1}{2}\tilde{\mathbf{p}}^{\top}\tilde{\mathbf{D}}_{% \mathbf{p}}(\bm{\theta}_{\text{quad}})\tilde{\mathbf{p}}+\tilde{H}_{\text{nl}}% (\tilde{\mathbf{q}},\tilde{\mathbf{p}},\bm{\theta}_{\text{nl}}),

(23)

where $\tilde{H}_{\text{nl}}(\tilde{\mathbf{q}},\tilde{\mathbf{p}},\bm{\theta}_{\text% {nl}})\coloneqq H_{\text{nl}}(\mathbf{\Phi}\tilde{\mathbf{q}},\mathbf{\Phi}% \tilde{\mathbf{p}},\bm{\theta}_{\text{nl}})$ defines the nonlinear terms, and $\tilde{\mathbf{D}}_{\mathbf{q}}$ and $\tilde{\mathbf{D}}_{\mathbf{p}}$ are symmetric matrices defining the quadratic terms. We have partitioned the parameter vector $\bm{\theta}=\begin{bmatrix}\bm{\theta}_{\text{quad}}^{\top}&\bm{\theta}_{\text% {nl}}^{\top}\end{bmatrix}^{\top}$ into components pertaining to the quadratic and nonlinear terms of the Hamiltonian to distinguish between them. Following Hamilton’s equations (5), the governing equations for this low-dimensional Hamiltonian system are

\dot{\tilde{\mathbf{q}}}(\bm{\theta})=\frac{\partial\tilde{H}}{\partial\tilde{% \mathbf{p}}}=\tilde{\mathbf{D}}_{\mathbf{p}}(\bm{\theta}_{\text{quad}})\tilde{% \mathbf{p}}+\mathbf{\Phi}^{\top}\mathbf{f}_{\mathbf{q}}(\mathbf{\Phi}\tilde{% \mathbf{q}},\mathbf{\Phi}\tilde{\mathbf{p}},\bm{\theta}_{\text{nl}}),\qquad% \dot{\tilde{\mathbf{p}}}(\bm{\theta})=-\frac{\partial\tilde{H}}{\partial\tilde% {\mathbf{q}}}=-\tilde{\mathbf{D}}_{\mathbf{q}}(\bm{\theta}_{\text{quad}})% \tilde{\mathbf{q}}-\mathbf{\Phi}^{\top}\mathbf{f}_{\mathbf{q}}(\mathbf{\Phi}% \tilde{\mathbf{q}},\mathbf{\Phi}\tilde{\mathbf{p}},\bm{\theta}_{\text{nl}}).

(24)

Numerically integrating these equations with Tao’s integrator as in Section 4.2 yields a structure-preserving propagator $\tilde{\Psi}$ . With the Hamiltonian operator inference method H-OpInf [44], $\bm{\theta}_{\text{quad}}$ can be determined in a straightforward fashion, leaving only $\bm{\theta}_{\text{nl}}$ to be estimated. More details on this approach will be given in Section 5.3.

5.2 Reduced-dimension observations

Now that the state dimension has been reduced, we require an observation function $\tilde{h}$ defined over this reduced-dimensional state. Furthermore, if the observations themselves are high-dimensional, the high computational complexity of Gaussian filtering may require reduction of the observations in addition to the states. We describe this approach in the setting of identity observation operators for simplicity of presentation.

To formulate the measurement function in terms of $\tilde{\bm{x}}_{k}$ , we first decompose the full-dimensional measurements in terms of $\bm{x}_{\mathcal{S}}$ and $\bm{x}_{\mathcal{S}^{\perp}}$ . Assuming that the measurement function $h$ in the observation model (4b) is the identity, we have

\bm{y}_{k}=\bm{x}_{k}\odot\bm{v}_{k}+\bm{\eta}_{k}=\left(\bm{x}_{k,\mathcal{S}% }+\bm{x}_{k,\mathcal{S}^{\perp}}\right)\odot\bm{v}_{k}+\bm{\eta}_{k}=\left(% \mathbf{V}\tilde{\bm{x}}_{k}+\bm{x}_{k,\mathcal{S}^{\perp}}\right)\odot\bm{v}_% {k}+\bm{\eta}_{k}.

(25)

The next step is to transform $\mathbf{y}$ into a low-dimensional form. A natural choice for this step is to project $\mathbf{y}$ onto the subspace spanned by the low-dimensional dynamics to produce the low-dimensional measurements $\tilde{\mathbf{y}}=\mathbf{V}^{\top}\mathbf{y}$ . This induces the low-dimensional measurement model

\tilde{\mathcal{M}}(\tilde{\bm{x}}_{k},\omega)=\mathbf{V}^{\top}\bm{y}_{k}=% \mathbf{V}^{\top}\Bigl{(}\left(\mathbf{V}\tilde{\bm{x}}_{k}+\bm{x}_{k,\mathcal% {S}^{\perp}}\right)\odot\bm{v}_{k}\Bigr{)}+\mathbf{V}^{\top}\bm{\eta}_{k},

(26)

which produces a collection of low-dimensional data $\tilde{\mathcal{Y}}_{N}=\{\mathbf{V}^{\top}\mathbf{y}_{k}|k=1,\ldots,N\}$ . This gives rise to a modified posterior distribution $\pi(\bm{\uptheta}|\tilde{\mathcal{Y}}_{N})$ . In general, this distribution is only an approximation of the original distribution $\pi(\bm{\uptheta}|\mathcal{Y}_{N})$ , but it can be computed much more efficiently than the original, as we will show in Section 5.3.

In the case of only additive noise, the additive form of the measurement model is preserved

\tilde{\mathcal{M}}(\tilde{\bm{x}}_{k},\omega)=\tilde{\bm{x}}_{k}+\mathbf{V}^{% \top}(\bm{x}_{k,\mathcal{S}^{\perp}}+\bm{\eta}_{k}),

(27)

where

\mathbf{V}^{\top}(\bm{x}_{k,\mathcal{S}^{\perp}}+\bm{\eta}_{k})\sim\mathcal{N}% \Big{(}\mathbf{V}^{\top}\mathbb{E}\left[\bm{x}_{k,\mathcal{S}^{\perp}}\right],% \mathbf{V}^{\top}(\mathbb{V}\text{ar}\left[\bm{x}_{k,\mathcal{S}^{\perp}}% \right]+\mathbf{\Gamma}_{+})\mathbf{V}\Big{)}

(28)

can be treated as a non-stationary additive noise term with unknown mean and covariance. This form allows for Gaussian filtering through simple application of the Kalman filter. In the general form of Eq. (26), however, the noise can no longer be modeled by an additive and multiplicative model, and Eqs. (8) and (9) for evaluating the mean and covariance of $\mathcal{M}$ can not be applied to $\tilde{\mathcal{M}}$ . Instead, the first two moments of $\tilde{\mathcal{M}}$ are given as


$\displaystyle\mathbb{E}\left[\tilde{\mathcal{M}}(\tilde{\bm{x}},\omega)\right]$	$\displaystyle=\mathbf{V}^{\top}\Big{(}(\mathbf{V}\mathbb{E}\left[\tilde{\bm{x}% }\right]+\bm{x}_{\mathcal{S}^{\perp}})\odot\bar{\mathbf{v}}\Big{)},$	(29a)
$\displaystyle\mathbb{V}\text{ar}\left[\tilde{\mathcal{M}}(\tilde{\bm{x}},% \omega)\right]$	$\displaystyle=\mathbf{V}^{\top}\Big{(}\mathbb{V}\text{ar}\left[(\mathbf{V}% \tilde{\bm{x}}+\bm{x}_{\mathcal{S}^{\perp}})\odot\bm{v}\right]+\mathbf{\Gamma}% _{+}\Big{)}\mathbf{V}.$	(29b)

Due to the presence of the unknown $\bm{x}_{\mathcal{S}^{\perp}}$ term, these moments are not computable. However, $\bm{x}_{\mathcal{S}^{\perp}}$ can be assumed to be small due to the initial hypothesis that $\bm{x}$ is mostly contained in $\mathcal{S}$ . By neglecting $\bm{x}_{\mathcal{S}^{\perp}}$ , Eq. (29) becomes


$\displaystyle\mathbb{E}\left[\tilde{\mathcal{M}}(\tilde{\bm{x}},\omega)\right]$	$\displaystyle=\mathbb{E}\left[\tilde{\bm{x}}\right]\odot\bar{\mathbf{v}},$	(30a)
$\displaystyle\mathbb{V}\text{ar}\left[\tilde{\mathcal{M}}(\tilde{\bm{x}}_{k},% \omega)\right]$	$\displaystyle\approx\mathbb{V}\text{ar}\left[\tilde{\bm{x}}_{k}\right]+\mathbf% {\Gamma}(\bm{\theta}),$	(30b)

where the uncertainty due to $\bm{v}$ in (29b) has been absorbed into an estimated stationary additive term $\mathbf{\Gamma}(\bm{\theta})$ in (30b). In this simplified form, both moments can be computed straightforwardly with a standard filtering procedure.

When $\bar{\mathbf{v}}=\mathbf{1}$ , the approximation (30) is equivalent to assuming the low-dimensional measurement model

\tilde{\mathcal{M}}(\tilde{\bm{x}}_{k},\omega)\approx\tilde{h}(\tilde{\bm{x}}_% {k})+\tilde{\bm{\eta}}_{k},\quad\tilde{\bm{\eta}}_{k}\sim\mathcal{N}(\mathbf{0% },\mathbf{\Gamma}(\bm{\theta})),

(31)

where $\tilde{h}(\tilde{\bm{x}}_{k})=\tilde{\bm{x}}_{k}$ . This model will be shown to yield acceptable accuracy on an example problem in Section 6.5.

5.3 Reduced-dimension likelihood evaluation

Lastly, we require a method for evaluating the likelihood of the low-dimensional Hamiltonian (23). With the symmetry constraints on $\tilde{\mathbf{D}}_{\mathbf{q}}$ and $\tilde{\mathbf{D}}_{\mathbf{p}}$ , the dimension of $\bm{\theta}_{\text{quad}}(\omega)$ is $r(r+1)$ . This dimension grows polynomially with $r$ , and even for $r\ll n$ , this scaling can be cumbersome for nonlinear optimization methods. Fortunately, H-OpInf [44] provides a method for estimating $\bm{\theta}_{\text{quad}}$ with linear optimization when $H_{\text{nl}}$ is available. This procedure is described in more detail in B. By treating $\bm{\theta}_{\text{quad}}$ as a deterministic function of $\bm{\theta}_{\text{nl}}$ defined through the H-OpInf procedure, the posterior can be simplified as follows

\pi(\bm{\uptheta}|\tilde{\mathcal{Y}}_{N})=\pi(\bm{\uptheta}_{\text{nl}}|% \tilde{\mathcal{Y}}_{N})\pi(\bm{\uptheta}_{\text{quad}}|\bm{\uptheta}_{\text{% nl}},\tilde{\mathcal{Y}}_{N})=\pi(\bm{\uptheta}_{\text{nl}}|\tilde{\mathcal{Y}% }_{N})\delta_{\bm{\uptheta}_{\text{nl}}}(\bm{\uptheta}_{\text{quad}}),

(32)

where $\delta_{\bm{\uptheta}_{\text{nl}}}(\bm{\uptheta}_{\text{quad}})$ takes the value 1 if $\bm{\uptheta}_{\text{quad}}$ is computed from $\bm{\uptheta}_{\text{nl}}$ with H-OpInf and 0 otherwise. Therefore, we only need a method for evaluating the likelihood $\pi(\bm{\uptheta}_{\text{nl}}|\tilde{\mathcal{Y}}_{N})$ . The pdf $\pi(\bm{\uptheta}_{\text{nl}}|\tilde{\mathcal{Y}}_{N})$ can be evaluated by applying Algorithm 1 to the HMM from (1) with dynamics $\tilde{\mathcal{T}}$ from (18) and measurements $\tilde{\mathcal{M}}$ from (26). The full procedure for this evaluation is given in Algorithm 2 and illustrated in Fig. 2.

Algorithm 2 Evaluating the reduced-dimensional likelihood of a high-dimensional Hamiltonian system

1:Full-field data

\mathcal{Y}_{N}

2: Reduced dimension

r

3: Parameter vector

\bm{\uptheta}_{\text{nl}}

4:Evaluation of

\pi(\bm{\uptheta}_{\text{nl}}|\tilde{\mathcal{Y}}_{N})

5:Pre-processing

6:Assemble data into snapshot matrices

\mathbf{Q}=\begin{bmatrix}|&|&|&|\\ \mathbf{q}_{1}&\mathbf{q}_{2}&\cdots&\mathbf{q}_{N}\\ |&|&|&|\end{bmatrix},\quad\mathbf{P}=\begin{bmatrix}|&|&|&|\\ \mathbf{p}_{1}&\mathbf{p}_{2}&\cdots&\mathbf{p}_{N}\\ |&|&|&|\end{bmatrix}

7:Compute projection matrix with Eq. (21)

\mathbf{V}\leftarrow\texttt{cotangent\_lift}(\mathbf{Q},\mathbf{P},r)

8:Define

\tilde{\mathcal{Y}}_{N}=\{\mathbf{V}^{\top}\mathbf{y}_{k}|k=1,\ldots,N\}

9:Evaluation

10:Solve for

\tilde{\mathbf{D}}_{\mathbf{q}}(\bm{\uptheta}_{\text{quad}})

and

\tilde{\mathbf{D}}_{\mathbf{p}}(\bm{\uptheta}_{\text{quad}})

using

\bm{\uptheta}_{\text{nl}}

with H-OpInf from B

11:Define low-dimensional propagator

\tilde{\Psi}

using Tao’s integrator (16) and symplectic time derivatives (24)

12:Define low-dimensional observations

\tilde{h}(\tilde{\bm{x}})=\tilde{\bm{x}}

13:Evaluate

\pi(\bm{\uptheta}_{\text{nl}}|\tilde{\mathcal{Y}}_{N})

using Algorithm 1 with dynamics

\tilde{\mathcal{T}}

(18) and observation model

\tilde{\mathcal{M}}(\tilde{\mathbf{x}})

(31)

6 Numerical experiments

In this section, we study the numerical performance of the proposed Bayesian system ID approach for three Hamiltonian models with increasing levels of complexity. In Section 6.1, we describe an existing method whose model parameterization we implement within the proposed Bayesian approach for comparison. Then in Section 6.2, we give an overview of training considerations that are used throughout each numerical experiment. The final three sections contain the numerical examples. Each experiment and its distinct contribution beyond the authors’ previous work [17] is outlined as follows. In Section 6.3 we consider a polynomial nonseparable Hamiltonian system from [43] to demonstrate the advantages of the proposed Bayesian approach compared to a state-of-the-art neural network-based method when the data are sparse and/or noisy. In Section 6.4 we consider the double pendulum system to show the effectiveness of the proposed approach for non-polynomial, nonseparable Hamiltonian models in the chaotic regime. Finally, in Section 6.5 we consider high-dimensional data with multiplicative noise from the nonlinear Schrödinger equation and employ the novel Algorithm 2 for learning a high-dimensional Hamiltonian model.

6.1 Parameterization using deep neural networks

The Bayesian approach presented in this paper is flexible enough that it can be paired with any arbitrary parameterization of the dynamics propagator $\Psi$ . Here, we choose a recently-developed neural network architecture for learning nonseparable Hamiltonian systems known as the nonseparable symplectic neural network (NSSNN) [40] as our parameterization. This approach parameterizes the Hamiltonian using a neural network, evaluates Hamilton’s equations using auto-differentiation, and integrates the equations with Tao’s integrator, see Section 4.2. In Sections 6.3 and 6.4, we parameterize $\Psi$ using the NSSNN and compare the performance of this neural network when it is trained with the negative log posterior as its objective against when it is trained with the objective recommended in its original paper [40]. The goal of these experiments is to investigate the effect of each objective on the quality of the estimated model when the training set consists of sparse and noisy data.

Here we briefly describe the NSSNN architecture and training procedure from [40]. The neural network parameterizing the Hamiltonian is composed of six linear layers with sigmoid activation functions following all but the last layer. The training data is a set of input-output pairs. The inputs $(\mathbf{q}_{1}^{(j)},\mathbf{p}_{1}^{(j)})$ are a collection of measurements at times $t_{1}^{(j)}$ , and the outputs $(\mathbf{q}^{(j)},\mathbf{p}^{(j)})$ are measurements at times $t^{(j)}=t_{1}^{(j)}+T$ for $j=1,\ldots N$ . For each $j$ , the input and output are collected from the same trajectory and are separated by a time length of $T$ . Since the integrator returns both the physical $\mathbf{q}$ , $\mathbf{p}$ and fictitious $\bar{\mathbf{q}}$ , $\bar{\mathbf{p}}$ position and momentum, as described in Section 4.2, the loss function contains a term corresponding to each of these variables. An $L_{1}$ loss is used since it was empirically shown by [40] to yield better results:

\mathcal{J}(\bm{\uptheta})=\frac{1}{N}\sum_{j=1}^{N}\left\lVert\mathbf{q}^{(j)% }-\hat{\mathbf{q}}^{(j)}(\bm{\uptheta})\right\rVert_{1}+\left\lVert\mathbf{p}^% {(j)}-\hat{\mathbf{p}}^{(j)}(\bm{\uptheta})\right\rVert_{1}+\left\lVert\mathbf% {q}^{(j)}-\hat{\bar{\mathbf{q}}}^{(j)}(\bm{\uptheta})\right\rVert_{1}+\left% \lVert\mathbf{p}^{(j)}-\hat{\bar{\mathbf{p}}}^{(j)}(\bm{\uptheta})\right\rVert% _{1},

(33)

where the hat $\hat{}$ denotes an estimated value. In the NSSNN training procedure, the weights of each layer are initialized with Xavier initialization, and the Adam optimizer [45] is used to minimize the loss. The optimizer’s hyperparameters (the learning rate and beta values $\beta_{1}$ , $\beta_{2}$ ) are problem-dependent and will be reported within each numerical experiment’s subsection. Aside from the loss function, we follow the same procedure for training the NSSNN with the negative log posterior. The NSSNN and training procedure are implemented in PyTorch [46] for its auto-differentiation capabilities.

6.2 Training considerations

In each of these experiments, we consider only single-trajectory training data. To generate the data, we run a numerical solver with a fine timestep $\Delta t_{f}$ to achieve high numerical accuracy. The training data are then subsampled from this solution using an interval $\Delta t_{t}>\Delta t_{f}$ . During training, the models use the larger timestep $\Delta t_{t}$ for prediction, which helps in improving the computational efficiency at the expense of larger numerical error. Once training is complete, the estimated models are tested by simulating with the fine timestep $\Delta t_{f}$ . Using a smaller timestep during testing ensures that the learned models have captured the continuous dynamics of the system and not only a discrete mapping at a single timestep [47].

For each of these experiments, we place an improper uniform prior, i.e., constant probability density, on the dynamics’ model parameters for the sake of direct comparison with the non-Bayesian method. We do, however, place weakly informative half-normal priors on the process noise variance parameters to enforce positivity and yield better convergence. The output model and the measurement noise variance are assumed to be known and are therefore fixed unless stated otherwise.

6.3 Tao’s example

The first system that we consider is a one-degree-of-freedom system used in [43] with the Hamiltonian

H(q,p)=\frac{1}{2}(q^{2}+1)(p^{2}+1).

(34)

For this experiment, we seek to assess the performances of the negative log posterior and the $L_{1}$ objective for training the NSSNN as the training data become fewer and noisier.

6.3.1 Data generation and training

We generate a collection of datasets of varying noise level and size for training. To generate the training data, we use Tao’s integrator with initial condition $\mathbf{x}_{1}=\begin{bmatrix}0&-3\end{bmatrix}^{\top}$ and timesteps $\Delta t_{f}=10^{-3}$ and $\Delta t_{t}=10^{-2}$ . The number of data points $N$ in each dataset takes the values $N=300,400,\ldots,1000$ corresponding to training intervals of length $N\Delta t_{t}$ . The period for this system is approximately 3.26, so at this $\Delta t_{t}$ , the dataset timespans range from slightly under a single period to just over three periods. Then, we corrupt the datasets with multiplicative noise $\bm{v}_{k}\sim\mathcal{U}[1-a,1+a]$ for $a=0.00,0.01,\ldots,0.10$ . The mean and variance of this noise are $\bar{\mathbf{v}}=\mathbf{1}$ and $\mathbf{\Gamma}_{\odot}=\frac{a^{2}}{3}\mathbf{I}_{2d}$ . The final collection of datasets includes every $(a,N)$ combination for a total 88 datasets.

To train the NSSNN, we use an initial learning rate of 0.05 and beta values $\beta_{1}=0.9$ and $\beta_{2}=0.999$ . Each model is trained for 400 epochs with the learning rate multiplied by 0.8 every 20 epochs. Following the approach presented in the original NSSNN paper [40], a batch size of 512 is used with the $L_{1}$ objective whenever applicable. For the negative log posterior, the process noise variance is parameterized as $\mathbf{\Sigma}_{+}=\bm{\theta}_{\mathbf{\Sigma}_{+}}\mathbf{I}_{2d}$ . The scalar optimization parameter $\theta_{\mathbf{\Sigma}_{+}}$ is initialized at a value of $10^{-3}$ , and a learning rate of $0.5\theta_{\mathbf{\Sigma}_{+}}$ and beta values $\beta_{1},\beta_{2}=0.1$ are used for optimization. To encourage small values, the prior $\text{half-}\mathcal{N}(0,10^{-12})$ is placed on the random variable $\bm{\theta}_{\mathbf{\Sigma}_{+}}$ . The positivity constraint is enforced by setting the value of $\theta_{\mathbf{\Sigma}_{+}}$ to be 0.9 times its previous value whenever the optimizer places $\theta_{\mathbf{\Sigma}_{+}}$ below zero. When $a=0$ , we set $\mathbf{\Gamma}_{\odot}=10^{-16}\times\mathbf{I}_{2d}$ for positive definiteness.

6.3.2 Time-domain prediction

Once training is completed on $N$ data points, each model is simulated starting at $\mathbf{x}_{1}$ for twice the training timespan until $t_{2N}$ using timestep $\Delta t_{f}$ . The accuracy of the model is then assessed by computing the mean squared error (MSE) defined as

\frac{1}{2N}\sum_{k=1}^{2N}(\hat{q}_{k}-q_{k})^{2}+(\hat{p}_{k}-p_{k})^{2}.

(35)

The results of the training can vary due to the randomness in the initialization of the NSSNN parameters, in the mini-batching procedure, and in the measurement noise. Sometimes, the optimizer can even get stuck at poor model estimates that yield very high MSE. To mitigate the effect of these outliers, we train 20 models at each $(a,N)$ pair and then compare the two methods using the median and minimum MSEs. Each of these 20 rounds of training has different realizations of initial parameters, mini-batches, and measurement noise. We use these particular statistics to quantify two distinct characteristics of the training objectives. The minimum MSE represents the best accuracy that an objective could potentially achieve either through fortuitous initialization and batching or through a more sophisticated optimization methodology. The median MSE, on the other hand, represents the minimum accuracy a user can expect to achieve roughly 50% of the time using the standard Adam optimizer with a reasonable amount of hyperparameter tuning. Together, these two statistics give a more complete picture of each objective’s performance.

The $\log_{10}$ of the median and minimum MSEs are presented as heatmaps in Figs. 3(a) and 3(b), respectively. These heatmaps show the effects of varying the number of training data and the level of noise in the data on estimation performance. In this and all future figures, the label $\pi(\bm{\uptheta}|\mathcal{Y}_{N})$ denotes the MAP estimate. We observe that using the negative log posterior to train the NSSNN results in lower median and minimum MSE at nearly every $(a,N)$ pair. The primary exception is the minimum MSE in the $0\%$ noise column in which the posterior yields lower MSE at only roughly half of the $N$ values. As noise is added, however, the MSE of models trained with the $L_{1}$ loss increases much more steeply compared to the MSE of models trained with the posterior. These results demonstrate that the posterior is much more robust to noise than the $L_{1}$ loss. Additionally, the lower median MSE of the posterior in the low noise case shown in Fig. 3(b) reflects the fact that the posterior can more reliably find low MSE model estimates, especially when a smaller number of data are available.

To visualize how the model behavior varies with the noise and number of data, we plot the estimated trajectories of models corresponding to the four corners of the heatmaps. The estimated output of the median and minimum MSE models are shown in Figs. 4 and 5, respectively. The vertical pink line corresponds to the time $t_{2N}$ at which the MSE evaluation ends. Visually, the estimates produced with the negative log posterior can accurately reconstruct the phase manifold in all eight of the figures. The most noticeable inaccuracy produced by the posterior is the oscillation frequency of its median MSE estimates when the noise is high. However, the minimum MSE estimates do not have this issue.

The $L_{1}$ objective leads to poor results in two main areas: identifying good models with high noise data and reliable optimization with a low number of data. In the high noise case, both the median and minimum MSE estimates produced by the $L_{1}$ objective construct qualitatively poor estimates of the phase manifold. In the low number of data case, the minimum MSE estimate matches the true output closely when there is no noise. In this same case, however, the median MSE estimate produces a flat line in the time domain. This substantial difference in the median and minimum MSE estimates suggests that although the $L_{1}$ objective is able to produce good estimates with a small number of data, optimization is challenging and can require multiple attempts.

6.3.3 Phase-space prediction

In Fig. 4(b), we noted that the posterior estimate appears to qualitatively match the true output closely in phase space but produces large errors in the time domain. One way to quantitatively assess how closely the estimates match the system behavior in phase space is by looking at the MSE of the Hamiltonian, defined as

\frac{1}{2N}\sum_{k=1}^{2N}\Big{(}H(\hat{q}_{k},\hat{p}_{k})-H(q_{1},p_{1})% \Big{)}^{2},

(36)

where $H$ is the Hamiltonian function (34). This metric measures the closeness of an estimate’s energy level to the true system energy. Since some of the median MSE estimates, such as those in Fig. 4(c), remain close to the initial point, we only asses the minimum MSE estimates with this metric. The $\log_{10}$ values of the mean squared Hamiltonian deviation for the minimum MSE estimates are shown as heatmaps in Fig. 6. This figure shows a clear trend of the Hamiltonian deviation growing both as noise increases and as the number of data decreases in both objectives’ estimates. The estimates produced by the $L_{1}$ objective outperform those produced by the posterior at almost all $N$ values when the data are noiseless — the notable exception being when the number of data is smallest.

Although Fig. 6 shows that the $L_{1}$ estimate has a lower Hamiltonian MSE when $a=0$ than the MAP estimate, Fig. 3 showed that the MAP estimate yielded generally lower MSE in the time domain for median performance and comparable MSE in the worst-case performance. Moreover the phase-space plots of Fig. 5 also indicate improved performance of the MAP estimate across noise cases. Looking at the magnitudes of the Hamiltonian MSE values in Fig. 6, the difference in errors between the MAP and $L_{1}$ estimates at $a=0$ is relatively small, approximately $10^{-3}$ . When $a>0$ , however, we observe a sharp increase in the Hamiltonian MSE from the $L_{1}$ estimates. In contrast, the MSE values of the MAP estimates increase gradually as $a$ increases, demonstrating greater robustness to noise.

These slight differences that we observe between the time-domain MSE and Hamiltonian MSE heatmaps arise as a result of the fundamental differences in the two error metrics. While the trajectory-based metric (35) computes the pointwise squared Euclidean distance between two flows parameterized by time, the energy-based metric (36) computes the distance between a flow and a specific energy value. Because the energy-based metric compares the flow to a time-invariant quantity, it cannot by itself properly assess the quality of the dynamics, nor does it gauge or reflect the proper phase-space behavior and level sets. This can lead to problems if, for example, the estimated model possesses a stationary point that coincides with the Hamiltonian level surface. If this stationary point is the initial condition, the flow would yield zero MSE in the energy-based metric while yielding high MSE in the trajectory-based metric. As mentioned earlier, this was the case for some of the median MSE estimates. Although this metric alone is not sufficient to gauge the accuracy of dynamics, it can be valuable when used in conjunction with other metrics for its simplicity and physically-meaningful interpretation. For example, if the energy of a system is known to be constant, the energy-based metric quantifies the physical plausibility of a given model by measuring the average squared change in energy of its flow.

6.3.4 Conclusions

We conclude that if the data are noiseless, the $L_{1}$ objective is the better option for this example because it gives time-domain MSE comparable to the negative log posterior at a lower computational cost. When the number of data is low, however, the negative log posterior is potentially more cost-efficient since it appears to require fewer training restarts than the $L_{1}$ objective based on the median results. For data that are noisy, the negative log posterior is the clear choice. Although the cost is higher, the negative log posterior produces significantly more accurate estimates when the data have as little as 1% noise.

6.4 Double pendulum

The next system that we consider is the double pendulum, which has the Hamiltonian

\small H(\mathbf{q},\mathbf{p})=\frac{m_{2}l_{2}^{2}p_{\phi}^{2}+(m_{1}+m_{2})% l_{1}^{2}p_{\varphi}^{2}-2m_{2}l_{1}l_{2}p_{\phi}p_{\varphi}\cos(q_{\phi}-q_{% \varphi})}{2l_{1}l_{2}m_{2}\left(m_{1}+m_{2}\sin^{2}(q_{\phi}-q_{\varphi})% \right)}-(m_{1}+m_{2})gl_{1}\cos(q_{\phi})-m_{2}gl_{2}\cos(q_{\varphi}),

(37)

where $\mathbf{q}=\begin{bmatrix}q_{\phi}&q_{\varphi}\end{bmatrix}^{\top}$ , $\mathbf{p}=\begin{bmatrix}p_{\phi}&p_{\varphi}\end{bmatrix}^{\top}$ , and $m_{1}$ and $m_{2}$ are two masses connected by rigid rods of lengths $l_{1}$ and $l_{2}$ . This Hamiltonian is more complex than the previous example, and the behavior displayed by the double pendulum is chaotic in certain regions of the phase space. These two aspects make the system challenging to learn and allow us to test the limits of the NSSNN and the two objective functions.

6.4.1 Data generation and training

For this experiment, we set $m_{1}=m_{2}=l_{1}=l_{2}=1$ and $g=9.81$ , and we use an initial condition of $\mathbf{x}_{1}=\begin{bmatrix}1&0&0&0\end{bmatrix}^{\top}$ , placing the dynamics in the chaotic regime. Then, we collect $N=2000$ measurements of the full state using timesteps $\Delta t_{f}=10^{-3}$ and $\Delta t_{t}=10^{-2}$ and corrupt these data with multiplicative noise $\bm{v}_{k}\sim\mathcal{U}[0.99,1.01]$ . The training procedure uses 1,000 epochs with an initial learning rate of 0.05 that is multiplied by 0.8 every 50 epochs. To learn the process noise covariance, we use the parameterization $\mathbf{\Sigma}_{+}=\text{diag}(\bm{\theta}_{\mathbf{\Sigma}_{+}})$ . The learning rate for the optimization parameter $\bm{\uptheta}_{\mathbf{\Sigma}_{+}}$ equals the parameter value and is also multiplied by 0.8 every 50 epochs. Any remaining aspects of the training follow the procedure described in the previous example.

6.4.2 Time-domain prediction

First, we examine the time-domain estimates of the learned models, plotted in Fig. 7. Since the double pendulum is a chaotic system, prediction errors grow exponentially fast, making long-term prediction difficult. As a result, the time-domain MSE (35) does not tend to give a reliable quantification of an estimate’s accuracy. For chaotic systems, it is often more important to (i) know when the estimate is no longer reliable and (ii) capture the qualitative behavior of the system.

To address the first concern, uncertainty quantification can be used. The typical approach to quantifying uncertainty in neural network models is by approximating the posterior distribution over the network parameters [48], but computational approximation of this distribution is often costly. Fortunately, the proposed Bayesian offers a measure of uncertainty that circumvents this challenging approximation. By including the process noise during evaluation of the model, a stochastic simulation is produced whose realizations can be used to construct a probability distribution of the estimated output. Samples from this stochastic simulation are shown alongside the deterministic simulation, referred to as the “nominal MAP estimate,” in Fig. 8. We observe that the spread of samples begins to grow as the MAP estimate begins to deviate from the truth. This suggests that the process noise can give a reliable estimate of how the uncertainty in an estimated model changes over time. Notably, the commonly-used deterministic model approaches have no such resource for assessing the reliability of their estimates.

6.4.3 Phase-space prediction

To address the second concern of assessing qualitative behavior, we compare the accuracy of the two estimates in phase space. Specifically, we quantify this accuracy by evaluating the absolute Hamiltonian error $\left\lvert H(\hat{\mathbf{q}}_{k},\hat{\mathbf{p}}_{k})-H(\mathbf{q}_{1},% \mathbf{p}_{1})\right\rvert$ . Fig. 9(a) shows the estimates in phase space. The color of the line denotes the value of the absolute Hamiltonian error at that point. The absolute Hamiltonian error is also plotted over time for reference in Fig. 9(b) with a pink line separating the training and testing time periods. We see that both qualitatively and quantitatively, the MAP estimate is much closer to the true Hamiltonian. Fig. 9(b) shows that the error of the $L_{1}$ estimate is sometimes lower than that of the MAP estimate, but it occasionally spikes to magnitudes that are several times larger than the MAP estimate error. Such behavior is typically indicative of overfitting. Additionally, the spikes become more frequent as time goes on, reflecting the poor potential for long-term forecasting that was also observed in the previous example in Section 6.3.3. In contrast, the absolute Hamiltonian error of the MAP estimate is lower on average and does not display sudden spikes. We hypothesize that this preferable behavior can be attributed to the inherent regularization in the marginal likelihood that penalizes large output covariances [16].

6.4.4 Conclusions

On this example, the negative log posterior was better able to capture the underlying Hamiltonian manifold of the chaotic double pendulum in terms of both Hamiltonian error and visual inspection in phase space compared to the $L_{1}$ objective. Additionally, the Bayesian system ID framework was shown to provide a reliable method of uncertainty quantification of the output forecasts without the expense of quantifying uncertainty in the neural network parameters.

6.5 Nonlinear Schrödinger equation

The nonlinear Schrödinger equation (NLSE) is a partial differential equation (PDE) that is used for modeling nonlinear waves in plasma physics [49], nonlinear optics [50], quantum mechanics [51], and oceanography [52]. This equation exhibits a rich variety of dynamical phenomena due to the combined action of dispersion and nonlinearity on a narrow-banded field envelope. The one-dimensional parametric NLSE considered here is a particularly suitable model for optical experiments realized with single-mode fibers, see [53] for more details.

We consider the parametric nonlinear Schrödinger equation with a cubic nonlinearity

i\frac{\partial\psi}{\partial t}+\frac{\partial^{2}\psi}{\partial z^{2}}+% \gamma|\psi|^{2}\psi=0,

(38)

where $\psi:=\psi(z,t)$ is a complex-valued function and $i=\sqrt{-1}$ is the imaginary unit. This parametric nonlinear PDE can be recast as a canonical Hamiltonian PDE by writing the complex-valued wave function $\psi(z,t)=p(z,t)+iq(z,t)$ in terms of its real and imaginary parts as

\frac{\partial p}{\partial t}=-\frac{\partial^{2}q}{\partial z^{2}}-\gamma% \left(q^{2}+p^{2}\right)p,\qquad\frac{\partial q}{\partial t}=\frac{\partial^{% 2}p}{\partial z^{2}}+\gamma\left(q^{2}+p^{2}\right)q,

with the space-time continuous Hamiltonian

\mathcal{H}(q,p)=\frac{1}{2}\int\left[p_{z}^{2}+q_{z}^{2}-\frac{\gamma}{2}% \left(p^{2}+q^{2}\right)^{2}\right]{\rm d}z,

(39)

where the parameter $\gamma$ determines the influence of the non-quadratic terms. In addition to Hamiltonian conservation, this system also conserves mass invariant $\mathcal{I}_{1}$ and momentum invariant $\mathcal{I}_{2}$ defined as

\mathcal{I}_{1}(q,p)=\int(p^{2}+q^{2}){\rm d}z,\qquad\mathcal{I}_{2}(q,p)=\int% (p_{z}q-q_{z}p){\rm d}z.

(40)

6.5.1 Data generation and training

For the learning problem, we consider a gray-box setting where we assume knowledge about the form of the Hamiltonian (39) at the PDE level, but the parameter $\gamma$ is uncertain. To differentiate from the true parameter $\gamma$ , we denote the uncertain parameter as $\bm{\theta}_{\gamma}$ . We then estimate the MAP value of $\bm{\theta}_{\gamma}$ using Algorithm 2 to yield a FOM based on the estimated parameter. To assess the quality of this model, we study the relative error of the estimated states and the model’s ability to conserve the system Hamiltonian, mass, and momentum.

For this study, we first generate high-dimensional data by numerically solving (38) with true parameter $\gamma=2$ . To discretize the PDE in space, we use a spatial domain $z\in[-L/2,L/2]$ , where $L=2\pi\sqrt{2}$ , with periodic boundary conditions and initial conditions of $p(z,0)=0.5(1+0.01\cos(2\pi z/L))$ and $q(z,0)=0$ . The spatial discretization uses $d=64$ equally spaced grid points for a total state dimension of $2d=128$ . We approximate a solution to this discretized PDE using Tao’s integrator with a fine timestep of $\Delta t_{f}=10^{-3}$ . Then we collect data over 20s at a timestep of $\Delta t_{t}=5\times 10^{-3}$ for a total $N=4000$ data. After generating the data, we add 20% multiplicative uniform noise $\bm{v}_{k}\sim\mathcal{U}[0.8,1.2]$ . As a means of visualizing how noisy these data are, we consider the system’s wave function $\psi(z,t)$ . Among other reasons, the wave function is notable because its square modulus $\lvert\psi(z,t)\rvert^{2}$ can be interpreted as the probability density of the system. This quantity is plotted using the clean solution in Fig. 10(a) and the noisy data in Fig. 10(b).

Next, we project the data onto a low-dimensional subspace so that we can estimate $\bm{\theta}_{\gamma}$ in a reduced setting. For this projection step, we compute a symplectic basis V of the form (21) with reduced dimension $r=8$ using the cotangent lift algorithm described in Section 5.1.2. As discussed in Section 5.2, the multiplicative noise form is not preserved under this dimension-reducing transformation. Therefore, we use the approximate $\tilde{\mathcal{M}}$ given by Eq. (31) as the observation model for this experiment.

To train the model, we seek to estimate the $\bm{\theta}_{\gamma}$ parameter and the diagonal elements of $\mathbf{\Sigma}_{+}$ and $\mathbf{\Gamma}_{+}$ for a total of 33 parameters. Since PyTorch does not have a Lyapunov solver needed for H-OpInf, we use the solver from the scipy package, which breaks the computational graph required for auto-differentiation with respect to $\theta_{\gamma}$ . As an alternative method of differentiation, we approximate $\partial\pi(\bm{\uptheta}|\mathcal{Y}_{N})/\partial\theta_{\gamma}$ using forward finite difference with a step size of $10^{-6}$ . Then, we optimize $\theta_{\gamma}$ using gradient descent with a step size of $10^{-4}$ . We parameterize the covariance matrices as $\mathbf{\Sigma}_{+}=\text{diag}(\bm{\theta}_{\mathbf{\Sigma}_{+}})$ and $\mathbf{\Gamma}_{+}=\text{diag}(\bm{\theta}_{\mathbf{\Gamma}_{+}})$ . The variance parameters are optimized using the same procedure as before with a learning rate of 0.5 that is multiplied by 0.8 every 10 epochs. The priors over $\bm{\theta}_{\mathbf{\Sigma}_{+}}$ and $\bm{\theta}_{\mathbf{\Gamma}_{+}}$ are $\text{half-}\mathcal{N}(0,10^{-12})$ and $\text{half-}\mathcal{N}(0,10^{-9})$ , respectively. We initialize optimization variables $\theta_{\gamma}$ at 0, $\bm{\uptheta}_{\mathbf{\Sigma}_{+}}$ at $10^{-4}$ , and $\bm{\uptheta}_{\mathbf{\Gamma}_{+}}$ at $10^{-3}$ . For this experiment, we use 50 epochs.

The optimization result is used to initialize MCMC. The sampling algorithm that we use is a delayed rejection adaptive Metropolis [54] within Gibbs procedure, where the parameter groups $\bm{\theta}_{\gamma}$ , $\bm{\theta}_{\mathbf{\Sigma}_{+}}$ , and $\bm{\theta}_{\mathbf{\Gamma}_{+}}$ are sampled sequentially. To ensure convergence, we draw $2\times 10^{4}$ samples, and discard the first $10^{4}$ as burn-in. Fig. 11(a) shows the marginal posterior distribution of $\bm{\theta}_{\gamma}$ with the mean value of 1.955 indicated by the dark blue line. The average squared error of these samples with respect to the true $\gamma$ value is $2.185\times 10^{-3}$ . We also plot the Markov chain of $\bm{\theta}_{\gamma}$ in Fig. 11(b) to show that the chain is well-mixed.

6.5.2 Relative state error

We next use these samples to simulate the system over time. To help decorrelate the samples, we only use every 10th sample for a total 1,000 samples. For a given initial condition in the high-dimensional setting, we simulate the learned FOMs until 40s to assess the model performance outside the training period. The performance metric that we consider is the relative state error defined as

\frac{\lVert\mathbf{X}_{e}-\hat{\mathbf{X}}_{e}\rVert_{F}^{2}}{\lVert\mathbf{X% }_{e}\rVert_{F}^{2}},

(41)

where $\mathbf{X}_{e}$ and $\hat{\mathbf{X}}_{e}$ are the true and estimated extended snapshot matrices, respectively. The relative state errors over the training and testing periods are shown in Fig. 12. The dark blue line denotes the relative state error yielded by the $\bm{\theta}_{\gamma}$ sample average. Almost all of the samples of the relative state error over the training period are below 10%. The error of the testing period is larger due to the fact that errors tend to grow over time in dynamical models. These plots show that we are able to make good predictions over the short-term with errors that grow gradually as the simulation time increases.

6.5.3 Conservation of invariants

We also assess the structure-preserving nature of this model by computing the absolute errors in Hamiltonian, mass, and momentum. The spatially-discretized forms of these invariants from Eqs. (39) and (40) are defined as

$\displaystyle H(\mathbf{q},\mathbf{p})$	$\displaystyle=\frac{1}{2}\sum_{i=1}^{d}\left[p_{z_{i}}^{2}+q_{z_{i}}^{2}-\frac% {\gamma}{2}\left(p_{i}^{2}+q_{i}^{2}\right)^{2}\right]\Delta z,$	(42)
$\displaystyle I_{1}(\mathbf{q},\mathbf{p})$	$\displaystyle=\sum_{i=1}^{d}\left[p_{i}^{2}+q_{i}^{2}\right]\Delta z,$	(43)
$\displaystyle I_{2}(\mathbf{q},\mathbf{p})$	$\displaystyle=\sum_{i=1}^{d}\left[p_{z_{i}}q_{i}-q_{z_{i}}p_{i}\right]\Delta z,$	(44)

where $\Delta z=L/d$ , $p_{z_{i}}=(p_{i+1}-p_{i})/\Delta z$ , and $q_{z_{i}}$ is defined similarly. The posterior of the absolute errors of the Hamiltonian $\lvert H(\hat{\mathbf{q}}_{k},\hat{\mathbf{p}}_{k})-H(\mathbf{q}_{1},\mathbf{p% }_{1})\rvert$ , the mass $\lvert I_{1}(\hat{\mathbf{q}}_{k},\hat{\mathbf{p}}_{k})-I_{1}(\mathbf{q}_{1},% \mathbf{p}_{1})\rvert$ , and the momentum $\lvert I_{2}(\hat{\mathbf{q}}_{k},\hat{\mathbf{p}}_{k})-I_{2}(\mathbf{q}_{1},% \mathbf{p}_{1})\rvert$ are plotted in Fig. 13. The FOM based on the estimated parameter value exhibits bounded error behavior $100\%$ outside the training data regime for all three conserved quantities, demonstrating that this method is capable of preserving underlying physics.

6.5.4 Varying the reduced dimension

Lastly, we study the parameter estimation accuracy of the proposed reduced-dimensional learning approach for different reduced dimensions. Since we have randomness in the measurement noise and the Adam optimizer, we draw 20 realizations of data and train on each for every reduced dimension value from $r=2$ to $r=8$ . The optimization uses the same procedure described earlier in this section. Box and whisker plots of the squared parameter errors are shown in Fig. 14. The error is naturally quite high for $r=2$ but decreases and levels off at $r=3$ , which indicates that the first three modes based on the SVD of the extended snapshot matrix are relatively clean. For $r>3$ , we observe modes with low values of signal-to-noise ratios which leads to a marginal increase in the estimation error. The shapes of these modes for a single noise realization are shown in Fig. 15.

6.5.5 Conclusions

This example showed that Algorithm 2 was effective at estimating a parameter from a 64-dimensional system in a reduced 8-dimensional subspace with 20% multiplicative measurement noise. Since the computational cost of the likelihood evaluation scales cubically with state dimension $n$ , reducing the state dimension by a factor of eight results in evaluation that is roughly 512 times cheaper compared to evaluation with the full state dimension. Furthermore, it was shown that the reduced dimension could be as small as $r=3$ while still achieving comparable parameter estimation accuracy as $r=8$ , suggesting potential savings of up to 9,709 times. Additionally, the posterior FOM was shown to share important physical structure to the underlying system, as shown by its conservation of mass, momentum, and Hamiltonian over time.

7 Conclusions

In this work, we presented a structure-preserving Bayesian framework for learning deep neural network parameterizations of nonseparable Hamiltonian systems from noisy data. This framework uses Gaussian filtering to compute a likelihood based on a stochastic dynamics model that allows for the effects of model and measurement uncertainty to be accounted for differently. Unlike past works which only considered additive Gaussian noise models, we showed that the algorithm can be tailored to other noise models and provided a filter for multiplicative noise as an example. The numerical experiments for low-dimensional systems demonstrated that the proposed Bayesian framework is data-efficient and robust to noise in the data, whereas the standard machine learning approach breaks down in the presence of noisy data. Moreover, the Bayesian framework outperformed the NSSNN, a state-of-the-art machine learning approach, when training on data with multiplicative uniform noise, demonstrating that the Gaussian filtering approach is not overly restrictive for non-Gaussian measurement noise.

We also proposed a novel algorithm for the identification of high-dimensional Hamiltonians that allows for cost-efficient parameter estimation through filtering in a low-dimensional symplectic subspace. Using prior knowledge about the underlying physics, this algorithm was effective in parameter estimation of a nonlinear Schrödinger equation within $2.185\times 10^{-3}$ mean squared error, even with data corrupted by 20% multiplicative uniform noise. The full-order models based on this parameter estimate provided accurate and stable predictions, while also preserving the system Hamiltonian and other invariants of motion. Future work on this topic will explore estimation using partial observations, including improving low-dimensional projections derived from partially unknown/uncertain full-order models.

CRediT authorship contribution statement

Nicholas Galioto: Conceptualization, Methodology, Investigation, Software, Formal analysis, Writing - original draft, Visualization. Harsh Sharma: Conceptualization, Methodology, Software, Writing - review & editing. Boris Kramer: Conceptualization, Validation, Resources, Writing: Review & Editing, Funding Acquisition. Alex Arkady Gorodetsky: Conceptualization, Methodology, Resources, Writing: Review & Editing, Funding Acquisition, Supervision

Declaration of competing interest

Alex Gorodetsky reports a relationship with Geminus AI that includes: employment and equity or stocks. Boris Kramer reports a relationship with ASML Holding US that includes: consulting or advisory. The other authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data availability

Data will be made available upon request.

Acknowledgments

A.A. Gorodetsky and N. Galioto were funded by the AFOSR Computational Mathematics Program (P.M. Fariba Fahroo) under award FA9550-19-1-0013. B. Kramer and H. Sharma were in part financially supported by the Ministry of Trade, Industry and Energy (MOTIE) and the Korea Institute for Advancement of Technology (KIAT) through the International Cooperative R&D program (No. P0019804, Digital twin based intelligent unmanned facility inspection solutions) and the Applied and Computational Analysis Program of the Office of Naval Research under award N000142212624.

Appendix A Cotangent lift algorithm

In projection-based model reduction, the semi-discrete model is projected onto a low-dimensional subspace. The key idea in structure-preserving model reduction is to preserve the underlying geometric structure during the projection. Since the FOM is a Hamiltonian system with underlying symplectic structure, the projection step is treated as the symplectic inverse of a symplectic lift from the low-dimensional subspace to the state space, see [55]. A symplectic lift is defined by $\mathbf{y}=\mathbf{V}\tilde{\mathbf{y}}$ where $\mathbf{V}\in\mathbb{R}^{2d\times 2r}$ is a symplectic matrix, i.e., a matrix that satisfies

\mathbf{V}^{\top}\mathbf{J}_{2d}\mathbf{V}=\mathbf{J}_{2r}.

(45)

The symplectic inverse $\mathbf{V}^{+}\in\mathbb{R}^{2r\times 2d}$ of a symplectic matrix $\mathbf{V}$ is defined by

\mathbf{V}^{+}=\mathbf{J}_{2r}^{\top}\mathbf{V}^{\top}\mathbf{J}_{2d},

(46)

and the symplectic projection can be written as $\tilde{\mathbf{x}}=\mathbf{V}^{+}\mathbf{x}\in\mathbb{R}^{2r}$ . Proper symplectic decomposition (PSD) [55] is a method to find a symplectic projection matrix $\mathbf{V}$ that simultaneously minimizes the projection error in a least-squares sense, i.e.,

\min_{\begin{subarray}{c}\mathbf{V}\\ \text{s.t.}\ \mathbf{V}^{\top}\mathbf{J}_{2d}\mathbf{V}=\mathbf{J}_{2r}\end{% subarray}}\|\textbf{X}-\mathbf{V}\mathbf{V}^{+}\textbf{X}\|_{F},

(47)

where $\textbf{X}:=[\mathbf{x}(t_{1}),\cdots,\mathbf{x}(t_{K})]\in\mathbb{R}^{2d% \times K}$ is the snapshot data matrix, and $\|\cdot\|_{F}$ is the Frobenius norm. Since solving Eq. (47) to obtain the symplectic basis matrix $\mathbf{V}$ is computationally expensive, the authors in [55] outlined three efficient algorithms for finding approximated optimal solution for the symplectic matrix $\mathbf{V}$ . These algorithms (cotangent lift, complex SVD, and a nonlinear programming approach) search for a near-optimal solution over different subsets of $Sp(2r,\mathbb{R}^{2d})$ , the set of all $2d\times 2r$ symplectic matrices. The cotangent lift algorithm computes the SVD of the extended snapshot matrix $\textbf{X}_{e}=[\mathbf{q}(t_{1}),\cdots,\mathbf{q}(t_{K}),\mathbf{p}(t_{1}),% \cdots,\mathbf{p}(t_{K})]\in\mathbb{R}^{d\times 2K}$ to obtain a POD basis matrix $\mathbf{\Phi}\in\mathbb{R}^{d\times r}$ and then constructs the symplectic basis matrix $\mathbf{V}=\begin{bmatrix}\mathbf{\Phi}&\mathbf{0}\\ \mathbf{0}&\mathbf{\Phi}\end{bmatrix}\in\mathbb{R}^{2d\times 2r}$ with $\mathbf{V}^{+}=\mathbf{V}^{\top}$ . Compared to the complex SVD and the nonlinear programming approach, the cotangent lift algorithm is more easily implemented in the offline stage as it only requires the SVD of the extended snapshot matrix $\textbf{X}_{e}$ . Moreover, the diagonal nature of $\mathbf{V}$ ensures that the interpretability of $\mathbf{q}$ and $\mathbf{p}$ is retained in the reduced setting.

Appendix B Hamiltonian operator inference

In this section, we describe how to estimate the terms $\tilde{\mathbf{D}}_{\mathbf{q}}$ and $\tilde{\mathbf{D}}_{\mathbf{p}}$ in the quadratic portion of the Hamiltonian (23) using the approach of H-OpInf [44]. First, assume we have an initial estimate of $\bm{\uptheta}_{\text{nl}}$ defining the nonlinear terms $H_{\text{nl}}$ . Then, define the nonlinear forcings $\mathbf{f}_{\mathbf{q}}$ and $\mathbf{f}_{\mathbf{p}}$ as

	$\displaystyle\mathbf{f}_{\mathbf{q}}(\mathbf{x},\bm{\uptheta}_{\text{nl}})$	$\displaystyle=\begin{bmatrix}\frac{\partial H_{\text{nl}}}{\partial p^{1}}(q^{% 1},p^{1},\bm{\uptheta}_{\text{nl}})\cdots\frac{\partial H_{\text{nl}}}{% \partial p^{d}}(q^{d},p^{d},\bm{\uptheta}_{\text{nl}})\end{bmatrix}^{\top}\in% \mathbb{R}^{d},$		(48)
	$\displaystyle\mathbf{f}_{\mathbf{p}}(\mathbf{x},\bm{\uptheta}_{\text{nl}})$	$\displaystyle=\begin{bmatrix}\frac{\partial H_{\text{nl}}}{\partial q^{1}}(q^{% 1},p^{1},\bm{\uptheta}_{\text{nl}})\cdots\frac{\partial H_{\text{nl}}}{% \partial q^{d}}(q^{d},p^{d},\bm{\uptheta}_{\text{nl}})\end{bmatrix}^{\top}\in% \mathbb{R}^{d}.$		(49)

Utilizing the explicit forms of $\mathbf{f}_{\mathbf{q}}$ and $\mathbf{f}_{\mathbf{p}}$ , form the nonlinear forcing snapshot matrices

\mathbf{F}_{\mathbf{q}}(\mathbf{Q},\mathbf{P},\bm{\uptheta}_{\text{nl}})=% \begin{bmatrix}\mathbf{f}_{\mathbf{q}}(\mathbf{x}_{1},\bm{\uptheta}_{\text{nl}% })\cdots\mathbf{f}_{\mathbf{q}}(\mathbf{x}_{N},\bm{\uptheta}_{\text{nl}})\end{% bmatrix},\qquad\mathbf{F}_{\mathbf{p}}(\mathbf{Q},\mathbf{P},\bm{\uptheta}_{% \text{nl}})=\begin{bmatrix}\mathbf{f}_{\mathbf{p}}(\mathbf{x}_{1},\bm{\uptheta% }_{\text{nl}})\cdots\mathbf{f}_{\mathbf{p}}(\mathbf{x}_{N},\bm{\uptheta}_{% \text{nl}})\end{bmatrix}.

(50)

Next, project the trajectory (20) and nonlinear (50) snapshot matrices onto the symplectic subspace using the projection matrix (21) found with the cotangent lift [55]

\tilde{\mathbf{Q}}={\mathbf{\Phi}}^{\top}\mathbf{Q}\in\mathbb{R}^{r\times N},% \qquad\tilde{\mathbf{P}}={\mathbf{\Phi}}^{\top}\mathbf{P}\in\mathbb{R}^{r% \times N},\qquad\tilde{\mathbf{F}}_{\mathbf{q}}={\mathbf{\Phi}}^{\top}\mathbf{% F}_{\mathbf{q}}\in\mathbb{R}^{r\times N},\qquad\tilde{\mathbf{F}}_{\mathbf{p}}% ={\mathbf{\Phi}}^{\top}\mathbf{F}_{\mathbf{p}}\in\mathbb{R}^{r\times N}.

(51)

Additionally, compute the reduced time-derivative data $\dot{\tilde{\mathbf{q}}}$ and $\dot{\tilde{\mathbf{p}}}$ from the reduced state trajectory data $\tilde{\mathbf{Q}}$ and $\tilde{\mathbf{P}}$ using a finite difference scheme. Then, organize the time-derivatives into snapshot matrices

\dot{\tilde{\mathbf{Q}}}=\begin{bmatrix}\dot{\tilde{\mathbf{q}}}_{1}\cdots\dot% {\tilde{\mathbf{q}}}_{N}\end{bmatrix}\in\mathbb{R}^{r\times N},\quad\dot{% \tilde{\mathbf{P}}}=\begin{bmatrix}\dot{\tilde{\mathbf{p}}}_{1}\cdots\dot{% \tilde{\mathbf{p}}}_{N}\end{bmatrix}\in\mathbb{R}^{r\times N}.

Lastly, infer the reduced operators $\tilde{\mathbf{D}}_{\mathbf{q}}(\bm{\uptheta}_{\text{quad}})$ and $\tilde{\mathbf{D}}_{\mathbf{p}}(\bm{\uptheta}_{\text{quad}})$ by solving the following constrained operator inference problem

\min_{\begin{subarray}{c}\tilde{\mathbf{D}}_{\mathbf{q}}=\tilde{\mathbf{D}}_{% \mathbf{q}}^{\top},\\ \tilde{\mathbf{D}}_{\mathbf{p}}=\tilde{\mathbf{D}}_{\mathbf{p}}^{\top}\end{% subarray}}\bigg{\lVert}\begin{bmatrix}\dot{\tilde{\mathbf{Q}}}-\tilde{\mathbf{% F}}_{\mathbf{q}}(\tilde{\mathbf{Q}},\tilde{\mathbf{P}},\bm{\uptheta}_{\text{nl% }})\\ \dot{\tilde{\mathbf{P}}}+\tilde{\mathbf{F}}_{\mathbf{p}}(\tilde{\mathbf{Q}},% \tilde{\mathbf{P}},\bm{\uptheta}_{\text{nl}})\end{bmatrix}-\begin{bmatrix}% \mathbf{0}&\tilde{\mathbf{D}}_{\mathbf{p}}\\ -\tilde{\mathbf{D}}_{\mathbf{q}}&\mathbf{0}\end{bmatrix}\begin{bmatrix}\tilde{% \mathbf{Q}}\\ \tilde{\mathbf{P}}\end{bmatrix}\bigg{\rVert}_{F},

(52)

where $\lVert\cdot\rVert_{F}$ denotes the Frobenius norm. The symmetric constraints on the reduced operators $\tilde{\mathbf{D}}_{\mathbf{q}}$ and $\tilde{\mathbf{D}}_{\mathbf{p}}$ ensure that the ROMs learned via H-OpInf are Hamiltonian. By introducing the terms $\tilde{\mathbf{R}}_{\mathbf{p}}=\dot{\tilde{\mathbf{Q}}}-\tilde{\mathbf{F}}_{% \mathbf{q}}(\tilde{\mathbf{Q}},\tilde{\mathbf{P}})$ and $\tilde{\mathbf{R}}_{\mathbf{q}}=-\dot{\tilde{\mathbf{P}}}-\tilde{\mathbf{F}}_{% \mathbf{p}}(\tilde{\mathbf{Q}},\tilde{\mathbf{P}})$ , the above inference problem (52) can be reformulated as the Lyapunov equations

(\tilde{\mathbf{Q}}\tilde{\mathbf{Q}}^{\top})\tilde{\mathbf{D}}_{\mathbf{q}}+% \tilde{\mathbf{D}}_{\mathbf{q}}(\tilde{\mathbf{Q}}\tilde{\mathbf{Q}}^{\top})=% \tilde{\mathbf{Q}}\tilde{\mathbf{R}}_{\mathbf{q}}^{\top}+\tilde{\mathbf{R}}_{% \mathbf{q}}\tilde{\mathbf{Q}}^{\top},\qquad(\tilde{\mathbf{P}}\tilde{\mathbf{P% }}^{\top})\tilde{\mathbf{D}}_{\mathbf{p}}+\tilde{\mathbf{D}}_{\mathbf{p}}(% \tilde{\mathbf{P}}\tilde{\mathbf{P}}^{\top})=\tilde{\mathbf{P}}\tilde{\mathbf{% R}}_{\mathbf{p}}^{\top}+\tilde{\mathbf{R}}_{\mathbf{p}}\tilde{\mathbf{P}}^{% \top},

(53)

which can be solved with off-the-shelf Lyapunov solvers.

The constrained operator inference problem in Eq. (52) enforces structure preservation at the reduced level to learn nonintrusive Hamiltonian ROMs that conserve the reduced Hamiltonian function $\tilde{H}(\tilde{\mathbf{q}},\tilde{\mathbf{p}})$ (23). The authors in [44] have shown that the reduced Hamiltonian function $\tilde{H}(\tilde{\mathbf{q}},\tilde{\mathbf{p}})$ can be interpreted as a perturbation of the FOM Hamiltonian $H$ , i.e. $\tilde{H}(\tilde{\mathbf{q}},\tilde{\mathbf{p}})=H(\Phi\tilde{\mathbf{q}},\Phi% \tilde{\mathbf{q}})+\Delta H(\tilde{\mathbf{q}},\tilde{\mathbf{p}})$ . Consequently, the H-OpInf ROM yields approximate FOM trajectories, i.e. $\mathbf{q}_{\text{approx}}=\Phi\tilde{\mathbf{q}}$ and $\mathbf{p}_{\text{approx}}=\Phi\tilde{\mathbf{p}}$ , that track the FOM solution trajectories accurately while also conserving a perturbed FOM Hamiltonian which yields bounded FOM energy error.

References

[1] G. E. Karniadakis, I. G. Kevrekidis, L. Lu, P. Perdikaris, S. Wang, L. Yang, Physics-informed machine learning, Nature Reviews Physics 3 (6) (2021) 422–440.
[2] M. Raissi, P. Perdikaris, G. Karniadakis, Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations, Journal of Computational Physics 378 (2019) 686–707. doi:https://doi.org/10.1016/j.jcp.2018.10.045.
[3] K. Linka, A. Schäfer, X. Meng, Z. Zou, G. E. Karniadakis, E. Kuhl, Bayesian physics informed neural networks for real-world nonlinear dynamical systems, Computer Methods in Applied Mechanics and Engineering 402 (2022) 115346.
[4] L. Zhang, X.-D. Zhang, An optimal filtering algorithm for systems with multiplicative/additive noises, IEEE Signal Processing Letters 14 (7) (2007) 469–472.
[5] A. Daw, R. Q. Thomas, C. C. Carey, J. S. Read, A. P. Appling, A. Karpatne, Physics-guided architecture (PGA) of neural networks for quantifying uncertainty in lake temperature modeling, in: Proceedings of the 2020 SIAM International Conference on Data Mining (SDM), 2020, pp. 532–540. doi:10.1137/1.9781611976236.60.
[6] S. Greydanus, M. Dzamba, J. Yosinski, Hamiltonian neural networks, Advances in Neural Information Processing Systems 32 (2019).
[7] Z. Chen, J. Zhang, M. Arjovsky, L. Bottou, Symplectic recurrent neural networks, in: International Conference on Learning Representations, 2020.
[8] P. Jin, Z. Zhang, A. Zhu, Y. Tang, G. E. Karniadakis, SympNets: Intrinsic structure-preserving symplectic networks for identifying Hamiltonian systems, Neural Networks 132 (2020) 166–179.
[9] S. Saemundsson, A. Terenin, K. Hofmann, M. Deisenroth, Variational integrator networks for physically structured embeddings, in: International Conference on Artificial Intelligence and Statistics, PMLR, 2020, pp. 3078–3087.
[10] M. Cranmer, S. Greydanus, S. Hoyer, P. Battaglia, D. Spergel, S. Ho, Lagrangian neural networks, in: ICLR 2020 Workshop on Integration of Deep Neural Models and Differential Equations, 2019.
[11] M. Lutter, C. Ritter, J. Peters, Deep Lagrangian networks: Using physics as model prior for deep learning, in: International Conference on Learning Representations, 2019.
[12] M. A. Roehrl, T. A. Runkler, V. Brandtstetter, M. Tokic, S. Obermayer, Modeling system dynamics with physics-informed neural networks based on Lagrangian mechanics, IFAC-PapersOnLine 53 (2) (2020) 9195–9200, 21st IFAC World Congress. doi:https://doi.org/10.1016/j.ifacol.2020.12.2182.
[13] S. Mallat, Understanding deep convolutional networks, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 374 (2065) (2016) 20150203.
[14] H. Sheng, C. Yang, PFNN: A penalty-free neural network method for solving a class of second-order boundary-value problems on complex geometries, Journal of Computational Physics 428 (2021) 110085. doi:https://doi.org/10.1016/j.jcp.2020.110085.
[15] N. Galioto, A. A. Gorodetsky, Bayesian system ID: Optimal management of parameter, model, and measurement uncertainty, Nonlinear Dynamics 102 (1) (2020) 241–267.
[16] N. Galioto, A. A. Gorodetsky, Likelihood-based generalization of markov parameter estimation and multiple shooting objectives in system identification, Physica D: Nonlinear Phenomena 462 (2024) 134146.
[17] H. Sharma, N. Galioto, A. A. Gorodetsky, B. Kramer, Bayesian identification of nonseparable Hamiltonian systems using stochastic dynamic models, in: 2022 IEEE 61st Conference on Decision and Control (CDC), IEEE, 2022, pp. 6742–6749.
[18] T. M. Hamill, J. S. Whitaker, C. Snyder, Distance-dependent filtering of background error covariance estimates in an ensemble Kalman filter, Monthly Weather Review 129 (11) (2001) 2776–2790.
[19] P. Rajasekaran, N. Satyanarayana, M. Srinath, Optimum linear estimation of stochastic signals in the presence of multiplicative noise, IEEE Transactions on Aerospace and Electronic Systems AES-7 (3) (1971) 462–468.
[20] W. Liu, Optimal filtering for discrete-time linear systems with time-correlated multiplicative measurement noises, IEEE Transactions on Automatic Control 61 (7) (2015) 1972–1978.
[21] F. Wang, V. Balakrishnan, Robust Kalman filters for linear time-varying systems with stochastic parametric uncertainties, IEEE Transactions on Signal Processing 50 (4) (2002) 803–813.
[22] B. Chow, W. Birkemeier, A new recursive filter for systems with multiplicative noise, IEEE Transactions on Information Theory 36 (6) (1990) 1430–1435. doi:10.1109/18.59939.
[23] F. Yang, Z. Wang, Y. Hung, Robust Kalman filtering for discrete time-varying uncertain systems with multiplicative noises, IEEE Transactions on Automatic Control 47 (7) (2002) 1179–1183. doi:10.1109/TAC.2002.800668.
[24] X. Kai, L. Liangdong, L. Yiwu, Robust extended Kalman filtering for nonlinear systems with multiplicative noises, Optimal Control Applications and Methods 32 (1) (2011) 47–63.
[25] G. Berkooz, P. Holmes, J. L. Lumley, The proper orthogonal decomposition in the analysis of turbulent flows, Annual review of fluid mechanics 25 (1) (1993) 539–575.
[26] G. Rozza, D. B. P. Huynh, A. T. Patera, Reduced basis approximation and a posteriori error estimation for affinely parametrized elliptic coercive partial differential equations: Application to transport and continuum mechanics, Archives of Computational Methods in Engineering 15 (3) (2008) 229–275.
[27] M. Guo, J. S. Hesthaven, Data-driven reduced order modeling for time-dependent problems, Computer methods in applied mechanics and engineering 345 (2019) 75–99.
[28] Y. Kim, Y. Choi, D. Widemann, T. Zohdi, A fast and accurate physics-informed neural network reduced order model with shallow masked autoencoder, Journal of Computational Physics 451 (2022) 110841. doi:https://doi.org/10.1016/j.jcp.2021.110841.
[29] K. Lee, K. T. Carlberg, Model reduction of dynamical systems on nonlinear manifolds using deep convolutional autoencoders, Journal of Computational Physics 404 (2020) 108973. doi:https://doi.org/10.1016/j.jcp.2019.108973.
[30] H. Sharma, H. Mu, P. Buchfink, R. Geelen, S. Glas, B. Kramer, Symplectic model reduction of Hamiltonian systems using data-driven quadratic manifolds, Computer Methods in Applied Mechanics and Engineering 417 (2023) 116402. doi:https://doi.org/10.1016/j.cma.2023.116402.
[31] R. Geelen, S. Wright, K. Willcox, Operator inference for non-intrusive model reduction with quadratic manifolds, Computer Methods in Applied Mechanics and Engineering 403 (2023) 115717.
[32] D. Serra, F. Ruggiero, A. Donaire, L. R. Buonocore, V. Lippiello, B. Siciliano, Control of nonprehensile planar rolling manipulation: A passivity-based approach, IEEE Transactions on Robotics 35 (2) (2019) 317–329.
[33] G. Li, S. Naoz, M. Holman, A. Loeb, Chaos in the test particle eccentric Kozai–Lidov mechanism, The Astrophysical Journal 791 (2) (2014) 86.
[34] E. Forest, Geometric integration for particle accelerators, Journal of Physics A: Mathematical and General 39 (19) (2006) 5321.
[35] R. Salmon, Hamiltonian fluid mechanics, Annual Review of Fluid Mechanics 20 (1) (1988) 225–256.
[36] J. Colliander, M. Keel, G. Staffilani, H. Takaoka, T. Tao, Transfer of energy to high frequencies in the cubic defocusing nonlinear Schrödinger equation, Inventiones Mathematicae 181 (1) (2010) 39–113.
[37] S. Särkkä, Bayesian filtering and smoothing, Cambridge University Press, 2013.
[38] C. Andrieu, A. Doucet, R. Holenstein, Particle Markov chain Monte Carlo methods, Journal of the Royal Statistical Society Series B: Statistical Methodology 72 (3) (2010) 269–342.
[39] K. Wu, T. Qin, D. Xiu, Structure-preserving method for reconstructing unknown Hamiltonian systems from trajectory data, SIAM Journal on Scientific Computing 42 (6) (2020) A3704–A3729.
[40] S. Xiong, Y. Tong, X. He, S. Yang, C. Yang, B. Zhu, Nonseparable symplectic neural networks, in: International Conference on Learning Representations, 2021.
[41] N. Galioto, A. A. Gorodetsky, Bayesian identification of Hamiltonian dynamics from symplectic data, in: 2020 59th IEEE Conference on Decision and Control (CDC), IEEE, 2020, pp. 1190–1195.
[42] M. David, F. Méhats, Symplectic learning for Hamiltonian neural networks, Journal of Computational Physics 494 (2023) 112495. doi:https://doi.org/10.1016/j.jcp.2023.112495.
[43] M. Tao, Explicit symplectic approximation of nonseparable Hamiltonians: Algorithm and long time performance, Physical Review E 94 (4) (2016) 043303.
[44] H. Sharma, Z. Wang, B. Kramer, Hamiltonian operator inference: Physics-preserving learning of reduced-order models for canonical Hamiltonian systems, Physica D: Nonlinear Phenomena 431 (2022) 133122.
[45] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, in: International Conference on Learning Representations, 2015.
[46] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, S. Chintala, PyTorch: An imperative style, high-performance deep learning library, in: H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, R. Garnett (Eds.), Advances in Neural Information Processing Systems, Vol. 32, Curran Associates, Inc., 2019.
[47] A. S. Krishnapriyan, A. F. Queiruga, N. B. Erichson, M. W. Mahoney, Learning continuous models for continuous physics, Communications Physics 6 (1) (2023) 319.
[48] A. Olivier, M. D. Shields, L. Graham-Brady, Bayesian neural networks for uncertainty quantification in data-driven materials modeling, Computer methods in applied mechanics and engineering 386 (2021) 114079.
[49] Z.-Z. Lan, B.-L. Guo, Nonlinear waves behaviors for a coupled generalized nonlinear Schrödinger–Boussinesq system in a homogeneous magnetized plasma, Nonlinear Dynamics 100 (2020) 3771–3784.
[50] Z. Yan, Generalized method and its application in the higher-order nonlinear Schrödinger equation in nonlinear optical fibres, Chaos, Solitons & Fractals 16 (5) (2003) 759–766.
[51] V. N. Serkin, A. Hasegawa, Exactly integrable nonlinear Schrodinger equation models with varying dispersion, nonlinearity and gain: Application for soliton dispersion, IEEE Journal of selected topics in Quantum Electronics 8 (3) (2002) 418–431.
[52] N. Akhmediev, A. Ankiewicz, J. M. Soto-Crespo, Rogue waves and rational solutions of the nonlinear Schrödinger equation, Physical Review E 80 (2) (2009) 026601.
[53] F. Copie, S. Randoux, P. Suret, The physics of the one-dimensional nonlinear Schrödinger equation in fiber optics: Rogue waves, modulation instability and self-focusing phenomena, Reviews in Physics 5 (2020) 100037.
[54] H. Haario, M. Laine, A. Mira, E. Saksman, DRAM: Efficient adaptive MCMC, Statistics and computing 16 (2006) 339–354.
[55] L. Peng, K. Mohseni, Symplectic model reduction of Hamiltonian systems, SIAM Journal on Scientific Computing 38 (1) (2016) A1–A27.