Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Bayesian identification of nonseparable Hamiltonians with multiplicative noise using deep learning and reduced-order modeling

Nicholas Galioto ngalioto@umich.edu Harsh Sharma hasharma@ucsd.edu Boris Kramer bmkramer@ucsd.edu Alex Arkady Gorodetsky goroda@umich.edu
Abstract

This paper presents a structure-preserving Bayesian approach for learning nonseparable Hamiltonian systems using stochastic dynamic models allowing for statistically-dependent, vector-valued additive and multiplicative measurement noise. The approach is comprised of three main facets. First, we derive a Gaussian filter for a statistically-dependent, vector-valued, additive and multiplicative noise model that is needed to evaluate the likelihood within the Bayesian posterior. Second, we develop a novel algorithm for cost-effective application of Bayesian system identification to high-dimensional systems. Third, we demonstrate how structure-preserving methods can be incorporated into the proposed framework, using nonseparable Hamiltonians as an illustrative system class. We assess the method’s performance based on the forecasting accuracy of a model estimated from single-trajectory data. We compare the Bayesian method to a state-of-the-art machine learning method on a canonical nonseparable Hamiltonian model and a chaotic double pendulum model with small, noisy training datasets. The results show that using the Bayesian posterior as a training objective can yield upwards of 724 times improvement in Hamiltonian mean squared error using training data with up to 10% multiplicative noise compared to a standard training objective. Lastly, we demonstrate the utility of the novel algorithm for parameter estimation of a 64-dimensional model of the spatially-discretized nonlinear Schrödinger equation with data corrupted by up to 20% multiplicative noise.

keywords:
Bayesian system identification , physics-informed machine learning , multiplicative noise , high-dimensional systems , nonseparable Hamiltonian systems, deep learning
\affiliation

[1]organization=Department of Aerospace Engineering, University of Michigan, city=Ann Arbor, postcode=MI 48109, country=USA

\affiliation

[2]organization=Department of Mechanical and Aerospace Engineering, University of California San Diego, city=San Diego, postcode=CA 92161, country=USA

{highlights}

A Gaussian filter is derived for general additive and multiplicative noise models

An algorithm is introduced for Bayesian estimation of high-dimensional Hamiltonians

Probabilistic modeling is shown to improve a deep system ID method on noisy data

Bayesian parameter estimation is successfully performed on a 64-dimensional system

1 Introduction

System identification (ID) plays a key role in many engineering and scientific frameworks such as model predictive control, system forecasting, and dynamical analysis. Creating a system ID algorithm includes careful selection of a class of candidate models and of an objective function to optimize. The success of system ID strongly depends on how efficiently this pair of model class and objective utilizes available information to guide estimation. Data from the system are most commonly used as sources of information, but prior knowledge on the system physics can also be considered within the estimation procedure.

Incorporating physical knowledge into system ID has been demonstrated through a variety of methodologies [1]. In one direction, physically-inconsistent models can be penalized through the addition of physics-based terms in the objective function. This approach is adopted by the widely-used physics-informed neural networks [2, 3] and has also been applied to various applications such as improving molecular dynamics simulation [4] and lake temperature modeling [5]. In another direction, physics are explicitly encoded into the model parameterization. This approach has led to various neural network architectures for learning conservative systems based on Hamiltonian [6, 7, 8] and Lagrangian [9, 10, 11, 12] mechanics. Other examples of this approach include preservation of symmetry groups in convolutional neural networks [13] and enforcement of boundary conditions in boundary value problems [14]. Compared to traditional machine learning approaches, these methods have all shown significant improvements in estimation accuracy.

In addition to incorporating physical knowledge through models, data must be utilized to encourage consistency with the real world. Proper design of a learning objective is crucial to ensure information in the data is being extracted properly. The most common objectives take the form of a summation of vector norms of the differences between the data and estimated outputs, but alternative forms can be derived through probabilistic modeling. Notably, the modeling of model, measurement, and parameter uncertainties within a Bayesian framework in [15] led to an objective (the negative log posterior) that has been shown to yield more accurate estimates over many widely-used objectives [16]. To further improve information extraction capabilities, this method can be combined with structure-preserving parameterization techniques [15, 17]. However, evaluation of this posterior relies on probabilistic filtering and is computationally challenging. Two challenges arise: (i) filtering can be costly if the noise models are non-additive, and (ii) filtering tends to not scale well with the state and measurement dimensions.

The challenge of non-additive noise arises when the sensor noise is dependent on the signal. The most common form of this is multiplicative noise. As an example, a distance-measuring sensor contains noise that increases roughly linearly with its distance from the target [18]. To address multiplicative noise, a Kalman filter was first derived in 1971 by [19]. Since then, a number of other filters have been developed for various multiplicative noise models such as Gaussian mixtures [20, 21], non-stationary noise [22, 4, 20], and deterministic uncertainties [23, 24]. Each of these filters, however, makes the assumption that the noise from each sensor is independent and identically distributed (i.i.d.). If sensors measure along the same axis, e.g., distance and velocity sensors or redundant sensors, their measurements, and therefore noises, will be correlated and not independent. And for sensors that have been manufactured differently, it is unlikely that their noise models will be identically distributed. Therefore, a more general filtering algorithm for multiplicative noises is needed.

The high-dimensional problem setting poses challenges for not only the Bayesian approach, but also many other system ID methods. Most nonlinear system ID algorithms require thousands of evaluations of a forward model (or data points) for training, which can quickly become prohibitive for large-scale dynamical models. This issue is often addressed through reduced-order modeling in which a low-dimensional approximation, known as a reduced-order model (ROM), of the high-dimensional dynamics is estimated. To identify the ROM, many approaches begin by estimating a low-dimensional subspace for the reduced-order state vector followed by system ID in the reduced-order space. In the time domain, common approaches for the subspace identification task include linear methods such as the proper orthogonal decomposition [25] and the reduced basis method [26, 27], and nonlinear methods such as autoencoders [28, 29] and polynomial manifolds [30, 31]. These methods, however, are trained using full-field simulation data from full-order models (FOMs) that realistically are often incorrect, partially unknown, or uncertain. As a result, the accuracy of the ROMs is limited by the accuracy of the FOMs. To improve past this limit requires using experimental data collected directly from the system of interest. Since these data are often noisy, training with them introduces additional error into the subspace approximation. For a system ID method to handle this added error, careful modeling of uncertainty will be key.

In this work, we introduce methodologies to address the challenges of Bayesian system ID for multiplicative noise and high-dimensional systems. Then we demonstrate how these methodologies can be combined with structure-preserving methods, using nonseparable Hamiltonian systems as an example class of systems. Lastly, we apply the methodologies to estimate a dynamics model from single-trajectory data and evaluate the model’s quality based on its forecasting accuracy. We choose nonseparable Hamiltonian systems for several reasons. Hamiltonian systems can demonstrate complex nonlinear behavior while possessing an underlying highly-structured geometry. These systems possess interesting physical properties that are important to preserve including conservation of the Hamiltonian, reversibility, and symplecticity. Nonseparable Hamiltonians, specifically, are of interest because they arise in diverse fields such as multibody dynamics and control in robotics [32], the Kozai-Lidov mechanism in astrophysics [33], particle accelerators in physics [34], 3D vortex dynamics in fluid mechanics [35], and the nonlinear Schrödinger equation in quantum mechanics [36].

In a previous work [17], we also considered Bayesian system ID of nonseparable Hamiltonian systems with multiplicative measurement noise. There are three major distinctions between that work and the present one. The first is that in the past work, we considered only a polynomial Hamiltonian with a polynomial model parameterization. The current work uses a more expressive deep neural network parameterization and considers a non-polynomial Hamiltonian example. Second, we previously used an additive noise model and trained on data with multiplicative noise to demonstrate robustness to model misspecification, but here we adapt the model and algorithm toward multiplicative noise. The third and most significant difference in this work is that we develop an original algorithm for efficient estimation of high-dimensional nonseparable Hamiltonian systems. These novel additions to the structure-preserving Bayesian learning framework of [17] are stated more specifically as follows:

  • 1.

    derivation of a Gaussian filter for a statistically-dependent, vector-valued, additive and multiplicative noise model and analysis of the added computational complexity compared to a filter for only additive noise in Section 4.1,

  • 2.

    creation of a novel learning algorithm (Algorithm 2) for estimation of high-dimensional nonseparable Hamiltonians in a reduced-dimensional space in Section 5,

  • 3.

    numerical experimentation showing that the proposed likelihood-based objective outperforms the original objective of a state-of-the-art machine learning method when training with sparse data with multiplicative uniform noise with respect to both state and Hamiltonian error in Sections 6.3 and 6.4. These results include upwards of 724 times improvement in Hamiltonian mean squared error when training with data corrupted by up to 10% multiplicative noise,

  • 4.

    demonstration of the effectiveness of Algorithm 2 for parameter estimation on the 64-dimensional spatially-discretized nonlinear Schrödinger equation within a reduced 8-dimensional space using data with 20% multiplicative noise in Section 6.5.

The rest of this paper is structured as follows. Section 2 provides notation and the probabilistic formulation of the system ID problem. Section 3 gives a background discussion on algorithmic design choices for the Bayesian method. Section 4 introduces a filter for a general additive and multiplicative noise model and reviews structure-preserving techniques for nonseparable Hamiltonian systems. Section 5 presents a novel algorithm for efficient and structure-preserving estimation of high-dimensional Hamiltonians. Section 6 applies the proposed Bayesian algorithm to two low-dimensional (one polynomial and one non-polynomial) nonseparable Hamiltonian systems. Then, the novel algorithm for estimating high-dimensional Hamiltonians is applied to a spatial discretization of the nonlinear Schrödinger equation. Finally, Section 7 provides concluding remarks and future research directions.

2 Problem statement

In this section, we define notation and formulate the system ID problem.

2.1 Notation

The space of real numbers is denoted by \mathbb{R}blackboard_R and the set of positive integers by +subscript\mathbb{Z}_{+}blackboard_Z start_POSTSUBSCRIPT + end_POSTSUBSCRIPT. Vectors are written in lowercase, non-italic, bold font, e.g., 𝐱𝐱\mathbf{x}bold_x, and matrices in uppercase, non-italic, bold font, e.g., 𝐀𝐀\mathbf{A}bold_A. The transpose is denoted by the symbol. If a vector varies in space and/or time, it is spatially indexed as 𝐱ssuperscript𝐱𝑠\mathbf{x}^{s}bold_x start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT and/or temporally indexed as 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for discrete space and time s,t+𝑠𝑡subscripts,t\in\mathbb{Z}_{+}italic_s , italic_t ∈ blackboard_Z start_POSTSUBSCRIPT + end_POSTSUBSCRIPT.

Let (Ω,,)Ω(\Omega,\mathcal{F},\mathbb{P})( roman_Ω , caligraphic_F , blackboard_P ) be a probability triple where ΩΩ\Omegaroman_Ω is a sample space, \mathcal{F}caligraphic_F is a σ𝜎\sigmaitalic_σ-algebra, and \mathbb{P}blackboard_P is a probability measure. Random variables are denoted in lowercase, italic, bold font, e.g., 𝒛𝒛\bm{z}bold_italic_z, and their realizations are denoted as their non-italic counterparts, e.g., 𝐳𝐳\mathbf{z}bold_z. We assume that for a continuous random variable 𝒛𝒛\bm{z}bold_italic_z, the probability measure (d𝒛)d𝒛\mathbb{P}({\rm d}\bm{z})blackboard_P ( roman_d bold_italic_z ) admits a probability density function (pdf) π(𝐳)𝜋𝐳\pi(\mathbf{z})italic_π ( bold_z ). A d𝑑ditalic_d-dimensional uniform distribution with lower and upper bounds 𝐚,𝐛d𝐚𝐛superscript𝑑\mathbf{a},\mathbf{b}\in\mathbb{R}^{d}bold_a , bold_b ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is denoted as 𝒰[𝐚,𝐛]𝒰𝐚𝐛\mathcal{U}[\mathbf{a},\mathbf{b}]caligraphic_U [ bold_a , bold_b ]. A normal distribution with mean 𝐦𝐦\mathbf{m}bold_m and covariance 𝐂𝐂\mathbf{C}bold_C is denoted as 𝒩(𝐦,𝐂)𝒩𝐦𝐂\mathcal{N}(\mathbf{m},\mathbf{C})caligraphic_N ( bold_m , bold_C ). If 𝒛𝒩(𝟎,𝐂)similar-to𝒛𝒩0𝐂\bm{z}\sim\mathcal{N}(\mathbf{0},\mathbf{C})bold_italic_z ∼ caligraphic_N ( bold_0 , bold_C ), then |𝒛|𝒛\lvert\bm{z}\rvert| bold_italic_z | follows a half-normal distribution denoted as half-𝒩(𝟎,𝐂)half-𝒩0𝐂\text{half-}\mathcal{N}(\mathbf{0},\mathbf{C})half- caligraphic_N ( bold_0 , bold_C ).

The symbol direct-product\odot represents element-wise multiplication. The direct-product\odot operation is defined when the dimensions of the operands match or when the operands are a matrix and a vector with length equal to the number of matrix columns. In the latter case, direct-product\odot multiplies the i𝑖iitalic_ith column of the matrix by the i𝑖iitalic_ith element of the vector. This operator has the useful property that 𝒛𝐚𝒩(𝐦𝐚,𝐂(𝐚𝐚))similar-todirect-product𝒛𝐚𝒩direct-product𝐦𝐚direct-product𝐂superscript𝐚𝐚top\bm{z}\odot\mathbf{a}\sim\mathcal{N}\left(\mathbf{m}\odot\mathbf{a},\mathbf{C}% \odot(\mathbf{a}\mathbf{a}^{\top})\right)bold_italic_z ⊙ bold_a ∼ caligraphic_N ( bold_m ⊙ bold_a , bold_C ⊙ ( bold_aa start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ) for the Gaussian random vector 𝒛𝒩(𝐦,𝐂)similar-to𝒛𝒩𝐦𝐂\bm{z}\sim\mathcal{N}(\mathbf{m},\mathbf{C})bold_italic_z ∼ caligraphic_N ( bold_m , bold_C ) and a constant vector 𝐚𝐚\mathbf{a}bold_a of the same length.

2.2 Probabilistic problem formulation

In this section, we describe a probabilistic problem formulation for Bayesian system ID for arbitrary noise models. We model the states 𝒙k(ω)nsubscript𝒙𝑘𝜔superscript𝑛\bm{x}_{k}(\omega)\in\mathbb{R}^{n}bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_ω ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and the outputs 𝒚k(ω)msubscript𝒚𝑘𝜔superscript𝑚\bm{y}_{k}(\omega)\in\mathbb{R}^{m}bold_italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_ω ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT for ωΩ𝜔Ω\omega\in\Omegaitalic_ω ∈ roman_Ω as discrete-time stochastic processes indexed by k+𝑘subscriptk\in\mathbb{Z}_{+}italic_k ∈ blackboard_Z start_POSTSUBSCRIPT + end_POSTSUBSCRIPT. The dynamics are modeled as a hidden Markov model (HMM)

𝒙k+1subscript𝒙𝑘1\displaystyle\bm{x}_{k+1}bold_italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT =𝒯(𝒙k,𝜽,ω),absent𝒯subscript𝒙𝑘𝜽𝜔\displaystyle=\mathcal{T}(\bm{x}_{k},\bm{\theta},\omega),= caligraphic_T ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_θ , italic_ω ) , (1a)
𝒚ksubscript𝒚𝑘\displaystyle\bm{y}_{k}bold_italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT =(𝒙k,𝜽,ω),absentsubscript𝒙𝑘𝜽𝜔\displaystyle=\mathcal{M}(\bm{x}_{k},\bm{\theta},\omega),= caligraphic_M ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_θ , italic_ω ) , (1b)

where 𝒯:n××Ωn:𝒯maps-tosuperscript𝑛superscriptΩsuperscript𝑛\mathcal{T}:\mathbb{R}^{n}\times\mathbb{R}^{\ell}\times\Omega\mapsto\mathbb{R}% ^{n}caligraphic_T : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT × roman_Ω ↦ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is the state-transition function and :n××Ωm:maps-tosuperscript𝑛superscriptΩsuperscript𝑚\mathcal{M}:\mathbb{R}^{n}\times\mathbb{R}^{\ell}\times\Omega\mapsto\mathbb{R}% ^{m}caligraphic_M : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT × roman_Ω ↦ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT the measurement function. These functions are parameterized by the random variable 𝜽(ω)𝜽𝜔superscript\bm{\theta}(\omega)\in\mathbb{R}^{\ell}bold_italic_θ ( italic_ω ) ∈ blackboard_R start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT, whose realizations we denote with 𝛉𝛉\bm{\uptheta}bold_θ, and both operators are functions of ωΩ𝜔Ω\omega\in\Omegaitalic_ω ∈ roman_Ω to represent that their outputs are random variables. The system is therefore characterized by the sequences of transitional pdfs π(𝐱k+1|𝐱k,𝛉)𝜋conditionalsubscript𝐱𝑘1subscript𝐱𝑘𝛉\pi(\mathbf{x}_{k+1}|\mathbf{x}_{k},\bm{\uptheta})italic_π ( bold_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_θ ) and conditional output pdfs π(𝐲k|𝐱k,𝛉)𝜋conditionalsubscript𝐲𝑘subscript𝐱𝑘𝛉\pi(\mathbf{y}_{k}|\mathbf{x}_{k},\bm{\uptheta})italic_π ( bold_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_θ ) induced by 𝒯𝒯\mathcal{T}caligraphic_T and \mathcal{M}caligraphic_M, respectively.

We seek to represent the posterior distribution π(𝛉|𝒴N)𝜋conditional𝛉subscript𝒴𝑁\pi(\bm{\uptheta}|\mathcal{Y}_{N})italic_π ( bold_θ | caligraphic_Y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) characterizing the uncertainty in system parameters 𝜽𝜽\bm{\theta}bold_italic_θ given a collection of measurements 𝒴N(𝐲1,,𝐲N)subscript𝒴𝑁subscript𝐲1subscript𝐲𝑁\mathcal{Y}_{N}\coloneqq(\mathbf{y}_{1},\ldots,\mathbf{y}_{N})caligraphic_Y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ≔ ( bold_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ). Bayes’ rule expresses the posterior in a computable form via the likelihood (𝛉;𝒴N)π(𝒴N|𝛉)𝛉subscript𝒴𝑁𝜋conditionalsubscript𝒴𝑁𝛉\mathcal{L}(\bm{\uptheta};\mathcal{Y}_{N})\coloneqq\pi(\mathcal{Y}_{N}|\bm{% \uptheta})caligraphic_L ( bold_θ ; caligraphic_Y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) ≔ italic_π ( caligraphic_Y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT | bold_θ ), the prior π(𝛉)𝜋𝛉\pi(\bm{\uptheta})italic_π ( bold_θ ), and a normalizing constant π(𝒴N)𝜋subscript𝒴𝑁\pi(\mathcal{Y}_{N})italic_π ( caligraphic_Y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) known as the evidence according to

π(𝛉|𝒴N)=(𝛉;𝒴N)π(𝛉)π(𝒴N).𝜋conditional𝛉subscript𝒴𝑁𝛉subscript𝒴𝑁𝜋𝛉𝜋subscript𝒴𝑁\pi(\bm{\uptheta}|\mathcal{Y}_{N})=\frac{\mathcal{L}(\bm{\uptheta};\mathcal{Y}% _{N})\pi(\bm{\uptheta})}{\pi(\mathcal{Y}_{N})}.italic_π ( bold_θ | caligraphic_Y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) = divide start_ARG caligraphic_L ( bold_θ ; caligraphic_Y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) italic_π ( bold_θ ) end_ARG start_ARG italic_π ( caligraphic_Y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) end_ARG . (2)

The HMM, however, has the additional collection of uncertain variables 𝒙1,,𝒙Nsubscript𝒙1subscript𝒙𝑁\bm{x}_{1},\ldots,\bm{x}_{N}bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT about which we do not intend to make inferences. These uncertain states are marginalized out of the inference problem within the likelihood as π(𝒴N,𝐱1,,𝐱N|𝛉)d𝐱1,,d𝐱N𝜋subscript𝒴𝑁subscript𝐱1conditionalsubscript𝐱𝑁𝛉differential-dsubscript𝐱1dsubscript𝐱𝑁\int\pi(\mathcal{Y}_{N},\mathbf{x}_{1},\ldots,\mathbf{x}_{N}|\bm{\uptheta}){% \rm d}\mathbf{x}_{1},\ldots,{\rm d}\mathbf{x}_{N}∫ italic_π ( caligraphic_Y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT | bold_θ ) roman_d bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , roman_d bold_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT. At first, this marginalization appears to require a costly nN𝑛𝑁nNitalic_n italic_N-dimensional integration. However, the recursive structure of the HMM can be exploited to break the high-dimensional integral into N𝑁Nitalic_N integrals of the more manageable dimension n𝑛nitalic_n by the decomposition

(𝛉;𝒴N)=π(𝐲1|𝛉)k=2Nπ(𝐲k|𝛉,𝒴k1).𝛉subscript𝒴𝑁𝜋conditionalsubscript𝐲1𝛉superscriptsubscriptproduct𝑘2𝑁𝜋conditionalsubscript𝐲𝑘𝛉subscript𝒴𝑘1\mathcal{L}(\bm{\uptheta};\mathcal{Y}_{N})=\pi(\mathbf{y}_{1}|\bm{\uptheta})% \prod_{k=2}^{N}\pi(\mathbf{y}_{k}|\bm{\uptheta},\mathcal{Y}_{k-1}).caligraphic_L ( bold_θ ; caligraphic_Y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) = italic_π ( bold_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | bold_θ ) ∏ start_POSTSUBSCRIPT italic_k = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_π ( bold_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | bold_θ , caligraphic_Y start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) . (3)

Each term in this product can be efficiently computed using recursion, as shown in Algorithm 1 from [37, Th. 12.3].

Algorithm 1 Recursive marginal likelihood evaluation [37]
1:π(𝐱1|𝛉)𝜋conditionalsubscript𝐱1𝛉\pi(\mathbf{x}_{1}|\bm{\uptheta})italic_π ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | bold_θ ), 𝒴Nsubscript𝒴𝑁\mathcal{Y}_{N}caligraphic_Y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT
2:(𝛉;𝒴N)𝛉subscript𝒴𝑁\mathcal{L}(\bm{\uptheta};\mathcal{Y}_{N})caligraphic_L ( bold_θ ; caligraphic_Y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT )
3:Initialize π(𝐱1|𝒴0,𝛉)π(𝐱1|𝛉)𝜋conditionalsubscript𝐱1subscript𝒴0𝛉𝜋conditionalsubscript𝐱1𝛉\pi(\mathbf{x}_{1}|\mathcal{Y}_{0},\bm{\uptheta})\coloneqq\pi(\mathbf{x}_{1}|% \bm{\uptheta})italic_π ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | caligraphic_Y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_θ ) ≔ italic_π ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | bold_θ ) and (𝛉;𝒴0)1𝛉subscript𝒴01\mathcal{L}(\bm{\uptheta};\mathcal{Y}_{0})\coloneqq 1caligraphic_L ( bold_θ ; caligraphic_Y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ≔ 1
4:for k=1,N𝑘1𝑁k=1,\ldots Nitalic_k = 1 , … italic_N do
5:     Marginalize: π(𝐲k|𝒴k1,𝛉)π(𝐲k|𝐱k,𝛉)π(𝐱k|𝒴k1,𝛉)d𝐱k𝜋conditionalsubscript𝐲𝑘subscript𝒴𝑘1𝛉𝜋conditionalsubscript𝐲𝑘subscript𝐱𝑘𝛉𝜋conditionalsubscript𝐱𝑘subscript𝒴𝑘1𝛉differential-dsubscript𝐱𝑘\displaystyle\pi(\mathbf{y}_{k}|\mathcal{Y}_{k-1},\bm{\uptheta})\leftarrow\int% \pi(\mathbf{y}_{k}|\mathbf{x}_{k},\bm{\uptheta})\pi(\mathbf{x}_{k}|\mathcal{Y}% _{k-1},\bm{\uptheta}){\rm d}\mathbf{x}_{k}italic_π ( bold_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | caligraphic_Y start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , bold_θ ) ← ∫ italic_π ( bold_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_θ ) italic_π ( bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | caligraphic_Y start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , bold_θ ) roman_d bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT
6:                             (𝛉;𝒴k)(𝛉;𝒴k1)π(𝐲k|𝒴k1,𝛉)𝛉subscript𝒴𝑘𝛉subscript𝒴𝑘1𝜋conditionalsubscript𝐲𝑘subscript𝒴𝑘1𝛉\displaystyle\mathcal{L}(\bm{\uptheta};\mathcal{Y}_{k})\leftarrow\mathcal{L}(% \bm{\uptheta};\mathcal{Y}_{k-1})\pi(\mathbf{y}_{k}|\mathcal{Y}_{k-1},\bm{% \uptheta})caligraphic_L ( bold_θ ; caligraphic_Y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ← caligraphic_L ( bold_θ ; caligraphic_Y start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) italic_π ( bold_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | caligraphic_Y start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , bold_θ )
7:     if k<N𝑘𝑁k<Nitalic_k < italic_N then
8:         Update: π(𝐱k|𝒴k,𝛉)π(𝐲k|𝐱k,𝛉)π(𝐲k|𝒴k1,𝛉)π(𝐱k|𝒴k1,𝛉)𝜋conditionalsubscript𝐱𝑘subscript𝒴𝑘𝛉𝜋conditionalsubscript𝐲𝑘subscript𝐱𝑘𝛉𝜋conditionalsubscript𝐲𝑘subscript𝒴𝑘1𝛉𝜋conditionalsubscript𝐱𝑘subscript𝒴𝑘1𝛉\displaystyle\pi(\mathbf{x}_{k}|\mathcal{Y}_{k},\bm{\uptheta})\leftarrow\frac{% \pi(\mathbf{y}_{k}|\mathbf{x}_{k},\bm{\uptheta})}{\pi(\mathbf{y}_{k}|\mathcal{% Y}_{k-1},\bm{\uptheta})}\pi(\mathbf{x}_{k}|\mathcal{Y}_{k-1},\bm{\uptheta})italic_π ( bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | caligraphic_Y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_θ ) ← divide start_ARG italic_π ( bold_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_θ ) end_ARG start_ARG italic_π ( bold_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | caligraphic_Y start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , bold_θ ) end_ARG italic_π ( bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | caligraphic_Y start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , bold_θ )
9:         Predict: π(𝐱k+1|𝒴k,𝛉)π(𝐱k+1|𝐱k,𝛉)π(𝐱k|𝒴k,𝛉)d𝐱k𝜋conditionalsubscript𝐱𝑘1subscript𝒴𝑘𝛉𝜋conditionalsubscript𝐱𝑘1subscript𝐱𝑘𝛉𝜋conditionalsubscript𝐱𝑘subscript𝒴𝑘𝛉differential-dsubscript𝐱𝑘\displaystyle\pi(\mathbf{x}_{k+1}|\mathcal{Y}_{k},\bm{\uptheta})\leftarrow\int% \pi(\mathbf{x}_{k+1}|\mathbf{x}_{k},\bm{\uptheta})\pi(\mathbf{x}_{k}|\mathcal{% Y}_{k},\bm{\uptheta}){\rm d}\mathbf{x}_{k}italic_π ( bold_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT | caligraphic_Y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_θ ) ← ∫ italic_π ( bold_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_θ ) italic_π ( bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | caligraphic_Y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_θ ) roman_d bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT
10:     end if
11:end for

3 Algorithmic design choices

While Algorithm 1 is general enough to be applicable to any system that can be modeled in the general HMM form of Eq. (1), this flexibility requires design choices addressing two primary challenges: (i) the computational expense of filtering and (ii) how to encode prior knowledge into the parameterizations of 𝒯𝒯\mathcal{T}caligraphic_T and \mathcal{M}caligraphic_M.

The first challenge of computational expense arises because the integrals in Algorithm 1 do not, in general, admit closed form solutions. The user’s choice in evaluation method will determine the accuracy of the marginal likelihood evaluation and overall computational complexity. The design choices to address the second challenge include selecting the coordinate system, system dimensions, and model fidelities within 𝒯𝒯\mathcal{T}caligraphic_T and \mathcal{M}caligraphic_M. These choices often involve a tradeoff between accuracy and computational expense. For the best accuracy and generalizability of the learned model, prior information on the system should inform the model parameterization as much as possible. We now describe these choices and our contributions in more detail.

3.1 Sources of computational expense

Integral evaluation methods within filtering can be divided into two classes: those that estimate the exact integral and those that estimate an approximation of the integral. Estimation of the exact integral is typically performed using sequential Monte Carlo algorithms such as particle filtering [38]. The efficiency of this approach, however, is strongly dependent on the ability to draw uncorrelated samples from the appropriate distributions, which is, in general, nontrivial. When the efficiency of Monte Carlo sampling is prohibitive, approximations are used instead. The most common class of approximation methods for these integrals is Gaussian filtering. These methods approximate the marginal pdf (𝛉;𝒴k)𝛉subscript𝒴𝑘\mathcal{L}(\bm{\uptheta};\mathcal{Y}_{k})caligraphic_L ( bold_θ ; caligraphic_Y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) as Gaussian, which requires tracking only the first two moments of all other pdfs in Algorithm 1. Tracking the mean and covariance of the prediction pdf π(𝐱k+1|𝒴k,𝛉)𝜋conditionalsubscript𝐱𝑘1subscript𝒴𝑘𝛉\pi(\mathbf{x}_{k+1}|\mathcal{Y}_{k},\bm{\uptheta})italic_π ( bold_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT | caligraphic_Y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_θ ) and output pdf π(𝐲k|𝒴k1,𝛉)𝜋conditionalsubscript𝐲𝑘subscript𝒴𝑘1𝛉\pi(\mathbf{y}_{k}|\mathcal{Y}_{k-1},\bm{\uptheta})italic_π ( bold_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | caligraphic_Y start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , bold_θ ) can be achieved through linearization using Taylor series expansion or through Gaussian integration. These techniques are possible because the prediction and output pdfs are defined in terms of available functions, in this case 𝒯𝒯\mathcal{T}caligraphic_T and \mathcal{M}caligraphic_M. There is no such function, however, for the update pdf π(𝐱k|𝒴k,𝛉)𝜋conditionalsubscript𝐱𝑘subscript𝒴𝑘𝛉\pi(\mathbf{x}_{k}|\mathcal{Y}_{k},\bm{\uptheta})italic_π ( bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | caligraphic_Y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_θ ). The mean and covariance of this pdf are instead approximated with the Kalman update, which delivers the minimum mean squared error (MMSE) estimate of these quantities.

3.1.1 Process and measurement model forms

If more is known about the forms of 𝒯𝒯\mathcal{T}caligraphic_T and \mathcal{M}caligraphic_M, linearization and Gaussian integration can sometimes be replaced by closed form solutions. To this end, it is useful to separate the dynamics and output functions into deterministic and stochastic components. Let Ψ:n×n:Ψmaps-tosuperscript𝑛superscriptsuperscript𝑛\Psi:\mathbb{R}^{n}\times\mathbb{R}^{\ell}\mapsto\mathbb{R}^{n}roman_Ψ : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT ↦ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and h:n×m:maps-tosuperscript𝑛superscriptsuperscript𝑚h:\mathbb{R}^{n}\times\mathbb{R}^{\ell}\mapsto\mathbb{R}^{m}italic_h : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT ↦ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT represent deterministic dynamics and measurement functions, respectively. The stochastic components are usually further divided according to how they enter into the model. The two main approaches for modeling these components are as additive or multiplicative noise. If both of these types of noise are present, the model is written as

𝒯(𝒙k,𝜽,ω)𝒯subscript𝒙𝑘𝜽𝜔\displaystyle\mathcal{T}(\bm{x}_{k},\bm{\theta},\omega)caligraphic_T ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_θ , italic_ω ) =Ψ(𝒙k,𝜽)𝒘k(ω)+𝝃k(ω),absentdirect-productΨsubscript𝒙𝑘𝜽subscript𝒘𝑘𝜔subscript𝝃𝑘𝜔\displaystyle=\Psi(\bm{x}_{k},\bm{\theta})\odot\bm{w}_{k}(\omega)+\bm{\xi}_{k}% (\omega),= roman_Ψ ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_θ ) ⊙ bold_italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_ω ) + bold_italic_ξ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_ω ) , (4a)
(𝒙k,𝜽,ω)subscript𝒙𝑘𝜽𝜔\displaystyle\mathcal{M}(\bm{x}_{k},\bm{\theta},\omega)caligraphic_M ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_θ , italic_ω ) =h(𝒙k,𝜽)𝒗k(ω)+𝜼k(ω),absentdirect-productsubscript𝒙𝑘𝜽subscript𝒗𝑘𝜔subscript𝜼𝑘𝜔\displaystyle=h(\bm{x}_{k},\bm{\theta})\odot\bm{v}_{k}(\omega)+\bm{\eta}_{k}(% \omega),= italic_h ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_θ ) ⊙ bold_italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_ω ) + bold_italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_ω ) , (4b)

where 𝝃k(ω),𝒘k(ω)nsubscript𝝃𝑘𝜔subscript𝒘𝑘𝜔superscript𝑛\bm{\xi}_{k}(\omega),\bm{w}_{k}(\omega)\in\mathbb{R}^{n}bold_italic_ξ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_ω ) , bold_italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_ω ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and 𝜼k(ω),𝒗k(ω)msubscript𝜼𝑘𝜔subscript𝒗𝑘𝜔superscript𝑚\bm{\eta}_{k}(\omega),\bm{v}_{k}(\omega)\in\mathbb{R}^{m}bold_italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_ω ) , bold_italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_ω ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT are discrete-time stochastic processes.

The most common noise model is additive noise. One benefit of an additive noise model is that the first two moments of the noise terms can be estimated separately from those of the dynamics and observation functions, reducing complexity. Then the total mean and covariance for either 𝒯𝒯\mathcal{T}caligraphic_T or \mathcal{M}caligraphic_M is simply the sum of these separate estimates. For linear systems, this allows for the means and covariances of π(𝐱k+1|𝒴k,𝛉)𝜋conditionalsubscript𝐱𝑘1subscript𝒴𝑘𝛉\pi(\mathbf{x}_{k+1}|\mathcal{Y}_{k},\bm{\uptheta})italic_π ( bold_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT | caligraphic_Y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_θ ) and π(𝐲k|𝒴k1,𝛉)𝜋conditionalsubscript𝐲𝑘subscript𝒴𝑘1𝛉\pi(\mathbf{y}_{k}|\mathcal{Y}_{k-1},\bm{\uptheta})italic_π ( bold_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | caligraphic_Y start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , bold_θ ) to be evaluated analytically with the Kalman filter when the first two moments of the noise terms are known. Moreover, if the system is linear and the noise terms are Gaussian, then the prediction, output, and update pdfs in Algorithm 1 are all Gaussian, and the Kalman filter is exact.

Although additive Gaussian noise is a suitable choice for many problems, recent works on learning Hamiltonians [39, 40] have begun considering multiplicative uniform noise. As with any other noise model, Gaussian filtering can be used to approximate Algorithm 1 for multiplicative uniform noise, but it is preferable to compute as much of each integral in closed form as possible for the sake of complexity. By using knowledge of the noise model, the general Gaussian filtering procedure can be adapted to replace a portion of the approximations with an exact and computationally efficient evaluation. In Section 4.1, we introduce such a filtering procedure for multiplicative noise and analyze its computational expense.

3.1.2 Dimensionality

Although using Gaussian filtering for integral evaluation is significantly more efficient than Monte Carlo sampling, it tends to not scale well with the state and measurement dimensions. For example, the Kalman filter has a computational complexity of 𝒪(N(n3+m3))𝒪𝑁superscript𝑛3superscript𝑚3\mathcal{O}(N(n^{3}+m^{3}))caligraphic_O ( italic_N ( italic_n start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT + italic_m start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) ), which can make optimization and sampling schemes infeasible for even moderate dimensions. Since in most real-world systems, nmmuch-greater-than𝑛𝑚n\gg mitalic_n ≫ italic_m, the state dimension tends to be the limiting dimension. Therefore, one solution is to use reduced-order modeling to reduce the state dimension to rnmuch-less-than𝑟𝑛r\ll nitalic_r ≪ italic_n and perform estimation in this r𝑟ritalic_r-dimensional subspace. In Section 5, we present a method for learning high-dimensional systems efficiently using Algorithm 1 and reduced-order modeling. If this approach is used, it is critical that the ROM still preserve the geometric structure of the FOM. Structure-preservation can be achieved by encoding prior physics knowledge into the parameterization of the dynamics model 𝒯𝒯\mathcal{T}caligraphic_T. We discuss such an approach in the following section.

3.2 Incorporation of prior knowledge

Here we consider the choice of model parameterization with a focus on embedding geometric structure into the model. The benefits of physics-informed parameterization are that it leads to models that generalize better beyond the training data and reduces the number of data required for training. There are two main ways to enforce the system physics within the dynamics model 𝒯𝒯\mathcal{T}caligraphic_T. First, the parameterization of 𝒯𝒯\mathcal{T}caligraphic_T should be designed to only admit models whose dynamics possess the same physical structure of the system. Second, the model dynamics should be evaluated in such a way that the resulting flow preserves the structure of the dynamics. Since this work considers specialization of methods to nonseparable Hamiltonian systems, we briefly review aspects of the Hamiltonian structure that will inform the proposed parameterization method.

Finite-dimensional canonical Hamiltonian systems are defined by a scalar function H𝐻Hitalic_H, known as the Hamiltonian, of the canonical position 𝐪𝐪\mathbf{q}bold_q and momentum 𝐩𝐩\mathbf{p}bold_p, both in dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. For these systems, the state is defined as 𝐱=[𝐪𝐩]𝐱superscriptmatrixsuperscript𝐪topsuperscript𝐩toptop\mathbf{x}=\begin{bmatrix}\mathbf{q}^{\top}&\mathbf{p}^{\top}\end{bmatrix}^{\top}bold_x = [ start_ARG start_ROW start_CELL bold_q start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_CELL start_CELL bold_p start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT such that n=2d𝑛2𝑑n=2ditalic_n = 2 italic_d. The governing equations of the system, known as Hamilton’s equations, are derived from this function

𝐪˙=H(𝐪,𝐩)𝐩,𝐩˙=H(𝐪,𝐩)𝐪.formulae-sequence˙𝐪𝐻𝐪𝐩𝐩˙𝐩𝐻𝐪𝐩𝐪\dot{\mathbf{q}}=\frac{\partial H(\mathbf{q},\mathbf{p})}{\partial\mathbf{p}},% \quad\quad\dot{\mathbf{p}}=-\frac{\partial H(\mathbf{q},\mathbf{p})}{\partial% \mathbf{q}}.over˙ start_ARG bold_q end_ARG = divide start_ARG ∂ italic_H ( bold_q , bold_p ) end_ARG start_ARG ∂ bold_p end_ARG , over˙ start_ARG bold_p end_ARG = - divide start_ARG ∂ italic_H ( bold_q , bold_p ) end_ARG start_ARG ∂ bold_q end_ARG . (5)

Many Hamiltonian systems of interest to engineers and scientists (e.g., [32, 34, 35, 33, 36]) are not additively separable with respect to functions of the position and momentum. Such Hamiltonian systems are said to be nonseparable. Unlike the separable Hamiltonian systems which can be written as H(𝐪,𝐩)=T(𝐩)+U(𝐪)𝐻𝐪𝐩𝑇𝐩𝑈𝐪H(\mathbf{q},\mathbf{p})=T(\mathbf{p})+U(\mathbf{q})italic_H ( bold_q , bold_p ) = italic_T ( bold_p ) + italic_U ( bold_q ) with a kinetic energy function T(𝐩)𝑇𝐩T(\mathbf{p})italic_T ( bold_p ) and a potential energy function U(𝐪)𝑈𝐪U(\mathbf{q})italic_U ( bold_q ), Eq. (5) cannot be further simplified for nonseparable Hamiltonians. A distinctive feature of Hamilton’s equations is that they possess physically meaningful geometric properties that can be described in the form of symplecticity, invariants of motion, and energy conservation. We discuss how these properties can be embedded in the estimation procedure in Section 4.2.

4 Methodology

In this section, we present solutions to the problems of computational expense and prior knowledge incorporation. In Section 4.1, we derive a Gaussian filter for a statistically-dependent, vector-valued, additive and multiplicative noise model, and we analyze its computational expense relative to a Gaussian filter for the more widely-used additive noise model. Then in Section 4.2, we add the additional capability of physics-informed estimation to the algorithm by embedding geometric structure within the dynamics propagator ΨΨ\Psiroman_Ψ, and we tailor the approach specifically toward nonseparable Hamiltonian systems.

4.1 Filtering with multiplicative noise

In this section, we extend the filter for multiplicative scalar noise from [19] to models with statistically-dependent, vector-valued noise. Consider the HMM with additive and multiplicative noise in Eq. (4). Let the vector-valued multiplicative noise terms 𝒘ksubscript𝒘𝑘\bm{w}_{k}bold_italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and 𝒗ksubscript𝒗𝑘\bm{v}_{k}bold_italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT each be i.i.d. with means 𝐰¯¯𝐰\bar{\mathbf{w}}over¯ start_ARG bold_w end_ARG and 𝐯¯¯𝐯\bar{\mathbf{v}}over¯ start_ARG bold_v end_ARG and covariances 𝚺subscript𝚺direct-product\mathbf{\Sigma}_{\odot}bold_Σ start_POSTSUBSCRIPT ⊙ end_POSTSUBSCRIPT and 𝚪subscript𝚪direct-product\mathbf{\Gamma}_{\odot}bold_Γ start_POSTSUBSCRIPT ⊙ end_POSTSUBSCRIPT, respectively. Similarly, let the additive noise terms 𝝃ksubscript𝝃𝑘\bm{\xi}_{k}bold_italic_ξ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and 𝜼ksubscript𝜼𝑘\bm{\eta}_{k}bold_italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT also be i.i.d. with zero means and covariances 𝚺+subscript𝚺\mathbf{\Sigma}_{+}bold_Σ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT and 𝚪+subscript𝚪\mathbf{\Gamma}_{+}bold_Γ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT, respectively. Since we are interested in Gaussian filtering, the higher order moments of these noise terms are not needed.

Recall from Section 3.1 that the goal of Gaussian filtering within Algorithm 1 is to compute (approximations of) the means and covariances of the distributions π(𝐱k|𝒴k1,𝛉)𝜋conditionalsubscript𝐱𝑘subscript𝒴𝑘1𝛉\pi(\mathbf{x}_{k}|\mathcal{Y}_{k-1},\bm{\uptheta})italic_π ( bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | caligraphic_Y start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , bold_θ ), π(𝐲k|𝒴k1,𝛉)𝜋conditionalsubscript𝐲𝑘subscript𝒴𝑘1𝛉\pi(\mathbf{y}_{k}|\mathcal{Y}_{k-1},\bm{\uptheta})italic_π ( bold_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | caligraphic_Y start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , bold_θ ), and π(𝐱k|𝒴k,𝛉)𝜋conditionalsubscript𝐱𝑘subscript𝒴𝑘𝛉\pi(\mathbf{x}_{k}|\mathcal{Y}_{k},\bm{\uptheta})italic_π ( bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | caligraphic_Y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_θ ). Many of these evaluations require statistics of a function output that can be computed in closed form if the function is linear or approximated with Gaussian approximation if the function is nonlinear. To represent these function outputs, we denote Ψ(𝒙k,𝜽)Ψsubscript𝒙𝑘𝜽\Psi(\bm{x}_{k},\bm{\theta})roman_Ψ ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_θ ) as 𝝍ksubscript𝝍𝑘\bm{\psi}_{k}bold_italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and h(𝐲k,𝜽)subscript𝐲𝑘𝜽h(\mathbf{y}_{k},\bm{\theta})italic_h ( bold_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_θ ) as 𝒉ksubscript𝒉𝑘\bm{h}_{k}bold_italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. The following equations outline the filtering procedure with line numbers denoting where each group of equations are evaluated within Algorithm 1. The mean 𝐦ksubscript𝐦𝑘\mathbf{m}_{k}bold_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and covariance 𝐂ksubscript𝐂𝑘\mathbf{C}_{k}bold_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT of 𝒯(𝒙k1,𝜽,ω)𝒯subscript𝒙𝑘1𝜽𝜔\mathcal{T}(\bm{x}_{k-1},\bm{\theta},\omega)caligraphic_T ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , bold_italic_θ , italic_ω ) with respect to π(𝐱k1|𝒴k1,𝛉)𝜋conditionalsubscript𝐱𝑘1subscript𝒴𝑘1𝛉\pi(\mathbf{x}_{k-1}|\mathcal{Y}_{k-1},\bm{\uptheta})italic_π ( bold_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT | caligraphic_Y start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , bold_θ ) (line 6):

𝐦ksubscript𝐦𝑘\displaystyle\mathbf{m}_{k}bold_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT =𝔼[𝝍k]𝐰¯,absentdirect-product𝔼delimited-[]subscript𝝍𝑘¯𝐰\displaystyle=\mathbb{E}\left[\bm{\psi}_{k}\right]\odot\bar{\mathbf{w}},= blackboard_E [ bold_italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] ⊙ over¯ start_ARG bold_w end_ARG , (6)
𝐂ksubscript𝐂𝑘\displaystyle\mathbf{C}_{k}bold_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT =𝔼[𝝍k𝝍k]𝚺+𝕍ar[𝝍k](𝐰¯𝐰¯)+𝚺+.absentdirect-product𝔼delimited-[]subscript𝝍𝑘superscriptsubscript𝝍𝑘topsubscript𝚺direct-productdirect-product𝕍ardelimited-[]subscript𝝍𝑘¯𝐰superscript¯𝐰topsubscript𝚺\displaystyle=\mathbb{E}\left[\bm{\psi}_{k}\bm{\psi}_{k}^{\top}\right]\odot% \mathbf{\Sigma}_{\odot}+\mathbb{V}\text{ar}\left[\bm{\psi}_{k}\right]\odot(% \bar{\mathbf{w}}\bar{\mathbf{w}}^{\top})+\mathbf{\Sigma}_{+}.= blackboard_E [ bold_italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] ⊙ bold_Σ start_POSTSUBSCRIPT ⊙ end_POSTSUBSCRIPT + blackboard_V ar [ bold_italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] ⊙ ( over¯ start_ARG bold_w end_ARG over¯ start_ARG bold_w end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) + bold_Σ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT . (7)

The mean 𝛍ksubscript𝛍𝑘\bm{\upmu}_{k}bold_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and covariance 𝐒ksubscript𝐒𝑘\mathbf{S}_{k}bold_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT of (𝒙k,𝜽,ω)subscript𝒙𝑘𝜽𝜔\mathcal{M}(\bm{x}_{k},\bm{\theta},\omega)caligraphic_M ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_θ , italic_ω ) with respect to π(𝐱k|𝒴k1,𝛉)𝜋conditionalsubscript𝐱𝑘subscript𝒴𝑘1𝛉\pi(\mathbf{x}_{k}|\mathcal{Y}_{k-1},\bm{\uptheta})italic_π ( bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | caligraphic_Y start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , bold_θ ) and the covariance 𝐔ksubscript𝐔𝑘\mathbf{U}_{k}bold_U start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT between 𝒯(𝒙k1,𝜽,ω)𝒯subscript𝒙𝑘1𝜽𝜔\mathcal{T}(\bm{x}_{k-1},\bm{\theta},\omega)caligraphic_T ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , bold_italic_θ , italic_ω ) and (𝒙k,𝜽,ω)subscript𝒙𝑘𝜽𝜔\mathcal{M}(\bm{x}_{k},\bm{\theta},\omega)caligraphic_M ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_θ , italic_ω ) with respect to π(𝐱k,𝐱k1|𝒴k1,𝛉)𝜋subscript𝐱𝑘conditionalsubscript𝐱𝑘1subscript𝒴𝑘1𝛉\pi(\mathbf{x}_{k},\mathbf{x}_{k-1}|\mathcal{Y}_{k-1},\bm{\uptheta})italic_π ( bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT | caligraphic_Y start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , bold_θ ) (line 3):

𝛍ksubscript𝛍𝑘\displaystyle\bm{\upmu}_{k}bold_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT =𝔼[𝒉k]𝐯¯,absentdirect-product𝔼delimited-[]subscript𝒉𝑘¯𝐯\displaystyle=\mathbb{E}\left[\bm{h}_{k}\right]\odot\bar{\mathbf{v}},= blackboard_E [ bold_italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] ⊙ over¯ start_ARG bold_v end_ARG , (8)
𝐒ksubscript𝐒𝑘\displaystyle\mathbf{S}_{k}bold_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT =𝔼[𝒉k𝒉k]𝚪+𝕍ar[𝒉k](𝐯¯𝐯¯)+𝚪+,absentdirect-product𝔼delimited-[]subscript𝒉𝑘superscriptsubscript𝒉𝑘topsubscript𝚪direct-productdirect-product𝕍ardelimited-[]subscript𝒉𝑘¯𝐯superscript¯𝐯topsubscript𝚪\displaystyle=\mathbb{E}\left[\bm{h}_{k}\bm{h}_{k}^{\top}\right]\odot\mathbf{% \Gamma}_{\odot}+\mathbb{V}\text{ar}\left[\bm{h}_{k}\right]\odot(\bar{\mathbf{v% }}\bar{\mathbf{v}}^{\top})+\mathbf{\Gamma}_{+},= blackboard_E [ bold_italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] ⊙ bold_Γ start_POSTSUBSCRIPT ⊙ end_POSTSUBSCRIPT + blackboard_V ar [ bold_italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] ⊙ ( over¯ start_ARG bold_v end_ARG over¯ start_ARG bold_v end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) + bold_Γ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT , (9)
𝐔ksubscript𝐔𝑘\displaystyle\mathbf{U}_{k}bold_U start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT =ov[𝝍k,𝒉k]𝐯¯k.absentdirect-productovsubscript𝝍𝑘subscript𝒉𝑘subscript¯𝐯𝑘\displaystyle=\mathbb{C}\text{ov}\left[\bm{\psi}_{k},\bm{h}_{k}\right]\odot% \bar{\mathbf{v}}_{k}.= blackboard_C ov [ bold_italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] ⊙ over¯ start_ARG bold_v end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT . (10)

The mean 𝐦k+superscriptsubscript𝐦𝑘\mathbf{m}_{k}^{+}bold_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT and covariance 𝐂k+superscriptsubscript𝐂𝑘\mathbf{C}_{k}^{+}bold_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT of 𝒙ksubscript𝒙𝑘\bm{x}_{k}bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT with respect to π(𝐱k|𝒴k,𝛉)𝜋conditionalsubscript𝐱𝑘subscript𝒴𝑘𝛉\pi(\mathbf{x}_{k}|\mathcal{Y}_{k},\bm{\uptheta})italic_π ( bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | caligraphic_Y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_θ ) (line 5):

𝐊ksubscript𝐊𝑘\displaystyle\mathbf{K}_{k}bold_K start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT =𝐔k𝐒k1,absentsubscript𝐔𝑘superscriptsubscript𝐒𝑘1\displaystyle=\mathbf{U}_{k}\mathbf{S}_{k}^{-1},= bold_U start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , (11)
𝐦k+superscriptsubscript𝐦𝑘\displaystyle\mathbf{m}_{k}^{+}bold_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT 𝐦k+𝐊k(𝐲k𝛍k),absentsubscript𝐦𝑘subscript𝐊𝑘subscript𝐲𝑘subscript𝛍𝑘\displaystyle\approx\mathbf{m}_{k}+\mathbf{K}_{k}(\mathbf{y}_{k}-\bm{\upmu}_{k% }),≈ bold_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + bold_K start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , (12)
𝐂k+superscriptsubscript𝐂𝑘\displaystyle\mathbf{C}_{k}^{+}bold_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT 𝐂k𝐊k𝐔k.absentsubscript𝐂𝑘subscript𝐊𝑘superscriptsubscript𝐔𝑘top\displaystyle\approx\mathbf{C}_{k}-\mathbf{K}_{k}\mathbf{U}_{k}^{\top}.≈ bold_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_K start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_U start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT . (13)

These last three equations represent the linear MMSE estimator with gain matrix 𝐊kn×msubscript𝐊𝑘superscript𝑛𝑚\mathbf{K}_{k}\in\mathbb{R}^{n\times m}bold_K start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_m end_POSTSUPERSCRIPT. Therefore, 𝐦k+superscriptsubscript𝐦𝑘\mathbf{m}_{k}^{+}bold_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT and 𝐂k+superscriptsubscript𝐂𝑘\mathbf{C}_{k}^{+}bold_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT are only approximations unless 𝒙ksubscript𝒙𝑘\bm{x}_{k}bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and 𝒚ksubscript𝒚𝑘\bm{y}_{k}bold_italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are jointly Gaussian. Also notice that 𝔼[𝝍k𝝍k]𝔼delimited-[]subscript𝝍𝑘superscriptsubscript𝝍𝑘top\mathbb{E}\left[\bm{\psi}_{k}\bm{\psi}_{k}^{\top}\right]blackboard_E [ bold_italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] and 𝔼[𝒉k𝒉k]𝔼delimited-[]subscript𝒉𝑘superscriptsubscript𝒉𝑘top\mathbb{E}\left[\bm{h}_{k}\bm{h}_{k}^{\top}\right]blackboard_E [ bold_italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] are required when multiplicative noise is present. Again, these could be computed with linearization or Gaussian integration, but for computational efficiency, we assume that the function outputs are Gaussian. This assumption allows Eq. (7) to be computed in the following form:

𝐂k=𝕍ar[𝝍k](𝚺+𝐰¯𝐰¯)+(𝔼[𝝍k]𝔼[𝝍k])𝚺+𝚺+.subscript𝐂𝑘direct-product𝕍ardelimited-[]subscript𝝍𝑘subscript𝚺direct-product¯𝐰superscript¯𝐰topdirect-product𝔼delimited-[]subscript𝝍𝑘𝔼superscriptdelimited-[]subscript𝝍𝑘topsubscript𝚺direct-productsubscript𝚺\mathbf{C}_{k}=\mathbb{V}\text{ar}\left[\bm{\psi}_{k}\right]\odot(\mathbf{% \Sigma}_{\odot}+\bar{\mathbf{w}}\bar{\mathbf{w}}^{\top})+(\mathbb{E}\left[\bm{% \psi}_{k}\right]\mathbb{E}\left[\bm{\psi}_{k}\right]^{\top})\odot\mathbf{% \Sigma}_{\odot}+\mathbf{\Sigma}_{+}.bold_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = blackboard_V ar [ bold_italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] ⊙ ( bold_Σ start_POSTSUBSCRIPT ⊙ end_POSTSUBSCRIPT + over¯ start_ARG bold_w end_ARG over¯ start_ARG bold_w end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) + ( blackboard_E [ bold_italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] blackboard_E [ bold_italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ⊙ bold_Σ start_POSTSUBSCRIPT ⊙ end_POSTSUBSCRIPT + bold_Σ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT . (14)

A similar form can be used to compute Eq. (9).

In a past work [15], we analyzed the computational complexity of the filtering procedure for additive noise models and found it to be on the order 𝒪(N(n3+m3))𝒪𝑁superscript𝑛3superscript𝑚3\mathcal{O}(N(n^{3}+m^{3}))caligraphic_O ( italic_N ( italic_n start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT + italic_m start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) ). Here, we analyze the increase in computational complexity of this filtering procedure when multiplicative noise is present in addition to additive noise. Because we assumed that the noise is stationary, the outer products 𝐰¯𝐰¯¯𝐰superscript¯𝐰top\bar{\mathbf{w}}\bar{\mathbf{w}}^{\top}over¯ start_ARG bold_w end_ARG over¯ start_ARG bold_w end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT and 𝐯¯𝐯¯¯𝐯superscript¯𝐯top\bar{\mathbf{v}}\bar{\mathbf{v}}^{\top}over¯ start_ARG bold_v end_ARG over¯ start_ARG bold_v end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT need only be evaluated once. All other operations scale linearly with the number of data N𝑁Nitalic_N. The added computational cost is summarized in Table 1. The order of the added expense is 𝒪(N(n2+m2))𝒪𝑁superscript𝑛2superscript𝑚2\mathcal{O}(N(n^{2}+m^{2}))caligraphic_O ( italic_N ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ), so the additional computation required when multiplicative noise is added to the model does not affect the order of complexity of the overall algorithm.

Table 1: The increase in computational complexity of Gaussian filtering with both additive and multiplicative noise in the HMM (4) compared to Gaussian filtering with only additive noise present.
Equation Added flops
Dynamics Eq. (6) Nn𝑁𝑛Nnitalic_N italic_n
Eq. (7) 5Nn2+n25𝑁superscript𝑛2superscript𝑛25Nn^{2}+n^{2}5 italic_N italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
Observations Eq. (8) Nm𝑁𝑚Nmitalic_N italic_m
Eq. (9) 5Nm2+m25𝑁superscript𝑚2superscript𝑚25Nm^{2}+m^{2}5 italic_N italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
Eq. (10) Nnm𝑁𝑛𝑚Nnmitalic_N italic_n italic_m
Total:
N(5n2+5m2+nmN(5n^{2}+5m^{2}+nmitalic_N ( 5 italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 5 italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_n italic_m
+n+m)+m2+n2+n+m)+m^{2}+n^{2}+ italic_n + italic_m ) + italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

4.2 Embedding symplectic structure

Next we describe a parameterization strategy for embedding symplectic structure within the learning process that is shown graphically in Fig. 1. The main idea behind this strategy is that rather than parameterizing the time derivatives or the propagator ΨΨ\Psiroman_Ψ directly, we parameterize the Hamiltonian H(𝐪,𝐩,𝜽)𝐻𝐪𝐩𝜽H(\mathbf{q},\mathbf{p},\bm{\theta})italic_H ( bold_q , bold_p , bold_italic_θ ). From this Hamiltonian, the time derivatives are derived from Hamilton’s equations and then integrated with a symplectic integrator. This process ensures that the estimated model will be Hamiltonian and also that its flow will be symplectic. This approach has shown utility in various other works [41, 17, 7, 42].

Parameter value 𝛉𝛉\bm{\uptheta}bold_θ Parameterized Hamiltonian H(𝐪,𝐩,𝛉)𝐻𝐪𝐩𝛉H(\mathbf{q},\mathbf{p},\bm{\uptheta})italic_H ( bold_q , bold_p , bold_θ ) Hamilton’s equations (5) 𝐪˙(𝛉),𝐩˙(𝛉)˙𝐪𝛉˙𝐩𝛉\dot{\mathbf{q}}(\bm{\uptheta}),\dot{\mathbf{p}}(\bm{\uptheta})over˙ start_ARG bold_q end_ARG ( bold_θ ) , over˙ start_ARG bold_p end_ARG ( bold_θ ) Symplectic integrator (16) Structure-preserving propagator Ψ(𝐪,𝐩,𝛉)Ψ𝐪𝐩𝛉\Psi(\mathbf{q},\mathbf{p},\bm{\uptheta})roman_Ψ ( bold_q , bold_p , bold_θ ) Likelihood evaluation π(𝛉|𝒴N)𝜋conditional𝛉subscript𝒴𝑁\pi(\bm{\uptheta}|\mathcal{Y}_{N})italic_π ( bold_θ | caligraphic_Y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT )
Figure 1: Schematic showing how Hamiltonian structure is preserved when evaluating the likelihood.

From the parameterized Hamiltonian, the continuous-time dynamics are given by Hamilton’s equations (5). However, the HMM (4) that we are using is formulated in discrete time. To resolve this discrepancy, we require a propagator ΨΨ\Psiroman_Ψ that can map the state forward in time using the estimated time derivatives without violating the physical properties mentioned in Section 3.2. For this purpose, we utilize a recently-developed explicit symplectic integrator for nonseparable Hamiltonian systems [43] that we refer to as “Tao’s integrator” throughout the paper. We briefly describe the integration method here.

Consider an arbitrary Hamiltonian H(𝐪,𝐩)𝐻𝐪𝐩H(\mathbf{q},\mathbf{p})italic_H ( bold_q , bold_p ) and fictitious position and momentum vectors 𝐪¯¯𝐪\bar{\mathbf{q}}over¯ start_ARG bold_q end_ARG and 𝐩¯¯𝐩\bar{\mathbf{p}}over¯ start_ARG bold_p end_ARG corresponding to 𝐪𝐪\mathbf{q}bold_q and 𝐩𝐩\mathbf{p}bold_p. Next, define the augmented Hamiltonian

H¯(𝐪,𝐩,𝐪¯,𝐩¯):=H(𝐪,𝐩¯)Ha+H(𝐪¯,𝐩)Hb+λ(𝐪𝐪¯22/2+𝐩𝐩¯22/2)Hc,assign¯𝐻𝐪𝐩¯𝐪¯𝐩subscript𝐻𝐪¯𝐩subscript𝐻𝑎subscript𝐻¯𝐪𝐩subscript𝐻𝑏𝜆subscriptsuperscriptsubscriptnorm𝐪¯𝐪222superscriptsubscriptnorm𝐩¯𝐩222subscript𝐻𝑐\bar{H}(\mathbf{q},\mathbf{p},\bar{\mathbf{q}},\bar{\mathbf{p}}):=\underbrace{% H(\mathbf{q},\bar{\mathbf{p}})}_{H_{a}}+\underbrace{H(\bar{\mathbf{q}},\mathbf% {p})}_{H_{b}}+\lambda\cdot\underbrace{\left(\|\mathbf{q}-\bar{\mathbf{q}}\|_{2% }^{2}/2+\|\mathbf{p}-\bar{\mathbf{p}}\|_{2}^{2}/2\right)}_{H_{c}},over¯ start_ARG italic_H end_ARG ( bold_q , bold_p , over¯ start_ARG bold_q end_ARG , over¯ start_ARG bold_p end_ARG ) := under⏟ start_ARG italic_H ( bold_q , over¯ start_ARG bold_p end_ARG ) end_ARG start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT + under⏟ start_ARG italic_H ( over¯ start_ARG bold_q end_ARG , bold_p ) end_ARG start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_λ ⋅ under⏟ start_ARG ( ∥ bold_q - over¯ start_ARG bold_q end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 + ∥ bold_p - over¯ start_ARG bold_p end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 ) end_ARG start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT , (15)

where Ha:=H(𝐪,𝐩¯)assignsubscript𝐻𝑎𝐻𝐪¯𝐩H_{a}:=H(\mathbf{q},\bar{\mathbf{p}})italic_H start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT := italic_H ( bold_q , over¯ start_ARG bold_p end_ARG ) and Hb:=H(𝐪¯,𝐩)assignsubscript𝐻𝑏𝐻¯𝐪𝐩H_{b}:=H(\bar{\mathbf{q}},\mathbf{p})italic_H start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT := italic_H ( over¯ start_ARG bold_q end_ARG , bold_p ) correspond to two copies of the original nonseparable Hamiltonian system with mixed-up positions and momenta; Hcsubscript𝐻𝑐H_{c}italic_H start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is an artificial constraint; and λ𝜆\lambdaitalic_λ is a constant that controls the binding of the two copies. Unlike the original Hamiltonian H𝐻Hitalic_H, the extended Hamiltonian H¯¯𝐻\bar{H}over¯ start_ARG italic_H end_ARG is amenable to explicit symplectic integration.

From H¯¯𝐻\bar{H}over¯ start_ARG italic_H end_ARG, a propagator ΨΨ\Psiroman_Ψ is derived using a second-order explicit symplectic method based on Strang splitting

Ψ:=ψHaΔt/2ψHbΔt/2ψλHcΔtψHbΔt/2ψHaΔt/2,assignΨsubscriptsuperscript𝜓Δ𝑡2subscript𝐻𝑎subscriptsuperscript𝜓Δ𝑡2subscript𝐻𝑏subscriptsuperscript𝜓Δ𝑡𝜆subscript𝐻𝑐subscriptsuperscript𝜓Δ𝑡2subscript𝐻𝑏subscriptsuperscript𝜓Δ𝑡2subscript𝐻𝑎\Psi:=\psi^{\Delta t/2}_{H_{a}}\circ\psi^{\Delta t/2}_{H_{b}}\circ\psi^{\Delta t% }_{\lambda H_{c}}\circ\psi^{\Delta t/2}_{H_{b}}\circ\psi^{\Delta t/2}_{H_{a}},roman_Ψ := italic_ψ start_POSTSUPERSCRIPT roman_Δ italic_t / 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∘ italic_ψ start_POSTSUPERSCRIPT roman_Δ italic_t / 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∘ italic_ψ start_POSTSUPERSCRIPT roman_Δ italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ italic_H start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∘ italic_ψ start_POSTSUPERSCRIPT roman_Δ italic_t / 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∘ italic_ψ start_POSTSUPERSCRIPT roman_Δ italic_t / 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT , (16)

where ψHaΔt,ψHbΔt,subscriptsuperscript𝜓Δ𝑡subscript𝐻𝑎subscriptsuperscript𝜓Δ𝑡subscript𝐻𝑏\psi^{\Delta t}_{H_{a}},\psi^{\Delta t}_{H_{b}},italic_ψ start_POSTSUPERSCRIPT roman_Δ italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_ψ start_POSTSUPERSCRIPT roman_Δ italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT , and ψλHcΔtsubscriptsuperscript𝜓Δ𝑡𝜆subscript𝐻𝑐\psi^{\Delta t}_{\lambda H_{c}}italic_ψ start_POSTSUPERSCRIPT roman_Δ italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_λ italic_H start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT are the time-ΔtΔ𝑡\Delta troman_Δ italic_t flow of Ha,Hb,subscript𝐻𝑎subscript𝐻𝑏H_{a},H_{b},italic_H start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , and λHc𝜆subscript𝐻𝑐\lambda H_{c}italic_λ italic_H start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. Each flow is evaluated with explicit symplectic Euler substeps, so the result of the composition is an explicit and symplectic propagator as desired.

5 Structure-preserving dimension reduction for high-dimensional systems

In this section, we discuss the approach of probabilistically learning low-dimensional dynamics of high-dimensional systems, including how to preserve Hamiltonian structure throughout the dimension-reduction process. This approach is divided into three steps. The first step in Section 5.1 is defining a structure-preserving propagator in the reduced-dimensional state space, the second step in Section 5.2 is deriving a low-dimensional observation function that determines a reduced-dimensional observation space, and the final step in Section 5.3 is deriving and evaluating a likelihood in this low-dimensional space. This revised process for high-dimensional Hamiltonians is graphically illustrated in Fig. 2.

Parameter value 𝛉nlsubscript𝛉nl\bm{\uptheta}_{\text{nl}}bold_θ start_POSTSUBSCRIPT nl end_POSTSUBSCRIPTH-OpInf [44] yields 𝛉quadsubscript𝛉quad\bm{\uptheta}_{\text{quad}}bold_θ start_POSTSUBSCRIPT quad end_POSTSUBSCRIPT Parameterized low-dimensional Hamiltonian (23) H~(𝐪~,𝐩~,𝛉)~𝐻~𝐪~𝐩𝛉\tilde{H}(\tilde{\mathbf{q}},\tilde{\mathbf{p}},\bm{\uptheta})over~ start_ARG italic_H end_ARG ( over~ start_ARG bold_q end_ARG , over~ start_ARG bold_p end_ARG , bold_θ ) Low-dimensional Hamilton’s equations (24) 𝐪~˙(𝛉),𝐩~˙(𝛉)˙~𝐪𝛉˙~𝐩𝛉\dot{\tilde{\mathbf{q}}}(\bm{\uptheta}),\dot{\tilde{\mathbf{p}}}(\bm{\uptheta})over˙ start_ARG over~ start_ARG bold_q end_ARG end_ARG ( bold_θ ) , over˙ start_ARG over~ start_ARG bold_p end_ARG end_ARG ( bold_θ ) Symplectic integrator (16) Structure-preserving low-dimensional propagator Ψ~(𝐪~,𝐩~,𝛉)~Ψ~𝐪~𝐩𝛉\tilde{\Psi}(\tilde{\mathbf{q}},\tilde{\mathbf{p}},\bm{\uptheta})over~ start_ARG roman_Ψ end_ARG ( over~ start_ARG bold_q end_ARG , over~ start_ARG bold_p end_ARG , bold_θ ) Likelihood evaluation π(𝛉nl|𝒴~N)𝜋conditionalsubscript𝛉nlsubscript~𝒴𝑁\pi(\bm{\uptheta}_{\text{nl}}|\tilde{\mathcal{Y}}_{N})italic_π ( bold_θ start_POSTSUBSCRIPT nl end_POSTSUBSCRIPT | over~ start_ARG caligraphic_Y end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) Low-dimensional observation model h~(𝐪~,𝐩~,𝛉)~~𝐪~𝐩𝛉\tilde{h}(\tilde{\mathbf{q}},\tilde{\mathbf{p}},\bm{\uptheta})over~ start_ARG italic_h end_ARG ( over~ start_ARG bold_q end_ARG , over~ start_ARG bold_p end_ARG , bold_θ )
Figure 2: Schematic showing how Hamiltonian structure is preserved in reduced dimensions when evaluating the likelihood. In this setting, the parameter vector is partitioned into quadratic θquadsubscript𝜃quad\theta_{\text{quad}}italic_θ start_POSTSUBSCRIPT quad end_POSTSUBSCRIPT and nonlinear components θnlsubscript𝜃nl\theta_{\text{nl}}italic_θ start_POSTSUBSCRIPT nl end_POSTSUBSCRIPT, where θquadsubscript𝜃quad\theta_{\text{quad}}italic_θ start_POSTSUBSCRIPT quad end_POSTSUBSCRIPT is determined by θnlsubscript𝜃nl\theta_{\text{nl}}italic_θ start_POSTSUBSCRIPT nl end_POSTSUBSCRIPT through the H-OpInf algorithm.

5.1 Reduced-dimension dynamics

Reducing the dimension of a dynamical system hinges upon a key hypothesis: that the system of interest is a high-dimensional realization of underlying low-dimensional dynamics. Under this assumption, it is theoretically possible to estimate dynamics with a considerably lower state dimension than the ambient dimension without any loss of accuracy. Estimating the hypothesized low-dimensional dynamics involves two primary components. The first is identifying a low-dimensional subspace on which a significant portion of the state evolution is contained. The other component is estimating a dynamical model whose state evolves within this low-dimensional space. Here we first describe linear dimension reduction for dynamical systems and then expound on alterations that can be made to preserve symplectic structure.

5.1.1 Linear dimension reduction

In this work, we consider linear projections. Let 𝐕n×r𝐕superscript𝑛𝑟\mathbf{V}\in\mathbb{R}^{n\times r}bold_V ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_r end_POSTSUPERSCRIPT and 𝐕n×(nr)subscript𝐕perpendicular-tosuperscript𝑛𝑛𝑟\mathbf{V}_{\perp}\in\mathbb{R}^{n\times(n-r)}bold_V start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × ( italic_n - italic_r ) end_POSTSUPERSCRIPT be projection matrices that project onto complementary subspaces 𝒮,𝒮n𝒮subscript𝒮perpendicular-tosuperscript𝑛\mathcal{S},\mathcal{S}_{\perp}\subseteq\mathbb{R}^{n}caligraphic_S , caligraphic_S start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT ⊆ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT with rnmuch-less-than𝑟𝑛r\ll nitalic_r ≪ italic_n. Denoting the components of the state 𝐱ksubscript𝐱𝑘\mathbf{x}_{k}bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT that lie in 𝒮𝒮\mathcal{S}caligraphic_S and 𝒮subscript𝒮perpendicular-to\mathcal{S}_{\perp}caligraphic_S start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT as 𝐱𝒮subscript𝐱𝒮\mathbf{x}_{\mathcal{S}}bold_x start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT and 𝐱𝒮subscript𝐱superscript𝒮perpendicular-to\mathbf{x}_{\mathcal{S}^{\perp}}bold_x start_POSTSUBSCRIPT caligraphic_S start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, the state can be written as

𝐱k=𝐱k,𝒮+𝐱k,𝒮=𝐕𝐕𝐱k+𝐕𝐕𝐱k=𝐕𝐱~k+𝐕𝐱~k,,subscript𝐱𝑘subscript𝐱𝑘𝒮subscript𝐱𝑘superscript𝒮perpendicular-tosuperscript𝐕𝐕topsubscript𝐱𝑘subscript𝐕perpendicular-tosuperscriptsubscript𝐕perpendicular-totopsubscript𝐱𝑘𝐕subscript~𝐱𝑘subscript𝐕perpendicular-tosubscript~𝐱𝑘perpendicular-to\mathbf{x}_{k}=\mathbf{x}_{k,\mathcal{S}}+\mathbf{x}_{k,\mathcal{S}^{\perp}}=% \mathbf{V}\mathbf{V}^{\top}\mathbf{x}_{k}+\mathbf{V}_{\perp}\mathbf{V}_{\perp}% ^{\top}\mathbf{x}_{k}=\mathbf{V}\tilde{\mathbf{x}}_{k}+\mathbf{V}_{\perp}% \tilde{\mathbf{x}}_{k,\perp},bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT italic_k , caligraphic_S end_POSTSUBSCRIPT + bold_x start_POSTSUBSCRIPT italic_k , caligraphic_S start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = bold_VV start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + bold_V start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT bold_V start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = bold_V over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + bold_V start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_k , ⟂ end_POSTSUBSCRIPT , (17)

where 𝐱~krsubscript~𝐱𝑘superscript𝑟\tilde{\mathbf{x}}_{k}\in\mathbb{R}^{r}over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT and 𝐱~k,nrsubscript~𝐱𝑘perpendicular-tosuperscript𝑛𝑟\tilde{\mathbf{x}}_{k,\perp}\in\mathbb{R}^{n-r}over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_k , ⟂ end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n - italic_r end_POSTSUPERSCRIPT are the low-dimensional representations of 𝐱𝒮subscript𝐱𝒮\mathbf{x}_{\mathcal{S}}bold_x start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT and 𝐱𝒮subscript𝐱superscript𝒮perpendicular-to\mathbf{x}_{\mathcal{S}^{\perp}}bold_x start_POSTSUBSCRIPT caligraphic_S start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT.

Once the subspace has been identified, the next step is to learn low-dimensional dynamics of the form

𝒯~(𝒙~k,𝜽,ω)=Ψ~(𝒙~k,𝜽)+𝝃~k(ω),~𝒯subscript~𝒙𝑘𝜽𝜔~Ψsubscript~𝒙𝑘𝜽subscript~𝝃𝑘𝜔\tilde{\mathcal{T}}(\tilde{\bm{x}}_{k},\bm{\theta},\omega)=\tilde{\Psi}(\tilde% {\bm{x}}_{k},\bm{\theta})+\tilde{\bm{\xi}}_{k}(\omega),over~ start_ARG caligraphic_T end_ARG ( over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_θ , italic_ω ) = over~ start_ARG roman_Ψ end_ARG ( over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_θ ) + over~ start_ARG bold_italic_ξ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_ω ) , (18)

where Ψ~:r×r:~Ψmaps-tosuperscript𝑟superscriptsuperscript𝑟\tilde{\Psi}:\mathbb{R}^{r}\times\mathbb{R}^{\ell}\mapsto\mathbb{R}^{r}over~ start_ARG roman_Ψ end_ARG : blackboard_R start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT ↦ blackboard_R start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT is the low-dimensional dynamics propagator and 𝝃~k(ω)rsubscript~𝝃𝑘𝜔superscript𝑟\tilde{\bm{\xi}}_{k}(\omega)\in\mathbb{R}^{r}over~ start_ARG bold_italic_ξ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_ω ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT represents the model uncertainty of Ψ~~Ψ\tilde{\Psi}over~ start_ARG roman_Ψ end_ARG. Since the form of uncertainty in the model is unknown, we select the additive model for simplicity. We use 𝒙~ksubscript~𝒙𝑘\tilde{\bm{x}}_{k}over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT as the uncertain state vector and therefore model it as a random vector. The state 𝐱~k,subscript~𝐱𝑘perpendicular-to\tilde{\mathbf{x}}_{k,\perp}over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_k , ⟂ end_POSTSUBSCRIPT is omitted from this model as a result of the earlier hypothesis that the dynamics are largely confined to a low-dimensional manifold, in this case 𝒮𝒮\mathcal{S}caligraphic_S. With these dynamics, predictions of the high-dimensional state can be made using 𝒙k=𝐕Ψ~k1(𝒙~1,𝜽)subscript𝒙𝑘𝐕superscript~Ψ𝑘1subscript~𝒙1𝜽\bm{x}_{k}=\mathbf{V}\tilde{\Psi}^{k-1}(\tilde{\bm{x}}_{1},\bm{\theta})bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = bold_V over~ start_ARG roman_Ψ end_ARG start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_θ ), where Ψ~k1superscript~Ψ𝑘1\tilde{\Psi}^{k-1}over~ start_ARG roman_Ψ end_ARG start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT denotes k1𝑘1k-1italic_k - 1 compositions of the low-dimensional propagator.

5.1.2 Hamiltonian case

Now we seek to extend this methodology to learn low-dimensional Hamiltonian models arising from discretizations of infinite-dimensional Hamiltonian systems. For this work, we focus on infinite-dimensional canonical Hamiltonian systems with Hamiltonian functionals of the form

[q,p]=(Hquad(q,p,qz,pz,)+Hnl(q,p))dz,𝑞𝑝subscript𝐻quad𝑞𝑝subscript𝑞𝑧subscript𝑝𝑧subscript𝐻nl𝑞𝑝d𝑧\mathcal{H}[q,p]=\int\left(H_{\text{quad}}(q,p,q_{z},p_{z},\cdots)+H_{\text{nl% }}(q,p)\right)\ \text{d}z,caligraphic_H [ italic_q , italic_p ] = ∫ ( italic_H start_POSTSUBSCRIPT quad end_POSTSUBSCRIPT ( italic_q , italic_p , italic_q start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , ⋯ ) + italic_H start_POSTSUBSCRIPT nl end_POSTSUBSCRIPT ( italic_q , italic_p ) ) d italic_z , (19)

where z𝑧zitalic_z is the spatial variable and qz=qzsubscript𝑞𝑧𝑞𝑧q_{z}=\frac{\partial q}{\partial z}italic_q start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT = divide start_ARG ∂ italic_q end_ARG start_ARG ∂ italic_z end_ARG is the partial derivative of q𝑞qitalic_q with respect to z𝑧zitalic_z. The function Hquad(q,p,qz,pz,)subscript𝐻quad𝑞𝑝subscript𝑞𝑧subscript𝑝𝑧H_{\text{quad}}(q,p,q_{z},p_{z},\cdots)italic_H start_POSTSUBSCRIPT quad end_POSTSUBSCRIPT ( italic_q , italic_p , italic_q start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , ⋯ ) contains quadratic terms, and Hnl(q,p)subscript𝐻nl𝑞𝑝H_{\text{nl}}(q,p)italic_H start_POSTSUBSCRIPT nl end_POSTSUBSCRIPT ( italic_q , italic_p ) contains spatially-local nonlinear terms.

To preserve the symplectic structure of the Hamiltonian system, we require that the symplecticity of the Hamiltonian flow is preserved in the reduced-dimensional space. This can be achieved by using the cotangent lift algorithm to find a projection matrix 𝐕𝐕\mathbf{V}bold_V that is symplectic. We briefly describe the cotangent lift algorithm here, but more details can be found in A. Let 𝐱1,,𝐱Nsubscript𝐱1subscript𝐱𝑁\mathbf{x}_{1},\dots,\mathbf{x}_{N}bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT be a collection of full-field data collected from the system at times t1,,tNsubscript𝑡1subscript𝑡𝑁t_{1},\dots,t_{N}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT. We define the snapshot matrices

𝐐=[𝐪1𝐪N]d×N,𝐏=[𝐩1𝐩N]d×N.formulae-sequence𝐐matrixsubscript𝐪1subscript𝐪𝑁superscript𝑑𝑁𝐏matrixsubscript𝐩1subscript𝐩𝑁superscript𝑑𝑁\mathbf{Q}=\begin{bmatrix}\mathbf{q}_{1}\cdots\mathbf{q}_{N}\end{bmatrix}\in% \mathbb{R}^{d\times N},\quad\mathbf{P}=\begin{bmatrix}\mathbf{p}_{1}\cdots% \mathbf{p}_{N}\end{bmatrix}\in\mathbb{R}^{d\times N}.bold_Q = [ start_ARG start_ROW start_CELL bold_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ bold_q start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_N end_POSTSUPERSCRIPT , bold_P = [ start_ARG start_ROW start_CELL bold_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ bold_p start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_N end_POSTSUPERSCRIPT . (20)

Then we compute a linear symplectic basis

V=[𝚽𝟎𝟎𝚽]2d×2r,Vmatrix𝚽00𝚽superscript2𝑑2𝑟\textbf{V}=\begin{bmatrix}\mathbf{\mathbf{\Phi}}&\mathbf{0}\\ \mathbf{0}&\mathbf{\mathbf{\Phi}}\end{bmatrix}\in\mathbb{R}^{2d\times 2r},V = [ start_ARG start_ROW start_CELL bold_Φ end_CELL start_CELL bold_0 end_CELL end_ROW start_ROW start_CELL bold_0 end_CELL start_CELL bold_Φ end_CELL end_ROW end_ARG ] ∈ blackboard_R start_POSTSUPERSCRIPT 2 italic_d × 2 italic_r end_POSTSUPERSCRIPT , (21)

where 𝚽d×r𝚽superscript𝑑𝑟\mathbf{\mathbf{\Phi}}\in\mathbb{R}^{d\times r}bold_Φ ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_r end_POSTSUPERSCRIPT is based on the leading r𝑟ritalic_r left singular vectors of the extended snapshot matrix Xe=[𝐐,𝐏]subscriptX𝑒𝐐𝐏\textbf{X}_{e}=[\mathbf{Q},\mathbf{P}]X start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = [ bold_Q , bold_P ].

Next, we propose a parameterization of Ψ~~Ψ\tilde{\Psi}over~ start_ARG roman_Ψ end_ARG that preserves the low-dimensional Hamiltonian structure. We define the low-dimensional position and momentum 𝐪~,𝐩~r~𝐪~𝐩superscript𝑟\tilde{\mathbf{q}},\tilde{\mathbf{p}}\in\mathbb{R}^{r}over~ start_ARG bold_q end_ARG , over~ start_ARG bold_p end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT as

𝐪~=𝚽𝐪,𝐩~=𝚽𝐩.formulae-sequence~𝐪superscript𝚽top𝐪~𝐩superscript𝚽top𝐩\tilde{\mathbf{q}}=\mathbf{\Phi}^{\top}\mathbf{q},\qquad\tilde{\mathbf{p}}=% \mathbf{\Phi}^{\top}\mathbf{p}.over~ start_ARG bold_q end_ARG = bold_Φ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_q , over~ start_ARG bold_p end_ARG = bold_Φ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_p . (22)

Based on the Hamiltonian form of Eq. (19), we use the parameterization

H~(𝐪~,𝐩~,𝜽)=12𝐪~𝐃~𝐪(𝜽quad)𝐪~+12𝐩~𝐃~𝐩(𝜽quad)𝐩~+H~nl(𝐪~,𝐩~,𝜽nl),~𝐻~𝐪~𝐩𝜽12superscript~𝐪topsubscript~𝐃𝐪subscript𝜽quad~𝐪12superscript~𝐩topsubscript~𝐃𝐩subscript𝜽quad~𝐩subscript~𝐻nl~𝐪~𝐩subscript𝜽nl\tilde{H}(\tilde{\mathbf{q}},\tilde{\mathbf{p}},\bm{\theta})=\frac{1}{2}\tilde% {\mathbf{q}}^{\top}\tilde{\mathbf{D}}_{\mathbf{q}}(\bm{\theta}_{\text{quad}})% \tilde{\mathbf{q}}+\frac{1}{2}\tilde{\mathbf{p}}^{\top}\tilde{\mathbf{D}}_{% \mathbf{p}}(\bm{\theta}_{\text{quad}})\tilde{\mathbf{p}}+\tilde{H}_{\text{nl}}% (\tilde{\mathbf{q}},\tilde{\mathbf{p}},\bm{\theta}_{\text{nl}}),over~ start_ARG italic_H end_ARG ( over~ start_ARG bold_q end_ARG , over~ start_ARG bold_p end_ARG , bold_italic_θ ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG over~ start_ARG bold_q end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over~ start_ARG bold_D end_ARG start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT quad end_POSTSUBSCRIPT ) over~ start_ARG bold_q end_ARG + divide start_ARG 1 end_ARG start_ARG 2 end_ARG over~ start_ARG bold_p end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over~ start_ARG bold_D end_ARG start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT quad end_POSTSUBSCRIPT ) over~ start_ARG bold_p end_ARG + over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT nl end_POSTSUBSCRIPT ( over~ start_ARG bold_q end_ARG , over~ start_ARG bold_p end_ARG , bold_italic_θ start_POSTSUBSCRIPT nl end_POSTSUBSCRIPT ) , (23)

where H~nl(𝐪~,𝐩~,𝜽nl)Hnl(𝚽𝐪~,𝚽𝐩~,𝜽nl)subscript~𝐻nl~𝐪~𝐩subscript𝜽nlsubscript𝐻nl𝚽~𝐪𝚽~𝐩subscript𝜽nl\tilde{H}_{\text{nl}}(\tilde{\mathbf{q}},\tilde{\mathbf{p}},\bm{\theta}_{\text% {nl}})\coloneqq H_{\text{nl}}(\mathbf{\Phi}\tilde{\mathbf{q}},\mathbf{\Phi}% \tilde{\mathbf{p}},\bm{\theta}_{\text{nl}})over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT nl end_POSTSUBSCRIPT ( over~ start_ARG bold_q end_ARG , over~ start_ARG bold_p end_ARG , bold_italic_θ start_POSTSUBSCRIPT nl end_POSTSUBSCRIPT ) ≔ italic_H start_POSTSUBSCRIPT nl end_POSTSUBSCRIPT ( bold_Φ over~ start_ARG bold_q end_ARG , bold_Φ over~ start_ARG bold_p end_ARG , bold_italic_θ start_POSTSUBSCRIPT nl end_POSTSUBSCRIPT ) defines the nonlinear terms, and 𝐃~𝐪subscript~𝐃𝐪\tilde{\mathbf{D}}_{\mathbf{q}}over~ start_ARG bold_D end_ARG start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT and 𝐃~𝐩subscript~𝐃𝐩\tilde{\mathbf{D}}_{\mathbf{p}}over~ start_ARG bold_D end_ARG start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT are symmetric matrices defining the quadratic terms. We have partitioned the parameter vector 𝜽=[𝜽quad𝜽nl]𝜽superscriptmatrixsuperscriptsubscript𝜽quadtopsuperscriptsubscript𝜽nltoptop\bm{\theta}=\begin{bmatrix}\bm{\theta}_{\text{quad}}^{\top}&\bm{\theta}_{\text% {nl}}^{\top}\end{bmatrix}^{\top}bold_italic_θ = [ start_ARG start_ROW start_CELL bold_italic_θ start_POSTSUBSCRIPT quad end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_CELL start_CELL bold_italic_θ start_POSTSUBSCRIPT nl end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT into components pertaining to the quadratic and nonlinear terms of the Hamiltonian to distinguish between them. Following Hamilton’s equations (5), the governing equations for this low-dimensional Hamiltonian system are

𝐪~˙(𝜽)=H~𝐩~=𝐃~𝐩(𝜽quad)𝐩~+𝚽𝐟𝐪(𝚽𝐪~,𝚽𝐩~,𝜽nl),𝐩~˙(𝜽)=H~𝐪~=𝐃~𝐪(𝜽quad)𝐪~𝚽𝐟𝐪(𝚽𝐪~,𝚽𝐩~,𝜽nl).formulae-sequence˙~𝐪𝜽~𝐻~𝐩subscript~𝐃𝐩subscript𝜽quad~𝐩superscript𝚽topsubscript𝐟𝐪𝚽~𝐪𝚽~𝐩subscript𝜽nl˙~𝐩𝜽~𝐻~𝐪subscript~𝐃𝐪subscript𝜽quad~𝐪superscript𝚽topsubscript𝐟𝐪𝚽~𝐪𝚽~𝐩subscript𝜽nl\dot{\tilde{\mathbf{q}}}(\bm{\theta})=\frac{\partial\tilde{H}}{\partial\tilde{% \mathbf{p}}}=\tilde{\mathbf{D}}_{\mathbf{p}}(\bm{\theta}_{\text{quad}})\tilde{% \mathbf{p}}+\mathbf{\Phi}^{\top}\mathbf{f}_{\mathbf{q}}(\mathbf{\Phi}\tilde{% \mathbf{q}},\mathbf{\Phi}\tilde{\mathbf{p}},\bm{\theta}_{\text{nl}}),\qquad% \dot{\tilde{\mathbf{p}}}(\bm{\theta})=-\frac{\partial\tilde{H}}{\partial\tilde% {\mathbf{q}}}=-\tilde{\mathbf{D}}_{\mathbf{q}}(\bm{\theta}_{\text{quad}})% \tilde{\mathbf{q}}-\mathbf{\Phi}^{\top}\mathbf{f}_{\mathbf{q}}(\mathbf{\Phi}% \tilde{\mathbf{q}},\mathbf{\Phi}\tilde{\mathbf{p}},\bm{\theta}_{\text{nl}}).over˙ start_ARG over~ start_ARG bold_q end_ARG end_ARG ( bold_italic_θ ) = divide start_ARG ∂ over~ start_ARG italic_H end_ARG end_ARG start_ARG ∂ over~ start_ARG bold_p end_ARG end_ARG = over~ start_ARG bold_D end_ARG start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT quad end_POSTSUBSCRIPT ) over~ start_ARG bold_p end_ARG + bold_Φ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_f start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT ( bold_Φ over~ start_ARG bold_q end_ARG , bold_Φ over~ start_ARG bold_p end_ARG , bold_italic_θ start_POSTSUBSCRIPT nl end_POSTSUBSCRIPT ) , over˙ start_ARG over~ start_ARG bold_p end_ARG end_ARG ( bold_italic_θ ) = - divide start_ARG ∂ over~ start_ARG italic_H end_ARG end_ARG start_ARG ∂ over~ start_ARG bold_q end_ARG end_ARG = - over~ start_ARG bold_D end_ARG start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT quad end_POSTSUBSCRIPT ) over~ start_ARG bold_q end_ARG - bold_Φ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_f start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT ( bold_Φ over~ start_ARG bold_q end_ARG , bold_Φ over~ start_ARG bold_p end_ARG , bold_italic_θ start_POSTSUBSCRIPT nl end_POSTSUBSCRIPT ) . (24)

Numerically integrating these equations with Tao’s integrator as in Section 4.2 yields a structure-preserving propagator Ψ~~Ψ\tilde{\Psi}over~ start_ARG roman_Ψ end_ARG. With the Hamiltonian operator inference method H-OpInf [44], 𝜽quadsubscript𝜽quad\bm{\theta}_{\text{quad}}bold_italic_θ start_POSTSUBSCRIPT quad end_POSTSUBSCRIPT can be determined in a straightforward fashion, leaving only 𝜽nlsubscript𝜽nl\bm{\theta}_{\text{nl}}bold_italic_θ start_POSTSUBSCRIPT nl end_POSTSUBSCRIPT to be estimated. More details on this approach will be given in Section 5.3.

5.2 Reduced-dimension observations

Now that the state dimension has been reduced, we require an observation function h~~\tilde{h}over~ start_ARG italic_h end_ARG defined over this reduced-dimensional state. Furthermore, if the observations themselves are high-dimensional, the high computational complexity of Gaussian filtering may require reduction of the observations in addition to the states. We describe this approach in the setting of identity observation operators for simplicity of presentation.

To formulate the measurement function in terms of 𝒙~ksubscript~𝒙𝑘\tilde{\bm{x}}_{k}over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, we first decompose the full-dimensional measurements in terms of 𝒙𝒮subscript𝒙𝒮\bm{x}_{\mathcal{S}}bold_italic_x start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT and 𝒙𝒮subscript𝒙superscript𝒮perpendicular-to\bm{x}_{\mathcal{S}^{\perp}}bold_italic_x start_POSTSUBSCRIPT caligraphic_S start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. Assuming that the measurement function hhitalic_h in the observation model (4b) is the identity, we have

𝒚k=𝒙k𝒗k+𝜼k=(𝒙k,𝒮+𝒙k,𝒮)𝒗k+𝜼k=(𝐕𝒙~k+𝒙k,𝒮)𝒗k+𝜼k.subscript𝒚𝑘direct-productsubscript𝒙𝑘subscript𝒗𝑘subscript𝜼𝑘direct-productsubscript𝒙𝑘𝒮subscript𝒙𝑘superscript𝒮perpendicular-tosubscript𝒗𝑘subscript𝜼𝑘direct-product𝐕subscript~𝒙𝑘subscript𝒙𝑘superscript𝒮perpendicular-tosubscript𝒗𝑘subscript𝜼𝑘\bm{y}_{k}=\bm{x}_{k}\odot\bm{v}_{k}+\bm{\eta}_{k}=\left(\bm{x}_{k,\mathcal{S}% }+\bm{x}_{k,\mathcal{S}^{\perp}}\right)\odot\bm{v}_{k}+\bm{\eta}_{k}=\left(% \mathbf{V}\tilde{\bm{x}}_{k}+\bm{x}_{k,\mathcal{S}^{\perp}}\right)\odot\bm{v}_% {k}+\bm{\eta}_{k}.bold_italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⊙ bold_italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + bold_italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ( bold_italic_x start_POSTSUBSCRIPT italic_k , caligraphic_S end_POSTSUBSCRIPT + bold_italic_x start_POSTSUBSCRIPT italic_k , caligraphic_S start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ⊙ bold_italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + bold_italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ( bold_V over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + bold_italic_x start_POSTSUBSCRIPT italic_k , caligraphic_S start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ⊙ bold_italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + bold_italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT . (25)

The next step is to transform 𝐲𝐲\mathbf{y}bold_y into a low-dimensional form. A natural choice for this step is to project 𝐲𝐲\mathbf{y}bold_y onto the subspace spanned by the low-dimensional dynamics to produce the low-dimensional measurements 𝐲~=𝐕𝐲~𝐲superscript𝐕top𝐲\tilde{\mathbf{y}}=\mathbf{V}^{\top}\mathbf{y}over~ start_ARG bold_y end_ARG = bold_V start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_y. This induces the low-dimensional measurement model

~(𝒙~k,ω)=𝐕𝒚k=𝐕((𝐕𝒙~k+𝒙k,𝒮)𝒗k)+𝐕𝜼k,~subscript~𝒙𝑘𝜔superscript𝐕topsubscript𝒚𝑘superscript𝐕topdirect-product𝐕subscript~𝒙𝑘subscript𝒙𝑘superscript𝒮perpendicular-tosubscript𝒗𝑘superscript𝐕topsubscript𝜼𝑘\tilde{\mathcal{M}}(\tilde{\bm{x}}_{k},\omega)=\mathbf{V}^{\top}\bm{y}_{k}=% \mathbf{V}^{\top}\Bigl{(}\left(\mathbf{V}\tilde{\bm{x}}_{k}+\bm{x}_{k,\mathcal% {S}^{\perp}}\right)\odot\bm{v}_{k}\Bigr{)}+\mathbf{V}^{\top}\bm{\eta}_{k},over~ start_ARG caligraphic_M end_ARG ( over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_ω ) = bold_V start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = bold_V start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( ( bold_V over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + bold_italic_x start_POSTSUBSCRIPT italic_k , caligraphic_S start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ⊙ bold_italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + bold_V start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , (26)

which produces a collection of low-dimensional data 𝒴~N={𝐕𝐲k|k=1,,N}subscript~𝒴𝑁conditional-setsuperscript𝐕topsubscript𝐲𝑘𝑘1𝑁\tilde{\mathcal{Y}}_{N}=\{\mathbf{V}^{\top}\mathbf{y}_{k}|k=1,\ldots,N\}over~ start_ARG caligraphic_Y end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT = { bold_V start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_k = 1 , … , italic_N }. This gives rise to a modified posterior distribution π(𝛉|𝒴~N)𝜋conditional𝛉subscript~𝒴𝑁\pi(\bm{\uptheta}|\tilde{\mathcal{Y}}_{N})italic_π ( bold_θ | over~ start_ARG caligraphic_Y end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ). In general, this distribution is only an approximation of the original distribution π(𝛉|𝒴N)𝜋conditional𝛉subscript𝒴𝑁\pi(\bm{\uptheta}|\mathcal{Y}_{N})italic_π ( bold_θ | caligraphic_Y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ), but it can be computed much more efficiently than the original, as we will show in Section 5.3.

In the case of only additive noise, the additive form of the measurement model is preserved

~(𝒙~k,ω)=𝒙~k+𝐕(𝒙k,𝒮+𝜼k),~subscript~𝒙𝑘𝜔subscript~𝒙𝑘superscript𝐕topsubscript𝒙𝑘superscript𝒮perpendicular-tosubscript𝜼𝑘\tilde{\mathcal{M}}(\tilde{\bm{x}}_{k},\omega)=\tilde{\bm{x}}_{k}+\mathbf{V}^{% \top}(\bm{x}_{k,\mathcal{S}^{\perp}}+\bm{\eta}_{k}),over~ start_ARG caligraphic_M end_ARG ( over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_ω ) = over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + bold_V start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k , caligraphic_S start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + bold_italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , (27)

where

𝐕(𝒙k,𝒮+𝜼k)𝒩(𝐕𝔼[𝒙k,𝒮],𝐕(𝕍ar[𝒙k,𝒮]+𝚪+)𝐕)similar-tosuperscript𝐕topsubscript𝒙𝑘superscript𝒮perpendicular-tosubscript𝜼𝑘𝒩superscript𝐕top𝔼delimited-[]subscript𝒙𝑘superscript𝒮perpendicular-tosuperscript𝐕top𝕍ardelimited-[]subscript𝒙𝑘superscript𝒮perpendicular-tosubscript𝚪𝐕\mathbf{V}^{\top}(\bm{x}_{k,\mathcal{S}^{\perp}}+\bm{\eta}_{k})\sim\mathcal{N}% \Big{(}\mathbf{V}^{\top}\mathbb{E}\left[\bm{x}_{k,\mathcal{S}^{\perp}}\right],% \mathbf{V}^{\top}(\mathbb{V}\text{ar}\left[\bm{x}_{k,\mathcal{S}^{\perp}}% \right]+\mathbf{\Gamma}_{+})\mathbf{V}\Big{)}bold_V start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k , caligraphic_S start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + bold_italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∼ caligraphic_N ( bold_V start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT blackboard_E [ bold_italic_x start_POSTSUBSCRIPT italic_k , caligraphic_S start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ] , bold_V start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( blackboard_V ar [ bold_italic_x start_POSTSUBSCRIPT italic_k , caligraphic_S start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ] + bold_Γ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) bold_V ) (28)

can be treated as a non-stationary additive noise term with unknown mean and covariance. This form allows for Gaussian filtering through simple application of the Kalman filter. In the general form of Eq. (26), however, the noise can no longer be modeled by an additive and multiplicative model, and Eqs. (8) and (9) for evaluating the mean and covariance of \mathcal{M}caligraphic_M can not be applied to ~~\tilde{\mathcal{M}}over~ start_ARG caligraphic_M end_ARG. Instead, the first two moments of ~~\tilde{\mathcal{M}}over~ start_ARG caligraphic_M end_ARG are given as

𝔼[~(𝒙~,ω)]𝔼delimited-[]~~𝒙𝜔\displaystyle\mathbb{E}\left[\tilde{\mathcal{M}}(\tilde{\bm{x}},\omega)\right]blackboard_E [ over~ start_ARG caligraphic_M end_ARG ( over~ start_ARG bold_italic_x end_ARG , italic_ω ) ] =𝐕((𝐕𝔼[𝒙~]+𝒙𝒮)𝐯¯),absentsuperscript𝐕topdirect-product𝐕𝔼delimited-[]~𝒙subscript𝒙superscript𝒮perpendicular-to¯𝐯\displaystyle=\mathbf{V}^{\top}\Big{(}(\mathbf{V}\mathbb{E}\left[\tilde{\bm{x}% }\right]+\bm{x}_{\mathcal{S}^{\perp}})\odot\bar{\mathbf{v}}\Big{)},= bold_V start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( ( bold_V blackboard_E [ over~ start_ARG bold_italic_x end_ARG ] + bold_italic_x start_POSTSUBSCRIPT caligraphic_S start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ⊙ over¯ start_ARG bold_v end_ARG ) , (29a)
𝕍ar[~(𝒙~,ω)]𝕍ardelimited-[]~~𝒙𝜔\displaystyle\mathbb{V}\text{ar}\left[\tilde{\mathcal{M}}(\tilde{\bm{x}},% \omega)\right]blackboard_V ar [ over~ start_ARG caligraphic_M end_ARG ( over~ start_ARG bold_italic_x end_ARG , italic_ω ) ] =𝐕(𝕍ar[(𝐕𝒙~+𝒙𝒮)𝒗]+𝚪+)𝐕.absentsuperscript𝐕top𝕍ardelimited-[]direct-product𝐕~𝒙subscript𝒙superscript𝒮perpendicular-to𝒗subscript𝚪𝐕\displaystyle=\mathbf{V}^{\top}\Big{(}\mathbb{V}\text{ar}\left[(\mathbf{V}% \tilde{\bm{x}}+\bm{x}_{\mathcal{S}^{\perp}})\odot\bm{v}\right]+\mathbf{\Gamma}% _{+}\Big{)}\mathbf{V}.= bold_V start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( blackboard_V ar [ ( bold_V over~ start_ARG bold_italic_x end_ARG + bold_italic_x start_POSTSUBSCRIPT caligraphic_S start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ⊙ bold_italic_v ] + bold_Γ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) bold_V . (29b)

Due to the presence of the unknown 𝒙𝒮subscript𝒙superscript𝒮perpendicular-to\bm{x}_{\mathcal{S}^{\perp}}bold_italic_x start_POSTSUBSCRIPT caligraphic_S start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT term, these moments are not computable. However, 𝒙𝒮subscript𝒙superscript𝒮perpendicular-to\bm{x}_{\mathcal{S}^{\perp}}bold_italic_x start_POSTSUBSCRIPT caligraphic_S start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT can be assumed to be small due to the initial hypothesis that 𝒙𝒙\bm{x}bold_italic_x is mostly contained in 𝒮𝒮\mathcal{S}caligraphic_S. By neglecting 𝒙𝒮subscript𝒙superscript𝒮perpendicular-to\bm{x}_{\mathcal{S}^{\perp}}bold_italic_x start_POSTSUBSCRIPT caligraphic_S start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, Eq. (29) becomes

𝔼[~(𝒙~,ω)]𝔼delimited-[]~~𝒙𝜔\displaystyle\mathbb{E}\left[\tilde{\mathcal{M}}(\tilde{\bm{x}},\omega)\right]blackboard_E [ over~ start_ARG caligraphic_M end_ARG ( over~ start_ARG bold_italic_x end_ARG , italic_ω ) ] =𝔼[𝒙~]𝐯¯,absentdirect-product𝔼delimited-[]~𝒙¯𝐯\displaystyle=\mathbb{E}\left[\tilde{\bm{x}}\right]\odot\bar{\mathbf{v}},= blackboard_E [ over~ start_ARG bold_italic_x end_ARG ] ⊙ over¯ start_ARG bold_v end_ARG , (30a)
𝕍ar[~(𝒙~k,ω)]𝕍ardelimited-[]~subscript~𝒙𝑘𝜔\displaystyle\mathbb{V}\text{ar}\left[\tilde{\mathcal{M}}(\tilde{\bm{x}}_{k},% \omega)\right]blackboard_V ar [ over~ start_ARG caligraphic_M end_ARG ( over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_ω ) ] 𝕍ar[𝒙~k]+𝚪(𝜽),absent𝕍ardelimited-[]subscript~𝒙𝑘𝚪𝜽\displaystyle\approx\mathbb{V}\text{ar}\left[\tilde{\bm{x}}_{k}\right]+\mathbf% {\Gamma}(\bm{\theta}),≈ blackboard_V ar [ over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] + bold_Γ ( bold_italic_θ ) , (30b)

where the uncertainty due to 𝒗𝒗\bm{v}bold_italic_v in (29b) has been absorbed into an estimated stationary additive term 𝚪(𝜽)𝚪𝜽\mathbf{\Gamma}(\bm{\theta})bold_Γ ( bold_italic_θ ) in (30b). In this simplified form, both moments can be computed straightforwardly with a standard filtering procedure.

When 𝐯¯=𝟏¯𝐯1\bar{\mathbf{v}}=\mathbf{1}over¯ start_ARG bold_v end_ARG = bold_1, the approximation (30) is equivalent to assuming the low-dimensional measurement model

~(𝒙~k,ω)h~(𝒙~k)+𝜼~k,𝜼~k𝒩(𝟎,𝚪(𝜽)),formulae-sequence~subscript~𝒙𝑘𝜔~subscript~𝒙𝑘subscript~𝜼𝑘similar-tosubscript~𝜼𝑘𝒩0𝚪𝜽\tilde{\mathcal{M}}(\tilde{\bm{x}}_{k},\omega)\approx\tilde{h}(\tilde{\bm{x}}_% {k})+\tilde{\bm{\eta}}_{k},\quad\tilde{\bm{\eta}}_{k}\sim\mathcal{N}(\mathbf{0% },\mathbf{\Gamma}(\bm{\theta})),over~ start_ARG caligraphic_M end_ARG ( over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_ω ) ≈ over~ start_ARG italic_h end_ARG ( over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + over~ start_ARG bold_italic_η end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , over~ start_ARG bold_italic_η end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∼ caligraphic_N ( bold_0 , bold_Γ ( bold_italic_θ ) ) , (31)

where h~(𝒙~k)=𝒙~k~subscript~𝒙𝑘subscript~𝒙𝑘\tilde{h}(\tilde{\bm{x}}_{k})=\tilde{\bm{x}}_{k}over~ start_ARG italic_h end_ARG ( over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. This model will be shown to yield acceptable accuracy on an example problem in Section 6.5.

5.3 Reduced-dimension likelihood evaluation

Lastly, we require a method for evaluating the likelihood of the low-dimensional Hamiltonian (23). With the symmetry constraints on 𝐃~𝐪subscript~𝐃𝐪\tilde{\mathbf{D}}_{\mathbf{q}}over~ start_ARG bold_D end_ARG start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT and 𝐃~𝐩subscript~𝐃𝐩\tilde{\mathbf{D}}_{\mathbf{p}}over~ start_ARG bold_D end_ARG start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT, the dimension of 𝜽quad(ω)subscript𝜽quad𝜔\bm{\theta}_{\text{quad}}(\omega)bold_italic_θ start_POSTSUBSCRIPT quad end_POSTSUBSCRIPT ( italic_ω ) is r(r+1)𝑟𝑟1r(r+1)italic_r ( italic_r + 1 ). This dimension grows polynomially with r𝑟ritalic_r, and even for rnmuch-less-than𝑟𝑛r\ll nitalic_r ≪ italic_n, this scaling can be cumbersome for nonlinear optimization methods. Fortunately, H-OpInf [44] provides a method for estimating 𝜽quadsubscript𝜽quad\bm{\theta}_{\text{quad}}bold_italic_θ start_POSTSUBSCRIPT quad end_POSTSUBSCRIPT with linear optimization when Hnlsubscript𝐻nlH_{\text{nl}}italic_H start_POSTSUBSCRIPT nl end_POSTSUBSCRIPT is available. This procedure is described in more detail in B. By treating 𝜽quadsubscript𝜽quad\bm{\theta}_{\text{quad}}bold_italic_θ start_POSTSUBSCRIPT quad end_POSTSUBSCRIPT as a deterministic function of 𝜽nlsubscript𝜽nl\bm{\theta}_{\text{nl}}bold_italic_θ start_POSTSUBSCRIPT nl end_POSTSUBSCRIPT defined through the H-OpInf procedure, the posterior can be simplified as follows

π(𝛉|𝒴~N)=π(𝛉nl|𝒴~N)π(𝛉quad|𝛉nl,𝒴~N)=π(𝛉nl|𝒴~N)δ𝛉nl(𝛉quad),𝜋conditional𝛉subscript~𝒴𝑁𝜋conditionalsubscript𝛉nlsubscript~𝒴𝑁𝜋conditionalsubscript𝛉quadsubscript𝛉nlsubscript~𝒴𝑁𝜋conditionalsubscript𝛉nlsubscript~𝒴𝑁subscript𝛿subscript𝛉nlsubscript𝛉quad\pi(\bm{\uptheta}|\tilde{\mathcal{Y}}_{N})=\pi(\bm{\uptheta}_{\text{nl}}|% \tilde{\mathcal{Y}}_{N})\pi(\bm{\uptheta}_{\text{quad}}|\bm{\uptheta}_{\text{% nl}},\tilde{\mathcal{Y}}_{N})=\pi(\bm{\uptheta}_{\text{nl}}|\tilde{\mathcal{Y}% }_{N})\delta_{\bm{\uptheta}_{\text{nl}}}(\bm{\uptheta}_{\text{quad}}),italic_π ( bold_θ | over~ start_ARG caligraphic_Y end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) = italic_π ( bold_θ start_POSTSUBSCRIPT nl end_POSTSUBSCRIPT | over~ start_ARG caligraphic_Y end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) italic_π ( bold_θ start_POSTSUBSCRIPT quad end_POSTSUBSCRIPT | bold_θ start_POSTSUBSCRIPT nl end_POSTSUBSCRIPT , over~ start_ARG caligraphic_Y end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) = italic_π ( bold_θ start_POSTSUBSCRIPT nl end_POSTSUBSCRIPT | over~ start_ARG caligraphic_Y end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) italic_δ start_POSTSUBSCRIPT bold_θ start_POSTSUBSCRIPT nl end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_θ start_POSTSUBSCRIPT quad end_POSTSUBSCRIPT ) , (32)

where δ𝛉nl(𝛉quad)subscript𝛿subscript𝛉nlsubscript𝛉quad\delta_{\bm{\uptheta}_{\text{nl}}}(\bm{\uptheta}_{\text{quad}})italic_δ start_POSTSUBSCRIPT bold_θ start_POSTSUBSCRIPT nl end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_θ start_POSTSUBSCRIPT quad end_POSTSUBSCRIPT ) takes the value 1 if 𝛉quadsubscript𝛉quad\bm{\uptheta}_{\text{quad}}bold_θ start_POSTSUBSCRIPT quad end_POSTSUBSCRIPT is computed from 𝛉nlsubscript𝛉nl\bm{\uptheta}_{\text{nl}}bold_θ start_POSTSUBSCRIPT nl end_POSTSUBSCRIPT with H-OpInf and 0 otherwise. Therefore, we only need a method for evaluating the likelihood π(𝛉nl|𝒴~N)𝜋conditionalsubscript𝛉nlsubscript~𝒴𝑁\pi(\bm{\uptheta}_{\text{nl}}|\tilde{\mathcal{Y}}_{N})italic_π ( bold_θ start_POSTSUBSCRIPT nl end_POSTSUBSCRIPT | over~ start_ARG caligraphic_Y end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ). The pdf π(𝛉nl|𝒴~N)𝜋conditionalsubscript𝛉nlsubscript~𝒴𝑁\pi(\bm{\uptheta}_{\text{nl}}|\tilde{\mathcal{Y}}_{N})italic_π ( bold_θ start_POSTSUBSCRIPT nl end_POSTSUBSCRIPT | over~ start_ARG caligraphic_Y end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) can be evaluated by applying Algorithm 1 to the HMM from (1) with dynamics 𝒯~~𝒯\tilde{\mathcal{T}}over~ start_ARG caligraphic_T end_ARG from (18) and measurements ~~\tilde{\mathcal{M}}over~ start_ARG caligraphic_M end_ARG from (26). The full procedure for this evaluation is given in Algorithm 2 and illustrated in Fig. 2.

Algorithm 2 Evaluating the reduced-dimensional likelihood of a high-dimensional Hamiltonian system
1:Full-field data 𝒴Nsubscript𝒴𝑁\mathcal{Y}_{N}caligraphic_Y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT
2:      Reduced dimension r𝑟ritalic_r
3:      Parameter vector 𝛉nlsubscript𝛉nl\bm{\uptheta}_{\text{nl}}bold_θ start_POSTSUBSCRIPT nl end_POSTSUBSCRIPT
4:Evaluation of π(𝛉nl|𝒴~N)𝜋conditionalsubscript𝛉nlsubscript~𝒴𝑁\pi(\bm{\uptheta}_{\text{nl}}|\tilde{\mathcal{Y}}_{N})italic_π ( bold_θ start_POSTSUBSCRIPT nl end_POSTSUBSCRIPT | over~ start_ARG caligraphic_Y end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT )
5:Pre-processing
6:Assemble data into snapshot matrices
𝐐=[||||𝐪1𝐪2𝐪N||||],𝐏=[||||𝐩1𝐩2𝐩N||||]formulae-sequence𝐐matrix||||subscript𝐪1subscript𝐪2subscript𝐪𝑁||||𝐏matrix||||subscript𝐩1subscript𝐩2subscript𝐩𝑁||||\mathbf{Q}=\begin{bmatrix}|&|&|&|\\ \mathbf{q}_{1}&\mathbf{q}_{2}&\cdots&\mathbf{q}_{N}\\ |&|&|&|\end{bmatrix},\quad\mathbf{P}=\begin{bmatrix}|&|&|&|\\ \mathbf{p}_{1}&\mathbf{p}_{2}&\cdots&\mathbf{p}_{N}\\ |&|&|&|\end{bmatrix}bold_Q = [ start_ARG start_ROW start_CELL | end_CELL start_CELL | end_CELL start_CELL | end_CELL start_CELL | end_CELL end_ROW start_ROW start_CELL bold_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL bold_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL bold_q start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL | end_CELL start_CELL | end_CELL start_CELL | end_CELL start_CELL | end_CELL end_ROW end_ARG ] , bold_P = [ start_ARG start_ROW start_CELL | end_CELL start_CELL | end_CELL start_CELL | end_CELL start_CELL | end_CELL end_ROW start_ROW start_CELL bold_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL bold_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL bold_p start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL | end_CELL start_CELL | end_CELL start_CELL | end_CELL start_CELL | end_CELL end_ROW end_ARG ]
7:Compute projection matrix with Eq. (21)
𝐕cotangent_lift(𝐐,𝐏,r)𝐕cotangent_lift𝐐𝐏𝑟\mathbf{V}\leftarrow\texttt{cotangent\_lift}(\mathbf{Q},\mathbf{P},r)bold_V ← cotangent_lift ( bold_Q , bold_P , italic_r )
8:Define 𝒴~N={𝐕𝐲k|k=1,,N}subscript~𝒴𝑁conditional-setsuperscript𝐕topsubscript𝐲𝑘𝑘1𝑁\tilde{\mathcal{Y}}_{N}=\{\mathbf{V}^{\top}\mathbf{y}_{k}|k=1,\ldots,N\}over~ start_ARG caligraphic_Y end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT = { bold_V start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_k = 1 , … , italic_N }
9:Evaluation
10:Solve for 𝐃~𝐪(𝛉quad)subscript~𝐃𝐪subscript𝛉quad\tilde{\mathbf{D}}_{\mathbf{q}}(\bm{\uptheta}_{\text{quad}})over~ start_ARG bold_D end_ARG start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT ( bold_θ start_POSTSUBSCRIPT quad end_POSTSUBSCRIPT ) and 𝐃~𝐩(𝛉quad)subscript~𝐃𝐩subscript𝛉quad\tilde{\mathbf{D}}_{\mathbf{p}}(\bm{\uptheta}_{\text{quad}})over~ start_ARG bold_D end_ARG start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT ( bold_θ start_POSTSUBSCRIPT quad end_POSTSUBSCRIPT ) using 𝛉nlsubscript𝛉nl\bm{\uptheta}_{\text{nl}}bold_θ start_POSTSUBSCRIPT nl end_POSTSUBSCRIPT with H-OpInf from B
11:Define low-dimensional propagator Ψ~~Ψ\tilde{\Psi}over~ start_ARG roman_Ψ end_ARG using Tao’s integrator (16) and symplectic time derivatives (24)
12:Define low-dimensional observations h~(𝒙~)=𝒙~~~𝒙~𝒙\tilde{h}(\tilde{\bm{x}})=\tilde{\bm{x}}over~ start_ARG italic_h end_ARG ( over~ start_ARG bold_italic_x end_ARG ) = over~ start_ARG bold_italic_x end_ARG
13:Evaluate π(𝛉nl|𝒴~N)𝜋conditionalsubscript𝛉nlsubscript~𝒴𝑁\pi(\bm{\uptheta}_{\text{nl}}|\tilde{\mathcal{Y}}_{N})italic_π ( bold_θ start_POSTSUBSCRIPT nl end_POSTSUBSCRIPT | over~ start_ARG caligraphic_Y end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) using Algorithm 1 with dynamics 𝒯~~𝒯\tilde{\mathcal{T}}over~ start_ARG caligraphic_T end_ARG (18) and observation model ~(𝐱~)~~𝐱\tilde{\mathcal{M}}(\tilde{\mathbf{x}})over~ start_ARG caligraphic_M end_ARG ( over~ start_ARG bold_x end_ARG ) (31)

6 Numerical experiments

In this section, we study the numerical performance of the proposed Bayesian system ID approach for three Hamiltonian models with increasing levels of complexity. In Section 6.1, we describe an existing method whose model parameterization we implement within the proposed Bayesian approach for comparison. Then in Section 6.2, we give an overview of training considerations that are used throughout each numerical experiment. The final three sections contain the numerical examples. Each experiment and its distinct contribution beyond the authors’ previous work [17] is outlined as follows. In Section 6.3 we consider a polynomial nonseparable Hamiltonian system from [43] to demonstrate the advantages of the proposed Bayesian approach compared to a state-of-the-art neural network-based method when the data are sparse and/or noisy. In Section 6.4 we consider the double pendulum system to show the effectiveness of the proposed approach for non-polynomial, nonseparable Hamiltonian models in the chaotic regime. Finally, in Section 6.5 we consider high-dimensional data with multiplicative noise from the nonlinear Schrödinger equation and employ the novel Algorithm 2 for learning a high-dimensional Hamiltonian model.

6.1 Parameterization using deep neural networks

The Bayesian approach presented in this paper is flexible enough that it can be paired with any arbitrary parameterization of the dynamics propagator ΨΨ\Psiroman_Ψ. Here, we choose a recently-developed neural network architecture for learning nonseparable Hamiltonian systems known as the nonseparable symplectic neural network (NSSNN) [40] as our parameterization. This approach parameterizes the Hamiltonian using a neural network, evaluates Hamilton’s equations using auto-differentiation, and integrates the equations with Tao’s integrator, see Section 4.2. In Sections 6.3 and 6.4, we parameterize ΨΨ\Psiroman_Ψ using the NSSNN and compare the performance of this neural network when it is trained with the negative log posterior as its objective against when it is trained with the objective recommended in its original paper [40]. The goal of these experiments is to investigate the effect of each objective on the quality of the estimated model when the training set consists of sparse and noisy data.

Here we briefly describe the NSSNN architecture and training procedure from [40]. The neural network parameterizing the Hamiltonian is composed of six linear layers with sigmoid activation functions following all but the last layer. The training data is a set of input-output pairs. The inputs (𝐪1(j),𝐩1(j))superscriptsubscript𝐪1𝑗superscriptsubscript𝐩1𝑗(\mathbf{q}_{1}^{(j)},\mathbf{p}_{1}^{(j)})( bold_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT , bold_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ) are a collection of measurements at times t1(j)superscriptsubscript𝑡1𝑗t_{1}^{(j)}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT, and the outputs (𝐪(j),𝐩(j))superscript𝐪𝑗superscript𝐩𝑗(\mathbf{q}^{(j)},\mathbf{p}^{(j)})( bold_q start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT , bold_p start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ) are measurements at times t(j)=t1(j)+Tsuperscript𝑡𝑗superscriptsubscript𝑡1𝑗𝑇t^{(j)}=t_{1}^{(j)}+Titalic_t start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT = italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT + italic_T for j=1,N𝑗1𝑁j=1,\ldots Nitalic_j = 1 , … italic_N. For each j𝑗jitalic_j, the input and output are collected from the same trajectory and are separated by a time length of T𝑇Titalic_T. Since the integrator returns both the physical 𝐪𝐪\mathbf{q}bold_q, 𝐩𝐩\mathbf{p}bold_p and fictitious 𝐪¯¯𝐪\bar{\mathbf{q}}over¯ start_ARG bold_q end_ARG, 𝐩¯¯𝐩\bar{\mathbf{p}}over¯ start_ARG bold_p end_ARG position and momentum, as described in Section 4.2, the loss function contains a term corresponding to each of these variables. An L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss is used since it was empirically shown by [40] to yield better results:

𝒥(𝛉)=1Nj=1N𝐪(j)𝐪^(j)(𝛉)1+𝐩(j)𝐩^(j)(𝛉)1+𝐪(j)𝐪¯^(j)(𝛉)1+𝐩(j)𝐩¯^(j)(𝛉)1,𝒥𝛉1𝑁superscriptsubscript𝑗1𝑁subscriptdelimited-∥∥superscript𝐪𝑗superscript^𝐪𝑗𝛉1subscriptdelimited-∥∥superscript𝐩𝑗superscript^𝐩𝑗𝛉1subscriptdelimited-∥∥superscript𝐪𝑗superscript^¯𝐪𝑗𝛉1subscriptdelimited-∥∥superscript𝐩𝑗superscript^¯𝐩𝑗𝛉1\mathcal{J}(\bm{\uptheta})=\frac{1}{N}\sum_{j=1}^{N}\left\lVert\mathbf{q}^{(j)% }-\hat{\mathbf{q}}^{(j)}(\bm{\uptheta})\right\rVert_{1}+\left\lVert\mathbf{p}^% {(j)}-\hat{\mathbf{p}}^{(j)}(\bm{\uptheta})\right\rVert_{1}+\left\lVert\mathbf% {q}^{(j)}-\hat{\bar{\mathbf{q}}}^{(j)}(\bm{\uptheta})\right\rVert_{1}+\left% \lVert\mathbf{p}^{(j)}-\hat{\bar{\mathbf{p}}}^{(j)}(\bm{\uptheta})\right\rVert% _{1},caligraphic_J ( bold_θ ) = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∥ bold_q start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT - over^ start_ARG bold_q end_ARG start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ( bold_θ ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ∥ bold_p start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT - over^ start_ARG bold_p end_ARG start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ( bold_θ ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ∥ bold_q start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT - over^ start_ARG over¯ start_ARG bold_q end_ARG end_ARG start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ( bold_θ ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ∥ bold_p start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT - over^ start_ARG over¯ start_ARG bold_p end_ARG end_ARG start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ( bold_θ ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , (33)

where the hat ^^absent\hat{}over^ start_ARG end_ARG denotes an estimated value. In the NSSNN training procedure, the weights of each layer are initialized with Xavier initialization, and the Adam optimizer [45] is used to minimize the loss. The optimizer’s hyperparameters (the learning rate and beta values β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, β2subscript𝛽2\beta_{2}italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) are problem-dependent and will be reported within each numerical experiment’s subsection. Aside from the loss function, we follow the same procedure for training the NSSNN with the negative log posterior. The NSSNN and training procedure are implemented in PyTorch [46] for its auto-differentiation capabilities.

6.2 Training considerations

In each of these experiments, we consider only single-trajectory training data. To generate the data, we run a numerical solver with a fine timestep ΔtfΔsubscript𝑡𝑓\Delta t_{f}roman_Δ italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT to achieve high numerical accuracy. The training data are then subsampled from this solution using an interval Δtt>ΔtfΔsubscript𝑡𝑡Δsubscript𝑡𝑓\Delta t_{t}>\Delta t_{f}roman_Δ italic_t start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT > roman_Δ italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT. During training, the models use the larger timestep ΔttΔsubscript𝑡𝑡\Delta t_{t}roman_Δ italic_t start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for prediction, which helps in improving the computational efficiency at the expense of larger numerical error. Once training is complete, the estimated models are tested by simulating with the fine timestep ΔtfΔsubscript𝑡𝑓\Delta t_{f}roman_Δ italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT. Using a smaller timestep during testing ensures that the learned models have captured the continuous dynamics of the system and not only a discrete mapping at a single timestep [47].

For each of these experiments, we place an improper uniform prior, i.e., constant probability density, on the dynamics’ model parameters for the sake of direct comparison with the non-Bayesian method. We do, however, place weakly informative half-normal priors on the process noise variance parameters to enforce positivity and yield better convergence. The output model and the measurement noise variance are assumed to be known and are therefore fixed unless stated otherwise.

6.3 Tao’s example

The first system that we consider is a one-degree-of-freedom system used in [43] with the Hamiltonian

H(q,p)=12(q2+1)(p2+1).𝐻𝑞𝑝12superscript𝑞21superscript𝑝21H(q,p)=\frac{1}{2}(q^{2}+1)(p^{2}+1).italic_H ( italic_q , italic_p ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 ) ( italic_p start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 ) . (34)

For this experiment, we seek to assess the performances of the negative log posterior and the L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT objective for training the NSSNN as the training data become fewer and noisier.

6.3.1 Data generation and training

We generate a collection of datasets of varying noise level and size for training. To generate the training data, we use Tao’s integrator with initial condition 𝐱1=[03]subscript𝐱1superscriptmatrix03top\mathbf{x}_{1}=\begin{bmatrix}0&-3\end{bmatrix}^{\top}bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL 0 end_CELL start_CELL - 3 end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT and timesteps Δtf=103Δsubscript𝑡𝑓superscript103\Delta t_{f}=10^{-3}roman_Δ italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT and Δtt=102Δsubscript𝑡𝑡superscript102\Delta t_{t}=10^{-2}roman_Δ italic_t start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT. The number of data points N𝑁Nitalic_N in each dataset takes the values N=300,400,,1000𝑁3004001000N=300,400,\ldots,1000italic_N = 300 , 400 , … , 1000 corresponding to training intervals of length NΔtt𝑁Δsubscript𝑡𝑡N\Delta t_{t}italic_N roman_Δ italic_t start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The period for this system is approximately 3.26, so at this ΔttΔsubscript𝑡𝑡\Delta t_{t}roman_Δ italic_t start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the dataset timespans range from slightly under a single period to just over three periods. Then, we corrupt the datasets with multiplicative noise 𝒗k𝒰[1a,1+a]similar-tosubscript𝒗𝑘𝒰1𝑎1𝑎\bm{v}_{k}\sim\mathcal{U}[1-a,1+a]bold_italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∼ caligraphic_U [ 1 - italic_a , 1 + italic_a ] for a=0.00,0.01,,0.10𝑎0.000.010.10a=0.00,0.01,\ldots,0.10italic_a = 0.00 , 0.01 , … , 0.10. The mean and variance of this noise are 𝐯¯=𝟏¯𝐯1\bar{\mathbf{v}}=\mathbf{1}over¯ start_ARG bold_v end_ARG = bold_1 and 𝚪=a23𝐈2dsubscript𝚪direct-productsuperscript𝑎23subscript𝐈2𝑑\mathbf{\Gamma}_{\odot}=\frac{a^{2}}{3}\mathbf{I}_{2d}bold_Γ start_POSTSUBSCRIPT ⊙ end_POSTSUBSCRIPT = divide start_ARG italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 3 end_ARG bold_I start_POSTSUBSCRIPT 2 italic_d end_POSTSUBSCRIPT. The final collection of datasets includes every (a,N)𝑎𝑁(a,N)( italic_a , italic_N ) combination for a total 88 datasets.

To train the NSSNN, we use an initial learning rate of 0.05 and beta values β1=0.9subscript𝛽10.9\beta_{1}=0.9italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9 and β2=0.999subscript𝛽20.999\beta_{2}=0.999italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.999. Each model is trained for 400 epochs with the learning rate multiplied by 0.8 every 20 epochs. Following the approach presented in the original NSSNN paper [40], a batch size of 512 is used with the L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT objective whenever applicable. For the negative log posterior, the process noise variance is parameterized as 𝚺+=𝜽𝚺+𝐈2dsubscript𝚺subscript𝜽subscript𝚺subscript𝐈2𝑑\mathbf{\Sigma}_{+}=\bm{\theta}_{\mathbf{\Sigma}_{+}}\mathbf{I}_{2d}bold_Σ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT = bold_italic_θ start_POSTSUBSCRIPT bold_Σ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_I start_POSTSUBSCRIPT 2 italic_d end_POSTSUBSCRIPT. The scalar optimization parameter θ𝚺+subscript𝜃subscript𝚺\theta_{\mathbf{\Sigma}_{+}}italic_θ start_POSTSUBSCRIPT bold_Σ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_POSTSUBSCRIPT is initialized at a value of 103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT, and a learning rate of 0.5θ𝚺+0.5subscript𝜃subscript𝚺0.5\theta_{\mathbf{\Sigma}_{+}}0.5 italic_θ start_POSTSUBSCRIPT bold_Σ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_POSTSUBSCRIPT and beta values β1,β2=0.1subscript𝛽1subscript𝛽20.1\beta_{1},\beta_{2}=0.1italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.1 are used for optimization. To encourage small values, the prior half-𝒩(0,1012)half-𝒩0superscript1012\text{half-}\mathcal{N}(0,10^{-12})half- caligraphic_N ( 0 , 10 start_POSTSUPERSCRIPT - 12 end_POSTSUPERSCRIPT ) is placed on the random variable 𝜽𝚺+subscript𝜽subscript𝚺\bm{\theta}_{\mathbf{\Sigma}_{+}}bold_italic_θ start_POSTSUBSCRIPT bold_Σ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_POSTSUBSCRIPT. The positivity constraint is enforced by setting the value of θ𝚺+subscript𝜃subscript𝚺\theta_{\mathbf{\Sigma}_{+}}italic_θ start_POSTSUBSCRIPT bold_Σ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_POSTSUBSCRIPT to be 0.9 times its previous value whenever the optimizer places θ𝚺+subscript𝜃subscript𝚺\theta_{\mathbf{\Sigma}_{+}}italic_θ start_POSTSUBSCRIPT bold_Σ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_POSTSUBSCRIPT below zero. When a=0𝑎0a=0italic_a = 0, we set 𝚪=1016×𝐈2dsubscript𝚪direct-productsuperscript1016subscript𝐈2𝑑\mathbf{\Gamma}_{\odot}=10^{-16}\times\mathbf{I}_{2d}bold_Γ start_POSTSUBSCRIPT ⊙ end_POSTSUBSCRIPT = 10 start_POSTSUPERSCRIPT - 16 end_POSTSUPERSCRIPT × bold_I start_POSTSUBSCRIPT 2 italic_d end_POSTSUBSCRIPT for positive definiteness.

6.3.2 Time-domain prediction

Once training is completed on N𝑁Nitalic_N data points, each model is simulated starting at 𝐱1subscript𝐱1\mathbf{x}_{1}bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT for twice the training timespan until t2Nsubscript𝑡2𝑁t_{2N}italic_t start_POSTSUBSCRIPT 2 italic_N end_POSTSUBSCRIPT using timestep ΔtfΔsubscript𝑡𝑓\Delta t_{f}roman_Δ italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT. The accuracy of the model is then assessed by computing the mean squared error (MSE) defined as

12Nk=12N(q^kqk)2+(p^kpk)2.12𝑁superscriptsubscript𝑘12𝑁superscriptsubscript^𝑞𝑘subscript𝑞𝑘2superscriptsubscript^𝑝𝑘subscript𝑝𝑘2\frac{1}{2N}\sum_{k=1}^{2N}(\hat{q}_{k}-q_{k})^{2}+(\hat{p}_{k}-p_{k})^{2}.divide start_ARG 1 end_ARG start_ARG 2 italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT ( over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (35)

The results of the training can vary due to the randomness in the initialization of the NSSNN parameters, in the mini-batching procedure, and in the measurement noise. Sometimes, the optimizer can even get stuck at poor model estimates that yield very high MSE. To mitigate the effect of these outliers, we train 20 models at each (a,N)𝑎𝑁(a,N)( italic_a , italic_N ) pair and then compare the two methods using the median and minimum MSEs. Each of these 20 rounds of training has different realizations of initial parameters, mini-batches, and measurement noise. We use these particular statistics to quantify two distinct characteristics of the training objectives. The minimum MSE represents the best accuracy that an objective could potentially achieve either through fortuitous initialization and batching or through a more sophisticated optimization methodology. The median MSE, on the other hand, represents the minimum accuracy a user can expect to achieve roughly 50% of the time using the standard Adam optimizer with a reasonable amount of hyperparameter tuning. Together, these two statistics give a more complete picture of each objective’s performance.

The log10subscript10\log_{10}roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT of the median and minimum MSEs are presented as heatmaps in Figs. 3(a) and 3(b), respectively. These heatmaps show the effects of varying the number of training data and the level of noise in the data on estimation performance. In this and all future figures, the label π(𝛉|𝒴N)𝜋conditional𝛉subscript𝒴𝑁\pi(\bm{\uptheta}|\mathcal{Y}_{N})italic_π ( bold_θ | caligraphic_Y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) denotes the MAP estimate. We observe that using the negative log posterior to train the NSSNN results in lower median and minimum MSE at nearly every (a,N)𝑎𝑁(a,N)( italic_a , italic_N ) pair. The primary exception is the minimum MSE in the 0%percent00\%0 % noise column in which the posterior yields lower MSE at only roughly half of the N𝑁Nitalic_N values. As noise is added, however, the MSE of models trained with the L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss increases much more steeply compared to the MSE of models trained with the posterior. These results demonstrate that the posterior is much more robust to noise than the L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss. Additionally, the lower median MSE of the posterior in the low noise case shown in Fig. 3(b) reflects the fact that the posterior can more reliably find low MSE model estimates, especially when a smaller number of data are available.

Refer to caption
Refer to caption
Refer to caption
(a) Median
Refer to caption
Refer to caption
Refer to caption
(b) Minimum
Figure 3: Tao’s example: log10subscript10\log_{10}roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT MSE (35) of models trained using logπ(𝛉|𝒴N)𝜋conditional𝛉subscript𝒴𝑁-\log\pi(\bm{\uptheta}|\mathcal{Y}_{N})- roman_log italic_π ( bold_θ | caligraphic_Y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) and the L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT norm (33) as objective functions. The label π(𝛉|𝒴N)𝜋conditional𝛉subscript𝒴𝑁\pi(\bm{\uptheta}|\mathcal{Y}_{N})italic_π ( bold_θ | caligraphic_Y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) represents the MAP estimate, and ‘Difference’ represents the log10subscript10\log_{10}roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT MSE of the MAP estimate minus the log10subscript10\log_{10}roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT MSE of the L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT estimate. The median MSEs of the MAP estimates are lower than those of the L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT estimates on all datasets, and the minimum MSEs of the MAP estimates are lower than those of the L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT estimates on all datasets with noise.

To visualize how the model behavior varies with the noise and number of data, we plot the estimated trajectories of models corresponding to the four corners of the heatmaps. The estimated output of the median and minimum MSE models are shown in Figs. 4 and 5, respectively. The vertical pink line corresponds to the time t2Nsubscript𝑡2𝑁t_{2N}italic_t start_POSTSUBSCRIPT 2 italic_N end_POSTSUBSCRIPT at which the MSE evaluation ends. Visually, the estimates produced with the negative log posterior can accurately reconstruct the phase manifold in all eight of the figures. The most noticeable inaccuracy produced by the posterior is the oscillation frequency of its median MSE estimates when the noise is high. However, the minimum MSE estimates do not have this issue.

The L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT objective leads to poor results in two main areas: identifying good models with high noise data and reliable optimization with a low number of data. In the high noise case, both the median and minimum MSE estimates produced by the L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT objective construct qualitatively poor estimates of the phase manifold. In the low number of data case, the minimum MSE estimate matches the true output closely when there is no noise. In this same case, however, the median MSE estimate produces a flat line in the time domain. This substantial difference in the median and minimum MSE estimates suggests that although the L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT objective is able to produce good estimates with a small number of data, optimization is challenging and can require multiple attempts.

Refer to caption
(a) a=0%𝑎percent0a=0\%italic_a = 0 %, N=1000𝑁1000N=1000italic_N = 1000
Refer to caption
(b) a=10%𝑎percent10a=10\%italic_a = 10 %, N=1000𝑁1000N=1000italic_N = 1000
Refer to caption
(c) a=0%𝑎percent0a=0\%italic_a = 0 %, N=300𝑁300N=300italic_N = 300
Refer to caption
(d) a=10%𝑎percent10a=10\%italic_a = 10 %, N=300𝑁300N=300italic_N = 300
Figure 4: Tao’s example: Estimated trajectories from the median MSE models. The MAP estimate closely matches the truth when the data are noiseless and degrades gracefully as noise increases. The L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT estimates reflect that the optimizer gets caught in local minima when the training dataset is small.
Refer to caption
(a) a=0%𝑎percent0a=0\%italic_a = 0 %, N=1000𝑁1000N=1000italic_N = 1000
Refer to caption
(b) a=10%𝑎percent10a=10\%italic_a = 10 %, N=1000𝑁1000N=1000italic_N = 1000
Refer to caption
(c) a=0%𝑎percent0a=0\%italic_a = 0 %, N=300𝑁300N=300italic_N = 300
Refer to caption
(d) a=10%𝑎percent10a=10\%italic_a = 10 %, N=300𝑁300N=300italic_N = 300
Figure 5: Tao’s example: Estimated trajectories from the minimum MSE models. The MAP estimates match the truth even on noisy datasets, whereas the L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT estimates only match the truth on the noiseless datasets.

6.3.3 Phase-space prediction

In Fig. 4(b), we noted that the posterior estimate appears to qualitatively match the true output closely in phase space but produces large errors in the time domain. One way to quantitatively assess how closely the estimates match the system behavior in phase space is by looking at the MSE of the Hamiltonian, defined as

12Nk=12N(H(q^k,p^k)H(q1,p1))2,12𝑁superscriptsubscript𝑘12𝑁superscript𝐻subscript^𝑞𝑘subscript^𝑝𝑘𝐻subscript𝑞1subscript𝑝12\frac{1}{2N}\sum_{k=1}^{2N}\Big{(}H(\hat{q}_{k},\hat{p}_{k})-H(q_{1},p_{1})% \Big{)}^{2},divide start_ARG 1 end_ARG start_ARG 2 italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT ( italic_H ( over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_H ( italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (36)

where H𝐻Hitalic_H is the Hamiltonian function (34). This metric measures the closeness of an estimate’s energy level to the true system energy. Since some of the median MSE estimates, such as those in Fig. 4(c), remain close to the initial point, we only asses the minimum MSE estimates with this metric. The log10subscript10\log_{10}roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT values of the mean squared Hamiltonian deviation for the minimum MSE estimates are shown as heatmaps in Fig. 6. This figure shows a clear trend of the Hamiltonian deviation growing both as noise increases and as the number of data decreases in both objectives’ estimates. The estimates produced by the L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT objective outperform those produced by the posterior at almost all N𝑁Nitalic_N values when the data are noiseless — the notable exception being when the number of data is smallest.

Although Fig. 6 shows that the L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT estimate has a lower Hamiltonian MSE when a=0𝑎0a=0italic_a = 0 than the MAP estimate, Fig. 3 showed that the MAP estimate yielded generally lower MSE in the time domain for median performance and comparable MSE in the worst-case performance. Moreover the phase-space plots of Fig. 5 also indicate improved performance of the MAP estimate across noise cases. Looking at the magnitudes of the Hamiltonian MSE values in Fig. 6, the difference in errors between the MAP and L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT estimates at a=0𝑎0a=0italic_a = 0 is relatively small, approximately 103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT. When a>0𝑎0a>0italic_a > 0, however, we observe a sharp increase in the Hamiltonian MSE from the L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT estimates. In contrast, the MSE values of the MAP estimates increase gradually as a𝑎aitalic_a increases, demonstrating greater robustness to noise.

These slight differences that we observe between the time-domain MSE and Hamiltonian MSE heatmaps arise as a result of the fundamental differences in the two error metrics. While the trajectory-based metric (35) computes the pointwise squared Euclidean distance between two flows parameterized by time, the energy-based metric (36) computes the distance between a flow and a specific energy value. Because the energy-based metric compares the flow to a time-invariant quantity, it cannot by itself properly assess the quality of the dynamics, nor does it gauge or reflect the proper phase-space behavior and level sets. This can lead to problems if, for example, the estimated model possesses a stationary point that coincides with the Hamiltonian level surface. If this stationary point is the initial condition, the flow would yield zero MSE in the energy-based metric while yielding high MSE in the trajectory-based metric. As mentioned earlier, this was the case for some of the median MSE estimates. Although this metric alone is not sufficient to gauge the accuracy of dynamics, it can be valuable when used in conjunction with other metrics for its simplicity and physically-meaningful interpretation. For example, if the energy of a system is known to be constant, the energy-based metric quantifies the physical plausibility of a given model by measuring the average squared change in energy of its flow.

Refer to caption
Refer to caption
Refer to caption
Figure 6: Tao’s example: log10subscript10\log_{10}roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT Hamiltonian MSE (36) of the minimum MSE models learned by the logπ(𝛉|𝒴N)𝜋conditional𝛉subscript𝒴𝑁-\log\pi(\bm{\uptheta}|\mathcal{Y}_{N})- roman_log italic_π ( bold_θ | caligraphic_Y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) and L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT norm objectives. ‘Difference’ represents the log10subscript10\log_{10}roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT Hamiltonian MSE of the MAP estimate minus the log10subscript10\log_{10}roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT Hamiltonian MSE of the L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT estimate. The MAP estimates yield lower Hamiltonian MSE than the L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT estimates on all datasets with noise.

6.3.4 Conclusions

We conclude that if the data are noiseless, the L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT objective is the better option for this example because it gives time-domain MSE comparable to the negative log posterior at a lower computational cost. When the number of data is low, however, the negative log posterior is potentially more cost-efficient since it appears to require fewer training restarts than the L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT objective based on the median results. For data that are noisy, the negative log posterior is the clear choice. Although the cost is higher, the negative log posterior produces significantly more accurate estimates when the data have as little as 1% noise.

6.4 Double pendulum

The next system that we consider is the double pendulum, which has the Hamiltonian

H(𝐪,𝐩)=m2l22pϕ2+(m1+m2)l12pφ22m2l1l2pϕpφcos(qϕqφ)2l1l2m2(m1+m2sin2(qϕqφ))(m1+m2)gl1cos(qϕ)m2gl2cos(qφ),𝐻𝐪𝐩subscript𝑚2superscriptsubscript𝑙22superscriptsubscript𝑝italic-ϕ2subscript𝑚1subscript𝑚2superscriptsubscript𝑙12superscriptsubscript𝑝𝜑22subscript𝑚2subscript𝑙1subscript𝑙2subscript𝑝italic-ϕsubscript𝑝𝜑subscript𝑞italic-ϕsubscript𝑞𝜑2subscript𝑙1subscript𝑙2subscript𝑚2subscript𝑚1subscript𝑚2superscript2subscript𝑞italic-ϕsubscript𝑞𝜑subscript𝑚1subscript𝑚2𝑔subscript𝑙1subscript𝑞italic-ϕsubscript𝑚2𝑔subscript𝑙2subscript𝑞𝜑\small H(\mathbf{q},\mathbf{p})=\frac{m_{2}l_{2}^{2}p_{\phi}^{2}+(m_{1}+m_{2})% l_{1}^{2}p_{\varphi}^{2}-2m_{2}l_{1}l_{2}p_{\phi}p_{\varphi}\cos(q_{\phi}-q_{% \varphi})}{2l_{1}l_{2}m_{2}\left(m_{1}+m_{2}\sin^{2}(q_{\phi}-q_{\varphi})% \right)}-(m_{1}+m_{2})gl_{1}\cos(q_{\phi})-m_{2}gl_{2}\cos(q_{\varphi}),italic_H ( bold_q , bold_p ) = divide start_ARG italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT roman_cos ( italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT - italic_q start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT ) end_ARG start_ARG 2 italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_sin start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT - italic_q start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT ) ) end_ARG - ( italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) italic_g italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_cos ( italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ) - italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_g italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_cos ( italic_q start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT ) , (37)

where 𝐪=[qϕqφ]𝐪superscriptmatrixsubscript𝑞italic-ϕsubscript𝑞𝜑top\mathbf{q}=\begin{bmatrix}q_{\phi}&q_{\varphi}\end{bmatrix}^{\top}bold_q = [ start_ARG start_ROW start_CELL italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_CELL start_CELL italic_q start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, 𝐩=[pϕpφ]𝐩superscriptmatrixsubscript𝑝italic-ϕsubscript𝑝𝜑top\mathbf{p}=\begin{bmatrix}p_{\phi}&p_{\varphi}\end{bmatrix}^{\top}bold_p = [ start_ARG start_ROW start_CELL italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, and m1subscript𝑚1m_{1}italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and m2subscript𝑚2m_{2}italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are two masses connected by rigid rods of lengths l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. This Hamiltonian is more complex than the previous example, and the behavior displayed by the double pendulum is chaotic in certain regions of the phase space. These two aspects make the system challenging to learn and allow us to test the limits of the NSSNN and the two objective functions.

6.4.1 Data generation and training

For this experiment, we set m1=m2=l1=l2=1subscript𝑚1subscript𝑚2subscript𝑙1subscript𝑙21m_{1}=m_{2}=l_{1}=l_{2}=1italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 and g=9.81𝑔9.81g=9.81italic_g = 9.81, and we use an initial condition of 𝐱1=[1000]subscript𝐱1superscriptmatrix1000top\mathbf{x}_{1}=\begin{bmatrix}1&0&0&0\end{bmatrix}^{\top}bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, placing the dynamics in the chaotic regime. Then, we collect N=2000𝑁2000N=2000italic_N = 2000 measurements of the full state using timesteps Δtf=103Δsubscript𝑡𝑓superscript103\Delta t_{f}=10^{-3}roman_Δ italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT and Δtt=102Δsubscript𝑡𝑡superscript102\Delta t_{t}=10^{-2}roman_Δ italic_t start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT and corrupt these data with multiplicative noise 𝒗k𝒰[0.99,1.01]similar-tosubscript𝒗𝑘𝒰0.991.01\bm{v}_{k}\sim\mathcal{U}[0.99,1.01]bold_italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∼ caligraphic_U [ 0.99 , 1.01 ]. The training procedure uses 1,000 epochs with an initial learning rate of 0.05 that is multiplied by 0.8 every 50 epochs. To learn the process noise covariance, we use the parameterization 𝚺+=diag(𝜽𝚺+)subscript𝚺diagsubscript𝜽subscript𝚺\mathbf{\Sigma}_{+}=\text{diag}(\bm{\theta}_{\mathbf{\Sigma}_{+}})bold_Σ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT = diag ( bold_italic_θ start_POSTSUBSCRIPT bold_Σ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_POSTSUBSCRIPT ). The learning rate for the optimization parameter 𝛉𝚺+subscript𝛉subscript𝚺\bm{\uptheta}_{\mathbf{\Sigma}_{+}}bold_θ start_POSTSUBSCRIPT bold_Σ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_POSTSUBSCRIPT equals the parameter value and is also multiplied by 0.8 every 50 epochs. Any remaining aspects of the training follow the procedure described in the previous example.

6.4.2 Time-domain prediction

First, we examine the time-domain estimates of the learned models, plotted in Fig. 7. Since the double pendulum is a chaotic system, prediction errors grow exponentially fast, making long-term prediction difficult. As a result, the time-domain MSE (35) does not tend to give a reliable quantification of an estimate’s accuracy. For chaotic systems, it is often more important to (i) know when the estimate is no longer reliable and (ii) capture the qualitative behavior of the system.

To address the first concern, uncertainty quantification can be used. The typical approach to quantifying uncertainty in neural network models is by approximating the posterior distribution over the network parameters [48], but computational approximation of this distribution is often costly. Fortunately, the proposed Bayesian offers a measure of uncertainty that circumvents this challenging approximation. By including the process noise during evaluation of the model, a stochastic simulation is produced whose realizations can be used to construct a probability distribution of the estimated output. Samples from this stochastic simulation are shown alongside the deterministic simulation, referred to as the “nominal MAP estimate,” in Fig. 8. We observe that the spread of samples begins to grow as the MAP estimate begins to deviate from the truth. This suggests that the process noise can give a reliable estimate of how the uncertainty in an estimated model changes over time. Notably, the commonly-used deterministic model approaches have no such resource for assessing the reliability of their estimates.

Refer to caption
Figure 7: Double pendulum: Comparison of the MAP and L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT estimates. Due to the chaotic nature of the system, neither estimate reliably predicts the output for more than five seconds.
Refer to caption
Figure 8: Double pendulum: Realizations of the MAP dynamics with the nominal MAP estimate, truth, and data overlaid. The stochastic model approach is able to signal the uncertainty of its estimate through realizations of the estimated process noise.

6.4.3 Phase-space prediction

To address the second concern of assessing qualitative behavior, we compare the accuracy of the two estimates in phase space. Specifically, we quantify this accuracy by evaluating the absolute Hamiltonian error |H(𝐪^k,𝐩^k)H(𝐪1,𝐩1)|𝐻subscript^𝐪𝑘subscript^𝐩𝑘𝐻subscript𝐪1subscript𝐩1\left\lvert H(\hat{\mathbf{q}}_{k},\hat{\mathbf{p}}_{k})-H(\mathbf{q}_{1},% \mathbf{p}_{1})\right\rvert| italic_H ( over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , over^ start_ARG bold_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_H ( bold_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) |. Fig. 9(a) shows the estimates in phase space. The color of the line denotes the value of the absolute Hamiltonian error at that point. The absolute Hamiltonian error is also plotted over time for reference in Fig. 9(b) with a pink line separating the training and testing time periods. We see that both qualitatively and quantitatively, the MAP estimate is much closer to the true Hamiltonian. Fig. 9(b) shows that the error of the L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT estimate is sometimes lower than that of the MAP estimate, but it occasionally spikes to magnitudes that are several times larger than the MAP estimate error. Such behavior is typically indicative of overfitting. Additionally, the spikes become more frequent as time goes on, reflecting the poor potential for long-term forecasting that was also observed in the previous example in Section 6.3.3. In contrast, the absolute Hamiltonian error of the MAP estimate is lower on average and does not display sudden spikes. We hypothesize that this preferable behavior can be attributed to the inherent regularization in the marginal likelihood that penalizes large output covariances [16].

6.4.4 Conclusions

On this example, the negative log posterior was better able to capture the underlying Hamiltonian manifold of the chaotic double pendulum in terms of both Hamiltonian error and visual inspection in phase space compared to the L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT objective. Additionally, the Bayesian system ID framework was shown to provide a reliable method of uncertainty quantification of the output forecasts without the expense of quantifying uncertainty in the neural network parameters.

Refer to caption
(a) MAP and L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT estimates in phase space, where the color denotes the absolute Hamiltonian error. The MAP estimate more closely resembles the shape of the true manifolds.
Refer to caption
(b) Absolute Hamiltonian error over time. The Hamiltonian error of the L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT estimate exhibits greater variance than that of the MAP estimate, indicating overfitting.
Figure 9: Double pendulum: Absolute Hamiltonian error in phase (9(a)) and time (9(b)) domains. The MAP estimate shows lower Hamiltonian error on average.

6.5 Nonlinear Schrödinger equation

The nonlinear Schrödinger equation (NLSE) is a partial differential equation (PDE) that is used for modeling nonlinear waves in plasma physics [49], nonlinear optics [50], quantum mechanics [51], and oceanography [52]. This equation exhibits a rich variety of dynamical phenomena due to the combined action of dispersion and nonlinearity on a narrow-banded field envelope. The one-dimensional parametric NLSE considered here is a particularly suitable model for optical experiments realized with single-mode fibers, see [53] for more details.

We consider the parametric nonlinear Schrödinger equation with a cubic nonlinearity

iψt+2ψz2+γ|ψ|2ψ=0,𝑖𝜓𝑡superscript2𝜓superscript𝑧2𝛾superscript𝜓2𝜓0i\frac{\partial\psi}{\partial t}+\frac{\partial^{2}\psi}{\partial z^{2}}+% \gamma|\psi|^{2}\psi=0,italic_i divide start_ARG ∂ italic_ψ end_ARG start_ARG ∂ italic_t end_ARG + divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ψ end_ARG start_ARG ∂ italic_z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + italic_γ | italic_ψ | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ψ = 0 , (38)

where ψ:=ψ(z,t)assign𝜓𝜓𝑧𝑡\psi:=\psi(z,t)italic_ψ := italic_ψ ( italic_z , italic_t ) is a complex-valued function and i=1𝑖1i=\sqrt{-1}italic_i = square-root start_ARG - 1 end_ARG is the imaginary unit. This parametric nonlinear PDE can be recast as a canonical Hamiltonian PDE by writing the complex-valued wave function ψ(z,t)=p(z,t)+iq(z,t)𝜓𝑧𝑡𝑝𝑧𝑡𝑖𝑞𝑧𝑡\psi(z,t)=p(z,t)+iq(z,t)italic_ψ ( italic_z , italic_t ) = italic_p ( italic_z , italic_t ) + italic_i italic_q ( italic_z , italic_t ) in terms of its real and imaginary parts as

pt=2qz2γ(q2+p2)p,qt=2pz2+γ(q2+p2)q,formulae-sequence𝑝𝑡superscript2𝑞superscript𝑧2𝛾superscript𝑞2superscript𝑝2𝑝𝑞𝑡superscript2𝑝superscript𝑧2𝛾superscript𝑞2superscript𝑝2𝑞\frac{\partial p}{\partial t}=-\frac{\partial^{2}q}{\partial z^{2}}-\gamma% \left(q^{2}+p^{2}\right)p,\qquad\frac{\partial q}{\partial t}=\frac{\partial^{% 2}p}{\partial z^{2}}+\gamma\left(q^{2}+p^{2}\right)q,divide start_ARG ∂ italic_p end_ARG start_ARG ∂ italic_t end_ARG = - divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_q end_ARG start_ARG ∂ italic_z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - italic_γ ( italic_q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_p start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_p , divide start_ARG ∂ italic_q end_ARG start_ARG ∂ italic_t end_ARG = divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_p end_ARG start_ARG ∂ italic_z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + italic_γ ( italic_q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_p start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_q ,

with the space-time continuous Hamiltonian

(q,p)=12[pz2+qz2γ2(p2+q2)2]dz,𝑞𝑝12delimited-[]superscriptsubscript𝑝𝑧2superscriptsubscript𝑞𝑧2𝛾2superscriptsuperscript𝑝2superscript𝑞22differential-d𝑧\mathcal{H}(q,p)=\frac{1}{2}\int\left[p_{z}^{2}+q_{z}^{2}-\frac{\gamma}{2}% \left(p^{2}+q^{2}\right)^{2}\right]{\rm d}z,caligraphic_H ( italic_q , italic_p ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∫ [ italic_p start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_q start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - divide start_ARG italic_γ end_ARG start_ARG 2 end_ARG ( italic_p start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] roman_d italic_z , (39)

where the parameter γ𝛾\gammaitalic_γ determines the influence of the non-quadratic terms. In addition to Hamiltonian conservation, this system also conserves mass invariant 1subscript1\mathcal{I}_{1}caligraphic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and momentum invariant 2subscript2\mathcal{I}_{2}caligraphic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT defined as

1(q,p)=(p2+q2)dz,2(q,p)=(pzqqzp)dz.formulae-sequencesubscript1𝑞𝑝superscript𝑝2superscript𝑞2differential-d𝑧subscript2𝑞𝑝subscript𝑝𝑧𝑞subscript𝑞𝑧𝑝differential-d𝑧\mathcal{I}_{1}(q,p)=\int(p^{2}+q^{2}){\rm d}z,\qquad\mathcal{I}_{2}(q,p)=\int% (p_{z}q-q_{z}p){\rm d}z.caligraphic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_q , italic_p ) = ∫ ( italic_p start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) roman_d italic_z , caligraphic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_q , italic_p ) = ∫ ( italic_p start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT italic_q - italic_q start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT italic_p ) roman_d italic_z . (40)

6.5.1 Data generation and training

For the learning problem, we consider a gray-box setting where we assume knowledge about the form of the Hamiltonian (39) at the PDE level, but the parameter γ𝛾\gammaitalic_γ is uncertain. To differentiate from the true parameter γ𝛾\gammaitalic_γ, we denote the uncertain parameter as 𝜽γsubscript𝜽𝛾\bm{\theta}_{\gamma}bold_italic_θ start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT. We then estimate the MAP value of 𝜽γsubscript𝜽𝛾\bm{\theta}_{\gamma}bold_italic_θ start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT using Algorithm 2 to yield a FOM based on the estimated parameter. To assess the quality of this model, we study the relative error of the estimated states and the model’s ability to conserve the system Hamiltonian, mass, and momentum.

For this study, we first generate high-dimensional data by numerically solving (38) with true parameter γ=2𝛾2\gamma=2italic_γ = 2. To discretize the PDE in space, we use a spatial domain z[L/2,L/2]𝑧𝐿2𝐿2z\in[-L/2,L/2]italic_z ∈ [ - italic_L / 2 , italic_L / 2 ], where L=2π2𝐿2𝜋2L=2\pi\sqrt{2}italic_L = 2 italic_π square-root start_ARG 2 end_ARG, with periodic boundary conditions and initial conditions of p(z,0)=0.5(1+0.01cos(2πz/L))𝑝𝑧00.510.012𝜋𝑧𝐿p(z,0)=0.5(1+0.01\cos(2\pi z/L))italic_p ( italic_z , 0 ) = 0.5 ( 1 + 0.01 roman_cos ( 2 italic_π italic_z / italic_L ) ) and q(z,0)=0𝑞𝑧00q(z,0)=0italic_q ( italic_z , 0 ) = 0. The spatial discretization uses d=64𝑑64d=64italic_d = 64 equally spaced grid points for a total state dimension of 2d=1282𝑑1282d=1282 italic_d = 128. We approximate a solution to this discretized PDE using Tao’s integrator with a fine timestep of Δtf=103Δsubscript𝑡𝑓superscript103\Delta t_{f}=10^{-3}roman_Δ italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT. Then we collect data over 20s at a timestep of Δtt=5×103Δsubscript𝑡𝑡5superscript103\Delta t_{t}=5\times 10^{-3}roman_Δ italic_t start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 5 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT for a total N=4000𝑁4000N=4000italic_N = 4000 data. After generating the data, we add 20% multiplicative uniform noise 𝒗k𝒰[0.8,1.2]similar-tosubscript𝒗𝑘𝒰0.81.2\bm{v}_{k}\sim\mathcal{U}[0.8,1.2]bold_italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∼ caligraphic_U [ 0.8 , 1.2 ]. As a means of visualizing how noisy these data are, we consider the system’s wave function ψ(z,t)𝜓𝑧𝑡\psi(z,t)italic_ψ ( italic_z , italic_t ). Among other reasons, the wave function is notable because its square modulus |ψ(z,t)|2superscript𝜓𝑧𝑡2\lvert\psi(z,t)\rvert^{2}| italic_ψ ( italic_z , italic_t ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT can be interpreted as the probability density of the system. This quantity is plotted using the clean solution in Fig. 10(a) and the noisy data in Fig. 10(b).

Refer to caption
(a) Truth
Refer to caption
(b) Data
Figure 10: NLSE: Data visualization using |ψ(x,t)|2superscript𝜓𝑥𝑡2\lvert\psi(x,t)\rvert^{2}| italic_ψ ( italic_x , italic_t ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

Next, we project the data onto a low-dimensional subspace so that we can estimate 𝜽γsubscript𝜽𝛾\bm{\theta}_{\gamma}bold_italic_θ start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT in a reduced setting. For this projection step, we compute a symplectic basis V of the form (21) with reduced dimension r=8𝑟8r=8italic_r = 8 using the cotangent lift algorithm described in Section 5.1.2. As discussed in Section 5.2, the multiplicative noise form is not preserved under this dimension-reducing transformation. Therefore, we use the approximate ~~\tilde{\mathcal{M}}over~ start_ARG caligraphic_M end_ARG given by Eq. (31) as the observation model for this experiment.

To train the model, we seek to estimate the 𝜽γsubscript𝜽𝛾\bm{\theta}_{\gamma}bold_italic_θ start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT parameter and the diagonal elements of 𝚺+subscript𝚺\mathbf{\Sigma}_{+}bold_Σ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT and 𝚪+subscript𝚪\mathbf{\Gamma}_{+}bold_Γ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT for a total of 33 parameters. Since PyTorch does not have a Lyapunov solver needed for H-OpInf, we use the solver from the scipy package, which breaks the computational graph required for auto-differentiation with respect to θγsubscript𝜃𝛾\theta_{\gamma}italic_θ start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT. As an alternative method of differentiation, we approximate π(𝛉|𝒴N)/θγ𝜋conditional𝛉subscript𝒴𝑁subscript𝜃𝛾\partial\pi(\bm{\uptheta}|\mathcal{Y}_{N})/\partial\theta_{\gamma}∂ italic_π ( bold_θ | caligraphic_Y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) / ∂ italic_θ start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT using forward finite difference with a step size of 106superscript10610^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT. Then, we optimize θγsubscript𝜃𝛾\theta_{\gamma}italic_θ start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT using gradient descent with a step size of 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. We parameterize the covariance matrices as 𝚺+=diag(𝜽𝚺+)subscript𝚺diagsubscript𝜽subscript𝚺\mathbf{\Sigma}_{+}=\text{diag}(\bm{\theta}_{\mathbf{\Sigma}_{+}})bold_Σ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT = diag ( bold_italic_θ start_POSTSUBSCRIPT bold_Σ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) and 𝚪+=diag(𝜽𝚪+)subscript𝚪diagsubscript𝜽subscript𝚪\mathbf{\Gamma}_{+}=\text{diag}(\bm{\theta}_{\mathbf{\Gamma}_{+}})bold_Γ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT = diag ( bold_italic_θ start_POSTSUBSCRIPT bold_Γ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_POSTSUBSCRIPT ). The variance parameters are optimized using the same procedure as before with a learning rate of 0.5 that is multiplied by 0.8 every 10 epochs. The priors over 𝜽𝚺+subscript𝜽subscript𝚺\bm{\theta}_{\mathbf{\Sigma}_{+}}bold_italic_θ start_POSTSUBSCRIPT bold_Σ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_POSTSUBSCRIPT and 𝜽𝚪+subscript𝜽subscript𝚪\bm{\theta}_{\mathbf{\Gamma}_{+}}bold_italic_θ start_POSTSUBSCRIPT bold_Γ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_POSTSUBSCRIPT are half-𝒩(0,1012)half-𝒩0superscript1012\text{half-}\mathcal{N}(0,10^{-12})half- caligraphic_N ( 0 , 10 start_POSTSUPERSCRIPT - 12 end_POSTSUPERSCRIPT ) and half-𝒩(0,109)half-𝒩0superscript109\text{half-}\mathcal{N}(0,10^{-9})half- caligraphic_N ( 0 , 10 start_POSTSUPERSCRIPT - 9 end_POSTSUPERSCRIPT ), respectively. We initialize optimization variables θγsubscript𝜃𝛾\theta_{\gamma}italic_θ start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT at 0, 𝛉𝚺+subscript𝛉subscript𝚺\bm{\uptheta}_{\mathbf{\Sigma}_{+}}bold_θ start_POSTSUBSCRIPT bold_Σ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_POSTSUBSCRIPT at 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, and 𝛉𝚪+subscript𝛉subscript𝚪\bm{\uptheta}_{\mathbf{\Gamma}_{+}}bold_θ start_POSTSUBSCRIPT bold_Γ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_POSTSUBSCRIPT at 103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT. For this experiment, we use 50 epochs.

The optimization result is used to initialize MCMC. The sampling algorithm that we use is a delayed rejection adaptive Metropolis [54] within Gibbs procedure, where the parameter groups 𝜽γsubscript𝜽𝛾\bm{\theta}_{\gamma}bold_italic_θ start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT, 𝜽𝚺+subscript𝜽subscript𝚺\bm{\theta}_{\mathbf{\Sigma}_{+}}bold_italic_θ start_POSTSUBSCRIPT bold_Σ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_POSTSUBSCRIPT, and 𝜽𝚪+subscript𝜽subscript𝚪\bm{\theta}_{\mathbf{\Gamma}_{+}}bold_italic_θ start_POSTSUBSCRIPT bold_Γ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_POSTSUBSCRIPT are sampled sequentially. To ensure convergence, we draw 2×1042superscript1042\times 10^{4}2 × 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT samples, and discard the first 104superscript10410^{4}10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT as burn-in. Fig. 11(a) shows the marginal posterior distribution of 𝜽γsubscript𝜽𝛾\bm{\theta}_{\gamma}bold_italic_θ start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT with the mean value of 1.955 indicated by the dark blue line. The average squared error of these samples with respect to the true γ𝛾\gammaitalic_γ value is 2.185×1032.185superscript1032.185\times 10^{-3}2.185 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT. We also plot the Markov chain of 𝜽γsubscript𝜽𝛾\bm{\theta}_{\gamma}bold_italic_θ start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT in Fig. 11(b) to show that the chain is well-mixed.

Refer to caption
(a) Histogram of samples
Refer to caption
(b) MCMC chain
Figure 11: NLSE: MCMC samples of the θγsubscript𝜃𝛾\theta_{\gamma}italic_θ start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT parameter. The dark blue line in Fig. 11(a) denotes the parameter posterior mean. The samples are unimodal and the chain appears well-mixed.

6.5.2 Relative state error

We next use these samples to simulate the system over time. To help decorrelate the samples, we only use every 10th sample for a total 1,000 samples. For a given initial condition in the high-dimensional setting, we simulate the learned FOMs until 40s to assess the model performance outside the training period. The performance metric that we consider is the relative state error defined as

𝐗e𝐗^eF2𝐗eF2,superscriptsubscriptdelimited-∥∥subscript𝐗𝑒subscript^𝐗𝑒𝐹2superscriptsubscriptdelimited-∥∥subscript𝐗𝑒𝐹2\frac{\lVert\mathbf{X}_{e}-\hat{\mathbf{X}}_{e}\rVert_{F}^{2}}{\lVert\mathbf{X% }_{e}\rVert_{F}^{2}},divide start_ARG ∥ bold_X start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT - over^ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∥ bold_X start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , (41)

where 𝐗esubscript𝐗𝑒\mathbf{X}_{e}bold_X start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and 𝐗^esubscript^𝐗𝑒\hat{\mathbf{X}}_{e}over^ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT are the true and estimated extended snapshot matrices, respectively. The relative state errors over the training and testing periods are shown in Fig. 12. The dark blue line denotes the relative state error yielded by the 𝜽γsubscript𝜽𝛾\bm{\theta}_{\gamma}bold_italic_θ start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT sample average. Almost all of the samples of the relative state error over the training period are below 10%. The error of the testing period is larger due to the fact that errors tend to grow over time in dynamical models. These plots show that we are able to make good predictions over the short-term with errors that grow gradually as the simulation time increases.

Refer to caption
(a) Training period (FOM)
Refer to caption
(b) Testing period (FOM)
Figure 12: NLSE: Relative state error (41) over training and testing periods. The dark blue line denotes the state error of the parameter posterior mean. Most of the samples exhibit less than 10% state error over the training period. Naturally, the testing period error is higher due to the accumulation of errors over time.

6.5.3 Conservation of invariants

We also assess the structure-preserving nature of this model by computing the absolute errors in Hamiltonian, mass, and momentum. The spatially-discretized forms of these invariants from Eqs. (39) and (40) are defined as

H(𝐪,𝐩)𝐻𝐪𝐩\displaystyle H(\mathbf{q},\mathbf{p})italic_H ( bold_q , bold_p ) =12i=1d[pzi2+qzi2γ2(pi2+qi2)2]Δz,absent12superscriptsubscript𝑖1𝑑delimited-[]superscriptsubscript𝑝subscript𝑧𝑖2superscriptsubscript𝑞subscript𝑧𝑖2𝛾2superscriptsuperscriptsubscript𝑝𝑖2superscriptsubscript𝑞𝑖22Δ𝑧\displaystyle=\frac{1}{2}\sum_{i=1}^{d}\left[p_{z_{i}}^{2}+q_{z_{i}}^{2}-\frac% {\gamma}{2}\left(p_{i}^{2}+q_{i}^{2}\right)^{2}\right]\Delta z,= divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT [ italic_p start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_q start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - divide start_ARG italic_γ end_ARG start_ARG 2 end_ARG ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] roman_Δ italic_z , (42)
I1(𝐪,𝐩)subscript𝐼1𝐪𝐩\displaystyle I_{1}(\mathbf{q},\mathbf{p})italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_q , bold_p ) =i=1d[pi2+qi2]Δz,absentsuperscriptsubscript𝑖1𝑑delimited-[]superscriptsubscript𝑝𝑖2superscriptsubscript𝑞𝑖2Δ𝑧\displaystyle=\sum_{i=1}^{d}\left[p_{i}^{2}+q_{i}^{2}\right]\Delta z,= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT [ italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] roman_Δ italic_z , (43)
I2(𝐪,𝐩)subscript𝐼2𝐪𝐩\displaystyle I_{2}(\mathbf{q},\mathbf{p})italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_q , bold_p ) =i=1d[pziqiqzipi]Δz,absentsuperscriptsubscript𝑖1𝑑delimited-[]subscript𝑝subscript𝑧𝑖subscript𝑞𝑖subscript𝑞subscript𝑧𝑖subscript𝑝𝑖Δ𝑧\displaystyle=\sum_{i=1}^{d}\left[p_{z_{i}}q_{i}-q_{z_{i}}p_{i}\right]\Delta z,= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT [ italic_p start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_q start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] roman_Δ italic_z , (44)

where Δz=L/dΔ𝑧𝐿𝑑\Delta z=L/droman_Δ italic_z = italic_L / italic_d, pzi=(pi+1pi)/Δzsubscript𝑝subscript𝑧𝑖subscript𝑝𝑖1subscript𝑝𝑖Δ𝑧p_{z_{i}}=(p_{i+1}-p_{i})/\Delta zitalic_p start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ( italic_p start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT - italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) / roman_Δ italic_z, and qzisubscript𝑞subscript𝑧𝑖q_{z_{i}}italic_q start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT is defined similarly. The posterior of the absolute errors of the Hamiltonian |H(𝐪^k,𝐩^k)H(𝐪1,𝐩1)|𝐻subscript^𝐪𝑘subscript^𝐩𝑘𝐻subscript𝐪1subscript𝐩1\lvert H(\hat{\mathbf{q}}_{k},\hat{\mathbf{p}}_{k})-H(\mathbf{q}_{1},\mathbf{p% }_{1})\rvert| italic_H ( over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , over^ start_ARG bold_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_H ( bold_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) |, the mass |I1(𝐪^k,𝐩^k)I1(𝐪1,𝐩1)|subscript𝐼1subscript^𝐪𝑘subscript^𝐩𝑘subscript𝐼1subscript𝐪1subscript𝐩1\lvert I_{1}(\hat{\mathbf{q}}_{k},\hat{\mathbf{p}}_{k})-I_{1}(\mathbf{q}_{1},% \mathbf{p}_{1})\rvert| italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , over^ start_ARG bold_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) |, and the momentum |I2(𝐪^k,𝐩^k)I2(𝐪1,𝐩1)|subscript𝐼2subscript^𝐪𝑘subscript^𝐩𝑘subscript𝐼2subscript𝐪1subscript𝐩1\lvert I_{2}(\hat{\mathbf{q}}_{k},\hat{\mathbf{p}}_{k})-I_{2}(\mathbf{q}_{1},% \mathbf{p}_{1})\rvert| italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , over^ start_ARG bold_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) | are plotted in Fig. 13. The FOM based on the estimated parameter value exhibits bounded error behavior 100%percent100100\%100 % outside the training data regime for all three conserved quantities, demonstrating that this method is capable of preserving underlying physics.

Refer to caption
Figure 13: NLSE: Absolute error of the three conserved quantities: Hamiltonian, mass, and momentum. The dark blue line denotes the result yielded by the parameter posterior mean, and the vertical pink line denotes the end of the training period. The sampled FOMs conserve each of the three quantities, demonstrating the structure-preserving capabilities of the proposed algorithm.

6.5.4 Varying the reduced dimension

Lastly, we study the parameter estimation accuracy of the proposed reduced-dimensional learning approach for different reduced dimensions. Since we have randomness in the measurement noise and the Adam optimizer, we draw 20 realizations of data and train on each for every reduced dimension value from r=2𝑟2r=2italic_r = 2 to r=8𝑟8r=8italic_r = 8. The optimization uses the same procedure described earlier in this section. Box and whisker plots of the squared parameter errors are shown in Fig. 14. The error is naturally quite high for r=2𝑟2r=2italic_r = 2 but decreases and levels off at r=3𝑟3r=3italic_r = 3, which indicates that the first three modes based on the SVD of the extended snapshot matrix are relatively clean. For r>3𝑟3r>3italic_r > 3, we observe modes with low values of signal-to-noise ratios which leads to a marginal increase in the estimation error. The shapes of these modes for a single noise realization are shown in Fig. 15.

Refer to caption
Figure 14: NLSE: Squared error of the parameter MAP θ^γsubscript^𝜃𝛾\hat{\theta}_{\gamma}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT at various r𝑟ritalic_r over 20 realizations of measurement noise. The circles represent outliers. The errors level off at r=3𝑟3r=3italic_r = 3, showing that only three dimensions are needed to achieve the best parameter estimation accuracy with the proposed algorithm on this example.
Refer to caption
Figure 15: NLSE: Mode shapes obtained via SVD of the extended snapshot data. The modes become much noisier at r>3𝑟3r>3italic_r > 3, which explains the slight increase in parameter estimation error after r=3𝑟3r=3italic_r = 3 in Fig. 14.

6.5.5 Conclusions

This example showed that Algorithm 2 was effective at estimating a parameter from a 64-dimensional system in a reduced 8-dimensional subspace with 20% multiplicative measurement noise. Since the computational cost of the likelihood evaluation scales cubically with state dimension n𝑛nitalic_n, reducing the state dimension by a factor of eight results in evaluation that is roughly 512 times cheaper compared to evaluation with the full state dimension. Furthermore, it was shown that the reduced dimension could be as small as r=3𝑟3r=3italic_r = 3 while still achieving comparable parameter estimation accuracy as r=8𝑟8r=8italic_r = 8, suggesting potential savings of up to 9,709 times. Additionally, the posterior FOM was shown to share important physical structure to the underlying system, as shown by its conservation of mass, momentum, and Hamiltonian over time.

7 Conclusions

In this work, we presented a structure-preserving Bayesian framework for learning deep neural network parameterizations of nonseparable Hamiltonian systems from noisy data. This framework uses Gaussian filtering to compute a likelihood based on a stochastic dynamics model that allows for the effects of model and measurement uncertainty to be accounted for differently. Unlike past works which only considered additive Gaussian noise models, we showed that the algorithm can be tailored to other noise models and provided a filter for multiplicative noise as an example. The numerical experiments for low-dimensional systems demonstrated that the proposed Bayesian framework is data-efficient and robust to noise in the data, whereas the standard machine learning approach breaks down in the presence of noisy data. Moreover, the Bayesian framework outperformed the NSSNN, a state-of-the-art machine learning approach, when training on data with multiplicative uniform noise, demonstrating that the Gaussian filtering approach is not overly restrictive for non-Gaussian measurement noise.

We also proposed a novel algorithm for the identification of high-dimensional Hamiltonians that allows for cost-efficient parameter estimation through filtering in a low-dimensional symplectic subspace. Using prior knowledge about the underlying physics, this algorithm was effective in parameter estimation of a nonlinear Schrödinger equation within 2.185×1032.185superscript1032.185\times 10^{-3}2.185 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT mean squared error, even with data corrupted by 20% multiplicative uniform noise. The full-order models based on this parameter estimate provided accurate and stable predictions, while also preserving the system Hamiltonian and other invariants of motion. Future work on this topic will explore estimation using partial observations, including improving low-dimensional projections derived from partially unknown/uncertain full-order models.

CRediT authorship contribution statement

Nicholas Galioto: Conceptualization, Methodology, Investigation, Software, Formal analysis, Writing - original draft, Visualization. Harsh Sharma: Conceptualization, Methodology, Software, Writing - review & editing. Boris Kramer: Conceptualization, Validation, Resources, Writing: Review & Editing, Funding Acquisition. Alex Arkady Gorodetsky: Conceptualization, Methodology, Resources, Writing: Review & Editing, Funding Acquisition, Supervision

Declaration of competing interest

Alex Gorodetsky reports a relationship with Geminus AI that includes: employment and equity or stocks. Boris Kramer reports a relationship with ASML Holding US that includes: consulting or advisory. The other authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data availability

Data will be made available upon request.

Acknowledgments

A.A. Gorodetsky and N. Galioto were funded by the AFOSR Computational Mathematics Program (P.M. Fariba Fahroo) under award FA9550-19-1-0013. B. Kramer and H. Sharma were in part financially supported by the Ministry of Trade, Industry and Energy (MOTIE) and the Korea Institute for Advancement of Technology (KIAT) through the International Cooperative R&D program (No. P0019804, Digital twin based intelligent unmanned facility inspection solutions) and the Applied and Computational Analysis Program of the Office of Naval Research under award N000142212624.

Appendix A Cotangent lift algorithm

In projection-based model reduction, the semi-discrete model is projected onto a low-dimensional subspace. The key idea in structure-preserving model reduction is to preserve the underlying geometric structure during the projection. Since the FOM is a Hamiltonian system with underlying symplectic structure, the projection step is treated as the symplectic inverse of a symplectic lift from the low-dimensional subspace to the state space, see [55]. A symplectic lift is defined by 𝐲=𝐕𝐲~𝐲𝐕~𝐲\mathbf{y}=\mathbf{V}\tilde{\mathbf{y}}bold_y = bold_V over~ start_ARG bold_y end_ARG where 𝐕2d×2r𝐕superscript2𝑑2𝑟\mathbf{V}\in\mathbb{R}^{2d\times 2r}bold_V ∈ blackboard_R start_POSTSUPERSCRIPT 2 italic_d × 2 italic_r end_POSTSUPERSCRIPT is a symplectic matrix, i.e., a matrix that satisfies

𝐕𝐉2d𝐕=𝐉2r.superscript𝐕topsubscript𝐉2𝑑𝐕subscript𝐉2𝑟\mathbf{V}^{\top}\mathbf{J}_{2d}\mathbf{V}=\mathbf{J}_{2r}.bold_V start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_J start_POSTSUBSCRIPT 2 italic_d end_POSTSUBSCRIPT bold_V = bold_J start_POSTSUBSCRIPT 2 italic_r end_POSTSUBSCRIPT . (45)

The symplectic inverse 𝐕+2r×2dsuperscript𝐕superscript2𝑟2𝑑\mathbf{V}^{+}\in\mathbb{R}^{2r\times 2d}bold_V start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 italic_r × 2 italic_d end_POSTSUPERSCRIPT of a symplectic matrix 𝐕𝐕\mathbf{V}bold_V is defined by

𝐕+=𝐉2r𝐕𝐉2d,superscript𝐕superscriptsubscript𝐉2𝑟topsuperscript𝐕topsubscript𝐉2𝑑\mathbf{V}^{+}=\mathbf{J}_{2r}^{\top}\mathbf{V}^{\top}\mathbf{J}_{2d},bold_V start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT = bold_J start_POSTSUBSCRIPT 2 italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_V start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_J start_POSTSUBSCRIPT 2 italic_d end_POSTSUBSCRIPT , (46)

and the symplectic projection can be written as 𝐱~=𝐕+𝐱2r~𝐱superscript𝐕𝐱superscript2𝑟\tilde{\mathbf{x}}=\mathbf{V}^{+}\mathbf{x}\in\mathbb{R}^{2r}over~ start_ARG bold_x end_ARG = bold_V start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT bold_x ∈ blackboard_R start_POSTSUPERSCRIPT 2 italic_r end_POSTSUPERSCRIPT. Proper symplectic decomposition (PSD) [55] is a method to find a symplectic projection matrix 𝐕𝐕\mathbf{V}bold_V that simultaneously minimizes the projection error in a least-squares sense, i.e.,

min𝐕s.t.𝐕𝐉2d𝐕=𝐉2rX𝐕𝐕+XF,subscript𝐕s.t.superscript𝐕topsubscript𝐉2𝑑𝐕subscript𝐉2𝑟subscriptnormXsuperscript𝐕𝐕X𝐹\min_{\begin{subarray}{c}\mathbf{V}\\ \text{s.t.}\ \mathbf{V}^{\top}\mathbf{J}_{2d}\mathbf{V}=\mathbf{J}_{2r}\end{% subarray}}\|\textbf{X}-\mathbf{V}\mathbf{V}^{+}\textbf{X}\|_{F},roman_min start_POSTSUBSCRIPT start_ARG start_ROW start_CELL bold_V end_CELL end_ROW start_ROW start_CELL s.t. bold_V start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_J start_POSTSUBSCRIPT 2 italic_d end_POSTSUBSCRIPT bold_V = bold_J start_POSTSUBSCRIPT 2 italic_r end_POSTSUBSCRIPT end_CELL end_ROW end_ARG end_POSTSUBSCRIPT ∥ X - bold_VV start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT X ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT , (47)

where X:=[𝐱(t1),,𝐱(tK)]2d×KassignX𝐱subscript𝑡1𝐱subscript𝑡𝐾superscript2𝑑𝐾\textbf{X}:=[\mathbf{x}(t_{1}),\cdots,\mathbf{x}(t_{K})]\in\mathbb{R}^{2d% \times K}X := [ bold_x ( italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ⋯ , bold_x ( italic_t start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) ] ∈ blackboard_R start_POSTSUPERSCRIPT 2 italic_d × italic_K end_POSTSUPERSCRIPT is the snapshot data matrix, and F\|\cdot\|_{F}∥ ⋅ ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT is the Frobenius norm. Since solving Eq. (47) to obtain the symplectic basis matrix 𝐕𝐕\mathbf{V}bold_V is computationally expensive, the authors in [55] outlined three efficient algorithms for finding approximated optimal solution for the symplectic matrix 𝐕𝐕\mathbf{V}bold_V. These algorithms (cotangent lift, complex SVD, and a nonlinear programming approach) search for a near-optimal solution over different subsets of Sp(2r,2d)𝑆𝑝2𝑟superscript2𝑑Sp(2r,\mathbb{R}^{2d})italic_S italic_p ( 2 italic_r , blackboard_R start_POSTSUPERSCRIPT 2 italic_d end_POSTSUPERSCRIPT ), the set of all 2d×2r2𝑑2𝑟2d\times 2r2 italic_d × 2 italic_r symplectic matrices. The cotangent lift algorithm computes the SVD of the extended snapshot matrix Xe=[𝐪(t1),,𝐪(tK),𝐩(t1),,𝐩(tK)]d×2KsubscriptX𝑒𝐪subscript𝑡1𝐪subscript𝑡𝐾𝐩subscript𝑡1𝐩subscript𝑡𝐾superscript𝑑2𝐾\textbf{X}_{e}=[\mathbf{q}(t_{1}),\cdots,\mathbf{q}(t_{K}),\mathbf{p}(t_{1}),% \cdots,\mathbf{p}(t_{K})]\in\mathbb{R}^{d\times 2K}X start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = [ bold_q ( italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ⋯ , bold_q ( italic_t start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) , bold_p ( italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ⋯ , bold_p ( italic_t start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × 2 italic_K end_POSTSUPERSCRIPT to obtain a POD basis matrix 𝚽d×r𝚽superscript𝑑𝑟\mathbf{\Phi}\in\mathbb{R}^{d\times r}bold_Φ ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_r end_POSTSUPERSCRIPT and then constructs the symplectic basis matrix 𝐕=[𝚽𝟎𝟎𝚽]2d×2r𝐕matrix𝚽00𝚽superscript2𝑑2𝑟\mathbf{V}=\begin{bmatrix}\mathbf{\Phi}&\mathbf{0}\\ \mathbf{0}&\mathbf{\Phi}\end{bmatrix}\in\mathbb{R}^{2d\times 2r}bold_V = [ start_ARG start_ROW start_CELL bold_Φ end_CELL start_CELL bold_0 end_CELL end_ROW start_ROW start_CELL bold_0 end_CELL start_CELL bold_Φ end_CELL end_ROW end_ARG ] ∈ blackboard_R start_POSTSUPERSCRIPT 2 italic_d × 2 italic_r end_POSTSUPERSCRIPT with 𝐕+=𝐕superscript𝐕superscript𝐕top\mathbf{V}^{+}=\mathbf{V}^{\top}bold_V start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT = bold_V start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT. Compared to the complex SVD and the nonlinear programming approach, the cotangent lift algorithm is more easily implemented in the offline stage as it only requires the SVD of the extended snapshot matrix XesubscriptX𝑒\textbf{X}_{e}X start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT. Moreover, the diagonal nature of 𝐕𝐕\mathbf{V}bold_V ensures that the interpretability of 𝐪𝐪\mathbf{q}bold_q and 𝐩𝐩\mathbf{p}bold_p is retained in the reduced setting.

Appendix B Hamiltonian operator inference

In this section, we describe how to estimate the terms 𝐃~𝐪subscript~𝐃𝐪\tilde{\mathbf{D}}_{\mathbf{q}}over~ start_ARG bold_D end_ARG start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT and 𝐃~𝐩subscript~𝐃𝐩\tilde{\mathbf{D}}_{\mathbf{p}}over~ start_ARG bold_D end_ARG start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT in the quadratic portion of the Hamiltonian (23) using the approach of H-OpInf [44]. First, assume we have an initial estimate of 𝛉nlsubscript𝛉nl\bm{\uptheta}_{\text{nl}}bold_θ start_POSTSUBSCRIPT nl end_POSTSUBSCRIPT defining the nonlinear terms Hnlsubscript𝐻nlH_{\text{nl}}italic_H start_POSTSUBSCRIPT nl end_POSTSUBSCRIPT. Then, define the nonlinear forcings 𝐟𝐪subscript𝐟𝐪\mathbf{f}_{\mathbf{q}}bold_f start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT and 𝐟𝐩subscript𝐟𝐩\mathbf{f}_{\mathbf{p}}bold_f start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT as

𝐟𝐪(𝐱,𝛉nl)subscript𝐟𝐪𝐱subscript𝛉nl\displaystyle\mathbf{f}_{\mathbf{q}}(\mathbf{x},\bm{\uptheta}_{\text{nl}})bold_f start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT ( bold_x , bold_θ start_POSTSUBSCRIPT nl end_POSTSUBSCRIPT ) =[Hnlp1(q1,p1,𝛉nl)Hnlpd(qd,pd,𝛉nl)]d,absentsuperscriptmatrixsubscript𝐻nlsuperscript𝑝1superscript𝑞1superscript𝑝1subscript𝛉nlsubscript𝐻nlsuperscript𝑝𝑑superscript𝑞𝑑superscript𝑝𝑑subscript𝛉nltopsuperscript𝑑\displaystyle=\begin{bmatrix}\frac{\partial H_{\text{nl}}}{\partial p^{1}}(q^{% 1},p^{1},\bm{\uptheta}_{\text{nl}})\cdots\frac{\partial H_{\text{nl}}}{% \partial p^{d}}(q^{d},p^{d},\bm{\uptheta}_{\text{nl}})\end{bmatrix}^{\top}\in% \mathbb{R}^{d},= [ start_ARG start_ROW start_CELL divide start_ARG ∂ italic_H start_POSTSUBSCRIPT nl end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_p start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_ARG ( italic_q start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_p start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , bold_θ start_POSTSUBSCRIPT nl end_POSTSUBSCRIPT ) ⋯ divide start_ARG ∂ italic_H start_POSTSUBSCRIPT nl end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_p start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_ARG ( italic_q start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , italic_p start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , bold_θ start_POSTSUBSCRIPT nl end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , (48)
𝐟𝐩(𝐱,𝛉nl)subscript𝐟𝐩𝐱subscript𝛉nl\displaystyle\mathbf{f}_{\mathbf{p}}(\mathbf{x},\bm{\uptheta}_{\text{nl}})bold_f start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT ( bold_x , bold_θ start_POSTSUBSCRIPT nl end_POSTSUBSCRIPT ) =[Hnlq1(q1,p1,𝛉nl)Hnlqd(qd,pd,𝛉nl)]d.absentsuperscriptmatrixsubscript𝐻nlsuperscript𝑞1superscript𝑞1superscript𝑝1subscript𝛉nlsubscript𝐻nlsuperscript𝑞𝑑superscript𝑞𝑑superscript𝑝𝑑subscript𝛉nltopsuperscript𝑑\displaystyle=\begin{bmatrix}\frac{\partial H_{\text{nl}}}{\partial q^{1}}(q^{% 1},p^{1},\bm{\uptheta}_{\text{nl}})\cdots\frac{\partial H_{\text{nl}}}{% \partial q^{d}}(q^{d},p^{d},\bm{\uptheta}_{\text{nl}})\end{bmatrix}^{\top}\in% \mathbb{R}^{d}.= [ start_ARG start_ROW start_CELL divide start_ARG ∂ italic_H start_POSTSUBSCRIPT nl end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_q start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_ARG ( italic_q start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_p start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , bold_θ start_POSTSUBSCRIPT nl end_POSTSUBSCRIPT ) ⋯ divide start_ARG ∂ italic_H start_POSTSUBSCRIPT nl end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_q start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_ARG ( italic_q start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , italic_p start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , bold_θ start_POSTSUBSCRIPT nl end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT . (49)

Utilizing the explicit forms of 𝐟𝐪subscript𝐟𝐪\mathbf{f}_{\mathbf{q}}bold_f start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT and 𝐟𝐩subscript𝐟𝐩\mathbf{f}_{\mathbf{p}}bold_f start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT, form the nonlinear forcing snapshot matrices

𝐅𝐪(𝐐,𝐏,𝛉nl)=[𝐟𝐪(𝐱1,𝛉nl)𝐟𝐪(𝐱N,𝛉nl)],𝐅𝐩(𝐐,𝐏,𝛉nl)=[𝐟𝐩(𝐱1,𝛉nl)𝐟𝐩(𝐱N,𝛉nl)].formulae-sequencesubscript𝐅𝐪𝐐𝐏subscript𝛉nlmatrixsubscript𝐟𝐪subscript𝐱1subscript𝛉nlsubscript𝐟𝐪subscript𝐱𝑁subscript𝛉nlsubscript𝐅𝐩𝐐𝐏subscript𝛉nlmatrixsubscript𝐟𝐩subscript𝐱1subscript𝛉nlsubscript𝐟𝐩subscript𝐱𝑁subscript𝛉nl\mathbf{F}_{\mathbf{q}}(\mathbf{Q},\mathbf{P},\bm{\uptheta}_{\text{nl}})=% \begin{bmatrix}\mathbf{f}_{\mathbf{q}}(\mathbf{x}_{1},\bm{\uptheta}_{\text{nl}% })\cdots\mathbf{f}_{\mathbf{q}}(\mathbf{x}_{N},\bm{\uptheta}_{\text{nl}})\end{% bmatrix},\qquad\mathbf{F}_{\mathbf{p}}(\mathbf{Q},\mathbf{P},\bm{\uptheta}_{% \text{nl}})=\begin{bmatrix}\mathbf{f}_{\mathbf{p}}(\mathbf{x}_{1},\bm{\uptheta% }_{\text{nl}})\cdots\mathbf{f}_{\mathbf{p}}(\mathbf{x}_{N},\bm{\uptheta}_{% \text{nl}})\end{bmatrix}.bold_F start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT ( bold_Q , bold_P , bold_θ start_POSTSUBSCRIPT nl end_POSTSUBSCRIPT ) = [ start_ARG start_ROW start_CELL bold_f start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_θ start_POSTSUBSCRIPT nl end_POSTSUBSCRIPT ) ⋯ bold_f start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , bold_θ start_POSTSUBSCRIPT nl end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARG ] , bold_F start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT ( bold_Q , bold_P , bold_θ start_POSTSUBSCRIPT nl end_POSTSUBSCRIPT ) = [ start_ARG start_ROW start_CELL bold_f start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_θ start_POSTSUBSCRIPT nl end_POSTSUBSCRIPT ) ⋯ bold_f start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , bold_θ start_POSTSUBSCRIPT nl end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARG ] . (50)

Next, project the trajectory (20) and nonlinear (50) snapshot matrices onto the symplectic subspace using the projection matrix (21) found with the cotangent lift [55]

𝐐~=𝚽𝐐r×N,𝐏~=𝚽𝐏r×N,𝐅~𝐪=𝚽𝐅𝐪r×N,𝐅~𝐩=𝚽𝐅𝐩r×N.formulae-sequence~𝐐superscript𝚽top𝐐superscript𝑟𝑁~𝐏superscript𝚽top𝐏superscript𝑟𝑁subscript~𝐅𝐪superscript𝚽topsubscript𝐅𝐪superscript𝑟𝑁subscript~𝐅𝐩superscript𝚽topsubscript𝐅𝐩superscript𝑟𝑁\tilde{\mathbf{Q}}={\mathbf{\Phi}}^{\top}\mathbf{Q}\in\mathbb{R}^{r\times N},% \qquad\tilde{\mathbf{P}}={\mathbf{\Phi}}^{\top}\mathbf{P}\in\mathbb{R}^{r% \times N},\qquad\tilde{\mathbf{F}}_{\mathbf{q}}={\mathbf{\Phi}}^{\top}\mathbf{% F}_{\mathbf{q}}\in\mathbb{R}^{r\times N},\qquad\tilde{\mathbf{F}}_{\mathbf{p}}% ={\mathbf{\Phi}}^{\top}\mathbf{F}_{\mathbf{p}}\in\mathbb{R}^{r\times N}.over~ start_ARG bold_Q end_ARG = bold_Φ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Q ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_N end_POSTSUPERSCRIPT , over~ start_ARG bold_P end_ARG = bold_Φ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_P ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_N end_POSTSUPERSCRIPT , over~ start_ARG bold_F end_ARG start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT = bold_Φ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_F start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_N end_POSTSUPERSCRIPT , over~ start_ARG bold_F end_ARG start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT = bold_Φ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_F start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_N end_POSTSUPERSCRIPT . (51)

Additionally, compute the reduced time-derivative data 𝐪~˙˙~𝐪\dot{\tilde{\mathbf{q}}}over˙ start_ARG over~ start_ARG bold_q end_ARG end_ARG and 𝐩~˙˙~𝐩\dot{\tilde{\mathbf{p}}}over˙ start_ARG over~ start_ARG bold_p end_ARG end_ARG from the reduced state trajectory data 𝐐~~𝐐\tilde{\mathbf{Q}}over~ start_ARG bold_Q end_ARG and 𝐏~~𝐏\tilde{\mathbf{P}}over~ start_ARG bold_P end_ARG using a finite difference scheme. Then, organize the time-derivatives into snapshot matrices

𝐐~˙=[𝐪~˙1𝐪~˙N]r×N,𝐏~˙=[𝐩~˙1𝐩~˙N]r×N.formulae-sequence˙~𝐐matrixsubscript˙~𝐪1subscript˙~𝐪𝑁superscript𝑟𝑁˙~𝐏matrixsubscript˙~𝐩1subscript˙~𝐩𝑁superscript𝑟𝑁\dot{\tilde{\mathbf{Q}}}=\begin{bmatrix}\dot{\tilde{\mathbf{q}}}_{1}\cdots\dot% {\tilde{\mathbf{q}}}_{N}\end{bmatrix}\in\mathbb{R}^{r\times N},\quad\dot{% \tilde{\mathbf{P}}}=\begin{bmatrix}\dot{\tilde{\mathbf{p}}}_{1}\cdots\dot{% \tilde{\mathbf{p}}}_{N}\end{bmatrix}\in\mathbb{R}^{r\times N}.over˙ start_ARG over~ start_ARG bold_Q end_ARG end_ARG = [ start_ARG start_ROW start_CELL over˙ start_ARG over~ start_ARG bold_q end_ARG end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ over˙ start_ARG over~ start_ARG bold_q end_ARG end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_N end_POSTSUPERSCRIPT , over˙ start_ARG over~ start_ARG bold_P end_ARG end_ARG = [ start_ARG start_ROW start_CELL over˙ start_ARG over~ start_ARG bold_p end_ARG end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ over˙ start_ARG over~ start_ARG bold_p end_ARG end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_N end_POSTSUPERSCRIPT .

Lastly, infer the reduced operators 𝐃~𝐪(𝛉quad)subscript~𝐃𝐪subscript𝛉quad\tilde{\mathbf{D}}_{\mathbf{q}}(\bm{\uptheta}_{\text{quad}})over~ start_ARG bold_D end_ARG start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT ( bold_θ start_POSTSUBSCRIPT quad end_POSTSUBSCRIPT ) and 𝐃~𝐩(𝛉quad)subscript~𝐃𝐩subscript𝛉quad\tilde{\mathbf{D}}_{\mathbf{p}}(\bm{\uptheta}_{\text{quad}})over~ start_ARG bold_D end_ARG start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT ( bold_θ start_POSTSUBSCRIPT quad end_POSTSUBSCRIPT ) by solving the following constrained operator inference problem

min𝐃~𝐪=𝐃~𝐪,𝐃~𝐩=𝐃~𝐩[𝐐~˙𝐅~𝐪(𝐐~,𝐏~,𝛉nl)𝐏~˙+𝐅~𝐩(𝐐~,𝐏~,𝛉nl)][𝟎𝐃~𝐩𝐃~𝐪𝟎][𝐐~𝐏~]F,\min_{\begin{subarray}{c}\tilde{\mathbf{D}}_{\mathbf{q}}=\tilde{\mathbf{D}}_{% \mathbf{q}}^{\top},\\ \tilde{\mathbf{D}}_{\mathbf{p}}=\tilde{\mathbf{D}}_{\mathbf{p}}^{\top}\end{% subarray}}\bigg{\lVert}\begin{bmatrix}\dot{\tilde{\mathbf{Q}}}-\tilde{\mathbf{% F}}_{\mathbf{q}}(\tilde{\mathbf{Q}},\tilde{\mathbf{P}},\bm{\uptheta}_{\text{nl% }})\\ \dot{\tilde{\mathbf{P}}}+\tilde{\mathbf{F}}_{\mathbf{p}}(\tilde{\mathbf{Q}},% \tilde{\mathbf{P}},\bm{\uptheta}_{\text{nl}})\end{bmatrix}-\begin{bmatrix}% \mathbf{0}&\tilde{\mathbf{D}}_{\mathbf{p}}\\ -\tilde{\mathbf{D}}_{\mathbf{q}}&\mathbf{0}\end{bmatrix}\begin{bmatrix}\tilde{% \mathbf{Q}}\\ \tilde{\mathbf{P}}\end{bmatrix}\bigg{\rVert}_{F},roman_min start_POSTSUBSCRIPT start_ARG start_ROW start_CELL over~ start_ARG bold_D end_ARG start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT = over~ start_ARG bold_D end_ARG start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , end_CELL end_ROW start_ROW start_CELL over~ start_ARG bold_D end_ARG start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT = over~ start_ARG bold_D end_ARG start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG end_POSTSUBSCRIPT ∥ [ start_ARG start_ROW start_CELL over˙ start_ARG over~ start_ARG bold_Q end_ARG end_ARG - over~ start_ARG bold_F end_ARG start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT ( over~ start_ARG bold_Q end_ARG , over~ start_ARG bold_P end_ARG , bold_θ start_POSTSUBSCRIPT nl end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL over˙ start_ARG over~ start_ARG bold_P end_ARG end_ARG + over~ start_ARG bold_F end_ARG start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT ( over~ start_ARG bold_Q end_ARG , over~ start_ARG bold_P end_ARG , bold_θ start_POSTSUBSCRIPT nl end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARG ] - [ start_ARG start_ROW start_CELL bold_0 end_CELL start_CELL over~ start_ARG bold_D end_ARG start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL - over~ start_ARG bold_D end_ARG start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT end_CELL start_CELL bold_0 end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL over~ start_ARG bold_Q end_ARG end_CELL end_ROW start_ROW start_CELL over~ start_ARG bold_P end_ARG end_CELL end_ROW end_ARG ] ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT , (52)

where Fsubscriptdelimited-∥∥𝐹\lVert\cdot\rVert_{F}∥ ⋅ ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT denotes the Frobenius norm. The symmetric constraints on the reduced operators 𝐃~𝐪subscript~𝐃𝐪\tilde{\mathbf{D}}_{\mathbf{q}}over~ start_ARG bold_D end_ARG start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT and 𝐃~𝐩subscript~𝐃𝐩\tilde{\mathbf{D}}_{\mathbf{p}}over~ start_ARG bold_D end_ARG start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT ensure that the ROMs learned via H-OpInf are Hamiltonian. By introducing the terms 𝐑~𝐩=𝐐~˙𝐅~𝐪(𝐐~,𝐏~)subscript~𝐑𝐩˙~𝐐subscript~𝐅𝐪~𝐐~𝐏\tilde{\mathbf{R}}_{\mathbf{p}}=\dot{\tilde{\mathbf{Q}}}-\tilde{\mathbf{F}}_{% \mathbf{q}}(\tilde{\mathbf{Q}},\tilde{\mathbf{P}})over~ start_ARG bold_R end_ARG start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT = over˙ start_ARG over~ start_ARG bold_Q end_ARG end_ARG - over~ start_ARG bold_F end_ARG start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT ( over~ start_ARG bold_Q end_ARG , over~ start_ARG bold_P end_ARG ) and 𝐑~𝐪=𝐏~˙𝐅~𝐩(𝐐~,𝐏~)subscript~𝐑𝐪˙~𝐏subscript~𝐅𝐩~𝐐~𝐏\tilde{\mathbf{R}}_{\mathbf{q}}=-\dot{\tilde{\mathbf{P}}}-\tilde{\mathbf{F}}_{% \mathbf{p}}(\tilde{\mathbf{Q}},\tilde{\mathbf{P}})over~ start_ARG bold_R end_ARG start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT = - over˙ start_ARG over~ start_ARG bold_P end_ARG end_ARG - over~ start_ARG bold_F end_ARG start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT ( over~ start_ARG bold_Q end_ARG , over~ start_ARG bold_P end_ARG ), the above inference problem (52) can be reformulated as the Lyapunov equations

(𝐐~𝐐~)𝐃~𝐪+𝐃~𝐪(𝐐~𝐐~)=𝐐~𝐑~𝐪+𝐑~𝐪𝐐~,(𝐏~𝐏~)𝐃~𝐩+𝐃~𝐩(𝐏~𝐏~)=𝐏~𝐑~𝐩+𝐑~𝐩𝐏~,formulae-sequence~𝐐superscript~𝐐topsubscript~𝐃𝐪subscript~𝐃𝐪~𝐐superscript~𝐐top~𝐐superscriptsubscript~𝐑𝐪topsubscript~𝐑𝐪superscript~𝐐top~𝐏superscript~𝐏topsubscript~𝐃𝐩subscript~𝐃𝐩~𝐏superscript~𝐏top~𝐏superscriptsubscript~𝐑𝐩topsubscript~𝐑𝐩superscript~𝐏top(\tilde{\mathbf{Q}}\tilde{\mathbf{Q}}^{\top})\tilde{\mathbf{D}}_{\mathbf{q}}+% \tilde{\mathbf{D}}_{\mathbf{q}}(\tilde{\mathbf{Q}}\tilde{\mathbf{Q}}^{\top})=% \tilde{\mathbf{Q}}\tilde{\mathbf{R}}_{\mathbf{q}}^{\top}+\tilde{\mathbf{R}}_{% \mathbf{q}}\tilde{\mathbf{Q}}^{\top},\qquad(\tilde{\mathbf{P}}\tilde{\mathbf{P% }}^{\top})\tilde{\mathbf{D}}_{\mathbf{p}}+\tilde{\mathbf{D}}_{\mathbf{p}}(% \tilde{\mathbf{P}}\tilde{\mathbf{P}}^{\top})=\tilde{\mathbf{P}}\tilde{\mathbf{% R}}_{\mathbf{p}}^{\top}+\tilde{\mathbf{R}}_{\mathbf{p}}\tilde{\mathbf{P}}^{% \top},( over~ start_ARG bold_Q end_ARG over~ start_ARG bold_Q end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) over~ start_ARG bold_D end_ARG start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT + over~ start_ARG bold_D end_ARG start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT ( over~ start_ARG bold_Q end_ARG over~ start_ARG bold_Q end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) = over~ start_ARG bold_Q end_ARG over~ start_ARG bold_R end_ARG start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + over~ start_ARG bold_R end_ARG start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT over~ start_ARG bold_Q end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , ( over~ start_ARG bold_P end_ARG over~ start_ARG bold_P end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) over~ start_ARG bold_D end_ARG start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT + over~ start_ARG bold_D end_ARG start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT ( over~ start_ARG bold_P end_ARG over~ start_ARG bold_P end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) = over~ start_ARG bold_P end_ARG over~ start_ARG bold_R end_ARG start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + over~ start_ARG bold_R end_ARG start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT over~ start_ARG bold_P end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , (53)

which can be solved with off-the-shelf Lyapunov solvers.

The constrained operator inference problem in Eq. (52) enforces structure preservation at the reduced level to learn nonintrusive Hamiltonian ROMs that conserve the reduced Hamiltonian function H~(𝐪~,𝐩~)~𝐻~𝐪~𝐩\tilde{H}(\tilde{\mathbf{q}},\tilde{\mathbf{p}})over~ start_ARG italic_H end_ARG ( over~ start_ARG bold_q end_ARG , over~ start_ARG bold_p end_ARG ) (23). The authors in [44] have shown that the reduced Hamiltonian function H~(𝐪~,𝐩~)~𝐻~𝐪~𝐩\tilde{H}(\tilde{\mathbf{q}},\tilde{\mathbf{p}})over~ start_ARG italic_H end_ARG ( over~ start_ARG bold_q end_ARG , over~ start_ARG bold_p end_ARG ) can be interpreted as a perturbation of the FOM Hamiltonian H𝐻Hitalic_H, i.e. H~(𝐪~,𝐩~)=H(Φ𝐪~,Φ𝐪~)+ΔH(𝐪~,𝐩~)~𝐻~𝐪~𝐩𝐻Φ~𝐪Φ~𝐪Δ𝐻~𝐪~𝐩\tilde{H}(\tilde{\mathbf{q}},\tilde{\mathbf{p}})=H(\Phi\tilde{\mathbf{q}},\Phi% \tilde{\mathbf{q}})+\Delta H(\tilde{\mathbf{q}},\tilde{\mathbf{p}})over~ start_ARG italic_H end_ARG ( over~ start_ARG bold_q end_ARG , over~ start_ARG bold_p end_ARG ) = italic_H ( roman_Φ over~ start_ARG bold_q end_ARG , roman_Φ over~ start_ARG bold_q end_ARG ) + roman_Δ italic_H ( over~ start_ARG bold_q end_ARG , over~ start_ARG bold_p end_ARG ). Consequently, the H-OpInf ROM yields approximate FOM trajectories, i.e. 𝐪approx=Φ𝐪~subscript𝐪approxΦ~𝐪\mathbf{q}_{\text{approx}}=\Phi\tilde{\mathbf{q}}bold_q start_POSTSUBSCRIPT approx end_POSTSUBSCRIPT = roman_Φ over~ start_ARG bold_q end_ARG and 𝐩approx=Φ𝐩~subscript𝐩approxΦ~𝐩\mathbf{p}_{\text{approx}}=\Phi\tilde{\mathbf{p}}bold_p start_POSTSUBSCRIPT approx end_POSTSUBSCRIPT = roman_Φ over~ start_ARG bold_p end_ARG, that track the FOM solution trajectories accurately while also conserving a perturbed FOM Hamiltonian which yields bounded FOM energy error.

References

  • [1] G. E. Karniadakis, I. G. Kevrekidis, L. Lu, P. Perdikaris, S. Wang, L. Yang, Physics-informed machine learning, Nature Reviews Physics 3 (6) (2021) 422–440.
  • [2] M. Raissi, P. Perdikaris, G. Karniadakis, Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations, Journal of Computational Physics 378 (2019) 686–707. doi:https://doi.org/10.1016/j.jcp.2018.10.045.
  • [3] K. Linka, A. Schäfer, X. Meng, Z. Zou, G. E. Karniadakis, E. Kuhl, Bayesian physics informed neural networks for real-world nonlinear dynamical systems, Computer Methods in Applied Mechanics and Engineering 402 (2022) 115346.
  • [4] L. Zhang, X.-D. Zhang, An optimal filtering algorithm for systems with multiplicative/additive noises, IEEE Signal Processing Letters 14 (7) (2007) 469–472.
  • [5] A. Daw, R. Q. Thomas, C. C. Carey, J. S. Read, A. P. Appling, A. Karpatne, Physics-guided architecture (PGA) of neural networks for quantifying uncertainty in lake temperature modeling, in: Proceedings of the 2020 SIAM International Conference on Data Mining (SDM), 2020, pp. 532–540. doi:10.1137/1.9781611976236.60.
  • [6] S. Greydanus, M. Dzamba, J. Yosinski, Hamiltonian neural networks, Advances in Neural Information Processing Systems 32 (2019).
  • [7] Z. Chen, J. Zhang, M. Arjovsky, L. Bottou, Symplectic recurrent neural networks, in: International Conference on Learning Representations, 2020.
  • [8] P. Jin, Z. Zhang, A. Zhu, Y. Tang, G. E. Karniadakis, SympNets: Intrinsic structure-preserving symplectic networks for identifying Hamiltonian systems, Neural Networks 132 (2020) 166–179.
  • [9] S. Saemundsson, A. Terenin, K. Hofmann, M. Deisenroth, Variational integrator networks for physically structured embeddings, in: International Conference on Artificial Intelligence and Statistics, PMLR, 2020, pp. 3078–3087.
  • [10] M. Cranmer, S. Greydanus, S. Hoyer, P. Battaglia, D. Spergel, S. Ho, Lagrangian neural networks, in: ICLR 2020 Workshop on Integration of Deep Neural Models and Differential Equations, 2019.
  • [11] M. Lutter, C. Ritter, J. Peters, Deep Lagrangian networks: Using physics as model prior for deep learning, in: International Conference on Learning Representations, 2019.
  • [12] M. A. Roehrl, T. A. Runkler, V. Brandtstetter, M. Tokic, S. Obermayer, Modeling system dynamics with physics-informed neural networks based on Lagrangian mechanics, IFAC-PapersOnLine 53 (2) (2020) 9195–9200, 21st IFAC World Congress. doi:https://doi.org/10.1016/j.ifacol.2020.12.2182.
  • [13] S. Mallat, Understanding deep convolutional networks, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 374 (2065) (2016) 20150203.
  • [14] H. Sheng, C. Yang, PFNN: A penalty-free neural network method for solving a class of second-order boundary-value problems on complex geometries, Journal of Computational Physics 428 (2021) 110085. doi:https://doi.org/10.1016/j.jcp.2020.110085.
  • [15] N. Galioto, A. A. Gorodetsky, Bayesian system ID: Optimal management of parameter, model, and measurement uncertainty, Nonlinear Dynamics 102 (1) (2020) 241–267.
  • [16] N. Galioto, A. A. Gorodetsky, Likelihood-based generalization of markov parameter estimation and multiple shooting objectives in system identification, Physica D: Nonlinear Phenomena 462 (2024) 134146.
  • [17] H. Sharma, N. Galioto, A. A. Gorodetsky, B. Kramer, Bayesian identification of nonseparable Hamiltonian systems using stochastic dynamic models, in: 2022 IEEE 61st Conference on Decision and Control (CDC), IEEE, 2022, pp. 6742–6749.
  • [18] T. M. Hamill, J. S. Whitaker, C. Snyder, Distance-dependent filtering of background error covariance estimates in an ensemble Kalman filter, Monthly Weather Review 129 (11) (2001) 2776–2790.
  • [19] P. Rajasekaran, N. Satyanarayana, M. Srinath, Optimum linear estimation of stochastic signals in the presence of multiplicative noise, IEEE Transactions on Aerospace and Electronic Systems AES-7 (3) (1971) 462–468.
  • [20] W. Liu, Optimal filtering for discrete-time linear systems with time-correlated multiplicative measurement noises, IEEE Transactions on Automatic Control 61 (7) (2015) 1972–1978.
  • [21] F. Wang, V. Balakrishnan, Robust Kalman filters for linear time-varying systems with stochastic parametric uncertainties, IEEE Transactions on Signal Processing 50 (4) (2002) 803–813.
  • [22] B. Chow, W. Birkemeier, A new recursive filter for systems with multiplicative noise, IEEE Transactions on Information Theory 36 (6) (1990) 1430–1435. doi:10.1109/18.59939.
  • [23] F. Yang, Z. Wang, Y. Hung, Robust Kalman filtering for discrete time-varying uncertain systems with multiplicative noises, IEEE Transactions on Automatic Control 47 (7) (2002) 1179–1183. doi:10.1109/TAC.2002.800668.
  • [24] X. Kai, L. Liangdong, L. Yiwu, Robust extended Kalman filtering for nonlinear systems with multiplicative noises, Optimal Control Applications and Methods 32 (1) (2011) 47–63.
  • [25] G. Berkooz, P. Holmes, J. L. Lumley, The proper orthogonal decomposition in the analysis of turbulent flows, Annual review of fluid mechanics 25 (1) (1993) 539–575.
  • [26] G. Rozza, D. B. P. Huynh, A. T. Patera, Reduced basis approximation and a posteriori error estimation for affinely parametrized elliptic coercive partial differential equations: Application to transport and continuum mechanics, Archives of Computational Methods in Engineering 15 (3) (2008) 229–275.
  • [27] M. Guo, J. S. Hesthaven, Data-driven reduced order modeling for time-dependent problems, Computer methods in applied mechanics and engineering 345 (2019) 75–99.
  • [28] Y. Kim, Y. Choi, D. Widemann, T. Zohdi, A fast and accurate physics-informed neural network reduced order model with shallow masked autoencoder, Journal of Computational Physics 451 (2022) 110841. doi:https://doi.org/10.1016/j.jcp.2021.110841.
  • [29] K. Lee, K. T. Carlberg, Model reduction of dynamical systems on nonlinear manifolds using deep convolutional autoencoders, Journal of Computational Physics 404 (2020) 108973. doi:https://doi.org/10.1016/j.jcp.2019.108973.
  • [30] H. Sharma, H. Mu, P. Buchfink, R. Geelen, S. Glas, B. Kramer, Symplectic model reduction of Hamiltonian systems using data-driven quadratic manifolds, Computer Methods in Applied Mechanics and Engineering 417 (2023) 116402. doi:https://doi.org/10.1016/j.cma.2023.116402.
  • [31] R. Geelen, S. Wright, K. Willcox, Operator inference for non-intrusive model reduction with quadratic manifolds, Computer Methods in Applied Mechanics and Engineering 403 (2023) 115717.
  • [32] D. Serra, F. Ruggiero, A. Donaire, L. R. Buonocore, V. Lippiello, B. Siciliano, Control of nonprehensile planar rolling manipulation: A passivity-based approach, IEEE Transactions on Robotics 35 (2) (2019) 317–329.
  • [33] G. Li, S. Naoz, M. Holman, A. Loeb, Chaos in the test particle eccentric Kozai–Lidov mechanism, The Astrophysical Journal 791 (2) (2014) 86.
  • [34] E. Forest, Geometric integration for particle accelerators, Journal of Physics A: Mathematical and General 39 (19) (2006) 5321.
  • [35] R. Salmon, Hamiltonian fluid mechanics, Annual Review of Fluid Mechanics 20 (1) (1988) 225–256.
  • [36] J. Colliander, M. Keel, G. Staffilani, H. Takaoka, T. Tao, Transfer of energy to high frequencies in the cubic defocusing nonlinear Schrödinger equation, Inventiones Mathematicae 181 (1) (2010) 39–113.
  • [37] S. Särkkä, Bayesian filtering and smoothing, Cambridge University Press, 2013.
  • [38] C. Andrieu, A. Doucet, R. Holenstein, Particle Markov chain Monte Carlo methods, Journal of the Royal Statistical Society Series B: Statistical Methodology 72 (3) (2010) 269–342.
  • [39] K. Wu, T. Qin, D. Xiu, Structure-preserving method for reconstructing unknown Hamiltonian systems from trajectory data, SIAM Journal on Scientific Computing 42 (6) (2020) A3704–A3729.
  • [40] S. Xiong, Y. Tong, X. He, S. Yang, C. Yang, B. Zhu, Nonseparable symplectic neural networks, in: International Conference on Learning Representations, 2021.
  • [41] N. Galioto, A. A. Gorodetsky, Bayesian identification of Hamiltonian dynamics from symplectic data, in: 2020 59th IEEE Conference on Decision and Control (CDC), IEEE, 2020, pp. 1190–1195.
  • [42] M. David, F. Méhats, Symplectic learning for Hamiltonian neural networks, Journal of Computational Physics 494 (2023) 112495. doi:https://doi.org/10.1016/j.jcp.2023.112495.
  • [43] M. Tao, Explicit symplectic approximation of nonseparable Hamiltonians: Algorithm and long time performance, Physical Review E 94 (4) (2016) 043303.
  • [44] H. Sharma, Z. Wang, B. Kramer, Hamiltonian operator inference: Physics-preserving learning of reduced-order models for canonical Hamiltonian systems, Physica D: Nonlinear Phenomena 431 (2022) 133122.
  • [45] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, in: International Conference on Learning Representations, 2015.
  • [46] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, S. Chintala, PyTorch: An imperative style, high-performance deep learning library, in: H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, R. Garnett (Eds.), Advances in Neural Information Processing Systems, Vol. 32, Curran Associates, Inc., 2019.
  • [47] A. S. Krishnapriyan, A. F. Queiruga, N. B. Erichson, M. W. Mahoney, Learning continuous models for continuous physics, Communications Physics 6 (1) (2023) 319.
  • [48] A. Olivier, M. D. Shields, L. Graham-Brady, Bayesian neural networks for uncertainty quantification in data-driven materials modeling, Computer methods in applied mechanics and engineering 386 (2021) 114079.
  • [49] Z.-Z. Lan, B.-L. Guo, Nonlinear waves behaviors for a coupled generalized nonlinear Schrödinger–Boussinesq system in a homogeneous magnetized plasma, Nonlinear Dynamics 100 (2020) 3771–3784.
  • [50] Z. Yan, Generalized method and its application in the higher-order nonlinear Schrödinger equation in nonlinear optical fibres, Chaos, Solitons & Fractals 16 (5) (2003) 759–766.
  • [51] V. N. Serkin, A. Hasegawa, Exactly integrable nonlinear Schrodinger equation models with varying dispersion, nonlinearity and gain: Application for soliton dispersion, IEEE Journal of selected topics in Quantum Electronics 8 (3) (2002) 418–431.
  • [52] N. Akhmediev, A. Ankiewicz, J. M. Soto-Crespo, Rogue waves and rational solutions of the nonlinear Schrödinger equation, Physical Review E 80 (2) (2009) 026601.
  • [53] F. Copie, S. Randoux, P. Suret, The physics of the one-dimensional nonlinear Schrödinger equation in fiber optics: Rogue waves, modulation instability and self-focusing phenomena, Reviews in Physics 5 (2020) 100037.
  • [54] H. Haario, M. Laine, A. Mira, E. Saksman, DRAM: Efficient adaptive MCMC, Statistics and computing 16 (2006) 339–354.
  • [55] L. Peng, K. Mohseni, Symplectic model reduction of Hamiltonian systems, SIAM Journal on Scientific Computing 38 (1) (2016) A1–A27.