Numerical analysis on Neural network projected schemes for approximating one dimensional Wasserstein Gradient flows

Xinzhe Zuo zxz@math.ucla.edu Department of Mathematics, University of California, Los Angeles, CA, 90095. , Jiaxi Zhao jiaxi.zhao@u.nus.edu Department of Mathematics, National University of Singapore. , Shu Liu shuliu@math.ucla.edu Department of Mathematics, University of California, Los Angeles, CA, 90095. , Stanley Osher sjo@math.ucla.edu Department of Mathematics, University of California, Los Angeles, CA, 90095. and Wuchen Li wuchen@mailbox.sc.edu Department of Mathematics, University of South Carolina, Columbia, SC, 29208.

Abstract.

We provide a numerical analysis and computation of neural network projected schemes for approximating one dimensional Wasserstein gradient flows. We approximate the Lagrangian mapping functions of gradient flows by the class of two-layer neural network functions with ReLU (rectified linear unit) activation functions. The numerical scheme is based on a projected gradient method, namely the Wasserstein natural gradient, where the projection is constructed from the $L^{2}$ mapping spaces onto the neural network parameterized mapping space. We establish theoretical guarantees for the performance of the neural projected dynamics. We derive a closed-form update for the scheme with well-posedness and explicit consistency guarantee for a particular choice of network structure. General truncation error analysis is also established on the basis of the projective nature of the dynamics. Numerical examples, including gradient drift Fokker-Planck equations, porous medium equations, and Keller-Segel models, verify the accuracy and effectiveness of the proposed neural projected algorithm.

Key words and phrases:

Optimal transport; Information Geometry; Natural gradient; Neural network functions; Convergence analysis.

Xinzhe Zuo and Jiaxi Zhao contributed equally. Jiaxi Zhao, Xinzhe Zuo, and Shu Liu are partially supported by AFOSR YIP award No. FA9550-23-1-008; Xinzhe Zuo, Shu Liu, and Stanley Osher are partially funded by AFOSR MURI FA9550-18-502 and ONR N00014-20-1-2787; Wuchen Li’s work is partially supported by AFOSR YIP award No. FA9550-23-1-008, NSF DMS-2245097, and NSF RTG: 2038080.

1. Introduction

Simulating gradient flows of free energies is a central problem in the computational physics of complex systems [8] and data science [1, 2]. In physics, gradient flows often arise from first-order principles, such as the Onsager principle [32]. The Onsager gradient flows are widely used in phase fields, chemistry, and biology modeling. In recent years, a particular type of Onsager gradient flow, known as Wasserstein gradient flow, has been widely studied in optimal transport communities [3, 33, 37]. It studies an infinite-dimensional pseudo-Riemannian metric in the probability distribution space known as the density manifold. The gradient flow in the Wasserstein space naturally captures the free energy dissipation properties. Depending on the choices of free energies, the Wasserstein gradient flow contains a vast class of differential equations, such as gradient drift Fokker-Planck equations, porous medium equations, and Keller-Segel models. These models are widely used in population dynamics and sampling-related optimization problems.

In recent years, machine learning has brought a class of new methods in computational physics, where free energies are identified with the loss functions [11, 30]. Meanwhile, computing Wasserstein gradient flows of loss functions in terms of samples also finds their various applications, such as generative artificial intelligence [4] and transport map-based sampling methods [35]. In these applications, one often relies on the Lagrangian mapping functions to describe the Wasserstein gradient flows and deep neural networks to approximate the mapping functions due to their high expressivity and adaptivity from the compositional structure. While empirical successes of this framework have been observed in various applications [35, 4], very few theoretical results exist to explain the underlying mechanism.

Moreover, projected dynamics in neural network space are widely used to approximate Wasserstein gradient flows [14, 25]. These dynamics restrict the space of probabilities onto a finite-dimensional subspace parameterized by neural network mapping functions. For this reason, we call it the neural projected gradient dynamics. This approach originates from the natural gradient method in information geometry [1] and extends the framework set by [20]. Some basic questions about its accuracy and efficiency remain: Even in one-dimensional space, how well do the neural projected dynamics approximate the Wasserstein gradient flow? What is the accuracy of the neural network approximation in Lagrangian mapping functions?

In this paper, we study the numerical analysis and computational neural network projected schemes for one-dimensional Wasserstein gradient flows. The main result is sketched below. By formulating gradient flows in Lagrangian coordinates, the proposed numerical scheme takes the form of a ‘preconditioned’ gradient descent, where the preconditioner is the metric tensor of the statistical manifold of the parameter space. Theoretically, we first provide the derivation of the analytic solution for the inverse neural mapping metric. It is based on a special class of the ReLU network in theorem 2. We use the analytic form of the projected gradient flow formula to prove the consistency of the numerical scheme. Then, we prove in theorem 3 that the numerical schemes derived from the neural projected dynamics are of first or second-order consistency for the general Wasserstein gradient directions. These include cases of the heat flow and the Fokker-Planck equation. Furthermore, viewing our neural network model as a moving mesh method, we show in proposition 7 that the mesh will not degenerate during the simulation.

In numerics, the advantages of the proposed method are twofold. First, using a two-layer neural network as our basis function, the proposed method can be regarded as a ‘moving-mesh’ method in Lagrangian coordinate, which demonstrates very promising performance even when the number of parameters of the neural network is very limited. In particular, our numerical examples can achieve an accuracy of $10^{-3}$ with less than 100 neurons. Second, using the Wasserstein gradient flow formulation, the proposed method is very easy to implement since it can make use of the automatic differentiation feature from popular machine learning libraries such as PyTorch.

Nowadays, the computation of Wasserstein gradient flows (WGFs) has attracted great interests from researchers in various communities such as mathematics, physics, statistics, and machine learning. Classical numerical methods [9] have been introduced to directly evaluate the probability density function. Recently, algorithms that approximate the Lagrangian mapping functions associated with WGFs have been invented. We refer the readers to [8] and references therein for related discussions. These treatments automatically preserve non-negativity and total mass. Together with the fast-developing deep learning techniques, they inspire a series of research on composing scalable, sampling-friendly computational methods for WGFs in higher-dimensional spaces [25, 27, 13, 16, 18]. Recently, deep learning-based algorithms for computing the Lagrangian coordinates of the Wasserstein Hamiltonian flows, or more generally mean field control problems, have also been introduced in [38, 28, 34].

Our treatment of projecting the WGFs onto the parameter space is also known as the natural gradient method, which are first introduced in [1] (w.r.t. Fisher-Rao metric) and [10] (w.r.t. Wasserstein metric). Here the projected matrix is often named information matrix, namely Fisher information matrix and Wasserstein information matrix, depending on the usage of metrics in probability space. This method recently finds its application in large-scale optimization problems [29]. In recent research [12, 7, 14], the authors aim to calculate general evolution equations by directly leveraging the neural network representation of the time-dependent solution. They endow the evolution of the equation in the functional space into the parameter space of the neural network to obtain a finite-dimensional ordinary differential equation, which can be readily integrated via the Runge-Kutta solvers. Numerical properties of the ReLU neural network families have been investigated in [15].

Compared to previous studies, we study the numerical analysis of neural network projected dynamics for approximating WGFs. In one-dimensional space, we provide the error analysis for the neural projected dynamics with a two-layer neural network. We numerically verify the proposed error analysis. In particular, we formulate a class of explicit schemes from the neural network projected dynamics. This study continues the study of the Wasserstein information matrix on neural network models; see related discussions in [22, 21, 25].

The paper is organized as follows. In Section 2, we briefly review the formulation of Wasserstein gradient flows of free energies in both Eulerian and Lagrangian coordinates. We formulate the projected Wasserstein gradient flows over neural network models in Section 3. In Section 4, we conduct the numerical analysis of the proposed neural projected dynamics in two-layer neural network functions. In Section 5, we verify the accuracy of the proposed algorithm with numerical examples in Fokker-Planck equations, porous medium equations, and Keller-Segel models.

2. Review of Wasserstein gradient flows and Lagrangian coordinates

In this section, we prepare the theoretical foundations of Wasserstein gradient flows with a focus on Lagrangian description (diffeomorphism mapping functions) and the associated microscopic particle dynamics. See details in [3, 37].

2.1. Wasserstein gradient flows

Suppose $\Omega$ is a domain in the Euclidean space $\mathbb{R}^{d}$ . Denote the probability space

\mathcal{P}(\Omega)=\left\{p(\cdot)\in C^{\infty}:~{}\int_{\Omega}p(x)dx=1,% \quad p(\cdot)\geq 0\right\}.

Given an energy functional $\mathcal{F}(\cdot):\Omega\rightarrow\mathbb{R}$ , we consider the following evolution equation associated with $\mathcal{F}(\cdot)$ ,

\partial_{t}p(t,x)=\nabla_{x}\cdot(p(t,x)\nabla_{x}\frac{\delta}{\delta p}% \mathcal{F}(p)),\quad p(\cdot,0)=p_{0},

(1)

with Neumann boundary condition $p(t,x)\nabla_{x}\frac{\delta}{\delta p}\mathcal{F}(p)\cdot\bm{n}=0$ where $\bm{n}$ is the outward pointing vector on boundary $\partial\Omega$ . $\frac{\delta}{\delta p}$ is the $L^{2}$ first variation operator w.r.t. density variable $p$ . The mass of $p(t,\cdot)$ is conserved and always equals $1$ . An important fact about (1) is that this equation can be treated as the gradient flow of $\mathcal{F}$ on $\mathcal{P}(\Omega)$ . To be more specific, by endowing the probability space $\mathcal{P}(\Omega)$ with the $L^{2}$ Wasserstein metric $g_{W}$ , we can view $(\mathcal{P}(\Omega),g_{W})$ as a Riemannian manifold, and (1) is the gradient flow on such manifold with respect to $g_{W}$ .

Let us briefly review several facts. We first define the metric $g_{W}$ at arbitrary $p\in\mathcal{P}(\Omega)$ , which is identified via the continuity equation (that is, tangent vectors) whose driving vector field belongs to the closure of all gradient fields $\nabla_{x}\psi:\Omega\rightarrow\mathbb{R}^{d}$ with $\psi\in C^{\infty}(\Omega)$ in $L^{2}(p)$ -norm. Consider a smooth curve $\{p_{i}(t,\cdot)\}_{t\in(-\epsilon,\epsilon)}$ ( $i=1,2$ ) passing through $p$ at $t=0$ on $\mathcal{P}(\Omega)$ . Suppose the probability evolution $p_{i}(t,\cdot)$ is driven by the gradient field $\nabla_{x}\psi_{i}(\cdot)$ at $t=0$ , i.e., $\psi_{i}(\cdot)$ solves

\partial_{t}p_{i}(0,x)+\nabla_{x}\cdot(p_{i}(0,x)\nabla\psi_{i}(x))=0,\quad i=% 1,2.

We define the $L^{2}$ Wasserstein metric $g_{W}(\cdot,\cdot)$ at $p$ as a symmetric, positive-definite bilinear form,

g_{W}(\partial_{t}p_{1}(0,\cdot),\partial_{t}p_{2}(0,\cdot))=\int_{\Omega}% \nabla_{x}\psi(x)\cdot\nabla_{x}\psi_{2}(x)p(x)~{}dx.

Recall the definition of the gradient of a smooth function $f$ on a Riemannian manifold $(M,g)$ as

g(\mathrm{grad}f(x),\dot{x}(0))=\frac{d}{dt}f(x(t)),

for any smooth curves $\{x(t)\}t\in(-\epsilon,\epsilon)$ passing through $x$ at $t=0$ . Switching back to our case, for the functional $\mathcal{F}$ defined on $(\mathcal{P}(\Omega),g_{W})$ , we define the gradient of $\mathcal{F}$ w.r.t. Wasserstein metric $g_{W}$ at $p$ as

g_{W}(\textrm{grad}_{W}\mathcal{F}(p),\partial_{t}p(0,\cdot))=\frac{d}{dt}% \mathcal{F}(p(t,\cdot))\Bigg{|}_{t=0}.

Here $\{p(t,\cdot)\}_{t\in(-\epsilon,\epsilon)}$ is arbitrary curve on $\mathcal{P}(\Omega)$ with $p(0,\cdot)=p(\cdot)$ . Suppose $p(t,\cdot)$ is guided by the gradient field $\nabla_{x}\psi$ at time $t=0,$ Then the right-hand side can be computed as

	$\displaystyle\frac{d}{dt}\mathcal{F}(p(t,\cdot))=\int_{\Omega}\frac{\delta% \mathcal{F}(p(0,\cdot))}{\delta p}(x)\partial_{t}p(0,x)~{}dx$	$\displaystyle=\int_{\Omega}\frac{\delta\mathcal{F}(p)}{\delta p}(x)(-\nabla_{x% }\cdot(p(x)\nabla_{x}\psi(x)))~{}dx$
		$\displaystyle=\int_{\Omega}\nabla\frac{\delta\mathcal{F}(p)}{\delta p}(x)\cdot% \nabla_{x}\psi(x)p(x)dx.$

Recall the definition of the metric $g_{W}$ , it is not difficult to verify that the gradient field associated with $\mathrm{grad}_{W}\mathcal{F}(p)$ is $\nabla_{x}\frac{\delta}{\delta p}\mathcal{F}(p)$ . Thus,

\mathrm{grad}_{W}\mathcal{F}(p)=-\nabla_{x}\cdot(p(t,x)\nabla_{x}\frac{\delta% \mathcal{F}(p)}{\delta p}(x)),

and the Wasserstein gradient flow $\partial_{t}p=-\mathrm{grad}_{W}\mathcal{F}(p)$ can be formulated as equation (1).

We provide several examples of WGFs. In these examples, we assume $\Omega=\mathbb{R}^{d}$ .

•

(Fokker-Planck equation) Consider

\displaystyle\mathcal{F}(p)=\int_{\Omega}V(x)p(x)dx+\gamma\int_{\Omega}p(x)% \log p(x)dx.

Then the Wasserstein gradient of $\mathcal{F}$ equals

\begin{split}\mathrm{grad}_{W}\mathcal{F}(p)=&-\nabla_{x}\cdot(p(x)\nabla_{x}(% V(x)+\gamma(\log p(x)+1)))\\ =&-\nabla\cdot(p(x)\nabla_{x}V(x))-\gamma\Delta_{x}p(x).\end{split}

The corresponding WGF is the Fokker-Planck equation

\partial_{t}p(t,x)=\nabla_{x}\cdot(p(t,x)\nabla_{x}V(x))+\gamma\Delta_{x}p(t,x).

(2)

•

(Porous medium equation) Consider

\mathcal{F}(p)=\frac{p^{m}}{m-1}.

One computes

\mathrm{grad}_{W}\mathcal{F}(p)=-\nabla_{x}\cdot(p(t,x)\nabla_{x}(\frac{m}{m-1% }p(x)^{m-1}))=-\nabla_{x}\cdot(\nabla_{x}(p(x)^{m}))=-\Delta_{x}p(x)^{m}.

Thus, the corresponding WGF yields the porous medium equation

\partial_{t}p(t,x)=\Delta_{x}p(t,x)^{m}.

(3)

•

(Keller-Segel equation) Another well-known WGF is by choosing $\mathcal{F}$ as the sum of the internal energy and the interaction energy

\mathcal{F}(p)=\int_{\Omega}U(p(x))~{}dx+\frac{1}{2}\iint_{\Omega\times\Omega}% W(|x-y|)p(x)p(y)~{}dxdy,

where $U$ is a certain smooth function defined on $\mathbb{R}_{+}$ , and $W(\cdot)\in C(\mathbb{R}_{+};\mathbb{R})$ is a kernel function.

We calculate

\mathrm{grad}_{W}\mathcal{F}(p)=-\nabla_{x}\cdot(p(x)\nabla_{x}(U^{\prime}(p(x% ))+W*p(x))),

where we denote the convolution $W*p(x)=\int_{\Omega}W(|x-y|)p(y)~{}dy$ . The WGF associated with this functional is the Keller-Segel equation

\partial_{t}p(t,x)=\nabla_{x}\cdot(p(t,x)\nabla_{x}U^{\prime}(p(t,x)))+\nabla_% {x}\cdot(p(t,x)\nabla_{x}(W*p_{t}(x))).

(4)

2.2. Lagrangian coordinates & Particle dynamics

Consider a mapping function $T\colon Z\rightarrow\Omega$ . Here $z\in Z$ is an input space, $\Omega\subset\mathbb{R}^{d}$ is the domain on which WGF is defined. To alleviate our discussion, we assume $Z=\Omega$ . Let us further assume $T\in C^{\infty}(Z,\Omega)$ , and the Jacobian matrix $D_{z}T(z)$ is non-singular for all $z\in Z$ , i.e., $\mathrm{det}(D_{z}T(z))\neq 0$ on $Z$ . This also guarantees that $T$ is injective. Given a smooth reference probability density $p_{\mathrm{r}}\in\mathcal{P}(Z)$ , we denote the pushforwarded probability density of $p_{r}$ by $T$ as

p=T_{\#}p_{r},

where $T_{\#}:\mathcal{P}(Z)\rightarrow\mathcal{P}(\Omega)$ is the pushforward operator defined as

\int_{\Omega}f(x)T_{\#}p_{r}(x)~{}dx=\int_{Z}f(T(z))p_{r}(z)~{}dz,\quad\textrm% {for all }f\circ T\in L^{1}(p_{r}).

The density function of $p$ satisfies

p(T(z))\mathrm{det}(D_{z}T(z))=p_{\mathrm{r}}(z).\quad\forall~{}z\in Z.\quad% \textrm{i.e.,}~{}~{}p(x)=\frac{p_{r}}{\mathrm{det}(D_{z}T)}\circ T^{-1}(x)% \quad\forall~{}x\in\Omega.

(5)

Such pushforward map $T$ used for constructing probability distribution $p$ is usually called the Lagrangian coordinate. We now imitate the derivation of the WGF to help formulate its counterpart under the Lagrangian coordinate.

We denote $\mathcal{O}$ as the space of smooth, $L^{2}(p_{r})$ integrable pushforward maps with non-zeros Jacobian, i.e.,

\mathcal{O}=\left\{T\in C^{\infty}(Z,\Omega)~{}:~{}\mathrm{det}(D_{z}T)\neq 0,% ~{}~{}\int_{Z}|T(z)|^{2}p_{z}(z)~{}dz<\infty\right\}.

Then the pushforward operation $\#:\mathcal{O}\rightarrow\mathcal{P}(\Omega)$ introduces a submersion from the space of pushforward maps (diffeomorphisms) to the space of probability densities.

In order to derive the Wasserstein gradient flows (WGFs) on the space $\mathcal{O}$ of pushforward maps instead of the probability space $\mathcal{P}(\Omega)$ , we first build up certain metric $\langle\cdot,\cdot\rangle$ on $\mathcal{O}$ that corresponds to the Wasserstein metric $g_{W}$ . As illustrated in [33], $g_{W}$ is obtained by pulling back the $L^{2}(p_{r})$ norm on $\mathcal{O}$ via submersion $\#$ . Thus, a way of choosing the metric is

\langle\mathbf{u}_{1},\mathbf{u}_{2}\rangle=\int_{Z}\mathbf{u}_{1}(z)\cdot% \mathbf{u}_{2}(z)p_{r}(z)~{}dz,\quad\forall~{}\mathbf{u}_{1},\mathbf{u}_{2}\in L% ^{2}(p_{r})~{}\bigcap~{}C^{\infty}(Z,\Omega).

Now for any smooth functional $\mathcal{F}:\mathcal{P}(\Omega)\rightarrow\mathbb{R}$ , the composition $\mathcal{F}^{\#}\triangleq\mathcal{F}\circ\#:\mathcal{O}\rightarrow\mathbb{R}$ defines its corresponding functional on $\mathcal{O}$ . Follow similar arguments presented in 2.1, we compute the gradient of $\mathcal{F}^{\#}$ with respect to the metric $\langle\cdot,\cdot\rangle$ as

\mathrm{grad}_{\langle\cdot,\cdot\rangle}\mathcal{F}^{\#}(T)=\frac{1}{p_{% \mathrm{r}}(\cdot)}\frac{\delta\mathcal{F}^{\#}(T)}{\delta T}(\cdot).

Here, $\frac{\delta}{\delta T}$ is the $L^{2}(m)$ ( $m$ denotes the Lebesgue measure) first variational w.r.t. the pushforward map $T$ .

Thus, the gradient flow of $\mathcal{F}^{\#}$ on $\mathcal{O}$ is formulated as

\partial_{t}T(t,\cdot)=-\mathrm{grad}_{\langle\cdot,\cdot\rangle}\mathcal{F}^{% \#}(T(t,\cdot))=-\frac{1}{p_{\mathrm{r}}(\cdot)}\frac{\delta\mathcal{F}^{\#}(T% (t,\cdot))}{\delta T}(\cdot).

The variation $\frac{\delta}{\delta T}$ is calculated as

\frac{\delta\mathcal{F}^{\#}(T)}{\delta T}(z)=\left(\nabla_{x}\frac{\delta% \mathcal{F}(T_{\#}p_{r})}{\delta p}\right)\circ T(z)p_{r}(z).

The above equation can also be written as

\partial_{t}T(t,z)=-\left(\nabla_{x}\frac{\delta\mathcal{F}(T(t,\cdot)_{\#}p_{% r})}{\delta p}\right)\circ T(t,z).

(6)

If we denote $p(t,\cdot)=T(t,\cdot)_{\#}p_{r}$ , one can verify that $p(t,\cdot)$ exactly solves equation (1) for WGF with $p_{0}=T(0,\cdot)_{\#}p_{r}$ , which justifies the equivalence between the gradient flow (6) in Lagrangian coordinates (i.e., the map $T(t,\cdot)$ ) and the WGF (1) expressed by using Eulerian coordinate (i.e., the density function $p(t,\cdot)$ ).

Such gradient flow (6) on the space of diffeomorphisms also forms a microscopic picture of particle dynamics of the WGF (1). For any random reference sample $z\sim p_{r}$ , by setting $\mathbf{x}_{t}=T(t,z)$ , it is not hard to verify that $\mathbf{x}_{t}$ evolves w.r.t. the dynamic

\frac{d\mathbf{x}_{t}}{dt}=-\left(\nabla_{x}\frac{\delta}{\delta p}\mathcal{F}% (p_{t})\right)(\mathbf{x}_{t}),\quad\mathbf{x}_{0}=T(0,z),\quad z\sim p_{r}.

(7)

Here we denote $p_{t}=T(t,\cdot)_{\#}p_{r}$ . $p_{t}$ can be equivalently treated as the probability density of the random particle $\mathbf{x}_{t}$ . In this dynamic, the movement of a single agent $\mathbf{x}_{t}$ is determined by the instant population density $p_{t}$ evaluated at $\mathbf{x}_{t}$ . Such an approach offers a microscopic and deterministic interpretation of various diffusive processes possessing WGF structures.

The aforementioned examples of WGF can be formulated as the gradient flows under Lagrangian coordinates (6) as well as the particle dynamics (7). We summarize this in the following Table 1. We assume $T(0,\cdot)_{\#}p_{r}=p_{0}$ as the initial condition for (6), and $\mathbf{x}_{0}\sim p_{0}$ as the initial distribution of the random particle $\mathbf{x}_{t}$ in (7). We denote $p_{t}=T(t,\cdot)_{\#}p_{r}$ in equation (6). Accordingly, we denote $p_{t}$ as the probability density of the stochastic particle $\mathbf{x}_{t}$ in the dynamic (7).

WGF

Gradient flow in Lagrangian coordinates

Particle dynamic

Fokker-Planck (2)

\partial_{t}T(t,z)=-\nabla_{x}(V+\gamma\log p_{t})\circ T(t,z)

\frac{d\mathbf{x}_{t}}{dt}=-\nabla_{x}V(\mathbf{x}_{t})-\gamma\nabla_{x}\log p% _{t}(\mathbf{x}_{t})

Porous-medium (3)

\partial_{t}T(t,z)=-\frac{m}{m-1}p_{t}(T(t,z))^{m-1}\nabla_{x}p_{t}\circ T(t,z)

\frac{d\mathbf{x}_{t}}{dt}=-\frac{m}{m-1}p_{t}(\mathbf{x}_{t})^{m-1}\nabla p_{% t}(\mathbf{x}_{t})

Keller-Segel (4)

\partial_{t}T(t,z)=-\nabla_{x}(U^{\prime}(p_{t})+W*p_{t})\circ T(t,z)

\frac{d\mathbf{x}_{t}}{dt}=-\nabla_{x}U^{\prime}(p_{t}(\mathbf{x}_{t}))-\nabla% _{x}W*p_{t}(\mathbf{x}_{t})

Table 1. Gradient flows under Lagrangian coordinates & Particle dynamics associated with the WGFs.

3. Neural projected Wassersetin gradient flows and their algorithms

As discussed in Section 2, instead of the direct evaluation of the density function of the Wasserstein gradient flow, it suffices to compute the time-dependent Lagrangian mapping $T(t,\cdot)$ . In this research, we approximate $T(t,\cdot)$ via neural networks parametrized by time-dependent parameter $\{\theta_{t}\}$ . The evolution of $\theta_{t}$ is obtained by projecting the gradient flow (6) onto the parameter space $\Theta$ . In this section, we briefly review the basic definitions of neural network mapping functions. We next study a metric space for neural mapping functions and formulate several neural mapping dynamics for $\{\theta_{t}\}$ .

3.1. Neural network activation functions

We first provide the definition of a neural network mapping function. Consider a mapping function

f\colon Z\times\Theta\rightarrow\Omega,

where $Z\subset\mathbb{R}^{l}$ is the latent space, $\Omega\subset\mathbb{R}^{d}$ is the sample space and $\Theta\subset\mathbb{R}^{D}$ is the parameter space. In this paper, we consider the following network structure

f(\theta,z)=\frac{1}{N}\sum_{i=1}^{N}a_{i}\sigma\Big{(}z-b_{i}\Big{)},

where $\theta=(a_{i},b_{i})\in\mathbb{R}^{D}$ , $D=(l+1)N$ . Here $N$ is the number of hidden units (neurons). $a_{i}\in\mathbb{R}$ is the weight of unit $i$ . $b_{i}\in\mathbb{R}^{l}$ is an offset (location variable). $\sigma\colon\mathbb{R}\rightarrow\mathbb{R}$ is an activation function, which satisfies $\sigma(0)=0$ , $1\in\partial\sigma(0)$ . From now on, we assume that $f$ is invertible, monotone, and is continuous w.r.t. both $z$ and $\theta$ variables.

For example, let $N=d=1$ , and $b_{1}=0$ . Define a two layer neural network by

{tikzpicture}

The following neural network mapping functions have been widely used.

Example 1 (Linear).

Denote $\sigma(x)=x$ . Consider

f(\theta,z)=\theta z,\quad\theta\in\mathbb{R}_{+}.

Example 2 (ReLU).

Denote $\sigma(x)=\max\{x,0\}$ . Consider

f(\theta,z)=\theta\max\{z,0\},\quad\theta\in\mathbb{R}_{+}.

Example 3 (Sigmoid).

Denote $\sigma(x)=\frac{1}{1+e^{-2x}}$ . Consider

f(\theta,z)=\frac{\theta}{1+e^{-2z}},\quad\theta\in\mathbb{R}_{+}.

In Section 4 (theoretical results) and Section 5 (numerical examples), we focus mainly on the case where $l=d=1$ , $D=2N$ . And $\sigma(\cdot)$ is the ReLU activation function.

3.2. Neural mapping models and energies

In this subsection, we consider the following probability density functions generated by neural network mapping functions. We call them the neural mapping models.

Definition 1 (Neural mapping models).

Let us define a fixed input reference probability density $p_{\mathrm{r}}\in\mathcal{P}(Z)=\Big{\{}p(z)\in C^{\infty}(Z)\colon\int_{Z}p_{% \mathrm{r}}(z)dz=1,~{}p(z)\geq 0\Big{\}}$ . Denote a probability density generated by a neural network mapping function by the pushforward operator:

p={f_{\theta}}_{\#}p_{\mathrm{r}}\in\mathcal{P}(\Omega),

In other words, $p$ satisfies the following Monge-Ampère equation by

p(f(\theta,z))\mathrm{det}(D_{z}f(\theta,z))=p_{\mathrm{r}}(z)\,,

(8)

where $D_{z}f(\theta,z)$ is the Jacobian of the mapping function $f(\theta,z)$ w.r.t. variable $z$ .

Definition 2 (Neural mapping energies).

Given an energy functional $\mathcal{F}\colon\mathcal{P}(\Omega)\rightarrow\mathbb{R}$ , we can construct a neural mapping energy $F\colon\Theta\rightarrow\mathbb{R}$ by

F(\theta)=\mathcal{F}({f_{\theta}}_{\#}p_{\mathrm{r}}).

Many applications in machine learning and scientific computing can be cast into the following optimization problem

\min_{\theta\in\Theta}F(\theta).

Here, $F$ often measures the closeness between the neural mapping model and the target or data density distribution. Several concrete examples of neural mapping energies $F$ are given below. For simplicity of presentation, we often write the integration operator w.r.t. density $p_{\mathrm{r}}$ over domain $Z$ by the expectation operator $\mathbb{E}_{z\sim p_{\mathrm{r}}}$ . Later in Section 3.5, we provide several examples of the energy functional $\mathcal{F}$ including the potential, the interaction (E.g. maximum mean discrepancy ) and the internal (information entropy/divergence) functionals. They are commonly used in machine learning and optimal transport communities; see details in [3, Section 9].

To summarize, the neural mapping energies are functionals $\mathcal{F}$ written in terms of the mapping functions $f(\theta,z)$ . This allows us to perform optimization on the finite dimensional space $\Theta$ instead of the infinite dimensional space $\mathcal{P}(\Omega)$ .

3.3. Neural mapping metric space

We next consider a mapping space parameterized by a neural mapping function $f(\theta,\cdot)$ . We can measure the difference between two neural mapping functions by the $L^{2}$ distance thanks to the following definition.

Definition 3 (Neural mapping distance).

Define a distance function $\mathrm{Dist}_{\mathrm{W}}\colon\Theta\times\Theta\rightarrow\mathbb{R}$ as

\begin{split}\mathrm{Dist}_{\mathrm{W}}({f_{\theta^{0}}}_{\#}p_{\mathrm{r}},{f% _{\theta^{1}}}_{\#}p_{\mathrm{r}})^{2}=&\int_{Z}\|f(\theta^{0},z)-f(\theta^{1}% ,z)\|^{2}p_{\mathrm{r}}(z)dz\\ =&\sum_{m=1}^{d}\mathbb{E}_{z\sim p_{\mathrm{r}}}\Big{[}\|f_{m}(\theta^{0},z)-% f_{m}(\theta^{1},z)\|^{2}\Big{]},\end{split}

where $\theta^{0}$ , $\theta^{1}\in\Theta$ are two sets of neural network parameters and $\|\cdot\|$ is the Euclidean norm in $\mathbb{R}^{d}$ .

In the above definition, $\mathrm{Dist}_{\mathrm{W}}$ represents a distance function for two given neural mapping functions $f(\theta^{0},\cdot)$ and $f(\theta^{1},\cdot)$ . In fact, the $L^{2}$ distance between neural mapping functions induces a metric on neural network parameters. Similar Riemannian geometry for feed-forward neural networks is also studied in [31].

We next consider the Taylor expansion of the distance function. Let $\Delta\theta\in\mathbb{R}^{D}$ ,

\begin{split}&\mathrm{Dist}_{\mathrm{W}}({f_{\theta+\Delta\theta}}_{\#}p_{% \mathrm{r}},{f_{\theta}}_{\#}p_{\mathrm{r}})^{2}\\ =&\sum_{m=1}^{d}\mathbb{E}_{z\sim p_{\mathrm{r}}}\Big{[}\|f_{m}(\theta+\Delta% \theta,z)-f_{m}(\theta,z)\|^{2}\Big{]}\\ =&\sum_{m=1}^{d}\sum_{i=1}^{D}\sum_{j=1}^{D}\mathbb{E}_{z\sim p_{\mathrm{r}}}% \Big{[}\partial_{\theta_{i}}f_{m}(\theta,z)\partial_{\theta_{j}}f_{m}(\theta,z% )\Big{]}\Delta\theta_{i}\Delta\theta_{j}+o(\|\Delta\theta\|^{2})\\ =&\Delta\theta^{\mathsf{T}}G_{\mathrm{W}}(\theta)\Delta\theta+o(\|\Delta\theta% \|^{2}).\end{split}

Here $G_{\mathrm{W}}$ is a Gram-type matrix function. We summarize its definition below.

Definition 4 (Neural mapping metric).

Define a matrix function $G_{\mathrm{W}}\colon\Theta\rightarrow\mathbb{R}^{D\times D}$ . Denote $G_{\mathrm{W}}(\theta)=(G_{\mathrm{W}}(\theta)_{ij})_{1\leq i,j\leq D}$ , such that

G_{\mathrm{W}}(\theta)_{ij}=\sum_{m=1}^{d}\mathbb{E}_{z\sim p_{\mathrm{r}}}% \Big{[}\partial_{\theta_{i}}f_{m}(\theta,z)\partial_{\theta_{j}}f_{m}(\theta,z% )\Big{]}.

We also write

G_{\mathrm{W}}(\theta)=\mathbb{E}_{z\sim p_{\mathrm{r}}}\Big{[}\nabla_{\theta}% f(\theta,z)\nabla_{\theta}f(\theta,z)^{\mathsf{T}}\Big{]},

where we denote $\nabla_{\theta}f(\theta,z)=(\partial_{\theta_{i}}f_{m}(\theta,z))_{1\leq i\leq D% ,1\leq m\leq d}\in\mathbb{R}^{D\times d}$ .

From now on, we call $(\Theta,G_{\mathrm{W}})$ the neural mapping metric space. Here we always assume that $G_{\mathrm{W}}(\theta)$ is a positive definite matrix in $\mathbb{R}^{D\times D}$ .

3.4. Neural mapping dynamics

In this subsection, we derive some analogies of Wasserstein gradient flows in the neural mapping metric space $(\Theta,G_{\mathrm{W}})$ . Shortly, we apply them to define the neural mapping dynamics and compare them with their counterparts in $L^{2}$ mapping metric space and $L^{2}$ Wasserstein metric probability space. From now on, we assume that $f$ is smooth w.r.t. parameter $\theta$ . This is not true for the ReLU activation function, which will be studied in detail in later sections.

The next proposition provides gradient operators of a function $F\in C^{2}(\Theta;\mathbb{R})$ in the neural mapping metric space $(\Theta,G_{\mathrm{W}})$ .

Proposition 1 (Neural mapping gradient operators).

The gradient operator of $F$ in $(\Theta,G_{\mathrm{W}})$ , $\mathrm{grad}_{\mathrm{W}}F(\theta)=(\mathrm{grad}_{\mathrm{W}}F(\theta)_{k})_% {k=1}^{D}$ , is given by

\mathrm{grad}_{\mathrm{W}}F(\theta)_{k}=\sum_{i=1}^{D}G^{-1}_{\mathrm{W}}(% \theta)_{ki}\partial_{\theta_{i}}F(\theta).

Proof.

We briefly derive the gradient operator of $F$ in $(\Theta,G_{\mathrm{W}})$ below. Suppose $\theta(t)=\theta_{t}$ is a smooth curve passing through the point $\theta(0)=\theta$ . Consider a Taylor expansion of $F(\theta_{t})$ at $t=0$ by

\begin{split}F(\theta_{t})=&F(\theta)+t\cdot\frac{d}{dt}F(\theta_{t})|_{t=0}+o% (t)\\ =&F(\theta)+t\cdot(G_{\mathrm{W}}(\theta)\cdot\mathrm{grad}_{\mathrm{W}}F(% \theta),\dot{\theta})+o(t),\end{split}

(9)

where we denote $\frac{d}{dt}\theta_{t}|_{t=0}=\dot{\theta}$ . Comparing linear terms of $t$ in (9), we have

\begin{split}(G_{\mathrm{W}}(\theta)\cdot\mathrm{grad}_{\mathrm{W}}F(\theta),% \dot{\theta})=&\frac{d}{dt}F(\theta_{t})|_{t=0}\\ =&(\nabla_{\theta}F(\theta),\dot{\theta}),\end{split}

for any $\dot{\theta}\in T_{\theta}\Theta=\mathbb{R}^{d}$ . Thus

\mathrm{grad}_{\mathrm{W}}F(\theta)=G^{-1}_{\mathrm{W}}(\theta)\nabla_{\theta}% F(\theta).

∎

We are ready to present the neural mapping gradient flow, which will be used for our first-order algorithm in neural mapping optimization problems.

Proposition 2 (Neural mapping gradient flows).

Consider an energy functional $\mathcal{F}\colon\mathcal{P}(\Omega)\rightarrow\mathbb{R}$ . Then the gradient flow of function $F(\theta)=\mathcal{F}({f_{\theta}}_{\#}p_{\mathrm{r}})$ in $(\Theta,G_{\mathrm{W}})$ is given by

\frac{d\theta}{dt}=-\mathrm{grad}_{\mathrm{W}}F(\theta).

(10)

In particular,

\begin{split}\frac{d\theta_{i}}{dt}=&-\sum_{j=1}^{D}\sum_{m=1}^{d}\Big{(}% \mathbb{E}_{z\sim p_{\mathrm{r}}}\Big{[}\nabla_{\theta}f(\theta,z)\nabla_{% \theta}f(\theta,z)^{\mathsf{T}}\Big{]}\Big{)}_{ij}^{-1}\cdot\\ &\hskip 56.9055pt\mathbb{E}_{\tilde{z}\sim p_{\mathrm{r}}}\Big{[}\nabla_{x_{m}% }\frac{\delta}{\delta p}\mathcal{F}(p)(f(\theta,\tilde{z}))\cdot\partial_{% \theta_{j}}f_{m}(\theta,\tilde{z})\Big{]},\end{split}

where $\frac{\delta}{\delta p(x)}$ is the $L^{2}$ –first variation w.r.t. variable $p(x)$ , $x=f(\theta,z)$ .

Proof.

As the neural mapping metric is given in definition 4, it suffices to calculate the formula for the Euclidean gradient $\partial_{\theta_{j}}F(\theta)$ as follows:

	$\displaystyle\partial_{\theta_{j}}F(\theta)=$	$\displaystyle\ \int_{\Omega}\partial_{\theta_{j}}\rho_{\theta}(x)\frac{\delta}% {\delta p}\mathcal{F}(\rho_{\theta})(x)dx$
	$\displaystyle=$	$\displaystyle\ \int_{\Omega}-\nabla_{x}\cdot\left[\rho_{\theta}(x)\partial_{% \theta_{j}}f(\theta,f(\theta,\cdot)^{-1}(x)))\right]\frac{\delta}{\delta p}% \mathcal{F}(\rho_{\theta})(x)dx$
	$\displaystyle=$	$\displaystyle\ \int_{\Omega}\partial_{\theta_{j}}f(\theta,f(\theta,\cdot)^{-1}% (x))\cdot\nabla_{x}\left(\frac{\delta}{\delta p}\mathcal{F}(\rho_{\theta})% \right)(x)\rho_{\theta}(x)dx$
	$\displaystyle=$	$\displaystyle\ \mathbb{E}_{z\sim p_{\mathrm{r}}}\Big{[}\partial_{\theta_{j}}f(% \theta,z)\cdot\nabla_{x}\left(\frac{\delta}{\delta p}\mathcal{F}(p)\right)(f(% \theta,z))\Big{]}\,.$

Here we denote $\rho_{\theta}=f_{\theta\#}p_{\mathrm{r}}$ . ∎

3.5. Neural projected Wasserstein flows

The dynamics in parameter space can be formulated in terms of mappings and probability densities. For simplicity of discussion, we demonstrate that the neural mapping gradient flow is a projected Wasserstein gradient flow. Here the projection is from the full mapping space into a neural parameterized mapping space. Concretely, we present the following reformulations of equation (10), which are in terms of mapping functions and probability density functions. The proof is based on the gradient flow equation in proposition 2 and the application of the chain rule.

Proposition 3 (Neural projected Wasserstein gradient flows).

Dynamic (10) in term of mapping functions $f(\theta,z)=(f_{m}(\theta,z))_{m=1}^{d}$ leads to

\begin{split}\frac{\partial}{\partial t}f_{m}(\theta(t),z)=&-\sum_{i=1}^{D}% \sum_{j=1}^{D}\sum_{n=1}^{d}\partial_{\theta_{i}}f_{m}(\theta,z)\Big{(}\mathbb% {E}_{\tilde{z}\sim p_{\mathrm{r}}}\Big{[}\nabla_{\theta}f(\theta,\tilde{z})% \nabla_{\theta}f(\theta,\tilde{z})^{\mathsf{T}}\Big{]}\Big{)}_{ij}^{-1}\cdot\\ &\hskip 71.13188pt\mathbb{E}_{\tilde{z}\sim p_{\mathrm{r}}}\Big{[}\nabla_{x_{n% }}\frac{\delta}{\delta p(x)}\mathcal{F}(p)(f(\theta,\tilde{z}))\cdot\partial_{% \theta_{j}}f_{n}(\theta,\tilde{z})\Big{]}.\end{split}

We present several examples of neural mapping Wasserstein gradient flows from proposition 2.

Example 4 (Neural projected linear transport equation).

Consider a linear energy given by

\mathcal{F}(p)=\int_{\Omega}V(x)p(x)dx.

In this case, the neural projected gradient flow satisfies

\frac{d\theta}{dt}=-G_{\mathrm{W}}^{-1}(\theta)\cdot\mathbb{E}_{\tilde{z}\sim p% _{\mathrm{r}}}\Big{[}\nabla_{\theta}V(f(\theta,\tilde{z}))\Big{]}.

(11)

In details,

\frac{d\theta_{i}}{dt}=-\sum_{j=1}^{D}\Big{(}\mathbb{E}_{z\sim p_{\mathrm{r}}}% \Big{[}\nabla_{\theta}f(\theta,z)\nabla_{\theta}f(\theta,z)^{\mathsf{T}}\Big{]% }\Big{)}_{ij}^{-1}\cdot\mathbb{E}_{\tilde{z}\sim p_{\mathrm{r}}}\Big{[}\nabla_% {x}V(f(\theta,\tilde{z}))\cdot\partial_{\theta_{j}}f(\theta,\tilde{z})\Big{]}.

Example 5 (Neural projected interaction transport equation).

Consider an interaction energy given by

\mathcal{F}(p)=\frac{1}{2}\int_{\Omega}\int_{\Omega}W(x_{1},x_{2})p(x_{1})p(x_% {2})dx_{1}dx_{2}.

In this case, the neural mapping gradient flow satisfies

\frac{d\theta}{dt}=-\frac{1}{2}G_{\mathrm{W}}^{-1}(\theta)\cdot\mathbb{E}_{(z_% {1},z_{2})\sim p_{\mathrm{r}}\times p_{\mathrm{r}}}\Big{[}\nabla_{\theta}W(f(% \theta,z_{1}),f(\theta,z_{2}))\Big{]}.

(12)

In details,

\begin{split}\frac{d\theta_{i}}{dt}=&-\sum_{j=1}^{D}\Big{(}\mathbb{E}_{z\sim p% _{\mathrm{r}}}\Big{[}\nabla_{\theta}f(\theta,z)\nabla_{\theta}f(\theta,z)^{% \mathsf{T}}\Big{]}\Big{)}_{ij}^{-1}\cdot\\ &\hskip 34.14322pt\mathbb{E}_{(z_{1},z_{2})\sim p_{\mathrm{r}}\times p_{% \mathrm{r}}}\Big{[}\nabla_{x_{1}}W(f(\theta,z_{1}),f(\theta,z_{2}))\cdot% \partial_{\theta_{j}}f(\theta,z_{1})\Big{]}.\end{split}

Example 6 (Neural projected negative entropy).

Consider a negative entropy functional given by

\mathcal{F}(p)=\int_{\Omega}U(p(x))dx.

In this case, the neural mapping gradient flow satisfies

\frac{d\theta}{dt}=-G_{\mathrm{W}}^{-1}(\theta)\cdot\mathbb{E}_{z\sim p_{% \mathrm{r}}}\Big{[}\nabla_{\theta}\hat{U}(\frac{p_{\mathrm{r}}(z)}{\mathrm{det% }(D_{z}f(\theta,z))})\Big{]}\,,

(13)

where $\hat{U}(p)=U(p)/p$ . This is because:

\begin{split}\mathcal{F}({f_{\theta}}_{\#}p_{\mathrm{r}})=&\int_{\Omega}U(p(f(% \theta,z)))df(\theta,z)\\ =&\int_{Z}U(\frac{p_{\mathrm{r}}(z)}{\mathrm{det}(D_{z}f(\theta,z))})\frac{% \mathrm{det}(D_{z}f(\theta,z))}{p_{\mathrm{r}}(z)}p_{\mathrm{r}}(z)dz\\ =&\mathbb{E}_{z\sim p_{\mathrm{r}}}\Big{[}\hat{U}(\frac{p_{\mathrm{r}}(z)}{% \mathrm{det}(D_{z}f(\theta,z))})\Big{]}\,.\end{split}

The choice $U(p)=p\log(p)$ and $\hat{U}(p)=\log(p)$ corresponds to the negative entropy. This belongs to the family of internal energy. In details,

\begin{split}\frac{d\theta_{i}}{dt}=&-\sum_{j=1}^{D}\Big{(}\mathbb{E}_{z\sim p% _{\mathrm{r}}}\Big{[}\nabla_{\theta}f(\theta,z)\nabla_{\theta}f(\theta,z)^{% \mathsf{T}}\Big{]}\Big{)}_{ij}^{-1}\cdot\\ &\hskip 34.14322pt\mathbb{E}_{z\sim p_{\mathrm{r}}}\Big{[}-\mathrm{tr}\Big{(}D% _{z}f(\theta,z)^{-1}\colon\partial_{\theta_{j}}D_{z}f(\theta,z)\Big{)}\hat{U}^% {\prime}(\frac{p_{\mathrm{r}}(z)}{\mathrm{det}(D_{z}f(\theta,z))})\frac{p_{% \mathrm{r}}(z)}{\mathrm{det}(D_{z}f(\theta,z))}\Big{]}.\end{split}

Here we denote $\mathrm{tr}(A\colon B)=\mathrm{tr}(AB)$ , for matrices $A$ , $B\in\mathbb{R}^{d\times d}$ .

The above examples are projected Wasserstein gradient flows in neural mapping metric space. In particular, Examples 4, 5, 6 correspond to the following classical PDEs, respectively.

$\displaystyle\partial_{t}p(t,x)=$	$\displaystyle\nabla_{x}\cdot\Big{(}p(t,x)\nabla_{x}V(x)\Big{)},$	(14)
$\displaystyle\partial_{t}p(t,x)=$	$\displaystyle\nabla_{x}\cdot\Big{(}p(t,x)\int_{\Omega}\nabla_{x}W(x,y)p(t,y)dy% \Big{)},$	(15)
$\displaystyle\partial_{t}p(t,x)=$	$\displaystyle\nabla_{x}\cdot\Big{(}p(t,x)\nabla_{x}U^{\prime}(p(t,x))\Big{)}.\,$	(16)

The above dynamics include potential transport, interaction transport, and porous medium equations. The Fokker-Planck equation is a combination of the above first and third equations.

3.6. Algorithm

In this section, we discuss the implementations of gradient flows projected onto the parameter space. We apply the forward Euler discretization of the natural gradient flow (10). Let $h>0$ be the step size. Then the update is given by

\theta^{k+1}=\theta^{k}-h\Big{(}\tilde{G}_{\mathrm{W}}(\theta^{k})\Big{)}^{-1}% \nabla_{\theta}\tilde{F}(\theta^{k})\,,

(17)

where $\tilde{G}_{\mathrm{W}}(\theta)=(\tilde{G}_{\mathrm{W}}(\theta)_{ij})_{1\leq i,% j\leq D}\in\mathbb{R}^{D\times D}$ , $\nabla_{\theta}\tilde{F}(\theta)$ are empirical estimates of the matrix $G_{\mathrm{W}}$ and the gradient $\nabla F(\theta)=\{\partial_{\theta_{j}}F(\theta)\}_{j=1}^{D}$ , respectively. In details, if $(z_{i})_{l=1}^{M}\sim p_{\mathrm{r}}$ , where $M$ is the number of empirical samples, then

\tilde{G}_{\mathrm{W}}(\theta)_{ij}=\frac{1}{M}\sum_{l=1}^{M}\sum_{m=1}^{d}% \partial_{\theta_{i}}f_{m}(z_{l},\theta)\partial_{\theta_{j}}f_{m}(z_{l},% \theta)\,.

In practice, the condition number of $\tilde{G}_{\mathrm{W}}(\theta)$ could be very large and it is more stable to use instead the pseudoinverse of $\tilde{G}_{\mathrm{W}}(\theta)$ in (17). Therefore, the update is

\theta^{k+1}=\theta^{k}-h\tilde{G}_{\mathrm{W}}(\theta)^{\dagger}\nabla_{% \theta}\tilde{F}(\theta^{k})\,.

When the reference measure is a one-dimensional standard Gaussian distribution, $G_{\mathrm{W}}(\theta)$ can be explicitly computed for our choice of neural network. In this case, we have

\theta^{k+1}=\theta^{k}-hG_{\mathrm{W}}(\theta)^{\dagger}\nabla_{\theta}\tilde% {F}(\theta^{k})\,.

We summarize the above explicitly update formulas below.

Input: Initial parameters

\theta\in\mathbb{R}^{D}

; stepsize

h>0

, total number of steps

L

, samples

\{z_{i}\}_{i=1}^{M}\sim p_{\mathrm{r}}

for estimating

\tilde{G}_{\mathrm{W}}(\theta)

and

\nabla_{\theta}\tilde{F}(\theta)

for

k=1,2,\ldots,L

\theta^{k+1}=\theta^{k}-h\tilde{G}_{\mathrm{W}}(\theta)^{\dagger}\nabla_{% \theta}\tilde{F}(\theta^{k});\textrm{\quad(when $G_{\mathrm{W}}(\theta)$ is % unknown)}

\theta^{k+1}=\theta^{k}-hG_{\mathrm{W}}(\theta)^{\dagger}\nabla_{\theta}\tilde% {F}(\theta^{k});\textrm{\quad(when $G_{\mathrm{W}}(\theta)$ is known)}

end for

Algorithm 1 Projected Wasserstein gradient flows

4. Numerical analysis on neural network projected gradient flows

In this section, we establish theoretical guarantees for the performance of the neural projected dynamics. We start by deriving an analytic formula for the inverse of the neural mapping metric of a special ReLU family in section 4.1. Based on the closed-form projected dynamics equations, we can establish the truncated error analysis for the projected dynamics in section 4.2. The analysis of truncated error for general dynamics is presented in section 4.3.

4.1. Analytic formula for the inverse of neural mapping metric

In this section, we consider the following special case of the ReLU model in 1D. We first rewrite the neural network mapping function into the following form:

f(\theta,z)=\frac{1}{N}\sum_{i=1}^{N}a_{i}\sigma(z-b_{i}),\quad\sigma(z)=\left% \{\begin{aligned} &0,\quad z<0,\\ &z,\quad z\geq 0.\end{aligned}\right.

(18)

In particular, we combine $a_{i},b_{i}$ into one parameter in the 1D case. Under this reparameterization, $a_{i}$ s represent the slopes of each ReLU component and $b_{i}$ s are the intercepts. To make the last assumption on this ReLU network mapping function which facilitates the analytic formula of the neural mapping metric, we require all the slope parameters to stay non-negative, i.e. $a_{i}\geq 0$ . Although this is an artificial assumption to enforce analyticity, it is natural in the sense that positive slope parameters induce monotone mapping function. Meanwhile, solutions of the Monge problems in 1D are known to be monotone. In fig. 1, we plot a typical ReLU mapping function.

Refer to caption — Figure 1. ReLU network mapping function considered in this section. The figure plots a typical monotone map parameterized by the ReLU network where the parameter $a_{i}$ is required to be positive.

We start with the analytic formula for the neural mapping metric, assuming the reference measure is given by $p_{\mathrm{r}}(\cdot)$ with associated cumulative distribution function $\mathfrak{F}_{0}(\cdot)$ .

Proposition 4 (Neural mapping metric of two-layer ReLU network).

The neural mapping metric of the two-layer ReLU network with reference measure $p_{\mathrm{r}}(\cdot)$ is given as

		$\displaystyle G_{\mathrm{W}}=\frac{1}{N^{2}}\begin{pmatrix}G_{\mathrm{W}}^{bb}% &G_{\mathrm{W}}^{bw}\\ \left(G_{\mathrm{W}}^{bw}\right)^{T}&G_{\mathrm{W}}^{ww}\end{pmatrix},$		(19)
		$\displaystyle G_{\mathrm{W}}^{bb}=\begin{pmatrix}a_{1}^{2}\left(1-\mathfrak{F}% _{0}\left(b_{1}\right)\right)&a_{1}a_{2}\left(1-\mathfrak{F}_{0}\left(b_{2}% \right)\right)&\cdots&a_{1}a_{N}\left(1-\mathfrak{F}_{0}\left(b_{N}\right)% \right)\\ a_{1}a_{2}\left(1-\mathfrak{F}_{0}\left(b_{2}\right)\right)&a_{2}^{2}\left(1-% \mathfrak{F}_{0}\left(b_{2}\right)\right)&\cdots&a_{2}a_{N}\left(1-\mathfrak{F% }_{0}\left(b_{N}\right)\right)\\ \vdots&\vdots&\ddots&\vdots\\ a_{1}a_{N-1}\left(1-\mathfrak{F}_{0}\left(b_{N-1}\right)\right)&a_{2}a_{N-1}% \left(1-\mathfrak{F}_{0}\left(b_{N-1}\right)\right)&\cdots&a_{N}a_{N-1}\left(1% -\mathfrak{F}_{0}\left(b_{N}\right)\right)\\ \\ a_{1}a_{N}\left(1-\mathfrak{F}_{0}\left(b_{N}\right)\right)&a_{2}a_{N}\left(1-% \mathfrak{F}_{0}\left(b_{N}\right)\right)&\cdots&a_{N}^{2}\left(1-\mathfrak{F}% _{0}\left(b_{N}\right)\right)\end{pmatrix},$

		$\displaystyle G_{\mathrm{W}}^{ba}=-\begin{pmatrix}a_{1}\int_{b_{1}}^{\infty}% \left(z-b_{1}\right)p_{\mathrm{r}}(z)dz&a_{1}\int_{b_{2}}^{\infty}\left(z-b_{2% }\right)p_{\mathrm{r}}(z)dz&\cdots&a_{1}\int_{b_{N}}^{\infty}\left(z-b_{N}% \right)p_{\mathrm{r}}(z)dz\\ \\ a_{2}\int_{b_{2}}^{\infty}\left(z-b_{1}\right)p_{\mathrm{r}}(z)dz&a_{2}\int_{b% _{2}}^{\infty}\left(z-b_{2}\right)p_{\mathrm{r}}(z)dz&\cdots&a_{2}\int_{b_{N}}% ^{\infty}\left(z-b_{N}\right)p_{\mathrm{r}}(z)dz\\ \vdots&\vdots&\ddots&\vdots\\ a_{N}\int_{b_{N}}^{\infty}\left(z-b_{1}\right)p_{\mathrm{r}}(z)dz&a_{N}\int_{b% _{N}}^{\infty}\left(z-b_{2}\right)p_{\mathrm{r}}(z)dz&\cdots&a_{N}\int_{b_{N}}% ^{\infty}\left(z-b_{N}\right)p_{\mathrm{r}}(z)dz\end{pmatrix},$
		$\displaystyle\left(G_{\mathrm{W}}^{aa}\right)_{ij}=\int_{\max\{b_{i},b_{j}\}}^% {\infty}\left(z-b_{j}\right)\left(z-b_{i}\right)p_{\mathrm{r}}(z)dz.$

Proof.

We first calculate the derivative of the neural network map $f(\theta,z)$ w.r.t. network parameters $\theta$

\displaystyle\partial_{b_{i}}f(\theta,z)=\left\{\begin{aligned} &\ 0,\quad z<b% _{i},\\ &-\frac{a_{i}}{N},\quad z>b_{i},\end{aligned}\right.\quad\partial_{a_{i}}f(% \theta,z)=\frac{1}{N}\sigma\left(z-b_{i}\right),

(20)

while the value at the singular point $b_{i}$ does not exist and can be omitted from the measure-theoretical perspective. According to definition 4, one can evaluate different blocks of the metric tensor as the following integral

	$\displaystyle\left(G_{\mathrm{W}}^{bb}\right)_{ij}$	$\displaystyle=\int_{\mathbb{R}}\partial_{b_{i}}f(\theta,z)\partial_{b_{j}}f(% \theta,z)p_{\mathrm{r}}(z)dz=\frac{a_{i}a_{j}}{N^{2}}\left(1-\mathfrak{F}_{0}% \left(\max\{b_{i},b_{j}\}\right)\right),$
	$\displaystyle\left(G_{\mathrm{W}}^{ba}\right)_{ij}$	$\displaystyle=\int_{\mathbb{R}}\partial_{b_{i}}f(\theta,z)\partial_{a_{j}}f(% \theta,z)p_{\mathrm{r}}(z)dz=-\frac{a_{i}}{N^{2}}\int_{\max\{b_{i},b_{j}\}}^{% \infty}\left(z-b_{j}\right)p_{\mathrm{r}}(z)dz,$
	$\displaystyle\left(G_{\mathrm{W}}^{aa}\right)_{ij}$	$\displaystyle=\int_{\mathbb{R}}\partial_{a_{i}}f(\theta,z)\partial_{a_{j}}f(% \theta,z)p_{\mathrm{r}}(z)dz=\frac{1}{N^{2}}\int_{\max\{b_{i},b_{j}\}}^{\infty% }\left(z-b_{j}\right)\left(z-b_{i}\right)p_{\mathrm{r}}(z)dz.$

∎

For general reference measure $p_{\mathrm{r}}(\cdot)$ , the matrix elements of the $G_{\mathrm{W}}^{ba},G_{\mathrm{W}}^{aa}$ relate to the first and second moments of the measure which may not have an analytic formula. Here, we consider a special neural mapping model with a Gaussian reference measure, thus rendering the metric with analytic elements.

Corollary 1.

With the same setting as proposition 4 and Gaussian reference measure, the matrix element of the neural mapping metric can be written analytically as

		$\displaystyle\left(G_{\mathrm{W}}^{ba}\right)_{ij}=p_{\mathrm{r}}(b_{i})-b_{j}% (1-\mathfrak{F}_{0}(b_{i})),\quad b_{i}>b_{j}.$		(21)
		$\displaystyle\left(G_{\mathrm{W}}^{aa}\right)_{ij}=b_{i}b_{j}(1-\mathfrak{F}_{% 0}(b_{i}))-b_{j}p_{\mathrm{r}}(b_{i})+(1-\mathfrak{F}_{0}(b_{i})),\quad b_{i}>% b_{j}.$		(21)

The other half of the elements can be obtained via switching $b_{i},b_{j}$ .

Proof.

The proof is obtained by elementary integration calculation

	$\displaystyle\ \int_{b_{i}}^{\infty}\left(z-b_{j}\right)p_{\mathrm{r}}(z)dz$	(22)
$\displaystyle=$	$\displaystyle\ p_{\mathrm{r}}(z)\Big{\|}_{\infty}^{b_{i}}-b_{j}(1-\mathfrak{F}_% {0}(b_{i}))=p_{\mathrm{r}}(b_{i})-b_{j}(1-\mathfrak{F}_{0}(b_{i})),$
	$\displaystyle\ \int_{b_{i}}^{\infty}\left(z-b_{j}\right)\left(z-b_{i}\right)p_% {\mathrm{r}}(z)dz$
$\displaystyle=$	$\displaystyle\ b_{i}b_{j}(1-\mathfrak{F}_{0}(b_{i}))-(b_{i}+b_{j})p_{\mathrm{r% }}(b_{i})+b_{i}p_{\mathrm{r}}(b_{i})+(1-\mathfrak{F}_{0}(b_{i}))$
$\displaystyle=$	$\displaystyle\ b_{i}b_{j}(1-\mathfrak{F}_{0}(b_{i}))-b_{j}p_{\mathrm{r}}(b_{i}% )+(1-\mathfrak{F}_{0}(b_{i})).$

∎

Now, we focus on the upper right corner $G_{\mathrm{W}}^{bb}$ of the neural mapping metric. We will establish an analytical formula for the inverse of this matrix.

Theorem 2 (Analytic inverse of the neural mapping metric).

The inverse matrix of the $G_{\mathrm{W}}^{bb}$ block in proposition 4 can be written analytically as

\displaystyle\frac{1}{N^{2}}\left(G_{\mathrm{W}}^{-1}(\mathbf{b})\right)_{ij}=% \left\{\begin{aligned} \frac{1}{a_{i}^{2}}\left(\frac{1}{\mathfrak{F}_{0}(b_{i% })-\mathfrak{F}_{0}(b_{i-1})}+\frac{1}{\mathfrak{F}_{0}(b_{i+1})-\mathfrak{F}_% {0}(b_{i})}\right),\quad i=j\neq 1,N,\\ \frac{1}{a_{i}^{2}}\left(\frac{1}{\mathfrak{F}_{0}(b_{N})-\mathfrak{F}_{0}(b_{% N-1})}+\frac{1}{1-\mathfrak{F}_{0}(b_{N})}\right),\quad i=j=N,\\ \frac{1}{a_{i}^{2}}\frac{1}{\mathfrak{F}_{0}(b_{2})-\mathfrak{F}_{0}(b_{1})},% \quad i=j=1,\\ -\frac{1}{a_{i}a_{i-1}}\frac{1}{\mathfrak{F}_{0}(b_{i})-\mathfrak{F}_{0}(b_{i-% 1})},\quad j=i-1,\\ -\frac{1}{a_{i}a_{i+1}}\frac{1}{\mathfrak{F}_{0}(b_{i+1})-\mathfrak{F}_{0}(b_{% i})},\quad j=i+1,\\ 0,\qquad\qquad o.w.\end{aligned}\right.

(23)

Proof.

First, we decompose the neural mapping metric into the following matrix product

G_{\mathrm{W}}=\frac{1}{N^{2}}D\begin{pmatrix}1-\mathfrak{F}_{0}(b_{1})&1-% \mathfrak{F}_{0}(b_{2})&\cdots&1-\mathfrak{F}_{0}(b_{N})\\ 1-\mathfrak{F}_{0}(b_{2})&1-\mathfrak{F}_{0}(b_{2})&\cdots&1-\mathfrak{F}_{0}(% b_{N})\\ \vdots&\vdots&\ddots&\vdots\\ 1-\mathfrak{F}_{0}(b_{N})&1-\mathfrak{F}_{0}(b_{N})&\cdots&1-\mathfrak{F}_{0}(% b_{N})\\ \end{pmatrix}D,

(24)

where $D=\operatorname{diag}(a_{1},a_{2},\cdots,a_{N})$ is a diagonal matrix. Then, it is direct to check that the middle matrix has the following tri-diagonal analytic inverse below:

\begin{pmatrix}\frac{1}{\mathfrak{F}_{0}(b_{2})-\mathfrak{F}_{0}(b_{1})}&-% \frac{1}{\mathfrak{F}_{0}(b_{2})-\mathfrak{F}_{0}(b_{1})}&0&\cdots&0\\ -\frac{1}{\mathfrak{F}_{0}(b_{2})-\mathfrak{F}_{0}(b_{1})}&\frac{1}{\mathfrak{% F}_{0}(b_{2})-\mathfrak{F}_{0}(b_{1})}+\frac{1}{\mathfrak{F}_{0}(b_{3})-% \mathfrak{F}_{0}(b_{2})}&-\frac{1}{\mathfrak{F}_{0}(b_{3})-F(b_{2})}&\cdots&0% \\ \vdots&\vdots&\vdots&\ddots&\vdots\\ 0&0&0&\cdots&\frac{1}{F(b_{N})-F(b_{N-1})}+\frac{1}{1-F(b_{N})}\end{pmatrix}.

(25)

Multiplying this matrix with the inverse of the diagonal matrix $D$ on both sides concludes this proof. ∎

This analytic form of the inverse metric will be used intensively in the next subsection to prove the consistency of the numerical scheme based on the ReLU neural network.

4.2. Truncated error analysis of the neural projected Wasserstein gradient flows based on analytic formula

In this section, we perform the numerical analysis of the neural mapping projected Wasserstein flows introduced in section 3.5 based on the analytic formula in section 4.1. Because of the analytic inverse of the neural mapping metric, the right-hand side of the Wasserstein projected gradient flow can be calculated explicitly, and one can thus talk about its consistency and order of accuracy following the same spirit as classical numerical analysis. We perform this derivation for the Wasserstein projected gradient flows of the potential functional explicitly.

Let us first recall that the formula for neural projected Wasserstein gradient flow is given by

\frac{d\theta}{dt}=-G_{\mathrm{W}}^{-1}(\theta)\cdot\mathbb{E}_{\tilde{z}\sim p% _{\mathrm{r}}}\Big{[}\nabla_{\theta}V(f(\theta,\tilde{z}))\Big{]}.

(26)

We have the following analytic formula for the projected gradient flow in the ReLU network model that we introduced in section 4.1.

Proposition 5 (Wasserstein gradient flow of potential functionals in ReLU network).

The projected potential flow in the ReLU network model eq. 18 has the following form:

$\displaystyle\dot{b}_{i}$	$\displaystyle=\frac{N}{a_{i}}\left[\frac{\mathbb{E}_{z\sim p_{\mathrm{r}}}[V^{% \prime}(f(b,z))\mathbf{1}_{[b_{i},b_{i+1}]}]}{\mathfrak{F}_{0}(b_{i+1})-% \mathfrak{F}_{0}(b_{i})}-\frac{\mathbb{E}_{z\sim p_{\mathrm{r}}}[V^{\prime}(f(% b,z))\mathbf{1}_{[b_{i-1},b_{i}]}]}{\mathfrak{F}_{0}(b_{i})-\mathfrak{F}_{0}(b% _{i-1})}\right],\quad i\neq 1,N,$	(27)
$\displaystyle\dot{b}_{N}$	$\displaystyle=\frac{N}{a_{N}}\left[\frac{\mathbb{E}_{z\sim p_{\mathrm{r}}}[V^{% \prime}(f(b,z))\mathbf{1}_{[b_{N},\infty)}]}{1-\mathfrak{F}_{0}(b_{N})}-\frac{% \mathbb{E}_{z\sim p_{\mathrm{r}}}[V^{\prime}(f(b,z))\mathbf{1}_{[b_{N-1},b_{N}% ]}]}{\mathfrak{F}_{0}(b_{N})-\mathfrak{F}_{0}(b_{N-1})}\right],$
$\displaystyle\dot{b}_{1}$	$\displaystyle=\frac{N}{a_{1}}\frac{\mathbb{E}_{z\sim p_{\mathrm{r}}}[V^{\prime% }(f(b,z))\mathbf{1}_{[b_{1},b_{2}]}]}{\mathfrak{F}_{0}(b_{2})-\mathfrak{F}_{0}% (b_{1})}.$

Using the trapezoid rule to calculate the integration gives the following spatial discretization, which can be used to simulate the projected gradient flow:

\dot{b}_{i}=\frac{N}{2a_{i}}\left(V^{\prime}(f(b,b_{i+1}))-V^{\prime}(f(b,b_{i% -1}))\right).

(28)

Proof.

It suffices to calculate the gradient of the linear potential functional in this model. Let us start with the calculation of the functional form of the potential energy in the ReLU network mapping model as follows

\displaystyle\mathbb{E}_{x\sim f_{b\#}p_{\mathrm{r}}}[V(x)]=

\displaystyle\ \mathbb{E}_{z\sim p_{\mathrm{r}}}[V(f(b,z))],

(29)

where we use the change of the integration variable above. Therefore, the gradient of this functional w.r.t. $b$ can be simplified to

\displaystyle\partial_{b_{i}}\mathbb{E}_{z\sim p_{\mathrm{r}}}[V(f(b,z))]=

\displaystyle\ \mathbb{E}_{z\sim p_{\mathrm{r}}}[\partial_{b_{i}}V(f(b,z))]=-% \frac{a_{i}}{N}\mathbb{E}_{z\sim p_{\mathrm{r}}}[V^{\prime}(f(b,z))\mathbf{1}_% {[b_{i},\infty)}(z)],

(30)

where we use $\mathbf{1}_{A}$ to denote the characteristic function on the interval $A$ . Now, plugging this result into the projected gradient flow eq. 26 with the analytical formula for the inverse matrix $G_{\mathrm{W}}^{-1}$ in theorem 2, we obtain

$\displaystyle\dot{b_{i}}=$	$\displaystyle\ \frac{N^{2}}{a_{i}^{2}}\left(\frac{1}{\mathfrak{F}_{0}(b_{i})-% \mathfrak{F}_{0}(b_{i-1})}+\frac{1}{\mathfrak{F}_{0}(b_{i+1})-\mathfrak{F}_{0}% (b_{i})}\right)\frac{a_{i}}{N}\mathbb{E}_{z\sim p_{\mathrm{r}}}[V^{\prime}(f(b% ,z))\mathbf{1}_{[b_{i},\infty)}(z)]$	(31)
	$\displaystyle\ -\frac{N^{2}}{a_{i}a_{i-1}}\frac{1}{\mathfrak{F}_{0}(b_{i})-% \mathfrak{F}_{0}(b_{i-1})}\frac{a_{i-1}}{N}\mathbb{E}_{z\sim p_{\mathrm{r}}}[V% ^{\prime}(f(b,z))\mathbf{1}_{[b_{i-1},\infty)}(z)]$
	$\displaystyle\ -\frac{N^{2}}{a_{i}a_{i+1}}\frac{1}{\mathfrak{F}_{0}(b_{i+1})-% \mathfrak{F}_{0}(b_{i})}\frac{a_{i+1}}{N}\mathbb{E}_{z\sim p_{\mathrm{r}}}[V^{% \prime}(f(b,z))\mathbf{1}_{[b_{i+1},\infty)}(z)]$
$\displaystyle=$	$\displaystyle\ \frac{N}{a_{i}}\left[\frac{\mathbb{E}_{z\sim p_{\mathrm{r}}}[V^% {\prime}(f(b,z))\mathbf{1}_{[b_{i},b_{i+1}]}(z)]}{\mathfrak{F}_{0}(b_{i+1})-% \mathfrak{F}_{0}(b_{i})}-\frac{\mathbb{E}_{z\sim p_{\mathrm{r}}}[V^{\prime}(f(% b,z))\mathbf{1}_{[b_{i},b_{i+1}]}(z)]}{\mathfrak{F}_{0}(b_{i})-\mathfrak{F}_{0% }(b_{i-1})}\right].$

Taking a close look at the terms inside the brackets, one finds that they are calculating the average value of $V^{\prime}$ inside the intervals $[b_{i-1},b_{i}],[b_{i},b_{i+1}]$ weighted by the base distribution $p_{\mathrm{r}}(\cdot)$ . Lastly, in order to complete the spatial discretization, one needs to choose a quadrature rule to calculate the integration in the above formula. One example is the trapezoid rule:

\mathbb{E}_{z\sim p_{\mathrm{r}}}[V^{\prime}(f(b,z))\mathbf{1}_{[b_{i},b_{i+1}% ]}(z)]\approx(\mathfrak{F}_{0}(b_{i+1})-\mathfrak{F}_{0}(b_{i}))\frac{V^{% \prime}(f(b,b_{i}))+V^{\prime}(f(b,b_{i+1}))}{2},

which provides the desired discretization. Special attention should be paid to the boundary node $b_{1},b_{N}$ to obtain their corresponding evolution equation and discretization. ∎

Given this spatial discretization, we can analyze the order of consistency of it, which is treated in the following proposition.

Proposition 6 (Consistency of the projected gradient flow).

Assume potential functional satisfies $\left\|V^{\prime\prime}\right\|_{\infty}<\infty$ . The spatial discretization eq. 28 is of first-order accuracy both in the mapping and the density coordinates.

Proof.

We prove this statement from two directions, i.e. consistency in the space of mapping distribution and consistency in the space of mapping function. We have

$\displaystyle\partial_{t}f(b(t),z)=$	$\displaystyle\ \dot{b}^{T}\partial_{b}f(b,z)$	(32)
$\displaystyle=$	$\displaystyle\ -\sum_{i=1}^{N}\frac{N}{2a_{i}}\left(V^{\prime}(f(b,b_{i+1}))-V% ^{\prime}(f(b,b_{i-1}))\right)\frac{a_{i}}{N}\mathbf{1}_{[b_{i},\infty)}(z)$
$\displaystyle=$	$\displaystyle\ -\sum_{i=1}^{N}\frac{V^{\prime}(f(b,b_{i+1}))-V^{\prime}(f(b,b_% {i-1}))}{2}\mathbf{1}_{[b_{i},\infty)}(z)$
$\displaystyle=$	$\displaystyle\ -\frac{V^{\prime}(f(b,b_{i+1}))+V^{\prime}(f(b,b_{i}))}{2},% \quad z\in[b_{i},b_{i+1}].$

In the above derivation, we slightly cheat in the derivation so we can use the consistent formula for the evolution equations for all the nodes $b_{i}$ . It is easy to conclude that our discretization corresponds to the evolution of the mapping function $f$ of constant speed $-\frac{V^{\prime}(f(b,b_{i+1}))+V^{\prime}(f(b,b_{i}))}{2}$ on each interval $[b_{i},b_{i+1}]$ . Now, recall that in mapping space, the Wasserstein gradient flow of the potential function $V(x)$ corresponds to the velocity field $-V^{\prime}(x)$ . Therefore, given that the length of each interval is of order $\Delta b$ , we conclude that our spatial discretization is first order consistent on the mapping space.

Next, we prove the statement for the mapping distribution. To do this, we need to derive the evolution equation for the mapping distribution according to eq. 28. We have for $x\in[f(b,b_{i}),f(b,b_{i+1})]$

	$\displaystyle\ \partial_{t}p(t,x)=\dot{b}^{T}\partial_{b}p(t,x)$	(33)
$\displaystyle=$	$\displaystyle\ \sum_{i=1}^{N}\frac{N}{2a_{i}}\left(V^{\prime}(f(b,b_{i+1}))-V^% {\prime}(f(b,b_{i-1}))\right)\frac{a_{i}}{N}\left(\partial_{x}p(t,x)\mathbf{1}% _{[f(b,b_{i}),\infty)}(x)+p(t,x)\delta_{f(b,b_{i})}(x)\right)$
$\displaystyle=$	$\displaystyle\ \partial_{x}p(t,x)\sum_{i=1}^{N}\frac{V^{\prime}(f(b,b_{i+1}))-% V^{\prime}(f(b,b_{i-1}))}{2}\mathbf{1}_{[f(b,b_{i}),\infty)}(x)$
	$\displaystyle\ +p(t,x)\sum_{i=1}^{N}\frac{V^{\prime}(f(b,b_{i+1}))-V^{\prime}(% f(b,b_{i-1}))}{2}\delta_{f(b,b_{i})}(x)$
$\displaystyle=$	$\displaystyle\ \partial_{x}p(t,x)\frac{V^{\prime}(f(b,b_{i+1}))+V^{\prime}(f(b% ,b_{i}))}{2}-p(t,x)\frac{V^{\prime}(f(b,b_{i+1}))+V^{\prime}(f(b,b_{i-1}))}{2}% \delta_{f(b,b_{i})}(x).$

A quick method to derive the formula of $\partial_{b}p(t,x)$ is to view it as a probability flow corresponds to the cotangent vector $\partial_{b}f$ and then use the Wasserstein metric to calculate via a continuity equation as in proposition 2. Recall that the potential gradient flow in the density manifold is given by

\partial_{t}p(t,x)=\nabla\cdot(p(t,x)\nabla V(x))=\partial_{x}p(t,x)\partial_{% x}V(x)+\partial_{xx}p(t,x)V(x).

(34)

Comparing eq. 33 and eq. 34, it is not difficult to recognize that the first term in eq. 33 approximates the continuous counterpart in eq. 34 in the first order. The remaining parts correspond to each other: the approximation is first order not in the strong sense, but rather in the weak sense as there is Dirac measure in eq. 33. Combining the above two parts, we finish the proof. ∎

4.2.1. Projected dynamics of Negative entropy gradient flow

The potential functional can be viewed as a linear functional whose projected gradient flow has a rather simple expression. The corresponding formula has a more complex expression for general nonlinear internal energy, such as entropy. We begin with calculating the negative entropy functional of a neural mapping measure $f_{\#}p_{\mathrm{r}}$ :

$\displaystyle H\left(f_{\#}p_{\mathrm{r}}\right)=$	$\displaystyle\ \mathbb{E}_{x\sim cont\left(f_{\#}p_{\mathrm{r}}\right)}\left[% \log f_{\#}p_{\mathrm{r}}\left(x\right)\right]+\mathfrak{F}_{0}\left(b_{1}% \right)\log\mathfrak{F}_{0}\left(b_{1}\right)$	(35)
$\displaystyle=$	$\displaystyle\ \mathbb{E}_{z\sim cont\left(p_{\mathrm{r}}\right)}\left[\log f_% {\#}p_{\mathrm{r}}\left(f\left(z\right)\right)\right]+\mathfrak{F}_{0}\left(b_% {1}\right)\log\mathfrak{F}_{0}\left(b_{1}\right)$
$\displaystyle=$	$\displaystyle\ \mathbb{E}_{z\sim cont\left(p_{\mathrm{r}}\right)}\left[\log% \frac{p_{\mathrm{r}}\left(z\right)}{f^{\prime}\left(z\right)}\right]+\mathfrak% {F}_{0}\left(b_{1}\right)\log\mathfrak{F}_{0}\left(b_{1}\right)$
$\displaystyle=$	$\displaystyle\ \mathbb{E}_{z\sim cont\left(p_{\mathrm{r}}\right)}\left[\log p_% {\mathrm{r}}\left(z\right)\right]-\mathbb{E}_{z\sim cont\left(p_{\mathrm{r}}% \right)}\left[{\log f^{\prime}\left(z\right)}\right]+\mathfrak{F}_{0}\left(b_{% 1}\right)\log\mathfrak{F}_{0}\left(b_{1}\right),$

where we use the Monge-Ampère equation $f_{\#}p_{\mathrm{r}}\left(f\left(z\right)\right)=\frac{p_{\mathrm{r}}(z)}{f^{% \prime}\left(z\right)}$ in one dimension. Moreover, notice that the last term corresponds to the entropy of the discrete part of distribution $f_{\#}p_{\mathrm{r}}$ as the ReLU mapping function maps $(-\infty,b_{1}]$ to $0$ and $cont\left(\cdot\right)$ refers to the continuous part of a distribution. Similarly, the relative entropy functional is given by

	$\displaystyle\mathrm{D}_{\mathrm{KL}}\left(f_{\#}p_{\mathrm{r}}\big{\\|}\nu% \right)=$	$\displaystyle\ \quad\mathbb{E}_{z\sim cont\left(p_{\mathrm{r}}\right)}\left[% \log p_{\mathrm{r}}\left(z\right)\right]-\mathbb{E}_{z\sim cont\left(p_{% \mathrm{r}}\right)}\left[{\log f^{\prime}\left(z\right)}\right]$
		$\displaystyle-\ \mathbb{E}_{z\sim cont\left(p_{\mathrm{r}}\right)}\left[{\log% \nu\left(f\left(z\right)\right)}\right]+\mathfrak{F}_{0}\left(b_{1}\right)% \left(\log\mathfrak{F}_{0}\left(b_{1}\right)\right).$

Moreover, the gradient flow of the KL-divergence differs from that of negative entropy only by a term that appears in the derivation in the potential functional gradient flow. This This also manifests in calculus on the density manifold between the heat and Fokker-Planck equations. Now, one calculates the derivative of continuous parts w.r.t. parameter $b_{i}$

	$\displaystyle\partial_{b_{i}}\mathbb{E}_{x\sim p_{\mathrm{r}}}\left[{\log f^{% \prime}\left(x\right)}\right]$	$\displaystyle\ =\left\{\begin{aligned} &\log\frac{\sum_{j=1}^{i-1}a_{j}}{\sum_% {j=1}^{i}a_{j}}p_{\mathrm{r}}\left(b_{i}\right),\quad i\neq 1,\\ &-p_{\mathrm{r}}\left(b_{1}\right)\frac{\log a_{1}}{N},\hskip 28.45274pti=1.% \end{aligned}\right.$
	$\displaystyle\partial_{b_{i}}\mathbb{E}_{x\sim p_{\mathrm{r}}}\left[{\log\nu% \left(f\left(x\right)\right)}\right]$	$\displaystyle\ =\mathbb{E}_{x\sim p_{\mathrm{r}}}\left[\frac{\nu^{\prime}\left% (y\left(x\right)\right)\partial_{b_{i}}y\left(x\right)}{\nu\left(y\left(x% \right)\right)}\right].$

The first derivation is based on the observation that the function $\log f^{\prime}\left(x\right)$ is a step function which changes its value at $b_{i}$ . It takes value $\log\sum_{j=1}^{i-1}a_{j}$ at interval $\left[b_{i-1},b_{i}\right]$ . Hence the desired conclusion follows, where $p_{\mathrm{r}}\left(b_{i}\right)$ comes in since this is the expectation w.r.t. distribution $p_{\mathrm{r}}\left(x\right)$ . Therefore, the derivative of the entropy and relative entropy functional reads as follows

	$\displaystyle\partial_{b_{i}}H\left(f_{\#}p_{\mathrm{r}}\right)=$	$\displaystyle\ \left\{\begin{aligned} &-\log\frac{\sum_{j=1}^{i-1}a_{j}}{\sum_% {j=1}^{i}a_{j}}p_{\mathrm{r}}\left(b_{i}\right),\hskip 59.75095pti\neq 1,\\ &p_{\mathrm{r}}\left(b_{1}\right)\left(\log\mathfrak{F}_{0}\left(b_{1}\right)+% 1+\log\frac{a_{1}}{N}\right),\quad i=1.\end{aligned}\right.$		(36)
	$\displaystyle\partial_{b_{i}}\mathrm{D}_{\mathrm{KL}}\left(f_{\#}p_{\mathrm{r}% }\big{\\|}\nu\right)=$	$\displaystyle\ \mathbb{E}_{x\sim p}\left[\frac{\nu^{\prime}\left(f\left(x% \right)\right)\partial_{b_{i}}f\left(x\right)}{\nu\left(f\left(x\right)\right)% }\right]-\log\frac{\sum_{j=1}^{i}a_{j}}{\sum_{j=1}^{i-1}a_{j}}p_{\mathrm{r}}% \left(b_{i}\right).$		(36)

With all these preparations, we can write out the gradient flow equation of the entropy functional:

$\displaystyle\dot{b_{i}}=$	$\displaystyle\ \frac{1}{\mathfrak{F}_{0}\left(b_{i}\right)-\mathfrak{F}_{0}% \left(b_{i-1}\right)}\left(\frac{\log\frac{\sum_{j=1}^{i-1}a_{j}}{\sum_{j=1}^{% i}a_{j}}p_{\mathrm{r}}\left(b_{i}\right)}{a_{i}^{2}}-\frac{\log\frac{\sum_{j=1% }^{i-2}a_{j}}{\sum_{j=1}^{i-1}a_{j}}p_{\mathrm{r}}\left(b_{i-1}\right)}{a_{i}a% _{i-1}}\right)$	(37)
	$\displaystyle\ +\frac{1}{\mathfrak{F}_{0}\left(b_{i+1}\right)-\mathfrak{F}_{0}% \left(b_{i}\right)}\left(\frac{\log\frac{\sum_{j=1}^{i-1}a_{j}}{\sum_{j=1}^{i}% a_{j}}p_{\mathrm{r}}\left(b_{i}\right)}{a_{i}^{2}}-\frac{\log\frac{\sum_{j=1}^% {i}a_{j}}{\sum_{j=1}^{i+1}a_{j}}p_{\mathrm{r}}\left(b_{i+1}\right)}{a_{i}a_{i+% 1}}\right),\quad i=2,\cdots,N-1,$
$\displaystyle\dot{b_{1}}=$	$\displaystyle\ -\frac{1}{a_{1}(\mathfrak{F}_{0}(b_{2})-\mathfrak{F}_{0}(b_{1})% )}\left(\frac{p_{\mathrm{r}}\left(b_{1}\right)\left(\log\mathfrak{F}_{0}\left(% b_{1}\right)+1+\log\frac{a_{1}}{N}\right)}{a_{1}}+\frac{\log\frac{\sum_{j=1}^{% 1}a_{j}}{\sum_{j=1}^{2}a_{j}}p_{\mathrm{r}}\left(b_{2}\right)}{a_{2}}\right),$
$\displaystyle\dot{b_{N}}=$	$\displaystyle\ \frac{\log\frac{\sum_{j=1}^{N-1}a_{j}}{\sum_{j=1}^{N}a_{j}}p_{% \mathrm{r}}\left(b_{N}\right)}{a_{N}^{2}(1-\mathfrak{F}_{0}\left(b_{N}\right))% }-\frac{1}{\mathfrak{F}_{0}\left(b_{N}\right)-\mathfrak{F}_{0}\left(b_{N-1}% \right)}\left(\frac{\log\frac{\sum_{j=1}^{N-2}a_{j}}{\sum_{j=1}^{N-1}a_{j}}p_{% \mathrm{r}}\left(b_{N-1}\right)}{a_{N}a_{N-1}}-\frac{\log\frac{\sum_{j=1}^{N-1% }a_{j}}{\sum_{j=1}^{N}a_{j}}p_{\mathrm{r}}\left(b_{N}\right)}{a_{N}^{2}}\right).$

Similar to the proof in proposition 6, one can carefully expand the neural projected dynamics of the entropy functional and prove that it converges to the heat equation in the limit that number of neurons tends to infinity and the gap between neurons nodes tends to zero.

4.2.2. Analysis of the long-time existence of the neural-projected heat flow

In general, the projected Wasserstein gradient flow does not necessarily need to be a linear dynamics even though the original gradient flow is linear, e.g., the projected gradient flow corresponding to the heat equation is highly nonlinear. This poses great difficulties in analyzing and establishing the long-time existence of the projected dynamics, as mentioned in [25]. Specifically, we focus on the nonlinear projected gradient flow of the entropy, which corresponds to the Heat equation in the full space. If we view all nodes $b_{i},i\in[N]$ as grid points and view the scheme as an example of the moving mesh method [17], then the mesh quality is an important quantity to observe during simulation. One does not want the mesh quality to decrease too much and even become degenerate during the simulations. Therefore, we consider the well-posedness of the non-linear ODE eq. 37.

Proposition 7.

The neural projected dynamics eq. 37 of the heat flow is well-posed, e.g. the solution extends to arbitrary time.

Proof.

We consider a special scenario when two adjacent nodes $b_{i},b_{i+1}$ become close to each other while maintaining a relatively large gap with all other nodes, i.e.

o(1)=b_{i+1}-b_{i}=o(b_{p}-b_{q}),\quad\forall q\in[N]\backslash\{i,i+1\},% \quad p=i,i+1.

(38)

WLOG, we assume $b_{i+1}=b_{i}+\Delta b>b_{i}$ and reduce the following term which appears both in their time derivative expression in eq. 37

	$\displaystyle\ \frac{1}{\mathfrak{F}_{0}\left(b_{i}\right)-\mathfrak{F}_{0}% \left(b_{i-1}\right)}\left(\frac{\log\frac{\sum_{j=1}^{i-1}a_{j}}{\sum_{j=1}^{% i}a_{j}}p_{\mathrm{r}}\left(b_{i}\right)}{a_{i}^{2}}-\frac{\log\frac{\sum_{j=1% }^{i-2}a_{j}}{\sum_{j=1}^{i-1}a_{j}}p_{\mathrm{r}}\left(b_{i-1}\right)}{a_{i}a% _{i-1}}\right)$	(39)
$\displaystyle=$	$\displaystyle\ \left(\frac{1}{p_{\mathrm{r}}(b_{i})\Delta b}+O(1)\right)\left(% \log\frac{i-1}{i}p_{\mathrm{r}}(b_{i})-\log\frac{i-2}{i-1}p_{\mathrm{r}}(b_{i-% 1})\right)$
$\displaystyle=$	$\displaystyle\ \left(\frac{1}{p_{\mathrm{r}}(b_{i})\Delta b}+O(1)\right)\left(% \log\frac{i-1}{i}p_{\mathrm{r}}(b_{i})-\log\frac{i-2}{i-1}\left(p_{\mathrm{r}}% (b_{i})+O(\Delta b)\right)\right)$
$\displaystyle=$	$\displaystyle\ \left(\frac{1}{p_{\mathrm{r}}(b_{i})\Delta b}+O(1)\right)\left(% \log\frac{i^{2}-2i+1}{i^{2}-2i}p_{\mathrm{r}}(b_{i})+O(\Delta b)\right)$
$\displaystyle=$	$\displaystyle\ \frac{1}{\Delta b}\log\frac{i^{2}-2i+1}{i^{2}-2i}+O(1)% \rightarrow+\infty,\quad\Delta b\rightarrow 0^{+},$

where we use the simplified model where all the weights $a_{i}$ are set to $1$ and Taylor expansion to conclude that $\mathfrak{F}_{0}\left(b_{i}\right)-\mathfrak{F}_{0}\left(b_{i-1}\right)=p_{% \mathrm{r}}(b_{i})\Delta b+O(\Delta b^{2})$ and $p(b_{i-1})$ follows the same spirit. This term appears with positive sign in the RHS of $\dot{b}_{i}$ and negative sign in the RHS of $\dot{b}_{i-1}$ , indicating that the left (right) node $b_{i-1}$ ( $b_{i}$ ) will move fast towards left (right) respectively. This repulsion behavior guarantees that the Lagrangian coordinates will never collide with each other and the mesh degeneracy will not appear.

Next, we analyze our scheme using the time derivative of the Lagrangian coordinate. It is a well-known result that under the heat flow the mean of the distribution is fixed. Therefore, due to the diffusive nature of the heat equation, one can imagine that the position of the quantile greater than the mean should move right in the heat equation and vice versa. Suppose $x\in[b_{i},b_{i+1}]$ is a quantile with $b_{i}$ greater than the mean $0$ . As the base measure is a standard Gaussian distribution whose probability density function decreases over $[0,\infty)$ , we conclude that

0<b_{i}<b_{i+1}\Longrightarrow p_{\mathrm{r}}(b_{i})>p_{\mathrm{r}}(b_{i+1})% \Longrightarrow\log\frac{\sum_{j=1}^{i-1}a_{j}}{\sum_{j=1}^{i}a_{j}}p_{\mathrm% {r}}\left(b_{i}\right)<\log\frac{\sum_{j=1}^{i}a_{j}}{\sum_{j=1}^{i+1}a_{j}}p_% {\mathrm{r}}\left(b_{i+1}\right)<0.

(40)

Consequently, the Lagrangian coordinate $f_{b}(z)$ is indeed moving towards right, which matchs the intuition from the heat equation.

∎

The neural projected dynamics can be understood as a Lagrangian scheme [8, 24, 26] with neural network basis. Specifically, fixing basis as ReLU components in eq. 18, one can view $a_{i}$ ’s and $b_{i}$ ’s as the shape and location coefficients of the basis functions respectively. Updating $a_{i}$ ’s is similar to classical finite-element method with fixed basis functions, while adding the degree of freedom of $b_{i}$ ’s is similar to the moving mesh method. The Lagrangian schemes can handle the problem of the free boundary such as porous medium, e.g. in [24], they use finite element method to solve the mapping function of the porous medium equation with high accuracy. While most Lagrangian schemes are based on updating the $a_{i}$ ’s parameters, our methods have more flexibility and expressivity as it takes more degree of freedom into account. The primal-dual structure of the Wasserstein gradient flow also leverages a lot of usage of Lagrangian schemes [8].

On the other hand, our numerical algorithm and the moving mesh method. The principal ingredients of the moving mesh method include the equidistribution principle, the moving mesh equation, and the method of lines approach [36]. The moving mesh equation is solved during the simulation to ensure the adaptivity such that the mesh can resolve to the detailed structure. In many classical moving mesh methods, the mesh equations are solved separately from the governing PDE itself to guarantee the adaptivity of the numerical methods. This implies that how the mesh change will not depend explicitly on the underlying PDE. There also exist moving mesh methods such that the mesh updates take into account of the governing PDE (e.g., the arbitrary Lagrangian-Eularian methods [5]). From this perspective, the projected dynamics provide a PDE-specific moving mesh equation, i.e. the mesh moved according to the PDE dynamics to simulate which is more adaptive and efficient. Moreover, through a detailed study of the simple case, we can establish a theoretical guarantee on the quality of our moving mesh method in proposition 7.

4.3. Truncated error analysis for general neural projected Wasserstein gradient flow

The proof of the consistency of the numerical scheme relies on the analytic formula derived before which is restrictive. In this section, we provide another methodology to prove the consistency of the numerical scheme we derived in this paper. Instead of calculating the evolution of the mapping explicitly, we calculate the deviation of the projected gradient w.r.t. the original gradient direction. Let us first state a geometric proposition where we attempt to be as general as possible. This result is also proved in [25] and we prove it here for completeness.

Let $\mathcal{X}$ be a manifold (possibly infinite-dimensional) with a Riemannian metric $g_{\mathcal{X}}$ , which provides an inner product on the tangent space $T_{x}\mathcal{X}$ (possibly infinite-dimensional Hilbert space) for each $x\in\mathcal{X}$ . Let $\mathcal{Y}\subset\mathcal{X}$ be its submanifold with induced metric denoted by $g_{\mathcal{Y}}$ , i.e. $\forall y\in\mathcal{Y}$ :

g_{\mathcal{Y}}(y):T_{y}\mathcal{Y}\times T_{y}\mathcal{Y}\rightarrow\mathbb{R% },\quad g_{\mathcal{Y}}(y)(v,w)=g_{\mathcal{X}}(y)(v,w),\quad\forall v,w\in T_% {y}\mathcal{Y}.

Furthermore, let $H:\mathcal{X}\rightarrow\mathbb{R}$ be a functional defined over $\mathcal{X}$ and we use $\widetilde{H}:\mathcal{Y}\rightarrow\mathbb{R}$ for its restriction on $\mathcal{Y}$ . We have the following proposition.

Proposition 8.

Let $\nabla_{g_{\mathcal{X}}}H(y)\in T_{y}\mathcal{X}$ $(\nabla_{g_{\mathcal{Y}}}\widetilde{H}(y)\in T_{y}\mathcal{Y})$ denote the gradient of the functional $H$ w.r.t. the metric $g_{\mathcal{X}}$ $(g_{\mathcal{Y}})$ at $y\in\mathcal{X}$ $(y\in\mathcal{Y})$ . Then, we have

\nabla_{g_{\mathcal{Y}}}\widetilde{H}(y)=\Pi(y)\nabla_{g_{\mathcal{X}}}H(y),

(41)

where $\Pi(y)$ is the orthogonal projection operator from $T_{y}\mathcal{X}$ to $T_{y}\mathcal{Y}$ .

Proof.

As $\mathcal{Y}$ is a submanfold of $\mathcal{X}$ , we have inclusion map $\mathrm{I}(y):T_{y}\mathcal{Y}\rightarrow T_{y}\mathcal{X}$ and restriction map $\mathrm{I}^{*}(y):T_{y}^{*}\mathcal{X}\rightarrow T_{y}^{*}\mathcal{Y}$ for each $y\in\mathcal{Y}$ . Both mappings are linear and are adjoint to each other. Therefore, viewing the metric tensor $g_{\mathcal{Y}}(y)$ as a linear mapping between $T_{y}\mathcal{Y}\rightarrow T_{y}^{*}\mathcal{Y}$ , we have

g_{\mathcal{Y}}(y)=\mathrm{I}^{*}(y)\circ g_{\mathcal{X}}(y)\circ\mathrm{I}(y)% ,\quad\forall y\in\mathcal{Y}.

Moreover, the inner product $g_{\mathcal{X}}(y)$ on the Hilbert space $T_{y}\mathcal{X}$ induces an orthogonal decomposition:

T_{y}\mathcal{X}=T_{y}\mathcal{Y}\oplus T_{y}\mathcal{Y}^{\perp},\quad\forall y% \in\mathcal{Y},

along with an orthogonal projection operator $\Pi(y)$ . Now, recall that the Riemannian gradient $\nabla_{g_{\mathcal{X}}}H(y)$ is defined as

g_{\mathcal{X}}(y)\nabla_{g_{\mathcal{X}}}H(y)=dH(y).

The differential of $H(\cdot)$ and $\widetilde{H}(\cdot)$ is related by

d\widetilde{H}(y)=\mathrm{I}^{*}(y)dH(y),\quad\forall y\in\mathcal{Y}.

Therefore, gathering all the ingredients, we have the following commutative diagram

\begin{tikzcd}

As $\Pi(y)$ is the orthogonal projection, we conclude that

\nabla_{g_{\mathcal{Y}}}\widetilde{H}(y)=(\mathrm{I}^{*}(y)g_{\mathcal{X}}(y)% \mathrm{I}(y))^{-1}\mathrm{I}^{*}(y)dH(y)=\Pi(y)\nabla_{g_{\mathcal{X}}}H(y).

∎

We can prove the consistency of our numerical schemes over different PDEs with the Wasserstein gradient flow structures by leveraging this proposition in the case $\mathcal{X}=P_{2}^{\infty}(\mathbb{R})$ is the density manifold and $g_{\mathcal{X}}$ is chosen to be the $W_{2}$ metric. To achieve this, we can rewrite eq. 41 as

\nabla_{g_{\mathcal{Y}}}\widetilde{H}(y)=\operatorname{argmin}_{v\in T_{y}% \mathcal{Y}}\left\|\nabla_{g_{\mathcal{X}}}H(y)-v\right\|_{g_{\mathcal{X}}}.

(42)

Therefore, $\forall v\in T_{y}\mathcal{Y}$ will provide an upper bound for the truncated error of our approximation scheme. Moreover, if we assume that the submanifold $\mathcal{Y}\subset P_{2}^{\infty}(\mathbb{R})$ is identical to a generative model via mapping function $f_{\theta\#}$ , i.e. $\mathcal{Y}=f_{\theta\#}p_{\mathrm{r}}$ with $\theta\in\Theta$ and $p_{\mathrm{r}}$ the base measure. Then, the projected gradient direction can also be characterized using the metric over the mapping space, i.e.

\nabla_{\Theta}\widetilde{H}(\theta)=\operatorname{argmin}_{v\in T_{\theta}% \Theta}\int(\nabla_{\theta}H(\theta)(x)-v(x))^{2}f_{\theta\#}p_{\mathrm{r}}(x)dx,

(43)

where $\theta$ is mapped to point $y$ and we abuse the notion of $\nabla_{\Theta}H(\theta)$ to denote the gradient vector in the mapping coordinate. Moreover, we can perform truncated error analysis directly over the mapping space, which is more convenient and clear. Let us focus on the ReLU network mapping eq. 18. The tangent space in the mapping coordinate is spanned by the vectors in eq. 20. Meanwhile, the tangent space in the density coordinate is spanned by

\partial_{b_{i}}f_{\theta\#}p_{\mathrm{r}}(x)=\frac{a_{i}}{N}p_{\mathrm{r}}^{% \prime}(x)\mathbf{1}_{[f(\theta,b_{i}),\infty)},\quad\partial_{a_{i}}f_{\theta% \#}p_{\mathrm{r}}(x)=\frac{f_{\theta}^{-1}(x)-b_{i}}{N}p_{\mathrm{r}}^{\prime}% (x)\mathbf{1}_{[f(\theta,b_{i}),\infty)},

(44)

where the notation $f_{\theta}(\cdot)=f(\theta,\cdot)$ . Here we use the fact that the mapping $f_{\theta}$ is linear with slope $\frac{\sum_{j=1}^{i}a_{j}}{N}$ over the interval $[b_{i},b_{i+1}]$ . If $b_{i}$ s are fixed, the projected dynamics belongs to projection-based model reduction [6] where the basis is fixed to be neurons. While changing $b_{i}$ s correspond to model reduction with adaptive basis.

Proposition 9.

The numerical scheme based on ReLU network mapping is consistent with order $2$ using both $a,b$ parameters and of order $1$ with either $a$ or $b$ parameters.

Proof.

In view of eq. 20, we have that the approximation using only $\partial_{b_{i}}f_{\theta}$ is simply piece-wise constant approximation. As each ingredient has the shape of a Heaviside function, it is consistent with order $1$ . While the approximation using both $\partial_{b_{i}}f_{\theta}$ and $\partial_{a_{i}}f_{\theta}$ is a piece-wise linear approximation, thereby consistent of order $2$ . This is because another set of ReLU-shape functions is added to the basis. ∎

The connection between the ReLU neural network and the linear finite element space is systematically studied in [15]. They theoretically establish that at least two hidden layers are needed in a ReLU neural network to represent any linear finite element functions in $\Omega\subset\mathbb{R}^{d}$ when $d\geq 2.$

Based on this concrete understanding of the structure of the tangent space, we can calculate the local truncation error of the projected gradient flow.

Theorem 3.

Given a tangent vector $v(x)\in T_{f_{\theta\#}p_{\mathrm{r}}}\mathcal{P}(\mathbb{R})$ whose approximated tangent vector in projected dynamics is given by $\nabla_{\theta}H(\theta)$ , the local truncation error in the ReLU network mapping is given by

\sum_{i=1}^{N}\int_{b_{i}}^{b_{i+1}}v^{2}(f_{\theta}(z))p_{\mathrm{r}}(z)dx-% \frac{\left(\int_{b_{i}}^{b_{i+1}}v(f_{\theta}(z))(z-m_{i})p_{\mathrm{r}}(z)dz% \right)^{2}}{\int_{b_{i}}^{b_{i+1}}(z-m_{i})^{2}p_{\mathrm{r}}(z)dz}-\frac{% \left(\int_{b_{i}}^{b_{i+1}}v(f_{\theta}(z))p_{\mathrm{r}}(z)dz\right)^{2}}{% \mathfrak{F}_{0}(b_{i+1})-\mathfrak{F}_{0}(b_{i})}

(45)

where $m_{i}$ is the center of mass of $p_{\mathrm{r}}(z)$ in $[b_{i},b_{i+1}]$ and $b_{N+1}$ is understood as $+\infty$ . Under the assumption that $v$ has bounded second order derivative and $b_{i+1}-b_{i}<\Delta b,\forall i.$

\left\|v(x)-\nabla_{\theta}H(\theta)\right\|_{L^{2}(f_{\theta\#}p_{\mathrm{r}}% )}^{2}=\frac{1}{4}\left(\frac{\sum_{j=1}^{N}a_{j}}{N}\right)^{2}\left\|v^{% \prime\prime}\right\|_{\infty}O(\Delta b^{4}).

(46)

Proof.

As mentioned in the above theorem, the approximation using ReLU network mapping is equivalent to piecewise linear approximation in the mapping coordinate. Moreover, at each node $b_{i}$ , the slope and value of the The function does not need to be continuous, which is exactly the same as the linear spline interpolation. The main difference is that the grid points $b_{i}$ are not fixed since they can evolve over time. Therefore, we rewrite the optimization problem eq. 43 as

\arg\min_{c_{i},d_{i}}\quad\sum_{i=1}^{N}\int_{f_{\theta}(b_{i})}^{f_{\theta}(% b_{i+1})}\left(v(x)-c_{i}x-d_{i}\right)^{2}f_{\theta\#}p_{\mathrm{r}}(x)dx,

(47)

which can be further reduced to $N-1$ separated optimization problem of $c_{i},d_{i}$ over small interval $[f_{\theta}(b_{i}),f_{\theta}(b_{i+1})]$ . For each subproblem, we have

\displaystyle\int_{f_{\theta}(b_{i})}^{f_{\theta}(b_{i+1})}\left(v(x)-c_{i}x-d% _{i}\right)^{2}f_{\theta\#}p_{\mathrm{r}}(x)dx=\int_{b_{i}}^{b_{i+1}}\left(v(f% _{\theta}(z))-c_{i}f_{\theta}(z)-d_{i}\right)^{2}p_{\mathrm{r}}(z)dz.

This is a quadratic optimization problem of $c_{i},d_{i}$ with positive definite Hessian matrix. Taking derivative w.r.t. $c_{i},d_{i}$ , we obtain

	$\displaystyle\int_{b_{i}}^{b_{i+1}}\left(v(f_{\theta}(z))-c_{i}f_{\theta}(z)-d% _{i}\right)p_{\mathrm{r}}(z)dz$	$\displaystyle=0,$
	$\displaystyle\int_{b_{i}}^{b_{i+1}}f_{\theta}(z)\left(v(f_{\theta}(z))-c_{i}f_% {\theta}(z)-d_{i}\right)p_{\mathrm{r}}(z)dz$	$\displaystyle=0.$

Now, using the fact that $f_{\theta}(z)$ is a linear function over the interval $[b_{i},b_{i+1}]$ , we have

\displaystyle c_{i}f_{\theta}(z)+d_{i}=\frac{\int_{b_{i}}^{b_{i+1}}v(f_{\theta% }(z))(z-m_{i})p_{\mathrm{r}}(z)dx}{\int_{b_{i}}^{b_{i+1}}(z-m_{i})^{2}p_{% \mathrm{r}}(z)dz}(z-m_{i})+\frac{\int_{b_{i}}^{b_{i+1}}v(f_{\theta}(z))p_{% \mathrm{r}}(z)dx}{\mathfrak{F}_{0}(b_{i+1})-\mathfrak{F}_{0}(b_{i})}.

(48)

Plugging back, we obtain the approximation error as

	$\displaystyle\int_{b_{i}}^{b_{i+1}}\left(v(f_{\theta}(z))-c_{i}f_{\theta}(z)-d% _{i}\right)^{2}p_{\mathrm{r}}(z)dz$	(49)
$\displaystyle=$	$\displaystyle\ \int_{b_{i}}^{b_{i+1}}v(f_{\theta}(z))\left(v(f_{\theta}(z))-c_% {i}f_{\theta}(z)-d_{i}\right)p_{\mathrm{r}}(z)dz$
$\displaystyle=$	$\displaystyle\ \int_{b_{i}}^{b_{i+1}}v^{2}(f_{\theta}(z))p_{\mathrm{r}}(z)dz-% \frac{\left(\int_{b_{i}}^{b_{i+1}}v(f_{\theta}(z))(z-m_{i})p_{\mathrm{r}}(z)dz% \right)^{2}}{\int_{b_{i}}^{b_{i+1}}(z-m_{i})^{2}p_{\mathrm{r}}(z)dz}-\frac{% \left(\int_{b_{i}}^{b_{i+1}}v(f_{\theta}(z))p_{\mathrm{r}}(z)dz\right)^{2}}{% \mathfrak{F}_{0}(b_{i+1})-\mathfrak{F}_{0}(b_{i})}.$

Next, we assume all the intervals $[b_{i},b_{i+1}]$ are short (of scale $O(\Delta)$ ) and consider expanding the $v$ as Taylor series around $m_{i}$ , i.e.

	$\displaystyle v(f_{\theta}(z))=$	$\displaystyle\ v(f_{\theta}(m_{i}))+\frac{\sum_{j=1}^{i}a_{j}}{N}v^{\prime}(f_% {\theta}(m_{i}))(z-m_{i})$		(50)
		$\displaystyle\ +\frac{1}{2}\left(\frac{\sum_{j=1}^{i}a_{j}}{N}\right)^{2}v^{% \prime\prime}(f_{\theta}(m_{i}))(z-m_{i})^{2}+O(\Delta^{3}).$		(50)

Here, we use the fact that $f_{\theta}(z)$ is a linear function with slope $\frac{\sum_{j=1}^{i}a_{j}}{N}$ over the interval $[b_{i},b_{i+1}]$ . Plugging into eq. 48, we obtain

$\displaystyle c_{i}f_{\theta}(z)-d_{i}=$	$\displaystyle\ \frac{\sum_{j=1}^{i}a_{j}}{N}v^{\prime}(f_{\theta}(m_{i}))(z-m_% {i})+v(f_{\theta}(m_{i}))$	(51)
	$\displaystyle\ +\frac{1}{2}\left(\frac{\sum_{j=1}^{i}a_{j}}{N}\right)^{2}v^{% \prime\prime}(f_{\theta}(m_{i}))\frac{\int_{b_{i}}^{b_{i+1}}(z-m_{i})^{2}p_{% \mathrm{r}}(z)dz}{\mathfrak{F}_{0}(b_{i+1})-\mathfrak{F}_{0}(b_{i})}$
	$\displaystyle\ +\frac{1}{2}\left(\frac{\sum_{j=1}^{i}a_{j}}{N}\right)^{2}v^{% \prime\prime}(f_{\theta}(m_{i}))\frac{\int_{b_{i}}^{b_{i+1}}(z-m_{i})^{3}p_{% \mathrm{r}}(z)dz}{\int_{b_{i}}^{b_{i+1}}(z-m_{i})^{2}p_{\mathrm{r}}(z)dz}(z-m_% {i})+O(\Delta^{3}).$

Notice that the first two terms are exactly the zero-th and first order term of the $v(f_{\theta}(z))$ function which is similar to classical linear function approximation by discarding all the higher order term. The appearance of residue terms is due to approximation in $L^{2}(p)$ sense. To calculate the $L^{2}$ -approximation error, we have

	$\displaystyle\ \left[\frac{1}{2}\left(\frac{\sum_{j=1}^{i}a_{j}}{N}\right)^{2}% v^{\prime\prime}(f_{\theta}(m_{i}))\right]^{2}$	(52)
	$\displaystyle\ \int_{b_{i}}^{b_{i+1}}\left((z-m_{i})^{2}-\frac{\int_{b_{i}}^{b% _{i+1}}(z-m_{i})^{2}p_{\mathrm{r}}(z)dz}{\mathfrak{F}_{0}(b_{i+1})-\mathfrak{F% }_{0}(b_{i})}-\frac{\int_{b_{i}}^{b_{i+1}}(z-m_{i})^{3}p_{\mathrm{r}}(z)dz}{% \int_{b_{i}}^{b_{i+1}}(z-m_{i})^{2}p_{\mathrm{r}}(z)dz}(z-m_{i})+O(\Delta^{3})% \right)^{2}p_{\mathrm{r}}(z)dz$
$\displaystyle=$	$\displaystyle\ \left[\frac{1}{2}\left(\frac{\sum_{j=1}^{i}a_{j}}{N}\right)^{2}% v^{\prime\prime}(f_{\theta}(m_{i}))\right]^{2}O((b_{i+1}-b_{i})^{5})p_{\mathrm% {r}}(b_{i})+O((b_{i+1}-b_{i})^{6})p_{\mathrm{r}}(b_{i}).$

In summary, the $L^{2}$ approximation error consists of the sum over all the interval $[b_{i},b_{i+1}]$ , with each term depends on $a_{i}$ through the factor $\frac{\sum_{j=1}^{i}a_{j}}{N}$ , on $b_{i}$ through $(b_{i+1}-b_{i})^{5}$ and the term $v^{\prime\prime}(f_{\theta}(m_{i}))$ , which also contains $a_{i},b_{i}$ . ∎

Let us calculate a special case of the Fokker-Planck equation

\partial_{t}p(t,x)-\nabla\cdot(p(t,x)\nabla V(x))-\gamma\Delta p(t,x)=0.

Under the Wasserstein metric, the tangent vector in the mapping space is given by

v(x)=-V^{\prime}(x)-\gamma\frac{p^{\prime}(t,x)}{p(t,x)}.

In this case, we have that

v^{\prime\prime}(x)=-V^{(3)}(x)-\gamma\frac{p^{(3)}(t,x)p(t,x)^{2}+2p^{\prime}% (t,x)^{3}-3p(t,x)p^{\prime}(t,x)p^{\prime\prime}(t,x)}{p(t,x)^{3}}.

The above function will determine the approximation quality of the projected dynamics.

Remark 1.

The high-order neural mapping function class and associated high-order projected dynamics can also be derived following a similar procedure. For example, we can add a quadratic term of the ReLU function into the network mapping function as

f\left(\theta,z\right)=\frac{1}{N}\sum_{i=1}^{N}a_{i}\sigma\left(z-b_{i}\right% )+c_{i}\sigma^{2}\left(z-b_{i}\right).

(53)

Notice that adding high order ReLU term is different from increasing the layers in the ReLU neural network which corresponds to function composition. We leave the detailed analysis and numerical experiments on high-order methods in future work.

5. Numerical Examples

In this section, we provide several numerical experiments to test our algorithm and theories. We focus our attention on the linear transport equation, Fokker-Planck equation, porous medium equations, and Keller-Segel equation. They all correspond to some specific energy functionals in the probability space equipped with the Wasserstein-2 distance.

5.1. Neural Network structure

We first describe the structure of our neural network for the experiment. We focus on two-layer neural network with ReLU as activation functions.

f(\theta,z)=\sum_{i=1}^{N}a_{i}\cdot\sigma(z-b_{i})+\sum_{i=N+1}^{2N}a_{i}% \cdot\sigma(b_{i}-z)\,.

(54)

Here $\theta\in\mathbb{R}^{4N}$ represents the collection of weights $\{a_{i}\}_{i=1}^{2N}$ and bias $\{b_{i}\}_{i=1}^{2N}$ . To simplify our notation, we have absorbed the $1/N$ factor into $a_{i}$ ’s. At initialization, we set $a_{i}=1/N$ for $i\in\{1,\ldots,N\}$ and $a_{i}=-1/N$ for $i\in\{N+1,\ldots,2N\}$ . To choose the $b_{i}$ ’s, we first set $\mathbf{b}=\textrm{linspace}(-B,B,N)$ for some positive constant $B$ (e.g. $B=4$ or $B=10$ ). We then set $b_{i}=\mathbf{b}[i]$ for $i=1,\ldots,N$ and $b_{j}=\mathbf{b}[j-N]+\varepsilon$ for $j=N+1,\ldots,2N$ . Here $\varepsilon=5\times 10^{-6}$ is a small offset which will be explained later in Section 5.3. Our initialization is chosen such that $f(\theta,\cdot)$ approximates the identity map at initialization. In practice, we find it beneficial to perform a rescaling of the weights $a_{i}$ ’s. We replace $a_{i}$ with $\overline{a}_{i}/\beta$ for some fixed constant $\beta>0$ . And we initialize $\overline{a}_{i}=\beta/N$ for $i\in\{1,\ldots,N\}$ and $\overline{a}_{i}=-\beta/N$ for $i\in\{N+1,\ldots,2N\}$ . This rescaling makes sure that $f(\theta,\cdot)$ still approximates the identity map at initialization. We provide a brief intuition for rescaling. Let us consider $g(a,b,z)=a\cdot\sigma(b-z)$ for $b=\mathcal{O}(1)$ , $z=\mathcal{O}(1)$ , $a=\mathcal{O}(1/N)$ . Then $\partial_{a}g=\sigma(z-b)=\mathcal{O}(1)$ . On the other hand, $\partial_{b}g=a\cdot\sigma^{\prime}(b-z)=\mathcal{O}(1/N)$ . This simple calculation shows that the partial gradient of $\eqref{eq:nn}$ with respect to weights and bias are of different scales. Therefore, to make them the same scale, a natural choice is choosing $\beta=\mathcal{O}(N)$ .

Remark 2.

The choice of neural network (54) is slightly more complicated than the one studied in Section 4. This symmetric structure is used in numerical experiments to overcome ReLU’s drawback such that only the positive input is activated. Moreover, (54) allows us to construct an approximation to the identity map over $\mathbb{R}$ easily. However, the results of Proposition 9 still hold for (54). And Theorem 3 can be generalized to neural network of the form given in (54) in a straightforward manner. The metric tensor $G_{\mathrm{W}}$ is now a $4N\times 4N$ matrix. The calculations of the individual components of $G_{\mathrm{W}}$ follow the same procedure presented in proposition 4.

Remark 3.

We remind our readers that our algorithm takes the form of

\theta^{k+1}=\theta^{k}-hG_{\mathrm{W}}(\theta)^{\dagger}\nabla_{\theta}\tilde% {F}(\theta^{k})\,.

During implementation, $\nabla_{\theta}\tilde{F}(\theta)$ can be obtained by backpropagating $\tilde{F}(\theta)$ in the case of Example 4 and Example 5. However, we need to pay special attention to $\partial_{b_{i}}F(\theta)$ when dealing with Example 6. This will be elaborated further in Section 5.3 and Section 5.4.

5.2. Linear transport PDE

We investigate the linear transport PDE given by Eq. (14) with several choices of potential $V(x)$ , corresponding to the gradient flow of them under the Wasserstein metric. For a simple potential function, this example can serve as a sanity check of the projected dynamics formulation. The trajectories of the particles for Eq. (14) (i.e. Lagrangian formulation) follows the following ODE

\dot{x}(t)=-\nabla V(x)\,.

(55)

Let us denote by $T(t,z_{0})$ the solution to Eq. (55) with initial condition $x(0)=z_{0}$ . In other words, $T(t,z_{0})$ is the transport map at time $t$ starting from position $z_{0}$ . We define the error at time $t$ by

	$\displaystyle\mathrm{error}$	$\displaystyle=\int_{-\infty}^{\infty}\|f(\theta_{t},z_{0})-T(t,z_{0})\|p_{0}(z_{% 0})\;dz_{0}$
		$\displaystyle\approx\frac{1}{N_{1}}\sum_{j=1}^{N_{1}}\|f(\theta_{t},z_{j})-T(t,% z_{j})\|p_{0}(z_{j})\,,$		(56)

where we discretize the integration domain by $N_{1}$ equally spaced points to approximate the integral. And $p_{0}(z_{0})$ denotes the initial distribution of $z_{0}$ . Below we test our projected dynamics under three choices of potential functions and investigate the convergence behavior of two projected dynamics: (i) fixing the bias terms $b_{i}$ and only updating the weights $a_{i}$ and (ii) updating both bias $b_{i}$ and weights $a_{i}$ . Note that when the bias terms $b_{i}$ ’s are fixed, we have that $G_{\mathrm{W}}\in\mathbb{R}^{2N\times 2N}$ . Recall that we are essentially simulating the gradient flow on parameter $\theta_{t}$ given by Eq. (11). We use $M=5\times 10^{5}$ particles sampled from a standard Gaussian distribution for approximating $\mathbb{E}_{\tilde{z}\sim p_{\mathrm{r}}}\Big{[}V(f(\theta,\tilde{z}))\Big{]}$ . Once we have the empirical loss function

\mathbb{E}_{\tilde{z}\sim p_{\mathrm{r}}}\Big{[}V(f(\theta,\tilde{z}))\Big{]}% \approx\frac{1}{M}\sum_{i=1}^{M}V(f(\theta,z_{i}))\,,

we can backpropagate this loss to obtain

\mathbb{E}_{\tilde{z}\sim p_{\mathrm{r}}}\Big{[}\nabla_{\theta}V(f(\theta,% \tilde{z}))\Big{]}\approx\frac{1}{M}\sum_{i=1}^{M}\nabla_{\theta}V(f(\theta,z_% {i}))\,,

which will be used in the update of $\theta_{t}$ given by Eq. (11).

5.2.1. Quadratic potential

As the first example for linear transport PDE, we consider the quadratic potential $V(x)=\frac{1}{2}(x-\mu_{0})^{2}$ as a sanity check. The stationary distribution will be the delta mass supported at $\mu_{0}$ . Using the method of characteristics, one can show that the solution at time $t>0$ is given by

p(t,x)=p_{0}\big{(}(x-\mu_{0})e^{t}+\mu_{0}\big{)}e^{t}\,,

(57)

where $p_{0}(x)=p(0,x)$ is the initial distribution. In Lagrangian coordinates, the transport map of a point $z_{0}$ at time $t$ is given by

T(t,z_{0})=\mu_{0}+e^{-t}(z_{0}-\mu_{0})\,.

(58)

One can check that $T(z_{0},0)=z_{0}$ and $T(t,z_{0})\to\mu_{0}$ as $t\to\infty$ . It is worthwhile mentioning that at each $t>0$ , the Lagrangian map $x_{t}(z_{0}):z_{0}\mapsto T(t,z_{0})$ is a linear map. For simplicity, we take $\delta_{0}=0$ . We choose $dt=10^{-3}$ and run for 1000 steps. We compare our numerical results with Eq. (58). The result is demonstrated in Fig. 2. In Fig. 1(b), we have provided a visualization of the analytic solution to the linear transport PDE in Lagrangian coordinates at $t=1$ and our computed solution. As shown in Fig. 1(b), the analytic transport map is linear while the neural mapping function is piecewise linear. Increasing $N$ does not necessarily give a smaller approximation error. In fact, we see in Fig. 1(a) that larger $N$ usually gives a larger error, commonly known as overfitting in machine learning.

5.2.2. Quartic potential

Let us consider $V(x)=(x-1)^{4}/4-(x-1)^{2}/2$ . The analytic solution of the transport map is given by

T(t,z_{0})=\left\{\begin{aligned} &\mathrm{sgn}(z_{0}-1)\frac{e^{t}}{\sqrt{(z_% {0}-1)^{-2}+e^{2t}-1}}+1,\quad z_{0}\neq 1\,,\\ &1,\hskip 184.9429ptz_{0}=1\,.\end{aligned}\right.

(59)

Basic settings are the same as the previous case. We choose $dt=2\times 10^{-4}$ and run for 1000 steps. We compare our numerical results with Eq. (59). We present our results in Fig. 3. In Fig. 2(a), we observe a clear decrease in error as the number of neurons becomes larger. In Fig. 2(b), we visualize the analytic solution to the linear transport PDE in Lagrangian coordinates at $t=0.2$ and our computed solution. We can see that even when the optimal transport map is nonlinear, our computed solution still matches the analytic solution very accurately.

5.2.3. Sixth order polynomial potential

Let us consider $V(x)=(x-4)^{6}/6$ . The analytic solution of the transport map is given by

T(t,z_{0})=\left\{\begin{aligned} &4+\mathrm{sgn}(z_{0}-4)\frac{1}{\sqrt{2% \sqrt{\frac{1}{4(z_{0}-4)^{4}}+t}}},\quad z_{0}\neq 4\,,\\ &4,\hskip 161.61143ptz_{0}=4\,.\end{aligned}\right.

(60)

We choose $dt=10^{-6}$ and run for 1000 steps. The reason to choose such a small step size is that the ODE (55) is stiff when $V(x)$ is a sixth order polynomial. This can be readily seen by considering the forward Euler scheme for solving (55), which results in the popular gradient descent algorithm. The step size that can guarantee convergence in gradient descent is at most $2/L$ where $L$ is the Lipschitz constant of the gradient function. In our case, the gradient function $\nabla V(x)$ is not globally Lipschitz. Even if we consider a fixed interval $(-l,l)$ , the Lipschitz constant is $L=5(l+4)^{4}$ . If we take $l=10$ , then we get $L=\mathcal{O}(10^{-6})$ . We compare our numerical results with Eq. (60). We have chosen $\{z_{j}\}_{j=1}^{N_{1}}$ to be a uniform mesh of size $N_{1}=4\times 10^{6}$ on $[-6,6]$ in Eq. (56) and $p_{0}$ is the standard Gaussian distribution. Note that $N_{1}$ is chosen to be much larger than the number of neurons $N$ in the network mapping function as it is used to evaluate the accuracy of our algorithm. We present our results in Fig. 4. We can see a clear decrease in error when $N$ increases from Fig. 3(a). It is also clear from Fig. 3(a) that updating both weights and bias tends to have a smaller error than just updating the weights, although the difference becomes smaller when $N$ increases and more mesh points become available. Comparing dashed and solid lines in Fig. 3(a), we find that the initialization of $b_{i}$ also plays a role in the overall performance of our solution. The error is smaller when the initial mesh points (i.e., the $b_{i}$ ’s) are more concentrated near the center of the reference measure. In our case, the reference measure is a standard Gaussian, whose measure is “almost” supported on $[-4,4]$ . Hence we see that the solid lines show a smaller error than the dashed lines in Fig. 3(a). In Fig. 3(b), we have given a visualization of the analytic solution to the linear transport PDE in Lagrangian coordinates at $t=10^{-3}$ and our computed solution. It is worth noting from Fig. 3(b) that our learned Lagrangian map approximates the analytic Lagrangian map well near the center of the reference distribution, which is concentrated near the origin. Even though the error of the learned Lagrangian map is larger outside of $[-4,4]$ , the overall error from Eq. (56) is still small since the reference measure (standard Gaussian measure) on $\mathbb{R}\setminus[-4,4]$ is exponentially small.

Remark 4.

According to Proposition 9, updating both $a_{i}$ and $b_{i}$ is a second order method. This can be seen from Fig. 3(a) when $N$ is small. When $N$ is large, the numerical advantage of updating both $a_{i}$ and $b_{i}$ is less significant compared with updating only $a_{i}$ . This is partially explained by the condition number of the $G_{\mathrm{W}}(\theta)$ grows too large when $\theta$ contains all of $a_{i}$ and $b_{i}$ . This phenomenon is also observed in our other experiments. Using the implicit scheme or proximal scheme (without solving the linear system that involves $G_{\mathrm{W}}(\theta)$ directly) might help with this difficulty, which we leave as a future study.

5.3. Fokker-Planck Equation

We consider Fokker-Planck equations. In general, there is no closed-form solution for either the Eulerian or Lagrangian coordinate except for some special forms of potential $V$ (e.g. quadratic). We can still have an approximation of the analytic transport map by realizing that the optimal transport map of a point $z_{0}$ at time $t$ is given by

T(t,z_{0})=\mathfrak{F}_{t}\big{(}{\mathfrak{F}_{0}^{-1}}(z_{0})\big{)}

(61)

where $\mathfrak{F}_{t}$ is the cumulative distribution function (CDF) of $p(t,x)$ . $\mathfrak{F}_{0}$ has a closed form expression when we choose our reference measure to be a standard Gaussian. But we still need to know $p(t,x)$ . Therefore, to investigate the performance of our algorithm, we need to use a numerical solver to solve for $p(t,x)$ . We choose a center difference in space, implicit in time discretization as our choice of numerical solver with vanishing boundary condition. Recall that we are essentially simulating the gradient flow on parameter $\theta_{t}$ given by Eq. (11) and Eq. (13). To calculate the derivative of the energy functionals, we used $M=10^{6}$ particles sampled from a standard Gaussian distribution for approximating $\mathbb{E}_{z\sim p_{\mathrm{r}}}\Big{[}V(f(\theta,z))+\hat{U}(\frac{p_{% \mathrm{r}}(z)}{D_{z}f(\theta,z)})\Big{]}$ . Approximating $\mathbb{E}_{z\sim p_{\mathrm{r}}}\Big{[}\nabla_{\theta}V(f(\theta,z))\Big{]}$ is straightforward and has been explained in detail in Section 5.2. On the other hand, some care needs to be taken when approximating $\mathbb{E}_{z\sim p_{\mathrm{r}}}\Big{[}\nabla_{\theta}\hat{U}(\frac{p_{% \mathrm{r}}(z)}{D_{z}f(\theta,z)})\Big{]}$ as explained in Section 4.2.1. Suppose that all of the $\{b_{k}\}_{k=1}^{2N}$ are different. Take $2\leq j\leq N$ . Let us also assume that the $b_{k}$ ’s are ordered so that $b_{1}\leq b_{2}\leq\cdots\leq b_{N}$ .

	$\displaystyle\mathbb{E}_{z\sim p_{\mathrm{r}}}\partial_{b_{j}}\log(D_{z}f(% \theta,z))$	$\displaystyle=\mathbb{E}_{z\sim p_{\mathrm{r}}}\partial_{b_{j}}\log\left(\sum_% {i=1}^{N}a_{i}\mathbf{1}_{[b_{i},\infty)}(z)-\sum_{i=N+1}^{2N}a_{i}\mathbf{1}_% {(-\infty,b_{i}]}(z)\right)$
		$\displaystyle=p_{\mathrm{r}}(b_{j})\log\left(\frac{\sum_{i=1}^{j-1}a_{i}-\sum_% {i=N+1}^{2N}a_{i}\mathbf{1}_{(-\infty,b_{i}]}(b_{j})}{\sum_{i=1}^{j}a_{i}-\sum% _{i=N+1}^{2N}a_{i}\mathbf{1}_{(-\infty,b_{i}]}(b_{j})}\right)\,.$		(62)

And

\mathbb{E}_{z\sim p_{\mathrm{r}}}\partial_{b_{1}}\log(D_{z}f(\theta,z))=p_{% \mathrm{r}}(b_{1})\log\left(\frac{\sum_{i=N+1}^{2N}-a_{i}\mathbf{1}_{(-\infty,% b_{i}]}(b_{1})}{a_{1}-\sum_{i=N+1}^{2N}a_{i}\mathbf{1}_{(-\infty,b_{i}]}(b_{1}% )}\right)\,.

(63)

Similarly, if we assume that $b_{N+1}\geq b_{N+2}\geq\cdots\geq b_{2N}$ and let $N+2\leq j\leq 2N$ , we have

	$\displaystyle\mathbb{E}_{z\sim p_{\mathrm{r}}}\partial_{b_{j}}\log(D_{z}f(% \theta,z))$	$\displaystyle=\mathbb{E}_{z\sim p_{\mathrm{r}}}\partial_{b_{j}}\log\left(\sum_% {i=1}^{N}a_{i}\mathbf{1}_{[b_{i},\infty)}(z)-\sum_{i=N+1}^{2N}a_{i}\mathbf{1}_% {(-\infty,b_{i}]}(z)\right)$
		$\displaystyle=p_{\mathrm{r}}(b_{j})\log\left(\frac{\sum_{i=1}^{N}a_{i}\mathbf{% 1}_{[b_{i},\infty)}(b_{j})-\sum_{i=N+1}^{j}a_{i}}{\sum_{i=1}^{N}a_{i}\mathbf{1% }_{[b_{i},\infty)}(b_{j})-\sum_{i=N+1}^{j-1}a_{i}}\right)\,.$		(64)

And

\mathbb{E}_{z\sim p_{\mathrm{r}}}\partial_{b_{N+1}}\log(D_{z}f(\theta,z))=p_{% \mathrm{r}}(b_{N+1})\log\left(\frac{\sum_{i=1}^{N}a_{i}\mathbf{1}_{[b_{i},% \infty)}(b_{N+1})-a_{N+1}}{\sum_{i=1}^{N}a_{i}\mathbf{1}_{[b_{i},\infty)}(b_{N% +1})}\right)\,.

(65)

Note that during implementation, we do not have to order the $b_{j}$ ’s in order to evaluate the above partial derivatives. Let $0<\delta\leq\frac{1}{2}\min_{i\neq j}|b_{i}-b_{j}|$ . Then by a straightforward calculation, we have

\mathbb{E}_{z\sim p_{\mathrm{r}}}\partial_{b_{j}}\log(D_{z}f(\theta,z))=\begin% {cases}p_{\mathrm{r}}(b_{j})\log\left(\frac{D_{z}f(\theta,b_{j}-\delta)}{D_{z}% f(\theta,b_{j})}\right)\,,\quad 1\leq j\leq N\,.\\ p_{\mathrm{r}}(b_{j})\log\left(\frac{D_{z}f(\theta,b_{j})}{D_{z}f(\theta,b_{j}% +\delta)}\right)\,,\quad N+1\leq j\leq 2N\,.\end{cases}

(66)

In our experiment, we set $\delta=\varepsilon/2$ where $\varepsilon$ is the small offset we introduced in Section 5.1 during initialization.

5.3.1. Quadratic potential

As a first example for the Fokker-Planck equation, we use the quadratic potential as a sanity check. Here $V(x)$ is chosen to be a quadratic function. This is one of the rare cases where the Fokker-Planck equation has a closed-form analytic solution. In Lagrangian coordinates, the trajectories of the particles follow the following SDE, commonly known as the Ornstein-Uhlenbeck process:

dX_{t}=-\gamma_{0}(X_{t}-\mu_{0})dt+\sigma_{0}dW_{t}\,.

(67)

The corresponding Langevin equation for the density $p(t,x)$ is given by

\frac{\partial p}{\partial t}=\gamma_{0}\frac{\partial}{\partial x}\big{(}(x-% \mu_{0})p\big{)}+D\frac{\partial^{2}p}{\partial x^{2}}\,,

(68)

where $D=\sigma_{0}^{2}/2$ . It can be shown that the solution to (68) is given by

p(t,x)=\sqrt{\frac{\gamma_{0}}{2\pi D(1-\mathrm{e}^{-2\gamma_{0}t})}}\int_{-% \infty}^{\infty}\mathrm{exp}\Big{(}-\frac{\gamma_{0}}{2D}\frac{(x-\mu_{0}-x^{% \prime}\mathrm{e}^{-\gamma_{0}t})^{2}}{1-\mathrm{e}^{-2\gamma_{0}t}}\Big{)}p_{% 0}(x^{\prime})\,\mathrm{d}x^{\prime}\,,

(69)

where $p_{0}(x)=p(0,x)$ is the initial distribution. In our experiment, $p_{0}(x)$ is a standard Gaussian. Then (69) implies that $p(t,x)$ is also Gaussian with mean $\mu_{0}(1-e^{-\gamma_{0}t})$ and variance $e^{-2\gamma_{0}t}+\frac{D(1-e^{-2\gamma_{0}t})}{\gamma_{0}}$ . Then the transport map is given by the optimal transport map between two Gaussians, which has a closed form expression. In this example, the transport map is

T(t,z)=\mu_{0}(1-e^{-\gamma_{0}t})+z\sqrt{e^{-2\gamma_{0}t}+D(1-e^{-2\gamma_{0% }t})/\gamma_{0}}\,,

(70)

which is always a linear map, no matter the choice of $\mu_{0}$ , $\gamma_{0}$ and $D$ . We use $M=10^{6}$ particles sampled from a standard Gaussian distribution for approximating $\mathbb{E}_{z\sim p_{\mathrm{r}}}\Big{[}\nabla_{\theta}V(f(\theta,z))+\nabla_{% \theta}\hat{U}(\frac{p_{\mathrm{r}}(z)}{D_{z}f(\theta,z)})\Big{]}$ . We choose $dt=10^{-3}$ and run for 1000 steps. We used a neural network with $m=32$ and $B=4$ following the setup in Section 5.1. We have the following two choices of parameters corresponding to different dynamics.

•

Moving and widening Gaussian. We choose $\gamma_{0}=1$ , $\mu_{0}=30$ , $\sigma_{0}=4$ . Under this setting, the solution at time $t$ is a Gaussian distribution with mean $30(1-e^{-t})$ and variance $e^{-2t}+8(1-e^{-2t})$ . This evolution is shown on the left panel of Fig. 5.
•

Moving and shrinking Gaussian. We choose $\gamma_{0}=1$ , $\mu_{0}=10$ , $\sigma_{0}=0.01$ . Under this setting, the solution at time $t$ is a Gaussian distribution with mean $10(1-e^{-t})$ and variance $e^{-2t}+5\times 10^{-5}(1-e^{-2t})$ . This evolution is shown on the right panel of Fig. 5.

Our results are demonstrated in Fig. 5. As shown in Fig. 5, the computed density closely follows the analytic density of the Fokker-Planck equation from $t=0$ to $t=1$ .

5.3.2. Quartic potential

We consider $V(x)=(x-1)^{4}/4-(x-1)^{2}/2$ . We choose $dt=2\times 10^{-4}$ and run for 1000 steps. We compare our numerical results with the transport map computed from Eq. (61). The results are shown in Fig. 6. In Fig. 5(a), we observe a clear decrease in error when the number of neurons increases. In Fig. 5(b), we plot a comparison between our computed Lagrangian map $f(\theta,z)$ vs the transport map computed from Eq. (61) using a numerical solver. The evolution of the density is demonstrated in Fig. 5(c) from $t=0$ to $t=0.2$ .

5.3.3. Sixth order polynomial potential

We consider $V(x)=(x-4)^{6}/6$ . We choose $dt=10^{-6}$ and run for 1000 steps. We compare our numerical results with the transport map computed from Eq. (61). The results are shown in Fig. 7. We have observed similar behavior as in the case of linear transport PDE: the error becomes smaller when $N$ increases. Moreover, comparing dashed and solid lines in Fig. 6(a) we see that as the initial mesh points (i.e. the $b_{i}$ ’s) concentrate nearer the center of our reference measure, the errors are smaller. In Fig. 6(b) we show a comparison between Lagrangian maps computed by our method and the numerical solver. We have also plotted the evolution of the density in Fig. 6(c) from $t=0$ to $t=10^{-3}$ .

5.4. Porous Medium Equation

We consider Example 6 with the functional $U(p(x))=\frac{1}{m-1}p(x)^{m}$ , $m>1$ . This choice of $U$ yields the porous medium equation

\partial_{t}p(t,x)=\Delta p(t,x)^{m}\,.

(71)

It is known that Eq. (71) admits solutions given by the Barenblatt profile

p(t,x)=(t_{0}+t)^{-\alpha}\Big{(}C-\beta\|x\|^{2}(t_{0}+t)^{-2\alpha/d}\Big{)}% ^{\frac{1}{m-1}}_{+}\,,

(72)

where $x\in\mathbb{R}^{d}$ , $\alpha=\frac{d}{d(m-1)+2}$ , $\beta=\frac{(m-1)\alpha}{2dm}$ , $t_{0}>0$ and $C$ is a normalization constant so that Eq. (72) integrates to 1 for all $t\geq 0$ . In this example, we consider the case when $m=2$ . Then $\alpha=\frac{1}{3}$ , $\beta=\frac{1}{12}$ and $C=\frac{3^{1/3}}{4}$ . Eq. (72) also suggests that the support of the reference measure $p_{0}(x)=p(x,0)$ is bounded in $[-3^{2/3}t_{0}^{1/3},3^{2/3}t_{0}^{1/3}]$ , which could help us with initializing the bias. More precisely, we cound initialize our $b_{i}$ ’s following Section 5.1 with $B=3^{2/3}t_{0}^{1/3}$ . In our experiment, we set $t_{0}=1$ . We use $dt=10^{-3}$ and run for 1000 steps. We use $M=10^{6}$ particles sampled from $p(x,0)$ defined in Eq. (72) using importance sampling to approximate $\mathbb{E}_{z\sim p_{\mathrm{r}}}\Big{[}\nabla_{\theta}\hat{U}(\frac{p_{% \mathrm{r}}(z)}{D_{z}f(\theta,z)})\Big{]}$ , where $\hat{U}(p)=p$ . Similar to the case of Fokker-Planck equation, special care needs to be taken when evaluating $\partial b_{i}\mathbb{E}_{z\sim p_{\mathrm{r}}}\Big{[}\hat{U}(\frac{p_{\mathrm% {r}}(z)}{D_{z}f(\theta,z)})\Big{]}$ . Using similar analysis from Section 5.3, we have that

\partial b_{i}\mathbb{E}_{z\sim p_{\mathrm{r}}}\Big{[}\hat{U}(\frac{p_{\mathrm% {r}}(z)}{D_{z}f(\theta,z)})\Big{]}=\begin{cases}\frac{p_{\mathrm{r}}(b_{i})^{2% }}{D_{z}f(\theta,b_{i}-\delta)}-\frac{p_{\mathrm{r}}(b_{i})^{2}}{D_{z}f(\theta% ,b_{i})},\quad 1\leq i\leq N\,.\\ \frac{p_{\mathrm{r}}(b_{i})^{2}}{D_{z}f(\theta,b_{i})}-\frac{p_{\mathrm{r}}(b_% {i})^{2}}{D_{z}f(\theta,b_{i}+\delta)},\quad N+1\leq i\leq 2N\,.\end{cases}

(73)

Our results are demonstrated in Fig. 8. In Fig. 7(b), 7(c) we have $N=32$ ; both the bias and weights are updated.

5.5. Keller-Segel equation

We consider the one-dimensional modified Keller-Segel equation, which is a combination of interaction energy in Example 5 and potential energy in Example 6:

\partial_{t}p(t,x)=\nabla\cdot\big{(}p(t,x)\nabla(U^{\prime}(p)+W*p)\big{)},

(74)

where $U(p)=p\log p$ and $W(x)=2\chi\log|x|$ , $\chi>0$ is a constant. The second moment of $p(t,x)$ has an analytic form given by

\mathbb{E}_{z\sim p(\cdot,t)}[z^{2}]=2(1-\chi)t\,\mathbb{E}_{z\sim p(\cdot,0)}% [z^{2}]\,.

(75)

It is clear from Eq. (75) that $\chi=1$ is a critical value. When $\chi>1$ , the solution blows up as $t\to\infty$ . So we consider two cases: $\chi=1.5$ , and $\chi=0.5$ . We present our results in Fig. 9, 10 and 11. We used 2000 particles with a standard Gaussian initial distribution. We set $dt=3\times 10^{-4}$ and run for 1000 steps. The interaction term $W*p$ is evaluated using the 2000 particles with self-interaction excluded. We used a neural network with $N=32$ and $B=4$ following the setup and initialization in Section 5.1. We update both the bias and weights terms in our experiment.

6. Discussion

This paper analyzes the neural network projected dynamics for one-dimensional Wasserstein gradient flows of general energy functionals. For two-layer neural network functions with ReLU activations, we analyze the convergence and stability issues for the proposed numerical schemes from location parameter $b$ and scale parameter $a$ . In numerical experiments, we demonstrate the second-order spatial domain accuracy as discussed in the numerical analysis.

In future work, we shall study neural projected dynamics as a computational framework for building theoretical guaranteed machine learning numerical schemes. Various topics in this direction remain to be studied. First, we shall design neural network approximations to approximate the initial value of high-dimensional PDEs, which traditional PDE solvers cannot efficiently solve due to the curse of dimensionality. In particular, how can we understand the numerical accuracy of deep neural network functions in high dimensions when approximating PDEs? Next, we shall generalize the neural projected dynamics to dynamical systems for conservative-dissipative equations in statistical physics. The equation includes Hamiltonian structures induced from the conservative system and the related mean-field control problems. Furthermore, considering the closed relationship between the Wasserstein density manifold and sampling algorithms, we shall investigate sampling using the projected dynamics on neural parameter spaces and study their theoretical and statistical properties. We also consider the time-implicit (proximal-type) computations of the proposed algorithm [23, 19], which could improve the performance and stability of the scheme.

References

[1] Shun-ichi Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.
[2] Shun-ichi Amari. Information geometry and its applications, volume 194. Springer, 2016.
[3] Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2005.
[4] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pages 214–223. PMLR, 2017.
[5] Andrew J Barlow, Pierre-Henri Maire, William J Rider, Robert N Rieben, and Mikhail J Shashkov. Arbitrary Lagrangian–Eulerian methods for modeling high-speed compressible multimaterial flows. Journal of Computational Physics, 322:603–665, 2016.
[6] Peter Benner, Serkan Gugercin, and Karen Willcox. A survey of projection-based model reduction methods for parametric dynamical systems. SIAM review, 57(4):483–531, 2015.
[7] Joan Bruna, Benjamin Peherstorfer, and Eric Vanden-Eijnden. Neural Galerkin schemes with active learning for high-dimensional evolution equations. Journal of Computational Physics, 496:112588, 2024.
[8] Jose A Carrillo, Daniel Matthes, and Marie-Therese Wolfram. Lagrangian schemes for Wasserstein gradient flows. Handbook of Numerical Analysis, 22:271–311, 2021.
[9] JS Chang and G Cooper. A practical difference scheme for Fokker–Planck equations. Journal of Computational Physics, 6(1):1–16, 1970.
[10] Yifan Chen and Wuchen Li. Optimal transport natural gradient for statistical manifolds with continuous sample space. Information Geometry, 3(1):1–32, 2020.
[11] Casey Chu, Kentaro Minami, and Kenji Fukumizu. The equivalence between Stein variational gradient descent and black-box variational inference. arXiv preprint arXiv:2004.01822, 2020.
[12] Yifan Du and Tamer A Zaki. Evolutional deep neural network. Physical Review E, 104(4):045303, 2021.
[13] Jiaojiao Fan, Qinsheng Zhang, Amirhossein Taghvaei, and Yongxin Chen. Variational Wasserstein gradient flow. arXiv preprint arXiv:2112.02424, 2021.
[14] Nathan Gaby, Xiaojing Ye, and Haomin Zhou. Neural control of parametric solutions for high-dimensional evolution PDEs. arXiv preprint arXiv:2302.00045, 2023.
[15] Juncai He, Lin Li, Jinchao Xu, and Chunyue Zheng. ReLU deep neural networks and linear finite elements. Journal of Computational Mathematics, 38(3):502–527, June 2020.
[16] Ziqing Hu, Chun Liu, Yiwei Wang, and Zhiliang Xu. Energetic variational neural network discretizations to gradient flows. arXiv preprint arXiv:2206.07303, 2022.
[17] Weizhang Huang and Robert D Russell. Adaptive moving mesh methods, volume 174. Springer Science & Business Media, 2010.
[18] Wonjun Lee, Li Wang, and Wuchen Li. Deep JKO: time-implicit particle methods for general nonlinear gradient flows. arXiv preprint arXiv:2311.06700, 2023.
[19] Wuchen Li, Alex Tong Lin, and Guido Montúfar. Affine natural proximal learning. In Geometric Science of Information: 4th International Conference, GSI 2019, Toulouse, France, August 27–29, 2019, Proceedings 4, pages 705–714. Springer, 2019.
[20] Wuchen Li and Guido Montúfar. Natural gradient via optimal transport. Information Geometry, 1:181–214, 2018.
[21] Wuchen Li and Jiaxi Zhao. Scaling limits of the Wasserstein information matrix on Gaussian mixture models. arXiv preprint arXiv:2309.12997, 2023.
[22] Wuchen Li and Jiaxi Zhao. Wasserstein information matrix. Information Geometry, pages 1–53, 2023.
[23] Alex Tong Lin, Wuchen Li, Stanley Osher, and Guido Montúfar. Wasserstein proximal of gans. In International Conference on Geometric Science of Information, pages 524–533. Springer, 2021.
[24] Chun Liu and Yiwei Wang. On Lagrangian schemes for porous medium type generalized diffusion equations: A discrete energetic variational approach. Journal of Computational Physics, 417:109566, 2020.
[25] Shu Liu, Wuchen Li, Hongyuan Zha, and Haomin Zhou. Neural parametric Fokker–Planck equation. SIAM Journal on Numerical Analysis, 60(3):1385–1449, 2022.
[26] Pierre-Henri Maire, Rémi Abgrall, Jérôme Breil, and Jean Ovadia. A cell-centered Lagrangian scheme for two-dimensional compressible flow problems. SIAM Journal on Scientific Computing, 29(4):1781–1824, 2007.
[27] Petr Mokrov, Alexander Korotin, Lingxiao Li, Aude Genevay, Justin M Solomon, and Evgeny Burnaev. Large-scale Wasserstein gradient flows. Advances in Neural Information Processing Systems, 34:15243–15256, 2021.
[28] Kirill Neklyudov, Rob Brekelmans, Alexander Tong, Lazar Atanackovic, Qiang Liu, and Alireza Makhzani. A computational framework for solving Wasserstein Lagrangian flows. arXiv preprint arXiv:2310.10649, 2023.
[29] Levon Nurbekyan, Wanzhou Lei, and Yunan Yang. Efficient natural gradient descent methods for large-scale PDE-based optimization problems. SIAM Journal on Scientific Computing, 45(4):A1621–A1655, 2023.
[30] N Nüsken. On the geometry of Stein variational gradient descent. Journal of Machine Learning Research, 24:1–39, 2023.
[31] Yann Ollivier. Riemannian metrics for neural networks I: feedforward networks. Information and Inference: A Journal of the IMA, 4(2):108–153, 2015.
[32] Lars Onsager. Reciprocal relations in irreversible processes. I. Phys. Rev., 37:405–426, Feb 1931.
[33] Felix Otto. The geometry of dissipative evolution equations the porous medium equation. Communications in Partial Differential Equations, 26(1-2):101–174, 2001.
[34] Lars Ruthotto, Stanley J Osher, Wuchen Li, Levon Nurbekyan, and Samy Wu Fung. A machine learning framework for solving high-dimensional mean field game and mean field control problems. Proceedings of the National Academy of Sciences, 117(17):9183–9193, 2020.
[35] Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
[36] Tao Tang. Moving mesh methods for computational fluid dynamics. Contemporary mathematics, 383(8):141–173, 2005.
[37] Cédric Villani. Optimal Transport: Old and New, volume 338. Springer, 2009.
[38] Hao Wu, Shu Liu, Xiaojing Ye, and Haomin Zhou. Parameterized Wasserstein Hamiltonian flow. arXiv preprint arXiv:2306.00191, 2023.