Numerical analysis on Neural network projected schemes for approximating one dimensional Wasserstein Gradient flows
Abstract.
We provide a numerical analysis and computation of neural network projected schemes for approximating one dimensional Wasserstein gradient flows. We approximate the Lagrangian mapping functions of gradient flows by the class of two-layer neural network functions with ReLU (rectified linear unit) activation functions. The numerical scheme is based on a projected gradient method, namely the Wasserstein natural gradient, where the projection is constructed from the mapping spaces onto the neural network parameterized mapping space. We establish theoretical guarantees for the performance of the neural projected dynamics. We derive a closed-form update for the scheme with well-posedness and explicit consistency guarantee for a particular choice of network structure. General truncation error analysis is also established on the basis of the projective nature of the dynamics. Numerical examples, including gradient drift Fokker-Planck equations, porous medium equations, and Keller-Segel models, verify the accuracy and effectiveness of the proposed neural projected algorithm.
Key words and phrases:
Optimal transport; Information Geometry; Natural gradient; Neural network functions; Convergence analysis.1. Introduction
Simulating gradient flows of free energies is a central problem in the computational physics of complex systems [8] and data science [1, 2]. In physics, gradient flows often arise from first-order principles, such as the Onsager principle [32]. The Onsager gradient flows are widely used in phase fields, chemistry, and biology modeling. In recent years, a particular type of Onsager gradient flow, known as Wasserstein gradient flow, has been widely studied in optimal transport communities [3, 33, 37]. It studies an infinite-dimensional pseudo-Riemannian metric in the probability distribution space known as the density manifold. The gradient flow in the Wasserstein space naturally captures the free energy dissipation properties. Depending on the choices of free energies, the Wasserstein gradient flow contains a vast class of differential equations, such as gradient drift Fokker-Planck equations, porous medium equations, and Keller-Segel models. These models are widely used in population dynamics and sampling-related optimization problems.
In recent years, machine learning has brought a class of new methods in computational physics, where free energies are identified with the loss functions [11, 30]. Meanwhile, computing Wasserstein gradient flows of loss functions in terms of samples also finds their various applications, such as generative artificial intelligence [4] and transport map-based sampling methods [35]. In these applications, one often relies on the Lagrangian mapping functions to describe the Wasserstein gradient flows and deep neural networks to approximate the mapping functions due to their high expressivity and adaptivity from the compositional structure. While empirical successes of this framework have been observed in various applications [35, 4], very few theoretical results exist to explain the underlying mechanism.
Moreover, projected dynamics in neural network space are widely used to approximate Wasserstein gradient flows [14, 25]. These dynamics restrict the space of probabilities onto a finite-dimensional subspace parameterized by neural network mapping functions. For this reason, we call it the neural projected gradient dynamics. This approach originates from the natural gradient method in information geometry [1] and extends the framework set by [20]. Some basic questions about its accuracy and efficiency remain: Even in one-dimensional space, how well do the neural projected dynamics approximate the Wasserstein gradient flow? What is the accuracy of the neural network approximation in Lagrangian mapping functions?
In this paper, we study the numerical analysis and computational neural network projected schemes for one-dimensional Wasserstein gradient flows. The main result is sketched below. By formulating gradient flows in Lagrangian coordinates, the proposed numerical scheme takes the form of a ‘preconditioned’ gradient descent, where the preconditioner is the metric tensor of the statistical manifold of the parameter space. Theoretically, we first provide the derivation of the analytic solution for the inverse neural mapping metric. It is based on a special class of the ReLU network in theorem 2. We use the analytic form of the projected gradient flow formula to prove the consistency of the numerical scheme. Then, we prove in theorem 3 that the numerical schemes derived from the neural projected dynamics are of first or second-order consistency for the general Wasserstein gradient directions. These include cases of the heat flow and the Fokker-Planck equation. Furthermore, viewing our neural network model as a moving mesh method, we show in proposition 7 that the mesh will not degenerate during the simulation.
In numerics, the advantages of the proposed method are twofold. First, using a two-layer neural network as our basis function, the proposed method can be regarded as a ‘moving-mesh’ method in Lagrangian coordinate, which demonstrates very promising performance even when the number of parameters of the neural network is very limited. In particular, our numerical examples can achieve an accuracy of with less than 100 neurons. Second, using the Wasserstein gradient flow formulation, the proposed method is very easy to implement since it can make use of the automatic differentiation feature from popular machine learning libraries such as PyTorch.
Nowadays, the computation of Wasserstein gradient flows (WGFs) has attracted great interests from researchers in various communities such as mathematics, physics, statistics, and machine learning. Classical numerical methods [9] have been introduced to directly evaluate the probability density function. Recently, algorithms that approximate the Lagrangian mapping functions associated with WGFs have been invented. We refer the readers to [8] and references therein for related discussions. These treatments automatically preserve non-negativity and total mass. Together with the fast-developing deep learning techniques, they inspire a series of research on composing scalable, sampling-friendly computational methods for WGFs in higher-dimensional spaces [25, 27, 13, 16, 18]. Recently, deep learning-based algorithms for computing the Lagrangian coordinates of the Wasserstein Hamiltonian flows, or more generally mean field control problems, have also been introduced in [38, 28, 34].
Our treatment of projecting the WGFs onto the parameter space is also known as the natural gradient method, which are first introduced in [1] (w.r.t. Fisher-Rao metric) and [10] (w.r.t. Wasserstein metric). Here the projected matrix is often named information matrix, namely Fisher information matrix and Wasserstein information matrix, depending on the usage of metrics in probability space. This method recently finds its application in large-scale optimization problems [29]. In recent research [12, 7, 14], the authors aim to calculate general evolution equations by directly leveraging the neural network representation of the time-dependent solution. They endow the evolution of the equation in the functional space into the parameter space of the neural network to obtain a finite-dimensional ordinary differential equation, which can be readily integrated via the Runge-Kutta solvers. Numerical properties of the ReLU neural network families have been investigated in [15].
Compared to previous studies, we study the numerical analysis of neural network projected dynamics for approximating WGFs. In one-dimensional space, we provide the error analysis for the neural projected dynamics with a two-layer neural network. We numerically verify the proposed error analysis. In particular, we formulate a class of explicit schemes from the neural network projected dynamics. This study continues the study of the Wasserstein information matrix on neural network models; see related discussions in [22, 21, 25].
The paper is organized as follows. In Section 2, we briefly review the formulation of Wasserstein gradient flows of free energies in both Eulerian and Lagrangian coordinates. We formulate the projected Wasserstein gradient flows over neural network models in Section 3. In Section 4, we conduct the numerical analysis of the proposed neural projected dynamics in two-layer neural network functions. In Section 5, we verify the accuracy of the proposed algorithm with numerical examples in Fokker-Planck equations, porous medium equations, and Keller-Segel models.
2. Review of Wasserstein gradient flows and Lagrangian coordinates
In this section, we prepare the theoretical foundations of Wasserstein gradient flows with a focus on Lagrangian description (diffeomorphism mapping functions) and the associated microscopic particle dynamics. See details in [3, 37].
2.1. Wasserstein gradient flows
Suppose is a domain in the Euclidean space . Denote the probability space
Given an energy functional , we consider the following evolution equation associated with ,
(1) |
with Neumann boundary condition where is the outward pointing vector on boundary . is the first variation operator w.r.t. density variable . The mass of is conserved and always equals . An important fact about (1) is that this equation can be treated as the gradient flow of on . To be more specific, by endowing the probability space with the Wasserstein metric , we can view as a Riemannian manifold, and (1) is the gradient flow on such manifold with respect to .
Let us briefly review several facts. We first define the metric at arbitrary , which is identified via the continuity equation (that is, tangent vectors) whose driving vector field belongs to the closure of all gradient fields with in -norm. Consider a smooth curve () passing through at on . Suppose the probability evolution is driven by the gradient field at , i.e., solves
We define the Wasserstein metric at as a symmetric, positive-definite bilinear form,
Recall the definition of the gradient of a smooth function on a Riemannian manifold as
for any smooth curves passing through at . Switching back to our case, for the functional defined on , we define the gradient of w.r.t. Wasserstein metric at as
Here is arbitrary curve on with . Suppose is guided by the gradient field at time Then the right-hand side can be computed as
Recall the definition of the metric , it is not difficult to verify that the gradient field associated with is . Thus,
and the Wasserstein gradient flow can be formulated as equation (1).
We provide several examples of WGFs. In these examples, we assume .
-
•
(Fokker-Planck equation) Consider
Then the Wasserstein gradient of equals
The corresponding WGF is the Fokker-Planck equation
(2) -
•
(Porous medium equation) Consider
One computes
Thus, the corresponding WGF yields the porous medium equation
(3) -
•
(Keller-Segel equation) Another well-known WGF is by choosing as the sum of the internal energy and the interaction energy
where is a certain smooth function defined on , and is a kernel function.
We calculate
where we denote the convolution . The WGF associated with this functional is the Keller-Segel equation
(4)
2.2. Lagrangian coordinates & Particle dynamics
Consider a mapping function . Here is an input space, is the domain on which WGF is defined. To alleviate our discussion, we assume . Let us further assume , and the Jacobian matrix is non-singular for all , i.e., on . This also guarantees that is injective. Given a smooth reference probability density , we denote the pushforwarded probability density of by as
where is the pushforward operator defined as
The density function of satisfies
(5) |
Such pushforward map used for constructing probability distribution is usually called the Lagrangian coordinate. We now imitate the derivation of the WGF to help formulate its counterpart under the Lagrangian coordinate.
We denote as the space of smooth, integrable pushforward maps with non-zeros Jacobian, i.e.,
Then the pushforward operation introduces a submersion from the space of pushforward maps (diffeomorphisms) to the space of probability densities.
In order to derive the Wasserstein gradient flows (WGFs) on the space of pushforward maps instead of the probability space , we first build up certain metric on that corresponds to the Wasserstein metric . As illustrated in [33], is obtained by pulling back the norm on via submersion . Thus, a way of choosing the metric is
Now for any smooth functional , the composition defines its corresponding functional on . Follow similar arguments presented in 2.1, we compute the gradient of with respect to the metric as
Here, is the ( denotes the Lebesgue measure) first variational w.r.t. the pushforward map .
Thus, the gradient flow of on is formulated as
The variation is calculated as
The above equation can also be written as
(6) |
If we denote , one can verify that exactly solves equation (1) for WGF with , which justifies the equivalence between the gradient flow (6) in Lagrangian coordinates (i.e., the map ) and the WGF (1) expressed by using Eulerian coordinate (i.e., the density function ).
Such gradient flow (6) on the space of diffeomorphisms also forms a microscopic picture of particle dynamics of the WGF (1). For any random reference sample , by setting , it is not hard to verify that evolves w.r.t. the dynamic
(7) |
Here we denote . can be equivalently treated as the probability density of the random particle . In this dynamic, the movement of a single agent is determined by the instant population density evaluated at . Such an approach offers a microscopic and deterministic interpretation of various diffusive processes possessing WGF structures.
The aforementioned examples of WGF can be formulated as the gradient flows under Lagrangian coordinates (6) as well as the particle dynamics (7). We summarize this in the following Table 1. We assume as the initial condition for (6), and as the initial distribution of the random particle in (7). We denote in equation (6). Accordingly, we denote as the probability density of the stochastic particle in the dynamic (7).
3. Neural projected Wassersetin gradient flows and their algorithms
As discussed in Section 2, instead of the direct evaluation of the density function of the Wasserstein gradient flow, it suffices to compute the time-dependent Lagrangian mapping . In this research, we approximate via neural networks parametrized by time-dependent parameter . The evolution of is obtained by projecting the gradient flow (6) onto the parameter space . In this section, we briefly review the basic definitions of neural network mapping functions. We next study a metric space for neural mapping functions and formulate several neural mapping dynamics for .
3.1. Neural network activation functions
We first provide the definition of a neural network mapping function. Consider a mapping function
where is the latent space, is the sample space and is the parameter space. In this paper, we consider the following network structure
where , . Here is the number of hidden units (neurons). is the weight of unit . is an offset (location variable). is an activation function, which satisfies , . From now on, we assume that is invertible, monotone, and is continuous w.r.t. both and variables.
For example, let , and . Define a two layer neural network by
{tikzpicture}The following neural network mapping functions have been widely used.
Example 1 (Linear).
Denote . Consider
Example 2 (ReLU).
Denote . Consider
Example 3 (Sigmoid).
Denote . Consider
3.2. Neural mapping models and energies
In this subsection, we consider the following probability density functions generated by neural network mapping functions. We call them the neural mapping models.
Definition 1 (Neural mapping models).
Let us define a fixed input reference probability density . Denote a probability density generated by a neural network mapping function by the pushforward operator:
In other words, satisfies the following Monge-Ampère equation by
(8) |
where is the Jacobian of the mapping function w.r.t. variable .
Definition 2 (Neural mapping energies).
Given an energy functional , we can construct a neural mapping energy by
Many applications in machine learning and scientific computing can be cast into the following optimization problem
Here, often measures the closeness between the neural mapping model and the target or data density distribution. Several concrete examples of neural mapping energies are given below. For simplicity of presentation, we often write the integration operator w.r.t. density over domain by the expectation operator . Later in Section 3.5, we provide several examples of the energy functional including the potential, the interaction (E.g. maximum mean discrepancy ) and the internal (information entropy/divergence) functionals. They are commonly used in machine learning and optimal transport communities; see details in [3, Section 9].
To summarize, the neural mapping energies are functionals written in terms of the mapping functions . This allows us to perform optimization on the finite dimensional space instead of the infinite dimensional space .
3.3. Neural mapping metric space
We next consider a mapping space parameterized by a neural mapping function . We can measure the difference between two neural mapping functions by the distance thanks to the following definition.
Definition 3 (Neural mapping distance).
Define a distance function as
where , are two sets of neural network parameters and is the Euclidean norm in .
In the above definition, represents a distance function for two given neural mapping functions and . In fact, the distance between neural mapping functions induces a metric on neural network parameters. Similar Riemannian geometry for feed-forward neural networks is also studied in [31].
We next consider the Taylor expansion of the distance function. Let ,
Here is a Gram-type matrix function. We summarize its definition below.
Definition 4 (Neural mapping metric).
Define a matrix function . Denote , such that
We also write
where we denote .
From now on, we call the neural mapping metric space. Here we always assume that is a positive definite matrix in .
3.4. Neural mapping dynamics
In this subsection, we derive some analogies of Wasserstein gradient flows in the neural mapping metric space . Shortly, we apply them to define the neural mapping dynamics and compare them with their counterparts in mapping metric space and Wasserstein metric probability space. From now on, we assume that is smooth w.r.t. parameter . This is not true for the ReLU activation function, which will be studied in detail in later sections.
The next proposition provides gradient operators of a function in the neural mapping metric space .
Proposition 1 (Neural mapping gradient operators).
The gradient operator of in , , is given by
Proof.
We briefly derive the gradient operator of in below. Suppose is a smooth curve passing through the point . Consider a Taylor expansion of at by
(9) |
where we denote . Comparing linear terms of in (9), we have
for any . Thus
∎
We are ready to present the neural mapping gradient flow, which will be used for our first-order algorithm in neural mapping optimization problems.
Proposition 2 (Neural mapping gradient flows).
Consider an energy functional . Then the gradient flow of function in is given by
(10) |
In particular,
where is the –first variation w.r.t. variable , .
Proof.
As the neural mapping metric is given in definition 4, it suffices to calculate the formula for the Euclidean gradient as follows:
Here we denote . ∎
3.5. Neural projected Wasserstein flows
The dynamics in parameter space can be formulated in terms of mappings and probability densities. For simplicity of discussion, we demonstrate that the neural mapping gradient flow is a projected Wasserstein gradient flow. Here the projection is from the full mapping space into a neural parameterized mapping space. Concretely, we present the following reformulations of equation (10), which are in terms of mapping functions and probability density functions. The proof is based on the gradient flow equation in proposition 2 and the application of the chain rule.
Proposition 3 (Neural projected Wasserstein gradient flows).
Dynamic (10) in term of mapping functions leads to
We present several examples of neural mapping Wasserstein gradient flows from proposition 2.
Example 4 (Neural projected linear transport equation).
Consider a linear energy given by
In this case, the neural projected gradient flow satisfies
(11) |
In details,
Example 5 (Neural projected interaction transport equation).
Consider an interaction energy given by
In this case, the neural mapping gradient flow satisfies
(12) |
In details,
Example 6 (Neural projected negative entropy).
Consider a negative entropy functional given by
In this case, the neural mapping gradient flow satisfies
(13) |
where . This is because:
The choice and corresponds to the negative entropy. This belongs to the family of internal energy. In details,
Here we denote , for matrices , .
The above examples are projected Wasserstein gradient flows in neural mapping metric space. In particular, Examples 4, 5, 6 correspond to the following classical PDEs, respectively.
(14) | ||||
(15) | ||||
(16) |
The above dynamics include potential transport, interaction transport, and porous medium equations. The Fokker-Planck equation is a combination of the above first and third equations.
3.6. Algorithm
In this section, we discuss the implementations of gradient flows projected onto the parameter space. We apply the forward Euler discretization of the natural gradient flow (10). Let be the step size. Then the update is given by
(17) |
where , are empirical estimates of the matrix and the gradient , respectively. In details, if , where is the number of empirical samples, then
In practice, the condition number of could be very large and it is more stable to use instead the pseudoinverse of in (17). Therefore, the update is
When the reference measure is a one-dimensional standard Gaussian distribution, can be explicitly computed for our choice of neural network. In this case, we have
We summarize the above explicitly update formulas below.
4. Numerical analysis on neural network projected gradient flows
In this section, we establish theoretical guarantees for the performance of the neural projected dynamics. We start by deriving an analytic formula for the inverse of the neural mapping metric of a special ReLU family in section 4.1. Based on the closed-form projected dynamics equations, we can establish the truncated error analysis for the projected dynamics in section 4.2. The analysis of truncated error for general dynamics is presented in section 4.3.
4.1. Analytic formula for the inverse of neural mapping metric
In this section, we consider the following special case of the ReLU model in 1D. We first rewrite the neural network mapping function into the following form:
(18) |
In particular, we combine into one parameter in the 1D case. Under this reparameterization, s represent the slopes of each ReLU component and s are the intercepts. To make the last assumption on this ReLU network mapping function which facilitates the analytic formula of the neural mapping metric, we require all the slope parameters to stay non-negative, i.e. . Although this is an artificial assumption to enforce analyticity, it is natural in the sense that positive slope parameters induce monotone mapping function. Meanwhile, solutions of the Monge problems in 1D are known to be monotone. In fig. 1, we plot a typical ReLU mapping function.
We start with the analytic formula for the neural mapping metric, assuming the reference measure is given by with associated cumulative distribution function .
Proposition 4 (Neural mapping metric of two-layer ReLU network).
The neural mapping metric of the two-layer ReLU network with reference measure is given as
(19) | ||||
Proof.
We first calculate the derivative of the neural network map w.r.t. network parameters
(20) |
while the value at the singular point does not exist and can be omitted from the measure-theoretical perspective. According to definition 4, one can evaluate different blocks of the metric tensor as the following integral
∎
For general reference measure , the matrix elements of the relate to the first and second moments of the measure which may not have an analytic formula. Here, we consider a special neural mapping model with a Gaussian reference measure, thus rendering the metric with analytic elements.
Corollary 1.
With the same setting as proposition 4 and Gaussian reference measure, the matrix element of the neural mapping metric can be written analytically as
(21) | ||||
The other half of the elements can be obtained via switching .
Proof.
The proof is obtained by elementary integration calculation
(22) | ||||
∎
Now, we focus on the upper right corner of the neural mapping metric. We will establish an analytical formula for the inverse of this matrix.
Theorem 2 (Analytic inverse of the neural mapping metric).
The inverse matrix of the block in proposition 4 can be written analytically as
(23) |
Proof.
First, we decompose the neural mapping metric into the following matrix product
(24) |
where is a diagonal matrix. Then, it is direct to check that the middle matrix has the following tri-diagonal analytic inverse below:
(25) |
Multiplying this matrix with the inverse of the diagonal matrix on both sides concludes this proof. ∎
This analytic form of the inverse metric will be used intensively in the next subsection to prove the consistency of the numerical scheme based on the ReLU neural network.
4.2. Truncated error analysis of the neural projected Wasserstein gradient flows based on analytic formula
In this section, we perform the numerical analysis of the neural mapping projected Wasserstein flows introduced in section 3.5 based on the analytic formula in section 4.1. Because of the analytic inverse of the neural mapping metric, the right-hand side of the Wasserstein projected gradient flow can be calculated explicitly, and one can thus talk about its consistency and order of accuracy following the same spirit as classical numerical analysis. We perform this derivation for the Wasserstein projected gradient flows of the potential functional explicitly.
Let us first recall that the formula for neural projected Wasserstein gradient flow is given by
(26) |
We have the following analytic formula for the projected gradient flow in the ReLU network model that we introduced in section 4.1.
Proposition 5 (Wasserstein gradient flow of potential functionals in ReLU network).
The projected potential flow in the ReLU network model eq. 18 has the following form:
(27) | ||||
Using the trapezoid rule to calculate the integration gives the following spatial discretization, which can be used to simulate the projected gradient flow:
(28) |
Proof.
It suffices to calculate the gradient of the linear potential functional in this model. Let us start with the calculation of the functional form of the potential energy in the ReLU network mapping model as follows
(29) |
where we use the change of the integration variable above. Therefore, the gradient of this functional w.r.t. can be simplified to
(30) |
where we use to denote the characteristic function on the interval . Now, plugging this result into the projected gradient flow eq. 26 with the analytical formula for the inverse matrix in theorem 2, we obtain
(31) | ||||
Taking a close look at the terms inside the brackets, one finds that they are calculating the average value of inside the intervals weighted by the base distribution . Lastly, in order to complete the spatial discretization, one needs to choose a quadrature rule to calculate the integration in the above formula. One example is the trapezoid rule:
which provides the desired discretization. Special attention should be paid to the boundary node to obtain their corresponding evolution equation and discretization. ∎
Given this spatial discretization, we can analyze the order of consistency of it, which is treated in the following proposition.
Proposition 6 (Consistency of the projected gradient flow).
Assume potential functional satisfies . The spatial discretization eq. 28 is of first-order accuracy both in the mapping and the density coordinates.
Proof.
We prove this statement from two directions, i.e. consistency in the space of mapping distribution and consistency in the space of mapping function. We have
(32) | ||||
In the above derivation, we slightly cheat in the derivation so we can use the consistent formula for the evolution equations for all the nodes . It is easy to conclude that our discretization corresponds to the evolution of the mapping function of constant speed on each interval . Now, recall that in mapping space, the Wasserstein gradient flow of the potential function corresponds to the velocity field . Therefore, given that the length of each interval is of order , we conclude that our spatial discretization is first order consistent on the mapping space.
Next, we prove the statement for the mapping distribution. To do this, we need to derive the evolution equation for the mapping distribution according to eq. 28. We have for
(33) | ||||
A quick method to derive the formula of is to view it as a probability flow corresponds to the cotangent vector and then use the Wasserstein metric to calculate via a continuity equation as in proposition 2. Recall that the potential gradient flow in the density manifold is given by
(34) |
Comparing eq. 33 and eq. 34, it is not difficult to recognize that the first term in eq. 33 approximates the continuous counterpart in eq. 34 in the first order. The remaining parts correspond to each other: the approximation is first order not in the strong sense, but rather in the weak sense as there is Dirac measure in eq. 33. Combining the above two parts, we finish the proof. ∎
4.2.1. Projected dynamics of Negative entropy gradient flow
The potential functional can be viewed as a linear functional whose projected gradient flow has a rather simple expression. The corresponding formula has a more complex expression for general nonlinear internal energy, such as entropy. We begin with calculating the negative entropy functional of a neural mapping measure :
(35) | ||||
where we use the Monge-Ampère equation in one dimension. Moreover, notice that the last term corresponds to the entropy of the discrete part of distribution as the ReLU mapping function maps to and refers to the continuous part of a distribution. Similarly, the relative entropy functional is given by
Moreover, the gradient flow of the KL-divergence differs from that of negative entropy only by a term that appears in the derivation in the potential functional gradient flow. This This also manifests in calculus on the density manifold between the heat and Fokker-Planck equations. Now, one calculates the derivative of continuous parts w.r.t. parameter
The first derivation is based on the observation that the function is a step function which changes its value at . It takes value at interval . Hence the desired conclusion follows, where comes in since this is the expectation w.r.t. distribution . Therefore, the derivative of the entropy and relative entropy functional reads as follows
(36) | ||||
With all these preparations, we can write out the gradient flow equation of the entropy functional:
(37) | ||||
Similar to the proof in proposition 6, one can carefully expand the neural projected dynamics of the entropy functional and prove that it converges to the heat equation in the limit that number of neurons tends to infinity and the gap between neurons nodes tends to zero.
4.2.2. Analysis of the long-time existence of the neural-projected heat flow
In general, the projected Wasserstein gradient flow does not necessarily need to be a linear dynamics even though the original gradient flow is linear, e.g., the projected gradient flow corresponding to the heat equation is highly nonlinear. This poses great difficulties in analyzing and establishing the long-time existence of the projected dynamics, as mentioned in [25]. Specifically, we focus on the nonlinear projected gradient flow of the entropy, which corresponds to the Heat equation in the full space. If we view all nodes as grid points and view the scheme as an example of the moving mesh method [17], then the mesh quality is an important quantity to observe during simulation. One does not want the mesh quality to decrease too much and even become degenerate during the simulations. Therefore, we consider the well-posedness of the non-linear ODE eq. 37.
Proposition 7.
The neural projected dynamics eq. 37 of the heat flow is well-posed, e.g. the solution extends to arbitrary time.
Proof.
We consider a special scenario when two adjacent nodes become close to each other while maintaining a relatively large gap with all other nodes, i.e.
(38) |
WLOG, we assume and reduce the following term which appears both in their time derivative expression in eq. 37
(39) | ||||
where we use the simplified model where all the weights are set to and Taylor expansion to conclude that and follows the same spirit. This term appears with positive sign in the RHS of and negative sign in the RHS of , indicating that the left (right) node () will move fast towards left (right) respectively. This repulsion behavior guarantees that the Lagrangian coordinates will never collide with each other and the mesh degeneracy will not appear.
Next, we analyze our scheme using the time derivative of the Lagrangian coordinate. It is a well-known result that under the heat flow the mean of the distribution is fixed. Therefore, due to the diffusive nature of the heat equation, one can imagine that the position of the quantile greater than the mean should move right in the heat equation and vice versa. Suppose is a quantile with greater than the mean . As the base measure is a standard Gaussian distribution whose probability density function decreases over , we conclude that
(40) |
Consequently, the Lagrangian coordinate is indeed moving towards right, which matchs the intuition from the heat equation.
∎
The neural projected dynamics can be understood as a Lagrangian scheme [8, 24, 26] with neural network basis. Specifically, fixing basis as ReLU components in eq. 18, one can view ’s and ’s as the shape and location coefficients of the basis functions respectively. Updating ’s is similar to classical finite-element method with fixed basis functions, while adding the degree of freedom of ’s is similar to the moving mesh method. The Lagrangian schemes can handle the problem of the free boundary such as porous medium, e.g. in [24], they use finite element method to solve the mapping function of the porous medium equation with high accuracy. While most Lagrangian schemes are based on updating the ’s parameters, our methods have more flexibility and expressivity as it takes more degree of freedom into account. The primal-dual structure of the Wasserstein gradient flow also leverages a lot of usage of Lagrangian schemes [8].
On the other hand, our numerical algorithm and the moving mesh method. The principal ingredients of the moving mesh method include the equidistribution principle, the moving mesh equation, and the method of lines approach [36]. The moving mesh equation is solved during the simulation to ensure the adaptivity such that the mesh can resolve to the detailed structure. In many classical moving mesh methods, the mesh equations are solved separately from the governing PDE itself to guarantee the adaptivity of the numerical methods. This implies that how the mesh change will not depend explicitly on the underlying PDE. There also exist moving mesh methods such that the mesh updates take into account of the governing PDE (e.g., the arbitrary Lagrangian-Eularian methods [5]). From this perspective, the projected dynamics provide a PDE-specific moving mesh equation, i.e. the mesh moved according to the PDE dynamics to simulate which is more adaptive and efficient. Moreover, through a detailed study of the simple case, we can establish a theoretical guarantee on the quality of our moving mesh method in proposition 7.
4.3. Truncated error analysis for general neural projected Wasserstein gradient flow
The proof of the consistency of the numerical scheme relies on the analytic formula derived before which is restrictive. In this section, we provide another methodology to prove the consistency of the numerical scheme we derived in this paper. Instead of calculating the evolution of the mapping explicitly, we calculate the deviation of the projected gradient w.r.t. the original gradient direction. Let us first state a geometric proposition where we attempt to be as general as possible. This result is also proved in [25] and we prove it here for completeness.
Let be a manifold (possibly infinite-dimensional) with a Riemannian metric , which provides an inner product on the tangent space (possibly infinite-dimensional Hilbert space) for each . Let be its submanifold with induced metric denoted by , i.e. :
Furthermore, let be a functional defined over and we use for its restriction on . We have the following proposition.
Proposition 8.
Let denote the gradient of the functional w.r.t. the metric at . Then, we have
(41) |
where is the orthogonal projection operator from to .
Proof.
As is a submanfold of , we have inclusion map and restriction map for each . Both mappings are linear and are adjoint to each other. Therefore, viewing the metric tensor as a linear mapping between , we have
Moreover, the inner product on the Hilbert space induces an orthogonal decomposition:
along with an orthogonal projection operator . Now, recall that the Riemannian gradient is defined as
The differential of and is related by
Therefore, gathering all the ingredients, we have the following commutative diagram
As is the orthogonal projection, we conclude that
∎
We can prove the consistency of our numerical schemes over different PDEs with the Wasserstein gradient flow structures by leveraging this proposition in the case is the density manifold and is chosen to be the metric. To achieve this, we can rewrite eq. 41 as
(42) |
Therefore, will provide an upper bound for the truncated error of our approximation scheme. Moreover, if we assume that the submanifold is identical to a generative model via mapping function , i.e. with and the base measure. Then, the projected gradient direction can also be characterized using the metric over the mapping space, i.e.
(43) |
where is mapped to point and we abuse the notion of to denote the gradient vector in the mapping coordinate. Moreover, we can perform truncated error analysis directly over the mapping space, which is more convenient and clear. Let us focus on the ReLU network mapping eq. 18. The tangent space in the mapping coordinate is spanned by the vectors in eq. 20. Meanwhile, the tangent space in the density coordinate is spanned by
(44) |
where the notation . Here we use the fact that the mapping is linear with slope over the interval . If s are fixed, the projected dynamics belongs to projection-based model reduction [6] where the basis is fixed to be neurons. While changing s correspond to model reduction with adaptive basis.
Proposition 9.
The numerical scheme based on ReLU network mapping is consistent with order using both parameters and of order with either or parameters.
Proof.
In view of eq. 20, we have that the approximation using only is simply piece-wise constant approximation. As each ingredient has the shape of a Heaviside function, it is consistent with order . While the approximation using both and is a piece-wise linear approximation, thereby consistent of order . This is because another set of ReLU-shape functions is added to the basis. ∎
The connection between the ReLU neural network and the linear finite element space is systematically studied in [15]. They theoretically establish that at least two hidden layers are needed in a ReLU neural network to represent any linear finite element functions in when
Based on this concrete understanding of the structure of the tangent space, we can calculate the local truncation error of the projected gradient flow.
Theorem 3.
Given a tangent vector whose approximated tangent vector in projected dynamics is given by , the local truncation error in the ReLU network mapping is given by
(45) |
where is the center of mass of in and is understood as . Under the assumption that has bounded second order derivative and
(46) |
Proof.
As mentioned in the above theorem, the approximation using ReLU network mapping is equivalent to piecewise linear approximation in the mapping coordinate. Moreover, at each node , the slope and value of the The function does not need to be continuous, which is exactly the same as the linear spline interpolation. The main difference is that the grid points are not fixed since they can evolve over time. Therefore, we rewrite the optimization problem eq. 43 as
(47) |
which can be further reduced to separated optimization problem of over small interval . For each subproblem, we have
This is a quadratic optimization problem of with positive definite Hessian matrix. Taking derivative w.r.t. , we obtain
Now, using the fact that is a linear function over the interval , we have
(48) |
Plugging back, we obtain the approximation error as
(49) | ||||
Next, we assume all the intervals are short (of scale ) and consider expanding the as Taylor series around , i.e.
(50) | ||||
Here, we use the fact that is a linear function with slope over the interval . Plugging into eq. 48, we obtain
(51) | ||||
Notice that the first two terms are exactly the zero-th and first order term of the function which is similar to classical linear function approximation by discarding all the higher order term. The appearance of residue terms is due to approximation in sense. To calculate the -approximation error, we have
(52) | ||||
In summary, the approximation error consists of the sum over all the interval , with each term depends on through the factor , on through and the term , which also contains . ∎
Let us calculate a special case of the Fokker-Planck equation
Under the Wasserstein metric, the tangent vector in the mapping space is given by
In this case, we have that
The above function will determine the approximation quality of the projected dynamics.
Remark 1.
The high-order neural mapping function class and associated high-order projected dynamics can also be derived following a similar procedure. For example, we can add a quadratic term of the ReLU function into the network mapping function as
(53) |
Notice that adding high order ReLU term is different from increasing the layers in the ReLU neural network which corresponds to function composition. We leave the detailed analysis and numerical experiments on high-order methods in future work.
5. Numerical Examples
In this section, we provide several numerical experiments to test our algorithm and theories. We focus our attention on the linear transport equation, Fokker-Planck equation, porous medium equations, and Keller-Segel equation. They all correspond to some specific energy functionals in the probability space equipped with the Wasserstein-2 distance.
5.1. Neural Network structure
We first describe the structure of our neural network for the experiment. We focus on two-layer neural network with ReLU as activation functions.
(54) |
Here represents the collection of weights and bias . To simplify our notation, we have absorbed the factor into ’s. At initialization, we set for and for . To choose the ’s, we first set for some positive constant (e.g. or ). We then set for and for . Here is a small offset which will be explained later in Section 5.3. Our initialization is chosen such that approximates the identity map at initialization. In practice, we find it beneficial to perform a rescaling of the weights ’s. We replace with for some fixed constant . And we initialize for and for . This rescaling makes sure that still approximates the identity map at initialization. We provide a brief intuition for rescaling. Let us consider for , , . Then . On the other hand, . This simple calculation shows that the partial gradient of with respect to weights and bias are of different scales. Therefore, to make them the same scale, a natural choice is choosing .
Remark 2.
The choice of neural network (54) is slightly more complicated than the one studied in Section 4. This symmetric structure is used in numerical experiments to overcome ReLU’s drawback such that only the positive input is activated. Moreover, (54) allows us to construct an approximation to the identity map over easily. However, the results of Proposition 9 still hold for (54). And Theorem 3 can be generalized to neural network of the form given in (54) in a straightforward manner. The metric tensor is now a matrix. The calculations of the individual components of follow the same procedure presented in proposition 4.
Remark 3.
5.2. Linear transport PDE
We investigate the linear transport PDE given by Eq. (14) with several choices of potential , corresponding to the gradient flow of them under the Wasserstein metric. For a simple potential function, this example can serve as a sanity check of the projected dynamics formulation. The trajectories of the particles for Eq. (14) (i.e. Lagrangian formulation) follows the following ODE
(55) |
Let us denote by the solution to Eq. (55) with initial condition . In other words, is the transport map at time starting from position . We define the error at time by
(56) |
where we discretize the integration domain by equally spaced points to approximate the integral. And denotes the initial distribution of . Below we test our projected dynamics under three choices of potential functions and investigate the convergence behavior of two projected dynamics: (i) fixing the bias terms and only updating the weights and (ii) updating both bias and weights . Note that when the bias terms ’s are fixed, we have that . Recall that we are essentially simulating the gradient flow on parameter given by Eq. (11). We use particles sampled from a standard Gaussian distribution for approximating . Once we have the empirical loss function
we can backpropagate this loss to obtain
which will be used in the update of given by Eq. (11).
5.2.1. Quadratic potential
As the first example for linear transport PDE, we consider the quadratic potential as a sanity check. The stationary distribution will be the delta mass supported at . Using the method of characteristics, one can show that the solution at time is given by
(57) |
where is the initial distribution. In Lagrangian coordinates, the transport map of a point at time is given by
(58) |
One can check that and as . It is worthwhile mentioning that at each , the Lagrangian map is a linear map. For simplicity, we take . We choose and run for 1000 steps. We compare our numerical results with Eq. (58). The result is demonstrated in Fig. 2. In Fig. 1(b), we have provided a visualization of the analytic solution to the linear transport PDE in Lagrangian coordinates at and our computed solution. As shown in Fig. 1(b), the analytic transport map is linear while the neural mapping function is piecewise linear. Increasing does not necessarily give a smaller approximation error. In fact, we see in Fig. 1(a) that larger usually gives a larger error, commonly known as overfitting in machine learning.
5.2.2. Quartic potential
Let us consider . The analytic solution of the transport map is given by
(59) |
Basic settings are the same as the previous case. We choose and run for 1000 steps. We compare our numerical results with Eq. (59). We present our results in Fig. 3. In Fig. 2(a), we observe a clear decrease in error as the number of neurons becomes larger. In Fig. 2(b), we visualize the analytic solution to the linear transport PDE in Lagrangian coordinates at and our computed solution. We can see that even when the optimal transport map is nonlinear, our computed solution still matches the analytic solution very accurately.
5.2.3. Sixth order polynomial potential
Let us consider . The analytic solution of the transport map is given by
(60) |
We choose and run for 1000 steps. The reason to choose such a small step size is that the ODE (55) is stiff when is a sixth order polynomial. This can be readily seen by considering the forward Euler scheme for solving (55), which results in the popular gradient descent algorithm. The step size that can guarantee convergence in gradient descent is at most where is the Lipschitz constant of the gradient function. In our case, the gradient function is not globally Lipschitz. Even if we consider a fixed interval , the Lipschitz constant is . If we take , then we get . We compare our numerical results with Eq. (60). We have chosen to be a uniform mesh of size on in Eq. (56) and is the standard Gaussian distribution. Note that is chosen to be much larger than the number of neurons in the network mapping function as it is used to evaluate the accuracy of our algorithm. We present our results in Fig. 4. We can see a clear decrease in error when increases from Fig. 3(a). It is also clear from Fig. 3(a) that updating both weights and bias tends to have a smaller error than just updating the weights, although the difference becomes smaller when increases and more mesh points become available. Comparing dashed and solid lines in Fig. 3(a), we find that the initialization of also plays a role in the overall performance of our solution. The error is smaller when the initial mesh points (i.e., the ’s) are more concentrated near the center of the reference measure. In our case, the reference measure is a standard Gaussian, whose measure is “almost” supported on . Hence we see that the solid lines show a smaller error than the dashed lines in Fig. 3(a). In Fig. 3(b), we have given a visualization of the analytic solution to the linear transport PDE in Lagrangian coordinates at and our computed solution. It is worth noting from Fig. 3(b) that our learned Lagrangian map approximates the analytic Lagrangian map well near the center of the reference distribution, which is concentrated near the origin. Even though the error of the learned Lagrangian map is larger outside of , the overall error from Eq. (56) is still small since the reference measure (standard Gaussian measure) on is exponentially small.
Remark 4.
According to Proposition 9, updating both and is a second order method. This can be seen from Fig. 3(a) when is small. When is large, the numerical advantage of updating both and is less significant compared with updating only . This is partially explained by the condition number of the grows too large when contains all of and . This phenomenon is also observed in our other experiments. Using the implicit scheme or proximal scheme (without solving the linear system that involves directly) might help with this difficulty, which we leave as a future study.
5.3. Fokker-Planck Equation
We consider Fokker-Planck equations. In general, there is no closed-form solution for either the Eulerian or Lagrangian coordinate except for some special forms of potential (e.g. quadratic). We can still have an approximation of the analytic transport map by realizing that the optimal transport map of a point at time is given by
(61) |
where is the cumulative distribution function (CDF) of . has a closed form expression when we choose our reference measure to be a standard Gaussian. But we still need to know . Therefore, to investigate the performance of our algorithm, we need to use a numerical solver to solve for . We choose a center difference in space, implicit in time discretization as our choice of numerical solver with vanishing boundary condition. Recall that we are essentially simulating the gradient flow on parameter given by Eq. (11) and Eq. (13). To calculate the derivative of the energy functionals, we used particles sampled from a standard Gaussian distribution for approximating . Approximating is straightforward and has been explained in detail in Section 5.2. On the other hand, some care needs to be taken when approximating as explained in Section 4.2.1. Suppose that all of the are different. Take . Let us also assume that the ’s are ordered so that .
(62) |
And
(63) |
Similarly, if we assume that and let , we have
(64) |
And
(65) |
Note that during implementation, we do not have to order the ’s in order to evaluate the above partial derivatives. Let . Then by a straightforward calculation, we have
(66) |
In our experiment, we set where is the small offset we introduced in Section 5.1 during initialization.
5.3.1. Quadratic potential
As a first example for the Fokker-Planck equation, we use the quadratic potential as a sanity check. Here is chosen to be a quadratic function. This is one of the rare cases where the Fokker-Planck equation has a closed-form analytic solution. In Lagrangian coordinates, the trajectories of the particles follow the following SDE, commonly known as the Ornstein-Uhlenbeck process:
(67) |
The corresponding Langevin equation for the density is given by
(68) |
where . It can be shown that the solution to (68) is given by
(69) |
where is the initial distribution. In our experiment, is a standard Gaussian. Then (69) implies that is also Gaussian with mean and variance . Then the transport map is given by the optimal transport map between two Gaussians, which has a closed form expression. In this example, the transport map is
(70) |
which is always a linear map, no matter the choice of , and . We use particles sampled from a standard Gaussian distribution for approximating . We choose and run for 1000 steps. We used a neural network with and following the setup in Section 5.1. We have the following two choices of parameters corresponding to different dynamics.
-
•
Moving and widening Gaussian. We choose , , . Under this setting, the solution at time is a Gaussian distribution with mean and variance . This evolution is shown on the left panel of Fig. 5.
-
•
Moving and shrinking Gaussian. We choose , , . Under this setting, the solution at time is a Gaussian distribution with mean and variance . This evolution is shown on the right panel of Fig. 5.
Our results are demonstrated in Fig. 5. As shown in Fig. 5, the computed density closely follows the analytic density of the Fokker-Planck equation from to .
5.3.2. Quartic potential
We consider . We choose and run for 1000 steps. We compare our numerical results with the transport map computed from Eq. (61). The results are shown in Fig. 6. In Fig. 5(a), we observe a clear decrease in error when the number of neurons increases. In Fig. 5(b), we plot a comparison between our computed Lagrangian map vs the transport map computed from Eq. (61) using a numerical solver. The evolution of the density is demonstrated in Fig. 5(c) from to .
5.3.3. Sixth order polynomial potential
We consider . We choose and run for 1000 steps. We compare our numerical results with the transport map computed from Eq. (61). The results are shown in Fig. 7. We have observed similar behavior as in the case of linear transport PDE: the error becomes smaller when increases. Moreover, comparing dashed and solid lines in Fig. 6(a) we see that as the initial mesh points (i.e. the ’s) concentrate nearer the center of our reference measure, the errors are smaller. In Fig. 6(b) we show a comparison between Lagrangian maps computed by our method and the numerical solver. We have also plotted the evolution of the density in Fig. 6(c) from to .
5.4. Porous Medium Equation
We consider Example 6 with the functional , . This choice of yields the porous medium equation
(71) |
It is known that Eq. (71) admits solutions given by the Barenblatt profile
(72) |
where , , , and is a normalization constant so that Eq. (72) integrates to 1 for all . In this example, we consider the case when . Then , and . Eq. (72) also suggests that the support of the reference measure is bounded in , which could help us with initializing the bias. More precisely, we cound initialize our ’s following Section 5.1 with . In our experiment, we set . We use and run for 1000 steps. We use particles sampled from defined in Eq. (72) using importance sampling to approximate , where . Similar to the case of Fokker-Planck equation, special care needs to be taken when evaluating . Using similar analysis from Section 5.3, we have that
(73) |
Our results are demonstrated in Fig. 8. In Fig. 7(b), 7(c) we have ; both the bias and weights are updated.
5.5. Keller-Segel equation
We consider the one-dimensional modified Keller-Segel equation, which is a combination of interaction energy in Example 5 and potential energy in Example 6:
(74) |
where and , is a constant. The second moment of has an analytic form given by
(75) |
It is clear from Eq. (75) that is a critical value. When , the solution blows up as . So we consider two cases: , and . We present our results in Fig. 9, 10 and 11. We used 2000 particles with a standard Gaussian initial distribution. We set and run for 1000 steps. The interaction term is evaluated using the 2000 particles with self-interaction excluded. We used a neural network with and following the setup and initialization in Section 5.1. We update both the bias and weights terms in our experiment.
6. Discussion
This paper analyzes the neural network projected dynamics for one-dimensional Wasserstein gradient flows of general energy functionals. For two-layer neural network functions with ReLU activations, we analyze the convergence and stability issues for the proposed numerical schemes from location parameter and scale parameter . In numerical experiments, we demonstrate the second-order spatial domain accuracy as discussed in the numerical analysis.
In future work, we shall study neural projected dynamics as a computational framework for building theoretical guaranteed machine learning numerical schemes. Various topics in this direction remain to be studied. First, we shall design neural network approximations to approximate the initial value of high-dimensional PDEs, which traditional PDE solvers cannot efficiently solve due to the curse of dimensionality. In particular, how can we understand the numerical accuracy of deep neural network functions in high dimensions when approximating PDEs? Next, we shall generalize the neural projected dynamics to dynamical systems for conservative-dissipative equations in statistical physics. The equation includes Hamiltonian structures induced from the conservative system and the related mean-field control problems. Furthermore, considering the closed relationship between the Wasserstein density manifold and sampling algorithms, we shall investigate sampling using the projected dynamics on neural parameter spaces and study their theoretical and statistical properties. We also consider the time-implicit (proximal-type) computations of the proposed algorithm [23, 19], which could improve the performance and stability of the scheme.
References
- [1] Shun-ichi Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.
- [2] Shun-ichi Amari. Information geometry and its applications, volume 194. Springer, 2016.
- [3] Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2005.
- [4] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pages 214–223. PMLR, 2017.
- [5] Andrew J Barlow, Pierre-Henri Maire, William J Rider, Robert N Rieben, and Mikhail J Shashkov. Arbitrary Lagrangian–Eulerian methods for modeling high-speed compressible multimaterial flows. Journal of Computational Physics, 322:603–665, 2016.
- [6] Peter Benner, Serkan Gugercin, and Karen Willcox. A survey of projection-based model reduction methods for parametric dynamical systems. SIAM review, 57(4):483–531, 2015.
- [7] Joan Bruna, Benjamin Peherstorfer, and Eric Vanden-Eijnden. Neural Galerkin schemes with active learning for high-dimensional evolution equations. Journal of Computational Physics, 496:112588, 2024.
- [8] Jose A Carrillo, Daniel Matthes, and Marie-Therese Wolfram. Lagrangian schemes for Wasserstein gradient flows. Handbook of Numerical Analysis, 22:271–311, 2021.
- [9] JS Chang and G Cooper. A practical difference scheme for Fokker–Planck equations. Journal of Computational Physics, 6(1):1–16, 1970.
- [10] Yifan Chen and Wuchen Li. Optimal transport natural gradient for statistical manifolds with continuous sample space. Information Geometry, 3(1):1–32, 2020.
- [11] Casey Chu, Kentaro Minami, and Kenji Fukumizu. The equivalence between Stein variational gradient descent and black-box variational inference. arXiv preprint arXiv:2004.01822, 2020.
- [12] Yifan Du and Tamer A Zaki. Evolutional deep neural network. Physical Review E, 104(4):045303, 2021.
- [13] Jiaojiao Fan, Qinsheng Zhang, Amirhossein Taghvaei, and Yongxin Chen. Variational Wasserstein gradient flow. arXiv preprint arXiv:2112.02424, 2021.
- [14] Nathan Gaby, Xiaojing Ye, and Haomin Zhou. Neural control of parametric solutions for high-dimensional evolution PDEs. arXiv preprint arXiv:2302.00045, 2023.
- [15] Juncai He, Lin Li, Jinchao Xu, and Chunyue Zheng. ReLU deep neural networks and linear finite elements. Journal of Computational Mathematics, 38(3):502–527, June 2020.
- [16] Ziqing Hu, Chun Liu, Yiwei Wang, and Zhiliang Xu. Energetic variational neural network discretizations to gradient flows. arXiv preprint arXiv:2206.07303, 2022.
- [17] Weizhang Huang and Robert D Russell. Adaptive moving mesh methods, volume 174. Springer Science & Business Media, 2010.
- [18] Wonjun Lee, Li Wang, and Wuchen Li. Deep JKO: time-implicit particle methods for general nonlinear gradient flows. arXiv preprint arXiv:2311.06700, 2023.
- [19] Wuchen Li, Alex Tong Lin, and Guido Montúfar. Affine natural proximal learning. In Geometric Science of Information: 4th International Conference, GSI 2019, Toulouse, France, August 27–29, 2019, Proceedings 4, pages 705–714. Springer, 2019.
- [20] Wuchen Li and Guido Montúfar. Natural gradient via optimal transport. Information Geometry, 1:181–214, 2018.
- [21] Wuchen Li and Jiaxi Zhao. Scaling limits of the Wasserstein information matrix on Gaussian mixture models. arXiv preprint arXiv:2309.12997, 2023.
- [22] Wuchen Li and Jiaxi Zhao. Wasserstein information matrix. Information Geometry, pages 1–53, 2023.
- [23] Alex Tong Lin, Wuchen Li, Stanley Osher, and Guido Montúfar. Wasserstein proximal of gans. In International Conference on Geometric Science of Information, pages 524–533. Springer, 2021.
- [24] Chun Liu and Yiwei Wang. On Lagrangian schemes for porous medium type generalized diffusion equations: A discrete energetic variational approach. Journal of Computational Physics, 417:109566, 2020.
- [25] Shu Liu, Wuchen Li, Hongyuan Zha, and Haomin Zhou. Neural parametric Fokker–Planck equation. SIAM Journal on Numerical Analysis, 60(3):1385–1449, 2022.
- [26] Pierre-Henri Maire, Rémi Abgrall, Jérôme Breil, and Jean Ovadia. A cell-centered Lagrangian scheme for two-dimensional compressible flow problems. SIAM Journal on Scientific Computing, 29(4):1781–1824, 2007.
- [27] Petr Mokrov, Alexander Korotin, Lingxiao Li, Aude Genevay, Justin M Solomon, and Evgeny Burnaev. Large-scale Wasserstein gradient flows. Advances in Neural Information Processing Systems, 34:15243–15256, 2021.
- [28] Kirill Neklyudov, Rob Brekelmans, Alexander Tong, Lazar Atanackovic, Qiang Liu, and Alireza Makhzani. A computational framework for solving Wasserstein Lagrangian flows. arXiv preprint arXiv:2310.10649, 2023.
- [29] Levon Nurbekyan, Wanzhou Lei, and Yunan Yang. Efficient natural gradient descent methods for large-scale PDE-based optimization problems. SIAM Journal on Scientific Computing, 45(4):A1621–A1655, 2023.
- [30] N Nüsken. On the geometry of Stein variational gradient descent. Journal of Machine Learning Research, 24:1–39, 2023.
- [31] Yann Ollivier. Riemannian metrics for neural networks I: feedforward networks. Information and Inference: A Journal of the IMA, 4(2):108–153, 2015.
- [32] Lars Onsager. Reciprocal relations in irreversible processes. I. Phys. Rev., 37:405–426, Feb 1931.
- [33] Felix Otto. The geometry of dissipative evolution equations the porous medium equation. Communications in Partial Differential Equations, 26(1-2):101–174, 2001.
- [34] Lars Ruthotto, Stanley J Osher, Wuchen Li, Levon Nurbekyan, and Samy Wu Fung. A machine learning framework for solving high-dimensional mean field game and mean field control problems. Proceedings of the National Academy of Sciences, 117(17):9183–9193, 2020.
- [35] Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
- [36] Tao Tang. Moving mesh methods for computational fluid dynamics. Contemporary mathematics, 383(8):141–173, 2005.
- [37] Cédric Villani. Optimal Transport: Old and New, volume 338. Springer, 2009.
- [38] Hao Wu, Shu Liu, Xiaojing Ye, and Haomin Zhou. Parameterized Wasserstein Hamiltonian flow. arXiv preprint arXiv:2306.00191, 2023.