1. Introduction
Deep neural networks are high-dimensional machine learning models that have demonstrated impressive performance on a number of challenging tasks in computer vision, natural language processing and reinforcement learning [
1,
2]. These models are typically trained to minimize the misprediction error compared to large amounts of human-annotated data. A large number of diverse models with varying properties are prevalent in these application domains. Despite this diversity, stochastic gradient descent (SGD) is the gold standard for training deep neural networks. It has been shown to obtain good generalization performance; i.e., to train a model that performs well on new data, across a wide range of applications. In spite of this popularity and efficacy, a precise understanding of SGD for deep learning remains elusive.
This paper develops a geometric understanding of stochastic gradient descent. We build upon the work of [
3], wherein the authors model the dynamics of SGD as a stochastic differential equation with state-dependent Gaussian noise. We interpret the covariance of this noise, called the diffusion matrix henceforth, as a metric on the parameter space. Our result provides a deterministic Equation (
10) that can be compared to SGD near equilibrium points Equation (
1). We write the diffusion matrix
in the form Equation (
3) to show how it fundamentally captures the anisotropy of the dynamical system underlying SGD. This clarifies how
is one of the key factors that differentiates steady-state solutions of SGD from those of ordinary gradient descent GD (see comparison in [
3,
4,
5]). Using the diffusion matrix, we then define a family of metrics on the parameter space that we call
diffusion metrics. We then take the Einstein equation describing the geodesic on a Riemannian manifold, for the motion of a particle subject to a gravitational and electromagnetic field. We replace the electromagnetic force by the ordinary gradient while gravity is taken into account using the diffusion metric itself. After some mild hypotheses on the architecture of the neural network, we obtain the result that geodesics with respect to this equation correspond precisely to the evolution of a dynamical system, which is not subject to Euclidean gradient descent but to relativistic gradient descent (RGD) with respect to the family of diffusion metrics.
In the end, we obtain Equation (
10), which is along the same vein as natural gradient descent in [
6], but whose significance is much deeper in the context of SGD, since it stems from the anisotropy of the gradients with respect to the various parameters. This anisotropy encodes the difference between the dynamics of GD and those of SGD. We also compare our result with the ones in [
3] and show them to be perfectly compatible. In the end we provide
Appendix A for the reader convenience, with a quick review of some key facts of Riemannian geometry.
2. Continuous-Time SGD and the Diffusion Matrix
Stochastic gradient descent performs an update of the weights
x of a neural network, replacing the ordinary gradient of the loss function
with
:
where
represents the continuous version of the weight update at step
j:
, with the learning rate
incorporated into the expression of
, and
is the mini-batch. In the expression of the loss function
,
is the loss relative to the
i-th element in our dataset
of size
. We assume that weights belong to a compact subset
and that the
satisfy suitable regularity conditions (see [
3]
Section 2 for more details).
We define the
diffusion matrix as the product of the size of the mini-batch
and the variance of
, viewed as a random variable,
,
:
Notice that
and does not depend on the size of the mini-batch; it only depends on the weights
x, loss function
f and the dataset
. With a direct calculation one can show that:
where:
and
is the euclidean scalar product. In fact:
which gives:
The diffusion matrix measures effectively the
anisotropy of our data:
if and only if
for all
and
. In other words, the diffusion matrix measures how the loss of each datum depends, at first order, on the weights in a different way with respect to the loss of each of the other data-points. So, it tells us how much we should expect the SGD dynamics to differ from the GD one. Notice that the expression (Equation (
3)) gives us immediately a bound on the rank of
D; namely,
.
The
Table 1 suggests that in many algorithms currently available, the diffusion matrix has low rank; hence, it is singular; this just by comparing the size
d of
D and its rank which bound by
N. This fact turns out to be very important in the construction of the diffusion metrics, as we will see below.
3. Diffusion Metrics and General Relativity
The evolution of a dynamical system in general relativity takes place along the geodesics according to the metric imposed on the Minkowski space by the presence of gravitational masses. The equation for such geodesics, once Einstein’s equation is solved, is:
where
are the Christoffel symbols for the Levi–Civita connection:
and
is a term regarding an external force; e.g., one coming from an electromagnetic field.
If we take the time derivative of the differential equation underlying the ordinary (i.e., non stochastic) gradient descent:
and compare with Equation (
4), observe that
effectively replaces the force term
. Hence, the geodesic equation Equation (
4) models the ordinary GD equation if we take a constant metric and replace the force term with the gradient of the loss; furthermore, this corresponds to the condition
in SGD dynamics Equation (
1).
This suggests that one may define a metric dependent on the diffusion matrix; this metric should become constant when
. As a side remark, notice that since
D is singular in many important practical applications (see
Table 1), it is not reasonable to use it to define the metric itself. On the other hand, since
D measures the anisotropy of the weight space, it is reasonable to employ it to perturb the euclidean metric. So the stochastic nature of the dynamical system ruled by the SGD is replaced by a perturbation of the dynamics for the ordinary gradient descent. As an analogy, the presence of (small) masses in space, in the weak field approximation (see [
7]) of general relativity, generates gravity; this motivates our (small) deformation of the euclidean metric using the diffusion matrix.
At each point
, we define a metric called
diffusion metric as
with
, where
. This ensures that
g is non-singular at each
. We have, thus, defined a family of metrics that depend on the real parameter
. We expect this model to approximate the solution to the Fokker–Planck equation when the parameter
, whose interpretation is related with the temperature, is very small (see [
3] for the notation and more details).
Notice that our heuristic hypothesis on
allows us to make the so called
weak field approximation (see [
7]):
Hence, we have the following expression for the Christoffel’s symbols (in this approximation we discard
):
where
dij are the coefficients of
D, and
δwz is the Kronecker delta.
Let us now compute the Christoffel’s symbols and then substitute them into the geodesic equation given by Equation (
4).
Let us substitute in Equation (
4) (writing the sum now):
Let us concentrate on the expression:
Now, we take the integral in
dt (we compute the parts twice):
Notice that, in many practical applications we have:
because
We now substitute the obtained expression into the Equation (
8), taking the integral in
dt:
We may assume
very small, as motivated by Equation (
1); hence, discard this term. Writing the equation into vector form, we have:
By the weak field approximation
, we can write:
where
is the gradient computed according to the diffusion metric Equation (
6).
We can summarize our result as follows:
the SGD equation Equation (
1) can be replaced, provided the approximation Equation (
9) holds, by the deterministic equation Equation (
10), where the dynamical system evolves with respect to the gradient computed according to the diffusion metric Equation (
6).
We now want to compare our result Equation (
10) with [
3]
Section 1, in order to understand how the steady state solutions of Equation (
10) compare to the SGD steady state solutions described by Equation (
1). In [
3], the authors regard SGD as minimizing the function Φ instead of our loss
f. Let us focus on (8) in [
3], where the relation between
f and Φ is discussed. In our case, we take ∇Φ = ∇
Df so that Equation (
11) becomes
where
is the diffusion metric. If the term
D(x) in (8) in [
3] is spelled out as our
, we can write such equation as:
Notice that according to our approximation Equation (
9),
. Hence, Equation (
12) (that is (8) in [
3]) is perfectly compatible with our treatment, and furthermore, the assumption 4 in [
3] is fully justified by the fact that
j(x) = 0.