1. Introduction
Training of neural networks is based on the progressive correction of their weights and biases (model parameters) performed by such algorithms as gradient descent, which compare actual outputs with the desired ones for a large set of input samples [
1]. Consequently, the understanding of the internal operation of neural networks should be intrinsically based on the detailed knowledge of the evolution of their weights in the process of training, starting from their initial configuration. Recently, Li and Liang [
2] revealed that, during training, weights in neural networks only slightly deviate from their initial values in most practical scenarios. In this paper, we explore in detail how training changes the initial configuration of weights, and the relations between those changes and the effectiveness of the networks’ function. We track the evolution of the weights of networks consisting of two Rectified Linear Unit (ReLU) hidden layers trained on three different classification tasks with Stochastic Gradient Descent (SGD), and measure the dependence of the distribution of deviations from an initial weight on this initial value. In all of our experiments, we observe no inconsistencies in the results of the three tasks.
By experimenting with networks of different sizes, we have observed that, to reach an arbitrarily chosen loss value, the weights of larger networks tend to deviate less from their initial values than those of smaller networks. This suggests that larger networks tend to converge to minima which are closer to their initialization. On the other hand, we observe that for a certain range of network sizes, the deviations from initial weights abruptly increase at some moment during their training within the overfitting regime.
This effect is illustrated in
Figure 1 by the persistence and disappearance of an initialization mask (the letters are stamped to a network’s initial configuration of weights by creating a bitmap of the same shape as the matrix of weights of the layer being marked, rasterizing the letter to the bitmap, and using the resulting binary mask to set to zero the weights laying outside the mark’s area.) in panels (a) and (b), respectively, for two network sizes. We find that the sharp increase on the deviations of the weights closely correlates with the crossover between two regimes of the network—trainability and untrainability—occurring in the course of the training.
The main contributions of this work are the following: (I) a quantitative statistical characterization of the deviation of the weights of two-hidden-layer ReLU network of various sizes, from their initial random configuration, and (II) we show a correlation between the magnitude of deviations of weights and the successful training of a network. Recent works [
2,
3,
4] showed that in highly over-parametrized networks, the training process implies a fine-tuning of the initial configuration of weights, significantly adjusting only a small portion of them. Our quantitative statistical characterization describes this phenomenon in greater detail and empirically verifies the small deviations that occur when the training process is successful. Furthermore, our analysis allows us to draw some insights regarding the training process of neural network and pave the way for future research.
Our paper is organized as follows. In
Section 2, we summarize some background topics on neural networks’ initializations, and review a series of recent papers pertaining to ours.
Section 3 presents the problem formulation, experimental settings and datasets used in this paper. In
Section 4, we explore the shape of the distribution of the deviations of weights from their initial values and its dependence on the initial weights. We continue these studies in
Section 5 by experimenting with networks of different widths and find that, whenever a network’s training is successful, the network does not travel far from its initial configuration. Finally,
Section 6 provides concluding remarks and points out directions for future research.
3. Problem Formulation
Our aim in this work is to contribute to the conceptual understanding of the influence that random initializations have in the solutions of feedforward neural networks trained with SGD. In order to avoid undesirable effects, specific to particular architectures, training methods, etc., we set up our experiments with very simple, vanilla, settings.
We trained feedforward neural networks with two layers of hidden nodes (three layers of links) with all-to-all connectivity between adjacent layers. In our experiments, we vary the widths (i.e., numbers of nodes) of the two hidden layers between 10 and 1000 nodes simultaneously, always keeping them equal to each other. The number of nodes of the input and output layers is determined by the dataset, specifically by the number of pixels of the input images and the number of classes, respectively. This architecture is largely based on the multilayer perceptron created by Keras for MNIST (
https://raw.githubusercontent.com/ShawDa/Keras-examples/master/mnist_mlp.py, accessed on 8 September 2021).
Let us denote the weight of the link connecting nodes
i in a given layer and node
j in the next layer by
. The output of a node
j in the hidden and output layers, denoted by
, is determined by an activation function of the weighted sum of the outputs of the previous layer added by the node’s bias,
, as
. The nodes of two hidden layers employ the Rectified Linear Unit (ReLU) activation function
The ReLU is a piecewise linear function that will output the input directly if it is positive; otherwise, it will output zero. It has become the default activation function for many types of neural networks because a model that uses it is easier to train and often achieves better performance.
The nodes in the output layer employ the softmax activation function
where
K is the number of elements in the input vector (i.e., the number of classes of the dataset). The softmax activation function is a generalization of the logistic function to multiple dimensions. It is used in multinomial logistic regression and is often used as the last activation function of a neural network to normalize the output of a network to a probability distribution over predicted output classes.
Unless otherwise stated, the biases of the networks are initialized at zero and the weights are initialized with Glorot’s uniform initialization [
7]:
where
is the uniform distribution in the interval
, and
m and
n are the number of units of the layers that weight
connects. In some of our experiments, we apply various masks to these uniformly distributed weights, setting to zero all weights
not covered by a mask (see
Figure 1). The loss function to be minimized is the categorical cross-entropy, i.e.,
where
C is the number of output classes,
the
i-th target output, and
the
i-th output of the network. The neural networks were optimized with Stochastic Gradient Descent with a learning rate of 0.1 and in mini-batches of size 128. The networks were defined and trained in Keras [
30] using its TensorFlow [
31] back-end.
Throughout this paper, we use three datasets to train our networks: MNIST, Fashion MNIST, and HASYv2.
Figure 2 displays samples of them. These are some of the most standard datasets used in research papers on supervised machine learning.
MNIST (
http://yann.lecun.com/exdb/mnist/, accessed on 5 September 2021) [
32] is a database of gray-scale handwritten digits. It consists of 6.0000 × 10
4 training and 1.0000 × 10
4 test images of size 28 × 28, each showing one of the numerals 0 to 9. It was chosen due to its popularity and widespread familiarity.
Fashion MNIST (
https://github.com/zalandoresearch/fashion-mnist, accessed on 5 September 2021) [
33] is a dataset intended to be a drop-in replacement for the original MNIST dataset for machine learning experiments. It features 10 classes of clothing categories (e.g., coat, shirt, etc.) and it is otherwise very similar to MNIST. It also consists of 28 × 28 gray-scale images, 6.0000 × 10
4 samples for training, and 1.0000 × 10
4 for testing.
HASYv2 (
https://github.com/MartinThoma/HASY, accessed on 5 September 2021) [
34] is a dataset of 32 × 32 binary images of handwritten symbols (mostly LaTex symbols, such as
,
, ∫, etc.). It mainly differentiates from the previous two datasets in that it has many more classes (369) and is much larger (containing around
train and
test images).
In this paper, the number of epochs elapsed on the process of training is denoted by t. We typically trained networks for very long periods (up to ), and, consequently, for most of their training, the networks were in the overfitting regime. However, since we are studying the training process of these network and making no claims concerning the networks’ ability to generalize on different data, overfitting does not affect our conclusions. In fact, our results are usually even stronger prior to overfitting. For similar reasons, we will be considering only the loss function of the networks (and not other metrics such as their accuracy), since it is the loss function that the networks are optimizing.
4. Statistics of Deviations of Weights from Initial Values
To illustrate the reduced scale of the deviations of weights during the training, let us mark a network’s initial configuration of weights using a mask in the shape of a letter, and observe how the marking evolves as the network is trained. Naturally, if the mark is still visible after the weights undergo a large number of updates and the networks converge, it indicates that the training process does not shift the majority of the weights of a network far from their initial states.
Figure 1a shows typical results of training a large network whose initial configuration is marked with the letter
a. One can see that the letter is clearly visible after training for as many as 1000 epochs. In fact, one observes the initial mark during all of the network’s training, without any sign that it will disappear. Even more surprisingly, these marks do not affect the quality of training. Independently of the shape marked (or whether there is a mark or not) the network trains to approximately the same loss across different realizations of initial weights. This demonstrates that randomly initialized networks preserve features of their initial configuration along their whole training—features that are ultimately transferred into the networks’ final applications.
Figure 1b demonstrates an opposite effect for midsize networks that cross over between the regimes of trainability and untrainability. As it illustrates, the initial configuration of weights of these unstable networks tend to be progressively lost, suffering the largest changes when the networks diverge (loss function sharply increases at some moment).
By inspecting the distribution of the final (i.e., last, after training) values of the weights of the network of
Figure 1a versus their initial values, portrayed in
Figure 3, we see that weights that start with larger absolute values are more likely to suffer larger updates (in the direction that their sign points to). This trend can be observed in the plot by the tilt of the interquartile range (yellow region in the middle) with respect to the median (dotted line). The figure demonstrates that initially large weights in absolute value have a tendency to become even larger, keeping their original sign; it also shows the maximum concentration of weights near the line
, indicating that most weights either change very little or nothing at all throughout training.
This effect may be explained by the presence of a winning ticket in the network’s initialization. Our results suggest that the role of the over-parametrized initial configuration is decisive in successful training: when we reduce the level of over-parametrization to a point where the initial configuration stops containing such winning tickets, the network becomes untrainable by SGD.
The skewness in the distribution of the final weights can be explained by the randomness of the initial configuration, which initializes certain groups of weights with more appropriate values than the others, which makes them better suited for certain features of the dataset. This subset of weights does not need to be particularly good, but as long as it provides slightly better or more consistent outputs than the rest of the weights, the learning process may favor their training, improving them further than the rest. Over the course of many epochs, the small preference that the learning algorithm keeps giving them adds up and causes these weights to become the best recognizers for the features that they initially, by chance, happened to be better at. In this hypothesis, it is highly likely that weights with larger initial values are more prone to be deemed more important by the learning algorithm, which will try to amplify their ‘signal’. This effect has several bearings with, for instance, the real-life effect of the month of birth in sports [
35].
5. Evolution of Deviations of Weights and Trainability
One may understand the relationship between the success of training and the fine-tuning process observed in
Section 4, during which a large fraction of the weights of a network suffer very tiny updates (and many are not even changed at all), in the following way. We suggest that the neural networks typically trained are so over-parameterized that, when initialized at random, their initial configuration has a high probability of being close to a proper minimum (i.e., a global minimum where the training loss approaches zero). Hence, to reach such a minimum, the network needs to adjust its weights only slightly, which causes its final configuration of weights to have strong traces of the initial configuration (in agreement with our observations).
This hypothesis raises the question of what happens when we train networks that have a small number of parameters. At some point, do they simply start to train worse? Or do they stop training at all? It turns out that, during the course of their training, neural networks cross over between two regimes—trainability and untrainability. The trainability region may be further split into two distinct regimes: a regime of
high trainability where training drives the networks towards global minima (with zero training loss), and a regime of
low trainability where the networks converge to sub-optimal minima of significantly higher loss. Only high trainability allows a network to train successfully (i.e., to be trainable), since in the remaining two regimes, of untrainability and low trainability, either the networks do not learn at all, or they learn but very poorly.
Figure 4 illustrates these three regimes. Note that we use the term trainability/untrainability referring to the regimes of the training process, in which loss and deviations of weights are, respectively, small/large. We reserve the terms trainable-untrainable to refer to the capability of a network to keep a low train loss after infinite training, which depends essentially on the network’s architecture.
We measure the dependence of the time at which these crossovers happen on the network size and build a diagram showing the network’s training regime for each network width and training time. This diagram,
Figure 4c, resembles a phase diagram, although the variable
t, the training time, is not a control parameter, but it is rather the measure of the duration of the ‘relaxation’ process that SGD training represents. One may speak about a phase transition in these systems only in respect of their stationary state, that is, the regime in which they finally end up, after being trained for a very long (infinite) time.
Figure 4c shows three characteristic instants (times) of training for each network width: (i) the time at which the minimum of the test loss occurs, (ii) the time of the minimum of the train loss, (iii) the time at which the loss abruptly increases (‘diverges’). Each of these times differ for different runs, and, for some widths, these fluctuations are strong or they even diverge. The points in this plot are the average values over ten independent runs. By the error bars, we show the scale of the fluctuations between different runs. Notice that the times (ii) and (iii) approach infinity as we approach the threshold of about 300 nodes from below (which is specific to the network’s architecture and dataset). Therefore, wide networks (
nodes in each hidden layer) never cross over to the untrainability regime; wide networks should stabilize in the trainability regime as
. The untrainability region of the diagram exists only for widths smaller than the threshold, which is in the neighborhood of 300 nodes. Networks with such widths initially demonstrate a consistent decrease in the train loss. However, at some later moment, during the training process, the systems abruptly cross over from the trainability regime, with small deviations of weights from their initial values and decreasing train loss, to the untrainability regime, with large loss and large deviations of weights.
By gradually reducing the width, and looking at the trainability regime in the limit of infinite training time, we find a phase transition from trainable to untrainable networks. In the diagram of
Figure 4c, this transition corresponds to a horizontal line at
, or, equivalently, to the projection of the diagram on the horizontal axis (notice that the border between regimes is concave).
The phase diagram in
Figure 4 (the bottom left panel) suggests the existence of three different classes of networks: weak, strong, and, in between them, unstable learners. Weak learners are small networks that, throughout their training, do not tend to diverge, but only train to poor minima. They mostly operate in a regime of low trainability, since they can be trained, but only to ill-suited solutions. On the other side of the spectrum of trainability are large networks. These are strong learners, as they train to very small loss values and they are not prone to diverge (they operate mostly in the regime of high trainability). In between these two classes are unstable learners, which are midsize networks that train to progressively better minima (as their size increases), but that, at some point, in the course of their training, are likely to diverge and become untrainable (i.e., they tend to show a crossover from the regime of trainability to the one of untrainability).
Remarkably, we observe the different regimes of operation of a network not only in the behavior of its loss function, but also in the distance it travels from its initial configuration of weights. We have already demonstrated in
Figure 1 how the mark of the initial configuration of weights of a network persists in large networks (i.e., strong learners that over the course of their training were always on the regime of high trainability), and vanishes for midsize networks that ultimately cross over to the regime of untrainability. In
Appendix A we supply detailed description of the evolution of the statistics of weights during the training of the networks used to draw
Figure 4.
Figure A1 and
Figure A2 show that, as the network width is reduced, the highly structured correlation between initial and final weight, illustrated by the coincidence of the median with the line
(see
Figure 3), remains in effect in all layers of weights until the trainability threshold. Below that point the structure of the correlations eventually breaks down, given enough training time. The reliability of the observation of this breakdown in
Figure A1 and
Figure A2, for widths below ∼300, is reinforced by the robust fitting method based in cumulative distributions that is explained in
Appendix B.
To quantitatively describe how distant a network becomes from its initial configuration of weights we consider the root mean square deviation (RMSD) of its system of weights at time
t with respect to its initial configuration, i.e.,
where
m is the number of weights of the network (which depends on its width), and
is the weight of edge
j at time
t.
Figure 5 plots, for three different datasets, the evolution of the loss function of networks of various widths alongside the deviation of their configuration of weights from its initial state. These plots evidence the existence of a link between the distance a network travels away from its initialization and the regime in which it is operating, which we describe below.
For all the datasets considered, the blue circles (•) show the training of networks that are weak learners—hence, they only achieve very high losses and are continuously operating in a regime of low trainability. These networks experience very large deviations on their configuration of weights, getting further and further away from their initial state. In contrast, the red left triangles (◂) show the training of large networks that are strong learners (in fact, for MNIST all the networks marked with triangles are strong learners; in our experiments we could not identify unstable learners on this dataset). These networks are always operating in the regime of high trainability, and over the course of their training they deviate very slightly from their initial configuration (compare with the results of Li and Liang [
2]). Finally, for the Fashion MNIST and HASYv2 datasets, orange down (▾) and green up (▴) triangles show unstable networks of different widths (the former being smaller than the latter). While on the regime of trainability, these networks deviate much further from their initial configuration than strong learners (but less than weak learners). However, as they diverge and cross over into the untrainability regime (which could only be observed on networks training with the HASYv2 dataset), the RMSD suffers a sharp increase and reaches a plateau. These observations highlight the persistent coupling between the network’s trainability (measured as train loss) and the distance it travels away from the initial configuration (measured as RMSD), as well as their dependence of the network’s width.
To complete the description of the behavior of these networks on the different regimes,
Figure 6 plots, for networks of different widths, the time at which they reach a loss below a certain value
, and the RMSD between their configuration of weights at that time and the initial one. It shows that networks that are small and are operating under the low trainability regime fail to reach even moderate losses (e.g., on Fashion MNIST, no network of width 10 reaches a loss of 0.1, whereas networks of width 100 reach losses that are three orders of magnitude smaller). Moreover, even when they reach these loss values, they take a significantly longer time to do so, as the plots for MNIST demonstrate. Finally, the figure also shows that, as the networks grow in size, the displacement each weight has to undergo to allow the network to reach a particular loss decreases, meaning that the networks are progressively converging to minima that are closer to their initialization. We can treat this displacement as a measure of the work the optimization algorithm performs during the training of a network to make it reach a particular quality (i.e., value of loss). Then one can say that using larger networks eases training by decreasing the average work the optimizer has to spend with each weight.