Effect of Initial Configuration of Weights on Training and Function of Artificial Neural Networks

Jesus, Ricardo J.; Antunes, Mário L.; da Costa, Rui A.; Dorogovtsev, Sergey N.; Mendes, José F. F.; Aguiar, Rui L.

doi:10.3390/math9182246

Open AccessArticle

Effect of Initial Configuration of Weights on Training and Function of Artificial Neural Networks

by

Ricardo J. Jesus

^1,2

,

Mário L. Antunes

^1,3,*

,

Rui A. da Costa

⁴

,

Sergey N. Dorogovtsev

⁴

,

José F. F. Mendes

⁴

and

Rui L. Aguiar

^1,3

¹

Departamento de Eletrónica, Telecomunicações e Informática, Campus Universitario de Santiago, Universidade de Aveiro, 3810-193 Aveiro, Portugal

²

EPCC, The University of Edinburgh, Edinburgh EH8 9YL, UK

³

Instituto de Telecomunicações, Campus Universitario de Santiago, Universidade de Aveiro, 3810-193 Aveiro, Portugal

⁴

Departamento de Física & I3N, Campus Universitario de Santiago, Universidade de Aveiro, 3810-193 Aveiro, Portugal

^*

Author to whom correspondence should be addressed.

Mathematics 2021, 9(18), 2246; https://doi.org/10.3390/math9182246

Submission received: 20 August 2021 / Revised: 6 September 2021 / Accepted: 7 September 2021 / Published: 13 September 2021

(This article belongs to the Special Issue Computational Optimizations for Machine Learning)

Download

Browse Figures

Versions Notes

Abstract

:

The function and performance of neural networks are largely determined by the evolution of their weights and biases in the process of training, starting from the initial configuration of these parameters to one of the local minima of the loss function. We perform the quantitative statistical characterization of the deviation of the weights of two-hidden-layer feedforward ReLU networks of various sizes trained via Stochastic Gradient Descent (SGD) from their initial random configuration. We compare the evolution of the distribution function of this deviation with the evolution of the loss during training. We observed that successful training via SGD leaves the network in the close neighborhood of the initial configuration of its weights. For each initial weight of a link we measured the distribution function of the deviation from this value after training and found how the moments of this distribution and its peak depend on the initial weight. We explored the evolution of these deviations during training and observed an abrupt increase within the overfitting region. This jump occurs simultaneously with a similarly abrupt increase recorded in the evolution of the loss function. Our results suggest that SGD’s ability to efficiently find local minima is restricted to the vicinity of the random initial configuration of weights.

Keywords:

training; evolution of weights; deep learning; neural networks; artificial intelligence

1. Introduction

Training of neural networks is based on the progressive correction of their weights and biases (model parameters) performed by such algorithms as gradient descent, which compare actual outputs with the desired ones for a large set of input samples [1]. Consequently, the understanding of the internal operation of neural networks should be intrinsically based on the detailed knowledge of the evolution of their weights in the process of training, starting from their initial configuration. Recently, Li and Liang [2] revealed that, during training, weights in neural networks only slightly deviate from their initial values in most practical scenarios. In this paper, we explore in detail how training changes the initial configuration of weights, and the relations between those changes and the effectiveness of the networks’ function. We track the evolution of the weights of networks consisting of two Rectified Linear Unit (ReLU) hidden layers trained on three different classification tasks with Stochastic Gradient Descent (SGD), and measure the dependence of the distribution of deviations from an initial weight on this initial value. In all of our experiments, we observe no inconsistencies in the results of the three tasks.

By experimenting with networks of different sizes, we have observed that, to reach an arbitrarily chosen loss value, the weights of larger networks tend to deviate less from their initial values than those of smaller networks. This suggests that larger networks tend to converge to minima which are closer to their initialization. On the other hand, we observe that for a certain range of network sizes, the deviations from initial weights abruptly increase at some moment during their training within the overfitting regime.

This effect is illustrated in Figure 1 by the persistence and disappearance of an initialization mask (the letters are stamped to a network’s initial configuration of weights by creating a bitmap of the same shape as the matrix of weights of the layer being marked, rasterizing the letter to the bitmap, and using the resulting binary mask to set to zero the weights laying outside the mark’s area.) in panels (a) and (b), respectively, for two network sizes. We find that the sharp increase on the deviations of the weights closely correlates with the crossover between two regimes of the network—trainability and untrainability—occurring in the course of the training.

The main contributions of this work are the following: (I) a quantitative statistical characterization of the deviation of the weights of two-hidden-layer ReLU network of various sizes, from their initial random configuration, and (II) we show a correlation between the magnitude of deviations of weights and the successful training of a network. Recent works [2,3,4] showed that in highly over-parametrized networks, the training process implies a fine-tuning of the initial configuration of weights, significantly adjusting only a small portion of them. Our quantitative statistical characterization describes this phenomenon in greater detail and empirically verifies the small deviations that occur when the training process is successful. Furthermore, our analysis allows us to draw some insights regarding the training process of neural network and pave the way for future research.

Our paper is organized as follows. In Section 2, we summarize some background topics on neural networks’ initializations, and review a series of recent papers pertaining to ours. Section 3 presents the problem formulation, experimental settings and datasets used in this paper. In Section 4, we explore the shape of the distribution of the deviations of weights from their initial values and its dependence on the initial weights. We continue these studies in Section 5 by experimenting with networks of different widths and find that, whenever a network’s training is successful, the network does not travel far from its initial configuration. Finally, Section 6 provides concluding remarks and points out directions for future research.

2. Background and Related Work

2.1. Previous Works

It is widely known that a neural network’s initialization is instrumental in its training [5,6,7,8]. The works of Glorot and Bengio [7], Chapelle and Erhan [9] and Krizhevsky et al. [10], for instance, showed that deep networks initialized with random weights and optimized with methods as simple as Stochastic Gradient Descent could, surprisingly, be trained successfully. In fact, by combining momentum with a well-chosen random initialization strategy, Sutskever et al. [11] managed to achieve performance comparable to that of Hessian-free methods.

There are many methods to randomly initialize a network. Usually, they consist of drawing the initial weights of the network from uniform or Gaussian distributions centered at zero, and setting the biases to zero or some other small constant. While the choice of the distribution (uniform or Gaussian) does not seem to be particularly important [12], the scale of the distribution from which the initial weights are drawn does. The most common initialization strategies—those of Glorot and Bengio [7], He et al. [8], and LeCun et al. [5]—define rules based on the network’s architecture for choosing the variance that the distribution of initial weights should have. These and other frequently used initialization strategies are mainly heuristic, seeking to achieve some desired properties at least during the first few iterations of training. However, it is generally unclear which properties are kept during training or how they vanish [12] (Section 8.4). Moreover, it is also not clear why some initializations are better from the point of view of optimization (i.e., achieve lower training loss), but are simultaneously worse from the point of view of generalization.

Frankle and Carbin [13] recently observed that randomly initialized dense neural networks typically contain subnetworks (called winning tickets) that are capable of matching the test accuracy of the original network when trained for the same amount of time in isolation. Based on this observation, they formulate the Lottery Ticket Hypothesis, which essentially states that this effect is general and manifests with high probability in this kind of network. Notably, these subnetworks are part of the network’s initialization, as opposed to an organization that emerges throughout training. The subsequent works of Zhou et al. [14] and Ramanujan et al. [15] corroborate the Lottery Ticket Hypothesis and propose that winning tickets may not even require training to achieve quality comparable to that of the trained networks.

In their recent paper, Li and Liang [2] established that two-layer over-parameterized ReLU networks, optimized with SGD on data drawn from a mixture of well-separated distributions, probably converge to a minimum close to their random initializations. Around the same time, Jacot et al. [3] proposed the neural tangent kernel (NTK), a kernel that characterizes the dynamics of the training process of neural networks in the so-called infinite-width limit. These works instigated a series of theoretical breakthroughs, such as the proof that SGD can find global minima under conditions commonly found in practice (e.g., over-parameterization) [16,17,18,19,20,21,22,23], and that, in the infinite-width limit, neural networks remain in an

O (1 / \sqrt{n})

neighborhood of their random initialization (n being the width of the hidden layers) [24,25]. Lee et al. [4] make a similar claim about the distance a network may deviates from its linearized version. Chizat et al. [26], however, argue that such wide networks operate in a regime of “lazy training” that appears to be incompatible with the many successes neural networks are known for in difficult, high dimensional tasks.

2.2. Our Contribution

From distinct perspectives, these previous works have shown that, in highly over-parametrized networks, the training process consists of a fine-tuning of the initial configuration of weights, adjusting significantly just a small portion of them (the ones belonging to the winning tickets). Furthermore, Frankle et al. [27] recently showed that the winning ticket’s weights are highly correlated with each other.

The previous investigations on the role of the initial weights configuration focus on networks with potentially infinite width, in which, as our results also show, the persistence of the initial configuration is more noticeable. In contrast, we explore a wide range of network sizes from untrainable to trainable by varying the number of units in the hidden layers. This approach allows us to explore the limits of trainability, and characterize the trainable–untrainable network transition that occurs at a certain threshold of the width of hidden layers.

A few recent works [28,29] indicated the existence of ‘phase transitions’ from narrow to wide networks associated to qualitative changes in the set of loss minima in the configuration space. These results resonate with ours, although neither relations to trainability nor the role of the initial configuration of weights were explored.

On one hand, we observe that, when the networks are trainable (large networks), they always converge to a minima in the vicinity of the initial weight configuration. On the other hand, when the network is untrainable (small networks) the weight configuration drifts away from the initial configuration. Moreover, in our simulations, we found an intermediate size range for which the networks train reasonably well for a while, decreasing the loss consistently, but, later in the overfitting region, their loss abruptly increases dramatically (due to overshooting). Past this point of divergence, the loss can no longer be reduced by more training. The behavior of these ultimately untrainable networks further emphasizes the connection between trainability (ability to reduce train loss) and proximity to the initial configuration: the distance to the initial configuration remains small in the first stage of training, while the loss is reduced, and later increases abruptly, simultaneously with the loss.

We hypothesize that networks initialized with random weights and trained with SGD can only find good minima in the vicinity of the initial configuration of weights. This kind of training procedure is unable to effectively explore more than a relatively small region of the configuration space around the initial point.

3. Problem Formulation

Our aim in this work is to contribute to the conceptual understanding of the influence that random initializations have in the solutions of feedforward neural networks trained with SGD. In order to avoid undesirable effects, specific to particular architectures, training methods, etc., we set up our experiments with very simple, vanilla, settings.

We trained feedforward neural networks with two layers of hidden nodes (three layers of links) with all-to-all connectivity between adjacent layers. In our experiments, we vary the widths (i.e., numbers of nodes) of the two hidden layers between 10 and 1000 nodes simultaneously, always keeping them equal to each other. The number of nodes of the input and output layers is determined by the dataset, specifically by the number of pixels of the input images and the number of classes, respectively. This architecture is largely based on the multilayer perceptron created by Keras for MNIST (https://raw.githubusercontent.com/ShawDa/Keras-examples/master/mnist_mlp.py, accessed on 8 September 2021).

Let us denote the weight of the link connecting nodes i in a given layer and node j in the next layer by

w_{i j}

. The output of a node j in the hidden and output layers, denoted by

o_{j}

, is determined by an activation function of the weighted sum of the outputs of the previous layer added by the node’s bias,

b_{j}

, as

b_{j} + \sum_{i} w_{i j} o_{i} \equiv x_{j}

. The nodes of two hidden layers employ the Rectified Linear Unit (ReLU) activation function

f (x_{j}) = \{\begin{matrix} x_{j} if x_{j} \geq 0, \\ 0 if x_{j} < 0 . \end{matrix}

(1)

The ReLU is a piecewise linear function that will output the input directly if it is positive; otherwise, it will output zero. It has become the default activation function for many types of neural networks because a model that uses it is easier to train and often achieves better performance.

The nodes in the output layer employ the softmax activation function

f (x_{j}) = \frac{e^{x_{j}}}{\sum_{k = 1}^{K} e^{x_{k}}}

(2)

where K is the number of elements in the input vector (i.e., the number of classes of the dataset). The softmax activation function is a generalization of the logistic function to multiple dimensions. It is used in multinomial logistic regression and is often used as the last activation function of a neural network to normalize the output of a network to a probability distribution over predicted output classes.

Unless otherwise stated, the biases of the networks are initialized at zero and the weights are initialized with Glorot’s uniform initialization [7]:

w_{i j} \sim U (- \frac{\sqrt{6}}{\sqrt{m + n}}, \frac{\sqrt{6}}{\sqrt{m + n}}),

(3)

where

U (α, β)

is the uniform distribution in the interval

(α, β)

, and m and n are the number of units of the layers that weight

w_{i j}

connects. In some of our experiments, we apply various masks to these uniformly distributed weights, setting to zero all weights

w_{i j}

not covered by a mask (see Figure 1). The loss function to be minimized is the categorical cross-entropy, i.e.,

L = - \sum_{i = 1}^{C} y_{i} ln o_{i},

(4)

where C is the number of output classes,

y_{i} \in {0, 1}

the i-th target output, and

o_{i}

the i-th output of the network. The neural networks were optimized with Stochastic Gradient Descent with a learning rate of 0.1 and in mini-batches of size 128. The networks were defined and trained in Keras [30] using its TensorFlow [31] back-end.

Throughout this paper, we use three datasets to train our networks: MNIST, Fashion MNIST, and HASYv2. Figure 2 displays samples of them. These are some of the most standard datasets used in research papers on supervised machine learning.

MNIST (http://yann.lecun.com/exdb/mnist/, accessed on 5 September 2021) [32] is a database of gray-scale handwritten digits. It consists of 6.0000 × 10⁴ training and 1.0000 × 10⁴ test images of size 28 × 28, each showing one of the numerals 0 to 9. It was chosen due to its popularity and widespread familiarity.

Fashion MNIST (https://github.com/zalandoresearch/fashion-mnist, accessed on 5 September 2021) [33] is a dataset intended to be a drop-in replacement for the original MNIST dataset for machine learning experiments. It features 10 classes of clothing categories (e.g., coat, shirt, etc.) and it is otherwise very similar to MNIST. It also consists of 28 × 28 gray-scale images, 6.0000 × 10⁴ samples for training, and 1.0000 × 10⁴ for testing.

HASYv2 (https://github.com/MartinThoma/HASY, accessed on 5 September 2021) [34] is a dataset of 32 × 32 binary images of handwritten symbols (mostly LaTex symbols, such as

α

,

σ

, ∫, etc.). It mainly differentiates from the previous two datasets in that it has many more classes (369) and is much larger (containing around

150,000

train and

17,000

test images).

In this paper, the number of epochs elapsed on the process of training is denoted by t. We typically trained networks for very long periods (up to

t = 1000

), and, consequently, for most of their training, the networks were in the overfitting regime. However, since we are studying the training process of these network and making no claims concerning the networks’ ability to generalize on different data, overfitting does not affect our conclusions. In fact, our results are usually even stronger prior to overfitting. For similar reasons, we will be considering only the loss function of the networks (and not other metrics such as their accuracy), since it is the loss function that the networks are optimizing.

4. Statistics of Deviations of Weights from Initial Values

To illustrate the reduced scale of the deviations of weights during the training, let us mark a network’s initial configuration of weights using a mask in the shape of a letter, and observe how the marking evolves as the network is trained. Naturally, if the mark is still visible after the weights undergo a large number of updates and the networks converge, it indicates that the training process does not shift the majority of the weights of a network far from their initial states.

Figure 1a shows typical results of training a large network whose initial configuration is marked with the letter a. One can see that the letter is clearly visible after training for as many as 1000 epochs. In fact, one observes the initial mark during all of the network’s training, without any sign that it will disappear. Even more surprisingly, these marks do not affect the quality of training. Independently of the shape marked (or whether there is a mark or not) the network trains to approximately the same loss across different realizations of initial weights. This demonstrates that randomly initialized networks preserve features of their initial configuration along their whole training—features that are ultimately transferred into the networks’ final applications.

Figure 1b demonstrates an opposite effect for midsize networks that cross over between the regimes of trainability and untrainability. As it illustrates, the initial configuration of weights of these unstable networks tend to be progressively lost, suffering the largest changes when the networks diverge (loss function sharply increases at some moment).

By inspecting the distribution of the final (i.e., last, after training) values of the weights of the network of Figure 1a versus their initial values, portrayed in Figure 3, we see that weights that start with larger absolute values are more likely to suffer larger updates (in the direction that their sign points to). This trend can be observed in the plot by the tilt of the interquartile range (yellow region in the middle) with respect to the median (dotted line). The figure demonstrates that initially large weights in absolute value have a tendency to become even larger, keeping their original sign; it also shows the maximum concentration of weights near the line

w_{f} = w_{i}

, indicating that most weights either change very little or nothing at all throughout training.

This effect may be explained by the presence of a winning ticket in the network’s initialization. Our results suggest that the role of the over-parametrized initial configuration is decisive in successful training: when we reduce the level of over-parametrization to a point where the initial configuration stops containing such winning tickets, the network becomes untrainable by SGD.

The skewness in the distribution of the final weights can be explained by the randomness of the initial configuration, which initializes certain groups of weights with more appropriate values than the others, which makes them better suited for certain features of the dataset. This subset of weights does not need to be particularly good, but as long as it provides slightly better or more consistent outputs than the rest of the weights, the learning process may favor their training, improving them further than the rest. Over the course of many epochs, the small preference that the learning algorithm keeps giving them adds up and causes these weights to become the best recognizers for the features that they initially, by chance, happened to be better at. In this hypothesis, it is highly likely that weights with larger initial values are more prone to be deemed more important by the learning algorithm, which will try to amplify their ‘signal’. This effect has several bearings with, for instance, the real-life effect of the month of birth in sports [35].

5. Evolution of Deviations of Weights and Trainability

One may understand the relationship between the success of training and the fine-tuning process observed in Section 4, during which a large fraction of the weights of a network suffer very tiny updates (and many are not even changed at all), in the following way. We suggest that the neural networks typically trained are so over-parameterized that, when initialized at random, their initial configuration has a high probability of being close to a proper minimum (i.e., a global minimum where the training loss approaches zero). Hence, to reach such a minimum, the network needs to adjust its weights only slightly, which causes its final configuration of weights to have strong traces of the initial configuration (in agreement with our observations).

This hypothesis raises the question of what happens when we train networks that have a small number of parameters. At some point, do they simply start to train worse? Or do they stop training at all? It turns out that, during the course of their training, neural networks cross over between two regimes—trainability and untrainability. The trainability region may be further split into two distinct regimes: a regime of high trainability where training drives the networks towards global minima (with zero training loss), and a regime of low trainability where the networks converge to sub-optimal minima of significantly higher loss. Only high trainability allows a network to train successfully (i.e., to be trainable), since in the remaining two regimes, of untrainability and low trainability, either the networks do not learn at all, or they learn but very poorly. Figure 4 illustrates these three regimes. Note that we use the term trainability/untrainability referring to the regimes of the training process, in which loss and deviations of weights are, respectively, small/large. We reserve the terms trainable-untrainable to refer to the capability of a network to keep a low train loss after infinite training, which depends essentially on the network’s architecture.

We measure the dependence of the time at which these crossovers happen on the network size and build a diagram showing the network’s training regime for each network width and training time. This diagram, Figure 4c, resembles a phase diagram, although the variable t, the training time, is not a control parameter, but it is rather the measure of the duration of the ‘relaxation’ process that SGD training represents. One may speak about a phase transition in these systems only in respect of their stationary state, that is, the regime in which they finally end up, after being trained for a very long (infinite) time. Figure 4c shows three characteristic instants (times) of training for each network width: (i) the time at which the minimum of the test loss occurs, (ii) the time of the minimum of the train loss, (iii) the time at which the loss abruptly increases (‘diverges’). Each of these times differ for different runs, and, for some widths, these fluctuations are strong or they even diverge. The points in this plot are the average values over ten independent runs. By the error bars, we show the scale of the fluctuations between different runs. Notice that the times (ii) and (iii) approach infinity as we approach the threshold of about 300 nodes from below (which is specific to the network’s architecture and dataset). Therefore, wide networks (

≳ 300

nodes in each hidden layer) never cross over to the untrainability regime; wide networks should stabilize in the trainability regime as

t \to \infty

. The untrainability region of the diagram exists only for widths smaller than the threshold, which is in the neighborhood of 300 nodes. Networks with such widths initially demonstrate a consistent decrease in the train loss. However, at some later moment, during the training process, the systems abruptly cross over from the trainability regime, with small deviations of weights from their initial values and decreasing train loss, to the untrainability regime, with large loss and large deviations of weights.

By gradually reducing the width, and looking at the trainability regime in the limit of infinite training time, we find a phase transition from trainable to untrainable networks. In the diagram of Figure 4c, this transition corresponds to a horizontal line at

t = \infty

, or, equivalently, to the projection of the diagram on the horizontal axis (notice that the border between regimes is concave).

The phase diagram in Figure 4 (the bottom left panel) suggests the existence of three different classes of networks: weak, strong, and, in between them, unstable learners. Weak learners are small networks that, throughout their training, do not tend to diverge, but only train to poor minima. They mostly operate in a regime of low trainability, since they can be trained, but only to ill-suited solutions. On the other side of the spectrum of trainability are large networks. These are strong learners, as they train to very small loss values and they are not prone to diverge (they operate mostly in the regime of high trainability). In between these two classes are unstable learners, which are midsize networks that train to progressively better minima (as their size increases), but that, at some point, in the course of their training, are likely to diverge and become untrainable (i.e., they tend to show a crossover from the regime of trainability to the one of untrainability).

Remarkably, we observe the different regimes of operation of a network not only in the behavior of its loss function, but also in the distance it travels from its initial configuration of weights. We have already demonstrated in Figure 1 how the mark of the initial configuration of weights of a network persists in large networks (i.e., strong learners that over the course of their training were always on the regime of high trainability), and vanishes for midsize networks that ultimately cross over to the regime of untrainability. In Appendix A we supply detailed description of the evolution of the statistics of weights during the training of the networks used to draw Figure 4. Figure A1 and Figure A2 show that, as the network width is reduced, the highly structured correlation between initial and final weight, illustrated by the coincidence of the median with the line

w_{f} = w_{i}

(see Figure 3), remains in effect in all layers of weights until the trainability threshold. Below that point the structure of the correlations eventually breaks down, given enough training time. The reliability of the observation of this breakdown in Figure A1 and Figure A2, for widths below ∼300, is reinforced by the robust fitting method based in cumulative distributions that is explained in Appendix B.

To quantitatively describe how distant a network becomes from its initial configuration of weights we consider the root mean square deviation (RMSD) of its system of weights at time t with respect to its initial configuration, i.e.,

RMSD (t) \equiv \sqrt{\frac{1}{m} \sum_{j = 1}^{m} {[w_{j} (t) - w_{j} (0)]}^{2}},

(5)

where m is the number of weights of the network (which depends on its width), and

w_{j} (t)

is the weight of edge j at time t.

Figure 5 plots, for three different datasets, the evolution of the loss function of networks of various widths alongside the deviation of their configuration of weights from its initial state. These plots evidence the existence of a link between the distance a network travels away from its initialization and the regime in which it is operating, which we describe below.

For all the datasets considered, the blue circles (•) show the training of networks that are weak learners—hence, they only achieve very high losses and are continuously operating in a regime of low trainability. These networks experience very large deviations on their configuration of weights, getting further and further away from their initial state. In contrast, the red left triangles (◂) show the training of large networks that are strong learners (in fact, for MNIST all the networks marked with triangles are strong learners; in our experiments we could not identify unstable learners on this dataset). These networks are always operating in the regime of high trainability, and over the course of their training they deviate very slightly from their initial configuration (compare with the results of Li and Liang [2]). Finally, for the Fashion MNIST and HASYv2 datasets, orange down (▾) and green up (▴) triangles show unstable networks of different widths (the former being smaller than the latter). While on the regime of trainability, these networks deviate much further from their initial configuration than strong learners (but less than weak learners). However, as they diverge and cross over into the untrainability regime (which could only be observed on networks training with the HASYv2 dataset), the RMSD suffers a sharp increase and reaches a plateau. These observations highlight the persistent coupling between the network’s trainability (measured as train loss) and the distance it travels away from the initial configuration (measured as RMSD), as well as their dependence of the network’s width.

To complete the description of the behavior of these networks on the different regimes, Figure 6 plots, for networks of different widths, the time at which they reach a loss below a certain value

θ

, and the RMSD between their configuration of weights at that time and the initial one. It shows that networks that are small and are operating under the low trainability regime fail to reach even moderate losses (e.g., on Fashion MNIST, no network of width 10 reaches a loss of 0.1, whereas networks of width 100 reach losses that are three orders of magnitude smaller). Moreover, even when they reach these loss values, they take a significantly longer time to do so, as the plots for MNIST demonstrate. Finally, the figure also shows that, as the networks grow in size, the displacement each weight has to undergo to allow the network to reach a particular loss decreases, meaning that the networks are progressively converging to minima that are closer to their initialization. We can treat this displacement as a measure of the work the optimization algorithm performs during the training of a network to make it reach a particular quality (i.e., value of loss). Then one can say that using larger networks eases training by decreasing the average work the optimizer has to spend with each weight.

6. Conclusions

In this paper, we explored the effects of the initial configuration of weights on the training process and function of shallow feedforward neural networks. We performed a statistical characterization of the deviation of the weights of two-hidden-layer networks of various sizes trained via Stochastic Gradient Descent from their initial random configuration. Our analysis has shown that there is a strong correlation between the successful training of a shallow feedforward network and the magnitude of the weights’ deviations from their initial values. Furthermore, we were able to observe that the initial configuration of weights typically leaves recognizable traces on the final configuration after training, which provides evidence that the learning process is based on fine-tuning the weights of the network.

We investigated the conditions under which a network travels far from its initial configurations. We observed that a neural network learns in one of two major regimes: trainability and untrainability. Moreover, its size (number of parameters) largely determines the quality of its training process and its chance to enter the untrainability regime. By comparing the evolution of the distribution function of the deviations of the weights with the evolution of the loss function during training, we have shown that a network only travels far away from its initial configuration of weights if it is either (i) a poor learner (which means that it never reaches a good minimum) or (ii) when it crosses over from trainability to untrainability regimes. In the alternative (good) case, in which the network is a strong learner and it does not become untrainable, the network always converges to the neighbourhood of its initial configuration (keeping extensive traces of its initialization); in all of our experiments, we never observed a network converging to a good minimum outside the vicinity of the initial configuration. The results and analysis of our simulations point out that the typical black-box model, used in most applications of neural networks, hides the trainability capacity of the networks. For a set of three typical classification problems, these results indicate a range of network sizes where the training process is successful. Our conclusions are consistent with recent finds, specifically the Lottery Ticket Hypothesis [13].

Finally, it is important to mention that most of our analysis was conducted when overfitting was already taking place. At shorter times, the deviations of weights from their initial values are even smaller, and our conclusions remain valid. Our conclusions were based on the analysis of the loss function of the networks since this was the actual function that the networks were optimizing. However, we argue that equivalent results can be obtained by using the accuracy or other similar metric. Our analysis was conducted on a specific set of conditions. To fully generalize our findings, different initialization methods and datasets should be considered in order to fully validate the hypothesis stated in Section 2.2. The generalization of the results is outside of the scope of this paper and is intended as future work.

Author Contributions

Conceptualization, R.J.J., M.L.A., R.A.d.C., S.N.D., J.F.F.M. and R.L.A.; methodology, R.J.J., M.L.A., R.A.d.C., S.N.D. and R.L.A.; software, R.J.J.; validation, R.J.J., M.L.A., R.A.d.C. and R.L.A.; formal analysis, R.A.d.C. and S.N.D.; investigation, R.J.J., R.A.d.C., M.L.A.; writing—original draft preparation, R.J.J., M.L.A. and R.A.d.C.; writing—review and editing, S.N.D., J.F.F.M. and R.L.A. All authors have read and agreed to the published version of the manuscript.

Funding

This work was developed within the scope of the project i3N, UIDB/50025/2020 and UIDP/50025/2020, financed by national funds through the FCT/MEC. RAC acknowledges the FCT Grant No. CEECIND/04697/2017. The work was also supported under the project YBN2020075021-Evolution of mobile telecommunications beyond 5G inside IT-Aveiro; by FCT/MCTES through national funds and when applicable co-funded EU funds under the project UIDB/50008/2020-UIDP/50008/2020; under the project PTDC/EEI-TEL/30685/2017 and by the Integrated Programme of SR&TD “SOCA” (Ref. CENTRO-01-0145-FEDER-000010), co-funded by Centro 2020 program, Portugal 2020, European Union, through the European Regional Development Fund.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Statistics of Weights across the Trainable–Untrainable Transition

The distribution of deviations from the initial weight is qualitatively similar across the trainable phase. The proximity between the line

w_{f} = w_{i}

and the median of the distributions of

w_{f}

for fixed

w_{i}

, observed in Figure 3, is a distinctive feature of that kind of distribution. Figure A1 shows the slope obtained by fitting a straight line to the peaks (or mode) of the distributions of trained weights

w_{f}

for fixed

w_{i}

. In large networks, the fitted slope of the peaks,

c_{o p t}

, is very close 1 in all layers (even for very large training times), independently of the width. Below the width threshold for the network to be trainable, of about 300 nodes per hidden layer, the slope

c_{o p t}

shows significant deviations from 1, and its value strongly fluctuates among realizations of the initialization and training. (The borders of the shaded area in the plots of Figure A1 and Figure A2 represent the standard deviation measured in ten independent realizations.) To emphasize the coupling between trainability and proximity to the initial configuration of weights, we used data from the same ten realizations to plot Figure 4, Figure A1 and Figure A2. Combined, these figures show the simultaneousness of the abrupt increase in the loss and of the deviation from the initial configuration.

Figure A1. Fitting of the line across the maxima of the distribution of final weights in the networks of Figure 4. The parameter

c_{o p t}

is the slope obtained by fitting a straight line to the peaks of the distributions of trained weights

w_{f}

for fixed

w_{i}

. These results are averages measured in ten independent realizations, and the shaded areas represents the dispersion (standard deviation across realizations).

Figure A1. Fitting of the line across the maxima of the distribution of final weights in the networks of Figure 4. The parameter

c_{o p t}

is the slope obtained by fitting a straight line to the peaks of the distributions of trained weights

w_{f}

for fixed

w_{i}

. These results are averages measured in ten independent realizations, and the shaded areas represents the dispersion (standard deviation across realizations).

For the sake of completeness, we perform linear fittings also to the mean trained weight as a function of the initial weight. Figure A2 shows the results of these fittings:

a_{0}

and

a_{1}

denote the constant and the slope, respectively. Similarly to the

c_{o p t}

, while above a width of about 300 nodes the values of

a_{0}

and

a_{1}

are stable, below the threshold they suffer an abrupt change at some moment of training. The dispersion of the trained weights around their initial value, measured by the standard deviation, is also shown in Figure A2 for the set of weights that are initialized with the value

w_{i} = 0

, displaying the same transition at a width of about 300. We observed that the distribution of trained weights for other

w_{i} \neq 0

behaves similarly with the variation of the network’s width. Notice that, since the weights are initialized from a continuous distribution, we are able to measure the mode (peaks), average, and standard deviation of weights as functions of the initial value

w_{i}

by applying the special procedure described in Appendix B, which is less affected by the presence of noise (fluctuations) in the data that the standard binning methods.

Figure A2. Mean and standard deviation of the trained weights in the networks of Figure 4. Top and middle rows: results of fitting a straight line

a_{0} + a_{1} w_{i}

to the mean of the trained weights

w_{f} (w i)

. Bottom row: standard deviation of the distribution of trained weights for the set of weights that are initialized with the value

w_{i} = 0

. These results are averages measured in ten independent realizations, and the shaded areas represents the dispersion (standard deviation across realizations).

Figure A2. Mean and standard deviation of the trained weights in the networks of Figure 4. Top and middle rows: results of fitting a straight line

a_{0} + a_{1} w_{i}

to the mean of the trained weights

w_{f} (w i)

. Bottom row: standard deviation of the distribution of trained weights for the set of weights that are initialized with the value

w_{i} = 0

. These results are averages measured in ten independent realizations, and the shaded areas represents the dispersion (standard deviation across realizations).

Appendix B. Fitting the Statistics of Weights in a Single Realization

This appendix briefly describes the methods used in this work to characterize the statistics of the displacements of weights produced by training with SGD. The problem is that we cannot directly obtain the distributions

P_{w_{i}} (w_{f})

of the final weights

w_{f}

for each value of the initial weight

w_{i}

, because the

w_{i}

s are drawn from a continuous (uniform) distribution, see Equation (3). In practice, for a single realization of training, we have a set of points

(w_{i}, w_{f})

, one for each link, in the continuous plane, as shown in Figure 3. In this situation, calculating the mean

〈 w_{f} 〉 (w_{i})

and standard deviation

σ_{w_{f}} (w_{i})

of the distribution

P_{w_{i}} (w_{f})

may follow one of two approaches: either using a binning procedure or cumulative distributions. In our analysis, we employed the latter, which is less affected by random fluctuations than the standard binning methods.

We assume a linear fit

〈 w_{f} 〉 (w_{i}) = a_{0} + a_{1} w_{i}

, and obtain the constants

a_{1}

and

a_{0}

as follows. Let us define the function

W_{f} (w) = \int_{min (w_{i})}^{w} 〈 w_{f} 〉 (x) d x = C + a_{0} w + \frac{a_{1}}{2} w^{2},

(A1)

where C is a constant resulting from the lower limit of the integral. For one given realization, we can estimate this function from the following cumulative sum

W_{f} (w) \approx \frac{max (w_{i}) - min (w_{i})}{N} \sum_{j : w_{j} (0) \leq w} w_{j} (t_{f}),

(A2)

where

w_{j} (t_{f})

denotes the value of the weight of link j at time

t_{f}

, N is the number of links of the network, and

min (w_{i})

/

max (w_{i})

is the minimum/maximum value of the initialization weights. (The sum in the right-hand side of Equation (A2) runs over all links whose initial weight is not larger than w.) Finally, we fit a second degree polynomial to

W_{f} (w)

, and get the constants

a_{1}

and

a_{0}

from its coefficients.

We use the same ‘cumulative-based’ approach to find the second moment of

P_{w_{i}} (w_{f})

, denoted by

〈 w_{f}^{2} 〉 (w_{i})

. In this case, we assume the polynomial

〈 w_{f}^{2} 〉 (w_{i}) = b_{0} + b_{1} w_{i} + b_{2} w_{i}^{2}

. We again define the cumulative

W_{f}^{(2)} (w) = \int_{min (w_{i})}^{w} 〈 w_{f}^{2} 〉 (x) d x

. Similarly to

W_{f} (w)

, we estimate

W_{f}^{(2)} (w)

as

W_{f}^{(2)} (w) \approx \frac{max (w_{i}) - min (w_{i})}{N} \sum_{j : w_{j} (0) \leq w} {w_{j}}^{2} (t_{f}),

(A3)

and fit a third degree polynomial to get the coefficients

b_{0}

,

b_{1}

, and

b_{2}

. Then, we calculate

σ_{w_{f}} (w_{i})

as

σ_{w_{f}} (w_{i}) = \sqrt{〈 w_{f}^{2} 〉 (w_{i}) - {[〈 w_{f} 〉 (w_{i})]}^{2}} .

(A4)

The method for fitting the peak (or the mode) of the distribution

P_{w_{i}} (w_{f})

is also based on a cumulative distribution. In our experiments we observe that, in the trainability regime, the peak of the distribution of

w_{f}

as a function of

w_{i}

is indistinguishable from a straight line, see Figure 3. Accordingly, we define

N_{c} (b) = |\{(w_{i}, w_{f}) : w_{f} \leq b + c w_{i}\}|,

(A5)

which is a function that counts the number of points

(w_{i}, w_{f})

below or at the line

b + c w_{i}

. Then, we fit the peak of

P_{w_{i}} (w_{f})

by optimizing the expression

max_{c} max (\frac{d}{d b} N_{c} (b)),

(A6)

In other words, we look for the slope that causes the largest rate of change in the function

N_{c} (b)

. This slope,

c_{o p t}

, is the slope of the linear function that best aligns with the peak of the distribution

P_{w_{i}} (w_{f})

.

References

LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Liang, Y. Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems 31; Curran Associates Inc.: Red Hook, NY, USA, 2018; pp. 8157–8166. [Google Scholar]
Jacot, A.; Gabriel, F.; Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in Neural Information Processing Systems 31; Curran Associates Inc.: Red Hook, NY, USA, 2018; pp. 8571–8580. [Google Scholar]
Lee, J.; Xiao, L.; Schoenholz, S.; Bahri, Y.; Novak, R.; Sohl-Dickstein, J.; Pennington, J. Wide neural networks of any depth evolve as linear models under gradient descent. In Advances in Neural Information Processing Systems 32; Curran Associates Inc.: Red Hook, NY, USA, 2019; pp. 8572–8583. [Google Scholar]
LeCun, Y.; Bottou, L.; Orr, G.B.; Müller, K.R. Efficient BackProp. In Neural Networks: Tricks of the Trade; Springer: Berlin/Heidelberg, Germany, 1998; pp. 9–50. [Google Scholar]
Yam, J.Y.; Chow, T.W. A weight initialization method for improving training speed in feedforward neural network. Neurocomputing 2000, 30, 219–232. [Google Scholar] [CrossRef]
Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Sardinia, Italy, 13–15 May 2010; pp. 249–256. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1026–1034. [Google Scholar]
Chapelle, O.; Erhan, D. Improved preconditioner for hessian free optimization. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning; Sierra Nevada, Spain, 2011; Available online: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.297.3089&rep=rep1&type=pdf (accessed on 5 September 2021).
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2012; pp. 1097–1105. [Google Scholar]
Sutskever, I.; Martens, J.; Dahl, G.; Hinton, G. On the importance of initialization and momentum in deep learning. In Proceedings of the International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013; pp. 1139–1147. [Google Scholar]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
Frankle, J.; Carbin, M. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Zhou, H.; Lan, J.; Liu, R.; Yosinski, J. Deconstructing lottery tickets: Zeros, signs, and the supermask. In Proceedings of the Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, Vancouver, BC, Canada, 8–14 December 2019; pp. 3592–3602. [Google Scholar]
Ramanujan, V.; Wortsman, M.; Kembhavi, A.; Farhadi, A.; Rastegari, M. What is Hidden in a Randomly Weighted Neural Network? arXiv 2019, arXiv:1911.13299. [Google Scholar]
Du, S.; Lee, J.; Li, H.; Wang, L.; Zhai, X. Gradient descent finds global minima of deep neural networks. In Proceedings of the 36th International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; Volume 97, pp. 1675–1685. [Google Scholar]
Du, S.S.; Zhai, X.; Póczos, B.; Singh, A. Gradient descent provably optimizes over-parameterized neural networks. In Proceedings of the 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Allen-Zhu, Z.; Li, Y.; Liang, Y. Learning and generalization in overparameterized neural networks, going beyond two layers. In Advances in Neural Information Processing Systems 32; Curran Associates Inc.: Red Hook, NY, USA, 2019; pp. 6158–6169. [Google Scholar]
Allen-Zhu, Z.; Li, Y.; Song, Z. A convergence theory for deep learning via over-parameterization. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; Chaudhuri, K., Salakhutdinov, R., Eds.; PMLR: Long Beach, CA, USA, 2019; Volume 97, pp. 242–252. [Google Scholar]
Allen-Zhu, Z.; Li, Y.; Song, Z. On the convergence rate of training recurrent neural networks. In Advances in Neural Information Processing Systems 32; Curran Associates, Inc.: Red Hook, NY, USA, 2019; pp. 6676–6688. [Google Scholar]
Oymak, S.; Soltanolkotabi, M. Overparameterized nonlinear learning: Gradient descent takes the shortest path? In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; PMLR: Long Beach, CA, USA, 2019; Volume 97, pp. 4951–4960. [Google Scholar]
Oymak, S.; Soltanolkotabi, M. Toward Moderate Overparameterization: Global Convergence Guarantees for Training Shallow Neural Networks. IEEE J. Sel. Areas Inf. Theory 2020, 1, 84–105. [Google Scholar] [CrossRef]
Zou, D.; Cao, Y.; Zhou, D.; Gu, Q. Gradient descent optimizes over-parameterized deep ReLU networks. Mach. Learn. 2020, 109, 467–492. [Google Scholar] [CrossRef]
Arora, S.; Du, S.; Hu, W.; Li, Z.; Wang, R. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; PMLR: Long Beach, CA, USA, 2019; Volume 97, pp. 322–332. [Google Scholar]
Arora, S.; Du, S.S.; Hu, W.; Li, Z.; Salakhutdinov, R.R.; Wang, R. On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems; Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2019; Volume 32, pp. 8141–8150. [Google Scholar]
Chizat, L.; Oyallon, E.; Bach, F. On lazy training in differentiable programming. In Advances in Neural Information Processing Systems 32; Curran Associates, Inc.: Red Hook, NY, USA, 2019; pp. 2937–2947. [Google Scholar]
Frankle, J.; Schwab, D.J.; Morcos, A.S. The early phase of neural network training. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Lu, Z.; Pu, H.; Wang, F.; Hu, Z.; Wang, L. The expressive power of neural networks: A view from the width. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 6231–6239. [Google Scholar]
Li, D.; Ding, T.; Sun, R. On the benefit of width for neural networks: Disappearance of bad basins. arXiv 2018, arXiv:1812.11039. [Google Scholar]
Chollet, F. Keras. 2015. Available online: https://keras.io (accessed on 5 September 2021).
Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. 2015. Available online: tensorflow.org (accessed on 5 September 2021).
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version]
Xiao, H.; Rasul, K.; Vollgraf, R. Fashion-MNIST: A Novel Image Dataset for Benchmarking Machine Learning Algorithms. arXiv 2017, arXiv:1708.07747. [Google Scholar]
Thoma, M. The hasyv2 dataset. arXiv 2017, arXiv:1701.08380. [Google Scholar]
Helsen, W.F.; Van Winckel, J.; Williams, A.M. The relative age effect in youth soccer across Europe. J. Sport. Sci. 2005, 23, 629–636. [Google Scholar] [CrossRef] [PubMed] [Green Version]

Figure 1. Train and test loss of networks consisting of two equally sized hidden layers of nodes, trained on HASYv2. Some of the weights connecting the two hidden layers were initially set to zero so that the weight matrix of these layers resembles the letter a (at initialization). The evolution of the weight matrix is shown in the subplots. (a) Loss of a stable learner network with 512 nodes in each hidden layer. (b) Training loss of an unstable network with 256 nodes in each hidden layer, illustrating the effect of crossing over from trainability to untrainability regimes on the network’s weights (disappearance of the initialization mark). A single curve is shown for clarity, but networks of the same width show the same behavior. Experiments with other symbols (that serve as mask for the initialization) exhibit similar behavior.

Figure 2. Samples of the datasets used in our experiments. Top: MNIST. Middle: Fashion MNIST. Bottom: HASYv2 (colors reversed).

Figure 3. Distribution of the final values of the weights of the network of Figure 1a, trained for 1000 epochs on HASYv2, as function of their initial value. The peak of the distribution is at

w_{f} = w_{i}

, which is extremely close to the median. The skewness of the distribution for large absolute values of

w_{i}

is evidenced in the histograms at the top.

Figure 3. Distribution of the final values of the weights of the network of Figure 1a, trained for 1000 epochs on HASYv2, as function of their initial value. The peak of the distribution is at

w_{f} = w_{i}

, which is extremely close to the median. The skewness of the distribution for large absolute values of

w_{i}

is evidenced in the histograms at the top.

Figure 4. The regimes of a neural network over the course of its training: Evolution of (a) train and (b) test loss functions of networks of various widths. Panels (a,b) show single typical training runs of the full set which we explored. (c) Average times (in epochs) taken by the networks to reach the minima of the train loss and of the test loss functions, and to diverge (i.e., reach the plateau of the train loss). For each network width, we calculate the averages and standard deviations of these times (represented by error bars) over ten independent runs trained on HASYv2. (d) Average values of minimum loss at the train and test sets reached during individual runs (different runs reach their minima at different times). These averages were measured over the same ten independent realizations as in panel (c).

Figure 5. Top: Temporal evolution of the loss function of networks of various widths. Bottom: Evolution of the root mean square deviation (RMSD) between the initial configuration of weights of a network and its current configuration. From left to right: networks trained on the MNIST, Fashion MNIST, and HASYv2 datasets. Five independent test runs are plotted (individually) for each value of width and each dataset.

Figure 6. Top: Time

t_{θ}

(in epochs) at which the networks first reach a given value

θ

. Bottom: Root mean square deviation (RMSD) between the initial configuration of weights of a network and its configuration of weights at time

t_{θ}

. From left to right: networks trained on the MNIST, Fashion MNIST, and HASYv2 datasets. Each point in the figure is the average of five independent test runs. The absence of a point in a plot indicates that the network does not reach this loss during the entire period of training.

Figure 6. Top: Time

t_{θ}

(in epochs) at which the networks first reach a given value

θ

. Bottom: Root mean square deviation (RMSD) between the initial configuration of weights of a network and its configuration of weights at time

t_{θ}

. From left to right: networks trained on the MNIST, Fashion MNIST, and HASYv2 datasets. Each point in the figure is the average of five independent test runs. The absence of a point in a plot indicates that the network does not reach this loss during the entire period of training.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jesus, R.J.; Antunes, M.L.; da Costa, R.A.; Dorogovtsev, S.N.; Mendes, J.F.F.; Aguiar, R.L. Effect of Initial Configuration of Weights on Training and Function of Artificial Neural Networks. Mathematics 2021, 9, 2246. https://doi.org/10.3390/math9182246

AMA Style

Jesus RJ, Antunes ML, da Costa RA, Dorogovtsev SN, Mendes JFF, Aguiar RL. Effect of Initial Configuration of Weights on Training and Function of Artificial Neural Networks. Mathematics. 2021; 9(18):2246. https://doi.org/10.3390/math9182246

Chicago/Turabian Style

Jesus, Ricardo J., Mário L. Antunes, Rui A. da Costa, Sergey N. Dorogovtsev, José F. F. Mendes, and Rui L. Aguiar. 2021. "Effect of Initial Configuration of Weights on Training and Function of Artificial Neural Networks" Mathematics 9, no. 18: 2246. https://doi.org/10.3390/math9182246

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Effect of Initial Configuration of Weights on Training and Function of Artificial Neural Networks

Abstract

1. Introduction

2. Background and Related Work

2.1. Previous Works

2.2. Our Contribution

3. Problem Formulation

4. Statistics of Deviations of Weights from Initial Values

5. Evolution of Deviations of Weights and Trainability

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Statistics of Weights across the Trainable–Untrainable Transition

Appendix B. Fitting the Statistics of Weights in a Single Realization

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI