Abstract
In view of a direct and simple improvement of vanilla SGD, this paper presents a fine-tuning of its step-sizes in the mini-batch case. For doing so, one estimates curvature, based on a local quadratic model and using only noisy gradient approximations. One obtains a new stochastic first-order method (Step-Tuned SGD), enhanced by second-order information, which can be seen as a stochastic version of the classical Barzilai-Borwein method. Our theoretical results ensure almost sure convergence to the critical set and we provide convergence rates. Experiments on deep residual network training illustrate the favorable properties of our approach. For such networks we observe, during training, both a sudden drop of the loss and an improvement of test accuracy at medium stages, yielding better results than SGD, RMSprop, or ADAM.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
For a fair comparison we implement this method with the scaling-factor \(\alpha \) of Algorithm 1.
There is also the possibility of computing additional estimates as [41] previously did for a stochastic BFGS algorithm, but this would double the computational cost.
Step-Tuned SGD achieves the same small level of error as SGD when doing additional epochs thanks to the decay schedule present in Algorithm 2.
Default values: \((\nu ,\beta ,{\tilde{m}},{\tilde{M}},\delta ) = (2,0.9,0.5,2,0.001)\).
An alternative common practice consists in manually decaying the step-size at pre-defined epochs. This technique although efficient in practice to achieve state-of-the-art results makes the comparison of algorithms harder, hence we stick to a usual Robbins-Monro type of decay.
References
Alber YI, Iusem AN, Solodov MV (1998) On the projected subgradient method for nonsmooth convex optimization in a hilbert space. Math Program 81(1):23–35
Allen-Zhu Z (2018) Natasha 2: faster non-convex optimization than SGD. In: Advances in Neural Information Processing Systems (NIPS), pp 2675–2686
Alvarez F, Cabot A (2004) Steepest descent with curvature dynamical system. J Optim Theory Appl 120(2):247–273
Babaie-Kafaki S, Fatemi M (2013) A modified two-point stepsize gradient algorithm for unconstrained minimization. Optim Methods Softw 28(5):1040–1050
Barakat A, Bianchi P (2018) Convergence of the ADAM algorithm from a dynamical system viewpoint. arXiv:1810.02263
Barzilai J, Borwein JM (1988) Two-point step size gradient methods. IMA J Numer Anal 8(1):141–148
Bertsekas DP, Hager W, Mangasarian O (1998) Nonlinear programming. Athena Scientific, Belmont, MA
Biglari F, Solimanpur M (2013) Scaling on the spectral gradient method. J Optim Theory Appl 158:626–635
Bolte J, Pauwels E (2020) A mathematical model for automatic differentiation in machine learning. In: advances in Neural Information Processing Systems (NIPS)
Carmon Y, Duchi JC, Hinder O, Sidford A (2017) Convex until proven guilty: dimension-free acceleration of gradient descent on non-convex functions. In: proceedings of the international conference on machine learning (ICML), pp 654–663
Castera C, Bolte J, Févotte C, Pauwels E (2021) An inertial Newton algorithm for deep learning. J Mach Learn Res 22(134):1–31
Curtis FE, Guo W (2016) Handling nonpositive curvature in a limited memory steepest descent method. IMA J Numer Anal 36(2):717–742
Curtis FE, Robinson DP (2019) Exploiting negative curvature in deterministic and stochastic optimization. Math Program 176(1–2):69–94
Dai Y, Yuan J, Yuan YX (2002) Modified two-point stepsize gradient methods for unconstrained optimization. Comput Optim Appl 22(1):103–109
Davis D, Drusvyatskiy D, Kakade S, Lee JD (2020) Stochastic subgradient method converges on tame functions. Found Comput Math 20(1):119–154
Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res, 12(7)
Duchi JC, Ruan F (2018) Stochastic methods for composite and weakly convex optimization problems. SIAM J Optim 28(4):3229–3259
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 770–778
Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507
Hunter JD (2007) Matplotlib: a 2d graphics environment. Comput Sci Eng 9(3):90–95
Idelbayev Y (2018) Proper ResNet implementation for CIFAR10/CIFAR100 in PyTorch. https://github.com/akamaster/pytorch_resnet_cifar10
Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: proceedings of the international conference on machine learning (ICML), pp 448–456
Johnson R, Zhang T (2013) Accelerating stochastic gradient descent using predictive variance reduction. In: advances in neural information processing systems (NIPS), pp 315–323
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: proceedings of the international conference on learning representations (ICLR)
Krishnan S, Xiao Y, Saurous RA (2018) Neumann optimizer: a practical optimization algorithm for deep neural networks. In: proceedings of the international conference on learning representations (ICLR)
Krizhevsky A (2009) Learning multiple layers of features from tiny images. Tech. rep, Canadian Institute for Advanced Research
LeCun Y, Bottou L, Bengio Y, Haffner P et al (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324
LeCun Y, Cortes C, Burges C (2010) MNIST handwritten digit database. ATT Labs [Online] Available: www.https://yannlecuncom/exdb/mnist
Li X, Orabona F (2019) On the convergence of stochastic gradient descent with adaptive stepsizes. In: proceedings of the international conference on artificial intelligence and statistics (AISTATS), pp 983–992
Liang J, Xu Y, Bao C, Quan Y, Ji H (2019) Barzilai-Borwein-based adaptive learning rate for deep learning. Pattern Recognit Lett 128:197–203
Lin M, Chen Q, Yan S (2013) Network in network. arXiv:1312.4400
Liu M, Yang T (2017) On noisy negative curvature descent: Competing with gradient descent for faster non-convex optimization. arXiv:1709.08571
Martens J, Grosse R (2015) Optimizing neural networks with kronecker-factored approximate curvature. In: proceedings of the international conference on machine learning (ICML), pp 2408–2417
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, et al. (2019) Pytorch: an imperative style, high-performance deep learning library. In: advances in neural information processing systems (NIPS), pp 8026–8037
Raydan M (1997) The Barzilai and Borwein gradient method for the large scale unconstrained minimization problem. SIAM J Optim 7(1):26–33
Robbins H, Monro S (1951) A stochastic approximation method. Ann Math Stat 22(1):400–407
Robbins H, Siegmund D (1971) A convergence theorem for non negative almost supermartingales and some applications. In: optimizing methods in statistics, Elsevier, pp 233–257
Robles-Kelly A, Nazari A (2019) Incorporating the Barzilai-Borwein adaptive step size into subgradient methods for deep network training. In: 2019 digital image computing: techniques and applications (DICTA), pp 1–6
Rossum G (1995) Python reference manual. CWI (Centre for Mathematics and Computer Science)
Royer CW, Wright SJ (2018) Complexity analysis of second-order line-search algorithms for smooth nonconvex optimization. SIAM J Optim 28(2):1448–1477
Schraudolph NN, Yu J, Günter S (2007) A stochastic Quasi-Newton method for online convex optimization. In: proceedings of the international conference on artificial intelligence and statistics (AISTATS)
Tan C, Ma S, Dai YH, Qian Y (2016) Barzilai-Borwein step size for stochastic gradient descent. In: advances in neural information processing systems (NIPS), pp 685–693
Tieleman T, Hinton G (2012) Lecture 6.5-RMSprop: divide the gradient by a running average of its recent magnitude. COURSERA Neural Netw Mach Learn 4(2):26–31
Svd Walt, Colbert SC, Varoquaux G (2011) The numpy array: a structure for efficient numerical computation. Comput Sci Eng 13(2):22–30
Wilson AC, Roelofs R, Stern M, Srebro N, Recht B (2017) The marginal value of adaptive gradient methods in machine learning. In: advances in neural information processing systems (NIPS), pp 4148–4158
Xiao Y, Wang Q, Wang D (2010) Notes on the Dai-Yuan-Yuan modified spectral gradient method. J Comput Appl Math 234(10):2986–2992
Zhuang J, Tang T, Ding Y, Tatikonda SC, Dvornek N, Papademetris X, Duncan J (2020) Adabelief optimizer: adapting stepsizes by the belief in observed gradients. Advances in Neural Information Processing Systems (NIPS) 33
Acknowledgements
The authors acknowledge the support of the European Research Council (ERC FACTORY-CoG-6681839), the Agence Nationale de la Recherche (ANR 3IA-ANITI, ANR-17-EURE-0010 CHESS, ANR-19-CE23-0017 MASDOL) and the Air Force Office of Scientific Research (FA9550-18-1-0226). Part of the numerical experiments were done using the OSIRIM platform of IRIT, supported by the CNRS, the FEDER, Région Occitanie and the French government (http://osirim.irit.fr/site/en). We thank the development teams of the following libraries that were used in the experiments: Python [39], Numpy [44], Matplotlib [20], PyTorch [34], and the PyTorch implementation of ResNets from [21]. We thank Emmanuel Soubies and Sixin Zhang for useful discussions and Sébastien Gadat for pointing out flaws in the original proof.
Author information
Authors and Affiliations
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Last-three authors are listed in alphabetical order.
Appendices
Appendix A: Details About Deep Learning Experiments
In addition to the method described in Sect. 5.1, we provide in Table 1 a summary of each problem considered.
In the DL experiments of Sect. 5, we display the training error and the test accuracy of each algorithm as a function of the number of stochastic gradient estimates computed. Due to their adaptive procedures, ADAM, RMSprop and Step-Tuned SGD have additional sub-routines in comparison to SGD. Thus, in Table 2 we additionally provide the wall-clock time per epoch of these methods relatively to SGD. Unlike the number of back-propagations performed, wall-clock time depends on many factors: the network and datasets considered, the computer used, and most importantly, the implementation. Regarding implementation, we would like to emphasize the fact that we used the versions of SGD, ADAM and RMSprop provided in PyTorch, which are fully optimized (and in particular parallelized). Table 2 indicates that Step-Tuned SGD is slower than other adaptive methods for large networks but this is due to our non-parallel implementation. Actually on small networks (where the benefits of parallel computing is small), we observe that running Step-Tuned SGD for one epoch is actually faster than for SGD. As a conclusion, the number of back-propagations is a more suitable metric for comparing the algorithms, and all methods considered require a single back-propagation per iteration.
Appendix B: Proof of the Theoretical Results
We state a lemma that we will use to prove Theorem 1.
1.1 Preliminary Lemma
The result is the following.
Lemma 1
( [1, Proposition 2]) Let \((u_k)_{k\in {\mathbb {N}}}\) and \((v_k)_{k\in {\mathbb {N}}}\) two non-negative real sequences. Assume that \(\sum _{k=0}^{+\infty } u_k v_k <+\infty \), and \(\sum _{k=0}^{+\infty } v_k =+\infty \). If there exists a constant \(C>0\) such that \(\forall k\in {\mathbb {N}}, \vert u_{k+1} - u_k \vert \le C v_k\), then \(u_k\xrightarrow [k\rightarrow +\infty ]{}0\).
1.2 Proof of the main theorem
We can now prove Theorem 1.
Proof of Theorem 1
We first clarify the random process induced by the draw of the mini-batches. Algorithm 2 takes a sequence of mini-batches as input. This sequence is represented by the random variables \(({\mathsf {B}}_k)_{k\in {\mathbb {N}}}\) as described in Sect. 3.2. Each of these random variables is independent of the others. In particular, for \(k\in {\mathbb {N}}_{>0}\), \({\mathsf {B}}_k\) is independent of the previous mini-batches \({\mathsf {B}}_0,\ldots , {\mathsf {B}}_{k-1}\). For convenience, we will denote \(\underline{{\mathsf {B}}}_k = \left\{ {\mathsf {B}}_0,\ldots ,{\mathsf {B}}_k\right\} \), the mini-batches up to iteration k. Due to the randomness of the mini-batches, the algorithm is a random process as well. As such, \(\theta _{k}\) is a random variable with a deterministic dependence on \(\underline{{\mathsf {B}}}_{k-1}\) and is independent of \({\mathsf {B}}_k\). However, \(\theta _{k+\frac{1}{2}}\) and \({\mathsf {B}}_{k}\) are not independent. Similarly, we constructed \(\gamma _k\) such that it is a random variable with a deterministic dependence on \(\underline{{\mathsf {B}}}_{k-1}\), which is independent of \({\mathsf {B}}_k\). This dependency structure will be crucial to derive and bound conditional expectations. Finally, we highlight the following important identity, for any \(k\in {\mathbb {N}}_{>0}\),
Indeed, the iterate \(\theta _{k}\) is a deterministic function of \(\underline{{\mathsf {B}}}_{k-1}\), so taking the expectation over \({\mathsf {B}}_k\), which is independent of \(\underline{{\mathsf {B}}}_{k-1}\), we recover the full gradient of \({\mathcal {J}}\) as the distribution of \({\mathsf {B}}_k\) is the same as that of \({\mathsf {S}}\) in Sect. 3.2. Notice in addition that a similar identity does not hold for \(\theta _{k+\frac{1}{2}}\) (as it depends on \({\mathsf {B}}_k\)).
We now provide estimates that will be used extensively in the rest of the proof. The gradient of the loss function \(\nabla {\mathcal {J}}\) is locally Lipschitz continuous as \({\mathcal {J}}\) is twice continuously differentiable. By assumption, there exists a compact convex set \({\mathsf {C}}\subset {\mathbb {R}}^P\), such that with probability 1, the sequence of iterates \((\theta _{k})_{k\in \frac{1}{2}{\mathbb {N}}}\) belongs to \({\mathsf {C}}\). Therefore, by local Lipschitz continuity, the restriction of \(\nabla {\mathcal {J}}\) to \({\mathsf {C}}\) is Lipschitz continuous on \({\mathsf {C}}\). Similarly, each \(\nabla {\mathcal {J}}_n\) is also Lipschitz continuous on \({\mathsf {C}}\). We denote by \(L>0\) a Lipschitz constant common to each \(\nabla {\mathcal {J}}_n\), \(n=1,\ldots , N\). Notice that the Lipschitz continuity is preserved by averaging, in other words,
In addition, using the continuity of the \(\nabla {\mathcal {J}}_n\)’s, there exists a constant \(C_2>0\), such that,
Finally, for a function \(g:{\mathbb {R}}^P\rightarrow {\mathbb {R}}\) with L-Lipschitz continuous gradient, we recall the following inequality called descent lemma (see for example [7, Proposition A.24]). For any \(\theta \in {\mathbb {R}}^P\) and any \(d\in {\mathbb {R}}^P\),
In our case since we only have the L-Lipschitz continuity of \(\nabla {\mathcal {J}}\) on \({\mathsf {C}}\) which is convex, we have a similar bound for \(\nabla {\mathcal {J}}\) on \({\mathsf {C}}\): for any \(\theta \in {\mathsf {C}}\) and any \(d\in {\mathbb {R}}^P\) such that \(\theta +d\in {\mathsf {C}}\),
Let \(\theta _0\in {\mathbb {R}}^P\) and let \((\theta _{k})_{k\in \frac{1}{2}{\mathbb {N}}}\) a sequence generated by Algorithm 2 initialized at \(\theta _0\). By assumption this sequence belongs to \({\mathsf {C}}\) almost surely. To simplify, for \(k\in {\mathbb {N}}\), we denote \(\eta _k = \alpha \gamma _k (k+1)^{-(1/2+\delta )}\). Fix an iteration \(k\in {\mathbb {N}}\), we can use (29) with \(\theta = \theta _k\) and \(d = -\eta _k\nabla {\mathcal {J}}_{{\mathsf {B}}_{k}}(\theta _k)\), almost surely (with respect to the boundedness assumption),
Similarly with \(\theta = \theta _{k+\frac{1}{2}}\) and \(d = -\eta _k\nabla {\mathcal {J}}_{{\mathsf {B}}_{k}}(\theta _{k+\frac{1}{2}})\), almost surely,
We combine (30) and (31), almost surely,
Using the boundedness assumption and (27), almost surely,
So almost surely,
Then, we take the conditional expectation of (34) over \({\mathsf {B}}_k\) conditionally on \(\underline{{\mathsf {B}}}_{k-1}\) (the mini-batches used up to iteration \(k-1\)), we have,
As explained at the beginning of the proof, \(\theta _{k}\) is a deterministic function of \(\underline{{\mathsf {B}}}_{k-1}\), thus, . Similarly, by construction \(\eta _k\) is independent of the current mini-batch \({\mathsf {B}}_k\), it is a deterministic function of \(\underline{{\mathsf {B}}}_{k-1}\). Hence, (35) reads,
Then, we use the fact that . Overall, we obtain,
We will now bound the last term of (37). First we write,
Using the Cauchy-Schwarz inequality, as well as (26) and (27), almost surely,
Hence,
We perform similar computations on the last term of (40), almost surely
Finally we obtain by combining (38), (40) and (41), almost surely,
Going back to the last term of (37), we have, taking the conditional expectation of (42), almost surely
In the end we obtain, for an arbitrary iteration \(k\in {\mathbb {N}}\), almost surely
To simplify we assume that \({{\tilde{M}}}\ge \nu \) (otherwise set \(\tilde{M} = \max ({{\tilde{M}}},\nu )\)). We use the fact that, \(\eta _k\in [\frac{\alpha \tilde{m}}{(k+1)^{1/2+\delta }},\frac{\alpha {{\tilde{M}}}}{(k+1)^{1/2+\delta }}]\), to obtain almost surely,
Since by assumption, the last term is summable, we can now invoke Robbins-Siegmund convergence theorem [37] to obtain that, almost surely, \(({\mathcal {J}}(\theta _{k}))_{k\in {\mathbb {N}}}\) converges and,
Since \(\sum _{k=0}^{+\infty }\frac{1}{(k+1)^{1/2+\delta }}=+\infty \), this implies at least that almost surely,
To prove that in addition \(\displaystyle \lim _{k\rightarrow \infty }\Vert \nabla {\mathcal {J}}(\theta _k)\Vert ^2 = 0\), we will use Lemma 1 with \(u_k = \Vert \nabla {\mathcal {J}}(\theta _{k})\Vert ^2\) and \(v_k = \frac{1}{(k+1)^{1/2+\delta }}\), for all \(k\in {\mathbb {N}}\). So we need to prove that there exists \(C_3>0\) such that \(\vert u_{k+1} - u_k\vert \le C_3 v_k\). To do so, we use the L-Lipschitz continuity of the gradients on \({\mathsf {C}}\), triangle inequalities and (27). It holds, almost surely, for all \(k \in {\mathbb {N}}\)
So taking \(C_3 =4C_2^2 L\alpha {\tilde{M}} \), by Lemma 1, almost surely, \(\lim _{k\rightarrow +\infty } \Vert \nabla {\mathcal {J}}(\theta _{k})\Vert ^2=0\). This concludes the almost sure convergence proof.
As for the rate, consider the expectation of (45) (with respect to the random variables \(({\mathsf {B}}_k)_{k\in {\mathbb {N}}}\)). The tower property of the conditional expectation gives \( {\mathbb {E}}[{\mathbb {E}}[{\mathcal {J}}(\theta _{k+1})|\underline{{\mathsf {B}}}_{k-1}]]={\mathbb {E}}\left[ {\mathcal {J}}(\theta _{k+1})\right] \), so we obtain, for all \(k\in {\mathbb {N}}\),
Then for \(K\ge 1\), we sum from 0 to \(K-1\),
The right-hand side is finite, so there is a constant \(C_4>0\) such that for any \(K\in {\mathbb {N}}\), it holds,
and we obtain the rate. \(\square \)
1.3 C Proof of the Corollary
Before proving the corollary we recall the following result.
Lemma 2
Let \(g:{\mathbb {R}}^P\rightarrow {\mathbb {R}}\) a L-Lipschitz continuous and differentiable function. Then \(\nabla g\) is uniformly bounded on \({\mathbb {R}}^P\).
We can now prove the corollary.
Proof of Corollary 1
The proof is very similar to the one of Theorem 1. Denote L the Lipschitz constant of \(\nabla {\mathcal {J}}\). Then, the descent lemma (30) holds surely. Furthermore, since for all \(n\in \{1,\ldots ,N\}\), each \({\mathcal {J}}_n\) is Lipschitz, so is \({\mathcal {J}}\), and globally Lipschitz functions have uniformly bounded gradients so \(\nabla {\mathcal {J}}\) has bounded gradient. This is enough to obtain (45). Similarly, at iteration \(k\in {\mathbb {N}}\), \({\mathbb {E}}\left[ \Vert \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _k)\Vert \right] \) is also uniformly bounded. Overall these arguments allows to follow the lines of the proof of Theorem 1 and the same conclusions follow by repeating the same arguments. \(\square \)
Appendix C: Details on the Synthetic Experiments
We detail the non-convex regression problem that we presented in Figs. 2 and 3. Given a matrix \(A\in {\mathbb {R}}^{N \times P}\) and a vector \(b\in {\mathbb {R}}^N\), denote \(A_n\) the n-th line of A. The problem consists in minimizing a loss function of the form,
where the non-convexity comes from the function \(t\in {\mathbb {R}}\mapsto \phi (t) = t^2/(1+t^2)\). For more details on the initialization of A and b we refer to [10] where this problem is initially proposed. In the experiments of Fig. 3, the mini-batch approximation was made by selecting a subset of the lines of A, which amounts to computing only a few terms of the full sum in (52). We used \(N=500\), \(P=30\) and mini-batches of size 50.
In the deterministic setting we ran each algorithm during 250 iterations and selected the hyper-parameters of each algorithm such that they achieved \(\vert {\mathcal {J}}(\theta )-{\mathcal {J}}^\star \vert <10^{-1}\) as fast as possible. In the mini-batch experiments we ran each algorithm during 250 epochs and selected the hyper-parameters that yielded the smallest value of \({\mathcal {J}}(\theta )\) after 50 epochs.
Appendix D: Description of Auxiliary Algorithms
We precise the heuristic algorithms used in Fig. 3 and discussed in Sect. 3.3. Note that the step-size in Algorithm 5 is equivalent to Expected-GV but is written differently to avoid storing an additional gradient estimate.
Rights and permissions
About this article
Cite this article
Castera, C., Bolte, J., Févotte, C. et al. Second-Order Step-Size Tuning of SGD for Non-Convex Optimization. Neural Process Lett 54, 1727–1752 (2022). https://doi.org/10.1007/s11063-021-10705-5
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-021-10705-5