Second-Order Step-Size Tuning of SGD for Non-Convex Optimization

Castera, Camille; Bolte, Jérôme; Févotte, Cédric; Pauwels, Edouard

doi:10.1007/s11063-021-10705-5

Second-Order Step-Size Tuning of SGD for Non-Convex Optimization

Published: 26 January 2022

Volume 54, pages 1727–1752, (2022)
Cite this article

Neural Processing Letters Aims and scope Submit manuscript

Camille Castera ORCID: orcid.org/0000-0002-7384-6387¹,
Jérôme Bolte²,
Cédric Févotte¹ &
…
Edouard Pauwels¹

616 Accesses
5 Citations
1 Altmetric
Explore all metrics

Abstract

In view of a direct and simple improvement of vanilla SGD, this paper presents a fine-tuning of its step-sizes in the mini-batch case. For doing so, one estimates curvature, based on a local quadratic model and using only noisy gradient approximations. One obtains a new stochastic first-order method (Step-Tuned SGD), enhanced by second-order information, which can be seen as a stochastic version of the classical Barzilai-Borwein method. Our theoretical results ensure almost sure convergence to the critical set and we provide convergence rates. Experiments on deep residual network training illustrate the favorable properties of our approach. For such networks we observe, during training, both a sudden drop of the loss and an improvement of test accuracy at medium stages, yielding better results than SGD, RMSprop, or ADAM.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Which Minimizer Does My Neural Network Converge To?

Stochastic Steffensen method

Article 07 June 2024

Empirically Explaining SGD from a Line Search Perspective

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

For a fair comparison we implement this method with the scaling-factor $\alpha $ of Algorithm 1.
There is also the possibility of computing additional estimates as [41] previously did for a stochastic BFGS algorithm, but this would double the computational cost.
Step-Tuned SGD achieves the same small level of error as SGD when doing additional epochs thanks to the decay schedule present in Algorithm 2.
Default values: $(\nu ,\beta ,{\tilde{m}},{\tilde{M}},\delta ) = (2,0.9,0.5,2,0.001)$.
An alternative common practice consists in manually decaying the step-size at pre-defined epochs. This technique although efficient in practice to achieve state-of-the-art results makes the comparison of algorithms harder, hence we stick to a usual Robbins-Monro type of decay.

References

Alber YI, Iusem AN, Solodov MV (1998) On the projected subgradient method for nonsmooth convex optimization in a hilbert space. Math Program 81(1):23–35
Article MathSciNet Google Scholar
Allen-Zhu Z (2018) Natasha 2: faster non-convex optimization than SGD. In: Advances in Neural Information Processing Systems (NIPS), pp 2675–2686
Alvarez F, Cabot A (2004) Steepest descent with curvature dynamical system. J Optim Theory Appl 120(2):247–273
Article MathSciNet Google Scholar
Babaie-Kafaki S, Fatemi M (2013) A modified two-point stepsize gradient algorithm for unconstrained minimization. Optim Methods Softw 28(5):1040–1050
Article MathSciNet Google Scholar
Barakat A, Bianchi P (2018) Convergence of the ADAM algorithm from a dynamical system viewpoint. arXiv:1810.02263
Barzilai J, Borwein JM (1988) Two-point step size gradient methods. IMA J Numer Anal 8(1):141–148
Article MathSciNet Google Scholar
Bertsekas DP, Hager W, Mangasarian O (1998) Nonlinear programming. Athena Scientific, Belmont, MA
Google Scholar
Biglari F, Solimanpur M (2013) Scaling on the spectral gradient method. J Optim Theory Appl 158:626–635
Article MathSciNet Google Scholar
Bolte J, Pauwels E (2020) A mathematical model for automatic differentiation in machine learning. In: advances in Neural Information Processing Systems (NIPS)
Carmon Y, Duchi JC, Hinder O, Sidford A (2017) Convex until proven guilty: dimension-free acceleration of gradient descent on non-convex functions. In: proceedings of the international conference on machine learning (ICML), pp 654–663
Castera C, Bolte J, Févotte C, Pauwels E (2021) An inertial Newton algorithm for deep learning. J Mach Learn Res 22(134):1–31
MathSciNet MATH Google Scholar
Curtis FE, Guo W (2016) Handling nonpositive curvature in a limited memory steepest descent method. IMA J Numer Anal 36(2):717–742
Article MathSciNet Google Scholar
Curtis FE, Robinson DP (2019) Exploiting negative curvature in deterministic and stochastic optimization. Math Program 176(1–2):69–94
Article MathSciNet Google Scholar
Dai Y, Yuan J, Yuan YX (2002) Modified two-point stepsize gradient methods for unconstrained optimization. Comput Optim Appl 22(1):103–109
Article MathSciNet Google Scholar
Davis D, Drusvyatskiy D, Kakade S, Lee JD (2020) Stochastic subgradient method converges on tame functions. Found Comput Math 20(1):119–154
Article MathSciNet Google Scholar
Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res, 12(7)
Duchi JC, Ruan F (2018) Stochastic methods for composite and weakly convex optimization problems. SIAM J Optim 28(4):3229–3259
Article MathSciNet Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 770–778
Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507
Article MathSciNet Google Scholar
Hunter JD (2007) Matplotlib: a 2d graphics environment. Comput Sci Eng 9(3):90–95
Article Google Scholar
Idelbayev Y (2018) Proper ResNet implementation for CIFAR10/CIFAR100 in PyTorch. https://github.com/akamaster/pytorch_resnet_cifar10
Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: proceedings of the international conference on machine learning (ICML), pp 448–456
Johnson R, Zhang T (2013) Accelerating stochastic gradient descent using predictive variance reduction. In: advances in neural information processing systems (NIPS), pp 315–323
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: proceedings of the international conference on learning representations (ICLR)
Krishnan S, Xiao Y, Saurous RA (2018) Neumann optimizer: a practical optimization algorithm for deep neural networks. In: proceedings of the international conference on learning representations (ICLR)
Krizhevsky A (2009) Learning multiple layers of features from tiny images. Tech. rep, Canadian Institute for Advanced Research
LeCun Y, Bottou L, Bengio Y, Haffner P et al (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324
Article Google Scholar
LeCun Y, Cortes C, Burges C (2010) MNIST handwritten digit database. ATT Labs [Online] Available: www.https://yannlecuncom/exdb/mnist
Li X, Orabona F (2019) On the convergence of stochastic gradient descent with adaptive stepsizes. In: proceedings of the international conference on artificial intelligence and statistics (AISTATS), pp 983–992
Liang J, Xu Y, Bao C, Quan Y, Ji H (2019) Barzilai-Borwein-based adaptive learning rate for deep learning. Pattern Recognit Lett 128:197–203
Article Google Scholar
Lin M, Chen Q, Yan S (2013) Network in network. arXiv:1312.4400
Liu M, Yang T (2017) On noisy negative curvature descent: Competing with gradient descent for faster non-convex optimization. arXiv:1709.08571
Martens J, Grosse R (2015) Optimizing neural networks with kronecker-factored approximate curvature. In: proceedings of the international conference on machine learning (ICML), pp 2408–2417
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, et al. (2019) Pytorch: an imperative style, high-performance deep learning library. In: advances in neural information processing systems (NIPS), pp 8026–8037
Raydan M (1997) The Barzilai and Borwein gradient method for the large scale unconstrained minimization problem. SIAM J Optim 7(1):26–33
Article MathSciNet Google Scholar
Robbins H, Monro S (1951) A stochastic approximation method. Ann Math Stat 22(1):400–407
Article MathSciNet Google Scholar
Robbins H, Siegmund D (1971) A convergence theorem for non negative almost supermartingales and some applications. In: optimizing methods in statistics, Elsevier, pp 233–257
Robles-Kelly A, Nazari A (2019) Incorporating the Barzilai-Borwein adaptive step size into subgradient methods for deep network training. In: 2019 digital image computing: techniques and applications (DICTA), pp 1–6
Rossum G (1995) Python reference manual. CWI (Centre for Mathematics and Computer Science)
Royer CW, Wright SJ (2018) Complexity analysis of second-order line-search algorithms for smooth nonconvex optimization. SIAM J Optim 28(2):1448–1477
Article MathSciNet Google Scholar
Schraudolph NN, Yu J, Günter S (2007) A stochastic Quasi-Newton method for online convex optimization. In: proceedings of the international conference on artificial intelligence and statistics (AISTATS)
Tan C, Ma S, Dai YH, Qian Y (2016) Barzilai-Borwein step size for stochastic gradient descent. In: advances in neural information processing systems (NIPS), pp 685–693
Tieleman T, Hinton G (2012) Lecture 6.5-RMSprop: divide the gradient by a running average of its recent magnitude. COURSERA Neural Netw Mach Learn 4(2):26–31
Google Scholar
Svd Walt, Colbert SC, Varoquaux G (2011) The numpy array: a structure for efficient numerical computation. Comput Sci Eng 13(2):22–30
Article Google Scholar
Wilson AC, Roelofs R, Stern M, Srebro N, Recht B (2017) The marginal value of adaptive gradient methods in machine learning. In: advances in neural information processing systems (NIPS), pp 4148–4158
Xiao Y, Wang Q, Wang D (2010) Notes on the Dai-Yuan-Yuan modified spectral gradient method. J Comput Appl Math 234(10):2986–2992
Article MathSciNet Google Scholar
Zhuang J, Tang T, Ding Y, Tatikonda SC, Dvornek N, Papademetris X, Duncan J (2020) Adabelief optimizer: adapting stepsizes by the belief in observed gradients. Advances in Neural Information Processing Systems (NIPS) 33

Download references

Acknowledgements

The authors acknowledge the support of the European Research Council (ERC FACTORY-CoG-6681839), the Agence Nationale de la Recherche (ANR 3IA-ANITI, ANR-17-EURE-0010 CHESS, ANR-19-CE23-0017 MASDOL) and the Air Force Office of Scientific Research (FA9550-18-1-0226). Part of the numerical experiments were done using the OSIRIM platform of IRIT, supported by the CNRS, the FEDER, Région Occitanie and the French government (http://osirim.irit.fr/site/en). We thank the development teams of the following libraries that were used in the experiments: Python [39], Numpy [44], Matplotlib [20], PyTorch [34], and the PyTorch implementation of ResNets from [21]. We thank Emmanuel Soubies and Sixin Zhang for useful discussions and Sébastien Gadat for pointing out flaws in the original proof.

Author information

Authors and Affiliations

IRIT, CNRS, Université de Toulouse, Toulouse, France
Camille Castera, Cédric Févotte & Edouard Pauwels
Toulouse School of Economics, Université de Toulouse, Toulouse, France
Jérôme Bolte

Authors

Camille Castera
View author publications
You can also search for this author in PubMed Google Scholar
Jérôme Bolte
View author publications
You can also search for this author in PubMed Google Scholar
Cédric Févotte
View author publications
You can also search for this author in PubMed Google Scholar
Edouard Pauwels
View author publications
You can also search for this author in PubMed Google Scholar

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Last-three authors are listed in alphabetical order.

Appendices

Appendix A: Details About Deep Learning Experiments

In addition to the method described in Sect. 5.1, we provide in Table 1 a summary of each problem considered.

Table 1 Setting of the four different deep learning experiments

Full size table

In the DL experiments of Sect. 5, we display the training error and the test accuracy of each algorithm as a function of the number of stochastic gradient estimates computed. Due to their adaptive procedures, ADAM, RMSprop and Step-Tuned SGD have additional sub-routines in comparison to SGD. Thus, in Table 2 we additionally provide the wall-clock time per epoch of these methods relatively to SGD. Unlike the number of back-propagations performed, wall-clock time depends on many factors: the network and datasets considered, the computer used, and most importantly, the implementation. Regarding implementation, we would like to emphasize the fact that we used the versions of SGD, ADAM and RMSprop provided in PyTorch, which are fully optimized (and in particular parallelized). Table 2 indicates that Step-Tuned SGD is slower than other adaptive methods for large networks but this is due to our non-parallel implementation. Actually on small networks (where the benefits of parallel computing is small), we observe that running Step-Tuned SGD for one epoch is actually faster than for SGD. As a conclusion, the number of back-propagations is a more suitable metric for comparing the algorithms, and all methods considered require a single back-propagation per iteration.

Table 2 Relative wall-clock time per epoch compared to SGD

Full size table

Appendix B: Proof of the Theoretical Results

We state a lemma that we will use to prove Theorem 1.

1.1 Preliminary Lemma

The result is the following.

Lemma 1

( [1, Proposition 2]) Let $(u_k)_{k\in {\mathbb {N}}}$ and $(v_k)_{k\in {\mathbb {N}}}$ two non-negative real sequences. Assume that $\sum _{k=0}^{+\infty } u_k v_k <+\infty $, and $\sum _{k=0}^{+\infty } v_k =+\infty $. If there exists a constant $C>0$ such that $\forall k\in {\mathbb {N}}, \vert u_{k+1} - u_k \vert \le C v_k$, then $u_k\xrightarrow [k\rightarrow +\infty ]{}0$.

1.2 Proof of the main theorem

We can now prove Theorem 1.

Proof of Theorem 1

We first clarify the random process induced by the draw of the mini-batches. Algorithm 2 takes a sequence of mini-batches as input. This sequence is represented by the random variables $({\mathsf {B}}_k)_{k\in {\mathbb {N}}}$ as described in Sect. 3.2. Each of these random variables is independent of the others. In particular, for $k\in {\mathbb {N}}_{>0}$, ${\mathsf {B}}_k$ is independent of the previous mini-batches ${\mathsf {B}}_0,\ldots , {\mathsf {B}}_{k-1}$. For convenience, we will denote $\underline{{\mathsf {B}}}_k = \left\{ {\mathsf {B}}_0,\ldots ,{\mathsf {B}}_k\right\} $, the mini-batches up to iteration k. Due to the randomness of the mini-batches, the algorithm is a random process as well. As such, $\theta _{k}$ is a random variable with a deterministic dependence on $\underline{{\mathsf {B}}}_{k-1}$ and is independent of ${\mathsf {B}}_k$. However, $\theta _{k+\frac{1}{2}}$ and ${\mathsf {B}}_{k}$ are not independent. Similarly, we constructed $\gamma _k$ such that it is a random variable with a deterministic dependence on $\underline{{\mathsf {B}}}_{k-1}$, which is independent of ${\mathsf {B}}_k$. This dependency structure will be crucial to derive and bound conditional expectations. Finally, we highlight the following important identity, for any $k\in {\mathbb {N}}_{>0}$,

(25)

Indeed, the iterate $\theta _{k}$ is a deterministic function of $\underline{{\mathsf {B}}}_{k-1}$, so taking the expectation over ${\mathsf {B}}_k$, which is independent of $\underline{{\mathsf {B}}}_{k-1}$, we recover the full gradient of ${\mathcal {J}}$ as the distribution of ${\mathsf {B}}_k$ is the same as that of ${\mathsf {S}}$ in Sect. 3.2. Notice in addition that a similar identity does not hold for $\theta _{k+\frac{1}{2}}$ (as it depends on ${\mathsf {B}}_k$).

We now provide estimates that will be used extensively in the rest of the proof. The gradient of the loss function $\nabla {\mathcal {J}}$ is locally Lipschitz continuous as ${\mathcal {J}}$ is twice continuously differentiable. By assumption, there exists a compact convex set ${\mathsf {C}}\subset {\mathbb {R}}^P$, such that with probability 1, the sequence of iterates $(\theta _{k})_{k\in \frac{1}{2}{\mathbb {N}}}$ belongs to ${\mathsf {C}}$. Therefore, by local Lipschitz continuity, the restriction of $\nabla {\mathcal {J}}$ to ${\mathsf {C}}$ is Lipschitz continuous on ${\mathsf {C}}$. Similarly, each $\nabla {\mathcal {J}}_n$ is also Lipschitz continuous on ${\mathsf {C}}$. We denote by $L>0$ a Lipschitz constant common to each $\nabla {\mathcal {J}}_n$, $n=1,\ldots , N$. Notice that the Lipschitz continuity is preserved by averaging, in other words,

$$\begin{aligned} \forall {\mathsf {B}}\subseteq \left\{ 1,\ldots ,N\right\} ,\forall \psi _1,\psi _2\in {\mathsf {C}}, \quad \Vert \nabla {\mathcal {J}}_{\mathsf {B}}(\psi _1) -\nabla {\mathcal {J}}_{\mathsf {B}}(\psi _2) \Vert \le L\Vert \psi _1-\psi _2\Vert . \end{aligned}$$

(26)

In addition, using the continuity of the $\nabla {\mathcal {J}}_n$’s, there exists a constant $C_2>0$, such that,

$$\begin{aligned} \forall {\mathsf {B}}\subseteq \left\{ 1,\ldots ,N\right\} ,\forall \psi \in {\mathsf {C}}, \quad \Vert \nabla {\mathcal {J}}_{\mathsf {B}}(\psi )\Vert \le C_2. \end{aligned}$$

(27)

Finally, for a function $g:{\mathbb {R}}^P\rightarrow {\mathbb {R}}$ with L-Lipschitz continuous gradient, we recall the following inequality called descent lemma (see for example [7, Proposition A.24]). For any $\theta \in {\mathbb {R}}^P$ and any $d\in {\mathbb {R}}^P$,

$$\begin{aligned} g(\theta +d) \le g(\theta ) + \langle \nabla g(\theta ), d\rangle + \frac{L}{2}\Vert d \Vert ^2. \end{aligned}$$

(28)

In our case since we only have the L-Lipschitz continuity of $\nabla {\mathcal {J}}$ on ${\mathsf {C}}$ which is convex, we have a similar bound for $\nabla {\mathcal {J}}$ on ${\mathsf {C}}$: for any $\theta \in {\mathsf {C}}$ and any $d\in {\mathbb {R}}^P$ such that $\theta +d\in {\mathsf {C}}$,

$$\begin{aligned} {\mathcal {J}}(\theta +d) \le {\mathcal {J}}(\theta ) + \langle \nabla {\mathcal {J}}(\theta ), d\rangle + \frac{L}{2}\Vert d \Vert ^2. \end{aligned}$$

(29)

Let $\theta _0\in {\mathbb {R}}^P$ and let $(\theta _{k})_{k\in \frac{1}{2}{\mathbb {N}}}$ a sequence generated by Algorithm 2 initialized at $\theta _0$. By assumption this sequence belongs to ${\mathsf {C}}$ almost surely. To simplify, for $k\in {\mathbb {N}}$, we denote $\eta _k = \alpha \gamma _k (k+1)^{-(1/2+\delta )}$. Fix an iteration $k\in {\mathbb {N}}$, we can use (29) with $\theta = \theta _k$ and $d = -\eta _k\nabla {\mathcal {J}}_{{\mathsf {B}}_{k}}(\theta _k)$, almost surely (with respect to the boundedness assumption),

$$\begin{aligned} {\mathcal {J}}(\theta _{k+\frac{1}{2}}) \le {\mathcal {J}}(\theta _k) - \eta _k \langle \nabla {\mathcal {J}}(\theta _k), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _k) \rangle + \frac{\eta _k^2}{2}L \Vert \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}}(\theta _k)\Vert ^2. \end{aligned}$$

(30)

Similarly with $\theta = \theta _{k+\frac{1}{2}}$ and $d = -\eta _k\nabla {\mathcal {J}}_{{\mathsf {B}}_{k}}(\theta _{k+\frac{1}{2}})$, almost surely,

$$\begin{aligned} {\mathcal {J}}(\theta _{k+1}) \le {\mathcal {J}}(\theta _{k+\frac{1}{2}}) - \eta _k \langle \nabla {\mathcal {J}}(\theta _{k+\frac{1}{2}}), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k+\frac{1}{2}}) \rangle + \frac{\eta _k^2}{2}L \Vert \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}}(\theta _{k+\frac{1}{2}})\Vert ^2. \end{aligned}$$

(31)

We combine (30) and (31), almost surely,

$$\begin{aligned} \begin{aligned}&{\mathcal {J}}(\theta _{k+1}) \le {\mathcal {J}}(\theta _{k}) - \eta _k \left( \langle \nabla {\mathcal {J}}(\theta _k), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _k)\rangle + \langle \nabla {\mathcal {J}}(\theta _{k+\frac{1}{2}}), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k+\frac{1}{2}}) \rangle \right) \\&\quad + \frac{\eta _k^2}{2}L \left( \Vert \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}}(\theta _k)\Vert ^2+ \Vert \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}}(\theta _{k+\frac{1}{2}})\Vert ^2\right) . \end{aligned} \end{aligned}$$

(32)

Using the boundedness assumption and (27), almost surely,

$$\begin{aligned} \Vert \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}}(\theta _k)\Vert ^2 \le C_2 \quad \text {and}\quad \Vert \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}}(\theta _{k+\frac{1}{2}})\Vert ^2 \le C_2. \end{aligned}$$

(33)

So almost surely,

$$\begin{aligned} \begin{aligned} {\mathcal {J}}(\theta _{k+1}) \le {\mathcal {J}}(\theta _{k})&- \eta _k \left( \langle \nabla {\mathcal {J}}(\theta _k), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _k)\rangle + \langle \nabla {\mathcal {J}}(\theta _{k+\frac{1}{2}}), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k+\frac{1}{2}}) \rangle \right) \\ {}&+ \eta _k^2L C_2. \end{aligned} \end{aligned}$$

(34)

Then, we take the conditional expectation of (34) over ${\mathsf {B}}_k$ conditionally on $\underline{{\mathsf {B}}}_{k-1}$ (the mini-batches used up to iteration $k-1$), we have,

(35)

As explained at the beginning of the proof, $\theta _{k}$ is a deterministic function of $\underline{{\mathsf {B}}}_{k-1}$, thus, . Similarly, by construction $\eta _k$ is independent of the current mini-batch ${\mathsf {B}}_k$, it is a deterministic function of $\underline{{\mathsf {B}}}_{k-1}$. Hence, (35) reads,

(36)

Then, we use the fact that . Overall, we obtain,

(37)

We will now bound the last term of (37). First we write,

$$\begin{aligned} \begin{aligned}&-\langle \nabla {\mathcal {J}}(\theta _{k+\frac{1}{2}}), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k+\frac{1}{2}}) \rangle \\&\quad =-\langle \nabla {\mathcal {J}}(\theta _{k+\frac{1}{2}}), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k+\frac{1}{2}})- \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k})\rangle - \langle \nabla {\mathcal {J}}(\theta _{k+\frac{1}{2}}), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k}) \rangle . \end{aligned} \end{aligned}$$

(38)

Using the Cauchy-Schwarz inequality, as well as (26) and (27), almost surely,

$$\begin{aligned} \begin{aligned} |\langle \nabla {\mathcal {J}}(\theta _{k+\frac{1}{2}}), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k+\frac{1}{2}})- \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k})\rangle |&\le \Vert \nabla {\mathcal {J}}(\theta _{k+\frac{1}{2}}) \Vert \Vert \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k+\frac{1}{2}})- \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k})\Vert \\&\le \Vert \nabla {\mathcal {J}}(\theta _{k+\frac{1}{2}}) \Vert L\Vert \theta _{k+\frac{1}{2}}-\theta _{k}\Vert \\&\le \Vert \nabla {\mathcal {J}}(\theta _{k+\frac{1}{2}}) \Vert L\Vert -\eta _k\nabla {\mathcal {J}}_{{\mathsf {B}}_k}(\theta _{k})\Vert \\&\le LC_2^2\eta _k. \end{aligned} \end{aligned}$$

(39)

Hence,

$$\begin{aligned} \begin{aligned}&-\langle \nabla {\mathcal {J}}(\theta _{k+\frac{1}{2}}), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k+\frac{1}{2}}) \rangle \le LC_2^2\eta _k - \langle \nabla {\mathcal {J}}(\theta _{k+\frac{1}{2}}), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k}) \rangle . \end{aligned} \end{aligned}$$

(40)

We perform similar computations on the last term of (40), almost surely

$$\begin{aligned} \begin{aligned}&-\langle \nabla {\mathcal {J}}(\theta _{k+\frac{1}{2}}), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k}) \rangle \\&= -\langle \nabla {\mathcal {J}}(\theta _{k+\frac{1}{2}})-\nabla {\mathcal {J}}(\theta _{k}), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k}) \rangle - \langle \nabla {\mathcal {J}}(\theta _{k}), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k}) \rangle \\&\le \Vert \nabla {\mathcal {J}}(\theta _{k+\frac{1}{2}})-\nabla {\mathcal {J}}(\theta _{k})\Vert \Vert \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k}) \Vert - \langle \nabla {\mathcal {J}}(\theta _{k}), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k}) \rangle \\&\le LC_2\Vert \theta _{k+\frac{1}{2}}-\theta _{k}\Vert - \langle \nabla {\mathcal {J}}(\theta _{k}), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k}) \rangle \\&\le LC_2^2\eta _k - \langle \nabla {\mathcal {J}}(\theta _{k}), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k}) \rangle . \end{aligned} \end{aligned}$$

(41)

Finally we obtain by combining (38), (40) and (41), almost surely,

$$\begin{aligned} \begin{aligned}&-\langle \nabla {\mathcal {J}}(\theta _{k+\frac{1}{2}}), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k+\frac{1}{2}}) \rangle \le 2LC_2^2\eta _k - \langle \nabla {\mathcal {J}}(\theta _{k}), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k}) \rangle . \end{aligned} \end{aligned}$$

(42)

Going back to the last term of (37), we have, taking the conditional expectation of (42), almost surely

(43)

In the end we obtain, for an arbitrary iteration $k\in {\mathbb {N}}$, almost surely

(44)

To simplify we assume that ${{\tilde{M}}}\ge \nu $ (otherwise set $\tilde{M} = \max ({{\tilde{M}}},\nu )$). We use the fact that, $\eta _k\in [\frac{\alpha \tilde{m}}{(k+1)^{1/2+\delta }},\frac{\alpha {{\tilde{M}}}}{(k+1)^{1/2+\delta }}]$, to obtain almost surely,

(45)

Since by assumption, the last term is summable, we can now invoke Robbins-Siegmund convergence theorem [37] to obtain that, almost surely, $({\mathcal {J}}(\theta _{k}))_{k\in {\mathbb {N}}}$ converges and,

$$\begin{aligned} \sum _{k=0}^{+\infty }\frac{1}{(k+1)^{1/2+\delta }}\Vert \nabla {\mathcal {J}}(\theta _k) \Vert ^2 < + \infty . \end{aligned}$$

(46)

Since $\sum _{k=0}^{+\infty }\frac{1}{(k+1)^{1/2+\delta }}=+\infty $, this implies at least that almost surely,

$$\begin{aligned} \liminf _{k\rightarrow \infty }\Vert \nabla {\mathcal {J}}(\theta _k) \Vert ^2=0. \end{aligned}$$

(47)

To prove that in addition $\displaystyle \lim _{k\rightarrow \infty }\Vert \nabla {\mathcal {J}}(\theta _k)\Vert ^2 = 0$, we will use Lemma 1 with $u_k = \Vert \nabla {\mathcal {J}}(\theta _{k})\Vert ^2$ and $v_k = \frac{1}{(k+1)^{1/2+\delta }}$, for all $k\in {\mathbb {N}}$. So we need to prove that there exists $C_3>0$ such that $\vert u_{k+1} - u_k\vert \le C_3 v_k$. To do so, we use the L-Lipschitz continuity of the gradients on ${\mathsf {C}}$, triangle inequalities and (27). It holds, almost surely, for all $k \in {\mathbb {N}}$

$$\begin{aligned}&\left| \Vert \nabla {\mathcal {J}}(\theta _{k+1})\Vert ^2-\Vert \nabla {\mathcal {J}}(\theta _{k})\Vert ^2\right| \nonumber \\&\quad = \;\left( \;\Vert \nabla {\mathcal {J}}(\theta _{k+1})\Vert + \Vert \nabla {\mathcal {J}}(\theta _{k})\Vert \;\right) \;\times \; \left| \;\Vert \;\nabla {\mathcal {J}}(\theta _{k+1})\Vert - \Vert \nabla {\mathcal {J}}(\theta _{k})\;\Vert \;\right| \nonumber \\&\quad \le 2C_2 \left| \Vert \nabla {\mathcal {J}}(\theta _{k+1})\Vert - \Vert \nabla {\mathcal {J}}(\theta _{k})\Vert \right| \nonumber \\&\quad \le 2C_2 \Vert \nabla {\mathcal {J}}(\theta _{k+1})-\nabla {\mathcal {J}}(\theta _{k})\Vert \nonumber \\&\quad \le 2C_2 L \Vert \theta _{k+1}-\theta _{k}\Vert \\&\quad \le 2C_2 L \left\| -\eta _k\nabla {\mathcal {J}}_{{\mathsf {B}}_{k}}(\theta _k) -\eta _k\nabla {\mathcal {J}}_{{\mathsf {B}}_{k}}(\theta _{k+\frac{1}{2}})\right\| \nonumber \\&\quad \le 2C_2 L\frac{\alpha {\tilde{M}}}{(k+1)^{1/2+\delta }} \Vert \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}}(\theta _k)+\nabla {\mathcal {J}}_{{\mathsf {B}}_{k}}(\theta _{k+\frac{1}{2}})\Vert \nonumber \\&\quad \le 4C_2^2 L\frac{\alpha {\tilde{M}}}{(k+1)^{1/2+\delta }}.\nonumber \end{aligned}$$

(48)

So taking $C_3 =4C_2^2 L\alpha {\tilde{M}} $, by Lemma 1, almost surely, $\lim _{k\rightarrow +\infty } \Vert \nabla {\mathcal {J}}(\theta _{k})\Vert ^2=0$. This concludes the almost sure convergence proof.

As for the rate, consider the expectation of (45) (with respect to the random variables $({\mathsf {B}}_k)_{k\in {\mathbb {N}}}$). The tower property of the conditional expectation gives $ {\mathbb {E}}[{\mathbb {E}}[{\mathcal {J}}(\theta _{k+1})|\underline{{\mathsf {B}}}_{k-1}]]={\mathbb {E}}\left[ {\mathcal {J}}(\theta _{k+1})\right] $, so we obtain, for all $k\in {\mathbb {N}}$,

$$\begin{aligned} \begin{aligned} 2\frac{\alpha {\tilde{m}}}{(k+1)^{1/2+\delta }}{\mathbb {E}}\left[ \Vert \nabla {\mathcal {J}}(\theta _k)\Vert ^2\right] \le&{\mathbb {E}}\left[ {\mathcal {J}}(\theta _k)\right] - {\mathbb {E}}\left[ {\mathcal {J}}(\theta _{k+1})\right] + \frac{\alpha ^2{\tilde{M}}^2}{(k+1)^{1+2\delta }}L(C_2+2C_2^2). \end{aligned} \end{aligned}$$

(49)

Then for $K\ge 1$, we sum from 0 to $K-1$,

$$\begin{aligned} \begin{aligned} \sum _{k=0}^{K-1}2\frac{\alpha {\tilde{m}}}{(k+1)^{1/2+\delta }}&{\mathbb {E}}\left[ \Vert \nabla {\mathcal {J}}(\theta _k)\Vert ^2\right] \\&\le \sum _{k=0}^{K-1}{\mathbb {E}}\left[ {\mathcal {J}}(\theta _k)\right] -\sum _{k=0}^{K-1} {\mathbb {E}}\left[ {\mathcal {J}}(\theta _{k+1})\right] + \sum _{k=0}^{K-1} \frac{\alpha ^2{\tilde{M}}^2}{(k+1)^{1+2\delta }}L(C_2+2C_2^2)\\&={\mathcal {J}}(\theta _0) - {\mathbb {E}}\left[ {\mathcal {J}}(\theta _{K})\right] + \sum _{k=0}^{K-1}\frac{\alpha ^2{\tilde{M}}^2}{(k+1)^{1+2\delta }}L(C_2+2C_2^2)\\&\le {\mathcal {J}}(\theta _0) - \inf _{\psi \in {\mathbb {R}}^P}{\mathcal {J}}(\psi ) + \sum _{k=0}^{K-1}\frac{\alpha ^2{\tilde{M}}^2}{(k+1)^{1+2\delta }}L(C_2+2C_2^2),\ \end{aligned} \end{aligned}$$

(50)

The right-hand side is finite, so there is a constant $C_4>0$ such that for any $K\in {\mathbb {N}}$, it holds,

$$\begin{aligned} C_4\ge \sum _{k=0}^K \frac{1}{(k+1)^{1/2+\delta }} {\mathbb {E}}\left[ \Vert \nabla {\mathcal {J}}(\theta _{k})\Vert ^2\right]&\ge \min _{k\in \left\{ 1,\ldots ,K\right\} }{\mathbb {E}}\left[ \Vert \nabla {\mathcal {J}}(\theta _{k})\Vert ^2\right] \sum _{k=0}^K \frac{1}{(k+1)^{1/2+\delta }} \nonumber \\&\ge \left( K+1\right) ^{1/2-\delta }\min _{k\in \left\{ 1,\ldots ,K\right\} } {\mathbb {E}}\left[ \Vert \nabla {\mathcal {J}}(\theta _{k})\Vert ^2\right] , \end{aligned}$$

(51)

and we obtain the rate. $\square $

1.3 C Proof of the Corollary

Before proving the corollary we recall the following result.

Lemma 2

Let $g:{\mathbb {R}}^P\rightarrow {\mathbb {R}}$ a L-Lipschitz continuous and differentiable function. Then $\nabla g$ is uniformly bounded on ${\mathbb {R}}^P$.

We can now prove the corollary.

Proof of Corollary 1

The proof is very similar to the one of Theorem 1. Denote L the Lipschitz constant of $\nabla {\mathcal {J}}$. Then, the descent lemma (30) holds surely. Furthermore, since for all $n\in \{1,\ldots ,N\}$, each ${\mathcal {J}}_n$ is Lipschitz, so is ${\mathcal {J}}$, and globally Lipschitz functions have uniformly bounded gradients so $\nabla {\mathcal {J}}$ has bounded gradient. This is enough to obtain (45). Similarly, at iteration $k\in {\mathbb {N}}$, ${\mathbb {E}}\left[ \Vert \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _k)\Vert \right] $ is also uniformly bounded. Overall these arguments allows to follow the lines of the proof of Theorem 1 and the same conclusions follow by repeating the same arguments. $\square $

Appendix C: Details on the Synthetic Experiments

We detail the non-convex regression problem that we presented in Figs. 2 and 3. Given a matrix $A\in {\mathbb {R}}^{N \times P}$ and a vector $b\in {\mathbb {R}}^N$, denote $A_n$ the n-th line of A. The problem consists in minimizing a loss function of the form,

$$\begin{aligned} \theta \in {\mathbb {R}}^P\mapsto {\mathcal {J}}(\theta ) = \frac{1}{N}\sum _{n}^{N} \phi (A_n^T\theta -b_n), \end{aligned}$$

(52)

where the non-convexity comes from the function $t\in {\mathbb {R}}\mapsto \phi (t) = t^2/(1+t^2)$. For more details on the initialization of A and b we refer to [10] where this problem is initially proposed. In the experiments of Fig. 3, the mini-batch approximation was made by selecting a subset of the lines of A, which amounts to computing only a few terms of the full sum in (52). We used $N=500$, $P=30$ and mini-batches of size 50.

In the deterministic setting we ran each algorithm during 250 iterations and selected the hyper-parameters of each algorithm such that they achieved $\vert {\mathcal {J}}(\theta )-{\mathcal {J}}^\star \vert <10^{-1}$ as fast as possible. In the mini-batch experiments we ran each algorithm during 250 epochs and selected the hyper-parameters that yielded the smallest value of ${\mathcal {J}}(\theta )$ after 50 epochs.

Appendix D: Description of Auxiliary Algorithms

We precise the heuristic algorithms used in Fig. 3 and discussed in Sect. 3.3. Note that the step-size in Algorithm 5 is equivalent to Expected-GV but is written differently to avoid storing an additional gradient estimate.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Castera, C., Bolte, J., Févotte, C. et al. Second-Order Step-Size Tuning of SGD for Non-Convex Optimization. Neural Process Lett 54, 1727–1752 (2022). https://doi.org/10.1007/s11063-021-10705-5

Download citation

Accepted: 19 November 2021
Published: 26 January 2022
Issue Date: June 2022
DOI: https://doi.org/10.1007/s11063-021-10705-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Second-Order Step-Size Tuning of SGD for Non-Convex Optimization

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others