Benefits of Learning Rate Annealing for
Tuning-Robustness in Stochastic Optimization
Abstract
The learning rate in stochastic gradient methods is a critical hyperparameter that is notoriously costly to tune via standard grid search, especially for training modern large-scale models with billions of parameters. We identify a theoretical advantage of learning rate annealing schemes that decay the learning rate to zero at a polynomial rate, such as the widely-used cosine schedule, by demonstrating their increased robustness to initial parameter misspecification due to a coarse grid search. We present an analysis in a stochastic convex optimization setup demonstrating that the convergence rate of stochastic gradient descent with annealed schedules depends sublinearly on the multiplicative misspecification factor (i.e., the grid resolution), achieving a rate of where is the degree of polynomial decay and is the number of steps, in contrast to the rate that arises with fixed stepsizes and exhibits a linear dependence on . Experiments confirm the increased robustness compared to tuning with a fixed stepsize, that has significant implications for the computational overhead of hyperparameter search in practical training scenarios.
1 Introduction
Stochastic Gradient Descent (SGD, Robbins and Monro, 1951) is a cornerstone of modern machine learning. Starting at a point , the update step of SGD takes the form , where is the stepsize at step and is a stochastic gradient at . An effective stepsize sequence is critical for performance, yet it is notoriously hard to tune in many scenarios and applications (e.g., Bottou, 2012; Schaul et al., 2013). Furthermore, as models continue to scale, the computational burden of stepsize tuning becomes increasingly demanding.
A common approach to tuning the stepsize sequence is simply using a fixed stepsize, selecting the best fixed value by performing a geometric grid search (Bengio, 2012). In this method, the stepsize is selected based on its performance on a validation set, with the grid resolution determining the (multiplicative) proximity to the best stepsize within the specified range.
A primary approach to moving beyond fixed stepsize sequences is stepsize scheduling. In stepsize scheduling (e.g., Smith, 2017; Loshchilov and Hutter, 2017; Ge et al., 2019), the step at time is determined by multiplying a baseline stepsize parameter with a parametric sequence. While the approach enables more versatile stepsize sequences and often leads to improved performance, it still requires tuning the baseline stepsize parameter, typically through grid search. Some stepsize schedules also exhibit theoretical benefits, such as anytime convergence guarantees and better last-iterate guarantees (e.g., Jain et al., 2019; Zamani and Glineur, 2023; Liu and Zhou, 2024; Defazio et al., 2024a).
While stepsize tuning is a widely adopted practice, its theoretical foundations remain under-explored. One key question is how sensitive this procedure is to the grid resolution. Limited computational budgets restrict the resolution of grid searches, an issue that has become increasingly prominent with the emergence of modern models consisting of billions of parameters that take days—sometimes weeks—to train. In fact, at massive scales, it is often the case that any methodological tuning of the stepsize is prohibitive and therefore abandoned entirely.
Standard analyses of fixed stepsize SGD in the convex setting demonstrate a linear degradation in convergence rate as a function of the multiplicative misspecification of the stepsize, which can be significant when performing a coarse—or even absent—grid search. This work investigates to what extent stepsize schedules can mitigate this dependency, providing more robust performance at lower grid resolutions.
Focusing our analysis on stochastic convex optimization, we establish convergence guarantees for SGD with stepsize schedules that decay polynomially to zero, which reveals a key advantage of automatically adapting to multiplicative overestimation of the stepsize. For commonly used schedules, such as cosine annealing, our guarantees yield a sublinear dependence on the misspecification factor, in contrast to the linear dependence that arises with fixed stepsizes. We further validate our theoretical findings through experiments on synthetic and real data, demonstrating improved robustness to stepsize tuning using decaying schedules compared to tuning a constant stepsize using a grid-search.
1.1 Summary of Contributions
In more detail, we consider stochastic first-order convex optimization settings, where we aim to minimize a convex objective , where is a convex set with diameter , while accessing only through a (sub-)gradient oracle (i.e., for all ). Given an initial stepsize , a schedule is specified by a function through , where is the total number of SGD update steps.
Our main results are the following:
-
•
Our first main result in the convex (non-smooth) case, where we assume that the second moment of the oracle is bounded, is a convergence guarantee of the last iterate of -steps SGD using a decaying schedule (which satisfies some mild assumptions), of the form
where is the convergence rate using a tuned stepsize, and are certain functions that depend on the schedule , and is the multiplicative overestimation factor of compared to the tuned stepsize . The infimum above is at most , but as we discuss below, may become sublinear in depending on the particular schedule .
-
•
Our second main result deals with the convex smooth case, where we assume that is -smooth and that the oracle has a bounded variance. We obtain a similar convergence guarantee of
where is the fraction of steps with , is the convergence rate with a tuned stepsize, and is the multiplicative overestimation factor compared to the tuned stepsize. The dependence on is unavoidable (up to constants), as convergence in smooth optimization requires step size smaller than . For sufficiently small , the infimum is again in the worst case.
-
•
Applying our main result to the cosine annealing schedule in the convex Lipschitz case, we obtain that the last-iterate convergence rate of SGD is . Similarly, applying the same result to the polynomially decaying schedule for some constant degree , we obtain that the last-iterate convergence rate of SGD is . In the convex smooth case, assuming , we obtain the same multiplicative sub-optimality of and for cosine annealing and (degree -)polynomially decaying schedules.
-
•
Additionally, we validate the robustness of various learning rate schedules to tuning in experiments, by performing grid search on two tasks: a synthetic logistic regression task with a linear model and the CIFAR-10 classification task with a deep neural network. We find that, when using a coarse grid, annealing schemes—specifically cosine annealing and linear decay—demonstrate greater robustness compared to a fixed step size schedule.
Our theoretical results show that polynomially decaying schedules, including cosine annealing, achieve convergence rates with a sublinear dependence on the misspecification factor, in contrast to the linear dependence observed in SGD with a fixed stepsize (which we demonstrate in detail in Appendix D). This distinction is particularly striking since, while both fixed and annealed stepsizes are able to attain the optimal convergence rate when properly tuned, the latter exhibits significantly greater robustness to parameter misspecifications. When tuning the stepsize using a coarse grid search under a limited computational budget, this difference in robustness can significantly impact performance, as also seen in our synthetic and real-data experiments.
1.2 Additional Related Work
Adaptive and parameter-free methods.
Beyond learning rate scheduling, several approaches have been developed to minimize the need for extensive tuning in first-order optimization. These include adaptive methods, such as AdaGrad and Adam (e.g., Duchi et al., 2011; Kingma and Ba, 2015), as well as recent theoretical advancements (Reddi et al., 2018; Tran et al., 2019; Kavis et al., 2019; Alacaoglu et al., 2020; Faw et al., 2022; Kavis et al., 2022; Attia and Koren, 2023; Liu et al., 2023), which utilize gradient statistics to dynamically adjust learning rates. Additionally, parameter-free methods (e.g., Chaudhuri et al., 2009; Streeter and McMahan, 2012; Luo and Schapire, 2015; Orabona and Pál, 2016; Cutkosky and Orabona, 2018; Orabona and Pál, 2021; Carmon and Hinder, 2022) primarily focus on automatically adapting to the problem’s complexity, such as the distance to the optimal solution. Recently, several parameter-free approaches demonstrated impressive practical performance, narrowing the gap to finely-tuned methods (Ivgi et al., 2023; Defazio and Mishchenko, 2023; Mishchenko and Defazio, 2023). While these approaches take different paths to reduce tuning, adaptive methods and scheduling schemes are often used together in practice.
Theoretical analyses of stepsize annealing.
Several studies have analyzed different stepsize schedules. The influential work of Jain et al. (2019) showed that the schedules and yield suboptimal last-iterate guarantees and proposed a new schedule with optimal last-iterate performance. Later, Defazio et al. (2024a) demonstrated that a linear decay schedule also achieves an optimal last-iterate guarantee. Additionally, Defazio et al. (2024b) introduced ”schedule-free” SGD, which eliminates the need to know the training length in advance. While these works focus on optimality with well-tuned stepsizes and last-iterate guarantees, our work examines the robustness of these schedules when the step size is not finely tuned. Additionally, new scheduling schemes continue to emerge, such as those proposed by Zhai et al. (2022) and Hu et al. (2024), which incorporate a cooldown phase to accommodate varying training durations. The robustness perspective we propose helps us better understand the benefits of different schedules and guides the design of more robust ones.
2 Preliminaries
2.1 Problem Setup
In this work, we are interested in first-order stochastic optimization over a bounded domain within the -dimensional Euclidean space, , equipped with the Euclidean norm, defined as . Let be a convex set with diameter (i.e., for all , ) and let be a convex function. Our goal is to find some such that is small, where we access only through an unbiased sub-gradient oracle (i.e., for all , where we denote with a slight abuse of notation ). We consider two optimization scenarios:
-
(i)
Convex and Lipschitz setting. Here we assume has a second moment bound, that is, for some , for all . This implies in particular that is -Lipschitz.
-
(ii)
Convex and smooth setting. In this scenario we assume that is -smooth,111A function is said to be -smooth if for all . In particular, this implies that for all . and instead of a second moment bound we assume that has a variance bound, that is, for some , for all .
Stochastic gradient descent.
We will analyze the (projected) Stochastic Gradient Descent (SGD) algorithm, which starts at some and performs update steps of the form , where is the stepsize at step , is a stochastic sub-gradient at , and is the Euclidean projection to . The output of -steps SGD is typically some average of the iterates or the last iterate. The convergence rate guarantee of the average iterate of fixed stepsize SGD with tuned stepsize is in the convex Lipschitz case and in the convex smooth case (See, e.g., Lan, 2012).
Stepsize scheduling.
Our focus will be on stepsizes of the form , for some and , where is the number of SGD steps that are performed. Common schedules include (fixed stepsize), (cosine annealing), and for some (polynomial decay). In particular, we will assume that is monotonically non-increasing, and satisfy ; we will call such a schedule annealed for brevity. We additionally assume for technical reasons that the annealed schedules we consider are differentiable and -Lipschitz. Using an annealed schedule, SGD with a properly tuned step size yields the same rate as optimally tuned fixed stepsize SGD, up to constant factors (where we treat as a constant). See Appendix E for additional details. Notable annealed schedules include cosine annealing and polynomial decay.
Robustness to stepsize misspecification.
Fixing an initialization and a stepsize schedule , it remains to tune the base stepsize . Considering a tuned stepsize ,222By tuned we mean a stepsize that minimize a corresponding convergence guarantee that depend on , possibly ignoring lower-order terms for simplicity. we investigate the sensitivity of SGD when the stepsize is only tuned to a multiplicative misspecification factor (i.e., stepsize , where is of course unknown to the algorithm). In this case, the convergence rate will likely degrade as increases. For instance, the standard guarantee of fixed stepsize SGD degrades linearly in ; we demonstrate this fact in the convex Lipschitz setting in Appendix D.
Our main inquiry is to what extent stepsize schedules can mitigate this degradation, enabling more robust performance when the stepsize is crudely tuned (e.g., when tuned using a coarse grid search), and achieving convergence rates with sublinear dependence on , for .
2.2 Convergence Analysis with Stepsize Schedules
Here we present convergence guarantees for SGD using an annealed schedule. The tuned stepsizes and respective convergence rates will serve as the baseline for establishing a sublinear dependence on the misspecification parameter. For their proofs, see Appendix E.
Let be a differentiable -Lipschitz annealed schedule . We define the following two functions associated with :
(1) |
Throughout, convergence bounds will be expressed in terms of and . We begin with the convex Lipschitz case.
Lemma 1.
Let be a convex set with diameter , a convex function, , and an unbiased first-order oracle of with second-moment bounded by . Let be the iterates produced by -steps SGD with stepsizes using the oracle , where is a differentiable -Lipschitz annealed schedule. Then it holds that
We denote the tuned stepsize and respective convergence guarantee (up to lower-order terms) by
(2) |
We proceed to the convergence guarantee in the convex smooth case.
Lemma 2.
Let be a convex set with diameter , a -smooth convex function, , and an unbiased first-order oracle of with variance bounded by . Let be the iterates produced by -steps SGD with stepsizes using the oracle , where is a differentiable -Lipschitz annealed schedule and . Then it holds that
Similarly, we denote the tuned stepsize over as
(3) |
and the respective convergence guarantee (up to lower-order terms) as
(4) |
As we previously mentioned, under the mild assumption that (the Lipschitz parameter of ), the guarantees match the rates of optimally tuned fixed stepsize SGD (see Appendix E for details). In both cases, a multiplicative overestimation of the optimal stepsize degrades the guarantee linearly (in the convex smooth case, for a large enough overestimation, and the guarantee does not even hold).
3 Convex and Lipschitz Setting
This section considers a convex objective where the second moment of the sub-gradient oracle is bounded. The main result of this section is a convergence guarantee that mitigates the imbalance caused by overestimation by automatically adapting to the tails of and . The key observation in obtaining this result is that any suffix of iterates can be viewed as a -steps SGD starting at , effectively ignoring the large stepsizes prior to step that would otherwise degrade the convergence bound.
Next, we present the general guarantee, followed by corollaries for specific schedules.
Theorem 1.
Let be a convex set with diameter , a convex function, , and an unbiased first-order oracle of with second-moment bounded by . For any , let be the iterates produced by -steps SGD with stepsizes using the oracle , where and is a differentiable -Lipschitz annealed schedule. Then it holds that
(5) |
where , , , and are given in Eqs. 1 and 2. In particular, the optimal satisfies (or if there is no solution).
First, note that for , Theorem 1 recovers up to low order terms, as the infimum is at most . Furthermore, as and both and are decreasing and equal at , the infimum adapts to the imbalance of the and terms which are introduced by the overestimation.
We defer the proof of Theorem 1 to Section 3.2. Following are corollaries for polynomially decaying and cosine annealing schedules which provide concrete examples for the power of Theorem 1.
Corollary 2.
We observe that for , the optimal rate is the same as tuned SGD with fixed stepsize (up to constants), while the dependence on is sublinear, as we aimed to achieve. The dependence might lead to the idea that a larger is always better, but as increases the optimal rate degrades at a rate of . In particular, using the convergence rate will be , and increasing beyond this point will not improve the final rate.
Proof of Corollary 2.
First note that is non-increasing, differentiable, -Lipschitz (since ) and satisfy . Hence, is annealed and we can use Theorem 1. A simple integration yields that , and . Thus,
We proceed to solve the optimality equation of Theorem 1, :
While this value is optimal, it may be negative for small , so we select a slightly sub-optimal value of which is always valid. Using this value,
(6) |
Hence, using this value to bound the infimum of Eq. 5,
We conclude by plugging the values of and to Eq. 2 (and using ),
We proceed to the cosine annealing guarantee. Given its similarities to Corollary 2, the proof is deferred to Appendix A.
Corollary 3.
Again we observe a sublinear dependence on with an optimal rate of . Note that this is the same behavior as in Corollary 2 with , which arises from the tail behavior of . To see that, one can verify that for all .
3.1 Tighter Constants using Numerical Analysis

The constants of Corollary 3 are not tight; in particular, the bound is established using crude (up to constants) bounds for and . While a tighter bound can be obtained, the framework easily yields to numerical analysis as we demonstrate next.
The convergence guarantee of Theorem 1 is not posed as a closed-form equation but rather as a minimization over integrals that depend on the schedule. For a specific schedule and misspecification parameter, we use Scipy’s (Virtanen et al., 2020) quad integration to evaluate , and fsolve to solve the minimization of Theorem 1.
In Fig. 1 we provide a numerical analysis for the convergence guarantee of Theorem 1 with several decaying schedules, including the cosine annealing, showing in particular that the convergence rate of SGD with cosine annealing is bounded by . We observe that the cosine annealing schedule and the quadratic decay schedule have similar convergence guarantees with a coefficient between to . In addition, even for a somewhat large misspecification parameter of size , the difference between cosine annealing and the different polynomial decaying schedules is at most a factor of , which indicates that even mild decay might be sufficient if the grid is not too coarse.
3.2 Proof of Theorem 1
Before proving our main theorem, we first state a few lemmas we will require. The first is a last-iterate convergence guarantee, using the techniques of Zamani and Glineur (2023); Liu and Zhou (2024) (proof appearing in Appendix C).
Lemma 3.
Let be a convex set, a convex function, and an unbiased first-order oracle of with second-moment bounded by . Let be the iterates produced by -steps SGD with stepsizes using the oracle . Then for any ,
Next is a key lemma, translating the suffix of the last-iterate bound in Lemma 3 to one based on integrating the stepsize schedule (proof given later in the section).
Lemma 4.
Let , , and for some differentiable -Lipschitz annealed schedule . Then for any ,
We proceed to the proof of Theorem 1.
Proof of Theorem 1.
Let and let . Consider the suffix as an SGD sequence starting at . By Lemma 3 with ,
As , by Lemma 4 with and ,
( and Eq. 2) | ||||
(, Eq. 1) |
This inequality holds for any , hence it holds for the infimum over all . It is left to find the which minimizes the bound. Let
(7) |
By the fundamental theorem of calculus,
and
Thus,
Hence, when satisfy , . For ,
so and . Similarly, for , and . Hence, satisfying is the minimizer. If no such exists, the derivative is always positive (as is continuous and ), and the minimizer is at . ∎
3.3 Proof of Lemma 4
.
Let . As is non-increasing, we can use integration to obtain the following bound,
(changing integration variables) | ||||
(Eq. 1) |
Again bounding by integration and changing variables,
As is differentiable, -Lipschitz, and , for any ,
(8) |
Hence, and since ,
Plugging back,
4 Convex and Smooth Setting
In the following section, we extend our robustness result to the convex smooth setting, in which we replace the second-moment gradient oracle assumption with the assumptions that the gradient oracle has bounded variance and that is -smooth. The core technique is the same as in Section 3, with some additional considerations due to the requirement in standard smooth analysis that the stepsizes satisfy for some constant .
Next is the main result of this section, a convergence guarantee robust to a multiplicative misspecification of the stepsize.
Theorem 4.
Let be a convex set with diameter , a -smooth convex function, , and an unbiased first-order oracle of with variance bounded by . For any , let be the iterates produced by -steps SGD with stepsizes using the oracle , where and is a differentiable -Lipschitz annealed schedule. Denote . Then it holds that
(9) |
where , , , and are given in Eqs. 1, 3 and 4. In particular, the optimal satisfies (or if there is no solution).
As in Theorem 1, in Theorem 4 we observe a similar adaptivity to using the tails of and . One small yet important difference is that the infimum is limited to the range , where denotes the fraction of iterations in which the stepsize exceeds . This dependency is somewhat unavoidable (up to constants) as stepsizes larger or equal to do not converge. Additionally, note that the above guarantee holds even if we specify a stepsize that is larger than , which is not the case with fixed stepsize SGD.
Next are corollaries of Theorem 4 with polynomial decay and cosine annealing schedules. Due to space constraints and similarities to the convex Lipschitz case, we defer the proofs of Theorem 4 and of the corollaries to Appendix B.
Corollary 5.
Corollary 6.
Observing Corollaries 5 and 6, a similar improved dependence on as in Corollaries 2 and 3 holds when is sufficiently small. When is large, we obtain the expected inverse dependence on the fraction of steps with small enough stepsizes, which is unavoidable as we explained above.
5 Experimental Evaluation
Our theory predicts that learning rate annealing schemes exhibit greater robustness to learning rate tuning compared to tuning a fixed learning rate. To support the prediction, we perform experiments to compare the performances of different scheduling strategies under varying grid search resolutions for learning rate tuning.
We conduct two types of experiments: the first involves a synthetic logistic regression task closely aligned with the theoretical setting, while the second involves training a neural network classifier.
5.1 Experimental setup
We consider common schedules, namely, fixed learning rate (as our baseline), in addition to the decaying cosine annealing, and linear decay schedules. To simulate varying grid resolutions, we train the models using a geometric grid of learning rates with a multiplicative factor of approximately (the values multiplied by with different ’s), and consider the different subsets with resolutions , etc. For example, with range and resolution of , we find the best model for each of the grids333 We average the performance over 3 runs per learning rate. and report the average test loss/top-1 error across grids.
Synthetic logistic regression.
In the synthetic experiment, we generate 100,000 samples of dimension 100, drawn from a normal distribution. Labels are assigned based on thresholding probabilities determined by a ”true weights” vector of size 100, also sampled from a normal distribution. To introduce additional noise, we flip each label with a probability of 0.1. A test set of the same size is generated similarly. We train a linear classifier using binary cross-entropy loss, SGD without momentum, a batch size of 1,000, and a single epoch (updating the scheduler after each step). For the fixed learning rate scheduler, we report both the last iterate and the averaged iterate performances.
Wide ResNet on CIFAR-10.
We train a Wide ResNet 28-10 model444We use the PyTorch implementation of Wide ResNet at https://github.com/bmsookim/wide-resnet.pytorch. (Zagoruyko and Komodakis, 2016) without dropout on the CIFAR-10 dataset (Krizhevsky, 2009). We train for 200 epochs, using a batch size of , Nesterov momentum of , and weight decay of . The scheduler is updated after each epoch. As the last iterate of fixed stepsize SGD is under-performing, we use polynomial averaging as proposed by Shamir and Zhang (2013), with parameter , following Ivgi et al. (2023).
5.2 Results




The test loss per learning rate appears in Fig. 2(a). For each resolution, Fig. 2(b) illustrates the logistic regression test loss averaged across the best models for each sub-grid. At high resolutions (e.g., grid parameters up to 10), we observe a comparable performance degradation across different schedules (besides fixed stepsize without averaging which underperforms). However, as grid resolution decreases, the gap between the fixed learning rate schedule and the decaying schedules widens. For instance, with a grid factor of approximately 100, the performance of the fixed learning rate (with averaging) decreases by 0.08, whereas cosine annealing and linear decay schedules experience smaller drops of 0.01 and 0.014, respectively, with similar trends observed for grids with lower resolutions.
Fig. 3(b) shows the CIFAR-10 top-1 test error for each resolution, averaged over the best models per sub-grid, with the raw test error per learning rate appearing in Fig. 3(a). Similar to the logistic regression task, degradation remains similar for high resolutions while the gap between the fixed learning rate schedule and the decaying schedules widens for large grid factors. With a grid factor of approximately 22, the performance of the fixed learning rate decreases by 0.61, with smaller drops of 0.3 and 0.35 observed for cosine annealing and linear decay schedules, respectively, and the trend continues for grids with lower resolutions.
5.3 Discussion
The experiments show that decaying schedules are more robust to coarse grids, while performance differences on fine grids remain minimal. These findings align with our theory, which suggests that all decaying schedules perform similarly to iterate averaging under small multiplicative misspecification but outperform it when misspecification is large. However, our theory also predicts robustness variations across decay rates, which are not observed in the real-data experiments. A possible explanation is the small difference in convergence rates among decaying schedules when misspecification is low, as illustrated in Fig. 1.
Acknowledgements
We are grateful to Noga Bar, Yair Carmon and Tomer Porian for helpful discussions. This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (grant agreement No. 101078075). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council. Neither the European Union nor the granting authority can be held responsible for them. This work received additional support from the Israel Science Foundation (ISF, grant number 3174/23), a grant from the Tel Aviv University Center for AI and Data Science (TAD), and a fellowship from the Israeli Council of Higher Education.
References
- Alacaoglu et al. (2020) A. Alacaoglu, Y. Malitsky, P. Mertikopoulos, and V. Cevher. A new regret analysis for adam-type algorithms. In International conference on machine learning, pages 202–210. PMLR, 2020.
- Attia and Koren (2023) A. Attia and T. Koren. Sgd with adagrad stepsizes: Full adaptivity with high probability to unknown parameters, unbounded gradients and affine variance. In International Conference on Machine Learning, 2023.
- Bengio (2012) Y. Bengio. Practical recommendations for gradient-based training of deep architectures. In Neural networks: Tricks of the trade: Second edition, pages 437–478. Springer, 2012.
- Bottou (2012) L. Bottou. Stochastic gradient descent tricks. In Neural networks: Tricks of the trade, pages 421–436. Springer, 2012.
- Carmon and Hinder (2022) Y. Carmon and O. Hinder. Making sgd parameter-free. In Conference on Learning Theory, pages 2360–2389. PMLR, 2022.
- Chaudhuri et al. (2009) K. Chaudhuri, Y. Freund, and D. J. Hsu. A parameter-free hedging algorithm. Advances in neural information processing systems, 22, 2009.
- Cutkosky and Orabona (2018) A. Cutkosky and F. Orabona. Black-box reductions for parameter-free online learning in banach spaces. In Conference On Learning Theory, pages 1493–1529. PMLR, 2018.
- Defazio and Mishchenko (2023) A. Defazio and K. Mishchenko. Learning-rate-free learning by d-adaptation. In International Conference on Machine Learning, 2023.
- Defazio et al. (2024a) A. Defazio, A. Cutkosky, H. Mehta, and K. Mishchenko. Optimal linear decay learning rate schedules and further refinements. arXiv preprint arXiv:2310.07831, 2024a.
- Defazio et al. (2024b) A. Defazio, X. A. Yang, A. Khaled, K. Mishchenko, H. Mehta, and A. Cutkosky. The road less scheduled. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024b.
- Duchi et al. (2011) J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.
- Faw et al. (2022) M. Faw, I. Tziotis, C. Caramanis, A. Mokhtari, S. Shakkottai, and R. A. Ward. The power of adaptivity in sgd: Self-tuning step sizes with unbounded gradients and affine variance. In COLT, 2022.
- Ge et al. (2019) R. Ge, S. M. Kakade, R. Kidambi, and P. Netrapalli. The step decay schedule: A near optimal, geometrically decaying learning rate procedure for least squares. Advances in neural information processing systems, 32, 2019.
- Hu et al. (2024) S. Hu, Y. Tu, X. Han, G. Cui, C. He, W. Zhao, X. Long, Z. Zheng, Y. Fang, Y. Huang, X. Zhang, Z. L. Thai, C. Wang, Y. Yao, C. Zhao, J. Zhou, J. Cai, Z. Zhai, N. Ding, C. Jia, G. Zeng, dahai li, Z. Liu, and M. Sun. MiniCPM: Unveiling the potential of small language models with scalable training strategies. In First Conference on Language Modeling, 2024.
- Ivgi et al. (2023) M. Ivgi, O. Hinder, and Y. Carmon. Dog is sgd’s best friend: A parameter-free dynamic step size schedule. In International Conference on Machine Learning, 2023.
- Jain et al. (2019) P. Jain, D. Nagaraj, and P. Netrapalli. Making the last iterate of sgd information theoretically optimal. In Conference on Learning Theory, pages 1752–1755. PMLR, 2019.
- Kavis et al. (2019) A. Kavis, K. Y. Levy, F. Bach, and V. Cevher. Unixgrad: A universal, adaptive algorithm with optimal guarantees for constrained optimization. Advances in neural information processing systems, 32, 2019.
- Kavis et al. (2022) A. Kavis, K. Y. Levy, and V. Cevher. High probability bounds for a class of nonconvex algorithms with adagrad stepsize. In International Conference on Learning Representations, 2022.
- Kingma and Ba (2015) D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
- Krizhevsky (2009) A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
- Lan (2012) G. Lan. An optimal method for stochastic composite optimization. Mathematical Programming, 133(1):365–397, 2012.
- Liu and Zhou (2024) Z. Liu and Z. Zhou. Revisiting the last-iterate convergence of stochastic gradient methods. In The Twelfth International Conference on Learning Representations, 2024.
- Liu et al. (2023) Z. Liu, T. D. Nguyen, T. H. Nguyen, A. Ene, and H. L. Nguyen. High probability convergence of stochastic gradient methods. arXiv preprint arXiv:2302.14843, 2023.
- Loshchilov and Hutter (2017) I. Loshchilov and F. Hutter. SGDR: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations, 2017.
- Luo and Schapire (2015) H. Luo and R. E. Schapire. Achieving all with no parameters: Adanormalhedge. In Conference on Learning Theory, pages 1286–1304. PMLR, 2015.
- Mishchenko and Defazio (2023) K. Mishchenko and A. Defazio. Prodigy: An expeditiously adaptive parameter-free learner. arXiv preprint arXiv:2306.06101, 2023.
- Orabona and Pál (2016) F. Orabona and D. Pál. Coin betting and parameter-free online learning. Advances in Neural Information Processing Systems, 29, 2016.
- Orabona and Pál (2021) F. Orabona and D. Pál. Parameter-free stochastic optimization of variationally coherent functions. arXiv preprint arXiv:2102.00236, 2021.
- Reddi et al. (2018) S. J. Reddi, S. Kale, and S. Kumar. On the convergence of adam and beyond. In International Conference on Learning Representations, 2018.
- Robbins and Monro (1951) H. Robbins and S. Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
- Schaul et al. (2013) T. Schaul, S. Zhang, and Y. LeCun. No more pesky learning rates. In International conference on machine learning, pages 343–351. PMLR, 2013.
- Shamir and Zhang (2013) O. Shamir and T. Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In International conference on machine learning, pages 71–79. PMLR, 2013.
- Smith (2017) L. N. Smith. Cyclical learning rates for training neural networks. In 2017 IEEE winter conference on applications of computer vision (WACV), pages 464–472. IEEE, 2017.
- Streeter and McMahan (2012) M. J. Streeter and H. B. McMahan. No-regret algorithms for unconstrained online convex optimization. In Neural Information Processing Systems, 2012.
- Tran et al. (2019) P. T. Tran et al. On the convergence proof of amsgrad and a new version. IEEE Access, 7:61706–61716, 2019.
- Virtanen et al. (2020) P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright, S. J. van der Walt, M. Brett, J. Wilson, K. J. Millman, N. Mayorov, A. R. J. Nelson, E. Jones, R. Kern, E. Larson, C. J. Carey, İ. Polat, Y. Feng, E. W. Moore, J. VanderPlas, D. Laxalde, J. Perktold, R. Cimrman, I. Henriksen, E. A. Quintero, C. R. Harris, A. M. Archibald, A. H. Ribeiro, F. Pedregosa, P. van Mulbregt, and SciPy 1.0 Contributors. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261–272, 2020. doi: 10.1038/s41592-019-0686-2.
- Zagoruyko and Komodakis (2016) S. Zagoruyko and N. Komodakis. Wide residual networks. In British Machine Vision Conference 2016. British Machine Vision Association, 2016.
- Zamani and Glineur (2023) M. Zamani and F. Glineur. Exact convergence rate of the last iterate in subgradient methods. arXiv preprint arXiv:2307.11134, 2023.
- Zhai et al. (2022) X. Zhai, A. Kolesnikov, N. Houlsby, and L. Beyer. Scaling vision transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12104–12113, 2022.
Appendix A Proofs of Section 3
A.1 Proof of Corollary 3
.
Note that is non-increasing, differentiable (), -Lipschitz (as ) and satisfy . Hence, is annealed and by Theorem 1,
Next, we will bound and using polynomials. As and ,
(10) |
On the other hand, for ,
Using the fundamental theorem of calculus, for all ,
(11) |
Using integration, Eqs. 10 and 11 also implies that
(12) |
Using the above inequalities,
(13) |
and
(14) |
Using the bounds, setting , and noting that ,
(15) |
Thus,
Again noting that and using Eq. 13,
Appendix B Proofs of Section 4
B.1 Proof of Theorem 4
In the proof, we use the following last-iterate guarantee for convex-smooth optimization, replacing Lemma 3 which we used in the convex Lipschitz case. The lemma is based on the technique introduced by Liu and Zhou (2024) and the proof appears at Appendix C.
Lemma 5.
Let be a convex set, a convex function, and an unbiased first-order oracle of with variance bounded by . Let be the iterates produced by -steps SGD with stepsizes (satisfying for all ) and using the oracle . Then for any ,
We proceed to the proof of Theorem 4.
Proof of Theorem 4.
Let and let . Consider the suffix as an SGD sequence starting at and note that since is non-increasing, . Thus, by Lemma 5 with ,
As , invoking Lemma 4 with and ,
Substituting and using Eqs. 3 and 4,
This inequality holds for any , hence it holds for the infimum over all . It is left to find the which minimizes the right-hand side. Let
This is the same function as in Eq. 7, so the same solution to is the minimizer of the function, and if there is no solution, the function is increasing (positive derivative) and the minimizer is at . ∎
B.2 Proof of Corollary 5
.
As in the proof of Corollary 2, is annealed as is non-increasing, differentiable, -Lipschitz and satisfy . Hence, we can use Theorem 4. In addition, , and , so
If we can pick , as . In this case,
If and , picking and using the -Lipschitz property of ,
() | ||||
() |
and
where the last two transitions use and the assumption . Since there is no case where and . Bounding the infimum of Eq. 9 in the two cases with our choices of , if ,
and if ,
Noting that by the assumption ,
we obtain our final convergence guarantees. The bound of follows from plugging and to Eq. 4. ∎
B.3 Proof of Corollary 6
.
As in the proof of Corollary 3, is annealed as is non-increasing, differentiable, -Lipschitz and satisfy . Hence, we can use Theorem 4. We already established at Eq. 15 of the proof of Corollary 3 that
If we can pick , as . In this case,
If and , picking , using the definition of and the Lipschitz property of ,
() |
implying (with ) that
In addition to the assumption ,
where the last transition uses the assumption . Since there is no case where and . Bounding the infimum of Eq. 9 in the two cases with our choices of , if ,
and if ,
We obtain our final convergence guarantees by noting that , which, together with the fact that implies
and plugging back to the above bounds. The bound of is immediate from Eq. 4 as and (as we established in Eq. 14).
∎
Appendix C Last Iterate Guarantees for Stochastic Gradient Descent
A convergence analysis of Stochastic Gradient Descent (SGD) for convex Lipschitz and convex smooth functions follows. The technique, introduced by Zamani and Glineur (2023) and later refined by Liu and Zhou (2024), is based on comparing the iterates of SGD with iterates of the form
(16) |
for some non-increasing sequence , starting at some . Note that by Jensen’s inequality, for any ,
(17) |
In particular, for any , we will use
(18) |
and , similarly to Liu and Zhou (2024). Next, we restate the convergence results. Their proofs follow. See 3 See 5
C.1 Proof of Lemmas 3 and 5
To prove the last-iterate guarantees we need the following lemmas. Their proofs follow. The first translates from an average regret-like guarantee to a last-iterate guarantee.
Lemma 6.
Lemma 7.
Let be a convex set, , a convex function and . Then for any , the iterates of SGD satisfy
where are defined by Eq. 16.
We proceed to the proof.
Proof of Lemmas 3 and 5.
By Lemma 7,
where . By the definition of and the fact that ,
Combining with our previous inequality multiplied by ,
Summing for , and removing ,
Combining with Lemma 6, and noting that ,
(19) |
Next, we assume a second-moment bound (as in Lemma 3). From convexity,
where we used the inequality . Similarly, . Hence, using the second-moment bound, . Plugging the bound of to Eq. 19 concludes the proof of Lemma 3. Next we assume that is -smooth, a variance bound, and that for all (as in Lemma 5). By smoothness,
() |
By the inequality ,
Hence, using the variance bound,
Plugging the bound of to Eq. 19 concludes the proof of Lemma 5. ∎
C.2 Proof of Lemma 6
Proof.
C.3 Proof of Lemma 7
Proof.
Using the convexity of ,
(20) |
Focusing on the first term, as does not depend on ,
Note that the update step is
From the first-order optimality condition,
Rearranging,
Thus,
Returning to Eq. 20, we conclude that
Appendix D Sensitivity of Fixed Stepsize Gradient Descent to Misspecification of the Stepsize
Given a -Lipschitz function , where is a convex set with diameter , the standard average-iterate convergence guarantee of -steps Gradient Descent (GD) with a fixed stepsize is
The optimal satisfy . Given a multiplicative overestimation of the optimal stepsize, for , the convergence guarantee is
A natural follow-up question is whether this linear dependence on is simply an artifact of the analysis or a true degradation in the convergence rate of GD. Next, we show that for any weights , the worst-case convergence rate of the (weighted) average iterate is .
Let , , , and . First we will assume that . Let for some , defined over the domain , and let . After a single gradient step, . After another update step, . Hence, the iterates will move back and forth between and , and the average iterate will satisfy
where we used our assumption that . Hence,
If, on the other hand, it holds that , we can initialize and mirroring the same argument will conclude the proof.
Hence, the worst-case convergence rate of fixed stepsize GD degrades linearly in a multiplicative misspecification of the stepsize. As GD is a private case of SGD, the lower bound also holds for SGD with a second-moment bound .
Appendix E Convergence Analysis with Stepsize Schedules
In this section, we provide convergence guarantees for SGD with an annealed schedule in the convex Lipschitz and convex smooth settings. The guarantees are established by combining a last-iterate guarantee with Lemma 4, which translates the sums of stepsizes to integrals that depend on the schedule. The proofs follow.
See 1
Note that when we tune according to Eq. 2, we obtain a convergence rate of
See 2
Similarly, when we tune according to Eq. 3, we obtain a convergence rate of
Note that using the fact that is non-increasing and the Lipschitz condition,
Additionally,
and using Eq. 8,
Hence, assuming and , and are , and the rates above match those of optimally tuned fixed stepsize SGD up to constant factors.