Benefits of Learning Rate Annealing for
Tuning-Robustness in Stochastic Optimization

Amit Attia Blavatnik School of Computer Science, Tel Aviv University; amitattia@mail.tau.ac.il. Tomer Koren Blavatnik School of Computer Science, Tel Aviv University, and Google Research Tel Aviv; tkoren@tauex.tau.ac.il.

Abstract

The learning rate in stochastic gradient methods is a critical hyperparameter that is notoriously costly to tune via standard grid search, especially for training modern large-scale models with billions of parameters. We identify a theoretical advantage of learning rate annealing schemes that decay the learning rate to zero at a polynomial rate, such as the widely-used cosine schedule, by demonstrating their increased robustness to initial parameter misspecification due to a coarse grid search. We present an analysis in a stochastic convex optimization setup demonstrating that the convergence rate of stochastic gradient descent with annealed schedules depends sublinearly on the multiplicative misspecification factor $\rho$ (i.e., the grid resolution), achieving a rate of $O(\rho^{1/(2p+1)}/\sqrt{T})$ where $p$ is the degree of polynomial decay and $T$ is the number of steps, in contrast to the $O(\rho/\sqrt{T})$ rate that arises with fixed stepsizes and exhibits a linear dependence on $\rho$ . Experiments confirm the increased robustness compared to tuning with a fixed stepsize, that has significant implications for the computational overhead of hyperparameter search in practical training scenarios.

1 Introduction

Stochastic Gradient Descent (SGD, Robbins and Monro, 1951) is a cornerstone of modern machine learning. Starting at a point $x_{1}$ , the update step of SGD takes the form $x_{t+1}=x_{t}-\eta_{t}g_{t}$ , where $\eta_{t}$ is the stepsize at step $t$ and $g_{t}$ is a stochastic gradient at $x_{t}$ . An effective stepsize sequence $\eta_{1},\eta_{2},\ldots$ is critical for performance, yet it is notoriously hard to tune in many scenarios and applications (e.g., Bottou, 2012; Schaul et al., 2013). Furthermore, as models continue to scale, the computational burden of stepsize tuning becomes increasingly demanding.

A common approach to tuning the stepsize sequence is simply using a fixed stepsize, selecting the best fixed value by performing a geometric grid search (Bengio, 2012). In this method, the stepsize is selected based on its performance on a validation set, with the grid resolution determining the (multiplicative) proximity to the best stepsize within the specified range.

A primary approach to moving beyond fixed stepsize sequences is stepsize scheduling. In stepsize scheduling (e.g., Smith, 2017; Loshchilov and Hutter, 2017; Ge et al., 2019), the step at time $t$ is determined by multiplying a baseline stepsize parameter with a parametric sequence. While the approach enables more versatile stepsize sequences and often leads to improved performance, it still requires tuning the baseline stepsize parameter, typically through grid search. Some stepsize schedules also exhibit theoretical benefits, such as anytime convergence guarantees and better last-iterate guarantees (e.g., Jain et al., 2019; Zamani and Glineur, 2023; Liu and Zhou, 2024; Defazio et al., 2024a).

While stepsize tuning is a widely adopted practice, its theoretical foundations remain under-explored. One key question is how sensitive this procedure is to the grid resolution. Limited computational budgets restrict the resolution of grid searches, an issue that has become increasingly prominent with the emergence of modern models consisting of billions of parameters that take days—sometimes weeks—to train. In fact, at massive scales, it is often the case that any methodological tuning of the stepsize is prohibitive and therefore abandoned entirely.

Standard analyses of fixed stepsize SGD in the convex setting demonstrate a linear degradation in convergence rate as a function of the multiplicative misspecification of the stepsize, which can be significant when performing a coarse—or even absent—grid search. This work investigates to what extent stepsize schedules can mitigate this dependency, providing more robust performance at lower grid resolutions.

Focusing our analysis on stochastic convex optimization, we establish convergence guarantees for SGD with stepsize schedules that decay polynomially to zero, which reveals a key advantage of automatically adapting to multiplicative overestimation of the stepsize. For commonly used schedules, such as cosine annealing, our guarantees yield a sublinear dependence on the misspecification factor, in contrast to the linear dependence that arises with fixed stepsizes. We further validate our theoretical findings through experiments on synthetic and real data, demonstrating improved robustness to stepsize tuning using decaying schedules compared to tuning a constant stepsize using a grid-search.

1.1 Summary of Contributions

In more detail, we consider stochastic first-order convex optimization settings, where we aim to minimize a convex objective $f:\mathcal{X}\to{\mathbb{R}}$ , where $\mathcal{X}\subset{\mathbb{R}}^{d}$ is a convex set with diameter $D$ , while accessing $f$ only through a (sub-)gradient oracle $g$ (i.e., $\mathbb{E}[g(x)]\in\partial f(x)$ for all $x\in\mathcal{X}$ ). Given an initial stepsize $\eta>0$ , a schedule is specified by a function $h:[0,1]\to[0,1]$ through $\eta_{t}=\eta h(\frac{t-1}{T})$ , where $T$ is the total number of SGD update steps.

Our main results are the following:

•

Our first main result in the convex (non-smooth) case, where we assume that the second moment of the oracle is bounded, is a convergence guarantee of the last iterate of $T$ -steps SGD using a decaying schedule $h$ (which satisfies some mild assumptions), of the form

\displaystyle O\brk*{\mathsf{Rate}_{h,T}^{\mathsf{tu}}}\cdot\!\inf_{\tau\in[0,% 1)}\brk[c]*{\frac{1}{\rho H_{h}(\tau)}+\rho Q_{h}(\tau)},

where $\mathsf{Rate}_{h,T}^{\mathsf{tu}}$ is the convergence rate using a tuned stepsize, $H_{h}$ and $Q_{h}$ are certain functions that depend on the schedule $h$ , and $\rho=\ifrac{\eta}{\eta_{\mathsf{tu}}}\geq 1$ is the multiplicative overestimation factor of $\eta$ compared to the tuned stepsize $\eta_{\mathsf{tu}}$ . The infimum above is at most $O(\rho)$ , but as we discuss below, may become sublinear in $\rho$ depending on the particular schedule $h$ .

•

Our second main result deals with the convex smooth case, where we assume that $f$ is $\beta$ -smooth and that the oracle has a bounded variance. We obtain a similar convergence guarantee of

\displaystyle O\brk*{\mathsf{Rate}^{\mathsf{sm,tu}}_{h,T}}\cdot\inf_{\tau\in[% \tau_{0},1)}\brk[c]*{\frac{1}{\rho H_{h}(\tau)}+\rho Q_{h}(\tau)},

where $\tau_{0}$ is the fraction of steps with $\eta_{t}>1/2\beta$ , $\mathsf{Rate}^{\mathsf{sm,tu}}_{h,T}$ is the convergence rate with a tuned stepsize, and $\rho\geq 1$ is the multiplicative overestimation factor compared to the tuned stepsize. The dependence on $\tau_{0}$ is unavoidable (up to constants), as convergence in smooth optimization requires step size smaller than $\ifrac{2}{\beta}$ . For sufficiently small $\tau_{0}$ , the infimum is again $O(\rho)$ in the worst case.

•

Applying our main result to the cosine annealing schedule in the convex Lipschitz case, we obtain that the last-iterate convergence rate of SGD is $O(\ifrac{\rho^{0.2}DG}{\sqrt{T}})$ . Similarly, applying the same result to the polynomially decaying schedule $(1-\frac{t-1}{T})^{p}$ for some constant degree $p\geq 1$ , we obtain that the last-iterate convergence rate of SGD is $O(\ifrac{\rho^{1/(2p+1)}DG}{\sqrt{T}})$ . In the convex smooth case, assuming $\eta_{1}=\eta h(0)\leq 1/2\beta$ , we obtain the same multiplicative sub-optimality of $\rho^{0.2}$ and $\rho^{1/(2p+1)}$ for cosine annealing and (degree $p$ -)polynomially decaying schedules.
•

Additionally, we validate the robustness of various learning rate schedules to tuning in experiments, by performing grid search on two tasks: a synthetic logistic regression task with a linear model and the CIFAR-10 classification task with a deep neural network. We find that, when using a coarse grid, annealing schemes—specifically cosine annealing and linear decay—demonstrate greater robustness compared to a fixed step size schedule.

Our theoretical results show that polynomially decaying schedules, including cosine annealing, achieve convergence rates with a sublinear dependence on the misspecification factor, in contrast to the linear dependence observed in SGD with a fixed stepsize (which we demonstrate in detail in Appendix D). This distinction is particularly striking since, while both fixed and annealed stepsizes are able to attain the optimal convergence rate when properly tuned, the latter exhibits significantly greater robustness to parameter misspecifications. When tuning the stepsize using a coarse grid search under a limited computational budget, this difference in robustness can significantly impact performance, as also seen in our synthetic and real-data experiments.

1.2 Additional Related Work

Adaptive and parameter-free methods.

Beyond learning rate scheduling, several approaches have been developed to minimize the need for extensive tuning in first-order optimization. These include adaptive methods, such as AdaGrad and Adam (e.g., Duchi et al., 2011; Kingma and Ba, 2015), as well as recent theoretical advancements (Reddi et al., 2018; Tran et al., 2019; Kavis et al., 2019; Alacaoglu et al., 2020; Faw et al., 2022; Kavis et al., 2022; Attia and Koren, 2023; Liu et al., 2023), which utilize gradient statistics to dynamically adjust learning rates. Additionally, parameter-free methods (e.g., Chaudhuri et al., 2009; Streeter and McMahan, 2012; Luo and Schapire, 2015; Orabona and Pál, 2016; Cutkosky and Orabona, 2018; Orabona and Pál, 2021; Carmon and Hinder, 2022) primarily focus on automatically adapting to the problem’s complexity, such as the distance to the optimal solution. Recently, several parameter-free approaches demonstrated impressive practical performance, narrowing the gap to finely-tuned methods (Ivgi et al., 2023; Defazio and Mishchenko, 2023; Mishchenko and Defazio, 2023). While these approaches take different paths to reduce tuning, adaptive methods and scheduling schemes are often used together in practice.

Theoretical analyses of stepsize annealing.

Several studies have analyzed different stepsize schedules. The influential work of Jain et al. (2019) showed that the schedules $\eta_{t}=\eta/t$ and $\eta_{t}=\eta/\sqrt{t}$ yield suboptimal last-iterate guarantees and proposed a new schedule with optimal last-iterate performance. Later, Defazio et al. (2024a) demonstrated that a linear decay schedule also achieves an optimal last-iterate guarantee. Additionally, Defazio et al. (2024b) introduced ”schedule-free” SGD, which eliminates the need to know the training length $T$ in advance. While these works focus on optimality with well-tuned stepsizes and last-iterate guarantees, our work examines the robustness of these schedules when the step size is not finely tuned. Additionally, new scheduling schemes continue to emerge, such as those proposed by Zhai et al. (2022) and Hu et al. (2024), which incorporate a cooldown phase to accommodate varying training durations. The robustness perspective we propose helps us better understand the benefits of different schedules and guides the design of more robust ones.

2 Preliminaries

2.1 Problem Setup

In this work, we are interested in first-order stochastic optimization over a bounded domain within the $d$ -dimensional Euclidean space, $\mathbb{R}^{d}$ , equipped with the Euclidean norm, defined as $\norm{\cdot}\triangleq\norm{\cdot}_{2}$ . Let $\mathcal{X}\subset{\mathbb{R}}^{d}$ be a convex set with diameter $D$ (i.e., for all $x,y\in\mathcal{X}$ , $\norm{x-y}\leq D$ ) and let $f:\mathcal{X}\to{\mathbb{R}}$ be a convex function. Our goal is to find some $\overline{x}\in\mathcal{X}$ such that $f(\overline{x})-\min_{x\in\mathcal{X}}f(x)$ is small, where we access $f$ only through an unbiased sub-gradient oracle $g:\mathcal{X}\to{\mathbb{R}}^{d}$ (i.e., $\mathbb{E}[g(x)]\in\partial f(x)$ for all $x\in\mathcal{X}$ , where we denote with a slight abuse of notation $\nabla f(x)\triangleq\mathbb{E}[g(x)]$ ). We consider two optimization scenarios:

(i)

Convex and Lipschitz setting. Here we assume $g$ has a second moment bound, that is, for some $G>0$ , $\mathbb{E}\norm{g(x)}^{2}\leq G^{2}$ for all $x\in\mathcal{X}$ . This implies in particular that $f$ is $G$ -Lipschitz.
(ii)

Convex and smooth setting. In this scenario we assume that $f$ is $\beta$ -smooth,¹¹1A function $f:\mathcal{X}\to{\mathbb{R}}$ is said to be $\beta$ -smooth if $\norm{\nabla f(x)-\nabla f(y)}\leq\beta\norm{x-y}$ for all $x,y\in\mathcal{X}$ . In particular, this implies that $\abs{f(y)-f(x)-\nabla f(x)\bm{\cdot}(y-x)}\leq\frac{\beta}{2}\norm{y-x}^{2}$ for all $x,y\in\mathcal{X}$ . and instead of a second moment bound we assume that $g$ has a variance bound, that is, for some $\sigma>0$ , $\mathbb{E}[\norm{g(x)-\nabla f(x)}^{2}]\leq\sigma^{2}$ for all $x\in\mathcal{X}$ .

Stochastic gradient descent.

We will analyze the (projected) Stochastic Gradient Descent (SGD) algorithm, which starts at some $x_{1}\in\mathcal{X}$ and performs update steps of the form $x_{t+1}=\Pi_{\mathcal{X}}\brk*{x_{t}-\eta_{t}g_{t}}$ , where $\eta_{t}$ is the stepsize at step $t$ , $g_{t}=g(x_{t})$ is a stochastic sub-gradient at $x_{t}$ , and $\Pi_{\mathcal{X}}\brk*{\cdot}$ is the Euclidean projection to $\mathcal{X}$ . The output of $T$ -steps SGD is typically some average of the iterates or the last iterate. The convergence rate guarantee of the average iterate of fixed stepsize SGD with tuned stepsize is $O(\ifrac{DG}{\sqrt{T}})$ in the convex Lipschitz case and $O(\ifrac{\beta D^{2}}{T}+\ifrac{D\sigma}{\sqrt{T}})$ in the convex smooth case (See, e.g., Lan, 2012).

Stepsize scheduling.

Our focus will be on stepsizes of the form $\eta_{t}=\eta h\brk{\frac{t-1}{T}}$ , for some $\eta>0$ and $h:[0,1]\to[0,1]$ , where $T\in{\mathbb{N}}$ is the number of SGD steps that are performed. Common schedules include $h(u)=1$ (fixed stepsize), $h(u)=\frac{1}{2}+\frac{1}{2}\cos(\pi u)$ (cosine annealing), and $h(u)=(1-u)^{p}$ for some $p\geq 1$ (polynomial decay). In particular, we will assume that $h(u)$ is monotonically non-increasing, and satisfy $h(u)=0\Leftrightarrow u=1$ ; we will call such a schedule annealed for brevity. We additionally assume for technical reasons that the annealed schedules we consider are differentiable and $p$ -Lipschitz. Using an annealed schedule, SGD with a properly tuned step size yields the same rate as optimally tuned fixed stepsize SGD, up to constant factors (where we treat $p$ as a constant). See Appendix E for additional details. Notable annealed schedules include cosine annealing and polynomial decay.

Robustness to stepsize misspecification.

Fixing an initialization $x_{1}\in\mathcal{X}$ and a stepsize schedule $h(\cdot)$ , it remains to tune the base stepsize $\eta$ . Considering a tuned stepsize $\eta_{\mathsf{tu}}$ ,²²2By tuned we mean a stepsize that minimize a corresponding convergence guarantee that depend on $\eta$ , possibly ignoring lower-order terms for simplicity. we investigate the sensitivity of SGD when the stepsize is only tuned to a multiplicative misspecification factor $\rho\geq 1$ (i.e., stepsize $\eta=\rho\eta_{\mathsf{tu}}$ , where $\rho$ is of course unknown to the algorithm). In this case, the convergence rate will likely degrade as $\rho$ increases. For instance, the standard guarantee of fixed stepsize SGD degrades linearly in $\rho$ ; we demonstrate this fact in the convex Lipschitz setting in Appendix D.

Our main inquiry is to what extent stepsize schedules can mitigate this degradation, enabling more robust performance when the stepsize is crudely tuned (e.g., when tuned using a coarse grid search), and achieving convergence rates with sublinear dependence on $\rho$ , for $\rho\geq 1$ .

2.2 Convergence Analysis with Stepsize Schedules

Here we present convergence guarantees for SGD using an annealed schedule. The tuned stepsizes and respective convergence rates will serve as the baseline for establishing a sublinear dependence on the misspecification parameter. For their proofs, see Appendix E.

Let $h$ be a differentiable $p$ -Lipschitz annealed schedule $h$ . We define the following two functions associated with $h$ :

\displaystyle H_{h}(v)

\displaystyle\triangleq\int_{u}^{1}h(u)du\qquad\text{ and }\qquad Q_{h}(v)% \triangleq\int_{v}^{1}\frac{H_{h}^{\prime}(u)^{2}}{H_{h}(u)}du.

(1)

Throughout, convergence bounds will be expressed in terms of $H_{h}$ and $Q_{h}$ . We begin with the convex Lipschitz case.

Lemma 1.

Let $\mathcal{X}\subset{\mathbb{R}}^{d}$ be a convex set with diameter $D>0$ , $f:\mathcal{X}\to{\mathbb{R}}$ a convex function, $x^{\star}\in\operatorname*{arg\,min}_{x\in\mathcal{X}}f(x)$ , and $g:\mathcal{X}\to{\mathbb{R}}^{d}$ an unbiased first-order oracle of $f$ with second-moment bounded by $G^{2}>0$ . Let $x_{1},x_{2},\ldots,x_{T+1}$ be the iterates produced by $T$ -steps SGD with stepsizes $\eta_{t}=\eta h\brk{\frac{t-1}{T}}$ using the oracle $g$ , where $h$ is a differentiable $p$ -Lipschitz annealed schedule. Then it holds that

\displaystyle\mathbb{E}[f(x_{T+1})-f(x^{\star})]

\displaystyle\leq\frac{D^{2}}{2\eta TH_{h}(0)}+2\eta G^{2}Q_{h}(0)+\frac{8p% \eta G^{2}}{T}.

We denote the tuned stepsize and respective convergence guarantee (up to lower-order terms) by

\displaystyle\eta_{\mathsf{tu}}

\displaystyle\triangleq\frac{D}{2G\sqrt{TH_{h}(0)Q_{h}(0)}}\qquad\text{and}% \qquad\mathsf{Rate}_{h,T}^{\mathsf{tu}}\triangleq\frac{2DG}{\sqrt{T}}\sqrt{Q_{% h}(0)/H_{h}(0)}.

(2)

We proceed to the convergence guarantee in the convex smooth case.

Lemma 2.

Let $\mathcal{X}\subset{\mathbb{R}}^{d}$ be a convex set with diameter $D>0$ , $f:\mathcal{X}\to{\mathbb{R}}$ a $\beta$ -smooth convex function, $x^{\star}\in\operatorname*{arg\,min}_{x\in\mathcal{X}}f(x)$ , and $g:\mathcal{X}\to{\mathbb{R}}^{d}$ an unbiased first-order oracle of $f$ with variance bounded by $\sigma^{2}\geq 0$ . Let $x_{1},x_{2},\ldots,x_{T+1}$ be the iterates produced by $T$ -steps SGD with stepsizes $\eta_{t}=\eta h\brk{\frac{t-1}{T}}$ using the oracle $g$ , where $h$ is a differentiable $p$ -Lipschitz annealed schedule and $\eta h(0)\leq\frac{1}{2\beta}$ . Then it holds that

\displaystyle\mathbb{E}[f(x_{T+1})-f(x^{\star})]

\displaystyle\leq\frac{D^{2}}{2\eta TH_{h}(0)}+\eta\sigma^{2}Q_{h}(0)+\frac{4p% \eta\sigma^{2}}{T}.

Similarly, we denote the tuned stepsize over $\eta\in(0,\frac{1}{2\beta h(0)}]$ as

\displaystyle\eta_{\mathsf{tu}}^{\mathsf{sm}}

\displaystyle\triangleq\min\set*{\frac{1}{2\beta h(0)},\frac{D}{\sigma\sqrt{2% TH_{h}(0)Q_{h}(0)}}},

(3)

and the respective convergence guarantee (up to lower-order terms) as

\displaystyle\mathsf{Rate}^{\mathsf{sm,tu}}_{h,T}

\displaystyle\triangleq\frac{D^{2}}{2\eta_{\mathsf{tu}}^{\mathsf{sm}}TH_{h}(0)% }+\eta_{\mathsf{tu}}^{\mathsf{sm}}\sigma^{2}Q_{h}(0)\leq\frac{\beta D^{2}h(0)}% {TH_{h}(0)}+\frac{D\sigma}{\sqrt{T}}\sqrt{2Q_{h}(0)/H_{h}(0)}.

(4)

As we previously mentioned, under the mild assumption that $p=\Theta(1)$ (the Lipschitz parameter of $h$ ), the guarantees match the rates of optimally tuned fixed stepsize SGD (see Appendix E for details). In both cases, a multiplicative overestimation of the optimal stepsize degrades the guarantee linearly (in the convex smooth case, for a large enough overestimation, $\eta_{1}>\frac{1}{2\beta}$ and the guarantee does not even hold).

3 Convex and Lipschitz Setting

This section considers a convex objective where the second moment of the sub-gradient oracle is bounded. The main result of this section is a convergence guarantee that mitigates the imbalance caused by overestimation by automatically adapting to the tails of $H_{h}(v)$ and $Q_{h}(v)$ . The key observation in obtaining this result is that any suffix of iterates $x_{k},\ldots,x_{T+1}$ can be viewed as a $(T-k+1)$ -steps SGD starting at $x_{k}$ , effectively ignoring the large stepsizes prior to step $k$ that would otherwise degrade the convergence bound.

Next, we present the general guarantee, followed by corollaries for specific schedules.

Theorem 1.

Let $\mathcal{X}\subset{\mathbb{R}}^{d}$ be a convex set with diameter $D>0$ , $f:\mathcal{X}\to{\mathbb{R}}$ a convex function, $x^{\star}\in\operatorname*{arg\,min}_{x\in\mathcal{X}}f(x)$ , and $g:\mathcal{X}\to{\mathbb{R}}^{d}$ an unbiased first-order oracle of $f$ with second-moment bounded by $G^{2}>0$ . For any $\rho\geq 1$ , let $x_{1},x_{2},\ldots,x_{T+1}$ be the iterates produced by $T$ -steps SGD with stepsizes $\eta_{t}=\eta h\brk{\frac{t-1}{T}}$ using the oracle $g$ , where $\eta=\rho\cdot\eta_{\mathsf{tu}}$ and $h$ is a differentiable $p$ -Lipschitz annealed schedule. Then it holds that

\displaystyle\mathbb{E}[f(x_{T+1})-f(x^{\star})]\leq\frac{1}{2}\mathsf{Rate}_{% h,T}^{\mathsf{tu}}\cdot\inf_{\tau\in[0,1)}\brk*{\frac{H_{h}(0)}{\rho H_{h}(% \tau)}+\frac{\rho Q_{h}(\tau)}{Q_{h}(0)}}+O\brk*{\frac{p\rho\eta_{\mathsf{tu}}% G^{2}}{T}},

(5)

where $H_{h}$ , $Q_{h}$ , $\eta_{\mathsf{tu}}$ , and $\mathsf{Rate}_{h,T}^{\mathsf{tu}}$ are given in Eqs. 1 and 2. In particular, the optimal $\tau$ satisfies $H_{h}(\tau)H_{h}^{\prime}(\tau)=\ifrac{-H_{h}(0)Q_{h}(0)}{\rho^{2}}$ (or $\tau=0$ if there is no solution).

First, note that for $\rho=1$ , Theorem 1 recovers $\mathsf{Rate}_{h,T}^{\mathsf{tu}}$ up to low order terms, as the infimum is at most $2$ . Furthermore, as $\rho\geq 1$ and both $H_{h}(v)$ and $Q_{h}(v)$ are decreasing and equal $0$ at $v=1$ , the infimum adapts to the imbalance of the $\frac{1}{\rho}$ and $\rho$ terms which are introduced by the overestimation.

We defer the proof of Theorem 1 to Section 3.2. Following are corollaries for polynomially decaying and cosine annealing schedules which provide concrete examples for the power of Theorem 1.

Corollary 2.

In the setting of Theorem 1, assuming $h(u)=(1-u)^{p}$ for some $p\geq 1$ ,

\displaystyle\mathbb{E}[f(x_{T+1})-f(x^{\star})]

\displaystyle=\mathsf{Rate}_{h,T}^{\mathsf{tu}}\cdot\rho^{\frac{1}{2p+1}}+O% \brk*{\frac{p\rho\eta_{\mathsf{tu}}G^{2}}{T}},

where $\mathsf{Rate}_{h,T}^{\mathsf{tu}}=\frac{p+1}{\sqrt{p}}\cdot\frac{2DG}{\sqrt{T}% }=O\brk*{\frac{\sqrt{p}DG}{\sqrt{T}}}$ .

We observe that for $p=\Theta(1)$ , the optimal rate is the same as tuned SGD with fixed stepsize (up to constants), while the dependence on $\rho\geq 1$ is sublinear, as we aimed to achieve. The dependence $\rho^{\frac{1}{2p+1}}$ might lead to the idea that a larger $p$ is always better, but as $p$ increases the optimal rate degrades at a rate of $O(\sqrt{p})$ . In particular, using $p=\Theta(\log\rho)$ the convergence rate will be $O\brk{\ifrac{DG\sqrt{\log\rho}}{\sqrt{T}}}$ , and increasing beyond this point will not improve the final rate.

Proof of Corollary 2.

First note that $h(u)=(1-u)^{p}$ is non-increasing, differentiable, $p$ -Lipschitz (since $\abs{h^{\prime}(u)}=p(1-u)^{p-1}\leq p$ ) and satisfy $h(u)=0\Leftrightarrow u=1$ . Hence, $h$ is annealed and we can use Theorem 1. A simple integration yields that $H_{h}(\tau)=\frac{1}{p+1}(1-\tau)^{p+1}$ , and $H_{h}^{\prime}(\tau)=-(1-\tau)^{p}$ . Thus,

\displaystyle Q_{h}(\tau)

\displaystyle=\int_{\tau}^{1}\frac{H_{h}^{\prime}(u)^{2}}{H_{h}(u)}du=\frac{p+% 1}{p}(1-\tau)^{p}.

We proceed to solve the optimality equation of Theorem 1, $H_{h}(\bar{\tau})H_{h}^{\prime}(\bar{\tau})=\frac{-H_{h}(0)Q_{h}(0)}{\rho^{2}}$ :

\displaystyle\frac{-(1-\bar{\tau})^{2p+1}}{p+1}

\displaystyle=\frac{-1}{p\rho^{2}}\implies\bar{\tau}=1-\brk*{\frac{p+1}{p\rho^% {2}}}^{\frac{1}{2p+1}}.

While this value is optimal, it may be negative for small $\rho$ , so we select a slightly sub-optimal value of $\bar{\tau}=1-\rho^{\frac{-2}{2p+1}}\in[0,1)$ which is always valid. Using this value,

\displaystyle\frac{H_{h}(0)}{\rho H_{h}(\bar{\tau})}+\frac{\rho Q_{h}(\bar{% \tau})}{Q_{h}(0)}=\frac{(1-\bar{\tau})^{-(p+1)}}{\rho}+\rho(1-\bar{\tau})^{p}

\displaystyle=\frac{\rho^{\frac{2(p+1)}{2p+1}}}{\rho}+\rho\rho^{\frac{-2p}{2p+% 1}}=2\rho^{\frac{1}{2p+1}}.

(6)

Hence, using this value to bound the infimum of Eq. 5,

\displaystyle\mathbb{E}[f(x_{T+1})-f(x^{\star})]

\displaystyle\leq\mathsf{Rate}_{h,T}^{\mathsf{tu}}\cdot\rho^{\frac{1}{2p+1}}+O% \brk*{\frac{p\rho\eta_{\mathsf{tu}}G^{2}}{T}}.

We conclude by plugging the values of $H_{h}(0)$ and $Q_{h}(0)$ to Eq. 2 (and using $p\geq 1$ ),

\displaystyle\mathsf{Rate}_{h,T}^{\mathsf{tu}}

\displaystyle=\frac{2DG}{\sqrt{T}}\sqrt{\frac{(p+1)^{2}}{p}}=O\brk*{\frac{% \sqrt{p}DG}{\sqrt{T}}}.\qed

We proceed to the cosine annealing guarantee. Given its similarities to Corollary 2, the proof is deferred to Appendix A.

Corollary 3.

In the setting of Theorem 1, assuming $h(u)=\frac{1}{2}(1+\cos\brk{\pi u})$ ,

\displaystyle\mathbb{E}[f(x_{T+1})-f(x^{\star})]

\displaystyle\leq\mathsf{Rate}_{h,T}^{\mathsf{tu}}\cdot 18\rho^{\frac{1}{5}}+O% \brk*{\frac{\rho\eta_{\mathsf{tu}}G^{2}}{T}},

where $\mathsf{Rate}_{h,T}^{\mathsf{tu}}=\frac{2DG}{\sqrt{T}}\sqrt{2Q_{h}(0)}\leq% \frac{10DG}{\sqrt{T}}$ .

Again we observe a sublinear dependence on $\rho$ with an optimal rate of $O(\frac{DG}{\sqrt{T}})$ . Note that this is the same behavior as in Corollary 2 with $p=2$ , which arises from the tail behavior of $h(u)$ . To see that, one can verify that $(1-u)^{2}\leq h(u)\leq\frac{5}{2}(1-u)^{2}$ for all $u\in[0,1]$ .

3.1 Tighter Constants using Numerical Analysis

Refer to caption — Figure 1: Numerically evaluating the coefficient of $\ifrac{DG}{\sqrt{T}}$ for the convergence guarantee of Theorem 1 with different schedules and varying multiplicative misspecification parameter $\rho$ .

The constants of Corollary 3 are not tight; in particular, the bound is established using crude (up to constants) bounds for $H_{h}(u)$ and $Q_{h}(u)$ . While a tighter bound can be obtained, the framework easily yields to numerical analysis as we demonstrate next.

The convergence guarantee of Theorem 1 is not posed as a closed-form equation but rather as a minimization over integrals that depend on the schedule. For a specific schedule and misspecification parameter, we use Scipy’s (Virtanen et al., 2020) quad integration to evaluate $H_{h},Q_{h}$ , and fsolve to solve the minimization of Theorem 1.

In Fig. 1 we provide a numerical analysis for the convergence guarantee of Theorem 1 with several decaying schedules, including the cosine annealing, showing in particular that the convergence rate of SGD with cosine annealing is bounded by $5\rho^{\frac{1}{5}}\frac{DG}{\sqrt{T}}$ . We observe that the cosine annealing schedule and the quadratic decay schedule have similar convergence guarantees with a coefficient between $4\rho^{\frac{1}{5}}$ to $5\rho^{\frac{1}{5}}$ . In addition, even for a somewhat large misspecification parameter of size $50$ , the difference between cosine annealing and the different polynomial decaying schedules is at most a factor of $2$ , which indicates that even mild decay might be sufficient if the grid is not too coarse.

3.2 Proof of Theorem 1

Before proving our main theorem, we first state a few lemmas we will require. The first is a last-iterate convergence guarantee, using the techniques of Zamani and Glineur (2023); Liu and Zhou (2024) (proof appearing in Appendix C).

Lemma 3.

Let $\mathcal{X}\subseteq{\mathbb{R}}^{d}$ be a convex set, $f:\mathcal{X}\to{\mathbb{R}}$ a convex function, and $g:\mathcal{X}\to{\mathbb{R}}^{d}$ an unbiased first-order oracle of $f$ with second-moment bounded by $G^{2}>0$ . Let $x_{1},\ldots,x_{T+1}$ be the iterates produced by $T$ -steps SGD with stepsizes $\eta_{1},\ldots,\eta_{T}$ using the oracle $g$ . Then for any $\hat{x}\in\mathcal{X}$ ,

\displaystyle\mathbb{E}[f(x_{T+1})-f(\hat{x})]

\displaystyle\leq\frac{\norm{x_{1}-\hat{x}}^{2}}{2\sum_{s=1}^{T}\eta_{s}}+2G^{% 2}\sum_{t=1}^{T}\frac{\eta_{t}^{2}}{\sum_{s=t}^{T}\eta_{s}}.

Next is a key lemma, translating the suffix of the last-iterate bound in Lemma 3 to one based on integrating the stepsize schedule (proof given later in the section).

Lemma 4.

Let $k\in[T]$ , $c_{1},c_{2},\eta>0$ , and $\eta_{t}=\eta h\brk{\frac{t-1}{T}}$ for some differentiable $p$ -Lipschitz annealed schedule $h$ . Then for any $\tau\in[\frac{k-1}{T},\frac{k}{T})$ ,

\displaystyle\frac{c_{1}}{\sum_{s=k}^{T}\eta_{s}}+c_{2}\sum_{t=k}^{T}\frac{% \eta_{t}^{2}}{\sum_{s=t}^{T}\eta_{s}}

\displaystyle\leq\frac{c_{1}}{\eta TH_{h}(\tau)}+c_{2}\eta\int_{\tau}^{1-\frac% {1}{T}}\frac{h(u)^{2}}{H_{h}(u)}du+\frac{4\eta c_{2}p}{T}.

We proceed to the proof of Theorem 1.

Proof of Theorem 1.

Let $\tau\in[0,1)$ and let $k=\lfloor\tau T\rfloor+1\in[T]$ . Consider the suffix $x_{k},x_{k+1},\ldots,x_{T+1}$ as an SGD sequence starting at $x_{k}$ . By Lemma 3 with $\hat{x}=x^{\star}$ ,

\displaystyle\mathbb{E}[f(x_{T+1})-f(x^{\star})]

\displaystyle\leq\frac{D^{2}}{2\sum_{s=k}^{T}\eta_{s}}+2G^{2}\sum_{t=k}^{T}% \frac{\eta_{t}^{2}}{\sum_{s=t}^{T}\eta_{s}}.

As $\tau\in[\frac{k-1}{T},\frac{k}{T})$ , by Lemma 4 with $c_{1}=\frac{D^{2}}{2}$ and $c_{2}=2G^{2}$ ,

$\displaystyle\mathbb{E}[f(x_{T+1})-f(x^{\star})]$	$\displaystyle\leq\frac{D^{2}}{2\eta TH_{h}(\tau)}+2\eta G^{2}\int_{\tau}^{1}% \frac{h(u)^{2}}{H_{h}(u)}du+\frac{8p\eta G^{2}}{T}$
	$\displaystyle=\frac{1}{2}\mathsf{Rate}_{h,T}^{\mathsf{tu}}\cdot\brk*{\frac{H_{% h}(0)}{\rho H_{h}(\tau)}+\frac{\rho\int_{\tau}^{1}\frac{h(u)^{2}}{H_{h}(u)}du}% {\int_{0}^{1}\frac{h(u)^{2}}{H_{h}(u)}du}}+\frac{8p\rho\eta_{\mathsf{tu}}G^{2}% }{T}$	( $\eta=\rho\eta_{\mathsf{tu}}$ and Eq. 2)
	$\displaystyle=\frac{1}{2}\mathsf{Rate}_{h,T}^{\mathsf{tu}}\cdot\brk*{\frac{H_{% h}(0)}{\rho H_{h}(\tau)}+\frac{\rho Q_{h}(\tau)}{Q_{h}(0)}}+\frac{8p\rho\eta_{% \mathsf{tu}}G^{2}}{T}.$	( $H_{h}^{\prime}(u)^{2}=h(u)^{2}$ , Eq. 1)

This inequality holds for any $\tau\in[0,1)$ , hence it holds for the infimum over all $\tau\in[0,1)$ . It is left to find the $\tau$ which minimizes the bound. Let

\displaystyle g(v)=\frac{H_{h}(0)}{\rho H_{h}(v)}+\frac{\rho Q_{h}(v)}{Q_{h}(0% )}.

(7)

By the fundamental theorem of calculus,

\displaystyle H_{h}^{\prime}(v)=\brk*{\int_{0}^{1}h(u)du-\int_{0}^{v}h(u)du}^{% \prime}=-h(u)

and

\displaystyle Q_{h}^{\prime}(v)

\displaystyle=\brk*{\int_{0}^{1}\frac{H_{h}^{\prime}(u)^{2}}{H_{h}(u)}du-\!% \int_{0}^{v}\frac{H_{h}^{\prime}(u)^{2}}{H_{h}(u)}du}^{\prime}=-\frac{H_{h}^{% \prime}(v)^{2}}{H_{h}(v)}.

Thus,

	$\displaystyle g^{\prime}(v)$	$\displaystyle=\frac{-H_{h}(0)H_{h}^{\prime}(v)}{\rho H_{h}(v)^{2}}-\frac{\rho% \frac{H_{h}^{\prime}(v)^{2}}{H_{h}(v)}}{Q_{h}(0)}=\frac{-\rho H_{h}^{\prime}(v% )}{Q_{h}(0)H_{h}(v)^{2}}\brk*{\frac{H_{h}(0)Q_{h}(0)}{\rho^{2}}+H_{h}(v)H_{h}^% {\prime}(v)}$
		$\displaystyle=\frac{\rho h(v)}{Q_{h}(0)H_{h}(v)^{2}}\brk*{H_{h}(v)H_{h}^{% \prime}(v)+\frac{H_{h}(0)Q_{h}(0)}{\rho^{2}}}.$

Hence, when $\tau$ satisfy $H_{h}(\tau)H_{h}^{\prime}(\tau)=\frac{-H_{h}(0)Q_{h}(0)}{\rho^{2}}$ , $g^{\prime}(\tau)=0$ . For $v>\tau$ ,

\displaystyle H_{h}(v)H_{h}^{\prime}(v)

\displaystyle=-H_{h}(v)h(v)\geq-H_{h}(v)h(\tau)>-H_{h}(\tau)h(\tau)=-\frac{H_{% h}(0)Q_{h}(0)}{\rho^{2}},

so $g^{\prime}(v)>0$ and $g(v)>g(\tau)$ . Similarly, for $v<\tau$ , $g^{\prime}(v)<0$ and $g(v)>g(\tau)$ . Hence, $\tau$ satisfying $H_{h}(\tau)H_{h}^{\prime}(\tau)=\frac{-H_{h}(0)Q_{h}(0)}{\rho^{2}}$ is the minimizer. If no such $\tau$ exists, the derivative is always positive (as $h$ is continuous and $H_{h}(1)H_{h}^{\prime}(1)=0$ ), and the minimizer is at $\tau=0$ . ∎

3.3 Proof of Lemma 4

.

Let $\tau\in[\frac{k-1}{T},\frac{k}{T})$ . As $h$ is non-increasing, we can use integration to obtain the following bound,

$\displaystyle\frac{c_{1}}{\sum_{t=k}^{T}\eta_{t}}+c_{2}\sum_{t=k}^{T}\frac{% \eta_{t}^{2}}{\sum_{s=t}^{T}\eta_{s}}$	$\displaystyle\leq\frac{c_{1}}{\eta\int_{k}^{T+1}h\brk{\frac{t-1}{T}}dt}+\frac% {c_{2}}{\eta}\sum_{t=k}^{T}\frac{\eta_{t}^{2}}{\int_{t}^{T+1}h\brk{\frac{s-1}% {T}}ds}$
	$\displaystyle=\frac{c_{1}}{\eta T\int_{\frac{k-1}{T}}^{1}h\brk{u}du}+\frac{c_% {2}}{\eta T}\sum_{t=k}^{T}\frac{\eta_{t}^{2}}{\int_{\frac{t-1}{T}}^{1}h\brk{u% }du}$	(changing integration variables)
	$\displaystyle=\frac{c_{1}}{\eta TH_{h}\brk{\frac{k-1}{T}}}+\frac{c_{2}}{\eta T% }\sum_{t=k}^{T}\frac{\eta_{t}^{2}}{H_{h}\brk{\frac{t-1}{T}}}.$	(Eq. 1)

Again bounding by integration and changing variables,

\displaystyle\frac{c_{2}}{\eta T}\!\sum_{t=k}^{T}\frac{\eta_{t}^{2}}{H_{h}\brk% *{\frac{t-1}{T}}}

\displaystyle\leq\frac{c_{2}\eta}{T}\brk*{\frac{h\brk*{\frac{k-1}{T}}^{2}}{H_{% h}(\frac{k-1}{T})}\!+\!\int_{k}^{T}\frac{h\brk*{\frac{t-1}{T}}^{2}}{H_{h}\brk*% {\frac{t-1}{T}}}dt}=\frac{c_{2}\eta}{T}\brk*{\frac{h\brk*{\frac{k-1}{T}}^{2}}{% H_{h}(\frac{k-1}{T})}+T\!\int_{\frac{k-1}{T}}^{1-\frac{1}{T}}\frac{h\brk*{u}^{% 2}}{H_{h}\brk*{u}}du}.

As $h(u)$ is differentiable, $p$ -Lipschitz, and $h(1)=0$ , for any $v\in[0,1)$ ,

\displaystyle 2pH_{h}(v)

\displaystyle=2p\int_{v}^{1}h(u)du\geq 2\int_{v}^{1}h(u)(-h(u))^{\prime}du=h(v% )^{2}-h(1)^{2}=h(v)^{2}.

(8)

Hence, $\frac{h\brk*{\frac{k-1}{T}}^{2}}{H_{h}(\frac{k-1}{T})}\leq 2p$ and since $\abs{\tau-\frac{k-1}{T}}\leq\frac{1}{T}$ ,

	$\displaystyle\int_{\frac{k-1}{T}}^{1-\frac{1}{T}}\frac{h\brk{u}^{2}}{H_{h}% \brk{u}}du$	$\displaystyle=\int_{\tau}^{1-\frac{1}{T}}\frac{h\brk{u}^{2}}{H_{h}\brk{u}}du% +\int_{\frac{k-1}{T}}^{\tau}\frac{h\brk{u}^{2}}{H_{h}\brk{u}}du\leq\int_{% \tau}^{1-\frac{1}{T}}\frac{h\brk{u}^{2}}{H_{h}\brk{u}}du+\abs*{\int_{\frac{k% -1}{T}}^{\tau}2pdu}$
		$\displaystyle\leq\int_{\tau}^{1-\frac{1}{T}}\frac{h\brk{u}^{2}}{H_{h}\brk{u}% }du+\frac{2p}{T}.$

Plugging back,

\displaystyle\frac{c_{1}}{\sum_{t=k}^{T}\eta_{t}}+c_{2}\sum_{t=k}^{T}\frac{% \eta_{t}^{2}}{\sum_{s=t}^{T}\eta_{s}}

\displaystyle\leq\frac{c_{1}}{\eta TH_{h}\brk*{\tau}}+\eta c_{2}\int_{\tau}^{1% -\frac{1}{T}}\frac{h\brk*{u}^{2}}{H_{h}\brk*{u}}du+\frac{4\eta c_{2}p}{T}.\qed

4 Convex and Smooth Setting

In the following section, we extend our robustness result to the convex smooth setting, in which we replace the second-moment gradient oracle assumption with the assumptions that the gradient oracle has bounded variance and that $f$ is $\beta$ -smooth. The core technique is the same as in Section 3, with some additional considerations due to the requirement in standard smooth analysis that the stepsizes satisfy $\eta_{1},\ldots,\eta_{T}\leq\frac{c}{\beta}$ for some constant $c<2$ .

Next is the main result of this section, a convergence guarantee robust to a multiplicative misspecification of the stepsize.

Theorem 4.

Let $\mathcal{X}\subset{\mathbb{R}}^{d}$ be a convex set with diameter $D>0$ , $f:\mathcal{X}\to{\mathbb{R}}$ a $\beta$ -smooth convex function, $x^{\star}\in\operatorname*{arg\,min}_{x\in\mathcal{X}}f(x)$ , and $g:\mathcal{X}\to{\mathbb{R}}^{d}$ an unbiased first-order oracle of $f$ with variance bounded by $\sigma^{2}\geq 0$ . For any $\rho\geq 1$ , let $x_{1},x_{2},\ldots,x_{T+1}$ be the iterates produced by $T$ -steps SGD with stepsizes $\eta_{t}=\eta h\brk{\frac{t-1}{T}}$ using the oracle $g$ , where $\eta=\rho\cdot\eta_{\mathsf{tu}}^{\mathsf{sm}}$ and $h$ is a differentiable $p$ -Lipschitz annealed schedule. Denote $\tau_{0}\triangleq\min\set{\tau\in[0,1):\eta h\brk{\frac{\lfloor\tau T\rfloor}% {T}}\leq\frac{1}{2\beta}}$ . Then it holds that

\displaystyle\mathbb{E}[f(x_{T+1})-f(x^{\star})]

\displaystyle\leq\mathsf{Rate}^{\mathsf{sm,tu}}_{h,T}\cdot\inf_{\tau\in[\tau_{% 0},1)}\brk*{\frac{H_{h}(0)}{\rho H_{h}(\tau)}+\frac{\rho Q_{h}(\tau)}{Q_{h}(0)% }}+O\brk*{\frac{p\rho\eta_{\mathsf{tu}}^{\mathsf{sm}}\sigma^{2}}{T}},

(9)

where $H_{h}$ , $Q_{h}$ , $\eta_{\mathsf{tu}}^{\mathsf{sm}}$ , and $\mathsf{Rate}^{\mathsf{sm,tu}}_{h,T}$ are given in Eqs. 1, 3 and 4. In particular, the optimal $\tau$ satisfies $H_{h}(\tau)H_{h}^{\prime}(\tau)=\ifrac{-H_{h}(0)Q_{h}(0)}{\rho^{2}}$ (or $\tau=\tau_{0}$ if there is no solution).

As in Theorem 1, in Theorem 4 we observe a similar adaptivity to $\rho$ using the tails of $H_{h}$ and $Q_{h}$ . One small yet important difference is that the infimum is limited to the range $[\tau_{0},1)$ , where $\tau_{0}$ denotes the fraction of iterations in which the stepsize exceeds $\frac{1}{2\beta}$ . This dependency is somewhat unavoidable (up to constants) as stepsizes larger or equal to $\frac{2}{\beta}$ do not converge. Additionally, note that the above guarantee holds even if we specify a stepsize that is larger than $\frac{2}{\beta}$ , which is not the case with fixed stepsize SGD.

Next are corollaries of Theorem 4 with polynomial decay and cosine annealing schedules. Due to space constraints and similarities to the convex Lipschitz case, we defer the proofs of Theorem 4 and of the corollaries to Appendix B.

Corollary 5.

In the setting of Theorem 4, let $h(u)=(1-u)^{p}$ for some $p\geq 1$ and $\rho\leq\frac{T}{2p}$ . Then if $\rho^{2}\geq(1-\tau_{0})^{-(2p+1)}$ ,

\displaystyle\mathbb{E}[f(x_{T+1})-f(x^{\star})]=\mathsf{Rate}^{\mathsf{sm}}_{% h,T}(\eta_{\mathsf{tu}}^{\mathsf{sm}})\cdot O\brk*{\rho^{\frac{1}{2p+1}}},

and if $\rho^{2}<(1-\tau_{0})^{-(2p+1)}$ ,

\displaystyle\mathbb{E}[f(x_{T+1})-f(x^{\star})]=\mathsf{Rate}^{\mathsf{sm}}_{% h,T}(\eta_{\mathsf{tu}}^{\mathsf{sm}})\cdot O\brk*{\frac{1}{1-\tau_{0}}}.

In addition, $\mathsf{Rate}^{\mathsf{sm,tu}}_{h,T}=O\brk*{\frac{p\beta D^{2}}{T}+\frac{\sqrt% {p}D\sigma}{\sqrt{T}}}$ .

Corollary 6.

In the setting of Theorem 4, let $h(u)=\frac{1}{2}(1+\cos(\pi u))$ and $\rho\leq\frac{2T}{\pi}$ . Then if $\rho^{2}\geq(1-\tau_{0})^{-5}$ ,

\displaystyle\mathbb{E}[f(x_{T+1})-f(x^{\star})]

\displaystyle=\mathsf{Rate}^{\mathsf{sm,tu}}_{h,T}\cdot O\brk*{\rho^{\frac{1}{% 5}}},

and if $\rho^{2}<(1-\tau_{0})^{-5}$ ,

\displaystyle\mathbb{E}[f(x_{T+1})-f(x^{\star})]

\displaystyle=\mathsf{Rate}^{\mathsf{sm,tu}}_{h,T}\cdot O\brk*{\frac{1}{1-\tau% _{0}}}.

In addition, $\mathsf{Rate}^{\mathsf{sm,tu}}_{h,T}=O\brk*{\frac{\beta D^{2}}{T}+\frac{D% \sigma}{\sqrt{T}}}$ .

Observing Corollaries 5 and 6, a similar improved dependence on $\rho$ as in Corollaries 2 and 3 holds when $\tau_{0}$ is sufficiently small. When $\tau_{0}$ is large, we obtain the expected inverse dependence on the fraction of steps with small enough stepsizes, which is unavoidable as we explained above.

5 Experimental Evaluation

Our theory predicts that learning rate annealing schemes exhibit greater robustness to learning rate tuning compared to tuning a fixed learning rate. To support the prediction, we perform experiments to compare the performances of different scheduling strategies under varying grid search resolutions for learning rate tuning.

We conduct two types of experiments: the first involves a synthetic logistic regression task closely aligned with the theoretical setting, while the second involves training a neural network classifier.

5.1 Experimental setup

We consider common schedules, namely, fixed learning rate (as our baseline), in addition to the decaying cosine annealing, and linear decay schedules. To simulate varying grid resolutions, we train the models using a geometric grid of learning rates with a multiplicative factor of approximately $\sqrt[3]{10}\approx 2.15$ (the values $\{1,2.2,5\}$ multiplied by $10^{i}$ with different $i$ ’s), and consider the different subsets with resolutions $2.15,2.15^{2},2.15^{3}$ , etc. For example, with range $[0.01,5]$ and resolution of $2.15^{3}$ , we find the best model for each of the grids³³3 We average the performance over 3 runs per learning rate. $\{0.01,0.1,1\},\{0.022,0.22,2.2\},\{0.05,0.5,5\},$ and report the average test loss/top-1 error across grids.

Synthetic logistic regression.

In the synthetic experiment, we generate 100,000 samples of dimension 100, drawn from a normal distribution. Labels are assigned based on thresholding probabilities determined by a ”true weights” vector of size 100, also sampled from a normal distribution. To introduce additional noise, we flip each label with a probability of 0.1. A test set of the same size is generated similarly. We train a linear classifier using binary cross-entropy loss, SGD without momentum, a batch size of 1,000, and a single epoch (updating the scheduler after each step). For the fixed learning rate scheduler, we report both the last iterate and the averaged iterate performances.

Wide ResNet on CIFAR-10.

We train a Wide ResNet 28-10 model⁴⁴4We use the PyTorch implementation of Wide ResNet at https://github.com/bmsookim/wide-resnet.pytorch. (Zagoruyko and Komodakis, 2016) without dropout on the CIFAR-10 dataset (Krizhevsky, 2009). We train for 200 epochs, using a batch size of $128$ , Nesterov momentum of $0.9$ , and weight decay of $0.0005$ . The scheduler is updated after each epoch. As the last iterate of fixed stepsize SGD is under-performing, we use polynomial averaging as proposed by Shamir and Zhang (2013), with parameter $\gamma=8$ , following Ivgi et al. (2023).

5.2 Results

The test loss per learning rate appears in Fig. 2(a). For each resolution, Fig. 2(b) illustrates the logistic regression test loss averaged across the best models for each sub-grid. At high resolutions (e.g., grid parameters up to 10), we observe a comparable performance degradation across different schedules (besides fixed stepsize without averaging which underperforms). However, as grid resolution decreases, the gap between the fixed learning rate schedule and the decaying schedules widens. For instance, with a grid factor of approximately 100, the performance of the fixed learning rate (with averaging) decreases by 0.08, whereas cosine annealing and linear decay schedules experience smaller drops of 0.01 and 0.014, respectively, with similar trends observed for grids with lower resolutions.

Fig. 3(b) shows the CIFAR-10 top-1 test error for each resolution, averaged over the best models per sub-grid, with the raw test error per learning rate appearing in Fig. 3(a). Similar to the logistic regression task, degradation remains similar for high resolutions while the gap between the fixed learning rate schedule and the decaying schedules widens for large grid factors. With a grid factor of approximately 22, the performance of the fixed learning rate decreases by 0.61, with smaller drops of 0.3 and 0.35 observed for cosine annealing and linear decay schedules, respectively, and the trend continues for grids with lower resolutions.

5.3 Discussion

The experiments show that decaying schedules are more robust to coarse grids, while performance differences on fine grids remain minimal. These findings align with our theory, which suggests that all decaying schedules perform similarly to iterate averaging under small multiplicative misspecification but outperform it when misspecification is large. However, our theory also predicts robustness variations across decay rates, which are not observed in the real-data experiments. A possible explanation is the small difference in convergence rates among decaying schedules when misspecification is low, as illustrated in Fig. 1.

Acknowledgements

We are grateful to Noga Bar, Yair Carmon and Tomer Porian for helpful discussions. This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (grant agreement No. 101078075). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council. Neither the European Union nor the granting authority can be held responsible for them. This work received additional support from the Israel Science Foundation (ISF, grant number 3174/23), a grant from the Tel Aviv University Center for AI and Data Science (TAD), and a fellowship from the Israeli Council of Higher Education.

References

Alacaoglu et al. (2020) A. Alacaoglu, Y. Malitsky, P. Mertikopoulos, and V. Cevher. A new regret analysis for adam-type algorithms. In International conference on machine learning, pages 202–210. PMLR, 2020.
Attia and Koren (2023) A. Attia and T. Koren. Sgd with adagrad stepsizes: Full adaptivity with high probability to unknown parameters, unbounded gradients and affine variance. In International Conference on Machine Learning, 2023.
Bengio (2012) Y. Bengio. Practical recommendations for gradient-based training of deep architectures. In Neural networks: Tricks of the trade: Second edition, pages 437–478. Springer, 2012.
Bottou (2012) L. Bottou. Stochastic gradient descent tricks. In Neural networks: Tricks of the trade, pages 421–436. Springer, 2012.
Carmon and Hinder (2022) Y. Carmon and O. Hinder. Making sgd parameter-free. In Conference on Learning Theory, pages 2360–2389. PMLR, 2022.
Chaudhuri et al. (2009) K. Chaudhuri, Y. Freund, and D. J. Hsu. A parameter-free hedging algorithm. Advances in neural information processing systems, 22, 2009.
Cutkosky and Orabona (2018) A. Cutkosky and F. Orabona. Black-box reductions for parameter-free online learning in banach spaces. In Conference On Learning Theory, pages 1493–1529. PMLR, 2018.
Defazio and Mishchenko (2023) A. Defazio and K. Mishchenko. Learning-rate-free learning by d-adaptation. In International Conference on Machine Learning, 2023.
Defazio et al. (2024a) A. Defazio, A. Cutkosky, H. Mehta, and K. Mishchenko. Optimal linear decay learning rate schedules and further refinements. arXiv preprint arXiv:2310.07831, 2024a.
Defazio et al. (2024b) A. Defazio, X. A. Yang, A. Khaled, K. Mishchenko, H. Mehta, and A. Cutkosky. The road less scheduled. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024b.
Duchi et al. (2011) J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.
Faw et al. (2022) M. Faw, I. Tziotis, C. Caramanis, A. Mokhtari, S. Shakkottai, and R. A. Ward. The power of adaptivity in sgd: Self-tuning step sizes with unbounded gradients and affine variance. In COLT, 2022.
Ge et al. (2019) R. Ge, S. M. Kakade, R. Kidambi, and P. Netrapalli. The step decay schedule: A near optimal, geometrically decaying learning rate procedure for least squares. Advances in neural information processing systems, 32, 2019.
Hu et al. (2024) S. Hu, Y. Tu, X. Han, G. Cui, C. He, W. Zhao, X. Long, Z. Zheng, Y. Fang, Y. Huang, X. Zhang, Z. L. Thai, C. Wang, Y. Yao, C. Zhao, J. Zhou, J. Cai, Z. Zhai, N. Ding, C. Jia, G. Zeng, dahai li, Z. Liu, and M. Sun. MiniCPM: Unveiling the potential of small language models with scalable training strategies. In First Conference on Language Modeling, 2024.
Ivgi et al. (2023) M. Ivgi, O. Hinder, and Y. Carmon. Dog is sgd’s best friend: A parameter-free dynamic step size schedule. In International Conference on Machine Learning, 2023.
Jain et al. (2019) P. Jain, D. Nagaraj, and P. Netrapalli. Making the last iterate of sgd information theoretically optimal. In Conference on Learning Theory, pages 1752–1755. PMLR, 2019.
Kavis et al. (2019) A. Kavis, K. Y. Levy, F. Bach, and V. Cevher. Unixgrad: A universal, adaptive algorithm with optimal guarantees for constrained optimization. Advances in neural information processing systems, 32, 2019.
Kavis et al. (2022) A. Kavis, K. Y. Levy, and V. Cevher. High probability bounds for a class of nonconvex algorithms with adagrad stepsize. In International Conference on Learning Representations, 2022.
Kingma and Ba (2015) D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
Krizhevsky (2009) A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
Lan (2012) G. Lan. An optimal method for stochastic composite optimization. Mathematical Programming, 133(1):365–397, 2012.
Liu and Zhou (2024) Z. Liu and Z. Zhou. Revisiting the last-iterate convergence of stochastic gradient methods. In The Twelfth International Conference on Learning Representations, 2024.
Liu et al. (2023) Z. Liu, T. D. Nguyen, T. H. Nguyen, A. Ene, and H. L. Nguyen. High probability convergence of stochastic gradient methods. arXiv preprint arXiv:2302.14843, 2023.
Loshchilov and Hutter (2017) I. Loshchilov and F. Hutter. SGDR: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations, 2017.
Luo and Schapire (2015) H. Luo and R. E. Schapire. Achieving all with no parameters: Adanormalhedge. In Conference on Learning Theory, pages 1286–1304. PMLR, 2015.
Mishchenko and Defazio (2023) K. Mishchenko and A. Defazio. Prodigy: An expeditiously adaptive parameter-free learner. arXiv preprint arXiv:2306.06101, 2023.
Orabona and Pál (2016) F. Orabona and D. Pál. Coin betting and parameter-free online learning. Advances in Neural Information Processing Systems, 29, 2016.
Orabona and Pál (2021) F. Orabona and D. Pál. Parameter-free stochastic optimization of variationally coherent functions. arXiv preprint arXiv:2102.00236, 2021.
Reddi et al. (2018) S. J. Reddi, S. Kale, and S. Kumar. On the convergence of adam and beyond. In International Conference on Learning Representations, 2018.
Robbins and Monro (1951) H. Robbins and S. Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
Schaul et al. (2013) T. Schaul, S. Zhang, and Y. LeCun. No more pesky learning rates. In International conference on machine learning, pages 343–351. PMLR, 2013.
Shamir and Zhang (2013) O. Shamir and T. Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In International conference on machine learning, pages 71–79. PMLR, 2013.
Smith (2017) L. N. Smith. Cyclical learning rates for training neural networks. In 2017 IEEE winter conference on applications of computer vision (WACV), pages 464–472. IEEE, 2017.
Streeter and McMahan (2012) M. J. Streeter and H. B. McMahan. No-regret algorithms for unconstrained online convex optimization. In Neural Information Processing Systems, 2012.
Tran et al. (2019) P. T. Tran et al. On the convergence proof of amsgrad and a new version. IEEE Access, 7:61706–61716, 2019.
Virtanen et al. (2020) P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright, S. J. van der Walt, M. Brett, J. Wilson, K. J. Millman, N. Mayorov, A. R. J. Nelson, E. Jones, R. Kern, E. Larson, C. J. Carey, İ. Polat, Y. Feng, E. W. Moore, J. VanderPlas, D. Laxalde, J. Perktold, R. Cimrman, I. Henriksen, E. A. Quintero, C. R. Harris, A. M. Archibald, A. H. Ribeiro, F. Pedregosa, P. van Mulbregt, and SciPy 1.0 Contributors. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261–272, 2020. doi: 10.1038/s41592-019-0686-2.
Zagoruyko and Komodakis (2016) S. Zagoruyko and N. Komodakis. Wide residual networks. In British Machine Vision Conference 2016. British Machine Vision Association, 2016.
Zamani and Glineur (2023) M. Zamani and F. Glineur. Exact convergence rate of the last iterate in subgradient methods. arXiv preprint arXiv:2307.11134, 2023.
Zhai et al. (2022) X. Zhai, A. Kolesnikov, N. Houlsby, and L. Beyer. Scaling vision transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12104–12113, 2022.

Appendix A Proofs of Section 3

A.1 Proof of Corollary 3

.

Note that $h(u)$ is non-increasing, differentiable ( $h^{\prime}(u)=\frac{-\pi}{2}\sin(\pi u)$ ), $\frac{\pi}{2}$ -Lipschitz (as $\abs{h^{\prime}(u)}\leq\frac{\pi}{2}$ ) and satisfy $h(u)=0\Leftrightarrow u=1$ . Hence, $h$ is annealed and by Theorem 1,

\displaystyle\mathbb{E}[f(x_{T+1})-f(x^{\star})]

\displaystyle\leq\frac{1}{2}\mathsf{Rate}_{h,T}^{\mathsf{tu}}\cdot\inf_{\tau% \in[0,1)}\brk*{\frac{H_{h}(0)}{\rho H_{h}(\tau)}+\frac{\rho Q_{h}(\tau)}{Q_{h}% (0)}}+O\brk*{\frac{\rho\eta_{\mathsf{tu}}G^{2}}{T}}.

Next, we will bound $h(u),H_{h}(u)$ and $Q_{h}(u)$ using polynomials. As $\cos(\pi-\theta)=-\cos(\theta)$ and $\cos(\theta)\geq 1-\frac{\theta^{2}}{2}$ ,

\displaystyle h(u)=\frac{1}{2}(1-\cos(\pi(1-u)))\leq\frac{\pi^{2}}{4}(1-u)^{2}% \leq\frac{5}{2}(1-u)^{2}.

(10)

On the other hand, for $u\in[0,1)$ ,

\displaystyle\brk*{\frac{h(u)}{(1-u)^{2}}}^{\prime}=\frac{-\frac{\pi}{2}\sin(% \pi u)(1-u)^{2}+2(1-u)}{(1-u)^{4}}=\frac{4-\pi(1-u)\sin(\pi u)}{2(1-u)^{3}}% \geq\frac{4-\pi}{2}>0.

Using the fundamental theorem of calculus, for all $u\in[0,1)$ ,

\displaystyle\frac{h(u)}{(1-u)^{2}}=\frac{h(0)}{(1-0)^{2}}+\int_{0}^{u}\brk*{% \frac{h(v)}{(1-v)^{2}}}^{\prime}dv\geq 1+\int_{0}^{u}0\cdot dv=1\implies h(u)% \geq(1-u)^{2}.

(11)

Using integration, Eqs. 10 and 11 also implies that

\displaystyle\frac{1}{3}(1-u)^{3}\leq H_{h}(u)\leq\frac{5}{6}(1-u)^{3}.

(12)

Using the above inequalities,

\displaystyle Q_{h}(v)

\displaystyle=\int_{v}^{1}\frac{h(u)^{2}}{H_{h}(u)}du\leq\frac{75}{4}\int_{v}^% {1}(1-u)du=\frac{75}{8}(1-v)^{2}

(13)

and

\displaystyle Q_{h}(v)\geq\int_{v}^{1}\frac{6(1-u)^{4}}{5(1-u)^{3}}du=\frac{3}% {5}(1-v)^{2}.

(14)

Using the bounds, setting $\bar{\tau}=1-\rho^{-0.4}\in[0,1)$ , and noting that $H_{h}(0)=\frac{1}{2}\int_{0}^{1}(1+\cos(\pi u))du=\frac{1}{2}$ ,

\displaystyle\frac{H_{h}(0)}{\rho H_{h}(\bar{\tau})}+\frac{\rho Q_{h}(\bar{% \tau})}{Q_{h}(0)}\leq\frac{3}{2\rho(1-\bar{\tau})^{3}}+\frac{\rho 125(1-\bar{% \tau})^{2}}{8}=\frac{3}{2\rho\rho^{-1.2}}+\frac{125\rho\rho^{-0.8}}{8Q_{h}(0)}% \leq 18\rho^{0.2}.

(15)

Thus,

\displaystyle\mathbb{E}[f(x_{T+1})-f(x^{\star})]

\displaystyle\leq\mathsf{Rate}_{h,T}^{\mathsf{tu}}\cdot 18\rho^{\frac{1}{5}}+O% \brk*{\frac{\rho\eta_{\mathsf{tu}}G^{2}}{T}}.

Again noting that $H_{h}(0)=\frac{1}{2}$ and using Eq. 13,

\displaystyle\mathsf{Rate}_{h,T}^{\mathsf{tu}}

\displaystyle=\frac{2DG}{\sqrt{T}}\sqrt{Q_{h}(0)/H_{h}(0)}=\frac{2DG}{\sqrt{T}% }\sqrt{2Q_{h}(0)}\leq\frac{2DG\sqrt{\frac{150}{8}}}{\sqrt{T}}\leq\frac{10DG}{% \sqrt{T}}.\qed

Appendix B Proofs of Section 4

B.1 Proof of Theorem 4

In the proof, we use the following last-iterate guarantee for convex-smooth optimization, replacing Lemma 3 which we used in the convex Lipschitz case. The lemma is based on the technique introduced by Liu and Zhou (2024) and the proof appears at Appendix C.

Lemma 5.

Let $\mathcal{X}\subseteq{\mathbb{R}}^{d}$ be a convex set, $f:\mathcal{X}\to{\mathbb{R}}$ a convex function, and $g:\mathcal{X}\to{\mathbb{R}}^{d}$ an unbiased first-order oracle of $f$ with variance bounded by $\sigma^{2}\geq 0$ . Let $x_{1},x_{2},\ldots,x_{T+1}$ be the iterates produced by $T$ -steps SGD with stepsizes $\eta_{1},\ldots,\eta_{T}$ (satisfying $\eta_{t}\leq\frac{1}{2\beta}$ for all $t\in[T]$ ) and using the oracle $g$ . Then for any $\hat{x}\in\mathcal{X}$ ,

\displaystyle\mathbb{E}[f(x_{T+1})-f(\hat{x})]

\displaystyle\leq\frac{\norm{x_{1}-\hat{x}}^{2}}{2\sum_{s=1}^{T}\eta_{s}}+% \sigma^{2}\sum_{t=1}^{T}\frac{\eta_{t}^{2}}{\sum_{s=t}^{T}\eta_{s}}.

We proceed to the proof of Theorem 4.

Proof of Theorem 4.

Let $\tau\in[\tau_{0},1)$ and let $k=\lfloor\tau T\rfloor+1\in[T]$ . Consider the suffix $x_{k},x_{k+1},\ldots,x_{T+1}$ as an SGD sequence starting at $x_{k}$ and note that since $h$ is non-increasing, $\eta_{k}=\eta h\brk{\frac{k-1}{T}}\leq\eta h(\tau_{0})\leq\frac{1}{2\beta}$ . Thus, by Lemma 5 with $\hat{x}=x^{\star}$ ,

\displaystyle\mathbb{E}[f(x_{T+1})-f(x^{\star})]

\displaystyle\leq\frac{D^{2}}{2\sum_{s=k}^{T}\eta_{s}}+\sigma^{2}\sum_{t=k}^{T% }\frac{\eta_{t}^{2}}{\sum_{s=t}^{T}\eta_{s}}.

As $\tau\in[\frac{k-1}{T},\frac{k}{T})$ , invoking Lemma 4 with $c_{1}=D^{2}/2$ and $c_{2}=\sigma^{2}$ ,

\displaystyle\mathbb{E}[f(x_{T+1})-f(x^{\star})]

\displaystyle\leq\frac{D^{2}}{2\eta TH_{h}(\tau)}+\eta\sigma^{2}\int_{\tau}^{1% }\frac{h(u)^{2}}{H_{h}(u)}du+\frac{4\eta p\sigma^{2}}{T}.

Substituting $\eta=\rho\cdot\eta_{\mathsf{tu}}^{\mathsf{sm}}$ and using Eqs. 3 and 4,

	$\displaystyle\mathbb{E}[f(x_{T+1})-f(x^{\star})]$	$\displaystyle\leq\frac{1}{\rho H_{h}(\tau)}\cdot\frac{D^{2}}{2\eta_{\mathsf{tu% }}^{\mathsf{sm}}T}+\brk*{\rho\int_{\tau}^{1}\frac{h(u)^{2}}{H_{h}(u)}du}\cdot% \eta_{\mathsf{tu}}^{\mathsf{sm}}\sigma^{2}+\frac{4p\rho}{T}\cdot\eta_{\mathsf{% tu}}^{\mathsf{sm}}\sigma^{2}$
		$\displaystyle=\frac{H_{h}(0)}{\rho H_{h}(\tau)}\cdot\frac{D^{2}}{2\eta_{% \mathsf{tu}}^{\mathsf{sm}}TH_{h}(0)}+\frac{\rho Q_{h}(\tau)}{Q_{h}(0)}\cdot% \eta_{\mathsf{tu}}^{\mathsf{sm}}\sigma^{2}Q_{h}(0)+\frac{4p\rho\eta_{\mathsf{% tu}}^{\mathsf{sm}}\sigma^{2}}{T}$
		$\displaystyle\leq\mathsf{Rate}^{\mathsf{sm}}_{h,T}\cdot\brk*{\frac{H_{h}(0)}{% \rho H_{h}(\tau)}+\frac{\rho Q_{h}(\tau)}{Q_{h}(0)}}+\frac{4p\rho\eta_{\mathsf% {tu}}^{\mathsf{sm}}\sigma^{2}}{T}.$

This inequality holds for any $\tau\in[\tau_{0},1)$ , hence it holds for the infimum over all $\tau\in[\tau_{0},1)$ . It is left to find the $\tau$ which minimizes the right-hand side. Let

\displaystyle g(v)=\frac{H_{h}(0)}{\rho H_{h}(\tau)}+\frac{\rho Q_{h}(\tau)}{Q% _{h}(0)}.

This is the same function as in Eq. 7, so the same solution to $H_{h}(\tau)H_{h}^{\prime}(\tau)=\frac{-H_{h}(0)Q_{h}(0)}{\rho^{2}}$ is the minimizer of the function, and if there is no solution, the function is increasing (positive derivative) and the minimizer is at $\tau=\tau_{0}$ . ∎

B.2 Proof of Corollary 5

.

As in the proof of Corollary 2, $h(u)$ is annealed as $h(u)$ is non-increasing, differentiable, $p$ -Lipschitz and satisfy $h(u)=0\Leftrightarrow u=1$ . Hence, we can use Theorem 4. In addition, $H_{h}(\tau)=\frac{1}{p+1}(1-\tau)^{p+1}$ , $H_{h}^{\prime}(\tau)=-(1-\tau)^{p}$ and $Q_{h}(\tau)=\frac{p+1}{p}(1-\tau)^{p}$ , so

\displaystyle\frac{H_{h}(0)}{\rho H_{h}(\tau)}+\frac{\rho Q_{h}(\tau)}{Q_{h}(0)}

\displaystyle=\frac{1}{\rho(1-\tau)^{p+1}}+\rho(1-\tau)^{p}.

If $\rho^{2}\geq(1-\tau_{0})^{-(2p+1)}$ we can pick $\bar{\tau}=1-\rho^{\frac{-2}{2p+1}}$ , as $\bar{\tau}\in[\tau_{0},1)$ . In this case,

\displaystyle\frac{H_{h}(0)}{\rho H_{h}(\bar{\tau})}+\frac{\rho Q_{h}(\bar{% \tau})}{Q_{h}(0)}

\displaystyle=\frac{1}{\rho\cdot\rho^{\frac{-2(p+1)}{2p+1}}}+\rho\cdot\rho^{% \frac{-2p}{2p+1}}=2\rho^{\frac{1}{2p+1}}.

If $\rho^{2}<(1-\tau_{0})^{-(2p+1)}$ and $\tau_{0}>0$ , picking $\bar{\tau}=\tau_{0}$ and using the $p$ -Lipschitz property of $h$ ,

	$\displaystyle\frac{1}{2\beta}$	$\displaystyle\leq\eta h\brk{\tau_{0}-\frac{1}{T}}=\rho\eta_{\mathsf{tu}}^{% \mathsf{sm}}h\brk{\tau_{0}-\frac{1}{T}}\leq\rho\eta_{\mathsf{tu}}^{\mathsf{sm% }}h(\tau_{0})+\frac{p\rho\eta_{\mathsf{tu}}^{\mathsf{sm}}}{T}\leq\frac{\rho h(% \tau_{0})}{2\beta h(0)}+\frac{p\rho}{2\beta h(0)T}$		( $\eta_{\mathsf{tu}}^{\mathsf{sm}}h(0)\leq\frac{1}{2\beta}$ )
		$\displaystyle\implies\rho\geq\brk*{1-\frac{p\rho}{T}}(1-\tau_{0})^{-p}$		( $h(0)=1$ )

and

\displaystyle\frac{H_{h}(0)}{\rho H_{h}(\tau_{0})}+\frac{\rho Q_{h}(\tau_{0})}% {Q_{h}(0)}

\displaystyle=\frac{1}{\rho(1-\tau_{0})^{p+1}}+\rho(1-\tau_{0})^{p}\leq\frac{1% }{\brk*{1-\frac{p\rho}{T}}(1-\tau_{0})}+\sqrt{\frac{1}{1-\tau_{0}}}=O\brk*{% \frac{1}{1-\tau_{0}}},

where the last two transitions use $\rho^{2}<(1-\tau_{0})^{-(2p+1)}$ and the assumption $\rho\leq\frac{T}{2p}$ . Since $\rho\geq 1$ there is no case where $\rho^{2}<(1-\tau_{0})^{-(2p+1)}$ and $\tau_{0}=0$ . Bounding the infimum of Eq. 9 in the two cases with our choices of $\bar{\tau}$ , if $\rho^{2}\geq(1-\tau_{0})^{-(2p+1)}$ ,

\displaystyle\mathbb{E}[f(x_{T+1})-f(x^{\star})]=\mathsf{Rate}^{\mathsf{sm,tu}% }_{h,T}\cdot O\brk*{\rho^{\frac{1}{2p+1}}}+O\brk*{\frac{p\rho\eta_{\mathsf{tu}% }^{\mathsf{sm}}\sigma^{2}}{T}},

and if $\rho^{2}<(1-\tau_{0})^{-(2p+1)}$ ,

\displaystyle\mathbb{E}[f(x_{T+1})-f(x^{\star})]=\mathsf{Rate}^{\mathsf{sm,tu}% }_{h,T}\cdot O\brk*{\frac{1}{1-\tau_{0}}}+O\brk*{\frac{p\rho\eta_{\mathsf{tu}}% ^{\mathsf{sm}}\sigma^{2}}{T}}.

Noting that by the assumption $\rho\leq\frac{T}{2p}$ ,

\displaystyle\frac{p\rho\eta_{\mathsf{tu}}^{\mathsf{sm}}\sigma^{2}}{T}\leq% \frac{\eta_{\mathsf{tu}}^{\mathsf{sm}}\sigma^{2}}{2}\leq\frac{\mathsf{Rate}^{% \mathsf{sm,tu}}_{h,T}}{2Q_{h}(0)}=O(\mathsf{Rate}^{\mathsf{sm,tu}}_{h,T}),

we obtain our final convergence guarantees. The bound of $\mathsf{Rate}^{\mathsf{sm,tu}}_{h,T}$ follows from plugging $H_{h}(0)=\frac{1}{p+1}$ and $Q_{h}(0)=\frac{p+1}{p}$ to Eq. 4. ∎

B.3 Proof of Corollary 6

.

As in the proof of Corollary 3, $h$ is annealed as $h(u)$ is non-increasing, differentiable, $\frac{\pi}{2}$ -Lipschitz and satisfy $h(u)=0\Leftrightarrow u=1$ . Hence, we can use Theorem 4. We already established at Eq. 15 of the proof of Corollary 3 that

\displaystyle\frac{H_{h}(0)}{\rho H_{h}(\tau)}+\frac{\rho Q_{h}(\tau)}{Q_{h}(0)}

\displaystyle\leq\frac{3}{2\rho(1-\tau)^{3}}+\frac{125\rho(1-\tau)^{2}}{8}.

If $\rho^{2}\geq(1-\tau_{0})^{-5}$ we can pick $\bar{\tau}=1-\rho^{\frac{-2}{5}}$ , as $\bar{\tau}\in[\tau_{0},1)$ . In this case,

\displaystyle\frac{H_{h}(0)}{\rho H_{h}(\bar{\tau})}+\frac{\rho Q_{h}(\bar{% \tau})}{Q_{h}(0)}

\displaystyle\leq\frac{3}{2\rho\cdot\rho^{\frac{-6}{5}}}+\frac{125\rho\cdot% \rho^{\frac{-4}{5}}}{8}=\frac{137\rho^{\frac{1}{5}}}{8}\leq 18\rho^{\frac{1}{5% }}.

If $\rho^{2}<(1-\tau_{0})^{-5}$ and $\tau_{0}>0$ , picking $\bar{\tau}=\tau_{0}$ , using the definition of $\tau_{0}$ and the Lipschitz property of $h$ ,

\displaystyle\frac{1}{2\beta}

\displaystyle\leq\eta h\brk*{\tau_{0}-\frac{1}{T}}=\rho\eta_{\mathsf{tu}}^{% \mathsf{sm}}h\brk*{\tau_{0}-\frac{1}{T}}\leq\rho\eta_{\mathsf{tu}}^{\mathsf{sm% }}h(\tau_{0})+\frac{\pi\rho\eta_{\mathsf{tu}}^{\mathsf{sm}}}{2T}\leq\frac{\rho h% (\tau_{0})}{2\beta h(0)}+\frac{\pi\rho}{4\beta h(0)T},

(

\eta_{\mathsf{tu}}^{\mathsf{sm}}h(0)\leq\frac{1}{2\beta}

)

implying (with $h(0)=1$ ) that

\displaystyle\rho\geq\brk*{1-\frac{\pi\rho}{2T}}h(\tau_{0})\geq\brk*{1-\frac{% \pi\rho}{2T}}(1-\tau_{0})^{2}.

In addition to the assumption $\rho^{2}<(1-\tau_{0})^{-5}$ ,

	$\displaystyle\frac{H_{h}(0)}{\rho H_{h}(\tau_{0})}+\frac{\rho Q_{h}(\tau_{0})}% {Q_{h}(0)}$	$\displaystyle\leq\frac{3}{2\rho(1-\tau_{0})^{3}}+\frac{125\rho(1-\tau_{0})^{2}% }{8}$
		$\displaystyle\leq\frac{3}{2\brk{1-\frac{\pi\rho}{2T}}(1-\tau_{0})}+\frac{125}% {8}\sqrt{\frac{1}{1-\tau_{0}}}=O\brk{\frac{1}{1-\tau_{0}}},$

where the last transition uses the assumption $\rho\leq\frac{2T}{\pi}$ . Since $\rho\geq 1$ there is no case where $\rho^{2}<(1-\tau_{0})^{-(2p+1)}$ and $\tau_{0}=0$ . Bounding the infimum of Eq. 9 in the two cases with our choices of $\bar{\tau}$ , if $\rho^{2}\geq(1-\tau_{0})^{-5}$ ,

\displaystyle\mathbb{E}[f(x_{T+1})-f(x^{\star})]

\displaystyle=\mathsf{Rate}^{\mathsf{sm,tu}}_{h,T}\cdot O\brk*{\rho^{\frac{1}{% 5}}}+O\brk*{\frac{\rho\eta_{\mathsf{tu}}^{\mathsf{sm}}\sigma^{2}}{T}},

and if $\rho^{2}<(1-\tau_{0})^{-5}$ ,

\displaystyle\mathbb{E}[f(x_{T+1})-f(x^{\star})]

\displaystyle=\mathsf{Rate}^{\mathsf{sm,tu}}_{h,T}\cdot O\brk*{\frac{1}{1-\tau% _{0}}}+O\brk*{\frac{\rho\eta_{\mathsf{tu}}^{\mathsf{sm}}\sigma^{2}}{T}}.

We obtain our final convergence guarantees by noting that $\rho\leq\frac{2T}{\pi}$ , which, together with the fact that $Q_{h}(0)=\Theta(1)$ implies

\displaystyle\frac{\rho\eta_{\mathsf{tu}}^{\mathsf{sm}}\sigma^{2}}{T}\leq\frac% {2\eta_{\mathsf{tu}}^{\mathsf{sm}}\sigma^{2}}{\pi}\leq\frac{2\mathsf{Rate}^{% \mathsf{sm,tu}}_{h,T}}{\pi Q_{h}(0)}=O(\mathsf{Rate}^{\mathsf{sm,tu}}_{h,T}),

and plugging back to the above bounds. The bound of $\mathsf{Rate}^{\mathsf{sm,tu}}_{h,T}$ is immediate from Eq. 4 as $H_{h}(0)=\frac{1}{2}$ and $Q_{h}(0)=\Theta(1)$ (as we established in Eq. 14).

∎

Appendix C Last Iterate Guarantees for Stochastic Gradient Descent

A convergence analysis of Stochastic Gradient Descent (SGD) for convex Lipschitz and convex smooth functions follows. The technique, introduced by Zamani and Glineur (2023) and later refined by Liu and Zhou (2024), is based on comparing the iterates of SGD $(x_{1},x_{2},\ldots)$ with iterates of the form

\displaystyle z_{t}\triangleq\frac{v_{t-1}}{v_{t}}z_{t-1}+\brk*{1-\frac{v_{t-1% }}{v_{t}}}x_{t}=\frac{v_{0}}{v_{t}}\hat{x}+\sum_{s=1}^{t}\frac{v_{s}-v_{s-1}}{% v_{t}}x_{s}

(16)

for some non-increasing sequence $v_{0},v_{1},v_{2},\ldots$ , starting at some $z_{0}=\hat{x}\in\mathcal{X}$ . Note that by Jensen’s inequality, for any $t\geq 2$ ,

\displaystyle f(z_{t})\leq\frac{v_{0}}{v_{t}}f(\hat{x})+\sum_{s=1}^{t}\frac{v_% {s}-v_{s-1}}{v_{t}}f(x_{s}).

(17)

In particular, for any $t\in[T]$ , we will use

\displaystyle v_{t}\triangleq\frac{\eta_{T}}{\sum_{s=t}^{T}\eta_{s}}

(18)

and $v_{0}=v_{1}$ , similarly to Liu and Zhou (2024). Next, we restate the convergence results. Their proofs follow. See 3 See 5

C.1 Proof of Lemmas 3 and 5

To prove the last-iterate guarantees we need the following lemmas. Their proofs follow. The first translates from an average regret-like guarantee to a last-iterate guarantee.

Lemma 6.

Let $\mathcal{X}\subseteq{\mathbb{R}}^{d}$ be a convex set, $x_{1},\hat{x}\in\mathcal{X}$ , $f:\mathcal{X}\to{\mathbb{R}}$ a convex function and $T\in{\mathbb{N}}$ . Then for any sequences $g_{1},\ldots,g_{T}\in{\mathbb{R}}^{d}$ and $\eta_{1},\ldots,\eta_{T}>0$ , the iterates defined by $x_{t+1}=x_{t}-\eta_{t}g_{t}$ satisfy

\displaystyle\eta_{T}v_{T}(f(x_{T+1})-f(\hat{x}))\leq\sum_{t=1}^{T}\eta_{t}v_{% t}(f(x_{t+1})-f(z_{t})),

where $z_{0},\ldots,z_{T}$ and $v_{0},\ldots,v_{T}$ are defined by Eqs. 18 and 16.

Lemma 7.

\displaystyle\mathbb{E}[f(x_{t+1})-f(z_{t})]

\displaystyle\leq\mathbb{E}\brk[s]*{\frac{\norm{x_{t}-z_{t}}^{2}-\norm{x_{t+1}% -z_{t}}^{2}-\norm{x_{t+1}-x_{t}}^{2}}{2\eta_{t}}+f(x_{t+1})-f(x_{t})+g_{t}\bm{% \cdot}(x_{t}-x_{t+1})},

where $z_{0},\ldots,z_{T}$ are defined by Eq. 16.

We proceed to the proof.

Proof of Lemmas 3 and 5.

By Lemma 7,

\displaystyle\mathbb{E}[f(x_{t+1})-f(z_{t})]

\displaystyle\leq\mathbb{E}\brk[s]*{\frac{\norm{x_{t}-z_{t}}^{2}-\norm{x_{t+1}% -z_{t}}^{2}}{2\eta_{t}}+\Delta_{t}},

where $\Delta_{t}\triangleq f(x_{t+1})-f(x_{t})+g_{t}\bm{\cdot}(x_{t}-x_{t+1})-\frac{% \norm{x_{t+1}-x_{t}}^{2}}{2\eta_{t}}$ . By the definition of $z_{t}$ and the fact that $v_{t}\geq v_{t-1}$ ,

\displaystyle\norm{x_{t}-z_{t}}^{2}

\displaystyle=\frac{v_{t-1}^{2}}{v_{t}^{2}}\norm{x_{t-1}-z_{t}}^{2}\leq\frac{v% _{t-1}}{v_{t}}\norm{x_{t-1}-z_{t}}^{2}.

Combining with our previous inequality multiplied by $\eta_{t}v_{t}$ ,

\displaystyle\mathbb{E}[\eta_{t}v_{t}(f(x_{t+1})-f(z_{t}))]

\displaystyle\leq\mathbb{E}\brk[s]*{\frac{v_{t-1}\norm{x_{t}-z_{t-1}}^{2}-v_{t% }\norm{x_{t+1}-z_{t}}^{2}}{2}+\eta_{t}v_{t}\Delta_{t}}.

Summing for $t=1,\ldots,T$ , and removing $-v_{T}\norm{x_{T+1}-z_{T}}^{2}\leq 0$ ,

\displaystyle\mathbb{E}\brk[s]*{\sum_{t=1}^{T}\eta_{t}v_{t}(f(x_{t+1})-f(z_{t}% ))}

\displaystyle\leq\mathbb{E}\brk[s]*{\frac{v_{0}\norm{x_{1}-z_{0}}^{2}}{2}+\sum% _{t=1}^{T}\eta_{t}v_{t}\Delta_{t}}.

Combining with Lemma 6, and noting that $z_{0}=\hat{x}$ ,

\displaystyle\eta_{T}v_{T}(f(x_{T+1})-f(\hat{x}))

\displaystyle\leq\mathbb{E}\brk[s]*{\frac{v_{0}\norm{x_{1}-\hat{x}}^{2}}{2}+% \sum_{t=1}^{T}\eta_{t}v_{t}\Delta_{t}}.

(19)

Next, we assume a second-moment bound (as in Lemma 3). From convexity,

\displaystyle\mathbb{E}[f(x_{t+1})-f(x_{t})]\leq\mathbb{E}[\nabla f(x_{t+1})% \bm{\cdot}(x_{t+1}-x_{t})]\leq\mathbb{E}\brk[s]*{\eta_{t}\norm{\nabla f(x_{t})% }^{2}+\frac{\norm{x_{t+1}-x_{t}}^{2}}{4\eta_{t}}},

where we used the inequality $2u\bm{\cdot}v\leq\norm{u}^{2}+\norm{v}^{2}$ . Similarly, $g_{t}\bm{\cdot}(x_{t}-x_{t+1})\leq\eta_{t}\norm{g_{t}}^{2}+\frac{\norm{x_{t+1}% -x_{t}}}{4\eta_{t}}$ . Hence, using the second-moment bound, $\mathbb{E}\Delta_{t}\leq 2\eta_{t}G^{2}$ . Plugging the bound of $\mathbb{E}[\Delta_{t}]$ to Eq. 19 concludes the proof of Lemma 3. Next we assume that $f$ is $\beta$ -smooth, a variance bound, and that $\eta_{t}\leq\frac{1}{2\beta}$ for all $t\in[T]$ (as in Lemma 5). By smoothness,

	$\displaystyle f(x_{t+1})-f(x_{t})$	$\displaystyle\leq\nabla f(x_{t})\bm{\cdot}(x_{t+1}-x_{t})+\frac{\beta}{2}\norm% {x_{t+1}-x_{t}}^{2}$
		$\displaystyle\leq\nabla f(x_{t})\bm{\cdot}(x_{t+1}-x_{t})+\frac{1}{4\eta_{t}}% \norm{x_{t+1}-x_{t}}^{2}.$		( $\eta_{t}\leq\frac{1}{2\beta}$ )

By the inequality $2u\bm{\cdot}v\leq\norm{u}^{2}+\norm{v}^{2}$ ,

	$\displaystyle\nabla f(x_{t})\bm{\cdot}(x_{t+1}-x_{t})$	$\displaystyle=(\nabla f(x_{t})-g_{t})\bm{\cdot}(x_{t+1}-x_{t})+g_{t}\bm{\cdot}% (x_{t+1}-x_{t})$
		$\displaystyle\leq\eta_{t}\norm{\nabla f(x_{t})-g_{t}}^{2}+\frac{\norm{x_{t+1}-% x_{t}}^{2}}{4\eta_{t}}+g_{t}\bm{\cdot}(x_{t+1}-x_{t}).$

Hence, using the variance bound,

\displaystyle\mathbb{E}\Delta_{t}

\displaystyle\leq\mathbb{E}\brk[s]*{\eta_{t}\norm{\nabla f(x_{t})-g_{t}}^{2}+% \frac{\norm{x_{t+1}-x_{t}}^{2}}{4\eta_{t}}+\frac{\norm{x_{t+1}-x_{t}}^{2}}{4% \eta_{t}}-\frac{\norm{x_{t+1}-x_{t}}^{2}}{2\eta_{t}}}\leq\eta_{t}\sigma^{2}.

Plugging the bound of $\mathbb{E}[\Delta_{t}]$ to Eq. 19 concludes the proof of Lemma 5. ∎

C.2 Proof of Lemma 6

Proof.

By Eq. 17,

	$\displaystyle\sum_{t=1}^{T}\eta_{t}v_{t}(f(x_{t+1})-f(z_{t}))$	$\displaystyle\geq\sum_{t=1}^{T}\eta_{t}v_{t}\brk{f(x_{t+1})-\brk{\frac{v_{0}% }{v_{t}}f(\hat{x})+\sum_{s=1}^{t}\frac{v_{s}-v_{s-1}}{v_{t}}f(x_{s})}}$
		$\displaystyle=\sum_{t=1}^{T}\eta_{t}\brk{v_{t}f(x_{t+1})-\brk{v_{0}f(\hat{x}% )+\sum_{s=1}^{t}\brk{v_{s}-v_{s-1}}f(x_{s})}}$
		$\displaystyle=\sum_{t=1}^{T}\eta_{t}\brk*{v_{t}f(x_{t+1})-v_{t}f(\hat{x})-\sum% _{s=1}^{t}\brk{v_{s}-v_{s-1}}(f(x_{s})-f(\hat{x}))}$
		$\displaystyle=\eta_{T}v_{T}(f(x_{T+1})-f(\hat{x}))$
		$\displaystyle+\sum_{t=1}^{T}\brk*{\eta_{t-1}v_{t-1}-(v_{t}-v_{t-1})\sum_{s=t}^% {T}\eta_{s}}(f(x_{t})-f(\hat{x})).$

Note that $v_{1}=v_{0}$ and for $2\leq t\leq T$ ,

\displaystyle\eta_{t-1}v_{t-1}-(v_{t}-v_{t-1})\sum_{s=t}^{T}\eta_{s}

\displaystyle=\eta_{T}\brk*{\frac{\eta_{t-1}}{\sum_{s=t-1}^{T}\eta_{s}}-\frac{% \eta_{t-1}\sum_{s=t}^{T}\eta_{s}}{\sum_{s=t-1}^{T}\eta_{s}\cdot\sum_{s=t}^{T}% \eta_{s}}}=0.

Thus,

\displaystyle\eta_{T}v_{T}(f(x_{T+1})-f(\hat{x}))

\displaystyle\leq\sum_{t=1}^{T}\eta_{t}v_{t}(f(x_{t+1})-f(z_{t})).\qed

C.3 Proof of Lemma 7

Proof.

Using the convexity of $f$ ,

\displaystyle\mathbb{E}[f(x_{t+1})-f(z_{t})]

\displaystyle=\mathbb{E}[f(x_{t})-f(z_{t})+f(x_{t+1})-f(x_{t})]\leq\mathbb{E}[% \nabla f(x_{t})\bm{\cdot}(x_{t}-z_{t})+f(x_{t+1})-f(x_{t})].

(20)

Focusing on the first term, as $z_{t}$ does not depend on $g_{t}$ ,

\displaystyle\mathbb{E}[\nabla f(x_{t})\bm{\cdot}(x_{t}-z_{t})]

\displaystyle=\mathbb{E}[g_{t}\bm{\cdot}(x_{t}-z_{t})]=\mathbb{E}[g_{t}\bm{% \cdot}(x_{t+1}-z_{t})+g_{t}\bm{\cdot}(x_{t}-x_{t+1})].

Note that the update step is

\displaystyle x_{t+1}=\operatorname*{arg\,min}_{x\in\mathcal{X}}\set*{f(x_{t})% +g_{t}\bm{\cdot}(x-x_{t})+\frac{1}{2\eta_{t}}\norm{x-x_{t}}^{2}}.

From the first-order optimality condition,

\displaystyle\frac{1}{\eta_{t}}(x_{t+1}-x_{t}+\eta_{t}g_{t})\bm{\cdot}(z_{t}-x% _{t+1})\geq 0.

Rearranging,

\displaystyle g_{t}\bm{\cdot}(x_{t+1}-z_{t})

\displaystyle\leq\frac{\norm{x_{t}-z_{t}}^{2}-\norm{x_{t+1}-z_{t}}^{2}-\norm{x% _{t+1}-x_{t}}^{2}}{2\eta_{t}}.

Thus,

\displaystyle\mathbb{E}[\nabla f(x_{t})\bm{\cdot}(x_{t}-z_{t})]

\displaystyle\leq\mathbb{E}\brk[s]*{\frac{\norm{x_{t}-z_{t}}^{2}-\norm{x_{t+1}% -z_{t}}^{2}-\norm{x_{t+1}-x_{t}}^{2}}{2\eta_{t}}+g_{t}\bm{\cdot}(x_{t}-x_{t+1}% )}.

Returning to Eq. 20, we conclude that

\displaystyle\mathbb{E}[f(x_{t+1})-f(z_{t})]

\displaystyle\leq\mathbb{E}\brk[s]*{\frac{\norm{x_{t}-z_{t}}^{2}-\norm{x_{t+1}% -z_{t}}^{2}-\norm{x_{t+1}-x_{t}}^{2}}{2\eta_{t}}+f(x_{t+1})-f(x_{t})+g_{t}\bm{% \cdot}(x_{t}-x_{t+1})}.\qed

Appendix D Sensitivity of Fixed Stepsize Gradient Descent to Misspecification of the Stepsize

Given a $G$ -Lipschitz function $f:\mathcal{X}\to{\mathbb{R}}$ , where $\mathcal{X}\subset{\mathbb{R}}^{d}$ is a convex set with diameter $D$ , the standard average-iterate convergence guarantee of $T$ -steps Gradient Descent (GD) with a fixed stepsize $\eta>0$ is

\displaystyle f\brk*{\frac{1}{T}\sum_{t=1}^{T}x_{t}}-\min_{x\in\mathcal{X}}f(x% )\leq\mathsf{Rate}_{\mathsf{con},T}(\eta)\triangleq\frac{D^{2}}{2\eta T}+\frac% {\eta G^{2}}{2}.

The optimal $\eta_{\mathsf{tu}}=\frac{D}{G\sqrt{T}}$ satisfy $\mathsf{Rate}_{\mathsf{con},T}(\eta_{\mathsf{tu}})=\frac{DG}{\sqrt{T}}$ . Given a multiplicative overestimation of the optimal stepsize, $\eta=\rho\eta_{\mathsf{tu}}$ for $\rho\geq 1$ , the convergence guarantee is

\displaystyle\mathsf{Rate}_{\mathsf{con},T}(\rho\eta_{\mathsf{tu}})=\mathsf{% Rate}_{\mathsf{con},T}(\eta_{\mathsf{tu}})\brk*{\frac{1}{2\rho}+\frac{\rho}{2}% }=\Omega(\rho\mathsf{Rate}_{\mathsf{con},T}(\eta_{\mathsf{tu}})).

A natural follow-up question is whether this linear dependence on $\rho$ is simply an artifact of the analysis or a true degradation in the convergence rate of GD. Next, we show that for any weights $w_{1},\ldots,w_{T}$ , the worst-case convergence rate of the (weighted) average iterate is $\Omega(\rho\mathsf{Rate}_{\mathsf{con},T}(\eta_{\mathsf{tu}}))$ .

Let $T\in{\mathbb{N}}$ , $D>0$ , $G>0$ , $0<\rho<\frac{1}{2}\sqrt{T}$ and $w_{1},\ldots,w_{T}>0$ . First we will assume that $w_{1}+w_{3}+\ldots+w_{2\lfloor(T-1)/2\rfloor+1}\geq w_{2}+w_{4}+\ldots+w_{2% \lfloor T/2\rfloor}$ . Let $\eta=\frac{\rho D}{G\sqrt{T}}$ for some $\rho\geq 1$ , $f(x)=G\abs{x}$ defined over the domain $[-\frac{D}{2},\frac{D}{2}]$ , and let $x_{1}=\frac{3}{4}G\eta$ . After a single gradient step, $x_{2}=x_{1}-\eta G=-\frac{1}{4}G\eta$ . After another update step, $x_{3}=x_{2}+\eta G=\frac{3}{4}G\eta=x_{1}$ . Hence, the iterates will move back and forth between $\frac{3}{4}G\eta$ and $-\frac{1}{4}G\eta$ , and the average iterate $\overline{x}$ will satisfy

\displaystyle\overline{x}=\frac{1}{\sum_{t=1}^{T}w_{t}}\sum_{t=1}^{T}w_{t}x_{t}

\displaystyle=\frac{G\eta\brk*{3\sum_{t=1,3,\ldots}w_{t}-\sum_{t=2,4,\ldots}w_% {t}}}{4\sum_{t=1}^{T}w_{t}}\geq\frac{G\eta\brk*{2\sum_{t=1,3,\ldots}w_{t}}}{8% \sum_{t=1,3,\ldots}w_{t}}=\frac{\eta G}{4},

where we used our assumption that $w_{1}+w_{3}+\ldots+w_{2\lfloor(T-1)/2\rfloor+1}\geq w_{2}+w_{4}+\ldots+w_{2% \lfloor T/2\rfloor}$ . Hence,

\displaystyle f(\overline{x})\geq\frac{\eta G^{2}}{4}=\frac{\rho DG}{4\sqrt{T}% }=\Omega(\rho\mathsf{Rate}_{\mathsf{con},T}(\eta_{\mathsf{tu}})).

If, on the other hand, it holds that $w_{1}+w_{3}+\ldots+w_{2\lfloor(T-1)/2\rfloor+1}<w_{2}+w_{4}+\ldots+w_{2\lfloor T% /2\rfloor}$ , we can initialize $x_{1}=-\frac{G\eta}{4}$ and mirroring the same argument will conclude the proof.

Hence, the worst-case convergence rate of fixed stepsize GD degrades linearly in a multiplicative misspecification of the stepsize. As GD is a private case of SGD, the lower bound also holds for SGD with a second-moment bound $G^{2}$ .

Appendix E Convergence Analysis with Stepsize Schedules

In this section, we provide convergence guarantees for SGD with an annealed schedule in the convex Lipschitz and convex smooth settings. The guarantees are established by combining a last-iterate guarantee with Lemma 4, which translates the sums of stepsizes to integrals that depend on the schedule. The proofs follow.

See 1

Note that when we tune $\eta$ according to Eq. 2, we obtain a convergence rate of

\displaystyle\frac{2DG}{\sqrt{T}}\sqrt{Q_{h}(0)/H_{h}(0)}+O\brk*{\frac{pDG/% \sqrt{H_{h}(0)Q_{h}(0)}}{T^{3/2}}}.

See 2

Similarly, when we tune $\eta$ according to Eq. 3, we obtain a convergence rate of

\displaystyle\frac{\beta D^{2}h(0)}{TH_{h}(0)}+\frac{D\sigma}{\sqrt{T}}\sqrt{2% Q_{h}(0)/H_{h}(0)}+O\brk*{\frac{pD\sigma/\sqrt{H_{h}(0)Q_{h}(0)}}{T^{3/2}}}.

Note that using the fact that $h$ is non-increasing and the Lipschitz condition,

\displaystyle h(0)\geq H_{h}(0)=\int_{0}^{1}h(u)du\geq\int_{0}^{\min\{1,h(0)/2% p\}}\frac{1}{2}h(0)du=\frac{1}{2}h(0)\min\{1,h(0)/2p\}.

Additionally,

\displaystyle Q_{h}(0)=\int_{0}^{1}\frac{h(u)^{2}}{H_{h}(u)}du\geq\int_{0}^{1}% h(u)du=H_{h}(0)\geq\frac{1}{2}h(0)\min\{1,h(0)/2p\}

and using Eq. 8,

\displaystyle Q_{h}(0)=\int_{0}^{1}\frac{H_{h}^{\prime}(u)^{2}}{H_{h}(u)}du% \leq 2p.

Hence, assuming $h(0)=\Theta(1)$ and $p=\Theta(1)$ , $H_{h}(0)$ and $Q_{h}(0)$ are $\Theta(1)$ , and the rates above match those of optimally tuned fixed stepsize SGD up to constant factors.

E.1 Proofs of Lemmas 1 and 2

Proof of Lemma 1.

By Lemma 3 with $\hat{x}=x^{\star}$ ,

\displaystyle\mathbb{E}[f(x_{T+1})-f(x^{\star})]

\displaystyle\leq\frac{D^{2}}{2\sum_{s=1}^{T}\eta_{s}}+2G^{2}\sum_{t=1}^{T}% \frac{\eta_{t}^{2}}{\sum_{s=t}^{T}\eta_{s}}.

Using Lemma 4 with $c_{1}=D^{2}/2$ , $c_{2}=2G^{2}$ , $k=1$ and $\tau=0$ ,

\displaystyle\mathbb{E}[f(x_{T+1})-f(x^{\star})]

\displaystyle\leq\frac{D^{2}}{2\eta TH_{h}(0)}\!+\!2\eta G^{2}\int_{0}^{1-% \frac{1}{T}}\frac{h(u)^{2}}{H_{h}(u)}du\!+\!\frac{8p\eta G^{2}}{T}\leq\frac{D^% {2}}{2\eta TH_{h}(0)}+2\eta G^{2}Q_{h}(0)+\frac{8p\eta G^{2}}{T},

where that last inequality follows by the fact that $h(u)$ and $H_{h}(u)$ are non-negative and the definition of $Q_{h}(u)$ . ∎

Proof of Lemma 2.

As $\eta_{1}=\eta h(0)\leq\frac{1}{2\beta}$ and $h$ is non-increasing, $\eta_{t}\leq\frac{1}{2\beta}$ and we can use Lemma 5 with $\hat{x}=x^{\star}$ , obtaining

\displaystyle\mathbb{E}[f(x_{T+1})-f(x^{\star})]

\displaystyle\leq\frac{D^{2}}{2\sum_{s=k}^{T}\eta_{s}}+\sigma^{2}\sum_{t=k}^{T% }\frac{\eta_{t}^{2}}{\sum_{s=t}^{T}\eta_{s}}.

Invoking Lemma 4 with $c_{1}=D^{2}/2$ , $c_{2}=\sigma^{2}$ , $k=1$ and $\tau=0$ ,

\displaystyle\mathbb{E}[f(x_{T+1})-f(x^{\star})]

\displaystyle\leq\frac{D^{2}}{2\eta TH_{h}(0)}+\eta\sigma^{2}\int_{0}^{1-\frac% {1}{T}}\frac{h(u)^{2}}{H_{h}(u)}du+\frac{4\eta p\sigma^{2}}{T}\leq\frac{D^{2}}% {2\eta TH_{h}(0)}+\eta\sigma^{2}Q_{h}(0)+\frac{4p\eta\sigma^{2}}{T},

where that last inequality follows by the fact that $h(u)$ and $H_{h}(u)$ are non-negative and the definition of $Q_{h}(u)$ . ∎

Benefits of Learning Rate Annealing for Tuning-Robustness in Stochastic Optimization

Abstract

1 Introduction

1.1 Summary of Contributions

1.2 Additional Related Work

Adaptive and parameter-free methods.

Theoretical analyses of stepsize annealing.

2 Preliminaries

2.1 Problem Setup

Stochastic gradient descent.

Stepsize scheduling.

Robustness to stepsize misspecification.

2.2 Convergence Analysis with Stepsize Schedules

Lemma 1.

Lemma 2.

3 Convex and Lipschitz Setting

Theorem 1.

Corollary 2.

Proof of Corollary 2.

Corollary 3.

3.1 Tighter Constants using Numerical Analysis

3.2 Proof of Theorem 1

Lemma 3.

Lemma 4.

Proof of Theorem 1.

3.3 Proof of Lemma 4

.

4 Convex and Smooth Setting

Theorem 4.

Corollary 5.

Corollary 6.

5 Experimental Evaluation

5.1 Experimental setup

Synthetic logistic regression.

Wide ResNet on CIFAR-10.

5.2 Results

5.3 Discussion

Acknowledgements

References

Appendix A Proofs of Section 3

A.1 Proof of Corollary 3

.

Appendix B Proofs of Section 4

B.1 Proof of Theorem 4

Lemma 5.

Proof of Theorem 4.

B.2 Proof of Corollary 5

.

B.3 Proof of Corollary 6

.

Appendix C Last Iterate Guarantees for Stochastic Gradient Descent

C.1 Proof of Lemmas 3 and 5

Lemma 6.

Lemma 7.

Proof of Lemmas 3 and 5.

C.2 Proof of Lemma 6

Proof.

C.3 Proof of Lemma 7

Proof.

Appendix D Sensitivity of Fixed Stepsize Gradient Descent to Misspecification of the Stepsize

Appendix E Convergence Analysis with Stepsize Schedules

E.1 Proofs of Lemmas 1 and 2

Proof of Lemma 1.

Proof of Lemma 2.

Benefits of Learning Rate Annealing for
Tuning-Robustness in Stochastic Optimization