Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Benefits of Learning Rate Annealing for
Tuning-Robustness in Stochastic Optimization

Amit Attia Blavatnik School of Computer Science, Tel Aviv University; amitattia@mail.tau.ac.il.    Tomer Koren Blavatnik School of Computer Science, Tel Aviv University, and Google Research Tel Aviv; tkoren@tauex.tau.ac.il.
Abstract

The learning rate in stochastic gradient methods is a critical hyperparameter that is notoriously costly to tune via standard grid search, especially for training modern large-scale models with billions of parameters. We identify a theoretical advantage of learning rate annealing schemes that decay the learning rate to zero at a polynomial rate, such as the widely-used cosine schedule, by demonstrating their increased robustness to initial parameter misspecification due to a coarse grid search. We present an analysis in a stochastic convex optimization setup demonstrating that the convergence rate of stochastic gradient descent with annealed schedules depends sublinearly on the multiplicative misspecification factor ρ𝜌\rhoitalic_ρ (i.e., the grid resolution), achieving a rate of O(ρ1/(2p+1)/T)𝑂superscript𝜌12𝑝1𝑇O(\rho^{1/(2p+1)}/\sqrt{T})italic_O ( italic_ρ start_POSTSUPERSCRIPT 1 / ( 2 italic_p + 1 ) end_POSTSUPERSCRIPT / square-root start_ARG italic_T end_ARG ) where p𝑝pitalic_p is the degree of polynomial decay and T𝑇Titalic_T is the number of steps, in contrast to the O(ρ/T)𝑂𝜌𝑇O(\rho/\sqrt{T})italic_O ( italic_ρ / square-root start_ARG italic_T end_ARG ) rate that arises with fixed stepsizes and exhibits a linear dependence on ρ𝜌\rhoitalic_ρ. Experiments confirm the increased robustness compared to tuning with a fixed stepsize, that has significant implications for the computational overhead of hyperparameter search in practical training scenarios.

1 Introduction

Stochastic Gradient Descent (SGD, Robbins and Monro, 1951) is a cornerstone of modern machine learning. Starting at a point x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, the update step of SGD takes the form xt+1=xtηtgtsubscript𝑥𝑡1subscript𝑥𝑡subscript𝜂𝑡subscript𝑔𝑡x_{t+1}=x_{t}-\eta_{t}g_{t}italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, where ηtsubscript𝜂𝑡\eta_{t}italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the stepsize at step t𝑡titalic_t and gtsubscript𝑔𝑡g_{t}italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a stochastic gradient at xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. An effective stepsize sequence η1,η2,subscript𝜂1subscript𝜂2\eta_{1},\eta_{2},\ldotsitalic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_η start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … is critical for performance, yet it is notoriously hard to tune in many scenarios and applications (e.g., Bottou, 2012; Schaul et al., 2013). Furthermore, as models continue to scale, the computational burden of stepsize tuning becomes increasingly demanding.

A common approach to tuning the stepsize sequence is simply using a fixed stepsize, selecting the best fixed value by performing a geometric grid search (Bengio, 2012). In this method, the stepsize is selected based on its performance on a validation set, with the grid resolution determining the (multiplicative) proximity to the best stepsize within the specified range.

A primary approach to moving beyond fixed stepsize sequences is stepsize scheduling. In stepsize scheduling (e.g., Smith, 2017; Loshchilov and Hutter, 2017; Ge et al., 2019), the step at time t𝑡titalic_t is determined by multiplying a baseline stepsize parameter with a parametric sequence. While the approach enables more versatile stepsize sequences and often leads to improved performance, it still requires tuning the baseline stepsize parameter, typically through grid search. Some stepsize schedules also exhibit theoretical benefits, such as anytime convergence guarantees and better last-iterate guarantees (e.g., Jain et al., 2019; Zamani and Glineur, 2023; Liu and Zhou, 2024; Defazio et al., 2024a).

While stepsize tuning is a widely adopted practice, its theoretical foundations remain under-explored. One key question is how sensitive this procedure is to the grid resolution. Limited computational budgets restrict the resolution of grid searches, an issue that has become increasingly prominent with the emergence of modern models consisting of billions of parameters that take days—sometimes weeks—to train. In fact, at massive scales, it is often the case that any methodological tuning of the stepsize is prohibitive and therefore abandoned entirely.

Standard analyses of fixed stepsize SGD in the convex setting demonstrate a linear degradation in convergence rate as a function of the multiplicative misspecification of the stepsize, which can be significant when performing a coarse—or even absent—grid search. This work investigates to what extent stepsize schedules can mitigate this dependency, providing more robust performance at lower grid resolutions.

Focusing our analysis on stochastic convex optimization, we establish convergence guarantees for SGD with stepsize schedules that decay polynomially to zero, which reveals a key advantage of automatically adapting to multiplicative overestimation of the stepsize. For commonly used schedules, such as cosine annealing, our guarantees yield a sublinear dependence on the misspecification factor, in contrast to the linear dependence that arises with fixed stepsizes. We further validate our theoretical findings through experiments on synthetic and real data, demonstrating improved robustness to stepsize tuning using decaying schedules compared to tuning a constant stepsize using a grid-search.

1.1 Summary of Contributions

In more detail, we consider stochastic first-order convex optimization settings, where we aim to minimize a convex objective f:𝒳:𝑓𝒳f:\mathcal{X}\to{\mathbb{R}}italic_f : caligraphic_X → blackboard_R, where 𝒳d𝒳superscript𝑑\mathcal{X}\subset{\mathbb{R}}^{d}caligraphic_X ⊂ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is a convex set with diameter D𝐷Ditalic_D, while accessing f𝑓fitalic_f only through a (sub-)gradient oracle g𝑔gitalic_g (i.e., 𝔼[g(x)]f(x)𝔼delimited-[]𝑔𝑥𝑓𝑥\mathbb{E}[g(x)]\in\partial f(x)blackboard_E [ italic_g ( italic_x ) ] ∈ ∂ italic_f ( italic_x ) for all x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X). Given an initial stepsize η>0𝜂0\eta>0italic_η > 0, a schedule is specified by a function h:[0,1][0,1]:0101h:[0,1]\to[0,1]italic_h : [ 0 , 1 ] → [ 0 , 1 ] through ηt=ηh(t1T)subscript𝜂𝑡𝜂𝑡1𝑇\eta_{t}=\eta h(\frac{t-1}{T})italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_η italic_h ( divide start_ARG italic_t - 1 end_ARG start_ARG italic_T end_ARG ), where T𝑇Titalic_T is the total number of SGD update steps.

Our main results are the following:

  • Our first main result in the convex (non-smooth) case, where we assume that the second moment of the oracle is bounded, is a convergence guarantee of the last iterate of T𝑇Titalic_T-steps SGD using a decaying schedule hhitalic_h (which satisfies some mild assumptions), of the form

    O\brk𝖱𝖺𝗍𝖾h,T𝗍𝗎infτ[0,1)\brk[c]1ρHh(τ)+ρQh(τ),𝑂\brksuperscriptsubscript𝖱𝖺𝗍𝖾𝑇𝗍𝗎subscriptinfimum𝜏01\brkdelimited-[]𝑐1𝜌subscript𝐻𝜏𝜌subscript𝑄𝜏\displaystyle O\brk*{\mathsf{Rate}_{h,T}^{\mathsf{tu}}}\cdot\!\inf_{\tau\in[0,% 1)}\brk[c]*{\frac{1}{\rho H_{h}(\tau)}+\rho Q_{h}(\tau)},italic_O ∗ sansserif_Rate start_POSTSUBSCRIPT italic_h , italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_tu end_POSTSUPERSCRIPT ⋅ roman_inf start_POSTSUBSCRIPT italic_τ ∈ [ 0 , 1 ) end_POSTSUBSCRIPT [ italic_c ] ∗ divide start_ARG 1 end_ARG start_ARG italic_ρ italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_τ ) end_ARG + italic_ρ italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_τ ) ,

    where 𝖱𝖺𝗍𝖾h,T𝗍𝗎superscriptsubscript𝖱𝖺𝗍𝖾𝑇𝗍𝗎\mathsf{Rate}_{h,T}^{\mathsf{tu}}sansserif_Rate start_POSTSUBSCRIPT italic_h , italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_tu end_POSTSUPERSCRIPT is the convergence rate using a tuned stepsize, Hhsubscript𝐻H_{h}italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and Qhsubscript𝑄Q_{h}italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT are certain functions that depend on the schedule hhitalic_h, and ρ=ηη𝗍𝗎1𝜌𝜂subscript𝜂𝗍𝗎1\rho=\ifrac{\eta}{\eta_{\mathsf{tu}}}\geq 1italic_ρ = ∕ start_ARG italic_η end_ARG start_ARG italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT end_ARG ≥ 1 is the multiplicative overestimation factor of η𝜂\etaitalic_η compared to the tuned stepsize η𝗍𝗎subscript𝜂𝗍𝗎\eta_{\mathsf{tu}}italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT. The infimum above is at most O(ρ)𝑂𝜌O(\rho)italic_O ( italic_ρ ), but as we discuss below, may become sublinear in ρ𝜌\rhoitalic_ρ depending on the particular schedule hhitalic_h.

  • Our second main result deals with the convex smooth case, where we assume that f𝑓fitalic_f is β𝛽\betaitalic_β-smooth and that the oracle has a bounded variance. We obtain a similar convergence guarantee of

    O\brk𝖱𝖺𝗍𝖾h,T𝗌𝗆,𝗍𝗎infτ[τ0,1)\brk[c]1ρHh(τ)+ρQh(τ),𝑂\brksubscriptsuperscript𝖱𝖺𝗍𝖾𝗌𝗆𝗍𝗎𝑇subscriptinfimum𝜏subscript𝜏01\brkdelimited-[]𝑐1𝜌subscript𝐻𝜏𝜌subscript𝑄𝜏\displaystyle O\brk*{\mathsf{Rate}^{\mathsf{sm,tu}}_{h,T}}\cdot\inf_{\tau\in[% \tau_{0},1)}\brk[c]*{\frac{1}{\rho H_{h}(\tau)}+\rho Q_{h}(\tau)},italic_O ∗ sansserif_Rate start_POSTSUPERSCRIPT sansserif_sm , sansserif_tu end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_T end_POSTSUBSCRIPT ⋅ roman_inf start_POSTSUBSCRIPT italic_τ ∈ [ italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 1 ) end_POSTSUBSCRIPT [ italic_c ] ∗ divide start_ARG 1 end_ARG start_ARG italic_ρ italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_τ ) end_ARG + italic_ρ italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_τ ) ,

    where τ0subscript𝜏0\tau_{0}italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the fraction of steps with ηt>1/2βsubscript𝜂𝑡12𝛽\eta_{t}>1/2\betaitalic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT > 1 / 2 italic_β, 𝖱𝖺𝗍𝖾h,T𝗌𝗆,𝗍𝗎subscriptsuperscript𝖱𝖺𝗍𝖾𝗌𝗆𝗍𝗎𝑇\mathsf{Rate}^{\mathsf{sm,tu}}_{h,T}sansserif_Rate start_POSTSUPERSCRIPT sansserif_sm , sansserif_tu end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_T end_POSTSUBSCRIPT is the convergence rate with a tuned stepsize, and ρ1𝜌1\rho\geq 1italic_ρ ≥ 1 is the multiplicative overestimation factor compared to the tuned stepsize. The dependence on τ0subscript𝜏0\tau_{0}italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is unavoidable (up to constants), as convergence in smooth optimization requires step size smaller than 2β2𝛽\ifrac{2}{\beta}∕ start_ARG 2 end_ARG start_ARG italic_β end_ARG. For sufficiently small τ0subscript𝜏0\tau_{0}italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, the infimum is again O(ρ)𝑂𝜌O(\rho)italic_O ( italic_ρ ) in the worst case.

  • Applying our main result to the cosine annealing schedule in the convex Lipschitz case, we obtain that the last-iterate convergence rate of SGD is O(ρ0.2DGT)𝑂superscript𝜌0.2𝐷𝐺𝑇O(\ifrac{\rho^{0.2}DG}{\sqrt{T}})italic_O ( ∕ start_ARG italic_ρ start_POSTSUPERSCRIPT 0.2 end_POSTSUPERSCRIPT italic_D italic_G end_ARG start_ARG square-root start_ARG italic_T end_ARG end_ARG ). Similarly, applying the same result to the polynomially decaying schedule (1t1T)psuperscript1𝑡1𝑇𝑝(1-\frac{t-1}{T})^{p}( 1 - divide start_ARG italic_t - 1 end_ARG start_ARG italic_T end_ARG ) start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT for some constant degree p1𝑝1p\geq 1italic_p ≥ 1, we obtain that the last-iterate convergence rate of SGD is O(ρ1/(2p+1)DGT)𝑂superscript𝜌12𝑝1𝐷𝐺𝑇O(\ifrac{\rho^{1/(2p+1)}DG}{\sqrt{T}})italic_O ( ∕ start_ARG italic_ρ start_POSTSUPERSCRIPT 1 / ( 2 italic_p + 1 ) end_POSTSUPERSCRIPT italic_D italic_G end_ARG start_ARG square-root start_ARG italic_T end_ARG end_ARG ). In the convex smooth case, assuming η1=ηh(0)1/2βsubscript𝜂1𝜂012𝛽\eta_{1}=\eta h(0)\leq 1/2\betaitalic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_η italic_h ( 0 ) ≤ 1 / 2 italic_β, we obtain the same multiplicative sub-optimality of ρ0.2superscript𝜌0.2\rho^{0.2}italic_ρ start_POSTSUPERSCRIPT 0.2 end_POSTSUPERSCRIPT and ρ1/(2p+1)superscript𝜌12𝑝1\rho^{1/(2p+1)}italic_ρ start_POSTSUPERSCRIPT 1 / ( 2 italic_p + 1 ) end_POSTSUPERSCRIPT for cosine annealing and (degree p𝑝pitalic_p-)polynomially decaying schedules.

  • Additionally, we validate the robustness of various learning rate schedules to tuning in experiments, by performing grid search on two tasks: a synthetic logistic regression task with a linear model and the CIFAR-10 classification task with a deep neural network. We find that, when using a coarse grid, annealing schemes—specifically cosine annealing and linear decay—demonstrate greater robustness compared to a fixed step size schedule.

Our theoretical results show that polynomially decaying schedules, including cosine annealing, achieve convergence rates with a sublinear dependence on the misspecification factor, in contrast to the linear dependence observed in SGD with a fixed stepsize (which we demonstrate in detail in Appendix D). This distinction is particularly striking since, while both fixed and annealed stepsizes are able to attain the optimal convergence rate when properly tuned, the latter exhibits significantly greater robustness to parameter misspecifications. When tuning the stepsize using a coarse grid search under a limited computational budget, this difference in robustness can significantly impact performance, as also seen in our synthetic and real-data experiments.

1.2 Additional Related Work

Adaptive and parameter-free methods.

Beyond learning rate scheduling, several approaches have been developed to minimize the need for extensive tuning in first-order optimization. These include adaptive methods, such as AdaGrad and Adam (e.g., Duchi et al., 2011; Kingma and Ba, 2015), as well as recent theoretical advancements (Reddi et al., 2018; Tran et al., 2019; Kavis et al., 2019; Alacaoglu et al., 2020; Faw et al., 2022; Kavis et al., 2022; Attia and Koren, 2023; Liu et al., 2023), which utilize gradient statistics to dynamically adjust learning rates. Additionally, parameter-free methods (e.g., Chaudhuri et al., 2009; Streeter and McMahan, 2012; Luo and Schapire, 2015; Orabona and Pál, 2016; Cutkosky and Orabona, 2018; Orabona and Pál, 2021; Carmon and Hinder, 2022) primarily focus on automatically adapting to the problem’s complexity, such as the distance to the optimal solution. Recently, several parameter-free approaches demonstrated impressive practical performance, narrowing the gap to finely-tuned methods (Ivgi et al., 2023; Defazio and Mishchenko, 2023; Mishchenko and Defazio, 2023). While these approaches take different paths to reduce tuning, adaptive methods and scheduling schemes are often used together in practice.

Theoretical analyses of stepsize annealing.

Several studies have analyzed different stepsize schedules. The influential work of Jain et al. (2019) showed that the schedules ηt=η/tsubscript𝜂𝑡𝜂𝑡\eta_{t}=\eta/titalic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_η / italic_t and ηt=η/tsubscript𝜂𝑡𝜂𝑡\eta_{t}=\eta/\sqrt{t}italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_η / square-root start_ARG italic_t end_ARG yield suboptimal last-iterate guarantees and proposed a new schedule with optimal last-iterate performance. Later, Defazio et al. (2024a) demonstrated that a linear decay schedule also achieves an optimal last-iterate guarantee. Additionally, Defazio et al. (2024b) introduced ”schedule-free” SGD, which eliminates the need to know the training length T𝑇Titalic_T in advance. While these works focus on optimality with well-tuned stepsizes and last-iterate guarantees, our work examines the robustness of these schedules when the step size is not finely tuned. Additionally, new scheduling schemes continue to emerge, such as those proposed by Zhai et al. (2022) and Hu et al. (2024), which incorporate a cooldown phase to accommodate varying training durations. The robustness perspective we propose helps us better understand the benefits of different schedules and guides the design of more robust ones.

2 Preliminaries

2.1 Problem Setup

In this work, we are interested in first-order stochastic optimization over a bounded domain within the d𝑑ditalic_d-dimensional Euclidean space, dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, equipped with the Euclidean norm, defined as \norm\norm2\norm{\cdot}\triangleq\norm{\cdot}_{2}⋅ ≜ ⋅ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Let 𝒳d𝒳superscript𝑑\mathcal{X}\subset{\mathbb{R}}^{d}caligraphic_X ⊂ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT be a convex set with diameter D𝐷Ditalic_D (i.e., for all x,y𝒳𝑥𝑦𝒳x,y\in\mathcal{X}italic_x , italic_y ∈ caligraphic_X, \normxyD\norm𝑥𝑦𝐷\norm{x-y}\leq Ditalic_x - italic_y ≤ italic_D) and let f:𝒳:𝑓𝒳f:\mathcal{X}\to{\mathbb{R}}italic_f : caligraphic_X → blackboard_R be a convex function. Our goal is to find some x¯𝒳¯𝑥𝒳\overline{x}\in\mathcal{X}over¯ start_ARG italic_x end_ARG ∈ caligraphic_X such that f(x¯)minx𝒳f(x)𝑓¯𝑥subscript𝑥𝒳𝑓𝑥f(\overline{x})-\min_{x\in\mathcal{X}}f(x)italic_f ( over¯ start_ARG italic_x end_ARG ) - roman_min start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT italic_f ( italic_x ) is small, where we access f𝑓fitalic_f only through an unbiased sub-gradient oracle g:𝒳d:𝑔𝒳superscript𝑑g:\mathcal{X}\to{\mathbb{R}}^{d}italic_g : caligraphic_X → blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT (i.e., 𝔼[g(x)]f(x)𝔼delimited-[]𝑔𝑥𝑓𝑥\mathbb{E}[g(x)]\in\partial f(x)blackboard_E [ italic_g ( italic_x ) ] ∈ ∂ italic_f ( italic_x ) for all x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X, where we denote with a slight abuse of notation f(x)𝔼[g(x)]𝑓𝑥𝔼delimited-[]𝑔𝑥\nabla f(x)\triangleq\mathbb{E}[g(x)]∇ italic_f ( italic_x ) ≜ blackboard_E [ italic_g ( italic_x ) ]). We consider two optimization scenarios:

  1. (i)

    Convex and Lipschitz setting. Here we assume g𝑔gitalic_g has a second moment bound, that is, for some G>0𝐺0G>0italic_G > 0, 𝔼\normg(x)2G2𝔼\norm𝑔superscript𝑥2superscript𝐺2\mathbb{E}\norm{g(x)}^{2}\leq G^{2}blackboard_E italic_g ( italic_x ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for all x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X. This implies in particular that f𝑓fitalic_f is G𝐺Gitalic_G-Lipschitz.

  2. (ii)

    Convex and smooth setting. In this scenario we assume that f𝑓fitalic_f is β𝛽\betaitalic_β-smooth,111A function f:𝒳:𝑓𝒳f:\mathcal{X}\to{\mathbb{R}}italic_f : caligraphic_X → blackboard_R is said to be β𝛽\betaitalic_β-smooth if \normf(x)f(y)β\normxy\norm𝑓𝑥𝑓𝑦𝛽\norm𝑥𝑦\norm{\nabla f(x)-\nabla f(y)}\leq\beta\norm{x-y}∇ italic_f ( italic_x ) - ∇ italic_f ( italic_y ) ≤ italic_β italic_x - italic_y for all x,y𝒳𝑥𝑦𝒳x,y\in\mathcal{X}italic_x , italic_y ∈ caligraphic_X. In particular, this implies that \absf(y)f(x)f(x)(yx)β2\normyx2\abs𝑓𝑦𝑓𝑥bold-⋅𝑓𝑥𝑦𝑥𝛽2\norm𝑦superscript𝑥2\abs{f(y)-f(x)-\nabla f(x)\bm{\cdot}(y-x)}\leq\frac{\beta}{2}\norm{y-x}^{2}italic_f ( italic_y ) - italic_f ( italic_x ) - ∇ italic_f ( italic_x ) bold_⋅ ( italic_y - italic_x ) ≤ divide start_ARG italic_β end_ARG start_ARG 2 end_ARG italic_y - italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for all x,y𝒳𝑥𝑦𝒳x,y\in\mathcal{X}italic_x , italic_y ∈ caligraphic_X. and instead of a second moment bound we assume that g𝑔gitalic_g has a variance bound, that is, for some σ>0𝜎0\sigma>0italic_σ > 0, 𝔼[\normg(x)f(x)2]σ2𝔼delimited-[]\norm𝑔𝑥𝑓superscript𝑥2superscript𝜎2\mathbb{E}[\norm{g(x)-\nabla f(x)}^{2}]\leq\sigma^{2}blackboard_E [ italic_g ( italic_x ) - ∇ italic_f ( italic_x ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for all x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X.

Stochastic gradient descent.

We will analyze the (projected) Stochastic Gradient Descent (SGD) algorithm, which starts at some x1𝒳subscript𝑥1𝒳x_{1}\in\mathcal{X}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ caligraphic_X and performs update steps of the form xt+1=Π𝒳\brkxtηtgtsubscript𝑥𝑡1subscriptΠ𝒳\brksubscript𝑥𝑡subscript𝜂𝑡subscript𝑔𝑡x_{t+1}=\Pi_{\mathcal{X}}\brk*{x_{t}-\eta_{t}g_{t}}italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = roman_Π start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT ∗ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, where ηtsubscript𝜂𝑡\eta_{t}italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the stepsize at step t𝑡titalic_t, gt=g(xt)subscript𝑔𝑡𝑔subscript𝑥𝑡g_{t}=g(x_{t})italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_g ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is a stochastic sub-gradient at xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and Π𝒳\brk\Pi_{\mathcal{X}}\brk*{\cdot}roman_Π start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT ∗ ⋅ is the Euclidean projection to 𝒳𝒳\mathcal{X}caligraphic_X. The output of T𝑇Titalic_T-steps SGD is typically some average of the iterates or the last iterate. The convergence rate guarantee of the average iterate of fixed stepsize SGD with tuned stepsize is O(DGT)𝑂𝐷𝐺𝑇O(\ifrac{DG}{\sqrt{T}})italic_O ( ∕ start_ARG italic_D italic_G end_ARG start_ARG square-root start_ARG italic_T end_ARG end_ARG ) in the convex Lipschitz case and O(βD2T+DσT)𝑂𝛽superscript𝐷2𝑇𝐷𝜎𝑇O(\ifrac{\beta D^{2}}{T}+\ifrac{D\sigma}{\sqrt{T}})italic_O ( ∕ start_ARG italic_β italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_T end_ARG + ∕ start_ARG italic_D italic_σ end_ARG start_ARG square-root start_ARG italic_T end_ARG end_ARG ) in the convex smooth case (See, e.g., Lan, 2012).

Stepsize scheduling.

Our focus will be on stepsizes of the form ηt=ηh\brkt1Tsubscript𝜂𝑡𝜂\brk𝑡1𝑇\eta_{t}=\eta h\brk{\frac{t-1}{T}}italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_η italic_h divide start_ARG italic_t - 1 end_ARG start_ARG italic_T end_ARG, for some η>0𝜂0\eta>0italic_η > 0 and h:[0,1][0,1]:0101h:[0,1]\to[0,1]italic_h : [ 0 , 1 ] → [ 0 , 1 ], where T𝑇T\in{\mathbb{N}}italic_T ∈ blackboard_N is the number of SGD steps that are performed. Common schedules include h(u)=1𝑢1h(u)=1italic_h ( italic_u ) = 1 (fixed stepsize), h(u)=12+12cos(πu)𝑢1212𝜋𝑢h(u)=\frac{1}{2}+\frac{1}{2}\cos(\pi u)italic_h ( italic_u ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG + divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_cos ( italic_π italic_u ) (cosine annealing), and h(u)=(1u)p𝑢superscript1𝑢𝑝h(u)=(1-u)^{p}italic_h ( italic_u ) = ( 1 - italic_u ) start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT for some p1𝑝1p\geq 1italic_p ≥ 1 (polynomial decay). In particular, we will assume that h(u)𝑢h(u)italic_h ( italic_u ) is monotonically non-increasing, and satisfy h(u)=0u=1𝑢0𝑢1h(u)=0\Leftrightarrow u=1italic_h ( italic_u ) = 0 ⇔ italic_u = 1; we will call such a schedule annealed for brevity. We additionally assume for technical reasons that the annealed schedules we consider are differentiable and p𝑝pitalic_p-Lipschitz. Using an annealed schedule, SGD with a properly tuned step size yields the same rate as optimally tuned fixed stepsize SGD, up to constant factors (where we treat p𝑝pitalic_p as a constant). See Appendix E for additional details. Notable annealed schedules include cosine annealing and polynomial decay.

Robustness to stepsize misspecification.

Fixing an initialization x1𝒳subscript𝑥1𝒳x_{1}\in\mathcal{X}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ caligraphic_X and a stepsize schedule h()h(\cdot)italic_h ( ⋅ ), it remains to tune the base stepsize η𝜂\etaitalic_η. Considering a tuned stepsize η𝗍𝗎subscript𝜂𝗍𝗎\eta_{\mathsf{tu}}italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT,222By tuned we mean a stepsize that minimize a corresponding convergence guarantee that depend on η𝜂\etaitalic_η, possibly ignoring lower-order terms for simplicity. we investigate the sensitivity of SGD when the stepsize is only tuned to a multiplicative misspecification factor ρ1𝜌1\rho\geq 1italic_ρ ≥ 1 (i.e., stepsize η=ρη𝗍𝗎𝜂𝜌subscript𝜂𝗍𝗎\eta=\rho\eta_{\mathsf{tu}}italic_η = italic_ρ italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT, where ρ𝜌\rhoitalic_ρ is of course unknown to the algorithm). In this case, the convergence rate will likely degrade as ρ𝜌\rhoitalic_ρ increases. For instance, the standard guarantee of fixed stepsize SGD degrades linearly in ρ𝜌\rhoitalic_ρ; we demonstrate this fact in the convex Lipschitz setting in Appendix D.

Our main inquiry is to what extent stepsize schedules can mitigate this degradation, enabling more robust performance when the stepsize is crudely tuned (e.g., when tuned using a coarse grid search), and achieving convergence rates with sublinear dependence on ρ𝜌\rhoitalic_ρ, for ρ1𝜌1\rho\geq 1italic_ρ ≥ 1.

2.2 Convergence Analysis with Stepsize Schedules

Here we present convergence guarantees for SGD using an annealed schedule. The tuned stepsizes and respective convergence rates will serve as the baseline for establishing a sublinear dependence on the misspecification parameter. For their proofs, see Appendix E.

Let hhitalic_h be a differentiable p𝑝pitalic_p-Lipschitz annealed schedule hhitalic_h. We define the following two functions associated with hhitalic_h:

Hh(v)subscript𝐻𝑣\displaystyle H_{h}(v)italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_v ) u1h(u)𝑑u and Qh(v)v1Hh(u)2Hh(u)𝑑u.formulae-sequenceabsentsuperscriptsubscript𝑢1𝑢differential-d𝑢 and subscript𝑄𝑣superscriptsubscript𝑣1superscriptsubscript𝐻superscript𝑢2subscript𝐻𝑢differential-d𝑢\displaystyle\triangleq\int_{u}^{1}h(u)du\qquad\text{ and }\qquad Q_{h}(v)% \triangleq\int_{v}^{1}\frac{H_{h}^{\prime}(u)^{2}}{H_{h}(u)}du.≜ ∫ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_h ( italic_u ) italic_d italic_u and italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_v ) ≜ ∫ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT divide start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_u ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_u ) end_ARG italic_d italic_u . (1)

Throughout, convergence bounds will be expressed in terms of Hhsubscript𝐻H_{h}italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and Qhsubscript𝑄Q_{h}italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT. We begin with the convex Lipschitz case.

Lemma 1.

Let 𝒳d𝒳superscript𝑑\mathcal{X}\subset{\mathbb{R}}^{d}caligraphic_X ⊂ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT be a convex set with diameter D>0𝐷0D>0italic_D > 0, f:𝒳:𝑓𝒳f:\mathcal{X}\to{\mathbb{R}}italic_f : caligraphic_X → blackboard_R a convex function, xargminx𝒳f(x)superscript𝑥subscriptargmin𝑥𝒳𝑓𝑥x^{\star}\in\operatorname*{arg\,min}_{x\in\mathcal{X}}f(x)italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ∈ start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT italic_f ( italic_x ), and g:𝒳d:𝑔𝒳superscript𝑑g:\mathcal{X}\to{\mathbb{R}}^{d}italic_g : caligraphic_X → blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT an unbiased first-order oracle of f𝑓fitalic_f with second-moment bounded by G2>0superscript𝐺20G^{2}>0italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT > 0. Let x1,x2,,xT+1subscript𝑥1subscript𝑥2subscript𝑥𝑇1x_{1},x_{2},\ldots,x_{T+1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT be the iterates produced by T𝑇Titalic_T-steps SGD with stepsizes ηt=ηh\brkt1Tsubscript𝜂𝑡𝜂\brk𝑡1𝑇\eta_{t}=\eta h\brk{\frac{t-1}{T}}italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_η italic_h divide start_ARG italic_t - 1 end_ARG start_ARG italic_T end_ARG using the oracle g𝑔gitalic_g, where hhitalic_h is a differentiable p𝑝pitalic_p-Lipschitz annealed schedule. Then it holds that

𝔼[f(xT+1)f(x)]𝔼delimited-[]𝑓subscript𝑥𝑇1𝑓superscript𝑥\displaystyle\mathbb{E}[f(x_{T+1})-f(x^{\star})]blackboard_E [ italic_f ( italic_x start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ] D22ηTHh(0)+2ηG2Qh(0)+8pηG2T.absentsuperscript𝐷22𝜂𝑇subscript𝐻02𝜂superscript𝐺2subscript𝑄08𝑝𝜂superscript𝐺2𝑇\displaystyle\leq\frac{D^{2}}{2\eta TH_{h}(0)}+2\eta G^{2}Q_{h}(0)+\frac{8p% \eta G^{2}}{T}.≤ divide start_ARG italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_η italic_T italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG + 2 italic_η italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) + divide start_ARG 8 italic_p italic_η italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_T end_ARG .

We denote the tuned stepsize and respective convergence guarantee (up to lower-order terms) by

η𝗍𝗎subscript𝜂𝗍𝗎\displaystyle\eta_{\mathsf{tu}}italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT D2GTHh(0)Qh(0)and𝖱𝖺𝗍𝖾h,T𝗍𝗎2DGTQh(0)/Hh(0).formulae-sequenceabsent𝐷2𝐺𝑇subscript𝐻0subscript𝑄0andsuperscriptsubscript𝖱𝖺𝗍𝖾𝑇𝗍𝗎2𝐷𝐺𝑇subscript𝑄0subscript𝐻0\displaystyle\triangleq\frac{D}{2G\sqrt{TH_{h}(0)Q_{h}(0)}}\qquad\text{and}% \qquad\mathsf{Rate}_{h,T}^{\mathsf{tu}}\triangleq\frac{2DG}{\sqrt{T}}\sqrt{Q_{% h}(0)/H_{h}(0)}.≜ divide start_ARG italic_D end_ARG start_ARG 2 italic_G square-root start_ARG italic_T italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG end_ARG and sansserif_Rate start_POSTSUBSCRIPT italic_h , italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_tu end_POSTSUPERSCRIPT ≜ divide start_ARG 2 italic_D italic_G end_ARG start_ARG square-root start_ARG italic_T end_ARG end_ARG square-root start_ARG italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) / italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG . (2)

We proceed to the convergence guarantee in the convex smooth case.

Lemma 2.

Let 𝒳d𝒳superscript𝑑\mathcal{X}\subset{\mathbb{R}}^{d}caligraphic_X ⊂ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT be a convex set with diameter D>0𝐷0D>0italic_D > 0, f:𝒳:𝑓𝒳f:\mathcal{X}\to{\mathbb{R}}italic_f : caligraphic_X → blackboard_R a β𝛽\betaitalic_β-smooth convex function, xargminx𝒳f(x)superscript𝑥subscriptargmin𝑥𝒳𝑓𝑥x^{\star}\in\operatorname*{arg\,min}_{x\in\mathcal{X}}f(x)italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ∈ start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT italic_f ( italic_x ), and g:𝒳d:𝑔𝒳superscript𝑑g:\mathcal{X}\to{\mathbb{R}}^{d}italic_g : caligraphic_X → blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT an unbiased first-order oracle of f𝑓fitalic_f with variance bounded by σ20superscript𝜎20\sigma^{2}\geq 0italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ 0. Let x1,x2,,xT+1subscript𝑥1subscript𝑥2subscript𝑥𝑇1x_{1},x_{2},\ldots,x_{T+1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT be the iterates produced by T𝑇Titalic_T-steps SGD with stepsizes ηt=ηh\brkt1Tsubscript𝜂𝑡𝜂\brk𝑡1𝑇\eta_{t}=\eta h\brk{\frac{t-1}{T}}italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_η italic_h divide start_ARG italic_t - 1 end_ARG start_ARG italic_T end_ARG using the oracle g𝑔gitalic_g, where hhitalic_h is a differentiable p𝑝pitalic_p-Lipschitz annealed schedule and ηh(0)12β𝜂012𝛽\eta h(0)\leq\frac{1}{2\beta}italic_η italic_h ( 0 ) ≤ divide start_ARG 1 end_ARG start_ARG 2 italic_β end_ARG. Then it holds that

𝔼[f(xT+1)f(x)]𝔼delimited-[]𝑓subscript𝑥𝑇1𝑓superscript𝑥\displaystyle\mathbb{E}[f(x_{T+1})-f(x^{\star})]blackboard_E [ italic_f ( italic_x start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ] D22ηTHh(0)+ησ2Qh(0)+4pησ2T.absentsuperscript𝐷22𝜂𝑇subscript𝐻0𝜂superscript𝜎2subscript𝑄04𝑝𝜂superscript𝜎2𝑇\displaystyle\leq\frac{D^{2}}{2\eta TH_{h}(0)}+\eta\sigma^{2}Q_{h}(0)+\frac{4p% \eta\sigma^{2}}{T}.≤ divide start_ARG italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_η italic_T italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG + italic_η italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) + divide start_ARG 4 italic_p italic_η italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_T end_ARG .

Similarly, we denote the tuned stepsize over η(0,12βh(0)]𝜂012𝛽0\eta\in(0,\frac{1}{2\beta h(0)}]italic_η ∈ ( 0 , divide start_ARG 1 end_ARG start_ARG 2 italic_β italic_h ( 0 ) end_ARG ] as

η𝗍𝗎𝗌𝗆superscriptsubscript𝜂𝗍𝗎𝗌𝗆\displaystyle\eta_{\mathsf{tu}}^{\mathsf{sm}}italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_sm end_POSTSUPERSCRIPT min\set12βh(0),Dσ2THh(0)Qh(0),absent\set12𝛽0𝐷𝜎2𝑇subscript𝐻0subscript𝑄0\displaystyle\triangleq\min\set*{\frac{1}{2\beta h(0)},\frac{D}{\sigma\sqrt{2% TH_{h}(0)Q_{h}(0)}}},≜ roman_min ∗ divide start_ARG 1 end_ARG start_ARG 2 italic_β italic_h ( 0 ) end_ARG , divide start_ARG italic_D end_ARG start_ARG italic_σ square-root start_ARG 2 italic_T italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG end_ARG , (3)

and the respective convergence guarantee (up to lower-order terms) as

𝖱𝖺𝗍𝖾h,T𝗌𝗆,𝗍𝗎subscriptsuperscript𝖱𝖺𝗍𝖾𝗌𝗆𝗍𝗎𝑇\displaystyle\mathsf{Rate}^{\mathsf{sm,tu}}_{h,T}sansserif_Rate start_POSTSUPERSCRIPT sansserif_sm , sansserif_tu end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_T end_POSTSUBSCRIPT D22η𝗍𝗎𝗌𝗆THh(0)+η𝗍𝗎𝗌𝗆σ2Qh(0)βD2h(0)THh(0)+DσT2Qh(0)/Hh(0).absentsuperscript𝐷22superscriptsubscript𝜂𝗍𝗎𝗌𝗆𝑇subscript𝐻0superscriptsubscript𝜂𝗍𝗎𝗌𝗆superscript𝜎2subscript𝑄0𝛽superscript𝐷20𝑇subscript𝐻0𝐷𝜎𝑇2subscript𝑄0subscript𝐻0\displaystyle\triangleq\frac{D^{2}}{2\eta_{\mathsf{tu}}^{\mathsf{sm}}TH_{h}(0)% }+\eta_{\mathsf{tu}}^{\mathsf{sm}}\sigma^{2}Q_{h}(0)\leq\frac{\beta D^{2}h(0)}% {TH_{h}(0)}+\frac{D\sigma}{\sqrt{T}}\sqrt{2Q_{h}(0)/H_{h}(0)}.≜ divide start_ARG italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_sm end_POSTSUPERSCRIPT italic_T italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG + italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_sm end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) ≤ divide start_ARG italic_β italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_h ( 0 ) end_ARG start_ARG italic_T italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG + divide start_ARG italic_D italic_σ end_ARG start_ARG square-root start_ARG italic_T end_ARG end_ARG square-root start_ARG 2 italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) / italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG . (4)

As we previously mentioned, under the mild assumption that p=Θ(1)𝑝Θ1p=\Theta(1)italic_p = roman_Θ ( 1 ) (the Lipschitz parameter of hhitalic_h), the guarantees match the rates of optimally tuned fixed stepsize SGD (see Appendix E for details). In both cases, a multiplicative overestimation of the optimal stepsize degrades the guarantee linearly (in the convex smooth case, for a large enough overestimation, η1>12βsubscript𝜂112𝛽\eta_{1}>\frac{1}{2\beta}italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > divide start_ARG 1 end_ARG start_ARG 2 italic_β end_ARG and the guarantee does not even hold).

3 Convex and Lipschitz Setting

This section considers a convex objective where the second moment of the sub-gradient oracle is bounded. The main result of this section is a convergence guarantee that mitigates the imbalance caused by overestimation by automatically adapting to the tails of Hh(v)subscript𝐻𝑣H_{h}(v)italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_v ) and Qh(v)subscript𝑄𝑣Q_{h}(v)italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_v ). The key observation in obtaining this result is that any suffix of iterates xk,,xT+1subscript𝑥𝑘subscript𝑥𝑇1x_{k},\ldots,x_{T+1}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT can be viewed as a (Tk+1)𝑇𝑘1(T-k+1)( italic_T - italic_k + 1 )-steps SGD starting at xksubscript𝑥𝑘x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, effectively ignoring the large stepsizes prior to step k𝑘kitalic_k that would otherwise degrade the convergence bound.

Next, we present the general guarantee, followed by corollaries for specific schedules.

Theorem 1.

Let 𝒳d𝒳superscript𝑑\mathcal{X}\subset{\mathbb{R}}^{d}caligraphic_X ⊂ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT be a convex set with diameter D>0𝐷0D>0italic_D > 0, f:𝒳:𝑓𝒳f:\mathcal{X}\to{\mathbb{R}}italic_f : caligraphic_X → blackboard_R a convex function, xargminx𝒳f(x)superscript𝑥subscriptargmin𝑥𝒳𝑓𝑥x^{\star}\in\operatorname*{arg\,min}_{x\in\mathcal{X}}f(x)italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ∈ start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT italic_f ( italic_x ), and g:𝒳d:𝑔𝒳superscript𝑑g:\mathcal{X}\to{\mathbb{R}}^{d}italic_g : caligraphic_X → blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT an unbiased first-order oracle of f𝑓fitalic_f with second-moment bounded by G2>0superscript𝐺20G^{2}>0italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT > 0. For any ρ1𝜌1\rho\geq 1italic_ρ ≥ 1, let x1,x2,,xT+1subscript𝑥1subscript𝑥2subscript𝑥𝑇1x_{1},x_{2},\ldots,x_{T+1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT be the iterates produced by T𝑇Titalic_T-steps SGD with stepsizes ηt=ηh\brkt1Tsubscript𝜂𝑡𝜂\brk𝑡1𝑇\eta_{t}=\eta h\brk{\frac{t-1}{T}}italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_η italic_h divide start_ARG italic_t - 1 end_ARG start_ARG italic_T end_ARG using the oracle g𝑔gitalic_g, where η=ρη𝗍𝗎𝜂𝜌subscript𝜂𝗍𝗎\eta=\rho\cdot\eta_{\mathsf{tu}}italic_η = italic_ρ ⋅ italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT and hhitalic_h is a differentiable p𝑝pitalic_p-Lipschitz annealed schedule. Then it holds that

𝔼[f(xT+1)f(x)]12𝖱𝖺𝗍𝖾h,T𝗍𝗎infτ[0,1)\brkHh(0)ρHh(τ)+ρQh(τ)Qh(0)+O\brkpρη𝗍𝗎G2T,𝔼delimited-[]𝑓subscript𝑥𝑇1𝑓superscript𝑥12superscriptsubscript𝖱𝖺𝗍𝖾𝑇𝗍𝗎subscriptinfimum𝜏01\brksubscript𝐻0𝜌subscript𝐻𝜏𝜌subscript𝑄𝜏subscript𝑄0𝑂\brk𝑝𝜌subscript𝜂𝗍𝗎superscript𝐺2𝑇\displaystyle\mathbb{E}[f(x_{T+1})-f(x^{\star})]\leq\frac{1}{2}\mathsf{Rate}_{% h,T}^{\mathsf{tu}}\cdot\inf_{\tau\in[0,1)}\brk*{\frac{H_{h}(0)}{\rho H_{h}(% \tau)}+\frac{\rho Q_{h}(\tau)}{Q_{h}(0)}}+O\brk*{\frac{p\rho\eta_{\mathsf{tu}}% G^{2}}{T}},blackboard_E [ italic_f ( italic_x start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ] ≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG sansserif_Rate start_POSTSUBSCRIPT italic_h , italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_tu end_POSTSUPERSCRIPT ⋅ roman_inf start_POSTSUBSCRIPT italic_τ ∈ [ 0 , 1 ) end_POSTSUBSCRIPT ∗ divide start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG start_ARG italic_ρ italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_τ ) end_ARG + divide start_ARG italic_ρ italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_τ ) end_ARG start_ARG italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG + italic_O ∗ divide start_ARG italic_p italic_ρ italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_T end_ARG , (5)

where Hhsubscript𝐻H_{h}italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, Qhsubscript𝑄Q_{h}italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, η𝗍𝗎subscript𝜂𝗍𝗎\eta_{\mathsf{tu}}italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT, and 𝖱𝖺𝗍𝖾h,T𝗍𝗎superscriptsubscript𝖱𝖺𝗍𝖾𝑇𝗍𝗎\mathsf{Rate}_{h,T}^{\mathsf{tu}}sansserif_Rate start_POSTSUBSCRIPT italic_h , italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_tu end_POSTSUPERSCRIPT are given in Eqs. 1 and 2. In particular, the optimal τ𝜏\tauitalic_τ satisfies Hh(τ)Hh(τ)=Hh(0)Qh(0)ρ2subscript𝐻𝜏superscriptsubscript𝐻𝜏subscript𝐻0subscript𝑄0superscript𝜌2H_{h}(\tau)H_{h}^{\prime}(\tau)=\ifrac{-H_{h}(0)Q_{h}(0)}{\rho^{2}}italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_τ ) italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_τ ) = ∕ start_ARG - italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG start_ARG italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (or τ=0𝜏0\tau=0italic_τ = 0 if there is no solution).

First, note that for ρ=1𝜌1\rho=1italic_ρ = 1, Theorem 1 recovers 𝖱𝖺𝗍𝖾h,T𝗍𝗎superscriptsubscript𝖱𝖺𝗍𝖾𝑇𝗍𝗎\mathsf{Rate}_{h,T}^{\mathsf{tu}}sansserif_Rate start_POSTSUBSCRIPT italic_h , italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_tu end_POSTSUPERSCRIPT up to low order terms, as the infimum is at most 2222. Furthermore, as ρ1𝜌1\rho\geq 1italic_ρ ≥ 1 and both Hh(v)subscript𝐻𝑣H_{h}(v)italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_v ) and Qh(v)subscript𝑄𝑣Q_{h}(v)italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_v ) are decreasing and equal 00 at v=1𝑣1v=1italic_v = 1, the infimum adapts to the imbalance of the 1ρ1𝜌\frac{1}{\rho}divide start_ARG 1 end_ARG start_ARG italic_ρ end_ARG and ρ𝜌\rhoitalic_ρ terms which are introduced by the overestimation.

We defer the proof of Theorem 1 to Section 3.2. Following are corollaries for polynomially decaying and cosine annealing schedules which provide concrete examples for the power of Theorem 1.

Corollary 2.

In the setting of Theorem 1, assuming h(u)=(1u)p𝑢superscript1𝑢𝑝h(u)=(1-u)^{p}italic_h ( italic_u ) = ( 1 - italic_u ) start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT for some p1𝑝1p\geq 1italic_p ≥ 1,

𝔼[f(xT+1)f(x)]𝔼delimited-[]𝑓subscript𝑥𝑇1𝑓superscript𝑥\displaystyle\mathbb{E}[f(x_{T+1})-f(x^{\star})]blackboard_E [ italic_f ( italic_x start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ] =𝖱𝖺𝗍𝖾h,T𝗍𝗎ρ12p+1+O\brkpρη𝗍𝗎G2T,absentsuperscriptsubscript𝖱𝖺𝗍𝖾𝑇𝗍𝗎superscript𝜌12𝑝1𝑂\brk𝑝𝜌subscript𝜂𝗍𝗎superscript𝐺2𝑇\displaystyle=\mathsf{Rate}_{h,T}^{\mathsf{tu}}\cdot\rho^{\frac{1}{2p+1}}+O% \brk*{\frac{p\rho\eta_{\mathsf{tu}}G^{2}}{T}},= sansserif_Rate start_POSTSUBSCRIPT italic_h , italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_tu end_POSTSUPERSCRIPT ⋅ italic_ρ start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 italic_p + 1 end_ARG end_POSTSUPERSCRIPT + italic_O ∗ divide start_ARG italic_p italic_ρ italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_T end_ARG ,

where 𝖱𝖺𝗍𝖾h,T𝗍𝗎=p+1p2DGT=O\brkpDGTsuperscriptsubscript𝖱𝖺𝗍𝖾𝑇𝗍𝗎𝑝1𝑝2𝐷𝐺𝑇𝑂\brk𝑝𝐷𝐺𝑇\mathsf{Rate}_{h,T}^{\mathsf{tu}}=\frac{p+1}{\sqrt{p}}\cdot\frac{2DG}{\sqrt{T}% }=O\brk*{\frac{\sqrt{p}DG}{\sqrt{T}}}sansserif_Rate start_POSTSUBSCRIPT italic_h , italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_tu end_POSTSUPERSCRIPT = divide start_ARG italic_p + 1 end_ARG start_ARG square-root start_ARG italic_p end_ARG end_ARG ⋅ divide start_ARG 2 italic_D italic_G end_ARG start_ARG square-root start_ARG italic_T end_ARG end_ARG = italic_O ∗ divide start_ARG square-root start_ARG italic_p end_ARG italic_D italic_G end_ARG start_ARG square-root start_ARG italic_T end_ARG end_ARG.

We observe that for p=Θ(1)𝑝Θ1p=\Theta(1)italic_p = roman_Θ ( 1 ), the optimal rate is the same as tuned SGD with fixed stepsize (up to constants), while the dependence on ρ1𝜌1\rho\geq 1italic_ρ ≥ 1 is sublinear, as we aimed to achieve. The dependence ρ12p+1superscript𝜌12𝑝1\rho^{\frac{1}{2p+1}}italic_ρ start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 italic_p + 1 end_ARG end_POSTSUPERSCRIPT might lead to the idea that a larger p𝑝pitalic_p is always better, but as p𝑝pitalic_p increases the optimal rate degrades at a rate of O(p)𝑂𝑝O(\sqrt{p})italic_O ( square-root start_ARG italic_p end_ARG ). In particular, using p=Θ(logρ)𝑝Θ𝜌p=\Theta(\log\rho)italic_p = roman_Θ ( roman_log italic_ρ ) the convergence rate will be O\brkDGlogρT𝑂\brk𝐷𝐺𝜌𝑇O\brk{\ifrac{DG\sqrt{\log\rho}}{\sqrt{T}}}italic_O ∕ start_ARG italic_D italic_G square-root start_ARG roman_log italic_ρ end_ARG end_ARG start_ARG square-root start_ARG italic_T end_ARG end_ARG, and increasing beyond this point will not improve the final rate.

Proof of Corollary 2.

First note that h(u)=(1u)p𝑢superscript1𝑢𝑝h(u)=(1-u)^{p}italic_h ( italic_u ) = ( 1 - italic_u ) start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT is non-increasing, differentiable, p𝑝pitalic_p-Lipschitz (since \absh(u)=p(1u)p1p\abssuperscript𝑢𝑝superscript1𝑢𝑝1𝑝\abs{h^{\prime}(u)}=p(1-u)^{p-1}\leq pitalic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_u ) = italic_p ( 1 - italic_u ) start_POSTSUPERSCRIPT italic_p - 1 end_POSTSUPERSCRIPT ≤ italic_p) and satisfy h(u)=0u=1𝑢0𝑢1h(u)=0\Leftrightarrow u=1italic_h ( italic_u ) = 0 ⇔ italic_u = 1. Hence, hhitalic_h is annealed and we can use Theorem 1. A simple integration yields that Hh(τ)=1p+1(1τ)p+1subscript𝐻𝜏1𝑝1superscript1𝜏𝑝1H_{h}(\tau)=\frac{1}{p+1}(1-\tau)^{p+1}italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_τ ) = divide start_ARG 1 end_ARG start_ARG italic_p + 1 end_ARG ( 1 - italic_τ ) start_POSTSUPERSCRIPT italic_p + 1 end_POSTSUPERSCRIPT, and Hh(τ)=(1τ)psuperscriptsubscript𝐻𝜏superscript1𝜏𝑝H_{h}^{\prime}(\tau)=-(1-\tau)^{p}italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_τ ) = - ( 1 - italic_τ ) start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT. Thus,

Qh(τ)subscript𝑄𝜏\displaystyle Q_{h}(\tau)italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_τ ) =τ1Hh(u)2Hh(u)𝑑u=p+1p(1τ)p.absentsuperscriptsubscript𝜏1superscriptsubscript𝐻superscript𝑢2subscript𝐻𝑢differential-d𝑢𝑝1𝑝superscript1𝜏𝑝\displaystyle=\int_{\tau}^{1}\frac{H_{h}^{\prime}(u)^{2}}{H_{h}(u)}du=\frac{p+% 1}{p}(1-\tau)^{p}.= ∫ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT divide start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_u ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_u ) end_ARG italic_d italic_u = divide start_ARG italic_p + 1 end_ARG start_ARG italic_p end_ARG ( 1 - italic_τ ) start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT .

We proceed to solve the optimality equation of Theorem 1, Hh(τ¯)Hh(τ¯)=Hh(0)Qh(0)ρ2subscript𝐻¯𝜏superscriptsubscript𝐻¯𝜏subscript𝐻0subscript𝑄0superscript𝜌2H_{h}(\bar{\tau})H_{h}^{\prime}(\bar{\tau})=\frac{-H_{h}(0)Q_{h}(0)}{\rho^{2}}italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( over¯ start_ARG italic_τ end_ARG ) italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( over¯ start_ARG italic_τ end_ARG ) = divide start_ARG - italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG start_ARG italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG:

(1τ¯)2p+1p+1superscript1¯𝜏2𝑝1𝑝1\displaystyle\frac{-(1-\bar{\tau})^{2p+1}}{p+1}divide start_ARG - ( 1 - over¯ start_ARG italic_τ end_ARG ) start_POSTSUPERSCRIPT 2 italic_p + 1 end_POSTSUPERSCRIPT end_ARG start_ARG italic_p + 1 end_ARG =1pρ2τ¯=1\brkp+1pρ212p+1.absent1𝑝superscript𝜌2¯𝜏1\brksuperscript𝑝1𝑝superscript𝜌212𝑝1\displaystyle=\frac{-1}{p\rho^{2}}\implies\bar{\tau}=1-\brk*{\frac{p+1}{p\rho^% {2}}}^{\frac{1}{2p+1}}.= divide start_ARG - 1 end_ARG start_ARG italic_p italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⟹ over¯ start_ARG italic_τ end_ARG = 1 - ∗ divide start_ARG italic_p + 1 end_ARG start_ARG italic_p italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 italic_p + 1 end_ARG end_POSTSUPERSCRIPT .

While this value is optimal, it may be negative for small ρ𝜌\rhoitalic_ρ, so we select a slightly sub-optimal value of τ¯=1ρ22p+1[0,1)¯𝜏1superscript𝜌22𝑝101\bar{\tau}=1-\rho^{\frac{-2}{2p+1}}\in[0,1)over¯ start_ARG italic_τ end_ARG = 1 - italic_ρ start_POSTSUPERSCRIPT divide start_ARG - 2 end_ARG start_ARG 2 italic_p + 1 end_ARG end_POSTSUPERSCRIPT ∈ [ 0 , 1 ) which is always valid. Using this value,

Hh(0)ρHh(τ¯)+ρQh(τ¯)Qh(0)=(1τ¯)(p+1)ρ+ρ(1τ¯)psubscript𝐻0𝜌subscript𝐻¯𝜏𝜌subscript𝑄¯𝜏subscript𝑄0superscript1¯𝜏𝑝1𝜌𝜌superscript1¯𝜏𝑝\displaystyle\frac{H_{h}(0)}{\rho H_{h}(\bar{\tau})}+\frac{\rho Q_{h}(\bar{% \tau})}{Q_{h}(0)}=\frac{(1-\bar{\tau})^{-(p+1)}}{\rho}+\rho(1-\bar{\tau})^{p}divide start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG start_ARG italic_ρ italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( over¯ start_ARG italic_τ end_ARG ) end_ARG + divide start_ARG italic_ρ italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( over¯ start_ARG italic_τ end_ARG ) end_ARG start_ARG italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG = divide start_ARG ( 1 - over¯ start_ARG italic_τ end_ARG ) start_POSTSUPERSCRIPT - ( italic_p + 1 ) end_POSTSUPERSCRIPT end_ARG start_ARG italic_ρ end_ARG + italic_ρ ( 1 - over¯ start_ARG italic_τ end_ARG ) start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT =ρ2(p+1)2p+1ρ+ρρ2p2p+1=2ρ12p+1.absentsuperscript𝜌2𝑝12𝑝1𝜌𝜌superscript𝜌2𝑝2𝑝12superscript𝜌12𝑝1\displaystyle=\frac{\rho^{\frac{2(p+1)}{2p+1}}}{\rho}+\rho\rho^{\frac{-2p}{2p+% 1}}=2\rho^{\frac{1}{2p+1}}.= divide start_ARG italic_ρ start_POSTSUPERSCRIPT divide start_ARG 2 ( italic_p + 1 ) end_ARG start_ARG 2 italic_p + 1 end_ARG end_POSTSUPERSCRIPT end_ARG start_ARG italic_ρ end_ARG + italic_ρ italic_ρ start_POSTSUPERSCRIPT divide start_ARG - 2 italic_p end_ARG start_ARG 2 italic_p + 1 end_ARG end_POSTSUPERSCRIPT = 2 italic_ρ start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 italic_p + 1 end_ARG end_POSTSUPERSCRIPT . (6)

Hence, using this value to bound the infimum of Eq. 5,

𝔼[f(xT+1)f(x)]𝔼delimited-[]𝑓subscript𝑥𝑇1𝑓superscript𝑥\displaystyle\mathbb{E}[f(x_{T+1})-f(x^{\star})]blackboard_E [ italic_f ( italic_x start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ] 𝖱𝖺𝗍𝖾h,T𝗍𝗎ρ12p+1+O\brkpρη𝗍𝗎G2T.absentsuperscriptsubscript𝖱𝖺𝗍𝖾𝑇𝗍𝗎superscript𝜌12𝑝1𝑂\brk𝑝𝜌subscript𝜂𝗍𝗎superscript𝐺2𝑇\displaystyle\leq\mathsf{Rate}_{h,T}^{\mathsf{tu}}\cdot\rho^{\frac{1}{2p+1}}+O% \brk*{\frac{p\rho\eta_{\mathsf{tu}}G^{2}}{T}}.≤ sansserif_Rate start_POSTSUBSCRIPT italic_h , italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_tu end_POSTSUPERSCRIPT ⋅ italic_ρ start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 italic_p + 1 end_ARG end_POSTSUPERSCRIPT + italic_O ∗ divide start_ARG italic_p italic_ρ italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_T end_ARG .

We conclude by plugging the values of Hh(0)subscript𝐻0H_{h}(0)italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) and Qh(0)subscript𝑄0Q_{h}(0)italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) to Eq. 2 (and using p1𝑝1p\geq 1italic_p ≥ 1),

𝖱𝖺𝗍𝖾h,T𝗍𝗎superscriptsubscript𝖱𝖺𝗍𝖾𝑇𝗍𝗎\displaystyle\mathsf{Rate}_{h,T}^{\mathsf{tu}}sansserif_Rate start_POSTSUBSCRIPT italic_h , italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_tu end_POSTSUPERSCRIPT =2DGT(p+1)2p=O\brkpDGT.absent2𝐷𝐺𝑇superscript𝑝12𝑝𝑂\brk𝑝𝐷𝐺𝑇\displaystyle=\frac{2DG}{\sqrt{T}}\sqrt{\frac{(p+1)^{2}}{p}}=O\brk*{\frac{% \sqrt{p}DG}{\sqrt{T}}}.\qed= divide start_ARG 2 italic_D italic_G end_ARG start_ARG square-root start_ARG italic_T end_ARG end_ARG square-root start_ARG divide start_ARG ( italic_p + 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_p end_ARG end_ARG = italic_O ∗ divide start_ARG square-root start_ARG italic_p end_ARG italic_D italic_G end_ARG start_ARG square-root start_ARG italic_T end_ARG end_ARG . italic_∎

We proceed to the cosine annealing guarantee. Given its similarities to Corollary 2, the proof is deferred to Appendix A.

Corollary 3.

In the setting of Theorem 1, assuming h(u)=12(1+cos\brkπu)𝑢121\brk𝜋𝑢h(u)=\frac{1}{2}(1+\cos\brk{\pi u})italic_h ( italic_u ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( 1 + roman_cos italic_π italic_u ),

𝔼[f(xT+1)f(x)]𝔼delimited-[]𝑓subscript𝑥𝑇1𝑓superscript𝑥\displaystyle\mathbb{E}[f(x_{T+1})-f(x^{\star})]blackboard_E [ italic_f ( italic_x start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ] 𝖱𝖺𝗍𝖾h,T𝗍𝗎18ρ15+O\brkρη𝗍𝗎G2T,absentsuperscriptsubscript𝖱𝖺𝗍𝖾𝑇𝗍𝗎18superscript𝜌15𝑂\brk𝜌subscript𝜂𝗍𝗎superscript𝐺2𝑇\displaystyle\leq\mathsf{Rate}_{h,T}^{\mathsf{tu}}\cdot 18\rho^{\frac{1}{5}}+O% \brk*{\frac{\rho\eta_{\mathsf{tu}}G^{2}}{T}},≤ sansserif_Rate start_POSTSUBSCRIPT italic_h , italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_tu end_POSTSUPERSCRIPT ⋅ 18 italic_ρ start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 5 end_ARG end_POSTSUPERSCRIPT + italic_O ∗ divide start_ARG italic_ρ italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_T end_ARG ,

where 𝖱𝖺𝗍𝖾h,T𝗍𝗎=2DGT2Qh(0)10DGTsuperscriptsubscript𝖱𝖺𝗍𝖾𝑇𝗍𝗎2𝐷𝐺𝑇2subscript𝑄010𝐷𝐺𝑇\mathsf{Rate}_{h,T}^{\mathsf{tu}}=\frac{2DG}{\sqrt{T}}\sqrt{2Q_{h}(0)}\leq% \frac{10DG}{\sqrt{T}}sansserif_Rate start_POSTSUBSCRIPT italic_h , italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_tu end_POSTSUPERSCRIPT = divide start_ARG 2 italic_D italic_G end_ARG start_ARG square-root start_ARG italic_T end_ARG end_ARG square-root start_ARG 2 italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG ≤ divide start_ARG 10 italic_D italic_G end_ARG start_ARG square-root start_ARG italic_T end_ARG end_ARG.

Again we observe a sublinear dependence on ρ𝜌\rhoitalic_ρ with an optimal rate of O(DGT)𝑂𝐷𝐺𝑇O(\frac{DG}{\sqrt{T}})italic_O ( divide start_ARG italic_D italic_G end_ARG start_ARG square-root start_ARG italic_T end_ARG end_ARG ). Note that this is the same behavior as in Corollary 2 with p=2𝑝2p=2italic_p = 2, which arises from the tail behavior of h(u)𝑢h(u)italic_h ( italic_u ). To see that, one can verify that (1u)2h(u)52(1u)2superscript1𝑢2𝑢52superscript1𝑢2(1-u)^{2}\leq h(u)\leq\frac{5}{2}(1-u)^{2}( 1 - italic_u ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_h ( italic_u ) ≤ divide start_ARG 5 end_ARG start_ARG 2 end_ARG ( 1 - italic_u ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for all u[0,1]𝑢01u\in[0,1]italic_u ∈ [ 0 , 1 ].

3.1 Tighter Constants using Numerical Analysis

Refer to caption
Figure 1: Numerically evaluating the coefficient of DGT𝐷𝐺𝑇\ifrac{DG}{\sqrt{T}}∕ start_ARG italic_D italic_G end_ARG start_ARG square-root start_ARG italic_T end_ARG end_ARG for the convergence guarantee of Theorem 1 with different schedules and varying multiplicative misspecification parameter ρ𝜌\rhoitalic_ρ.

The constants of Corollary 3 are not tight; in particular, the bound is established using crude (up to constants) bounds for Hh(u)subscript𝐻𝑢H_{h}(u)italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_u ) and Qh(u)subscript𝑄𝑢Q_{h}(u)italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_u ). While a tighter bound can be obtained, the framework easily yields to numerical analysis as we demonstrate next.

The convergence guarantee of Theorem 1 is not posed as a closed-form equation but rather as a minimization over integrals that depend on the schedule. For a specific schedule and misspecification parameter, we use Scipy’s (Virtanen et al., 2020) quad integration to evaluate Hh,Qhsubscript𝐻subscript𝑄H_{h},Q_{h}italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, and fsolve to solve the minimization of Theorem 1.

In Fig. 1 we provide a numerical analysis for the convergence guarantee of Theorem 1 with several decaying schedules, including the cosine annealing, showing in particular that the convergence rate of SGD with cosine annealing is bounded by 5ρ15DGT5superscript𝜌15𝐷𝐺𝑇5\rho^{\frac{1}{5}}\frac{DG}{\sqrt{T}}5 italic_ρ start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 5 end_ARG end_POSTSUPERSCRIPT divide start_ARG italic_D italic_G end_ARG start_ARG square-root start_ARG italic_T end_ARG end_ARG. We observe that the cosine annealing schedule and the quadratic decay schedule have similar convergence guarantees with a coefficient between 4ρ154superscript𝜌154\rho^{\frac{1}{5}}4 italic_ρ start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 5 end_ARG end_POSTSUPERSCRIPT to 5ρ155superscript𝜌155\rho^{\frac{1}{5}}5 italic_ρ start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 5 end_ARG end_POSTSUPERSCRIPT. In addition, even for a somewhat large misspecification parameter of size 50505050, the difference between cosine annealing and the different polynomial decaying schedules is at most a factor of 2222, which indicates that even mild decay might be sufficient if the grid is not too coarse.

3.2 Proof of Theorem 1

Before proving our main theorem, we first state a few lemmas we will require. The first is a last-iterate convergence guarantee, using the techniques of Zamani and Glineur (2023); Liu and Zhou (2024) (proof appearing in Appendix C).

Lemma 3.

Let 𝒳d𝒳superscript𝑑\mathcal{X}\subseteq{\mathbb{R}}^{d}caligraphic_X ⊆ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT be a convex set, f:𝒳:𝑓𝒳f:\mathcal{X}\to{\mathbb{R}}italic_f : caligraphic_X → blackboard_R a convex function, and g:𝒳d:𝑔𝒳superscript𝑑g:\mathcal{X}\to{\mathbb{R}}^{d}italic_g : caligraphic_X → blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT an unbiased first-order oracle of f𝑓fitalic_f with second-moment bounded by G2>0superscript𝐺20G^{2}>0italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT > 0. Let x1,,xT+1subscript𝑥1subscript𝑥𝑇1x_{1},\ldots,x_{T+1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT be the iterates produced by T𝑇Titalic_T-steps SGD with stepsizes η1,,ηTsubscript𝜂1subscript𝜂𝑇\eta_{1},\ldots,\eta_{T}italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_η start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT using the oracle g𝑔gitalic_g. Then for any x^𝒳^𝑥𝒳\hat{x}\in\mathcal{X}over^ start_ARG italic_x end_ARG ∈ caligraphic_X,

𝔼[f(xT+1)f(x^)]𝔼delimited-[]𝑓subscript𝑥𝑇1𝑓^𝑥\displaystyle\mathbb{E}[f(x_{T+1})-f(\hat{x})]blackboard_E [ italic_f ( italic_x start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT ) - italic_f ( over^ start_ARG italic_x end_ARG ) ] \normx1x^22s=1Tηs+2G2t=1Tηt2s=tTηs.absent\normsubscript𝑥1superscript^𝑥22superscriptsubscript𝑠1𝑇subscript𝜂𝑠2superscript𝐺2superscriptsubscript𝑡1𝑇superscriptsubscript𝜂𝑡2superscriptsubscript𝑠𝑡𝑇subscript𝜂𝑠\displaystyle\leq\frac{\norm{x_{1}-\hat{x}}^{2}}{2\sum_{s=1}^{T}\eta_{s}}+2G^{% 2}\sum_{t=1}^{T}\frac{\eta_{t}^{2}}{\sum_{s=t}^{T}\eta_{s}}.≤ divide start_ARG italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG + 2 italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_s = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG .

Next is a key lemma, translating the suffix of the last-iterate bound in Lemma 3 to one based on integrating the stepsize schedule (proof given later in the section).

Lemma 4.

Let k[T]𝑘delimited-[]𝑇k\in[T]italic_k ∈ [ italic_T ], c1,c2,η>0subscript𝑐1subscript𝑐2𝜂0c_{1},c_{2},\eta>0italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_η > 0, and ηt=ηh\brkt1Tsubscript𝜂𝑡𝜂\brk𝑡1𝑇\eta_{t}=\eta h\brk{\frac{t-1}{T}}italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_η italic_h divide start_ARG italic_t - 1 end_ARG start_ARG italic_T end_ARG for some differentiable p𝑝pitalic_p-Lipschitz annealed schedule hhitalic_h. Then for any τ[k1T,kT)𝜏𝑘1𝑇𝑘𝑇\tau\in[\frac{k-1}{T},\frac{k}{T})italic_τ ∈ [ divide start_ARG italic_k - 1 end_ARG start_ARG italic_T end_ARG , divide start_ARG italic_k end_ARG start_ARG italic_T end_ARG ),

c1s=kTηs+c2t=kTηt2s=tTηssubscript𝑐1superscriptsubscript𝑠𝑘𝑇subscript𝜂𝑠subscript𝑐2superscriptsubscript𝑡𝑘𝑇superscriptsubscript𝜂𝑡2superscriptsubscript𝑠𝑡𝑇subscript𝜂𝑠\displaystyle\frac{c_{1}}{\sum_{s=k}^{T}\eta_{s}}+c_{2}\sum_{t=k}^{T}\frac{% \eta_{t}^{2}}{\sum_{s=t}^{T}\eta_{s}}divide start_ARG italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_s = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG + italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_s = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG c1ηTHh(τ)+c2ητ11Th(u)2Hh(u)𝑑u+4ηc2pT.absentsubscript𝑐1𝜂𝑇subscript𝐻𝜏subscript𝑐2𝜂superscriptsubscript𝜏11𝑇superscript𝑢2subscript𝐻𝑢differential-d𝑢4𝜂subscript𝑐2𝑝𝑇\displaystyle\leq\frac{c_{1}}{\eta TH_{h}(\tau)}+c_{2}\eta\int_{\tau}^{1-\frac% {1}{T}}\frac{h(u)^{2}}{H_{h}(u)}du+\frac{4\eta c_{2}p}{T}.≤ divide start_ARG italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_η italic_T italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_τ ) end_ARG + italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η ∫ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG end_POSTSUPERSCRIPT divide start_ARG italic_h ( italic_u ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_u ) end_ARG italic_d italic_u + divide start_ARG 4 italic_η italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_p end_ARG start_ARG italic_T end_ARG .

We proceed to the proof of Theorem 1.

Proof of Theorem 1.

Let τ[0,1)𝜏01\tau\in[0,1)italic_τ ∈ [ 0 , 1 ) and let k=τT+1[T]𝑘𝜏𝑇1delimited-[]𝑇k=\lfloor\tau T\rfloor+1\in[T]italic_k = ⌊ italic_τ italic_T ⌋ + 1 ∈ [ italic_T ]. Consider the suffix xk,xk+1,,xT+1subscript𝑥𝑘subscript𝑥𝑘1subscript𝑥𝑇1x_{k},x_{k+1},\ldots,x_{T+1}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT as an SGD sequence starting at xksubscript𝑥𝑘x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. By Lemma 3 with x^=x^𝑥superscript𝑥\hat{x}=x^{\star}over^ start_ARG italic_x end_ARG = italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT,

𝔼[f(xT+1)f(x)]𝔼delimited-[]𝑓subscript𝑥𝑇1𝑓superscript𝑥\displaystyle\mathbb{E}[f(x_{T+1})-f(x^{\star})]blackboard_E [ italic_f ( italic_x start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ] D22s=kTηs+2G2t=kTηt2s=tTηs.absentsuperscript𝐷22superscriptsubscript𝑠𝑘𝑇subscript𝜂𝑠2superscript𝐺2superscriptsubscript𝑡𝑘𝑇superscriptsubscript𝜂𝑡2superscriptsubscript𝑠𝑡𝑇subscript𝜂𝑠\displaystyle\leq\frac{D^{2}}{2\sum_{s=k}^{T}\eta_{s}}+2G^{2}\sum_{t=k}^{T}% \frac{\eta_{t}^{2}}{\sum_{s=t}^{T}\eta_{s}}.≤ divide start_ARG italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 ∑ start_POSTSUBSCRIPT italic_s = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG + 2 italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_s = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG .

As τ[k1T,kT)𝜏𝑘1𝑇𝑘𝑇\tau\in[\frac{k-1}{T},\frac{k}{T})italic_τ ∈ [ divide start_ARG italic_k - 1 end_ARG start_ARG italic_T end_ARG , divide start_ARG italic_k end_ARG start_ARG italic_T end_ARG ), by Lemma 4 with c1=D22subscript𝑐1superscript𝐷22c_{1}=\frac{D^{2}}{2}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = divide start_ARG italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG and c2=2G2subscript𝑐22superscript𝐺2c_{2}=2G^{2}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 2 italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT,

𝔼[f(xT+1)f(x)]𝔼delimited-[]𝑓subscript𝑥𝑇1𝑓superscript𝑥\displaystyle\mathbb{E}[f(x_{T+1})-f(x^{\star})]blackboard_E [ italic_f ( italic_x start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ] D22ηTHh(τ)+2ηG2τ1h(u)2Hh(u)𝑑u+8pηG2Tabsentsuperscript𝐷22𝜂𝑇subscript𝐻𝜏2𝜂superscript𝐺2superscriptsubscript𝜏1superscript𝑢2subscript𝐻𝑢differential-d𝑢8𝑝𝜂superscript𝐺2𝑇\displaystyle\leq\frac{D^{2}}{2\eta TH_{h}(\tau)}+2\eta G^{2}\int_{\tau}^{1}% \frac{h(u)^{2}}{H_{h}(u)}du+\frac{8p\eta G^{2}}{T}≤ divide start_ARG italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_η italic_T italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_τ ) end_ARG + 2 italic_η italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∫ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT divide start_ARG italic_h ( italic_u ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_u ) end_ARG italic_d italic_u + divide start_ARG 8 italic_p italic_η italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_T end_ARG
=12𝖱𝖺𝗍𝖾h,T𝗍𝗎\brkHh(0)ρHh(τ)+ρτ1h(u)2Hh(u)𝑑u01h(u)2Hh(u)𝑑u+8pρη𝗍𝗎G2Tabsent12superscriptsubscript𝖱𝖺𝗍𝖾𝑇𝗍𝗎\brksubscript𝐻0𝜌subscript𝐻𝜏𝜌superscriptsubscript𝜏1superscript𝑢2subscript𝐻𝑢differential-d𝑢superscriptsubscript01superscript𝑢2subscript𝐻𝑢differential-d𝑢8𝑝𝜌subscript𝜂𝗍𝗎superscript𝐺2𝑇\displaystyle=\frac{1}{2}\mathsf{Rate}_{h,T}^{\mathsf{tu}}\cdot\brk*{\frac{H_{% h}(0)}{\rho H_{h}(\tau)}+\frac{\rho\int_{\tau}^{1}\frac{h(u)^{2}}{H_{h}(u)}du}% {\int_{0}^{1}\frac{h(u)^{2}}{H_{h}(u)}du}}+\frac{8p\rho\eta_{\mathsf{tu}}G^{2}% }{T}= divide start_ARG 1 end_ARG start_ARG 2 end_ARG sansserif_Rate start_POSTSUBSCRIPT italic_h , italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_tu end_POSTSUPERSCRIPT ⋅ ∗ divide start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG start_ARG italic_ρ italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_τ ) end_ARG + divide start_ARG italic_ρ ∫ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT divide start_ARG italic_h ( italic_u ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_u ) end_ARG italic_d italic_u end_ARG start_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT divide start_ARG italic_h ( italic_u ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_u ) end_ARG italic_d italic_u end_ARG + divide start_ARG 8 italic_p italic_ρ italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_T end_ARG (η=ρη𝗍𝗎𝜂𝜌subscript𝜂𝗍𝗎\eta=\rho\eta_{\mathsf{tu}}italic_η = italic_ρ italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT and Eq. 2)
=12𝖱𝖺𝗍𝖾h,T𝗍𝗎\brkHh(0)ρHh(τ)+ρQh(τ)Qh(0)+8pρη𝗍𝗎G2T.absent12superscriptsubscript𝖱𝖺𝗍𝖾𝑇𝗍𝗎\brksubscript𝐻0𝜌subscript𝐻𝜏𝜌subscript𝑄𝜏subscript𝑄08𝑝𝜌subscript𝜂𝗍𝗎superscript𝐺2𝑇\displaystyle=\frac{1}{2}\mathsf{Rate}_{h,T}^{\mathsf{tu}}\cdot\brk*{\frac{H_{% h}(0)}{\rho H_{h}(\tau)}+\frac{\rho Q_{h}(\tau)}{Q_{h}(0)}}+\frac{8p\rho\eta_{% \mathsf{tu}}G^{2}}{T}.= divide start_ARG 1 end_ARG start_ARG 2 end_ARG sansserif_Rate start_POSTSUBSCRIPT italic_h , italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_tu end_POSTSUPERSCRIPT ⋅ ∗ divide start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG start_ARG italic_ρ italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_τ ) end_ARG + divide start_ARG italic_ρ italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_τ ) end_ARG start_ARG italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG + divide start_ARG 8 italic_p italic_ρ italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_T end_ARG . (Hh(u)2=h(u)2superscriptsubscript𝐻superscript𝑢2superscript𝑢2H_{h}^{\prime}(u)^{2}=h(u)^{2}italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_u ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_h ( italic_u ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, Eq. 1)

This inequality holds for any τ[0,1)𝜏01\tau\in[0,1)italic_τ ∈ [ 0 , 1 ), hence it holds for the infimum over all τ[0,1)𝜏01\tau\in[0,1)italic_τ ∈ [ 0 , 1 ). It is left to find the τ𝜏\tauitalic_τ which minimizes the bound. Let

g(v)=Hh(0)ρHh(v)+ρQh(v)Qh(0).𝑔𝑣subscript𝐻0𝜌subscript𝐻𝑣𝜌subscript𝑄𝑣subscript𝑄0\displaystyle g(v)=\frac{H_{h}(0)}{\rho H_{h}(v)}+\frac{\rho Q_{h}(v)}{Q_{h}(0% )}.italic_g ( italic_v ) = divide start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG start_ARG italic_ρ italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_v ) end_ARG + divide start_ARG italic_ρ italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_v ) end_ARG start_ARG italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG . (7)

By the fundamental theorem of calculus,

Hh(v)=\brk01h(u)𝑑u0vh(u)𝑑u=h(u)superscriptsubscript𝐻𝑣\brksuperscriptsubscript01𝑢differential-d𝑢superscriptsubscript0𝑣𝑢differential-dsuperscript𝑢𝑢\displaystyle H_{h}^{\prime}(v)=\brk*{\int_{0}^{1}h(u)du-\int_{0}^{v}h(u)du}^{% \prime}=-h(u)italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_v ) = ∗ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_h ( italic_u ) italic_d italic_u - ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT italic_h ( italic_u ) italic_d italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = - italic_h ( italic_u )

and

Qh(v)superscriptsubscript𝑄𝑣\displaystyle Q_{h}^{\prime}(v)italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_v ) =\brk01Hh(u)2Hh(u)𝑑u0vHh(u)2Hh(u)𝑑u=Hh(v)2Hh(v).absent\brksuperscriptsubscript01superscriptsubscript𝐻superscript𝑢2subscript𝐻𝑢differential-d𝑢superscriptsubscript0𝑣superscriptsubscript𝐻superscript𝑢2subscript𝐻𝑢differential-dsuperscript𝑢superscriptsubscript𝐻superscript𝑣2subscript𝐻𝑣\displaystyle=\brk*{\int_{0}^{1}\frac{H_{h}^{\prime}(u)^{2}}{H_{h}(u)}du-\!% \int_{0}^{v}\frac{H_{h}^{\prime}(u)^{2}}{H_{h}(u)}du}^{\prime}=-\frac{H_{h}^{% \prime}(v)^{2}}{H_{h}(v)}.= ∗ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT divide start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_u ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_u ) end_ARG italic_d italic_u - ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT divide start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_u ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_u ) end_ARG italic_d italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = - divide start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_v ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_v ) end_ARG .

Thus,

g(v)superscript𝑔𝑣\displaystyle g^{\prime}(v)italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_v ) =Hh(0)Hh(v)ρHh(v)2ρHh(v)2Hh(v)Qh(0)=ρHh(v)Qh(0)Hh(v)2\brkHh(0)Qh(0)ρ2+Hh(v)Hh(v)absentsubscript𝐻0superscriptsubscript𝐻𝑣𝜌subscript𝐻superscript𝑣2𝜌superscriptsubscript𝐻superscript𝑣2subscript𝐻𝑣subscript𝑄0𝜌superscriptsubscript𝐻𝑣subscript𝑄0subscript𝐻superscript𝑣2\brksubscript𝐻0subscript𝑄0superscript𝜌2subscript𝐻𝑣superscriptsubscript𝐻𝑣\displaystyle=\frac{-H_{h}(0)H_{h}^{\prime}(v)}{\rho H_{h}(v)^{2}}-\frac{\rho% \frac{H_{h}^{\prime}(v)^{2}}{H_{h}(v)}}{Q_{h}(0)}=\frac{-\rho H_{h}^{\prime}(v% )}{Q_{h}(0)H_{h}(v)^{2}}\brk*{\frac{H_{h}(0)Q_{h}(0)}{\rho^{2}}+H_{h}(v)H_{h}^% {\prime}(v)}= divide start_ARG - italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_v ) end_ARG start_ARG italic_ρ italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_v ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - divide start_ARG italic_ρ divide start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_v ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_v ) end_ARG end_ARG start_ARG italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG = divide start_ARG - italic_ρ italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_v ) end_ARG start_ARG italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_v ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∗ divide start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG start_ARG italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_v ) italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_v )
=ρh(v)Qh(0)Hh(v)2\brkHh(v)Hh(v)+Hh(0)Qh(0)ρ2.absent𝜌𝑣subscript𝑄0subscript𝐻superscript𝑣2\brksubscript𝐻𝑣superscriptsubscript𝐻𝑣subscript𝐻0subscript𝑄0superscript𝜌2\displaystyle=\frac{\rho h(v)}{Q_{h}(0)H_{h}(v)^{2}}\brk*{H_{h}(v)H_{h}^{% \prime}(v)+\frac{H_{h}(0)Q_{h}(0)}{\rho^{2}}}.= divide start_ARG italic_ρ italic_h ( italic_v ) end_ARG start_ARG italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_v ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∗ italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_v ) italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_v ) + divide start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG start_ARG italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG .

Hence, when τ𝜏\tauitalic_τ satisfy Hh(τ)Hh(τ)=Hh(0)Qh(0)ρ2subscript𝐻𝜏superscriptsubscript𝐻𝜏subscript𝐻0subscript𝑄0superscript𝜌2H_{h}(\tau)H_{h}^{\prime}(\tau)=\frac{-H_{h}(0)Q_{h}(0)}{\rho^{2}}italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_τ ) italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_τ ) = divide start_ARG - italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG start_ARG italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG, g(τ)=0superscript𝑔𝜏0g^{\prime}(\tau)=0italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_τ ) = 0. For v>τ𝑣𝜏v>\tauitalic_v > italic_τ,

Hh(v)Hh(v)subscript𝐻𝑣superscriptsubscript𝐻𝑣\displaystyle H_{h}(v)H_{h}^{\prime}(v)italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_v ) italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_v ) =Hh(v)h(v)Hh(v)h(τ)>Hh(τ)h(τ)=Hh(0)Qh(0)ρ2,absentsubscript𝐻𝑣𝑣subscript𝐻𝑣𝜏subscript𝐻𝜏𝜏subscript𝐻0subscript𝑄0superscript𝜌2\displaystyle=-H_{h}(v)h(v)\geq-H_{h}(v)h(\tau)>-H_{h}(\tau)h(\tau)=-\frac{H_{% h}(0)Q_{h}(0)}{\rho^{2}},= - italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_v ) italic_h ( italic_v ) ≥ - italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_v ) italic_h ( italic_τ ) > - italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_τ ) italic_h ( italic_τ ) = - divide start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG start_ARG italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ,

so g(v)>0superscript𝑔𝑣0g^{\prime}(v)>0italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_v ) > 0 and g(v)>g(τ)𝑔𝑣𝑔𝜏g(v)>g(\tau)italic_g ( italic_v ) > italic_g ( italic_τ ). Similarly, for v<τ𝑣𝜏v<\tauitalic_v < italic_τ, g(v)<0superscript𝑔𝑣0g^{\prime}(v)<0italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_v ) < 0 and g(v)>g(τ)𝑔𝑣𝑔𝜏g(v)>g(\tau)italic_g ( italic_v ) > italic_g ( italic_τ ). Hence, τ𝜏\tauitalic_τ satisfying Hh(τ)Hh(τ)=Hh(0)Qh(0)ρ2subscript𝐻𝜏superscriptsubscript𝐻𝜏subscript𝐻0subscript𝑄0superscript𝜌2H_{h}(\tau)H_{h}^{\prime}(\tau)=\frac{-H_{h}(0)Q_{h}(0)}{\rho^{2}}italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_τ ) italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_τ ) = divide start_ARG - italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG start_ARG italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG is the minimizer. If no such τ𝜏\tauitalic_τ exists, the derivative is always positive (as hhitalic_h is continuous and Hh(1)Hh(1)=0subscript𝐻1superscriptsubscript𝐻10H_{h}(1)H_{h}^{\prime}(1)=0italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 1 ) italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( 1 ) = 0), and the minimizer is at τ=0𝜏0\tau=0italic_τ = 0. ∎

3.3 Proof of Lemma 4

.

Let τ[k1T,kT)𝜏𝑘1𝑇𝑘𝑇\tau\in[\frac{k-1}{T},\frac{k}{T})italic_τ ∈ [ divide start_ARG italic_k - 1 end_ARG start_ARG italic_T end_ARG , divide start_ARG italic_k end_ARG start_ARG italic_T end_ARG ). As hhitalic_h is non-increasing, we can use integration to obtain the following bound,

c1t=kTηt+c2t=kTηt2s=tTηssubscript𝑐1superscriptsubscript𝑡𝑘𝑇subscript𝜂𝑡subscript𝑐2superscriptsubscript𝑡𝑘𝑇superscriptsubscript𝜂𝑡2superscriptsubscript𝑠𝑡𝑇subscript𝜂𝑠\displaystyle\frac{c_{1}}{\sum_{t=k}^{T}\eta_{t}}+c_{2}\sum_{t=k}^{T}\frac{% \eta_{t}^{2}}{\sum_{s=t}^{T}\eta_{s}}divide start_ARG italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_t = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG + italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_s = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG c1ηkT+1h\brkt1T𝑑t+c2ηt=kTηt2tT+1h\brks1T𝑑sabsentsubscript𝑐1𝜂superscriptsubscript𝑘𝑇1\brk𝑡1𝑇differential-d𝑡subscript𝑐2𝜂superscriptsubscript𝑡𝑘𝑇superscriptsubscript𝜂𝑡2superscriptsubscript𝑡𝑇1\brk𝑠1𝑇differential-d𝑠\displaystyle\leq\frac{c_{1}}{\eta\int_{k}^{T+1}h\brk*{\frac{t-1}{T}}dt}+\frac% {c_{2}}{\eta}\sum_{t=k}^{T}\frac{\eta_{t}^{2}}{\int_{t}^{T+1}h\brk*{\frac{s-1}% {T}}ds}≤ divide start_ARG italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_η ∫ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T + 1 end_POSTSUPERSCRIPT italic_h ∗ divide start_ARG italic_t - 1 end_ARG start_ARG italic_T end_ARG italic_d italic_t end_ARG + divide start_ARG italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_η end_ARG ∑ start_POSTSUBSCRIPT italic_t = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∫ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T + 1 end_POSTSUPERSCRIPT italic_h ∗ divide start_ARG italic_s - 1 end_ARG start_ARG italic_T end_ARG italic_d italic_s end_ARG
=c1ηTk1T1h\brku𝑑u+c2ηTt=kTηt2t1T1h\brku𝑑uabsentsubscript𝑐1𝜂𝑇superscriptsubscript𝑘1𝑇1\brk𝑢differential-d𝑢subscript𝑐2𝜂𝑇superscriptsubscript𝑡𝑘𝑇superscriptsubscript𝜂𝑡2superscriptsubscript𝑡1𝑇1\brk𝑢differential-d𝑢\displaystyle=\frac{c_{1}}{\eta T\int_{\frac{k-1}{T}}^{1}h\brk*{u}du}+\frac{c_% {2}}{\eta T}\sum_{t=k}^{T}\frac{\eta_{t}^{2}}{\int_{\frac{t-1}{T}}^{1}h\brk*{u% }du}= divide start_ARG italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_η italic_T ∫ start_POSTSUBSCRIPT divide start_ARG italic_k - 1 end_ARG start_ARG italic_T end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_h ∗ italic_u italic_d italic_u end_ARG + divide start_ARG italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_η italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∫ start_POSTSUBSCRIPT divide start_ARG italic_t - 1 end_ARG start_ARG italic_T end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_h ∗ italic_u italic_d italic_u end_ARG (changing integration variables)
=c1ηTHh\brkk1T+c2ηTt=kTηt2Hh\brkt1T.absentsubscript𝑐1𝜂𝑇subscript𝐻\brk𝑘1𝑇subscript𝑐2𝜂𝑇superscriptsubscript𝑡𝑘𝑇superscriptsubscript𝜂𝑡2subscript𝐻\brk𝑡1𝑇\displaystyle=\frac{c_{1}}{\eta TH_{h}\brk*{\frac{k-1}{T}}}+\frac{c_{2}}{\eta T% }\sum_{t=k}^{T}\frac{\eta_{t}^{2}}{H_{h}\brk*{\frac{t-1}{T}}}.= divide start_ARG italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_η italic_T italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∗ divide start_ARG italic_k - 1 end_ARG start_ARG italic_T end_ARG end_ARG + divide start_ARG italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_η italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∗ divide start_ARG italic_t - 1 end_ARG start_ARG italic_T end_ARG end_ARG . (Eq. 1)

Again bounding by integration and changing variables,

c2ηTt=kTηt2Hh\brkt1Tsubscript𝑐2𝜂𝑇superscriptsubscript𝑡𝑘𝑇superscriptsubscript𝜂𝑡2subscript𝐻\brk𝑡1𝑇\displaystyle\frac{c_{2}}{\eta T}\!\sum_{t=k}^{T}\frac{\eta_{t}^{2}}{H_{h}\brk% *{\frac{t-1}{T}}}divide start_ARG italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_η italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∗ divide start_ARG italic_t - 1 end_ARG start_ARG italic_T end_ARG end_ARG c2ηT\brkh\brkk1T2Hh(k1T)+kTh\brkt1T2Hh\brkt1T𝑑t=c2ηT\brkh\brkk1T2Hh(k1T)+Tk1T11Th\brku2Hh\brku𝑑u.absentsubscript𝑐2𝜂𝑇\brk\brksuperscript𝑘1𝑇2subscript𝐻𝑘1𝑇superscriptsubscript𝑘𝑇\brksuperscript𝑡1𝑇2subscript𝐻\brk𝑡1𝑇differential-d𝑡subscript𝑐2𝜂𝑇\brk\brksuperscript𝑘1𝑇2subscript𝐻𝑘1𝑇𝑇superscriptsubscript𝑘1𝑇11𝑇\brksuperscript𝑢2subscript𝐻\brk𝑢differential-d𝑢\displaystyle\leq\frac{c_{2}\eta}{T}\brk*{\frac{h\brk*{\frac{k-1}{T}}^{2}}{H_{% h}(\frac{k-1}{T})}\!+\!\int_{k}^{T}\frac{h\brk*{\frac{t-1}{T}}^{2}}{H_{h}\brk*% {\frac{t-1}{T}}}dt}=\frac{c_{2}\eta}{T}\brk*{\frac{h\brk*{\frac{k-1}{T}}^{2}}{% H_{h}(\frac{k-1}{T})}+T\!\int_{\frac{k-1}{T}}^{1-\frac{1}{T}}\frac{h\brk*{u}^{% 2}}{H_{h}\brk*{u}}du}.≤ divide start_ARG italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η end_ARG start_ARG italic_T end_ARG ∗ divide start_ARG italic_h ∗ divide start_ARG italic_k - 1 end_ARG start_ARG italic_T end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( divide start_ARG italic_k - 1 end_ARG start_ARG italic_T end_ARG ) end_ARG + ∫ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG italic_h ∗ divide start_ARG italic_t - 1 end_ARG start_ARG italic_T end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∗ divide start_ARG italic_t - 1 end_ARG start_ARG italic_T end_ARG end_ARG italic_d italic_t = divide start_ARG italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η end_ARG start_ARG italic_T end_ARG ∗ divide start_ARG italic_h ∗ divide start_ARG italic_k - 1 end_ARG start_ARG italic_T end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( divide start_ARG italic_k - 1 end_ARG start_ARG italic_T end_ARG ) end_ARG + italic_T ∫ start_POSTSUBSCRIPT divide start_ARG italic_k - 1 end_ARG start_ARG italic_T end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG end_POSTSUPERSCRIPT divide start_ARG italic_h ∗ italic_u start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∗ italic_u end_ARG italic_d italic_u .

As h(u)𝑢h(u)italic_h ( italic_u ) is differentiable, p𝑝pitalic_p-Lipschitz, and h(1)=010h(1)=0italic_h ( 1 ) = 0, for any v[0,1)𝑣01v\in[0,1)italic_v ∈ [ 0 , 1 ),

2pHh(v)2𝑝subscript𝐻𝑣\displaystyle 2pH_{h}(v)2 italic_p italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_v ) =2pv1h(u)𝑑u2v1h(u)(h(u))𝑑u=h(v)2h(1)2=h(v)2.absent2𝑝superscriptsubscript𝑣1𝑢differential-d𝑢2superscriptsubscript𝑣1𝑢superscript𝑢differential-d𝑢superscript𝑣2superscript12superscript𝑣2\displaystyle=2p\int_{v}^{1}h(u)du\geq 2\int_{v}^{1}h(u)(-h(u))^{\prime}du=h(v% )^{2}-h(1)^{2}=h(v)^{2}.= 2 italic_p ∫ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_h ( italic_u ) italic_d italic_u ≥ 2 ∫ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_h ( italic_u ) ( - italic_h ( italic_u ) ) start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_d italic_u = italic_h ( italic_v ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_h ( 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_h ( italic_v ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (8)

Hence, h\brkk1T2Hh(k1T)2p\brksuperscript𝑘1𝑇2subscript𝐻𝑘1𝑇2𝑝\frac{h\brk*{\frac{k-1}{T}}^{2}}{H_{h}(\frac{k-1}{T})}\leq 2pdivide start_ARG italic_h ∗ divide start_ARG italic_k - 1 end_ARG start_ARG italic_T end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( divide start_ARG italic_k - 1 end_ARG start_ARG italic_T end_ARG ) end_ARG ≤ 2 italic_p and since \absτk1T1T\abs𝜏𝑘1𝑇1𝑇\abs{\tau-\frac{k-1}{T}}\leq\frac{1}{T}italic_τ - divide start_ARG italic_k - 1 end_ARG start_ARG italic_T end_ARG ≤ divide start_ARG 1 end_ARG start_ARG italic_T end_ARG,

k1T11Th\brku2Hh\brku𝑑usuperscriptsubscript𝑘1𝑇11𝑇\brksuperscript𝑢2subscript𝐻\brk𝑢differential-d𝑢\displaystyle\int_{\frac{k-1}{T}}^{1-\frac{1}{T}}\frac{h\brk*{u}^{2}}{H_{h}% \brk*{u}}du∫ start_POSTSUBSCRIPT divide start_ARG italic_k - 1 end_ARG start_ARG italic_T end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG end_POSTSUPERSCRIPT divide start_ARG italic_h ∗ italic_u start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∗ italic_u end_ARG italic_d italic_u =τ11Th\brku2Hh\brku𝑑u+k1Tτh\brku2Hh\brku𝑑uτ11Th\brku2Hh\brku𝑑u+\absk1Tτ2p𝑑uabsentsuperscriptsubscript𝜏11𝑇\brksuperscript𝑢2subscript𝐻\brk𝑢differential-d𝑢superscriptsubscript𝑘1𝑇𝜏\brksuperscript𝑢2subscript𝐻\brk𝑢differential-d𝑢superscriptsubscript𝜏11𝑇\brksuperscript𝑢2subscript𝐻\brk𝑢differential-d𝑢\abssuperscriptsubscript𝑘1𝑇𝜏2𝑝differential-d𝑢\displaystyle=\int_{\tau}^{1-\frac{1}{T}}\frac{h\brk*{u}^{2}}{H_{h}\brk*{u}}du% +\int_{\frac{k-1}{T}}^{\tau}\frac{h\brk*{u}^{2}}{H_{h}\brk*{u}}du\leq\int_{% \tau}^{1-\frac{1}{T}}\frac{h\brk*{u}^{2}}{H_{h}\brk*{u}}du+\abs*{\int_{\frac{k% -1}{T}}^{\tau}2pdu}= ∫ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG end_POSTSUPERSCRIPT divide start_ARG italic_h ∗ italic_u start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∗ italic_u end_ARG italic_d italic_u + ∫ start_POSTSUBSCRIPT divide start_ARG italic_k - 1 end_ARG start_ARG italic_T end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT divide start_ARG italic_h ∗ italic_u start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∗ italic_u end_ARG italic_d italic_u ≤ ∫ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG end_POSTSUPERSCRIPT divide start_ARG italic_h ∗ italic_u start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∗ italic_u end_ARG italic_d italic_u + ∗ ∫ start_POSTSUBSCRIPT divide start_ARG italic_k - 1 end_ARG start_ARG italic_T end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT 2 italic_p italic_d italic_u
τ11Th\brku2Hh\brku𝑑u+2pT.absentsuperscriptsubscript𝜏11𝑇\brksuperscript𝑢2subscript𝐻\brk𝑢differential-d𝑢2𝑝𝑇\displaystyle\leq\int_{\tau}^{1-\frac{1}{T}}\frac{h\brk*{u}^{2}}{H_{h}\brk*{u}% }du+\frac{2p}{T}.≤ ∫ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG end_POSTSUPERSCRIPT divide start_ARG italic_h ∗ italic_u start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∗ italic_u end_ARG italic_d italic_u + divide start_ARG 2 italic_p end_ARG start_ARG italic_T end_ARG .

Plugging back,

c1t=kTηt+c2t=kTηt2s=tTηssubscript𝑐1superscriptsubscript𝑡𝑘𝑇subscript𝜂𝑡subscript𝑐2superscriptsubscript𝑡𝑘𝑇superscriptsubscript𝜂𝑡2superscriptsubscript𝑠𝑡𝑇subscript𝜂𝑠\displaystyle\frac{c_{1}}{\sum_{t=k}^{T}\eta_{t}}+c_{2}\sum_{t=k}^{T}\frac{% \eta_{t}^{2}}{\sum_{s=t}^{T}\eta_{s}}divide start_ARG italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_t = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG + italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_s = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG c1ηTHh\brkτ+ηc2τ11Th\brku2Hh\brku𝑑u+4ηc2pT.absentsubscript𝑐1𝜂𝑇subscript𝐻\brk𝜏𝜂subscript𝑐2superscriptsubscript𝜏11𝑇\brksuperscript𝑢2subscript𝐻\brk𝑢differential-d𝑢4𝜂subscript𝑐2𝑝𝑇\displaystyle\leq\frac{c_{1}}{\eta TH_{h}\brk*{\tau}}+\eta c_{2}\int_{\tau}^{1% -\frac{1}{T}}\frac{h\brk*{u}^{2}}{H_{h}\brk*{u}}du+\frac{4\eta c_{2}p}{T}.\qed≤ divide start_ARG italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_η italic_T italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∗ italic_τ end_ARG + italic_η italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG end_POSTSUPERSCRIPT divide start_ARG italic_h ∗ italic_u start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∗ italic_u end_ARG italic_d italic_u + divide start_ARG 4 italic_η italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_p end_ARG start_ARG italic_T end_ARG . italic_∎

4 Convex and Smooth Setting

In the following section, we extend our robustness result to the convex smooth setting, in which we replace the second-moment gradient oracle assumption with the assumptions that the gradient oracle has bounded variance and that f𝑓fitalic_f is β𝛽\betaitalic_β-smooth. The core technique is the same as in Section 3, with some additional considerations due to the requirement in standard smooth analysis that the stepsizes satisfy η1,,ηTcβsubscript𝜂1subscript𝜂𝑇𝑐𝛽\eta_{1},\ldots,\eta_{T}\leq\frac{c}{\beta}italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_η start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ≤ divide start_ARG italic_c end_ARG start_ARG italic_β end_ARG for some constant c<2𝑐2c<2italic_c < 2.

Next is the main result of this section, a convergence guarantee robust to a multiplicative misspecification of the stepsize.

Theorem 4.

Let 𝒳d𝒳superscript𝑑\mathcal{X}\subset{\mathbb{R}}^{d}caligraphic_X ⊂ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT be a convex set with diameter D>0𝐷0D>0italic_D > 0, f:𝒳:𝑓𝒳f:\mathcal{X}\to{\mathbb{R}}italic_f : caligraphic_X → blackboard_R a β𝛽\betaitalic_β-smooth convex function, xargminx𝒳f(x)superscript𝑥subscriptargmin𝑥𝒳𝑓𝑥x^{\star}\in\operatorname*{arg\,min}_{x\in\mathcal{X}}f(x)italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ∈ start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT italic_f ( italic_x ), and g:𝒳d:𝑔𝒳superscript𝑑g:\mathcal{X}\to{\mathbb{R}}^{d}italic_g : caligraphic_X → blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT an unbiased first-order oracle of f𝑓fitalic_f with variance bounded by σ20superscript𝜎20\sigma^{2}\geq 0italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ 0. For any ρ1𝜌1\rho\geq 1italic_ρ ≥ 1, let x1,x2,,xT+1subscript𝑥1subscript𝑥2subscript𝑥𝑇1x_{1},x_{2},\ldots,x_{T+1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT be the iterates produced by T𝑇Titalic_T-steps SGD with stepsizes ηt=ηh\brkt1Tsubscript𝜂𝑡𝜂\brk𝑡1𝑇\eta_{t}=\eta h\brk{\frac{t-1}{T}}italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_η italic_h divide start_ARG italic_t - 1 end_ARG start_ARG italic_T end_ARG using the oracle g𝑔gitalic_g, where η=ρη𝗍𝗎𝗌𝗆𝜂𝜌superscriptsubscript𝜂𝗍𝗎𝗌𝗆\eta=\rho\cdot\eta_{\mathsf{tu}}^{\mathsf{sm}}italic_η = italic_ρ ⋅ italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_sm end_POSTSUPERSCRIPT and hhitalic_h is a differentiable p𝑝pitalic_p-Lipschitz annealed schedule. Denote τ0min\setτ[0,1):ηh\brkτTT12β:subscript𝜏0\set𝜏01𝜂\brk𝜏𝑇𝑇12𝛽\tau_{0}\triangleq\min\set{\tau\in[0,1):\eta h\brk{\frac{\lfloor\tau T\rfloor}% {T}}\leq\frac{1}{2\beta}}italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≜ roman_min italic_τ ∈ [ 0 , 1 ) : italic_η italic_h divide start_ARG ⌊ italic_τ italic_T ⌋ end_ARG start_ARG italic_T end_ARG ≤ divide start_ARG 1 end_ARG start_ARG 2 italic_β end_ARG. Then it holds that

𝔼[f(xT+1)f(x)]𝔼delimited-[]𝑓subscript𝑥𝑇1𝑓superscript𝑥\displaystyle\mathbb{E}[f(x_{T+1})-f(x^{\star})]blackboard_E [ italic_f ( italic_x start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ] 𝖱𝖺𝗍𝖾h,T𝗌𝗆,𝗍𝗎infτ[τ0,1)\brkHh(0)ρHh(τ)+ρQh(τ)Qh(0)+O\brkpρη𝗍𝗎𝗌𝗆σ2T,absentsubscriptsuperscript𝖱𝖺𝗍𝖾𝗌𝗆𝗍𝗎𝑇subscriptinfimum𝜏subscript𝜏01\brksubscript𝐻0𝜌subscript𝐻𝜏𝜌subscript𝑄𝜏subscript𝑄0𝑂\brk𝑝𝜌superscriptsubscript𝜂𝗍𝗎𝗌𝗆superscript𝜎2𝑇\displaystyle\leq\mathsf{Rate}^{\mathsf{sm,tu}}_{h,T}\cdot\inf_{\tau\in[\tau_{% 0},1)}\brk*{\frac{H_{h}(0)}{\rho H_{h}(\tau)}+\frac{\rho Q_{h}(\tau)}{Q_{h}(0)% }}+O\brk*{\frac{p\rho\eta_{\mathsf{tu}}^{\mathsf{sm}}\sigma^{2}}{T}},≤ sansserif_Rate start_POSTSUPERSCRIPT sansserif_sm , sansserif_tu end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_T end_POSTSUBSCRIPT ⋅ roman_inf start_POSTSUBSCRIPT italic_τ ∈ [ italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 1 ) end_POSTSUBSCRIPT ∗ divide start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG start_ARG italic_ρ italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_τ ) end_ARG + divide start_ARG italic_ρ italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_τ ) end_ARG start_ARG italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG + italic_O ∗ divide start_ARG italic_p italic_ρ italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_sm end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_T end_ARG , (9)

where Hhsubscript𝐻H_{h}italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, Qhsubscript𝑄Q_{h}italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, η𝗍𝗎𝗌𝗆superscriptsubscript𝜂𝗍𝗎𝗌𝗆\eta_{\mathsf{tu}}^{\mathsf{sm}}italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_sm end_POSTSUPERSCRIPT, and 𝖱𝖺𝗍𝖾h,T𝗌𝗆,𝗍𝗎subscriptsuperscript𝖱𝖺𝗍𝖾𝗌𝗆𝗍𝗎𝑇\mathsf{Rate}^{\mathsf{sm,tu}}_{h,T}sansserif_Rate start_POSTSUPERSCRIPT sansserif_sm , sansserif_tu end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_T end_POSTSUBSCRIPT are given in Eqs. 1, 3 and 4. In particular, the optimal τ𝜏\tauitalic_τ satisfies Hh(τ)Hh(τ)=Hh(0)Qh(0)ρ2subscript𝐻𝜏superscriptsubscript𝐻𝜏subscript𝐻0subscript𝑄0superscript𝜌2H_{h}(\tau)H_{h}^{\prime}(\tau)=\ifrac{-H_{h}(0)Q_{h}(0)}{\rho^{2}}italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_τ ) italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_τ ) = ∕ start_ARG - italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG start_ARG italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (or τ=τ0𝜏subscript𝜏0\tau=\tau_{0}italic_τ = italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT if there is no solution).

As in Theorem 1, in Theorem 4 we observe a similar adaptivity to ρ𝜌\rhoitalic_ρ using the tails of Hhsubscript𝐻H_{h}italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and Qhsubscript𝑄Q_{h}italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT. One small yet important difference is that the infimum is limited to the range [τ0,1)subscript𝜏01[\tau_{0},1)[ italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 1 ), where τ0subscript𝜏0\tau_{0}italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT denotes the fraction of iterations in which the stepsize exceeds 12β12𝛽\frac{1}{2\beta}divide start_ARG 1 end_ARG start_ARG 2 italic_β end_ARG. This dependency is somewhat unavoidable (up to constants) as stepsizes larger or equal to 2β2𝛽\frac{2}{\beta}divide start_ARG 2 end_ARG start_ARG italic_β end_ARG do not converge. Additionally, note that the above guarantee holds even if we specify a stepsize that is larger than 2β2𝛽\frac{2}{\beta}divide start_ARG 2 end_ARG start_ARG italic_β end_ARG, which is not the case with fixed stepsize SGD.

Next are corollaries of Theorem 4 with polynomial decay and cosine annealing schedules. Due to space constraints and similarities to the convex Lipschitz case, we defer the proofs of Theorem 4 and of the corollaries to Appendix B.

Corollary 5.

In the setting of Theorem 4, let h(u)=(1u)p𝑢superscript1𝑢𝑝h(u)=(1-u)^{p}italic_h ( italic_u ) = ( 1 - italic_u ) start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT for some p1𝑝1p\geq 1italic_p ≥ 1 and ρT2p𝜌𝑇2𝑝\rho\leq\frac{T}{2p}italic_ρ ≤ divide start_ARG italic_T end_ARG start_ARG 2 italic_p end_ARG. Then if ρ2(1τ0)(2p+1)superscript𝜌2superscript1subscript𝜏02𝑝1\rho^{2}\geq(1-\tau_{0})^{-(2p+1)}italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ ( 1 - italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - ( 2 italic_p + 1 ) end_POSTSUPERSCRIPT,

𝔼[f(xT+1)f(x)]=𝖱𝖺𝗍𝖾h,T𝗌𝗆(η𝗍𝗎𝗌𝗆)O\brkρ12p+1,𝔼delimited-[]𝑓subscript𝑥𝑇1𝑓superscript𝑥subscriptsuperscript𝖱𝖺𝗍𝖾𝗌𝗆𝑇superscriptsubscript𝜂𝗍𝗎𝗌𝗆𝑂\brksuperscript𝜌12𝑝1\displaystyle\mathbb{E}[f(x_{T+1})-f(x^{\star})]=\mathsf{Rate}^{\mathsf{sm}}_{% h,T}(\eta_{\mathsf{tu}}^{\mathsf{sm}})\cdot O\brk*{\rho^{\frac{1}{2p+1}}},blackboard_E [ italic_f ( italic_x start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ] = sansserif_Rate start_POSTSUPERSCRIPT sansserif_sm end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_T end_POSTSUBSCRIPT ( italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_sm end_POSTSUPERSCRIPT ) ⋅ italic_O ∗ italic_ρ start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 italic_p + 1 end_ARG end_POSTSUPERSCRIPT ,

and if ρ2<(1τ0)(2p+1)superscript𝜌2superscript1subscript𝜏02𝑝1\rho^{2}<(1-\tau_{0})^{-(2p+1)}italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT < ( 1 - italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - ( 2 italic_p + 1 ) end_POSTSUPERSCRIPT,

𝔼[f(xT+1)f(x)]=𝖱𝖺𝗍𝖾h,T𝗌𝗆(η𝗍𝗎𝗌𝗆)O\brk11τ0.𝔼delimited-[]𝑓subscript𝑥𝑇1𝑓superscript𝑥subscriptsuperscript𝖱𝖺𝗍𝖾𝗌𝗆𝑇superscriptsubscript𝜂𝗍𝗎𝗌𝗆𝑂\brk11subscript𝜏0\displaystyle\mathbb{E}[f(x_{T+1})-f(x^{\star})]=\mathsf{Rate}^{\mathsf{sm}}_{% h,T}(\eta_{\mathsf{tu}}^{\mathsf{sm}})\cdot O\brk*{\frac{1}{1-\tau_{0}}}.blackboard_E [ italic_f ( italic_x start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ] = sansserif_Rate start_POSTSUPERSCRIPT sansserif_sm end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_T end_POSTSUBSCRIPT ( italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_sm end_POSTSUPERSCRIPT ) ⋅ italic_O ∗ divide start_ARG 1 end_ARG start_ARG 1 - italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG .

In addition, 𝖱𝖺𝗍𝖾h,T𝗌𝗆,𝗍𝗎=O\brkpβD2T+pDσTsubscriptsuperscript𝖱𝖺𝗍𝖾𝗌𝗆𝗍𝗎𝑇𝑂\brk𝑝𝛽superscript𝐷2𝑇𝑝𝐷𝜎𝑇\mathsf{Rate}^{\mathsf{sm,tu}}_{h,T}=O\brk*{\frac{p\beta D^{2}}{T}+\frac{\sqrt% {p}D\sigma}{\sqrt{T}}}sansserif_Rate start_POSTSUPERSCRIPT sansserif_sm , sansserif_tu end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_T end_POSTSUBSCRIPT = italic_O ∗ divide start_ARG italic_p italic_β italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_T end_ARG + divide start_ARG square-root start_ARG italic_p end_ARG italic_D italic_σ end_ARG start_ARG square-root start_ARG italic_T end_ARG end_ARG.

Corollary 6.

In the setting of Theorem 4, let h(u)=12(1+cos(πu))𝑢121𝜋𝑢h(u)=\frac{1}{2}(1+\cos(\pi u))italic_h ( italic_u ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( 1 + roman_cos ( italic_π italic_u ) ) and ρ2Tπ𝜌2𝑇𝜋\rho\leq\frac{2T}{\pi}italic_ρ ≤ divide start_ARG 2 italic_T end_ARG start_ARG italic_π end_ARG. Then if ρ2(1τ0)5superscript𝜌2superscript1subscript𝜏05\rho^{2}\geq(1-\tau_{0})^{-5}italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ ( 1 - italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT,

𝔼[f(xT+1)f(x)]𝔼delimited-[]𝑓subscript𝑥𝑇1𝑓superscript𝑥\displaystyle\mathbb{E}[f(x_{T+1})-f(x^{\star})]blackboard_E [ italic_f ( italic_x start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ] =𝖱𝖺𝗍𝖾h,T𝗌𝗆,𝗍𝗎O\brkρ15,absentsubscriptsuperscript𝖱𝖺𝗍𝖾𝗌𝗆𝗍𝗎𝑇𝑂\brksuperscript𝜌15\displaystyle=\mathsf{Rate}^{\mathsf{sm,tu}}_{h,T}\cdot O\brk*{\rho^{\frac{1}{% 5}}},= sansserif_Rate start_POSTSUPERSCRIPT sansserif_sm , sansserif_tu end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_T end_POSTSUBSCRIPT ⋅ italic_O ∗ italic_ρ start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 5 end_ARG end_POSTSUPERSCRIPT ,

and if ρ2<(1τ0)5superscript𝜌2superscript1subscript𝜏05\rho^{2}<(1-\tau_{0})^{-5}italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT < ( 1 - italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT,

𝔼[f(xT+1)f(x)]𝔼delimited-[]𝑓subscript𝑥𝑇1𝑓superscript𝑥\displaystyle\mathbb{E}[f(x_{T+1})-f(x^{\star})]blackboard_E [ italic_f ( italic_x start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ] =𝖱𝖺𝗍𝖾h,T𝗌𝗆,𝗍𝗎O\brk11τ0.absentsubscriptsuperscript𝖱𝖺𝗍𝖾𝗌𝗆𝗍𝗎𝑇𝑂\brk11subscript𝜏0\displaystyle=\mathsf{Rate}^{\mathsf{sm,tu}}_{h,T}\cdot O\brk*{\frac{1}{1-\tau% _{0}}}.= sansserif_Rate start_POSTSUPERSCRIPT sansserif_sm , sansserif_tu end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_T end_POSTSUBSCRIPT ⋅ italic_O ∗ divide start_ARG 1 end_ARG start_ARG 1 - italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG .

In addition, 𝖱𝖺𝗍𝖾h,T𝗌𝗆,𝗍𝗎=O\brkβD2T+DσTsubscriptsuperscript𝖱𝖺𝗍𝖾𝗌𝗆𝗍𝗎𝑇𝑂\brk𝛽superscript𝐷2𝑇𝐷𝜎𝑇\mathsf{Rate}^{\mathsf{sm,tu}}_{h,T}=O\brk*{\frac{\beta D^{2}}{T}+\frac{D% \sigma}{\sqrt{T}}}sansserif_Rate start_POSTSUPERSCRIPT sansserif_sm , sansserif_tu end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_T end_POSTSUBSCRIPT = italic_O ∗ divide start_ARG italic_β italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_T end_ARG + divide start_ARG italic_D italic_σ end_ARG start_ARG square-root start_ARG italic_T end_ARG end_ARG.

Observing Corollaries 5 and 6, a similar improved dependence on ρ𝜌\rhoitalic_ρ as in Corollaries 2 and 3 holds when τ0subscript𝜏0\tau_{0}italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is sufficiently small. When τ0subscript𝜏0\tau_{0}italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is large, we obtain the expected inverse dependence on the fraction of steps with small enough stepsizes, which is unavoidable as we explained above.

5 Experimental Evaluation

Our theory predicts that learning rate annealing schemes exhibit greater robustness to learning rate tuning compared to tuning a fixed learning rate. To support the prediction, we perform experiments to compare the performances of different scheduling strategies under varying grid search resolutions for learning rate tuning.

We conduct two types of experiments: the first involves a synthetic logistic regression task closely aligned with the theoretical setting, while the second involves training a neural network classifier.

5.1 Experimental setup

We consider common schedules, namely, fixed learning rate (as our baseline), in addition to the decaying cosine annealing, and linear decay schedules. To simulate varying grid resolutions, we train the models using a geometric grid of learning rates with a multiplicative factor of approximately 1032.153102.15\sqrt[3]{10}\approx 2.15nth-root start_ARG 3 end_ARG start_ARG 10 end_ARG ≈ 2.15 (the values {1,2.2,5}12.25\{1,2.2,5\}{ 1 , 2.2 , 5 } multiplied by 10isuperscript10𝑖10^{i}10 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT with different i𝑖iitalic_i’s), and consider the different subsets with resolutions 2.15,2.152,2.1532.15superscript2.152superscript2.1532.15,2.15^{2},2.15^{3}2.15 , 2.15 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , 2.15 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, etc. For example, with range [0.01,5]0.015[0.01,5][ 0.01 , 5 ] and resolution of 2.153superscript2.1532.15^{3}2.15 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, we find the best model for each of the grids333 We average the performance over 3 runs per learning rate. {0.01,0.1,1},{0.022,0.22,2.2},{0.05,0.5,5},0.010.110.0220.222.20.050.55\{0.01,0.1,1\},\{0.022,0.22,2.2\},\{0.05,0.5,5\},{ 0.01 , 0.1 , 1 } , { 0.022 , 0.22 , 2.2 } , { 0.05 , 0.5 , 5 } , and report the average test loss/top-1 error across grids.

Synthetic logistic regression.

In the synthetic experiment, we generate 100,000 samples of dimension 100, drawn from a normal distribution. Labels are assigned based on thresholding probabilities determined by a ”true weights” vector of size 100, also sampled from a normal distribution. To introduce additional noise, we flip each label with a probability of 0.1. A test set of the same size is generated similarly. We train a linear classifier using binary cross-entropy loss, SGD without momentum, a batch size of 1,000, and a single epoch (updating the scheduler after each step). For the fixed learning rate scheduler, we report both the last iterate and the averaged iterate performances.

Wide ResNet on CIFAR-10.

We train a Wide ResNet 28-10 model444We use the PyTorch implementation of Wide ResNet at https://github.com/bmsookim/wide-resnet.pytorch. (Zagoruyko and Komodakis, 2016) without dropout on the CIFAR-10 dataset (Krizhevsky, 2009). We train for 200 epochs, using a batch size of 128128128128, Nesterov momentum of 0.90.90.90.9, and weight decay of 0.00050.00050.00050.0005. The scheduler is updated after each epoch. As the last iterate of fixed stepsize SGD is under-performing, we use polynomial averaging as proposed by Shamir and Zhang (2013), with parameter γ=8𝛾8\gamma=8italic_γ = 8, following Ivgi et al. (2023).

5.2 Results

Refer to caption
(a)
Refer to caption
(b)
Figure 2: (a) Test loss for the logistic regression task with varying learning rates and different learning rate schedules. Each point represents 3 runs, reporting average and standard deviation. (b) Test loss of the best model in a sub-grid averaged over multiple sub-grids with the same multiplicative grid factor. The smallest multiplicative factor represents the full grid of (a). “Fixed stepsize w/ AVG” stands for fixed stepsize SGD with iterate averaging.
Refer to caption
(a)
Refer to caption
(b)
Figure 3: (a) CIFAR-10 top-1 test error of WideResNet28-10 with varying learning rates and different learning rate schedules. Each point represents 3 runs, reporting average and standard deviation. (b) Test error of the best model in a sub-grid averaged over multiple sub-grids with the same multiplicative grid factor. The smallest multiplicative factor represents the full grid of (a). “Fixed stepsize w/ AVG” stands for fixed stepsize SGD with polynomial iterate averaging.

The test loss per learning rate appears in Fig. 2(a). For each resolution, Fig. 2(b) illustrates the logistic regression test loss averaged across the best models for each sub-grid. At high resolutions (e.g., grid parameters up to 10), we observe a comparable performance degradation across different schedules (besides fixed stepsize without averaging which underperforms). However, as grid resolution decreases, the gap between the fixed learning rate schedule and the decaying schedules widens. For instance, with a grid factor of approximately 100, the performance of the fixed learning rate (with averaging) decreases by 0.08, whereas cosine annealing and linear decay schedules experience smaller drops of 0.01 and 0.014, respectively, with similar trends observed for grids with lower resolutions.

Fig. 3(b) shows the CIFAR-10 top-1 test error for each resolution, averaged over the best models per sub-grid, with the raw test error per learning rate appearing in Fig. 3(a). Similar to the logistic regression task, degradation remains similar for high resolutions while the gap between the fixed learning rate schedule and the decaying schedules widens for large grid factors. With a grid factor of approximately 22, the performance of the fixed learning rate decreases by 0.61, with smaller drops of 0.3 and 0.35 observed for cosine annealing and linear decay schedules, respectively, and the trend continues for grids with lower resolutions.

5.3 Discussion

The experiments show that decaying schedules are more robust to coarse grids, while performance differences on fine grids remain minimal. These findings align with our theory, which suggests that all decaying schedules perform similarly to iterate averaging under small multiplicative misspecification but outperform it when misspecification is large. However, our theory also predicts robustness variations across decay rates, which are not observed in the real-data experiments. A possible explanation is the small difference in convergence rates among decaying schedules when misspecification is low, as illustrated in Fig. 1.

Acknowledgements

We are grateful to Noga Bar, Yair Carmon and Tomer Porian for helpful discussions. This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (grant agreement No. 101078075). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council. Neither the European Union nor the granting authority can be held responsible for them. This work received additional support from the Israel Science Foundation (ISF, grant number 3174/23), a grant from the Tel Aviv University Center for AI and Data Science (TAD), and a fellowship from the Israeli Council of Higher Education.

References

  • Alacaoglu et al. (2020) A. Alacaoglu, Y. Malitsky, P. Mertikopoulos, and V. Cevher. A new regret analysis for adam-type algorithms. In International conference on machine learning, pages 202–210. PMLR, 2020.
  • Attia and Koren (2023) A. Attia and T. Koren. Sgd with adagrad stepsizes: Full adaptivity with high probability to unknown parameters, unbounded gradients and affine variance. In International Conference on Machine Learning, 2023.
  • Bengio (2012) Y. Bengio. Practical recommendations for gradient-based training of deep architectures. In Neural networks: Tricks of the trade: Second edition, pages 437–478. Springer, 2012.
  • Bottou (2012) L. Bottou. Stochastic gradient descent tricks. In Neural networks: Tricks of the trade, pages 421–436. Springer, 2012.
  • Carmon and Hinder (2022) Y. Carmon and O. Hinder. Making sgd parameter-free. In Conference on Learning Theory, pages 2360–2389. PMLR, 2022.
  • Chaudhuri et al. (2009) K. Chaudhuri, Y. Freund, and D. J. Hsu. A parameter-free hedging algorithm. Advances in neural information processing systems, 22, 2009.
  • Cutkosky and Orabona (2018) A. Cutkosky and F. Orabona. Black-box reductions for parameter-free online learning in banach spaces. In Conference On Learning Theory, pages 1493–1529. PMLR, 2018.
  • Defazio and Mishchenko (2023) A. Defazio and K. Mishchenko. Learning-rate-free learning by d-adaptation. In International Conference on Machine Learning, 2023.
  • Defazio et al. (2024a) A. Defazio, A. Cutkosky, H. Mehta, and K. Mishchenko. Optimal linear decay learning rate schedules and further refinements. arXiv preprint arXiv:2310.07831, 2024a.
  • Defazio et al. (2024b) A. Defazio, X. A. Yang, A. Khaled, K. Mishchenko, H. Mehta, and A. Cutkosky. The road less scheduled. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024b.
  • Duchi et al. (2011) J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.
  • Faw et al. (2022) M. Faw, I. Tziotis, C. Caramanis, A. Mokhtari, S. Shakkottai, and R. A. Ward. The power of adaptivity in sgd: Self-tuning step sizes with unbounded gradients and affine variance. In COLT, 2022.
  • Ge et al. (2019) R. Ge, S. M. Kakade, R. Kidambi, and P. Netrapalli. The step decay schedule: A near optimal, geometrically decaying learning rate procedure for least squares. Advances in neural information processing systems, 32, 2019.
  • Hu et al. (2024) S. Hu, Y. Tu, X. Han, G. Cui, C. He, W. Zhao, X. Long, Z. Zheng, Y. Fang, Y. Huang, X. Zhang, Z. L. Thai, C. Wang, Y. Yao, C. Zhao, J. Zhou, J. Cai, Z. Zhai, N. Ding, C. Jia, G. Zeng, dahai li, Z. Liu, and M. Sun. MiniCPM: Unveiling the potential of small language models with scalable training strategies. In First Conference on Language Modeling, 2024.
  • Ivgi et al. (2023) M. Ivgi, O. Hinder, and Y. Carmon. Dog is sgd’s best friend: A parameter-free dynamic step size schedule. In International Conference on Machine Learning, 2023.
  • Jain et al. (2019) P. Jain, D. Nagaraj, and P. Netrapalli. Making the last iterate of sgd information theoretically optimal. In Conference on Learning Theory, pages 1752–1755. PMLR, 2019.
  • Kavis et al. (2019) A. Kavis, K. Y. Levy, F. Bach, and V. Cevher. Unixgrad: A universal, adaptive algorithm with optimal guarantees for constrained optimization. Advances in neural information processing systems, 32, 2019.
  • Kavis et al. (2022) A. Kavis, K. Y. Levy, and V. Cevher. High probability bounds for a class of nonconvex algorithms with adagrad stepsize. In International Conference on Learning Representations, 2022.
  • Kingma and Ba (2015) D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
  • Krizhevsky (2009) A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
  • Lan (2012) G. Lan. An optimal method for stochastic composite optimization. Mathematical Programming, 133(1):365–397, 2012.
  • Liu and Zhou (2024) Z. Liu and Z. Zhou. Revisiting the last-iterate convergence of stochastic gradient methods. In The Twelfth International Conference on Learning Representations, 2024.
  • Liu et al. (2023) Z. Liu, T. D. Nguyen, T. H. Nguyen, A. Ene, and H. L. Nguyen. High probability convergence of stochastic gradient methods. arXiv preprint arXiv:2302.14843, 2023.
  • Loshchilov and Hutter (2017) I. Loshchilov and F. Hutter. SGDR: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations, 2017.
  • Luo and Schapire (2015) H. Luo and R. E. Schapire. Achieving all with no parameters: Adanormalhedge. In Conference on Learning Theory, pages 1286–1304. PMLR, 2015.
  • Mishchenko and Defazio (2023) K. Mishchenko and A. Defazio. Prodigy: An expeditiously adaptive parameter-free learner. arXiv preprint arXiv:2306.06101, 2023.
  • Orabona and Pál (2016) F. Orabona and D. Pál. Coin betting and parameter-free online learning. Advances in Neural Information Processing Systems, 29, 2016.
  • Orabona and Pál (2021) F. Orabona and D. Pál. Parameter-free stochastic optimization of variationally coherent functions. arXiv preprint arXiv:2102.00236, 2021.
  • Reddi et al. (2018) S. J. Reddi, S. Kale, and S. Kumar. On the convergence of adam and beyond. In International Conference on Learning Representations, 2018.
  • Robbins and Monro (1951) H. Robbins and S. Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
  • Schaul et al. (2013) T. Schaul, S. Zhang, and Y. LeCun. No more pesky learning rates. In International conference on machine learning, pages 343–351. PMLR, 2013.
  • Shamir and Zhang (2013) O. Shamir and T. Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In International conference on machine learning, pages 71–79. PMLR, 2013.
  • Smith (2017) L. N. Smith. Cyclical learning rates for training neural networks. In 2017 IEEE winter conference on applications of computer vision (WACV), pages 464–472. IEEE, 2017.
  • Streeter and McMahan (2012) M. J. Streeter and H. B. McMahan. No-regret algorithms for unconstrained online convex optimization. In Neural Information Processing Systems, 2012.
  • Tran et al. (2019) P. T. Tran et al. On the convergence proof of amsgrad and a new version. IEEE Access, 7:61706–61716, 2019.
  • Virtanen et al. (2020) P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright, S. J. van der Walt, M. Brett, J. Wilson, K. J. Millman, N. Mayorov, A. R. J. Nelson, E. Jones, R. Kern, E. Larson, C. J. Carey, İ. Polat, Y. Feng, E. W. Moore, J. VanderPlas, D. Laxalde, J. Perktold, R. Cimrman, I. Henriksen, E. A. Quintero, C. R. Harris, A. M. Archibald, A. H. Ribeiro, F. Pedregosa, P. van Mulbregt, and SciPy 1.0 Contributors. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261–272, 2020. doi: 10.1038/s41592-019-0686-2.
  • Zagoruyko and Komodakis (2016) S. Zagoruyko and N. Komodakis. Wide residual networks. In British Machine Vision Conference 2016. British Machine Vision Association, 2016.
  • Zamani and Glineur (2023) M. Zamani and F. Glineur. Exact convergence rate of the last iterate in subgradient methods. arXiv preprint arXiv:2307.11134, 2023.
  • Zhai et al. (2022) X. Zhai, A. Kolesnikov, N. Houlsby, and L. Beyer. Scaling vision transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12104–12113, 2022.

Appendix A Proofs of Section 3

A.1 Proof of Corollary 3

.

Note that h(u)𝑢h(u)italic_h ( italic_u ) is non-increasing, differentiable (h(u)=π2sin(πu)superscript𝑢𝜋2𝜋𝑢h^{\prime}(u)=\frac{-\pi}{2}\sin(\pi u)italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_u ) = divide start_ARG - italic_π end_ARG start_ARG 2 end_ARG roman_sin ( italic_π italic_u )), π2𝜋2\frac{\pi}{2}divide start_ARG italic_π end_ARG start_ARG 2 end_ARG-Lipschitz (as \absh(u)π2\abssuperscript𝑢𝜋2\abs{h^{\prime}(u)}\leq\frac{\pi}{2}italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_u ) ≤ divide start_ARG italic_π end_ARG start_ARG 2 end_ARG) and satisfy h(u)=0u=1𝑢0𝑢1h(u)=0\Leftrightarrow u=1italic_h ( italic_u ) = 0 ⇔ italic_u = 1. Hence, hhitalic_h is annealed and by Theorem 1,

𝔼[f(xT+1)f(x)]𝔼delimited-[]𝑓subscript𝑥𝑇1𝑓superscript𝑥\displaystyle\mathbb{E}[f(x_{T+1})-f(x^{\star})]blackboard_E [ italic_f ( italic_x start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ] 12𝖱𝖺𝗍𝖾h,T𝗍𝗎infτ[0,1)\brkHh(0)ρHh(τ)+ρQh(τ)Qh(0)+O\brkρη𝗍𝗎G2T.absent12superscriptsubscript𝖱𝖺𝗍𝖾𝑇𝗍𝗎subscriptinfimum𝜏01\brksubscript𝐻0𝜌subscript𝐻𝜏𝜌subscript𝑄𝜏subscript𝑄0𝑂\brk𝜌subscript𝜂𝗍𝗎superscript𝐺2𝑇\displaystyle\leq\frac{1}{2}\mathsf{Rate}_{h,T}^{\mathsf{tu}}\cdot\inf_{\tau% \in[0,1)}\brk*{\frac{H_{h}(0)}{\rho H_{h}(\tau)}+\frac{\rho Q_{h}(\tau)}{Q_{h}% (0)}}+O\brk*{\frac{\rho\eta_{\mathsf{tu}}G^{2}}{T}}.≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG sansserif_Rate start_POSTSUBSCRIPT italic_h , italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_tu end_POSTSUPERSCRIPT ⋅ roman_inf start_POSTSUBSCRIPT italic_τ ∈ [ 0 , 1 ) end_POSTSUBSCRIPT ∗ divide start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG start_ARG italic_ρ italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_τ ) end_ARG + divide start_ARG italic_ρ italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_τ ) end_ARG start_ARG italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG + italic_O ∗ divide start_ARG italic_ρ italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_T end_ARG .

Next, we will bound h(u),Hh(u)𝑢subscript𝐻𝑢h(u),H_{h}(u)italic_h ( italic_u ) , italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_u ) and Qh(u)subscript𝑄𝑢Q_{h}(u)italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_u ) using polynomials. As cos(πθ)=cos(θ)𝜋𝜃𝜃\cos(\pi-\theta)=-\cos(\theta)roman_cos ( italic_π - italic_θ ) = - roman_cos ( italic_θ ) and cos(θ)1θ22𝜃1superscript𝜃22\cos(\theta)\geq 1-\frac{\theta^{2}}{2}roman_cos ( italic_θ ) ≥ 1 - divide start_ARG italic_θ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG,

h(u)=12(1cos(π(1u)))π24(1u)252(1u)2.𝑢121𝜋1𝑢superscript𝜋24superscript1𝑢252superscript1𝑢2\displaystyle h(u)=\frac{1}{2}(1-\cos(\pi(1-u)))\leq\frac{\pi^{2}}{4}(1-u)^{2}% \leq\frac{5}{2}(1-u)^{2}.italic_h ( italic_u ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( 1 - roman_cos ( italic_π ( 1 - italic_u ) ) ) ≤ divide start_ARG italic_π start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 4 end_ARG ( 1 - italic_u ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ divide start_ARG 5 end_ARG start_ARG 2 end_ARG ( 1 - italic_u ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (10)

On the other hand, for u[0,1)𝑢01u\in[0,1)italic_u ∈ [ 0 , 1 ),

\brkh(u)(1u)2=π2sin(πu)(1u)2+2(1u)(1u)4=4π(1u)sin(πu)2(1u)34π2>0.\brksuperscript𝑢superscript1𝑢2𝜋2𝜋𝑢superscript1𝑢221𝑢superscript1𝑢44𝜋1𝑢𝜋𝑢2superscript1𝑢34𝜋20\displaystyle\brk*{\frac{h(u)}{(1-u)^{2}}}^{\prime}=\frac{-\frac{\pi}{2}\sin(% \pi u)(1-u)^{2}+2(1-u)}{(1-u)^{4}}=\frac{4-\pi(1-u)\sin(\pi u)}{2(1-u)^{3}}% \geq\frac{4-\pi}{2}>0.∗ divide start_ARG italic_h ( italic_u ) end_ARG start_ARG ( 1 - italic_u ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = divide start_ARG - divide start_ARG italic_π end_ARG start_ARG 2 end_ARG roman_sin ( italic_π italic_u ) ( 1 - italic_u ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 ( 1 - italic_u ) end_ARG start_ARG ( 1 - italic_u ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG = divide start_ARG 4 - italic_π ( 1 - italic_u ) roman_sin ( italic_π italic_u ) end_ARG start_ARG 2 ( 1 - italic_u ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG ≥ divide start_ARG 4 - italic_π end_ARG start_ARG 2 end_ARG > 0 .

Using the fundamental theorem of calculus, for all u[0,1)𝑢01u\in[0,1)italic_u ∈ [ 0 , 1 ),

h(u)(1u)2=h(0)(10)2+0u\brkh(v)(1v)2𝑑v1+0u0𝑑v=1h(u)(1u)2.𝑢superscript1𝑢20superscript102superscriptsubscript0𝑢\brksuperscript𝑣superscript1𝑣2differential-d𝑣1superscriptsubscript0𝑢0differential-d𝑣1𝑢superscript1𝑢2\displaystyle\frac{h(u)}{(1-u)^{2}}=\frac{h(0)}{(1-0)^{2}}+\int_{0}^{u}\brk*{% \frac{h(v)}{(1-v)^{2}}}^{\prime}dv\geq 1+\int_{0}^{u}0\cdot dv=1\implies h(u)% \geq(1-u)^{2}.divide start_ARG italic_h ( italic_u ) end_ARG start_ARG ( 1 - italic_u ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG = divide start_ARG italic_h ( 0 ) end_ARG start_ARG ( 1 - 0 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ∗ divide start_ARG italic_h ( italic_v ) end_ARG start_ARG ( 1 - italic_v ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_d italic_v ≥ 1 + ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT 0 ⋅ italic_d italic_v = 1 ⟹ italic_h ( italic_u ) ≥ ( 1 - italic_u ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (11)

Using integration, Eqs. 10 and 11 also implies that

13(1u)3Hh(u)56(1u)3.13superscript1𝑢3subscript𝐻𝑢56superscript1𝑢3\displaystyle\frac{1}{3}(1-u)^{3}\leq H_{h}(u)\leq\frac{5}{6}(1-u)^{3}.divide start_ARG 1 end_ARG start_ARG 3 end_ARG ( 1 - italic_u ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ≤ italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_u ) ≤ divide start_ARG 5 end_ARG start_ARG 6 end_ARG ( 1 - italic_u ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT . (12)

Using the above inequalities,

Qh(v)subscript𝑄𝑣\displaystyle Q_{h}(v)italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_v ) =v1h(u)2Hh(u)𝑑u754v1(1u)𝑑u=758(1v)2absentsuperscriptsubscript𝑣1superscript𝑢2subscript𝐻𝑢differential-d𝑢754superscriptsubscript𝑣11𝑢differential-d𝑢758superscript1𝑣2\displaystyle=\int_{v}^{1}\frac{h(u)^{2}}{H_{h}(u)}du\leq\frac{75}{4}\int_{v}^% {1}(1-u)du=\frac{75}{8}(1-v)^{2}= ∫ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT divide start_ARG italic_h ( italic_u ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_u ) end_ARG italic_d italic_u ≤ divide start_ARG 75 end_ARG start_ARG 4 end_ARG ∫ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( 1 - italic_u ) italic_d italic_u = divide start_ARG 75 end_ARG start_ARG 8 end_ARG ( 1 - italic_v ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (13)

and

Qh(v)v16(1u)45(1u)3𝑑u=35(1v)2.subscript𝑄𝑣superscriptsubscript𝑣16superscript1𝑢45superscript1𝑢3differential-d𝑢35superscript1𝑣2\displaystyle Q_{h}(v)\geq\int_{v}^{1}\frac{6(1-u)^{4}}{5(1-u)^{3}}du=\frac{3}% {5}(1-v)^{2}.italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_v ) ≥ ∫ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT divide start_ARG 6 ( 1 - italic_u ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG 5 ( 1 - italic_u ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG italic_d italic_u = divide start_ARG 3 end_ARG start_ARG 5 end_ARG ( 1 - italic_v ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (14)

Using the bounds, setting τ¯=1ρ0.4[0,1)¯𝜏1superscript𝜌0.401\bar{\tau}=1-\rho^{-0.4}\in[0,1)over¯ start_ARG italic_τ end_ARG = 1 - italic_ρ start_POSTSUPERSCRIPT - 0.4 end_POSTSUPERSCRIPT ∈ [ 0 , 1 ), and noting that Hh(0)=1201(1+cos(πu))𝑑u=12subscript𝐻012superscriptsubscript011𝜋𝑢differential-d𝑢12H_{h}(0)=\frac{1}{2}\int_{0}^{1}(1+\cos(\pi u))du=\frac{1}{2}italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( 1 + roman_cos ( italic_π italic_u ) ) italic_d italic_u = divide start_ARG 1 end_ARG start_ARG 2 end_ARG,

Hh(0)ρHh(τ¯)+ρQh(τ¯)Qh(0)32ρ(1τ¯)3+ρ125(1τ¯)28=32ρρ1.2+125ρρ0.88Qh(0)18ρ0.2.subscript𝐻0𝜌subscript𝐻¯𝜏𝜌subscript𝑄¯𝜏subscript𝑄032𝜌superscript1¯𝜏3𝜌125superscript1¯𝜏2832𝜌superscript𝜌1.2125𝜌superscript𝜌0.88subscript𝑄018superscript𝜌0.2\displaystyle\frac{H_{h}(0)}{\rho H_{h}(\bar{\tau})}+\frac{\rho Q_{h}(\bar{% \tau})}{Q_{h}(0)}\leq\frac{3}{2\rho(1-\bar{\tau})^{3}}+\frac{\rho 125(1-\bar{% \tau})^{2}}{8}=\frac{3}{2\rho\rho^{-1.2}}+\frac{125\rho\rho^{-0.8}}{8Q_{h}(0)}% \leq 18\rho^{0.2}.divide start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG start_ARG italic_ρ italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( over¯ start_ARG italic_τ end_ARG ) end_ARG + divide start_ARG italic_ρ italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( over¯ start_ARG italic_τ end_ARG ) end_ARG start_ARG italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG ≤ divide start_ARG 3 end_ARG start_ARG 2 italic_ρ ( 1 - over¯ start_ARG italic_τ end_ARG ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG + divide start_ARG italic_ρ 125 ( 1 - over¯ start_ARG italic_τ end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 8 end_ARG = divide start_ARG 3 end_ARG start_ARG 2 italic_ρ italic_ρ start_POSTSUPERSCRIPT - 1.2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 125 italic_ρ italic_ρ start_POSTSUPERSCRIPT - 0.8 end_POSTSUPERSCRIPT end_ARG start_ARG 8 italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG ≤ 18 italic_ρ start_POSTSUPERSCRIPT 0.2 end_POSTSUPERSCRIPT . (15)

Thus,

𝔼[f(xT+1)f(x)]𝔼delimited-[]𝑓subscript𝑥𝑇1𝑓superscript𝑥\displaystyle\mathbb{E}[f(x_{T+1})-f(x^{\star})]blackboard_E [ italic_f ( italic_x start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ] 𝖱𝖺𝗍𝖾h,T𝗍𝗎18ρ15+O\brkρη𝗍𝗎G2T.absentsuperscriptsubscript𝖱𝖺𝗍𝖾𝑇𝗍𝗎18superscript𝜌15𝑂\brk𝜌subscript𝜂𝗍𝗎superscript𝐺2𝑇\displaystyle\leq\mathsf{Rate}_{h,T}^{\mathsf{tu}}\cdot 18\rho^{\frac{1}{5}}+O% \brk*{\frac{\rho\eta_{\mathsf{tu}}G^{2}}{T}}.≤ sansserif_Rate start_POSTSUBSCRIPT italic_h , italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_tu end_POSTSUPERSCRIPT ⋅ 18 italic_ρ start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 5 end_ARG end_POSTSUPERSCRIPT + italic_O ∗ divide start_ARG italic_ρ italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_T end_ARG .

Again noting that Hh(0)=12subscript𝐻012H_{h}(0)=\frac{1}{2}italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG and using Eq. 13,

𝖱𝖺𝗍𝖾h,T𝗍𝗎superscriptsubscript𝖱𝖺𝗍𝖾𝑇𝗍𝗎\displaystyle\mathsf{Rate}_{h,T}^{\mathsf{tu}}sansserif_Rate start_POSTSUBSCRIPT italic_h , italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_tu end_POSTSUPERSCRIPT =2DGTQh(0)/Hh(0)=2DGT2Qh(0)2DG1508T10DGT.absent2𝐷𝐺𝑇subscript𝑄0subscript𝐻02𝐷𝐺𝑇2subscript𝑄02𝐷𝐺1508𝑇10𝐷𝐺𝑇\displaystyle=\frac{2DG}{\sqrt{T}}\sqrt{Q_{h}(0)/H_{h}(0)}=\frac{2DG}{\sqrt{T}% }\sqrt{2Q_{h}(0)}\leq\frac{2DG\sqrt{\frac{150}{8}}}{\sqrt{T}}\leq\frac{10DG}{% \sqrt{T}}.\qed= divide start_ARG 2 italic_D italic_G end_ARG start_ARG square-root start_ARG italic_T end_ARG end_ARG square-root start_ARG italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) / italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG = divide start_ARG 2 italic_D italic_G end_ARG start_ARG square-root start_ARG italic_T end_ARG end_ARG square-root start_ARG 2 italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG ≤ divide start_ARG 2 italic_D italic_G square-root start_ARG divide start_ARG 150 end_ARG start_ARG 8 end_ARG end_ARG end_ARG start_ARG square-root start_ARG italic_T end_ARG end_ARG ≤ divide start_ARG 10 italic_D italic_G end_ARG start_ARG square-root start_ARG italic_T end_ARG end_ARG . italic_∎

Appendix B Proofs of Section 4

B.1 Proof of Theorem 4

In the proof, we use the following last-iterate guarantee for convex-smooth optimization, replacing Lemma 3 which we used in the convex Lipschitz case. The lemma is based on the technique introduced by Liu and Zhou (2024) and the proof appears at Appendix C.

Lemma 5.

Let 𝒳d𝒳superscript𝑑\mathcal{X}\subseteq{\mathbb{R}}^{d}caligraphic_X ⊆ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT be a convex set, f:𝒳:𝑓𝒳f:\mathcal{X}\to{\mathbb{R}}italic_f : caligraphic_X → blackboard_R a convex function, and g:𝒳d:𝑔𝒳superscript𝑑g:\mathcal{X}\to{\mathbb{R}}^{d}italic_g : caligraphic_X → blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT an unbiased first-order oracle of f𝑓fitalic_f with variance bounded by σ20superscript𝜎20\sigma^{2}\geq 0italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ 0. Let x1,x2,,xT+1subscript𝑥1subscript𝑥2subscript𝑥𝑇1x_{1},x_{2},\ldots,x_{T+1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT be the iterates produced by T𝑇Titalic_T-steps SGD with stepsizes η1,,ηTsubscript𝜂1subscript𝜂𝑇\eta_{1},\ldots,\eta_{T}italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_η start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT (satisfying ηt12βsubscript𝜂𝑡12𝛽\eta_{t}\leq\frac{1}{2\beta}italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG 2 italic_β end_ARG for all t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ]) and using the oracle g𝑔gitalic_g. Then for any x^𝒳^𝑥𝒳\hat{x}\in\mathcal{X}over^ start_ARG italic_x end_ARG ∈ caligraphic_X,

𝔼[f(xT+1)f(x^)]𝔼delimited-[]𝑓subscript𝑥𝑇1𝑓^𝑥\displaystyle\mathbb{E}[f(x_{T+1})-f(\hat{x})]blackboard_E [ italic_f ( italic_x start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT ) - italic_f ( over^ start_ARG italic_x end_ARG ) ] \normx1x^22s=1Tηs+σ2t=1Tηt2s=tTηs.absent\normsubscript𝑥1superscript^𝑥22superscriptsubscript𝑠1𝑇subscript𝜂𝑠superscript𝜎2superscriptsubscript𝑡1𝑇superscriptsubscript𝜂𝑡2superscriptsubscript𝑠𝑡𝑇subscript𝜂𝑠\displaystyle\leq\frac{\norm{x_{1}-\hat{x}}^{2}}{2\sum_{s=1}^{T}\eta_{s}}+% \sigma^{2}\sum_{t=1}^{T}\frac{\eta_{t}^{2}}{\sum_{s=t}^{T}\eta_{s}}.≤ divide start_ARG italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_s = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG .

We proceed to the proof of Theorem 4.

Proof of Theorem 4.

Let τ[τ0,1)𝜏subscript𝜏01\tau\in[\tau_{0},1)italic_τ ∈ [ italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 1 ) and let k=τT+1[T]𝑘𝜏𝑇1delimited-[]𝑇k=\lfloor\tau T\rfloor+1\in[T]italic_k = ⌊ italic_τ italic_T ⌋ + 1 ∈ [ italic_T ]. Consider the suffix xk,xk+1,,xT+1subscript𝑥𝑘subscript𝑥𝑘1subscript𝑥𝑇1x_{k},x_{k+1},\ldots,x_{T+1}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT as an SGD sequence starting at xksubscript𝑥𝑘x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and note that since hhitalic_h is non-increasing, ηk=ηh\brkk1Tηh(τ0)12βsubscript𝜂𝑘𝜂\brk𝑘1𝑇𝜂subscript𝜏012𝛽\eta_{k}=\eta h\brk{\frac{k-1}{T}}\leq\eta h(\tau_{0})\leq\frac{1}{2\beta}italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_η italic_h divide start_ARG italic_k - 1 end_ARG start_ARG italic_T end_ARG ≤ italic_η italic_h ( italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ≤ divide start_ARG 1 end_ARG start_ARG 2 italic_β end_ARG. Thus, by Lemma 5 with x^=x^𝑥superscript𝑥\hat{x}=x^{\star}over^ start_ARG italic_x end_ARG = italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT,

𝔼[f(xT+1)f(x)]𝔼delimited-[]𝑓subscript𝑥𝑇1𝑓superscript𝑥\displaystyle\mathbb{E}[f(x_{T+1})-f(x^{\star})]blackboard_E [ italic_f ( italic_x start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ] D22s=kTηs+σ2t=kTηt2s=tTηs.absentsuperscript𝐷22superscriptsubscript𝑠𝑘𝑇subscript𝜂𝑠superscript𝜎2superscriptsubscript𝑡𝑘𝑇superscriptsubscript𝜂𝑡2superscriptsubscript𝑠𝑡𝑇subscript𝜂𝑠\displaystyle\leq\frac{D^{2}}{2\sum_{s=k}^{T}\eta_{s}}+\sigma^{2}\sum_{t=k}^{T% }\frac{\eta_{t}^{2}}{\sum_{s=t}^{T}\eta_{s}}.≤ divide start_ARG italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 ∑ start_POSTSUBSCRIPT italic_s = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_s = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG .

As τ[k1T,kT)𝜏𝑘1𝑇𝑘𝑇\tau\in[\frac{k-1}{T},\frac{k}{T})italic_τ ∈ [ divide start_ARG italic_k - 1 end_ARG start_ARG italic_T end_ARG , divide start_ARG italic_k end_ARG start_ARG italic_T end_ARG ), invoking Lemma 4 with c1=D2/2subscript𝑐1superscript𝐷22c_{1}=D^{2}/2italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 and c2=σ2subscript𝑐2superscript𝜎2c_{2}=\sigma^{2}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT,

𝔼[f(xT+1)f(x)]𝔼delimited-[]𝑓subscript𝑥𝑇1𝑓superscript𝑥\displaystyle\mathbb{E}[f(x_{T+1})-f(x^{\star})]blackboard_E [ italic_f ( italic_x start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ] D22ηTHh(τ)+ησ2τ1h(u)2Hh(u)𝑑u+4ηpσ2T.absentsuperscript𝐷22𝜂𝑇subscript𝐻𝜏𝜂superscript𝜎2superscriptsubscript𝜏1superscript𝑢2subscript𝐻𝑢differential-d𝑢4𝜂𝑝superscript𝜎2𝑇\displaystyle\leq\frac{D^{2}}{2\eta TH_{h}(\tau)}+\eta\sigma^{2}\int_{\tau}^{1% }\frac{h(u)^{2}}{H_{h}(u)}du+\frac{4\eta p\sigma^{2}}{T}.≤ divide start_ARG italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_η italic_T italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_τ ) end_ARG + italic_η italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∫ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT divide start_ARG italic_h ( italic_u ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_u ) end_ARG italic_d italic_u + divide start_ARG 4 italic_η italic_p italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_T end_ARG .

Substituting η=ρη𝗍𝗎𝗌𝗆𝜂𝜌superscriptsubscript𝜂𝗍𝗎𝗌𝗆\eta=\rho\cdot\eta_{\mathsf{tu}}^{\mathsf{sm}}italic_η = italic_ρ ⋅ italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_sm end_POSTSUPERSCRIPT and using Eqs. 3 and 4,

𝔼[f(xT+1)f(x)]𝔼delimited-[]𝑓subscript𝑥𝑇1𝑓superscript𝑥\displaystyle\mathbb{E}[f(x_{T+1})-f(x^{\star})]blackboard_E [ italic_f ( italic_x start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ] 1ρHh(τ)D22η𝗍𝗎𝗌𝗆T+\brkρτ1h(u)2Hh(u)𝑑uη𝗍𝗎𝗌𝗆σ2+4pρTη𝗍𝗎𝗌𝗆σ2absent1𝜌subscript𝐻𝜏superscript𝐷22superscriptsubscript𝜂𝗍𝗎𝗌𝗆𝑇\brk𝜌superscriptsubscript𝜏1superscript𝑢2subscript𝐻𝑢differential-d𝑢superscriptsubscript𝜂𝗍𝗎𝗌𝗆superscript𝜎24𝑝𝜌𝑇superscriptsubscript𝜂𝗍𝗎𝗌𝗆superscript𝜎2\displaystyle\leq\frac{1}{\rho H_{h}(\tau)}\cdot\frac{D^{2}}{2\eta_{\mathsf{tu% }}^{\mathsf{sm}}T}+\brk*{\rho\int_{\tau}^{1}\frac{h(u)^{2}}{H_{h}(u)}du}\cdot% \eta_{\mathsf{tu}}^{\mathsf{sm}}\sigma^{2}+\frac{4p\rho}{T}\cdot\eta_{\mathsf{% tu}}^{\mathsf{sm}}\sigma^{2}≤ divide start_ARG 1 end_ARG start_ARG italic_ρ italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_τ ) end_ARG ⋅ divide start_ARG italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_sm end_POSTSUPERSCRIPT italic_T end_ARG + ∗ italic_ρ ∫ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT divide start_ARG italic_h ( italic_u ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_u ) end_ARG italic_d italic_u ⋅ italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_sm end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 4 italic_p italic_ρ end_ARG start_ARG italic_T end_ARG ⋅ italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_sm end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=Hh(0)ρHh(τ)D22η𝗍𝗎𝗌𝗆THh(0)+ρQh(τ)Qh(0)η𝗍𝗎𝗌𝗆σ2Qh(0)+4pρη𝗍𝗎𝗌𝗆σ2Tabsentsubscript𝐻0𝜌subscript𝐻𝜏superscript𝐷22superscriptsubscript𝜂𝗍𝗎𝗌𝗆𝑇subscript𝐻0𝜌subscript𝑄𝜏subscript𝑄0superscriptsubscript𝜂𝗍𝗎𝗌𝗆superscript𝜎2subscript𝑄04𝑝𝜌superscriptsubscript𝜂𝗍𝗎𝗌𝗆superscript𝜎2𝑇\displaystyle=\frac{H_{h}(0)}{\rho H_{h}(\tau)}\cdot\frac{D^{2}}{2\eta_{% \mathsf{tu}}^{\mathsf{sm}}TH_{h}(0)}+\frac{\rho Q_{h}(\tau)}{Q_{h}(0)}\cdot% \eta_{\mathsf{tu}}^{\mathsf{sm}}\sigma^{2}Q_{h}(0)+\frac{4p\rho\eta_{\mathsf{% tu}}^{\mathsf{sm}}\sigma^{2}}{T}= divide start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG start_ARG italic_ρ italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_τ ) end_ARG ⋅ divide start_ARG italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_sm end_POSTSUPERSCRIPT italic_T italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG + divide start_ARG italic_ρ italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_τ ) end_ARG start_ARG italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG ⋅ italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_sm end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) + divide start_ARG 4 italic_p italic_ρ italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_sm end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_T end_ARG
𝖱𝖺𝗍𝖾h,T𝗌𝗆\brkHh(0)ρHh(τ)+ρQh(τ)Qh(0)+4pρη𝗍𝗎𝗌𝗆σ2T.absentsubscriptsuperscript𝖱𝖺𝗍𝖾𝗌𝗆𝑇\brksubscript𝐻0𝜌subscript𝐻𝜏𝜌subscript𝑄𝜏subscript𝑄04𝑝𝜌superscriptsubscript𝜂𝗍𝗎𝗌𝗆superscript𝜎2𝑇\displaystyle\leq\mathsf{Rate}^{\mathsf{sm}}_{h,T}\cdot\brk*{\frac{H_{h}(0)}{% \rho H_{h}(\tau)}+\frac{\rho Q_{h}(\tau)}{Q_{h}(0)}}+\frac{4p\rho\eta_{\mathsf% {tu}}^{\mathsf{sm}}\sigma^{2}}{T}.≤ sansserif_Rate start_POSTSUPERSCRIPT sansserif_sm end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_T end_POSTSUBSCRIPT ⋅ ∗ divide start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG start_ARG italic_ρ italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_τ ) end_ARG + divide start_ARG italic_ρ italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_τ ) end_ARG start_ARG italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG + divide start_ARG 4 italic_p italic_ρ italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_sm end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_T end_ARG .

This inequality holds for any τ[τ0,1)𝜏subscript𝜏01\tau\in[\tau_{0},1)italic_τ ∈ [ italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 1 ), hence it holds for the infimum over all τ[τ0,1)𝜏subscript𝜏01\tau\in[\tau_{0},1)italic_τ ∈ [ italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 1 ). It is left to find the τ𝜏\tauitalic_τ which minimizes the right-hand side. Let

g(v)=Hh(0)ρHh(τ)+ρQh(τ)Qh(0).𝑔𝑣subscript𝐻0𝜌subscript𝐻𝜏𝜌subscript𝑄𝜏subscript𝑄0\displaystyle g(v)=\frac{H_{h}(0)}{\rho H_{h}(\tau)}+\frac{\rho Q_{h}(\tau)}{Q% _{h}(0)}.italic_g ( italic_v ) = divide start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG start_ARG italic_ρ italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_τ ) end_ARG + divide start_ARG italic_ρ italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_τ ) end_ARG start_ARG italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG .

This is the same function as in Eq. 7, so the same solution to Hh(τ)Hh(τ)=Hh(0)Qh(0)ρ2subscript𝐻𝜏superscriptsubscript𝐻𝜏subscript𝐻0subscript𝑄0superscript𝜌2H_{h}(\tau)H_{h}^{\prime}(\tau)=\frac{-H_{h}(0)Q_{h}(0)}{\rho^{2}}italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_τ ) italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_τ ) = divide start_ARG - italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG start_ARG italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG is the minimizer of the function, and if there is no solution, the function is increasing (positive derivative) and the minimizer is at τ=τ0𝜏subscript𝜏0\tau=\tau_{0}italic_τ = italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. ∎

B.2 Proof of Corollary 5

.

As in the proof of Corollary 2, h(u)𝑢h(u)italic_h ( italic_u ) is annealed as h(u)𝑢h(u)italic_h ( italic_u ) is non-increasing, differentiable, p𝑝pitalic_p-Lipschitz and satisfy h(u)=0u=1𝑢0𝑢1h(u)=0\Leftrightarrow u=1italic_h ( italic_u ) = 0 ⇔ italic_u = 1. Hence, we can use Theorem 4. In addition, Hh(τ)=1p+1(1τ)p+1subscript𝐻𝜏1𝑝1superscript1𝜏𝑝1H_{h}(\tau)=\frac{1}{p+1}(1-\tau)^{p+1}italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_τ ) = divide start_ARG 1 end_ARG start_ARG italic_p + 1 end_ARG ( 1 - italic_τ ) start_POSTSUPERSCRIPT italic_p + 1 end_POSTSUPERSCRIPT, Hh(τ)=(1τ)psuperscriptsubscript𝐻𝜏superscript1𝜏𝑝H_{h}^{\prime}(\tau)=-(1-\tau)^{p}italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_τ ) = - ( 1 - italic_τ ) start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT and Qh(τ)=p+1p(1τ)psubscript𝑄𝜏𝑝1𝑝superscript1𝜏𝑝Q_{h}(\tau)=\frac{p+1}{p}(1-\tau)^{p}italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_τ ) = divide start_ARG italic_p + 1 end_ARG start_ARG italic_p end_ARG ( 1 - italic_τ ) start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT, so

Hh(0)ρHh(τ)+ρQh(τ)Qh(0)subscript𝐻0𝜌subscript𝐻𝜏𝜌subscript𝑄𝜏subscript𝑄0\displaystyle\frac{H_{h}(0)}{\rho H_{h}(\tau)}+\frac{\rho Q_{h}(\tau)}{Q_{h}(0)}divide start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG start_ARG italic_ρ italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_τ ) end_ARG + divide start_ARG italic_ρ italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_τ ) end_ARG start_ARG italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG =1ρ(1τ)p+1+ρ(1τ)p.absent1𝜌superscript1𝜏𝑝1𝜌superscript1𝜏𝑝\displaystyle=\frac{1}{\rho(1-\tau)^{p+1}}+\rho(1-\tau)^{p}.= divide start_ARG 1 end_ARG start_ARG italic_ρ ( 1 - italic_τ ) start_POSTSUPERSCRIPT italic_p + 1 end_POSTSUPERSCRIPT end_ARG + italic_ρ ( 1 - italic_τ ) start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT .

If ρ2(1τ0)(2p+1)superscript𝜌2superscript1subscript𝜏02𝑝1\rho^{2}\geq(1-\tau_{0})^{-(2p+1)}italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ ( 1 - italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - ( 2 italic_p + 1 ) end_POSTSUPERSCRIPT we can pick τ¯=1ρ22p+1¯𝜏1superscript𝜌22𝑝1\bar{\tau}=1-\rho^{\frac{-2}{2p+1}}over¯ start_ARG italic_τ end_ARG = 1 - italic_ρ start_POSTSUPERSCRIPT divide start_ARG - 2 end_ARG start_ARG 2 italic_p + 1 end_ARG end_POSTSUPERSCRIPT, as τ¯[τ0,1)¯𝜏subscript𝜏01\bar{\tau}\in[\tau_{0},1)over¯ start_ARG italic_τ end_ARG ∈ [ italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 1 ). In this case,

Hh(0)ρHh(τ¯)+ρQh(τ¯)Qh(0)subscript𝐻0𝜌subscript𝐻¯𝜏𝜌subscript𝑄¯𝜏subscript𝑄0\displaystyle\frac{H_{h}(0)}{\rho H_{h}(\bar{\tau})}+\frac{\rho Q_{h}(\bar{% \tau})}{Q_{h}(0)}divide start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG start_ARG italic_ρ italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( over¯ start_ARG italic_τ end_ARG ) end_ARG + divide start_ARG italic_ρ italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( over¯ start_ARG italic_τ end_ARG ) end_ARG start_ARG italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG =1ρρ2(p+1)2p+1+ρρ2p2p+1=2ρ12p+1.absent1𝜌superscript𝜌2𝑝12𝑝1𝜌superscript𝜌2𝑝2𝑝12superscript𝜌12𝑝1\displaystyle=\frac{1}{\rho\cdot\rho^{\frac{-2(p+1)}{2p+1}}}+\rho\cdot\rho^{% \frac{-2p}{2p+1}}=2\rho^{\frac{1}{2p+1}}.= divide start_ARG 1 end_ARG start_ARG italic_ρ ⋅ italic_ρ start_POSTSUPERSCRIPT divide start_ARG - 2 ( italic_p + 1 ) end_ARG start_ARG 2 italic_p + 1 end_ARG end_POSTSUPERSCRIPT end_ARG + italic_ρ ⋅ italic_ρ start_POSTSUPERSCRIPT divide start_ARG - 2 italic_p end_ARG start_ARG 2 italic_p + 1 end_ARG end_POSTSUPERSCRIPT = 2 italic_ρ start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 italic_p + 1 end_ARG end_POSTSUPERSCRIPT .

If ρ2<(1τ0)(2p+1)superscript𝜌2superscript1subscript𝜏02𝑝1\rho^{2}<(1-\tau_{0})^{-(2p+1)}italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT < ( 1 - italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - ( 2 italic_p + 1 ) end_POSTSUPERSCRIPT and τ0>0subscript𝜏00\tau_{0}>0italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT > 0, picking τ¯=τ0¯𝜏subscript𝜏0\bar{\tau}=\tau_{0}over¯ start_ARG italic_τ end_ARG = italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and using the p𝑝pitalic_p-Lipschitz property of hhitalic_h,

12β12𝛽\displaystyle\frac{1}{2\beta}divide start_ARG 1 end_ARG start_ARG 2 italic_β end_ARG ηh\brkτ01T=ρη𝗍𝗎𝗌𝗆h\brkτ01Tρη𝗍𝗎𝗌𝗆h(τ0)+pρη𝗍𝗎𝗌𝗆Tρh(τ0)2βh(0)+pρ2βh(0)Tabsent𝜂\brksubscript𝜏01𝑇𝜌superscriptsubscript𝜂𝗍𝗎𝗌𝗆\brksubscript𝜏01𝑇𝜌superscriptsubscript𝜂𝗍𝗎𝗌𝗆subscript𝜏0𝑝𝜌superscriptsubscript𝜂𝗍𝗎𝗌𝗆𝑇𝜌subscript𝜏02𝛽0𝑝𝜌2𝛽0𝑇\displaystyle\leq\eta h\brk*{\tau_{0}-\frac{1}{T}}=\rho\eta_{\mathsf{tu}}^{% \mathsf{sm}}h\brk*{\tau_{0}-\frac{1}{T}}\leq\rho\eta_{\mathsf{tu}}^{\mathsf{sm% }}h(\tau_{0})+\frac{p\rho\eta_{\mathsf{tu}}^{\mathsf{sm}}}{T}\leq\frac{\rho h(% \tau_{0})}{2\beta h(0)}+\frac{p\rho}{2\beta h(0)T}≤ italic_η italic_h ∗ italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG = italic_ρ italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_sm end_POSTSUPERSCRIPT italic_h ∗ italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ≤ italic_ρ italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_sm end_POSTSUPERSCRIPT italic_h ( italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + divide start_ARG italic_p italic_ρ italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_sm end_POSTSUPERSCRIPT end_ARG start_ARG italic_T end_ARG ≤ divide start_ARG italic_ρ italic_h ( italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG 2 italic_β italic_h ( 0 ) end_ARG + divide start_ARG italic_p italic_ρ end_ARG start_ARG 2 italic_β italic_h ( 0 ) italic_T end_ARG (η𝗍𝗎𝗌𝗆h(0)12βsuperscriptsubscript𝜂𝗍𝗎𝗌𝗆012𝛽\eta_{\mathsf{tu}}^{\mathsf{sm}}h(0)\leq\frac{1}{2\beta}italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_sm end_POSTSUPERSCRIPT italic_h ( 0 ) ≤ divide start_ARG 1 end_ARG start_ARG 2 italic_β end_ARG)
ρ\brk1pρT(1τ0)pabsent𝜌\brk1𝑝𝜌𝑇superscript1subscript𝜏0𝑝\displaystyle\implies\rho\geq\brk*{1-\frac{p\rho}{T}}(1-\tau_{0})^{-p}⟹ italic_ρ ≥ ∗ 1 - divide start_ARG italic_p italic_ρ end_ARG start_ARG italic_T end_ARG ( 1 - italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - italic_p end_POSTSUPERSCRIPT (h(0)=101h(0)=1italic_h ( 0 ) = 1)

and

Hh(0)ρHh(τ0)+ρQh(τ0)Qh(0)subscript𝐻0𝜌subscript𝐻subscript𝜏0𝜌subscript𝑄subscript𝜏0subscript𝑄0\displaystyle\frac{H_{h}(0)}{\rho H_{h}(\tau_{0})}+\frac{\rho Q_{h}(\tau_{0})}% {Q_{h}(0)}divide start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG start_ARG italic_ρ italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG + divide start_ARG italic_ρ italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG =1ρ(1τ0)p+1+ρ(1τ0)p1\brk1pρT(1τ0)+11τ0=O\brk11τ0,absent1𝜌superscript1subscript𝜏0𝑝1𝜌superscript1subscript𝜏0𝑝1\brk1𝑝𝜌𝑇1subscript𝜏011subscript𝜏0𝑂\brk11subscript𝜏0\displaystyle=\frac{1}{\rho(1-\tau_{0})^{p+1}}+\rho(1-\tau_{0})^{p}\leq\frac{1% }{\brk*{1-\frac{p\rho}{T}}(1-\tau_{0})}+\sqrt{\frac{1}{1-\tau_{0}}}=O\brk*{% \frac{1}{1-\tau_{0}}},= divide start_ARG 1 end_ARG start_ARG italic_ρ ( 1 - italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_p + 1 end_POSTSUPERSCRIPT end_ARG + italic_ρ ( 1 - italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG ∗ 1 - divide start_ARG italic_p italic_ρ end_ARG start_ARG italic_T end_ARG ( 1 - italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG + square-root start_ARG divide start_ARG 1 end_ARG start_ARG 1 - italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG end_ARG = italic_O ∗ divide start_ARG 1 end_ARG start_ARG 1 - italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ,

where the last two transitions use ρ2<(1τ0)(2p+1)superscript𝜌2superscript1subscript𝜏02𝑝1\rho^{2}<(1-\tau_{0})^{-(2p+1)}italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT < ( 1 - italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - ( 2 italic_p + 1 ) end_POSTSUPERSCRIPT and the assumption ρT2p𝜌𝑇2𝑝\rho\leq\frac{T}{2p}italic_ρ ≤ divide start_ARG italic_T end_ARG start_ARG 2 italic_p end_ARG. Since ρ1𝜌1\rho\geq 1italic_ρ ≥ 1 there is no case where ρ2<(1τ0)(2p+1)superscript𝜌2superscript1subscript𝜏02𝑝1\rho^{2}<(1-\tau_{0})^{-(2p+1)}italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT < ( 1 - italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - ( 2 italic_p + 1 ) end_POSTSUPERSCRIPT and τ0=0subscript𝜏00\tau_{0}=0italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0. Bounding the infimum of Eq. 9 in the two cases with our choices of τ¯¯𝜏\bar{\tau}over¯ start_ARG italic_τ end_ARG, if ρ2(1τ0)(2p+1)superscript𝜌2superscript1subscript𝜏02𝑝1\rho^{2}\geq(1-\tau_{0})^{-(2p+1)}italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ ( 1 - italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - ( 2 italic_p + 1 ) end_POSTSUPERSCRIPT,

𝔼[f(xT+1)f(x)]=𝖱𝖺𝗍𝖾h,T𝗌𝗆,𝗍𝗎O\brkρ12p+1+O\brkpρη𝗍𝗎𝗌𝗆σ2T,𝔼delimited-[]𝑓subscript𝑥𝑇1𝑓superscript𝑥subscriptsuperscript𝖱𝖺𝗍𝖾𝗌𝗆𝗍𝗎𝑇𝑂\brksuperscript𝜌12𝑝1𝑂\brk𝑝𝜌superscriptsubscript𝜂𝗍𝗎𝗌𝗆superscript𝜎2𝑇\displaystyle\mathbb{E}[f(x_{T+1})-f(x^{\star})]=\mathsf{Rate}^{\mathsf{sm,tu}% }_{h,T}\cdot O\brk*{\rho^{\frac{1}{2p+1}}}+O\brk*{\frac{p\rho\eta_{\mathsf{tu}% }^{\mathsf{sm}}\sigma^{2}}{T}},blackboard_E [ italic_f ( italic_x start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ] = sansserif_Rate start_POSTSUPERSCRIPT sansserif_sm , sansserif_tu end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_T end_POSTSUBSCRIPT ⋅ italic_O ∗ italic_ρ start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 italic_p + 1 end_ARG end_POSTSUPERSCRIPT + italic_O ∗ divide start_ARG italic_p italic_ρ italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_sm end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_T end_ARG ,

and if ρ2<(1τ0)(2p+1)superscript𝜌2superscript1subscript𝜏02𝑝1\rho^{2}<(1-\tau_{0})^{-(2p+1)}italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT < ( 1 - italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - ( 2 italic_p + 1 ) end_POSTSUPERSCRIPT,

𝔼[f(xT+1)f(x)]=𝖱𝖺𝗍𝖾h,T𝗌𝗆,𝗍𝗎O\brk11τ0+O\brkpρη𝗍𝗎𝗌𝗆σ2T.𝔼delimited-[]𝑓subscript𝑥𝑇1𝑓superscript𝑥subscriptsuperscript𝖱𝖺𝗍𝖾𝗌𝗆𝗍𝗎𝑇𝑂\brk11subscript𝜏0𝑂\brk𝑝𝜌superscriptsubscript𝜂𝗍𝗎𝗌𝗆superscript𝜎2𝑇\displaystyle\mathbb{E}[f(x_{T+1})-f(x^{\star})]=\mathsf{Rate}^{\mathsf{sm,tu}% }_{h,T}\cdot O\brk*{\frac{1}{1-\tau_{0}}}+O\brk*{\frac{p\rho\eta_{\mathsf{tu}}% ^{\mathsf{sm}}\sigma^{2}}{T}}.blackboard_E [ italic_f ( italic_x start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ] = sansserif_Rate start_POSTSUPERSCRIPT sansserif_sm , sansserif_tu end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_T end_POSTSUBSCRIPT ⋅ italic_O ∗ divide start_ARG 1 end_ARG start_ARG 1 - italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG + italic_O ∗ divide start_ARG italic_p italic_ρ italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_sm end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_T end_ARG .

Noting that by the assumption ρT2p𝜌𝑇2𝑝\rho\leq\frac{T}{2p}italic_ρ ≤ divide start_ARG italic_T end_ARG start_ARG 2 italic_p end_ARG,

pρη𝗍𝗎𝗌𝗆σ2Tη𝗍𝗎𝗌𝗆σ22𝖱𝖺𝗍𝖾h,T𝗌𝗆,𝗍𝗎2Qh(0)=O(𝖱𝖺𝗍𝖾h,T𝗌𝗆,𝗍𝗎),𝑝𝜌superscriptsubscript𝜂𝗍𝗎𝗌𝗆superscript𝜎2𝑇superscriptsubscript𝜂𝗍𝗎𝗌𝗆superscript𝜎22subscriptsuperscript𝖱𝖺𝗍𝖾𝗌𝗆𝗍𝗎𝑇2subscript𝑄0𝑂subscriptsuperscript𝖱𝖺𝗍𝖾𝗌𝗆𝗍𝗎𝑇\displaystyle\frac{p\rho\eta_{\mathsf{tu}}^{\mathsf{sm}}\sigma^{2}}{T}\leq% \frac{\eta_{\mathsf{tu}}^{\mathsf{sm}}\sigma^{2}}{2}\leq\frac{\mathsf{Rate}^{% \mathsf{sm,tu}}_{h,T}}{2Q_{h}(0)}=O(\mathsf{Rate}^{\mathsf{sm,tu}}_{h,T}),divide start_ARG italic_p italic_ρ italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_sm end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_T end_ARG ≤ divide start_ARG italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_sm end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ≤ divide start_ARG sansserif_Rate start_POSTSUPERSCRIPT sansserif_sm , sansserif_tu end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_T end_POSTSUBSCRIPT end_ARG start_ARG 2 italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG = italic_O ( sansserif_Rate start_POSTSUPERSCRIPT sansserif_sm , sansserif_tu end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_T end_POSTSUBSCRIPT ) ,

we obtain our final convergence guarantees. The bound of 𝖱𝖺𝗍𝖾h,T𝗌𝗆,𝗍𝗎subscriptsuperscript𝖱𝖺𝗍𝖾𝗌𝗆𝗍𝗎𝑇\mathsf{Rate}^{\mathsf{sm,tu}}_{h,T}sansserif_Rate start_POSTSUPERSCRIPT sansserif_sm , sansserif_tu end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_T end_POSTSUBSCRIPT follows from plugging Hh(0)=1p+1subscript𝐻01𝑝1H_{h}(0)=\frac{1}{p+1}italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) = divide start_ARG 1 end_ARG start_ARG italic_p + 1 end_ARG and Qh(0)=p+1psubscript𝑄0𝑝1𝑝Q_{h}(0)=\frac{p+1}{p}italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) = divide start_ARG italic_p + 1 end_ARG start_ARG italic_p end_ARG to Eq. 4. ∎

B.3 Proof of Corollary 6

.

As in the proof of Corollary 3, hhitalic_h is annealed as h(u)𝑢h(u)italic_h ( italic_u ) is non-increasing, differentiable, π2𝜋2\frac{\pi}{2}divide start_ARG italic_π end_ARG start_ARG 2 end_ARG-Lipschitz and satisfy h(u)=0u=1𝑢0𝑢1h(u)=0\Leftrightarrow u=1italic_h ( italic_u ) = 0 ⇔ italic_u = 1. Hence, we can use Theorem 4. We already established at Eq. 15 of the proof of Corollary 3 that

Hh(0)ρHh(τ)+ρQh(τ)Qh(0)subscript𝐻0𝜌subscript𝐻𝜏𝜌subscript𝑄𝜏subscript𝑄0\displaystyle\frac{H_{h}(0)}{\rho H_{h}(\tau)}+\frac{\rho Q_{h}(\tau)}{Q_{h}(0)}divide start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG start_ARG italic_ρ italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_τ ) end_ARG + divide start_ARG italic_ρ italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_τ ) end_ARG start_ARG italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG 32ρ(1τ)3+125ρ(1τ)28.absent32𝜌superscript1𝜏3125𝜌superscript1𝜏28\displaystyle\leq\frac{3}{2\rho(1-\tau)^{3}}+\frac{125\rho(1-\tau)^{2}}{8}.≤ divide start_ARG 3 end_ARG start_ARG 2 italic_ρ ( 1 - italic_τ ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 125 italic_ρ ( 1 - italic_τ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 8 end_ARG .

If ρ2(1τ0)5superscript𝜌2superscript1subscript𝜏05\rho^{2}\geq(1-\tau_{0})^{-5}italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ ( 1 - italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT we can pick τ¯=1ρ25¯𝜏1superscript𝜌25\bar{\tau}=1-\rho^{\frac{-2}{5}}over¯ start_ARG italic_τ end_ARG = 1 - italic_ρ start_POSTSUPERSCRIPT divide start_ARG - 2 end_ARG start_ARG 5 end_ARG end_POSTSUPERSCRIPT, as τ¯[τ0,1)¯𝜏subscript𝜏01\bar{\tau}\in[\tau_{0},1)over¯ start_ARG italic_τ end_ARG ∈ [ italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 1 ). In this case,

Hh(0)ρHh(τ¯)+ρQh(τ¯)Qh(0)subscript𝐻0𝜌subscript𝐻¯𝜏𝜌subscript𝑄¯𝜏subscript𝑄0\displaystyle\frac{H_{h}(0)}{\rho H_{h}(\bar{\tau})}+\frac{\rho Q_{h}(\bar{% \tau})}{Q_{h}(0)}divide start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG start_ARG italic_ρ italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( over¯ start_ARG italic_τ end_ARG ) end_ARG + divide start_ARG italic_ρ italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( over¯ start_ARG italic_τ end_ARG ) end_ARG start_ARG italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG 32ρρ65+125ρρ458=137ρ15818ρ15.absent32𝜌superscript𝜌65125𝜌superscript𝜌458137superscript𝜌15818superscript𝜌15\displaystyle\leq\frac{3}{2\rho\cdot\rho^{\frac{-6}{5}}}+\frac{125\rho\cdot% \rho^{\frac{-4}{5}}}{8}=\frac{137\rho^{\frac{1}{5}}}{8}\leq 18\rho^{\frac{1}{5% }}.≤ divide start_ARG 3 end_ARG start_ARG 2 italic_ρ ⋅ italic_ρ start_POSTSUPERSCRIPT divide start_ARG - 6 end_ARG start_ARG 5 end_ARG end_POSTSUPERSCRIPT end_ARG + divide start_ARG 125 italic_ρ ⋅ italic_ρ start_POSTSUPERSCRIPT divide start_ARG - 4 end_ARG start_ARG 5 end_ARG end_POSTSUPERSCRIPT end_ARG start_ARG 8 end_ARG = divide start_ARG 137 italic_ρ start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 5 end_ARG end_POSTSUPERSCRIPT end_ARG start_ARG 8 end_ARG ≤ 18 italic_ρ start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 5 end_ARG end_POSTSUPERSCRIPT .

If ρ2<(1τ0)5superscript𝜌2superscript1subscript𝜏05\rho^{2}<(1-\tau_{0})^{-5}italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT < ( 1 - italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT and τ0>0subscript𝜏00\tau_{0}>0italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT > 0, picking τ¯=τ0¯𝜏subscript𝜏0\bar{\tau}=\tau_{0}over¯ start_ARG italic_τ end_ARG = italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, using the definition of τ0subscript𝜏0\tau_{0}italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and the Lipschitz property of hhitalic_h,

12β12𝛽\displaystyle\frac{1}{2\beta}divide start_ARG 1 end_ARG start_ARG 2 italic_β end_ARG ηh\brkτ01T=ρη𝗍𝗎𝗌𝗆h\brkτ01Tρη𝗍𝗎𝗌𝗆h(τ0)+πρη𝗍𝗎𝗌𝗆2Tρh(τ0)2βh(0)+πρ4βh(0)T,absent𝜂\brksubscript𝜏01𝑇𝜌superscriptsubscript𝜂𝗍𝗎𝗌𝗆\brksubscript𝜏01𝑇𝜌superscriptsubscript𝜂𝗍𝗎𝗌𝗆subscript𝜏0𝜋𝜌superscriptsubscript𝜂𝗍𝗎𝗌𝗆2𝑇𝜌subscript𝜏02𝛽0𝜋𝜌4𝛽0𝑇\displaystyle\leq\eta h\brk*{\tau_{0}-\frac{1}{T}}=\rho\eta_{\mathsf{tu}}^{% \mathsf{sm}}h\brk*{\tau_{0}-\frac{1}{T}}\leq\rho\eta_{\mathsf{tu}}^{\mathsf{sm% }}h(\tau_{0})+\frac{\pi\rho\eta_{\mathsf{tu}}^{\mathsf{sm}}}{2T}\leq\frac{\rho h% (\tau_{0})}{2\beta h(0)}+\frac{\pi\rho}{4\beta h(0)T},≤ italic_η italic_h ∗ italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG = italic_ρ italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_sm end_POSTSUPERSCRIPT italic_h ∗ italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ≤ italic_ρ italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_sm end_POSTSUPERSCRIPT italic_h ( italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + divide start_ARG italic_π italic_ρ italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_sm end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_T end_ARG ≤ divide start_ARG italic_ρ italic_h ( italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG 2 italic_β italic_h ( 0 ) end_ARG + divide start_ARG italic_π italic_ρ end_ARG start_ARG 4 italic_β italic_h ( 0 ) italic_T end_ARG , (η𝗍𝗎𝗌𝗆h(0)12βsuperscriptsubscript𝜂𝗍𝗎𝗌𝗆012𝛽\eta_{\mathsf{tu}}^{\mathsf{sm}}h(0)\leq\frac{1}{2\beta}italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_sm end_POSTSUPERSCRIPT italic_h ( 0 ) ≤ divide start_ARG 1 end_ARG start_ARG 2 italic_β end_ARG)

implying (with h(0)=101h(0)=1italic_h ( 0 ) = 1) that

ρ\brk1πρ2Th(τ0)\brk1πρ2T(1τ0)2.𝜌\brk1𝜋𝜌2𝑇subscript𝜏0\brk1𝜋𝜌2𝑇superscript1subscript𝜏02\displaystyle\rho\geq\brk*{1-\frac{\pi\rho}{2T}}h(\tau_{0})\geq\brk*{1-\frac{% \pi\rho}{2T}}(1-\tau_{0})^{2}.italic_ρ ≥ ∗ 1 - divide start_ARG italic_π italic_ρ end_ARG start_ARG 2 italic_T end_ARG italic_h ( italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ≥ ∗ 1 - divide start_ARG italic_π italic_ρ end_ARG start_ARG 2 italic_T end_ARG ( 1 - italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

In addition to the assumption ρ2<(1τ0)5superscript𝜌2superscript1subscript𝜏05\rho^{2}<(1-\tau_{0})^{-5}italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT < ( 1 - italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT,

Hh(0)ρHh(τ0)+ρQh(τ0)Qh(0)subscript𝐻0𝜌subscript𝐻subscript𝜏0𝜌subscript𝑄subscript𝜏0subscript𝑄0\displaystyle\frac{H_{h}(0)}{\rho H_{h}(\tau_{0})}+\frac{\rho Q_{h}(\tau_{0})}% {Q_{h}(0)}divide start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG start_ARG italic_ρ italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG + divide start_ARG italic_ρ italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG 32ρ(1τ0)3+125ρ(1τ0)28absent32𝜌superscript1subscript𝜏03125𝜌superscript1subscript𝜏028\displaystyle\leq\frac{3}{2\rho(1-\tau_{0})^{3}}+\frac{125\rho(1-\tau_{0})^{2}% }{8}≤ divide start_ARG 3 end_ARG start_ARG 2 italic_ρ ( 1 - italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 125 italic_ρ ( 1 - italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 8 end_ARG
32\brk1πρ2T(1τ0)+125811τ0=O\brk11τ0,absent32\brk1𝜋𝜌2𝑇1subscript𝜏0125811subscript𝜏0𝑂\brk11subscript𝜏0\displaystyle\leq\frac{3}{2\brk*{1-\frac{\pi\rho}{2T}}(1-\tau_{0})}+\frac{125}% {8}\sqrt{\frac{1}{1-\tau_{0}}}=O\brk*{\frac{1}{1-\tau_{0}}},≤ divide start_ARG 3 end_ARG start_ARG 2 ∗ 1 - divide start_ARG italic_π italic_ρ end_ARG start_ARG 2 italic_T end_ARG ( 1 - italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG + divide start_ARG 125 end_ARG start_ARG 8 end_ARG square-root start_ARG divide start_ARG 1 end_ARG start_ARG 1 - italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG end_ARG = italic_O ∗ divide start_ARG 1 end_ARG start_ARG 1 - italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ,

where the last transition uses the assumption ρ2Tπ𝜌2𝑇𝜋\rho\leq\frac{2T}{\pi}italic_ρ ≤ divide start_ARG 2 italic_T end_ARG start_ARG italic_π end_ARG. Since ρ1𝜌1\rho\geq 1italic_ρ ≥ 1 there is no case where ρ2<(1τ0)(2p+1)superscript𝜌2superscript1subscript𝜏02𝑝1\rho^{2}<(1-\tau_{0})^{-(2p+1)}italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT < ( 1 - italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - ( 2 italic_p + 1 ) end_POSTSUPERSCRIPT and τ0=0subscript𝜏00\tau_{0}=0italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0. Bounding the infimum of Eq. 9 in the two cases with our choices of τ¯¯𝜏\bar{\tau}over¯ start_ARG italic_τ end_ARG, if ρ2(1τ0)5superscript𝜌2superscript1subscript𝜏05\rho^{2}\geq(1-\tau_{0})^{-5}italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ ( 1 - italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT,

𝔼[f(xT+1)f(x)]𝔼delimited-[]𝑓subscript𝑥𝑇1𝑓superscript𝑥\displaystyle\mathbb{E}[f(x_{T+1})-f(x^{\star})]blackboard_E [ italic_f ( italic_x start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ] =𝖱𝖺𝗍𝖾h,T𝗌𝗆,𝗍𝗎O\brkρ15+O\brkρη𝗍𝗎𝗌𝗆σ2T,absentsubscriptsuperscript𝖱𝖺𝗍𝖾𝗌𝗆𝗍𝗎𝑇𝑂\brksuperscript𝜌15𝑂\brk𝜌superscriptsubscript𝜂𝗍𝗎𝗌𝗆superscript𝜎2𝑇\displaystyle=\mathsf{Rate}^{\mathsf{sm,tu}}_{h,T}\cdot O\brk*{\rho^{\frac{1}{% 5}}}+O\brk*{\frac{\rho\eta_{\mathsf{tu}}^{\mathsf{sm}}\sigma^{2}}{T}},= sansserif_Rate start_POSTSUPERSCRIPT sansserif_sm , sansserif_tu end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_T end_POSTSUBSCRIPT ⋅ italic_O ∗ italic_ρ start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 5 end_ARG end_POSTSUPERSCRIPT + italic_O ∗ divide start_ARG italic_ρ italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_sm end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_T end_ARG ,

and if ρ2<(1τ0)5superscript𝜌2superscript1subscript𝜏05\rho^{2}<(1-\tau_{0})^{-5}italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT < ( 1 - italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT,

𝔼[f(xT+1)f(x)]𝔼delimited-[]𝑓subscript𝑥𝑇1𝑓superscript𝑥\displaystyle\mathbb{E}[f(x_{T+1})-f(x^{\star})]blackboard_E [ italic_f ( italic_x start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ] =𝖱𝖺𝗍𝖾h,T𝗌𝗆,𝗍𝗎O\brk11τ0+O\brkρη𝗍𝗎𝗌𝗆σ2T.absentsubscriptsuperscript𝖱𝖺𝗍𝖾𝗌𝗆𝗍𝗎𝑇𝑂\brk11subscript𝜏0𝑂\brk𝜌superscriptsubscript𝜂𝗍𝗎𝗌𝗆superscript𝜎2𝑇\displaystyle=\mathsf{Rate}^{\mathsf{sm,tu}}_{h,T}\cdot O\brk*{\frac{1}{1-\tau% _{0}}}+O\brk*{\frac{\rho\eta_{\mathsf{tu}}^{\mathsf{sm}}\sigma^{2}}{T}}.= sansserif_Rate start_POSTSUPERSCRIPT sansserif_sm , sansserif_tu end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_T end_POSTSUBSCRIPT ⋅ italic_O ∗ divide start_ARG 1 end_ARG start_ARG 1 - italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG + italic_O ∗ divide start_ARG italic_ρ italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_sm end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_T end_ARG .

We obtain our final convergence guarantees by noting that ρ2Tπ𝜌2𝑇𝜋\rho\leq\frac{2T}{\pi}italic_ρ ≤ divide start_ARG 2 italic_T end_ARG start_ARG italic_π end_ARG, which, together with the fact that Qh(0)=Θ(1)subscript𝑄0Θ1Q_{h}(0)=\Theta(1)italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) = roman_Θ ( 1 ) implies

ρη𝗍𝗎𝗌𝗆σ2T2η𝗍𝗎𝗌𝗆σ2π2𝖱𝖺𝗍𝖾h,T𝗌𝗆,𝗍𝗎πQh(0)=O(𝖱𝖺𝗍𝖾h,T𝗌𝗆,𝗍𝗎),𝜌superscriptsubscript𝜂𝗍𝗎𝗌𝗆superscript𝜎2𝑇2superscriptsubscript𝜂𝗍𝗎𝗌𝗆superscript𝜎2𝜋2subscriptsuperscript𝖱𝖺𝗍𝖾𝗌𝗆𝗍𝗎𝑇𝜋subscript𝑄0𝑂subscriptsuperscript𝖱𝖺𝗍𝖾𝗌𝗆𝗍𝗎𝑇\displaystyle\frac{\rho\eta_{\mathsf{tu}}^{\mathsf{sm}}\sigma^{2}}{T}\leq\frac% {2\eta_{\mathsf{tu}}^{\mathsf{sm}}\sigma^{2}}{\pi}\leq\frac{2\mathsf{Rate}^{% \mathsf{sm,tu}}_{h,T}}{\pi Q_{h}(0)}=O(\mathsf{Rate}^{\mathsf{sm,tu}}_{h,T}),divide start_ARG italic_ρ italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_sm end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_T end_ARG ≤ divide start_ARG 2 italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_sm end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_π end_ARG ≤ divide start_ARG 2 sansserif_Rate start_POSTSUPERSCRIPT sansserif_sm , sansserif_tu end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_T end_POSTSUBSCRIPT end_ARG start_ARG italic_π italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG = italic_O ( sansserif_Rate start_POSTSUPERSCRIPT sansserif_sm , sansserif_tu end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_T end_POSTSUBSCRIPT ) ,

and plugging back to the above bounds. The bound of 𝖱𝖺𝗍𝖾h,T𝗌𝗆,𝗍𝗎subscriptsuperscript𝖱𝖺𝗍𝖾𝗌𝗆𝗍𝗎𝑇\mathsf{Rate}^{\mathsf{sm,tu}}_{h,T}sansserif_Rate start_POSTSUPERSCRIPT sansserif_sm , sansserif_tu end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_T end_POSTSUBSCRIPT is immediate from Eq. 4 as Hh(0)=12subscript𝐻012H_{h}(0)=\frac{1}{2}italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG and Qh(0)=Θ(1)subscript𝑄0Θ1Q_{h}(0)=\Theta(1)italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) = roman_Θ ( 1 ) (as we established in Eq. 14).

Appendix C Last Iterate Guarantees for Stochastic Gradient Descent

A convergence analysis of Stochastic Gradient Descent (SGD) for convex Lipschitz and convex smooth functions follows. The technique, introduced by Zamani and Glineur (2023) and later refined by Liu and Zhou (2024), is based on comparing the iterates of SGD (x1,x2,)subscript𝑥1subscript𝑥2(x_{1},x_{2},\ldots)( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … ) with iterates of the form

ztvt1vtzt1+\brk1vt1vtxt=v0vtx^+s=1tvsvs1vtxssubscript𝑧𝑡subscript𝑣𝑡1subscript𝑣𝑡subscript𝑧𝑡1\brk1subscript𝑣𝑡1subscript𝑣𝑡subscript𝑥𝑡subscript𝑣0subscript𝑣𝑡^𝑥superscriptsubscript𝑠1𝑡subscript𝑣𝑠subscript𝑣𝑠1subscript𝑣𝑡subscript𝑥𝑠\displaystyle z_{t}\triangleq\frac{v_{t-1}}{v_{t}}z_{t-1}+\brk*{1-\frac{v_{t-1% }}{v_{t}}}x_{t}=\frac{v_{0}}{v_{t}}\hat{x}+\sum_{s=1}^{t}\frac{v_{s}-v_{s-1}}{% v_{t}}x_{s}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≜ divide start_ARG italic_v start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + ∗ 1 - divide start_ARG italic_v start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG over^ start_ARG italic_x end_ARG + ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT divide start_ARG italic_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT (16)

for some non-increasing sequence v0,v1,v2,subscript𝑣0subscript𝑣1subscript𝑣2v_{0},v_{1},v_{2},\ldotsitalic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , …, starting at some z0=x^𝒳subscript𝑧0^𝑥𝒳z_{0}=\hat{x}\in\mathcal{X}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = over^ start_ARG italic_x end_ARG ∈ caligraphic_X. Note that by Jensen’s inequality, for any t2𝑡2t\geq 2italic_t ≥ 2,

f(zt)v0vtf(x^)+s=1tvsvs1vtf(xs).𝑓subscript𝑧𝑡subscript𝑣0subscript𝑣𝑡𝑓^𝑥superscriptsubscript𝑠1𝑡subscript𝑣𝑠subscript𝑣𝑠1subscript𝑣𝑡𝑓subscript𝑥𝑠\displaystyle f(z_{t})\leq\frac{v_{0}}{v_{t}}f(\hat{x})+\sum_{s=1}^{t}\frac{v_% {s}-v_{s-1}}{v_{t}}f(x_{s}).italic_f ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≤ divide start_ARG italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_f ( over^ start_ARG italic_x end_ARG ) + ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT divide start_ARG italic_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_f ( italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) . (17)

In particular, for any t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ], we will use

vtηTs=tTηssubscript𝑣𝑡subscript𝜂𝑇superscriptsubscript𝑠𝑡𝑇subscript𝜂𝑠\displaystyle v_{t}\triangleq\frac{\eta_{T}}{\sum_{s=t}^{T}\eta_{s}}italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≜ divide start_ARG italic_η start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_s = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG (18)

and v0=v1subscript𝑣0subscript𝑣1v_{0}=v_{1}italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, similarly to Liu and Zhou (2024). Next, we restate the convergence results. Their proofs follow. See 3 See 5

C.1 Proof of Lemmas 3 and 5

To prove the last-iterate guarantees we need the following lemmas. Their proofs follow. The first translates from an average regret-like guarantee to a last-iterate guarantee.

Lemma 6.

Let 𝒳d𝒳superscript𝑑\mathcal{X}\subseteq{\mathbb{R}}^{d}caligraphic_X ⊆ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT be a convex set, x1,x^𝒳subscript𝑥1^𝑥𝒳x_{1},\hat{x}\in\mathcal{X}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over^ start_ARG italic_x end_ARG ∈ caligraphic_X, f:𝒳:𝑓𝒳f:\mathcal{X}\to{\mathbb{R}}italic_f : caligraphic_X → blackboard_R a convex function and T𝑇T\in{\mathbb{N}}italic_T ∈ blackboard_N. Then for any sequences g1,,gTdsubscript𝑔1subscript𝑔𝑇superscript𝑑g_{1},\ldots,g_{T}\in{\mathbb{R}}^{d}italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_g start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and η1,,ηT>0subscript𝜂1subscript𝜂𝑇0\eta_{1},\ldots,\eta_{T}>0italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_η start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT > 0, the iterates defined by xt+1=xtηtgtsubscript𝑥𝑡1subscript𝑥𝑡subscript𝜂𝑡subscript𝑔𝑡x_{t+1}=x_{t}-\eta_{t}g_{t}italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT satisfy

ηTvT(f(xT+1)f(x^))t=1Tηtvt(f(xt+1)f(zt)),subscript𝜂𝑇subscript𝑣𝑇𝑓subscript𝑥𝑇1𝑓^𝑥superscriptsubscript𝑡1𝑇subscript𝜂𝑡subscript𝑣𝑡𝑓subscript𝑥𝑡1𝑓subscript𝑧𝑡\displaystyle\eta_{T}v_{T}(f(x_{T+1})-f(\hat{x}))\leq\sum_{t=1}^{T}\eta_{t}v_{% t}(f(x_{t+1})-f(z_{t})),italic_η start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_f ( italic_x start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT ) - italic_f ( over^ start_ARG italic_x end_ARG ) ) ≤ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_f ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ,

where z0,,zTsubscript𝑧0subscript𝑧𝑇z_{0},\ldots,z_{T}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and v0,,vTsubscript𝑣0subscript𝑣𝑇v_{0},\ldots,v_{T}italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT are defined by Eqs. 18 and 16.

Lemma 7.

Let 𝒳d𝒳superscript𝑑\mathcal{X}\subseteq{\mathbb{R}}^{d}caligraphic_X ⊆ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT be a convex set, x1,x^𝒳subscript𝑥1^𝑥𝒳x_{1},\hat{x}\in\mathcal{X}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over^ start_ARG italic_x end_ARG ∈ caligraphic_X, f:𝒳:𝑓𝒳f:\mathcal{X}\to{\mathbb{R}}italic_f : caligraphic_X → blackboard_R a convex function and T𝑇T\in{\mathbb{N}}italic_T ∈ blackboard_N. Then for any t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ], the iterates of SGD satisfy

𝔼[f(xt+1)f(zt)]𝔼delimited-[]𝑓subscript𝑥𝑡1𝑓subscript𝑧𝑡\displaystyle\mathbb{E}[f(x_{t+1})-f(z_{t})]blackboard_E [ italic_f ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] 𝔼\brk[s]\normxtzt2\normxt+1zt2\normxt+1xt22ηt+f(xt+1)f(xt)+gt(xtxt+1),absent𝔼\brkdelimited-[]𝑠\normsubscript𝑥𝑡superscriptsubscript𝑧𝑡2\normsubscript𝑥𝑡1superscriptsubscript𝑧𝑡2\normsubscript𝑥𝑡1superscriptsubscript𝑥𝑡22subscript𝜂𝑡𝑓subscript𝑥𝑡1𝑓subscript𝑥𝑡bold-⋅subscript𝑔𝑡subscript𝑥𝑡subscript𝑥𝑡1\displaystyle\leq\mathbb{E}\brk[s]*{\frac{\norm{x_{t}-z_{t}}^{2}-\norm{x_{t+1}% -z_{t}}^{2}-\norm{x_{t+1}-x_{t}}^{2}}{2\eta_{t}}+f(x_{t+1})-f(x_{t})+g_{t}\bm{% \cdot}(x_{t}-x_{t+1})},≤ blackboard_E [ italic_s ] ∗ divide start_ARG italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG + italic_f ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_⋅ ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ,

where z0,,zTsubscript𝑧0subscript𝑧𝑇z_{0},\ldots,z_{T}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT are defined by Eq. 16.

We proceed to the proof.

Proof of Lemmas 3 and 5.

By Lemma 7,

𝔼[f(xt+1)f(zt)]𝔼delimited-[]𝑓subscript𝑥𝑡1𝑓subscript𝑧𝑡\displaystyle\mathbb{E}[f(x_{t+1})-f(z_{t})]blackboard_E [ italic_f ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] 𝔼\brk[s]\normxtzt2\normxt+1zt22ηt+Δt,absent𝔼\brkdelimited-[]𝑠\normsubscript𝑥𝑡superscriptsubscript𝑧𝑡2\normsubscript𝑥𝑡1superscriptsubscript𝑧𝑡22subscript𝜂𝑡subscriptΔ𝑡\displaystyle\leq\mathbb{E}\brk[s]*{\frac{\norm{x_{t}-z_{t}}^{2}-\norm{x_{t+1}% -z_{t}}^{2}}{2\eta_{t}}+\Delta_{t}},≤ blackboard_E [ italic_s ] ∗ divide start_ARG italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG + roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ,

where Δtf(xt+1)f(xt)+gt(xtxt+1)\normxt+1xt22ηtsubscriptΔ𝑡𝑓subscript𝑥𝑡1𝑓subscript𝑥𝑡bold-⋅subscript𝑔𝑡subscript𝑥𝑡subscript𝑥𝑡1\normsubscript𝑥𝑡1superscriptsubscript𝑥𝑡22subscript𝜂𝑡\Delta_{t}\triangleq f(x_{t+1})-f(x_{t})+g_{t}\bm{\cdot}(x_{t}-x_{t+1})-\frac{% \norm{x_{t+1}-x_{t}}^{2}}{2\eta_{t}}roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≜ italic_f ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_⋅ ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - divide start_ARG italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG. By the definition of ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the fact that vtvt1subscript𝑣𝑡subscript𝑣𝑡1v_{t}\geq v_{t-1}italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ italic_v start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT,

\normxtzt2\normsubscript𝑥𝑡superscriptsubscript𝑧𝑡2\displaystyle\norm{x_{t}-z_{t}}^{2}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT =vt12vt2\normxt1zt2vt1vt\normxt1zt2.absentsuperscriptsubscript𝑣𝑡12superscriptsubscript𝑣𝑡2\normsubscript𝑥𝑡1superscriptsubscript𝑧𝑡2subscript𝑣𝑡1subscript𝑣𝑡\normsubscript𝑥𝑡1superscriptsubscript𝑧𝑡2\displaystyle=\frac{v_{t-1}^{2}}{v_{t}^{2}}\norm{x_{t-1}-z_{t}}^{2}\leq\frac{v% _{t-1}}{v_{t}}\norm{x_{t-1}-z_{t}}^{2}.= divide start_ARG italic_v start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ divide start_ARG italic_v start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Combining with our previous inequality multiplied by ηtvtsubscript𝜂𝑡subscript𝑣𝑡\eta_{t}v_{t}italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT,

𝔼[ηtvt(f(xt+1)f(zt))]𝔼delimited-[]subscript𝜂𝑡subscript𝑣𝑡𝑓subscript𝑥𝑡1𝑓subscript𝑧𝑡\displaystyle\mathbb{E}[\eta_{t}v_{t}(f(x_{t+1})-f(z_{t}))]blackboard_E [ italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_f ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ] 𝔼\brk[s]vt1\normxtzt12vt\normxt+1zt22+ηtvtΔt.absent𝔼\brkdelimited-[]𝑠subscript𝑣𝑡1\normsubscript𝑥𝑡superscriptsubscript𝑧𝑡12subscript𝑣𝑡\normsubscript𝑥𝑡1superscriptsubscript𝑧𝑡22subscript𝜂𝑡subscript𝑣𝑡subscriptΔ𝑡\displaystyle\leq\mathbb{E}\brk[s]*{\frac{v_{t-1}\norm{x_{t}-z_{t-1}}^{2}-v_{t% }\norm{x_{t+1}-z_{t}}^{2}}{2}+\eta_{t}v_{t}\Delta_{t}}.≤ blackboard_E [ italic_s ] ∗ divide start_ARG italic_v start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG + italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT .

Summing for t=1,,T𝑡1𝑇t=1,\ldots,Titalic_t = 1 , … , italic_T, and removing vT\normxT+1zT20subscript𝑣𝑇\normsubscript𝑥𝑇1superscriptsubscript𝑧𝑇20-v_{T}\norm{x_{T+1}-z_{T}}^{2}\leq 0- italic_v start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ 0,

𝔼\brk[s]t=1Tηtvt(f(xt+1)f(zt))𝔼\brkdelimited-[]𝑠superscriptsubscript𝑡1𝑇subscript𝜂𝑡subscript𝑣𝑡𝑓subscript𝑥𝑡1𝑓subscript𝑧𝑡\displaystyle\mathbb{E}\brk[s]*{\sum_{t=1}^{T}\eta_{t}v_{t}(f(x_{t+1})-f(z_{t}% ))}blackboard_E [ italic_s ] ∗ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_f ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) 𝔼\brk[s]v0\normx1z022+t=1TηtvtΔt.absent𝔼\brkdelimited-[]𝑠subscript𝑣0\normsubscript𝑥1superscriptsubscript𝑧022superscriptsubscript𝑡1𝑇subscript𝜂𝑡subscript𝑣𝑡subscriptΔ𝑡\displaystyle\leq\mathbb{E}\brk[s]*{\frac{v_{0}\norm{x_{1}-z_{0}}^{2}}{2}+\sum% _{t=1}^{T}\eta_{t}v_{t}\Delta_{t}}.≤ blackboard_E [ italic_s ] ∗ divide start_ARG italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG + ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT .

Combining with Lemma 6, and noting that z0=x^subscript𝑧0^𝑥z_{0}=\hat{x}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = over^ start_ARG italic_x end_ARG,

ηTvT(f(xT+1)f(x^))subscript𝜂𝑇subscript𝑣𝑇𝑓subscript𝑥𝑇1𝑓^𝑥\displaystyle\eta_{T}v_{T}(f(x_{T+1})-f(\hat{x}))italic_η start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_f ( italic_x start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT ) - italic_f ( over^ start_ARG italic_x end_ARG ) ) 𝔼\brk[s]v0\normx1x^22+t=1TηtvtΔt.absent𝔼\brkdelimited-[]𝑠subscript𝑣0\normsubscript𝑥1superscript^𝑥22superscriptsubscript𝑡1𝑇subscript𝜂𝑡subscript𝑣𝑡subscriptΔ𝑡\displaystyle\leq\mathbb{E}\brk[s]*{\frac{v_{0}\norm{x_{1}-\hat{x}}^{2}}{2}+% \sum_{t=1}^{T}\eta_{t}v_{t}\Delta_{t}}.≤ blackboard_E [ italic_s ] ∗ divide start_ARG italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG + ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT . (19)

Next, we assume a second-moment bound (as in Lemma 3). From convexity,

𝔼[f(xt+1)f(xt)]𝔼[f(xt+1)(xt+1xt)]𝔼\brk[s]ηt\normf(xt)2+\normxt+1xt24ηt,𝔼delimited-[]𝑓subscript𝑥𝑡1𝑓subscript𝑥𝑡𝔼delimited-[]bold-⋅𝑓subscript𝑥𝑡1subscript𝑥𝑡1subscript𝑥𝑡𝔼\brkdelimited-[]𝑠subscript𝜂𝑡\norm𝑓superscriptsubscript𝑥𝑡2\normsubscript𝑥𝑡1superscriptsubscript𝑥𝑡24subscript𝜂𝑡\displaystyle\mathbb{E}[f(x_{t+1})-f(x_{t})]\leq\mathbb{E}[\nabla f(x_{t+1})% \bm{\cdot}(x_{t+1}-x_{t})]\leq\mathbb{E}\brk[s]*{\eta_{t}\norm{\nabla f(x_{t})% }^{2}+\frac{\norm{x_{t+1}-x_{t}}^{2}}{4\eta_{t}}},blackboard_E [ italic_f ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] ≤ blackboard_E [ ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) bold_⋅ ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] ≤ blackboard_E [ italic_s ] ∗ italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 4 italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ,

where we used the inequality 2uv\normu2+\normv2bold-⋅2𝑢𝑣\normsuperscript𝑢2\normsuperscript𝑣22u\bm{\cdot}v\leq\norm{u}^{2}+\norm{v}^{2}2 italic_u bold_⋅ italic_v ≤ italic_u start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_v start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Similarly, gt(xtxt+1)ηt\normgt2+\normxt+1xt4ηtbold-⋅subscript𝑔𝑡subscript𝑥𝑡subscript𝑥𝑡1subscript𝜂𝑡\normsuperscriptsubscript𝑔𝑡2\normsubscript𝑥𝑡1subscript𝑥𝑡4subscript𝜂𝑡g_{t}\bm{\cdot}(x_{t}-x_{t+1})\leq\eta_{t}\norm{g_{t}}^{2}+\frac{\norm{x_{t+1}% -x_{t}}}{4\eta_{t}}italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_⋅ ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ≤ italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG 4 italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG. Hence, using the second-moment bound, 𝔼Δt2ηtG2𝔼subscriptΔ𝑡2subscript𝜂𝑡superscript𝐺2\mathbb{E}\Delta_{t}\leq 2\eta_{t}G^{2}blackboard_E roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ 2 italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Plugging the bound of 𝔼[Δt]𝔼delimited-[]subscriptΔ𝑡\mathbb{E}[\Delta_{t}]blackboard_E [ roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] to Eq. 19 concludes the proof of Lemma 3. Next we assume that f𝑓fitalic_f is β𝛽\betaitalic_β-smooth, a variance bound, and that ηt12βsubscript𝜂𝑡12𝛽\eta_{t}\leq\frac{1}{2\beta}italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG 2 italic_β end_ARG for all t[T]𝑡delimited-[]𝑇t\in[T]italic_t ∈ [ italic_T ] (as in Lemma 5). By smoothness,

f(xt+1)f(xt)𝑓subscript𝑥𝑡1𝑓subscript𝑥𝑡\displaystyle f(x_{t+1})-f(x_{t})italic_f ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) f(xt)(xt+1xt)+β2\normxt+1xt2absentbold-⋅𝑓subscript𝑥𝑡subscript𝑥𝑡1subscript𝑥𝑡𝛽2\normsubscript𝑥𝑡1superscriptsubscript𝑥𝑡2\displaystyle\leq\nabla f(x_{t})\bm{\cdot}(x_{t+1}-x_{t})+\frac{\beta}{2}\norm% {x_{t+1}-x_{t}}^{2}≤ ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_⋅ ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + divide start_ARG italic_β end_ARG start_ARG 2 end_ARG italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
f(xt)(xt+1xt)+14ηt\normxt+1xt2.absentbold-⋅𝑓subscript𝑥𝑡subscript𝑥𝑡1subscript𝑥𝑡14subscript𝜂𝑡\normsubscript𝑥𝑡1superscriptsubscript𝑥𝑡2\displaystyle\leq\nabla f(x_{t})\bm{\cdot}(x_{t+1}-x_{t})+\frac{1}{4\eta_{t}}% \norm{x_{t+1}-x_{t}}^{2}.≤ ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_⋅ ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + divide start_ARG 1 end_ARG start_ARG 4 italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (ηt12βsubscript𝜂𝑡12𝛽\eta_{t}\leq\frac{1}{2\beta}italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG 2 italic_β end_ARG)

By the inequality 2uv\normu2+\normv2bold-⋅2𝑢𝑣\normsuperscript𝑢2\normsuperscript𝑣22u\bm{\cdot}v\leq\norm{u}^{2}+\norm{v}^{2}2 italic_u bold_⋅ italic_v ≤ italic_u start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_v start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT,

f(xt)(xt+1xt)bold-⋅𝑓subscript𝑥𝑡subscript𝑥𝑡1subscript𝑥𝑡\displaystyle\nabla f(x_{t})\bm{\cdot}(x_{t+1}-x_{t})∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_⋅ ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) =(f(xt)gt)(xt+1xt)+gt(xt+1xt)absentbold-⋅𝑓subscript𝑥𝑡subscript𝑔𝑡subscript𝑥𝑡1subscript𝑥𝑡bold-⋅subscript𝑔𝑡subscript𝑥𝑡1subscript𝑥𝑡\displaystyle=(\nabla f(x_{t})-g_{t})\bm{\cdot}(x_{t+1}-x_{t})+g_{t}\bm{\cdot}% (x_{t+1}-x_{t})= ( ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_⋅ ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_⋅ ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
ηt\normf(xt)gt2+\normxt+1xt24ηt+gt(xt+1xt).absentsubscript𝜂𝑡\norm𝑓subscript𝑥𝑡superscriptsubscript𝑔𝑡2\normsubscript𝑥𝑡1superscriptsubscript𝑥𝑡24subscript𝜂𝑡bold-⋅subscript𝑔𝑡subscript𝑥𝑡1subscript𝑥𝑡\displaystyle\leq\eta_{t}\norm{\nabla f(x_{t})-g_{t}}^{2}+\frac{\norm{x_{t+1}-% x_{t}}^{2}}{4\eta_{t}}+g_{t}\bm{\cdot}(x_{t+1}-x_{t}).≤ italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 4 italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG + italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_⋅ ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) .

Hence, using the variance bound,

𝔼Δt𝔼subscriptΔ𝑡\displaystyle\mathbb{E}\Delta_{t}blackboard_E roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 𝔼\brk[s]ηt\normf(xt)gt2+\normxt+1xt24ηt+\normxt+1xt24ηt\normxt+1xt22ηtηtσ2.absent𝔼\brkdelimited-[]𝑠subscript𝜂𝑡\norm𝑓subscript𝑥𝑡superscriptsubscript𝑔𝑡2\normsubscript𝑥𝑡1superscriptsubscript𝑥𝑡24subscript𝜂𝑡\normsubscript𝑥𝑡1superscriptsubscript𝑥𝑡24subscript𝜂𝑡\normsubscript𝑥𝑡1superscriptsubscript𝑥𝑡22subscript𝜂𝑡subscript𝜂𝑡superscript𝜎2\displaystyle\leq\mathbb{E}\brk[s]*{\eta_{t}\norm{\nabla f(x_{t})-g_{t}}^{2}+% \frac{\norm{x_{t+1}-x_{t}}^{2}}{4\eta_{t}}+\frac{\norm{x_{t+1}-x_{t}}^{2}}{4% \eta_{t}}-\frac{\norm{x_{t+1}-x_{t}}^{2}}{2\eta_{t}}}\leq\eta_{t}\sigma^{2}.≤ blackboard_E [ italic_s ] ∗ italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 4 italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG + divide start_ARG italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 4 italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG - divide start_ARG italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ≤ italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Plugging the bound of 𝔼[Δt]𝔼delimited-[]subscriptΔ𝑡\mathbb{E}[\Delta_{t}]blackboard_E [ roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] to Eq. 19 concludes the proof of Lemma 5. ∎

C.2 Proof of Lemma 6

Proof.

By Eq. 17,

t=1Tηtvt(f(xt+1)f(zt))superscriptsubscript𝑡1𝑇subscript𝜂𝑡subscript𝑣𝑡𝑓subscript𝑥𝑡1𝑓subscript𝑧𝑡\displaystyle\sum_{t=1}^{T}\eta_{t}v_{t}(f(x_{t+1})-f(z_{t}))∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_f ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) t=1Tηtvt\brkf(xt+1)\brkv0vtf(x^)+s=1tvsvs1vtf(xs)absentsuperscriptsubscript𝑡1𝑇subscript𝜂𝑡subscript𝑣𝑡\brk𝑓subscript𝑥𝑡1\brksubscript𝑣0subscript𝑣𝑡𝑓^𝑥superscriptsubscript𝑠1𝑡subscript𝑣𝑠subscript𝑣𝑠1subscript𝑣𝑡𝑓subscript𝑥𝑠\displaystyle\geq\sum_{t=1}^{T}\eta_{t}v_{t}\brk*{f(x_{t+1})-\brk*{\frac{v_{0}% }{v_{t}}f(\hat{x})+\sum_{s=1}^{t}\frac{v_{s}-v_{s-1}}{v_{t}}f(x_{s})}}≥ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∗ italic_f ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - ∗ divide start_ARG italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_f ( over^ start_ARG italic_x end_ARG ) + ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT divide start_ARG italic_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_f ( italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT )
=t=1Tηt\brkvtf(xt+1)\brkv0f(x^)+s=1t\brkvsvs1f(xs)absentsuperscriptsubscript𝑡1𝑇subscript𝜂𝑡\brksubscript𝑣𝑡𝑓subscript𝑥𝑡1\brksubscript𝑣0𝑓^𝑥superscriptsubscript𝑠1𝑡\brksubscript𝑣𝑠subscript𝑣𝑠1𝑓subscript𝑥𝑠\displaystyle=\sum_{t=1}^{T}\eta_{t}\brk*{v_{t}f(x_{t+1})-\brk*{v_{0}f(\hat{x}% )+\sum_{s=1}^{t}\brk{v_{s}-v_{s-1}}f(x_{s})}}= ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∗ italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - ∗ italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_f ( over^ start_ARG italic_x end_ARG ) + ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT )
=t=1Tηt\brkvtf(xt+1)vtf(x^)s=1t\brkvsvs1(f(xs)f(x^))absentsuperscriptsubscript𝑡1𝑇subscript𝜂𝑡\brksubscript𝑣𝑡𝑓subscript𝑥𝑡1subscript𝑣𝑡𝑓^𝑥superscriptsubscript𝑠1𝑡\brksubscript𝑣𝑠subscript𝑣𝑠1𝑓subscript𝑥𝑠𝑓^𝑥\displaystyle=\sum_{t=1}^{T}\eta_{t}\brk*{v_{t}f(x_{t+1})-v_{t}f(\hat{x})-\sum% _{s=1}^{t}\brk{v_{s}-v_{s-1}}(f(x_{s})-f(\hat{x}))}= ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∗ italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_f ( over^ start_ARG italic_x end_ARG ) - ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT ( italic_f ( italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) - italic_f ( over^ start_ARG italic_x end_ARG ) )
=ηTvT(f(xT+1)f(x^))absentsubscript𝜂𝑇subscript𝑣𝑇𝑓subscript𝑥𝑇1𝑓^𝑥\displaystyle=\eta_{T}v_{T}(f(x_{T+1})-f(\hat{x}))= italic_η start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_f ( italic_x start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT ) - italic_f ( over^ start_ARG italic_x end_ARG ) )
+t=1T\brkηt1vt1(vtvt1)s=tTηs(f(xt)f(x^)).superscriptsubscript𝑡1𝑇\brksubscript𝜂𝑡1subscript𝑣𝑡1subscript𝑣𝑡subscript𝑣𝑡1superscriptsubscript𝑠𝑡𝑇subscript𝜂𝑠𝑓subscript𝑥𝑡𝑓^𝑥\displaystyle+\sum_{t=1}^{T}\brk*{\eta_{t-1}v_{t-1}-(v_{t}-v_{t-1})\sum_{s=t}^% {T}\eta_{s}}(f(x_{t})-f(\hat{x})).+ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∗ italic_η start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - ( italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ∑ start_POSTSUBSCRIPT italic_s = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_f ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_f ( over^ start_ARG italic_x end_ARG ) ) .

Note that v1=v0subscript𝑣1subscript𝑣0v_{1}=v_{0}italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and for 2tT2𝑡𝑇2\leq t\leq T2 ≤ italic_t ≤ italic_T,

ηt1vt1(vtvt1)s=tTηssubscript𝜂𝑡1subscript𝑣𝑡1subscript𝑣𝑡subscript𝑣𝑡1superscriptsubscript𝑠𝑡𝑇subscript𝜂𝑠\displaystyle\eta_{t-1}v_{t-1}-(v_{t}-v_{t-1})\sum_{s=t}^{T}\eta_{s}italic_η start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - ( italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ∑ start_POSTSUBSCRIPT italic_s = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT =ηT\brkηt1s=t1Tηsηt1s=tTηss=t1Tηss=tTηs=0.absentsubscript𝜂𝑇\brksubscript𝜂𝑡1superscriptsubscript𝑠𝑡1𝑇subscript𝜂𝑠subscript𝜂𝑡1superscriptsubscript𝑠𝑡𝑇subscript𝜂𝑠superscriptsubscript𝑠𝑡1𝑇subscript𝜂𝑠superscriptsubscript𝑠𝑡𝑇subscript𝜂𝑠0\displaystyle=\eta_{T}\brk*{\frac{\eta_{t-1}}{\sum_{s=t-1}^{T}\eta_{s}}-\frac{% \eta_{t-1}\sum_{s=t}^{T}\eta_{s}}{\sum_{s=t-1}^{T}\eta_{s}\cdot\sum_{s=t}^{T}% \eta_{s}}}=0.= italic_η start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∗ divide start_ARG italic_η start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_s = italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG - divide start_ARG italic_η start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_s = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_s = italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ⋅ ∑ start_POSTSUBSCRIPT italic_s = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG = 0 .

Thus,

ηTvT(f(xT+1)f(x^))subscript𝜂𝑇subscript𝑣𝑇𝑓subscript𝑥𝑇1𝑓^𝑥\displaystyle\eta_{T}v_{T}(f(x_{T+1})-f(\hat{x}))italic_η start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_f ( italic_x start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT ) - italic_f ( over^ start_ARG italic_x end_ARG ) ) t=1Tηtvt(f(xt+1)f(zt)).absentsuperscriptsubscript𝑡1𝑇subscript𝜂𝑡subscript𝑣𝑡𝑓subscript𝑥𝑡1𝑓subscript𝑧𝑡\displaystyle\leq\sum_{t=1}^{T}\eta_{t}v_{t}(f(x_{t+1})-f(z_{t})).\qed≤ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_f ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) . italic_∎

C.3 Proof of Lemma 7

Proof.

Using the convexity of f𝑓fitalic_f,

𝔼[f(xt+1)f(zt)]𝔼delimited-[]𝑓subscript𝑥𝑡1𝑓subscript𝑧𝑡\displaystyle\mathbb{E}[f(x_{t+1})-f(z_{t})]blackboard_E [ italic_f ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] =𝔼[f(xt)f(zt)+f(xt+1)f(xt)]𝔼[f(xt)(xtzt)+f(xt+1)f(xt)].absent𝔼delimited-[]𝑓subscript𝑥𝑡𝑓subscript𝑧𝑡𝑓subscript𝑥𝑡1𝑓subscript𝑥𝑡𝔼delimited-[]bold-⋅𝑓subscript𝑥𝑡subscript𝑥𝑡subscript𝑧𝑡𝑓subscript𝑥𝑡1𝑓subscript𝑥𝑡\displaystyle=\mathbb{E}[f(x_{t})-f(z_{t})+f(x_{t+1})-f(x_{t})]\leq\mathbb{E}[% \nabla f(x_{t})\bm{\cdot}(x_{t}-z_{t})+f(x_{t+1})-f(x_{t})].= blackboard_E [ italic_f ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_f ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_f ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] ≤ blackboard_E [ ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_⋅ ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_f ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] . (20)

Focusing on the first term, as ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT does not depend on gtsubscript𝑔𝑡g_{t}italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT,

𝔼[f(xt)(xtzt)]𝔼delimited-[]bold-⋅𝑓subscript𝑥𝑡subscript𝑥𝑡subscript𝑧𝑡\displaystyle\mathbb{E}[\nabla f(x_{t})\bm{\cdot}(x_{t}-z_{t})]blackboard_E [ ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_⋅ ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] =𝔼[gt(xtzt)]=𝔼[gt(xt+1zt)+gt(xtxt+1)].absent𝔼delimited-[]bold-⋅subscript𝑔𝑡subscript𝑥𝑡subscript𝑧𝑡𝔼delimited-[]bold-⋅subscript𝑔𝑡subscript𝑥𝑡1subscript𝑧𝑡bold-⋅subscript𝑔𝑡subscript𝑥𝑡subscript𝑥𝑡1\displaystyle=\mathbb{E}[g_{t}\bm{\cdot}(x_{t}-z_{t})]=\mathbb{E}[g_{t}\bm{% \cdot}(x_{t+1}-z_{t})+g_{t}\bm{\cdot}(x_{t}-x_{t+1})].= blackboard_E [ italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_⋅ ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] = blackboard_E [ italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_⋅ ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_⋅ ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ] .

Note that the update step is

xt+1=argminx𝒳\setf(xt)+gt(xxt)+12ηt\normxxt2.subscript𝑥𝑡1subscriptargmin𝑥𝒳\set𝑓subscript𝑥𝑡bold-⋅subscript𝑔𝑡𝑥subscript𝑥𝑡12subscript𝜂𝑡\norm𝑥superscriptsubscript𝑥𝑡2\displaystyle x_{t+1}=\operatorname*{arg\,min}_{x\in\mathcal{X}}\set*{f(x_{t})% +g_{t}\bm{\cdot}(x-x_{t})+\frac{1}{2\eta_{t}}\norm{x-x_{t}}^{2}}.italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT ∗ italic_f ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_⋅ ( italic_x - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + divide start_ARG 1 end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_x - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

From the first-order optimality condition,

1ηt(xt+1xt+ηtgt)(ztxt+1)0.bold-⋅1subscript𝜂𝑡subscript𝑥𝑡1subscript𝑥𝑡subscript𝜂𝑡subscript𝑔𝑡subscript𝑧𝑡subscript𝑥𝑡10\displaystyle\frac{1}{\eta_{t}}(x_{t+1}-x_{t}+\eta_{t}g_{t})\bm{\cdot}(z_{t}-x% _{t+1})\geq 0.divide start_ARG 1 end_ARG start_ARG italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_⋅ ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ≥ 0 .

Rearranging,

gt(xt+1zt)bold-⋅subscript𝑔𝑡subscript𝑥𝑡1subscript𝑧𝑡\displaystyle g_{t}\bm{\cdot}(x_{t+1}-z_{t})italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_⋅ ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) \normxtzt2\normxt+1zt2\normxt+1xt22ηt.absent\normsubscript𝑥𝑡superscriptsubscript𝑧𝑡2\normsubscript𝑥𝑡1superscriptsubscript𝑧𝑡2\normsubscript𝑥𝑡1superscriptsubscript𝑥𝑡22subscript𝜂𝑡\displaystyle\leq\frac{\norm{x_{t}-z_{t}}^{2}-\norm{x_{t+1}-z_{t}}^{2}-\norm{x% _{t+1}-x_{t}}^{2}}{2\eta_{t}}.≤ divide start_ARG italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG .

Thus,

𝔼[f(xt)(xtzt)]𝔼delimited-[]bold-⋅𝑓subscript𝑥𝑡subscript𝑥𝑡subscript𝑧𝑡\displaystyle\mathbb{E}[\nabla f(x_{t})\bm{\cdot}(x_{t}-z_{t})]blackboard_E [ ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_⋅ ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] 𝔼\brk[s]\normxtzt2\normxt+1zt2\normxt+1xt22ηt+gt(xtxt+1).absent𝔼\brkdelimited-[]𝑠\normsubscript𝑥𝑡superscriptsubscript𝑧𝑡2\normsubscript𝑥𝑡1superscriptsubscript𝑧𝑡2\normsubscript𝑥𝑡1superscriptsubscript𝑥𝑡22subscript𝜂𝑡bold-⋅subscript𝑔𝑡subscript𝑥𝑡subscript𝑥𝑡1\displaystyle\leq\mathbb{E}\brk[s]*{\frac{\norm{x_{t}-z_{t}}^{2}-\norm{x_{t+1}% -z_{t}}^{2}-\norm{x_{t+1}-x_{t}}^{2}}{2\eta_{t}}+g_{t}\bm{\cdot}(x_{t}-x_{t+1}% )}.≤ blackboard_E [ italic_s ] ∗ divide start_ARG italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG + italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_⋅ ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) .

Returning to Eq. 20, we conclude that

𝔼[f(xt+1)f(zt)]𝔼delimited-[]𝑓subscript𝑥𝑡1𝑓subscript𝑧𝑡\displaystyle\mathbb{E}[f(x_{t+1})-f(z_{t})]blackboard_E [ italic_f ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] 𝔼\brk[s]\normxtzt2\normxt+1zt2\normxt+1xt22ηt+f(xt+1)f(xt)+gt(xtxt+1).absent𝔼\brkdelimited-[]𝑠\normsubscript𝑥𝑡superscriptsubscript𝑧𝑡2\normsubscript𝑥𝑡1superscriptsubscript𝑧𝑡2\normsubscript𝑥𝑡1superscriptsubscript𝑥𝑡22subscript𝜂𝑡𝑓subscript𝑥𝑡1𝑓subscript𝑥𝑡bold-⋅subscript𝑔𝑡subscript𝑥𝑡subscript𝑥𝑡1\displaystyle\leq\mathbb{E}\brk[s]*{\frac{\norm{x_{t}-z_{t}}^{2}-\norm{x_{t+1}% -z_{t}}^{2}-\norm{x_{t+1}-x_{t}}^{2}}{2\eta_{t}}+f(x_{t+1})-f(x_{t})+g_{t}\bm{% \cdot}(x_{t}-x_{t+1})}.\qed≤ blackboard_E [ italic_s ] ∗ divide start_ARG italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG + italic_f ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_⋅ ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) . italic_∎

Appendix D Sensitivity of Fixed Stepsize Gradient Descent to Misspecification of the Stepsize

Given a G𝐺Gitalic_G-Lipschitz function f:𝒳:𝑓𝒳f:\mathcal{X}\to{\mathbb{R}}italic_f : caligraphic_X → blackboard_R, where 𝒳d𝒳superscript𝑑\mathcal{X}\subset{\mathbb{R}}^{d}caligraphic_X ⊂ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is a convex set with diameter D𝐷Ditalic_D, the standard average-iterate convergence guarantee of T𝑇Titalic_T-steps Gradient Descent (GD) with a fixed stepsize η>0𝜂0\eta>0italic_η > 0 is

f\brk1Tt=1Txtminx𝒳f(x)𝖱𝖺𝗍𝖾𝖼𝗈𝗇,T(η)D22ηT+ηG22.𝑓\brk1𝑇superscriptsubscript𝑡1𝑇subscript𝑥𝑡subscript𝑥𝒳𝑓𝑥subscript𝖱𝖺𝗍𝖾𝖼𝗈𝗇𝑇𝜂superscript𝐷22𝜂𝑇𝜂superscript𝐺22\displaystyle f\brk*{\frac{1}{T}\sum_{t=1}^{T}x_{t}}-\min_{x\in\mathcal{X}}f(x% )\leq\mathsf{Rate}_{\mathsf{con},T}(\eta)\triangleq\frac{D^{2}}{2\eta T}+\frac% {\eta G^{2}}{2}.italic_f ∗ divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - roman_min start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT italic_f ( italic_x ) ≤ sansserif_Rate start_POSTSUBSCRIPT sansserif_con , italic_T end_POSTSUBSCRIPT ( italic_η ) ≜ divide start_ARG italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_η italic_T end_ARG + divide start_ARG italic_η italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG .

The optimal η𝗍𝗎=DGTsubscript𝜂𝗍𝗎𝐷𝐺𝑇\eta_{\mathsf{tu}}=\frac{D}{G\sqrt{T}}italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT = divide start_ARG italic_D end_ARG start_ARG italic_G square-root start_ARG italic_T end_ARG end_ARG satisfy 𝖱𝖺𝗍𝖾𝖼𝗈𝗇,T(η𝗍𝗎)=DGTsubscript𝖱𝖺𝗍𝖾𝖼𝗈𝗇𝑇subscript𝜂𝗍𝗎𝐷𝐺𝑇\mathsf{Rate}_{\mathsf{con},T}(\eta_{\mathsf{tu}})=\frac{DG}{\sqrt{T}}sansserif_Rate start_POSTSUBSCRIPT sansserif_con , italic_T end_POSTSUBSCRIPT ( italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT ) = divide start_ARG italic_D italic_G end_ARG start_ARG square-root start_ARG italic_T end_ARG end_ARG. Given a multiplicative overestimation of the optimal stepsize, η=ρη𝗍𝗎𝜂𝜌subscript𝜂𝗍𝗎\eta=\rho\eta_{\mathsf{tu}}italic_η = italic_ρ italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT for ρ1𝜌1\rho\geq 1italic_ρ ≥ 1, the convergence guarantee is

𝖱𝖺𝗍𝖾𝖼𝗈𝗇,T(ρη𝗍𝗎)=𝖱𝖺𝗍𝖾𝖼𝗈𝗇,T(η𝗍𝗎)\brk12ρ+ρ2=Ω(ρ𝖱𝖺𝗍𝖾𝖼𝗈𝗇,T(η𝗍𝗎)).subscript𝖱𝖺𝗍𝖾𝖼𝗈𝗇𝑇𝜌subscript𝜂𝗍𝗎subscript𝖱𝖺𝗍𝖾𝖼𝗈𝗇𝑇subscript𝜂𝗍𝗎\brk12𝜌𝜌2Ω𝜌subscript𝖱𝖺𝗍𝖾𝖼𝗈𝗇𝑇subscript𝜂𝗍𝗎\displaystyle\mathsf{Rate}_{\mathsf{con},T}(\rho\eta_{\mathsf{tu}})=\mathsf{% Rate}_{\mathsf{con},T}(\eta_{\mathsf{tu}})\brk*{\frac{1}{2\rho}+\frac{\rho}{2}% }=\Omega(\rho\mathsf{Rate}_{\mathsf{con},T}(\eta_{\mathsf{tu}})).sansserif_Rate start_POSTSUBSCRIPT sansserif_con , italic_T end_POSTSUBSCRIPT ( italic_ρ italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT ) = sansserif_Rate start_POSTSUBSCRIPT sansserif_con , italic_T end_POSTSUBSCRIPT ( italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT ) ∗ divide start_ARG 1 end_ARG start_ARG 2 italic_ρ end_ARG + divide start_ARG italic_ρ end_ARG start_ARG 2 end_ARG = roman_Ω ( italic_ρ sansserif_Rate start_POSTSUBSCRIPT sansserif_con , italic_T end_POSTSUBSCRIPT ( italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT ) ) .

A natural follow-up question is whether this linear dependence on ρ𝜌\rhoitalic_ρ is simply an artifact of the analysis or a true degradation in the convergence rate of GD. Next, we show that for any weights w1,,wTsubscript𝑤1subscript𝑤𝑇w_{1},\ldots,w_{T}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, the worst-case convergence rate of the (weighted) average iterate is Ω(ρ𝖱𝖺𝗍𝖾𝖼𝗈𝗇,T(η𝗍𝗎))Ω𝜌subscript𝖱𝖺𝗍𝖾𝖼𝗈𝗇𝑇subscript𝜂𝗍𝗎\Omega(\rho\mathsf{Rate}_{\mathsf{con},T}(\eta_{\mathsf{tu}}))roman_Ω ( italic_ρ sansserif_Rate start_POSTSUBSCRIPT sansserif_con , italic_T end_POSTSUBSCRIPT ( italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT ) ).

Let T𝑇T\in{\mathbb{N}}italic_T ∈ blackboard_N, D>0𝐷0D>0italic_D > 0, G>0𝐺0G>0italic_G > 0, 0<ρ<12T0𝜌12𝑇0<\rho<\frac{1}{2}\sqrt{T}0 < italic_ρ < divide start_ARG 1 end_ARG start_ARG 2 end_ARG square-root start_ARG italic_T end_ARG and w1,,wT>0subscript𝑤1subscript𝑤𝑇0w_{1},\ldots,w_{T}>0italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT > 0. First we will assume that w1+w3++w2(T1)/2+1w2+w4++w2T/2subscript𝑤1subscript𝑤3subscript𝑤2𝑇121subscript𝑤2subscript𝑤4subscript𝑤2𝑇2w_{1}+w_{3}+\ldots+w_{2\lfloor(T-1)/2\rfloor+1}\geq w_{2}+w_{4}+\ldots+w_{2% \lfloor T/2\rfloor}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT + … + italic_w start_POSTSUBSCRIPT 2 ⌊ ( italic_T - 1 ) / 2 ⌋ + 1 end_POSTSUBSCRIPT ≥ italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT + … + italic_w start_POSTSUBSCRIPT 2 ⌊ italic_T / 2 ⌋ end_POSTSUBSCRIPT. Let η=ρDGT𝜂𝜌𝐷𝐺𝑇\eta=\frac{\rho D}{G\sqrt{T}}italic_η = divide start_ARG italic_ρ italic_D end_ARG start_ARG italic_G square-root start_ARG italic_T end_ARG end_ARG for some ρ1𝜌1\rho\geq 1italic_ρ ≥ 1, f(x)=G\absx𝑓𝑥𝐺\abs𝑥f(x)=G\abs{x}italic_f ( italic_x ) = italic_G italic_x defined over the domain [D2,D2]𝐷2𝐷2[-\frac{D}{2},\frac{D}{2}][ - divide start_ARG italic_D end_ARG start_ARG 2 end_ARG , divide start_ARG italic_D end_ARG start_ARG 2 end_ARG ], and let x1=34Gηsubscript𝑥134𝐺𝜂x_{1}=\frac{3}{4}G\etaitalic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = divide start_ARG 3 end_ARG start_ARG 4 end_ARG italic_G italic_η. After a single gradient step, x2=x1ηG=14Gηsubscript𝑥2subscript𝑥1𝜂𝐺14𝐺𝜂x_{2}=x_{1}-\eta G=-\frac{1}{4}G\etaitalic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_η italic_G = - divide start_ARG 1 end_ARG start_ARG 4 end_ARG italic_G italic_η. After another update step, x3=x2+ηG=34Gη=x1subscript𝑥3subscript𝑥2𝜂𝐺34𝐺𝜂subscript𝑥1x_{3}=x_{2}+\eta G=\frac{3}{4}G\eta=x_{1}italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_η italic_G = divide start_ARG 3 end_ARG start_ARG 4 end_ARG italic_G italic_η = italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Hence, the iterates will move back and forth between 34Gη34𝐺𝜂\frac{3}{4}G\etadivide start_ARG 3 end_ARG start_ARG 4 end_ARG italic_G italic_η and 14Gη14𝐺𝜂-\frac{1}{4}G\eta- divide start_ARG 1 end_ARG start_ARG 4 end_ARG italic_G italic_η, and the average iterate x¯¯𝑥\overline{x}over¯ start_ARG italic_x end_ARG will satisfy

x¯=1t=1Twtt=1Twtxt¯𝑥1superscriptsubscript𝑡1𝑇subscript𝑤𝑡superscriptsubscript𝑡1𝑇subscript𝑤𝑡subscript𝑥𝑡\displaystyle\overline{x}=\frac{1}{\sum_{t=1}^{T}w_{t}}\sum_{t=1}^{T}w_{t}x_{t}over¯ start_ARG italic_x end_ARG = divide start_ARG 1 end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =Gη\brk3t=1,3,wtt=2,4,wt4t=1TwtGη\brk2t=1,3,wt8t=1,3,wt=ηG4,absent𝐺𝜂\brk3subscript𝑡13subscript𝑤𝑡subscript𝑡24subscript𝑤𝑡4superscriptsubscript𝑡1𝑇subscript𝑤𝑡𝐺𝜂\brk2subscript𝑡13subscript𝑤𝑡8subscript𝑡13subscript𝑤𝑡𝜂𝐺4\displaystyle=\frac{G\eta\brk*{3\sum_{t=1,3,\ldots}w_{t}-\sum_{t=2,4,\ldots}w_% {t}}}{4\sum_{t=1}^{T}w_{t}}\geq\frac{G\eta\brk*{2\sum_{t=1,3,\ldots}w_{t}}}{8% \sum_{t=1,3,\ldots}w_{t}}=\frac{\eta G}{4},= divide start_ARG italic_G italic_η ∗ 3 ∑ start_POSTSUBSCRIPT italic_t = 1 , 3 , … end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - ∑ start_POSTSUBSCRIPT italic_t = 2 , 4 , … end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG 4 ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ≥ divide start_ARG italic_G italic_η ∗ 2 ∑ start_POSTSUBSCRIPT italic_t = 1 , 3 , … end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG 8 ∑ start_POSTSUBSCRIPT italic_t = 1 , 3 , … end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG = divide start_ARG italic_η italic_G end_ARG start_ARG 4 end_ARG ,

where we used our assumption that w1+w3++w2(T1)/2+1w2+w4++w2T/2subscript𝑤1subscript𝑤3subscript𝑤2𝑇121subscript𝑤2subscript𝑤4subscript𝑤2𝑇2w_{1}+w_{3}+\ldots+w_{2\lfloor(T-1)/2\rfloor+1}\geq w_{2}+w_{4}+\ldots+w_{2% \lfloor T/2\rfloor}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT + … + italic_w start_POSTSUBSCRIPT 2 ⌊ ( italic_T - 1 ) / 2 ⌋ + 1 end_POSTSUBSCRIPT ≥ italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT + … + italic_w start_POSTSUBSCRIPT 2 ⌊ italic_T / 2 ⌋ end_POSTSUBSCRIPT. Hence,

f(x¯)ηG24=ρDG4T=Ω(ρ𝖱𝖺𝗍𝖾𝖼𝗈𝗇,T(η𝗍𝗎)).𝑓¯𝑥𝜂superscript𝐺24𝜌𝐷𝐺4𝑇Ω𝜌subscript𝖱𝖺𝗍𝖾𝖼𝗈𝗇𝑇subscript𝜂𝗍𝗎\displaystyle f(\overline{x})\geq\frac{\eta G^{2}}{4}=\frac{\rho DG}{4\sqrt{T}% }=\Omega(\rho\mathsf{Rate}_{\mathsf{con},T}(\eta_{\mathsf{tu}})).italic_f ( over¯ start_ARG italic_x end_ARG ) ≥ divide start_ARG italic_η italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 4 end_ARG = divide start_ARG italic_ρ italic_D italic_G end_ARG start_ARG 4 square-root start_ARG italic_T end_ARG end_ARG = roman_Ω ( italic_ρ sansserif_Rate start_POSTSUBSCRIPT sansserif_con , italic_T end_POSTSUBSCRIPT ( italic_η start_POSTSUBSCRIPT sansserif_tu end_POSTSUBSCRIPT ) ) .

If, on the other hand, it holds that w1+w3++w2(T1)/2+1<w2+w4++w2T/2subscript𝑤1subscript𝑤3subscript𝑤2𝑇121subscript𝑤2subscript𝑤4subscript𝑤2𝑇2w_{1}+w_{3}+\ldots+w_{2\lfloor(T-1)/2\rfloor+1}<w_{2}+w_{4}+\ldots+w_{2\lfloor T% /2\rfloor}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT + … + italic_w start_POSTSUBSCRIPT 2 ⌊ ( italic_T - 1 ) / 2 ⌋ + 1 end_POSTSUBSCRIPT < italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT + … + italic_w start_POSTSUBSCRIPT 2 ⌊ italic_T / 2 ⌋ end_POSTSUBSCRIPT, we can initialize x1=Gη4subscript𝑥1𝐺𝜂4x_{1}=-\frac{G\eta}{4}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = - divide start_ARG italic_G italic_η end_ARG start_ARG 4 end_ARG and mirroring the same argument will conclude the proof.

Hence, the worst-case convergence rate of fixed stepsize GD degrades linearly in a multiplicative misspecification of the stepsize. As GD is a private case of SGD, the lower bound also holds for SGD with a second-moment bound G2superscript𝐺2G^{2}italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

Appendix E Convergence Analysis with Stepsize Schedules

In this section, we provide convergence guarantees for SGD with an annealed schedule in the convex Lipschitz and convex smooth settings. The guarantees are established by combining a last-iterate guarantee with Lemma 4, which translates the sums of stepsizes to integrals that depend on the schedule. The proofs follow.

See 1

Note that when we tune η𝜂\etaitalic_η according to Eq. 2, we obtain a convergence rate of

2DGTQh(0)/Hh(0)+O\brkpDG/Hh(0)Qh(0)T3/2.2𝐷𝐺𝑇subscript𝑄0subscript𝐻0𝑂\brk𝑝𝐷𝐺subscript𝐻0subscript𝑄0superscript𝑇32\displaystyle\frac{2DG}{\sqrt{T}}\sqrt{Q_{h}(0)/H_{h}(0)}+O\brk*{\frac{pDG/% \sqrt{H_{h}(0)Q_{h}(0)}}{T^{3/2}}}.divide start_ARG 2 italic_D italic_G end_ARG start_ARG square-root start_ARG italic_T end_ARG end_ARG square-root start_ARG italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) / italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG + italic_O ∗ divide start_ARG italic_p italic_D italic_G / square-root start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG end_ARG start_ARG italic_T start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT end_ARG .

See 2

Similarly, when we tune η𝜂\etaitalic_η according to Eq. 3, we obtain a convergence rate of

βD2h(0)THh(0)+DσT2Qh(0)/Hh(0)+O\brkpDσ/Hh(0)Qh(0)T3/2.𝛽superscript𝐷20𝑇subscript𝐻0𝐷𝜎𝑇2subscript𝑄0subscript𝐻0𝑂\brk𝑝𝐷𝜎subscript𝐻0subscript𝑄0superscript𝑇32\displaystyle\frac{\beta D^{2}h(0)}{TH_{h}(0)}+\frac{D\sigma}{\sqrt{T}}\sqrt{2% Q_{h}(0)/H_{h}(0)}+O\brk*{\frac{pD\sigma/\sqrt{H_{h}(0)Q_{h}(0)}}{T^{3/2}}}.divide start_ARG italic_β italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_h ( 0 ) end_ARG start_ARG italic_T italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG + divide start_ARG italic_D italic_σ end_ARG start_ARG square-root start_ARG italic_T end_ARG end_ARG square-root start_ARG 2 italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) / italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG + italic_O ∗ divide start_ARG italic_p italic_D italic_σ / square-root start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG end_ARG start_ARG italic_T start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT end_ARG .

Note that using the fact that hhitalic_h is non-increasing and the Lipschitz condition,

h(0)Hh(0)=01h(u)𝑑u0min{1,h(0)/2p}12h(0)𝑑u=12h(0)min{1,h(0)/2p}.0subscript𝐻0superscriptsubscript01𝑢differential-d𝑢superscriptsubscript0102𝑝120differential-d𝑢120102𝑝\displaystyle h(0)\geq H_{h}(0)=\int_{0}^{1}h(u)du\geq\int_{0}^{\min\{1,h(0)/2% p\}}\frac{1}{2}h(0)du=\frac{1}{2}h(0)\min\{1,h(0)/2p\}.italic_h ( 0 ) ≥ italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_h ( italic_u ) italic_d italic_u ≥ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_min { 1 , italic_h ( 0 ) / 2 italic_p } end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_h ( 0 ) italic_d italic_u = divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_h ( 0 ) roman_min { 1 , italic_h ( 0 ) / 2 italic_p } .

Additionally,

Qh(0)=01h(u)2Hh(u)𝑑u01h(u)𝑑u=Hh(0)12h(0)min{1,h(0)/2p}subscript𝑄0superscriptsubscript01superscript𝑢2subscript𝐻𝑢differential-d𝑢superscriptsubscript01𝑢differential-d𝑢subscript𝐻0120102𝑝\displaystyle Q_{h}(0)=\int_{0}^{1}\frac{h(u)^{2}}{H_{h}(u)}du\geq\int_{0}^{1}% h(u)du=H_{h}(0)\geq\frac{1}{2}h(0)\min\{1,h(0)/2p\}italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT divide start_ARG italic_h ( italic_u ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_u ) end_ARG italic_d italic_u ≥ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_h ( italic_u ) italic_d italic_u = italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) ≥ divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_h ( 0 ) roman_min { 1 , italic_h ( 0 ) / 2 italic_p }

and using Eq. 8,

Qh(0)=01Hh(u)2Hh(u)𝑑u2p.subscript𝑄0superscriptsubscript01superscriptsubscript𝐻superscript𝑢2subscript𝐻𝑢differential-d𝑢2𝑝\displaystyle Q_{h}(0)=\int_{0}^{1}\frac{H_{h}^{\prime}(u)^{2}}{H_{h}(u)}du% \leq 2p.italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT divide start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_u ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_u ) end_ARG italic_d italic_u ≤ 2 italic_p .

Hence, assuming h(0)=Θ(1)0Θ1h(0)=\Theta(1)italic_h ( 0 ) = roman_Θ ( 1 ) and p=Θ(1)𝑝Θ1p=\Theta(1)italic_p = roman_Θ ( 1 ), Hh(0)subscript𝐻0H_{h}(0)italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) and Qh(0)subscript𝑄0Q_{h}(0)italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) are Θ(1)Θ1\Theta(1)roman_Θ ( 1 ), and the rates above match those of optimally tuned fixed stepsize SGD up to constant factors.

E.1 Proofs of Lemmas 1 and 2

Proof of Lemma 1.

By Lemma 3 with x^=x^𝑥superscript𝑥\hat{x}=x^{\star}over^ start_ARG italic_x end_ARG = italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT,

𝔼[f(xT+1)f(x)]𝔼delimited-[]𝑓subscript𝑥𝑇1𝑓superscript𝑥\displaystyle\mathbb{E}[f(x_{T+1})-f(x^{\star})]blackboard_E [ italic_f ( italic_x start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ] D22s=1Tηs+2G2t=1Tηt2s=tTηs.absentsuperscript𝐷22superscriptsubscript𝑠1𝑇subscript𝜂𝑠2superscript𝐺2superscriptsubscript𝑡1𝑇superscriptsubscript𝜂𝑡2superscriptsubscript𝑠𝑡𝑇subscript𝜂𝑠\displaystyle\leq\frac{D^{2}}{2\sum_{s=1}^{T}\eta_{s}}+2G^{2}\sum_{t=1}^{T}% \frac{\eta_{t}^{2}}{\sum_{s=t}^{T}\eta_{s}}.≤ divide start_ARG italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG + 2 italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_s = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG .

Using Lemma 4 with c1=D2/2subscript𝑐1superscript𝐷22c_{1}=D^{2}/2italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2, c2=2G2subscript𝑐22superscript𝐺2c_{2}=2G^{2}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 2 italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, k=1𝑘1k=1italic_k = 1 and τ=0𝜏0\tau=0italic_τ = 0,

𝔼[f(xT+1)f(x)]𝔼delimited-[]𝑓subscript𝑥𝑇1𝑓superscript𝑥\displaystyle\mathbb{E}[f(x_{T+1})-f(x^{\star})]blackboard_E [ italic_f ( italic_x start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ] D22ηTHh(0)+2ηG2011Th(u)2Hh(u)𝑑u+8pηG2TD22ηTHh(0)+2ηG2Qh(0)+8pηG2T,absentsuperscript𝐷22𝜂𝑇subscript𝐻02𝜂superscript𝐺2superscriptsubscript011𝑇superscript𝑢2subscript𝐻𝑢differential-d𝑢8𝑝𝜂superscript𝐺2𝑇superscript𝐷22𝜂𝑇subscript𝐻02𝜂superscript𝐺2subscript𝑄08𝑝𝜂superscript𝐺2𝑇\displaystyle\leq\frac{D^{2}}{2\eta TH_{h}(0)}\!+\!2\eta G^{2}\int_{0}^{1-% \frac{1}{T}}\frac{h(u)^{2}}{H_{h}(u)}du\!+\!\frac{8p\eta G^{2}}{T}\leq\frac{D^% {2}}{2\eta TH_{h}(0)}+2\eta G^{2}Q_{h}(0)+\frac{8p\eta G^{2}}{T},≤ divide start_ARG italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_η italic_T italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG + 2 italic_η italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG end_POSTSUPERSCRIPT divide start_ARG italic_h ( italic_u ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_u ) end_ARG italic_d italic_u + divide start_ARG 8 italic_p italic_η italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_T end_ARG ≤ divide start_ARG italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_η italic_T italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG + 2 italic_η italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) + divide start_ARG 8 italic_p italic_η italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_T end_ARG ,

where that last inequality follows by the fact that h(u)𝑢h(u)italic_h ( italic_u ) and Hh(u)subscript𝐻𝑢H_{h}(u)italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_u ) are non-negative and the definition of Qh(u)subscript𝑄𝑢Q_{h}(u)italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_u ). ∎

Proof of Lemma 2.

As η1=ηh(0)12βsubscript𝜂1𝜂012𝛽\eta_{1}=\eta h(0)\leq\frac{1}{2\beta}italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_η italic_h ( 0 ) ≤ divide start_ARG 1 end_ARG start_ARG 2 italic_β end_ARG and hhitalic_h is non-increasing, ηt12βsubscript𝜂𝑡12𝛽\eta_{t}\leq\frac{1}{2\beta}italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG 2 italic_β end_ARG and we can use Lemma 5 with x^=x^𝑥superscript𝑥\hat{x}=x^{\star}over^ start_ARG italic_x end_ARG = italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT, obtaining

𝔼[f(xT+1)f(x)]𝔼delimited-[]𝑓subscript𝑥𝑇1𝑓superscript𝑥\displaystyle\mathbb{E}[f(x_{T+1})-f(x^{\star})]blackboard_E [ italic_f ( italic_x start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ] D22s=kTηs+σ2t=kTηt2s=tTηs.absentsuperscript𝐷22superscriptsubscript𝑠𝑘𝑇subscript𝜂𝑠superscript𝜎2superscriptsubscript𝑡𝑘𝑇superscriptsubscript𝜂𝑡2superscriptsubscript𝑠𝑡𝑇subscript𝜂𝑠\displaystyle\leq\frac{D^{2}}{2\sum_{s=k}^{T}\eta_{s}}+\sigma^{2}\sum_{t=k}^{T% }\frac{\eta_{t}^{2}}{\sum_{s=t}^{T}\eta_{s}}.≤ divide start_ARG italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 ∑ start_POSTSUBSCRIPT italic_s = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_s = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG .

Invoking Lemma 4 with c1=D2/2subscript𝑐1superscript𝐷22c_{1}=D^{2}/2italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2, c2=σ2subscript𝑐2superscript𝜎2c_{2}=\sigma^{2}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, k=1𝑘1k=1italic_k = 1 and τ=0𝜏0\tau=0italic_τ = 0,

𝔼[f(xT+1)f(x)]𝔼delimited-[]𝑓subscript𝑥𝑇1𝑓superscript𝑥\displaystyle\mathbb{E}[f(x_{T+1})-f(x^{\star})]blackboard_E [ italic_f ( italic_x start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ] D22ηTHh(0)+ησ2011Th(u)2Hh(u)𝑑u+4ηpσ2TD22ηTHh(0)+ησ2Qh(0)+4pησ2T,absentsuperscript𝐷22𝜂𝑇subscript𝐻0𝜂superscript𝜎2superscriptsubscript011𝑇superscript𝑢2subscript𝐻𝑢differential-d𝑢4𝜂𝑝superscript𝜎2𝑇superscript𝐷22𝜂𝑇subscript𝐻0𝜂superscript𝜎2subscript𝑄04𝑝𝜂superscript𝜎2𝑇\displaystyle\leq\frac{D^{2}}{2\eta TH_{h}(0)}+\eta\sigma^{2}\int_{0}^{1-\frac% {1}{T}}\frac{h(u)^{2}}{H_{h}(u)}du+\frac{4\eta p\sigma^{2}}{T}\leq\frac{D^{2}}% {2\eta TH_{h}(0)}+\eta\sigma^{2}Q_{h}(0)+\frac{4p\eta\sigma^{2}}{T},≤ divide start_ARG italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_η italic_T italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG + italic_η italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG end_POSTSUPERSCRIPT divide start_ARG italic_h ( italic_u ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_u ) end_ARG italic_d italic_u + divide start_ARG 4 italic_η italic_p italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_T end_ARG ≤ divide start_ARG italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_η italic_T italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) end_ARG + italic_η italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( 0 ) + divide start_ARG 4 italic_p italic_η italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_T end_ARG ,

where that last inequality follows by the fact that h(u)𝑢h(u)italic_h ( italic_u ) and Hh(u)subscript𝐻𝑢H_{h}(u)italic_H start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_u ) are non-negative and the definition of Qh(u)subscript𝑄𝑢Q_{h}(u)italic_Q start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_u ). ∎