Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
License: arXiv.org perpetual non-exclusive license
arXiv:2305.17665v2 [cs.LG] 01 Feb 2024

Acceleration of stochastic gradient descent with momentum by averaging: finite-sample rates and asymptotic normality

Kejie Tang 111School of Mathematical Sciences, Shanghai Jiao Tong University Weidong Liu 222School of Mathematical Sciences, MoE Key Lab of Artificial Intelligence, Shanghai Jiao Tong University Yichen Zhang333Daniels School of Business, Purdue University Xi Chen444Stern School of Business, New York University
Abstract

Stochastic gradient descent with momentum (SGDM) has been widely used in many machine learning and statistical applications. Despite the observed empirical benefits of SGDM over traditional SGD, the theoretical understanding of the role of momentum for different learning rates in the optimization process remains widely open. We analyze the finite-sample convergence rate of SGDM under the strongly convex settings and show that, with a large batch size, the mini-batch SGDM converges faster than the mini-batch SGD to a neighborhood of the optimal value. Additionally, our findings, supported by theoretical analysis and numerical experiments, indicate that SGDM permits broader choices of learning rates. Furthermore, we analyze the Polyak-averaging version of the SGDM estimator, establish its asymptotic normality, and justify its asymptotic equivalence to the averaged SGD. The asymptotic distribution of the averaged SGDM enables uncertainty quantification of the algorithm output and statistical inference of the model parameters.

1 Introduction

In this paper, we are interested in solving the following stochastic optimization problem:

x*=argminxd𝔼[fξ(x)],superscript𝑥subscriptargmin𝑥superscript𝑑𝔼delimited-[]subscript𝑓𝜉𝑥\displaystyle x^{*}=\text{argmin}_{x\in\mathbb{R}^{d}}\mathbb{E}[f_{\xi}(x)],italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = argmin start_POSTSUBSCRIPT italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_E [ italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_x ) ] , (1.1)

where fξ(x)subscript𝑓𝜉𝑥f_{\xi}(x)italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_x ) is a stochastic convex loss function, ξ𝜉\xiitalic_ξ is a random element, and 𝔼[]𝔼delimited-[]\mathbb{E}[\cdot]blackboard_E [ ⋅ ] is the expectation respected to ξ𝜉\xiitalic_ξ. Stochastic optimization plays an important role in many statistics and machine learning problems. For a wide range of applications, the objective function is strongly convex, and stochastic gradient descent (SGD) is often preferred over gradient descent (GD) (see, e.g., Nesterov, 2003; Nocedal and Wright, 2006) due to its computational advantage. SGD is a first-order optimization algorithm that approximates the expected loss by averaging the loss function over a mini-batch of training examples. At each iteration, the algorithm updates the model parameters in the direction of the negative gradient of the mini-batch loss, scaled by a learning rate parameter.

While SGD is simple and easy to implement, it may suffer from slow convergence rates or oscillations in high-dimensional optimization problems, particularly when the loss function is noisy and ill-conditioned. Momentum-based methods enhance SGD by introducing an exponentially weighted moving average of the past gradients to the update rule, which serves to dampen oscillations and accelerate convergence, allowing the algorithm to maintain a more consistent direction of movement even in the presence of noisy gradients. SGD with momentum (SGDM) has become increasingly popular in modern applications, e.g., large-scale deep neural networks. Evident numerical studies have shown that the use of momentum-based optimization methods improves the convergence rate and the generalization performance, as well as reduces the sensitivity to the choice of hyperparameters. However, the theoretical analysis of SGDM is still an active area of research.

In this paper, we focus on the theoretical analysis from an optimization perspective under strong convexity that ensures the existence of a unique and well-defined global minimizer. Our goal is to establish the convergence properties of SGDM to provide insights into the role of momentum and other hyperparameters in the optimization process. For solving (1.1), SGDM updates the target estimator by a weighted combination with historical gradients

mt+1subscript𝑚𝑡1\displaystyle m_{t+1}italic_m start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT =\displaystyle== γmt+(1γ)fξt(xt),𝛾subscript𝑚𝑡1𝛾subscript𝑓subscript𝜉𝑡subscript𝑥𝑡\displaystyle\gamma m_{t}+(1-\gamma)\triangledown f_{\xi_{t}}(x_{t}),italic_γ italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ( 1 - italic_γ ) ▽ italic_f start_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (1.2)
xt+1subscript𝑥𝑡1\displaystyle x_{t+1}italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT =\displaystyle== xtαmt+1,subscript𝑥𝑡𝛼subscript𝑚𝑡1\displaystyle x_{t}-\alpha m_{t+1},italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_α italic_m start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , (1.3)

where ξtsubscript𝜉𝑡\xi_{t}italic_ξ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, t1𝑡1t\geq 1italic_t ≥ 1 are i.i.d. samplings from ξ𝜉\xiitalic_ξ, γ𝛾\gammaitalic_γ is the momentum weight and α𝛼\alphaitalic_α is the learning rate. The classical SGD is a special case of SGDM with γ=0𝛾0\gamma=0italic_γ = 0 and mt+1=fξt(xt)subscript𝑚𝑡1subscript𝑓subscript𝜉𝑡subscript𝑥𝑡m_{t+1}=\nabla f_{\xi_{t}}(x_{t})italic_m start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = ∇ italic_f start_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).

In practice, a large momentum weight is often placed to accelerate the algorithm, for example, γ=0.9𝛾0.9\gamma=0.9italic_γ = 0.9. That being said, it is still an open problem to investigate the theoretical properties of SGDM with a general specification of γ𝛾\gammaitalic_γ. The theoretical analysis in existing studies has not given an affirmative answer to the open problem by Liu et al. (2020) and the assertion by Kidambi et al. (2018). In this paper, we consider the mini-batch SGDM with batch size B𝐵Bitalic_B. We establish finite sample rates for SGDM and the averaging of its trajectories under smooth and strongly convex loss functions. By our results, we can give some partial answers to these open questions and some related contributions in several aspects.

  1. 1.

    Drawing upon the assertion presented in Kidambi et al. (2018), this study addresses the open question posited by Liu et al. (2020) by demonstrating that mini-batch SGDM, with appropriate momentum weights, converges to a local neighborhood of the minimum with a faster rate than SGD. We rigorously establish the finite-sample convergence rate, and we further provide an adaptive choice of the momentum weight which theoretically attains the optimal convergence rate. Our experiments support the theoretical results of convergence and the optimal momentum weight.

  2. 2.

    We provide a non-asymptotic analysis of the Polyak-Ruppert averaging version of SGDM. The averaging of SGDM trajectories x¯t=1tn0i=n0+1txisubscript¯𝑥𝑡1𝑡subscript𝑛0superscriptsubscript𝑖subscript𝑛01𝑡subscript𝑥𝑖\bar{x}_{t}=\frac{1}{t-n_{0}}\sum_{i=n_{0}+1}^{t}x_{i}over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_t - italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can accelerate SGDM with a wide range of learning rates α𝛼\alphaitalic_α to the rate 𝒪(σt)𝒪𝜎𝑡\mathcal{O}(\frac{\sigma}{\sqrt{t}})caligraphic_O ( divide start_ARG italic_σ end_ARG start_ARG square-root start_ARG italic_t end_ARG end_ARG ), where σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is the variance of stochastic gradient. Furthermore, we show theoretically that averaged SGDM converges faster than averaged SGD in the early phases of the iteration process. Moreover, through both theoretical analysis and numerical experiments, we demonstrate that SGDM is less sensitive to the choice of learning rates, and in addition to this, the averaged SGDM is less sensitive to the start of the averaging iteration n0subscript𝑛0n_{0}italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

  3. 3.

    We further establish the asymptotic normality for the averaged x¯tsubscript¯𝑥𝑡\bar{x}_{t}over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as t𝑡t\rightarrow\inftyitalic_t → ∞ with a decaying learning rate α𝛼\alphaitalic_α or a diverging batch size B𝐵Bitalic_B. Particularly, as α=Θ(tϵ)𝛼Θsuperscript𝑡italic-ϵ\alpha=\Theta(t^{-\epsilon})italic_α = roman_Θ ( italic_t start_POSTSUPERSCRIPT - italic_ϵ end_POSTSUPERSCRIPT ) for ϵ(12,1)italic-ϵ121\epsilon\in(\frac{1}{2},1)italic_ϵ ∈ ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG , 1 ), the asymptotic normality holds for averaged SGDM under strongly convex loss functions. The asymptotic covariance matrix depends only on the Hessian and Gram matrix of the stochastic gradients at x*superscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, and the batch size B𝐵Bitalic_B. Interestingly, mini-batch averaged SGDM is asymptotically equivalent to mini-batch averaged SGD as t𝑡t\rightarrow\inftyitalic_t → ∞. We further demonstrate that the optimal learning rates of averaged SGDM that correspond to asymptotic normality are α=Θ(t1/2)𝛼Θsuperscript𝑡12\alpha=\Theta(t^{-1/2})italic_α = roman_Θ ( italic_t start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) for quadratic losses and α=Θ(t2/3)𝛼Θsuperscript𝑡23\alpha=\Theta(t^{-2/3})italic_α = roman_Θ ( italic_t start_POSTSUPERSCRIPT - 2 / 3 end_POSTSUPERSCRIPT ) for general strongly convex losses. To our best knowledge, this is the first work to analyze the convergence of averaged SGDM to asymptotic normality under mini-batching. The results enable us to perform uncertainty quantification for the algorithm outputs of the averaged SGDM algorithm x¯tsubscript¯𝑥𝑡\bar{x}_{t}over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and statistical inference for model parameters x*superscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT.

In addition to the convergence rate under pre-specified learning rates, the selection of appropriate learning rates is a critical aspect when evaluating the performance of optimizers, as either an excessively high or low learning rate can detrimentally affect convergence. We study the sensitivity of the convergence over different learning rates. Extensive research has been conducted to tackle this challenge and accelerate convergence, such as Zeiler (2012); Kingma and Ba (2014). Recent results (Paquette and Paquette, 2021; Bollapragada et al., 2022) demonstrated the acceleration of SGDM on quadratic forms, however, under a shrinking range of allowable learning rates for larger momentum. This finding contradicts the advantages of SGDM and indicates an increased sensitivity to the choice of learning rates in theoretical analysis.

1.1 Related Works

Since the seminal work by Robbins and Monro (1951), the convergence properties of SGD have been extensively studied in the literature (See, e.g., Moulines and Bach, 2011; Bottou et al., 2018; Nguyen et al., 2018). On the contrary, the theory for SGDM has rarely been explored until recently, although it is very popular in training many modern machine learning models such as neural networks to improve the training speed and accuracy of various models.

Kidambi et al. (2018) showed that (1.2) cannot achieve any improvement over SGD for a specially constructed linear regression problem. Together with some numerical experiments, they asserted that the only reason for the superiority of stochastic momentum methods in practice is mini-batching. In a recent paper, Liu et al. (2020) proved SGDM can be as fast as SGD. They established the identical convergence bound as SGD. They also posed an open problem of whether it is possible to show that SGDM converges faster than SGD for special objectives such as quadratic ones. Other studies on the theoretical analysis of SGDM can be found in Loizou and Richtárik (2017), Loizou and Richtárik (2020), Gitman et al. (2019) for linear system and quadratic loss; Sebbouh et al. (2021) for almost sure convergence rates under smooth and convex loss functions with a time-varying momentum weight. Mai and Johansson (2020) studied a class of general convex loss and obtained convergence rates of time averages regardless of the momentum weight. We refer the readers to the latter paper and Liu et al. (2020) for a few more papers on the convergence rate for SGDM. Richtárik and Takác (2020) considered the problem of solving a consistent linear system Ax=b𝐴𝑥𝑏Ax=bitalic_A italic_x = italic_b on least square regression. SGDM is also called the stochastic heavy-ball method, originated from Polyak’s heavy-ball method for deterministic optimization (Polyak, 1964). Some works (Loizou and Richtárik, 2017; Kidambi et al., 2018; Loizou and Richtárik, 2020; Paquette and Paquette, 2021; Bollapragada et al., 2022; Lee et al., 2022) based on the stochastic heavy ball method (SHB) achieved the linear convergence rate. Yang et al. (2016); Defazio (2020); Jin et al. (2022); Li et al. (2022) investigated the impact of momentum on the convergence properties of non-convex optimization problems and provided insights into its practical use. Gitman et al. (2019) focused on the noise reduction properties of momentum.

The averaged SGDM is not widely analyzed in the literature, but averaging tools have been studied for SGD and its variants since Ruppert (1988); Polyak and Juditsky (1992); Moulines and Bach (2011) to accelerate the convergence and establish the asymptotic normality. Building statistical inference and uncertainty quantification of SGD iterates has been an emerging topic, and an extensive list of literature follows, including Chen et al. (2020); Su and Zhu (2023); Zhu et al. (2023) that studied averaged SGD and provided inference procedures based on plug-in, batch-means, and tree-based construction of the confidence intervals, respectively. Zhu and Dong (2021); Lee et al. (2022) applied process-level function central limit theorem and utilized it to construct confidence regions seamlessly. Toulis et al. (2021); Chen et al. (2023) studied the uncertainty quantification of the implicit and gradient-free variants of the SGD procedures.

2 Preliminaries

In this section, we present the mini-batch SGDM settings considered in this paper. Particularly, we define a mini-batch stochastic loss

gηt(x)=1Bi=1Bfξti(x),subscript𝑔subscript𝜂𝑡𝑥1𝐵superscriptsubscript𝑖1𝐵subscript𝑓subscript𝜉𝑡𝑖𝑥\displaystyle g_{\eta_{t}}(x)=\frac{1}{B}\sum_{i=1}^{B}f_{\xi_{ti}}(x),italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) = divide start_ARG 1 end_ARG start_ARG italic_B end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_t italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) , (2.4)

where ηt={ξt1,,ξtB}subscript𝜂𝑡subscript𝜉𝑡1subscript𝜉𝑡𝐵\eta_{t}=\{\xi_{t1},...,\xi_{tB}\}italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_ξ start_POSTSUBSCRIPT italic_t 1 end_POSTSUBSCRIPT , … , italic_ξ start_POSTSUBSCRIPT italic_t italic_B end_POSTSUBSCRIPT } is a mini-batch of size B𝐵Bitalic_B and elements ξtisubscript𝜉𝑡𝑖\xi_{ti}italic_ξ start_POSTSUBSCRIPT italic_t italic_i end_POSTSUBSCRIPT are i.i.d. sampled from the distribution of ξ𝜉\xiitalic_ξ. The update of mtsubscript𝑚𝑡m_{t}italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in (1.2) uses the mini-batch stochastic gradient gηt(xt)subscript𝑔subscript𝜂𝑡subscript𝑥𝑡\triangledown g_{\eta_{t}}(x_{t})▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) instead of the individual stochastic gradient fξti(xt)subscript𝑓subscript𝜉𝑡𝑖subscript𝑥𝑡\triangledown f_{\xi_{ti}}(x_{t})▽ italic_f start_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_t italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).

We first introduce some regularity assumptions and discuss their use in the theoretical results. We denote the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-norm of a vector as \left\|\cdot\right\|∥ ⋅ ∥ and the operator norm as A=maxx=1Axnorm𝐴subscriptnorm𝑥1norm𝐴𝑥\left\|A\right\|=\max_{\left\|x\right\|=1}\left\|Ax\right\|∥ italic_A ∥ = roman_max start_POSTSUBSCRIPT ∥ italic_x ∥ = 1 end_POSTSUBSCRIPT ∥ italic_A italic_x ∥.

(A1). Assume that the loss function fξ(x)subscript𝑓𝜉𝑥f_{\xi}(x)italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_x ) is twice differentiable. Let κ1κdsubscript𝜅1subscript𝜅𝑑\kappa_{1}\leq\cdots\leq\kappa_{d}italic_κ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ ⋯ ≤ italic_κ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT be the eigenvalues of Hessian matrix Σ=𝔼[2fξ(x*)]Σ𝔼delimited-[]superscript2subscript𝑓𝜉superscript𝑥\Sigma=\mathbb{E}[\triangledown^{2}f_{\xi}(x^{*})]roman_Σ = blackboard_E [ ▽ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) ]. Assume that μ:=κ1>0assign𝜇subscript𝜅10\mu:=\kappa_{1}>0italic_μ := italic_κ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > 0, and L:=κd<.assign𝐿subscript𝜅𝑑L:=\kappa_{d}<\infty.italic_L := italic_κ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT < ∞ .

(A2). There exists a constant L¯0¯𝐿0\overline{L}\geq 0over¯ start_ARG italic_L end_ARG ≥ 0 such that the Hessian matrix of the loss f(x):=𝔼[fξ(x)]assign𝑓𝑥𝔼delimited-[]subscript𝑓𝜉𝑥f(x):=\mathbb{E}[f_{\xi}(x)]italic_f ( italic_x ) := blackboard_E [ italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_x ) ], Σ(x)=2f(x)Σ𝑥superscript2𝑓𝑥\Sigma(x)=\triangledown^{2}f(x)roman_Σ ( italic_x ) = ▽ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x ) satisfies Σ(x)Σ(x*)L¯xx*normΣ𝑥Σsuperscript𝑥¯𝐿norm𝑥superscript𝑥\left\|\Sigma(x)-\Sigma(x^{*})\right\|\leq\overline{L}\left\|x-x^{*}\right\|∥ roman_Σ ( italic_x ) - roman_Σ ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) ∥ ≤ over¯ start_ARG italic_L end_ARG ∥ italic_x - italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∥, xfor-all𝑥\forall x∀ italic_x.

(A3). Define Lξ=supxfξ(x)fξ(x*)xx*subscript𝐿𝜉subscriptsupremum𝑥normsubscript𝑓𝜉𝑥subscript𝑓𝜉superscript𝑥norm𝑥superscript𝑥L_{\xi}=\sup_{x}\frac{\left\|\nabla f_{\xi}(x)-\nabla f_{\xi}(x^{*})\right\|}{% \left\|x-x^{*}\right\|}italic_L start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT = roman_sup start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT divide start_ARG ∥ ∇ italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_x ) - ∇ italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) ∥ end_ARG start_ARG ∥ italic_x - italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∥ end_ARG. Assume that 𝔼[fξ(x*)2]σ2𝔼delimited-[]superscriptnormsubscript𝑓𝜉superscript𝑥2superscript𝜎2\mathbb{E}[\left\|\triangledown f_{\xi}(x^{*})\right\|^{2}]\leq\sigma^{2}blackboard_E [ ∥ ▽ italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and 𝔼[Lξ2]Lf2𝔼delimited-[]superscriptsubscript𝐿𝜉2superscriptsubscript𝐿𝑓2\mathbb{E}[L_{\xi}^{2}]\leq L_{f}^{2}blackboard_E [ italic_L start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

(A3’). Assume that fξ(x*)subscript𝑓𝜉superscript𝑥\triangledown f_{\xi}(x^{*})▽ italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) and Lξsubscript𝐿𝜉L_{\xi}italic_L start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT satisfy

supv=1𝔼[exp(|vfξ(x*)|/σ)]2,𝔼[exp(Lξ/Lf)]2.formulae-sequencesubscriptsupremumnorm𝑣1𝔼delimited-[]superscript𝑣topsubscript𝑓𝜉superscript𝑥𝜎2𝔼delimited-[]subscript𝐿𝜉subscript𝐿𝑓2\displaystyle\sup_{\left\|v\right\|=1}\mathbb{E}\left[\exp\left(\left|v^{\top}% \triangledown f_{\xi}(x^{*})\right|/\sigma\right)\right]\leq 2,\quad\mathbb{E}% \left[\exp\left(L_{\xi}/L_{f}\right)\right]\leq 2.roman_sup start_POSTSUBSCRIPT ∥ italic_v ∥ = 1 end_POSTSUBSCRIPT blackboard_E [ roman_exp ( | italic_v start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ▽ italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) | / italic_σ ) ] ≤ 2 , blackboard_E [ roman_exp ( italic_L start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT / italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ) ] ≤ 2 .

A few discussions follow concerning the assumptions above. Assumption (A1) is a regularity condition on the smoothness and strong convexity of the loss function fξ(x)subscript𝑓𝜉𝑥f_{\xi}(x)italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_x ) at x*superscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT. Beyond that, (A2) assumes the Lipschitz condition on the Hessian matrix of the loss function. Notably, the ΣΣ\Sigmaroman_Σ defined in (A1) is a brief notation for 𝔼[Σξ(x*)]𝔼delimited-[]subscriptΣ𝜉superscript𝑥\mathbb{E}[\Sigma_{\xi}(x^{*})]blackboard_E [ roman_Σ start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) ] defined in (A2). For quadratic losses, the Hessian matrix Σ(x)Σ𝑥\Sigma(x)roman_Σ ( italic_x ) is identical for different x𝑥xitalic_x, and therefore (A2) holds with L¯=0¯𝐿0\overline{L}=0over¯ start_ARG italic_L end_ARG = 0. For general strongly convex losses, we illustrate (A2) under a logistic regression setting in Example 1 below. In the following, we consider two scenarios, quadratic losses, and general (strongly convex) losses.

Assumptions (A3) and (A3’) are two separate conditions on the smoothness of quadratic and general stochastic loss functions, respectively. Assumption (A3) is weaker than (A3’) and indeed weaker than many in the literature, such as those in Yan et al. (2018); Liu et al. (2020) that 𝔼[fξ(x)f(x)2]σ2𝔼delimited-[]superscriptnormsubscript𝑓𝜉𝑥𝑓𝑥2superscript𝜎2\mathbb{E}[\left\|\nabla f_{\xi}(x)-\nabla f(x)\right\|^{2}]\leq\sigma^{2}blackboard_E [ ∥ ∇ italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_x ) - ∇ italic_f ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for all x𝑥xitalic_x, which does not hold for linear regression problems with unbounded domain of x𝑥xitalic_x. On the other hand, Bottou et al. (2018); Wang and Johansson (2022) assumed that 𝔼[fξ(x)f(x)2]ρf(x)2+σ2𝔼delimited-[]superscriptnormsubscript𝑓𝜉𝑥𝑓𝑥2𝜌superscriptnorm𝑓𝑥2superscript𝜎2\mathbb{E}[\left\|\nabla f_{\xi}(x)-\nabla f(x)\right\|^{2}]\leq\rho\left\|% \nabla f(x)\right\|^{2}+\sigma^{2}blackboard_E [ ∥ ∇ italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_x ) - ∇ italic_f ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ italic_ρ ∥ ∇ italic_f ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for ρ>0𝜌0\rho>0italic_ρ > 0. Meanwhile, our assumptions lead to 𝔼[fξ(x)f(x)2]3(Lf2+L2)μ2f(x)2+3σ2,𝔼delimited-[]superscriptnormsubscript𝑓𝜉𝑥𝑓𝑥23superscriptsubscript𝐿𝑓2superscript𝐿2superscript𝜇2superscriptnorm𝑓𝑥23superscript𝜎2\mathbb{E}[\left\|\nabla f_{\xi}(x)-\nabla f(x)\right\|^{2}]\leq\frac{3(L_{f}^% {2}+L^{2})}{\mu^{2}}\left\|\nabla f(x)\right\|^{2}+3\sigma^{2},blackboard_E [ ∥ ∇ italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_x ) - ∇ italic_f ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ divide start_ARG 3 ( italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∥ ∇ italic_f ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 3 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , due to the triangle inequality and strongly convexity. For any bounded domain {xr}norm𝑥𝑟\{\left\|x\right\|\leq r\}{ ∥ italic_x ∥ ≤ italic_r }, (A3) and (A3’) are easily satisfied for bounded gradients and smoothness. For unbounded problems, they are usually satisfied given certain design properties in many popular statistical models. For instance, we illustrate the assumptions under logistic regression in the following example.

Example 1

Consider an 2subscriptnormal-ℓ2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-regularized logistic regression model with samples {aξ,bξ}subscript𝑎𝜉subscript𝑏𝜉\{a_{\xi},b_{\xi}\}{ italic_a start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT } such that aξdsubscript𝑎𝜉superscript𝑑a_{\xi}\in\mathbb{R}^{d}italic_a start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and bξ{0,1}subscript𝑏𝜉01b_{\xi}\in\{0,1\}italic_b start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ∈ { 0 , 1 } are generated by bξ=1subscript𝑏𝜉1b_{\xi}=1italic_b start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT = 1 with probability px(aξ)subscript𝑝𝑥subscript𝑎𝜉p_{x}(a_{\xi})italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ) and bξ=0subscript𝑏𝜉0b_{\xi}=0italic_b start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT = 0 otherwise, where px(a)=1/(1+exp(xa))subscript𝑝𝑥𝑎11superscript𝑥top𝑎p_{x}(a)=1/(1+\exp(-x^{\top}a))italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_a ) = 1 / ( 1 + roman_exp ( - italic_x start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_a ) ). We consider the loss function defined as:

fξ(x)=bξlog(px(aξ))(1bξ)log(1px(aξ))+ν2x2.subscript𝑓𝜉𝑥subscript𝑏𝜉subscript𝑝𝑥subscript𝑎𝜉1subscript𝑏𝜉1subscript𝑝𝑥subscript𝑎𝜉𝜈2superscriptnorm𝑥2\displaystyle f_{\xi}(x)=-b_{\xi}\log(p_{x}(a_{\xi}))-(1-b_{\xi})\log(1-p_{x}(% a_{\xi}))+\frac{\nu}{2}\left\|x\right\|^{2}.italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_x ) = - italic_b start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT roman_log ( italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ) ) - ( 1 - italic_b start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ) roman_log ( 1 - italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ) ) + divide start_ARG italic_ν end_ARG start_ARG 2 end_ARG ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

The gradient and the Hessian matrix are

fξ(x)=(px(aξ)bξ)aξ+νx,Σ(x)=𝔼[px(aξ)(1px(aξ))aξaξ]+νId.formulae-sequencesubscript𝑓𝜉𝑥subscript𝑝𝑥subscript𝑎𝜉subscript𝑏𝜉subscript𝑎𝜉𝜈𝑥Σ𝑥𝔼delimited-[]subscript𝑝𝑥subscript𝑎𝜉1subscript𝑝𝑥subscript𝑎𝜉subscript𝑎𝜉superscriptsubscript𝑎𝜉top𝜈subscript𝐼𝑑\displaystyle\nabla f_{\xi}(x)=(p_{x}(a_{\xi})-b_{\xi})a_{\xi}+\nu x,\quad% \Sigma(x)=\mathbb{E}[p_{x}(a_{\xi})(1-p_{x}(a_{\xi}))a_{\xi}a_{\xi}^{\top}]+% \nu I_{d}.∇ italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_x ) = ( italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ) - italic_b start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ) italic_a start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT + italic_ν italic_x , roman_Σ ( italic_x ) = blackboard_E [ italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ) ( 1 - italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ) ) italic_a start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] + italic_ν italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT .

By the fact that |px(a)py(a)|318axysubscript𝑝𝑥𝑎subscript𝑝𝑦𝑎318norm𝑎norm𝑥𝑦\left|p_{x}(a)-p_{y}(a)\right|\leq\frac{\sqrt{3}}{18}\left\|a\right\|\left\|x-% y\right\|| italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_a ) - italic_p start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( italic_a ) | ≤ divide start_ARG square-root start_ARG 3 end_ARG end_ARG start_ARG 18 end_ARG ∥ italic_a ∥ ∥ italic_x - italic_y ∥, (A2) is satisfied with L¯=36𝔼aξ3+ν.normal-¯𝐿36𝔼superscriptnormsubscript𝑎𝜉3𝜈\overline{L}=\frac{\sqrt{3}}{6}\mathbb{E}\left\|a_{\xi}\right\|^{3}+\nu.over¯ start_ARG italic_L end_ARG = divide start_ARG square-root start_ARG 3 end_ARG end_ARG start_ARG 6 end_ARG blackboard_E ∥ italic_a start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT + italic_ν . Suppose {aξ}subscript𝑎𝜉\{a_{\xi}\}{ italic_a start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT } in logistic regression satisfy 𝔼[aξ]=0d𝔼delimited-[]subscript𝑎𝜉subscript0𝑑\mathbb{E}[a_{\xi}]=0_{d}blackboard_E [ italic_a start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ] = 0 start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT and supv=1𝔼[exp(|vaξ|2/σ2)]2subscriptsupremumnorm𝑣1𝔼delimited-[]superscriptsuperscript𝑣topsubscript𝑎𝜉2superscript𝜎22\sup_{\left\|v\right\|=1}\mathbb{E}\big{[}\exp\big{(}\left|v^{\top}a_{\xi}% \right|^{2}/\sigma^{2}\big{)}\big{]}\leq 2roman_sup start_POSTSUBSCRIPT ∥ italic_v ∥ = 1 end_POSTSUBSCRIPT blackboard_E [ roman_exp ( | italic_v start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ] ≤ 2, we have that |px(a)b|2subscript𝑝𝑥𝑎𝑏2\left|p_{x}(a)-b\right|\leq 2| italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_a ) - italic_b | ≤ 2, (px(a)py(a))a318a2xynormsubscript𝑝𝑥𝑎subscript𝑝𝑦𝑎𝑎318superscriptnorm𝑎2norm𝑥𝑦\left\|(p_{x}(a)-p_{y}(a))a\right\|\leq\frac{\sqrt{3}}{18}\left\|a\right\|^{2}% \left\|x-y\right\|∥ ( italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_a ) - italic_p start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( italic_a ) ) italic_a ∥ ≤ divide start_ARG square-root start_ARG 3 end_ARG end_ARG start_ARG 18 end_ARG ∥ italic_a ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_x - italic_y ∥, and therefore,

supv=1𝔼[exp(|vfξ(x*)|/(cσ))]2,𝔼[exp(Lξ/(cLf))]2.formulae-sequencesubscriptsupremumnorm𝑣1𝔼delimited-[]superscript𝑣topsubscript𝑓𝜉superscript𝑥𝑐𝜎2𝔼delimited-[]subscript𝐿𝜉𝑐subscript𝐿𝑓2\displaystyle\sup_{\left\|v\right\|=1}\mathbb{E}\left[\exp\left(\left|v^{\top}% \triangledown f_{\xi}(x^{*})\right|/(c\sigma)\right)\right]\leq 2,~{}\mathbb{E% }\left[\exp\left(L_{\xi}/(cL_{f})\right)\right]\leq 2.roman_sup start_POSTSUBSCRIPT ∥ italic_v ∥ = 1 end_POSTSUBSCRIPT blackboard_E [ roman_exp ( | italic_v start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ▽ italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) | / ( italic_c italic_σ ) ) ] ≤ 2 , blackboard_E [ roman_exp ( italic_L start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT / ( italic_c italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ) ) ] ≤ 2 .

for some absolute constant c>0𝑐0c>0italic_c > 0 and Lf=𝔼aξ2+νsubscript𝐿𝑓𝔼superscriptnormsubscript𝑎𝜉2𝜈L_{f}=\mathbb{E}\left\|a_{\xi}\right\|^{2}+\nuitalic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = blackboard_E ∥ italic_a start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ν.

3 Finite-sample Convergence Rates for SGDM

In this section, we first present the finite-sample convergence results for SGDM with general momentum weight γ𝛾\gammaitalic_γ. We consider the two cases separately: L¯=0¯𝐿0\overline{L}=0over¯ start_ARG italic_L end_ARG = 0 corresponds to the quadratic losses, and L¯>0¯𝐿0\overline{L}>0over¯ start_ARG italic_L end_ARG > 0 corresponds to general strongly convex losses.

3.1 The finite-sample rates for SGDM on quadratic losses

We first establish the finite-sample rates under quadratic losses, where

fξ(x)=12xAξxbξx+c.subscript𝑓𝜉𝑥12superscript𝑥topsubscript𝐴𝜉𝑥subscriptsuperscript𝑏top𝜉𝑥𝑐\displaystyle f_{\xi}(x)=\frac{1}{2}x^{\top}A_{\xi}x-b^{\top}_{\xi}x+c.italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_x ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_x start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT italic_x - italic_b start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT italic_x + italic_c . (3.5)

The Hessian matrix of the loss function Σ=𝔼[Aξ]Σ𝔼delimited-[]subscript𝐴𝜉\Sigma=\mathbb{E}[A_{\xi}]roman_Σ = blackboard_E [ italic_A start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ]. The stochastic loss is gηt(x)=1Bi=1Bfξti(x)subscript𝑔subscript𝜂𝑡𝑥1𝐵superscriptsubscript𝑖1𝐵subscript𝑓subscript𝜉𝑡𝑖𝑥g_{\eta_{t}}(x)=\frac{1}{B}\sum_{i=1}^{B}f_{\xi_{ti}}(x)italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) = divide start_ARG 1 end_ARG start_ARG italic_B end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_t italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) in the mini-batch setting. The following theorem shows the convergence rate of the last iterate xt+1subscript𝑥𝑡1x_{t+1}italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT.

Theorem 1

Under (A1)-(A3) and L¯=0normal-¯𝐿0\overline{L}=0over¯ start_ARG italic_L end_ARG = 0, for any momentum γ[0,1)𝛾01\gamma\in[0,1)italic_γ ∈ [ 0 , 1 ) and fixed δ(0,1]𝛿01\delta\in(0,1]italic_δ ∈ ( 0 , 1 ], assume the learning rate α>0𝛼0\alpha>0italic_α > 0 satisfies αL<2(1+γ)/(1γ)𝛼𝐿21𝛾1𝛾\alpha L<2(1+\gamma)/(1-\gamma)italic_α italic_L < 2 ( 1 + italic_γ ) / ( 1 - italic_γ ) and 16M2α2Lf2Bδλ2(1δ)(1λ)16superscript𝑀2superscript𝛼2superscriptsubscript𝐿𝑓2𝐵𝛿superscript𝜆21𝛿1𝜆16M^{2}\alpha^{2}L_{f}^{2}\leq B\delta\lambda^{2(1-\delta)}(1-\lambda)16 italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_B italic_δ italic_λ start_POSTSUPERSCRIPT 2 ( 1 - italic_δ ) end_POSTSUPERSCRIPT ( 1 - italic_λ ), where

M=4Δ(2(1γ)(1+αL+L)+3αγ),Δ=mink{|(γ+1α(1γ)κk)24γ|}>0,formulae-sequence𝑀4Δ21𝛾1𝛼𝐿𝐿3𝛼𝛾Δsubscript𝑘superscript𝛾1𝛼1𝛾subscript𝜅𝑘24𝛾0\displaystyle M=\frac{4}{\sqrt{\Delta}}\left(2(1-\gamma)(1+\alpha L+L)+3\alpha% \gamma\right),~{}\Delta=\min_{k}\left\{\left|\left(\gamma+1-\alpha(1-\gamma)% \kappa_{k}\right)^{2}-4\gamma\right|\right\}>0,italic_M = divide start_ARG 4 end_ARG start_ARG square-root start_ARG roman_Δ end_ARG end_ARG ( 2 ( 1 - italic_γ ) ( 1 + italic_α italic_L + italic_L ) + 3 italic_α italic_γ ) , roman_Δ = roman_min start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT { | ( italic_γ + 1 - italic_α ( 1 - italic_γ ) italic_κ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 4 italic_γ | } > 0 , (3.6)

and λ𝜆\lambdaitalic_λ is the spectral radius of the matrix

Γ=(γI,(1γ)ΣαγI,Iα(1γ)Σ).Γ𝛾𝐼1𝛾Σ𝛼𝛾𝐼𝐼𝛼1𝛾Σ\displaystyle\Gamma=\left(\begin{array}[]{cc}\gamma I,&(1-\gamma)\Sigma\\ -\alpha\gamma I,&I-\alpha(1-\gamma)\Sigma\\ \end{array}\right).roman_Γ = ( start_ARRAY start_ROW start_CELL italic_γ italic_I , end_CELL start_CELL ( 1 - italic_γ ) roman_Σ end_CELL end_ROW start_ROW start_CELL - italic_α italic_γ italic_I , end_CELL start_CELL italic_I - italic_α ( 1 - italic_γ ) roman_Σ end_CELL end_ROW end_ARRAY ) . (3.9)

Let m~t+1=(1γ)j=1tγtjΣ(xjx*)subscriptnormal-~𝑚𝑡11𝛾superscriptsubscript𝑗1𝑡superscript𝛾𝑡𝑗normal-Σsubscript𝑥𝑗superscript𝑥\widetilde{m}_{t+1}=(1-\gamma)\sum_{j=1}^{t}\gamma^{t-j}\Sigma(x_{j}-x^{*})over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT roman_Σ ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ), we have for t1𝑡1t\geq 1italic_t ≥ 1,

𝔼[m~t+12+xt+1x*2]2M2(4B(1λ)α2σ2+x1x*2λ2(1δ)t).𝔼delimited-[]superscriptnormsubscript~𝑚𝑡12superscriptnormsubscript𝑥𝑡1superscript𝑥22superscript𝑀24𝐵1𝜆superscript𝛼2superscript𝜎2superscriptnormsubscript𝑥1superscript𝑥2superscript𝜆21𝛿𝑡\displaystyle\mathbb{E}[\left\|\widetilde{m}_{t+1}\right\|^{2}+\left\|{x}_{t+1% }-x^{*}\right\|^{2}]\leq 2M^{2}\left(\frac{4}{B(1-\lambda)}\alpha^{2}\sigma^{2% }+\left\|x_{1}-x^{*}\right\|^{2}\lambda^{2(1-\delta)t}\right).blackboard_E [ ∥ over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ 2 italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( divide start_ARG 4 end_ARG start_ARG italic_B ( 1 - italic_λ ) end_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT 2 ( 1 - italic_δ ) italic_t end_POSTSUPERSCRIPT ) . (3.10)

Theorem 1 provides a finite-sample bound simultaneously for the error of the last iterate xt+1x*subscript𝑥𝑡1superscript𝑥x_{t+1}-x^{*}italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, and a weighted average of the stochastic gradients m~t+1subscript~𝑚𝑡1\widetilde{m}_{t+1}over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT. The first term in the error bound (3.10), 8M2B(1λ)α2σ28superscript𝑀2𝐵1𝜆superscript𝛼2superscript𝜎2\frac{8M^{2}}{B(1-\lambda)}\alpha^{2}\sigma^{2}divide start_ARG 8 italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B ( 1 - italic_λ ) end_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, corresponds to a non-decaying bias, due to the noisy observation in the stochastic gradient. The bias term is proportional to the squared learning rate α2superscript𝛼2\alpha^{2}italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and the variance of the stochastic gradient σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, which is the same as the one in SGD. The second term is exponentially decaying when λ<1𝜆1\lambda<1italic_λ < 1, and establishes the convergence of the SGDM algorithm from any initialization x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to the true solution x*superscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT. The linear convergence convergence is up to a neighborhood of x*superscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT with size 8M2B(1λ)ασ8superscript𝑀2𝐵1𝜆𝛼𝜎\sqrt{\frac{8M^{2}}{B(1-\lambda)}}\alpha\sigmasquare-root start_ARG divide start_ARG 8 italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B ( 1 - italic_λ ) end_ARG end_ARG italic_α italic_σ. With a larger batch size B𝐵Bitalic_B or a smaller σ𝜎\sigmaitalic_σ, the size of the neighborhood will be smaller, which aligns with the experimental findings reported in Kidambi et al. (2018) that the superiority of momentum methods is mainly due to mini-batching. Meanwhile, the second term in (3.10), x1x*2λ2(1δ)tsuperscriptnormsubscript𝑥1superscript𝑥2superscript𝜆21𝛿𝑡\left\|x_{1}-x^{*}\right\|^{2}\lambda^{2(1-\delta)t}∥ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT 2 ( 1 - italic_δ ) italic_t end_POSTSUPERSCRIPT, remains important to determine the convergence rate in the initial stage. Subsequently, we will demonstrate that this term, particularly the quantity λ𝜆\lambdaitalic_λ in SGDM, is improved compared to the λ𝜆\lambdaitalic_λ in SGD.

3.1.1 Linear convergence to a local neighborhood

The second term in (3.10) determines a linear convergence of SGDM to a local neighborhood of x*superscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT determined by its first term. The rate of this linear convergence is determined by γ𝛾\gammaitalic_γ, the spectral radius of the matrix (3.9). We first provide some intuition in deriving λ𝜆\lambdaitalic_λ. Considering the noiseless setting where the stochastic gradient is the same as the true gradient, i.e., fξ(x)=f(x)subscript𝑓𝜉𝑥𝑓𝑥\nabla f_{\xi}(x)=\nabla f(x)∇ italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_x ) = ∇ italic_f ( italic_x ) for any x𝑥xitalic_x and ξ𝜉\xiitalic_ξ. We have g(x)=Σ(xx*)𝑔𝑥Σ𝑥superscript𝑥\nabla g(x)=\Sigma(x-x^{*})∇ italic_g ( italic_x ) = roman_Σ ( italic_x - italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) and the SGDM updating rule (1.2) can be rewritten as

(mt+1xt+1x*)=Γ(mtxtx*)=Γt(m1x1x*),subscript𝑚𝑡1subscript𝑥𝑡1superscript𝑥Γsubscript𝑚𝑡subscript𝑥𝑡superscript𝑥superscriptΓ𝑡subscript𝑚1subscript𝑥1superscript𝑥\displaystyle\left(\begin{array}[]{c}m_{t+1}\\ x_{t+1}-x^{*}\end{array}\right)=\Gamma\left(\begin{array}[]{c}m_{t}\\ x_{t}-x^{*}\end{array}\right)=\Gamma^{t}\left(\begin{array}[]{c}m_{1}\\ x_{1}-x^{*}\end{array}\right),( start_ARRAY start_ROW start_CELL italic_m start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY ) = roman_Γ ( start_ARRAY start_ROW start_CELL italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY ) = roman_Γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( start_ARRAY start_ROW start_CELL italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY ) ,

where ΓΓ\Gammaroman_Γ is the matrix in (3.9). When the spectral radius of ΓΓ\Gammaroman_Γ is less than 1, the full-batch gradient descent with momentum enjoys linear convergence. In Theorem 1, the matrix ΓΓ\Gammaroman_Γ satisfies ΓtMλtnormsuperscriptΓ𝑡𝑀superscript𝜆𝑡\left\|\Gamma^{t}\right\|\leq M\lambda^{t}∥ roman_Γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ ≤ italic_M italic_λ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT with λ<1𝜆1\lambda<1italic_λ < 1, which is proved in the appendix.

Remark 2

The assumption that Δ>0normal-Δ0\Delta>0roman_Δ > 0 in (3.6) is placed for the diagonalization of the matrix Γnormal-Γ\Gammaroman_Γ, which is required in our analysis to provide a last-iterate convergence analysis and can be relaxed if we aim for a time-average convergence analysis of the sum t𝔼[m~t2+xtx*2]subscript𝑡𝔼delimited-[]superscriptnormsubscriptnormal-~𝑚𝑡2superscriptnormsubscript𝑥𝑡superscript𝑥2\sum_{t}\mathbb{E}[\left\|\widetilde{m}_{t}\right\|^{2}+\left\|x_{t}-x^{*}% \right\|^{2}]∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT blackboard_E [ ∥ over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]. More particularly, a time-average convergence analysis computes t=0Γt=(IΓ)1normsuperscriptsubscript𝑡0superscriptnormal-Γ𝑡normsuperscript𝐼normal-Γ1\left\|\sum_{t=0}^{\infty}\Gamma^{t}\right\|=\left\|(I-\Gamma)^{-1}\right\|∥ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT roman_Γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ = ∥ ( italic_I - roman_Γ ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ and requires only that the spectral radius λ<1𝜆1\lambda<1italic_λ < 1.

In Theorem 1, the convergence rate mainly depends on the spectral radius λ𝜆\lambdaitalic_λ of the matrix (3.9). We characterize its explicit form in the following theorem.

Theorem 3

For any momentum weight γ[0,1)𝛾01\gamma\in[0,1)italic_γ ∈ [ 0 , 1 ), let α,γ𝛼𝛾\alpha,\gammaitalic_α , italic_γ satisfy αL<2(1+γ)/(1γ)𝛼𝐿21𝛾1𝛾\alpha L<2(1+\gamma)/(1-\gamma)italic_α italic_L < 2 ( 1 + italic_γ ) / ( 1 - italic_γ ) and define ϕ=min{αμ,2(1+γ)/(1γ)αL}.italic-ϕ𝛼𝜇21𝛾1𝛾𝛼𝐿\phi=\min\left\{\alpha\mu,2(1+\gamma)/(1-\gamma)-\alpha L\right\}.italic_ϕ = roman_min { italic_α italic_μ , 2 ( 1 + italic_γ ) / ( 1 - italic_γ ) - italic_α italic_L } . If the momentum γ<(1ϕ)2/(1+ϕ)2𝛾superscript1italic-ϕ2superscript1italic-ϕ2\gamma<(1-\phi)^{2}/(1+\phi)^{2}italic_γ < ( 1 - italic_ϕ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / ( 1 + italic_ϕ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, we have that the spectral radius λ𝜆\lambdaitalic_λ of Γnormal-Γ\Gammaroman_Γ defined by (3.9) satisfies

λ=γ+1(1γ)ϕ+(γ+1(1γ)ϕ)24γ2.𝜆𝛾11𝛾italic-ϕsuperscript𝛾11𝛾italic-ϕ24𝛾2\displaystyle\lambda=\frac{\gamma+1-(1-\gamma)\phi+\sqrt{\left(\gamma+1-(1-% \gamma)\phi\right)^{2}-4\gamma}}{2}.italic_λ = divide start_ARG italic_γ + 1 - ( 1 - italic_γ ) italic_ϕ + square-root start_ARG ( italic_γ + 1 - ( 1 - italic_γ ) italic_ϕ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 4 italic_γ end_ARG end_ARG start_ARG 2 end_ARG . (3.12)

On the other hand, if the momentum γ(1ϕ)2/(1+ϕ)2𝛾superscript1italic-ϕ2superscript1italic-ϕ2\gamma\geq(1-\phi)^{2}/(1+\phi)^{2}italic_γ ≥ ( 1 - italic_ϕ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / ( 1 + italic_ϕ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, we have

λ=γ.𝜆𝛾\lambda=\sqrt{\gamma}.italic_λ = square-root start_ARG italic_γ end_ARG .

Theorem 3 reveals that the behavior of the convergence rate λ𝜆\lambdaitalic_λ is essentially different in two ranges of momentum weights γ𝛾\gammaitalic_γ, exhibiting a phase transition at γ=(1ϕ)2/(1+ϕ)2𝛾superscript1italic-ϕ2superscript1italic-ϕ2\gamma=(1-\phi)^{2}/(1+\phi)^{2}italic_γ = ( 1 - italic_ϕ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / ( 1 + italic_ϕ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. With Theorem 3, the following remark sheds light on the optimal choice of α𝛼\alphaitalic_α to achieve the fastest convergence rate.

Remark 4

When γ𝛾\gammaitalic_γ increases from 00 to 1111, the spectral radius λ𝜆\lambdaitalic_λ first decreases and then increases, and the minimal spectral radius is achieved under the condition γ=(1ϕ)2/(1+ϕ)2𝛾superscript1italic-ϕ2superscript1italic-ϕ2\gamma=(1-\phi)^{2}/(1+\phi)^{2}italic_γ = ( 1 - italic_ϕ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / ( 1 + italic_ϕ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Particularly, for γ(1ϕ)2/(1+ϕ)2𝛾superscript1italic-ϕ2superscript1italic-ϕ2\gamma\leq(1-\phi)^{2}/(1+\phi)^{2}italic_γ ≤ ( 1 - italic_ϕ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / ( 1 + italic_ϕ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, the quantity ϕitalic-ϕ\phiitalic_ϕ in Theorem 3 is non-decreasing, and λ𝜆\lambdaitalic_λ decreases as γ𝛾\gammaitalic_γ increases. For γ(1ϕ)2/(1+ϕ)2𝛾superscript1italic-ϕ2superscript1italic-ϕ2\gamma\geq(1-\phi)^{2}/(1+\phi)^{2}italic_γ ≥ ( 1 - italic_ϕ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / ( 1 + italic_ϕ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, the spectral radius λ=γ𝜆𝛾\lambda=\sqrt{\gamma}italic_λ = square-root start_ARG italic_γ end_ARG increases as γ𝛾\gammaitalic_γ increases. Moreover, the minimal spectral radius is

λ*=LμL+μ,superscript𝜆𝐿𝜇𝐿𝜇\displaystyle\lambda^{*}=\frac{\sqrt{L}-\sqrt{\mu}}{\sqrt{L}+\sqrt{\mu}},italic_λ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = divide start_ARG square-root start_ARG italic_L end_ARG - square-root start_ARG italic_μ end_ARG end_ARG start_ARG square-root start_ARG italic_L end_ARG + square-root start_ARG italic_μ end_ARG end_ARG ,

if we specify α=1/μL𝛼1𝜇𝐿\alpha=1/\sqrt{\mu L}italic_α = 1 / square-root start_ARG italic_μ italic_L end_ARG and γ=(Lμ)2/(L+μ)2𝛾superscript𝐿𝜇2superscript𝐿𝜇2\gamma=(\sqrt{L}-\sqrt{\mu})^{2}/(\sqrt{L}+\sqrt{\mu})^{2}italic_γ = ( square-root start_ARG italic_L end_ARG - square-root start_ARG italic_μ end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / ( square-root start_ARG italic_L end_ARG + square-root start_ARG italic_μ end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, and it follows that ϕ=μ/Litalic-ϕ𝜇𝐿\phi=\sqrt{\mu/L}italic_ϕ = square-root start_ARG italic_μ / italic_L end_ARG.

Figure 1 illustrates the spectral radium λ𝜆\lambdaitalic_λ presented in Theorem 3 with respect to different γ𝛾\gammaitalic_γ and α𝛼\alphaitalic_α when L/μ𝐿𝜇L/\muitalic_L / italic_μ is set to 5555. A brighter color corresponds to smaller λ𝜆\lambdaitalic_λ so that the convergence is faster. Figure 1 verifies that the fastest convergence rate is achieved when α𝛼\alphaitalic_α is approximately 1/μL1𝜇𝐿1/\sqrt{\mu L}1 / square-root start_ARG italic_μ italic_L end_ARG, γ𝛾\gammaitalic_γ is approximately (51)2/(5+1)20.146superscript512superscript5120.146(\sqrt{5}-1)^{2}/(\sqrt{5}+1)^{2}\approx 0.146( square-root start_ARG 5 end_ARG - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / ( square-root start_ARG 5 end_ARG + 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≈ 0.146 and the optimal λ*superscript𝜆\lambda^{*}italic_λ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT is near 515+10.38251510.382\frac{\sqrt{5}-\sqrt{1}}{\sqrt{5}+\sqrt{1}}\approx 0.382divide start_ARG square-root start_ARG 5 end_ARG - square-root start_ARG 1 end_ARG end_ARG start_ARG square-root start_ARG 5 end_ARG + square-root start_ARG 1 end_ARG end_ARG ≈ 0.382.

Remark 5

Figure 1 shows that SGDM with large batch sizes converges faster than SGD to a local neighborhood of x*superscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT. This is reflected by the observation that, in the figure, the minimal radium λ𝜆\lambdaitalic_λ for SGDM (with best γ0.146𝛾0.146\gamma\approx 0.146italic_γ ≈ 0.146) is smaller than the minimal λ𝜆\lambdaitalic_λ of SGD (γ=0𝛾0\gamma=0italic_γ = 0). A smaller λ𝜆\lambdaitalic_λ for SGDM corresponds to a smaller second term in (3.10) of Theorem 1, and thus implies that SGDM will converge faster to enter a local neighborhood of x*superscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT determined by the first term in (3.10). This indicates a faster convergence rate of SGDM in the initial stage with mini-batching.

Moreover, compared to SGD (γ=0𝛾0\gamma=0italic_γ = 0), SGDM with a larger γ𝛾\gammaitalic_γ permits more flexible choices of the learning rate α𝛼\alphaitalic_α, i.e., the colored region in the figure is larger for larger γ𝛾\gammaitalic_γ. This means that the convergence of SGDM is less sensitive to learning rates. The conclusion is validated by numerical experiments in the subsequent section. Particularly, Figure 6 in Section 5.2 shows that SGDM permits a wider range of learning rates to achieve convergence.

Refer to caption
Figure 1: The spectral radius of ΓΓ\Gammaroman_Γ with respect to γ𝛾\gammaitalic_γ and α𝛼\alphaitalic_α in Theorem 3, where L/μ=5/1𝐿𝜇51L/\mu=5/1italic_L / italic_μ = 5 / 1.

3.1.2 Explicit convergence rates for SGDM with specified γ𝛾\gammaitalic_γ and α𝛼\alphaitalic_α

In the following, we formalize the conclusions drawn from Theorem 1 and Remark 4 in two corollaries on the explicit convergence rates of SGDM within the two phases of specifications of γ𝛾\gammaitalic_γ.

Corollary 6 (Small momentum weight γ𝛾\gammaitalic_γ)

Under the assumptions in Theorem 1, for 0γ(1αL)/(4+4α2L2)0𝛾1𝛼𝐿44superscript𝛼2superscript𝐿20\leq\gamma\leq(1-\alpha L)/(4+4\alpha^{2}L^{2})0 ≤ italic_γ ≤ ( 1 - italic_α italic_L ) / ( 4 + 4 italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), we have that Δ(1αL)2/2normal-Δsuperscript1𝛼𝐿22\Delta\geq(1-\alpha L)^{2}/2roman_Δ ≥ ( 1 - italic_α italic_L ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2, αL2Lf2=𝒪(Bμ)𝛼superscript𝐿2superscriptsubscript𝐿𝑓2𝒪𝐵𝜇\alpha L^{2}L_{f}^{2}=\mathcal{O}(B\mu)italic_α italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = caligraphic_O ( italic_B italic_μ ) and

𝔼[m~t+12+xt+1x*2]=𝒪(L2Bμασ2+L2λ2(1δ)t).𝔼delimited-[]superscriptnormsubscript~𝑚𝑡12superscriptnormsubscript𝑥𝑡1superscript𝑥2𝒪superscript𝐿2𝐵𝜇𝛼superscript𝜎2superscript𝐿2superscript𝜆21𝛿𝑡\displaystyle\mathbb{E}[\left\|\widetilde{m}_{t+1}\right\|^{2}+\left\|{x}_{t+1% }-x^{*}\right\|^{2}]=\mathcal{O}\left(\frac{L^{2}}{B\mu}\alpha\sigma^{2}+L^{2}% \lambda^{2(1-\delta)t}\right).blackboard_E [ ∥ over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = caligraphic_O ( divide start_ARG italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B italic_μ end_ARG italic_α italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT 2 ( 1 - italic_δ ) italic_t end_POSTSUPERSCRIPT ) .

Corollary 6 establishes the explicit finite-sample rate of convergence of SGDM under small γ𝛾\gammaitalic_γ, and shows that, the spectral radius λ𝜆\lambdaitalic_λ decreases slowly as γ𝛾\gammaitalic_γ increases from 00 to (1αL)/(4+4α2L2)1𝛼𝐿44superscript𝛼2superscript𝐿2(1-\alpha L)/(4+4\alpha^{2}L^{2})( 1 - italic_α italic_L ) / ( 4 + 4 italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ).

Corollary 7 (Large momentum weight γ𝛾\gammaitalic_γ)

Under the assumptions in Theorem 1, for 0<1γ2αμ/(1+αμ)201𝛾2𝛼𝜇superscript1𝛼𝜇20<1-\gamma\leq 2\alpha\mu/(1+\alpha\mu)^{2}0 < 1 - italic_γ ≤ 2 italic_α italic_μ / ( 1 + italic_α italic_μ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, we have that Δ2(1γ)αμnormal-Δ21𝛾𝛼𝜇\Delta\geq 2(1-\gamma)\alpha\muroman_Δ ≥ 2 ( 1 - italic_γ ) italic_α italic_μ, αLf2(L2+α2(1γ)2)=𝒪(Bμ),𝛼superscriptsubscript𝐿𝑓2superscript𝐿2superscript𝛼2superscript1𝛾2𝒪𝐵𝜇\alpha L_{f}^{2}\left(L^{2}+\frac{\alpha^{2}}{(1-\gamma)^{2}}\right)=\mathcal{% O}\left(B\mu\right),italic_α italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) = caligraphic_O ( italic_B italic_μ ) , and

𝔼[m~t+12+xt+1x*2]=𝒪(1Bμ(L2+α2(1γ)2)ασ2+(L2+α(1γ)μ)λ2(1δ)t),𝔼delimited-[]superscriptnormsubscript~𝑚𝑡12superscriptnormsubscript𝑥𝑡1superscript𝑥2𝒪1𝐵𝜇superscript𝐿2superscript𝛼2superscript1𝛾2𝛼superscript𝜎2superscript𝐿2𝛼1𝛾𝜇superscript𝜆21𝛿𝑡\displaystyle\mathbb{E}[\left\|\widetilde{m}_{t+1}\right\|^{2}+\left\|{x}_{t+1% }-x^{*}\right\|^{2}]=\mathcal{O}\left(\frac{1}{B\mu}\left(L^{2}+\frac{\alpha^{% 2}}{(1-\gamma)^{2}}\right)\alpha\sigma^{2}+\left(L^{2}+\frac{\alpha}{(1-\gamma% )\mu}\right)\lambda^{2(1-\delta)t}\right),blackboard_E [ ∥ over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = caligraphic_O ( divide start_ARG 1 end_ARG start_ARG italic_B italic_μ end_ARG ( italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) italic_α italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_α end_ARG start_ARG ( 1 - italic_γ ) italic_μ end_ARG ) italic_λ start_POSTSUPERSCRIPT 2 ( 1 - italic_δ ) italic_t end_POSTSUPERSCRIPT ) ,

where λ𝜆\lambdaitalic_λ is defined in (3.12) above. Furthermore, if α<2(1+γ)(μ+L)(1γ)𝛼21𝛾𝜇𝐿1𝛾\alpha<\frac{2(1+\gamma)}{(\mu+L)(1-\gamma)}italic_α < divide start_ARG 2 ( 1 + italic_γ ) end_ARG start_ARG ( italic_μ + italic_L ) ( 1 - italic_γ ) end_ARG, we have λ=γ𝜆𝛾\lambda=\sqrt{\gamma}italic_λ = square-root start_ARG italic_γ end_ARG.

Corollary 7 demonstrates that the spectral radius λ𝜆\lambdaitalic_λ increases with γ𝛾\gammaitalic_γ when it approaches 1111, exhibiting an opposite behavior compared to the small γ𝛾\gammaitalic_γ settings that in Corollary 6. As we demonstrated in Remark 5 above, compared to mini-batch SGD (γ=0𝛾0\gamma=0italic_γ = 0), mini-batch SGDM with an appropriate γ𝛾\gammaitalic_γ converges faster to the local neighborhood of x*superscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT by achieving a smaller second term in their convergence rates. The following Remark 8 implements an optimal choice of the momentum weight γ𝛾\gammaitalic_γ in SGDM, to provide explicit convergence rates of m~t+1subscript~𝑚𝑡1\widetilde{m}_{t+1}over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT and xt+1subscript𝑥𝑡1{x}_{t+1}italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT with respect to L𝐿Litalic_L, μ𝜇\muitalic_μ and B𝐵Bitalic_B.

Remark 8

As one increases γ𝛾\gammaitalic_γ from 00 to 1111, the corresponding spectral radius λ𝜆\lambdaitalic_λ first decreases and then increases. Particularly, if one specifies γ𝛾\gammaitalic_γ and α𝛼\alphaitalic_α as

γ=(1cαμ)2(1+cαμ)2,𝑎𝑛𝑑α1μL,formulae-sequence𝛾superscript1𝑐𝛼𝜇2superscript1𝑐𝛼𝜇2𝑎𝑛𝑑𝛼1𝜇𝐿\displaystyle\gamma=\frac{(1-c\alpha\mu)^{2}}{(1+c\alpha\mu)^{2}},\quad\text{% and}\quad\alpha\leq\sqrt{\frac{1}{\mu L}},italic_γ = divide start_ARG ( 1 - italic_c italic_α italic_μ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 + italic_c italic_α italic_μ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , and italic_α ≤ square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_μ italic_L end_ARG end_ARG ,

for c<1𝑐1c<1italic_c < 1 sufficiently close to 1, we have that Δ=𝒪(α2μ2)normal-Δ𝒪superscript𝛼2superscript𝜇2\Delta=\mathcal{O}(\alpha^{2}\mu^{2})roman_Δ = caligraphic_O ( italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) and

𝔼[m~t+12+xt+1x*2]=𝒪((L+1/μ)2Bμασ2+(L+1/μ)2γ(1δ)t).𝔼delimited-[]superscriptnormsubscript~𝑚𝑡12superscriptnormsubscript𝑥𝑡1superscript𝑥2𝒪superscript𝐿1𝜇2𝐵𝜇𝛼superscript𝜎2superscript𝐿1𝜇2superscript𝛾1𝛿𝑡\displaystyle\mathbb{E}[\left\|\widetilde{m}_{t+1}\right\|^{2}+\left\|{x}_{t+1% }-x^{*}\right\|^{2}]=\mathcal{O}\left(\frac{(L+1/\mu)^{2}}{B\mu}\alpha\sigma^{% 2}+(L+1/\mu)^{2}\gamma^{(1-\delta)t}\right).blackboard_E [ ∥ over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = caligraphic_O ( divide start_ARG ( italic_L + 1 / italic_μ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B italic_μ end_ARG italic_α italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_L + 1 / italic_μ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT ( 1 - italic_δ ) italic_t end_POSTSUPERSCRIPT ) .

We now compare our rate of convergence with several results in the existing literature.

Comparison to SGD.

Theorem 4.6 in Bottou et al. (2018) proved the convergence rate of SGD under strong convexity 𝒪(Lμασ2+(1αμ)t)𝒪𝐿𝜇𝛼superscript𝜎2superscript1𝛼𝜇𝑡\mathcal{O}(\frac{L}{\mu}\alpha\sigma^{2}+(1-\alpha\mu)^{t})caligraphic_O ( divide start_ARG italic_L end_ARG start_ARG italic_μ end_ARG italic_α italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( 1 - italic_α italic_μ ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) under the assumption 𝔼[fξ(x)f(x)2]σ2𝔼delimited-[]superscriptnormsubscript𝑓𝜉𝑥𝑓𝑥2superscript𝜎2\mathbb{E}[\left\|\nabla f_{\xi}(x)-\nabla f(x)\right\|^{2}]\leq\sigma^{2}blackboard_E [ ∥ ∇ italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_x ) - ∇ italic_f ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for all x𝑥xitalic_x. Comparing with their result, our analysis in Theorem 3 can apply with γ=0𝛾0\gamma=0italic_γ = 0 and α<2/(μ+L)𝛼2𝜇𝐿\alpha<2/(\mu+L)italic_α < 2 / ( italic_μ + italic_L ) such that ϕ=αμitalic-ϕ𝛼𝜇\phi=\alpha\muitalic_ϕ = italic_α italic_μ, which is consistent with Bottou et al. (2018). Our Remark 4 also shows that SGDM with large momentum weight can accelerate the convergence rate in Bottou et al. (2018).

Comparison to existing results for SGDM.

Liu et al. (2020) proved that the convergence rate for strongly convex loss is 𝒪(Lμασ2+max{1αμ,γ}t)\mathcal{O}(\frac{L}{\mu}\alpha\sigma^{2}+\max\{1-\alpha\mu,\gamma\}^{t})caligraphic_O ( divide start_ARG italic_L end_ARG start_ARG italic_μ end_ARG italic_α italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + roman_max { 1 - italic_α italic_μ , italic_γ } start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) for tt0=𝒪(1/log(γ))𝑡subscript𝑡0𝒪1𝛾t\geq t_{0}=\mathcal{O}(-1/\log(\gamma))italic_t ≥ italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = caligraphic_O ( - 1 / roman_log ( italic_γ ) ). Comparing that with Corollary 6 above, our result improves the rate when the momentum weight γ𝛾\gammaitalic_γ is small. Moreover, Liu et al. (2020) required learning rate α=𝒪(1γ)𝛼𝒪1𝛾\alpha=\mathcal{O}(1-\gamma)italic_α = caligraphic_O ( 1 - italic_γ ), which is considerably more restrictive than the range permitted in our Theorems that αL<2(1+γ)/(1γ)𝛼𝐿21𝛾1𝛾\alpha L<2(1+\gamma)/(1-\gamma)italic_α italic_L < 2 ( 1 + italic_γ ) / ( 1 - italic_γ ). This requirement is in opposition to our finding that SGDM permits a wider range of learning rates. The stochastic heavy ball (SHB) method is equivalent to SGDM for α=α(1γ)superscript𝛼𝛼1𝛾\alpha^{\prime}=\alpha(1-\gamma)italic_α start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_α ( 1 - italic_γ ) and γ=γsuperscript𝛾𝛾\gamma^{\prime}=\gammaitalic_γ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_γ. Under the noiseless setting, for α=1/Lsuperscript𝛼1𝐿\alpha^{\prime}=1/Litalic_α start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 / italic_L and γ=(L/μ1)/(L/μ+1)superscript𝛾𝐿𝜇1𝐿𝜇1\gamma^{\prime}=(\sqrt{L/\mu}-1)/(\sqrt{L/\mu}+1)italic_γ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ( square-root start_ARG italic_L / italic_μ end_ARG - 1 ) / ( square-root start_ARG italic_L / italic_μ end_ARG + 1 ), Nesterov (1983) gave the 𝒪((1μ/L)t)𝒪superscript1𝜇𝐿𝑡\mathcal{O}((1-\sqrt{\mu/L})^{t})caligraphic_O ( ( 1 - square-root start_ARG italic_μ / italic_L end_ARG ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) convergence rate. For α=𝒪(1/L)superscript𝛼𝒪1𝐿\alpha^{\prime}=\mathcal{O}(1/L)italic_α start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = caligraphic_O ( 1 / italic_L ) and α=𝒪(1/μL)𝛼𝒪1𝜇𝐿\alpha=\mathcal{O}(\sqrt{1/\mu L})italic_α = caligraphic_O ( square-root start_ARG 1 / italic_μ italic_L end_ARG ) in Remark 8, our convergence factor is γ=𝒪(1μ/L)𝛾𝒪1𝜇𝐿\gamma=\mathcal{O}(1-\sqrt{\mu/L})italic_γ = caligraphic_O ( 1 - square-root start_ARG italic_μ / italic_L end_ARG ), which improves the factor 𝒪(1μ/L)𝒪1𝜇𝐿\mathcal{O}(1-\mu/L)caligraphic_O ( 1 - italic_μ / italic_L ) in SGD and is consistent with Nesterov (1983). Theorem 3.5 in Bollapragada et al. (2022) proved that for α=2L+ϵ+μϵsuperscript𝛼2𝐿italic-ϵ𝜇italic-ϵ\sqrt{\alpha^{\prime}}=\frac{2}{\sqrt{L+\epsilon}+\sqrt{\mu-\epsilon}}square-root start_ARG italic_α start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG = divide start_ARG 2 end_ARG start_ARG square-root start_ARG italic_L + italic_ϵ end_ARG + square-root start_ARG italic_μ - italic_ϵ end_ARG end_ARG and γ=(L+ϵ)/(μϵ)1(L+ϵ)/(μϵ)+1superscript𝛾𝐿italic-ϵ𝜇italic-ϵ1𝐿italic-ϵ𝜇italic-ϵ1\sqrt{\gamma^{\prime}}=\frac{\sqrt{(L+\epsilon)/(\mu-\epsilon)}-1}{\sqrt{(L+% \epsilon)/(\mu-\epsilon)}+1}square-root start_ARG italic_γ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG = divide start_ARG square-root start_ARG ( italic_L + italic_ϵ ) / ( italic_μ - italic_ϵ ) end_ARG - 1 end_ARG start_ARG square-root start_ARG ( italic_L + italic_ϵ ) / ( italic_μ - italic_ϵ ) end_ARG + 1 end_ARG, ϵ(0,μ)italic-ϵ0𝜇\epsilon\in(0,\mu)italic_ϵ ∈ ( 0 , italic_μ ), the convergence rate of the SHB method on a consistent linear system Ax=b𝐴𝑥𝑏Ax=bitalic_A italic_x = italic_b is 𝒪(L2ϵt2(γ)t+L4ϵμ2σ2)𝒪superscript𝐿2italic-ϵsuperscript𝑡2superscriptsuperscript𝛾𝑡superscript𝐿4italic-ϵsuperscript𝜇2superscript𝜎2\mathcal{O}(\frac{L^{2}}{\epsilon}t^{2}(\gamma^{\prime})^{t}+\frac{L^{4}}{% \epsilon\mu^{2}}\sigma^{2})caligraphic_O ( divide start_ARG italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_ϵ end_ARG italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_γ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + divide start_ARG italic_L start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG italic_ϵ italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). Their convergence rate is the same as ours in Remark 8 under a different setting, while their results are more restrictive on the choice of learning rates and momentum weights.

3.2 The finite-sample rates for SGDM on general losses

For the general strongly convex loss where L¯>0¯𝐿0\overline{L}>0over¯ start_ARG italic_L end_ARG > 0, we provide the following convergence rate.

Theorem 9

Under (A1), (A2), and (A3’) and L¯>0normal-¯𝐿0\overline{L}>0over¯ start_ARG italic_L end_ARG > 0, for any momentum γ[0,1)𝛾01\gamma\in[0,1)italic_γ ∈ [ 0 , 1 ), assume the learning rate α>0𝛼0\alpha>0italic_α > 0 satisfies αL<2(1+γ)/(1γ)𝛼𝐿21𝛾1𝛾\alpha L<2(1+\gamma)/(1-\gamma)italic_α italic_L < 2 ( 1 + italic_γ ) / ( 1 - italic_γ ) and

62c(2logT+4d)M(2Lfλ(1δ)δ(1λ)1/2+3L¯Mασ(1λ)3/2)αB,62𝑐2𝑇4𝑑𝑀2subscript𝐿𝑓superscript𝜆1𝛿𝛿superscript1𝜆123¯𝐿𝑀𝛼𝜎superscript1𝜆32𝛼𝐵\displaystyle 6\sqrt{2}c(2\log T+4d)M\left(\frac{2L_{f}}{\lambda^{(1-\delta)}% \sqrt{\delta}(1-\lambda)^{1/2}}+\frac{3\overline{L}M\alpha\sigma}{(1-\lambda)^% {3/2}}\right)\alpha\leq\sqrt{B},6 square-root start_ARG 2 end_ARG italic_c ( 2 roman_log italic_T + 4 italic_d ) italic_M ( divide start_ARG 2 italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_ARG start_ARG italic_λ start_POSTSUPERSCRIPT ( 1 - italic_δ ) end_POSTSUPERSCRIPT square-root start_ARG italic_δ end_ARG ( 1 - italic_λ ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 3 over¯ start_ARG italic_L end_ARG italic_M italic_α italic_σ end_ARG start_ARG ( 1 - italic_λ ) start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT end_ARG ) italic_α ≤ square-root start_ARG italic_B end_ARG ,

for fixed δ(0,1]𝛿01\delta\in(0,1]italic_δ ∈ ( 0 , 1 ] and an absolute constant c>0𝑐0c>0italic_c > 0, where M𝑀Mitalic_M is defined in (3.6). In addition, the initialization x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT satisfies x1x*δ(1λ)λ1δ36M2αL¯normsubscript𝑥1superscript𝑥𝛿1𝜆superscript𝜆1𝛿36superscript𝑀2𝛼normal-¯𝐿\left\|x_{1}-x^{*}\right\|\leq\frac{\delta(1-\lambda)\lambda^{1-\delta}}{36M^{% 2}\alpha\overline{L}}∥ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∥ ≤ divide start_ARG italic_δ ( 1 - italic_λ ) italic_λ start_POSTSUPERSCRIPT 1 - italic_δ end_POSTSUPERSCRIPT end_ARG start_ARG 36 italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_α over¯ start_ARG italic_L end_ARG end_ARG. Then with probability 12T112superscript𝑇11-2T^{-1}1 - 2 italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT,

m~t+12+xt+1x*232c(2logT+4d)MB(1λ)ασ+3Mx1x*λ(1δ)t,for 1tT.formulae-sequencesuperscriptnormsubscript~𝑚𝑡12superscriptnormsubscript𝑥𝑡1superscript𝑥232𝑐2𝑇4𝑑𝑀𝐵1𝜆𝛼𝜎3𝑀normsubscript𝑥1superscript𝑥superscript𝜆1𝛿𝑡for 1𝑡𝑇\displaystyle\sqrt{\left\|\widetilde{m}_{t+1}\right\|^{2}+\left\|x_{t+1}-x^{*}% \right\|^{2}}\leq\frac{3\sqrt{2}c\left(2\log T+4d\right)M}{\sqrt{B(1-\lambda)}% }\alpha\sigma+3M\left\|x_{1}-x^{*}\right\|\lambda^{(1-\delta)t},~{}\text{for }% 1\leq t\leq T.square-root start_ARG ∥ over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ≤ divide start_ARG 3 square-root start_ARG 2 end_ARG italic_c ( 2 roman_log italic_T + 4 italic_d ) italic_M end_ARG start_ARG square-root start_ARG italic_B ( 1 - italic_λ ) end_ARG end_ARG italic_α italic_σ + 3 italic_M ∥ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∥ italic_λ start_POSTSUPERSCRIPT ( 1 - italic_δ ) italic_t end_POSTSUPERSCRIPT , for 1 ≤ italic_t ≤ italic_T .

Compared to Theorem 1, the convergence rate in Theorem 9 is worse but with only up to a logarithm term, which indicates that SGDM for general losses achieves almost the same rate as that for quadratic ones. This convergence bound is presented with high probability for general losses, in contrast to the convergence in expectation outlined for the quadratic losses previously. Nonetheless, it remains uniformly applicable across all iterations T𝑇Titalic_T.

Mini-batch SGDM for general convex loss still converges faster than mini-batch SGD with high probability, as discussed in Theorem 3. In contrast to the quadratic settings L¯=0¯𝐿0\overline{L}=0over¯ start_ARG italic_L end_ARG = 0, the general convex setting requires additionally that the initialization x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is close to x*superscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, i.e., x1x*normsubscript𝑥1superscript𝑥\left\|x_{1}-x^{*}\right\|∥ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∥ is small. This assumption is expected since the loss function is not guaranteed to be convex everywhere under the weak assumptions in (A1)–(A2). Particularly, for γ𝛾\gammaitalic_γ close to 0, we assume x1x*=𝒪(μ/L¯L2)normsubscript𝑥1superscript𝑥𝒪𝜇¯𝐿superscript𝐿2\left\|x_{1}-x^{*}\right\|=\mathcal{O}\left(\mu/\overline{L}L^{2}\right)∥ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∥ = caligraphic_O ( italic_μ / over¯ start_ARG italic_L end_ARG italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). For γ𝛾\gammaitalic_γ close to 1111, we assume x1x*=𝒪(μ/L¯(L2+α2/(1γ)2))normsubscript𝑥1superscript𝑥𝒪𝜇¯𝐿superscript𝐿2superscript𝛼2superscript1𝛾2\left\|x_{1}-x^{*}\right\|=\mathcal{O}\left(\mu/\overline{L}\left(L^{2}+\alpha% ^{2}/(1-\gamma)^{2}\right)\right)∥ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∥ = caligraphic_O ( italic_μ / over¯ start_ARG italic_L end_ARG ( italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ).

Based on Theorem 9, we have the following corollaries of the convergence rates for small and large specifications of γ𝛾\gammaitalic_γ under general losses, similar to Corollaries 67 for quadratic losses.

Corollary 10 (Small momentum weight γ𝛾\gammaitalic_γ)

Following Theorem 9, for 0γ(1αL)/(4+4α2L2)0𝛾1𝛼𝐿44superscript𝛼2superscript𝐿20\leq\gamma\leq(1-\alpha L)/(4+4\alpha^{2}L^{2})0 ≤ italic_γ ≤ ( 1 - italic_α italic_L ) / ( 4 + 4 italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), we have that Δ(1αL)2/2normal-Δsuperscript1𝛼𝐿22\Delta\geq(1-\alpha L)^{2}/2roman_Δ ≥ ( 1 - italic_α italic_L ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2, αLfL(μ+L¯L)=𝒪(B1/2μ3/2)𝛼subscript𝐿𝑓𝐿𝜇normal-¯𝐿𝐿𝒪superscript𝐵12superscript𝜇32\sqrt{\alpha}L_{f}L\left(\mu+\overline{L}L\right)=\mathcal{O}\left(B^{1/2}\mu^% {3/2}\right)square-root start_ARG italic_α end_ARG italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT italic_L ( italic_μ + over¯ start_ARG italic_L end_ARG italic_L ) = caligraphic_O ( italic_B start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT italic_μ start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT ) and with high probability,

m~t+12+xt+1x*2=𝒪(LlogTBμασ+Lλ(1δ)t),superscriptnormsubscript~𝑚𝑡12superscriptnormsubscript𝑥𝑡1superscript𝑥2𝒪𝐿𝑇𝐵𝜇𝛼𝜎𝐿superscript𝜆1𝛿𝑡\displaystyle\sqrt{\left\|\widetilde{m}_{t+1}\right\|^{2}+\left\|x_{t+1}-x^{*}% \right\|^{2}}=\mathcal{O}\left(\frac{L\log T}{\sqrt{B\mu}}\sqrt{\alpha}\sigma+% L\lambda^{(1-\delta)t}\right),square-root start_ARG ∥ over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG = caligraphic_O ( divide start_ARG italic_L roman_log italic_T end_ARG start_ARG square-root start_ARG italic_B italic_μ end_ARG end_ARG square-root start_ARG italic_α end_ARG italic_σ + italic_L italic_λ start_POSTSUPERSCRIPT ( 1 - italic_δ ) italic_t end_POSTSUPERSCRIPT ) ,

where λ𝜆\lambdaitalic_λ is defined in (3.12) above.

Corollary 11 (Large momentum weight γ𝛾\gammaitalic_γ)

Following Theorem 9, for 0<1γ2αμ/(1+αμ)201𝛾2𝛼𝜇superscript1𝛼𝜇20<1-\gamma\leq 2\alpha\mu/(1+\alpha\mu)^{2}0 < 1 - italic_γ ≤ 2 italic_α italic_μ / ( 1 + italic_α italic_μ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, we have that Δ2(1γ)αμnormal-Δ21𝛾𝛼𝜇\Delta\geq 2(1-\gamma)\alpha\muroman_Δ ≥ 2 ( 1 - italic_γ ) italic_α italic_μ,

αLf(L+α1γ){1+L¯(L+α1γ)α(1γ)μ}=𝒪(Bμ),𝛼subscript𝐿𝑓𝐿𝛼1𝛾1¯𝐿𝐿𝛼1𝛾𝛼1𝛾𝜇𝒪𝐵𝜇\sqrt{\alpha}L_{f}\left(L+\frac{\alpha}{1-\gamma}\right)\left\{1+\overline{L}% \left(L+\frac{\alpha}{1-\gamma}\right)\sqrt{\frac{\alpha}{(1-\gamma)\mu}}% \right\}=\mathcal{O}\left(B\mu\right),square-root start_ARG italic_α end_ARG italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_L + divide start_ARG italic_α end_ARG start_ARG 1 - italic_γ end_ARG ) { 1 + over¯ start_ARG italic_L end_ARG ( italic_L + divide start_ARG italic_α end_ARG start_ARG 1 - italic_γ end_ARG ) square-root start_ARG divide start_ARG italic_α end_ARG start_ARG ( 1 - italic_γ ) italic_μ end_ARG end_ARG } = caligraphic_O ( italic_B italic_μ ) ,

and with high probability

m~t+12+xt+1x*2=𝒪(logTBμ(L+α1γ)ασ+(L+α(1γ)μ)λ(1δ)t),superscriptnormsubscript~𝑚𝑡12superscriptnormsubscript𝑥𝑡1superscript𝑥2𝒪𝑇𝐵𝜇𝐿𝛼1𝛾𝛼𝜎𝐿𝛼1𝛾𝜇superscript𝜆1𝛿𝑡\displaystyle\sqrt{\left\|\widetilde{m}_{t+1}\right\|^{2}+\left\|x_{t+1}-x^{*}% \right\|^{2}}=\mathcal{O}\left(\frac{\log T}{\sqrt{B\mu}}\left(L+\frac{\alpha}% {1-\gamma}\right)\sqrt{\alpha}\sigma+\left(L+\frac{\sqrt{\alpha}}{\sqrt{(1-% \gamma)\mu}}\right)\lambda^{(1-\delta)t}\right),square-root start_ARG ∥ over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG = caligraphic_O ( divide start_ARG roman_log italic_T end_ARG start_ARG square-root start_ARG italic_B italic_μ end_ARG end_ARG ( italic_L + divide start_ARG italic_α end_ARG start_ARG 1 - italic_γ end_ARG ) square-root start_ARG italic_α end_ARG italic_σ + ( italic_L + divide start_ARG square-root start_ARG italic_α end_ARG end_ARG start_ARG square-root start_ARG ( 1 - italic_γ ) italic_μ end_ARG end_ARG ) italic_λ start_POSTSUPERSCRIPT ( 1 - italic_δ ) italic_t end_POSTSUPERSCRIPT ) ,

where λ𝜆\lambdaitalic_λ is defined in (3.12) above. Furthermore, if α<2(1+γ)(μ+L)(1γ)𝛼21𝛾𝜇𝐿1𝛾\alpha<\frac{2(1+\gamma)}{(\mu+L)(1-\gamma)}italic_α < divide start_ARG 2 ( 1 + italic_γ ) end_ARG start_ARG ( italic_μ + italic_L ) ( 1 - italic_γ ) end_ARG, we have λ=γ𝜆𝛾\lambda=\sqrt{\gamma}italic_λ = square-root start_ARG italic_γ end_ARG.

In the following remark, we choose the optimal momentum weights γ𝛾\gammaitalic_γ to show the explicit convergence results with respect to L𝐿Litalic_L, μ𝜇\muitalic_μ, and B𝐵Bitalic_B.

Remark 12

If one specifies γ𝛾\gammaitalic_γ and α𝛼\alphaitalic_α as

γ=(1cαμ)2(1+cαμ)2,𝑎𝑛𝑑α1μL,formulae-sequence𝛾superscript1𝑐𝛼𝜇2superscript1𝑐𝛼𝜇2𝑎𝑛𝑑𝛼1𝜇𝐿\displaystyle\gamma=\frac{(1-c\alpha\mu)^{2}}{(1+c\alpha\mu)^{2}},\quad\text{% and}\quad\alpha\leq\sqrt{\frac{1}{\mu L}},italic_γ = divide start_ARG ( 1 - italic_c italic_α italic_μ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 + italic_c italic_α italic_μ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , and italic_α ≤ square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_μ italic_L end_ARG end_ARG ,

for c<1𝑐1c<1italic_c < 1 sufficiently close to 1, we have that Δ=𝒪(α2μ2)normal-Δ𝒪superscript𝛼2superscript𝜇2\Delta=\mathcal{O}(\alpha^{2}\mu^{2})roman_Δ = caligraphic_O ( italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) and with high probability

m~t+12+xt+1x*2=𝒪((L+1/μ)2(logT)2Bμασ2+(L+1/μ)2γ(1δ)t).superscriptnormsubscript~𝑚𝑡12superscriptnormsubscript𝑥𝑡1superscript𝑥2𝒪superscript𝐿1𝜇2superscript𝑇2𝐵𝜇𝛼superscript𝜎2superscript𝐿1𝜇2superscript𝛾1𝛿𝑡\displaystyle\left\|\widetilde{m}_{t+1}\right\|^{2}+\left\|x_{t+1}-x^{*}\right% \|^{2}=\mathcal{O}\left(\frac{(L+1/\mu)^{2}(\log T)^{2}}{B\mu}\alpha\sigma^{2}% +(L+1/\mu)^{2}\gamma^{(1-\delta)t}\right).∥ over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = caligraphic_O ( divide start_ARG ( italic_L + 1 / italic_μ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( roman_log italic_T ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B italic_μ end_ARG italic_α italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_L + 1 / italic_μ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT ( 1 - italic_δ ) italic_t end_POSTSUPERSCRIPT ) .

We arrive at the same conclusions as we had in quadratic settings that SGDM can accelerate convergence over SGD under general strongly convex losses when L¯>0normal-¯𝐿0\bar{L}>0over¯ start_ARG italic_L end_ARG > 0.

4 Acceleration by Averaging and Asymptotic Normality

In this section, we study the Polyak-averaging SGDM (referred to as averaged SGDM). Particularly, we aim at building the convergence result of 1nn0t=n0+1nxt1𝑛subscript𝑛0superscriptsubscript𝑡subscript𝑛01𝑛subscript𝑥𝑡\frac{1}{n-n_{0}}\sum_{t=n_{0}+1}^{n}x_{t}divide start_ARG 1 end_ARG start_ARG italic_n - italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_t = italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, which is an average of all iterates xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT starting from a constant period n0subscript𝑛0n_{0}italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. We show that the averaging leads to acceleration for SGDM, compared to the last-iterate bounds established in the previous section.

4.1 Averaged SGDM under quadratic losses

For the mini-batch model (2.4) under quadratic losses (L¯=0)¯𝐿0(\overline{L}=0)( over¯ start_ARG italic_L end_ARG = 0 ), we now characterize a decomposition of the error of the averaged SGDM 1nn0t=n0+1n(xtx*)1𝑛subscript𝑛0superscriptsubscript𝑡subscript𝑛01𝑛subscript𝑥𝑡superscript𝑥\frac{1}{n-n_{0}}\sum_{t=n_{0}+1}^{n}(x_{t}-x^{*})divide start_ARG 1 end_ARG start_ARG italic_n - italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_t = italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ), which leads to an acceleration over the last iterate of SGDM xnsubscript𝑥𝑛x_{n}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT.

Theorem 13

Under (A1)-(A3) and L¯=0normal-¯𝐿0\overline{L}=0over¯ start_ARG italic_L end_ARG = 0, suppose that the conditions in Theorem 1 hold, and the t𝑡titalic_t-th iteration (m~t,xt)subscriptnormal-~𝑚𝑡subscript𝑥𝑡(\widetilde{m}_{t},x_{t})( over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) satisfies

𝔼[m~t2+xtx*2]C1B(1λ)α2σ2+C2x1x*2λ2(1δ)(t1),t1.formulae-sequence𝔼delimited-[]superscriptnormsubscript~𝑚𝑡2superscriptnormsubscript𝑥𝑡superscript𝑥2subscript𝐶1𝐵1𝜆superscript𝛼2superscript𝜎2subscript𝐶2superscriptnormsubscript𝑥1superscript𝑥2superscript𝜆21𝛿𝑡1𝑡1\displaystyle\mathbb{E}[\left\|\widetilde{m}_{t}\right\|^{2}+\left\|x_{t}-x^{*% }\right\|^{2}]\leq\frac{C_{1}}{B(1-\lambda)}\alpha^{2}\sigma^{2}+C_{2}\left\|x% _{1}-x^{*}\right\|^{2}\lambda^{2(1-\delta)(t-1)},\quad t\geq 1.blackboard_E [ ∥ over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ divide start_ARG italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_B ( 1 - italic_λ ) end_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT 2 ( 1 - italic_δ ) ( italic_t - 1 ) end_POSTSUPERSCRIPT , italic_t ≥ 1 . (4.13)

Then for n2n0𝑛2subscript𝑛0n\geq 2n_{0}italic_n ≥ 2 italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT where λ2n0=B1(1λ)superscript𝜆2subscript𝑛0superscript𝐵11𝜆\lambda^{2n_{0}}=B^{-1}(1-\lambda)italic_λ start_POSTSUPERSCRIPT 2 italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = italic_B start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( 1 - italic_λ ), we have

t=n0+1n(xtx*)nn0=t=n0+1ni=1BΣ1(Aξtix*bξti)B(nn0)+Rn,superscriptsubscript𝑡subscript𝑛01𝑛subscript𝑥𝑡superscript𝑥𝑛subscript𝑛0superscriptsubscript𝑡subscript𝑛01𝑛superscriptsubscript𝑖1𝐵superscriptΣ1subscript𝐴subscript𝜉𝑡𝑖superscript𝑥subscript𝑏subscript𝜉𝑡𝑖𝐵𝑛subscript𝑛0subscript𝑅𝑛\displaystyle\frac{\sum_{t=n_{0}+1}^{n}(x_{t}-x^{*})}{n-n_{0}}=-\frac{\sum_{t=% n_{0}+1}^{n}\sum_{i=1}^{B}\Sigma^{-1}(A_{\xi_{ti}}x^{*}-b_{\xi_{ti}})}{B(n-n_{% 0})}+R_{n},divide start_ARG ∑ start_POSTSUBSCRIPT italic_t = italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_n - italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG = - divide start_ARG ∑ start_POSTSUBSCRIPT italic_t = italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_t italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT - italic_b start_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_t italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_ARG start_ARG italic_B ( italic_n - italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG + italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , (4.14)

where 𝔼Rn2C~1Bn2+C~2B2n𝔼superscriptnormsubscript𝑅𝑛2subscriptnormal-~𝐶1𝐵superscript𝑛2subscriptnormal-~𝐶2superscript𝐵2𝑛\mathbb{E}\left\|R_{n}\right\|^{2}\leq\frac{\tilde{C}_{1}}{Bn^{2}}+\frac{% \tilde{C}_{2}}{B^{2}n}blackboard_E ∥ italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ divide start_ARG over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_B italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_n end_ARG, and

C~1=40α2Lf2M2C2x1x*2(1δ)(1λ)3+120α2σ2M21(1λ)3+4M2x1x*21λ,subscript~𝐶140superscript𝛼2superscriptsubscript𝐿𝑓2superscript𝑀2subscript𝐶2superscriptnormsubscript𝑥1superscript𝑥21𝛿superscript1𝜆3120superscript𝛼2superscript𝜎2superscript𝑀21superscript1𝜆34superscript𝑀2superscriptnormsubscript𝑥1superscript𝑥21𝜆\displaystyle\tilde{C}_{1}=40\alpha^{2}L_{f}^{2}M^{2}\frac{C_{2}\left\|x_{1}-x% ^{*}\right\|^{2}}{(1-\delta)(1-\lambda)^{3}}+120\alpha^{2}\sigma^{2}M^{2}\frac% {1}{(1-\lambda)^{3}}+\frac{4M^{2}\left\|x_{1}-x^{*}\right\|^{2}}{1-\lambda},over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 40 italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - italic_δ ) ( 1 - italic_λ ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG + 120 italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG ( 1 - italic_λ ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 4 italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_λ end_ARG ,
C~2=30α2σ2Lf2M2C1α2(1λ)3.subscript~𝐶230superscript𝛼2superscript𝜎2superscriptsubscript𝐿𝑓2superscript𝑀2subscript𝐶1superscript𝛼2superscript1𝜆3\displaystyle\tilde{C}_{2}=30\alpha^{2}\sigma^{2}L_{f}^{2}M^{2}\frac{C_{1}% \alpha^{2}}{(1-\lambda)^{3}}.over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 30 italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - italic_λ ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG .

and M𝑀Mitalic_M is defined in (3.6).

In the above theorem, we obtain the convergence rate of averaged SGDM in (4.14). While (4.13) is assumed for the purpose of explicitly establishing the dependence of C~1subscript~𝐶1\tilde{C}_{1}over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, C~2subscript~𝐶2\tilde{C}_{2}over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to the constants C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, C2subscript𝐶2C_{2}italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in (4.13), it is nothing but a rephrasing of the last-iterate bounds established in (3.10). Comparing the convergence rates in Theorem 13, the averaged SGDM converges to x*superscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT without a bias as n𝑛nitalic_n increases. The leading term t=n0+1nΣ1(Aξtix*bξti)nn0superscriptsubscript𝑡subscript𝑛01𝑛superscriptΣ1subscript𝐴subscript𝜉𝑡𝑖superscript𝑥subscript𝑏subscript𝜉𝑡𝑖𝑛subscript𝑛0-\frac{\sum_{t=n_{0}+1}^{n}\Sigma^{-1}(A_{\xi_{ti}}x^{*}-b_{\xi_{ti}})}{n-n_{0}}- divide start_ARG ∑ start_POSTSUBSCRIPT italic_t = italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_t italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT - italic_b start_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_t italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_ARG start_ARG italic_n - italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG is an average of (nn0)𝑛subscript𝑛0(n-n_{0})( italic_n - italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) i.i.d. vectors when ξtisubscript𝜉𝑡𝑖\xi_{ti}italic_ξ start_POSTSUBSCRIPT italic_t italic_i end_POSTSUBSCRIPT are i.i.d. sampled. Therefore the leading term is bounded by 𝒪(1Bn)𝒪1𝐵𝑛\mathcal{O}(\frac{1}{Bn})caligraphic_O ( divide start_ARG 1 end_ARG start_ARG italic_B italic_n end_ARG ) in squared expectation, and the remainder term Rnsubscript𝑅𝑛R_{n}italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT in (4.14) is bounded by C~1Bn2+C~2B2nsubscript~𝐶1𝐵superscript𝑛2subscript~𝐶2superscript𝐵2𝑛\frac{\tilde{C}_{1}}{Bn^{2}}+\frac{\tilde{C}_{2}}{B^{2}n}divide start_ARG over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_B italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_n end_ARG. We demonstrate that the application of the averaging technique enhances the convergence rate of SGDM from the biased expression presented in (4.13) to an unbiased rate of 𝒪(1n)𝒪1𝑛\mathcal{O}(\frac{1}{n})caligraphic_O ( divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ), which asymptotically approaches zero as n𝑛n\rightarrow\inftyitalic_n → ∞. It is noteworthy to mention that, for averaged SGDM, the initialization error x1x*normsubscript𝑥1superscript𝑥\|x_{1}-x^{*}\|∥ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∥ is forgotten at the rate of n2superscript𝑛2n^{-2}italic_n start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT, slower than the exponential initialization-forgetting for the last-iterate convergence established in Theorem 1.

Remark 14

Comparing the convergence rates of SGDM and averaged SGDM in Theorem 1 and Theorem 13, respectively, we establish that the leading term in the averaged SGDM does not depend on the learning rate α𝛼\alphaitalic_α. On the other hand, the averaging process starts from n0subscript𝑛0n_{0}italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, effectively excluding the initial iterations that may have larger deviations. Practically, the starting point n0subscript𝑛0n_{0}italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is manually selected due to μ𝜇\muitalic_μ and L𝐿Litalic_L. With a small λ𝜆\lambdaitalic_λ, averaged SGDM affords the selection of a smaller n0subscript𝑛0n_{0}italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to fulfill the condition λ2n0=B1(1λ)superscript𝜆2subscript𝑛0superscript𝐵11𝜆\lambda^{2n_{0}}=B^{-1}(1-\lambda)italic_λ start_POSTSUPERSCRIPT 2 italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = italic_B start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( 1 - italic_λ ), thereby demonstrating reduced sensitivity to n0subscript𝑛0n_{0}italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. In addition, the leading term in (4.14) is independent of momentum weight γ𝛾\gammaitalic_γ and learning rate α𝛼\alphaitalic_α, which indeed indicates that the averaged SGDM has the same rate of convergence as the averaged SGD.

The following Corollary 15 establishes the asymptotic distribution of averaged SGDM, which is indeed the distribution of the leading term in (4.14), and the remainder term Rnsubscript𝑅𝑛R_{n}italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT in (4.14) establishes the convergence of the averaged SGDM algorithm to the asymptotic normal distribution.

Corollary 15

Following Theorem 13, we have

Bσt=n0+1n(xtx*)nn0𝒩(0,Σ1ΩΣ1),𝑎𝑠(αn,Bα),formulae-sequence𝐵𝜎superscriptsubscript𝑡subscript𝑛01𝑛subscript𝑥𝑡superscript𝑥𝑛subscript𝑛0𝒩0superscriptΣ1ΩsuperscriptΣ1𝑎𝑠𝛼𝑛𝐵𝛼\displaystyle\frac{\sqrt{B}}{\sigma}\frac{\sum_{t=n_{0}+1}^{n}({x}_{t}-x^{*})}% {\sqrt{n-n_{0}}}\Rightarrow\mathcal{N}(0,\Sigma^{-1}\Omega\Sigma^{-1}),\quad% \text{as}~{}\left(\alpha n,\frac{B}{\alpha}\right)\to\infty,divide start_ARG square-root start_ARG italic_B end_ARG end_ARG start_ARG italic_σ end_ARG divide start_ARG ∑ start_POSTSUBSCRIPT italic_t = italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) end_ARG start_ARG square-root start_ARG italic_n - italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG end_ARG ⇒ caligraphic_N ( 0 , roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_Ω roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) , as ( italic_α italic_n , divide start_ARG italic_B end_ARG start_ARG italic_α end_ARG ) → ∞ ,

where

Ω=1σ2𝔼[(Aξx*bξ)(Aξx*bξ)].Ω1superscript𝜎2𝔼delimited-[]subscript𝐴𝜉superscript𝑥subscript𝑏𝜉superscriptsubscript𝐴𝜉superscript𝑥subscript𝑏𝜉top\displaystyle\Omega=\frac{1}{\sigma^{2}}\mathbb{E}[(A_{\xi}x^{*}-b_{\xi})(A_{% \xi}x^{*}-b_{\xi})^{\top}].roman_Ω = divide start_ARG 1 end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG blackboard_E [ ( italic_A start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT - italic_b start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ) ( italic_A start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT - italic_b start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] . (4.15)

The asymptotic distribution of the averaged SGDM is the same as that of the averaged SGD in the existing literature (Polyak and Juditsky, 1992), with different remainder terms Rnsubscript𝑅𝑛R_{n}italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. As the averaged SGDM is close to x*superscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, the limiting distribution is established with covariance matrix Σ1ΩΣ1superscriptΣ1ΩsuperscriptΣ1\Sigma^{-1}\Omega\Sigma^{-1}roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_Ω roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT where ΣΣ\Sigmaroman_Σ and ΩΩ\Omegaroman_Ω are respectively the Hessian and Gram matrix of Aξx*bξsubscript𝐴𝜉superscript𝑥subscript𝑏𝜉A_{\xi}x^{*}-b_{\xi}italic_A start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT - italic_b start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT.

Remark 16

Asymptotic normality in Corollary 15 hold under asymptotics αn,B/αnormal-→𝛼𝑛𝐵𝛼\alpha n,B/\alpha\rightarrow\inftyitalic_α italic_n , italic_B / italic_α → ∞. We specify three common scenarios of the learning rate α𝛼\alphaitalic_α, batch size B𝐵Bitalic_B, and sample size n𝑛nitalic_n.

  • For a constant learning rate α𝛼\alphaitalic_α, Corollary 15 holds as n𝑛nitalic_n and B𝐵Bitalic_B tend to infinity. As the batch size B𝐵Bitalic_B increases, the bias term in (4.13) approaches zero, reducing the difference between the stochastic gradient at xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and x*superscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT.

  • For decaying learning rate α𝛼\alphaitalic_α, the asymptotic normality holds with a fixed batch size B𝐵Bitalic_B as long as αn𝛼𝑛\alpha nitalic_α italic_n diverges. The idea is intuitive: as α𝛼\alphaitalic_α decreases, the random effect reduces. For instance, Corollary 15 holds when the batch size B𝐵Bitalic_B is fixed and α=Θ(nϵ)𝛼Θsuperscript𝑛italic-ϵ\alpha=\Theta(n^{-\epsilon})italic_α = roman_Θ ( italic_n start_POSTSUPERSCRIPT - italic_ϵ end_POSTSUPERSCRIPT ) with ϵ(0,1)italic-ϵ01\epsilon\in(0,1)italic_ϵ ∈ ( 0 , 1 ) as n𝑛n\to\inftyitalic_n → ∞.

  • By minimizing the remainder term Rnsubscript𝑅𝑛R_{n}italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT in (4.13), the optimal learning rate is α=Θ((n/B)1/2)𝛼Θsuperscript𝑛𝐵12\alpha=\Theta((n/B)^{-1/2})italic_α = roman_Θ ( ( italic_n / italic_B ) start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ). With such a specified learning rate, the averaged SGDM correspondingly converges to asymptotic normality with rate 𝒪((nB)1/2)𝒪superscript𝑛𝐵12\mathcal{O}((nB)^{-1/2})caligraphic_O ( ( italic_n italic_B ) start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ).

In addition, Corollary 15 provides a rigorous foundation for constructing asymptotically valid confidence intervals based on the asymptotic normality of averaged SGDM. By estimating the covariance matrix, uncertainty quantification and statistical inference can be performed based on SGDM methods. Based on the asymptotic normality and the given covariance matrix, the 95%percent9595\%95 % confidence region of x*superscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT can be constructed as

In={xd:B(nn0)σ2(t=n0+1nxtnn0x)ΣΩ1Σ(t=n0+1nxtnn0x)χd,0.052},subscript𝐼𝑛conditional-set𝑥superscript𝑑𝐵𝑛subscript𝑛0superscript𝜎2superscriptsuperscriptsubscript𝑡subscript𝑛01𝑛subscript𝑥𝑡𝑛subscript𝑛0𝑥topΣsuperscriptΩ1Σsuperscriptsubscript𝑡subscript𝑛01𝑛subscript𝑥𝑡𝑛subscript𝑛0𝑥subscriptsuperscript𝜒2𝑑0.05\displaystyle I_{n}=\left\{x\in\mathbb{R}^{d}:\frac{B(n-n_{0})}{\sigma^{2}}% \left(\frac{\sum_{t=n_{0}+1}^{n}x_{t}}{n-n_{0}}-x\right)^{\top}\Sigma\Omega^{-% 1}\Sigma\left(\frac{\sum_{t=n_{0}+1}^{n}x_{t}}{n-n_{0}}-x\right)\leq\chi^{2}_{% d,0.05}\right\},italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = { italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT : divide start_ARG italic_B ( italic_n - italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( divide start_ARG ∑ start_POSTSUBSCRIPT italic_t = italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_n - italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG - italic_x ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Σ roman_Ω start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_Σ ( divide start_ARG ∑ start_POSTSUBSCRIPT italic_t = italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_n - italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG - italic_x ) ≤ italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d , 0.05 end_POSTSUBSCRIPT } ,

where χd,0.052subscriptsuperscript𝜒2𝑑0.05\chi^{2}_{d,0.05}italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d , 0.05 end_POSTSUBSCRIPT is the 0.950.950.950.95-quantile of the chi-squared distribution with d𝑑ditalic_d degrees of freedom. Notably, for ωd𝜔superscript𝑑\omega\in\mathbb{R}^{d}italic_ω ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT with ω=1norm𝜔1\left\|\omega\right\|=1∥ italic_ω ∥ = 1 as a test vector, we can define the random variable

Z=BσωΣ1ΩΣ1ωt=n0+1nω(xtx*)nn0𝒩(0,1),asn,formulae-sequence𝑍𝐵𝜎superscript𝜔topsuperscriptΣ1ΩsuperscriptΣ1𝜔superscriptsubscript𝑡subscript𝑛01𝑛superscript𝜔topsubscript𝑥𝑡superscript𝑥𝑛subscript𝑛0𝒩01as𝑛\displaystyle Z=\frac{\sqrt{B}}{\sigma\sqrt{\omega^{\top}\Sigma^{-1}\Omega% \Sigma^{-1}\omega}}\frac{\sum_{t=n_{0}+1}^{n}\omega^{\top}(x_{t}-x^{*})}{\sqrt% {n-n_{0}}}\Rightarrow\mathcal{N}(0,1),\quad\text{as}\quad n\to\infty,italic_Z = divide start_ARG square-root start_ARG italic_B end_ARG end_ARG start_ARG italic_σ square-root start_ARG italic_ω start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_Ω roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_ω end_ARG end_ARG divide start_ARG ∑ start_POSTSUBSCRIPT italic_t = italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_ω start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) end_ARG start_ARG square-root start_ARG italic_n - italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG end_ARG ⇒ caligraphic_N ( 0 , 1 ) , as italic_n → ∞ , (4.16)

based on which, one can construct a one-dimensional asymptotic exact confidence interval for ωx*superscript𝜔topsuperscript𝑥\omega^{\top}x^{*}italic_ω start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT,

Inω=[ωt=n0+1nxtnn0z0.025σωΣ1ΩΣ1ωB(nn0),ωt=n0+1nxtnn0+z0.025σωΣ1ΩΣ1ωB(nn0)],superscriptsubscript𝐼𝑛𝜔superscript𝜔topsuperscriptsubscript𝑡subscript𝑛01𝑛subscript𝑥𝑡𝑛subscript𝑛0subscript𝑧0.025𝜎superscript𝜔topsuperscriptΣ1ΩsuperscriptΣ1𝜔𝐵𝑛subscript𝑛0superscript𝜔topsuperscriptsubscript𝑡subscript𝑛01𝑛subscript𝑥𝑡𝑛subscript𝑛0subscript𝑧0.025𝜎superscript𝜔topsuperscriptΣ1ΩsuperscriptΣ1𝜔𝐵𝑛subscript𝑛0I_{n}^{\omega}=\left[\frac{\omega^{\top}\sum_{t=n_{0}+1}^{n}x_{t}}{n-n_{0}}-z_% {0.025}\frac{\sigma\sqrt{\omega^{\top}\Sigma^{-1}\Omega\Sigma^{-1}\omega}}{% \sqrt{B(n-n_{0})}},\frac{\omega^{\top}\sum_{t=n_{0}+1}^{n}x_{t}}{n-n_{0}}+z_{0% .025}\frac{\sigma\sqrt{\omega^{\top}\Sigma^{-1}\Omega\Sigma^{-1}\omega}}{\sqrt% {B(n-n_{0})}}\right],italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ω end_POSTSUPERSCRIPT = [ divide start_ARG italic_ω start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_n - italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG - italic_z start_POSTSUBSCRIPT 0.025 end_POSTSUBSCRIPT divide start_ARG italic_σ square-root start_ARG italic_ω start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_Ω roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_ω end_ARG end_ARG start_ARG square-root start_ARG italic_B ( italic_n - italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG end_ARG , divide start_ARG italic_ω start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_n - italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG + italic_z start_POSTSUBSCRIPT 0.025 end_POSTSUBSCRIPT divide start_ARG italic_σ square-root start_ARG italic_ω start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_Ω roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_ω end_ARG end_ARG start_ARG square-root start_ARG italic_B ( italic_n - italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG end_ARG ] ,

where z0.025subscript𝑧0.025z_{0.025}italic_z start_POSTSUBSCRIPT 0.025 end_POSTSUBSCRIPT is the 0.9750.9750.9750.975-quantile of the standard normal distribution. Particularly, Pr(ωx*Inω)0.95Prsuperscript𝜔topsuperscript𝑥superscriptsubscript𝐼𝑛𝜔0.95\Pr(\omega^{\top}x^{*}\in I_{n}^{\omega})\rightarrow 0.95roman_Pr ( italic_ω start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∈ italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ω end_POSTSUPERSCRIPT ) → 0.95, as n𝑛n\rightarrow\inftyitalic_n → ∞. In Section 5.2 below, we conduct a simulation experiment on constructing the confidence intervals and report the outcomes in Figure 9. The coverage of the proposed construction is persuasive for averaged SGDM, and its performance benefits from the less sensitivity to the learning rates. The confidence interval can also serve as an uncertainty quantification of the averaged SGDM estimator ωt=n0+1nxtnn0superscript𝜔topsuperscriptsubscript𝑡subscript𝑛01𝑛subscript𝑥𝑡𝑛subscript𝑛0\frac{\omega^{\top}\sum_{t=n_{0}+1}^{n}x_{t}}{n-n_{0}}divide start_ARG italic_ω start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_n - italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG, and one may consider the length of the confidence interval as a criterion for stopping the algorithm or other decision-making purposes.

4.2 Averaged SGDM under general losses

For the general strongly convex loss function, we provide the convergence rate of the averaged SGDM in the following theorem.

Theorem 17

For L¯>0normal-¯𝐿0\overline{L}>0over¯ start_ARG italic_L end_ARG > 0, suppose that the conditions in Theorem 9 hold, and the t𝑡titalic_t-th iteration (m~t,xt)subscriptnormal-~𝑚𝑡subscript𝑥𝑡(\widetilde{m}_{t},x_{t})( over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) satisfies for fixed δ(0,12]𝛿012\delta\in(0,\frac{1}{2}]italic_δ ∈ ( 0 , divide start_ARG 1 end_ARG start_ARG 2 end_ARG ],

m~t2+xtx*2C1B(1λ)ασ+C2x1x*λ(1δ)(t1),1tT.formulae-sequencesuperscriptnormsubscript~𝑚𝑡2superscriptnormsubscript𝑥𝑡superscript𝑥2subscript𝐶1𝐵1𝜆𝛼𝜎subscript𝐶2normsubscript𝑥1superscript𝑥superscript𝜆1𝛿𝑡11𝑡𝑇\displaystyle\sqrt{\left\|\widetilde{m}_{t}\right\|^{2}+\left\|{x}_{t}-x^{*}% \right\|^{2}}\leq\frac{C_{1}}{\sqrt{B(1-\lambda)}}\alpha\sigma+C_{2}\left\|x_{% 1}-x^{*}\right\|\lambda^{(1-\delta)(t-1)},\quad 1\leq t\leq T.square-root start_ARG ∥ over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ≤ divide start_ARG italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_B ( 1 - italic_λ ) end_ARG end_ARG italic_α italic_σ + italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∥ italic_λ start_POSTSUPERSCRIPT ( 1 - italic_δ ) ( italic_t - 1 ) end_POSTSUPERSCRIPT , 1 ≤ italic_t ≤ italic_T . (4.17)

Then for 2n0nT2subscript𝑛0𝑛𝑇2n_{0}\leq n\leq T2 italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≤ italic_n ≤ italic_T, where λn0=B1/2(1λ)1/2superscript𝜆subscript𝑛0superscript𝐵12superscript1𝜆12\lambda^{n_{0}}=B^{-1/2}(1-\lambda)^{1/2}italic_λ start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = italic_B start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ( 1 - italic_λ ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT, with probability of at least 14T114superscript𝑇11-4T^{-1}1 - 4 italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, we have

t=n0+1n(xtx*)nn0=t=n0+1ni=1BΣ1fξti(x*)B(nn0)+Rn,superscriptsubscript𝑡subscript𝑛01𝑛subscript𝑥𝑡superscript𝑥𝑛subscript𝑛0superscriptsubscript𝑡subscript𝑛01𝑛superscriptsubscript𝑖1𝐵superscriptΣ1subscript𝑓subscript𝜉𝑡𝑖superscript𝑥𝐵𝑛subscript𝑛0subscript𝑅𝑛\displaystyle\frac{\sum_{t=n_{0}+1}^{n}(x_{t}-x^{*})}{n-n_{0}}=-\frac{\sum_{t=% n_{0}+1}^{n}\sum_{i=1}^{B}\Sigma^{-1}\triangledown f_{\xi_{ti}}(x^{*})}{B(n-n_% {0})}+R_{n},divide start_ARG ∑ start_POSTSUBSCRIPT italic_t = italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_n - italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG = - divide start_ARG ∑ start_POSTSUBSCRIPT italic_t = italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ▽ italic_f start_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_t italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_B ( italic_n - italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG + italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , (4.18)

where

RnC~1+C~3Bn+C~2Bn+C~4B,normsubscript𝑅𝑛subscript~𝐶1subscript~𝐶3𝐵𝑛subscript~𝐶2𝐵𝑛subscript~𝐶4𝐵\displaystyle\left\|R_{n}\right\|\leq\frac{\tilde{C}_{1}+\tilde{C}_{3}}{\sqrt{% B}n}+\frac{\tilde{C}_{2}}{B\sqrt{n}}+\frac{\tilde{C}_{4}}{B},∥ italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ ≤ divide start_ARG over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_B end_ARG italic_n end_ARG + divide start_ARG over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_B square-root start_ARG italic_n end_ARG end_ARG + divide start_ARG over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_ARG start_ARG italic_B end_ARG , (4.19)
C~1=αLfM16c(2logT+4d)C2x1x*(1λ)3/2+ασM43c(2logT+4d)(1λ)3/2+2Mx1x*(1λ)1/2,subscript~𝐶1𝛼subscript𝐿𝑓𝑀16𝑐2𝑇4𝑑subscript𝐶2normsubscript𝑥1superscript𝑥superscript1𝜆32𝛼𝜎𝑀43𝑐2𝑇4𝑑superscript1𝜆322𝑀normsubscript𝑥1superscript𝑥superscript1𝜆12\displaystyle\tilde{C}_{1}=\alpha L_{f}M\frac{16c(2\log T+4d)C_{2}\left\|x_{1}% -x^{*}\right\|}{(1-\lambda)^{3/2}}+\alpha\sigma M\frac{4\sqrt{3}c(2\log T+4d)}% {(1-\lambda)^{3/2}}+\frac{2M\left\|x_{1}-x^{*}\right\|}{(1-\lambda)^{1/2}},over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_α italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT italic_M divide start_ARG 16 italic_c ( 2 roman_log italic_T + 4 italic_d ) italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∥ end_ARG start_ARG ( 1 - italic_λ ) start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT end_ARG + italic_α italic_σ italic_M divide start_ARG 4 square-root start_ARG 3 end_ARG italic_c ( 2 roman_log italic_T + 4 italic_d ) end_ARG start_ARG ( 1 - italic_λ ) start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 2 italic_M ∥ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∥ end_ARG start_ARG ( 1 - italic_λ ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT end_ARG ,
C~2=αLfσM82c(2logT+4d)C1α(1λ)3/2,C~3=8αML¯C22x1x*2(n0(1λ)+1)(1λ)3/2,C~4=4ασ2ML¯C12α2(1λ)2.formulae-sequencesubscript~𝐶2𝛼subscript𝐿𝑓𝜎𝑀82𝑐2𝑇4𝑑subscript𝐶1𝛼superscript1𝜆32formulae-sequencesubscript~𝐶38𝛼𝑀¯𝐿superscriptsubscript𝐶22superscriptnormsubscript𝑥1superscript𝑥2subscript𝑛01𝜆1superscript1𝜆32subscript~𝐶44𝛼superscript𝜎2𝑀¯𝐿superscriptsubscript𝐶12superscript𝛼2superscript1𝜆2\displaystyle\tilde{C}_{2}=\alpha L_{f}\sigma M\frac{8\sqrt{2}c(2\log T+4d)C_{% 1}\alpha}{(1-\lambda)^{3/2}},~{}\tilde{C}_{3}=8\alpha M\overline{L}\frac{C_{2}% ^{2}\left\|x_{1}-x^{*}\right\|^{2}\left(n_{0}(1-\lambda)+1\right)}{(1-\lambda)% ^{3/2}},~{}\tilde{C}_{4}=4\alpha\sigma^{2}M\overline{L}\frac{C_{1}^{2}\alpha^{% 2}}{(1-\lambda)^{2}}.over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_α italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT italic_σ italic_M divide start_ARG 8 square-root start_ARG 2 end_ARG italic_c ( 2 roman_log italic_T + 4 italic_d ) italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_α end_ARG start_ARG ( 1 - italic_λ ) start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT end_ARG , over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = 8 italic_α italic_M over¯ start_ARG italic_L end_ARG divide start_ARG italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( 1 - italic_λ ) + 1 ) end_ARG start_ARG ( 1 - italic_λ ) start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT end_ARG , over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = 4 italic_α italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_M over¯ start_ARG italic_L end_ARG divide start_ARG italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - italic_λ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG .

Here M𝑀Mitalic_M is defined in (3.6).

For non-quadratic settings of L¯>0¯𝐿0\overline{L}>0over¯ start_ARG italic_L end_ARG > 0, there are two additional terms C~3/(Bn)subscript~𝐶3𝐵𝑛\tilde{C}_{3}/(\sqrt{B}n)over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT / ( square-root start_ARG italic_B end_ARG italic_n ) and C~4/Bsubscript~𝐶4𝐵\tilde{C}_{4}/Bover~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT / italic_B in (4.19) compared to the case of L¯=0¯𝐿0\overline{L}=0over¯ start_ARG italic_L end_ARG = 0. As n𝑛n\rightarrow\inftyitalic_n → ∞, the term C~4/Bsubscript~𝐶4𝐵\tilde{C}_{4}/Bover~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT / italic_B appears as a non-vanishing bias, if the batch size B𝐵Bitalic_B and learning rate α𝛼\alphaitalic_α both stays fixed, due to which the asymptotic normality does not hold. Nonetheless, if one decays α𝛼\alphaitalic_α or increases B𝐵Bitalic_B appropriately as n𝑛nitalic_n increases, the asymptotic normality result remains to hold in the following corollary.

Corollary 18

Following Theorem 17, we have

Bσt=n0+1n(xtx*)nn0𝒩(0,Σ1ΩΣ1),𝑎𝑠(αnlog(B/α),Bαn),formulae-sequence𝐵𝜎superscriptsubscript𝑡subscript𝑛01𝑛subscript𝑥𝑡superscript𝑥𝑛subscript𝑛0𝒩0superscriptΣ1ΩsuperscriptΣ1𝑎𝑠𝛼𝑛𝐵𝛼𝐵𝛼𝑛\displaystyle\frac{\sqrt{B}}{\sigma}\frac{\sum_{t=n_{0}+1}^{n}({x}_{t}-x^{*})}% {\sqrt{n-n_{0}}}\Rightarrow\mathcal{N}(0,\Sigma^{-1}\Omega\Sigma^{-1}),\quad% \text{as}~{}\left(\frac{\sqrt{\alpha n}}{\log(B/\alpha)},\frac{\sqrt{B}}{% \alpha\sqrt{n}}\right)\to\infty,divide start_ARG square-root start_ARG italic_B end_ARG end_ARG start_ARG italic_σ end_ARG divide start_ARG ∑ start_POSTSUBSCRIPT italic_t = italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) end_ARG start_ARG square-root start_ARG italic_n - italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG end_ARG ⇒ caligraphic_N ( 0 , roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_Ω roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) , as ( divide start_ARG square-root start_ARG italic_α italic_n end_ARG end_ARG start_ARG roman_log ( italic_B / italic_α ) end_ARG , divide start_ARG square-root start_ARG italic_B end_ARG end_ARG start_ARG italic_α square-root start_ARG italic_n end_ARG end_ARG ) → ∞ ,

where

Ω=1σ2𝔼[fξ(x*)fξ(x*)].Ω1superscript𝜎2𝔼delimited-[]subscript𝑓𝜉superscript𝑥subscript𝑓𝜉superscriptsuperscript𝑥top\displaystyle\Omega=\frac{1}{\sigma^{2}}\mathbb{E}[\triangledown f_{\xi}(x^{*}% )\triangledown f_{\xi}(x^{*})^{\top}].roman_Ω = divide start_ARG 1 end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG blackboard_E [ ▽ italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) ▽ italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] . (4.20)
Remark 19

The conditions on n𝑛nitalic_n, B𝐵Bitalic_B, and α𝛼\alphaitalic_α in this corollary are more restrictive than those in Corollary 15 due to the presence of the approximation errors L¯xx*2normal-¯𝐿superscriptnorm𝑥superscript𝑥2\overline{L}\left\|x-x^{*}\right\|^{2}over¯ start_ARG italic_L end_ARG ∥ italic_x - italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. For a decaying learning rate α𝛼\alphaitalic_α, the condition on the batch size is much relaxed since a small learning rate reduces the error caused by randomness, as shown in the C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT term in (4.17). Particularly, when the batch size B𝐵Bitalic_B is fixed, the condition of Corollary 18 is met for α=Θ(nϵ)𝛼normal-Θsuperscript𝑛italic-ϵ\alpha=\Theta(n^{-\epsilon})italic_α = roman_Θ ( italic_n start_POSTSUPERSCRIPT - italic_ϵ end_POSTSUPERSCRIPT ) with ϵ(12,1)italic-ϵ121\epsilon\in(\frac{1}{2},1)italic_ϵ ∈ ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG , 1 ) as nnormal-→𝑛n\to\inftyitalic_n → ∞. Specifically, with diverging n𝑛nitalic_n and B𝐵Bitalic_B, the nearly optimal learning rate is α=Θ((n/B)2/3)𝛼normal-Θsuperscript𝑛𝐵23\alpha=\Theta((n/\sqrt{B})^{-2/3})italic_α = roman_Θ ( ( italic_n / square-root start_ARG italic_B end_ARG ) start_POSTSUPERSCRIPT - 2 / 3 end_POSTSUPERSCRIPT ), and the corresponding rate of convergence to asymptotic normality is 𝒪((nB)1/3log2(nB))𝒪superscript𝑛𝐵13superscript2𝑛𝐵\mathcal{O}((nB)^{-1/3}\log^{2}(nB))caligraphic_O ( ( italic_n italic_B ) start_POSTSUPERSCRIPT - 1 / 3 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_n italic_B ) ).

5 Experiments

In this section, we support our theoretical results with simulations in a quadratic example and a logistic loss example of the general strongly convex loss, and real-data experiments of a multinomial logistic regression on MNIST hand-written digit classification.

In the numerical experiments, we verify the convergence results built in the paper on a training set of size N𝑁Nitalic_N. Setting ξ𝜉\xiitalic_ξ to be a discrete uniform random variable with values in {1,2,,N}12𝑁\{1,2,\dots,N\}{ 1 , 2 , … , italic_N }, we consider to minimize minx1Ni=1Nfi(x)subscript𝑥1𝑁superscriptsubscript𝑖1𝑁subscript𝑓𝑖𝑥\min_{x}\frac{1}{N}\sum_{i=1}^{N}f_{i}(x)roman_min start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ), as an example of the stochastic optimization model (1.1), where the objective is either the empirical risk 𝔼[fξ(x)]=1Ni=1Nfi(x)𝔼delimited-[]subscript𝑓𝜉𝑥1𝑁superscriptsubscript𝑖1𝑁subscript𝑓𝑖𝑥\mathbb{E}[f_{\xi}(x)]=\frac{1}{N}\sum_{i=1}^{N}f_{i}(x)blackboard_E [ italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_x ) ] = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ), or more specifically in a maximum likelihood estimation, the negative log-likelihood. In all the simulation results below, we repeat the experiment 200 times. For the real-data analysis on MNIST, we repeat the experiment in a replicable setting while we fix the random seeds to 1, 2, and 3 in training.

5.1 Simulation: quadratic loss

In a quadratic loss model (3.5), we first conduct a simulation study with a sample of size N=20,000𝑁20000N=20,000italic_N = 20 , 000 and dimension d=10𝑑10d=10italic_d = 10. We generate the parameters {(Ai,bi)}i=1Nsuperscriptsubscriptsubscript𝐴𝑖subscript𝑏𝑖𝑖1𝑁\{(A_{i},b_{i})\}_{i=1}^{N}{ ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT in (3.5) i.i.d and bi𝒩(0d,Id)similar-tosubscript𝑏𝑖𝒩subscript0𝑑subscript𝐼𝑑b_{i}\sim\mathcal{N}(0_{d},I_{d})italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ), and the positive definitive matrix Ai=ρViVi+10Idsubscript𝐴𝑖𝜌superscriptsubscript𝑉𝑖topsubscript𝑉𝑖10subscript𝐼𝑑A_{i}=\rho V_{i}^{\top}V_{i}+10I_{d}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_ρ italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + 10 italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT. Here Visubscript𝑉𝑖V_{i}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a matrix in d×dsuperscript𝑑𝑑\mathbb{R}^{d\times d}blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT and each row of Visubscript𝑉𝑖V_{i}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT generates from the normal distribution 𝒩(0d,Id)𝒩subscript0𝑑subscript𝐼𝑑\mathcal{N}(0_{d},I_{d})caligraphic_N ( 0 start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ), where ρ>0𝜌0\rho>0italic_ρ > 0 affect the conditional number L/μ𝐿𝜇L/\muitalic_L / italic_μ. We choose ρ=1𝜌1\rho=1italic_ρ = 1 and the average conditional number L/μ𝐿𝜇L/\muitalic_L / italic_μ is 35/10351035/1035 / 10. The deterministic minimizer x*superscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT is computed by x*=(i=1NAi)1(i=1Nbi)superscript𝑥superscriptsuperscriptsubscript𝑖1𝑁subscript𝐴𝑖1superscriptsubscript𝑖1𝑁subscript𝑏𝑖x^{*}=\left(\sum_{i=1}^{N}A_{i}\right)^{-1}\left(\sum_{i=1}^{N}b_{i}\right)italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ).

We consider the mini-batch SGDM with batch size B=0.2N𝐵0.2𝑁B=0.2Nitalic_B = 0.2 italic_N with replacement. The learning rate is fixed at α=0.001𝛼0.001\alpha=0.001italic_α = 0.001. According to Theorem 3, we choose the momentum weight following

γ=(1μα1+μα)2.𝛾superscript1𝜇𝛼1𝜇𝛼2\displaystyle\gamma=\left(\frac{1-\mu\alpha}{1+\mu\alpha}\right)^{2}.italic_γ = ( divide start_ARG 1 - italic_μ italic_α end_ARG start_ARG 1 + italic_μ italic_α end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (5.21)

We refer to the above γ𝛾\gammaitalic_γ as the adaptive momentum weight in SGDM (SGDM-adap), while we also compare it with other fixed γ𝛾\gammaitalic_γ, as well as SGD as a special case of SGDM with γ=0𝛾0\gamma=0italic_γ = 0 in Figure 2.

Refer to caption
(a) Small γ𝛾\gammaitalic_γ
Refer to caption
(b) Large γ𝛾\gammaitalic_γ
Figure 2: Performance of SGD and SGDM on the quadratic loss. The average value of the adaptive momentum weight is 0.96.

Figure 2 supports the theoretical results in Theorems 1 and 3. Both SGD and SGDM enjoy linear convergence at the beginning and have the same order of error at the end. The adaptive momentum SGDM-adap with γ=(1ϕ)2/(1+ϕ)2𝛾superscript1italic-ϕ2superscript1italic-ϕ2\gamma=(1-\phi)^{2}/(1+\phi)^{2}italic_γ = ( 1 - italic_ϕ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / ( 1 + italic_ϕ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT in (5.21) converges fastest, where the average value over 200 experiments is 0.950.950.950.95. The acceleration of convergence with small momentum weights is not significant, while SGDM with a very large γ𝛾\gammaitalic_γ becomes slower and unstable. Particularly, SGDM with γ=0.5𝛾0.5\gamma=0.5italic_γ = 0.5 converges as fast as SGD, and SGDM with momentum weight 0.9 converges faster than SGD. For large momentum weight γ=0.99𝛾0.99\gamma=0.99italic_γ = 0.99, the linear convergence factor is γ𝛾\sqrt{\gamma}square-root start_ARG italic_γ end_ARG, and we can see from the experiment that its convergence is the slowest in Figure 2.

We can see that appropriate momentum weights lead to faster convergence as shown in Figure 2(a). However, when the momentum weight is exceedingly large, SGDM may cause the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT error to oscillate, as shown in Figure 2(b). This is because momentum causes the algorithm to continue moving in the direction of past gradients, even if the current gradient is opposite to the momentum. As a result, SGDM oscillates back and forth near the minimum of the loss function instead of converging steadily towards it. Moreover, it also leads to a decrease in the convergence speed. The observation matches the finding in Theorem 3.

We further compare the averaged SGD and SGDM. We set n0subscript𝑛0n_{0}italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT in Theorem 13 as n0=200subscript𝑛0200n_{0}=200italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 200 and 500500500500, and report the convergence of the averaged SGD and SGDM in

x¯n0+t=j=n0+1n0+txjt.subscript¯𝑥subscript𝑛0𝑡superscriptsubscript𝑗subscript𝑛01subscript𝑛0𝑡subscript𝑥𝑗𝑡\displaystyle\bar{x}_{n_{0}+t}=\frac{\sum_{j=n_{0}+1}^{n_{0}+t}x_{j}}{t}.over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_t end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_j = italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_t end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_t end_ARG . (5.22)
Refer to caption
(a) Small γ𝛾\gammaitalic_γ
Refer to caption
(b) Large γ𝛾\gammaitalic_γ
Figure 3: Performance of Averaged SGD and Averaged SGDM on the quadratic loss. The average value of the adaptive momentum weight is 0.96.
Refer to caption
(a) Small γ𝛾\gammaitalic_γ
Refer to caption
(b) Large γ𝛾\gammaitalic_γ
Figure 4: Performance of Averaged SGD and Averaged SGDM on the quadratic loss with n0=500subscript𝑛0500n_{0}=500italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 500.

In Figures 34, averaged SGDM with adaptive momentum weight defined as (5.21) converges faster than the others. Specifically, it converges significantly faster than the averaged SGD. When n0=500subscript𝑛0500n_{0}=500italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 500, Figure 4 illustrates that the averaging technique reduces the order of the bias from 108superscript10810^{-8}10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT, as seen in Figure 2, to 109superscript10910^{-9}10 start_POSTSUPERSCRIPT - 9 end_POSTSUPERSCRIPT.

We further conduct a simulation study to verify the asymptotic normality result of Corollary 15. In this experiment, we evaluate the one-dimensional projection statistic Z𝑍Zitalic_Z in (4.16) with n0=1000subscript𝑛01000n_{0}=1000italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 1000, n=2000𝑛2000n=2000italic_n = 2000 and Ω=(Nσ2)1i=1N(Aix*bi)(Aix*bi)Ωsuperscript𝑁superscript𝜎21superscriptsubscript𝑖1𝑁subscript𝐴𝑖superscript𝑥subscript𝑏𝑖superscriptsubscript𝐴𝑖superscript𝑥subscript𝑏𝑖top\Omega=\left(N\sigma^{2}\right)^{-1}\sum_{i=1}^{N}(A_{i}x^{*}-b_{i})(A_{i}x^{*% }-b_{i})^{\top}roman_Ω = ( italic_N italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT - italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT - italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT. We generate the trajectory {x1,,xn}subscript𝑥1subscript𝑥𝑛\{x_{1},\cdots,x_{n}\}{ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } 1000 times and get the replications Z(1),Z(2),,Z(1000)superscript𝑍1superscript𝑍2superscript𝑍1000Z^{(1)},Z^{(2)},\cdots,Z^{(1000)}italic_Z start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , italic_Z start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT , ⋯ , italic_Z start_POSTSUPERSCRIPT ( 1000 ) end_POSTSUPERSCRIPT. Figure 5 shows the frequency of {Z(1),Z(2),,Z(1000)}superscript𝑍1superscript𝑍2superscript𝑍1000\{Z^{(1)},Z^{(2)},\cdots,Z^{(1000)}\}{ italic_Z start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , italic_Z start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT , ⋯ , italic_Z start_POSTSUPERSCRIPT ( 1000 ) end_POSTSUPERSCRIPT }. From Figure 5 we can see that both averaged SGD and SGDM well approximate the normal distribution. We can further see that the frequency of averaged SGDM is slightly closer to the normal distribution than that of averaged SGD under finite rounds n=2000𝑛2000n=2000italic_n = 2000. These observations reflect the theoretical results we build in Theorem 13.

Refer to caption
Refer to caption
Figure 5: Frequency of Y𝑌Yitalic_Y about Averaged SGD and Averaged SGDM with γ=0.9𝛾0.9\gamma=0.9italic_γ = 0.9.

5.2 Sensitivity to learning rates

In this section, we will show that the performance of SGDM and averaged SGDM have a wider range of tunable learning rates compared to SGD and averaged SGD. In the quadratic loss model with N=20,000𝑁20000N=20,000italic_N = 20 , 000 and d=10𝑑10d=10italic_d = 10, we generate the parameters {(Ai,bi)}i=1Nsuperscriptsubscriptsubscript𝐴𝑖subscript𝑏𝑖𝑖1𝑁\{(A_{i},b_{i})\}_{i=1}^{N}{ ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT in (3.5) i.i.d and bi𝒩(0d,Id)similar-tosubscript𝑏𝑖𝒩subscript0𝑑subscript𝐼𝑑b_{i}\sim\mathcal{N}(0_{d},I_{d})italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ), and the positive definitive matrix Ai=ViVi+Idsubscript𝐴𝑖superscriptsubscript𝑉𝑖topsubscript𝑉𝑖subscript𝐼𝑑A_{i}=V_{i}^{\top}V_{i}+I_{d}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT, where each row of Visubscript𝑉𝑖V_{i}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT generates from the normal distribution 𝒩(0d,Id)𝒩subscript0𝑑subscript𝐼𝑑\mathcal{N}(0_{d},I_{d})caligraphic_N ( 0 start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ). The average conditional number L/μ𝐿𝜇L/\muitalic_L / italic_μ is 26/126126/126 / 1. The learning rates we set are chosen from {21,20,21,22,}superscript21superscript20superscript21superscript22\{2^{1},2^{0},2^{-1},2^{-2},\cdots\}{ 2 start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , 2 start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , 2 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , 2 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT , ⋯ }. The batch size is B=0.2N𝐵0.2𝑁B=0.2Nitalic_B = 0.2 italic_N.

Refer to caption
Figure 6: Performance of SGD and SGDM with γ=0.8𝛾0.8\gamma=0.8italic_γ = 0.8 under different learning rates. SGD fails to converge with α=21𝛼superscript21\alpha=2^{-1}italic_α = 2 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT and α=23𝛼superscript23\alpha=2^{-3}italic_α = 2 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT.

Figure 6 shows the convergence behaviors of SGD and SGDM for a wide range of learning rates α=21,23,25,27,29𝛼superscript21superscript23superscript25superscript27superscript29\alpha=2^{-1},2^{-3},2^{-5},2^{-7},2^{-9}italic_α = 2 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , 2 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT , 2 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT , 2 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT , 2 start_POSTSUPERSCRIPT - 9 end_POSTSUPERSCRIPT. Notably, SGD fails to converge with learning rates of α23𝛼superscript23\alpha\geq 2^{-3}italic_α ≥ 2 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT. In contrast, SGDM with γ=0.8𝛾0.8\gamma=0.8italic_γ = 0.8 exhibits robust convergence properties, successfully converging even with a larger learning rate α=21𝛼superscript21\alpha=2^{-1}italic_α = 2 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT. This comparison emphasizes that SGDM is less sensitive to the learning rate. Figure 7 further illustrates the finite-sample errors of SGD and SGDM across different learning rates, given a fixed number of iterations at T=500𝑇500T=500italic_T = 500. Intuitively, smaller learning rates necessitate more iterations to converge and are associated with an increased bias for each method. Notably, SGDM is capable of converging with a relatively large learning rate. The maximum learning rate ensuring convergence is α=24𝛼superscript24\alpha=2^{-4}italic_α = 2 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT for SGD, α=21𝛼superscript21\alpha=2^{-1}italic_α = 2 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT for SGDM with γ=0.8𝛾0.8\gamma=0.8italic_γ = 0.8 and α=20𝛼superscript20\alpha=2^{0}italic_α = 2 start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT for SGDM with γ=0.9𝛾0.9\gamma=0.9italic_γ = 0.9. Reflecting upon the learning rate condition αL<(1+γ)/(1γ)𝛼𝐿1𝛾1𝛾\alpha L<(1+\gamma)/(1-\gamma)italic_α italic_L < ( 1 + italic_γ ) / ( 1 - italic_γ ), it can be confirmed that (1+γ)/(1γ)21/241𝛾1𝛾superscript21superscript24(1+\gamma)/(1-\gamma)\approx 2^{-1}/2^{-4}( 1 + italic_γ ) / ( 1 - italic_γ ) ≈ 2 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT / 2 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT for γ=0.8𝛾0.8\gamma=0.8italic_γ = 0.8 and (1+γ)/(1γ)20/241𝛾1𝛾superscript20superscript24(1+\gamma)/(1-\gamma)\approx 2^{0}/2^{-4}( 1 + italic_γ ) / ( 1 - italic_γ ) ≈ 2 start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT / 2 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT for γ=0.9𝛾0.9\gamma=0.9italic_γ = 0.9. This alignment validates that the figure is in agreement with the theoretical analysis.

Refer to caption
Figure 7: Performance of SGD and SGDM with γ=0.8,0.9𝛾0.80.9\gamma=0.8,~{}0.9italic_γ = 0.8 , 0.9 under different learning rates, where T=500𝑇500T=500italic_T = 500 is fixed.
Refer to caption
Figure 8: Performance of Averaged SGD and Averaged SGDM with γ=0.8𝛾0.8\gamma=0.8italic_γ = 0.8 under different learning rates.

Moreover, we examine the performance of averaged SGD and averaged SGDM across a range of learning rates. For both algorithms, we set n0=500subscript𝑛0500n_{0}=500italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 500 and define x¯n0+tsubscript¯𝑥subscript𝑛0𝑡\bar{x}_{n_{0}+t}over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_t end_POSTSUBSCRIPT as (5.22). Figure 8 illustrates that all algorithms exhibit a sublinear rate of convergence. SGDM with learning rates ranging from α=21𝛼superscript21\alpha=2^{-1}italic_α = 2 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT to 27superscript272^{-7}2 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT, as well as SGD with learning rates α=25,27𝛼superscript25superscript27\alpha=2^{-5},2^{-7}italic_α = 2 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT , 2 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT, achieve convergence to an similar error level. Compared to Figure 6, we can see that the averaging technique improves convergence and is less dependent on learning rates. However, it is noteworthy that algorithms with learning rate α=29𝛼superscript29\alpha=2^{-9}italic_α = 2 start_POSTSUPERSCRIPT - 9 end_POSTSUPERSCRIPT require a substantially larger initial iteration count n0subscript𝑛0n_{0}italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to mitigate the effects of iterations with significant deviations.

Refer to caption
Figure 9: The probability P(|Z|<1.96)𝑃𝑍1.96P(|Z|<1.96)italic_P ( | italic_Z | < 1.96 ) for SGD and SGDM with γ=0.8,0.9𝛾0.80.9\gamma=0.8,~{}0.9italic_γ = 0.8 , 0.9 under different learning rates, where Z𝑍Zitalic_Z is defined as (4.16).

Additionally, we investigate the asymptotic normality as outlined in Corollary 15. Figure 9 illustrates the asymptotic behavior of the statistic Z𝑍Zitalic_Z in (4.16) with n=1000𝑛1000n=1000italic_n = 1000 and n0=500subscript𝑛0500n_{0}=500italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 500. This illustration confirms its convergence to 𝒩(0,1)𝒩01\mathcal{N}(0,1)caligraphic_N ( 0 , 1 ), under a range of learning rates. The y𝑦yitalic_y-axis displays the empirical probability P(|Z|<1.96)𝑃𝑍1.96P(|Z|<1.96)italic_P ( | italic_Z | < 1.96 ), which indicates the frequency with which the statistic Z𝑍Zitalic_Z falls within the critical range (1.96,1.96)1.961.96(-1.96,1.96)( - 1.96 , 1.96 ) across 1000100010001000 trials, corresponding to a 95%percent9595\%95 % confidence interval. The x𝑥xitalic_x-axis specifies the learning rates employed in the numerical experiments. Figure 9 supports the assertions in the corollary, demonstrating the robustness of the asymptotic normality with respect to different learning rates. The averaged SGDM with large momentum permits a broader selection of learning rates and exhibits reduced sensitivity to their variation.

5.3 Simulation: logistic regression

In Example 1, we generate the data aidsubscript𝑎𝑖superscript𝑑a_{i}\in\mathbb{R}^{d}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT i.i.d from 𝒩(0d,Id)𝒩subscript0𝑑subscript𝐼𝑑\mathcal{N}(0_{d},I_{d})caligraphic_N ( 0 start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ), and bi{0,1}subscript𝑏𝑖01b_{i}\in\{0,1\}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { 0 , 1 } where generated by bi=1subscript𝑏𝑖1b_{i}=1italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 with probability px(ai)subscript𝑝𝑥subscript𝑎𝑖p_{x}(a_{i})italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and bi=0subscript𝑏𝑖0b_{i}=0italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 otherwise. Here px(a)=1/(1+exp(xa))subscript𝑝𝑥𝑎11superscript𝑥top𝑎p_{x}(a)=1/(1+\exp(-x^{\top}a))italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_a ) = 1 / ( 1 + roman_exp ( - italic_x start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_a ) ), and we use x=1d(1,1,,1)𝑥1𝑑superscript111topx=\frac{1}{\sqrt{d}}(1,1,\cdots,1)^{\top}italic_x = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ( 1 , 1 , ⋯ , 1 ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT. We fix the sample size N=20,000𝑁20000N=20,000italic_N = 20 , 000 and the dimension d=10𝑑10d=10italic_d = 10. We set the regularization parameter ν𝜈\nuitalic_ν to zero. Before each simulation, we run the full-batch gradient descent to get the minimizer x*superscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, then compute the Hessian matrix at x*superscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT,

Σ=1Ni=1Npx*(ai)(1px*(ai))aiai.Σ1𝑁superscriptsubscript𝑖1𝑁subscript𝑝superscript𝑥subscript𝑎𝑖1subscript𝑝superscript𝑥subscript𝑎𝑖subscript𝑎𝑖superscriptsubscript𝑎𝑖top\displaystyle\Sigma=\frac{1}{N}\sum_{i=1}^{N}p_{x^{*}}(a_{i})(1-p_{x^{*}}(a_{i% }))a_{i}a_{i}^{\top}.roman_Σ = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( 1 - italic_p start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT .

We specify the batch size B=0.2N𝐵0.2𝑁B=0.2Nitalic_B = 0.2 italic_N and the learning rate is α=0.5𝛼0.5\alpha=0.5italic_α = 0.5. We consider SGDM with fixed momentum weights γ=0.3,0.5,0.7,0.8,0.9𝛾0.30.50.70.80.9\gamma=0.3,0.5,0.7,0.8,0.9italic_γ = 0.3 , 0.5 , 0.7 , 0.8 , 0.9, as well as the adaptive momentum weight as (5.21). Under the data generation procedure, the average of the adaptive momentum weight is 0.750.750.750.75.

Refer to caption
(a) Small γ𝛾\gammaitalic_γ
Refer to caption
(b) Large γ𝛾\gammaitalic_γ
Figure 10: Performance of SGD and SGDM on the logistic loss, and the average value of the adaptive momentum weight is 0.75.

Figure 10 illustrates that for SGDM with momentum weights γ𝛾\gammaitalic_γ less than 0.5, the convergence rate of SGDM is slightly faster than SGD, while for γ=0.7𝛾0.7\gamma=0.7italic_γ = 0.7 and adaptive weights γ𝛾\gammaitalic_γ specified as in (5.21), the convergence is much faster than SGD. For γ=0.8𝛾0.8\gamma=0.8italic_γ = 0.8, the convergence is similar to that of the adaptive weight but less stable. For γ=0.9𝛾0.9\gamma=0.9italic_γ = 0.9, the convergence is much slower and more unstable.

Refer to caption
(a) Small γ𝛾\gammaitalic_γ
Refer to caption
(b) Large γ𝛾\gammaitalic_γ
Figure 11: Performance of Averaged SGD and Averaged SGDM on the logistic loss. The average value of the adaptive momentum weight is 0.75.
Refer to caption
(a) Small γ𝛾\gammaitalic_γ
Refer to caption
(b) Large γ𝛾\gammaitalic_γ
Figure 12: Performance of Averaged SGD and Averaged SGDM on the logistic loss with n0=40subscript𝑛040n_{0}=40italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 40.

To further compare the convergence of averaged SGD and SGDM, we set n0subscript𝑛0n_{0}italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT in Theorem 13 as n0=10subscript𝑛010n_{0}=10italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 10 and 40404040. We plot the comparison of the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT error of the averaged SGDM and SGD. Figure 11(a) shows that for momentum weights smaller and equal to the adaptive momentum weight in (5.21), the averaged SGDM converges faster than the averaged SGD. In Figure 11(b), we can see the averaged SGDM with γ=0.8𝛾0.8\gamma=0.8italic_γ = 0.8 converges even faster than the averaged SGDM with adaptive γ𝛾\gammaitalic_γ, while for γ=0.9𝛾0.9\gamma=0.9italic_γ = 0.9, the averaged SGDM converges faster at the beginning but fluctuates, which cause the increase of 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT error for γ=0.9𝛾0.9\gamma=0.9italic_γ = 0.9. In comparison with Figure 2(b) for the quadratic loss, we can see that although large momentum weight causes fluctuation in both cases, the convergence is more flattened for the the logistic loss in Figure 10(b). This may also explain why the averaged SGDM performs well with γ𝛾\gammaitalic_γ slightly larger than the adaptive momentum weight in Figure 11(b). Figure 12 demonstrates that with n0=40subscript𝑛040n_{0}=40italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 40, there is an improvement in the bias order from 103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT, as seen in Figure 10, to 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. This improvement reflects a sublinear convergence that is in agreement with the behavior observed in Figure 4 for quadratic loss. Consequently, for averaged SGDM, this suggests reduced sensitivity to the choice of a smaller n0subscript𝑛0n_{0}italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

5.4 Real data: MNIST classification

In this section, we consider the multinomial logistic regression on the MNIST dataset, which consists of N=60,000𝑁60000N=60,000italic_N = 60 , 000 images of handwritten digits with size 28×28282828\times 2828 × 28. We reshape the images to vectors of size 784×17841784\times 1784 × 1. For the samples (ai,bi)subscript𝑎𝑖subscript𝑏𝑖(a_{i},b_{i})( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), where ai784subscript𝑎𝑖superscript784a_{i}\in\mathbb{R}^{784}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 784 end_POSTSUPERSCRIPT are the vectorized images and bi10subscript𝑏𝑖superscript10b_{i}\in\mathbb{R}^{10}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT are one-hot indicators corresponding the digits 0,1,,90190,1,\cdots,90 , 1 , ⋯ , 9, the loss function is fi(X)=H(Xai,bi),subscript𝑓𝑖𝑋𝐻superscript𝑋topsubscript𝑎𝑖subscript𝑏𝑖f_{i}(X)=H(X^{\top}a_{i},b_{i}),italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_X ) = italic_H ( italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , where X784×10𝑋superscript78410X\in\mathbb{R}^{784\times 10}italic_X ∈ blackboard_R start_POSTSUPERSCRIPT 784 × 10 end_POSTSUPERSCRIPT and H𝐻Hitalic_H is the multi-class cross entropy. We specify the batch size B=256𝐵256B=256italic_B = 256 and the learning rate α=1.0𝛼1.0\alpha=1.0italic_α = 1.0. Due to the difficulty of computing the condition number L/μ𝐿𝜇L/\muitalic_L / italic_μ, we fix the momentum weight γ=0.1,0.3,0.5,0.7,0.9,0.99𝛾0.10.30.50.70.90.99\gamma=0.1,0.3,0.5,0.7,0.9,0.99italic_γ = 0.1 , 0.3 , 0.5 , 0.7 , 0.9 , 0.99 in training.

Refer to caption
(a) Small γ𝛾\gammaitalic_γ
Refer to caption
(b) Large γ𝛾\gammaitalic_γ
Figure 13: Performance of SGD and SGDM on MNIST.

Figure 13 shows the convergence of the training loss over iterations t𝑡titalic_t. For the purpose of clear representation, the reported training loss for each iteration t𝑡titalic_t is based on an average of mini-batch stochastic losses gηtsubscript𝑔subscript𝜂𝑡g_{\eta_{t}}italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT of the past N/B𝑁𝐵\lfloor N/B\rfloor⌊ italic_N / italic_B ⌋ batches, with the batch size B𝐵Bitalic_B.

From Figure 13, we see that SGDM with momentum weight γ𝛾\gammaitalic_γ ranging from 0.10.10.10.1 to 0.90.90.90.9 greatly outperforms SGD. SGDM with γ=0.5𝛾0.5\gamma=0.5italic_γ = 0.5 converges fastest in the earlier iterations, while the training losses are almost identical for different momentum weights in the later iterations, except for γ=0.99𝛾0.99\gamma=0.99italic_γ = 0.99. The convergence rate is not very sensitive to the momentum weights and a wide range of γ[0.3,0.9]𝛾0.30.9\gamma\in[0.3,0.9]italic_γ ∈ [ 0.3 , 0.9 ] leads to similar performance.

6 Conclusion

Our study sheds light on the performance of mini-batch SGDM in solving optimization problems under strongly convex loss functions. The convergence rate of mini-batch SGDM is influenced by several factors, including the batch size, the momentum weight, and the learning rate. Our analysis rigorously shows that certain choices of momentum weight with a reasonably large batch size can lead to faster convergence compared to SGD. This finding is consistent with previous numerical studies on SGDM. Additionally, our findings, supported by theoretical analysis and numerical experiments, indicate that SGDM permits a broader selection of learning rates.

We further establish the asymptotic normality of averaged SGDM in the quadratic settings and reveal the non-vanishing bias of that in the general settings. Our investigation reveals that averaged SGDM is asymptotically equivalent to averaged SGD. By minimizing the remainder term, we give the optimal learning rate and the corresponding rate of convergence to asymptotic normality. In addition, we present the asymptotic covariance matrix for the averaged SGDM, enabling the uncertainty quantification of the algorithm outputs and statistical inference of the true model parameters based on SGDM, as opposed to SGD.

In summary, our study contributes to the theoretical understanding of mini-batch SGDM and has practical implications for developing efficient optimization algorithms in machine learning. It is noteworthy to mention that our study assumed that the loss function was smooth and strongly convex. In future work, our work can be extended in several ways. It would be interesting to investigate the performance of mini-batch SGDM under non-convex loss functions, since this would have important implications for the application of SGDM in deep learning. It is also of interest to extend the analysis to the case when the learning rate decays over time.


References

  • Bollapragada et al. (2022) Bollapragada, R., T. Chen, and R. Ward (2022). On the fast convergence of minibatch heavy ball momentum. arXiv preprint arXiv:2206.07553.
  • Bottou et al. (2018) Bottou, L., F. E. Curtis, and J. Nocedal (2018). Optimization methods for large-scale machine learning. SIAM Review 60(2), 223–311.
  • Chen et al. (2023) Chen, X., Z. Lai, H. Li, and Y. Zhang (2023). Online statistical inference for stochastic optimization via Kiefer-Wolfowitz methods. Journal of the American Statistical Association (To appear).
  • Chen et al. (2020) Chen, X., J. D. Lee, X. T. Tong, and Y. Zhang (2020). Statistical inference for model parameters in stochastic gradient descent. The Annals of Statistics 48(1), 251 – 273.
  • Defazio (2020) Defazio, A. (2020). Understanding the role of momentum in non-convex optimization: Practical insights from a lyapunov analysis. arXiv preprint arXiv:2010.00406.
  • Gitman et al. (2019) Gitman, I., H. Lang, P. Zhang, and L. Xiao (2019). Understanding the role of momentum in stochastic gradient methods. Advances in Neural Information Processing Systems 32.
  • Jin et al. (2022) Jin, R., Y. Xing, and X. He (2022). On the convergence of mSGD and AdaGrad for stochastic optimization. arXiv preprint arXiv:2201.11204.
  • Kidambi et al. (2018) Kidambi, R., P. Netrapalli, P. Jain, and S. Kakade (2018). On the insufficiency of existing momentum schemes for stochastic optimization. In Information Theory and Applications Workshop, pp.  1–9.
  • Kingma and Ba (2014) Kingma, D. P. and J. Ba (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • Lee et al. (2022) Lee, K., A. Cheng, E. Paquette, and C. Paquette (2022). Trajectory of mini-batch momentum: Batch size saturation and convergence in high dimensions. Advances in Neural Information Processing Systems 35, 36944–36957.
  • Lee et al. (2022) Lee, S., Y. Liao, M. H. Seo, and Y. Shin (2022). Fast and robust online inference with stochastic gradient descent via random scaling. In Proceedings of the AAAI Conference on Artificial Intelligence, Volume 36, pp.  7381–7389.
  • Li et al. (2022) Li, X., M. Liu, and F. Orabona (2022). On the last iterate convergence of momentum methods. In International Conference on Algorithmic Learning Theory, pp.  699–717.
  • Liu et al. (2020) Liu, Y., Y. Gao, and W. Yin (2020). An improved analysis of stochastic gradient descent with momentum. Advances in Neural Information Processing Systems 33, 18261–18271.
  • Loizou and Richtárik (2017) Loizou, N. and P. Richtárik (2017). Linearly convergent stochastic heavy ball method for minimizing generalization error. arXiv preprint arXiv:1710.10737.
  • Loizou and Richtárik (2020) Loizou, N. and P. Richtárik (2020). Momentum and stochastic momentum for stochastic gradient, newton, proximal point and subspace descent methods. Computational Optimization and Applications 77(3), 653–710.
  • Mai and Johansson (2020) Mai, V. and M. Johansson (2020). Convergence of a stochastic gradient method with momentum for non-smooth non-convex optimization. In International Conference on Machine Learning, pp.  6630–6639.
  • Moulines and Bach (2011) Moulines, E. and F. Bach (2011). Non-asymptotic analysis of stochastic approximation algorithms for machine learning. Advances in Neural Information Processing Systems 24.
  • Nesterov (1983) Nesterov, Y. (1983). A method for unconstrained convex minimization problem with the rate of convergence o(1/k2)𝑜1superscript𝑘2o(1/k^{2})italic_o ( 1 / italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). In Doklady AN USSR, Volume 269, pp.  543–547.
  • Nesterov (2003) Nesterov, Y. (2003). Introductory lectures on convex optimization: A basic course, Volume 87. Springer.
  • Nguyen et al. (2018) Nguyen, L., P. H. Nguyen, M. Dijk, P. Richtárik, K. Scheinberg, and M. Takác (2018). SGD and Hogwild! convergence without the bounded gradients assumption. In International Conference on Machine Learning, pp.  3750–3758.
  • Nocedal and Wright (2006) Nocedal, J. and S. J. Wright (2006). Numerical optimization. springer series in operations research. SIAM J Optimization.
  • Paquette and Paquette (2021) Paquette, C. and E. Paquette (2021). Dynamics of stochastic momentum methods on large-scale, quadratic models. Advances in Neural Information Processing Systems 34, 9229–9240.
  • Polyak (1964) Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics 4(5), 1–17.
  • Polyak and Juditsky (1992) Polyak, B. T. and A. B. Juditsky (1992). Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization 30(4), 838–855.
  • Richtárik and Takác (2020) Richtárik, P. and M. Takác (2020). Stochastic reformulations of linear systems: algorithms and convergence theory. SIAM Journal on Matrix Analysis and Applications 41(2), 487–524.
  • Robbins and Monro (1951) Robbins, H. and S. Monro (1951). A stochastic approximation method. The Annals of Mathematical Statistics, 400–407.
  • Ruppert (1988) Ruppert, D. (1988). Efficient estimations from a slowly convergent Robbins-Monro process. Technical report, Cornell University Operations Research and Industrial Engineering.
  • Sebbouh et al. (2021) Sebbouh, O., R. M. Gower, and A. Defazio (2021). Almost sure convergence rates for stochastic gradient descent and stochastic heavy ball. In Conference on Learning Theory, pp.  3935–3971. PMLR.
  • Su and Zhu (2023) Su, W. J. and Y. Zhu (2023). Higrad: Uncertainty quantification for online learning and stochastic approximation. Journal of Machine Learning Research 24(124), 1–53.
  • Toulis et al. (2021) Toulis, P., T. Horel, and E. M. Airoldi (2021). The proximal Robbins-Monro method. Journal of the Royal Statistical Society Series B: Statistical Methodology 83(1), 188–212.
  • Wang and Johansson (2022) Wang, X. and M. Johansson (2022). On uniform boundedness properties of sgd and its momentum variants. arXiv preprint arXiv:2201.10245.
  • Yan et al. (2018) Yan, Y., T. Yang, Z. Li, Q. Lin, and Y. Yang (2018). A unified analysis of stochastic momentum methods for deep learning. arXiv preprint arXiv:1808.10396.
  • Yang et al. (2016) Yang, T., Q. Lin, and Z. Li (2016). Unified convergence analysis of stochastic momentum methods for convex and non-convex optimization. arXiv preprint arXiv:1604.03257.
  • Zeiler (2012) Zeiler, M. D. (2012). Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701.
  • Zhu et al. (2023) Zhu, W., X. Chen, and W. B. Wu (2023). Online covariance matrix estimation in stochastic gradient descent. Journal of the American Statistical Association 118(541), 393–404.
  • Zhu and Dong (2021) Zhu, Y. and J. Dong (2021). On constructing confidence region for model parameters in stochastic gradient descent via batch means. In 2021 Winter Simulation Conference (WSC), pp.  1–12.

Appendix A Proofs of finite-time convergence rates of SGDM

A.1 Proof of Theorem 1

Recall that

mt+1=γmt+(1γ)gηt(xt),xt+1=xtαmt+1.formulae-sequencesubscript𝑚𝑡1𝛾subscript𝑚𝑡1𝛾subscript𝑔subscript𝜂𝑡subscript𝑥𝑡subscript𝑥𝑡1subscript𝑥𝑡𝛼subscript𝑚𝑡1\displaystyle m_{t+1}=\gamma m_{t}+(1-\gamma)\triangledown g_{\eta_{t}}(x_{t})% ,\quad x_{t+1}=x_{t}-\alpha m_{t+1}.italic_m start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_γ italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ( 1 - italic_γ ) ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_α italic_m start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT .

Put

Γ=(γI,(1γ)ΣαγI,Iα(1γ)Σ).Γ𝛾𝐼1𝛾Σmissing-subexpression𝛼𝛾𝐼𝐼𝛼1𝛾Σmissing-subexpression\displaystyle\Gamma=\left(\begin{array}[]{ccc}\gamma I,&(1-\gamma)\Sigma\\ -\alpha\gamma I,&I-\alpha(1-\gamma)\Sigma\\ \end{array}\right).roman_Γ = ( start_ARRAY start_ROW start_CELL italic_γ italic_I , end_CELL start_CELL ( 1 - italic_γ ) roman_Σ end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL - italic_α italic_γ italic_I , end_CELL start_CELL italic_I - italic_α ( 1 - italic_γ ) roman_Σ end_CELL start_CELL end_CELL end_ROW end_ARRAY ) .

Let x~t=xtx*subscript~𝑥𝑡subscript𝑥𝑡superscript𝑥\widetilde{x}_{t}=x_{t}-x^{*}over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT and m~t+1=(1γ)j=1tγtjΣx~jsubscript~𝑚𝑡11𝛾superscriptsubscript𝑗1𝑡superscript𝛾𝑡𝑗Σsubscript~𝑥𝑗\widetilde{m}_{t+1}=(1-\gamma)\sum_{j=1}^{t}\gamma^{t-j}\Sigma\widetilde{x}_{j}over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT roman_Σ over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, we can write

(m~t+1x~t+1)subscript~𝑚𝑡1missing-subexpressionmissing-subexpressionsubscript~𝑥𝑡1missing-subexpressionmissing-subexpression\displaystyle\left(\begin{array}[]{ccc}\widetilde{m}_{t+1}\\ \widetilde{x}_{t+1}\\ \end{array}\right)( start_ARRAY start_ROW start_CELL over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY ) =Γ(m~tx~t)α(1γ)(0j=1tγtj(gηj(xj)Σx~j))absentΓsubscript~𝑚𝑡missing-subexpressionmissing-subexpressionsubscript~𝑥𝑡missing-subexpressionmissing-subexpression𝛼1𝛾0missing-subexpressionmissing-subexpressionsuperscriptsubscript𝑗1𝑡superscript𝛾𝑡𝑗subscript𝑔subscript𝜂𝑗subscript𝑥𝑗Σsubscript~𝑥𝑗missing-subexpressionmissing-subexpression\displaystyle=\Gamma\left(\begin{array}[]{ccc}\widetilde{m}_{t}\\ \widetilde{x}_{t}\\ \end{array}\right)-\alpha(1-\gamma)\left(\begin{array}[]{ccc}0\\ \sum_{j=1}^{t}\gamma^{t-j}(\triangledown g_{\eta_{j}}(x_{j})-\Sigma\widetilde{% x}_{j})\\ \end{array}\right)= roman_Γ ( start_ARRAY start_ROW start_CELL over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY ) - italic_α ( 1 - italic_γ ) ( start_ARRAY start_ROW start_CELL 0 end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT ( ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - roman_Σ over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY ) (A.35)
=Γt(m~1x~1)α(1γ)j=1tΓtj(0k=1jγjk(gηk(xk)Σx~k)).absentsuperscriptΓ𝑡subscript~𝑚1missing-subexpressionmissing-subexpressionsubscript~𝑥1missing-subexpressionmissing-subexpression𝛼1𝛾superscriptsubscript𝑗1𝑡superscriptΓ𝑡𝑗0missing-subexpressionmissing-subexpressionsuperscriptsubscript𝑘1𝑗superscript𝛾𝑗𝑘subscript𝑔subscript𝜂𝑘subscript𝑥𝑘Σsubscript~𝑥𝑘missing-subexpressionmissing-subexpression\displaystyle=\Gamma^{t}\left(\begin{array}[]{ccc}\widetilde{m}_{1}\\ \widetilde{x}_{1}\\ \end{array}\right)-\alpha(1-\gamma)\sum_{j=1}^{t}\Gamma^{t-j}\left(\begin{% array}[]{ccc}0\\ \sum_{k=1}^{j}\gamma^{j-k}(\triangledown g_{\eta_{k}}(x_{k})-\Sigma\widetilde{% x}_{k})\\ \end{array}\right).= roman_Γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( start_ARRAY start_ROW start_CELL over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY ) - italic_α ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_Γ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT ( start_ARRAY start_ROW start_CELL 0 end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT ( ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - roman_Σ over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY ) .

Define

qt:=𝔼[m~t2+x~t2].assignsubscript𝑞𝑡𝔼delimited-[]superscriptnormsubscript~𝑚𝑡2superscriptnormsubscript~𝑥𝑡2\displaystyle q_{t}:=\mathbb{E}[\left\|\widetilde{m}_{t}\right\|^{2}+\left\|% \widetilde{x}_{t}\right\|^{2}].italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := blackboard_E [ ∥ over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] .

Recall for L¯=0¯𝐿0\overline{L}=0over¯ start_ARG italic_L end_ARG = 0, there hold that g(x)=𝔼[gη(x)]=Σ(xx*)𝑔𝑥𝔼delimited-[]subscript𝑔𝜂𝑥Σ𝑥superscript𝑥\triangledown g(x)=\mathbb{E}[\triangledown g_{\eta}(x)]=\Sigma(x-x^{*})▽ italic_g ( italic_x ) = blackboard_E [ ▽ italic_g start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( italic_x ) ] = roman_Σ ( italic_x - italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) and

gηk(xk)Σx~k=(gηk(xk)gηk(x*)+g(x*)g(xk))+gηk(x*).subscript𝑔subscript𝜂𝑘subscript𝑥𝑘Σsubscript~𝑥𝑘subscript𝑔subscript𝜂𝑘subscript𝑥𝑘subscript𝑔subscript𝜂𝑘superscript𝑥𝑔superscript𝑥𝑔subscript𝑥𝑘subscript𝑔subscript𝜂𝑘superscript𝑥\displaystyle\triangledown g_{\eta_{k}}(x_{k})-\Sigma\widetilde{x}_{k}=\left(% \triangledown g_{\eta_{k}}(x_{k})-\triangledown g_{\eta_{k}}(x^{*})+% \triangledown g(x^{*})-\triangledown g(x_{k})\right)+\triangledown g_{\eta_{k}% }(x^{*}).▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - roman_Σ over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ( ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) + ▽ italic_g ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) - ▽ italic_g ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) + ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) .

By the Lfsubscript𝐿𝑓L_{f}italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT-smooth of the individual function fξ(x)subscript𝑓𝜉𝑥f_{\xi}(x)italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_x ) in (A3) and the independence of Lξsubscript𝐿𝜉L_{\xi}italic_L start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT and x~ksubscript~𝑥𝑘\widetilde{x}_{k}over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, we have

𝔼fξ(xk)fξ(x*)+f(x*)f(xk)2𝔼fξ(xk)fξ(x*)2Lf2𝔼x~k2.𝔼superscriptnormsubscript𝑓𝜉subscript𝑥𝑘subscript𝑓𝜉superscript𝑥𝑓superscript𝑥𝑓subscript𝑥𝑘2𝔼superscriptnormsubscript𝑓𝜉subscript𝑥𝑘subscript𝑓𝜉superscript𝑥2superscriptsubscript𝐿𝑓2𝔼superscriptnormsubscript~𝑥𝑘2\displaystyle\mathbb{E}\left\|\triangledown f_{\xi}(x_{k})-\triangledown f_{% \xi}(x^{*})+\triangledown f(x^{*})-\triangledown f(x_{k})\right\|^{2}\leq% \mathbb{E}\left\|\triangledown f_{\xi}(x_{k})-\triangledown f_{\xi}(x^{*})% \right\|^{2}\leq L_{f}^{2}\mathbb{E}\left\|\widetilde{x}_{k}\right\|^{2}.blackboard_E ∥ ▽ italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ▽ italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) + ▽ italic_f ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) - ▽ italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ blackboard_E ∥ ▽ italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ▽ italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E ∥ over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Then from the iteration (A.35), we have

qt+1subscript𝑞𝑡1\displaystyle q_{t+1}italic_q start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT Γt2q1+2α2(1γ)2σ2Bk=1t(j=ktΓtjγjk)2absentsuperscriptnormsuperscriptΓ𝑡2subscript𝑞12superscript𝛼2superscript1𝛾2superscript𝜎2𝐵superscriptsubscript𝑘1𝑡superscriptsuperscriptsubscript𝑗𝑘𝑡normsuperscriptΓ𝑡𝑗superscript𝛾𝑗𝑘2\displaystyle\leq\left\|\Gamma^{t}\right\|^{2}q_{1}+2\alpha^{2}(1-\gamma)^{2}% \frac{\sigma^{2}}{B}\sum_{k=1}^{t}\left(\sum_{j=k}^{t}\left\|\Gamma^{t-j}% \right\|\gamma^{j-k}\right)^{2}≤ ∥ roman_Γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 2 italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ roman_Γ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT ∥ italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (A.36)
+2α2(1γ)2Lf2Bk=1t(j=ktΓtjγjk)2qk.2superscript𝛼2superscript1𝛾2superscriptsubscript𝐿𝑓2𝐵superscriptsubscript𝑘1𝑡superscriptsuperscriptsubscript𝑗𝑘𝑡normsuperscriptΓ𝑡𝑗superscript𝛾𝑗𝑘2subscript𝑞𝑘\displaystyle+2\alpha^{2}(1-\gamma)^{2}\frac{L_{f}^{2}}{B}\sum_{k=1}^{t}\left(% \sum_{j=k}^{t}\left\|\Gamma^{t-j}\right\|\gamma^{j-k}\right)^{2}q_{k}.+ 2 italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ roman_Γ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT ∥ italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT .

In the remaining part of proof, we use the inductive method to derive the bound of qtsubscript𝑞𝑡q_{t}italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Let C11subscript𝐶11C_{1}\geq 1italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≥ 1 and C21subscript𝐶21C_{2}\geq 1italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≥ 1 be some constants which will be specified later. We will prove that, for any t1𝑡1t\geq 1italic_t ≥ 1, the inequalities

qjC1B(1λ)α2σ2+C2q1λ2(1δ)(j1),1jtformulae-sequencesubscript𝑞𝑗subscript𝐶1𝐵1𝜆superscript𝛼2superscript𝜎2subscript𝐶2subscript𝑞1superscript𝜆21𝛿𝑗11𝑗𝑡\displaystyle q_{j}\leq\frac{C_{1}}{B(1-\lambda)}\alpha^{2}\sigma^{2}+C_{2}q_{% 1}\lambda^{2(1-\delta)(j-1)},\quad 1\leq j\leq titalic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≤ divide start_ARG italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_B ( 1 - italic_λ ) end_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_λ start_POSTSUPERSCRIPT 2 ( 1 - italic_δ ) ( italic_j - 1 ) end_POSTSUPERSCRIPT , 1 ≤ italic_j ≤ italic_t (A.37)

with fixed δ(0,1]𝛿01\delta\in(0,1]italic_δ ∈ ( 0 , 1 ], imply that

qt+1C1B(1λ)α2σ2+C2q1λ2(1δ)t.subscript𝑞𝑡1subscript𝐶1𝐵1𝜆superscript𝛼2superscript𝜎2subscript𝐶2subscript𝑞1superscript𝜆21𝛿𝑡\displaystyle q_{t+1}\leq\frac{C_{1}}{B(1-\lambda)}\alpha^{2}\sigma^{2}+C_{2}q% _{1}\lambda^{2(1-\delta)t}.italic_q start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ≤ divide start_ARG italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_B ( 1 - italic_λ ) end_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_λ start_POSTSUPERSCRIPT 2 ( 1 - italic_δ ) italic_t end_POSTSUPERSCRIPT . (A.38)

By Lemma 20 below, we have ΓjMλjnormsuperscriptΓ𝑗𝑀superscript𝜆𝑗\left\|\Gamma^{j}\right\|\leq M\lambda^{j}∥ roman_Γ start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∥ ≤ italic_M italic_λ start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT for all j0𝑗0j\geq 0italic_j ≥ 0. Theorem 3 provides that λγ𝜆𝛾\lambda\geq\sqrt{\gamma}italic_λ ≥ square-root start_ARG italic_γ end_ARG. By applying Lemma 27, we are equipped to manage the summation presented in (A.36). Consequently, from (A.36) and (A.37), we deduce that

qt+1subscript𝑞𝑡1\displaystyle q_{t+1}italic_q start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT M2λ2tq1+2α2σ2BM221λabsentsuperscript𝑀2superscript𝜆2𝑡subscript𝑞12superscript𝛼2superscript𝜎2𝐵superscript𝑀221𝜆\displaystyle\leq M^{2}\lambda^{2t}q_{1}+2\alpha^{2}\frac{\sigma^{2}}{B}M^{2}% \frac{2}{1-\lambda}≤ italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT 2 italic_t end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 2 italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG 2 end_ARG start_ARG 1 - italic_λ end_ARG
+2α2(1γ)2Lf2BM2(2C1α2σ2B(1γ)2(1λ)2+C2q1λ2(1δ)t1λ2(1δ)4δ(1γ)2(1λ))2superscript𝛼2superscript1𝛾2superscriptsubscript𝐿𝑓2𝐵superscript𝑀22subscript𝐶1superscript𝛼2superscript𝜎2𝐵superscript1𝛾2superscript1𝜆2subscript𝐶2subscript𝑞1superscript𝜆21𝛿𝑡1superscript𝜆21𝛿4𝛿superscript1𝛾21𝜆\displaystyle+2\alpha^{2}(1-\gamma)^{2}\frac{L_{f}^{2}}{B}M^{2}\left(\frac{2C_% {1}\alpha^{2}\sigma^{2}}{B(1-\gamma)^{2}(1-\lambda)^{2}}+C_{2}q_{1}\lambda^{2(% 1-\delta)t}\frac{1}{\lambda^{2(1-\delta)}}\frac{4}{\delta(1-\gamma)^{2}(1-% \lambda)}\right)+ 2 italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( divide start_ARG 2 italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - italic_λ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_λ start_POSTSUPERSCRIPT 2 ( 1 - italic_δ ) italic_t end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_λ start_POSTSUPERSCRIPT 2 ( 1 - italic_δ ) end_POSTSUPERSCRIPT end_ARG divide start_ARG 4 end_ARG start_ARG italic_δ ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - italic_λ ) end_ARG )
C1B(1λ)α2σ2M2(4C1+4α2Lf2B(1λ))+C2q1λ2(1δ)tM2(λ2δtC2+8α2Lf2Bλ2(1δ)δ(1λ)).absentsubscript𝐶1𝐵1𝜆superscript𝛼2superscript𝜎2superscript𝑀24subscript𝐶14superscript𝛼2superscriptsubscript𝐿𝑓2𝐵1𝜆subscript𝐶2subscript𝑞1superscript𝜆21𝛿𝑡superscript𝑀2superscript𝜆2𝛿𝑡subscript𝐶28superscript𝛼2superscriptsubscript𝐿𝑓2𝐵superscript𝜆21𝛿𝛿1𝜆\displaystyle\leq\frac{C_{1}}{B(1-\lambda)}\alpha^{2}\sigma^{2}M^{2}\left(% \frac{4}{C_{1}}+\frac{4\alpha^{2}L_{f}^{2}}{B(1-\lambda)}\right)+C_{2}q_{1}% \lambda^{2(1-\delta)t}M^{2}\left(\frac{\lambda^{2\delta t}}{C_{2}}+\frac{8% \alpha^{2}L_{f}^{2}}{B\lambda^{2(1-\delta)}\delta(1-\lambda)}\right).≤ divide start_ARG italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_B ( 1 - italic_λ ) end_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( divide start_ARG 4 end_ARG start_ARG italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG + divide start_ARG 4 italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B ( 1 - italic_λ ) end_ARG ) + italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_λ start_POSTSUPERSCRIPT 2 ( 1 - italic_δ ) italic_t end_POSTSUPERSCRIPT italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( divide start_ARG italic_λ start_POSTSUPERSCRIPT 2 italic_δ italic_t end_POSTSUPERSCRIPT end_ARG start_ARG italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG + divide start_ARG 8 italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B italic_λ start_POSTSUPERSCRIPT 2 ( 1 - italic_δ ) end_POSTSUPERSCRIPT italic_δ ( 1 - italic_λ ) end_ARG ) .

Let the learning rate α𝛼\alphaitalic_α and the batch size B𝐵Bitalic_B satisfy

8M21λ2(1δ)1δ(1λ)α2Lf2B12,8superscript𝑀21superscript𝜆21𝛿1𝛿1𝜆superscript𝛼2superscriptsubscript𝐿𝑓2𝐵12\displaystyle 8M^{2}\frac{1}{\lambda^{2(1-\delta)}}\frac{1}{\delta(1-\lambda)}% \alpha^{2}\frac{L_{f}^{2}}{B}\leq\frac{1}{2},8 italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_λ start_POSTSUPERSCRIPT 2 ( 1 - italic_δ ) end_POSTSUPERSCRIPT end_ARG divide start_ARG 1 end_ARG start_ARG italic_δ ( 1 - italic_λ ) end_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG ≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ,

we can take

C1=8M2,C2=2M2.formulae-sequencesubscript𝐶18superscript𝑀2subscript𝐶22superscript𝑀2\displaystyle C_{1}=8M^{2},\quad C_{2}=2M^{2}.italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 8 italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 2 italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

It is straightforward to verify that (A.38) is satisfied, and the induction is completed.

A.2 Proof of Theorem 3

Put

Γ=(U,0d0d,U)(γI,(1γ)AαγI,Iα(1γ)A)(U,0d0d,U)Γ𝑈subscript0𝑑missing-subexpressionsubscript0𝑑𝑈missing-subexpression𝛾𝐼1𝛾𝐴missing-subexpression𝛼𝛾𝐼𝐼𝛼1𝛾𝐴missing-subexpressionsuperscript𝑈topsubscript0𝑑missing-subexpressionsubscript0𝑑superscript𝑈topmissing-subexpression\displaystyle\Gamma=\left(\begin{array}[]{ccc}U,&0_{d}\\ 0_{d},&U\\ \end{array}\right)\left(\begin{array}[]{ccc}\gamma I,&(1-\gamma)A\\ -\alpha\gamma I,&I-\alpha(1-\gamma)A\\ \end{array}\right)\left(\begin{array}[]{ccc}U^{\top},&0_{d}\\ 0_{d},&U^{\top}\\ \end{array}\right)roman_Γ = ( start_ARRAY start_ROW start_CELL italic_U , end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , end_CELL start_CELL italic_U end_CELL start_CELL end_CELL end_ROW end_ARRAY ) ( start_ARRAY start_ROW start_CELL italic_γ italic_I , end_CELL start_CELL ( 1 - italic_γ ) italic_A end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL - italic_α italic_γ italic_I , end_CELL start_CELL italic_I - italic_α ( 1 - italic_γ ) italic_A end_CELL start_CELL end_CELL end_ROW end_ARRAY ) ( start_ARRAY start_ROW start_CELL italic_U start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , end_CELL start_CELL italic_U start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_CELL start_CELL end_CELL end_ROW end_ARRAY )

where Σ=Udiag(κ1,,κd)UT=:UAUT\Sigma=Udiag(\kappa_{1},...,\kappa_{d})U^{T}=:UAU^{T}roman_Σ = italic_U italic_d italic_i italic_a italic_g ( italic_κ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_κ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = : italic_U italic_A italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, μ=κ1κd=L𝜇subscript𝜅1subscript𝜅𝑑𝐿\mu=\kappa_{1}\leq\cdots\leq\kappa_{d}=Litalic_μ = italic_κ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ ⋯ ≤ italic_κ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = italic_L, U𝑈Uitalic_U is an orthogonal matrix. Now we can define

λk±=γ+1α(1γ)κk±(α(1γ)κkγ1)24γ2,subscriptsuperscript𝜆plus-or-minus𝑘plus-or-minus𝛾1𝛼1𝛾subscript𝜅𝑘superscript𝛼1𝛾subscript𝜅𝑘𝛾124𝛾2\displaystyle\lambda^{\pm}_{k}=\frac{\gamma+1-\alpha(1-\gamma)\kappa_{k}\pm% \sqrt{(\alpha(1-\gamma)\kappa_{k}-\gamma-1)^{2}-4\gamma}}{2},italic_λ start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG italic_γ + 1 - italic_α ( 1 - italic_γ ) italic_κ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ± square-root start_ARG ( italic_α ( 1 - italic_γ ) italic_κ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_γ - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 4 italic_γ end_ARG end_ARG start_ARG 2 end_ARG ,

for 1kd1𝑘𝑑1\leq k\leq d1 ≤ italic_k ≤ italic_d, and the diagonal matrix

Λ=(diag(λ1+,,λd+),0d0d,diag(λ1,,λd)).Λ𝑑𝑖𝑎𝑔subscriptsuperscript𝜆1subscriptsuperscript𝜆𝑑subscript0𝑑missing-subexpressionsubscript0𝑑𝑑𝑖𝑎𝑔subscriptsuperscript𝜆1subscriptsuperscript𝜆𝑑missing-subexpression\displaystyle\Lambda=\left(\begin{array}[]{ccc}diag(\lambda^{+}_{1},...,% \lambda^{+}_{d}),&0_{d}\\ 0_{d},&diag(\lambda^{-}_{1},...,\lambda^{-}_{d})\\ \end{array}\right).roman_Λ = ( start_ARRAY start_ROW start_CELL italic_d italic_i italic_a italic_g ( italic_λ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_λ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) , end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , end_CELL start_CELL italic_d italic_i italic_a italic_g ( italic_λ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_λ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_CELL start_CELL end_CELL end_ROW end_ARRAY ) . (A.42)

Notice that if (α(1γ)κkγ1)2<4γsuperscript𝛼1𝛾subscript𝜅𝑘𝛾124𝛾(\alpha(1-\gamma)\kappa_{k}-\gamma-1)^{2}<4\gamma( italic_α ( 1 - italic_γ ) italic_κ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_γ - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT < 4 italic_γ, λk±subscriptsuperscript𝜆plus-or-minus𝑘\lambda^{\pm}_{k}italic_λ start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is complex and the definition is

λk±=γ+1α(1γ)κk±14γ(α(1γ)κkγ1)22.subscriptsuperscript𝜆plus-or-minus𝑘plus-or-minus𝛾1𝛼1𝛾subscript𝜅𝑘14𝛾superscript𝛼1𝛾subscript𝜅𝑘𝛾122\displaystyle\lambda^{\pm}_{k}=\frac{\gamma+1-\alpha(1-\gamma)\kappa_{k}\pm% \sqrt{-1}\sqrt{4\gamma-(\alpha(1-\gamma)\kappa_{k}-\gamma-1)^{2}}}{2}.italic_λ start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG italic_γ + 1 - italic_α ( 1 - italic_γ ) italic_κ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ± square-root start_ARG - 1 end_ARG square-root start_ARG 4 italic_γ - ( italic_α ( 1 - italic_γ ) italic_κ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_γ - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG 2 end_ARG .

Prior to establishing Theorem 3, we first demonstrate the diagonalization of ΓΓ\Gammaroman_Γ, with ΛΛ\Lambdaroman_Λ representing the resultant diagonal matrix. Subsequent to this, we proceed to delineate the bounds of the spectral radius of ΛΛ\Lambdaroman_Λ.

Lemma 20

For α,γ,L𝛼𝛾𝐿\alpha,\gamma,Litalic_α , italic_γ , italic_L satisfying that αL<2(1+γ)/(1γ)𝛼𝐿21𝛾1𝛾\alpha L<2(1+\gamma)/(1-\gamma)italic_α italic_L < 2 ( 1 + italic_γ ) / ( 1 - italic_γ ) and |γ+1α(1γ)κk|2γ𝛾1𝛼1𝛾subscript𝜅𝑘2𝛾\left|\gamma+1-\alpha(1-\gamma)\kappa_{k}\right|\neq 2\sqrt{\gamma}| italic_γ + 1 - italic_α ( 1 - italic_γ ) italic_κ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | ≠ 2 square-root start_ARG italic_γ end_ARG for all 1kd1𝑘𝑑1\leq k\leq d1 ≤ italic_k ≤ italic_d, we have

(γI,(1γ)AαγI,Iα(1γ)A)=PΛP1,𝛾𝐼1𝛾𝐴missing-subexpression𝛼𝛾𝐼𝐼𝛼1𝛾𝐴missing-subexpression𝑃Λsuperscript𝑃1\displaystyle\left(\begin{array}[]{ccc}\gamma I,&(1-\gamma)A\\ -\alpha\gamma I,&I-\alpha(1-\gamma)A\\ \end{array}\right)=P\Lambda P^{-1},( start_ARRAY start_ROW start_CELL italic_γ italic_I , end_CELL start_CELL ( 1 - italic_γ ) italic_A end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL - italic_α italic_γ italic_I , end_CELL start_CELL italic_I - italic_α ( 1 - italic_γ ) italic_A end_CELL start_CELL end_CELL end_ROW end_ARRAY ) = italic_P roman_Λ italic_P start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ,

for some invertible matrix P𝑃Pitalic_P that satisfies

P2,P12Δ(2(1γ)(1+αL+L)+3αγ),formulae-sequencenorm𝑃2normsuperscript𝑃12Δ21𝛾1𝛼𝐿𝐿3𝛼𝛾\displaystyle\left\|P\right\|\leq 2,\quad\left\|P^{-1}\right\|\leq\frac{2}{% \sqrt{\Delta}}\left(2(1-\gamma)(1+\alpha L+L)+3\alpha\gamma\right),∥ italic_P ∥ ≤ 2 , ∥ italic_P start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ ≤ divide start_ARG 2 end_ARG start_ARG square-root start_ARG roman_Δ end_ARG end_ARG ( 2 ( 1 - italic_γ ) ( 1 + italic_α italic_L + italic_L ) + 3 italic_α italic_γ ) ,

where Δ=mink{|(γ+1α(1γ)κk)24γ|}normal-Δsubscript𝑘superscript𝛾1𝛼1𝛾subscript𝜅𝑘24𝛾\Delta=\min_{k}\left\{\left|\left(\gamma+1-\alpha(1-\gamma)\kappa_{k}\right)^{% 2}-4\gamma\right|\right\}roman_Δ = roman_min start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT { | ( italic_γ + 1 - italic_α ( 1 - italic_γ ) italic_κ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 4 italic_γ | }. Therefore, ΓjMλjnormsuperscriptnormal-Γ𝑗𝑀superscript𝜆𝑗\left\|\Gamma^{j}\right\|\leq M\lambda^{j}∥ roman_Γ start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∥ ≤ italic_M italic_λ start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT for any j1𝑗1j\geq 1italic_j ≥ 1, where

M=4Δ(2(1γ)(1+αL+L)+3αγ),𝑀4Δ21𝛾1𝛼𝐿𝐿3𝛼𝛾\displaystyle M=\frac{4}{\sqrt{\Delta}}\left(2(1-\gamma)(1+\alpha L+L)+3\alpha% \gamma\right),italic_M = divide start_ARG 4 end_ARG start_ARG square-root start_ARG roman_Δ end_ARG end_ARG ( 2 ( 1 - italic_γ ) ( 1 + italic_α italic_L + italic_L ) + 3 italic_α italic_γ ) ,

and λ𝜆\lambdaitalic_λ is the spectral radius of Λnormal-Λ\Lambdaroman_Λ.

Proof.   The eigenvalues satisfy

(λk±)2+(γ1+α(1γ)κk)λk±+γ=0,superscriptsubscriptsuperscript𝜆plus-or-minus𝑘2𝛾1𝛼1𝛾subscript𝜅𝑘subscriptsuperscript𝜆plus-or-minus𝑘𝛾0\displaystyle(\lambda^{\pm}_{k})^{2}+(-\gamma-1+\alpha(1-\gamma)\kappa_{k})% \lambda^{\pm}_{k}+\gamma=0,( italic_λ start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( - italic_γ - 1 + italic_α ( 1 - italic_γ ) italic_κ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) italic_λ start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_γ = 0 ,
λk++λk=γ+1α(1γ)κk,λk+λk=γ.formulae-sequencesubscriptsuperscript𝜆𝑘subscriptsuperscript𝜆𝑘𝛾1𝛼1𝛾subscript𝜅𝑘subscriptsuperscript𝜆𝑘subscriptsuperscript𝜆𝑘𝛾\displaystyle\lambda^{+}_{k}+\lambda^{-}_{k}=\gamma+1-\alpha(1-\gamma)\kappa_{% k},\quad\lambda^{+}_{k}\lambda^{-}_{k}=\gamma.italic_λ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_λ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_γ + 1 - italic_α ( 1 - italic_γ ) italic_κ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_λ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_λ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_γ .

Given that αL<2(1+γ)/(1γ)𝛼𝐿21𝛾1𝛾\alpha L<2(1+\gamma)/(1-\gamma)italic_α italic_L < 2 ( 1 + italic_γ ) / ( 1 - italic_γ ) and considering μκkL𝜇subscript𝜅𝑘𝐿\mu\leq\kappa_{k}\leq Litalic_μ ≤ italic_κ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≤ italic_L, it follows that |γ+1α(1γ)κk|<1+γ𝛾1𝛼1𝛾subscript𝜅𝑘1𝛾|\gamma+1-\alpha(1-\gamma)\kappa_{k}|<1+\gamma| italic_γ + 1 - italic_α ( 1 - italic_γ ) italic_κ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | < 1 + italic_γ for all k𝑘kitalic_k. Considering the product of the eigenvalues λk+λk=γ<1subscriptsuperscript𝜆𝑘subscriptsuperscript𝜆𝑘𝛾1\lambda^{+}_{k}\lambda^{-}_{k}=\gamma<1italic_λ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_λ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_γ < 1, subsequent calculations reveal that maxk{|λk±|}<1subscript𝑘subscriptsuperscript𝜆plus-or-minus𝑘1\max_{k}\{|\lambda^{\pm}_{k}|\}<1roman_max start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT { | italic_λ start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | } < 1, which holds even in the case of complex eigenvalues. Moreover, since |γ+1α(1γ)κk|2γ𝛾1𝛼1𝛾subscript𝜅𝑘2𝛾|\gamma+1-\alpha(1-\gamma)\kappa_{k}|\neq 2\sqrt{\gamma}| italic_γ + 1 - italic_α ( 1 - italic_γ ) italic_κ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | ≠ 2 square-root start_ARG italic_γ end_ARG for any k𝑘kitalic_k, it follows that λk+λksubscriptsuperscript𝜆𝑘subscriptsuperscript𝜆𝑘\lambda^{+}_{k}\neq\lambda^{-}_{k}italic_λ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≠ italic_λ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, confirming that the matrix is indeed diagonalizable.

Let eksubscript𝑒𝑘e_{k}italic_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT be the unit vector which has 1 in the k𝑘kitalic_k-th coordinate and others zero. By the definition of A𝐴Aitalic_A, we have

(αγ(1γ)+α(λk±γ)(1γ))Aek=(λk±γ)(1λk±)ek,𝛼𝛾1𝛾𝛼subscriptsuperscript𝜆plus-or-minus𝑘𝛾1𝛾𝐴subscript𝑒𝑘subscriptsuperscript𝜆plus-or-minus𝑘𝛾1subscriptsuperscript𝜆plus-or-minus𝑘subscript𝑒𝑘\displaystyle(\alpha\gamma(1-\gamma)+\alpha(\lambda^{\pm}_{k}-\gamma)(1-\gamma% ))Ae_{k}=(\lambda^{\pm}_{k}-\gamma)(1-\lambda^{\pm}_{k})e_{k},( italic_α italic_γ ( 1 - italic_γ ) + italic_α ( italic_λ start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_γ ) ( 1 - italic_γ ) ) italic_A italic_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ( italic_λ start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_γ ) ( 1 - italic_λ start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) italic_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ,

which is equivalent to

(1γ)Aek=λk±γαγ{(1λk±)ekα(1γ)Aek}.1𝛾𝐴subscript𝑒𝑘subscriptsuperscript𝜆plus-or-minus𝑘𝛾𝛼𝛾1subscriptsuperscript𝜆plus-or-minus𝑘subscript𝑒𝑘𝛼1𝛾𝐴subscript𝑒𝑘\displaystyle(1-\gamma)Ae_{k}=\frac{\lambda^{\pm}_{k}-\gamma}{\alpha\gamma}% \Big{\{}(1-\lambda^{\pm}_{k})e_{k}-\alpha(1-\gamma)Ae_{k}\Big{\}}.( 1 - italic_γ ) italic_A italic_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG italic_λ start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_γ end_ARG start_ARG italic_α italic_γ end_ARG { ( 1 - italic_λ start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) italic_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_α ( 1 - italic_γ ) italic_A italic_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } . (A.44)

Now define 𝒛k±subscriptsuperscript𝒛plus-or-minus𝑘\boldsymbol{z}^{\pm}_{k}bold_italic_z start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT by

(λk±γ)𝒛k±=(1γ)Aek,subscriptsuperscript𝜆plus-or-minus𝑘𝛾subscriptsuperscript𝒛plus-or-minus𝑘1𝛾𝐴subscript𝑒𝑘\displaystyle(\lambda^{\pm}_{k}-\gamma)\boldsymbol{z}^{\pm}_{k}=(1-\gamma)Ae_{% k},( italic_λ start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_γ ) bold_italic_z start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ( 1 - italic_γ ) italic_A italic_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ,

which can be written as γ𝒛k±+(1γ)Aek=λk±𝒛k±𝛾superscriptsubscript𝒛𝑘plus-or-minus1𝛾𝐴subscript𝑒𝑘subscriptsuperscript𝜆plus-or-minus𝑘superscriptsubscript𝒛𝑘plus-or-minus\gamma\boldsymbol{z}_{k}^{\pm}+(1-\gamma)Ae_{k}=\lambda^{\pm}_{k}\boldsymbol{z% }_{k}^{\pm}italic_γ bold_italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT + ( 1 - italic_γ ) italic_A italic_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_λ start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT. Combining this equation with (A.44), we have

(λk±γ)𝒛k±=λk±γαγ{(1λk±)ekα(1γ)Aek},subscriptsuperscript𝜆plus-or-minus𝑘𝛾subscriptsuperscript𝒛plus-or-minus𝑘subscriptsuperscript𝜆plus-or-minus𝑘𝛾𝛼𝛾1subscriptsuperscript𝜆plus-or-minus𝑘subscript𝑒𝑘𝛼1𝛾𝐴subscript𝑒𝑘\displaystyle(\lambda^{\pm}_{k}-\gamma)\boldsymbol{z}^{\pm}_{k}=\frac{\lambda^% {\pm}_{k}-\gamma}{\alpha\gamma}\Big{\{}(1-\lambda^{\pm}_{k})e_{k}-\alpha(1-% \gamma)Ae_{k}\Big{\}},( italic_λ start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_γ ) bold_italic_z start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG italic_λ start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_γ end_ARG start_ARG italic_α italic_γ end_ARG { ( 1 - italic_λ start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) italic_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_α ( 1 - italic_γ ) italic_A italic_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } ,

which yields that

αγ𝒛k±+ekα(1γ)Aek=λk±ek.𝛼𝛾subscriptsuperscript𝒛plus-or-minus𝑘subscript𝑒𝑘𝛼1𝛾𝐴subscript𝑒𝑘subscriptsuperscript𝜆plus-or-minus𝑘subscript𝑒𝑘\displaystyle-\alpha\gamma\boldsymbol{z}^{\pm}_{k}+e_{k}-\alpha(1-\gamma)Ae_{k% }=\lambda^{\pm}_{k}e_{k}.- italic_α italic_γ bold_italic_z start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_α ( 1 - italic_γ ) italic_A italic_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_λ start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT .

Thus

(γI,(1γ)AαγI,Iα(1γ)A)(𝒛k±ek)=λk±(𝒛k±ek)=(𝒛k±ek)λk±.𝛾𝐼1𝛾𝐴missing-subexpression𝛼𝛾𝐼𝐼𝛼1𝛾𝐴missing-subexpressionsubscriptsuperscript𝒛plus-or-minus𝑘subscript𝑒𝑘subscriptsuperscript𝜆plus-or-minus𝑘subscriptsuperscript𝒛plus-or-minus𝑘subscript𝑒𝑘subscriptsuperscript𝒛plus-or-minus𝑘subscript𝑒𝑘subscriptsuperscript𝜆plus-or-minus𝑘\displaystyle\left(\begin{array}[]{ccc}\gamma I,&(1-\gamma)A\\ -\alpha\gamma I,&I-\alpha(1-\gamma)A\\ \end{array}\right)\left(\begin{array}[]{c}\boldsymbol{z}^{\pm}_{k}\\ e_{k}\\ \end{array}\right)=\lambda^{\pm}_{k}\left(\begin{array}[]{c}\boldsymbol{z}^{% \pm}_{k}\\ e_{k}\\ \end{array}\right)=\left(\begin{array}[]{c}\boldsymbol{z}^{\pm}_{k}\\ e_{k}\\ \end{array}\right)\lambda^{\pm}_{k}.( start_ARRAY start_ROW start_CELL italic_γ italic_I , end_CELL start_CELL ( 1 - italic_γ ) italic_A end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL - italic_α italic_γ italic_I , end_CELL start_CELL italic_I - italic_α ( 1 - italic_γ ) italic_A end_CELL start_CELL end_CELL end_ROW end_ARRAY ) ( start_ARRAY start_ROW start_CELL bold_italic_z start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ) = italic_λ start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( start_ARRAY start_ROW start_CELL bold_italic_z start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ) = ( start_ARRAY start_ROW start_CELL bold_italic_z start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ) italic_λ start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT . (A.53)

Define

𝒛+=(𝒛1+,,𝒛d+),𝒛=(𝒛1,,𝒛d),formulae-sequencesuperscript𝒛subscriptsuperscript𝒛1subscriptsuperscript𝒛𝑑superscript𝒛subscriptsuperscript𝒛1subscriptsuperscript𝒛𝑑\displaystyle\boldsymbol{z}^{+}=(\boldsymbol{z}^{+}_{1},...,\boldsymbol{z}^{+}% _{d}),\quad\boldsymbol{z}^{-}=(\boldsymbol{z}^{-}_{1},...,\boldsymbol{z}^{-}_{% d}),bold_italic_z start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT = ( bold_italic_z start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_z start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT = ( bold_italic_z start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_z start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) ,

and

P𝑃\displaystyle Pitalic_P =(𝒛+(|𝒛+|2+I)12,𝒛(|𝒛|2+I)12I(|𝒛+|2+I)12,I(|𝒛|2+I)12),absentsuperscript𝒛superscriptsuperscriptsuperscript𝒛2𝐼12superscript𝒛superscriptsuperscriptsuperscript𝒛2𝐼12𝐼superscriptsuperscriptsuperscript𝒛2𝐼12𝐼superscriptsuperscriptsuperscript𝒛2𝐼12\displaystyle=\left(\begin{array}[]{cc}\boldsymbol{z}^{+}\left(\left|% \boldsymbol{z}^{+}\right|^{2}+I\right)^{-\frac{1}{2}},&\boldsymbol{z}^{-}\left% (\left|\boldsymbol{z}^{-}\right|^{2}+I\right)^{-\frac{1}{2}}\\ I\left(\left|\boldsymbol{z}^{+}\right|^{2}+I\right)^{-\frac{1}{2}},&I\left(% \left|\boldsymbol{z}^{-}\right|^{2}+I\right)^{-\frac{1}{2}}\end{array}\right),= ( start_ARRAY start_ROW start_CELL bold_italic_z start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( | bold_italic_z start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_I ) start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT , end_CELL start_CELL bold_italic_z start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ( | bold_italic_z start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_I ) start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_I ( | bold_italic_z start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_I ) start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT , end_CELL start_CELL italic_I ( | bold_italic_z start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_I ) start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY ) ,
P1superscript𝑃1\displaystyle P^{-1}italic_P start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT =((|𝒛+|2+I)12(𝒛+𝒛)1,(|𝒛+|2+I)12𝒛(𝒛+𝒛)1(|𝒛|2+I)12(𝒛+𝒛)1,(|𝒛|2+I)12𝒛+(𝒛+𝒛)1).absentsuperscriptsuperscriptsuperscript𝒛2𝐼12superscriptsuperscript𝒛superscript𝒛1superscriptsuperscriptsuperscript𝒛2𝐼12superscript𝒛superscriptsuperscript𝒛superscript𝒛1superscriptsuperscriptsuperscript𝒛2𝐼12superscriptsuperscript𝒛superscript𝒛1superscriptsuperscriptsuperscript𝒛2𝐼12superscript𝒛superscriptsuperscript𝒛superscript𝒛1\displaystyle=\left(\begin{array}[]{cc}\left(\left|\boldsymbol{z}^{+}\right|^{% 2}+I\right)^{\frac{1}{2}}(\boldsymbol{z}^{+}-\boldsymbol{z}^{-})^{-1},&-\left(% \left|\boldsymbol{z}^{+}\right|^{2}+I\right)^{\frac{1}{2}}\boldsymbol{z}^{-}(% \boldsymbol{z}^{+}-\boldsymbol{z}^{-})^{-1}\\ -\left(\left|\boldsymbol{z}^{-}\right|^{2}+I\right)^{\frac{1}{2}}(\boldsymbol{% z}^{+}-\boldsymbol{z}^{-})^{-1},&\left(\left|\boldsymbol{z}^{-}\right|^{2}+I% \right)^{\frac{1}{2}}\boldsymbol{z}^{+}(\boldsymbol{z}^{+}-\boldsymbol{z}^{-})% ^{-1}\end{array}\right).= ( start_ARRAY start_ROW start_CELL ( | bold_italic_z start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_I ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ( bold_italic_z start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT - bold_italic_z start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , end_CELL start_CELL - ( | bold_italic_z start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_I ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT bold_italic_z start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ( bold_italic_z start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT - bold_italic_z start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL - ( | bold_italic_z start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_I ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ( bold_italic_z start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT - bold_italic_z start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , end_CELL start_CELL ( | bold_italic_z start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_I ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT bold_italic_z start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( bold_italic_z start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT - bold_italic_z start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY ) .

Note that 𝒛+superscript𝒛\boldsymbol{z}^{+}bold_italic_z start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT and 𝒛superscript𝒛\boldsymbol{z}^{-}bold_italic_z start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT are diagonal matrices. Then by (A.53),

(γI,(1γ)AαγI,Iα(1γ)A)P=PΛ,𝛾𝐼1𝛾𝐴missing-subexpression𝛼𝛾𝐼𝐼𝛼1𝛾𝐴missing-subexpression𝑃𝑃Λ\displaystyle\left(\begin{array}[]{ccc}\gamma I,&(1-\gamma)A\\ -\alpha\gamma I,&I-\alpha(1-\gamma)A\\ \end{array}\right)P=P\Lambda,( start_ARRAY start_ROW start_CELL italic_γ italic_I , end_CELL start_CELL ( 1 - italic_γ ) italic_A end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL - italic_α italic_γ italic_I , end_CELL start_CELL italic_I - italic_α ( 1 - italic_γ ) italic_A end_CELL start_CELL end_CELL end_ROW end_ARRAY ) italic_P = italic_P roman_Λ ,

where ΛΛ\Lambdaroman_Λ is defined by (A.42). By the definition of 𝒛±superscript𝒛plus-or-minus\boldsymbol{z}^{\pm}bold_italic_z start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT, we have

𝒛±superscript𝒛plus-or-minus\displaystyle\boldsymbol{z}^{\pm}bold_italic_z start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT =diag(λkγαγ),absent𝑑𝑖𝑎𝑔superscriptsubscript𝜆𝑘minus-or-plus𝛾𝛼𝛾\displaystyle=diag\left(\frac{\lambda_{k}^{\mp}-\gamma}{\alpha\gamma}\right),= italic_d italic_i italic_a italic_g ( divide start_ARG italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∓ end_POSTSUPERSCRIPT - italic_γ end_ARG start_ARG italic_α italic_γ end_ARG ) ,
𝒛+𝒛superscript𝒛superscript𝒛\displaystyle\boldsymbol{z}^{+}\boldsymbol{z}^{-}bold_italic_z start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT bold_italic_z start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT =diag((1γ)κkαγ),absent𝑑𝑖𝑎𝑔1𝛾subscript𝜅𝑘𝛼𝛾\displaystyle=diag\left(\frac{(1-\gamma)\kappa_{k}}{\alpha\gamma}\right),= italic_d italic_i italic_a italic_g ( divide start_ARG ( 1 - italic_γ ) italic_κ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_α italic_γ end_ARG ) ,
(𝒛+𝒛)1superscriptsuperscript𝒛superscript𝒛1\displaystyle(\boldsymbol{z}^{+}-\boldsymbol{z}^{-})^{-1}( bold_italic_z start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT - bold_italic_z start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT =diag(αγ(γ+1α(1γ)κk)24γ).absent𝑑𝑖𝑎𝑔𝛼𝛾superscript𝛾1𝛼1𝛾subscript𝜅𝑘24𝛾\displaystyle=diag\left(\frac{-\alpha\gamma}{\sqrt{(\gamma+1-\alpha(1-\gamma)% \kappa_{k})^{2}-4\gamma}}\right).= italic_d italic_i italic_a italic_g ( divide start_ARG - italic_α italic_γ end_ARG start_ARG square-root start_ARG ( italic_γ + 1 - italic_α ( 1 - italic_γ ) italic_κ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 4 italic_γ end_ARG end_ARG ) .

For |𝒛k+|+|𝒛k|superscriptsubscript𝒛𝑘superscriptsubscript𝒛𝑘\left|\boldsymbol{z}_{k}^{+}\right|+\left|\boldsymbol{z}_{k}^{-}\right|| bold_italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT | + | bold_italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT |, we have

|𝒛k+|+|𝒛k|2|𝒛k+|2+|𝒛k|2={2αγ2(λk+γ)(λkγ),(γ+1α(1γ)κk)2<4γ,2αγ(λk+γ)2+(λkγ)2,(γ+1α(1γ)κk)2>4γ,superscriptsubscript𝒛𝑘superscriptsubscript𝒛𝑘2superscriptsuperscriptsubscript𝒛𝑘2superscriptsuperscriptsubscript𝒛𝑘2cases2𝛼𝛾2superscriptsubscript𝜆𝑘𝛾superscriptsubscript𝜆𝑘𝛾superscript𝛾1𝛼1𝛾subscript𝜅𝑘24𝛾2𝛼𝛾superscriptsuperscriptsubscript𝜆𝑘𝛾2superscriptsuperscriptsubscript𝜆𝑘𝛾2superscript𝛾1𝛼1𝛾subscript𝜅𝑘24𝛾\displaystyle\left|\boldsymbol{z}_{k}^{+}\right|+\left|\boldsymbol{z}_{k}^{-}% \right|\leq 2\sqrt{\left|\boldsymbol{z}_{k}^{+}\right|^{2}+\left|\boldsymbol{z% }_{k}^{-}\right|^{2}}=\left\{\begin{array}[]{cc}\frac{2}{\alpha\gamma}\sqrt{2(% \lambda_{k}^{+}-\gamma)(\lambda_{k}^{-}-\gamma)},&(\gamma+1-\alpha(1-\gamma)% \kappa_{k})^{2}<4\gamma,\\ \frac{2}{\alpha\gamma}\sqrt{(\lambda_{k}^{+}-\gamma)^{2}+(\lambda_{k}^{-}-% \gamma)^{2}},&(\gamma+1-\alpha(1-\gamma)\kappa_{k})^{2}>4\gamma,\end{array}\right.| bold_italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT | + | bold_italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT | ≤ 2 square-root start_ARG | bold_italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + | bold_italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG = { start_ARRAY start_ROW start_CELL divide start_ARG 2 end_ARG start_ARG italic_α italic_γ end_ARG square-root start_ARG 2 ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT - italic_γ ) ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT - italic_γ ) end_ARG , end_CELL start_CELL ( italic_γ + 1 - italic_α ( 1 - italic_γ ) italic_κ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT < 4 italic_γ , end_CELL end_ROW start_ROW start_CELL divide start_ARG 2 end_ARG start_ARG italic_α italic_γ end_ARG square-root start_ARG ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , end_CELL start_CELL ( italic_γ + 1 - italic_α ( 1 - italic_γ ) italic_κ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT > 4 italic_γ , end_CELL end_ROW end_ARRAY

and

2αγ2(λk+γ)(λkγ)2𝛼𝛾2superscriptsubscript𝜆𝑘𝛾superscriptsubscript𝜆𝑘𝛾\displaystyle\frac{2}{\alpha\gamma}\sqrt{2(\lambda_{k}^{+}-\gamma)(\lambda_{k}% ^{-}-\gamma)}divide start_ARG 2 end_ARG start_ARG italic_α italic_γ end_ARG square-root start_ARG 2 ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT - italic_γ ) ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT - italic_γ ) end_ARG =2αγ2αγ(1γ)κk,absent2𝛼𝛾2𝛼𝛾1𝛾subscript𝜅𝑘\displaystyle=\frac{2}{\alpha\gamma}\sqrt{2\alpha\gamma(1-\gamma)\kappa_{k}},= divide start_ARG 2 end_ARG start_ARG italic_α italic_γ end_ARG square-root start_ARG 2 italic_α italic_γ ( 1 - italic_γ ) italic_κ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ,
2αγ(λk+γ)2+(λkγ)22𝛼𝛾superscriptsuperscriptsubscript𝜆𝑘𝛾2superscriptsuperscriptsubscript𝜆𝑘𝛾2\displaystyle\frac{2}{\alpha\gamma}\sqrt{(\lambda_{k}^{+}-\gamma)^{2}+(\lambda% _{k}^{-}-\gamma)^{2}}divide start_ARG 2 end_ARG start_ARG italic_α italic_γ end_ARG square-root start_ARG ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG =2αγ(λk++λkγ)22λk+λk+γ2absent2𝛼𝛾superscriptsuperscriptsubscript𝜆𝑘superscriptsubscript𝜆𝑘𝛾22superscriptsubscript𝜆𝑘superscriptsubscript𝜆𝑘superscript𝛾2\displaystyle=\frac{2}{\alpha\gamma}\sqrt{(\lambda_{k}^{+}+\lambda_{k}^{-}-% \gamma)^{2}-2\lambda_{k}^{+}\lambda_{k}^{-}+\gamma^{2}}= divide start_ARG 2 end_ARG start_ARG italic_α italic_γ end_ARG square-root start_ARG ( italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT + italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG
=2αγ(1γ)22α(1γ)κk+α2(1γ)2κk22αγ(1γ)(1+ακk).absent2𝛼𝛾superscript1𝛾22𝛼1𝛾subscript𝜅𝑘superscript𝛼2superscript1𝛾2superscriptsubscript𝜅𝑘22𝛼𝛾1𝛾1𝛼subscript𝜅𝑘\displaystyle=\frac{2}{\alpha\gamma}\sqrt{(1-\gamma)^{2}-2\alpha(1-\gamma)% \kappa_{k}+\alpha^{2}(1-\gamma)^{2}\kappa_{k}^{2}}\leq\frac{2}{\alpha\gamma}(1% -\gamma)\left(1+\alpha\kappa_{k}\right).= divide start_ARG 2 end_ARG start_ARG italic_α italic_γ end_ARG square-root start_ARG ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_α ( 1 - italic_γ ) italic_κ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_κ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ≤ divide start_ARG 2 end_ARG start_ARG italic_α italic_γ end_ARG ( 1 - italic_γ ) ( 1 + italic_α italic_κ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) .

So it holds that

|𝒛k+|+|𝒛k|superscriptsubscript𝒛𝑘superscriptsubscript𝒛𝑘\displaystyle\left|\boldsymbol{z}_{k}^{+}\right|+\left|\boldsymbol{z}_{k}^{-}\right|| bold_italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT | + | bold_italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT | 2αγ(2αγ(1γ)κk+(1γ)(1+ακk))absent2𝛼𝛾2𝛼𝛾1𝛾subscript𝜅𝑘1𝛾1𝛼subscript𝜅𝑘\displaystyle\leq\frac{2}{\alpha\gamma}\left(\sqrt{2\alpha\gamma(1-\gamma)% \kappa_{k}}+(1-\gamma)(1+\alpha\kappa_{k})\right)≤ divide start_ARG 2 end_ARG start_ARG italic_α italic_γ end_ARG ( square-root start_ARG 2 italic_α italic_γ ( 1 - italic_γ ) italic_κ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG + ( 1 - italic_γ ) ( 1 + italic_α italic_κ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) )
1αγ(2αγ+(1γ)(2+2ακk+κk)).absent1𝛼𝛾2𝛼𝛾1𝛾22𝛼subscript𝜅𝑘subscript𝜅𝑘\displaystyle\leq\frac{1}{\alpha\gamma}\left(2\alpha\gamma+(1-\gamma)(2+2% \alpha\kappa_{k}+\kappa_{k})\right).≤ divide start_ARG 1 end_ARG start_ARG italic_α italic_γ end_ARG ( 2 italic_α italic_γ + ( 1 - italic_γ ) ( 2 + 2 italic_α italic_κ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_κ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) .

Then for P𝑃Pitalic_P and P1superscript𝑃1P^{-1}italic_P start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, it holds that

Pnorm𝑃\displaystyle\left\|P\right\|∥ italic_P ∥ 4=2,absent42\displaystyle\leq\sqrt{4}=2,≤ square-root start_ARG 4 end_ARG = 2 ,
P1normsuperscript𝑃1\displaystyle\left\|P^{-1}\right\|∥ italic_P start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ 2maxk||𝒛k+|2+1|𝒛k|2+1(𝒛+𝒛)k1|absent2subscript𝑘superscriptsubscriptsuperscript𝒛𝑘21superscriptsubscriptsuperscript𝒛𝑘21superscriptsubscriptsuperscript𝒛superscript𝒛𝑘1\displaystyle\leq 2\max_{k}\left|\sqrt{\left|\boldsymbol{z}^{+}_{k}\right|^{2}% +1}\sqrt{\left|\boldsymbol{z}^{-}_{k}\right|^{2}+1}(\boldsymbol{z}^{+}-% \boldsymbol{z}^{-})_{k}^{-1}\right|≤ 2 roman_max start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | square-root start_ARG | bold_italic_z start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 end_ARG square-root start_ARG | bold_italic_z start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 end_ARG ( bold_italic_z start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT - bold_italic_z start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT |
2maxk(|(𝒛+𝒛)k|+|𝒛k+|+|𝒛k|+1)|(𝒛+𝒛)k1|absent2subscript𝑘subscriptsuperscript𝒛superscript𝒛𝑘subscriptsuperscript𝒛𝑘subscriptsuperscript𝒛𝑘1superscriptsubscriptsuperscript𝒛superscript𝒛𝑘1\displaystyle\leq 2\max_{k}\left(\left|(\boldsymbol{z}^{+}\boldsymbol{z}^{-})_% {k}\right|+\left|\boldsymbol{z}^{+}_{k}\right|+\left|\boldsymbol{z}^{-}_{k}% \right|+1\right)\left|(\boldsymbol{z}^{+}-\boldsymbol{z}^{-})_{k}^{-1}\right|≤ 2 roman_max start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( | ( bold_italic_z start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT bold_italic_z start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | + | bold_italic_z start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | + | bold_italic_z start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | + 1 ) | ( bold_italic_z start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT - bold_italic_z start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT |
2maxk((1γ)κkαγ+1αγ(2αγ+(1γ)(2+2ακk+κk))+1)αγΔabsent2subscript𝑘1𝛾subscript𝜅𝑘𝛼𝛾1𝛼𝛾2𝛼𝛾1𝛾22𝛼subscript𝜅𝑘subscript𝜅𝑘1𝛼𝛾Δ\displaystyle\leq 2\max_{k}\left(\frac{(1-\gamma)\kappa_{k}}{\alpha\gamma}+% \frac{1}{\alpha\gamma}\left(2\alpha\gamma+(1-\gamma)(2+2\alpha\kappa_{k}+% \kappa_{k})\right)+1\right)\frac{\alpha\gamma}{\sqrt{\Delta}}≤ 2 roman_max start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( divide start_ARG ( 1 - italic_γ ) italic_κ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_α italic_γ end_ARG + divide start_ARG 1 end_ARG start_ARG italic_α italic_γ end_ARG ( 2 italic_α italic_γ + ( 1 - italic_γ ) ( 2 + 2 italic_α italic_κ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_κ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) + 1 ) divide start_ARG italic_α italic_γ end_ARG start_ARG square-root start_ARG roman_Δ end_ARG end_ARG
2Δ(2(1γ)(1+αL+L)+3αγ),absent2Δ21𝛾1𝛼𝐿𝐿3𝛼𝛾\displaystyle\leq\frac{2}{\sqrt{\Delta}}\left(2(1-\gamma)(1+\alpha L+L)+3% \alpha\gamma\right),≤ divide start_ARG 2 end_ARG start_ARG square-root start_ARG roman_Δ end_ARG end_ARG ( 2 ( 1 - italic_γ ) ( 1 + italic_α italic_L + italic_L ) + 3 italic_α italic_γ ) ,

where Δ=mink{|(γ+1α(1γ)κk)24γ|}Δsubscript𝑘superscript𝛾1𝛼1𝛾subscript𝜅𝑘24𝛾\Delta=\min_{k}\left\{\left|\left(\gamma+1-\alpha(1-\gamma)\kappa_{k}\right)^{% 2}-4\gamma\right|\right\}roman_Δ = roman_min start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT { | ( italic_γ + 1 - italic_α ( 1 - italic_γ ) italic_κ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 4 italic_γ | }.

Put

P~=(U,0d0d,U)P.~𝑃𝑈subscript0𝑑subscript0𝑑𝑈𝑃\displaystyle\tilde{P}=\left(\begin{array}[]{cc}U,&0_{d}\\ 0_{d},&U\end{array}\right)P.over~ start_ARG italic_P end_ARG = ( start_ARRAY start_ROW start_CELL italic_U , end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , end_CELL start_CELL italic_U end_CELL end_ROW end_ARRAY ) italic_P .

Then we have

Γj=P~ΛjP~1PP1λj4Δ(2(1γ)(1+αL+L)+3αγ)λj:=Mλj.normsuperscriptΓ𝑗norm~𝑃superscriptΛ𝑗superscript~𝑃1norm𝑃normsuperscript𝑃1superscript𝜆𝑗4Δ21𝛾1𝛼𝐿𝐿3𝛼𝛾superscript𝜆𝑗assign𝑀superscript𝜆𝑗\displaystyle\left\|\Gamma^{j}\right\|=\left\|\tilde{P}\Lambda^{j}\tilde{P}^{-% 1}\right\|\leq\left\|P\right\|\left\|P^{-1}\right\|\lambda^{j}\leq\frac{4}{% \sqrt{\Delta}}\left(2(1-\gamma)(1+\alpha L+L)+3\alpha\gamma\right)\lambda^{j}:% =M\lambda^{j}.∥ roman_Γ start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∥ = ∥ over~ start_ARG italic_P end_ARG roman_Λ start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ ≤ ∥ italic_P ∥ ∥ italic_P start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ italic_λ start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ≤ divide start_ARG 4 end_ARG start_ARG square-root start_ARG roman_Δ end_ARG end_ARG ( 2 ( 1 - italic_γ ) ( 1 + italic_α italic_L + italic_L ) + 3 italic_α italic_γ ) italic_λ start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT := italic_M italic_λ start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT .

After diagonalizing ΓΓ\Gammaroman_Γ, we can obtain an upper bound for the spectral radius λ𝜆\lambdaitalic_λ by evaluating the maximal absolute eigenvalue maxk|λk±|subscript𝑘subscriptsuperscript𝜆plus-or-minus𝑘\max_{k}\left|\lambda^{\pm}_{k}\right|roman_max start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_λ start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | in the diagonal matrix ΛΛ\Lambdaroman_Λ, as specified in definition (A.42).

Proof.  [Theorem 3] From Lemma 20, the eigenvalues satisfy

λk++λk=γ+1α(1γ)κk[γ+1α(1γ)L,γ+1α(1γ)μ],subscriptsuperscript𝜆𝑘subscriptsuperscript𝜆𝑘𝛾1𝛼1𝛾subscript𝜅𝑘𝛾1𝛼1𝛾𝐿𝛾1𝛼1𝛾𝜇\displaystyle\lambda^{+}_{k}+\lambda^{-}_{k}=\gamma+1-\alpha(1-\gamma)\kappa_{% k}\in[\gamma+1-\alpha(1-\gamma)L,\gamma+1-\alpha(1-\gamma)\mu],italic_λ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_λ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_γ + 1 - italic_α ( 1 - italic_γ ) italic_κ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ [ italic_γ + 1 - italic_α ( 1 - italic_γ ) italic_L , italic_γ + 1 - italic_α ( 1 - italic_γ ) italic_μ ] ,

For real eigenvalues, there exists k𝑘kitalic_k such that |γ+1α(1γ)κk|>2γ𝛾1𝛼1𝛾subscript𝜅𝑘2𝛾\left|\gamma+1-\alpha(1-\gamma)\kappa_{k}\right|>2\sqrt{\gamma}| italic_γ + 1 - italic_α ( 1 - italic_γ ) italic_κ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | > 2 square-root start_ARG italic_γ end_ARG. By the fact that μκkL𝜇subscript𝜅𝑘𝐿\mu\leq\kappa_{k}\leq Litalic_μ ≤ italic_κ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≤ italic_L, we have

λ=maxk|λk±|𝜆subscript𝑘subscriptsuperscript𝜆plus-or-minus𝑘\displaystyle\lambda=\max_{k}\left|\lambda^{\pm}_{k}\right|italic_λ = roman_max start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_λ start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | =max{γ+1α(1γ)μ+(γ+1α(1γ)μ)24γ2,\displaystyle=\max\left\{\frac{\gamma+1-\alpha(1-\gamma)\mu+\sqrt{\left(\gamma% +1-\alpha(1-\gamma)\mu\right)^{2}-4\gamma}}{2},\right.= roman_max { divide start_ARG italic_γ + 1 - italic_α ( 1 - italic_γ ) italic_μ + square-root start_ARG ( italic_γ + 1 - italic_α ( 1 - italic_γ ) italic_μ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 4 italic_γ end_ARG end_ARG start_ARG 2 end_ARG , (A.59)
α(1γ)Lγ1+(α(1γ)Lγ1)24γ2}\displaystyle\quad\left.\frac{\alpha(1-\gamma)L-\gamma-1+\sqrt{\left(\alpha(1-% \gamma)L-\gamma-1\right)^{2}-4\gamma}}{2}\right\}divide start_ARG italic_α ( 1 - italic_γ ) italic_L - italic_γ - 1 + square-root start_ARG ( italic_α ( 1 - italic_γ ) italic_L - italic_γ - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 4 italic_γ end_ARG end_ARG start_ARG 2 end_ARG }
=γ+1(1γ)ϕ+(γ+1(1γ)ϕ)24γ2:=h(γ),absent𝛾11𝛾italic-ϕsuperscript𝛾11𝛾italic-ϕ24𝛾2assign𝛾\displaystyle=\frac{\gamma+1-(1-\gamma)\phi+\sqrt{\left(\gamma+1-(1-\gamma)% \phi\right)^{2}-4\gamma}}{2}:=h(\gamma),= divide start_ARG italic_γ + 1 - ( 1 - italic_γ ) italic_ϕ + square-root start_ARG ( italic_γ + 1 - ( 1 - italic_γ ) italic_ϕ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 4 italic_γ end_ARG end_ARG start_ARG 2 end_ARG := italic_h ( italic_γ ) ,

where

ϕ=min{αμ,2(1+γ)1γαL}.italic-ϕ𝛼𝜇21𝛾1𝛾𝛼𝐿\displaystyle\phi=\min\left\{\alpha\mu,\frac{2(1+\gamma)}{1-\gamma}-\alpha L% \right\}.italic_ϕ = roman_min { italic_α italic_μ , divide start_ARG 2 ( 1 + italic_γ ) end_ARG start_ARG 1 - italic_γ end_ARG - italic_α italic_L } .

In this condition, there holds that γ+1(1γ)ϕ>2γ𝛾11𝛾italic-ϕ2𝛾\gamma+1-(1-\gamma)\phi>2\sqrt{\gamma}italic_γ + 1 - ( 1 - italic_γ ) italic_ϕ > 2 square-root start_ARG italic_γ end_ARG. Define η=(γ+1(1γ)ϕ)24γ>0𝜂superscript𝛾11𝛾italic-ϕ24𝛾0\eta=\left(\gamma+1-(1-\gamma)\phi\right)^{2}-4\gamma>0italic_η = ( italic_γ + 1 - ( 1 - italic_γ ) italic_ϕ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 4 italic_γ > 0, then

h(γ)=(1+ϕ)h(γ)1η,h′′(γ)=h(γ)((1+ϕ)η+2(1+ϕ)(γ+1(1γ)ϕ))2η.formulae-sequencesuperscript𝛾1italic-ϕ𝛾1𝜂superscript′′𝛾superscript𝛾1italic-ϕ𝜂21italic-ϕ𝛾11𝛾italic-ϕ2𝜂\displaystyle h^{\prime}(\gamma)=\frac{(1+\phi)h(\gamma)-1}{\sqrt{\eta}},\quad h% ^{\prime\prime}(\gamma)=\frac{h^{\prime}(\gamma)\left((1+\phi)\sqrt{\eta}+2-(1% +\phi)(\gamma+1-(1-\gamma)\phi)\right)}{2\eta}.italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_γ ) = divide start_ARG ( 1 + italic_ϕ ) italic_h ( italic_γ ) - 1 end_ARG start_ARG square-root start_ARG italic_η end_ARG end_ARG , italic_h start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( italic_γ ) = divide start_ARG italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_γ ) ( ( 1 + italic_ϕ ) square-root start_ARG italic_η end_ARG + 2 - ( 1 + italic_ϕ ) ( italic_γ + 1 - ( 1 - italic_γ ) italic_ϕ ) ) end_ARG start_ARG 2 italic_η end_ARG .

By the fact that h(0)=1ϕ,h(0)<0formulae-sequence01italic-ϕsuperscript00h(0)=1-\phi,h^{\prime}(0)<0italic_h ( 0 ) = 1 - italic_ϕ , italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( 0 ) < 0 and 2(1+ϕ)(γ+1(1γ)ϕ)021italic-ϕ𝛾11𝛾italic-ϕ02-(1+\phi)(\gamma+1-(1-\gamma)\phi)\geq 02 - ( 1 + italic_ϕ ) ( italic_γ + 1 - ( 1 - italic_γ ) italic_ϕ ) ≥ 0, where 0γ<(1ϕ)2/(1+ϕ)20𝛾superscript1italic-ϕ2superscript1italic-ϕ20\leq\gamma<(1-\phi)^{2}/(1+\phi)^{2}0 ≤ italic_γ < ( 1 - italic_ϕ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / ( 1 + italic_ϕ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, we have that hhitalic_h is concave and

1ϕϕ(1+ϕ)1ϕγλ1ϕϕ21ϕγ.1italic-ϕitalic-ϕ1italic-ϕ1italic-ϕ𝛾𝜆1italic-ϕsuperscriptitalic-ϕ21italic-ϕ𝛾\displaystyle 1-\phi-\frac{\phi(1+\phi)}{1-\phi}\gamma\leq\lambda\leq 1-\phi-% \frac{\phi^{2}}{1-\phi}\gamma.1 - italic_ϕ - divide start_ARG italic_ϕ ( 1 + italic_ϕ ) end_ARG start_ARG 1 - italic_ϕ end_ARG italic_γ ≤ italic_λ ≤ 1 - italic_ϕ - divide start_ARG italic_ϕ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_ϕ end_ARG italic_γ .

For complex eigenvalues, which means that |γ+1α(1γ)κk|2γ𝛾1𝛼1𝛾subscript𝜅𝑘2𝛾\left|\gamma+1-\alpha(1-\gamma)\kappa_{k}\right|\leq 2\sqrt{\gamma}| italic_γ + 1 - italic_α ( 1 - italic_γ ) italic_κ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | ≤ 2 square-root start_ARG italic_γ end_ARG for all k𝑘kitalic_k. Then γ(1ϕ)2/(1+ϕ)2𝛾superscript1italic-ϕ2superscript1italic-ϕ2\gamma\geq(1-\phi)^{2}/(1+\phi)^{2}italic_γ ≥ ( 1 - italic_ϕ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / ( 1 + italic_ϕ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and

λ=maxk|λk±|=maxk{|γ+1α(1γ)κk±14γ(α(1γ)κkγ1)22|}=γ.𝜆subscript𝑘subscriptsuperscript𝜆plus-or-minus𝑘subscript𝑘plus-or-minus𝛾1𝛼1𝛾subscript𝜅𝑘14𝛾superscript𝛼1𝛾subscript𝜅𝑘𝛾122𝛾\displaystyle\lambda=\max_{k}\left|\lambda^{\pm}_{k}\right|=\max_{k}\left\{% \left|\frac{\gamma+1-\alpha(1-\gamma)\kappa_{k}\pm\sqrt{-1}\sqrt{4\gamma-(% \alpha(1-\gamma)\kappa_{k}-\gamma-1)^{2}}}{2}\right|\right\}=\sqrt{\gamma}.italic_λ = roman_max start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_λ start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | = roman_max start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT { | divide start_ARG italic_γ + 1 - italic_α ( 1 - italic_γ ) italic_κ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ± square-root start_ARG - 1 end_ARG square-root start_ARG 4 italic_γ - ( italic_α ( 1 - italic_γ ) italic_κ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_γ - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG 2 end_ARG | } = square-root start_ARG italic_γ end_ARG .

A.3 Proof of Theorem 9

Recall the iteration (A.35) that

(m~t+1x~t+1)=Γt(m~1x~1)α(1γ)j=1tΓtj(0k=1jγjk(gηk(xk)Σx~k)).subscript~𝑚𝑡1missing-subexpressionmissing-subexpressionsubscript~𝑥𝑡1missing-subexpressionmissing-subexpressionsuperscriptΓ𝑡subscript~𝑚1missing-subexpressionmissing-subexpressionsubscript~𝑥1missing-subexpressionmissing-subexpression𝛼1𝛾superscriptsubscript𝑗1𝑡superscriptΓ𝑡𝑗0missing-subexpressionmissing-subexpressionsuperscriptsubscript𝑘1𝑗superscript𝛾𝑗𝑘subscript𝑔subscript𝜂𝑘subscript𝑥𝑘Σsubscript~𝑥𝑘missing-subexpressionmissing-subexpression\displaystyle\left(\begin{array}[]{ccc}\widetilde{m}_{t+1}\\ \widetilde{x}_{t+1}\\ \end{array}\right)=\Gamma^{t}\left(\begin{array}[]{ccc}\widetilde{m}_{1}\\ \widetilde{x}_{1}\\ \end{array}\right)-\alpha(1-\gamma)\sum_{j=1}^{t}\Gamma^{t-j}\left(\begin{% array}[]{ccc}0\\ \sum_{k=1}^{j}\gamma^{j-k}(\triangledown g_{\eta_{k}}(x_{k})-\Sigma\widetilde{% x}_{k})\\ \end{array}\right).( start_ARRAY start_ROW start_CELL over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY ) = roman_Γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( start_ARRAY start_ROW start_CELL over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY ) - italic_α ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_Γ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT ( start_ARRAY start_ROW start_CELL 0 end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT ( ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - roman_Σ over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY ) . (A.66)

Define the events

Ej={(m~jx~j)C1B(1λ)ασ+C2q11/2λ(1δ)(j1)},1jt.formulae-sequencesubscript𝐸𝑗normsubscript~𝑚𝑗missing-subexpressionmissing-subexpressionsubscript~𝑥𝑗missing-subexpressionmissing-subexpressionsubscript𝐶1𝐵1𝜆𝛼𝜎subscript𝐶2superscriptsubscript𝑞112superscript𝜆1𝛿𝑗11𝑗𝑡\displaystyle E_{j}=\left\{\left\|\left(\begin{array}[]{ccc}\widetilde{m}_{j}% \\ \widetilde{x}_{j}\\ \end{array}\right)\right\|\leq\frac{C_{1}}{\sqrt{B(1-\lambda)}}\alpha\sigma+C_% {2}q_{1}^{1/2}\lambda^{(1-\delta)(j-1)}\right\},\quad 1\leq j\leq t.italic_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = { ∥ ( start_ARRAY start_ROW start_CELL over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY ) ∥ ≤ divide start_ARG italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_B ( 1 - italic_λ ) end_ARG end_ARG italic_α italic_σ + italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT ( 1 - italic_δ ) ( italic_j - 1 ) end_POSTSUPERSCRIPT } , 1 ≤ italic_j ≤ italic_t .

For certain constants C1,C21subscript𝐶1subscript𝐶21C_{1},C_{2}\geq 1italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≥ 1, which will be defined subsequently, the probability of event E1subscript𝐸1E_{1}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is (E1)=1subscript𝐸11\mathbb{P}(E_{1})=1blackboard_P ( italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = 1. In the remaining part of proof, we use the inductive method to demonstrate that, for any t1𝑡1t\geq 1italic_t ≥ 1, on the events j=1tEjsuperscriptsubscript𝑗1𝑡subscript𝐸𝑗\cap_{j=1}^{t}E_{j}∩ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, t1𝑡1t\geq 1italic_t ≥ 1, the event Et+1subscript𝐸𝑡1E_{t+1}italic_E start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT occurs with at least 12/T212superscript𝑇21-2/T^{2}1 - 2 / italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT probability.

For L¯>0¯𝐿0\overline{L}>0over¯ start_ARG italic_L end_ARG > 0, it holds that

gηk(xk)Σx~ksubscript𝑔subscript𝜂𝑘subscript𝑥𝑘Σsubscript~𝑥𝑘\displaystyle\triangledown g_{\eta_{k}}(x_{k})-\Sigma\widetilde{x}_{k}▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - roman_Σ over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT =(gηk(xk)gηk(x*)+g(x*)g(xk))+gηk(x*)+g(xk)Σx~kabsentsubscript𝑔subscript𝜂𝑘subscript𝑥𝑘subscript𝑔subscript𝜂𝑘superscript𝑥𝑔superscript𝑥𝑔subscript𝑥𝑘subscript𝑔subscript𝜂𝑘superscript𝑥𝑔subscript𝑥𝑘Σsubscript~𝑥𝑘\displaystyle=\left(\triangledown g_{\eta_{k}}(x_{k})-\triangledown g_{\eta_{k% }}(x^{*})+\triangledown g(x^{*})-\triangledown g(x_{k})\right)+\triangledown g% _{\eta_{k}}(x^{*})+\triangledown g(x_{k})-\Sigma\widetilde{x}_{k}= ( ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) + ▽ italic_g ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) - ▽ italic_g ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) + ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) + ▽ italic_g ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - roman_Σ over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (A.68)
=(gηk(xk)gηk(x*)+g(x*)g(xk))+gηk(x*)+Vkx~k,absentsubscript𝑔subscript𝜂𝑘subscript𝑥𝑘subscript𝑔subscript𝜂𝑘superscript𝑥𝑔superscript𝑥𝑔subscript𝑥𝑘subscript𝑔subscript𝜂𝑘superscript𝑥subscript𝑉𝑘subscript~𝑥𝑘\displaystyle=\left(\triangledown g_{\eta_{k}}(x_{k})-\triangledown g_{\eta_{k% }}(x^{*})+\triangledown g(x^{*})-\triangledown g(x_{k})\right)+\triangledown g% _{\eta_{k}}(x^{*})+V_{k}\widetilde{x}_{k},= ( ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) + ▽ italic_g ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) - ▽ italic_g ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) + ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) + italic_V start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ,

where Vk=01Σ(x*+yx~k)Σ(x*)dyL¯x~knormsubscript𝑉𝑘normsuperscriptsubscript01Σsuperscript𝑥𝑦subscript~𝑥𝑘Σsuperscript𝑥𝑑𝑦¯𝐿normsubscript~𝑥𝑘\left\|V_{k}\right\|=\left\|\int_{0}^{1}\Sigma(x^{*}+y\widetilde{x}_{k})-% \Sigma(x^{*})dy\right\|\leq\overline{L}\left\|\widetilde{x}_{k}\right\|∥ italic_V start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ = ∥ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT roman_Σ ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT + italic_y over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - roman_Σ ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) italic_d italic_y ∥ ≤ over¯ start_ARG italic_L end_ARG ∥ over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥. By applying Lemmas 21, 22 and 23, we establish bounds for each of the three components in (A.68). Then from the iteration (A.35), with at least 12/T212superscript𝑇21-2/T^{2}1 - 2 / italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT probability, we have that on the events j=1tEjsuperscriptsubscript𝑗1𝑡subscript𝐸𝑗\cap_{j=1}^{t}E_{j}∩ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT

(m~t+1x~t+1)Mλtq11/2+ασBM2c(2logT+4d)(1λ)1/2+αML¯(2C12α2σ2B(1λ)2+4C22q1λ(1δ)(t1)δ(1λ))normsubscript~𝑚𝑡1missing-subexpressionmissing-subexpressionsubscript~𝑥𝑡1missing-subexpressionmissing-subexpression𝑀superscript𝜆𝑡superscriptsubscript𝑞112𝛼𝜎𝐵𝑀2𝑐2𝑇4𝑑superscript1𝜆12𝛼𝑀¯𝐿2superscriptsubscript𝐶12superscript𝛼2superscript𝜎2𝐵superscript1𝜆24superscriptsubscript𝐶22subscript𝑞1superscript𝜆1𝛿𝑡1𝛿1𝜆\displaystyle\left\|\left(\begin{array}[]{ccc}\widetilde{m}_{t+1}\\ \widetilde{x}_{t+1}\\ \end{array}\right)\right\|\leq M\lambda^{t}q_{1}^{1/2}+\alpha\frac{\sigma}{% \sqrt{B}}M\frac{\sqrt{2}c\left(2\log T+4d\right)}{(1-\lambda)^{1/2}}+\alpha M% \overline{L}\left(\frac{2C_{1}^{2}\alpha^{2}\sigma^{2}}{B(1-\lambda)^{2}}+% \frac{4C_{2}^{2}q_{1}\lambda^{(1-\delta)(t-1)}}{\delta(1-\lambda)}\right)∥ ( start_ARRAY start_ROW start_CELL over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY ) ∥ ≤ italic_M italic_λ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT + italic_α divide start_ARG italic_σ end_ARG start_ARG square-root start_ARG italic_B end_ARG end_ARG italic_M divide start_ARG square-root start_ARG 2 end_ARG italic_c ( 2 roman_log italic_T + 4 italic_d ) end_ARG start_ARG ( 1 - italic_λ ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT end_ARG + italic_α italic_M over¯ start_ARG italic_L end_ARG ( divide start_ARG 2 italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B ( 1 - italic_λ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 4 italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_λ start_POSTSUPERSCRIPT ( 1 - italic_δ ) ( italic_t - 1 ) end_POSTSUPERSCRIPT end_ARG start_ARG italic_δ ( 1 - italic_λ ) end_ARG )
+αLfBMc(2logT+4d)(4C1ασB(1λ)+42C2q11/2λ(1δ)(t1)δ(1λ)1/2)𝛼subscript𝐿𝑓𝐵𝑀𝑐2𝑇4𝑑4subscript𝐶1𝛼𝜎𝐵1𝜆42subscript𝐶2superscriptsubscript𝑞112superscript𝜆1𝛿𝑡1𝛿superscript1𝜆12\displaystyle+\alpha\frac{L_{f}}{\sqrt{B}}Mc(2\log T+4d)\left(\frac{4C_{1}% \alpha\sigma}{\sqrt{B}(1-\lambda)}+\frac{4\sqrt{2}C_{2}q_{1}^{1/2}\lambda^{(1-% \delta)(t-1)}}{\sqrt{\delta}(1-\lambda)^{1/2}}\right)+ italic_α divide start_ARG italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_B end_ARG end_ARG italic_M italic_c ( 2 roman_log italic_T + 4 italic_d ) ( divide start_ARG 4 italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_α italic_σ end_ARG start_ARG square-root start_ARG italic_B end_ARG ( 1 - italic_λ ) end_ARG + divide start_ARG 4 square-root start_ARG 2 end_ARG italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT ( 1 - italic_δ ) ( italic_t - 1 ) end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_δ end_ARG ( 1 - italic_λ ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT end_ARG )
C1B(1λ)ασM(2c(2logT+4d)C1+c(2logT+4d)4αLfB(1λ)1/2+2L¯C1α2σB(1λ)3/2)absentsubscript𝐶1𝐵1𝜆𝛼𝜎𝑀2𝑐2𝑇4𝑑subscript𝐶1𝑐2𝑇4𝑑4𝛼subscript𝐿𝑓𝐵superscript1𝜆122¯𝐿subscript𝐶1superscript𝛼2𝜎𝐵superscript1𝜆32\displaystyle\leq\frac{C_{1}}{\sqrt{B(1-\lambda)}}\alpha\sigma M\left(\frac{% \sqrt{2}c\left(2\log T+4d\right)}{C_{1}}+c(2\log T+4d)\frac{4\alpha L_{f}}{% \sqrt{B}(1-\lambda)^{1/2}}+\frac{2\overline{L}C_{1}\alpha^{2}\sigma}{\sqrt{B}(% 1-\lambda)^{3/2}}\right)≤ divide start_ARG italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_B ( 1 - italic_λ ) end_ARG end_ARG italic_α italic_σ italic_M ( divide start_ARG square-root start_ARG 2 end_ARG italic_c ( 2 roman_log italic_T + 4 italic_d ) end_ARG start_ARG italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG + italic_c ( 2 roman_log italic_T + 4 italic_d ) divide start_ARG 4 italic_α italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_B end_ARG ( 1 - italic_λ ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 2 over¯ start_ARG italic_L end_ARG italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ end_ARG start_ARG square-root start_ARG italic_B end_ARG ( 1 - italic_λ ) start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT end_ARG )
+C2q11/2λ(1δ)tM(λδtC2+42c(2logT+4d)αLfBλ(1δ)δ(1λ)1/2+4L¯αC2q11/2λ(1δ)δ(1λ)),subscript𝐶2superscriptsubscript𝑞112superscript𝜆1𝛿𝑡𝑀superscript𝜆𝛿𝑡subscript𝐶242𝑐2𝑇4𝑑𝛼subscript𝐿𝑓𝐵superscript𝜆1𝛿𝛿superscript1𝜆124¯𝐿𝛼subscript𝐶2superscriptsubscript𝑞112superscript𝜆1𝛿𝛿1𝜆\displaystyle+C_{2}q_{1}^{1/2}\lambda^{(1-\delta)t}M\left(\frac{\lambda^{% \delta t}}{C_{2}}+\frac{4\sqrt{2}c(2\log T+4d)\alpha L_{f}}{\sqrt{B}\lambda^{(% 1-\delta)}\sqrt{\delta}(1-\lambda)^{1/2}}+\frac{4\overline{L}\alpha C_{2}q_{1}% ^{1/2}}{\lambda^{(1-\delta)}\delta(1-\lambda)}\right),+ italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT ( 1 - italic_δ ) italic_t end_POSTSUPERSCRIPT italic_M ( divide start_ARG italic_λ start_POSTSUPERSCRIPT italic_δ italic_t end_POSTSUPERSCRIPT end_ARG start_ARG italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG + divide start_ARG 4 square-root start_ARG 2 end_ARG italic_c ( 2 roman_log italic_T + 4 italic_d ) italic_α italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_B end_ARG italic_λ start_POSTSUPERSCRIPT ( 1 - italic_δ ) end_POSTSUPERSCRIPT square-root start_ARG italic_δ end_ARG ( 1 - italic_λ ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 4 over¯ start_ARG italic_L end_ARG italic_α italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_λ start_POSTSUPERSCRIPT ( 1 - italic_δ ) end_POSTSUPERSCRIPT italic_δ ( 1 - italic_λ ) end_ARG ) ,

where c>0𝑐0c>0italic_c > 0 is an absolute constant. To ensure the occurrence of event Et+1subscript𝐸𝑡1E_{t+1}italic_E start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT, the learning rate α𝛼\alphaitalic_α and the batch size B𝐵Bitalic_B must fulfill the condition

M(42c(2logT+4d)λ(1δ)δ(1λ)1/2αLfB+2L¯C1α2σB(1λ)3/2)13,𝑀42𝑐2𝑇4𝑑superscript𝜆1𝛿𝛿superscript1𝜆12𝛼subscript𝐿𝑓𝐵2¯𝐿subscript𝐶1superscript𝛼2𝜎𝐵superscript1𝜆3213\displaystyle M\left(\frac{4\sqrt{2}c(2\log T+4d)}{\lambda^{(1-\delta)}\sqrt{% \delta}(1-\lambda)^{1/2}}\frac{\alpha L_{f}}{\sqrt{B}}+\frac{2\overline{L}C_{1% }\alpha^{2}\sigma}{\sqrt{B}(1-\lambda)^{3/2}}\right)\leq\frac{1}{3},italic_M ( divide start_ARG 4 square-root start_ARG 2 end_ARG italic_c ( 2 roman_log italic_T + 4 italic_d ) end_ARG start_ARG italic_λ start_POSTSUPERSCRIPT ( 1 - italic_δ ) end_POSTSUPERSCRIPT square-root start_ARG italic_δ end_ARG ( 1 - italic_λ ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT end_ARG divide start_ARG italic_α italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_B end_ARG end_ARG + divide start_ARG 2 over¯ start_ARG italic_L end_ARG italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ end_ARG start_ARG square-root start_ARG italic_B end_ARG ( 1 - italic_λ ) start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT end_ARG ) ≤ divide start_ARG 1 end_ARG start_ARG 3 end_ARG ,

where

C1=32c(2logT+4d)M.subscript𝐶132𝑐2𝑇4𝑑𝑀\displaystyle C_{1}=3\sqrt{2}c\left(2\log T+4d\right)M.italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 3 square-root start_ARG 2 end_ARG italic_c ( 2 roman_log italic_T + 4 italic_d ) italic_M .

Furthermore, we require that the initialization q1=m~12+x~12=x~12subscript𝑞1superscriptnormsubscript~𝑚12superscriptnormsubscript~𝑥12superscriptnormsubscript~𝑥12q_{1}=\left\|\widetilde{m}_{1}\right\|^{2}+\left\|\widetilde{x}_{1}\right\|^{2% }=\left\|\widetilde{x}_{1}\right\|^{2}italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ∥ over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∥ over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT satisfy

4ML¯αC2q11/21λ1δδ(1λ)13,4𝑀¯𝐿𝛼subscript𝐶2superscriptsubscript𝑞1121superscript𝜆1𝛿𝛿1𝜆13\displaystyle 4M\overline{L}\alpha C_{2}q_{1}^{1/2}\frac{1}{\lambda^{1-\delta}% \delta(1-\lambda)}\leq\frac{1}{3},4 italic_M over¯ start_ARG italic_L end_ARG italic_α italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_λ start_POSTSUPERSCRIPT 1 - italic_δ end_POSTSUPERSCRIPT italic_δ ( 1 - italic_λ ) end_ARG ≤ divide start_ARG 1 end_ARG start_ARG 3 end_ARG ,

where

C2=3M.subscript𝐶23𝑀\displaystyle C_{2}=3M.italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 3 italic_M .

It is straightforward to verify that

(Et+1j=1tEj)=((m~t+1x~t+1)C1B(1λ)ασ+C2q11/2λ(1δ)tj=1tEj)12T2,conditionalsubscript𝐸𝑡1superscriptsubscript𝑗1𝑡subscript𝐸𝑗normsubscript~𝑚𝑡1missing-subexpressionmissing-subexpressionsubscript~𝑥𝑡1missing-subexpressionmissing-subexpressionsubscript𝐶1𝐵1𝜆𝛼𝜎conditionalsubscript𝐶2superscriptsubscript𝑞112superscript𝜆1𝛿𝑡superscriptsubscript𝑗1𝑡subscript𝐸𝑗12superscript𝑇2\displaystyle\mathbb{P}\left(E_{t+1}\mid\cap_{j=1}^{t}E_{j}\right)=\mathbb{P}% \left(\left\|\left(\begin{array}[]{ccc}\widetilde{m}_{t+1}\\ \widetilde{x}_{t+1}\\ \end{array}\right)\right\|\leq\frac{C_{1}}{\sqrt{B(1-\lambda)}}\alpha\sigma+C_% {2}q_{1}^{1/2}\lambda^{(1-\delta)t}\mid\cap_{j=1}^{t}E_{j}\right)\geq 1-\frac{% 2}{T^{2}},blackboard_P ( italic_E start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∣ ∩ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = blackboard_P ( ∥ ( start_ARRAY start_ROW start_CELL over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY ) ∥ ≤ divide start_ARG italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_B ( 1 - italic_λ ) end_ARG end_ARG italic_α italic_σ + italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT ( 1 - italic_δ ) italic_t end_POSTSUPERSCRIPT ∣ ∩ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ≥ 1 - divide start_ARG 2 end_ARG start_ARG italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ,

and the induction is completed. Consequently, we deduce that the probability of the intersection of events from E1subscript𝐸1E_{1}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT through ETsubscript𝐸𝑇E_{T}italic_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT is at least (j=1TEj)12Tsuperscriptsubscript𝑗1𝑇subscript𝐸𝑗12𝑇\mathbb{P}\left(\bigcap_{j=1}^{T}E_{j}\right)\geq 1-\frac{2}{T}blackboard_P ( ⋂ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ≥ 1 - divide start_ARG 2 end_ARG start_ARG italic_T end_ARG.

A.4 Supporting Lemmas for Bounds in Theorem 9 Proof

Lemma 21

Under (A1)-(A2) and (A3’), we have

(j=1tΓtj(0k=1jγjkgηk(x*))MσB2c(2logT+4d)(1λ)1/2(1γ))1T2,normsuperscriptsubscript𝑗1𝑡superscriptΓ𝑡𝑗0missing-subexpressionmissing-subexpressionsuperscriptsubscript𝑘1𝑗superscript𝛾𝑗𝑘subscript𝑔subscript𝜂𝑘superscript𝑥missing-subexpressionmissing-subexpression𝑀𝜎𝐵2𝑐2𝑇4𝑑superscript1𝜆121𝛾1superscript𝑇2\displaystyle\mathbb{P}\left(\left\|\sum_{j=1}^{t}\Gamma^{t-j}\left(\begin{% array}[]{ccc}0\\ \sum_{k=1}^{j}\gamma^{j-k}\triangledown g_{\eta_{k}}(x^{*})\\ \end{array}\right)\right\|\geq M\frac{\sigma}{\sqrt{B}}\frac{\sqrt{2}c\left(2% \log T+4d\right)}{(1-\lambda)^{1/2}(1-\gamma)}\right)\leq\frac{1}{T^{2}},blackboard_P ( ∥ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_Γ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT ( start_ARRAY start_ROW start_CELL 0 end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY ) ∥ ≥ italic_M divide start_ARG italic_σ end_ARG start_ARG square-root start_ARG italic_B end_ARG end_ARG divide start_ARG square-root start_ARG 2 end_ARG italic_c ( 2 roman_log italic_T + 4 italic_d ) end_ARG start_ARG ( 1 - italic_λ ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ( 1 - italic_γ ) end_ARG ) ≤ divide start_ARG 1 end_ARG start_ARG italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ,

for 1tT1𝑡𝑇1\leq t\leq T1 ≤ italic_t ≤ italic_T, where c>0𝑐0c>0italic_c > 0 is an absolute constant.

Proof.   Define

Yt,k=j=ktΓtjγjk(0gηk(x*)).subscript𝑌𝑡𝑘superscriptsubscript𝑗𝑘𝑡superscriptΓ𝑡𝑗superscript𝛾𝑗𝑘0missing-subexpressionmissing-subexpressionsubscript𝑔subscript𝜂𝑘superscript𝑥missing-subexpressionmissing-subexpression\displaystyle Y_{t,k}=\sum_{j=k}^{t}\Gamma^{t-j}\gamma^{j-k}\left(\begin{array% }[]{ccc}0\\ \triangledown g_{\eta_{k}}(x^{*})\\ \end{array}\right).italic_Y start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_Γ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT ( start_ARRAY start_ROW start_CELL 0 end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY ) .

By the definition of the sub-exponential, Yt,ksubscript𝑌𝑡𝑘Y_{t,k}italic_Y start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT is j=ktΓtjγjkcσBnormsuperscriptsubscript𝑗𝑘𝑡superscriptΓ𝑡𝑗superscript𝛾𝑗𝑘𝑐𝜎𝐵\left\|\sum_{j=k}^{t}\Gamma^{t-j}\gamma^{j-k}\right\|\frac{c\sigma}{\sqrt{B}}∥ ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_Γ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT ∥ divide start_ARG italic_c italic_σ end_ARG start_ARG square-root start_ARG italic_B end_ARG end_ARG-sub-exponential random vector for some absolute constant c>0𝑐0c>0italic_c > 0, and {Yt,k}k=1tsuperscriptsubscriptsubscript𝑌𝑡𝑘𝑘1𝑡\{Y_{t,k}\}_{k=1}^{t}{ italic_Y start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT are independent. Then by Lemmas 27 and 20, we have

k=1tj=ktΓtjγjk2M22(1λ)(1γ)2.superscriptsubscript𝑘1𝑡superscriptnormsuperscriptsubscript𝑗𝑘𝑡superscriptΓ𝑡𝑗superscript𝛾𝑗𝑘2superscript𝑀221𝜆superscript1𝛾2\displaystyle\sum_{k=1}^{t}\left\|\sum_{j=k}^{t}\Gamma^{t-j}\gamma^{j-k}\right% \|^{2}\leq M^{2}\frac{2}{(1-\lambda)(1-\gamma)^{2}}.∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_Γ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG 2 end_ARG start_ARG ( 1 - italic_λ ) ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG .

By applying Lemma 28 on {Yt,k}k=1tsuperscriptsubscriptsubscript𝑌𝑡𝑘𝑘1𝑡\{Y_{t,k}\}_{k=1}^{t}{ italic_Y start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, with probability 11/T211superscript𝑇21-1/T^{2}1 - 1 / italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, we can obtain

k=1tYt,kMσB2c(2logT+4d)(1λ)1/2(1γ).normsuperscriptsubscript𝑘1𝑡subscript𝑌𝑡𝑘𝑀𝜎𝐵2𝑐2𝑇4𝑑superscript1𝜆121𝛾\displaystyle\left\|\sum_{k=1}^{t}Y_{t,k}\right\|\leq M\frac{\sigma}{\sqrt{B}}% \frac{\sqrt{2}c\left(2\log T+4d\right)}{(1-\lambda)^{1/2}(1-\gamma)}.∥ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_Y start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT ∥ ≤ italic_M divide start_ARG italic_σ end_ARG start_ARG square-root start_ARG italic_B end_ARG end_ARG divide start_ARG square-root start_ARG 2 end_ARG italic_c ( 2 roman_log italic_T + 4 italic_d ) end_ARG start_ARG ( 1 - italic_λ ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ( 1 - italic_γ ) end_ARG .
Lemma 22

Under (A1)-(A2) and (A3’), suppose that the t𝑡titalic_t-th iteration (m~t,x~t)subscriptnormal-~𝑚𝑡subscriptnormal-~𝑥𝑡(\widetilde{m}_{t},\widetilde{x}_{t})( over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) satisfies

m~j2+x~j2C1B(1λ)ασ+C2q11/2λ(1δ)(j1),superscriptnormsubscript~𝑚𝑗2superscriptnormsubscript~𝑥𝑗2subscript𝐶1𝐵1𝜆𝛼𝜎subscript𝐶2superscriptsubscript𝑞112superscript𝜆1𝛿𝑗1\displaystyle\sqrt{\left\|\widetilde{m}_{j}\right\|^{2}+\left\|\widetilde{x}_{% j}\right\|^{2}}\leq\frac{C_{1}}{\sqrt{B(1-\lambda)}}\alpha\sigma+C_{2}q_{1}^{1% /2}\lambda^{(1-\delta)(j-1)},square-root start_ARG ∥ over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ≤ divide start_ARG italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_B ( 1 - italic_λ ) end_ARG end_ARG italic_α italic_σ + italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT ( 1 - italic_δ ) ( italic_j - 1 ) end_POSTSUPERSCRIPT ,

for 1jt1𝑗𝑡1\leq j\leq t1 ≤ italic_j ≤ italic_t, we have

(j=1tΓtj(0k=1jγjk(gηk(xk)gηk(x*)+g(x*)g(xk)))C)1T2,normsuperscriptsubscript𝑗1𝑡superscriptΓ𝑡𝑗0missing-subexpressionmissing-subexpressionsuperscriptsubscript𝑘1𝑗superscript𝛾𝑗𝑘subscript𝑔subscript𝜂𝑘subscript𝑥𝑘subscript𝑔subscript𝜂𝑘superscript𝑥𝑔superscript𝑥𝑔subscript𝑥𝑘missing-subexpressionmissing-subexpression𝐶1superscript𝑇2\displaystyle\mathbb{P}\left(\left\|\sum_{j=1}^{t}\Gamma^{t-j}\left(\begin{% array}[]{ccc}0\\ \sum_{k=1}^{j}\gamma^{j-k}\left(\triangledown g_{\eta_{k}}(x_{k})-% \triangledown g_{\eta_{k}}(x^{*})+\triangledown g(x^{*})-\triangledown g(x_{k}% )\right)\\ \end{array}\right)\right\|\geq C\right)\leq\frac{1}{T^{2}},blackboard_P ( ∥ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_Γ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT ( start_ARRAY start_ROW start_CELL 0 end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT ( ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) + ▽ italic_g ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) - ▽ italic_g ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY ) ∥ ≥ italic_C ) ≤ divide start_ARG 1 end_ARG start_ARG italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ,

where

C=c(2logT+4d)MLfB(4C1ασB(1λ)(1γ)+42C2q11/2λ(1δ)(t1)δ(1λ)1/2(1γ)),𝐶𝑐2𝑇4𝑑𝑀subscript𝐿𝑓𝐵4subscript𝐶1𝛼𝜎𝐵1𝜆1𝛾42subscript𝐶2superscriptsubscript𝑞112superscript𝜆1𝛿𝑡1𝛿superscript1𝜆121𝛾\displaystyle C=c(2\log T+4d)M\frac{L_{f}}{\sqrt{B}}\left(\frac{4C_{1}\alpha% \sigma}{\sqrt{B}(1-\lambda)(1-\gamma)}+\frac{4\sqrt{2}C_{2}q_{1}^{1/2}\lambda^% {(1-\delta)(t-1)}}{\sqrt{\delta}(1-\lambda)^{1/2}(1-\gamma)}\right),italic_C = italic_c ( 2 roman_log italic_T + 4 italic_d ) italic_M divide start_ARG italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_B end_ARG end_ARG ( divide start_ARG 4 italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_α italic_σ end_ARG start_ARG square-root start_ARG italic_B end_ARG ( 1 - italic_λ ) ( 1 - italic_γ ) end_ARG + divide start_ARG 4 square-root start_ARG 2 end_ARG italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT ( 1 - italic_δ ) ( italic_t - 1 ) end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_δ end_ARG ( 1 - italic_λ ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ( 1 - italic_γ ) end_ARG ) ,

and c>0𝑐0c>0italic_c > 0 is an absolute constant.

Proof.   Define

Yt,k=j=ktΓtjγjk(0gηk(xk)gηk(x*)+g(x*)g(xk)).subscript𝑌𝑡𝑘superscriptsubscript𝑗𝑘𝑡superscriptΓ𝑡𝑗superscript𝛾𝑗𝑘0missing-subexpressionmissing-subexpressionsubscript𝑔subscript𝜂𝑘subscript𝑥𝑘subscript𝑔subscript𝜂𝑘superscript𝑥𝑔superscript𝑥𝑔subscript𝑥𝑘missing-subexpressionmissing-subexpression\displaystyle Y_{t,k}=\sum_{j=k}^{t}\Gamma^{t-j}\gamma^{j-k}\left(\begin{array% }[]{ccc}0\\ \triangledown g_{\eta_{k}}(x_{k})-\triangledown g_{\eta_{k}}(x^{*})+% \triangledown g(x^{*})-\triangledown g(x_{k})\\ \end{array}\right).italic_Y start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_Γ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT ( start_ARRAY start_ROW start_CELL 0 end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) + ▽ italic_g ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) - ▽ italic_g ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY ) .

According to the iteration of SGDM, we have ξksubscript𝜉𝑘\xi_{k}italic_ξ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and xksubscript𝑥𝑘x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are independent and {Yt,k}k=1tsuperscriptsubscriptsubscript𝑌𝑡𝑘𝑘1𝑡\{Y_{t,k}\}_{k=1}^{t}{ italic_Y start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT are martingale difference. Due to the Lfsubscript𝐿𝑓L_{f}italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT-smooth of the individual gradient fξ(x)subscript𝑓𝜉𝑥\triangledown f_{\xi}(x)▽ italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_x ), it holds that

maxv=1𝔼[exp(|vYt,k|/σk)]2,subscriptnorm𝑣1𝔼delimited-[]superscript𝑣topsubscript𝑌𝑡𝑘subscript𝜎𝑘2\displaystyle\max_{\left\|v\right\|=1}\mathbb{E}\left[\exp\left(\left|v^{\top}% Y_{t,k}\right|/\sigma_{k}\right)\right]\leq 2,roman_max start_POSTSUBSCRIPT ∥ italic_v ∥ = 1 end_POSTSUBSCRIPT blackboard_E [ roman_exp ( | italic_v start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_Y start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT | / italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ] ≤ 2 ,

where

σk=cLfBj=ktΓtjγjkx~k,subscript𝜎𝑘𝑐subscript𝐿𝑓𝐵normsuperscriptsubscript𝑗𝑘𝑡superscriptΓ𝑡𝑗superscript𝛾𝑗𝑘normsubscript~𝑥𝑘\displaystyle\sigma_{k}=c\frac{L_{f}}{\sqrt{B}}\left\|\sum_{j=k}^{t}\Gamma^{t-% j}\gamma^{j-k}\right\|\left\|\widetilde{x}_{k}\right\|,italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_c divide start_ARG italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_B end_ARG end_ARG ∥ ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_Γ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT ∥ ∥ over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ ,

and c>0𝑐0c>0italic_c > 0 is an absolute constant. Then according to Lemmas 27 and 20, we have

k=1tj=ktΓtjγjk2x~k2M2(2C12α2σ2B2(1γ)2(1λ)2+2C22q14λ2(1δ)(t1)δ(1γ)2(1λ)).superscriptsubscript𝑘1𝑡superscriptnormsuperscriptsubscript𝑗𝑘𝑡superscriptΓ𝑡𝑗superscript𝛾𝑗𝑘2superscriptnormsubscript~𝑥𝑘2superscript𝑀22superscriptsubscript𝐶12superscript𝛼2superscript𝜎2𝐵2superscript1𝛾2superscript1𝜆22superscriptsubscript𝐶22subscript𝑞14superscript𝜆21𝛿𝑡1𝛿superscript1𝛾21𝜆\displaystyle\sum_{k=1}^{t}\left\|\sum_{j=k}^{t}\Gamma^{t-j}\gamma^{j-k}\right% \|^{2}\left\|\widetilde{x}_{k}\right\|^{2}\leq M^{2}\left(2C_{1}^{2}\alpha^{2}% \frac{\sigma^{2}}{B}\frac{2}{(1-\gamma)^{2}(1-\lambda)^{2}}+2C_{2}^{2}q_{1}% \frac{4\lambda^{2(1-\delta)(t-1)}}{\delta(1-\gamma)^{2}(1-\lambda)}\right).∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_Γ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 2 italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG divide start_ARG 2 end_ARG start_ARG ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - italic_λ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + 2 italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT divide start_ARG 4 italic_λ start_POSTSUPERSCRIPT 2 ( 1 - italic_δ ) ( italic_t - 1 ) end_POSTSUPERSCRIPT end_ARG start_ARG italic_δ ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - italic_λ ) end_ARG ) .

By applying Lemma 28 on {Yt,k}k=1tsuperscriptsubscriptsubscript𝑌𝑡𝑘𝑘1𝑡\{Y_{t,k}\}_{k=1}^{t}{ italic_Y start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, with probability 11/T211superscript𝑇21-1/T^{2}1 - 1 / italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, we can obtain the following result:

j=1tYt,jc(2logT+4d)MLfB(4C1ασB(1λ)(1γ)+42C2q11/2λ(1δ)(t1)δ(1λ)1/2(1γ)).normsuperscriptsubscript𝑗1𝑡subscript𝑌𝑡𝑗𝑐2𝑇4𝑑𝑀subscript𝐿𝑓𝐵4subscript𝐶1𝛼𝜎𝐵1𝜆1𝛾42subscript𝐶2superscriptsubscript𝑞112superscript𝜆1𝛿𝑡1𝛿superscript1𝜆121𝛾\displaystyle\left\|\sum_{j=1}^{t}Y_{t,j}\right\|\leq c(2\log T+4d)M\frac{L_{f% }}{\sqrt{B}}\left(\frac{4C_{1}\alpha\sigma}{\sqrt{B}(1-\lambda)(1-\gamma)}+% \frac{4\sqrt{2}C_{2}q_{1}^{1/2}\lambda^{(1-\delta)(t-1)}}{\sqrt{\delta}(1-% \lambda)^{1/2}(1-\gamma)}\right).∥ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_Y start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT ∥ ≤ italic_c ( 2 roman_log italic_T + 4 italic_d ) italic_M divide start_ARG italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_B end_ARG end_ARG ( divide start_ARG 4 italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_α italic_σ end_ARG start_ARG square-root start_ARG italic_B end_ARG ( 1 - italic_λ ) ( 1 - italic_γ ) end_ARG + divide start_ARG 4 square-root start_ARG 2 end_ARG italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT ( 1 - italic_δ ) ( italic_t - 1 ) end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_δ end_ARG ( 1 - italic_λ ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ( 1 - italic_γ ) end_ARG ) .
Lemma 23

Under (A1)-(A2) and (A3’), suppose that the t𝑡titalic_t-th iteration (m~t,x~t)subscriptnormal-~𝑚𝑡subscriptnormal-~𝑥𝑡(\widetilde{m}_{t},\widetilde{x}_{t})( over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) satisfies

m~j2+x~j2C1B(1λ)ασ+C2q11/2λ(1δ)(j1),superscriptnormsubscript~𝑚𝑗2superscriptnormsubscript~𝑥𝑗2subscript𝐶1𝐵1𝜆𝛼𝜎subscript𝐶2superscriptsubscript𝑞112superscript𝜆1𝛿𝑗1\displaystyle\sqrt{\left\|\widetilde{m}_{j}\right\|^{2}+\left\|\widetilde{x}_{% j}\right\|^{2}}\leq\frac{C_{1}}{\sqrt{B(1-\lambda)}}\alpha\sigma+C_{2}q_{1}^{1% /2}\lambda^{(1-\delta)(j-1)},square-root start_ARG ∥ over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ≤ divide start_ARG italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_B ( 1 - italic_λ ) end_ARG end_ARG italic_α italic_σ + italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT ( 1 - italic_δ ) ( italic_j - 1 ) end_POSTSUPERSCRIPT ,

for 1jt1𝑗𝑡1\leq j\leq t1 ≤ italic_j ≤ italic_t, for L¯>0normal-¯𝐿0\overline{L}>0over¯ start_ARG italic_L end_ARG > 0, we have

j=1tΓtj(0k=1jγjkVkx~k)ML¯(2C12α2σ2B(1λ)2(1γ)+4C22q1λ(1δ)(t1)δ(1λ)(1γ)).normsuperscriptsubscript𝑗1𝑡superscriptΓ𝑡𝑗0missing-subexpressionmissing-subexpressionsuperscriptsubscript𝑘1𝑗superscript𝛾𝑗𝑘subscript𝑉𝑘subscript~𝑥𝑘missing-subexpressionmissing-subexpression𝑀¯𝐿2superscriptsubscript𝐶12superscript𝛼2superscript𝜎2𝐵superscript1𝜆21𝛾4superscriptsubscript𝐶22subscript𝑞1superscript𝜆1𝛿𝑡1𝛿1𝜆1𝛾\displaystyle\left\|\sum_{j=1}^{t}\Gamma^{t-j}\left(\begin{array}[]{ccc}0\\ \sum_{k=1}^{j}\gamma^{j-k}V_{k}\widetilde{x}_{k}\\ \end{array}\right)\right\|\leq M\overline{L}\left(\frac{2C_{1}^{2}\alpha^{2}% \sigma^{2}}{B(1-\lambda)^{2}(1-\gamma)}+\frac{4C_{2}^{2}q_{1}\lambda^{(1-% \delta)(t-1)}}{\delta(1-\lambda)(1-\gamma)}\right).∥ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_Γ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT ( start_ARRAY start_ROW start_CELL 0 end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY ) ∥ ≤ italic_M over¯ start_ARG italic_L end_ARG ( divide start_ARG 2 italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B ( 1 - italic_λ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - italic_γ ) end_ARG + divide start_ARG 4 italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_λ start_POSTSUPERSCRIPT ( 1 - italic_δ ) ( italic_t - 1 ) end_POSTSUPERSCRIPT end_ARG start_ARG italic_δ ( 1 - italic_λ ) ( 1 - italic_γ ) end_ARG ) .

Proof.   By the fact that j=1tk=1jλtjγjk(1λ)1(1γ)1superscriptsubscript𝑗1𝑡superscriptsubscript𝑘1𝑗superscript𝜆𝑡𝑗superscript𝛾𝑗𝑘superscript1𝜆1superscript1𝛾1\sum_{j=1}^{t}\sum_{k=1}^{j}\lambda^{t-j}\gamma^{j-k}\leq(1-\lambda)^{-1}(1-% \gamma)^{-1}∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT ≤ ( 1 - italic_λ ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( 1 - italic_γ ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT and Lemma 27, we have

j=1tΓtj(0k=1jγjkVkx~k)j=1tMλtjk=1jγjkL¯x~k2normsuperscriptsubscript𝑗1𝑡superscriptΓ𝑡𝑗0missing-subexpressionmissing-subexpressionsuperscriptsubscript𝑘1𝑗superscript𝛾𝑗𝑘subscript𝑉𝑘subscript~𝑥𝑘missing-subexpressionmissing-subexpressionsuperscriptsubscript𝑗1𝑡𝑀superscript𝜆𝑡𝑗superscriptsubscript𝑘1𝑗superscript𝛾𝑗𝑘¯𝐿superscriptnormsubscript~𝑥𝑘2\displaystyle\left\|\sum_{j=1}^{t}\Gamma^{t-j}\left(\begin{array}[]{ccc}0\\ \sum_{k=1}^{j}\gamma^{j-k}V_{k}\widetilde{x}_{k}\\ \end{array}\right)\right\|\leq\sum_{j=1}^{t}M\lambda^{t-j}\sum_{k=1}^{j}\gamma% ^{j-k}\overline{L}\left\|\widetilde{x}_{k}\right\|^{2}∥ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_Γ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT ( start_ARRAY start_ROW start_CELL 0 end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY ) ∥ ≤ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_M italic_λ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT over¯ start_ARG italic_L end_ARG ∥ over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
ML¯j=1tλtjk=1jγjk(2C12B(1λ)α2σ2+2C22q1λ2(1δ)(k1))absent𝑀¯𝐿superscriptsubscript𝑗1𝑡superscript𝜆𝑡𝑗superscriptsubscript𝑘1𝑗superscript𝛾𝑗𝑘2superscriptsubscript𝐶12𝐵1𝜆superscript𝛼2superscript𝜎22superscriptsubscript𝐶22subscript𝑞1superscript𝜆21𝛿𝑘1\displaystyle\leq M\overline{L}\sum_{j=1}^{t}\lambda^{t-j}\sum_{k=1}^{j}\gamma% ^{j-k}\left(\frac{2C_{1}^{2}}{B(1-\lambda)}\alpha^{2}\sigma^{2}+2C_{2}^{2}q_{1% }\lambda^{2(1-\delta)(k-1)}\right)≤ italic_M over¯ start_ARG italic_L end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT ( divide start_ARG 2 italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B ( 1 - italic_λ ) end_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_λ start_POSTSUPERSCRIPT 2 ( 1 - italic_δ ) ( italic_k - 1 ) end_POSTSUPERSCRIPT )
ML¯(2C12α2σ2B(1λ)2(1γ)+2C22q1k=1tj=ktλtjγjkλ(1δ)(k1))absent𝑀¯𝐿2superscriptsubscript𝐶12superscript𝛼2superscript𝜎2𝐵superscript1𝜆21𝛾2superscriptsubscript𝐶22subscript𝑞1superscriptsubscript𝑘1𝑡superscriptsubscript𝑗𝑘𝑡superscript𝜆𝑡𝑗superscript𝛾𝑗𝑘superscript𝜆1𝛿𝑘1\displaystyle\leq M\overline{L}\left(\frac{2C_{1}^{2}\alpha^{2}\sigma^{2}}{B(1% -\lambda)^{2}(1-\gamma)}+2C_{2}^{2}q_{1}\sum_{k=1}^{t}\sum_{j=k}^{t}\lambda^{t% -j}\gamma^{j-k}\lambda^{(1-\delta)(k-1)}\right)≤ italic_M over¯ start_ARG italic_L end_ARG ( divide start_ARG 2 italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B ( 1 - italic_λ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - italic_γ ) end_ARG + 2 italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT ( 1 - italic_δ ) ( italic_k - 1 ) end_POSTSUPERSCRIPT )
ML¯(2C12α2σ2B(1λ)2(1γ)+2C22q12λ(1δ)(t1)δ(1γ)(1λ)).absent𝑀¯𝐿2superscriptsubscript𝐶12superscript𝛼2superscript𝜎2𝐵superscript1𝜆21𝛾2superscriptsubscript𝐶22subscript𝑞12superscript𝜆1𝛿𝑡1𝛿1𝛾1𝜆\displaystyle\leq M\overline{L}\left(\frac{2C_{1}^{2}\alpha^{2}\sigma^{2}}{B(1% -\lambda)^{2}(1-\gamma)}+2C_{2}^{2}q_{1}\frac{2\lambda^{(1-\delta)(t-1)}}{% \delta(1-\gamma)(1-\lambda)}\right).≤ italic_M over¯ start_ARG italic_L end_ARG ( divide start_ARG 2 italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B ( 1 - italic_λ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - italic_γ ) end_ARG + 2 italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT divide start_ARG 2 italic_λ start_POSTSUPERSCRIPT ( 1 - italic_δ ) ( italic_t - 1 ) end_POSTSUPERSCRIPT end_ARG start_ARG italic_δ ( 1 - italic_γ ) ( 1 - italic_λ ) end_ARG ) .

Appendix B Proofs of convergence rates of averaged SGDM

B.1 Proof of Theorem 13

By the definition of ΓΓ\Gammaroman_Γ, one can easily check that

α(IΓ)1(0gηt(x*))=α(I,1αIγγ1Σ1,1αΣ1)(0gηt(x*))=(gηt(x*)Σ1gηt(x*)),𝛼superscript𝐼Γ10missing-subexpressionmissing-subexpressionsubscript𝑔subscript𝜂𝑡superscript𝑥missing-subexpressionmissing-subexpression𝛼𝐼1𝛼𝐼𝛾𝛾1superscriptΣ11𝛼superscriptΣ10missing-subexpressionmissing-subexpressionsubscript𝑔subscript𝜂𝑡superscript𝑥missing-subexpressionmissing-subexpressionsubscript𝑔subscript𝜂𝑡superscript𝑥missing-subexpressionmissing-subexpressionsuperscriptΣ1subscript𝑔subscript𝜂𝑡superscript𝑥missing-subexpressionmissing-subexpression\displaystyle\alpha(I-\Gamma)^{-1}\left(\begin{array}[]{ccc}0\\ \triangledown g_{\eta_{t}}(x^{*})\\ \end{array}\right)=\alpha\left(\begin{array}[]{cc}I,&\frac{1}{\alpha}I\\ \frac{\gamma}{\gamma-1}\Sigma^{-1},&\frac{1}{\alpha}\Sigma^{-1}\end{array}% \right)\left(\begin{array}[]{ccc}0\\ \triangledown g_{\eta_{t}}(x^{*})\\ \end{array}\right)=\left(\begin{array}[]{ccc}\triangledown g_{\eta_{t}}(x^{*})% \\ \Sigma^{-1}\triangledown g_{\eta_{t}}(x^{*})\\ \end{array}\right),italic_α ( italic_I - roman_Γ ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( start_ARRAY start_ROW start_CELL 0 end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY ) = italic_α ( start_ARRAY start_ROW start_CELL italic_I , end_CELL start_CELL divide start_ARG 1 end_ARG start_ARG italic_α end_ARG italic_I end_CELL end_ROW start_ROW start_CELL divide start_ARG italic_γ end_ARG start_ARG italic_γ - 1 end_ARG roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , end_CELL start_CELL divide start_ARG 1 end_ARG start_ARG italic_α end_ARG roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY ) ( start_ARRAY start_ROW start_CELL 0 end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY ) = ( start_ARRAY start_ROW start_CELL ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY ) ,

and we have

(IΓ)1t=1n(0gηt(x*))(1γ)t=1nj=1tΓtj(0k=1jγjkgηk(x*))superscript𝐼Γ1superscriptsubscript𝑡1𝑛0missing-subexpressionmissing-subexpressionsubscript𝑔subscript𝜂𝑡superscript𝑥missing-subexpressionmissing-subexpression1𝛾superscriptsubscript𝑡1𝑛superscriptsubscript𝑗1𝑡superscriptΓ𝑡𝑗0missing-subexpressionmissing-subexpressionsuperscriptsubscript𝑘1𝑗superscript𝛾𝑗𝑘subscript𝑔subscript𝜂𝑘superscript𝑥missing-subexpressionmissing-subexpression\displaystyle(I-\Gamma)^{-1}\sum_{t=1}^{n}\left(\begin{array}[]{ccc}0\\ \triangledown g_{\eta_{t}}(x^{*})\\ \end{array}\right)-(1-\gamma)\sum_{t=1}^{n}\sum_{j=1}^{t}\Gamma^{t-j}\left(% \begin{array}[]{ccc}0\\ \sum_{k=1}^{j}\gamma^{j-k}\triangledown g_{\eta_{k}}(x^{*})\\ \end{array}\right)( italic_I - roman_Γ ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( start_ARRAY start_ROW start_CELL 0 end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY ) - ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_Γ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT ( start_ARRAY start_ROW start_CELL 0 end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY )
=k=1n((IΓ)1(1γ)t=knj=ktΓtjγjk)(0gηk(x*)).absentsuperscriptsubscript𝑘1𝑛superscript𝐼Γ11𝛾superscriptsubscript𝑡𝑘𝑛superscriptsubscript𝑗𝑘𝑡superscriptΓ𝑡𝑗superscript𝛾𝑗𝑘0missing-subexpressionmissing-subexpressionsubscript𝑔subscript𝜂𝑘superscript𝑥missing-subexpressionmissing-subexpression\displaystyle=\sum_{k=1}^{n}\left((I-\Gamma)^{-1}-(1-\gamma)\sum_{t=k}^{n}\sum% _{j=k}^{t}\Gamma^{t-j}\gamma^{j-k}\right)\left(\begin{array}[]{ccc}0\\ \triangledown g_{\eta_{k}}(x^{*})\\ \end{array}\right).= ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( ( italic_I - roman_Γ ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT - ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_t = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_Γ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT ) ( start_ARRAY start_ROW start_CELL 0 end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY ) .

Then summing (A.35) over t𝑡titalic_t from n0+1subscript𝑛01n_{0}+1italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 to n𝑛nitalic_n, for L¯=0¯𝐿0\overline{L}=0over¯ start_ARG italic_L end_ARG = 0, we obtain

t=n0+1n(m~t+1x~t+1)+t=n0+1n(gηt(x*)Σ1gηt(x*))=t=n0+1n(m~t+1x~t+1)+αt=n0+1n(IΓ)1(0gηt(x*))superscriptsubscript𝑡subscript𝑛01𝑛subscript~𝑚𝑡1missing-subexpressionmissing-subexpressionsubscript~𝑥𝑡1missing-subexpressionmissing-subexpressionsuperscriptsubscript𝑡subscript𝑛01𝑛subscript𝑔subscript𝜂𝑡superscript𝑥missing-subexpressionmissing-subexpressionsuperscriptΣ1subscript𝑔subscript𝜂𝑡superscript𝑥missing-subexpressionmissing-subexpressionsuperscriptsubscript𝑡subscript𝑛01𝑛subscript~𝑚𝑡1missing-subexpressionmissing-subexpressionsubscript~𝑥𝑡1missing-subexpressionmissing-subexpression𝛼superscriptsubscript𝑡subscript𝑛01𝑛superscript𝐼Γ10missing-subexpressionmissing-subexpressionsubscript𝑔subscript𝜂𝑡superscript𝑥missing-subexpressionmissing-subexpression\displaystyle\sum_{t=n_{0}+1}^{n}\left(\begin{array}[]{ccc}\widetilde{m}_{t+1}% \\ \widetilde{x}_{t+1}\\ \end{array}\right)+\sum_{t=n_{0}+1}^{n}\left(\begin{array}[]{ccc}\triangledown g% _{\eta_{t}}(x^{*})\\ \Sigma^{-1}\triangledown g_{\eta_{t}}(x^{*})\\ \end{array}\right)=\sum_{t=n_{0}+1}^{n}\left(\begin{array}[]{ccc}\widetilde{m}% _{t+1}\\ \widetilde{x}_{t+1}\\ \end{array}\right)+\alpha\sum_{t=n_{0}+1}^{n}(I-\Gamma)^{-1}\left(\begin{array% }[]{ccc}0\\ \triangledown g_{\eta_{t}}(x^{*})\\ \end{array}\right)∑ start_POSTSUBSCRIPT italic_t = italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( start_ARRAY start_ROW start_CELL over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY ) + ∑ start_POSTSUBSCRIPT italic_t = italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( start_ARRAY start_ROW start_CELL ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY ) = ∑ start_POSTSUBSCRIPT italic_t = italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( start_ARRAY start_ROW start_CELL over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY ) + italic_α ∑ start_POSTSUBSCRIPT italic_t = italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_I - roman_Γ ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( start_ARRAY start_ROW start_CELL 0 end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY ) (B.88)
=α(1γ)t=1nj=1tΓtj(0k=1jγjk(gηk(xk)gηk(x*)+g(x*)g(xk)))absent𝛼1𝛾superscriptsubscript𝑡1𝑛superscriptsubscript𝑗1𝑡superscriptΓ𝑡𝑗0missing-subexpressionmissing-subexpressionsuperscriptsubscript𝑘1𝑗superscript𝛾𝑗𝑘subscript𝑔subscript𝜂𝑘subscript𝑥𝑘subscript𝑔subscript𝜂𝑘superscript𝑥𝑔superscript𝑥𝑔subscript𝑥𝑘missing-subexpressionmissing-subexpression\displaystyle=-\alpha(1-\gamma)\sum_{t=1}^{n}\sum_{j=1}^{t}\Gamma^{t-j}\left(% \begin{array}[]{ccc}0\\ \sum_{k=1}^{j}\gamma^{j-k}\left(\triangledown g_{\eta_{k}}(x_{k})-% \triangledown g_{\eta_{k}}(x^{*})+\triangledown g(x^{*})-\triangledown g(x_{k}% )\right)\\ \end{array}\right)= - italic_α ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_Γ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT ( start_ARRAY start_ROW start_CELL 0 end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT ( ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) + ▽ italic_g ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) - ▽ italic_g ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY ) (B.91)
+α(1γ)t=1n0j=1tΓtj(0k=1jγjk(gηk(xk)gηk(x*)+g(x*)g(xk)))𝛼1𝛾superscriptsubscript𝑡1subscript𝑛0superscriptsubscript𝑗1𝑡superscriptΓ𝑡𝑗0missing-subexpressionmissing-subexpressionsuperscriptsubscript𝑘1𝑗superscript𝛾𝑗𝑘subscript𝑔subscript𝜂𝑘subscript𝑥𝑘subscript𝑔subscript𝜂𝑘superscript𝑥𝑔superscript𝑥𝑔subscript𝑥𝑘missing-subexpressionmissing-subexpression\displaystyle+\alpha(1-\gamma)\sum_{t=1}^{n_{0}}\sum_{j=1}^{t}\Gamma^{t-j}% \left(\begin{array}[]{ccc}0\\ \sum_{k=1}^{j}\gamma^{j-k}\left(\triangledown g_{\eta_{k}}(x_{k})-% \triangledown g_{\eta_{k}}(x^{*})+\triangledown g(x^{*})-\triangledown g(x_{k}% )\right)\\ \end{array}\right)+ italic_α ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_Γ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT ( start_ARRAY start_ROW start_CELL 0 end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT ( ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) + ▽ italic_g ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) - ▽ italic_g ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY ) (B.94)
+αk=1n((IΓ)1(1γ)t=knj=ktΓtjγjk)(0gηk(x*))𝛼superscriptsubscript𝑘1𝑛superscript𝐼Γ11𝛾superscriptsubscript𝑡𝑘𝑛superscriptsubscript𝑗𝑘𝑡superscriptΓ𝑡𝑗superscript𝛾𝑗𝑘0missing-subexpressionmissing-subexpressionsubscript𝑔subscript𝜂𝑘superscript𝑥missing-subexpressionmissing-subexpression\displaystyle+\alpha\sum_{k=1}^{n}\left((I-\Gamma)^{-1}-(1-\gamma)\sum_{t=k}^{% n}\sum_{j=k}^{t}\Gamma^{t-j}\gamma^{j-k}\right)\left(\begin{array}[]{ccc}0\\ \triangledown g_{\eta_{k}}(x^{*})\\ \end{array}\right)+ italic_α ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( ( italic_I - roman_Γ ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT - ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_t = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_Γ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT ) ( start_ARRAY start_ROW start_CELL 0 end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY ) (B.97)
αk=1n0((IΓ)1(1γ)t=kn0j=ktΓtjγjk)(0gηk(x*))+t=n0+1nΓt(m~1x~1).𝛼superscriptsubscript𝑘1subscript𝑛0superscript𝐼Γ11𝛾superscriptsubscript𝑡𝑘subscript𝑛0superscriptsubscript𝑗𝑘𝑡superscriptΓ𝑡𝑗superscript𝛾𝑗𝑘0missing-subexpressionmissing-subexpressionsubscript𝑔subscript𝜂𝑘superscript𝑥missing-subexpressionmissing-subexpressionsuperscriptsubscript𝑡subscript𝑛01𝑛superscriptΓ𝑡subscript~𝑚1missing-subexpressionmissing-subexpressionsubscript~𝑥1missing-subexpressionmissing-subexpression\displaystyle-\alpha\sum_{k=1}^{n_{0}}\left((I-\Gamma)^{-1}-(1-\gamma)\sum_{t=% k}^{n_{0}}\sum_{j=k}^{t}\Gamma^{t-j}\gamma^{j-k}\right)\left(\begin{array}[]{% ccc}0\\ \triangledown g_{\eta_{k}}(x^{*})\\ \end{array}\right)+\sum_{t=n_{0}+1}^{n}\Gamma^{t}\left(\begin{array}[]{ccc}% \widetilde{m}_{1}\\ \widetilde{x}_{1}\\ \end{array}\right).- italic_α ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( ( italic_I - roman_Γ ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT - ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_t = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_Γ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT ) ( start_ARRAY start_ROW start_CELL 0 end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY ) + ∑ start_POSTSUBSCRIPT italic_t = italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_Γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( start_ARRAY start_ROW start_CELL over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY ) . (B.102)

The subsequent proof aims to calculate the expected values of various components within the equation (B.102). For the t𝑡titalic_t-th iteration (m~t,x~t)subscript~𝑚𝑡subscript~𝑥𝑡(\widetilde{m}_{t},\widetilde{x}_{t})( over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) satisfying

𝔼[m~t2+x~t2]C1B(1λ)α2σ2+C2q1λ2(1δ)(t1),t1,formulae-sequence𝔼delimited-[]superscriptnormsubscript~𝑚𝑡2superscriptnormsubscript~𝑥𝑡2subscript𝐶1𝐵1𝜆superscript𝛼2superscript𝜎2subscript𝐶2subscript𝑞1superscript𝜆21𝛿𝑡1𝑡1\displaystyle\mathbb{E}[\left\|\widetilde{m}_{t}\right\|^{2}+\left\|\widetilde% {x}_{t}\right\|^{2}]\leq\frac{C_{1}}{B(1-\lambda)}\alpha^{2}\sigma^{2}+C_{2}q_% {1}\lambda^{2(1-\delta)(t-1)},\quad t\geq 1,blackboard_E [ ∥ over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ divide start_ARG italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_B ( 1 - italic_λ ) end_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_λ start_POSTSUPERSCRIPT 2 ( 1 - italic_δ ) ( italic_t - 1 ) end_POSTSUPERSCRIPT , italic_t ≥ 1 ,

by the fact that t=knj=ktλtjγjk(1λ)1(1γ)1superscriptsubscript𝑡𝑘𝑛superscriptsubscript𝑗𝑘𝑡superscript𝜆𝑡𝑗superscript𝛾𝑗𝑘superscript1𝜆1superscript1𝛾1\sum_{t=k}^{n}\sum_{j=k}^{t}\lambda^{t-j}\gamma^{j-k}\leq(1-\lambda)^{-1}(1-% \gamma)^{-1}∑ start_POSTSUBSCRIPT italic_t = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT ≤ ( 1 - italic_λ ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( 1 - italic_γ ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT and 1x1xδδ11𝑥1superscript𝑥𝛿superscript𝛿1\frac{1-x}{1-x^{\delta}}\leq\delta^{-1}divide start_ARG 1 - italic_x end_ARG start_ARG 1 - italic_x start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT end_ARG ≤ italic_δ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT for δ(0,1]𝛿01\delta\in(0,1]italic_δ ∈ ( 0 , 1 ], we have

𝔼t=1nα(1γ)j=1tΓtj(0k=1jγjk(gηk(xk)gηk(x*)+g(x*)g(xk)))2𝔼superscriptnormsuperscriptsubscript𝑡1𝑛𝛼1𝛾superscriptsubscript𝑗1𝑡superscriptΓ𝑡𝑗0missing-subexpressionmissing-subexpressionsuperscriptsubscript𝑘1𝑗superscript𝛾𝑗𝑘subscript𝑔subscript𝜂𝑘subscript𝑥𝑘subscript𝑔subscript𝜂𝑘superscript𝑥𝑔superscript𝑥𝑔subscript𝑥𝑘missing-subexpressionmissing-subexpression2\displaystyle\mathbb{E}\left\|\sum_{t=1}^{n}\alpha(1-\gamma)\sum_{j=1}^{t}% \Gamma^{t-j}\left(\begin{array}[]{ccc}0\\ \sum_{k=1}^{j}\gamma^{j-k}\left(\triangledown g_{\eta_{k}}(x_{k})-% \triangledown g_{\eta_{k}}(x^{*})+\triangledown g(x^{*})-\triangledown g(x_{k}% )\right)\\ \end{array}\right)\right\|^{2}blackboard_E ∥ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_α ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_Γ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT ( start_ARRAY start_ROW start_CELL 0 end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT ( ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) + ▽ italic_g ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) - ▽ italic_g ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (B.105)
α2(1γ)2Lf2BM2k=1n(t=knj=ktλtjγjk)2(C1B(1λ)α2σ2+C2q1λ2(1δ)(k1))absentsuperscript𝛼2superscript1𝛾2superscriptsubscript𝐿𝑓2𝐵superscript𝑀2superscriptsubscript𝑘1𝑛superscriptsuperscriptsubscript𝑡𝑘𝑛superscriptsubscript𝑗𝑘𝑡superscript𝜆𝑡𝑗superscript𝛾𝑗𝑘2subscript𝐶1𝐵1𝜆superscript𝛼2superscript𝜎2subscript𝐶2subscript𝑞1superscript𝜆21𝛿𝑘1\displaystyle\leq\alpha^{2}(1-\gamma)^{2}\frac{L_{f}^{2}}{B}M^{2}\sum_{k=1}^{n% }\left(\sum_{t=k}^{n}\sum_{j=k}^{t}\lambda^{t-j}\gamma^{j-k}\right)^{2}\left(% \frac{C_{1}}{B(1-\lambda)}\alpha^{2}\sigma^{2}+C_{2}q_{1}\lambda^{2(1-\delta)(% k-1)}\right)≤ italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_t = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( divide start_ARG italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_B ( 1 - italic_λ ) end_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_λ start_POSTSUPERSCRIPT 2 ( 1 - italic_δ ) ( italic_k - 1 ) end_POSTSUPERSCRIPT )
α2Lf2BM2(C1α2σ2B(1λ)3n+C2q1(1λ)2(1λ2(1δ)))α2Lf2BM2(C1α2σ2B(1λ)3n+C2q1(1δ)(1λ)3).absentsuperscript𝛼2superscriptsubscript𝐿𝑓2𝐵superscript𝑀2subscript𝐶1superscript𝛼2superscript𝜎2𝐵superscript1𝜆3𝑛subscript𝐶2subscript𝑞1superscript1𝜆21superscript𝜆21𝛿superscript𝛼2superscriptsubscript𝐿𝑓2𝐵superscript𝑀2subscript𝐶1superscript𝛼2superscript𝜎2𝐵superscript1𝜆3𝑛subscript𝐶2subscript𝑞11𝛿superscript1𝜆3\displaystyle\leq\alpha^{2}\frac{L_{f}^{2}}{B}M^{2}\left(\frac{C_{1}\alpha^{2}% \sigma^{2}}{B(1-\lambda)^{3}}n+\frac{C_{2}q_{1}}{(1-\lambda)^{2}(1-\lambda^{2(% 1-\delta)})}\right)\leq\alpha^{2}\frac{L_{f}^{2}}{B}M^{2}\left(\frac{C_{1}% \alpha^{2}\sigma^{2}}{B(1-\lambda)^{3}}n+\frac{C_{2}q_{1}}{(1-\delta)(1-% \lambda)^{3}}\right).≤ italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( divide start_ARG italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B ( 1 - italic_λ ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG italic_n + divide start_ARG italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG ( 1 - italic_λ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - italic_λ start_POSTSUPERSCRIPT 2 ( 1 - italic_δ ) end_POSTSUPERSCRIPT ) end_ARG ) ≤ italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( divide start_ARG italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B ( 1 - italic_λ ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG italic_n + divide start_ARG italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG ( 1 - italic_δ ) ( 1 - italic_λ ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG ) . (B.106)

Incorporating Lemmas 27 and 29, it is established that

𝔼αk=1n((IΓ)1(1γ)t=knj=ktΓtjγjk)(0gηk(x*))2𝔼superscriptnorm𝛼superscriptsubscript𝑘1𝑛superscript𝐼Γ11𝛾superscriptsubscript𝑡𝑘𝑛superscriptsubscript𝑗𝑘𝑡superscriptΓ𝑡𝑗superscript𝛾𝑗𝑘0missing-subexpressionmissing-subexpressionsubscript𝑔subscript𝜂𝑘superscript𝑥missing-subexpressionmissing-subexpression2\displaystyle\mathbb{E}\left\|\alpha\sum_{k=1}^{n}\left((I-\Gamma)^{-1}-(1-% \gamma)\sum_{t=k}^{n}\sum_{j=k}^{t}\Gamma^{t-j}\gamma^{j-k}\right)\left(\begin% {array}[]{ccc}0\\ \triangledown g_{\eta_{k}}(x^{*})\\ \end{array}\right)\right\|^{2}blackboard_E ∥ italic_α ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( ( italic_I - roman_Γ ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT - ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_t = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_Γ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT ) ( start_ARRAY start_ROW start_CELL 0 end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (B.109)
α2σ2BM2k=1n(2γ2(j=knλnjγjk)2+2λ2(nk+1)(1λ)2)absentsuperscript𝛼2superscript𝜎2𝐵superscript𝑀2superscriptsubscript𝑘1𝑛2superscript𝛾2superscriptsuperscriptsubscript𝑗𝑘𝑛superscript𝜆𝑛𝑗superscript𝛾𝑗𝑘22superscript𝜆2𝑛𝑘1superscript1𝜆2\displaystyle\leq\alpha^{2}\frac{\sigma^{2}}{B}M^{2}\sum_{k=1}^{n}\left(2% \gamma^{2}\left(\sum_{j=k}^{n}\lambda^{n-j}\gamma^{j-k}\right)^{2}+\frac{2% \lambda^{2(n-k+1)}}{(1-\lambda)^{2}}\right)≤ italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( 2 italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT italic_n - italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 2 italic_λ start_POSTSUPERSCRIPT 2 ( italic_n - italic_k + 1 ) end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - italic_λ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG )
2α2σ2BM2(2γ2(1γ)2(1λ)+λ2(1λ)3)6α2σ2BM21(1λ)3,absent2superscript𝛼2superscript𝜎2𝐵superscript𝑀22superscript𝛾2superscript1𝛾21𝜆superscript𝜆2superscript1𝜆36superscript𝛼2superscript𝜎2𝐵superscript𝑀21superscript1𝜆3\displaystyle\leq 2\alpha^{2}\frac{\sigma^{2}}{B}M^{2}\left(\frac{2\gamma^{2}}% {(1-\gamma)^{2}(1-\lambda)}+\frac{\lambda^{2}}{(1-\lambda)^{3}}\right)\leq 6% \alpha^{2}\frac{\sigma^{2}}{B}M^{2}\frac{1}{(1-\lambda)^{3}},≤ 2 italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( divide start_ARG 2 italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - italic_λ ) end_ARG + divide start_ARG italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - italic_λ ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG ) ≤ 6 italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG ( 1 - italic_λ ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG , (B.110)

where the last inequality is due to the fact that γλ<1𝛾𝜆1\sqrt{\gamma}\leq\lambda<1square-root start_ARG italic_γ end_ARG ≤ italic_λ < 1 in Theorem 3. Furthermore, according to Lemma 20, it holds that

t=n0+1nΓt(m~1x~1)2M2λ2n0(1λ)2q1.superscriptnormsuperscriptsubscript𝑡subscript𝑛01𝑛superscriptΓ𝑡subscript~𝑚1missing-subexpressionmissing-subexpressionsubscript~𝑥1missing-subexpressionmissing-subexpression2superscript𝑀2superscript𝜆2subscript𝑛0superscript1𝜆2subscript𝑞1\displaystyle\left\|\sum_{t=n_{0}+1}^{n}\Gamma^{t}\left(\begin{array}[]{ccc}% \widetilde{m}_{1}\\ \widetilde{x}_{1}\\ \end{array}\right)\right\|^{2}\leq M^{2}\frac{\lambda^{2n_{0}}}{(1-\lambda)^{2% }}q_{1}.∥ ∑ start_POSTSUBSCRIPT italic_t = italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_Γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( start_ARRAY start_ROW start_CELL over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_λ start_POSTSUPERSCRIPT 2 italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - italic_λ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . (B.113)

Combining the above inequalities (B.106)(B.110)(B.113), we deduce that

(nn0)2𝔼t=n0+1nx~tnn0+k=n0+1nΣ1gηk(x*)nn02superscript𝑛subscript𝑛02𝔼superscriptnormsuperscriptsubscript𝑡subscript𝑛01𝑛subscript~𝑥𝑡𝑛subscript𝑛0superscriptsubscript𝑘subscript𝑛01𝑛superscriptΣ1subscript𝑔subscript𝜂𝑘superscript𝑥𝑛subscript𝑛02\displaystyle(n-n_{0})^{2}\mathbb{E}\left\|\frac{\sum_{t=n_{0}+1}^{n}% \widetilde{x}_{t}}{n-n_{0}}+\frac{\sum_{k=n_{0}+1}^{n}\Sigma^{-1}\triangledown g% _{\eta_{k}}(x^{*})}{n-n_{0}}\right\|^{2}( italic_n - italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E ∥ divide start_ARG ∑ start_POSTSUBSCRIPT italic_t = italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_n - italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG + divide start_ARG ∑ start_POSTSUBSCRIPT italic_k = italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_n - italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
5α2Lf2BM2(C1α2σ2B(1λ)3(n+n0)+2C2q1(1δ)(1λ)3)absent5superscript𝛼2superscriptsubscript𝐿𝑓2𝐵superscript𝑀2subscript𝐶1superscript𝛼2superscript𝜎2𝐵superscript1𝜆3𝑛subscript𝑛02subscript𝐶2subscript𝑞11𝛿superscript1𝜆3\displaystyle\leq 5\alpha^{2}\frac{L_{f}^{2}}{B}M^{2}\left(\frac{C_{1}\alpha^{% 2}\sigma^{2}}{B(1-\lambda)^{3}}(n+n_{0})+\frac{2C_{2}q_{1}}{(1-\delta)(1-% \lambda)^{3}}\right)≤ 5 italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( divide start_ARG italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B ( 1 - italic_λ ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG ( italic_n + italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + divide start_ARG 2 italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG ( 1 - italic_δ ) ( 1 - italic_λ ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG )
+30α2σ2BM21(1λ)3+M2λ2n0(1λ)2q1.30superscript𝛼2superscript𝜎2𝐵superscript𝑀21superscript1𝜆3superscript𝑀2superscript𝜆2subscript𝑛0superscript1𝜆2subscript𝑞1\displaystyle+30\alpha^{2}\frac{\sigma^{2}}{B}M^{2}\frac{1}{(1-\lambda)^{3}}+M% ^{2}\frac{\lambda^{2n_{0}}}{(1-\lambda)^{2}}q_{1}.+ 30 italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B end_ARG italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG ( 1 - italic_λ ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG + italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_λ start_POSTSUPERSCRIPT 2 italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - italic_λ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT .

Take n0subscript𝑛0n_{0}italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT such that λ2n0=B1(1λ)superscript𝜆2subscript𝑛0superscript𝐵11𝜆\lambda^{2n_{0}}=B^{-1}(1-\lambda)italic_λ start_POSTSUPERSCRIPT 2 italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = italic_B start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( 1 - italic_λ ) and n2n0𝑛2subscript𝑛0n\geq 2n_{0}italic_n ≥ 2 italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, it holds that

𝔼t=n0+1nx~tnn0+k=n0+1nΣ1gηk(x*)nn02C~1Bn2+C~2B2n,𝔼superscriptnormsuperscriptsubscript𝑡subscript𝑛01𝑛subscript~𝑥𝑡𝑛subscript𝑛0superscriptsubscript𝑘subscript𝑛01𝑛superscriptΣ1subscript𝑔subscript𝜂𝑘superscript𝑥𝑛subscript𝑛02subscript~𝐶1𝐵superscript𝑛2subscript~𝐶2superscript𝐵2𝑛\displaystyle\mathbb{E}\left\|\frac{\sum_{t=n_{0}+1}^{n}\widetilde{x}_{t}}{n-n% _{0}}+\frac{\sum_{k=n_{0}+1}^{n}\Sigma^{-1}\triangledown g_{\eta_{k}}(x^{*})}{% n-n_{0}}\right\|^{2}\leq\frac{\tilde{C}_{1}}{Bn^{2}}+\frac{\tilde{C}_{2}}{B^{2% }n},blackboard_E ∥ divide start_ARG ∑ start_POSTSUBSCRIPT italic_t = italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_n - italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG + divide start_ARG ∑ start_POSTSUBSCRIPT italic_k = italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_n - italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ divide start_ARG over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_B italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_n end_ARG ,

where

C1~~subscript𝐶1\displaystyle\tilde{C_{1}}over~ start_ARG italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG =40α2Lf2M2C2q1(1δ)(1λ)3+120α2σ2M21(1λ)3+4M2q11λ,absent40superscript𝛼2superscriptsubscript𝐿𝑓2superscript𝑀2subscript𝐶2subscript𝑞11𝛿superscript1𝜆3120superscript𝛼2superscript𝜎2superscript𝑀21superscript1𝜆34superscript𝑀2subscript𝑞11𝜆\displaystyle=40\alpha^{2}L_{f}^{2}M^{2}\frac{C_{2}q_{1}}{(1-\delta)(1-\lambda% )^{3}}+120\alpha^{2}\sigma^{2}M^{2}\frac{1}{(1-\lambda)^{3}}+\frac{4M^{2}q_{1}% }{1-\lambda},= 40 italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG ( 1 - italic_δ ) ( 1 - italic_λ ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG + 120 italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG ( 1 - italic_λ ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 4 italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_λ end_ARG ,
C2~~subscript𝐶2\displaystyle\tilde{C_{2}}over~ start_ARG italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG =30α2σ2Lf2M2C1α2(1λ)3.absent30superscript𝛼2superscript𝜎2superscriptsubscript𝐿𝑓2superscript𝑀2subscript𝐶1superscript𝛼2superscript1𝜆3\displaystyle=30\alpha^{2}\sigma^{2}L_{f}^{2}M^{2}\frac{C_{1}\alpha^{2}}{(1-% \lambda)^{3}}.= 30 italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - italic_λ ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG .

B.2 Proof of Theorem 17

From (A.35), for L¯>0¯𝐿0\overline{L}>0over¯ start_ARG italic_L end_ARG > 0, we have

t=n0+1n(m~t+1x~t+1)+t=n0+1n(gηt(x*)Σ1gηt(x*))superscriptsubscript𝑡subscript𝑛01𝑛subscript~𝑚𝑡1missing-subexpressionmissing-subexpressionsubscript~𝑥𝑡1missing-subexpressionmissing-subexpressionsuperscriptsubscript𝑡subscript𝑛01𝑛subscript𝑔subscript𝜂𝑡superscript𝑥missing-subexpressionmissing-subexpressionsuperscriptΣ1subscript𝑔subscript𝜂𝑡superscript𝑥missing-subexpressionmissing-subexpression\displaystyle\sum_{t=n_{0}+1}^{n}\left(\begin{array}[]{ccc}\widetilde{m}_{t+1}% \\ \widetilde{x}_{t+1}\\ \end{array}\right)+\sum_{t=n_{0}+1}^{n}\left(\begin{array}[]{ccc}\triangledown g% _{\eta_{t}}(x^{*})\\ \Sigma^{-1}\triangledown g_{\eta_{t}}(x^{*})\\ \end{array}\right)∑ start_POSTSUBSCRIPT italic_t = italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( start_ARRAY start_ROW start_CELL over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY ) + ∑ start_POSTSUBSCRIPT italic_t = italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( start_ARRAY start_ROW start_CELL ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY )
=α(1γ)t=1nj=1tΓtj(0k=1jγjk(gηk(xk)gηk(x*)+g(x*)g(xk)))absent𝛼1𝛾superscriptsubscript𝑡1𝑛superscriptsubscript𝑗1𝑡superscriptΓ𝑡𝑗0missing-subexpressionmissing-subexpressionsuperscriptsubscript𝑘1𝑗superscript𝛾𝑗𝑘subscript𝑔subscript𝜂𝑘subscript𝑥𝑘subscript𝑔subscript𝜂𝑘superscript𝑥𝑔superscript𝑥𝑔subscript𝑥𝑘missing-subexpressionmissing-subexpression\displaystyle=-\alpha(1-\gamma)\sum_{t=1}^{n}\sum_{j=1}^{t}\Gamma^{t-j}\left(% \begin{array}[]{ccc}0\\ \sum_{k=1}^{j}\gamma^{j-k}\left(\triangledown g_{\eta_{k}}(x_{k})-% \triangledown g_{\eta_{k}}(x^{*})+\triangledown g(x^{*})-\triangledown g(x_{k}% )\right)\\ \end{array}\right)= - italic_α ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_Γ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT ( start_ARRAY start_ROW start_CELL 0 end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT ( ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) + ▽ italic_g ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) - ▽ italic_g ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY )
+α(1γ)t=1n0j=1tΓtj(0k=1jγjk(gηk(xk)gηk(x*)+g(x*)g(xk)))𝛼1𝛾superscriptsubscript𝑡1subscript𝑛0superscriptsubscript𝑗1𝑡superscriptΓ𝑡𝑗0missing-subexpressionmissing-subexpressionsuperscriptsubscript𝑘1𝑗superscript𝛾𝑗𝑘subscript𝑔subscript𝜂𝑘subscript𝑥𝑘subscript𝑔subscript𝜂𝑘superscript𝑥𝑔superscript𝑥𝑔subscript𝑥𝑘missing-subexpressionmissing-subexpression\displaystyle+\alpha(1-\gamma)\sum_{t=1}^{n_{0}}\sum_{j=1}^{t}\Gamma^{t-j}% \left(\begin{array}[]{ccc}0\\ \sum_{k=1}^{j}\gamma^{j-k}\left(\triangledown g_{\eta_{k}}(x_{k})-% \triangledown g_{\eta_{k}}(x^{*})+\triangledown g(x^{*})-\triangledown g(x_{k}% )\right)\\ \end{array}\right)+ italic_α ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_Γ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT ( start_ARRAY start_ROW start_CELL 0 end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT ( ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) + ▽ italic_g ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) - ▽ italic_g ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY )
+αk=1n((IΓ)1(1γ)t=knj=ktΓtjγjk)(0gηk(x*))𝛼superscriptsubscript𝑘1𝑛superscript𝐼Γ11𝛾superscriptsubscript𝑡𝑘𝑛superscriptsubscript𝑗𝑘𝑡superscriptΓ𝑡𝑗superscript𝛾𝑗𝑘0missing-subexpressionmissing-subexpressionsubscript𝑔subscript𝜂𝑘superscript𝑥missing-subexpressionmissing-subexpression\displaystyle+\alpha\sum_{k=1}^{n}\left((I-\Gamma)^{-1}-(1-\gamma)\sum_{t=k}^{% n}\sum_{j=k}^{t}\Gamma^{t-j}\gamma^{j-k}\right)\left(\begin{array}[]{ccc}0\\ \triangledown g_{\eta_{k}}(x^{*})\\ \end{array}\right)+ italic_α ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( ( italic_I - roman_Γ ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT - ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_t = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_Γ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT ) ( start_ARRAY start_ROW start_CELL 0 end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY )
αk=1n0((IΓ)1(1γ)t=kn0j=ktΓtjγjk)(0gηk(x*))𝛼superscriptsubscript𝑘1subscript𝑛0superscript𝐼Γ11𝛾superscriptsubscript𝑡𝑘subscript𝑛0superscriptsubscript𝑗𝑘𝑡superscriptΓ𝑡𝑗superscript𝛾𝑗𝑘0missing-subexpressionmissing-subexpressionsubscript𝑔subscript𝜂𝑘superscript𝑥missing-subexpressionmissing-subexpression\displaystyle-\alpha\sum_{k=1}^{n_{0}}\left((I-\Gamma)^{-1}-(1-\gamma)\sum_{t=% k}^{n_{0}}\sum_{j=k}^{t}\Gamma^{t-j}\gamma^{j-k}\right)\left(\begin{array}[]{% ccc}0\\ \triangledown g_{\eta_{k}}(x^{*})\\ \end{array}\right)- italic_α ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( ( italic_I - roman_Γ ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT - ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_t = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_Γ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT ) ( start_ARRAY start_ROW start_CELL 0 end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY )
α(1γ)t=n0+1nj=1tΓtj(0k=1jγjkVkx~k)+t=n0+1nΓt(m~1x~1).𝛼1𝛾superscriptsubscript𝑡subscript𝑛01𝑛superscriptsubscript𝑗1𝑡superscriptΓ𝑡𝑗0missing-subexpressionmissing-subexpressionsuperscriptsubscript𝑘1𝑗superscript𝛾𝑗𝑘subscript𝑉𝑘subscript~𝑥𝑘missing-subexpressionmissing-subexpressionsuperscriptsubscript𝑡subscript𝑛01𝑛superscriptΓ𝑡subscript~𝑚1missing-subexpressionmissing-subexpressionsubscript~𝑥1missing-subexpressionmissing-subexpression\displaystyle-\alpha(1-\gamma)\sum_{t=n_{0}+1}^{n}\sum_{j=1}^{t}\Gamma^{t-j}% \left(\begin{array}[]{ccc}0\\ \sum_{k=1}^{j}\gamma^{j-k}V_{k}\widetilde{x}_{k}\\ \end{array}\right)+\sum_{t=n_{0}+1}^{n}\Gamma^{t}\left(\begin{array}[]{ccc}% \widetilde{m}_{1}\\ \widetilde{x}_{1}\\ \end{array}\right).- italic_α ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_t = italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_Γ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT ( start_ARRAY start_ROW start_CELL 0 end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY ) + ∑ start_POSTSUBSCRIPT italic_t = italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_Γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( start_ARRAY start_ROW start_CELL over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY ) .

By applying latter Lemmas 24, 25 and 26, we establish bounds for each respective component. Consequently, with at least probability 14/T14𝑇1-4/T1 - 4 / italic_T, for fixed δ(0,12]𝛿012\delta\in(0,\frac{1}{2}]italic_δ ∈ ( 0 , divide start_ARG 1 end_ARG start_ARG 2 end_ARG ], we have

(nn0)t=n0+1nx~tnn0+k=n0+1nΣ1gηk(x*)nn0𝑛subscript𝑛0normsuperscriptsubscript𝑡subscript𝑛01𝑛subscript~𝑥𝑡𝑛subscript𝑛0superscriptsubscript𝑘subscript𝑛01𝑛superscriptΣ1subscript𝑔subscript𝜂𝑘superscript𝑥𝑛subscript𝑛0\displaystyle(n-n_{0})\left\|\frac{\sum_{t=n_{0}+1}^{n}\widetilde{x}_{t}}{n-n_% {0}}+\frac{\sum_{k=n_{0}+1}^{n}\Sigma^{-1}\triangledown g_{\eta_{k}}(x^{*})}{n% -n_{0}}\right\|( italic_n - italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∥ divide start_ARG ∑ start_POSTSUBSCRIPT italic_t = italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_n - italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG + divide start_ARG ∑ start_POSTSUBSCRIPT italic_k = italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_n - italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ∥
c(2logT+4d)αLfBM(22C1ασB(1λ)3/2(n+n0)+8C2q11/2(1λ)3/2)absent𝑐2𝑇4𝑑𝛼subscript𝐿𝑓𝐵𝑀22subscript𝐶1𝛼𝜎𝐵superscript1𝜆32𝑛subscript𝑛08subscript𝐶2superscriptsubscript𝑞112superscript1𝜆32\displaystyle\leq c(2\log T+4d)\alpha\frac{L_{f}}{\sqrt{B}}M\left(\frac{2\sqrt% {2}C_{1}\alpha\sigma}{\sqrt{B}(1-\lambda)^{3/2}}(\sqrt{n}+\sqrt{n_{0}})+\frac{% 8C_{2}q_{1}^{1/2}}{(1-\lambda)^{3/2}}\right)≤ italic_c ( 2 roman_log italic_T + 4 italic_d ) italic_α divide start_ARG italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_B end_ARG end_ARG italic_M ( divide start_ARG 2 square-root start_ARG 2 end_ARG italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_α italic_σ end_ARG start_ARG square-root start_ARG italic_B end_ARG ( 1 - italic_λ ) start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT end_ARG ( square-root start_ARG italic_n end_ARG + square-root start_ARG italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ) + divide start_ARG 8 italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - italic_λ ) start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT end_ARG )
+2ασBM3c(2logT+4d)(1λ)3/2+αML¯(2C12α2σ2B(1λ)2n+4C22q1(n0(1λ)+1)λn0(1λ)2)+Mλn01λq11/2,2𝛼𝜎𝐵𝑀3𝑐2𝑇4𝑑superscript1𝜆32𝛼𝑀¯𝐿2superscriptsubscript𝐶12superscript𝛼2superscript𝜎2𝐵superscript1𝜆2𝑛4superscriptsubscript𝐶22subscript𝑞1subscript𝑛01𝜆1superscript𝜆subscript𝑛0superscript1𝜆2𝑀superscript𝜆subscript𝑛01𝜆superscriptsubscript𝑞112\displaystyle+2\alpha\frac{\sigma}{\sqrt{B}}M\frac{\sqrt{3}c(2\log T+4d)}{(1-% \lambda)^{3/2}}+\alpha M\overline{L}\left(\frac{2C_{1}^{2}\alpha^{2}\sigma^{2}% }{B(1-\lambda)^{2}}n+\frac{4C_{2}^{2}q_{1}\left(n_{0}(1-\lambda)+1\right)% \lambda^{n_{0}}}{(1-\lambda)^{2}}\right)+M\frac{\lambda^{n_{0}}}{1-\lambda}q_{% 1}^{1/2},+ 2 italic_α divide start_ARG italic_σ end_ARG start_ARG square-root start_ARG italic_B end_ARG end_ARG italic_M divide start_ARG square-root start_ARG 3 end_ARG italic_c ( 2 roman_log italic_T + 4 italic_d ) end_ARG start_ARG ( 1 - italic_λ ) start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT end_ARG + italic_α italic_M over¯ start_ARG italic_L end_ARG ( divide start_ARG 2 italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B ( 1 - italic_λ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_n + divide start_ARG 4 italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( 1 - italic_λ ) + 1 ) italic_λ start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - italic_λ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) + italic_M divide start_ARG italic_λ start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_λ end_ARG italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ,

where c>0𝑐0c>0italic_c > 0 is an absolute constant. Take n0subscript𝑛0n_{0}italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT such that λn0=B1/2(1λ)1/2superscript𝜆subscript𝑛0superscript𝐵12superscript1𝜆12\lambda^{n_{0}}=B^{-1/2}(1-\lambda)^{1/2}italic_λ start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = italic_B start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ( 1 - italic_λ ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT and n2n0𝑛2subscript𝑛0n\geq 2n_{0}italic_n ≥ 2 italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, then we have

t=n0+1nx~tnn0+t=n0+1nΣ1gηt(x*)nn0C~1+C~3Bn+C~2Bn+C~4B,normsuperscriptsubscript𝑡subscript𝑛01𝑛subscript~𝑥𝑡𝑛subscript𝑛0superscriptsubscript𝑡subscript𝑛01𝑛superscriptΣ1subscript𝑔subscript𝜂𝑡superscript𝑥𝑛subscript𝑛0subscript~𝐶1subscript~𝐶3𝐵𝑛subscript~𝐶2𝐵𝑛subscript~𝐶4𝐵\displaystyle\left\|\frac{\sum_{t=n_{0}+1}^{n}\widetilde{x}_{t}}{n-n_{0}}+% \frac{\sum_{t=n_{0}+1}^{n}\Sigma^{-1}\triangledown g_{\eta_{t}}(x^{*})}{n-n_{0% }}\right\|\leq\frac{\tilde{C}_{1}+\tilde{C}_{3}}{\sqrt{B}n}+\frac{\tilde{C}_{2% }}{B\sqrt{n}}+\frac{\tilde{C}_{4}}{B},∥ divide start_ARG ∑ start_POSTSUBSCRIPT italic_t = italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_n - italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG + divide start_ARG ∑ start_POSTSUBSCRIPT italic_t = italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_n - italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ∥ ≤ divide start_ARG over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_B end_ARG italic_n end_ARG + divide start_ARG over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_B square-root start_ARG italic_n end_ARG end_ARG + divide start_ARG over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_ARG start_ARG italic_B end_ARG , (B.120)

where

C~1subscript~𝐶1\displaystyle\tilde{C}_{1}over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT =αLfM16c(2logT+4d)C2q11/2(1λ)3/2+ασM43c(2logT+4d)(1λ)3/2+2Mq11/2(1λ)1/2,absent𝛼subscript𝐿𝑓𝑀16𝑐2𝑇4𝑑subscript𝐶2superscriptsubscript𝑞112superscript1𝜆32𝛼𝜎𝑀43𝑐2𝑇4𝑑superscript1𝜆322𝑀superscriptsubscript𝑞112superscript1𝜆12\displaystyle=\alpha L_{f}M\frac{16c(2\log T+4d)C_{2}q_{1}^{1/2}}{(1-\lambda)^% {3/2}}+\alpha\sigma M\frac{4\sqrt{3}c(2\log T+4d)}{(1-\lambda)^{3/2}}+\frac{2% Mq_{1}^{1/2}}{(1-\lambda)^{1/2}},= italic_α italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT italic_M divide start_ARG 16 italic_c ( 2 roman_log italic_T + 4 italic_d ) italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - italic_λ ) start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT end_ARG + italic_α italic_σ italic_M divide start_ARG 4 square-root start_ARG 3 end_ARG italic_c ( 2 roman_log italic_T + 4 italic_d ) end_ARG start_ARG ( 1 - italic_λ ) start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 2 italic_M italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - italic_λ ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT end_ARG ,
C~2subscript~𝐶2\displaystyle\tilde{C}_{2}over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT =αLfσM82c(2logT+4d)C1α(1λ)3/2,absent𝛼subscript𝐿𝑓𝜎𝑀82𝑐2𝑇4𝑑subscript𝐶1𝛼superscript1𝜆32\displaystyle=\alpha L_{f}\sigma M\frac{8\sqrt{2}c(2\log T+4d)C_{1}\alpha}{(1-% \lambda)^{3/2}},= italic_α italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT italic_σ italic_M divide start_ARG 8 square-root start_ARG 2 end_ARG italic_c ( 2 roman_log italic_T + 4 italic_d ) italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_α end_ARG start_ARG ( 1 - italic_λ ) start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT end_ARG ,
C~3subscript~𝐶3\displaystyle\tilde{C}_{3}over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT =8αML¯C22q1(n0(1λ)+1)(1λ)3/2,C~4=4ασ2ML¯C12α2(1λ)2.formulae-sequenceabsent8𝛼𝑀¯𝐿superscriptsubscript𝐶22subscript𝑞1subscript𝑛01𝜆1superscript1𝜆32subscript~𝐶44𝛼superscript𝜎2𝑀¯𝐿superscriptsubscript𝐶12superscript𝛼2superscript1𝜆2\displaystyle=8\alpha M\overline{L}\frac{C_{2}^{2}q_{1}(n_{0}(1-\lambda)+1)}{(% 1-\lambda)^{3/2}},\quad\tilde{C}_{4}=4\alpha\sigma^{2}M\overline{L}\frac{C_{1}% ^{2}\alpha^{2}}{(1-\lambda)^{2}}.= 8 italic_α italic_M over¯ start_ARG italic_L end_ARG divide start_ARG italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( 1 - italic_λ ) + 1 ) end_ARG start_ARG ( 1 - italic_λ ) start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT end_ARG , over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = 4 italic_α italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_M over¯ start_ARG italic_L end_ARG divide start_ARG italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - italic_λ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG .

So (B.120) holds for 2n0nT2subscript𝑛0𝑛𝑇2n_{0}\leq n\leq T2 italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≤ italic_n ≤ italic_T with probability 14/T14𝑇1-4/T1 - 4 / italic_T.

B.3 Supporting Lemmas for Bounds in Theorem 17 Proof

Lemma 24

Under (A1)-(A2) and (A3’), suppose that the t𝑡titalic_t-th iteration (m~t,x~t)subscriptnormal-~𝑚𝑡subscriptnormal-~𝑥𝑡(\widetilde{m}_{t},\widetilde{x}_{t})( over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) satisfies

m~j2+x~j2C1B(1λ)ασ+C2q11/2λ(1δ)(j1),superscriptnormsubscript~𝑚𝑗2superscriptnormsubscript~𝑥𝑗2subscript𝐶1𝐵1𝜆𝛼𝜎subscript𝐶2superscriptsubscript𝑞112superscript𝜆1𝛿𝑗1\displaystyle\sqrt{\left\|\widetilde{m}_{j}\right\|^{2}+\left\|\widetilde{x}_{% j}\right\|^{2}}\leq\frac{C_{1}}{\sqrt{B(1-\lambda)}}\alpha\sigma+C_{2}q_{1}^{1% /2}\lambda^{(1-\delta)(j-1)},square-root start_ARG ∥ over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ≤ divide start_ARG italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_B ( 1 - italic_λ ) end_ARG end_ARG italic_α italic_σ + italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT ( 1 - italic_δ ) ( italic_j - 1 ) end_POSTSUPERSCRIPT ,

for 1jt1𝑗𝑡1\leq j\leq t1 ≤ italic_j ≤ italic_t, we have

(α(1γ)t=1nj=1tΓtj(0k=1jγjk(gηk(xk)gηk(x*)+g(x*)g(xk)))C)1T2,norm𝛼1𝛾superscriptsubscript𝑡1𝑛superscriptsubscript𝑗1𝑡superscriptΓ𝑡𝑗0missing-subexpressionmissing-subexpressionsuperscriptsubscript𝑘1𝑗superscript𝛾𝑗𝑘subscript𝑔subscript𝜂𝑘subscript𝑥𝑘subscript𝑔subscript𝜂𝑘superscript𝑥𝑔superscript𝑥𝑔subscript𝑥𝑘missing-subexpressionmissing-subexpression𝐶1superscript𝑇2\displaystyle\mathbb{P}\left(\left\|\alpha(1-\gamma)\sum_{t=1}^{n}\sum_{j=1}^{% t}\Gamma^{t-j}\left(\begin{array}[]{ccc}0\\ \sum_{k=1}^{j}\gamma^{j-k}\left(\triangledown g_{\eta_{k}}(x_{k})-% \triangledown g_{\eta_{k}}(x^{*})+\triangledown g(x^{*})-\triangledown g(x_{k}% )\right)\\ \end{array}\right)\right\|\geq C\right)\leq\frac{1}{T^{2}},blackboard_P ( ∥ italic_α ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_Γ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT ( start_ARRAY start_ROW start_CELL 0 end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT ( ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) + ▽ italic_g ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) - ▽ italic_g ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY ) ∥ ≥ italic_C ) ≤ divide start_ARG 1 end_ARG start_ARG italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ,

where

C𝐶\displaystyle Citalic_C =c(2logT+4d)αLfBM(22C1ασB(1λ)3/2n+22C2q11/2(1δ)1/2(1λ)3/2),absent𝑐2𝑇4𝑑𝛼subscript𝐿𝑓𝐵𝑀22subscript𝐶1𝛼𝜎𝐵superscript1𝜆32𝑛22subscript𝐶2superscriptsubscript𝑞112superscript1𝛿12superscript1𝜆32\displaystyle=c(2\log T+4d)\alpha\frac{L_{f}}{\sqrt{B}}M\left(\frac{2\sqrt{2}C% _{1}\alpha\sigma}{\sqrt{B}(1-\lambda)^{3/2}}\sqrt{n}+\frac{2\sqrt{2}C_{2}q_{1}% ^{1/2}}{(1-\delta)^{1/2}(1-\lambda)^{3/2}}\right),= italic_c ( 2 roman_log italic_T + 4 italic_d ) italic_α divide start_ARG italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_B end_ARG end_ARG italic_M ( divide start_ARG 2 square-root start_ARG 2 end_ARG italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_α italic_σ end_ARG start_ARG square-root start_ARG italic_B end_ARG ( 1 - italic_λ ) start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT end_ARG square-root start_ARG italic_n end_ARG + divide start_ARG 2 square-root start_ARG 2 end_ARG italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - italic_δ ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ( 1 - italic_λ ) start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT end_ARG ) ,

and c>0𝑐0c>0italic_c > 0 is an absolute constant.

Proof.  Define

Yn,k=(1γ)t=knj=ktΓtjγjk(0gηk(xk)gηk(x*)+g(x*)g(xk)).subscript𝑌𝑛𝑘1𝛾superscriptsubscript𝑡𝑘𝑛superscriptsubscript𝑗𝑘𝑡superscriptΓ𝑡𝑗superscript𝛾𝑗𝑘0missing-subexpressionmissing-subexpressionsubscript𝑔subscript𝜂𝑘subscript𝑥𝑘subscript𝑔subscript𝜂𝑘superscript𝑥𝑔superscript𝑥𝑔subscript𝑥𝑘missing-subexpressionmissing-subexpression\displaystyle Y_{n,k}=(1-\gamma)\sum_{t=k}^{n}\sum_{j=k}^{t}\Gamma^{t-j}\gamma% ^{j-k}\left(\begin{array}[]{ccc}0\\ \triangledown g_{\eta_{k}}(x_{k})-\triangledown g_{\eta_{k}}(x^{*})+% \triangledown g(x^{*})-\triangledown g(x_{k})\\ \end{array}\right).italic_Y start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT = ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_t = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_Γ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT ( start_ARRAY start_ROW start_CELL 0 end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) + ▽ italic_g ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) - ▽ italic_g ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY ) .

According to the iteration of SGDM, we have ξksubscript𝜉𝑘\xi_{k}italic_ξ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and xksubscript𝑥𝑘x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are independent and {Yt,k}k=1tsuperscriptsubscriptsubscript𝑌𝑡𝑘𝑘1𝑡\{Y_{t,k}\}_{k=1}^{t}{ italic_Y start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT are martingale difference. Due to the Lfsubscript𝐿𝑓L_{f}italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT-smooth of the individual gradient fξ(x)subscript𝑓𝜉𝑥\triangledown f_{\xi}(x)▽ italic_f start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_x ), it holds that

maxv=1𝔼[exp(|vYt,k|/σk)]2,subscriptnorm𝑣1𝔼delimited-[]superscript𝑣topsubscript𝑌𝑡𝑘subscript𝜎𝑘2\displaystyle\max_{\left\|v\right\|=1}\mathbb{E}\left[\exp\left(\left|v^{\top}% Y_{t,k}\right|/\sigma_{k}\right)\right]\leq 2,roman_max start_POSTSUBSCRIPT ∥ italic_v ∥ = 1 end_POSTSUBSCRIPT blackboard_E [ roman_exp ( | italic_v start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_Y start_POSTSUBSCRIPT italic_t , italic_k end_POSTSUBSCRIPT | / italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ] ≤ 2 ,

where

σk=cLfB(1γ)t=knj=ktΓtjγjkx~k,subscript𝜎𝑘𝑐subscript𝐿𝑓𝐵1𝛾normsuperscriptsubscript𝑡𝑘𝑛superscriptsubscript𝑗𝑘𝑡superscriptΓ𝑡𝑗superscript𝛾𝑗𝑘normsubscript~𝑥𝑘\displaystyle\sigma_{k}=c\frac{L_{f}}{\sqrt{B}}(1-\gamma)\left\|\sum_{t=k}^{n}% \sum_{j=k}^{t}\Gamma^{t-j}\gamma^{j-k}\right\|\left\|\widetilde{x}_{k}\right\|,italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_c divide start_ARG italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_B end_ARG end_ARG ( 1 - italic_γ ) ∥ ∑ start_POSTSUBSCRIPT italic_t = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_Γ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT ∥ ∥ over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ ,

and c>0𝑐0c>0italic_c > 0 is an absolute constant. By the fact that t=knj=ktλtjγjk(1γ)1(1λ)1superscriptsubscript𝑡𝑘𝑛superscriptsubscript𝑗𝑘𝑡superscript𝜆𝑡𝑗superscript𝛾𝑗𝑘superscript1𝛾1superscript1𝜆1\sum_{t=k}^{n}\sum_{j=k}^{t}\lambda^{t-j}\gamma^{j-k}\leq(1-\gamma)^{-1}(1-% \lambda)^{-1}∑ start_POSTSUBSCRIPT italic_t = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT ≤ ( 1 - italic_γ ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( 1 - italic_λ ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT and 1x1xδδ11𝑥1superscript𝑥𝛿superscript𝛿1\frac{1-x}{1-x^{\delta}}\leq\delta^{-1}divide start_ARG 1 - italic_x end_ARG start_ARG 1 - italic_x start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT end_ARG ≤ italic_δ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT for δ(0,1]𝛿01\delta\in(0,1]italic_δ ∈ ( 0 , 1 ], it holds that

k=1n(1γ)2t=knj=ktΓtjγjk2x~k2superscriptsubscript𝑘1𝑛superscript1𝛾2superscriptnormsuperscriptsubscript𝑡𝑘𝑛superscriptsubscript𝑗𝑘𝑡superscriptΓ𝑡𝑗superscript𝛾𝑗𝑘2superscriptnormsubscript~𝑥𝑘2\displaystyle\sum_{k=1}^{n}(1-\gamma)^{2}\left\|\sum_{t=k}^{n}\sum_{j=k}^{t}% \Gamma^{t-j}\gamma^{j-k}\right\|^{2}\left\|\widetilde{x}_{k}\right\|^{2}∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ ∑ start_POSTSUBSCRIPT italic_t = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_Γ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
M2(1γ)2k=1n(t=knj=ktλtjγjk)2(2C12B(1λ)α2σ2+2C22q1λ2(1δ)(k1))absentsuperscript𝑀2superscript1𝛾2superscriptsubscript𝑘1𝑛superscriptsuperscriptsubscript𝑡𝑘𝑛superscriptsubscript𝑗𝑘𝑡superscript𝜆𝑡𝑗superscript𝛾𝑗𝑘22superscriptsubscript𝐶12𝐵1𝜆superscript𝛼2superscript𝜎22superscriptsubscript𝐶22subscript𝑞1superscript𝜆21𝛿𝑘1\displaystyle\leq M^{2}(1-\gamma)^{2}\sum_{k=1}^{n}\left(\sum_{t=k}^{n}\sum_{j% =k}^{t}\lambda^{t-j}\gamma^{j-k}\right)^{2}\left(\frac{2C_{1}^{2}}{B(1-\lambda% )}\alpha^{2}\sigma^{2}+2C_{2}^{2}q_{1}\lambda^{2(1-\delta)(k-1)}\right)≤ italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_t = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( divide start_ARG 2 italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B ( 1 - italic_λ ) end_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_λ start_POSTSUPERSCRIPT 2 ( 1 - italic_δ ) ( italic_k - 1 ) end_POSTSUPERSCRIPT )
M2(2C12α2σ2B(1λ)3n+2C22q1(1δ)(1λ)3).absentsuperscript𝑀22superscriptsubscript𝐶12superscript𝛼2superscript𝜎2𝐵superscript1𝜆3𝑛2superscriptsubscript𝐶22subscript𝑞11𝛿superscript1𝜆3\displaystyle\leq M^{2}\left(\frac{2C_{1}^{2}\alpha^{2}\sigma^{2}}{B(1-\lambda% )^{3}}n+\frac{2C_{2}^{2}q_{1}}{(1-\delta)(1-\lambda)^{3}}\right).≤ italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( divide start_ARG 2 italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B ( 1 - italic_λ ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG italic_n + divide start_ARG 2 italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG ( 1 - italic_δ ) ( 1 - italic_λ ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG ) .

By applying Lemma 28 on {Yn,k}k=1nsuperscriptsubscriptsubscript𝑌𝑛𝑘𝑘1𝑛\{Y_{n,k}\}_{k=1}^{n}{ italic_Y start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, with probability 11/T211superscript𝑇21-1/T^{2}1 - 1 / italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, we have

k=1nYn,kc(2logT+4d)LfBM(22C1ασB(1λ)3/2n+22C2q11/2(1δ)1/2(1λ)3/2).normsuperscriptsubscript𝑘1𝑛subscript𝑌𝑛𝑘𝑐2𝑇4𝑑subscript𝐿𝑓𝐵𝑀22subscript𝐶1𝛼𝜎𝐵superscript1𝜆32𝑛22subscript𝐶2superscriptsubscript𝑞112superscript1𝛿12superscript1𝜆32\displaystyle\left\|\sum_{k=1}^{n}Y_{n,k}\right\|\leq c(2\log T+4d)\frac{L_{f}% }{\sqrt{B}}M\left(\frac{2\sqrt{2}C_{1}\alpha\sigma}{\sqrt{B}(1-\lambda)^{3/2}}% \sqrt{n}+\frac{2\sqrt{2}C_{2}q_{1}^{1/2}}{(1-\delta)^{1/2}(1-\lambda)^{3/2}}% \right).∥ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_Y start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT ∥ ≤ italic_c ( 2 roman_log italic_T + 4 italic_d ) divide start_ARG italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_B end_ARG end_ARG italic_M ( divide start_ARG 2 square-root start_ARG 2 end_ARG italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_α italic_σ end_ARG start_ARG square-root start_ARG italic_B end_ARG ( 1 - italic_λ ) start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT end_ARG square-root start_ARG italic_n end_ARG + divide start_ARG 2 square-root start_ARG 2 end_ARG italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - italic_δ ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ( 1 - italic_λ ) start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT end_ARG ) .
Lemma 25

Under (A1)-(A2) and (A3’), we have

(αk=1n((IΓ)1(1γ)t=knj=ktΓtjγjk)(0gηk(x*))C)1T2,norm𝛼superscriptsubscript𝑘1𝑛superscript𝐼Γ11𝛾superscriptsubscript𝑡𝑘𝑛superscriptsubscript𝑗𝑘𝑡superscriptΓ𝑡𝑗superscript𝛾𝑗𝑘0missing-subexpressionmissing-subexpressionsubscript𝑔subscript𝜂𝑘superscript𝑥missing-subexpressionmissing-subexpression𝐶1superscript𝑇2\displaystyle\mathbb{P}\left(\left\|\alpha\sum_{k=1}^{n}\left((I-\Gamma)^{-1}-% (1-\gamma)\sum_{t=k}^{n}\sum_{j=k}^{t}\Gamma^{t-j}\gamma^{j-k}\right)\left(% \begin{array}[]{ccc}0\\ \triangledown g_{\eta_{k}}(x^{*})\\ \end{array}\right)\right\|\geq C\right)\leq\frac{1}{T^{2}},blackboard_P ( ∥ italic_α ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( ( italic_I - roman_Γ ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT - ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_t = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_Γ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT ) ( start_ARRAY start_ROW start_CELL 0 end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY ) ∥ ≥ italic_C ) ≤ divide start_ARG 1 end_ARG start_ARG italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ,

where

C𝐶\displaystyle Citalic_C =ασBM3c(2logT+4d)(1λ)3/2,absent𝛼𝜎𝐵𝑀3𝑐2𝑇4𝑑superscript1𝜆32\displaystyle=\alpha\frac{\sigma}{\sqrt{B}}M\frac{\sqrt{3}c(2\log T+4d)}{(1-% \lambda)^{3/2}},= italic_α divide start_ARG italic_σ end_ARG start_ARG square-root start_ARG italic_B end_ARG end_ARG italic_M divide start_ARG square-root start_ARG 3 end_ARG italic_c ( 2 roman_log italic_T + 4 italic_d ) end_ARG start_ARG ( 1 - italic_λ ) start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT end_ARG ,

and c>0𝑐0c>0italic_c > 0 is an absolute constant.

Proof.   Define

Yn,k=((IΓ)1(1γ)t=knj=ktΓtjγjk)(0gηk(x*)).subscript𝑌𝑛𝑘superscript𝐼Γ11𝛾superscriptsubscript𝑡𝑘𝑛superscriptsubscript𝑗𝑘𝑡superscriptΓ𝑡𝑗superscript𝛾𝑗𝑘0missing-subexpressionmissing-subexpressionsubscript𝑔subscript𝜂𝑘superscript𝑥missing-subexpressionmissing-subexpression\displaystyle Y_{n,k}=\left((I-\Gamma)^{-1}-(1-\gamma)\sum_{t=k}^{n}\sum_{j=k}% ^{t}\Gamma^{t-j}\gamma^{j-k}\right)\left(\begin{array}[]{ccc}0\\ \triangledown g_{\eta_{k}}(x^{*})\\ \end{array}\right).italic_Y start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT = ( ( italic_I - roman_Γ ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT - ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_t = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_Γ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT ) ( start_ARRAY start_ROW start_CELL 0 end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL ▽ italic_g start_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY ) .

Then {Yn,k}k=1nsuperscriptsubscriptsubscript𝑌𝑛𝑘𝑘1𝑛\{Y_{n,k}\}_{k=1}^{n}{ italic_Y start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT are independent and sub-exponential. By Lemma 29 and the fact γλ<1𝛾𝜆1\sqrt{\gamma}\leq\lambda<1square-root start_ARG italic_γ end_ARG ≤ italic_λ < 1, it holds that

k=1n(IΓ)1(1γ)t=knj=ktΓtjγjk2superscriptsubscript𝑘1𝑛superscriptnormsuperscript𝐼Γ11𝛾superscriptsubscript𝑡𝑘𝑛superscriptsubscript𝑗𝑘𝑡superscriptΓ𝑡𝑗superscript𝛾𝑗𝑘2\displaystyle\sum_{k=1}^{n}\left\|(I-\Gamma)^{-1}-(1-\gamma)\sum_{t=k}^{n}\sum% _{j=k}^{t}\Gamma^{t-j}\gamma^{j-k}\right\|^{2}∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∥ ( italic_I - roman_Γ ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT - ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_t = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_Γ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT M2(2γ2(1γ)2(1λ)+λ2(1λ)3)3M2(1λ)3.absentsuperscript𝑀22superscript𝛾2superscript1𝛾21𝜆superscript𝜆2superscript1𝜆33superscript𝑀2superscript1𝜆3\displaystyle\leq M^{2}\left(\frac{2\gamma^{2}}{(1-\gamma)^{2}(1-\lambda)}+% \frac{\lambda^{2}}{(1-\lambda)^{3}}\right)\leq\frac{3M^{2}}{(1-\lambda)^{3}}.≤ italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( divide start_ARG 2 italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - italic_λ ) end_ARG + divide start_ARG italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - italic_λ ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG ) ≤ divide start_ARG 3 italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - italic_λ ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG .

By applying Lemma 28 on {Yn,k}k=1nsuperscriptsubscriptsubscript𝑌𝑛𝑘𝑘1𝑛\{Y_{n,k}\}_{k=1}^{n}{ italic_Y start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, with probability 11/T211superscript𝑇21-1/T^{2}1 - 1 / italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, we have

k=1nYn,kσBM3c(2logT+4d)(1λ)3/2.normsuperscriptsubscript𝑘1𝑛subscript𝑌𝑛𝑘𝜎𝐵𝑀3𝑐2𝑇4𝑑superscript1𝜆32\displaystyle\left\|\sum_{k=1}^{n}Y_{n,k}\right\|\leq\frac{\sigma}{\sqrt{B}}M% \frac{\sqrt{3}c(2\log T+4d)}{(1-\lambda)^{3/2}}.∥ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_Y start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT ∥ ≤ divide start_ARG italic_σ end_ARG start_ARG square-root start_ARG italic_B end_ARG end_ARG italic_M divide start_ARG square-root start_ARG 3 end_ARG italic_c ( 2 roman_log italic_T + 4 italic_d ) end_ARG start_ARG ( 1 - italic_λ ) start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT end_ARG .
Lemma 26

Under (A1)-(A2) and (A3’), suppose that the t𝑡titalic_t-th iteration (m~t,x~t)subscriptnormal-~𝑚𝑡subscriptnormal-~𝑥𝑡(\widetilde{m}_{t},\widetilde{x}_{t})( over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) satisfies

m~j2+x~j2C1B(1λ)ασ+C2q11/2λ(1δ)(j1),superscriptnormsubscript~𝑚𝑗2superscriptnormsubscript~𝑥𝑗2subscript𝐶1𝐵1𝜆𝛼𝜎subscript𝐶2superscriptsubscript𝑞112superscript𝜆1𝛿𝑗1\displaystyle\sqrt{\left\|\widetilde{m}_{j}\right\|^{2}+\left\|\widetilde{x}_{% j}\right\|^{2}}\leq\frac{C_{1}}{\sqrt{B(1-\lambda)}}\alpha\sigma+C_{2}q_{1}^{1% /2}\lambda^{(1-\delta)(j-1)},square-root start_ARG ∥ over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ≤ divide start_ARG italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_B ( 1 - italic_λ ) end_ARG end_ARG italic_α italic_σ + italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT ( 1 - italic_δ ) ( italic_j - 1 ) end_POSTSUPERSCRIPT ,

for 1jt1𝑗𝑡1\leq j\leq t1 ≤ italic_j ≤ italic_t and fixed δ(0,12]𝛿012\delta\in(0,\frac{1}{2}]italic_δ ∈ ( 0 , divide start_ARG 1 end_ARG start_ARG 2 end_ARG ], then for L¯>0normal-¯𝐿0\overline{L}>0over¯ start_ARG italic_L end_ARG > 0, we have

α(1γ)t=n0+1nj=1tΓtj(0k=1jγjkVkx~k)norm𝛼1𝛾superscriptsubscript𝑡subscript𝑛01𝑛superscriptsubscript𝑗1𝑡superscriptΓ𝑡𝑗0missing-subexpressionmissing-subexpressionsuperscriptsubscript𝑘1𝑗superscript𝛾𝑗𝑘subscript𝑉𝑘subscript~𝑥𝑘missing-subexpressionmissing-subexpression\displaystyle\left\|\alpha(1-\gamma)\sum_{t=n_{0}+1}^{n}\sum_{j=1}^{t}\Gamma^{% t-j}\left(\begin{array}[]{ccc}0\\ -\sum_{k=1}^{j}\gamma^{j-k}V_{k}\widetilde{x}_{k}\\ \end{array}\right)\right\|∥ italic_α ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_t = italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_Γ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT ( start_ARRAY start_ROW start_CELL 0 end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL - ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY ) ∥
αML¯(2C12α2σ2B(1λ)2n+4C22q1(n0(1λ)+1)λn0(1λ)2).absent𝛼𝑀¯𝐿2superscriptsubscript𝐶12superscript𝛼2superscript𝜎2𝐵superscript1𝜆2𝑛4superscriptsubscript𝐶22subscript𝑞1subscript𝑛01𝜆1superscript𝜆subscript𝑛0superscript1𝜆2\displaystyle\leq\alpha M\overline{L}\left(\frac{2C_{1}^{2}\alpha^{2}\sigma^{2% }}{B(1-\lambda)^{2}}n+4C_{2}^{2}q_{1}\frac{\left(n_{0}(1-\lambda)+1\right)% \lambda^{n_{0}}}{(1-\lambda)^{2}}\right).≤ italic_α italic_M over¯ start_ARG italic_L end_ARG ( divide start_ARG 2 italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B ( 1 - italic_λ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_n + 4 italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT divide start_ARG ( italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( 1 - italic_λ ) + 1 ) italic_λ start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - italic_λ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) .

Proof.   By the fact that t=knj=ktλtjγjk(1γ)1(1λ)1superscriptsubscript𝑡𝑘𝑛superscriptsubscript𝑗𝑘𝑡superscript𝜆𝑡𝑗superscript𝛾𝑗𝑘superscript1𝛾1superscript1𝜆1\sum_{t=k}^{n}\sum_{j=k}^{t}\lambda^{t-j}\gamma^{j-k}\leq(1-\gamma)^{-1}(1-% \lambda)^{-1}∑ start_POSTSUBSCRIPT italic_t = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT ≤ ( 1 - italic_γ ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( 1 - italic_λ ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, we have

α(1γ)t=n0+1nj=1tΓtj(0k=1jγjkVkx~k)norm𝛼1𝛾superscriptsubscript𝑡subscript𝑛01𝑛superscriptsubscript𝑗1𝑡superscriptΓ𝑡𝑗0missing-subexpressionmissing-subexpressionsuperscriptsubscript𝑘1𝑗superscript𝛾𝑗𝑘subscript𝑉𝑘subscript~𝑥𝑘missing-subexpressionmissing-subexpression\displaystyle\left\|\alpha(1-\gamma)\sum_{t=n_{0}+1}^{n}\sum_{j=1}^{t}\Gamma^{% t-j}\left(\begin{array}[]{ccc}0\\ -\sum_{k=1}^{j}\gamma^{j-k}V_{k}\widetilde{x}_{k}\\ \end{array}\right)\right\|∥ italic_α ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_t = italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_Γ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT ( start_ARRAY start_ROW start_CELL 0 end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL - ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY ) ∥
α(1γ)ML¯k=1n(t=max{n0+1,k}nj=ktλtjγjk)(2C12B(1λ)α2σ2+2C22q1λ2(1δ)(k1))absent𝛼1𝛾𝑀¯𝐿superscriptsubscript𝑘1𝑛superscriptsubscript𝑡subscript𝑛01𝑘𝑛superscriptsubscript𝑗𝑘𝑡superscript𝜆𝑡𝑗superscript𝛾𝑗𝑘2superscriptsubscript𝐶12𝐵1𝜆superscript𝛼2superscript𝜎22superscriptsubscript𝐶22subscript𝑞1superscript𝜆21𝛿𝑘1\displaystyle\leq\alpha(1-\gamma)M\overline{L}\sum_{k=1}^{n}\left(\sum_{t=\max% \{n_{0}+1,k\}}^{n}\sum_{j=k}^{t}\lambda^{t-j}\gamma^{j-k}\right)\left(\frac{2C% _{1}^{2}}{B(1-\lambda)}\alpha^{2}\sigma^{2}+2C_{2}^{2}q_{1}\lambda^{2(1-\delta% )(k-1)}\right)≤ italic_α ( 1 - italic_γ ) italic_M over¯ start_ARG italic_L end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_t = roman_max { italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 , italic_k } end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT ) ( divide start_ARG 2 italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B ( 1 - italic_λ ) end_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_λ start_POSTSUPERSCRIPT 2 ( 1 - italic_δ ) ( italic_k - 1 ) end_POSTSUPERSCRIPT )
αML¯2C12α2σ2B(1λ)2n+2α(1γ)ML¯C22q1k=1n0(t=n0+1nj=ktλtjγjk)λ2(1δ)(k1)absent𝛼𝑀¯𝐿2superscriptsubscript𝐶12superscript𝛼2superscript𝜎2𝐵superscript1𝜆2𝑛2𝛼1𝛾𝑀¯𝐿superscriptsubscript𝐶22subscript𝑞1superscriptsubscript𝑘1subscript𝑛0superscriptsubscript𝑡subscript𝑛01𝑛superscriptsubscript𝑗𝑘𝑡superscript𝜆𝑡𝑗superscript𝛾𝑗𝑘superscript𝜆21𝛿𝑘1\displaystyle\leq\alpha M\overline{L}\frac{2C_{1}^{2}\alpha^{2}\sigma^{2}}{B(1% -\lambda)^{2}}n+2\alpha(1-\gamma)M\overline{L}C_{2}^{2}q_{1}\sum_{k=1}^{n_{0}}% \left(\sum_{t=n_{0}+1}^{n}\sum_{j=k}^{t}\lambda^{t-j}\gamma^{j-k}\right)% \lambda^{2(1-\delta)(k-1)}≤ italic_α italic_M over¯ start_ARG italic_L end_ARG divide start_ARG 2 italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B ( 1 - italic_λ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_n + 2 italic_α ( 1 - italic_γ ) italic_M over¯ start_ARG italic_L end_ARG italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_t = italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT ) italic_λ start_POSTSUPERSCRIPT 2 ( 1 - italic_δ ) ( italic_k - 1 ) end_POSTSUPERSCRIPT
+2α(1γ)ML¯C22q1k=n0+1n(t=knj=ktλtjγjk)λ2(1δ)(k1).2𝛼1𝛾𝑀¯𝐿superscriptsubscript𝐶22subscript𝑞1superscriptsubscript𝑘subscript𝑛01𝑛superscriptsubscript𝑡𝑘𝑛superscriptsubscript𝑗𝑘𝑡superscript𝜆𝑡𝑗superscript𝛾𝑗𝑘superscript𝜆21𝛿𝑘1\displaystyle+2\alpha(1-\gamma)M\overline{L}C_{2}^{2}q_{1}\sum_{k=n_{0}+1}^{n}% \left(\sum_{t=k}^{n}\sum_{j=k}^{t}\lambda^{t-j}\gamma^{j-k}\right)\lambda^{2(1% -\delta)(k-1)}.+ 2 italic_α ( 1 - italic_γ ) italic_M over¯ start_ARG italic_L end_ARG italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_k = italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_t = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT ) italic_λ start_POSTSUPERSCRIPT 2 ( 1 - italic_δ ) ( italic_k - 1 ) end_POSTSUPERSCRIPT .

For the second term, due to that λγ𝜆𝛾\lambda\geq\sqrt{\gamma}italic_λ ≥ square-root start_ARG italic_γ end_ARG and δ(0,12]𝛿012\delta\in(0,\frac{1}{2}]italic_δ ∈ ( 0 , divide start_ARG 1 end_ARG start_ARG 2 end_ARG ], we have

k=1n0(t=n0+1nj=ktλtjγjk)λ2(1δ)(k1)k=1n0(λn0+1kt=n0+1nλtn01j=kt(γ/λ)jk)λ2(1δ)(k1)superscriptsubscript𝑘1subscript𝑛0superscriptsubscript𝑡subscript𝑛01𝑛superscriptsubscript𝑗𝑘𝑡superscript𝜆𝑡𝑗superscript𝛾𝑗𝑘superscript𝜆21𝛿𝑘1superscriptsubscript𝑘1subscript𝑛0superscript𝜆subscript𝑛01𝑘superscriptsubscript𝑡subscript𝑛01𝑛superscript𝜆𝑡subscript𝑛01superscriptsubscript𝑗𝑘𝑡superscript𝛾𝜆𝑗𝑘superscript𝜆21𝛿𝑘1\displaystyle\sum_{k=1}^{n_{0}}\left(\sum_{t=n_{0}+1}^{n}\sum_{j=k}^{t}\lambda% ^{t-j}\gamma^{j-k}\right)\lambda^{2(1-\delta)(k-1)}\leq\sum_{k=1}^{n_{0}}\left% (\lambda^{n_{0}+1-k}\sum_{t=n_{0}+1}^{n}\lambda^{t-n_{0}-1}\sum_{j=k}^{t}(% \gamma/\lambda)^{j-k}\right)\lambda^{2(1-\delta)(k-1)}∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_t = italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT ) italic_λ start_POSTSUPERSCRIPT 2 ( 1 - italic_δ ) ( italic_k - 1 ) end_POSTSUPERSCRIPT ≤ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_λ start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 - italic_k end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT italic_t - italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_γ / italic_λ ) start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT ) italic_λ start_POSTSUPERSCRIPT 2 ( 1 - italic_δ ) ( italic_k - 1 ) end_POSTSUPERSCRIPT
k=1n02(1λ)(1γ)λn0+1k+2(1δ)(k1)2n0λn0(1λ)(1γ).absentsuperscriptsubscript𝑘1subscript𝑛021𝜆1𝛾superscript𝜆subscript𝑛01𝑘21𝛿𝑘12subscript𝑛0superscript𝜆subscript𝑛01𝜆1𝛾\displaystyle\leq\sum_{k=1}^{n_{0}}\frac{2}{(1-\lambda)(1-\gamma)}\lambda^{n_{% 0}+1-k+2(1-\delta)(k-1)}\leq\frac{2n_{0}\lambda^{n_{0}}}{(1-\lambda)(1-\gamma)}.≤ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT divide start_ARG 2 end_ARG start_ARG ( 1 - italic_λ ) ( 1 - italic_γ ) end_ARG italic_λ start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 - italic_k + 2 ( 1 - italic_δ ) ( italic_k - 1 ) end_POSTSUPERSCRIPT ≤ divide start_ARG 2 italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_λ start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - italic_λ ) ( 1 - italic_γ ) end_ARG .

For the third term, we have

k=n0+1n(t=knj=ktλtjγjk)λ2(1δ)(k1)superscriptsubscript𝑘subscript𝑛01𝑛superscriptsubscript𝑡𝑘𝑛superscriptsubscript𝑗𝑘𝑡superscript𝜆𝑡𝑗superscript𝛾𝑗𝑘superscript𝜆21𝛿𝑘1\displaystyle\sum_{k=n_{0}+1}^{n}\left(\sum_{t=k}^{n}\sum_{j=k}^{t}\lambda^{t-% j}\gamma^{j-k}\right)\lambda^{2(1-\delta)(k-1)}∑ start_POSTSUBSCRIPT italic_k = italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_t = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT ) italic_λ start_POSTSUPERSCRIPT 2 ( 1 - italic_δ ) ( italic_k - 1 ) end_POSTSUPERSCRIPT k=n0+1n1(1λ)(1γ)λ2(1δ)(k1)absentsuperscriptsubscript𝑘subscript𝑛01𝑛11𝜆1𝛾superscript𝜆21𝛿𝑘1\displaystyle\leq\sum_{k=n_{0}+1}^{n}\frac{1}{(1-\lambda)(1-\gamma)}\lambda^{2% (1-\delta)(k-1)}≤ ∑ start_POSTSUBSCRIPT italic_k = italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG ( 1 - italic_λ ) ( 1 - italic_γ ) end_ARG italic_λ start_POSTSUPERSCRIPT 2 ( 1 - italic_δ ) ( italic_k - 1 ) end_POSTSUPERSCRIPT
2λn0(1λ)2(1γ).absent2superscript𝜆subscript𝑛0superscript1𝜆21𝛾\displaystyle\leq\frac{2\lambda^{n_{0}}}{(1-\lambda)^{2}(1-\gamma)}.≤ divide start_ARG 2 italic_λ start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - italic_λ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - italic_γ ) end_ARG .

Thus we draw the conclusion.

Appendix C Several Useful Lemmas

Lemma 27

For 0λ,γ<1formulae-sequence0𝜆𝛾10\leq\lambda,\gamma<10 ≤ italic_λ , italic_γ < 1, we have

k=1t(j=ktλtjγjk)22(1γ)2(1λ).superscriptsubscript𝑘1𝑡superscriptsuperscriptsubscript𝑗𝑘𝑡superscript𝜆𝑡𝑗superscript𝛾𝑗𝑘22superscript1𝛾21𝜆\displaystyle\sum_{k=1}^{t}\left(\sum_{j=k}^{t}\lambda^{t-j}\gamma^{j-k}\right% )^{2}\leq\frac{2}{(1-\gamma)^{2}(1-\lambda)}.∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ divide start_ARG 2 end_ARG start_ARG ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - italic_λ ) end_ARG .

In addition, if λγ𝜆𝛾\lambda\geq\sqrt{\gamma}italic_λ ≥ square-root start_ARG italic_γ end_ARG, for fixed δ(0,1]𝛿01\delta\in(0,1]italic_δ ∈ ( 0 , 1 ], we have

k=1t(j=ktλtjγjk)2λ2(1δ)(k1)4δ(1γ)2(1λ)λ2(1δ)(t1),superscriptsubscript𝑘1𝑡superscriptsuperscriptsubscript𝑗𝑘𝑡superscript𝜆𝑡𝑗superscript𝛾𝑗𝑘2superscript𝜆21𝛿𝑘14𝛿superscript1𝛾21𝜆superscript𝜆21𝛿𝑡1\displaystyle\sum_{k=1}^{t}\left(\sum_{j=k}^{t}\lambda^{t-j}\gamma^{j-k}\right% )^{2}\lambda^{2(1-\delta)(k-1)}\leq\frac{4}{\delta(1-\gamma)^{2}(1-\lambda)}% \lambda^{2(1-\delta)(t-1)},∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT 2 ( 1 - italic_δ ) ( italic_k - 1 ) end_POSTSUPERSCRIPT ≤ divide start_ARG 4 end_ARG start_ARG italic_δ ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - italic_λ ) end_ARG italic_λ start_POSTSUPERSCRIPT 2 ( 1 - italic_δ ) ( italic_t - 1 ) end_POSTSUPERSCRIPT ,

and

k=1tj=ktλtjγjkλ(1δ)(k1)2δ(1γ)(1λ)λ(1δ)(t1).superscriptsubscript𝑘1𝑡superscriptsubscript𝑗𝑘𝑡superscript𝜆𝑡𝑗superscript𝛾𝑗𝑘superscript𝜆1𝛿𝑘12𝛿1𝛾1𝜆superscript𝜆1𝛿𝑡1\displaystyle\sum_{k=1}^{t}\sum_{j=k}^{t}\lambda^{t-j}\gamma^{j-k}\lambda^{(1-% \delta)(k-1)}\leq\frac{2}{\delta(1-\gamma)(1-\lambda)}\lambda^{(1-\delta)(t-1)}.∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT ( 1 - italic_δ ) ( italic_k - 1 ) end_POSTSUPERSCRIPT ≤ divide start_ARG 2 end_ARG start_ARG italic_δ ( 1 - italic_γ ) ( 1 - italic_λ ) end_ARG italic_λ start_POSTSUPERSCRIPT ( 1 - italic_δ ) ( italic_t - 1 ) end_POSTSUPERSCRIPT .

Proof.

k=1t(j=ktλtjγjk)2superscriptsubscript𝑘1𝑡superscriptsuperscriptsubscript𝑗𝑘𝑡superscript𝜆𝑡𝑗superscript𝛾𝑗𝑘2\displaystyle\sum_{k=1}^{t}\left(\sum_{j=k}^{t}\lambda^{t-j}\gamma^{j-k}\right% )^{2}∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT =k=1ti=ktj=ktλ2tijγi+j2k=i=1tj=1tλ2tijk=1min{i,j}γi+j2kabsentsuperscriptsubscript𝑘1𝑡superscriptsubscript𝑖𝑘𝑡superscriptsubscript𝑗𝑘𝑡superscript𝜆2𝑡𝑖𝑗superscript𝛾𝑖𝑗2𝑘superscriptsubscript𝑖1𝑡superscriptsubscript𝑗1𝑡superscript𝜆2𝑡𝑖𝑗superscriptsubscript𝑘1𝑖𝑗superscript𝛾𝑖𝑗2𝑘\displaystyle=\sum_{k=1}^{t}\sum_{i=k}^{t}\sum_{j=k}^{t}\lambda^{2t-i-j}\gamma% ^{i+j-2k}=\sum_{i=1}^{t}\sum_{j=1}^{t}\lambda^{2t-i-j}\sum_{k=1}^{\min\{i,j\}}% \gamma^{i+j-2k}= ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT 2 italic_t - italic_i - italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_i + italic_j - 2 italic_k end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT 2 italic_t - italic_i - italic_j end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_min { italic_i , italic_j } end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_i + italic_j - 2 italic_k end_POSTSUPERSCRIPT
11γi=1tj=1tλ2tijγ|ij|21γk=0t1j=1tkλ2t2jkγkabsent11𝛾superscriptsubscript𝑖1𝑡superscriptsubscript𝑗1𝑡superscript𝜆2𝑡𝑖𝑗superscript𝛾𝑖𝑗21𝛾superscriptsubscript𝑘0𝑡1superscriptsubscript𝑗1𝑡𝑘superscript𝜆2𝑡2𝑗𝑘superscript𝛾𝑘\displaystyle\leq\frac{1}{1-\gamma}\sum_{i=1}^{t}\sum_{j=1}^{t}\lambda^{2t-i-j% }\gamma^{\left|i-j\right|}\leq\frac{2}{1-\gamma}\sum_{k=0}^{t-1}\sum_{j=1}^{t-% k}\lambda^{2t-2j-k}\gamma^{k}≤ divide start_ARG 1 end_ARG start_ARG 1 - italic_γ end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT 2 italic_t - italic_i - italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT | italic_i - italic_j | end_POSTSUPERSCRIPT ≤ divide start_ARG 2 end_ARG start_ARG 1 - italic_γ end_ARG ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - italic_k end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT 2 italic_t - 2 italic_j - italic_k end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT
21γk=0t1λk1λ2γk2(1γ)(1λ)(1λγ)2(1γ)2(1λ).absent21𝛾superscriptsubscript𝑘0𝑡1superscript𝜆𝑘1superscript𝜆2superscript𝛾𝑘21𝛾1𝜆1𝜆𝛾2superscript1𝛾21𝜆\displaystyle\leq\frac{2}{1-\gamma}\sum_{k=0}^{t-1}\frac{\lambda^{k}}{1-% \lambda^{2}}\gamma^{k}\leq\frac{2}{(1-\gamma)(1-\lambda)(1-\lambda\gamma)}\leq% \frac{2}{(1-\gamma)^{2}(1-\lambda)}.≤ divide start_ARG 2 end_ARG start_ARG 1 - italic_γ end_ARG ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT divide start_ARG italic_λ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_γ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ≤ divide start_ARG 2 end_ARG start_ARG ( 1 - italic_γ ) ( 1 - italic_λ ) ( 1 - italic_λ italic_γ ) end_ARG ≤ divide start_ARG 2 end_ARG start_ARG ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - italic_λ ) end_ARG .

If λγ𝜆𝛾\lambda\geq\sqrt{\gamma}italic_λ ≥ square-root start_ARG italic_γ end_ARG, we have

λ2(1δ)(t1)k=1t(j=ktλtjγjk)2λ2(1δ)(k1)=k=1ti=ktj=ktλ2kijγi+j2kλ2δ(tk)superscript𝜆21𝛿𝑡1superscriptsubscript𝑘1𝑡superscriptsuperscriptsubscript𝑗𝑘𝑡superscript𝜆𝑡𝑗superscript𝛾𝑗𝑘2superscript𝜆21𝛿𝑘1superscriptsubscript𝑘1𝑡superscriptsubscript𝑖𝑘𝑡superscriptsubscript𝑗𝑘𝑡superscript𝜆2𝑘𝑖𝑗superscript𝛾𝑖𝑗2𝑘superscript𝜆2𝛿𝑡𝑘\displaystyle\lambda^{-2(1-\delta)(t-1)}\sum_{k=1}^{t}\left(\sum_{j=k}^{t}% \lambda^{t-j}\gamma^{j-k}\right)^{2}\lambda^{2(1-\delta)(k-1)}=\sum_{k=1}^{t}% \sum_{i=k}^{t}\sum_{j=k}^{t}\lambda^{2k-i-j}\gamma^{i+j-2k}\lambda^{2\delta(t-% k)}italic_λ start_POSTSUPERSCRIPT - 2 ( 1 - italic_δ ) ( italic_t - 1 ) end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT 2 ( 1 - italic_δ ) ( italic_k - 1 ) end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT 2 italic_k - italic_i - italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_i + italic_j - 2 italic_k end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT 2 italic_δ ( italic_t - italic_k ) end_POSTSUPERSCRIPT
i=1tj=1tλ2δ(tmin{i,j})k=1min{i,j}(γλ)i+j2k11γ2/λ2i=1tj=1tλ2δ(tmin{i,j})(γ)|ij|absentsuperscriptsubscript𝑖1𝑡superscriptsubscript𝑗1𝑡superscript𝜆2𝛿𝑡𝑖𝑗superscriptsubscript𝑘1𝑖𝑗superscript𝛾𝜆𝑖𝑗2𝑘11superscript𝛾2superscript𝜆2superscriptsubscript𝑖1𝑡superscriptsubscript𝑗1𝑡superscript𝜆2𝛿𝑡𝑖𝑗superscript𝛾𝑖𝑗\displaystyle\leq\sum_{i=1}^{t}\sum_{j=1}^{t}\lambda^{2\delta(t-\min\{i,j\})}% \sum_{k=1}^{\min\{i,j\}}\left(\frac{\gamma}{\lambda}\right)^{i+j-2k}\leq\frac{% 1}{1-\gamma^{2}/\lambda^{2}}\sum_{i=1}^{t}\sum_{j=1}^{t}\lambda^{2\delta(t-% \min\{i,j\})}\left(\sqrt{\gamma}\right)^{\left|i-j\right|}≤ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT 2 italic_δ ( italic_t - roman_min { italic_i , italic_j } ) end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_min { italic_i , italic_j } end_POSTSUPERSCRIPT ( divide start_ARG italic_γ end_ARG start_ARG italic_λ end_ARG ) start_POSTSUPERSCRIPT italic_i + italic_j - 2 italic_k end_POSTSUPERSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG 1 - italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT 2 italic_δ ( italic_t - roman_min { italic_i , italic_j } ) end_POSTSUPERSCRIPT ( square-root start_ARG italic_γ end_ARG ) start_POSTSUPERSCRIPT | italic_i - italic_j | end_POSTSUPERSCRIPT
21γi=1tk=0tiλ2δ(ti)(γ)k4(1γ)2(1λ2δ)4δ(1γ)2(1λ),absent21𝛾superscriptsubscript𝑖1𝑡superscriptsubscript𝑘0𝑡𝑖superscript𝜆2𝛿𝑡𝑖superscript𝛾𝑘4superscript1𝛾21superscript𝜆2𝛿4𝛿superscript1𝛾21𝜆\displaystyle\leq\frac{2}{1-\gamma}\sum_{i=1}^{t}\sum_{k=0}^{t-i}\lambda^{2% \delta(t-i)}\left(\sqrt{\gamma}\right)^{k}\leq\frac{4}{(1-\gamma)^{2}(1-% \lambda^{2\delta})}\leq\frac{4}{\delta(1-\gamma)^{2}(1-\lambda)},≤ divide start_ARG 2 end_ARG start_ARG 1 - italic_γ end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - italic_i end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT 2 italic_δ ( italic_t - italic_i ) end_POSTSUPERSCRIPT ( square-root start_ARG italic_γ end_ARG ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ≤ divide start_ARG 4 end_ARG start_ARG ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - italic_λ start_POSTSUPERSCRIPT 2 italic_δ end_POSTSUPERSCRIPT ) end_ARG ≤ divide start_ARG 4 end_ARG start_ARG italic_δ ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - italic_λ ) end_ARG ,

where the last inequality is due to the fact that 1x1xδ1δ1𝑥1superscript𝑥𝛿1𝛿\frac{1-x}{1-x^{\delta}}\leq\frac{1}{\delta}divide start_ARG 1 - italic_x end_ARG start_ARG 1 - italic_x start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT end_ARG ≤ divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG for δ(0,1]𝛿01\delta\in(0,1]italic_δ ∈ ( 0 , 1 ]. Similarly, we have

λ(1δ)(t1)k=1tj=ktλtjγjkλ(1δ)(k1)=k=1tj=ktλδ(tk)(γλ)jksuperscript𝜆1𝛿𝑡1superscriptsubscript𝑘1𝑡superscriptsubscript𝑗𝑘𝑡superscript𝜆𝑡𝑗superscript𝛾𝑗𝑘superscript𝜆1𝛿𝑘1superscriptsubscript𝑘1𝑡superscriptsubscript𝑗𝑘𝑡superscript𝜆𝛿𝑡𝑘superscript𝛾𝜆𝑗𝑘\displaystyle\lambda^{-(1-\delta)(t-1)}\sum_{k=1}^{t}\sum_{j=k}^{t}\lambda^{t-% j}\gamma^{j-k}\lambda^{(1-\delta)(k-1)}=\sum_{k=1}^{t}\sum_{j=k}^{t}\lambda^{% \delta(t-k)}\left(\frac{\gamma}{\lambda}\right)^{j-k}italic_λ start_POSTSUPERSCRIPT - ( 1 - italic_δ ) ( italic_t - 1 ) end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT ( 1 - italic_δ ) ( italic_k - 1 ) end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT italic_δ ( italic_t - italic_k ) end_POSTSUPERSCRIPT ( divide start_ARG italic_γ end_ARG start_ARG italic_λ end_ARG ) start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT
j=1tλδ(tj)k=1j(γ)jk21γ11λδ2δ(1γ)(1λ).absentsuperscriptsubscript𝑗1𝑡superscript𝜆𝛿𝑡𝑗superscriptsubscript𝑘1𝑗superscript𝛾𝑗𝑘21𝛾11superscript𝜆𝛿2𝛿1𝛾1𝜆\displaystyle\leq\sum_{j=1}^{t}\lambda^{\delta(t-j)}\sum_{k=1}^{j}(\sqrt{% \gamma})^{j-k}\leq\frac{2}{1-\gamma}\frac{1}{1-\lambda^{\delta}}\leq\frac{2}{% \delta(1-\gamma)(1-\lambda)}.≤ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT italic_δ ( italic_t - italic_j ) end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ( square-root start_ARG italic_γ end_ARG ) start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT ≤ divide start_ARG 2 end_ARG start_ARG 1 - italic_γ end_ARG divide start_ARG 1 end_ARG start_ARG 1 - italic_λ start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT end_ARG ≤ divide start_ARG 2 end_ARG start_ARG italic_δ ( 1 - italic_γ ) ( 1 - italic_λ ) end_ARG .
Lemma 28

Suppose xkdsubscript𝑥𝑘superscript𝑑x_{k}\in\mathbb{R}^{d}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT are independent, zero-mean σksubscript𝜎𝑘\sigma_{k}italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT-sub-exponential random vector, we have

(k=1txk>c(log(1/ϵ)+2d)k=1nσk2)ϵ,normsuperscriptsubscript𝑘1𝑡subscript𝑥𝑘𝑐1italic-ϵ2𝑑superscriptsubscript𝑘1𝑛superscriptsubscript𝜎𝑘2italic-ϵ\displaystyle\mathbb{P}\left(\left\|\sum_{k=1}^{t}x_{k}\right\|>c\left(\log(1/% \epsilon)+2d\right)\sqrt{\sum_{k=1}^{n}\sigma_{k}^{2}}\right)\leq\epsilon,blackboard_P ( ∥ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ > italic_c ( roman_log ( 1 / italic_ϵ ) + 2 italic_d ) square-root start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) ≤ italic_ϵ ,

where c>0𝑐0c>0italic_c > 0 is an absolute constant.

Proof.   Let 𝒩𝒩\mathcal{N}caligraphic_N be the 1/2-net of the unit ball {zd:z1}conditional-set𝑧superscript𝑑norm𝑧1\{z\in\mathbb{R}^{d}:\left\|z\right\|\leq 1\}{ italic_z ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT : ∥ italic_z ∥ ≤ 1 } with respect to the Euclidean norm that satisfies |𝒩|6d𝒩superscript6𝑑\left|\mathcal{N}\right|\leq 6^{d}| caligraphic_N | ≤ 6 start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. By the inequality that

maxv=1vxmaxz𝒩zx+maxv=12vx=maxz𝒩zx+12maxv=1vx,subscriptnorm𝑣1superscript𝑣top𝑥subscript𝑧𝒩superscript𝑧top𝑥subscriptnorm𝑣12superscript𝑣top𝑥subscript𝑧𝒩superscript𝑧top𝑥12subscriptnorm𝑣1superscript𝑣top𝑥\displaystyle\max_{\left\|v\right\|=1}v^{\top}x\leq\max_{z\in\mathcal{N}}z^{% \top}x+\max_{\left\|v\right\|=\frac{1}{2}}v^{\top}x=\max_{z\in\mathcal{N}}z^{% \top}x+\frac{1}{2}\max_{\left\|v\right\|=1}v^{\top}x,roman_max start_POSTSUBSCRIPT ∥ italic_v ∥ = 1 end_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x ≤ roman_max start_POSTSUBSCRIPT italic_z ∈ caligraphic_N end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x + roman_max start_POSTSUBSCRIPT ∥ italic_v ∥ = divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x = roman_max start_POSTSUBSCRIPT italic_z ∈ caligraphic_N end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x + divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_max start_POSTSUBSCRIPT ∥ italic_v ∥ = 1 end_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x ,

we have

[maxv=1vk=1nxk>δ][2maxz𝒩zk=1nxk>δ]6dsupz=1[2zk=1nxk>δ].delimited-[]subscriptnorm𝑣1superscript𝑣topsuperscriptsubscript𝑘1𝑛subscript𝑥𝑘𝛿delimited-[]2subscript𝑧𝒩superscript𝑧topsuperscriptsubscript𝑘1𝑛subscript𝑥𝑘𝛿superscript6𝑑subscriptsupremumnorm𝑧1delimited-[]2superscript𝑧topsuperscriptsubscript𝑘1𝑛subscript𝑥𝑘𝛿\displaystyle\mathbb{P}[\max_{\left\|v\right\|=1}v^{\top}\sum_{k=1}^{n}x_{k}>% \delta]\leq\mathbb{P}[2\max_{z\in\mathcal{N}}z^{\top}\sum_{k=1}^{n}x_{k}>% \delta]\leq 6^{d}\sup_{\left\|z\right\|=1}\mathbb{P}[2z^{\top}\sum_{k=1}^{n}x_% {k}>\delta].blackboard_P [ roman_max start_POSTSUBSCRIPT ∥ italic_v ∥ = 1 end_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT > italic_δ ] ≤ blackboard_P [ 2 roman_max start_POSTSUBSCRIPT italic_z ∈ caligraphic_N end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT > italic_δ ] ≤ 6 start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT roman_sup start_POSTSUBSCRIPT ∥ italic_z ∥ = 1 end_POSTSUBSCRIPT blackboard_P [ 2 italic_z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT > italic_δ ] .

According to Bernstein’s inequality, there holds

[2zk=1nxk>δ]exp(c0min{δ24k=1nσk2,δ2maxkσk}),delimited-[]2superscript𝑧topsuperscriptsubscript𝑘1𝑛subscript𝑥𝑘𝛿subscript𝑐0superscript𝛿24superscriptsubscript𝑘1𝑛superscriptsubscript𝜎𝑘2𝛿2subscript𝑘subscript𝜎𝑘\displaystyle\mathbb{P}[2z^{\top}\sum_{k=1}^{n}x_{k}>\delta]\leq\exp\left(-c_{% 0}\min\left\{\frac{\delta^{2}}{4\sum_{k=1}^{n}\sigma_{k}^{2}},\frac{\delta}{2% \max_{k}\sigma_{k}}\right\}\right),blackboard_P [ 2 italic_z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT > italic_δ ] ≤ roman_exp ( - italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT roman_min { divide start_ARG italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 4 ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , divide start_ARG italic_δ end_ARG start_ARG 2 roman_max start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG } ) ,

for some absolute constant c0subscript𝑐0c_{0}italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. To bound [maxv=1vk=1nxk>δ]ϵdelimited-[]subscriptnorm𝑣1superscript𝑣topsuperscriptsubscript𝑘1𝑛subscript𝑥𝑘𝛿italic-ϵ\mathbb{P}[\max_{\left\|v\right\|=1}v^{\top}\sum_{k=1}^{n}x_{k}>\delta]\leq\epsilonblackboard_P [ roman_max start_POSTSUBSCRIPT ∥ italic_v ∥ = 1 end_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT > italic_δ ] ≤ italic_ϵ, we find δ𝛿\deltaitalic_δ such that

δ𝛿\displaystyle\deltaitalic_δ =2max{k=1nσk21c0log(1/ϵ)+2d,maxkσk1c0(log(1/ϵ)+2d)}ck=1nσk2(log(1/ϵ)+2d),absent2superscriptsubscript𝑘1𝑛superscriptsubscript𝜎𝑘21subscript𝑐01italic-ϵ2𝑑subscript𝑘subscript𝜎𝑘1subscript𝑐01italic-ϵ2𝑑𝑐superscriptsubscript𝑘1𝑛superscriptsubscript𝜎𝑘21italic-ϵ2𝑑\displaystyle=2\max\left\{\sqrt{\sum_{k=1}^{n}\sigma_{k}^{2}}\frac{1}{\sqrt{c_% {0}}}\sqrt{\log(1/\epsilon)+2d},\max_{k}\sigma_{k}\frac{1}{c_{0}}\left(\log(1/% \epsilon)+2d\right)\right\}\leq c\sqrt{\sum_{k=1}^{n}\sigma_{k}^{2}}\left(\log% (1/\epsilon)+2d\right),= 2 roman_max { square-root start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG end_ARG square-root start_ARG roman_log ( 1 / italic_ϵ ) + 2 italic_d end_ARG , roman_max start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ( roman_log ( 1 / italic_ϵ ) + 2 italic_d ) } ≤ italic_c square-root start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( roman_log ( 1 / italic_ϵ ) + 2 italic_d ) ,

where c=2max{1c0,1c0}.𝑐21subscript𝑐01subscript𝑐0c=2\max\left\{\frac{1}{c_{0}},\frac{1}{\sqrt{c_{0}}}\right\}.italic_c = 2 roman_max { divide start_ARG 1 end_ARG start_ARG italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG , divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG end_ARG } .

Lemma 29

For Γnormal-Γ\Gammaroman_Γ satisfying the conditions in Lemma 20, we have

(IΓ)1(1γ)t=knj=ktΓtjγjknormsuperscript𝐼Γ11𝛾superscriptsubscript𝑡𝑘𝑛superscriptsubscript𝑗𝑘𝑡superscriptΓ𝑡𝑗superscript𝛾𝑗𝑘\displaystyle\left\|(I-\Gamma)^{-1}-(1-\gamma)\sum_{t=k}^{n}\sum_{j=k}^{t}% \Gamma^{t-j}\gamma^{j-k}\right\|∥ ( italic_I - roman_Γ ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT - ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_t = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_Γ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT ∥ M(γj=knλnjγjk+λnk+11λ),absent𝑀𝛾superscriptsubscript𝑗𝑘𝑛superscript𝜆𝑛𝑗superscript𝛾𝑗𝑘superscript𝜆𝑛𝑘11𝜆\displaystyle\leq M\left(\gamma\sum_{j=k}^{n}\lambda^{n-j}\gamma^{j-k}+\frac{% \lambda^{n-k+1}}{1-\lambda}\right),≤ italic_M ( italic_γ ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT italic_n - italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT + divide start_ARG italic_λ start_POSTSUPERSCRIPT italic_n - italic_k + 1 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_λ end_ARG ) ,

and

(IΓ)1t=0nΓtnormsuperscript𝐼Γ1superscriptsubscript𝑡0𝑛superscriptΓ𝑡\displaystyle\left\|(I-\Gamma)^{-1}-\sum_{t=0}^{n}\Gamma^{t}\right\|∥ ( italic_I - roman_Γ ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_Γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ Mλn+11λ.absent𝑀superscript𝜆𝑛11𝜆\displaystyle\leq M\frac{\lambda^{n+1}}{1-\lambda}.≤ italic_M divide start_ARG italic_λ start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_λ end_ARG .

Proof.   From Lemma 20, we have Γ=P~ΛP~1Γ~𝑃Λsuperscript~𝑃1\Gamma=\tilde{P}\Lambda\tilde{P}^{-1}roman_Γ = over~ start_ARG italic_P end_ARG roman_Λ over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT. Then

(IΓ)1(1γ)t=knj=ktΓtjγjkPP1(IΛ)1(1γ)t=knj=ktΛtjγjknormsuperscript𝐼Γ11𝛾superscriptsubscript𝑡𝑘𝑛superscriptsubscript𝑗𝑘𝑡superscriptΓ𝑡𝑗superscript𝛾𝑗𝑘norm𝑃normsuperscript𝑃1normsuperscript𝐼Λ11𝛾superscriptsubscript𝑡𝑘𝑛superscriptsubscript𝑗𝑘𝑡superscriptΛ𝑡𝑗superscript𝛾𝑗𝑘\displaystyle\left\|(I-\Gamma)^{-1}-(1-\gamma)\sum_{t=k}^{n}\sum_{j=k}^{t}% \Gamma^{t-j}\gamma^{j-k}\right\|\leq\left\|P\right\|\left\|P^{-1}\right\|\left% \|(I-\Lambda)^{-1}-(1-\gamma)\sum_{t=k}^{n}\sum_{j=k}^{t}\Lambda^{t-j}\gamma^{% j-k}\right\|∥ ( italic_I - roman_Γ ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT - ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_t = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_Γ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT ∥ ≤ ∥ italic_P ∥ ∥ italic_P start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ ∥ ( italic_I - roman_Λ ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT - ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_t = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_Λ start_POSTSUPERSCRIPT italic_t - italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT ∥
Mj=0Λj(1γ)j=0nkΛjl=0nkjγl=Mj=0nkΛj(1(1γ)l=0nkjγl)+j=nk+1Λjabsent𝑀normsuperscriptsubscript𝑗0superscriptΛ𝑗1𝛾superscriptsubscript𝑗0𝑛𝑘superscriptΛ𝑗superscriptsubscript𝑙0𝑛𝑘𝑗superscript𝛾𝑙𝑀normsuperscriptsubscript𝑗0𝑛𝑘superscriptΛ𝑗11𝛾superscriptsubscript𝑙0𝑛𝑘𝑗superscript𝛾𝑙superscriptsubscript𝑗𝑛𝑘1superscriptΛ𝑗\displaystyle\leq M\left\|\sum_{j=0}^{\infty}\Lambda^{j}-(1-\gamma)\sum_{j=0}^% {n-k}\Lambda^{j}\sum_{l=0}^{n-k-j}\gamma^{l}\right\|=M\left\|\sum_{j=0}^{n-k}% \Lambda^{j}(1-(1-\gamma)\sum_{l=0}^{n-k-j}\gamma^{l})+\sum_{j=n-k+1}^{\infty}% \Lambda^{j}\right\|≤ italic_M ∥ ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT roman_Λ start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT - ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - italic_k end_POSTSUPERSCRIPT roman_Λ start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_l = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - italic_k - italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∥ = italic_M ∥ ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - italic_k end_POSTSUPERSCRIPT roman_Λ start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ( 1 - ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_l = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - italic_k - italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_j = italic_n - italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT roman_Λ start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∥
M(γj=knλnjγjk+λnk+11λ).absent𝑀𝛾superscriptsubscript𝑗𝑘𝑛superscript𝜆𝑛𝑗superscript𝛾𝑗𝑘superscript𝜆𝑛𝑘11𝜆\displaystyle\leq M\left(\gamma\sum_{j=k}^{n}\lambda^{n-j}\gamma^{j-k}+\frac{% \lambda^{n-k+1}}{1-\lambda}\right).≤ italic_M ( italic_γ ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT italic_n - italic_j end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT + divide start_ARG italic_λ start_POSTSUPERSCRIPT italic_n - italic_k + 1 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_λ end_ARG ) .

Also we have

(IΓ)1t=0nΓtMj=n+1λjMλn+11λ.normsuperscript𝐼Γ1superscriptsubscript𝑡0𝑛superscriptΓ𝑡𝑀superscriptsubscript𝑗𝑛1superscript𝜆𝑗𝑀superscript𝜆𝑛11𝜆\displaystyle\left\|(I-\Gamma)^{-1}-\sum_{t=0}^{n}\Gamma^{t}\right\|\leq M\sum% _{j=n+1}^{\infty}\lambda^{j}\leq M\frac{\lambda^{n+1}}{1-\lambda}.∥ ( italic_I - roman_Γ ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_Γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ ≤ italic_M ∑ start_POSTSUBSCRIPT italic_j = italic_n + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ≤ italic_M divide start_ARG italic_λ start_POSTSUPERSCRIPT italic_n + 1 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_λ end_ARG .