Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

A Lightweight Method for Tackling Unknown Participation Statistics in Federated Averaging

Shiqiang Wang
IBM T. J. Watson Research Center
Yorktown Heights, NY 10598
wangshiq@us.ibm.com
&Mingyue Ji
Department of ECE, University of Utah
Salt Lake City, UT 84112
mingyue.ji@utah.edu
Abstract

In federated learning (FL), clients usually have diverse participation statistics that are unknown a priori, which can significantly harm the performance of FL if not handled properly. Existing works aiming at addressing this problem are usually based on global variance reduction, which requires a substantial amount of additional memory in a multiplicative factor equal to the total number of clients. An important open problem is to find a lightweight method for FL in the presence of clients with unknown participation rates. In this paper, we address this problem by adapting the aggregation weights in federated averaging (FedAvg) based on the participation history of each client. We first show that, with heterogeneous participation statistics, FedAvg with non-optimal aggregation weights can diverge from the optimal solution of the original FL objective, indicating the need of finding optimal aggregation weights. However, it is difficult to compute the optimal weights when the participation statistics are unknown. To address this problem, we present a new algorithm called FedAU, which improves FedAvg by adaptively weighting the client updates based on online estimates of the optimal weights without knowing the statistics of client participation. We provide a theoretical convergence analysis of FedAU using a novel methodology to connect the estimation error and convergence. Our theoretical results reveal important and interesting insights, while showing that FedAU converges to an optimal solution of the original objective and has desirable properties such as linear speedup. Our experimental results also verify the advantage of FedAU over baseline methods with various participation patterns.

1 Introduction

We consider the problem of finding 𝐱d𝐱superscript𝑑\mathbf{x}\in\mathbb{R}^{d}bold_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT that minimizes the distributed finite-sum objective:

f(𝐱):=1Nn=1NFn(𝐱),assign𝑓𝐱1𝑁superscriptsubscript𝑛1𝑁subscript𝐹𝑛𝐱\textstyle f(\mathbf{x}):=\frac{1}{N}\sum_{n=1}^{N}F_{n}(\mathbf{x}),italic_f ( bold_x ) := divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_x ) , (1)

where each individual (local) objective Fn(𝐱)subscript𝐹𝑛𝐱F_{n}(\mathbf{x})italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_x ) is only computable at the client n𝑛nitalic_n. This problem often arises in the context of federated learning (FL) (Kairouz et al., 2021, Li et al., 2020a, Yang et al., 2019), where Fn(𝐱)subscript𝐹𝑛𝐱F_{n}(\mathbf{x})italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_x ) is defined on client n𝑛nitalic_n’s local dataset, f(𝐱)𝑓𝐱f(\mathbf{x})italic_f ( bold_x ) is the global objective, and 𝐱𝐱\mathbf{x}bold_x is the parameter vector of the model being trained. Each client keeps its local dataset to itself, which is not shared with other clients or the server. It is possible to extend (1) to weighted average with positive coefficients multiplied to each Fn(𝐱)subscript𝐹𝑛𝐱F_{n}(\mathbf{x})italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_x ), but for simplicity, we consider such coefficients to be included in {Fn(𝐱):n}conditional-setsubscript𝐹𝑛𝐱for-all𝑛\{F_{n}(\mathbf{x}):\forall n\}{ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_x ) : ∀ italic_n } (see Appendix A.1) and do not write them out.

Federated averaging (FedAvg) is a commonly used algorithm for minimizing (1), which alternates between local updates at each client and parameter aggregation among multiple clients with the help of a server (McMahan et al., 2017). However, there are several challenges in FedAvg, including data heterogeneity and partial participation of clients, which can cause performance degradation and even non-convergence if the FedAvg algorithm is improperly configured.

Unknown, Uncontrollable, and Heterogeneous Participation of Clients. Most existing works on FL with partial client participation assume that the clients participate according to a known or controllable random process (Karimireddy et al., 2020, Yang et al., 2021, Chen et al., 2022, Fraboni et al., 2021a, Li et al., 2020b; c). In practice, however, it is common for clients to have heterogeneous and time-varying computation power and network bandwidth, which depend on both the inherent characteristics of each client and other tasks that concurrently run in the system. This generally leads to heterogeneous participation statistics across clients, which are difficult to know a priori due to their complex dependency on various factors in the system (Wang et al., 2021). It is also generally impossible to fully control the participation statistics, due to the randomness of whether a client can successfully complete a round of model updates (Bonawitz et al., 2019).

The problem of having heterogeneous and unknown participation statistics is that it may cause the result of FL to be biased towards certain local objectives, which diverges from the optimum of the original objective in (1). In FL, data heterogeneity across clients is a common phenomenon, resulting in diverse local objectives {Fn(𝐱)}subscript𝐹𝑛𝐱\{F_{n}(\mathbf{x})\}{ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_x ) }. The participation heterogeneity is often correlated with data heterogeneity, because the characteristics of different user populations may be correlated with how powerful their devices are. Intuitively, when some clients participate more frequently than others, the final FL result will be benefiting the local objectives of those frequently participating clients, causing a possible discrimination for clients that participate less frequently.

A few recent works aiming at addressing this problem are based on the idea of global variance reduction by saving the most recent updates of all the clients, which requires a substantial amount of additional memory in the order of Nd𝑁𝑑Nditalic_N italic_d, i.e., the total number of clients times the dimension of the model parameter vector (Yang et al., 2022, Yan et al., 2020, Gu et al., 2021, Jhunjhunwala et al., 2022). This additional memory consumption is either incurred at the server or evenly distributed to all the clients. For practical FL systems with many clients, this causes unnecessary memory usage that affects the overall capability and performance of the system. Therefore, we ask the following important question in this paper:

Is there a lightweight method that provably minimizes the original objective in (1), when the participation statistics of clients are unknown, uncontrollable, and heterogeneous?

We leverage the insight that we can apply different weights to different clients’ updates in the parameter aggregation stage of FedAvg. If this is done properly, the effect of heterogeneous participation can be canceled out so that we can minimize (1), as shown in existing works that assume known participation statistics (Chen et al., 2022, Fraboni et al., 2021a, Li et al., 2020b; c). However, in our setting, we do not know the participation statistics a priori, which makes it challenging to compute (estimate) the optimal aggregation weights. It is also non-trivial to quantify the impact of estimation error on convergence.

Our Contributions. We thoroughly analyze this problem and make the following novel contributions.

  1. 1.

    To motivate the need for adaptive weighting in parameter aggregation, we show that FedAvg with non-optimal weights minimizes a different objective (defined in (2)) instead of (1).

  2. 2.

    We propose a lightweight procedure for estimating the optimal aggregation weight at each client n𝑛nitalic_n as part of the overall FL process, based on client n𝑛nitalic_n’s participation history. We name this new algorithm FedAU, which stands for FedAvg with adaptive weighting to support unknown participation statistics.

  3. 3.

    We analyze the convergence upper bound of FedAU, using a novel method that first obtains a weight error term in the convergence bound and then further bounds the weight error term via a bias-variance decomposition approach. Our result shows that FedAU converges to an optimal solution of the original objective (1). In addition, a desirable linear speedup of convergence with respect to the number of clients is achieved when the number of FL rounds is large enough.

  4. 4.

    We verify the advantage of FedAU in experiments with several datasets and baselines, with a variety of participation patterns including those that are independent, Markovian, and cyclic.

Related Work. Earlier works on FedAvg considered the convergence analysis with full client participation (Gorbunov et al., 2021, Haddadpour et al., 2019, Lin et al., 2020, Stich, 2019, Wang & Joshi, 2019; 2021, Yu et al., 2019, Malinovsky et al., 2023), which do not capture the fact that only a subset of clients participates in each round in practical FL systems. Recently, partial client participation has came to attention. Some works analyzed the convergence of FedAvg where the statistics or patterns of client participation are known or controllable (Fraboni et al., 2021a; b, Li et al., 2020c, Yang et al., 2021, Wang & Ji, 2022, Cho et al., 2023, Karimireddy et al., 2020, Li et al., 2020b, Chen et al., 2022, Rizk et al., 2022). However, as pointed out by Wang et al. (2021), Bonawitz et al. (2019), the participation of clients in FL can have complex dependencies on the underlying system characteristics, which makes it difficult to know or control each client’s behavior a priori. A recent work analyzed the convergence for a re-weighted objective (Patel et al., 2022), where the re-weighting is essentially arbitrary for unknown participation distributions. Some recent works (Yang et al., 2022, Yan et al., 2020, Gu et al., 2021, Jhunjhunwala et al., 2022) aimed at addressing this problem using variance reduction, by including the most recent local update of each client in the global update, even if they do not participate in the current round. These methods require a substantial amount of additional memory to store the clients’ local updates. In contrast, our work focuses on developing a lightweight algorithm that has virtually the same memory requirement as the standard FedAvg algorithm.

A related area is adaptive FL algorithms, where adaptive gradients (Reddi et al., 2021, Wang et al., 2022b; c) and adaptive local updates (Ruan et al., 2021, Wang et al., 2020) were studied. Some recent works viewed the adaptation of aggregation weights from different perspectives (Wu & Wang, 2021, Tan et al., 2022, Wang et al., 2022a), which do not address the problem of unknown participation statistics. All these methods are orthogonal to our work and can potentially work together with our algorithm. To the best of our knowledge, no prior work has studied weight adaptation in the presence of unknown participation statistics with provable convergence guarantees.

A uniqueness in our problem is that the statistics related to participation need to be collected across multiple FL rounds. Although Wang & Ji (2022) aimed at extracting a participation-specific term in the convergence bound, that approach still requires the aggregation weights in each round to sum to one (thus coordinated participation); it also requires an amplification procedure over multiple rounds for the bound to hold, making it difficult to tune the hyperparameters. In contrast, this paper considers uncontrolled and uncoordinated participation without sophisticated amplification mechanisms.

2 FedAvg with Pluggable Aggregation Weights

1
Input: γ𝛾\gammaitalic_γ, η𝜂\etaitalic_η, 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, I𝐼Iitalic_I;  Output: {𝐱t:t}conditional-setsubscript𝐱𝑡for-all𝑡\{\mathbf{x}_{t}:\forall t\}{ bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : ∀ italic_t };
2Initialize t00subscript𝑡00t_{0}\leftarrow 0italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ← 0, 𝐮𝟎𝐮0\mathbf{u}\leftarrow\mathbf{0}bold_u ← bold_0;
3for t=0,,T1𝑡0𝑇1t=0,\ldots,T-1italic_t = 0 , … , italic_T - 1 do
4      
5      for n=1,,N𝑛1𝑁n=1,\ldots,Nitalic_n = 1 , … , italic_N in parallel  do
6            
7            Sample IltnsuperscriptsubscriptIl𝑡𝑛{\rm I\kern-1.99997ptl}_{t}^{n}roman_Il start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT from an unknown process;
8            if Iltn=1superscriptsubscriptIl𝑡𝑛1{\rm I\kern-1.99997ptl}_{t}^{n}=1roman_Il start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = 1 then
9                  
10                  𝐲t,0n𝐱tsubscriptsuperscript𝐲𝑛𝑡0subscript𝐱𝑡\mathbf{y}^{n}_{t,0}\leftarrow\mathbf{x}_{t}bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , 0 end_POSTSUBSCRIPT ← bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT;
11                  for i=0,,I1𝑖0𝐼1i=0,\ldots,I-1italic_i = 0 , … , italic_I - 1  do
12                         𝐲t,i+1n𝐲t,inγ𝐠n(𝐲t,in)subscriptsuperscript𝐲𝑛𝑡𝑖1subscriptsuperscript𝐲𝑛𝑡𝑖𝛾subscript𝐠𝑛subscriptsuperscript𝐲𝑛𝑡𝑖\mathbf{y}^{n}_{t,i+1}\leftarrow\mathbf{y}^{n}_{t,i}-\gamma\mathbf{g}_{n}(% \mathbf{y}^{n}_{t,i})bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i + 1 end_POSTSUBSCRIPT ← bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - italic_γ bold_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT );
13                  
14                  Δtn𝐲t,In𝐱tsuperscriptsubscriptΔ𝑡𝑛subscriptsuperscript𝐲𝑛𝑡𝐼subscript𝐱𝑡\Delta_{t}^{n}\leftarrow\mathbf{y}^{n}_{t,I}-\mathbf{x}_{t}roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ← bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_I end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT; 
15             else
16                   Δtn𝟎superscriptsubscriptΔ𝑡𝑛0\Delta_{t}^{n}\leftarrow\mathbf{0}roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ← bold_0;
17            
18            ωtnComputeWeight({Ilτn:τ<t})superscriptsubscript𝜔𝑡𝑛ComputeWeightconditional-setsuperscriptsubscriptIl𝜏𝑛𝜏𝑡\omega_{t}^{n}\!\leftarrow\!\texttt{ComputeWeight}(\!\{{\rm I\kern-1.99997ptl}% _{\tau}^{n}\!:\!\tau\!<\!t\}\!)italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ← ComputeWeight ( { roman_Il start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT : italic_τ < italic_t } );​​
19      
20      𝐱t+1𝐱t+ηNn=1NωtnΔtnsubscript𝐱𝑡1subscript𝐱𝑡𝜂𝑁superscriptsubscript𝑛1𝑁superscriptsubscript𝜔𝑡𝑛superscriptsubscriptΔ𝑡𝑛\mathbf{x}_{t+1}\leftarrow\mathbf{x}_{t}+\frac{\eta}{N}\sum_{n=1}^{N}\omega_{t% }^{n}\Delta_{t}^{n}bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ← bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + divide start_ARG italic_η end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT;
Algorithm 1 FedAvg with pluggable aggregation weights

We begin by describing a generic FedAvg algorithm that includes a separate oracle for computing the aggregation weights, as shown in Algorithm 1. In this algorithm, there are a total of T𝑇Titalic_T rounds, where each round t𝑡titalic_t includes I𝐼Iitalic_I steps of local stochastic gradient descent (SGD) at a participating client. For simplicity, we consider I𝐼Iitalic_I to be the same for all the clients, while noting that our algorithm and results can be extended to more general cases. We use γ>0𝛾0\gamma>0italic_γ > 0 and η>0𝜂0\eta>0italic_η > 0 to denote the local and global step sizes, respectively. The variable 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the initial model parameter, IltnsuperscriptsubscriptIl𝑡𝑛{\rm I\kern-1.99997ptl}_{t}^{n}roman_Il start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is an identity function that is equal to one if client n𝑛nitalic_n participates in round t𝑡titalic_t and zero otherwise, and 𝐠n()subscript𝐠𝑛\mathbf{g}_{n}(\cdot)bold_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( ⋅ ) is the stochastic gradient of the local objective Fn()subscript𝐹𝑛F_{n}(\cdot)italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( ⋅ ) for each client n𝑛nitalic_n.

The main steps of Algorithm 1 are similar to those of standard FedAvg, but with a few notable items as follows. 1) In Line 1, we clearly state that we do not have prior knowledge of the sampling process of client participation. 2) Line 1 calls a separate oracle to compute the aggregation weight ωtnsuperscriptsubscript𝜔𝑡𝑛\omega_{t}^{n}italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT (ωtn>0superscriptsubscript𝜔𝑡𝑛0\omega_{t}^{n}>0italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT > 0) for client n𝑛nitalic_n in round t𝑡titalic_t. This computation is done on each client n𝑛nitalic_n alone, without coordinating with other clients. We do not need to save the full sequence of participation record {Ilτn:τ<t}conditional-setsuperscriptsubscriptIl𝜏𝑛𝜏𝑡\{{\rm I\kern-1.99997ptl}_{\tau}^{n}:\tau<t\}{ roman_Il start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT : italic_τ < italic_t }, because it is sufficient to save an aggregated metric of the participation record for weight computation. In Section 3, we will see that we use the average participation interval for weight computation in FedAU, where the average can be computed in an online manner. We also note that we do not include IltnsuperscriptsubscriptIl𝑡𝑛{\rm I\kern-1.99997ptl}_{t}^{n}roman_Il start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT in the current round t𝑡titalic_t for computing the weight, which is needed for the convergence analysis so that ωtnsuperscriptsubscript𝜔𝑡𝑛\omega_{t}^{n}italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is independent of the local parameter 𝐲t,insubscriptsuperscript𝐲𝑛𝑡𝑖\mathbf{y}^{n}_{t,i}bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT when the initial parameter of round t𝑡titalic_t (i.e., 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT) is given. 3) The parameter aggregation is weighted by ωtnsuperscriptsubscript𝜔𝑡𝑛\omega_{t}^{n}italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT for each client n𝑛nitalic_n in Line 1.

Objective Inconsistency with Improper Aggregation Weights. We first show that without weight adaptation, FedAvg minimizes an alternative objective that is generally different from (1).

Theorem 1 (Objective minimized at convergence, informal).

When IltnBernoulli(pn)similar-tosuperscriptsubscriptIl𝑡𝑛Bernoullisubscript𝑝𝑛{\rm I\kern-1.99997ptl}_{t}^{n}\sim\mathrm{Bernoulli}(p_{n})roman_Il start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∼ roman_Bernoulli ( italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) and the weights are time-constant, i.e., ωtn=ωnsuperscriptsubscript𝜔𝑡𝑛subscript𝜔𝑛\omega_{t}^{n}=\omega_{n}italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT but generally ωnsubscript𝜔𝑛\omega_{n}italic_ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT may not be equal to ωnsubscript𝜔superscript𝑛\omega_{n^{\prime}}italic_ω start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT (nn𝑛superscript𝑛n\neq n^{\prime}italic_n ≠ italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT), with properly chosen learning rates γ𝛾\gammaitalic_γ and η𝜂\etaitalic_η and some other assumptions, Algorithm 1 minimizes the following objective:

h(𝐱):=1Pn=1NωnpnFn(𝐱),assign𝐱1𝑃superscriptsubscript𝑛1𝑁subscript𝜔𝑛subscript𝑝𝑛subscript𝐹𝑛𝐱\displaystyle\textstyle h(\mathbf{x}):=\frac{1}{P}\sum_{n=1}^{N}\omega_{n}p_{n% }F_{n}(\mathbf{x}),italic_h ( bold_x ) := divide start_ARG 1 end_ARG start_ARG italic_P end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_x ) , (2)

where P:=n=1Nωnpnassign𝑃superscriptsubscript𝑛1𝑁subscript𝜔𝑛subscript𝑝𝑛P:=\sum_{n=1}^{N}\omega_{n}p_{n}italic_P := ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT.

A formal version of the theorem is given in Appendix B.4. Theorem 1 shows that, even in the special case where each client n𝑛nitalic_n participates according to a Bernoulli distribution with probability pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, choosing a constant aggregation weight such as ωn=1,nsubscript𝜔𝑛1for-all𝑛\omega_{n}=1,\forall nitalic_ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 1 , ∀ italic_n as in standard FedAvg causes the algorithm to converge to a different objective that is weighted by pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. As mentioned earlier, this implicit weighting discriminates clients that participate less frequently. In addition, since the participation statistics (here, the probabilities {pn}subscript𝑝𝑛\{p_{n}\}{ italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }) of clients are unknown, the exact objective being minimized is also unknown, and it is generally unreasonable to minimize an unknown objective. This means that it is important to design an adaptive method to find the aggregation weights, so that we can minimize (1) even when the participation statistics are unknown, which is our focus in this paper.

The full proofs of all mathematical claims are in Appendix B.

3 FedAU: Estimation of Optimal Aggregation Weights

In this section, we describe the computation of aggregation weights {ωtn}superscriptsubscript𝜔𝑡𝑛\{\omega_{t}^{n}\}{ italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT } based on the participation history observed at each client, which is the core of our FedAU algorithm that extends FedAvg. Our goal is to choose {ωtn}superscriptsubscript𝜔𝑡𝑛\{\omega_{t}^{n}\}{ italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT } to minimize the original objective (1) as close as possible.

Intuition. We build from the intuition in Theorem 1 and design an aggregation weight adaptation algorithm that works for general participation patterns, i.e., not limited to the Bernoulli distribution considered in Theorem 1. From (2), we see that if we can choose ωn=1/pnsubscript𝜔𝑛1subscript𝑝𝑛\omega_{n}=\nicefrac{{1}}{{p_{n}}}italic_ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = / start_ARG 1 end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG, the objective being minimized is the same as (1). We note that pn1Tt=0T1Iltnsubscript𝑝𝑛1𝑇superscriptsubscript𝑡0𝑇1superscriptsubscriptIl𝑡𝑛p_{n}\approx\frac{1}{T}\sum_{t=0}^{T-1}{\rm I\kern-1.99997ptl}_{t}^{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≈ divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT roman_Il start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT for each client n𝑛nitalic_n when T𝑇Titalic_T is large, due to ergodicity of the Bernoulli distribution considered in Theorem 1. Extending to general participation patterns that are not limited to the Bernoulli distribution, intuitively, we would like to choose the weight ωnsubscript𝜔𝑛\omega_{n}italic_ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT to be inversely proportional to the average frequency of participation. In this way, the bias caused by lower participation frequency is “canceled out” by the higher weight used in aggregation. Based on this intuition, our goal of aggregation weight estimation is as follows.

Problem 1 (Goal of Weight Estimation, informal).

Choose {ωtn}superscriptsubscript𝜔𝑡𝑛\{\omega_{t}^{n}\}{ italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT } so that its long-term average (i.e., for large T𝑇Titalic_T) 1Tt=0T1ωtn1𝑇superscriptsubscript𝑡0𝑇1superscriptsubscript𝜔𝑡𝑛\frac{1}{T}\sum_{t=0}^{T-1}\omega_{t}^{n}divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is close to 11Tt=0T1Iltn11𝑇superscriptsubscript𝑡0𝑇1superscriptsubscriptIl𝑡𝑛\frac{1}{\frac{1}{T}\sum_{t=0}^{T-1}{\rm I\kern-1.39998ptl}_{t}^{n}}divide start_ARG 1 end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT roman_Il start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG, for each n𝑛nitalic_n.

Some previous works have discovered this need of debiasing the skewness of client participation (Li et al., 2020c, Perazzone et al., 2022) or designing the client sampling scheme to ensure that the updates are unbiased (Fraboni et al., 2021a, Li et al., 2020b). However, in our work, we consider the more realistic case where the participation statistics are unknown, uncontrollable, and heterogeneous. In this case, we are unable to directly find the optimal aggregation weights because we do not know the participation statistics a priori.

Technical Challenge. If we were to know the participation pattern for all the T𝑇Titalic_T rounds, an immediate solution to Problem 1 is to choose ωtnsuperscriptsubscript𝜔𝑡𝑛\omega_{t}^{n}italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT (for each client n𝑛nitalic_n) to be equal to T𝑇Titalic_T divided by the number of rounds where client n𝑛nitalic_n participates. We can see that this solution is equal to the average interval between every two adjacent participating rounds, assuming that the first interval starts right before the first round t=0𝑡0t=0italic_t = 0. However, since we do not know the future participation pattern or statistics in each round t𝑡titalic_t, we cannot directly apply this solution. In other words, in every round t𝑡titalic_t, we need to perform an online estimation of the weight ωtnsuperscriptsubscript𝜔𝑡𝑛\omega_{t}^{n}italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT based on the participation history up to round t1𝑡1t-1italic_t - 1.

A challenge in this online setting is that the estimation accuracy is related to the number of times each client n𝑛nitalic_n has participated until round t1𝑡1t-1italic_t - 1. When t𝑡titalic_t is small and client n𝑛nitalic_n has not yet participated in any of the preceding rounds, we do not have any information about how to choose ωtnsuperscriptsubscript𝜔𝑡𝑛\omega_{t}^{n}italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. For an intermediate value of t𝑡titalic_t where client n𝑛nitalic_n has participated only in a few rounds, we have limited information about the choice of ωtnsuperscriptsubscript𝜔𝑡𝑛\omega_{t}^{n}italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. In this case, if we directly use the average participation interval up to the (t1)𝑡1(t-1)( italic_t - 1 )-th round, the resulting ωtnsuperscriptsubscript𝜔𝑡𝑛\omega_{t}^{n}italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT can be far from its optimal value, i.e., the estimation has a high variance if the client participation follows a random process. This is problematic especially when there exists a long interval between two rounds (both before the (t1)𝑡1(t-1)( italic_t - 1 )-th round) where the client participates. Although the probability of the occurrence of such a long interval is usually low, when it occurs, it results in a long average interval for the first t𝑡titalic_t rounds when t𝑡titalic_t is relatively small, and using this long average interval as the value of ωtnsuperscriptsubscript𝜔𝑡𝑛\omega_{t}^{n}italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT may cause instability to the training process.

Key Idea. To overcome this challenge, we define a positive integer K𝐾Kitalic_K as a “cutoff” interval length. If a client has not participated for K𝐾Kitalic_K rounds, we consider K𝐾Kitalic_K to be a participation interval that we sample and start a new interval thereafter. In this way, we can limit the length of each interval by adjusting K𝐾Kitalic_K. By setting ωtnsuperscriptsubscript𝜔𝑡𝑛\omega_{t}^{n}italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT to be the average of this possibly cutoff participation interval, we overcome the aforementioned challenge. From a theoretical perspective, we note that ωtnsuperscriptsubscript𝜔𝑡𝑛\omega_{t}^{n}italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT will be a biased estimation when K<𝐾K<\inftyitalic_K < ∞ and the bias will be larger when K𝐾Kitalic_K is smaller. In contrast, a smaller K𝐾Kitalic_K leads to a smaller variance of ωtnsuperscriptsubscript𝜔𝑡𝑛\omega_{t}^{n}italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, because we collect more samples in the computation of ωtnsuperscriptsubscript𝜔𝑡𝑛\omega_{t}^{n}italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT with a smaller K𝐾Kitalic_K. Therefore, an insight here is that K𝐾Kitalic_K controls the bias-variance tradeoff111Note that we focus on the aggregation weights here, which is different from classical concept of the bias-variance tradeoff of the model. of the aggregation weight ωtnsuperscriptsubscript𝜔𝑡𝑛\omega_{t}^{n}italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. In Section 4, we will formally show this property and obtain desirable convergence properties of the weight error term and the overall objective function (1), by properly choosing K𝐾Kitalic_K in the theoretical analysis. Our experimental results in Section 5 also confirm that choosing an appropriate value of K<𝐾K<\inftyitalic_K < ∞ improves the performance in most cases.

1
Input: K𝐾Kitalic_K, {Iltn:t,n}conditional-setsuperscriptsubscriptIl𝑡𝑛for-all𝑡𝑛\{{\rm I\kern-1.99997ptl}_{t}^{n}:\forall t,n\}{ roman_Il start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT : ∀ italic_t , italic_n };  Output: {ωtn:t,n}conditional-setsuperscriptsubscript𝜔𝑡𝑛for-all𝑡𝑛\{\omega_{t}^{n}:\forall t,n\}{ italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT : ∀ italic_t , italic_n };​​​
2for n=1,,N𝑛1𝑁n=1,\ldots,Nitalic_n = 1 , … , italic_N in parallel  do
3      
4      Initialize Mn0subscript𝑀𝑛0M_{n}\leftarrow 0italic_M start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ← 0, Sn0superscriptsubscript𝑆𝑛0S_{n}^{\diamond}\leftarrow 0italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋄ end_POSTSUPERSCRIPT ← 0, ω0n1superscriptsubscript𝜔0𝑛1\omega_{0}^{n}\leftarrow 1italic_ω start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ← 1;
5      for t=1,,T1𝑡1𝑇1t=1,\ldots,T-1italic_t = 1 , … , italic_T - 1  do
6            
7            SnSn+1superscriptsubscript𝑆𝑛superscriptsubscript𝑆𝑛1S_{n}^{\diamond}\leftarrow S_{n}^{\diamond}+1italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋄ end_POSTSUPERSCRIPT ← italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋄ end_POSTSUPERSCRIPT + 1;
8            if Ilt1n=1superscriptsubscriptIl𝑡1𝑛1{\rm I\kern-1.99997ptl}_{t-1}^{n}=1roman_Il start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = 1 or Sn=Ksuperscriptsubscript𝑆𝑛𝐾S_{n}^{\diamond}=Kitalic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋄ end_POSTSUPERSCRIPT = italic_K  then
                   SnSnsubscript𝑆𝑛superscriptsubscript𝑆𝑛S_{n}\leftarrow S_{n}^{\diamond}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ← italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋄ end_POSTSUPERSCRIPT; // final interval computed
9                  
10                  ωtn{Sn,if Mn=0Mnωt1n+SnMn+1,if Mn1;superscriptsubscript𝜔𝑡𝑛casessubscript𝑆𝑛if subscript𝑀𝑛0subscript𝑀𝑛superscriptsubscript𝜔𝑡1𝑛subscript𝑆𝑛subscript𝑀𝑛1if subscript𝑀𝑛1\omega_{t}^{n}\leftarrow\begin{cases}S_{n},\!\!\!\!&\textrm{if }M_{n}=0\\ \frac{M_{n}\cdot\omega_{t-1}^{n}+S_{n}}{M_{n}+1},\!\!\!\!&\textrm{if }M_{n}% \geq 1\end{cases};\!\!\!italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ← { start_ROW start_CELL italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , end_CELL start_CELL if italic_M start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 0 end_CELL end_ROW start_ROW start_CELL divide start_ARG italic_M start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⋅ italic_ω start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT + italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG start_ARG italic_M start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + 1 end_ARG , end_CELL start_CELL if italic_M start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≥ 1 end_CELL end_ROW ;
11                  MnMn+1subscript𝑀𝑛subscript𝑀𝑛1M_{n}\leftarrow M_{n}+1italic_M start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ← italic_M start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + 1;
12                  Sn0superscriptsubscript𝑆𝑛0S_{n}^{\diamond}\leftarrow 0italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋄ end_POSTSUPERSCRIPT ← 0;
13             else
14                   ωtnωt1nsuperscriptsubscript𝜔𝑡𝑛superscriptsubscript𝜔𝑡1𝑛\omega_{t}^{n}\leftarrow\omega_{t-1}^{n}italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ← italic_ω start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT;
15            
16      
Algorithm 2 Weight computation in FedAU

Online Algorithm. Based on the above insight, we describe the procedure of computing the aggregation weights {ωtn}superscriptsubscript𝜔𝑡𝑛\{\omega_{t}^{n}\}{ italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT }, as shown in Algorithm 2. The computation is independent for each client n𝑛nitalic_n. In this algorithm, the variable Mnsubscript𝑀𝑛M_{n}italic_M start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT denotes the number of (possibly cutoff) participation intervals that have been collected, and Snsuperscriptsubscript𝑆𝑛S_{n}^{\diamond}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋄ end_POSTSUPERSCRIPT denotes the the length of the last interval that is being computed. We compute the interval by incrementing Snsuperscriptsubscript𝑆𝑛S_{n}^{\diamond}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋄ end_POSTSUPERSCRIPT by one in every round, until the condition in Line 2 holds. When this condition holds, Sn=Snsubscript𝑆𝑛superscriptsubscript𝑆𝑛S_{n}=S_{n}^{\diamond}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋄ end_POSTSUPERSCRIPT is the actual length of the latest participation interval with possible cutoff. As explained above, we always start a new interval when Snsuperscriptsubscript𝑆𝑛S_{n}^{\diamond}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋄ end_POSTSUPERSCRIPT reaches K𝐾Kitalic_K. Also note that we consider Ilt1nsuperscriptsubscriptIl𝑡1𝑛{\rm I\kern-1.99997ptl}_{t-1}^{n}roman_Il start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT instead of IltnsuperscriptsubscriptIl𝑡𝑛{\rm I\kern-1.99997ptl}_{t}^{n}roman_Il start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT in this condition and start the loop from t=1𝑡1t=1italic_t = 1 in Line 2, to align with the requirement in Algorithm 1 that the weights are computed from the participation records before (not including) the current round t𝑡titalic_t. For t=0𝑡0t=0italic_t = 0, we always use ωtn=1superscriptsubscript𝜔𝑡𝑛1\omega_{t}^{n}=1italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = 1. In Line 2, we compute the weight using an online averaging method, which is equivalent to averaging over all the participation intervals that have been observed until each round t𝑡titalic_t. With this method, we do not need to save all the previous participation intervals. Essentially, the computation in each round t𝑡titalic_t only requires three state variables that are scalars, including Mnsubscript𝑀𝑛M_{n}italic_M start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, Snsuperscriptsubscript𝑆𝑛S_{n}^{\diamond}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋄ end_POSTSUPERSCRIPT, and the previous round’s weight ωt1nsuperscriptsubscript𝜔𝑡1𝑛\omega_{t-1}^{n}italic_ω start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. This makes this algorithm extremely memory efficient.

In the full FedAU algorithm, we plug in the result of ωntsuperscriptsubscript𝜔𝑛𝑡\omega_{n}^{t}italic_ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT for each round t𝑡titalic_t obtained from Algorithm 2 into Line 1 of Algorithm 1. In other words, ComputeWeight in Algorithm 1 calls one step of update that includes Lines 22 of Algorithm 2.

Compatibility with Privacy-Preserving Mechanisms. In our FedAU algorithm, the aggregation weight computation (Algorithm 2) is done individually at each client, which only uses the client’s participation states and does not use the training dataset or the model. When using these aggregation weights as part of FedAvg in Algorithm 1, the weight ωtnsuperscriptsubscript𝜔𝑡𝑛\omega_{t}^{n}italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT can be multiplied with the parameter update ΔtnsuperscriptsubscriptΔ𝑡𝑛\Delta_{t}^{n}roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT at each client n𝑛nitalic_n (and in each round t𝑡titalic_t) before the update is transmitted to the server. In this way, methods such as secure aggregation (Bonawitz et al., 2017) can be applied directly, since the server only needs to compute a sum of the participating clients’ updates. Differentially private FedAvg methods (McMahan et al., 2018, Andrew et al., 2021) can be applied in a similar way.

Practical Implementation. We will see from our experimental results in Section 5 that a coarsely chosen value of K𝐾Kitalic_K gives a reasonably good performance in practice, which means that we do not need to fine-tune K𝐾Kitalic_K. There are also other engineering tweaks that can be made in practice, such as using an exponentially weighted average in Line 2 of Algorithm 2 to put more emphasis on the recent participation characteristics of clients. In an extreme case where each client participates only once, a possible solution is to group clients that have similar computation power (e.g., same brand/model of devices) and are in similar geographical locations together. They may share the same state variables Mnsubscript𝑀𝑛M_{n}italic_M start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, Snsuperscriptsubscript𝑆𝑛S_{n}^{\diamond}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋄ end_POSTSUPERSCRIPT, and ωt1nsuperscriptsubscript𝜔𝑡1𝑛\omega_{t-1}^{n}italic_ω start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT used for weight computation in Algorithm 2. We note that according to the lower bound derived by Yang et al. (2022), if each client participates only once, it is impossible to have an algorithm to converge to the original objective without sharing additional information.

4 Convergence Analysis

Assumption 1.

The local objective functions are L𝐿Litalic_L-smooth, such that

Fn(𝐱)Fn(𝐲)L𝐱𝐲,𝐱,𝐲,n.normsubscript𝐹𝑛𝐱subscript𝐹𝑛𝐲𝐿norm𝐱𝐲for-all𝐱𝐲𝑛\displaystyle\left\|\nabla F_{n}(\mathbf{x})-\nabla F_{n}(\mathbf{y})\right\|% \leq L\left\|\mathbf{x}-\mathbf{y}\right\|,\forall\mathbf{x},\mathbf{y},n.∥ ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_x ) - ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y ) ∥ ≤ italic_L ∥ bold_x - bold_y ∥ , ∀ bold_x , bold_y , italic_n . (3)
Assumption 2.

The local stochastic gradients and unbiased with bounded variance, such that

𝔼[𝐠n(𝐱)|𝐱]=Fn(𝐱) and 𝔼[𝐠n(𝐱)Fn(𝐱)2|𝐱]σ2,𝐱,n.formulae-sequence𝔼delimited-[]conditionalsubscript𝐠𝑛𝐱𝐱subscript𝐹𝑛𝐱 and 𝔼delimited-[]conditionalsuperscriptnormsubscript𝐠𝑛𝐱subscript𝐹𝑛𝐱2𝐱superscript𝜎2for-all𝐱𝑛\displaystyle\mathbb{E}\left[\left.\mathbf{g}_{n}(\mathbf{x})\right|\mathbf{x}% \right]=\nabla F_{n}(\mathbf{x})\textrm{ and }\mathbb{E}\left[\left.\left\|% \mathbf{g}_{n}(\mathbf{x})-\nabla F_{n}(\mathbf{x})\right\|^{2}\right|\mathbf{% x}\right]\leq\sigma^{2},\forall\mathbf{x},n.blackboard_E [ bold_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_x ) | bold_x ] = ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_x ) and blackboard_E [ ∥ bold_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_x ) - ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | bold_x ] ≤ italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , ∀ bold_x , italic_n . (4)

In addition, the stochastic gradient noise 𝐠n(𝐱)Fn(𝐱)subscript𝐠𝑛𝐱subscript𝐹𝑛𝐱\mathbf{g}_{n}(\mathbf{x})-\nabla F_{n}(\mathbf{x})bold_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_x ) - ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_x ) is independent across different rounds (indexed by t𝑡titalic_t), clients (indexed by n𝑛nitalic_n), and local update steps (indexed by i𝑖iitalic_i).

Assumption 3.

The divergence between local and global gradients is bounded, such that

Fn(𝐱)f(𝐱)2δ2,𝐱,n.superscriptnormsubscript𝐹𝑛𝐱𝑓𝐱2superscript𝛿2for-all𝐱𝑛\displaystyle\left\|\nabla F_{n}(\mathbf{x})-\nabla f(\mathbf{x})\right\|^{2}% \leq\delta^{2},\forall\mathbf{x},n.∥ ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_x ) - ∇ italic_f ( bold_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , ∀ bold_x , italic_n . (5)
Assumption 4.

The client participation random variable IltnsuperscriptsubscriptIl𝑡𝑛{\rm I\kern-1.99997ptl}_{t}^{n}roman_Il start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is independent across different t𝑡titalic_t and n𝑛nitalic_n. It is also independent of the stochastic gradient noise. For each client n𝑛nitalic_n, we define pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT such that 𝔼[Iltn]=pn𝔼delimited-[]superscriptsubscriptIl𝑡𝑛subscript𝑝𝑛\mathbb{E}\left[{\rm I\kern-1.99997ptl}_{t}^{n}\right]=p_{n}blackboard_E [ roman_Il start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ] = italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, i.e., IltnBernoulli(pn)similar-tosuperscriptsubscriptIl𝑡𝑛Bernoullisubscript𝑝𝑛{\rm I\kern-1.99997ptl}_{t}^{n}\sim\mathrm{Bernoulli}(p_{n})roman_Il start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∼ roman_Bernoulli ( italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), where the value of pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is unknown to the system a priori.

Assumptions 13 are commonly used in the literature for the convergence analysis of FL algorithms (Yang et al., 2021, Wang & Ji, 2022, Cho et al., 2023). Our consideration of independent participation across clients in Assumption 4 is more realistic than the conventional setting of sampling among all the clients with or without replacement (Li et al., 2020c, Yang et al., 2021), because it is difficult to coordinate the participation across a large number of clients in practical FL systems.

Challenge in Analyzing Time-Dependent Participation. Regarding the assumption on the independence of IltnsuperscriptsubscriptIl𝑡𝑛{\rm I\kern-1.99997ptl}_{t}^{n}roman_Il start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT across time (round) t𝑡titalic_t in Assumption 4, the challenge in analyzing the more general time-dependent participation is due to the complex interplay between the randomness in stochastic gradient noise, participation identities {Iltn}superscriptsubscriptIl𝑡𝑛\{{\rm I\kern-1.99997ptl}_{t}^{n}\}{ roman_Il start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT }, and estimated aggregation weights {ωtn}superscriptsubscript𝜔𝑡𝑛\{\omega_{t}^{n}\}{ italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT }. In particular, the first step in our proof of the general descent lemma (see Appendix B.3, the specific step is in (B.3.6)) would not hold if IltnsuperscriptsubscriptIl𝑡𝑛{\rm I\kern-1.99997ptl}_{t}^{n}roman_Il start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is dependent on the past, because the past information is contained in 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and {ωtn}superscriptsubscript𝜔𝑡𝑛\{\omega_{t}^{n}\}{ italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT } that are conditions of the expectation. We emphasize that this is a purely theoretical limitation, and this time-independence of client participation has been assumed in the majority of works on FL with client sampling (Fraboni et al., 2021a; b, Karimireddy et al., 2020, Li et al., 2020b; c, Yang et al., 2021). The novelty in our analysis is that we consider the true values of {pn}subscript𝑝𝑛\{p_{n}\}{ italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } to be unknown to the system. Our experimental results in Section 5 show that FedAU provides performance gains also for Markovian and cyclic participation patterns that are both time-dependent.

Assumption 5.

We assume that either of the following holds and define ΨGsubscriptΨ𝐺\Psi_{G}roman_Ψ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT accordingly.

  • Option 1: Nearly optimal weights. Under the assumption that 1Nn=1N(pnωtn1)21811𝑁superscriptsubscript𝑛1𝑁superscriptsubscript𝑝𝑛superscriptsubscript𝜔𝑡𝑛12181\frac{1}{N}\sum_{n=1}^{N}\left(p_{n}\omega_{t}^{n}-1\right)^{2}\leq\frac{1}{81}divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG 81 end_ARG for all t𝑡titalic_t, we define ΨG:=0assignsubscriptΨ𝐺0\Psi_{G}:=0roman_Ψ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT := 0.

  • Option 2: Bounded global gradient. Under the assumption that f(𝐱)2G2superscriptnorm𝑓𝐱2superscript𝐺2\left\|\nabla f(\mathbf{x})\right\|^{2}\leq G^{2}∥ ∇ italic_f ( bold_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for any 𝐱𝐱\mathbf{x}bold_x, we define ΨG:=G2assignsubscriptΨ𝐺superscript𝐺2\Psi_{G}:=G^{2}roman_Ψ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT := italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

Assumption 5 is only needed for Theorem 2 (stated below) and not for Theorem 1. Here, the bounded global gradient assumption is a relaxed variant of the bounded stochastic gradient assumption commonly used in adaptive gradient algorithms (Reddi et al., 2021, Wang et al., 2022b; c). Although focusing on very different problems, our FedAU method shares some similarities with adaptive gradient methods in the sense that we both adapt the weights used in model updates, where the adaptation is dependent on some parameters that progressively change during the training process; see Appendix A.2 for some further discussion. For the nearly optimal weights assumption, we can see that it holds if 8/9pnωtn10/9pn89subscript𝑝𝑛superscriptsubscript𝜔𝑡𝑛109subscript𝑝𝑛\nicefrac{{8}}{{9p_{n}}}\leq\omega_{t}^{n}\leq\nicefrac{{10}}{{9p_{n}}}/ start_ARG 8 end_ARG start_ARG 9 italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ≤ italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ≤ / start_ARG 10 end_ARG start_ARG 9 italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG, which means a toleration of a relative error of 1/911%19percent11\nicefrac{{1}}{{9}}\approx 11\%/ start_ARG 1 end_ARG start_ARG 9 end_ARG ≈ 11 % from the optimal weight 1/pn1subscript𝑝𝑛\nicefrac{{1}}{{p_{n}}}/ start_ARG 1 end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG. Theorem 2 holds under either of these two additional assumptions.

Main Results. We now present our main results, starting with the convergence of Algorithm 1 with arbitrary (but given) weights {ωnt}superscriptsubscript𝜔𝑛𝑡\{\omega_{n}^{t}\}{ italic_ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } with respect to (w.r.t.) the original objective function in (1).

Theorem 2 (Convergence error w.r.t. (1)).

Let γ1415LI𝛾1415𝐿𝐼\gamma\leq\frac{1}{4\sqrt{15}LI}italic_γ ≤ divide start_ARG 1 end_ARG start_ARG 4 square-root start_ARG 15 end_ARG italic_L italic_I end_ARG and γηmin{14LI;N54LIQ}𝛾𝜂14𝐿𝐼𝑁54𝐿𝐼𝑄\gamma\eta\leq\min\big{\{}\frac{1}{4LI};\frac{N}{54LIQ}\big{\}}italic_γ italic_η ≤ roman_min { divide start_ARG 1 end_ARG start_ARG 4 italic_L italic_I end_ARG ; divide start_ARG italic_N end_ARG start_ARG 54 italic_L italic_I italic_Q end_ARG }, where Q:=maxt{0,,T1}1Nn=1Npn(ωtn)2assign𝑄subscript𝑡0𝑇11𝑁superscriptsubscript𝑛1𝑁subscript𝑝𝑛superscriptsuperscriptsubscript𝜔𝑡𝑛2Q:=\max_{t\in\{0,\ldots,T-1\}}\frac{1}{N}\sum_{n=1}^{N}p_{n}(\omega_{t}^{n})^{2}italic_Q := roman_max start_POSTSUBSCRIPT italic_t ∈ { 0 , … , italic_T - 1 } end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. When Assumptions 15 hold, the result {𝐱t}subscript𝐱𝑡\{\mathbf{x}_{t}\}{ bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } obtained from Algorithm 1 satisfies:

1Tt=0T1𝔼[f(𝐱t)2]1𝑇superscriptsubscript𝑡0𝑇1𝔼delimited-[]superscriptnorm𝑓subscript𝐱𝑡2\displaystyle\textstyle\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\left[\left\|% \nabla f(\mathbf{x}_{t})\right\|^{2}\right]divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT blackboard_E [ ∥ ∇ italic_f ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (6)
𝒪(γηIT+ΨG+δ2+γ2L2Iσ2NTt=0T1n=1N𝔼[(pnωtn1)2]+γηLQ(Iδ2+σ2)N+γ2L2I(Iδ2+σ2)),absent𝒪𝛾𝜂𝐼𝑇subscriptΨ𝐺superscript𝛿2superscript𝛾2superscript𝐿2𝐼superscript𝜎2𝑁𝑇superscriptsubscript𝑡0𝑇1superscriptsubscript𝑛1𝑁subscript𝔼absentdelimited-[]superscriptsubscript𝑝𝑛superscriptsubscript𝜔𝑡𝑛12𝛾𝜂𝐿𝑄𝐼superscript𝛿2superscript𝜎2𝑁superscript𝛾2superscript𝐿2𝐼𝐼superscript𝛿2superscript𝜎2\displaystyle\!\leq\!\mathcal{O}\!\Bigg{(}\!\!\frac{\mathcal{F}}{\gamma\eta IT% }\!+\!\frac{\Psi_{G}\!+\!\delta^{2}\!+\!\gamma^{2}L^{2}I\sigma^{2}}{NT}\sum_{t% =0}^{T-1}\sum_{n=1}^{N}\mathbb{E}_{\!\!}\left[\!\left(p_{n}\omega_{t}^{n}\!-\!% 1\right)^{\!2}\right]\!+\!\frac{\gamma\eta LQ\!\left(\!I\delta^{2}\!+\!\sigma^% {2}\right)\!}{N}+\gamma^{2}L^{2}I\!\left(\!I\delta^{2}\!+\!\sigma^{2}\right)\!% \!\!\Bigg{)}\!,≤ caligraphic_O ( divide start_ARG caligraphic_F end_ARG start_ARG italic_γ italic_η italic_I italic_T end_ARG + divide start_ARG roman_Ψ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT + italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_N italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + divide start_ARG italic_γ italic_η italic_L italic_Q ( italic_I italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_N end_ARG + italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I ( italic_I italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ) ,

where :=f(𝐱0)fassign𝑓subscript𝐱0superscript𝑓\mathcal{F}:=f(\mathbf{x}_{0})-f^{*}caligraphic_F := italic_f ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, and f:=min𝐱f(𝐱)assignsuperscript𝑓subscript𝐱𝑓𝐱f^{*}:=\min_{\mathbf{x}}f(\mathbf{x})italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT := roman_min start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT italic_f ( bold_x ) is the truly minimum value of the objective in (1).

The proof of Theorem 2 includes a novel step to obtain 1NTt=0T1n=1N𝔼[(pnωtn1)2]1𝑁𝑇superscriptsubscript𝑡0𝑇1superscriptsubscript𝑛1𝑁𝔼delimited-[]superscriptsubscript𝑝𝑛superscriptsubscript𝜔𝑡𝑛12\frac{1}{NT}\sum_{t=0}^{T-1}\sum_{n=1}^{N}\mathbb{E}\big{[}\left(p_{n}\omega_{% t}^{n}-1\right)^{2}\big{]}divide start_ARG 1 end_ARG start_ARG italic_N italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT blackboard_E [ ( italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (ignoring the other constants), referred to as the weight error term, that characterizes how the aggregation weights {ωnt}superscriptsubscript𝜔𝑛𝑡\{\omega_{n}^{t}\}{ italic_ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } affect the convergence. Next, we focus on {ωtn}superscriptsubscript𝜔𝑡𝑛\{\omega_{t}^{n}\}{ italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT } obtained from Algorithm 2.

Theorem 3 (Bounding the weight error term).

For {ωtn}superscriptsubscript𝜔𝑡𝑛\{\omega_{t}^{n}\}{ italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT } obtained from Algorithm 2, when T2𝑇2T\geq 2italic_T ≥ 2,

1NTt=0T1n=1N𝔼[(pnωtn1)2]𝒪(KlogTT+1Nn=1N(1pn)2K).1𝑁𝑇superscriptsubscript𝑡0𝑇1superscriptsubscript𝑛1𝑁𝔼delimited-[]superscriptsubscript𝑝𝑛superscriptsubscript𝜔𝑡𝑛12𝒪𝐾𝑇𝑇1𝑁superscriptsubscript𝑛1𝑁superscript1subscript𝑝𝑛2𝐾\displaystyle\frac{1}{NT}\sum_{t=0}^{T-1}\sum_{n=1}^{N}\mathbb{E}\left[\left(p% _{n}\omega_{t}^{n}-1\right)^{2}\right]\leq\mathcal{O}\left(\frac{K\log T}{T}+% \frac{1}{N}\sum_{n=1}^{N}(1-p_{n})^{2K}\right).divide start_ARG 1 end_ARG start_ARG italic_N italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT blackboard_E [ ( italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ caligraphic_O ( divide start_ARG italic_K roman_log italic_T end_ARG start_ARG italic_T end_ARG + divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_K end_POSTSUPERSCRIPT ) . (7)

The proof of Theorem 3 is based on analyzing the unique statistical properties of the possibly cutoff participation interval Snsubscript𝑆𝑛S_{n}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT obtained in Algorithm 2. The first term of the bound in (7) is related to the variance of ωtnsuperscriptsubscript𝜔𝑡𝑛\omega_{t}^{n}italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. This term increases linearly in K𝐾Kitalic_K, because when K𝐾Kitalic_K gets larger, the minimum number of samples of Snsubscript𝑆𝑛S_{n}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT that are used for computing ωtnsuperscriptsubscript𝜔𝑡𝑛\omega_{t}^{n}italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT gets smaller, thus the variance upper bound becomes larger. The second term of the bound in (7) is related to the bias of ωtnsuperscriptsubscript𝜔𝑡𝑛\omega_{t}^{n}italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, which measures how far 𝔼[ωtn]𝔼delimited-[]superscriptsubscript𝜔𝑡𝑛\mathbb{E}\left[\omega_{t}^{n}\right]blackboard_E [ italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ] departs from the desired quantity of 1/pn1subscript𝑝𝑛\nicefrac{{1}}{{p_{n}}}/ start_ARG 1 end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG. Since 0<pn10subscript𝑝𝑛10<p_{n}\leq 10 < italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≤ 1, this term decreases exponentially in K𝐾Kitalic_K. This result confirms the bias-variance tradeoff of ωtnsuperscriptsubscript𝜔𝑡𝑛\omega_{t}^{n}italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT that we mentioned earlier.

Corollary 4 (Convergence of FedAU).

Let K=logcT𝐾subscript𝑐𝑇K=\left\lceil\log_{c}T\right\rceilitalic_K = ⌈ roman_log start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_T ⌉ with c:=1/(1minnpn)2assign𝑐1superscript1subscript𝑛subscript𝑝𝑛2c:=\nicefrac{{1}}{{(1-\min_{n}p_{n})^{2}}}italic_c := / start_ARG 1 end_ARG start_ARG ( 1 - roman_min start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG, γ=min{1LIT;1415LI}𝛾1𝐿𝐼𝑇1415𝐿𝐼\gamma=\min\left\{\frac{1}{LI\sqrt{T}};\frac{1}{4\sqrt{15}LI}\right\}italic_γ = roman_min { divide start_ARG 1 end_ARG start_ARG italic_L italic_I square-root start_ARG italic_T end_ARG end_ARG ; divide start_ARG 1 end_ARG start_ARG 4 square-root start_ARG 15 end_ARG italic_L italic_I end_ARG }, and choose η𝜂\etaitalic_η such that γη=min{NQ(Iδ2+σ2)LIT;14LI;N54LIQ}𝛾𝜂𝑁𝑄𝐼superscript𝛿2superscript𝜎2𝐿𝐼𝑇14𝐿𝐼𝑁54𝐿𝐼𝑄\gamma\eta=\min\left\{\sqrt{\frac{\mathcal{F}N}{Q\left(I\delta^{2}+\sigma^{2}% \right)LIT}};\frac{1}{4LI};\frac{N}{54LIQ}\right\}italic_γ italic_η = roman_min { square-root start_ARG divide start_ARG caligraphic_F italic_N end_ARG start_ARG italic_Q ( italic_I italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_L italic_I italic_T end_ARG end_ARG ; divide start_ARG 1 end_ARG start_ARG 4 italic_L italic_I end_ARG ; divide start_ARG italic_N end_ARG start_ARG 54 italic_L italic_I italic_Q end_ARG }. When T2𝑇2T\geq 2italic_T ≥ 2, the result {𝐱t}subscript𝐱𝑡\{\mathbf{x}_{t}\}{ bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } obtained from Algorithm 1 that uses {ωtn}superscriptsubscript𝜔𝑡𝑛\{\omega_{t}^{n}\}{ italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT } obtained from Algorithm 2 satisfies

1Tt=0T1𝔼[f(𝐱t)2]1𝑇superscriptsubscript𝑡0𝑇1𝔼delimited-[]superscriptnorm𝑓subscript𝐱𝑡2\displaystyle\textstyle\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\left[\left\|% \nabla f(\mathbf{x}_{t})\right\|^{2}\right]divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT blackboard_E [ ∥ ∇ italic_f ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
𝒪(σLQNIT+δLQNT+(ΨG+δ2+σ2IT)Rlog2TT+L(1+QN)+δ2+σ2IT),absent𝒪𝜎𝐿𝑄𝑁𝐼𝑇𝛿𝐿𝑄𝑁𝑇subscriptΨ𝐺superscript𝛿2superscript𝜎2𝐼𝑇𝑅superscript2𝑇𝑇𝐿1𝑄𝑁superscript𝛿2superscript𝜎2𝐼𝑇\displaystyle\leq\mathcal{O}\Bigg{(}\frac{\sigma\sqrt{L\mathcal{F}Q}}{\sqrt{% NIT}}+\frac{\delta\sqrt{L\mathcal{F}Q}}{\sqrt{NT}}+\frac{\big{(}\Psi_{G}+% \delta^{2}+\frac{\sigma^{2}}{IT}\big{)}R\log^{2}T}{T}+\frac{L\mathcal{F}\big{(% }1+\frac{Q}{N}\big{)}+\delta^{2}+\frac{\sigma^{2}}{I}}{T}\Bigg{)},≤ caligraphic_O ( divide start_ARG italic_σ square-root start_ARG italic_L caligraphic_F italic_Q end_ARG end_ARG start_ARG square-root start_ARG italic_N italic_I italic_T end_ARG end_ARG + divide start_ARG italic_δ square-root start_ARG italic_L caligraphic_F italic_Q end_ARG end_ARG start_ARG square-root start_ARG italic_N italic_T end_ARG end_ARG + divide start_ARG ( roman_Ψ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT + italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_I italic_T end_ARG ) italic_R roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T end_ARG start_ARG italic_T end_ARG + divide start_ARG italic_L caligraphic_F ( 1 + divide start_ARG italic_Q end_ARG start_ARG italic_N end_ARG ) + italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_I end_ARG end_ARG start_ARG italic_T end_ARG ) , (8)

where Q𝑄Qitalic_Q and ΨGsubscriptΨ𝐺\Psi_{G}roman_Ψ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT are defined in Theorem 2 and R:=1/logcassign𝑅1𝑐R:=\nicefrac{{1}}{{\log c}}italic_R := / start_ARG 1 end_ARG start_ARG roman_log italic_c end_ARG.

The result in Corollary 4 is the convergence upper bound of the full FedAU algorithm. Its proof involves further bounding (7) in Theorem 3, when choosing K=logcT𝐾subscript𝑐𝑇K=\log_{c}Titalic_K = roman_log start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_T, and plugging back the result along with the values of γ𝛾\gammaitalic_γ and η𝜂\etaitalic_η into Theorem 2. It shows that, with properly estimated aggregation weights {ωtn}superscriptsubscript𝜔𝑡𝑛\{\omega_{t}^{n}\}{ italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT } using Algorithm 2, the error approaches zero as T𝑇T\rightarrow\inftyitalic_T → ∞, although the actual participation statistics are unknown. The first two terms of the bound in (8) dominate when T𝑇Titalic_T is large enough, which are related to the stochastic gradient variance σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and gradient divergence δ2superscript𝛿2\delta^{2}italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. The error caused by the fact that {pn}subscript𝑝𝑛\{p_{n}\}{ italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } is unknown is captured by the third term of the bound in (8), which has an order of 𝒪(log2T/T)𝒪superscript2𝑇𝑇\mathcal{O}(\nicefrac{{\log^{2}T}}{{T}})caligraphic_O ( / start_ARG roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T end_ARG start_ARG italic_T end_ARG ). We also see that, as long as we maintain T𝑇Titalic_T to be large enough so that the first two terms of the bound in (8) dominate, we can achieve the desirable property of linear speedup in N𝑁Nitalic_N. This means that we can keep the same convergence error by increasing the number of clients (N𝑁Nitalic_N) and decreasing the number of rounds (T𝑇Titalic_T), to the extent that T𝑇Titalic_T remains large enough. Our result also recovers existing convergence bounds for FedAvg in the case of known participation probabilities (Karimireddy et al., 2020, Yang et al., 2021); see Appendix A.3 for details.

5 Experiments

Table 1: Accuracy results (in %) on training and test data
  ​​​ Participation pattern Dataset SVHN CIFAR-10 CIFAR-100 CINIC-10
Method / Metric ​​  Train ​​  Test ​​  Train ​​  Test ​​  Train ​​  Test ​​  Train ​​  Test
  Bernoulli FedAU (ours, K𝐾K\rightarrow\inftyitalic_K → ∞) ​​90.4±plus-or-minus\pm±0.5 ​​89.3±plus-or-minus\pm±0.5 ​​85.4±plus-or-minus\pm±0.4 ​​77.1±plus-or-minus\pm±0.4 ​​63.4±plus-or-minus\pm±0.6 ​​52.3±plus-or-minus\pm±0.4 ​​65.2±plus-or-minus\pm±0.5 ​​61.5±plus-or-minus\pm±0.4
FedAU (ours, K=50𝐾50K=50italic_K = 50) ​​90.6±plus-or-minus\pm±0.4 ​​89.6±plus-or-minus\pm±0.4 ​​86.0±plus-or-minus\pm±0.5 ​​77.3±plus-or-minus\pm±0.3 ​​63.8±plus-or-minus\pm±0.3 ​​52.1±plus-or-minus\pm±0.6 ​​66.7±plus-or-minus\pm±0.3 ​​62.7±plus-or-minus\pm±0.2
Average participating ​​89.1±plus-or-minus\pm±0.3 ​​87.2±plus-or-minus\pm±0.3 ​​83.5±plus-or-minus\pm±0.9 ​​74.1±plus-or-minus\pm±0.8 ​​59.3±plus-or-minus\pm±0.4 ​​48.8±plus-or-minus\pm±0.7 ​​61.1±plus-or-minus\pm±2.3 ​​56.6±plus-or-minus\pm±2.0
Average all ​​88.5±plus-or-minus\pm±0.5 ​​87.0±plus-or-minus\pm±0.3 ​​81.0±plus-or-minus\pm±0.9 ​​72.7±plus-or-minus\pm±0.9 ​​58.2±plus-or-minus\pm±0.4 ​​47.9±plus-or-minus\pm±0.5 ​​60.5±plus-or-minus\pm±2.3 ​​56.2±plus-or-minus\pm±2.0
\clineB2-102 FedVarp (250×250\times250 × memory) ​​89.6±plus-or-minus\pm±0.5 ​​88.9±plus-or-minus\pm±0.5 ​​84.2±plus-or-minus\pm±0.3 ​​77.9±plus-or-minus\pm±0.2 ​​57.2±plus-or-minus\pm±0.9 ​​49.2±plus-or-minus\pm±0.8 ​​64.4±plus-or-minus\pm±0.6 ​​62.0±plus-or-minus\pm±0.5
MIFA (250×250\times250 × memory) ​​89.4±plus-or-minus\pm±0.3 ​​88.7±plus-or-minus\pm±0.2 ​​83.5±plus-or-minus\pm±0.6 ​​77.5±plus-or-minus\pm±0.3 ​​55.8±plus-or-minus\pm±1.1 ​​48.4±plus-or-minus\pm±0.7 ​​63.8±plus-or-minus\pm±0.7 ​​61.5±plus-or-minus\pm±0.5
​​​Known participation statistics​​​ ​​89.2±plus-or-minus\pm±0.5 ​​88.4±plus-or-minus\pm±0.5 ​​84.3±plus-or-minus\pm±0.5 ​​77.0±plus-or-minus\pm±0.5 ​​59.4±plus-or-minus\pm±0.7 ​​50.6±plus-or-minus\pm±0.4 ​​63.2±plus-or-minus\pm±0.6 ​​60.5±plus-or-minus\pm±0.5
  Markovian FedAU (ours, K𝐾K\rightarrow\inftyitalic_K → ∞) ​​90.5±plus-or-minus\pm±0.4 ​​89.3±plus-or-minus\pm±0.4 ​​85.3±plus-or-minus\pm±0.3 ​​77.1±plus-or-minus\pm±0.3 ​​63.2±plus-or-minus\pm±0.5 ​​51.8±plus-or-minus\pm±0.3 ​​64.9±plus-or-minus\pm±0.3 ​​61.2±plus-or-minus\pm±0.2
FedAU (ours, K=50𝐾50K=50italic_K = 50) ​​90.6±plus-or-minus\pm±0.3 ​​89.5±plus-or-minus\pm±0.3 ​​85.9±plus-or-minus\pm±0.5 ​​77.2±plus-or-minus\pm±0.3 ​​63.5±plus-or-minus\pm±0.4 ​​51.7±plus-or-minus\pm±0.3 ​​66.3±plus-or-minus\pm±0.4 ​​62.3±plus-or-minus\pm±0.2
Average participating ​​89.0±plus-or-minus\pm±0.3 ​​87.1±plus-or-minus\pm±0.2 ​​83.4±plus-or-minus\pm±0.9 ​​74.2±plus-or-minus\pm±0.7 ​​59.2±plus-or-minus\pm±0.4 ​​48.6±plus-or-minus\pm±0.4 ​​61.5±plus-or-minus\pm±2.3 ​​56.9±plus-or-minus\pm±1.9
Average all ​​88.4±plus-or-minus\pm±0.6 ​​86.8±plus-or-minus\pm±0.7 ​​80.8±plus-or-minus\pm±1.0 ​​72.5±plus-or-minus\pm±0.5 ​​57.8±plus-or-minus\pm±0.9 ​​47.7±plus-or-minus\pm±0.5 ​​59.9±plus-or-minus\pm±2.8 ​​55.7±plus-or-minus\pm±2.2
\clineB2-102 FedVarp (250×250\times250 × memory) ​​89.6±plus-or-minus\pm±0.3 ​​88.6±plus-or-minus\pm±0.2 ​​84.0±plus-or-minus\pm±0.3 ​​77.8±plus-or-minus\pm±0.2 ​​56.4±plus-or-minus\pm±1.1 ​​48.8±plus-or-minus\pm±0.5 ​​64.6±plus-or-minus\pm±0.4 ​​62.1±plus-or-minus\pm±0.4
MIFA (250×250\times250 × memory) ​​89.1±plus-or-minus\pm±0.3 ​​88.4±plus-or-minus\pm±0.2 ​​83.0±plus-or-minus\pm±0.4 ​​77.2±plus-or-minus\pm±0.4 ​​55.1±plus-or-minus\pm±1.2 ​​48.1±plus-or-minus\pm±0.6 ​​63.5±plus-or-minus\pm±0.7 ​​61.2±plus-or-minus\pm±0.6
​​​Known participation statistics​​​ ​​89.5±plus-or-minus\pm±0.2 ​​88.6±plus-or-minus\pm±0.2 ​​84.5±plus-or-minus\pm±0.4 ​​76.9±plus-or-minus\pm±0.3 ​​59.7±plus-or-minus\pm±0.5 ​​50.3±plus-or-minus\pm±0.5 ​​63.5±plus-or-minus\pm±0.9 ​​60.7±plus-or-minus\pm±0.6
  Cyclic FedAU (ours, K𝐾K\rightarrow\inftyitalic_K → ∞) ​​89.8±plus-or-minus\pm±0.6 ​​88.7±plus-or-minus\pm±0.6 ​​84.2±plus-or-minus\pm±0.8 ​​76.3±plus-or-minus\pm±0.7 ​​60.9±plus-or-minus\pm±0.6 ​​50.6±plus-or-minus\pm±0.3 ​​63.5±plus-or-minus\pm±1.0 ​​60.0±plus-or-minus\pm±0.8
FedAU (ours, K=50𝐾50K=50italic_K = 50) ​​89.9±plus-or-minus\pm±0.6 ​​88.8±plus-or-minus\pm±0.6 ​​84.8±plus-or-minus\pm±0.6 ​​76.6±plus-or-minus\pm±0.4 ​​61.3±plus-or-minus\pm±0.8 ​​51.0±plus-or-minus\pm±0.5 ​​64.5±plus-or-minus\pm±0.9 ​​60.9±plus-or-minus\pm±0.7
Average participating ​​87.4±plus-or-minus\pm±0.5 ​​85.5±plus-or-minus\pm±0.7 ​​81.6±plus-or-minus\pm±1.2 ​​73.3±plus-or-minus\pm±0.8 ​​58.1±plus-or-minus\pm±1.0 ​​48.3±plus-or-minus\pm±0.8 ​​58.9±plus-or-minus\pm±2.1 ​​55.0±plus-or-minus\pm±1.6
Average all ​​89.1±plus-or-minus\pm±0.8 ​​87.4±plus-or-minus\pm±0.8 ​​83.1±plus-or-minus\pm±1.0 ​​73.8±plus-or-minus\pm±0.8 ​​59.7±plus-or-minus\pm±0.3 ​​48.8±plus-or-minus\pm±0.4 ​​62.9±plus-or-minus\pm±1.7 ​​57.6±plus-or-minus\pm±1.5
\clineB2-102 FedVarp (250×250\times250 × memory) ​​84.8±plus-or-minus\pm±0.5 ​​83.9±plus-or-minus\pm±0.6 ​​79.7±plus-or-minus\pm±0.9 ​​75.3±plus-or-minus\pm±0.7 ​​50.9±plus-or-minus\pm±0.5 ​​45.9±plus-or-minus\pm±0.4 ​​60.4±plus-or-minus\pm±0.7 ​​58.5±plus-or-minus\pm±0.6
MIFA (250×250\times250 × memory) ​​78.6±plus-or-minus\pm±1.2 ​​77.4±plus-or-minus\pm±1.1 ​​73.0±plus-or-minus\pm±1.3 ​​70.6±plus-or-minus\pm±1.1 ​​44.8±plus-or-minus\pm±0.6 ​​41.1±plus-or-minus\pm±0.6 ​​51.2±plus-or-minus\pm±1.0 ​​50.2±plus-or-minus\pm±0.9
​​​Known participation statistics​​​ ​​89.9±plus-or-minus\pm±0.7 ​​88.7±plus-or-minus\pm±0.6 ​​83.6±plus-or-minus\pm±0.7 ​​76.1±plus-or-minus\pm±0.5 ​​60.2±plus-or-minus\pm±0.4 ​​50.8±plus-or-minus\pm±0.4 ​​62.6±plus-or-minus\pm±0.8 ​​59.8±plus-or-minus\pm±0.7
 

Note to the table. The top part of the sub-table for each participation pattern includes our method and baselines in the same setting. The bottom part of each sub-table includes baselines that require either additional memory or known participation statistics. For each column, the best values in the top and bottom parts are highlighted with bold and underline, respectively. The total number of rounds is 2,00020002,0002 , 000 for SVHN; 10,0001000010,00010 , 000 for CIFAR-10 and CINIC-10; 20,0002000020,00020 , 000 for CIFAR-100. The mean and standard deviation values shown in the table are computed over experiments with 5555 different random seeds, for the average accuracy over the last 200200200200 rounds (measured at an interval of 10101010 rounds).

We evaluate the performance of FedAU in experiments. More experimental setup details, including the link to the code, and results are in Appendices C and D, respectively.

Datasets, Models, and System. We consider four image classification tasks, with datasets including SVHN (Netzer et al., 2011), CIFAR-10 (Krizhevsky & Hinton, 2009), CIFAR-100 (Krizhevsky & Hinton, 2009), and CINIC-10 (Darlow et al., 2018), where CIFAR-100 has 100100100100 classes (labels) while the other datasets have 10101010 classes. We use FL train convolutional neural network (CNN) models of slightly different architectures for these tasks. We simulate an FL system that includes a total of N=250𝑁250N=250italic_N = 250 clients, where each n𝑛nitalic_n has its own participation pattern.

Heterogeneity. Similar to existing works (Hsu et al., 2019, Reddi et al., 2021), we use a Dirichlet distribution with parameter αd=0.1subscript𝛼𝑑0.1\alpha_{d}=0.1italic_α start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = 0.1 to generate the class distribution of each client’s data, for a setup with non-IID data across clients. Here, αdsubscript𝛼𝑑\alpha_{d}italic_α start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT specifies the degree of data heterogeneity, where a smaller αdsubscript𝛼𝑑\alpha_{d}italic_α start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT indicates a more heterogeneous data distribution. In addition, to simulate the correlation between data distribution and client participation frequency as motivated in Section 1, we generate a class-wide participation probability distribution that follows a Dirichlet distribution with parameter αp=0.1subscript𝛼𝑝0.1\alpha_{p}=0.1italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 0.1. Here, αpsubscript𝛼𝑝\alpha_{p}italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT specifies the degree of participation heterogeneity, where a smaller αpsubscript𝛼𝑝\alpha_{p}italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT indicates more heterogeneous participation across clients. We generate client participation patterns following a random process that is either Bernoulli (independent), Markovian, or cyclic, and study the performance of these types of participation patterns in different experiments. The participation patterns have a stationary probability pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, for each client n𝑛nitalic_n, that is generated according to a combination of the two aforementioned Dirichlet distributions, and the details are explained in Appendix C.6. We enforce the minimum pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, nfor-all𝑛\forall n∀ italic_n, to be 0.020.020.020.02 in the main experiments, which is relaxed later. This generative approach creates an experimental scenario with non-IID client participation, while our FedAU algorithm and most baselines still do not know the actual participation statistics.

Baselines. We compare our FedAU algorithm with several baselines. The first set of baselines includes algorithms that compute an average of parameters over either all the participating clients (average participating) or all the clients (average all) in the aggregation stage of each round, where the latter case includes updates of non-participating clients that are equal to zero as part of averaging. These two baselines encompass most existing FedAvg implementations (e.g., Yang et al. (2021), McMahan et al. (2017), Patel et al. (2022)) that do not address the bias caused by heterogeneous participation statistics. They do not require additional memory or knowledge, thus they work under the same system assumptions as FedAU. The second set of baselines has algorithms that require extra resources or information, including FedVarp (Jhunjhunwala et al., 2022) and MIFA (Gu et al., 2021), which require N=250𝑁250N=250italic_N = 250 times of memory, and an idealized baseline that assumes known participation statistics and weighs the clients’ contributions using the reciprocal of the stationary participation probability. For each baseline, we performed a separate grid search to find the best γ𝛾\gammaitalic_γ and η𝜂\etaitalic_η.

Results. The main results are shown in Table 1, where we choose K=50𝐾50K=50italic_K = 50 for FedAU with finite K𝐾Kitalic_K based on a simple rule-of-thumb without detailed search. Our general observation is that FedAU provides the highest accuracy compared to almost all the baselines, including those that require additional memory and known participation statistics, except for the test accuracy on the CIFAR-10 dataset where FedVarp performs the best. Choosing K=50𝐾50K=50italic_K = 50 generally gives a better performance than choosing K𝐾K\rightarrow\inftyitalic_K → ∞ for FedAU, which aligns with our discussion in Section 3.

The reason that FedAU can perform better than FedVarp and MIFA is that these baselines keep historical local updates, which may be outdated when some clients participate infrequently. Updating the global model parameter with outdated local updates can lead to slow convergence, which is similar to the consequence of having stale updates in asynchronous SGD (Recht et al., 2011). In contrast, at the beginning of each round, participating clients in FedAU always start with the latest global parameter obtained from the server. This avoids stale updates, and we compensate heterogeneous participation statistics by adapting the aggregation weights, which is a fundamentally different and more efficient method compared to tracking historical updates as in FedVarp and MIFA.

It is surprising that FedAU even performs better than the case with known participation statistics. To understand this phenomenon, we point out that in the case of Bernoulli-distributed participation with very low probability (e.g., pn=0.02subscript𝑝𝑛0.02p_{n}=0.02italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 0.02), the empirical probability of a sample path of a client’s participation can diverge significantly from pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. For T=10,000𝑇10000T=10,000italic_T = 10 , 000 rounds, the standard deviation of the total number of participated rounds is σ:=Tpn(1pn)=0.0196=14assignsuperscript𝜎𝑇subscript𝑝𝑛1subscript𝑝𝑛0.019614\sigma^{\prime}:=\sqrt{Tp_{n}(1-p_{n})}=0.0196=14italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT := square-root start_ARG italic_T italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_ARG = 0.0196 = 14 while the mean is μ:=Tpn=200assignsuperscript𝜇𝑇subscript𝑝𝑛200\mu^{\prime}:=Tp_{n}=200italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT := italic_T italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 200. Considering the range within 2σ2superscript𝜎2\sigma^{\prime}2 italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, we know that the optimal participation weight when seen on the empirical probability ranges from T/(μ+2σ)43.9𝑇superscript𝜇2superscript𝜎43.9\nicefrac{{T}}{{(\mu^{\prime}+2\sigma^{\prime})}}\approx 43.9/ start_ARG italic_T end_ARG start_ARG ( italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 2 italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG ≈ 43.9 to T/(μ2σ)58.1𝑇superscript𝜇2superscript𝜎58.1\nicefrac{{T}}{{(\mu^{\prime}-2\sigma^{\prime})}}\approx 58.1/ start_ARG italic_T end_ARG start_ARG ( italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 2 italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG ≈ 58.1, while the optimal weight computed on the model-based probability is 1/pn=501subscript𝑝𝑛50\nicefrac{{1}}{{p_{n}}}=50/ start_ARG 1 end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG = 50. Our FedAU algorithm computes the aggregation weights from the actual participation sample path of each client, which captures the actual client behavior and empirically performs better than using 1/pn1subscript𝑝𝑛\nicefrac{{1}}{{p_{n}}}/ start_ARG 1 end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG even if pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is known. Some experimental results that further explain this phenomenon are in Appendix D.4.

​​​Refer to caption

Figure 1: FedAU with different K𝐾Kitalic_K (CIFAR-10 with Bernoulli participation).

As mentioned earlier, we lower-bounded pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, nfor-all𝑛\forall n∀ italic_n, by 0.020.020.020.02 for the main results. Next, we consider different lower bounds of pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, where a smaller lower bound of pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT means that there exist clients that participate less frequently. The performance of FedAU with different choices of K𝐾Kitalic_K and different lower bounds of pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is shown in Figure 1. We observe that choosing K=50𝐾50K=50italic_K = 50 always gives the best performance; the performance remains similar even when the lower bound of pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is small and there exist some clients that participate very infrequently. However, choosing a large K𝐾Kitalic_K (e.g., K500𝐾500K\geq 500italic_K ≥ 500) significantly deteriorates the performance when the lower bound of pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is small. This means that having a finite cutoff interval K𝐾Kitalic_K of an intermediate value (i.e., K=50𝐾50K=50italic_K = 50 in our experiments) for aggregation weight estimation, which is a uniqueness of FedAU, is essential especially when very infrequently participating clients exist.

6 Conclusion

In this paper, we have studied the challenging practical FL scenario of having unknown participation statistics of clients. To address this problem, we have considered the adaptation of aggregation weights based on the participation history observed at each individual client. Using a new consideration of the bias-variance tradeoff of the aggregation weight, we have obtained the FedAU algorithm. Our analytical methodology includes a unique decomposition which yields a separate weight error term that is further bounded to obtain the convergence upper bound of FedAU. Experimental results have confirmed the advantage of FedAU with several client participation patterns. Future work can study the convergence analysis of FedAU with more general participation processes and the incorporation of aggregation weight adaptation into other types of FL algorithms.

Acknowledgment

The work of M. Ji was supported by the National Science Foundation (NSF) CAREER Award 2145835.

References

  • Andrew et al. (2021) Galen Andrew, Om Thakkar, Brendan McMahan, and Swaroop Ramaswamy. Differentially private learning with adaptive clipping. Advances in Neural Information Processing Systems, 34:17455–17466, 2021.
  • Bonawitz et al. (2017) Keith Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H Brendan McMahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth. Practical secure aggregation for privacy-preserving machine learning. In proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp.  1175–1191, 2017.
  • Bonawitz et al. (2019) Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloé Kiddon, Jakub Konečný, Stefano Mazzocchi, Brendan McMahan, Timon Van Overveldt, David Petrou, Daniel Ramage, and Jason Roselander. Towards federated learning at scale: System design. In Proceedings of Machine Learning and Systems, volume 1, pp.  374–388, 2019.
  • Chen et al. (2022) Wenlin Chen, Samuel Horváth, and Peter Richtárik. Optimal client sampling for federated learning. Transactions on Machine Learning Research, 2022. ISSN 2835-8856.
  • Cho et al. (2023) Yae Jee Cho, Pranay Sharma, Gauri Joshi, Zheng Xu, Satyen Kale, and Tong Zhang. On the convergence of federated averaging with cyclic client participation. arXiv preprint arXiv:2302.03109, 2023.
  • Darlow et al. (2018) Luke N Darlow, Elliot J Crowley, Antreas Antoniou, and Amos J Storkey. CINIC-10 is not Imagenet or CIFAR-10. arXiv preprint arXiv:1810.03505, 2018.
  • Ding et al. (2020) Yucheng Ding, Chaoyue Niu, Yikai Yan, Zhenzhe Zheng, Fan Wu, Guihai Chen, Shaojie Tang, and Rongfei Jia. Distributed optimization over block-cyclic data. arXiv preprint arXiv:2002.07454, 2020.
  • Eichner et al. (2019) Hubert Eichner, Tomer Koren, Brendan McMahan, Nathan Srebro, and Kunal Talwar. Semi-cyclic stochastic gradient descent. In International Conference on Machine Learning, pp.  1764–1773. PMLR, 2019.
  • Fraboni et al. (2021a) Yann Fraboni, Richard Vidal, Laetitia Kameni, and Marco Lorenzi. Clustered sampling: Low-variance and improved representativity for clients selection in federated learning. In International Conference on Machine Learning, volume 139, pp.  3407–3416. PMLR, Jul. 2021a.
  • Fraboni et al. (2021b) Yann Fraboni, Richard Vidal, Laetitia Kameni, and Marco Lorenzi. On the impact of client sampling on federated learning convergence. arXiv preprint arXiv:2107.12211, 2021b.
  • Gorbunov et al. (2021) Eduard Gorbunov, Filip Hanzely, and Peter Richtarik. Local SGD: Unified theory and new efficient methods. In International Conference on Artificial Intelligence and Statistics, volume 130 of PMLR, pp.  3556–3564, 2021.
  • Gu et al. (2021) Xinran Gu, Kaixuan Huang, Jingzhao Zhang, and Longbo Huang. Fast federated learning in the presence of arbitrary device unavailability. In Advances in Neural Information Processing Systems, 2021.
  • Haddadpour et al. (2019) Farzin Haddadpour, Mohammad Mahdi Kamani, Mehrdad Mahdavi, and Viveck Cadambe. Local SGD with periodic averaging: Tighter analysis and adaptive synchronization. In Advances in Neural Information Processing Systems, 2019.
  • Hsu et al. (2019) Tzu-Ming Harry Hsu, Hang Qi, and Matthew Brown. Measuring the effects of non-identical data distribution for federated visual classification. arXiv preprint arXiv:1909.06335, 2019.
  • Jhunjhunwala et al. (2022) Divyansh Jhunjhunwala, Pranay Sharma, Aushim Nagarkatti, and Gauri Joshi. FedVARP: Tackling the variance due to partial client participation in federated learning. In Uncertainty in Artificial Intelligence, pp.  906–916. PMLR, 2022.
  • Kairouz et al. (2021) Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. Foundations and Trends® in Machine Learning, 14(1–2):1–210, 2021.
  • Karimireddy et al. (2020) Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. SCAFFOLD: Stochastic controlled averaging for federated learning. In International Conference on Machine Learning, pp.  5132–5143. PMLR, 2020.
  • Krizhevsky & Hinton (2009) Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
  • Li et al. (2020a) Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Magazine, 37(3):50–60, 2020a.
  • Li et al. (2020b) Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. Proceedings of Machine Learning and Systems, 2:429–450, 2020b.
  • Li et al. (2020c) Xiang Li, Kaixuan Huang, Wenhao Yang, Shusen Wang, and Zhihua Zhang. On the convergence of FedAvg on non-IID data. In International Conference on Learning Representations, 2020c.
  • Lin et al. (2020) Tao Lin, Sebastian U. Stich, Kumar Kshitij Patel, and Martin Jaggi. Don’t use large mini-batches, use local SGD. In International Conference on Learning Representations, 2020.
  • Malinovsky et al. (2023) Grigory Malinovsky, Samuel Horváth, Konstantin Burlachenko, and Peter Richtárik. Federated learning with regularized client participation. arXiv preprint arXiv:2302.03662, 2023.
  • McMahan et al. (2017) Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, pp.  1273–1282. PMLR, 2017.
  • McMahan et al. (2018) H. Brendan McMahan, Daniel Ramage, Kunal Talwar, and Li Zhang. Learning differentially private recurrent language models. In International Conference on Learning Representations, 2018.
  • Netzer et al. (2011) Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng. Reading digits in natural images with unsupervised feature learning. In NeurIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
  • Patel et al. (2022) Kumar Kshitij Patel, Lingxiao Wang, Blake Woodworth, Brian Bullins, and Nathan Srebro. Towards optimal communication complexity in distributed non-convex optimization. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022.
  • Perazzone et al. (2022) Jake Perazzone, Shiqiang Wang, Mingyue Ji, and Kevin S Chan. Communication-efficient device scheduling for federated learning using stochastic optimization. In IEEE Conference on Computer Communications, pp.  1449–1458, 2022.
  • Recht et al. (2011) Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. Advances in neural information processing systems, 24, 2011.
  • Reddi et al. (2021) Sashank J. Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konečný, Sanjiv Kumar, and Hugh Brendan McMahan. Adaptive federated optimization. In International Conference on Learning Representations, 2021.
  • Rizk et al. (2022) Elsa Rizk, Stefan Vlaski, and Ali H Sayed. Federated learning under importance sampling. IEEE Transactions on Signal Processing, 70:5381–5396, 2022.
  • Ruan et al. (2021) Yichen Ruan, Xiaoxi Zhang, Shu-Che Liang, and Carlee Joe-Wong. Towards flexible device participation in federated learning. In International Conference on Artificial Intelligence and Statistics, pp.  3403–3411. PMLR, 2021.
  • Stich (2019) Sebastian U. Stich. Local SGD converges fast and communicates little. In International Conference on Learning Representations, 2019.
  • Tan et al. (2022) Lei Tan, Xiaoxi Zhang, Yipeng Zhou, Xinkai Che, Miao Hu, Xu Chen, and Di Wu. Adafed: Optimizing participation-aware federated learning with adaptive aggregation weights. IEEE Transactions on Network Science and Engineering, 9(4):2708–2720, 2022.
  • Wang & Joshi (2019) Jianyu Wang and Gauri Joshi. Adaptive communication strategies to achieve the best error-runtime trade-off in local-update SGD. In Proceedings of Machine Learning and Systems, volume 1, pp.  212–229, 2019.
  • Wang & Joshi (2021) Jianyu Wang and Gauri Joshi. Cooperative SGD: A unified framework for the design and analysis of local-update SGD algorithms. Journal of Machine Learning Research, 22(213):1–50, 2021.
  • Wang et al. (2020) Jianyu Wang, Qinghua Liu, Hao Liang, Gauri Joshi, and H Vincent Poor. Tackling the objective inconsistency problem in heterogeneous federated optimization. Advances in neural information processing systems, 33:7611–7623, 2020.
  • Wang et al. (2021) Jianyu Wang, Zachary Charles, Zheng Xu, Gauri Joshi, H Brendan McMahan, Maruan Al-Shedivat, Galen Andrew, Salman Avestimehr, Katharine Daly, Deepesh Data, et al. A field guide to federated optimization. arXiv preprint arXiv:2107.06917, 2021.
  • Wang et al. (2022a) Qiyuan Wang, Qianqian Yang, Shibo He, Zhiguo Shi, and Jiming Chen. AsyncFedED: Asynchronous federated learning with euclidean distance based adaptive weight aggregation. arXiv preprint arXiv:2205.13797, 2022a.
  • Wang & Ji (2022) Shiqiang Wang and Mingyue Ji. A unified analysis of federated learning with arbitrary client participation. In Advances in Neural Information Processing Systems, volume 35, 2022.
  • Wang et al. (2022b) Yujia Wang, Lu Lin, and Jinghui Chen. Communication-efficient adaptive federated learning. In International Conference on Machine Learning, pp.  22802–22838. PMLR, 2022b.
  • Wang et al. (2022c) Yujia Wang, Lu Lin, and Jinghui Chen. Communication-compressed adaptive gradient method for distributed nonconvex optimization. In International Conference on Artificial Intelligence and Statistics, pp.  6292–6320. PMLR, 2022c.
  • Wu & Wang (2021) Hongda Wu and Ping Wang. Fast-convergent federated learning with adaptive weighting. IEEE Transactions on Cognitive Communications and Networking, 7(4):1078–1088, 2021.
  • Yan et al. (2020) Yikai Yan, Chaoyue Niu, Yucheng Ding, Zhenzhe Zheng, Fan Wu, Guihai Chen, Shaojie Tang, and Zhihua Wu. Distributed non-convex optimization with sublinear speedup under intermittent client availability. arXiv preprint arXiv:2002.07399, 2020.
  • Yang et al. (2021) Haibo Yang, Minghong Fang, and Jia Liu. Achieving linear speedup with partial worker participation in non-IID federated learning. In International Conference on Learning Representations, 2021.
  • Yang et al. (2022) Haibo Yang, Xin Zhang, Prashant Khanduri, and Jia Liu. Anarchic federated learning. In International Conference on Machine Learning, pp.  25331–25363. PMLR, 2022.
  • Yang et al. (2019) Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. Federated machine learning: Concept and applications. ACM Transactions on Intelligent Systems and Technology (TIST), 10(2):12, 2019.
  • Yu et al. (2019) Hao Yu, Sen Yang, and Shenghuo Zhu. Parallel restarted SGD with faster convergence and less communication: Demystifying why model averaging works for deep learning. In AAAI Conference on Artificial Intelligence, pp.  5693–5700, 2019.

Appendix

\startcontents

[sections] \printcontents[sections]l1

Appendix A Additional Discussion

A.1 Extending Objective (1) to Weighted Average

We note that our objective (1) can be easily extended to a weighted average of per-client empirical risk (i.e., average of sample losses), with arbitrary weights {qn:n}conditional-setsubscript𝑞𝑛for-all𝑛\{q_{n}:\forall n\}{ italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT : ∀ italic_n }. To see this, let F^n(𝐱)subscript^𝐹𝑛𝐱\hat{F}_{n}(\mathbf{x})over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_x ) denote the (local) empirical risk of client n𝑛nitalic_n, and let Γ:=n=1NqnassignΓsuperscriptsubscript𝑛1𝑁subscript𝑞𝑛\Gamma:=\sum_{n=1}^{N}q_{n}roman_Γ := ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. We can define the local objective of client n𝑛nitalic_n as Fn(𝐱)=qnNΓF^n(𝐱)subscript𝐹𝑛𝐱subscript𝑞𝑛𝑁Γsubscript^𝐹𝑛𝐱F_{n}(\mathbf{x})=\frac{q_{n}N}{\Gamma}\hat{F}_{n}(\mathbf{x})italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_x ) = divide start_ARG italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_N end_ARG start_ARG roman_Γ end_ARG over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_x ), which gives us the global objective of

f(𝐱)=1Nn=1NFn(𝐱)=1Γn=1NqnF^n(𝐱).𝑓𝐱1𝑁superscriptsubscript𝑛1𝑁subscript𝐹𝑛𝐱1Γsuperscriptsubscript𝑛1𝑁subscript𝑞𝑛subscript^𝐹𝑛𝐱\displaystyle f(\mathbf{x})=\frac{1}{N}\sum_{n=1}^{N}F_{n}(\mathbf{x})=\frac{1% }{\Gamma}\sum_{n=1}^{N}q_{n}\hat{F}_{n}(\mathbf{x}).italic_f ( bold_x ) = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_x ) = divide start_ARG 1 end_ARG start_ARG roman_Γ end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_x ) . (A.1.1)

This objective is in a standard form seen in most FL papers. The extension allows us to give different importance to different clients, if needed. For simplicity, we do not write out the weights {qn}subscript𝑞𝑛\{q_{n}\}{ italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } in the main paper, because this extension to arbitrary weights {qn}subscript𝑞𝑛\{q_{n}\}{ italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } is straightforward, and such a simplification has also been made in various other works such as Jhunjhunwala et al. (2022), Karimireddy et al. (2020), Reddi et al. (2021), Wang & Ji (2022).

A.2 Assumption on Bounded Global Gradient

As stated in Theorem 2, our convergence result holds when either of the “bounded global gradient” assumption or the “nearly optimal weights” assumption holds. When the aggregation weights {ωtn}superscriptsubscript𝜔𝑡𝑛\{\omega_{t}^{n}\}{ italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT } are nearly optimal satisfying 1Nn=1N(pnωtn1)21811𝑁superscriptsubscript𝑛1𝑁superscriptsubscript𝑝𝑛superscriptsubscript𝜔𝑡𝑛12181\frac{1}{N}\sum_{n=1}^{N}\left(p_{n}\omega_{t}^{n}-1\right)^{2}\leq\frac{1}{81}divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG 81 end_ARG, we do not need the bounded gradient assumption.

For the bounded gradient assumption itself, a stronger assumption of bounded stochastic gradient is used in related works on adaptive gradient algorithms (Reddi et al., 2021, Wang et al., 2022b; c), which implies an upper bound on the per-sample gradient. Compared to these works, we only require an upper bound on the global gradient, i.e., average of per-sample gradients, in our work. Although focusing on very different problems, our FedAU method shares some similarities with adaptive gradient methods in the sense that we both adapt the weights used in model updates, where the adaptation is dependent on some parameters that progressively change during the training process. The difference, however, is that our weight adaptation is based on each client’s participation history, while adaptive gradient methods adapt the element-wise weights based on the historical model update vector. Nevertheless, the similarity in both methods leads to a technical (mathematical) step of bounding a “weight error” in the proofs, which is where the bounded gradient assumption is needed especially when the “weight error” itself cannot be bounded. In our work, this step is done in the proof of Theorem 2 (in Appendix B.5). In adaptive gradient methods, as an example, this step is on page 14 until Equation (4) in Reddi et al. (2021).

Again, we note that the bounded gradient assumption is only needed when the aggregation weights are estimated and the estimation error is large. This is seen in the two choices in Assumption 5; the convergence bound holds when either of these two conditions hold. Intuitively, this aligns with the reasoning of the need for bounding the “weight error”.

A.3 Comparison with Existing Convergence Bounds for FedAvg

We compare our result in Corollary 4 with existing FedAvg convergence results, where the latter assumes known participation probabilities. Since most existing results consider equiprobable sampling of a certain number (denoted by S𝑆Sitalic_S here) of clients out of all the N𝑁Nitalic_N clients, we first convert our bound to the same setting so that it is comparable with existing results. We note that our convergence bound includes the parameter Q𝑄Qitalic_Q that is defined as Q:=maxt{0,,T1}1Nn=1Npn(ωtn)2assign𝑄subscript𝑡0𝑇11𝑁superscriptsubscript𝑛1𝑁subscript𝑝𝑛superscriptsuperscriptsubscript𝜔𝑡𝑛2Q:=\max_{t\in\{0,\ldots,T-1\}}\frac{1}{N}\sum_{n=1}^{N}p_{n}(\omega_{t}^{n})^{2}italic_Q := roman_max start_POSTSUBSCRIPT italic_t ∈ { 0 , … , italic_T - 1 } end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT in Theorem 2. When we know the participation probabilities and choose ωtn=1pnsuperscriptsubscript𝜔𝑡𝑛1subscript𝑝𝑛\omega_{t}^{n}=\frac{1}{p_{n}}italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG for all t𝑡titalic_t, we have Q=1Nn=1N1pn𝑄1𝑁superscriptsubscript𝑛1𝑁1subscript𝑝𝑛Q=\frac{1}{N}\sum_{n=1}^{N}\frac{1}{p_{n}}italic_Q = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG. Further, for equiprobable sampling of S𝑆Sitalic_S clients out of a total of N𝑁Nitalic_N clients, we have pn=SNsubscript𝑝𝑛𝑆𝑁p_{n}=\frac{S}{N}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = divide start_ARG italic_S end_ARG start_ARG italic_N end_ARG and thus Q=NS𝑄𝑁𝑆Q=\frac{N}{S}italic_Q = divide start_ARG italic_N end_ARG start_ARG italic_S end_ARG. Therefore, when T𝑇Titalic_T is large and ignoring the other constants, our upper bound in Corollary 4 becomes 𝒪(QNT)=𝒪(1ST)𝒪𝑄𝑁𝑇𝒪1𝑆𝑇\mathcal{O}\left(\frac{\sqrt{Q}}{\sqrt{NT}}\right)=\mathcal{O}\left(\frac{1}{% \sqrt{ST}}\right)caligraphic_O ( divide start_ARG square-root start_ARG italic_Q end_ARG end_ARG start_ARG square-root start_ARG italic_N italic_T end_ARG end_ARG ) = caligraphic_O ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_S italic_T end_ARG end_ARG ).

Considering existing results of FedAvg with partial participation where the probabilities are both homogeneous and known, Theorem 1 in Karimireddy et al. (2020) gives the same convergence bound of 𝒪(1ST)𝒪1𝑆𝑇\mathcal{O}\left(\frac{1}{\sqrt{ST}}\right)caligraphic_O ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_S italic_T end_ARG end_ARG ) for non-convex objectives, and Corollary 2 in Yang et al. (2021) gives a covergence bound of 𝒪(IST)𝒪𝐼𝑆𝑇\mathcal{O}\left(\frac{\sqrt{I}}{\sqrt{ST}}\right)caligraphic_O ( divide start_ARG square-root start_ARG italic_I end_ARG end_ARG start_ARG square-root start_ARG italic_S italic_T end_ARG end_ARG ). Here, we note that Karimireddy et al. (2020) express the bound on communication rounds while we give the bound on the square of gradient norm, but the two types of bounds are directly convertible to each other. Our bound of 𝒪(1ST)𝒪1𝑆𝑇\mathcal{O}\left(\frac{1}{\sqrt{ST}}\right)caligraphic_O ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_S italic_T end_ARG end_ARG ) matches with Theorem 1 in Karimireddy et al. (2020) and improves over Corollary 2 in Yang et al. (2021). We also note that, in this special case, our result shows a linear speedup with respect to the number of participating clients, i.e., S𝑆Sitalic_S, which is the same as the existing results in Karimireddy et al. (2020), Yang et al. (2021).

The uniqueness of our work compared to Karimireddy et al. (2020), Yang et al. (2021) and most other existing works is that we consider heterogeneous and unknown participation statistics (probabilities), where each client n𝑛nitalic_n has its own participation probability pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT that can be different from other clients. In contrast, Karimireddy et al. (2020), Yang et al. (2021) assume uniformly sampled clients where a fixed (and known) number of S𝑆Sitalic_S clients participate in each round. Our setup is more general where the number of clients that participate in each round can vary over time. Because of this generality, we cannot define a fixed value of S𝑆Sitalic_S in our convergence bound that holds for this general setup, so we use Q𝑄Qitalic_Q to capture the statistical characteristics of client participation. When the overall probability distribution of client participation remains the same, increasing the total number of clients (N𝑁Nitalic_N) has the same effect as increasing the number of participating clients (S𝑆Sitalic_S), as we have shown above.

As a side note, when choosing ωtn=1pnsuperscriptsubscript𝜔𝑡𝑛1subscript𝑝𝑛\omega_{t}^{n}=\frac{1}{p_{n}}italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG, the weight error term 𝔼[(pnωtn1)2]𝔼delimited-[]superscriptsubscript𝑝𝑛superscriptsubscript𝜔𝑡𝑛12\mathbb{E}[\left(p_{n}\omega_{t}^{n}-1\right)^{2}]blackboard_E [ ( italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] becomes zero and the third term in (8) in Corollary 4 will not exist, i.e., it becomes zero. See the proof in Appendix B.7 for why the third term in (8) is related to the weight error.

Appendix B Proofs

B.1 Preliminaries

We first note the following preliminary inequalities that we will use in the proofs without explaining them further.

We have

1Mm=1M𝐳m21Mm=1M𝐳m2 and m=1M𝐳m2Mm=1M𝐳m2superscriptnorm1𝑀superscriptsubscript𝑚1𝑀subscript𝐳𝑚21𝑀superscriptsubscript𝑚1𝑀superscriptnormsubscript𝐳𝑚2 and superscriptnormsuperscriptsubscript𝑚1𝑀subscript𝐳𝑚2𝑀superscriptsubscript𝑚1𝑀superscriptnormsubscript𝐳𝑚2\displaystyle\left\|\frac{1}{M}\sum_{m=1}^{M}\mathbf{z}_{m}\right\|^{2}\leq% \frac{1}{M}\sum_{m=1}^{M}\left\|\mathbf{z}_{m}\right\|^{2}\textrm{ and }\left% \|\sum_{m=1}^{M}\mathbf{z}_{m}\right\|^{2}\leq M\sum_{m=1}^{M}\left\|\mathbf{z% }_{m}\right\|^{2}∥ divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT bold_z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∥ bold_z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and ∥ ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT bold_z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_M ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∥ bold_z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (B.1.1)

for any 𝐳mdsubscript𝐳𝑚superscript𝑑\mathbf{z}_{m}\in\mathbb{R}^{d}bold_z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT with m{1,2,,M}𝑚12𝑀m\in\{1,2,\ldots,M\}italic_m ∈ { 1 , 2 , … , italic_M }, which is a direct consequence of Jensen’s inequality.

We also have

𝐳1,𝐳2ρ𝐳122+𝐳222ρ,subscript𝐳1subscript𝐳2𝜌superscriptnormsubscript𝐳122superscriptnormsubscript𝐳222𝜌\langle\mathbf{z}_{1},\mathbf{z}_{2}\rangle\leq\frac{\rho\left\|\mathbf{z}_{1}% \right\|^{2}}{2}+\frac{\left\|\mathbf{z}_{2}\right\|^{2}}{2\rho},⟨ bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⟩ ≤ divide start_ARG italic_ρ ∥ bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG + divide start_ARG ∥ bold_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_ρ end_ARG , (B.1.2)

for any 𝐳1,𝐳2dsubscript𝐳1subscript𝐳2superscript𝑑\mathbf{z}_{1},\mathbf{z}_{2}\in\mathbb{R}^{d}bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and ρ>0𝜌0\rho>0italic_ρ > 0, which is known as (the generalized version of) Young’s inequality and also Peter-Paul inequality. A direct consequence of (B.1.2) is

𝐳1+𝐳22(1+b)𝐳12+(1+1b)𝐳22,superscriptnormsubscript𝐳1subscript𝐳221𝑏superscriptnormsubscript𝐳1211𝑏superscriptnormsubscript𝐳22\left\|\mathbf{z}_{1}+\mathbf{z}_{2}\right\|^{2}\leq(1+b)\left\|\mathbf{z}_{1}% \right\|^{2}+\left(1+\frac{1}{b}\right)\left\|\mathbf{z}_{2}\right\|^{2},∥ bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + bold_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ ( 1 + italic_b ) ∥ bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( 1 + divide start_ARG 1 end_ARG start_ARG italic_b end_ARG ) ∥ bold_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (B.1.3)

for some constant b>0𝑏0b>0italic_b > 0.

We also use the variance relation as follows:

𝔼[𝐳2]=𝔼[𝐳]2+𝔼[𝐳𝔼[𝐳]2],𝔼delimited-[]superscriptnorm𝐳2superscriptnorm𝔼delimited-[]𝐳2𝔼delimited-[]superscriptnorm𝐳𝔼delimited-[]𝐳2\mathbb{E}\left[\left\|\mathbf{z}\right\|^{2}\right]=\left\|\mathbb{E}\left[% \mathbf{z}\right]\right\|^{2}+\mathbb{E}\left[\left\|\mathbf{z}-\mathbb{E}% \left[\mathbf{z}\right]\right\|^{2}\right],blackboard_E [ ∥ bold_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = ∥ blackboard_E [ bold_z ] ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + blackboard_E [ ∥ bold_z - blackboard_E [ bold_z ] ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (B.1.4)

for any 𝐳d𝐳superscript𝑑\mathbf{z}\in\mathbb{R}^{d}bold_z ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, while noting that (B.1.4) also holds when all the expectations are conditioned on the same variable(s).

In addition, we use 𝔼t[]subscript𝔼𝑡delimited-[]\mathbb{E}_{t}\left[\cdot\right]blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ⋅ ] to denote 𝔼[|𝐱t,{ωtn}]\mathbb{E}\left[\cdot|\mathbf{x}_{t},\{\omega_{t}^{n}\}\right]blackboard_E [ ⋅ | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , { italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT } ] in short. We also assume that Assumptions 14 hold throughout our analysis.

B.2 Equivalent Formulation of Algorithm 1

For the purpose of analysis, similar to Wang & Ji (2022), we consider an equivalent formulation of the original Algorithm 1, as shown in Algorithm 3. In this algorithm, we assume that all the clients compute their local updates in Lines 33. This is logically equivalent to the practical setting where the clients that do not participate have no computation, because their computed update ΔtnsuperscriptsubscriptΔ𝑡𝑛\Delta_{t}^{n}roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT has no effect in Line 3 if Iltn=0superscriptsubscriptIl𝑡𝑛0{\rm I\kern-1.99997ptl}_{t}^{n}=0roman_Il start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = 0, thus Algorithm 1 and Algorithm 3 give the same output sequence {𝐱t:t}conditional-setsubscript𝐱𝑡for-all𝑡\{\mathbf{x}_{t}:\forall t\}{ bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : ∀ italic_t }. Our proofs in the following sections consider the logically equivalent Algorithm 3 for analysis and also use the notations defined in this algorithm.

1
Input: γ𝛾\gammaitalic_γ, η𝜂\etaitalic_η, 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, I𝐼Iitalic_I
2
Output: {𝐱t:t}conditional-setsubscript𝐱𝑡for-all𝑡\{\mathbf{x}_{t}:\forall t\}{ bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : ∀ italic_t }
3
4Initialize t00subscript𝑡00t_{0}\leftarrow 0italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ← 0, 𝐮𝟎𝐮0\mathbf{u}\leftarrow\mathbf{0}bold_u ← bold_0;
5for t=0,,T1𝑡0𝑇1t=0,\ldots,T-1italic_t = 0 , … , italic_T - 1 do
6      
7      for n=1,,N𝑛1𝑁n=1,\ldots,Nitalic_n = 1 , … , italic_N in parallel  do
8            
9            Sample IltnsuperscriptsubscriptIl𝑡𝑛{\rm I\kern-1.99997ptl}_{t}^{n}roman_Il start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT from an unknown stochastic process;
10            𝐲t,0n𝐱tsubscriptsuperscript𝐲𝑛𝑡0subscript𝐱𝑡\mathbf{y}^{n}_{t,0}\leftarrow\mathbf{x}_{t}bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , 0 end_POSTSUBSCRIPT ← bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT;
11            for i=0,,I1𝑖0𝐼1i=0,\ldots,I-1italic_i = 0 , … , italic_I - 1  do
                   𝐲t,i+1n𝐲t,inγ𝐠n(𝐲t,in)subscriptsuperscript𝐲𝑛𝑡𝑖1subscriptsuperscript𝐲𝑛𝑡𝑖𝛾subscript𝐠𝑛subscriptsuperscript𝐲𝑛𝑡𝑖\mathbf{y}^{n}_{t,i+1}\leftarrow\mathbf{y}^{n}_{t,i}-\gamma\mathbf{g}_{n}(% \mathbf{y}^{n}_{t,i})bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i + 1 end_POSTSUBSCRIPT ← bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - italic_γ bold_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT );   // In practice, no computation if Iltn=0superscriptsubscriptIl𝑡𝑛0{\rm I\kern-1.99997ptl}_{t}^{n}=0roman_Il start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = 0
12                  
13            
14            Δtn𝐲t,In𝐱tsuperscriptsubscriptΔ𝑡𝑛subscriptsuperscript𝐲𝑛𝑡𝐼subscript𝐱𝑡\Delta_{t}^{n}\leftarrow\mathbf{y}^{n}_{t,I}-\mathbf{x}_{t}roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ← bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_I end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT; 
15            ωtnComputeWeight({Ilτn:τ<t})superscriptsubscript𝜔𝑡𝑛ComputeWeightconditional-setsuperscriptsubscriptIl𝜏𝑛𝜏𝑡\omega_{t}^{n}\leftarrow\texttt{ComputeWeight}(\{{\rm I\kern-1.99997ptl}_{\tau% }^{n}:\tau<t\})italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ← ComputeWeight ( { roman_Il start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT : italic_τ < italic_t } );
16      
17      𝐱t+1𝐱t+ηNn=1NIltnωtnΔtnsubscript𝐱𝑡1subscript𝐱𝑡𝜂𝑁superscriptsubscript𝑛1𝑁superscriptsubscriptIl𝑡𝑛superscriptsubscript𝜔𝑡𝑛superscriptsubscriptΔ𝑡𝑛\mathbf{x}_{t+1}\leftarrow\mathbf{x}_{t}+\frac{\eta}{N}\sum_{n=1}^{N}{\rm I% \kern-1.99997ptl}_{t}^{n}\omega_{t}^{n}\Delta_{t}^{n}bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ← bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + divide start_ARG italic_η end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_Il start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT;  
Algorithm 3 A logically equivalent version of Algorithm 1

B.3 General Descent Lemma

To prove the general descent lemma that is used to derive both Theorems 1 and 2, we first define the following generally weighted loss function.

Definition B.3.1.

Define

f~(𝐱):=n=1NφnFn(𝐱)assign~𝑓𝐱superscriptsubscript𝑛1𝑁subscript𝜑𝑛subscript𝐹𝑛𝐱\tilde{f}(\mathbf{x}):=\sum_{n=1}^{N}\varphi_{n}F_{n}(\mathbf{x})over~ start_ARG italic_f end_ARG ( bold_x ) := ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_x ) (B.3.1)

where φn0subscript𝜑𝑛0\varphi_{n}\geq 0italic_φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≥ 0 for all n𝑛nitalic_n and n=1Nφn=1superscriptsubscript𝑛1𝑁subscript𝜑𝑛1\sum_{n=1}^{N}\varphi_{n}=1∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 1.

In (B.3.1), choosing φn=1Nsubscript𝜑𝑛1𝑁\varphi_{n}=\frac{1}{N}italic_φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG gives our original objective of f(𝐱)𝑓𝐱f(\mathbf{x})italic_f ( bold_x ). Note that we consider the updates in Algorithm 3 to be still without weighting by φnsubscript𝜑𝑛\varphi_{n}italic_φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, which allows us to quantify the convergence to a different objective when the aggregation weights are not properly chosen.

Lemma B.3.1.

Define δ~:=2δassign~𝛿2𝛿\tilde{\delta}:=2\deltaover~ start_ARG italic_δ end_ARG := 2 italic_δ, we have

Fn(𝐱)f~(𝐱)2δ~2,𝐱,n.superscriptnormsubscript𝐹𝑛𝐱~𝑓𝐱2superscript~𝛿2for-all𝐱𝑛\displaystyle\left\|\nabla F_{n}(\mathbf{x})-\nabla\tilde{f}(\mathbf{x})\right% \|^{2}\leq\tilde{\delta}^{2},\forall\mathbf{x},n.∥ ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_x ) - ∇ over~ start_ARG italic_f end_ARG ( bold_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ over~ start_ARG italic_δ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , ∀ bold_x , italic_n . (B.3.2)
Proof.

From Assumption 3, we have

Fn(𝐱)f~(𝐱)2superscriptnormsubscript𝐹𝑛𝐱~𝑓𝐱2\displaystyle\left\|\nabla F_{n}(\mathbf{x})-\nabla\tilde{f}(\mathbf{x})\right% \|^{2}∥ ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_x ) - ∇ over~ start_ARG italic_f end_ARG ( bold_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT =Fn(𝐱)f(𝐱)+f(𝐱)f~(𝐱)2absentsuperscriptnormsubscript𝐹𝑛𝐱𝑓𝐱𝑓𝐱~𝑓𝐱2\displaystyle=\left\|\nabla F_{n}(\mathbf{x})-\nabla f(\mathbf{x})+\nabla f(% \mathbf{x})-\nabla\tilde{f}(\mathbf{x})\right\|^{2}= ∥ ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_x ) - ∇ italic_f ( bold_x ) + ∇ italic_f ( bold_x ) - ∇ over~ start_ARG italic_f end_ARG ( bold_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
2Fn(𝐱)f(𝐱)2+2f(𝐱)f~(𝐱)2absent2superscriptnormsubscript𝐹𝑛𝐱𝑓𝐱22superscriptnorm𝑓𝐱~𝑓𝐱2\displaystyle\leq 2\left\|\nabla F_{n}(\mathbf{x})-\nabla f(\mathbf{x})\right% \|^{2}+2\left\|\nabla f(\mathbf{x})-\nabla\tilde{f}(\mathbf{x})\right\|^{2}≤ 2 ∥ ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_x ) - ∇ italic_f ( bold_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 ∥ ∇ italic_f ( bold_x ) - ∇ over~ start_ARG italic_f end_ARG ( bold_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=2Fn(𝐱)f(𝐱)2+2n=1Nφn(f(𝐱)Fn(𝐱))2absent2superscriptnormsubscript𝐹𝑛𝐱𝑓𝐱22superscriptnormsuperscriptsubscript𝑛1𝑁subscript𝜑𝑛𝑓𝐱subscript𝐹𝑛𝐱2\displaystyle=2\left\|\nabla F_{n}(\mathbf{x})-\nabla f(\mathbf{x})\right\|^{2% }+2\left\|\sum_{n=1}^{N}\varphi_{n}(\nabla f(\mathbf{x})-\nabla F_{n}(\mathbf{% x}))\right\|^{2}= 2 ∥ ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_x ) - ∇ italic_f ( bold_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 ∥ ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( ∇ italic_f ( bold_x ) - ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_x ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
(a)2Fn(𝐱)f(𝐱)2+2n=1Nφnf(𝐱)Fn(𝐱)2𝑎2superscriptnormsubscript𝐹𝑛𝐱𝑓𝐱22superscriptsubscript𝑛1𝑁subscript𝜑𝑛superscriptnorm𝑓𝐱subscript𝐹𝑛𝐱2\displaystyle\overset{(a)}{\leq}2\left\|\nabla F_{n}(\mathbf{x})-\nabla f(% \mathbf{x})\right\|^{2}+2\sum_{n=1}^{N}\varphi_{n}\left\|\nabla f(\mathbf{x})-% \nabla F_{n}(\mathbf{x})\right\|^{2}start_OVERACCENT ( italic_a ) end_OVERACCENT start_ARG ≤ end_ARG 2 ∥ ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_x ) - ∇ italic_f ( bold_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ ∇ italic_f ( bold_x ) - ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
4δ2,absent4superscript𝛿2\displaystyle\leq 4\delta^{2},≤ 4 italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where we use the Jensen’s inequality in (a). The final result follows due to δ~2:=4δ2assignsuperscript~𝛿24superscript𝛿2\tilde{\delta}^{2}:=4\delta^{2}over~ start_ARG italic_δ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT := 4 italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. ∎

Lemma B.3.2.

When γ130LI𝛾130𝐿𝐼\gamma\leq\frac{1}{\sqrt{30}LI}italic_γ ≤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG 30 end_ARG italic_L italic_I end_ARG,

𝔼t[𝐲t,in𝐱t2]5Iγ2(σ2+6Iδ~2)+30I2γ2f~(𝐱t)2subscript𝔼𝑡delimited-[]superscriptnormsuperscriptsubscript𝐲𝑡𝑖𝑛subscript𝐱𝑡25𝐼superscript𝛾2superscript𝜎26𝐼superscript~𝛿230superscript𝐼2superscript𝛾2superscriptnorm~𝑓subscript𝐱𝑡2\displaystyle\mathbb{E}_{t}\left[\left\|\mathbf{y}_{t,i}^{n}-\mathbf{x}_{t}% \right\|^{2}\right]\leq 5I\gamma^{2}(\sigma^{2}+6I\tilde{\delta}^{2})+30I^{2}% \gamma^{2}\left\|\nabla\tilde{f}(\mathbf{x}_{t})\right\|^{2}blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT - bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ 5 italic_I italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 6 italic_I over~ start_ARG italic_δ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + 30 italic_I start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (B.3.3)
Proof.

This lemma has the same form as in Yang et al. (2021, Lemma 2) and Reddi et al. (2021, Lemma 3), but we present it here for a single client n𝑛nitalic_n instead of average over multiple clients.

For i{0,1,2,,I1}𝑖012𝐼1i\in\{0,1,2,\ldots,I-1\}italic_i ∈ { 0 , 1 , 2 , … , italic_I - 1 }, we have

𝔼t[𝐲t,i+1n𝐱t2]subscript𝔼𝑡delimited-[]superscriptnormsubscriptsuperscript𝐲𝑛𝑡𝑖1subscript𝐱𝑡2\displaystyle\mathbb{E}_{t}\left[\left\|\mathbf{y}^{n}_{t,i+1}-\mathbf{x}_{t}% \right\|^{2}\right]blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i + 1 end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=𝔼t[𝐲t,in𝐱tγ𝐠n(𝐲t,in)2]absentsubscript𝔼𝑡delimited-[]superscriptnormsubscriptsuperscript𝐲𝑛𝑡𝑖subscript𝐱𝑡𝛾subscript𝐠𝑛subscriptsuperscript𝐲𝑛𝑡𝑖2\displaystyle=\mathbb{E}_{t}\left[\left\|\mathbf{y}^{n}_{t,i}-\mathbf{x}_{t}-% \gamma\mathbf{g}_{n}(\mathbf{y}^{n}_{t,i})\right\|^{2}\right]= blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_γ bold_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=𝔼t[𝐲t,in𝐱tγ(𝐠n(𝐲t,in)Fn(𝐲t,in)+Fn(𝐲t,in)Fn(𝐱t)+Fn(𝐱t)\displaystyle=\mathbb{E}_{t}\left[\left\|\mathbf{y}^{n}_{t,i}-\mathbf{x}_{t}-% \gamma\left(\mathbf{g}_{n}(\mathbf{y}^{n}_{t,i})-\nabla F_{n}(\mathbf{y}^{n}_{% t,i})+\nabla F_{n}(\mathbf{y}^{n}_{t,i})-\nabla F_{n}(\mathbf{x}_{t})+\nabla F% _{n}(\mathbf{x}_{t})\right.\right.\right.= blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_γ ( bold_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ) - ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ) + ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ) - ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
f~(𝐱t)+f~(𝐱t))2]\displaystyle\quad\quad\quad\quad\quad\quad\quad\left.\left.\left.-\nabla% \tilde{f}(\mathbf{x}_{t})+\nabla\tilde{f}(\mathbf{x}_{t})\right)\right\|^{2}\right]- ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=(a)𝔼t[γ(𝐠n(𝐲t,in)Fn(𝐲t,in))2]𝑎subscript𝔼𝑡delimited-[]superscriptnorm𝛾subscript𝐠𝑛subscriptsuperscript𝐲𝑛𝑡𝑖subscript𝐹𝑛subscriptsuperscript𝐲𝑛𝑡𝑖2\displaystyle\overset{(a)}{=}\mathbb{E}_{t}\left[\left\|\gamma\left(\mathbf{g}% _{n}(\mathbf{y}^{n}_{t,i})-\nabla F_{n}(\mathbf{y}^{n}_{t,i})\right)\right\|^{% 2}\right]start_OVERACCENT ( italic_a ) end_OVERACCENT start_ARG = end_ARG blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ italic_γ ( bold_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ) - ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
+2𝔼t[𝔼t[γ(𝐠n(𝐲t,in)Fn(𝐲t,in)),𝐲t,in𝐱tγ(Fn(𝐲t,in)Fn(𝐱t)+Fn(𝐱t)\displaystyle\quad+\!2\mathbb{E}_{t}\left[\mathbb{E}_{t}\left[\!\left\langle% \gamma\!\left(\mathbf{g}_{n}(\mathbf{y}^{n}_{t,i})\!-\!\nabla F_{n}(\mathbf{y}% ^{n}_{t,i})\right),\mathbf{y}^{n}_{t,i}\!-\!\mathbf{x}_{t}\!-\!\gamma\Big{(}% \nabla F_{n}(\mathbf{y}^{n}_{t,i})\!-\!\nabla F_{n}(\mathbf{x}_{t})\!+\!\nabla F% _{n}(\mathbf{x}_{t})\!\right.\right.\right.+ 2 blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ⟨ italic_γ ( bold_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ) - ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ) ) , bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_γ ( ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ) - ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
f~(𝐱t)+f~(𝐱t))|𝐲t,in,𝐱t]]\displaystyle\quad\quad\quad\quad\quad\quad\quad\left.\left.\left.\left.\left.% -\nabla\tilde{f}(\mathbf{x}_{t})+\nabla\tilde{f}(\mathbf{x}_{t})\right)\right% \rangle\right|\mathbf{y}^{n}_{t,i},\mathbf{x}_{t}\right]\right]- ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ⟩ | bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ]
+𝔼t[𝐲t,in𝐱tγ(Fn(𝐲t,in)Fn(𝐱t)+Fn(𝐱t)f~(𝐱t)+f~(𝐱t))2]subscript𝔼𝑡delimited-[]superscriptnormsubscriptsuperscript𝐲𝑛𝑡𝑖subscript𝐱𝑡𝛾subscript𝐹𝑛subscriptsuperscript𝐲𝑛𝑡𝑖subscript𝐹𝑛subscript𝐱𝑡subscript𝐹𝑛subscript𝐱𝑡~𝑓subscript𝐱𝑡~𝑓subscript𝐱𝑡2\displaystyle\quad+\mathbb{E}_{t}\left[\left\|\mathbf{y}^{n}_{t,i}-\mathbf{x}_% {t}-\gamma\left(\nabla F_{n}(\mathbf{y}^{n}_{t,i})-\nabla F_{n}(\mathbf{x}_{t}% )+\nabla F_{n}(\mathbf{x}_{t})-\nabla\tilde{f}(\mathbf{x}_{t})+\nabla\tilde{f}% (\mathbf{x}_{t})\right)\right\|^{2}\right]+ blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_γ ( ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ) - ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=(b)𝔼t[γ(𝐠n(𝐲t,in)Fn(𝐲t,in))2]𝑏subscript𝔼𝑡delimited-[]superscriptnorm𝛾subscript𝐠𝑛subscriptsuperscript𝐲𝑛𝑡𝑖subscript𝐹𝑛subscriptsuperscript𝐲𝑛𝑡𝑖2\displaystyle\overset{(b)}{=}\mathbb{E}_{t}\left[\left\|\gamma\left(\mathbf{g}% _{n}(\mathbf{y}^{n}_{t,i})-\nabla F_{n}(\mathbf{y}^{n}_{t,i})\right)\right\|^{% 2}\right]start_OVERACCENT ( italic_b ) end_OVERACCENT start_ARG = end_ARG blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ italic_γ ( bold_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ) - ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
+2𝔼t[𝔼t[γ(𝐠n(𝐲t,in)Fn(𝐲t,in))|𝐲t,in,𝐱t],\displaystyle\quad+\!2\mathbb{E}_{t}\left[\left\langle\mathbb{E}_{t}\left[% \left.\!\gamma\!\left(\mathbf{g}_{n}(\mathbf{y}^{n}_{t,i})\!-\!\nabla F_{n}(% \mathbf{y}^{n}_{t,i})\right)\right|\!\mathbf{y}^{n}_{t,i},\mathbf{x}_{t}\!% \right],\right.\right.+ 2 blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ⟨ blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ italic_γ ( bold_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ) - ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ) ) | bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ,
𝐲t,in𝐱tγ(Fn(𝐲t,in)Fn(𝐱t)+Fn(𝐱t)f~(𝐱t)+f~(𝐱t))]\displaystyle\quad\quad\quad\quad\quad\quad\quad\left.\left.\mathbf{y}^{n}_{t,% i}\!-\!\mathbf{x}_{t}\!-\!\gamma\!\left(\!\nabla F_{n}(\mathbf{y}^{n}_{t,i})\!% -\!\nabla F_{n}(\mathbf{x}_{t})\!+\!\nabla F_{n}(\mathbf{x}_{t})\!-\!\nabla% \tilde{f}(\mathbf{x}_{t})\!+\!\nabla\tilde{f}(\mathbf{x}_{t})\!\right)\!\right% \rangle\right]bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_γ ( ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ) - ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ⟩ ]
+𝔼t[𝐲t,in𝐱tγ(Fn(𝐲t,in)Fn(𝐱t)+Fn(𝐱t)f~(𝐱t)+f~(𝐱t))2]subscript𝔼𝑡delimited-[]superscriptnormsubscriptsuperscript𝐲𝑛𝑡𝑖subscript𝐱𝑡𝛾subscript𝐹𝑛subscriptsuperscript𝐲𝑛𝑡𝑖subscript𝐹𝑛subscript𝐱𝑡subscript𝐹𝑛subscript𝐱𝑡~𝑓subscript𝐱𝑡~𝑓subscript𝐱𝑡2\displaystyle\quad+\mathbb{E}_{t}\left[\left\|\mathbf{y}^{n}_{t,i}-\mathbf{x}_% {t}-\gamma\left(\nabla F_{n}(\mathbf{y}^{n}_{t,i})-\nabla F_{n}(\mathbf{x}_{t}% )+\nabla F_{n}(\mathbf{x}_{t})-\nabla\tilde{f}(\mathbf{x}_{t})+\nabla\tilde{f}% (\mathbf{x}_{t})\right)\right\|^{2}\right]+ blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_γ ( ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ) - ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=(c)𝔼t[γ(𝐠n(𝐲t,in)Fn(𝐲t,in))2]𝑐subscript𝔼𝑡delimited-[]superscriptnorm𝛾subscript𝐠𝑛subscriptsuperscript𝐲𝑛𝑡𝑖subscript𝐹𝑛subscriptsuperscript𝐲𝑛𝑡𝑖2\displaystyle\overset{(c)}{=}\mathbb{E}_{t}\left[\left\|\gamma\left(\mathbf{g}% _{n}(\mathbf{y}^{n}_{t,i})-\nabla F_{n}(\mathbf{y}^{n}_{t,i})\right)\right\|^{% 2}\right]start_OVERACCENT ( italic_c ) end_OVERACCENT start_ARG = end_ARG blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ italic_γ ( bold_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ) - ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
+𝔼t[𝐲t,in𝐱tγ(Fn(𝐲t,in)Fn(𝐱t)+Fn(𝐱t)f~(𝐱t)+f~(𝐱t))2]subscript𝔼𝑡delimited-[]superscriptnormsubscriptsuperscript𝐲𝑛𝑡𝑖subscript𝐱𝑡𝛾subscript𝐹𝑛subscriptsuperscript𝐲𝑛𝑡𝑖subscript𝐹𝑛subscript𝐱𝑡subscript𝐹𝑛subscript𝐱𝑡~𝑓subscript𝐱𝑡~𝑓subscript𝐱𝑡2\displaystyle\quad+\mathbb{E}_{t}\left[\left\|\mathbf{y}^{n}_{t,i}-\mathbf{x}_% {t}-\gamma\left(\nabla F_{n}(\mathbf{y}^{n}_{t,i})-\nabla F_{n}(\mathbf{x}_{t}% )+\nabla F_{n}(\mathbf{x}_{t})-\nabla\tilde{f}(\mathbf{x}_{t})+\nabla\tilde{f}% (\mathbf{x}_{t})\right)\right\|^{2}\right]+ blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_γ ( ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ) - ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
(d)𝔼t[γ(𝐠n(𝐲t,in)Fn(𝐲t,in))2]+(1+12I1)𝔼t[𝐲t,in𝐱t2]𝑑subscript𝔼𝑡delimited-[]superscriptnorm𝛾subscript𝐠𝑛subscriptsuperscript𝐲𝑛𝑡𝑖subscript𝐹𝑛subscriptsuperscript𝐲𝑛𝑡𝑖2112𝐼1subscript𝔼𝑡delimited-[]superscriptnormsubscriptsuperscript𝐲𝑛𝑡𝑖subscript𝐱𝑡2\displaystyle\overset{(d)}{\leq}\mathbb{E}_{t}\left[\left\|\gamma\left(\mathbf% {g}_{n}(\mathbf{y}^{n}_{t,i})-\nabla F_{n}(\mathbf{y}^{n}_{t,i})\right)\right% \|^{2}\right]+\left(1+\frac{1}{2I-1}\right)\mathbb{E}_{t}\left[\left\|\mathbf{% y}^{n}_{t,i}-\mathbf{x}_{t}\right\|^{2}\right]start_OVERACCENT ( italic_d ) end_OVERACCENT start_ARG ≤ end_ARG blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ italic_γ ( bold_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ) - ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + ( 1 + divide start_ARG 1 end_ARG start_ARG 2 italic_I - 1 end_ARG ) blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
+2I𝔼t[γ(Fn(𝐲t,in)Fn(𝐱t)+Fn(𝐱t)f~(𝐱t)+f~(𝐱t))2]2𝐼subscript𝔼𝑡delimited-[]superscriptnorm𝛾subscript𝐹𝑛subscriptsuperscript𝐲𝑛𝑡𝑖subscript𝐹𝑛subscript𝐱𝑡subscript𝐹𝑛subscript𝐱𝑡~𝑓subscript𝐱𝑡~𝑓subscript𝐱𝑡2\displaystyle\quad+2I\mathbb{E}_{t}\left[\left\|\gamma\left(\nabla F_{n}(% \mathbf{y}^{n}_{t,i})-\nabla F_{n}(\mathbf{x}_{t})+\nabla F_{n}(\mathbf{x}_{t}% )-\nabla\tilde{f}(\mathbf{x}_{t})+\nabla\tilde{f}(\mathbf{x}_{t})\right)\right% \|^{2}\right]+ 2 italic_I blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ italic_γ ( ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ) - ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
𝔼t[γ(𝐠n(𝐲t,in)Fn(𝐲t,in))2]+(1+12I1)𝔼t[𝐲t,in𝐱t2]absentsubscript𝔼𝑡delimited-[]superscriptnorm𝛾subscript𝐠𝑛subscriptsuperscript𝐲𝑛𝑡𝑖subscript𝐹𝑛subscriptsuperscript𝐲𝑛𝑡𝑖2112𝐼1subscript𝔼𝑡delimited-[]superscriptnormsubscriptsuperscript𝐲𝑛𝑡𝑖subscript𝐱𝑡2\displaystyle\leq\mathbb{E}_{t}\left[\left\|\gamma\left(\mathbf{g}_{n}(\mathbf% {y}^{n}_{t,i})-\nabla F_{n}(\mathbf{y}^{n}_{t,i})\right)\right\|^{2}\right]+% \left(1+\frac{1}{2I-1}\right)\mathbb{E}_{t}\left[\left\|\mathbf{y}^{n}_{t,i}-% \mathbf{x}_{t}\right\|^{2}\right]≤ blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ italic_γ ( bold_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ) - ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + ( 1 + divide start_ARG 1 end_ARG start_ARG 2 italic_I - 1 end_ARG ) blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
+6I𝔼t[γ(Fn(𝐲t,in)Fn(𝐱t))2]+6I𝔼t[γ(Fn(𝐱t)f~(𝐱t))2]6𝐼subscript𝔼𝑡delimited-[]superscriptnorm𝛾subscript𝐹𝑛subscriptsuperscript𝐲𝑛𝑡𝑖subscript𝐹𝑛subscript𝐱𝑡26𝐼subscript𝔼𝑡delimited-[]superscriptnorm𝛾subscript𝐹𝑛subscript𝐱𝑡~𝑓subscript𝐱𝑡2\displaystyle\quad+6I\mathbb{E}_{t}\left[\left\|\gamma\left(\nabla F_{n}(% \mathbf{y}^{n}_{t,i})-\nabla F_{n}(\mathbf{x}_{t})\right)\right\|^{2}\right]+6% I\mathbb{E}_{t}\left[\left\|\gamma\left(\nabla F_{n}(\mathbf{x}_{t})-\nabla% \tilde{f}(\mathbf{x}_{t})\right)\right\|^{2}\right]+ 6 italic_I blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ italic_γ ( ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ) - ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + 6 italic_I blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ italic_γ ( ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
+6Iγf~(𝐱t)26𝐼superscriptnorm𝛾~𝑓subscript𝐱𝑡2\displaystyle\quad\quad\quad\quad\quad\quad\quad+6I\left\|\gamma\nabla\tilde{f% }(\mathbf{x}_{t})\right\|^{2}+ 6 italic_I ∥ italic_γ ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
(e)γ2σ2+(1+12I1)𝔼t[𝐲t,in𝐱t2]+6Iγ2L2𝔼t[𝐲t,in𝐱t2]+6Iγ2δ~2𝑒superscript𝛾2superscript𝜎2112𝐼1subscript𝔼𝑡delimited-[]superscriptnormsubscriptsuperscript𝐲𝑛𝑡𝑖subscript𝐱𝑡26𝐼superscript𝛾2superscript𝐿2subscript𝔼𝑡delimited-[]superscriptnormsubscriptsuperscript𝐲𝑛𝑡𝑖subscript𝐱𝑡26𝐼superscript𝛾2superscript~𝛿2\displaystyle\overset{(e)}{\leq}\gamma^{2}\sigma^{2}+\left(1+\frac{1}{2I-1}% \right)\mathbb{E}_{t}\left[\left\|\mathbf{y}^{n}_{t,i}-\mathbf{x}_{t}\right\|^% {2}\right]+6I\gamma^{2}L^{2}\mathbb{E}_{t}\left[\left\|\mathbf{y}^{n}_{t,i}-% \mathbf{x}_{t}\right\|^{2}\right]+6I\gamma^{2}\tilde{\delta}^{2}start_OVERACCENT ( italic_e ) end_OVERACCENT start_ARG ≤ end_ARG italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( 1 + divide start_ARG 1 end_ARG start_ARG 2 italic_I - 1 end_ARG ) blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + 6 italic_I italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + 6 italic_I italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT over~ start_ARG italic_δ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+6Iγ2f~(𝐱t)26𝐼superscript𝛾2superscriptnorm~𝑓subscript𝐱𝑡2\displaystyle\quad\quad\quad\quad\quad\quad\quad+6I\gamma^{2}\left\|\nabla% \tilde{f}(\mathbf{x}_{t})\right\|^{2}+ 6 italic_I italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=(1+12I1+6Iγ2L2)𝔼t[𝐲t,in𝐱t2]+γ2σ2+6Iγ2δ~2+6Iγ2f~(𝐱t)2,absent112𝐼16𝐼superscript𝛾2superscript𝐿2subscript𝔼𝑡delimited-[]superscriptnormsubscriptsuperscript𝐲𝑛𝑡𝑖subscript𝐱𝑡2superscript𝛾2superscript𝜎26𝐼superscript𝛾2superscript~𝛿26𝐼superscript𝛾2superscriptnorm~𝑓subscript𝐱𝑡2\displaystyle=\left(1+\frac{1}{2I-1}+6I\gamma^{2}L^{2}\right)\mathbb{E}_{t}% \left[\left\|\mathbf{y}^{n}_{t,i}-\mathbf{x}_{t}\right\|^{2}\right]+\gamma^{2}% \sigma^{2}+6I\gamma^{2}\tilde{\delta}^{2}+6I\gamma^{2}\left\|\nabla\tilde{f}(% \mathbf{x}_{t})\right\|^{2},= ( 1 + divide start_ARG 1 end_ARG start_ARG 2 italic_I - 1 end_ARG + 6 italic_I italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 6 italic_I italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT over~ start_ARG italic_δ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 6 italic_I italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (B.3.4)

where (a)𝑎(a)( italic_a ) follows from expanding the squared norm above and applying the law of total expectation on the second term, (b)𝑏(b)( italic_b ) is because the second part of the inner product has no randomness when 𝐲t,insubscriptsuperscript𝐲𝑛𝑡𝑖\mathbf{y}^{n}_{t,i}bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT and 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are given, (c)𝑐(c)( italic_c ) is because the inner product is zero due to the unbiasedness of stochastic gradient, (d)𝑑(d)( italic_d ) follows from expanding the second term and applying the Peter-Paul inequality, (e)𝑒(e)( italic_e ) uses gradient variance bound, Lipschitz gradient, and gradient divergence bound.

Because γ130LI𝛾130𝐿𝐼\gamma\leq\frac{1}{\sqrt{30}LI}italic_γ ≤ divide start_ARG 1 end_ARG start_ARG square-root start_ARG 30 end_ARG italic_L italic_I end_ARG, we have

12I1+6Iγ2L212I1+15I22I1=1I12.12𝐼16𝐼superscript𝛾2superscript𝐿212𝐼115𝐼22𝐼11𝐼12\displaystyle\frac{1}{2I-1}+6I\gamma^{2}L^{2}\leq\frac{1}{2I-1}+\frac{1}{5I}% \leq\frac{2}{2I-1}=\frac{1}{I-\frac{1}{2}}.divide start_ARG 1 end_ARG start_ARG 2 italic_I - 1 end_ARG + 6 italic_I italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG 2 italic_I - 1 end_ARG + divide start_ARG 1 end_ARG start_ARG 5 italic_I end_ARG ≤ divide start_ARG 2 end_ARG start_ARG 2 italic_I - 1 end_ARG = divide start_ARG 1 end_ARG start_ARG italic_I - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_ARG .

Continuing from (B.3.4), we have

𝔼t[𝐲t,i+1n𝐱t2](1+1I12)𝔼t[𝐲t,in𝐱t2]+γ2σ2+6Iγ2δ~2+6Iγ2f~(𝐱t)2subscript𝔼𝑡delimited-[]superscriptnormsubscriptsuperscript𝐲𝑛𝑡𝑖1subscript𝐱𝑡211𝐼12subscript𝔼𝑡delimited-[]superscriptnormsubscriptsuperscript𝐲𝑛𝑡𝑖subscript𝐱𝑡2superscript𝛾2superscript𝜎26𝐼superscript𝛾2superscript~𝛿26𝐼superscript𝛾2superscriptnorm~𝑓subscript𝐱𝑡2\displaystyle\mathbb{E}_{t}\left[\left\|\mathbf{y}^{n}_{t,i+1}-\mathbf{x}_{t}% \right\|^{2}\right]\leq\left(1+\frac{1}{I-\frac{1}{2}}\right)\mathbb{E}_{t}% \left[\left\|\mathbf{y}^{n}_{t,i}-\mathbf{x}_{t}\right\|^{2}\right]+\gamma^{2}% \sigma^{2}+6I\gamma^{2}\tilde{\delta}^{2}+6I\gamma^{2}\left\|\nabla\tilde{f}(% \mathbf{x}_{t})\right\|^{2}blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i + 1 end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ ( 1 + divide start_ARG 1 end_ARG start_ARG italic_I - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_ARG ) blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 6 italic_I italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT over~ start_ARG italic_δ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 6 italic_I italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

By unrolling the recursion, we obtain

𝔼t[𝐲t,i+1n𝐱t2]subscript𝔼𝑡delimited-[]superscriptnormsubscriptsuperscript𝐲𝑛𝑡𝑖1subscript𝐱𝑡2\displaystyle\mathbb{E}_{t}\left[\left\|\mathbf{y}^{n}_{t,i+1}-\mathbf{x}_{t}% \right\|^{2}\right]blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ bold_y start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i + 1 end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
i=0i(1+1I12)i(γ2σ2+6Iγ2δ~2+6Iγ2f~(𝐱t)2)absentsuperscriptsubscriptsuperscript𝑖0𝑖superscript11𝐼12superscript𝑖superscript𝛾2superscript𝜎26𝐼superscript𝛾2superscript~𝛿26𝐼superscript𝛾2superscriptnorm~𝑓subscript𝐱𝑡2\displaystyle\leq\sum_{i^{\prime}=0}^{i}\left(1+\frac{1}{I-\frac{1}{2}}\right)% ^{i^{\prime}}\left(\gamma^{2}\sigma^{2}+6I\gamma^{2}\tilde{\delta}^{2}+6I% \gamma^{2}\left\|\nabla\tilde{f}(\mathbf{x}_{t})\right\|^{2}\right)≤ ∑ start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( 1 + divide start_ARG 1 end_ARG start_ARG italic_I - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_ARG ) start_POSTSUPERSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 6 italic_I italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT over~ start_ARG italic_δ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 6 italic_I italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
i=0I1(1+1I12)i(γ2σ2+6Iγ2δ~2+6Iγ2f~(𝐱t)2)absentsuperscriptsubscriptsuperscript𝑖0𝐼1superscript11𝐼12superscript𝑖superscript𝛾2superscript𝜎26𝐼superscript𝛾2superscript~𝛿26𝐼superscript𝛾2superscriptnorm~𝑓subscript𝐱𝑡2\displaystyle\leq\sum_{i^{\prime}=0}^{I-1}\left(1+\frac{1}{I-\frac{1}{2}}% \right)^{i^{\prime}}\left(\gamma^{2}\sigma^{2}+6I\gamma^{2}\tilde{\delta}^{2}+% 6I\gamma^{2}\left\|\nabla\tilde{f}(\mathbf{x}_{t})\right\|^{2}\right)≤ ∑ start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT ( 1 + divide start_ARG 1 end_ARG start_ARG italic_I - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_ARG ) start_POSTSUPERSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 6 italic_I italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT over~ start_ARG italic_δ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 6 italic_I italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
=[(1+1I12)I1](I12)(γ2σ2+6Iγ2δ~2+6Iγ2f~(𝐱t)2)absentdelimited-[]superscript11𝐼12𝐼1𝐼12superscript𝛾2superscript𝜎26𝐼superscript𝛾2superscript~𝛿26𝐼superscript𝛾2superscriptnorm~𝑓subscript𝐱𝑡2\displaystyle=\left[\left(1+\frac{1}{I-\frac{1}{2}}\right)^{I}-1\right]\left(I% -\frac{1}{2}\right)\cdot\left(\gamma^{2}\sigma^{2}+6I\gamma^{2}\tilde{\delta}^% {2}+6I\gamma^{2}\left\|\nabla\tilde{f}(\mathbf{x}_{t})\right\|^{2}\right)= [ ( 1 + divide start_ARG 1 end_ARG start_ARG italic_I - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_ARG ) start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT - 1 ] ( italic_I - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) ⋅ ( italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 6 italic_I italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT over~ start_ARG italic_δ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 6 italic_I italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
=[(1+1I12)I12(1+1I12)121](I12)(γ2σ2+6Iγ2δ~2+6Iγ2f~(𝐱t)2)absentdelimited-[]superscript11𝐼12𝐼12superscript11𝐼12121𝐼12superscript𝛾2superscript𝜎26𝐼superscript𝛾2superscript~𝛿26𝐼superscript𝛾2superscriptnorm~𝑓subscript𝐱𝑡2\displaystyle=\left[\left(1+\frac{1}{I-\frac{1}{2}}\right)^{I-\frac{1}{2}}% \left(1+\frac{1}{I-\frac{1}{2}}\right)^{\frac{1}{2}}-1\right]\left(I-\frac{1}{% 2}\right)\cdot\left(\gamma^{2}\sigma^{2}+6I\gamma^{2}\tilde{\delta}^{2}+6I% \gamma^{2}\left\|\nabla\tilde{f}(\mathbf{x}_{t})\right\|^{2}\right)= [ ( 1 + divide start_ARG 1 end_ARG start_ARG italic_I - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_ARG ) start_POSTSUPERSCRIPT italic_I - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ( 1 + divide start_ARG 1 end_ARG start_ARG italic_I - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_ARG ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT - 1 ] ( italic_I - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) ⋅ ( italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 6 italic_I italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT over~ start_ARG italic_δ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 6 italic_I italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
(a)[3e1](I12)(γ2σ2+6Iγ2δ~2+6Iγ2f~(𝐱t)2)𝑎delimited-[]3𝑒1𝐼12superscript𝛾2superscript𝜎26𝐼superscript𝛾2superscript~𝛿26𝐼superscript𝛾2superscriptnorm~𝑓subscript𝐱𝑡2\displaystyle\overset{(a)}{\leq}\left[\sqrt{3}e-1\right]\left(I-\frac{1}{2}% \right)\cdot\left(\gamma^{2}\sigma^{2}+6I\gamma^{2}\tilde{\delta}^{2}+6I\gamma% ^{2}\left\|\nabla\tilde{f}(\mathbf{x}_{t})\right\|^{2}\right)start_OVERACCENT ( italic_a ) end_OVERACCENT start_ARG ≤ end_ARG [ square-root start_ARG 3 end_ARG italic_e - 1 ] ( italic_I - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) ⋅ ( italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 6 italic_I italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT over~ start_ARG italic_δ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 6 italic_I italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
5Iγ2(σ2+6Iδ~2)+30I2γ2f~(𝐱t)2absent5𝐼superscript𝛾2superscript𝜎26𝐼superscript~𝛿230superscript𝐼2superscript𝛾2superscriptnorm~𝑓subscript𝐱𝑡2\displaystyle\leq 5I\gamma^{2}\left(\sigma^{2}+6I\tilde{\delta}^{2}\right)+30I% ^{2}\gamma^{2}\left\|\nabla\tilde{f}(\mathbf{x}_{t})\right\|^{2}≤ 5 italic_I italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 6 italic_I over~ start_ARG italic_δ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + 30 italic_I start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

where (a)𝑎(a)( italic_a ) uses (1+1z)zesuperscript11𝑧𝑧𝑒\left(1+\frac{1}{z}\right)^{z}\leq e( 1 + divide start_ARG 1 end_ARG start_ARG italic_z end_ARG ) start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT ≤ italic_e for any z>0𝑧0z>0italic_z > 0 and 1+1I12311𝐼1231+\frac{1}{I-\frac{1}{2}}\leq 31 + divide start_ARG 1 end_ARG start_ARG italic_I - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_ARG ≤ 3. ∎

Lemma B.3.3 (General descent lemma).

When γ1415LI𝛾1415𝐿𝐼\gamma\leq\frac{1}{4\sqrt{15}LI}italic_γ ≤ divide start_ARG 1 end_ARG start_ARG 4 square-root start_ARG 15 end_ARG italic_L italic_I end_ARG and γη14LI𝛾𝜂14𝐿𝐼\gamma\eta\leq\frac{1}{4LI}italic_γ italic_η ≤ divide start_ARG 1 end_ARG start_ARG 4 italic_L italic_I end_ARG, we have

𝔼t[f~(𝐱t+1)]subscript𝔼𝑡delimited-[]~𝑓subscript𝐱𝑡1\displaystyle\mathbb{E}_{t}\left[\tilde{f}(\mathbf{x}_{t+1})\right]blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ]
f~(𝐱t)+3γηIN2(15L2Iγ2σ2+27δ~28)n=1N(pnωtnNφn)2absent~𝑓subscript𝐱𝑡3𝛾𝜂𝐼𝑁215superscript𝐿2𝐼superscript𝛾2superscript𝜎227superscript~𝛿28superscriptsubscript𝑛1𝑁superscriptsubscript𝑝𝑛superscriptsubscript𝜔𝑡𝑛𝑁subscript𝜑𝑛2\displaystyle\leq\tilde{f}(\mathbf{x}_{t})+\frac{3\gamma\eta IN}{2}\left(15L^{% 2}I\gamma^{2}\sigma^{2}+\frac{27\tilde{\delta}^{2}}{8}\right)\sum_{n=1}^{N}% \left(\frac{p_{n}\omega_{t}^{n}}{N}-\varphi_{n}\right)^{2}≤ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + divide start_ARG 3 italic_γ italic_η italic_I italic_N end_ARG start_ARG 2 end_ARG ( 15 italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 27 over~ start_ARG italic_δ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 8 end_ARG ) ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( divide start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG start_ARG italic_N end_ARG - italic_φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+5γ3ηL2I22(σ2+6Iδ~2)+γ2η2LIN2(17σ216+27Iδ~28)n=1Npn(ωtn)25superscript𝛾3𝜂superscript𝐿2superscript𝐼22superscript𝜎26𝐼superscript~𝛿2superscript𝛾2superscript𝜂2𝐿𝐼superscript𝑁217superscript𝜎21627𝐼superscript~𝛿28superscriptsubscript𝑛1𝑁subscript𝑝𝑛superscriptsuperscriptsubscript𝜔𝑡𝑛2\displaystyle\quad+\frac{5\gamma^{3}\eta L^{2}I^{2}}{2}(\sigma^{2}+6I\tilde{% \delta}^{2})+\frac{\gamma^{2}\eta^{2}LI}{N^{2}}\left(\frac{17\sigma^{2}}{16}+% \frac{27I\tilde{\delta}^{2}}{8}\right)\sum_{n=1}^{N}p_{n}(\omega_{t}^{n})^{2}+ divide start_ARG 5 italic_γ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_η italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ( italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 6 italic_I over~ start_ARG italic_δ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + divide start_ARG italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L italic_I end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( divide start_ARG 17 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 16 end_ARG + divide start_ARG 27 italic_I over~ start_ARG italic_δ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 8 end_ARG ) ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+γηI[81N16n=1N(pnωtnNφn)2+15L2I2γ2+27γηLI8N2n=1Npn(ωtn)214]f~(𝐱t)2.𝛾𝜂𝐼delimited-[]81𝑁16superscriptsubscript𝑛1𝑁superscriptsubscript𝑝𝑛superscriptsubscript𝜔𝑡𝑛𝑁subscript𝜑𝑛215superscript𝐿2superscript𝐼2superscript𝛾227𝛾𝜂𝐿𝐼8superscript𝑁2superscriptsubscript𝑛1𝑁subscript𝑝𝑛superscriptsuperscriptsubscript𝜔𝑡𝑛214superscriptnorm~𝑓subscript𝐱𝑡2\displaystyle\quad+\gamma\eta I\left[\frac{81N}{16}\sum_{n=1}^{N}\left(\frac{p% _{n}\omega_{t}^{n}}{N}-\varphi_{n}\right)^{2}+15L^{2}I^{2}\gamma^{2}+\frac{27% \gamma\eta LI}{8N^{2}}\sum_{n=1}^{N}p_{n}(\omega_{t}^{n})^{2}-\frac{1}{4}% \right]\cdot\left\|\nabla\tilde{f}(\mathbf{x}_{t})\right\|^{2}.+ italic_γ italic_η italic_I [ divide start_ARG 81 italic_N end_ARG start_ARG 16 end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( divide start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG start_ARG italic_N end_ARG - italic_φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 15 italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 27 italic_γ italic_η italic_L italic_I end_ARG start_ARG 8 italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 4 end_ARG ] ⋅ ∥ ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (B.3.5)
Proof.

Due to Assumption 1 (L𝐿Litalic_L-smoothness), we have

𝔼t[f~(𝐱t+1)]subscript𝔼𝑡delimited-[]~𝑓subscript𝐱𝑡1\displaystyle\mathbb{E}_{t}\left[\tilde{f}(\mathbf{x}_{t+1})\right]blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ] f~(𝐱t)γη𝔼t[f~(𝐱t),1Nn=1NIltnωtni=0I1𝐠n(𝐲t,in)]absent~𝑓subscript𝐱𝑡𝛾𝜂subscript𝔼𝑡delimited-[]~𝑓subscript𝐱𝑡1𝑁superscriptsubscript𝑛1𝑁superscriptsubscriptIl𝑡𝑛superscriptsubscript𝜔𝑡𝑛superscriptsubscript𝑖0𝐼1subscript𝐠𝑛superscriptsubscript𝐲𝑡𝑖𝑛\displaystyle\leq\tilde{f}(\mathbf{x}_{t})-\gamma\eta\mathbb{E}_{t}\left[\left% \langle\nabla\tilde{f}(\mathbf{x}_{t}),\frac{1}{N}\sum_{n=1}^{N}{\rm I\kern-1.% 99997ptl}_{t}^{n}\omega_{t}^{n}\sum_{i=0}^{I-1}\mathbf{g}_{n}(\mathbf{y}_{t,i}% ^{n})\right\rangle\right]≤ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_γ italic_η blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ⟨ ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_Il start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT bold_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ⟩ ]
+γ2η2L2𝔼t[1Nn=1NIltnωtni=0I1𝐠n(𝐲t,in)2]superscript𝛾2superscript𝜂2𝐿2subscript𝔼𝑡delimited-[]superscriptnorm1𝑁superscriptsubscript𝑛1𝑁superscriptsubscriptIl𝑡𝑛superscriptsubscript𝜔𝑡𝑛superscriptsubscript𝑖0𝐼1subscript𝐠𝑛superscriptsubscript𝐲𝑡𝑖𝑛2\displaystyle\quad\quad\quad\quad\quad\quad\quad+\frac{\gamma^{2}\eta^{2}L}{2}% \mathbb{E}_{t}\left[\left\|\frac{1}{N}\sum_{n=1}^{N}{\rm I\kern-1.99997ptl}_{t% }^{n}\omega_{t}^{n}\sum_{i=0}^{I-1}\mathbf{g}_{n}(\mathbf{y}_{t,i}^{n})\right% \|^{2}\right]+ divide start_ARG italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L end_ARG start_ARG 2 end_ARG blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_Il start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT bold_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=f~(𝐱t)γη𝔼t[f~(𝐱t),1Nn=1Npnωtni=0I1Fn(𝐲t,in)]absent~𝑓subscript𝐱𝑡𝛾𝜂subscript𝔼𝑡delimited-[]~𝑓subscript𝐱𝑡1𝑁superscriptsubscript𝑛1𝑁subscript𝑝𝑛superscriptsubscript𝜔𝑡𝑛superscriptsubscript𝑖0𝐼1subscript𝐹𝑛superscriptsubscript𝐲𝑡𝑖𝑛\displaystyle=\tilde{f}(\mathbf{x}_{t})-\gamma\eta\mathbb{E}_{t}\left[\left% \langle\nabla\tilde{f}(\mathbf{x}_{t}),\frac{1}{N}\sum_{n=1}^{N}p_{n}\omega_{t% }^{n}\sum_{i=0}^{I-1}\nabla F_{n}(\mathbf{y}_{t,i}^{n})\right\rangle\right]= over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_γ italic_η blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ⟨ ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ⟩ ]
+γ2η2L2𝔼t[1Nn=1NIltnωtni=0I1𝐠n(𝐲t,in)2],superscript𝛾2superscript𝜂2𝐿2subscript𝔼𝑡delimited-[]superscriptnorm1𝑁superscriptsubscript𝑛1𝑁superscriptsubscriptIl𝑡𝑛superscriptsubscript𝜔𝑡𝑛superscriptsubscript𝑖0𝐼1subscript𝐠𝑛superscriptsubscript𝐲𝑡𝑖𝑛2\displaystyle\quad\quad\quad\quad\quad\quad\quad+\frac{\gamma^{2}\eta^{2}L}{2}% \mathbb{E}_{t}\left[\left\|\frac{1}{N}\sum_{n=1}^{N}{\rm I\kern-1.99997ptl}_{t% }^{n}\omega_{t}^{n}\sum_{i=0}^{I-1}\mathbf{g}_{n}(\mathbf{y}_{t,i}^{n})\right% \|^{2}\right],+ divide start_ARG italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L end_ARG start_ARG 2 end_ARG blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_Il start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT bold_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (B.3.6)

where the last equality is due to 𝔼t[Iltn]=pnsubscript𝔼𝑡delimited-[]superscriptsubscriptIl𝑡𝑛subscript𝑝𝑛\mathbb{E}_{t}\left[{\rm I\kern-1.99997ptl}_{t}^{n}\right]=p_{n}blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ roman_Il start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ] = italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and the unbiasedness of the stochastic gradient giving 𝔼t[𝔼[𝐠n(𝐲t,in)|𝐲t,in]]=𝔼t[Fn(𝐲t,in)]subscript𝔼𝑡delimited-[]𝔼delimited-[]conditionalsubscript𝐠𝑛superscriptsubscript𝐲𝑡𝑖𝑛superscriptsubscript𝐲𝑡𝑖𝑛subscript𝔼𝑡delimited-[]subscript𝐹𝑛superscriptsubscript𝐲𝑡𝑖𝑛\mathbb{E}_{t}\left[\mathbb{E}\left[\left.\mathbf{g}_{n}(\mathbf{y}_{t,i}^{n})% \right|\mathbf{y}_{t,i}^{n}\right]\right]=\mathbb{E}_{t}\left[\nabla F_{n}(% \mathbf{y}_{t,i}^{n})\right]blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ blackboard_E [ bold_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) | bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ] ] = blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ] (for simplicity, we will not write out this total expectation in subsequent steps of this proof).

Expanding the second term of (B.3.6), we have

γη𝔼t[f~(𝐱t),1Nn=1Npnωtni=0I1Fn(𝐲t,in)]𝛾𝜂subscript𝔼𝑡delimited-[]~𝑓subscript𝐱𝑡1𝑁superscriptsubscript𝑛1𝑁subscript𝑝𝑛superscriptsubscript𝜔𝑡𝑛superscriptsubscript𝑖0𝐼1subscript𝐹𝑛superscriptsubscript𝐲𝑡𝑖𝑛\displaystyle-\gamma\eta\mathbb{E}_{t}\left[\left\langle\nabla\tilde{f}(% \mathbf{x}_{t}),\frac{1}{N}\sum_{n=1}^{N}p_{n}\omega_{t}^{n}\sum_{i=0}^{I-1}% \nabla F_{n}(\mathbf{y}_{t,i}^{n})\right\rangle\right]- italic_γ italic_η blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ⟨ ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ⟩ ]
=γη𝔼t[f~(𝐱t),1Nn=1Npnωtni=0I1Fn(𝐲t,in)n=1Nφni=0I1Fn(𝐲t,in)\displaystyle=-\gamma\eta\mathbb{E}_{t}\left[\left\langle\nabla\tilde{f}(% \mathbf{x}_{t}),\frac{1}{N}\sum_{n=1}^{N}p_{n}\omega_{t}^{n}\sum_{i=0}^{I-1}% \nabla F_{n}(\mathbf{y}_{t,i}^{n})-\sum_{n=1}^{N}\varphi_{n}\sum_{i=0}^{I-1}% \nabla F_{n}(\mathbf{y}_{t,i}^{n})\right.\right.= - italic_γ italic_η blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ⟨ ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT )
+n=1Nφni=0I1Fn(𝐲t,in)If~(𝐱t)+If~(𝐱t)]\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\left.\left.+% \sum_{n=1}^{N}\varphi_{n}\sum_{i=0}^{I-1}\nabla F_{n}(\mathbf{y}_{t,i}^{n})-I% \nabla\tilde{f}(\mathbf{x}_{t})+I\nabla\tilde{f}(\mathbf{x}_{t})\right\rangle\right]+ ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) - italic_I ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_I ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ⟩ ]
=γη𝔼t[f~(𝐱t),1Nn=1Npnωtni=0I1Fn(𝐲t,in)n=1Nφni=0I1Fn(𝐲t,in)]absent𝛾𝜂subscript𝔼𝑡delimited-[]~𝑓subscript𝐱𝑡1𝑁superscriptsubscript𝑛1𝑁subscript𝑝𝑛superscriptsubscript𝜔𝑡𝑛superscriptsubscript𝑖0𝐼1subscript𝐹𝑛superscriptsubscript𝐲𝑡𝑖𝑛superscriptsubscript𝑛1𝑁subscript𝜑𝑛superscriptsubscript𝑖0𝐼1subscript𝐹𝑛superscriptsubscript𝐲𝑡𝑖𝑛\displaystyle=-\gamma\eta\mathbb{E}_{t}\left[\left\langle\nabla\tilde{f}(% \mathbf{x}_{t}),\frac{1}{N}\sum_{n=1}^{N}p_{n}\omega_{t}^{n}\sum_{i=0}^{I-1}% \nabla F_{n}(\mathbf{y}_{t,i}^{n})-\sum_{n=1}^{N}\varphi_{n}\sum_{i=0}^{I-1}% \nabla F_{n}(\mathbf{y}_{t,i}^{n})\right\rangle\right]= - italic_γ italic_η blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ⟨ ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ⟩ ]
γη𝔼t[f~(𝐱t),n=1Nφni=0I1Fn(𝐲t,in)If~(𝐱t)]γηIf~(𝐱t)2𝛾𝜂subscript𝔼𝑡delimited-[]~𝑓subscript𝐱𝑡superscriptsubscript𝑛1𝑁subscript𝜑𝑛superscriptsubscript𝑖0𝐼1subscript𝐹𝑛superscriptsubscript𝐲𝑡𝑖𝑛𝐼~𝑓subscript𝐱𝑡𝛾𝜂𝐼superscriptnorm~𝑓subscript𝐱𝑡2\displaystyle\quad-\gamma\eta\mathbb{E}_{t}\left[\left\langle\nabla\tilde{f}(% \mathbf{x}_{t}),\sum_{n=1}^{N}\varphi_{n}\sum_{i=0}^{I-1}\nabla F_{n}(\mathbf{% y}_{t,i}^{n})-I\nabla\tilde{f}(\mathbf{x}_{t})\right\rangle\right]-\gamma\eta I% \left\|\nabla\tilde{f}(\mathbf{x}_{t})\right\|^{2}- italic_γ italic_η blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ⟨ ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) - italic_I ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ⟩ ] - italic_γ italic_η italic_I ∥ ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=γηI𝔼t[If~(𝐱t),n=1N(pnωtnNφn)i=0I1Fn(𝐲t,in)]absent𝛾𝜂𝐼subscript𝔼𝑡delimited-[]𝐼~𝑓subscript𝐱𝑡superscriptsubscript𝑛1𝑁subscript𝑝𝑛superscriptsubscript𝜔𝑡𝑛𝑁subscript𝜑𝑛superscriptsubscript𝑖0𝐼1subscript𝐹𝑛superscriptsubscript𝐲𝑡𝑖𝑛\displaystyle=-\frac{\gamma\eta}{I}\mathbb{E}_{t}\left[\left\langle I\nabla% \tilde{f}(\mathbf{x}_{t}),\sum_{n=1}^{N}\left(\frac{p_{n}\omega_{t}^{n}}{N}-% \varphi_{n}\right)\sum_{i=0}^{I-1}\nabla F_{n}(\mathbf{y}_{t,i}^{n})\right% \rangle\right]= - divide start_ARG italic_γ italic_η end_ARG start_ARG italic_I end_ARG blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ⟨ italic_I ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( divide start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG start_ARG italic_N end_ARG - italic_φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ⟩ ]
γηI𝔼t[If~(𝐱t),n=1Nφni=0I1(Fn(𝐲t,in)Fn(𝐱t))]γηIf~(𝐱t)2𝛾𝜂𝐼subscript𝔼𝑡delimited-[]𝐼~𝑓subscript𝐱𝑡superscriptsubscript𝑛1𝑁subscript𝜑𝑛superscriptsubscript𝑖0𝐼1subscript𝐹𝑛superscriptsubscript𝐲𝑡𝑖𝑛subscript𝐹𝑛subscript𝐱𝑡𝛾𝜂𝐼superscriptnorm~𝑓subscript𝐱𝑡2\displaystyle\quad-\frac{\gamma\eta}{I}\mathbb{E}_{t}\left[\left\langle I% \nabla\tilde{f}(\mathbf{x}_{t}),\sum_{n=1}^{N}\varphi_{n}\sum_{i=0}^{I-1}(% \nabla F_{n}(\mathbf{y}_{t,i}^{n})-\nabla F_{n}(\mathbf{x}_{t}))\right\rangle% \right]-\gamma\eta I\left\|\nabla\tilde{f}(\mathbf{x}_{t})\right\|^{2}- divide start_ARG italic_γ italic_η end_ARG start_ARG italic_I end_ARG blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ⟨ italic_I ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT ( ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) - ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ⟩ ] - italic_γ italic_η italic_I ∥ ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
(a)γηI4f~(𝐱t)2+γηI𝔼t[n=1N(pnωtnNφn)i=0I1Fn(𝐲t,in)2]𝑎𝛾𝜂𝐼4superscriptnorm~𝑓subscript𝐱𝑡2𝛾𝜂𝐼subscript𝔼𝑡delimited-[]superscriptnormsuperscriptsubscript𝑛1𝑁subscript𝑝𝑛superscriptsubscript𝜔𝑡𝑛𝑁subscript𝜑𝑛superscriptsubscript𝑖0𝐼1subscript𝐹𝑛superscriptsubscript𝐲𝑡𝑖𝑛2\displaystyle\overset{(a)}{\leq}\frac{\gamma\eta I}{4}\left\|\nabla\tilde{f}(% \mathbf{x}_{t})\right\|^{2}+\frac{\gamma\eta}{I}\mathbb{E}_{t}\left[\left\|% \sum_{n=1}^{N}\left(\frac{p_{n}\omega_{t}^{n}}{N}-\varphi_{n}\right)\sum_{i=0}% ^{I-1}\nabla F_{n}(\mathbf{y}_{t,i}^{n})\right\|^{2}\right]start_OVERACCENT ( italic_a ) end_OVERACCENT start_ARG ≤ end_ARG divide start_ARG italic_γ italic_η italic_I end_ARG start_ARG 4 end_ARG ∥ ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_γ italic_η end_ARG start_ARG italic_I end_ARG blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( divide start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG start_ARG italic_N end_ARG - italic_φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
+γηI2f~(𝐱t)2+γη2I𝔼t[n=1Nφni=0I1(Fn(𝐲t,in)Fn(𝐱t))2]𝛾𝜂𝐼2superscriptnorm~𝑓subscript𝐱𝑡2𝛾𝜂2𝐼subscript𝔼𝑡delimited-[]superscriptnormsuperscriptsubscript𝑛1𝑁subscript𝜑𝑛superscriptsubscript𝑖0𝐼1subscript𝐹𝑛superscriptsubscript𝐲𝑡𝑖𝑛subscript𝐹𝑛subscript𝐱𝑡2\displaystyle\quad+\frac{\gamma\eta I}{2}\left\|\nabla\tilde{f}(\mathbf{x}_{t}% )\right\|^{2}+\frac{\gamma\eta}{2I}\mathbb{E}_{t}\left[\left\|\sum_{n=1}^{N}% \varphi_{n}\sum_{i=0}^{I-1}(\nabla F_{n}(\mathbf{y}_{t,i}^{n})-\nabla F_{n}(% \mathbf{x}_{t}))\right\|^{2}\right]+ divide start_ARG italic_γ italic_η italic_I end_ARG start_ARG 2 end_ARG ∥ ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_γ italic_η end_ARG start_ARG 2 italic_I end_ARG blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT ( ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) - ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
γη2I𝔼t[n=1Nφni=0I1Fn(𝐲t,in)2]γηIf~(𝐱t)2𝛾𝜂2𝐼subscript𝔼𝑡delimited-[]superscriptnormsuperscriptsubscript𝑛1𝑁subscript𝜑𝑛superscriptsubscript𝑖0𝐼1subscript𝐹𝑛superscriptsubscript𝐲𝑡𝑖𝑛2𝛾𝜂𝐼superscriptnorm~𝑓subscript𝐱𝑡2\displaystyle\quad-\frac{\gamma\eta}{2I}\mathbb{E}_{t}\left[\left\|\sum_{n=1}^% {N}\varphi_{n}\sum_{i=0}^{I-1}\nabla F_{n}(\mathbf{y}_{t,i}^{n})\right\|^{2}% \right]-\gamma\eta I\left\|\nabla\tilde{f}(\mathbf{x}_{t})\right\|^{2}- divide start_ARG italic_γ italic_η end_ARG start_ARG 2 italic_I end_ARG blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] - italic_γ italic_η italic_I ∥ ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
γηNn=1N(pnωtnNφn)2i=0I1𝔼t[Fn(𝐲t,in)2]absent𝛾𝜂𝑁superscriptsubscript𝑛1𝑁superscriptsubscript𝑝𝑛superscriptsubscript𝜔𝑡𝑛𝑁subscript𝜑𝑛2superscriptsubscript𝑖0𝐼1subscript𝔼𝑡delimited-[]superscriptnormsubscript𝐹𝑛superscriptsubscript𝐲𝑡𝑖𝑛2\displaystyle\leq\gamma\eta N\sum_{n=1}^{N}\left(\frac{p_{n}\omega_{t}^{n}}{N}% -\varphi_{n}\right)^{2}\sum_{i=0}^{I-1}\mathbb{E}_{t}\left[\left\|\nabla F_{n}% (\mathbf{y}_{t,i}^{n})\right\|^{2}\right]≤ italic_γ italic_η italic_N ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( divide start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG start_ARG italic_N end_ARG - italic_φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
+γη2n=1Nφni=0I1𝔼t[Fn(𝐲t,in)Fn(𝐱t)2]𝛾𝜂2superscriptsubscript𝑛1𝑁subscript𝜑𝑛superscriptsubscript𝑖0𝐼1subscript𝔼𝑡delimited-[]superscriptnormsubscript𝐹𝑛superscriptsubscript𝐲𝑡𝑖𝑛subscript𝐹𝑛subscript𝐱𝑡2\displaystyle\quad+\frac{\gamma\eta}{2}\sum_{n=1}^{N}\varphi_{n}\sum_{i=0}^{I-% 1}\mathbb{E}_{t}\left[\left\|\nabla F_{n}(\mathbf{y}_{t,i}^{n})-\nabla F_{n}(% \mathbf{x}_{t})\right\|^{2}\right]+ divide start_ARG italic_γ italic_η end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) - ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
γη2I𝔼t[n=1Nφni=0I1Fn(𝐲t,in)2]γηI4f~(𝐱t)2𝛾𝜂2𝐼subscript𝔼𝑡delimited-[]superscriptnormsuperscriptsubscript𝑛1𝑁subscript𝜑𝑛superscriptsubscript𝑖0𝐼1subscript𝐹𝑛superscriptsubscript𝐲𝑡𝑖𝑛2𝛾𝜂𝐼4superscriptnorm~𝑓subscript𝐱𝑡2\displaystyle\quad-\frac{\gamma\eta}{2I}\mathbb{E}_{t}\left[\left\|\sum_{n=1}^% {N}\varphi_{n}\sum_{i=0}^{I-1}\nabla F_{n}(\mathbf{y}_{t,i}^{n})\right\|^{2}% \right]-\frac{\gamma\eta I}{4}\left\|\nabla\tilde{f}(\mathbf{x}_{t})\right\|^{2}- divide start_ARG italic_γ italic_η end_ARG start_ARG 2 italic_I end_ARG blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] - divide start_ARG italic_γ italic_η italic_I end_ARG start_ARG 4 end_ARG ∥ ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
γηNn=1N(pnωtnNφn)2i=0I1𝔼t[Fn(𝐲t,in)2]+γηL22n=1Nφni=0I1𝔼t[𝐲t,in𝐱t2]absent𝛾𝜂𝑁superscriptsubscript𝑛1𝑁superscriptsubscript𝑝𝑛superscriptsubscript𝜔𝑡𝑛𝑁subscript𝜑𝑛2superscriptsubscript𝑖0𝐼1subscript𝔼𝑡delimited-[]superscriptnormsubscript𝐹𝑛superscriptsubscript𝐲𝑡𝑖𝑛2𝛾𝜂superscript𝐿22superscriptsubscript𝑛1𝑁subscript𝜑𝑛superscriptsubscript𝑖0𝐼1subscript𝔼𝑡delimited-[]superscriptnormsuperscriptsubscript𝐲𝑡𝑖𝑛subscript𝐱𝑡2\displaystyle\leq\gamma\eta N\sum_{n=1}^{N}\left(\frac{p_{n}\omega_{t}^{n}}{N}% -\varphi_{n}\right)^{2}\sum_{i=0}^{I-1}\mathbb{E}_{t}\left[\left\|\nabla F_{n}% (\mathbf{y}_{t,i}^{n})\right\|^{2}\right]+\frac{\gamma\eta L^{2}}{2}\sum_{n=1}% ^{N}\varphi_{n}\sum_{i=0}^{I-1}\mathbb{E}_{t}\left[\left\|\mathbf{y}_{t,i}^{n}% -\mathbf{x}_{t}\right\|^{2}\right]≤ italic_γ italic_η italic_N ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( divide start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG start_ARG italic_N end_ARG - italic_φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + divide start_ARG italic_γ italic_η italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT - bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
γη2I𝔼t[n=1Nφni=0I1Fn(𝐲t,in)2]γηI4f~(𝐱t)2,𝛾𝜂2𝐼subscript𝔼𝑡delimited-[]superscriptnormsuperscriptsubscript𝑛1𝑁subscript𝜑𝑛superscriptsubscript𝑖0𝐼1subscript𝐹𝑛superscriptsubscript𝐲𝑡𝑖𝑛2𝛾𝜂𝐼4superscriptnorm~𝑓subscript𝐱𝑡2\displaystyle\quad-\frac{\gamma\eta}{2I}\mathbb{E}_{t}\left[\left\|\sum_{n=1}^% {N}\varphi_{n}\sum_{i=0}^{I-1}\nabla F_{n}(\mathbf{y}_{t,i}^{n})\right\|^{2}% \right]-\frac{\gamma\eta I}{4}\left\|\nabla\tilde{f}(\mathbf{x}_{t})\right\|^{% 2},- divide start_ARG italic_γ italic_η end_ARG start_ARG 2 italic_I end_ARG blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] - divide start_ARG italic_γ italic_η italic_I end_ARG start_ARG 4 end_ARG ∥ ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (B.3.7)

where we use 𝐚,𝐛=12(𝐚+𝐛2𝐚2𝐛2)𝐚𝐛12superscriptnorm𝐚𝐛2superscriptnorm𝐚2superscriptnorm𝐛2\left\langle\mathbf{a},\mathbf{b}\right\rangle=\frac{1}{2}(\left\|\mathbf{a}+% \mathbf{b}\right\|^{2}-\left\|\mathbf{a}\right\|^{2}-\left\|\mathbf{b}\right\|% ^{2})⟨ bold_a , bold_b ⟩ = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( ∥ bold_a + bold_b ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ bold_a ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ bold_b ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) to expand the second term in (a)𝑎(a)( italic_a ).

Expanding the third term of (B.3.6), we have

γ2η2L2𝔼t[1Nn=1NIltnωtni=0I1𝐠n(𝐲t,in)2]superscript𝛾2superscript𝜂2𝐿2subscript𝔼𝑡delimited-[]superscriptnorm1𝑁superscriptsubscript𝑛1𝑁superscriptsubscriptIl𝑡𝑛superscriptsubscript𝜔𝑡𝑛superscriptsubscript𝑖0𝐼1subscript𝐠𝑛superscriptsubscript𝐲𝑡𝑖𝑛2\displaystyle\frac{\gamma^{2}\eta^{2}L}{2}\mathbb{E}_{t}\left[\left\|\frac{1}{% N}\sum_{n=1}^{N}{\rm I\kern-1.99997ptl}_{t}^{n}\omega_{t}^{n}\sum_{i=0}^{I-1}% \mathbf{g}_{n}(\mathbf{y}_{t,i}^{n})\right\|^{2}\right]divide start_ARG italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L end_ARG start_ARG 2 end_ARG blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_Il start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT bold_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=γ2η2L2𝔼t[1Nn=1NIltnωtni=0I1(𝐠n(𝐲t,in)Fn(𝐲t,in))+1Nn=1NIltnωtni=0I1Fn(𝐲t,in)2]absentsuperscript𝛾2superscript𝜂2𝐿2subscript𝔼𝑡delimited-[]superscriptnorm1𝑁superscriptsubscript𝑛1𝑁superscriptsubscriptIl𝑡𝑛superscriptsubscript𝜔𝑡𝑛superscriptsubscript𝑖0𝐼1subscript𝐠𝑛superscriptsubscript𝐲𝑡𝑖𝑛subscript𝐹𝑛superscriptsubscript𝐲𝑡𝑖𝑛1𝑁superscriptsubscript𝑛1𝑁superscriptsubscriptIl𝑡𝑛superscriptsubscript𝜔𝑡𝑛superscriptsubscript𝑖0𝐼1subscript𝐹𝑛superscriptsubscript𝐲𝑡𝑖𝑛2\displaystyle=\frac{\gamma^{2}\eta^{2}L}{2}\mathbb{E}_{t}\left[\left\|\frac{1}% {N}\sum_{n=1}^{N}{\rm I\kern-1.99997ptl}_{t}^{n}\omega_{t}^{n}\sum_{i=0}^{I-1}% \left(\mathbf{g}_{n}(\mathbf{y}_{t,i}^{n})-\nabla F_{n}(\mathbf{y}_{t,i}^{n})% \right)+\frac{1}{N}\sum_{n=1}^{N}{\rm I\kern-1.99997ptl}_{t}^{n}\omega_{t}^{n}% \sum_{i=0}^{I-1}\nabla F_{n}(\mathbf{y}_{t,i}^{n})\right\|^{2}\right]= divide start_ARG italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L end_ARG start_ARG 2 end_ARG blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_Il start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT ( bold_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) - ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ) + divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_Il start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
γ2η2L𝔼t[1Nn=1NIltnωtni=0I1(𝐠n(𝐲t,in)Fn(𝐲t,in))2]absentsuperscript𝛾2superscript𝜂2𝐿subscript𝔼𝑡delimited-[]superscriptnorm1𝑁superscriptsubscript𝑛1𝑁superscriptsubscriptIl𝑡𝑛superscriptsubscript𝜔𝑡𝑛superscriptsubscript𝑖0𝐼1subscript𝐠𝑛superscriptsubscript𝐲𝑡𝑖𝑛subscript𝐹𝑛superscriptsubscript𝐲𝑡𝑖𝑛2\displaystyle\leq\gamma^{2}\eta^{2}L\mathbb{E}_{t}\left[\left\|\frac{1}{N}\sum% _{n=1}^{N}{\rm I\kern-1.99997ptl}_{t}^{n}\omega_{t}^{n}\sum_{i=0}^{I-1}\left(% \mathbf{g}_{n}(\mathbf{y}_{t,i}^{n})-\nabla F_{n}(\mathbf{y}_{t,i}^{n})\right)% \right\|^{2}\right]≤ italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_Il start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT ( bold_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) - ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
+γ2η2L𝔼t[1Nn=1NIltnωtni=0I1Fn(𝐲t,in)2]superscript𝛾2superscript𝜂2𝐿subscript𝔼𝑡delimited-[]superscriptnorm1𝑁superscriptsubscript𝑛1𝑁superscriptsubscriptIl𝑡𝑛superscriptsubscript𝜔𝑡𝑛superscriptsubscript𝑖0𝐼1subscript𝐹𝑛superscriptsubscript𝐲𝑡𝑖𝑛2\displaystyle\quad+\gamma^{2}\eta^{2}L\mathbb{E}_{t}\left[\left\|\frac{1}{N}% \sum_{n=1}^{N}{\rm I\kern-1.99997ptl}_{t}^{n}\omega_{t}^{n}\sum_{i=0}^{I-1}% \nabla F_{n}(\mathbf{y}_{t,i}^{n})\right\|^{2}\right]+ italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_Il start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=(a)γ2η2LN2n=1N𝔼t[Iltnωtni=0I1(𝐠n(𝐲t,in)Fn(𝐲t,in))2]𝑎superscript𝛾2superscript𝜂2𝐿superscript𝑁2superscriptsubscript𝑛1𝑁subscript𝔼𝑡delimited-[]superscriptnormsuperscriptsubscriptIl𝑡𝑛superscriptsubscript𝜔𝑡𝑛superscriptsubscript𝑖0𝐼1subscript𝐠𝑛superscriptsubscript𝐲𝑡𝑖𝑛subscript𝐹𝑛superscriptsubscript𝐲𝑡𝑖𝑛2\displaystyle\overset{(a)}{=}\frac{\gamma^{2}\eta^{2}L}{N^{2}}\sum_{n=1}^{N}% \mathbb{E}_{t}\left[\left\|{\rm I\kern-1.99997ptl}_{t}^{n}\omega_{t}^{n}\sum_{% i=0}^{I-1}\left(\mathbf{g}_{n}(\mathbf{y}_{t,i}^{n})-\nabla F_{n}(\mathbf{y}_{% t,i}^{n})\right)\right\|^{2}\right]start_OVERACCENT ( italic_a ) end_OVERACCENT start_ARG = end_ARG divide start_ARG italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ roman_Il start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT ( bold_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) - ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
+γ2η2L𝔼t[1Nn=1N(Iltnpn+pn)ωtni=0I1Fn(𝐲t,in)2]superscript𝛾2superscript𝜂2𝐿subscript𝔼𝑡delimited-[]superscriptnorm1𝑁superscriptsubscript𝑛1𝑁superscriptsubscriptIl𝑡𝑛subscript𝑝𝑛subscript𝑝𝑛superscriptsubscript𝜔𝑡𝑛superscriptsubscript𝑖0𝐼1subscript𝐹𝑛superscriptsubscript𝐲𝑡𝑖𝑛2\displaystyle\quad+\gamma^{2}\eta^{2}L\mathbb{E}_{t}\left[\left\|\frac{1}{N}% \sum_{n=1}^{N}({\rm I\kern-1.99997ptl}_{t}^{n}-p_{n}+p_{n})\omega_{t}^{n}\sum_% {i=0}^{I-1}\nabla F_{n}(\mathbf{y}_{t,i}^{n})\right\|^{2}\right]+ italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( roman_Il start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=(b)γ2η2LIσ2N2n=1Npn(ωtn)2+γ2η2L𝔼t[1Nn=1N(Iltnpn)ωtni=0I1Fn(𝐲t,in)2]𝑏superscript𝛾2superscript𝜂2𝐿𝐼superscript𝜎2superscript𝑁2superscriptsubscript𝑛1𝑁subscript𝑝𝑛superscriptsuperscriptsubscript𝜔𝑡𝑛2superscript𝛾2superscript𝜂2𝐿subscript𝔼𝑡delimited-[]superscriptnorm1𝑁superscriptsubscript𝑛1𝑁superscriptsubscriptIl𝑡𝑛subscript𝑝𝑛superscriptsubscript𝜔𝑡𝑛superscriptsubscript𝑖0𝐼1subscript𝐹𝑛superscriptsubscript𝐲𝑡𝑖𝑛2\displaystyle\overset{(b)}{=}\frac{\gamma^{2}\eta^{2}LI\sigma^{2}}{N^{2}}\sum_% {n=1}^{N}p_{n}(\omega_{t}^{n})^{2}+\gamma^{2}\eta^{2}L\mathbb{E}_{t}\left[% \left\|\frac{1}{N}\sum_{n=1}^{N}({\rm I\kern-1.99997ptl}_{t}^{n}-p_{n})\omega_% {t}^{n}\sum_{i=0}^{I-1}\nabla F_{n}(\mathbf{y}_{t,i}^{n})\right\|^{2}\right]start_OVERACCENT ( italic_b ) end_OVERACCENT start_ARG = end_ARG divide start_ARG italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L italic_I italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( roman_Il start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
+γ2η2L𝔼t[1Nn=1Npnωtni=0I1Fn(𝐲t,in)2]superscript𝛾2superscript𝜂2𝐿subscript𝔼𝑡delimited-[]superscriptnorm1𝑁superscriptsubscript𝑛1𝑁subscript𝑝𝑛superscriptsubscript𝜔𝑡𝑛superscriptsubscript𝑖0𝐼1subscript𝐹𝑛superscriptsubscript𝐲𝑡𝑖𝑛2\displaystyle\quad+\gamma^{2}\eta^{2}L\mathbb{E}_{t}\left[\left\|\frac{1}{N}% \sum_{n=1}^{N}p_{n}\omega_{t}^{n}\sum_{i=0}^{I-1}\nabla F_{n}(\mathbf{y}_{t,i}% ^{n})\right\|^{2}\right]+ italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=(c)γ2η2LIσ2N2n=1Npn(ωtn)2+γ2η2LN2n=1N𝔼t[(Iltnpn)ωtni=0I1Fn(𝐲t,in)2]𝑐superscript𝛾2superscript𝜂2𝐿𝐼superscript𝜎2superscript𝑁2superscriptsubscript𝑛1𝑁subscript𝑝𝑛superscriptsuperscriptsubscript𝜔𝑡𝑛2superscript𝛾2superscript𝜂2𝐿superscript𝑁2superscriptsubscript𝑛1𝑁subscript𝔼𝑡delimited-[]superscriptnormsuperscriptsubscriptIl𝑡𝑛subscript𝑝𝑛superscriptsubscript𝜔𝑡𝑛superscriptsubscript𝑖0𝐼1subscript𝐹𝑛superscriptsubscript𝐲𝑡𝑖𝑛2\displaystyle\overset{(c)}{=}\frac{\gamma^{2}\eta^{2}LI\sigma^{2}}{N^{2}}\sum_% {n=1}^{N}p_{n}(\omega_{t}^{n})^{2}+\frac{\gamma^{2}\eta^{2}L}{N^{2}}\sum_{n=1}% ^{N}\mathbb{E}_{t}\left[\left\|({\rm I\kern-1.99997ptl}_{t}^{n}-p_{n})\omega_{% t}^{n}\sum_{i=0}^{I-1}\nabla F_{n}(\mathbf{y}_{t,i}^{n})\right\|^{2}\right]start_OVERACCENT ( italic_c ) end_OVERACCENT start_ARG = end_ARG divide start_ARG italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L italic_I italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ ( roman_Il start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
+γ2η2L𝔼t[n=1N(pnωtnNφn+φn)i=0I1Fn(𝐲t,in)2]superscript𝛾2superscript𝜂2𝐿subscript𝔼𝑡delimited-[]superscriptnormsuperscriptsubscript𝑛1𝑁subscript𝑝𝑛superscriptsubscript𝜔𝑡𝑛𝑁subscript𝜑𝑛subscript𝜑𝑛superscriptsubscript𝑖0𝐼1subscript𝐹𝑛superscriptsubscript𝐲𝑡𝑖𝑛2\displaystyle\quad+\gamma^{2}\eta^{2}L\mathbb{E}_{t}\left[\left\|\sum_{n=1}^{N% }\left(\frac{p_{n}\omega_{t}^{n}}{N}-\varphi_{n}+\varphi_{n}\right)\sum_{i=0}^% {I-1}\nabla F_{n}(\mathbf{y}_{t,i}^{n})\right\|^{2}\right]+ italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( divide start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG start_ARG italic_N end_ARG - italic_φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
(d)γ2η2LIσ2N2n=1Npn(ωtn)2+γ2η2LN2n=1Npn(1pn)(ωtn)2𝔼t[i=0I1Fn(𝐲t,in)2]𝑑superscript𝛾2superscript𝜂2𝐿𝐼superscript𝜎2superscript𝑁2superscriptsubscript𝑛1𝑁subscript𝑝𝑛superscriptsuperscriptsubscript𝜔𝑡𝑛2superscript𝛾2superscript𝜂2𝐿superscript𝑁2superscriptsubscript𝑛1𝑁subscript𝑝𝑛1subscript𝑝𝑛superscriptsuperscriptsubscript𝜔𝑡𝑛2subscript𝔼𝑡delimited-[]superscriptnormsuperscriptsubscript𝑖0𝐼1subscript𝐹𝑛superscriptsubscript𝐲𝑡𝑖𝑛2\displaystyle\overset{(d)}{\leq}\frac{\gamma^{2}\eta^{2}LI\sigma^{2}}{N^{2}}% \sum_{n=1}^{N}p_{n}(\omega_{t}^{n})^{2}+\frac{\gamma^{2}\eta^{2}L}{N^{2}}\sum_% {n=1}^{N}p_{n}(1-p_{n})(\omega_{t}^{n})^{2}\mathbb{E}_{t}\left[\left\|\sum_{i=% 0}^{I-1}\nabla F_{n}(\mathbf{y}_{t,i}^{n})\right\|^{2}\right]start_OVERACCENT ( italic_d ) end_OVERACCENT start_ARG ≤ end_ARG divide start_ARG italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L italic_I italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ( italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
+2γ2η2L𝔼t[n=1N(pnωtnNφn)i=0I1Fn(𝐲t,in)2]+2γ2η2L𝔼t[n=1Nφni=0I1Fn(𝐲t,in)2]2superscript𝛾2superscript𝜂2𝐿subscript𝔼𝑡delimited-[]superscriptnormsuperscriptsubscript𝑛1𝑁subscript𝑝𝑛superscriptsubscript𝜔𝑡𝑛𝑁subscript𝜑𝑛superscriptsubscript𝑖0𝐼1subscript𝐹𝑛superscriptsubscript𝐲𝑡𝑖𝑛22superscript𝛾2superscript𝜂2𝐿subscript𝔼𝑡delimited-[]superscriptnormsuperscriptsubscript𝑛1𝑁subscript𝜑𝑛superscriptsubscript𝑖0𝐼1subscript𝐹𝑛superscriptsubscript𝐲𝑡𝑖𝑛2\displaystyle\quad+2\gamma^{2}\eta^{2}L\mathbb{E}_{t}\left[\left\|\sum_{n=1}^{% N}\left(\frac{p_{n}\omega_{t}^{n}}{N}-\varphi_{n}\right)\sum_{i=0}^{I-1}\nabla F% _{n}(\mathbf{y}_{t,i}^{n})\right\|^{2}\right]+2\gamma^{2}\eta^{2}L\mathbb{E}_{% t}\left[\left\|\sum_{n=1}^{N}\varphi_{n}\sum_{i=0}^{I-1}\nabla F_{n}(\mathbf{y% }_{t,i}^{n})\right\|^{2}\right]+ 2 italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( divide start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG start_ARG italic_N end_ARG - italic_φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + 2 italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
γ2η2LIσ2N2n=1Npn(ωtn)2+γ2η2LIN2n=1Npn(ωtn)2i=0I1𝔼t[Fn(𝐲t,in)2]absentsuperscript𝛾2superscript𝜂2𝐿𝐼superscript𝜎2superscript𝑁2superscriptsubscript𝑛1𝑁subscript𝑝𝑛superscriptsuperscriptsubscript𝜔𝑡𝑛2superscript𝛾2superscript𝜂2𝐿𝐼superscript𝑁2superscriptsubscript𝑛1𝑁subscript𝑝𝑛superscriptsuperscriptsubscript𝜔𝑡𝑛2superscriptsubscript𝑖0𝐼1subscript𝔼𝑡delimited-[]superscriptnormsubscript𝐹𝑛superscriptsubscript𝐲𝑡𝑖𝑛2\displaystyle\leq\frac{\gamma^{2}\eta^{2}LI\sigma^{2}}{N^{2}}\sum_{n=1}^{N}p_{% n}(\omega_{t}^{n})^{2}+\frac{\gamma^{2}\eta^{2}LI}{N^{2}}\sum_{n=1}^{N}p_{n}(% \omega_{t}^{n})^{2}\sum_{i=0}^{I-1}\mathbb{E}_{t}\left[\left\|\nabla F_{n}(% \mathbf{y}_{t,i}^{n})\right\|^{2}\right]≤ divide start_ARG italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L italic_I italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L italic_I end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
+2γ2η2LINn=1N(pnωtnNφn)2i=0I1𝔼t[Fn(𝐲t,in)2]2superscript𝛾2superscript𝜂2𝐿𝐼𝑁superscriptsubscript𝑛1𝑁superscriptsubscript𝑝𝑛superscriptsubscript𝜔𝑡𝑛𝑁subscript𝜑𝑛2superscriptsubscript𝑖0𝐼1subscript𝔼𝑡delimited-[]superscriptnormsubscript𝐹𝑛superscriptsubscript𝐲𝑡𝑖𝑛2\displaystyle\quad+2\gamma^{2}\eta^{2}LIN\sum_{n=1}^{N}\left(\frac{p_{n}\omega% _{t}^{n}}{N}-\varphi_{n}\right)^{2}\sum_{i=0}^{I-1}\mathbb{E}_{t}\left[\left\|% \nabla F_{n}(\mathbf{y}_{t,i}^{n})\right\|^{2}\right]+ 2 italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L italic_I italic_N ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( divide start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG start_ARG italic_N end_ARG - italic_φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
+2γ2η2L𝔼t[n=1Nφni=0I1Fn(𝐲t,in)2],2superscript𝛾2superscript𝜂2𝐿subscript𝔼𝑡delimited-[]superscriptnormsuperscriptsubscript𝑛1𝑁subscript𝜑𝑛superscriptsubscript𝑖0𝐼1subscript𝐹𝑛superscriptsubscript𝐲𝑡𝑖𝑛2\displaystyle\quad+2\gamma^{2}\eta^{2}L\mathbb{E}_{t}\left[\left\|\sum_{n=1}^{% N}\varphi_{n}\sum_{i=0}^{I-1}\nabla F_{n}(\mathbf{y}_{t,i}^{n})\right\|^{2}% \right],+ 2 italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (B.3.8)

where we note that IltnsuperscriptsubscriptIl𝑡𝑛{\rm I\kern-1.99997ptl}_{t}^{n}roman_Il start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT follows Bernoulli distribution with probability pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, thus 𝔼[Iltn]=pn𝔼delimited-[]superscriptsubscriptIl𝑡𝑛subscript𝑝𝑛\mathbb{E}\left[{\rm I\kern-1.99997ptl}_{t}^{n}\right]=p_{n}blackboard_E [ roman_Il start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ] = italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and Var[Iltn]=pn(1pn)Vardelimited-[]superscriptsubscriptIl𝑡𝑛subscript𝑝𝑛1subscript𝑝𝑛\mathrm{Var}\left[{\rm I\kern-1.99997ptl}_{t}^{n}\right]=p_{n}(1-p_{n})roman_Var [ roman_Il start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ] = italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), yielding the relation in (d)𝑑(d)( italic_d ). We also use the independence across different n𝑛nitalic_n and i𝑖iitalic_i for the stochastic gradients and the independence across n𝑛nitalic_n for the client participation random variable IltnsuperscriptsubscriptIl𝑡𝑛{\rm I\kern-1.99997ptl}_{t}^{n}roman_Il start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, as well as the fact that IltnsuperscriptsubscriptIl𝑡𝑛{\rm I\kern-1.99997ptl}_{t}^{n}roman_Il start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and the stochastic gradients are independent of each other, so the local updates (progression of 𝐲t,insuperscriptsubscript𝐲𝑡𝑖𝑛\mathbf{y}_{t,i}^{n}bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT) are independent of IltnsuperscriptsubscriptIl𝑡𝑛{\rm I\kern-1.99997ptl}_{t}^{n}roman_Il start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT according to the logically equivalent algorithm formulation in Algorithm 3. The independence yields some inner product terms to be zero, giving the results in (a)𝑎(a)( italic_a ), (b)𝑏(b)( italic_b ), and (c)𝑐(c)( italic_c ).

For 𝔼t[Fn(𝐲t,in)2]subscript𝔼𝑡delimited-[]superscriptnormsubscript𝐹𝑛superscriptsubscript𝐲𝑡𝑖𝑛2\mathbb{E}_{t}\left[\left\|\nabla F_{n}(\mathbf{y}_{t,i}^{n})\right\|^{2}\right]blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] that exists in both (B.3.7) and (B.3.8), we note that

𝔼t[Fn(𝐲t,in)2]subscript𝔼𝑡delimited-[]superscriptnormsubscript𝐹𝑛superscriptsubscript𝐲𝑡𝑖𝑛2\displaystyle\mathbb{E}_{t}\left[\left\|\nabla F_{n}(\mathbf{y}_{t,i}^{n})% \right\|^{2}\right]blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=𝔼t[Fn(𝐲t,in)Fn(𝐱t)+Fn(𝐱t)f~(𝐱t)+f~(𝐱t)2]absentsubscript𝔼𝑡delimited-[]superscriptnormsubscript𝐹𝑛superscriptsubscript𝐲𝑡𝑖𝑛subscript𝐹𝑛subscript𝐱𝑡subscript𝐹𝑛subscript𝐱𝑡~𝑓subscript𝐱𝑡~𝑓subscript𝐱𝑡2\displaystyle=\mathbb{E}_{t}\left[\left\|\nabla F_{n}(\mathbf{y}_{t,i}^{n})-% \nabla F_{n}(\mathbf{x}_{t})+\nabla F_{n}(\mathbf{x}_{t})-\nabla\tilde{f}(% \mathbf{x}_{t})+\nabla\tilde{f}(\mathbf{x}_{t})\right\|^{2}\right]= blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) - ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
3𝔼t[Fn(𝐲t,in)Fn(𝐱t)2]+3Fn(𝐱t)f~(𝐱t)2+3f~(𝐱t)2absent3subscript𝔼𝑡delimited-[]superscriptnormsubscript𝐹𝑛superscriptsubscript𝐲𝑡𝑖𝑛subscript𝐹𝑛subscript𝐱𝑡23superscriptnormsubscript𝐹𝑛subscript𝐱𝑡~𝑓subscript𝐱𝑡23superscriptnorm~𝑓subscript𝐱𝑡2\displaystyle\leq 3\mathbb{E}_{t}\left[\left\|\nabla F_{n}(\mathbf{y}_{t,i}^{n% })-\nabla F_{n}(\mathbf{x}_{t})\right\|^{2}\right]+3\left\|\nabla F_{n}(% \mathbf{x}_{t})-\nabla\tilde{f}(\mathbf{x}_{t})\right\|^{2}+3\left\|\nabla% \tilde{f}(\mathbf{x}_{t})\right\|^{2}≤ 3 blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) - ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + 3 ∥ ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 3 ∥ ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
3L2𝔼t[𝐲t,in𝐱t2]+3δ~2+3f~(𝐱t)2absent3superscript𝐿2subscript𝔼𝑡delimited-[]superscriptnormsuperscriptsubscript𝐲𝑡𝑖𝑛subscript𝐱𝑡23superscript~𝛿23superscriptnorm~𝑓subscript𝐱𝑡2\displaystyle\leq 3L^{2}\mathbb{E}_{t}\left[\left\|\mathbf{y}_{t,i}^{n}-% \mathbf{x}_{t}\right\|^{2}\right]+3\tilde{\delta}^{2}+3\left\|\nabla\tilde{f}(% \mathbf{x}_{t})\right\|^{2}≤ 3 italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT - bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + 3 over~ start_ARG italic_δ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 3 ∥ ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
15L2Iγ2(σ2+6Iδ~2)+3δ~2+(90L2I2γ2+3)f~(𝐱t)2.absent15superscript𝐿2𝐼superscript𝛾2superscript𝜎26𝐼superscript~𝛿23superscript~𝛿290superscript𝐿2superscript𝐼2superscript𝛾23superscriptnorm~𝑓subscript𝐱𝑡2\displaystyle\leq 15L^{2}I\gamma^{2}(\sigma^{2}+6I\tilde{\delta}^{2})+3\tilde{% \delta}^{2}+\left(90L^{2}I^{2}\gamma^{2}+3\right)\left\|\nabla\tilde{f}(% \mathbf{x}_{t})\right\|^{2}.≤ 15 italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 6 italic_I over~ start_ARG italic_δ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + 3 over~ start_ARG italic_δ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( 90 italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 3 ) ∥ ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (B.3.9)

Combining (B.3.7), (B.3.8), (B.3.9) gives

γη𝔼t[f~(𝐱t),1Nn=1Npnωtni=0I1Fn(𝐲t,in)]+γ2η2L2𝔼t[1Nn=1NIltnωtni=0I1𝐠n(𝐲t,in)2]𝛾𝜂subscript𝔼𝑡delimited-[]~𝑓subscript𝐱𝑡1𝑁superscriptsubscript𝑛1𝑁subscript𝑝𝑛superscriptsubscript𝜔𝑡𝑛superscriptsubscript𝑖0𝐼1subscript𝐹𝑛superscriptsubscript𝐲𝑡𝑖𝑛superscript𝛾2superscript𝜂2𝐿2subscript𝔼𝑡delimited-[]superscriptnorm1𝑁superscriptsubscript𝑛1𝑁superscriptsubscriptIl𝑡𝑛superscriptsubscript𝜔𝑡𝑛superscriptsubscript𝑖0𝐼1subscript𝐠𝑛superscriptsubscript𝐲𝑡𝑖𝑛2\displaystyle-\gamma\eta\mathbb{E}_{t}\left[\left\langle\nabla\tilde{f}(% \mathbf{x}_{t}),\frac{1}{N}\sum_{n=1}^{N}p_{n}\omega_{t}^{n}\sum_{i=0}^{I-1}% \nabla F_{n}(\mathbf{y}_{t,i}^{n})\right\rangle\right]+\frac{\gamma^{2}\eta^{2% }L}{2}\mathbb{E}_{t}\left[\left\|\frac{1}{N}\sum_{n=1}^{N}{\rm I\kern-1.99997% ptl}_{t}^{n}\omega_{t}^{n}\sum_{i=0}^{I-1}\mathbf{g}_{n}(\mathbf{y}_{t,i}^{n})% \right\|^{2}\right]- italic_γ italic_η blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ⟨ ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ⟩ ] + divide start_ARG italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L end_ARG start_ARG 2 end_ARG blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_Il start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT bold_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
γηNn=1N(pnωtnNφn)2i=0I1𝔼t[Fn(𝐲t,in)2]+γηL22n=1Nφni=0I1𝔼t[𝐲t,in𝐱t2]absent𝛾𝜂𝑁superscriptsubscript𝑛1𝑁superscriptsubscript𝑝𝑛superscriptsubscript𝜔𝑡𝑛𝑁subscript𝜑𝑛2superscriptsubscript𝑖0𝐼1subscript𝔼𝑡delimited-[]superscriptnormsubscript𝐹𝑛superscriptsubscript𝐲𝑡𝑖𝑛2𝛾𝜂superscript𝐿22superscriptsubscript𝑛1𝑁subscript𝜑𝑛superscriptsubscript𝑖0𝐼1subscript𝔼𝑡delimited-[]superscriptnormsuperscriptsubscript𝐲𝑡𝑖𝑛subscript𝐱𝑡2\displaystyle\leq\gamma\eta N\sum_{n=1}^{N}\left(\frac{p_{n}\omega_{t}^{n}}{N}% -\varphi_{n}\right)^{2}\sum_{i=0}^{I-1}\mathbb{E}_{t}\left[\left\|\nabla F_{n}% (\mathbf{y}_{t,i}^{n})\right\|^{2}\right]+\frac{\gamma\eta L^{2}}{2}\sum_{n=1}% ^{N}\varphi_{n}\sum_{i=0}^{I-1}\mathbb{E}_{t}\left[\left\|\mathbf{y}_{t,i}^{n}% -\mathbf{x}_{t}\right\|^{2}\right]≤ italic_γ italic_η italic_N ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( divide start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG start_ARG italic_N end_ARG - italic_φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + divide start_ARG italic_γ italic_η italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT - bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
γη2I𝔼t[n=1Nφni=0I1Fn(𝐲t,in)2]γηI4f~(𝐱t)2𝛾𝜂2𝐼subscript𝔼𝑡delimited-[]superscriptnormsuperscriptsubscript𝑛1𝑁subscript𝜑𝑛superscriptsubscript𝑖0𝐼1subscript𝐹𝑛superscriptsubscript𝐲𝑡𝑖𝑛2𝛾𝜂𝐼4superscriptnorm~𝑓subscript𝐱𝑡2\displaystyle\quad-\frac{\gamma\eta}{2I}\mathbb{E}_{t}\left[\left\|\sum_{n=1}^% {N}\varphi_{n}\sum_{i=0}^{I-1}\nabla F_{n}(\mathbf{y}_{t,i}^{n})\right\|^{2}% \right]-\frac{\gamma\eta I}{4}\left\|\nabla\tilde{f}(\mathbf{x}_{t})\right\|^{2}- divide start_ARG italic_γ italic_η end_ARG start_ARG 2 italic_I end_ARG blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] - divide start_ARG italic_γ italic_η italic_I end_ARG start_ARG 4 end_ARG ∥ ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+γ2η2LIσ2N2n=1Npn(ωtn)2+γ2η2LIN2n=1Npn(ωtn)2i=0I1𝔼t[Fn(𝐲t,in)2]superscript𝛾2superscript𝜂2𝐿𝐼superscript𝜎2superscript𝑁2superscriptsubscript𝑛1𝑁subscript𝑝𝑛superscriptsuperscriptsubscript𝜔𝑡𝑛2superscript𝛾2superscript𝜂2𝐿𝐼superscript𝑁2superscriptsubscript𝑛1𝑁subscript𝑝𝑛superscriptsuperscriptsubscript𝜔𝑡𝑛2superscriptsubscript𝑖0𝐼1subscript𝔼𝑡delimited-[]superscriptnormsubscript𝐹𝑛superscriptsubscript𝐲𝑡𝑖𝑛2\displaystyle\quad+\frac{\gamma^{2}\eta^{2}LI\sigma^{2}}{N^{2}}\sum_{n=1}^{N}p% _{n}(\omega_{t}^{n})^{2}+\frac{\gamma^{2}\eta^{2}LI}{N^{2}}\sum_{n=1}^{N}p_{n}% (\omega_{t}^{n})^{2}\sum_{i=0}^{I-1}\mathbb{E}_{t}\left[\left\|\nabla F_{n}(% \mathbf{y}_{t,i}^{n})\right\|^{2}\right]+ divide start_ARG italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L italic_I italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L italic_I end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
+2γ2η2LINn=1N(pnωtnNφn)2i=0I1𝔼t[Fn(𝐲t,in)2]2superscript𝛾2superscript𝜂2𝐿𝐼𝑁superscriptsubscript𝑛1𝑁superscriptsubscript𝑝𝑛superscriptsubscript𝜔𝑡𝑛𝑁subscript𝜑𝑛2superscriptsubscript𝑖0𝐼1subscript𝔼𝑡delimited-[]superscriptnormsubscript𝐹𝑛superscriptsubscript𝐲𝑡𝑖𝑛2\displaystyle\quad+2\gamma^{2}\eta^{2}LIN\sum_{n=1}^{N}\left(\frac{p_{n}\omega% _{t}^{n}}{N}-\varphi_{n}\right)^{2}\sum_{i=0}^{I-1}\mathbb{E}_{t}\left[\left\|% \nabla F_{n}(\mathbf{y}_{t,i}^{n})\right\|^{2}\right]+ 2 italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L italic_I italic_N ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( divide start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG start_ARG italic_N end_ARG - italic_φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
+2γ2η2L𝔼t[n=1Nφni=0I1Fn(𝐲t,in)2]2superscript𝛾2superscript𝜂2𝐿subscript𝔼𝑡delimited-[]superscriptnormsuperscriptsubscript𝑛1𝑁subscript𝜑𝑛superscriptsubscript𝑖0𝐼1subscript𝐹𝑛superscriptsubscript𝐲𝑡𝑖𝑛2\displaystyle\quad+2\gamma^{2}\eta^{2}L\mathbb{E}_{t}\left[\left\|\sum_{n=1}^{% N}\varphi_{n}\sum_{i=0}^{I-1}\nabla F_{n}(\mathbf{y}_{t,i}^{n})\right\|^{2}\right]+ 2 italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
γη(1+2γηLI)Nn=1N(pnωtnNφn)2i=0I1𝔼t[Fn(𝐲t,in)2]absent𝛾𝜂12𝛾𝜂𝐿𝐼𝑁superscriptsubscript𝑛1𝑁superscriptsubscript𝑝𝑛superscriptsubscript𝜔𝑡𝑛𝑁subscript𝜑𝑛2superscriptsubscript𝑖0𝐼1subscript𝔼𝑡delimited-[]superscriptnormsubscript𝐹𝑛superscriptsubscript𝐲𝑡𝑖𝑛2\displaystyle\leq\gamma\eta\left(1+2\gamma\eta LI\right)N\sum_{n=1}^{N}\left(% \frac{p_{n}\omega_{t}^{n}}{N}-\varphi_{n}\right)^{2}\sum_{i=0}^{I-1}\mathbb{E}% _{t}\left[\left\|\nabla F_{n}(\mathbf{y}_{t,i}^{n})\right\|^{2}\right]≤ italic_γ italic_η ( 1 + 2 italic_γ italic_η italic_L italic_I ) italic_N ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( divide start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG start_ARG italic_N end_ARG - italic_φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
+γηL22n=1Nφni=0I1𝔼t[𝐲t,in𝐱t2]𝛾𝜂superscript𝐿22superscriptsubscript𝑛1𝑁subscript𝜑𝑛superscriptsubscript𝑖0𝐼1subscript𝔼𝑡delimited-[]superscriptnormsuperscriptsubscript𝐲𝑡𝑖𝑛subscript𝐱𝑡2\displaystyle\quad+\frac{\gamma\eta L^{2}}{2}\sum_{n=1}^{N}\varphi_{n}\sum_{i=% 0}^{I-1}\mathbb{E}_{t}\left[\left\|\mathbf{y}_{t,i}^{n}-\mathbf{x}_{t}\right\|% ^{2}\right]+ divide start_ARG italic_γ italic_η italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT - bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
+γ2η2LIσ2N2n=1Npn(ωtn)2+γ2η2LIN2n=1Npn(ωtn)2i=0I1𝔼t[Fn(𝐲t,in)2]superscript𝛾2superscript𝜂2𝐿𝐼superscript𝜎2superscript𝑁2superscriptsubscript𝑛1𝑁subscript𝑝𝑛superscriptsuperscriptsubscript𝜔𝑡𝑛2superscript𝛾2superscript𝜂2𝐿𝐼superscript𝑁2superscriptsubscript𝑛1𝑁subscript𝑝𝑛superscriptsuperscriptsubscript𝜔𝑡𝑛2superscriptsubscript𝑖0𝐼1subscript𝔼𝑡delimited-[]superscriptnormsubscript𝐹𝑛superscriptsubscript𝐲𝑡𝑖𝑛2\displaystyle\quad+\frac{\gamma^{2}\eta^{2}LI\sigma^{2}}{N^{2}}\sum_{n=1}^{N}p% _{n}(\omega_{t}^{n})^{2}+\frac{\gamma^{2}\eta^{2}LI}{N^{2}}\sum_{n=1}^{N}p_{n}% (\omega_{t}^{n})^{2}\sum_{i=0}^{I-1}\mathbb{E}_{t}\left[\left\|\nabla F_{n}(% \mathbf{y}_{t,i}^{n})\right\|^{2}\right]+ divide start_ARG italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L italic_I italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L italic_I end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
(γη2I2γ2η2L)𝔼t[n=1Nφni=0I1Fn(𝐲t,in)2]γηI4f~(𝐱t)2𝛾𝜂2𝐼2superscript𝛾2superscript𝜂2𝐿subscript𝔼𝑡delimited-[]superscriptnormsuperscriptsubscript𝑛1𝑁subscript𝜑𝑛superscriptsubscript𝑖0𝐼1subscript𝐹𝑛superscriptsubscript𝐲𝑡𝑖𝑛2𝛾𝜂𝐼4superscriptnorm~𝑓subscript𝐱𝑡2\displaystyle\quad-\left(\frac{\gamma\eta}{2I}-2\gamma^{2}\eta^{2}L\right)% \mathbb{E}_{t}\left[\left\|\sum_{n=1}^{N}\varphi_{n}\sum_{i=0}^{I-1}\nabla F_{% n}(\mathbf{y}_{t,i}^{n})\right\|^{2}\right]-\frac{\gamma\eta I}{4}\left\|% \nabla\tilde{f}(\mathbf{x}_{t})\right\|^{2}- ( divide start_ARG italic_γ italic_η end_ARG start_ARG 2 italic_I end_ARG - 2 italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L ) blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] - divide start_ARG italic_γ italic_η italic_I end_ARG start_ARG 4 end_ARG ∥ ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
(a)γη(1+2γηLI)Nn=1N(pnωtnNφn)2i=0I1𝔼t[Fn(𝐲t,in)2]𝑎𝛾𝜂12𝛾𝜂𝐿𝐼𝑁superscriptsubscript𝑛1𝑁superscriptsubscript𝑝𝑛superscriptsubscript𝜔𝑡𝑛𝑁subscript𝜑𝑛2superscriptsubscript𝑖0𝐼1subscript𝔼𝑡delimited-[]superscriptnormsubscript𝐹𝑛superscriptsubscript𝐲𝑡𝑖𝑛2\displaystyle\overset{(a)}{\leq}\gamma\eta\left(1+2\gamma\eta LI\right)N\sum_{% n=1}^{N}\left(\frac{p_{n}\omega_{t}^{n}}{N}-\varphi_{n}\right)^{2}\sum_{i=0}^{% I-1}\mathbb{E}_{t}\left[\left\|\nabla F_{n}(\mathbf{y}_{t,i}^{n})\right\|^{2}\right]start_OVERACCENT ( italic_a ) end_OVERACCENT start_ARG ≤ end_ARG italic_γ italic_η ( 1 + 2 italic_γ italic_η italic_L italic_I ) italic_N ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( divide start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG start_ARG italic_N end_ARG - italic_φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
+γηL22n=1Nφni=0I1𝔼t[𝐲t,in𝐱t2]𝛾𝜂superscript𝐿22superscriptsubscript𝑛1𝑁subscript𝜑𝑛superscriptsubscript𝑖0𝐼1subscript𝔼𝑡delimited-[]superscriptnormsuperscriptsubscript𝐲𝑡𝑖𝑛subscript𝐱𝑡2\displaystyle\quad+\frac{\gamma\eta L^{2}}{2}\sum_{n=1}^{N}\varphi_{n}\sum_{i=% 0}^{I-1}\mathbb{E}_{t}\left[\left\|\mathbf{y}_{t,i}^{n}-\mathbf{x}_{t}\right\|% ^{2}\right]+ divide start_ARG italic_γ italic_η italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT - bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
+γ2η2LIσ2N2n=1Npn(ωtn)2+γ2η2LIN2n=1Npn(ωtn)2i=0I1𝔼t[Fn(𝐲t,in)2]γηI4f~(𝐱t)2superscript𝛾2superscript𝜂2𝐿𝐼superscript𝜎2superscript𝑁2superscriptsubscript𝑛1𝑁subscript𝑝𝑛superscriptsuperscriptsubscript𝜔𝑡𝑛2superscript𝛾2superscript𝜂2𝐿𝐼superscript𝑁2superscriptsubscript𝑛1𝑁subscript𝑝𝑛superscriptsuperscriptsubscript𝜔𝑡𝑛2superscriptsubscript𝑖0𝐼1subscript𝔼𝑡delimited-[]superscriptnormsubscript𝐹𝑛superscriptsubscript𝐲𝑡𝑖𝑛2𝛾𝜂𝐼4superscriptnorm~𝑓subscript𝐱𝑡2\displaystyle\quad+\frac{\gamma^{2}\eta^{2}LI\sigma^{2}}{N^{2}}\sum_{n=1}^{N}p% _{n}(\omega_{t}^{n})^{2}+\frac{\gamma^{2}\eta^{2}LI}{N^{2}}\sum_{n=1}^{N}p_{n}% (\omega_{t}^{n})^{2}\sum_{i=0}^{I-1}\mathbb{E}_{t}\left[\left\|\nabla F_{n}(% \mathbf{y}_{t,i}^{n})\right\|^{2}\right]-\frac{\gamma\eta I}{4}\left\|\nabla% \tilde{f}(\mathbf{x}_{t})\right\|^{2}+ divide start_ARG italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L italic_I italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L italic_I end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∥ ∇ italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] - divide start_ARG italic_γ italic_η italic_I end_ARG start_ARG 4 end_ARG ∥ ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
(b)γη(1+2γηLI)Nn=1N(pnωtnNφn)2[15L2Iγ2(σ2+6Iδ~2)+3δ~2+(90L2I2γ2+3)f~(𝐱t)2]𝑏𝛾𝜂12𝛾𝜂𝐿𝐼𝑁superscriptsubscript𝑛1𝑁superscriptsubscript𝑝𝑛superscriptsubscript𝜔𝑡𝑛𝑁subscript𝜑𝑛2delimited-[]15superscript𝐿2𝐼superscript𝛾2superscript𝜎26𝐼superscript~𝛿23superscript~𝛿290superscript𝐿2superscript𝐼2superscript𝛾23superscriptnorm~𝑓subscript𝐱𝑡2\displaystyle\overset{(b)}{\leq}\gamma\eta\left(1+2\gamma\eta LI\right)N\sum_{% n=1}^{N}\left(\frac{p_{n}\omega_{t}^{n}}{N}-\varphi_{n}\right)^{2}\left[15L^{2% }I\gamma^{2}(\sigma^{2}+6I\tilde{\delta}^{2})+3\tilde{\delta}^{2}+\left(90L^{2% }I^{2}\gamma^{2}+3\right)\left\|\nabla\tilde{f}(\mathbf{x}_{t})\right\|^{2}\right]start_OVERACCENT ( italic_b ) end_OVERACCENT start_ARG ≤ end_ARG italic_γ italic_η ( 1 + 2 italic_γ italic_η italic_L italic_I ) italic_N ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( divide start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG start_ARG italic_N end_ARG - italic_φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT [ 15 italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 6 italic_I over~ start_ARG italic_δ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + 3 over~ start_ARG italic_δ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( 90 italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 3 ) ∥ ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
+γηL2I2[5Iγ2(σ2+6Iδ~2)+30I2γ2f~(𝐱t)2]𝛾𝜂superscript𝐿2𝐼2delimited-[]5𝐼superscript𝛾2superscript𝜎26𝐼superscript~𝛿230superscript𝐼2superscript𝛾2superscriptnorm~𝑓subscript𝐱𝑡2\displaystyle\quad+\frac{\gamma\eta L^{2}I}{2}\left[5I\gamma^{2}(\sigma^{2}+6I% \tilde{\delta}^{2})+30I^{2}\gamma^{2}\left\|\nabla\tilde{f}(\mathbf{x}_{t})% \right\|^{2}\right]+ divide start_ARG italic_γ italic_η italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I end_ARG start_ARG 2 end_ARG [ 5 italic_I italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 6 italic_I over~ start_ARG italic_δ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + 30 italic_I start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
+γ2η2LIσ2N2n=1Npn(ωtn)2superscript𝛾2superscript𝜂2𝐿𝐼superscript𝜎2superscript𝑁2superscriptsubscript𝑛1𝑁subscript𝑝𝑛superscriptsuperscriptsubscript𝜔𝑡𝑛2\displaystyle\quad+\frac{\gamma^{2}\eta^{2}LI\sigma^{2}}{N^{2}}\sum_{n=1}^{N}p% _{n}(\omega_{t}^{n})^{2}+ divide start_ARG italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L italic_I italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+γ2η2LI2N2n=1Npn(ωtn)2[15L2Iγ2(σ2+6Iδ~2)+3δ~2+(90L2I2γ2+3)f~(𝐱t)2]superscript𝛾2superscript𝜂2𝐿superscript𝐼2superscript𝑁2superscriptsubscript𝑛1𝑁subscript𝑝𝑛superscriptsuperscriptsubscript𝜔𝑡𝑛2delimited-[]15superscript𝐿2𝐼superscript𝛾2superscript𝜎26𝐼superscript~𝛿23superscript~𝛿290superscript𝐿2superscript𝐼2superscript𝛾23superscriptnorm~𝑓subscript𝐱𝑡2\displaystyle\quad+\frac{\gamma^{2}\eta^{2}LI^{2}}{N^{2}}\sum_{n=1}^{N}p_{n}(% \omega_{t}^{n})^{2}\left[15L^{2}I\gamma^{2}(\sigma^{2}+6I\tilde{\delta}^{2})+3% \tilde{\delta}^{2}+\left(90L^{2}I^{2}\gamma^{2}+3\right)\left\|\nabla\tilde{f}% (\mathbf{x}_{t})\right\|^{2}\right]+ divide start_ARG italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L italic_I start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT [ 15 italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 6 italic_I over~ start_ARG italic_δ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + 3 over~ start_ARG italic_δ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( 90 italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 3 ) ∥ ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
γηI4f~(𝐱t)2𝛾𝜂𝐼4superscriptnorm~𝑓subscript𝐱𝑡2\displaystyle\quad-\frac{\gamma\eta I}{4}\left\|\nabla\tilde{f}(\mathbf{x}_{t}% )\right\|^{2}- divide start_ARG italic_γ italic_η italic_I end_ARG start_ARG 4 end_ARG ∥ ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
(c)3γηIN2n=1N(pnωtnNφn)2[15L2Iγ2σ2+27δ~28+278f~(𝐱t)2]𝑐3𝛾𝜂𝐼𝑁2superscriptsubscript𝑛1𝑁superscriptsubscript𝑝𝑛superscriptsubscript𝜔𝑡𝑛𝑁subscript𝜑𝑛2delimited-[]15superscript𝐿2𝐼superscript𝛾2superscript𝜎227superscript~𝛿28278superscriptnorm~𝑓subscript𝐱𝑡2\displaystyle\overset{(c)}{\leq}\frac{3\gamma\eta IN}{2}\sum_{n=1}^{N}\left(% \frac{p_{n}\omega_{t}^{n}}{N}-\varphi_{n}\right)^{2}\left[15L^{2}I\gamma^{2}% \sigma^{2}+\frac{27\tilde{\delta}^{2}}{8}+\frac{27}{8}\left\|\nabla\tilde{f}(% \mathbf{x}_{t})\right\|^{2}\right]start_OVERACCENT ( italic_c ) end_OVERACCENT start_ARG ≤ end_ARG divide start_ARG 3 italic_γ italic_η italic_I italic_N end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( divide start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG start_ARG italic_N end_ARG - italic_φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT [ 15 italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 27 over~ start_ARG italic_δ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 8 end_ARG + divide start_ARG 27 end_ARG start_ARG 8 end_ARG ∥ ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
+γηL2I2[5Iγ2(σ2+6Iδ~2)+30I2γ2f~(𝐱t)2]𝛾𝜂superscript𝐿2𝐼2delimited-[]5𝐼superscript𝛾2superscript𝜎26𝐼superscript~𝛿230superscript𝐼2superscript𝛾2superscriptnorm~𝑓subscript𝐱𝑡2\displaystyle\quad+\frac{\gamma\eta L^{2}I}{2}\left[5I\gamma^{2}(\sigma^{2}+6I% \tilde{\delta}^{2})+30I^{2}\gamma^{2}\left\|\nabla\tilde{f}(\mathbf{x}_{t})% \right\|^{2}\right]+ divide start_ARG italic_γ italic_η italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I end_ARG start_ARG 2 end_ARG [ 5 italic_I italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 6 italic_I over~ start_ARG italic_δ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + 30 italic_I start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
+γ2η2LIσ2N2n=1Npn(ωtn)2+γ2η2LI2N2n=1Npn(ωtn)2[σ216I+27δ~28+278f~(𝐱t)2]superscript𝛾2superscript𝜂2𝐿𝐼superscript𝜎2superscript𝑁2superscriptsubscript𝑛1𝑁subscript𝑝𝑛superscriptsuperscriptsubscript𝜔𝑡𝑛2superscript𝛾2superscript𝜂2𝐿superscript𝐼2superscript𝑁2superscriptsubscript𝑛1𝑁subscript𝑝𝑛superscriptsuperscriptsubscript𝜔𝑡𝑛2delimited-[]superscript𝜎216𝐼27superscript~𝛿28278superscriptnorm~𝑓subscript𝐱𝑡2\displaystyle\quad+\frac{\gamma^{2}\eta^{2}LI\sigma^{2}}{N^{2}}\sum_{n=1}^{N}p% _{n}(\omega_{t}^{n})^{2}+\frac{\gamma^{2}\eta^{2}LI^{2}}{N^{2}}\sum_{n=1}^{N}p% _{n}(\omega_{t}^{n})^{2}\left[\frac{\sigma^{2}}{16I}+\frac{27\tilde{\delta}^{2% }}{8}+\frac{27}{8}\left\|\nabla\tilde{f}(\mathbf{x}_{t})\right\|^{2}\right]+ divide start_ARG italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L italic_I italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L italic_I start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT [ divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 16 italic_I end_ARG + divide start_ARG 27 over~ start_ARG italic_δ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 8 end_ARG + divide start_ARG 27 end_ARG start_ARG 8 end_ARG ∥ ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
γηI4f~(𝐱t)2𝛾𝜂𝐼4superscriptnorm~𝑓subscript𝐱𝑡2\displaystyle\quad-\frac{\gamma\eta I}{4}\left\|\nabla\tilde{f}(\mathbf{x}_{t}% )\right\|^{2}- divide start_ARG italic_γ italic_η italic_I end_ARG start_ARG 4 end_ARG ∥ ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
3γηIN2(15L2Iγ2σ2+27δ~28)n=1N(pnωtnNφn)2absent3𝛾𝜂𝐼𝑁215superscript𝐿2𝐼superscript𝛾2superscript𝜎227superscript~𝛿28superscriptsubscript𝑛1𝑁superscriptsubscript𝑝𝑛superscriptsubscript𝜔𝑡𝑛𝑁subscript𝜑𝑛2\displaystyle\leq\frac{3\gamma\eta IN}{2}\left(15L^{2}I\gamma^{2}\sigma^{2}+% \frac{27\tilde{\delta}^{2}}{8}\right)\sum_{n=1}^{N}\left(\frac{p_{n}\omega_{t}% ^{n}}{N}-\varphi_{n}\right)^{2}≤ divide start_ARG 3 italic_γ italic_η italic_I italic_N end_ARG start_ARG 2 end_ARG ( 15 italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 27 over~ start_ARG italic_δ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 8 end_ARG ) ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( divide start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG start_ARG italic_N end_ARG - italic_φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+5γ3ηL2I22(σ2+6Iδ~2)+γ2η2LIN2(17σ216+27Iδ~28)n=1Npn(ωtn)25superscript𝛾3𝜂superscript𝐿2superscript𝐼22superscript𝜎26𝐼superscript~𝛿2superscript𝛾2superscript𝜂2𝐿𝐼superscript𝑁217superscript𝜎21627𝐼superscript~𝛿28superscriptsubscript𝑛1𝑁subscript𝑝𝑛superscriptsuperscriptsubscript𝜔𝑡𝑛2\displaystyle\quad+\frac{5\gamma^{3}\eta L^{2}I^{2}}{2}(\sigma^{2}+6I\tilde{% \delta}^{2})+\frac{\gamma^{2}\eta^{2}LI}{N^{2}}\left(\frac{17\sigma^{2}}{16}+% \frac{27I\tilde{\delta}^{2}}{8}\right)\sum_{n=1}^{N}p_{n}(\omega_{t}^{n})^{2}+ divide start_ARG 5 italic_γ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_η italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ( italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 6 italic_I over~ start_ARG italic_δ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + divide start_ARG italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L italic_I end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( divide start_ARG 17 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 16 end_ARG + divide start_ARG 27 italic_I over~ start_ARG italic_δ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 8 end_ARG ) ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+γηI[81N16n=1N(pnωtnNφn)2+15L2I2γ2+27γηLI8N2n=1Npn(ωtn)214]f~(𝐱t)2,𝛾𝜂𝐼delimited-[]81𝑁16superscriptsubscript𝑛1𝑁superscriptsubscript𝑝𝑛superscriptsubscript𝜔𝑡𝑛𝑁subscript𝜑𝑛215superscript𝐿2superscript𝐼2superscript𝛾227𝛾𝜂𝐿𝐼8superscript𝑁2superscriptsubscript𝑛1𝑁subscript𝑝𝑛superscriptsuperscriptsubscript𝜔𝑡𝑛214superscriptnorm~𝑓subscript𝐱𝑡2\displaystyle\quad+\gamma\eta I\left[\frac{81N}{16}\sum_{n=1}^{N}\left(\frac{p% _{n}\omega_{t}^{n}}{N}-\varphi_{n}\right)^{2}+15L^{2}I^{2}\gamma^{2}+\frac{27% \gamma\eta LI}{8N^{2}}\sum_{n=1}^{N}p_{n}(\omega_{t}^{n})^{2}-\frac{1}{4}% \right]\cdot\left\|\nabla\tilde{f}(\mathbf{x}_{t})\right\|^{2},+ italic_γ italic_η italic_I [ divide start_ARG 81 italic_N end_ARG start_ARG 16 end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( divide start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG start_ARG italic_N end_ARG - italic_φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 15 italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 27 italic_γ italic_η italic_L italic_I end_ARG start_ARG 8 italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 4 end_ARG ] ⋅ ∥ ∇ over~ start_ARG italic_f end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (B.3.10)

where (a)𝑎(a)( italic_a ) uses γη14LI𝛾𝜂14𝐿𝐼\gamma\eta\leq\frac{1}{4LI}italic_γ italic_η ≤ divide start_ARG 1 end_ARG start_ARG 4 italic_L italic_I end_ARG, (b)𝑏(b)( italic_b ) uses Lemma B.3.2 and (B.3.9), and (c)𝑐(c)( italic_c ) uses γη14LI𝛾𝜂14𝐿𝐼\gamma\eta\leq\frac{1}{4LI}italic_γ italic_η ≤ divide start_ARG 1 end_ARG start_ARG 4 italic_L italic_I end_ARG and γ1415LI𝛾1415𝐿𝐼\gamma\leq\frac{1}{4\sqrt{15}LI}italic_γ ≤ divide start_ARG 1 end_ARG start_ARG 4 square-root start_ARG 15 end_ARG italic_L italic_I end_ARG. The final result is obtained by plugging (B.3.10) into (B.3.6). ∎

B.4 Formal Version and Proof of Theorem 1

We first state the formal version of Theorem 1 as follows.

Theorem B.4.1 (Objective minimized at convergence, formal).

Define an alternative objective function as

h(𝐱):=1Pn=1NωnpnFn(𝐱),assign𝐱1𝑃superscriptsubscript𝑛1𝑁subscript𝜔𝑛subscript𝑝𝑛subscript𝐹𝑛𝐱\displaystyle h(\mathbf{x}):=\frac{1}{P}\sum_{n=1}^{N}\omega_{n}p_{n}F_{n}(% \mathbf{x}),italic_h ( bold_x ) := divide start_ARG 1 end_ARG start_ARG italic_P end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_x ) , (B.4.1)

where P:=n=1Nωnpnassign𝑃superscriptsubscript𝑛1𝑁subscript𝜔𝑛subscript𝑝𝑛P:=\sum_{n=1}^{N}\omega_{n}p_{n}italic_P := ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. Under Assumptions 14, when ωtn=ωnsuperscriptsubscript𝜔𝑡𝑛subscript𝜔𝑛\omega_{t}^{n}=\omega_{n}italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT with some ωn>0subscript𝜔𝑛0\omega_{n}>0italic_ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT > 0 for all t𝑡titalic_t and n𝑛nitalic_n, choosing γcT𝛾𝑐𝑇\gamma\leq\frac{c}{\sqrt{T}}italic_γ ≤ divide start_ARG italic_c end_ARG start_ARG square-root start_ARG italic_T end_ARG end_ARG and η>0𝜂0\eta>0italic_η > 0 so that γη=cT𝛾𝜂superscript𝑐𝑇\gamma\eta=\frac{c^{\prime}}{\sqrt{T}}italic_γ italic_η = divide start_ARG italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_T end_ARG end_ARG for some constants c,c>0𝑐superscript𝑐0c,c^{\prime}>0italic_c , italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT > 0, there exists a sufficiently large T𝑇Titalic_T so that the result {𝐱t}subscript𝐱𝑡\{\mathbf{x}_{t}\}{ bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } obtained from Algorithm 1 satisfies 1Tt=0T1𝔼[h(𝐱t)2]ϵ21𝑇superscriptsubscript𝑡0𝑇1𝔼delimited-[]superscriptnormsubscript𝐱𝑡2superscriptitalic-ϵ2\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\left[\left\|\nabla h(\mathbf{x}_{t})% \right\|^{2}\right]\leq\epsilon^{2}divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT blackboard_E [ ∥ ∇ italic_h ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for any ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0.

Proof.

According to Algorithm 3, the result remains the same when we replace η𝜂\etaitalic_η and ωnsubscript𝜔𝑛\omega_{n}italic_ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT (thus ωtnsuperscriptsubscript𝜔𝑡𝑛\omega_{t}^{n}italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT) with η~~𝜂\tilde{\eta}over~ start_ARG italic_η end_ARG and α~nsubscript~𝛼𝑛\tilde{\alpha}_{n}over~ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, respectively, while keeping the product η~α~n=ηωn~𝜂subscript~𝛼𝑛𝜂subscript𝜔𝑛\tilde{\eta}\tilde{\alpha}_{n}=\eta\omega_{n}over~ start_ARG italic_η end_ARG over~ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_η italic_ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. We choose α~n=ωnNPsubscript~𝛼𝑛subscript𝜔𝑛𝑁𝑃\tilde{\alpha}_{n}=\frac{\omega_{n}N}{P}over~ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = divide start_ARG italic_ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_N end_ARG start_ARG italic_P end_ARG and η~=ηPN~𝜂𝜂𝑃𝑁\tilde{\eta}=\frac{\eta P}{N}over~ start_ARG italic_η end_ARG = divide start_ARG italic_η italic_P end_ARG start_ARG italic_N end_ARG. Then, we choose φn=ωnpnP=pnα~nNsubscript𝜑𝑛subscript𝜔𝑛subscript𝑝𝑛𝑃subscript𝑝𝑛subscript~𝛼𝑛𝑁\varphi_{n}=\frac{\omega_{n}p_{n}}{P}=\frac{p_{n}\tilde{\alpha}_{n}}{N}italic_φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = divide start_ARG italic_ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG start_ARG italic_P end_ARG = divide start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT over~ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG in Lemma B.3.3. We can see that this choice satisfies n=1Nφn=1superscriptsubscript𝑛1𝑁subscript𝜑𝑛1\sum_{n=1}^{N}\varphi_{n}=1∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 1, so Lemma B.3.3 holds after replacing η𝜂\etaitalic_η and ωtnsuperscriptsubscript𝜔𝑡𝑛\omega_{t}^{n}italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT in the lemma with η~~𝜂\tilde{\eta}over~ start_ARG italic_η end_ARG and α~nsubscript~𝛼𝑛\tilde{\alpha}_{n}over~ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, respectively, and f~(𝐱)~𝑓𝐱\tilde{f}(\mathbf{x})over~ start_ARG italic_f end_ARG ( bold_x ) in Lemma B.3.3 is equal to h(𝐱)𝐱h(\mathbf{x})italic_h ( bold_x ) defined in Theorem B.4.1 with this choice of φnsubscript𝜑𝑛\varphi_{n}italic_φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. Therefore,

𝔼t[h(𝐱t+1)]subscript𝔼𝑡delimited-[]subscript𝐱𝑡1\displaystyle\mathbb{E}_{t}\left[h(\mathbf{x}_{t+1})\right]blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ italic_h ( bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ] h(𝐱t)+5γ3η~L2I22(σ2+6Iδ~2)+γ2η~2LIN2(17σ216+27Iδ~28)n=1Npnα~n2absentsubscript𝐱𝑡5superscript𝛾3~𝜂superscript𝐿2superscript𝐼22superscript𝜎26𝐼superscript~𝛿2superscript𝛾2superscript~𝜂2𝐿𝐼superscript𝑁217superscript𝜎21627𝐼superscript~𝛿28superscriptsubscript𝑛1𝑁subscript𝑝𝑛superscriptsubscript~𝛼𝑛2\displaystyle\leq h(\mathbf{x}_{t})+\frac{5\gamma^{3}\tilde{\eta}L^{2}I^{2}}{2% }(\sigma^{2}+6I\tilde{\delta}^{2})+\frac{\gamma^{2}\tilde{\eta}^{2}LI}{N^{2}}% \left(\frac{17\sigma^{2}}{16}+\frac{27I\tilde{\delta}^{2}}{8}\right)\sum_{n=1}% ^{N}p_{n}\tilde{\alpha}_{n}^{2}≤ italic_h ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + divide start_ARG 5 italic_γ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT over~ start_ARG italic_η end_ARG italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ( italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 6 italic_I over~ start_ARG italic_δ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + divide start_ARG italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT over~ start_ARG italic_η end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L italic_I end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( divide start_ARG 17 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 16 end_ARG + divide start_ARG 27 italic_I over~ start_ARG italic_δ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 8 end_ARG ) ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT over~ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+γη~I[15L2I2γ2+27γη~LI8N2n=1Npnα~n214]h(𝐱t)2.𝛾~𝜂𝐼delimited-[]15superscript𝐿2superscript𝐼2superscript𝛾227𝛾~𝜂𝐿𝐼8superscript𝑁2superscriptsubscript𝑛1𝑁subscript𝑝𝑛superscriptsubscript~𝛼𝑛214superscriptnormsubscript𝐱𝑡2\displaystyle\quad+\gamma\tilde{\eta}I\left[15L^{2}I^{2}\gamma^{2}+\frac{27% \gamma\tilde{\eta}LI}{8N^{2}}\sum_{n=1}^{N}p_{n}\tilde{\alpha}_{n}^{2}-\frac{1% }{4}\right]\cdot\left\|\nabla h(\mathbf{x}_{t})\right\|^{2}.+ italic_γ over~ start_ARG italic_η end_ARG italic_I [ 15 italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 27 italic_γ over~ start_ARG italic_η end_ARG italic_L italic_I end_ARG start_ARG 8 italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT over~ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 4 end_ARG ] ⋅ ∥ ∇ italic_h ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (B.4.2)

Because γcT𝛾𝑐𝑇\gamma\leq\frac{c}{\sqrt{T}}italic_γ ≤ divide start_ARG italic_c end_ARG start_ARG square-root start_ARG italic_T end_ARG end_ARG and γη~=c′′T𝛾~𝜂superscript𝑐′′𝑇\gamma\tilde{\eta}=\frac{c^{\prime\prime}}{\sqrt{T}}italic_γ over~ start_ARG italic_η end_ARG = divide start_ARG italic_c start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_T end_ARG end_ARG, there exists a sufficiently large T𝑇Titalic_T so that 15L2I2γ2+27γη~LI8N2n=1Npnα~n21815superscript𝐿2superscript𝐼2superscript𝛾227𝛾~𝜂𝐿𝐼8superscript𝑁2superscriptsubscript𝑛1𝑁subscript𝑝𝑛superscriptsubscript~𝛼𝑛21815L^{2}I^{2}\gamma^{2}+\frac{27\gamma\tilde{\eta}LI}{8N^{2}}\sum_{n=1}^{N}p_{n% }\tilde{\alpha}_{n}^{2}\leq\frac{1}{8}15 italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 27 italic_γ over~ start_ARG italic_η end_ARG italic_L italic_I end_ARG start_ARG 8 italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT over~ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG 8 end_ARG. In this case, after taking the total expectation of (B.4.2) and rearranging, we have

𝔼[h(𝐱t)2]𝔼delimited-[]superscriptnormsubscript𝐱𝑡2\displaystyle\mathbb{E}\left[\left\|\nabla h(\mathbf{x}_{t})\right\|^{2}\right]blackboard_E [ ∥ ∇ italic_h ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
8(𝔼[h(𝐱t)]𝔼[h(𝐱t+1)])γη~I+20γ2L2I(σ2+6Iδ~2)+8γη~LN2(17σ216+27Iδ~28)n=1Npnα~n2.absent8𝔼delimited-[]subscript𝐱𝑡𝔼delimited-[]subscript𝐱𝑡1𝛾~𝜂𝐼20superscript𝛾2superscript𝐿2𝐼superscript𝜎26𝐼superscript~𝛿28𝛾~𝜂𝐿superscript𝑁217superscript𝜎21627𝐼superscript~𝛿28superscriptsubscript𝑛1𝑁subscript𝑝𝑛superscriptsubscript~𝛼𝑛2\displaystyle\leq\frac{8\left(\mathbb{E}\left[h(\mathbf{x}_{t})\right]-\mathbb% {E}\left[h(\mathbf{x}_{t+1})\right]\right)}{\gamma\tilde{\eta}I}+20\gamma^{2}L% ^{2}I(\sigma^{2}+6I\tilde{\delta}^{2})+\frac{8\gamma\tilde{\eta}L}{N^{2}}\left% (\frac{17\sigma^{2}}{16}+\frac{27I\tilde{\delta}^{2}}{8}\right)\sum_{n=1}^{N}p% _{n}\tilde{\alpha}_{n}^{2}.≤ divide start_ARG 8 ( blackboard_E [ italic_h ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] - blackboard_E [ italic_h ( bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ] ) end_ARG start_ARG italic_γ over~ start_ARG italic_η end_ARG italic_I end_ARG + 20 italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I ( italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 6 italic_I over~ start_ARG italic_δ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + divide start_ARG 8 italic_γ over~ start_ARG italic_η end_ARG italic_L end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( divide start_ARG 17 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 16 end_ARG + divide start_ARG 27 italic_I over~ start_ARG italic_δ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 8 end_ARG ) ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT over~ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (B.4.3)

Then, summing up over T𝑇Titalic_T rounds and dividing by T𝑇Titalic_T, we have

1Tt=0T1𝔼[h(𝐱t)2]1𝑇superscriptsubscript𝑡0𝑇1𝔼delimited-[]superscriptnormsubscript𝐱𝑡2\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\left[\left\|\nabla h(% \mathbf{x}_{t})\right\|^{2}\right]divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT blackboard_E [ ∥ ∇ italic_h ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] 8γη~IT+20γ2L2I(σ2+6Iδ~2)+8γη~LN2(17σ216+27Iδ~28)n=1Npnα~n2,absent8𝛾~𝜂𝐼𝑇20superscript𝛾2superscript𝐿2𝐼superscript𝜎26𝐼superscript~𝛿28𝛾~𝜂𝐿superscript𝑁217superscript𝜎21627𝐼superscript~𝛿28superscriptsubscript𝑛1𝑁subscript𝑝𝑛superscriptsubscript~𝛼𝑛2\displaystyle\leq\frac{8\mathcal{H}}{\gamma\tilde{\eta}IT}+20\gamma^{2}L^{2}I(% \sigma^{2}+6I\tilde{\delta}^{2})+\frac{8\gamma\tilde{\eta}L}{N^{2}}\left(\frac% {17\sigma^{2}}{16}+\frac{27I\tilde{\delta}^{2}}{8}\right)\sum_{n=1}^{N}p_{n}% \tilde{\alpha}_{n}^{2},≤ divide start_ARG 8 caligraphic_H end_ARG start_ARG italic_γ over~ start_ARG italic_η end_ARG italic_I italic_T end_ARG + 20 italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I ( italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 6 italic_I over~ start_ARG italic_δ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + divide start_ARG 8 italic_γ over~ start_ARG italic_η end_ARG italic_L end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( divide start_ARG 17 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 16 end_ARG + divide start_ARG 27 italic_I over~ start_ARG italic_δ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 8 end_ARG ) ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT over~ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (B.4.4)

where :=h(𝐱0)hassignsubscript𝐱0superscript\mathcal{H}:=h(\mathbf{x}_{0})-h^{*}caligraphic_H := italic_h ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT with h:=min𝐱h(𝐱)assignsuperscriptsubscript𝐱𝐱h^{*}:=\min_{\mathbf{x}}h(\mathbf{x})italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT := roman_min start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT italic_h ( bold_x ) as the truly minimum value.

Since γcT𝛾𝑐𝑇\gamma\leq\frac{c}{\sqrt{T}}italic_γ ≤ divide start_ARG italic_c end_ARG start_ARG square-root start_ARG italic_T end_ARG end_ARG and γη~=c′′T𝛾~𝜂superscript𝑐′′𝑇\gamma\tilde{\eta}=\frac{c^{\prime\prime}}{\sqrt{T}}italic_γ over~ start_ARG italic_η end_ARG = divide start_ARG italic_c start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_T end_ARG end_ARG, we can see that the upper bound above converges to zero as T𝑇T\rightarrow\inftyitalic_T → ∞. Thus, there exists a sufficiently large T𝑇Titalic_T to achieve an upper bound of an arbitrarily positive value of ϵ2superscriptitalic-ϵ2\epsilon^{2}italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. ∎

B.5 Proof of Theorem 2

We first present the following variant of the descent lemma for the original objective defined in (1).

Lemma B.5.1 (Descent lemma for original objective).

Under the same conditions as in Lemma B.3.3,

𝔼t[f(𝐱t+1)]subscript𝔼𝑡delimited-[]𝑓subscript𝐱𝑡1\displaystyle\mathbb{E}_{t}\left[f(\mathbf{x}_{t+1})\right]blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ italic_f ( bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ]
f(𝐱t)+3γηI2N(15L2Iγ2σ2+27δ28)n=1N(pnωtn1)2+5γ3ηL2I22(σ2+6Iδ2)absent𝑓subscript𝐱𝑡3𝛾𝜂𝐼2𝑁15superscript𝐿2𝐼superscript𝛾2superscript𝜎227superscript𝛿28superscriptsubscript𝑛1𝑁superscriptsubscript𝑝𝑛superscriptsubscript𝜔𝑡𝑛125superscript𝛾3𝜂superscript𝐿2superscript𝐼22superscript𝜎26𝐼superscript𝛿2\displaystyle\leq f(\mathbf{x}_{t})+\frac{3\gamma\eta I}{2N}\left(15L^{2}I% \gamma^{2}\sigma^{2}+\frac{27\delta^{2}}{8}\right)\sum_{n=1}^{N}\left(p_{n}% \omega_{t}^{n}-1\right)^{2}+\frac{5\gamma^{3}\eta L^{2}I^{2}}{2}(\sigma^{2}+6I% \delta^{2})≤ italic_f ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + divide start_ARG 3 italic_γ italic_η italic_I end_ARG start_ARG 2 italic_N end_ARG ( 15 italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 27 italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 8 end_ARG ) ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 5 italic_γ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_η italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ( italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 6 italic_I italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
+γ2η2LIN2(17σ216+27Iδ28)n=1Npn(ωtn)2superscript𝛾2superscript𝜂2𝐿𝐼superscript𝑁217superscript𝜎21627𝐼superscript𝛿28superscriptsubscript𝑛1𝑁subscript𝑝𝑛superscriptsuperscriptsubscript𝜔𝑡𝑛2\displaystyle\quad+\frac{\gamma^{2}\eta^{2}LI}{N^{2}}\left(\frac{17\sigma^{2}}% {16}+\frac{27I\delta^{2}}{8}\right)\sum_{n=1}^{N}p_{n}(\omega_{t}^{n})^{2}+ divide start_ARG italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L italic_I end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( divide start_ARG 17 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 16 end_ARG + divide start_ARG 27 italic_I italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 8 end_ARG ) ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+γηI[8116Nn=1N(pnωtn1)2+15L2I2γ2+27γηLI8N2n=1Npn(ωtn)214]f(𝐱t)2.𝛾𝜂𝐼delimited-[]8116𝑁superscriptsubscript𝑛1𝑁superscriptsubscript𝑝𝑛superscriptsubscript𝜔𝑡𝑛1215superscript𝐿2superscript𝐼2superscript𝛾227𝛾𝜂𝐿𝐼8superscript𝑁2superscriptsubscript𝑛1𝑁subscript𝑝𝑛superscriptsuperscriptsubscript𝜔𝑡𝑛214superscriptnorm𝑓subscript𝐱𝑡2\displaystyle\quad+\gamma\eta I\left[\frac{81}{16N}\sum_{n=1}^{N}\left(p_{n}% \omega_{t}^{n}-1\right)^{2}+15L^{2}I^{2}\gamma^{2}+\frac{27\gamma\eta LI}{8N^{% 2}}\sum_{n=1}^{N}p_{n}(\omega_{t}^{n})^{2}-\frac{1}{4}\right]\cdot\left\|% \nabla f(\mathbf{x}_{t})\right\|^{2}.+ italic_γ italic_η italic_I [ divide start_ARG 81 end_ARG start_ARG 16 italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 15 italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 27 italic_γ italic_η italic_L italic_I end_ARG start_ARG 8 italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 4 end_ARG ] ⋅ ∥ ∇ italic_f ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (B.5.1)
Proof.

The result can be immediately obtained from Lemma B.3.3 by choosing φn=1Nsubscript𝜑𝑛1𝑁\varphi_{n}=\frac{1}{N}italic_φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG in (B.3.1) and noting that gradient divergence bound holds for δ𝛿\deltaitalic_δ with this choice of φnsubscript𝜑𝑛\varphi_{n}italic_φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT according to Assumption 3. ∎

Proof of Theorem 2.

Consider the last term in Lemma B.5.1. Due to γ1415LI𝛾1415𝐿𝐼\gamma\leq\frac{1}{4\sqrt{15}LI}italic_γ ≤ divide start_ARG 1 end_ARG start_ARG 4 square-root start_ARG 15 end_ARG italic_L italic_I end_ARG, and γηmin{14LI;N54LIQ}𝛾𝜂14𝐿𝐼𝑁54𝐿𝐼𝑄\gamma\eta\leq\min\left\{\frac{1}{4LI};\frac{N}{54LIQ}\right\}italic_γ italic_η ≤ roman_min { divide start_ARG 1 end_ARG start_ARG 4 italic_L italic_I end_ARG ; divide start_ARG italic_N end_ARG start_ARG 54 italic_L italic_I italic_Q end_ARG } as specified in the theorem, we have 15L2I2γ211615superscript𝐿2superscript𝐼2superscript𝛾211615L^{2}I^{2}\gamma^{2}\leq\frac{1}{16}15 italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG 16 end_ARG and 27γηLI8N2n=1Npn(ωtn)211627𝛾𝜂𝐿𝐼8superscript𝑁2superscriptsubscript𝑛1𝑁subscript𝑝𝑛superscriptsuperscriptsubscript𝜔𝑡𝑛2116\frac{27\gamma\eta LI}{8N^{2}}\sum_{n=1}^{N}p_{n}(\omega_{t}^{n})^{2}\leq\frac% {1}{16}divide start_ARG 27 italic_γ italic_η italic_L italic_I end_ARG start_ARG 8 italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG 16 end_ARG.

Case 1: When assuming f(𝐱)2G2superscriptnorm𝑓𝐱2superscript𝐺2\left\|\nabla f(\mathbf{x})\right\|^{2}\leq G^{2}∥ ∇ italic_f ( bold_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, we have

γηI[8116Nn=1N(pnωtn1)2+15L2I2γ2+27γηLI8N2n=1Npn(ωtn)214]f(𝐱t)2𝛾𝜂𝐼delimited-[]8116𝑁superscriptsubscript𝑛1𝑁superscriptsubscript𝑝𝑛superscriptsubscript𝜔𝑡𝑛1215superscript𝐿2superscript𝐼2superscript𝛾227𝛾𝜂𝐿𝐼8superscript𝑁2superscriptsubscript𝑛1𝑁subscript𝑝𝑛superscriptsuperscriptsubscript𝜔𝑡𝑛214superscriptnorm𝑓subscript𝐱𝑡2\displaystyle\gamma\eta I\left[\frac{81}{16N}\sum_{n=1}^{N}\left(p_{n}\omega_{% t}^{n}-1\right)^{2}+15L^{2}I^{2}\gamma^{2}+\frac{27\gamma\eta LI}{8N^{2}}\sum_% {n=1}^{N}p_{n}(\omega_{t}^{n})^{2}-\frac{1}{4}\right]\cdot\left\|\nabla f(% \mathbf{x}_{t})\right\|^{2}italic_γ italic_η italic_I [ divide start_ARG 81 end_ARG start_ARG 16 italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 15 italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 27 italic_γ italic_η italic_L italic_I end_ARG start_ARG 8 italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 4 end_ARG ] ⋅ ∥ ∇ italic_f ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
81γηIG216Nn=1N(pnωtn1)2γηI8f(𝐱t)2.absent81𝛾𝜂𝐼superscript𝐺216𝑁superscriptsubscript𝑛1𝑁superscriptsubscript𝑝𝑛superscriptsubscript𝜔𝑡𝑛12𝛾𝜂𝐼8superscriptnorm𝑓subscript𝐱𝑡2\displaystyle\leq\frac{81\gamma\eta IG^{2}}{16N}\sum_{n=1}^{N}\left(p_{n}% \omega_{t}^{n}-1\right)^{2}-\frac{\gamma\eta I}{8}\left\|\nabla f(\mathbf{x}_{% t})\right\|^{2}.≤ divide start_ARG 81 italic_γ italic_η italic_I italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 16 italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - divide start_ARG italic_γ italic_η italic_I end_ARG start_ARG 8 end_ARG ∥ ∇ italic_f ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (B.5.2)

Plugging back into Lemma B.5.1, after taking total expectation and rearranging, we obtain

𝔼[f(𝐱t)2]𝔼delimited-[]superscriptnorm𝑓subscript𝐱𝑡2\displaystyle\mathbb{E}\left[\left\|\nabla f(\mathbf{x}_{t})\right\|^{2}\right]blackboard_E [ ∥ ∇ italic_f ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
8(𝔼[f(𝐱t)]𝔼[f(𝐱t+1)])γηI+12N(15L2Iγ2σ2+27(δ2+G2)8)n=1N(pnωtn1)2absent8𝔼delimited-[]𝑓subscript𝐱𝑡𝔼delimited-[]𝑓subscript𝐱𝑡1𝛾𝜂𝐼12𝑁15superscript𝐿2𝐼superscript𝛾2superscript𝜎227superscript𝛿2superscript𝐺28superscriptsubscript𝑛1𝑁superscriptsubscript𝑝𝑛superscriptsubscript𝜔𝑡𝑛12\displaystyle\leq\frac{8(\mathbb{E}\left[f(\mathbf{x}_{t})\right]-\mathbb{E}% \left[f(\mathbf{x}_{t+1})\right])}{\gamma\eta I}+\frac{12}{N}\left(15L^{2}I% \gamma^{2}\sigma^{2}+\frac{27(\delta^{2}+G^{2})}{8}\right)\sum_{n=1}^{N}\left(% p_{n}\omega_{t}^{n}-1\right)^{2}≤ divide start_ARG 8 ( blackboard_E [ italic_f ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] - blackboard_E [ italic_f ( bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ] ) end_ARG start_ARG italic_γ italic_η italic_I end_ARG + divide start_ARG 12 end_ARG start_ARG italic_N end_ARG ( 15 italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 27 ( italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG 8 end_ARG ) ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+20γ2L2I(σ2+6Iδ2)+γηLN2(17σ22+27Iδ2)n=1Npn(ωtn)2.20superscript𝛾2superscript𝐿2𝐼superscript𝜎26𝐼superscript𝛿2𝛾𝜂𝐿superscript𝑁217superscript𝜎2227𝐼superscript𝛿2superscriptsubscript𝑛1𝑁subscript𝑝𝑛superscriptsuperscriptsubscript𝜔𝑡𝑛2\displaystyle\quad+20\gamma^{2}L^{2}I(\sigma^{2}+6I\delta^{2})+\frac{\gamma% \eta L}{N^{2}}\left(\frac{17\sigma^{2}}{2}+27I\delta^{2}\right)\sum_{n=1}^{N}p% _{n}(\omega_{t}^{n})^{2}.+ 20 italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I ( italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 6 italic_I italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + divide start_ARG italic_γ italic_η italic_L end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( divide start_ARG 17 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG + 27 italic_I italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (B.5.3)

Case 2: When assuming 1Nn=1N(pnωtn1)21811𝑁superscriptsubscript𝑛1𝑁superscriptsubscript𝑝𝑛superscriptsubscript𝜔𝑡𝑛12181\frac{1}{N}\sum_{n=1}^{N}\left(p_{n}\omega_{t}^{n}-1\right)^{2}\leq\frac{1}{81}divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG 81 end_ARG, we have

γηI[8116Nn=1N(pnωtn1)2+15L2I2γ2+27γηLI8N2n=1Npn(ωtn)214]f(𝐱t)2𝛾𝜂𝐼delimited-[]8116𝑁superscriptsubscript𝑛1𝑁superscriptsubscript𝑝𝑛superscriptsubscript𝜔𝑡𝑛1215superscript𝐿2superscript𝐼2superscript𝛾227𝛾𝜂𝐿𝐼8superscript𝑁2superscriptsubscript𝑛1𝑁subscript𝑝𝑛superscriptsuperscriptsubscript𝜔𝑡𝑛214superscriptnorm𝑓subscript𝐱𝑡2\displaystyle\gamma\eta I\left[\frac{81}{16N}\sum_{n=1}^{N}\left(p_{n}\omega_{% t}^{n}-1\right)^{2}+15L^{2}I^{2}\gamma^{2}+\frac{27\gamma\eta LI}{8N^{2}}\sum_% {n=1}^{N}p_{n}(\omega_{t}^{n})^{2}-\frac{1}{4}\right]\cdot\left\|\nabla f(% \mathbf{x}_{t})\right\|^{2}italic_γ italic_η italic_I [ divide start_ARG 81 end_ARG start_ARG 16 italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 15 italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 27 italic_γ italic_η italic_L italic_I end_ARG start_ARG 8 italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 4 end_ARG ] ⋅ ∥ ∇ italic_f ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
γηI16f(𝐱t)2.absent𝛾𝜂𝐼16superscriptnorm𝑓subscript𝐱𝑡2\displaystyle\leq-\frac{\gamma\eta I}{16}\left\|\nabla f(\mathbf{x}_{t})\right% \|^{2}.≤ - divide start_ARG italic_γ italic_η italic_I end_ARG start_ARG 16 end_ARG ∥ ∇ italic_f ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (B.5.4)

Plugging back into Lemma B.5.1, after taking total expectation and rearranging, we obtain

𝔼[f(𝐱t)2]𝔼delimited-[]superscriptnorm𝑓subscript𝐱𝑡2\displaystyle\mathbb{E}\left[\left\|\nabla f(\mathbf{x}_{t})\right\|^{2}\right]blackboard_E [ ∥ ∇ italic_f ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
16(𝔼[f(𝐱t)]𝔼[f(𝐱t+1)])γηI+24N(15L2Iγ2σ2+27δ28)n=1N(pnωtn1)2absent16𝔼delimited-[]𝑓subscript𝐱𝑡𝔼delimited-[]𝑓subscript𝐱𝑡1𝛾𝜂𝐼24𝑁15superscript𝐿2𝐼superscript𝛾2superscript𝜎227superscript𝛿28superscriptsubscript𝑛1𝑁superscriptsubscript𝑝𝑛superscriptsubscript𝜔𝑡𝑛12\displaystyle\leq\frac{16(\mathbb{E}\left[f(\mathbf{x}_{t})\right]-\mathbb{E}% \left[f(\mathbf{x}_{t+1})\right])}{\gamma\eta I}+\frac{24}{N}\left(15L^{2}I% \gamma^{2}\sigma^{2}+\frac{27\delta^{2}}{8}\right)\sum_{n=1}^{N}\left(p_{n}% \omega_{t}^{n}-1\right)^{2}≤ divide start_ARG 16 ( blackboard_E [ italic_f ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] - blackboard_E [ italic_f ( bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ] ) end_ARG start_ARG italic_γ italic_η italic_I end_ARG + divide start_ARG 24 end_ARG start_ARG italic_N end_ARG ( 15 italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 27 italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 8 end_ARG ) ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+40γ2L2I(σ2+6Iδ2)+γηLN2(17σ2+54Iδ2)n=1Npn(ωtn)2.40superscript𝛾2superscript𝐿2𝐼superscript𝜎26𝐼superscript𝛿2𝛾𝜂𝐿superscript𝑁217superscript𝜎254𝐼superscript𝛿2superscriptsubscript𝑛1𝑁subscript𝑝𝑛superscriptsuperscriptsubscript𝜔𝑡𝑛2\displaystyle\quad+40\gamma^{2}L^{2}I(\sigma^{2}+6I\delta^{2})+\frac{\gamma% \eta L}{N^{2}}\left(17\sigma^{2}+54I\delta^{2}\right)\sum_{n=1}^{N}p_{n}(% \omega_{t}^{n})^{2}.+ 40 italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I ( italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 6 italic_I italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + divide start_ARG italic_γ italic_η italic_L end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( 17 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 54 italic_I italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (B.5.5)

The final result is obtained by summing up either (B.5.3) or (B.5.5) over T𝑇Titalic_T rounds and dividing by T𝑇Titalic_T, choosing ΨGsubscriptΨ𝐺\Psi_{G}roman_Ψ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT accordingly for each case, and absorbing the constants in 𝒪()𝒪\mathcal{O}(\cdot)caligraphic_O ( ⋅ ) notation. ∎

B.6 Proof of Theorem 3

We start by analyzing the statistical properties of the possibly cutoff participation interval Snsubscript𝑆𝑛S_{n}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. Because in every round, each client n𝑛nitalic_n participates according to a Bernoulli distribution with probability pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, the random variable Snsubscript𝑆𝑛S_{n}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT has the following probability distribution:

Pr{Sn=k}={pn(1pn)k1,if 1k<K(1pn)k1,if k=K,Prsubscript𝑆𝑛𝑘casessubscript𝑝𝑛superscript1subscript𝑝𝑛𝑘1if 1𝑘𝐾superscript1subscript𝑝𝑛𝑘1if 𝑘𝐾\displaystyle\Pr\{S_{n}=k\}=\begin{cases}p_{n}(1-p_{n})^{k-1},&\textrm{if }1% \leq k<K\\ (1-p_{n})^{k-1},&\textrm{if }k=K\end{cases},roman_Pr { italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_k } = { start_ROW start_CELL italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT , end_CELL start_CELL if 1 ≤ italic_k < italic_K end_CELL end_ROW start_ROW start_CELL ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT , end_CELL start_CELL if italic_k = italic_K end_CELL end_ROW , (B.6.1)

which is a “cutoff” geometric distribution with a maximum value of K𝐾Kitalic_K. We will refer to this probability distribution as K𝐾Kitalic_K-cutoff geometric distribution. We can see that when K𝐾K\rightarrow\inftyitalic_K → ∞, this distribution becomes the same as the geometric distribution, but we consider the general case with an arbitrary K𝐾Kitalic_K that is specified later. We also recall that the actual value of pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is unknown to the system, which is why we need to compute {ωtn}superscriptsubscript𝜔𝑡𝑛\{\omega_{t}^{n}\}{ italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT } using the estimation procedure in Algorithm 2.

Lemma B.6.1.

Equation (B.6.1) defines a probability distribution, and the mean and variance of Snsubscript𝑆𝑛S_{n}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT are

𝔼[Sn]=1pn(1pn)Kpn;Var[Sn]=1pnpn2(2K1)(1pn)Kpn(1pn)2Kpn2.formulae-sequence𝔼delimited-[]subscript𝑆𝑛1subscript𝑝𝑛superscript1subscript𝑝𝑛𝐾subscript𝑝𝑛Vardelimited-[]subscript𝑆𝑛1subscript𝑝𝑛superscriptsubscript𝑝𝑛22𝐾1superscript1subscript𝑝𝑛𝐾subscript𝑝𝑛superscript1subscript𝑝𝑛2𝐾superscriptsubscript𝑝𝑛2\displaystyle\mathbb{E}\left[S_{n}\right]=\frac{1}{p_{n}}-\frac{(1-p_{n})^{K}}% {p_{n}};\quad\mathrm{Var}\left[S_{n}\right]=\frac{1-p_{n}}{p_{n}^{2}}-\frac{(2% K-1)(1-p_{n})^{K}}{p_{n}}-\frac{(1-p_{n})^{2K}}{p_{n}^{2}}.blackboard_E [ italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] = divide start_ARG 1 end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG - divide start_ARG ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ; roman_Var [ italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] = divide start_ARG 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - divide start_ARG ( 2 italic_K - 1 ) ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG - divide start_ARG ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_K end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG . (B.6.2)
Proof.

We first show that (B.6.1) defines a probability distribution. According to the definition in (B.6.1), we have k=1KPr{Sn=k}=1superscriptsubscript𝑘1𝐾Prsubscript𝑆𝑛𝑘1\sum_{k=1}^{K}\Pr\{S_{n}=k\}=1∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_Pr { italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_k } = 1 for any K𝐾Kitalic_K. We prove this by induction. Let Snsubscript𝑆𝑛S_{n}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and Snsubscriptsuperscript𝑆𝑛S^{\prime}_{n}italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT denote the random variables following K𝐾Kitalic_K-cutoff and (K+1)𝐾1(K+1)( italic_K + 1 )-cutoff geometric distributions, respectively. For K=2𝐾2K=2italic_K = 2, we have k=1KPr{Sn=k}=pn+(1pn)=1superscriptsubscript𝑘1𝐾Prsubscript𝑆𝑛𝑘subscript𝑝𝑛1subscript𝑝𝑛1\sum_{k=1}^{K}\Pr\{S_{n}=k\}=p_{n}+(1-p_{n})=1∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_Pr { italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_k } = italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = 1. Therefore, we can assume that k=1KPr{Sn=k}=1superscriptsubscript𝑘1𝐾Prsubscript𝑆𝑛𝑘1\sum_{k=1}^{K}\Pr\{S_{n}=k\}=1∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_Pr { italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_k } = 1 holds for a certain value of K𝐾Kitalic_K. For (K+1)𝐾1(K+1)( italic_K + 1 )-cutoff distribution, we first note that according to (B.6.1),

Pr{Sn=k}={Pr{Sn=k},1k<KpnPr{Sn=K},k=K(1pn)Pr{Sn=K},k=K+1.Prsubscriptsuperscript𝑆𝑛𝑘casesPrsubscript𝑆𝑛𝑘1𝑘𝐾subscript𝑝𝑛Prsubscript𝑆𝑛𝐾𝑘𝐾1subscript𝑝𝑛Prsubscript𝑆𝑛𝐾𝑘𝐾1\Pr\{S^{\prime}_{n}=k\}=\begin{cases}\Pr\{S_{n}=k\},&1\leq k<K\\ p_{n}\cdot\Pr\{S_{n}=K\},&k=K\\ (1-p_{n})\cdot\Pr\{S_{n}=K\},&k=K+1\end{cases}.roman_Pr { italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_k } = { start_ROW start_CELL roman_Pr { italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_k } , end_CELL start_CELL 1 ≤ italic_k < italic_K end_CELL end_ROW start_ROW start_CELL italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⋅ roman_Pr { italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_K } , end_CELL start_CELL italic_k = italic_K end_CELL end_ROW start_ROW start_CELL ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ⋅ roman_Pr { italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_K } , end_CELL start_CELL italic_k = italic_K + 1 end_CELL end_ROW .

Therefore,

k=1K+1Pr{Sn=k}superscriptsubscript𝑘1𝐾1Prsubscriptsuperscript𝑆𝑛𝑘\displaystyle\sum_{k=1}^{K+1}\Pr\{S^{\prime}_{n}=k\}∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K + 1 end_POSTSUPERSCRIPT roman_Pr { italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_k } =k=1K1Pr{Sn=k}+pnPr{Sn=K}+(1pn)Pr{Sn=K}absentsuperscriptsubscript𝑘1𝐾1Prsubscript𝑆𝑛𝑘subscript𝑝𝑛Prsubscript𝑆𝑛𝐾1subscript𝑝𝑛Prsubscript𝑆𝑛𝐾\displaystyle=\sum_{k=1}^{K-1}\Pr\{S_{n}=k\}+p_{n}\cdot\Pr\{S_{n}=K\}+(1-p_{n}% )\cdot\Pr\{S_{n}=K\}= ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT roman_Pr { italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_k } + italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⋅ roman_Pr { italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_K } + ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ⋅ roman_Pr { italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_K }
=k=1KPr{Sn=k}absentsuperscriptsubscript𝑘1𝐾Prsubscript𝑆𝑛𝑘\displaystyle=\sum_{k=1}^{K}\Pr\{S_{n}=k\}= ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_Pr { italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_k }
=1absent1\displaystyle=1= 1

This shows that Pr{Sn=k}Prsubscript𝑆𝑛𝑘\Pr\{S_{n}=k\}roman_Pr { italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_k } defined in (B.6.1) is a probability distribution.

In the following, we derive the mean and variance of Snsubscript𝑆𝑛S_{n}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, where we use dydx𝑑𝑦𝑑𝑥\frac{dy}{dx}divide start_ARG italic_d italic_y end_ARG start_ARG italic_d italic_x end_ARG to denote the derivative of y𝑦yitalic_y with respect to x𝑥xitalic_x.

We have

𝔼[Sn]𝔼delimited-[]subscript𝑆𝑛\displaystyle\mathbb{E}\left[S_{n}\right]blackboard_E [ italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] =k=1K1kpn(1pn)k1+K(1pn)K1absentsuperscriptsubscript𝑘1𝐾1𝑘subscript𝑝𝑛superscript1subscript𝑝𝑛𝑘1𝐾superscript1subscript𝑝𝑛𝐾1\displaystyle=\sum_{k=1}^{K-1}kp_{n}(1-p_{n})^{k-1}+K(1-p_{n})^{K-1}= ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT italic_k italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT + italic_K ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT
=pn[ddpn(k=1K1(1pn)k)]+K(1pn)K1absentsubscript𝑝𝑛delimited-[]𝑑𝑑subscript𝑝𝑛superscriptsubscript𝑘1𝐾1superscript1subscript𝑝𝑛𝑘𝐾superscript1subscript𝑝𝑛𝐾1\displaystyle=p_{n}\left[\frac{d}{dp_{n}}\left(-\sum_{k=1}^{K-1}(1-p_{n})^{k}% \right)\right]+K(1-p_{n})^{K-1}= italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT [ divide start_ARG italic_d end_ARG start_ARG italic_d italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ( - ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ] + italic_K ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT
=pn[ddpn((1pn)((1pn)K11)pn)]+K(1pn)K1absentsubscript𝑝𝑛delimited-[]𝑑𝑑subscript𝑝𝑛1subscript𝑝𝑛superscript1subscript𝑝𝑛𝐾11subscript𝑝𝑛𝐾superscript1subscript𝑝𝑛𝐾1\displaystyle=p_{n}\left[\frac{d}{dp_{n}}\left(\frac{(1-p_{n})\left((1-p_{n})^% {K-1}-1\right)}{p_{n}}\right)\right]+K(1-p_{n})^{K-1}= italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT [ divide start_ARG italic_d end_ARG start_ARG italic_d italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ( divide start_ARG ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ( ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT - 1 ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ) ] + italic_K ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT
=pn[ddpn((1pn)K1+pnpn)]+K(1pn)K1absentsubscript𝑝𝑛delimited-[]𝑑𝑑subscript𝑝𝑛superscript1subscript𝑝𝑛𝐾1subscript𝑝𝑛subscript𝑝𝑛𝐾superscript1subscript𝑝𝑛𝐾1\displaystyle=p_{n}\left[\frac{d}{dp_{n}}\left(\frac{(1-p_{n})^{K}-1+p_{n}}{p_% {n}}\right)\right]+K(1-p_{n})^{K-1}= italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT [ divide start_ARG italic_d end_ARG start_ARG italic_d italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ( divide start_ARG ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT - 1 + italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ) ] + italic_K ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT
=pn[ddpn((1pn)Kpn1pn+1)]+K(1pn)K1absentsubscript𝑝𝑛delimited-[]𝑑𝑑subscript𝑝𝑛superscript1subscript𝑝𝑛𝐾subscript𝑝𝑛1subscript𝑝𝑛1𝐾superscript1subscript𝑝𝑛𝐾1\displaystyle=p_{n}\left[\frac{d}{dp_{n}}\left(\frac{(1-p_{n})^{K}}{p_{n}}-% \frac{1}{p_{n}}+1\right)\right]+K(1-p_{n})^{K-1}= italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT [ divide start_ARG italic_d end_ARG start_ARG italic_d italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ( divide start_ARG ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG + 1 ) ] + italic_K ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT
=pn[K(1pn)K1pn+(1pn)Kpn2+1pn2]+K(1pn)K1absentsubscript𝑝𝑛delimited-[]𝐾superscript1subscript𝑝𝑛𝐾1subscript𝑝𝑛superscript1subscript𝑝𝑛𝐾superscriptsubscript𝑝𝑛21superscriptsubscript𝑝𝑛2𝐾superscript1subscript𝑝𝑛𝐾1\displaystyle=p_{n}\left[-\frac{K(1-p_{n})^{K-1}p_{n}+(1-p_{n})^{K}}{p_{n}^{2}% }+\frac{1}{p_{n}^{2}}\right]+K(1-p_{n})^{K-1}= italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT [ - divide start_ARG italic_K ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 1 end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ] + italic_K ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT
=1pn(1pn)Kpn,absent1subscript𝑝𝑛superscript1subscript𝑝𝑛𝐾subscript𝑝𝑛\displaystyle=\frac{1}{p_{n}}-\frac{(1-p_{n})^{K}}{p_{n}},= divide start_ARG 1 end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG - divide start_ARG ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG , (B.6.3)

which gives the expression for the expected value.

To compute the variance, we note that

𝔼[Sn(Sn1)]𝔼delimited-[]subscript𝑆𝑛subscript𝑆𝑛1\displaystyle\mathbb{E}\left[S_{n}(S_{n}-1)\right]blackboard_E [ italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - 1 ) ]
=k=1K1k(k1)pn(1pn)k1+K(K1)(1pn)K1absentsuperscriptsubscript𝑘1𝐾1𝑘𝑘1subscript𝑝𝑛superscript1subscript𝑝𝑛𝑘1𝐾𝐾1superscript1subscript𝑝𝑛𝐾1\displaystyle=\sum_{k=1}^{K-1}k(k-1)p_{n}(1-p_{n})^{k-1}+K(K-1)(1-p_{n})^{K-1}= ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT italic_k ( italic_k - 1 ) italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT + italic_K ( italic_K - 1 ) ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT
=pn[ddpn(k=1K1(k1)(1pn)k)]+K(K1)(1pn)K1absentsubscript𝑝𝑛delimited-[]𝑑𝑑subscript𝑝𝑛superscriptsubscript𝑘1𝐾1𝑘1superscript1subscript𝑝𝑛𝑘𝐾𝐾1superscript1subscript𝑝𝑛𝐾1\displaystyle=p_{n}\left[\frac{d}{dp_{n}}\left(-\sum_{k=1}^{K-1}(k-1)(1-p_{n})% ^{k}\right)\right]+K(K-1)(1-p_{n})^{K-1}= italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT [ divide start_ARG italic_d end_ARG start_ARG italic_d italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ( - ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT ( italic_k - 1 ) ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ] + italic_K ( italic_K - 1 ) ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT
=pn[ddpn((1pn)2k=1K1(k1)(1pn)k2)]+K(K1)(1pn)K1absentsubscript𝑝𝑛delimited-[]𝑑𝑑subscript𝑝𝑛superscript1subscript𝑝𝑛2superscriptsubscript𝑘1𝐾1𝑘1superscript1subscript𝑝𝑛𝑘2𝐾𝐾1superscript1subscript𝑝𝑛𝐾1\displaystyle=p_{n}\left[\frac{d}{dp_{n}}\left(-(1-p_{n})^{2}\sum_{k=1}^{K-1}(% k-1)(1-p_{n})^{k-2}\right)\right]+K(K-1)(1-p_{n})^{K-1}= italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT [ divide start_ARG italic_d end_ARG start_ARG italic_d italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ( - ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT ( italic_k - 1 ) ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_k - 2 end_POSTSUPERSCRIPT ) ] + italic_K ( italic_K - 1 ) ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT
=pn[ddpn((1pn)2ddpn(k=1K1(1pn)k1))]+K(K1)(1pn)K1absentsubscript𝑝𝑛delimited-[]𝑑𝑑subscript𝑝𝑛superscript1subscript𝑝𝑛2𝑑𝑑subscript𝑝𝑛superscriptsubscript𝑘1𝐾1superscript1subscript𝑝𝑛𝑘1𝐾𝐾1superscript1subscript𝑝𝑛𝐾1\displaystyle=p_{n}\left[\frac{d}{dp_{n}}\left(-(1-p_{n})^{2}\frac{d}{dp_{n}}% \left(-\sum_{k=1}^{K-1}(1-p_{n})^{k-1}\right)\right)\right]+K(K-1)(1-p_{n})^{K% -1}= italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT [ divide start_ARG italic_d end_ARG start_ARG italic_d italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ( - ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_d end_ARG start_ARG italic_d italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ( - ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ) ) ] + italic_K ( italic_K - 1 ) ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT
=pn[ddpn((1pn)2ddpn(1(1pn)K1pn))]+K(K1)(1pn)K1absentsubscript𝑝𝑛delimited-[]𝑑𝑑subscript𝑝𝑛superscript1subscript𝑝𝑛2𝑑𝑑subscript𝑝𝑛1superscript1subscript𝑝𝑛𝐾1subscript𝑝𝑛𝐾𝐾1superscript1subscript𝑝𝑛𝐾1\displaystyle=p_{n}\left[\frac{d}{dp_{n}}\left((1-p_{n})^{2}\frac{d}{dp_{n}}% \left(\frac{1-(1-p_{n})^{K-1}}{p_{n}}\right)\right)\right]+K(K-1)(1-p_{n})^{K-1}= italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT [ divide start_ARG italic_d end_ARG start_ARG italic_d italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ( ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_d end_ARG start_ARG italic_d italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ( divide start_ARG 1 - ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ) ) ] + italic_K ( italic_K - 1 ) ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT
=pn[ddpn((1pn)2(K1)(1pn)K2pn1+(1pn)K1pn2)]+K(K1)(1pn)K1absentsubscript𝑝𝑛delimited-[]𝑑𝑑subscript𝑝𝑛superscript1subscript𝑝𝑛2𝐾1superscript1subscript𝑝𝑛𝐾2subscript𝑝𝑛1superscript1subscript𝑝𝑛𝐾1superscriptsubscript𝑝𝑛2𝐾𝐾1superscript1subscript𝑝𝑛𝐾1\displaystyle=p_{n}\left[\frac{d}{dp_{n}}\left((1-p_{n})^{2}\cdot\frac{(K-1)(1% -p_{n})^{K-2}p_{n}-1+(1-p_{n})^{K-1}}{p_{n}^{2}}\right)\right]+K(K-1)(1-p_{n})% ^{K-1}= italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT [ divide start_ARG italic_d end_ARG start_ARG italic_d italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ( ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ divide start_ARG ( italic_K - 1 ) ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K - 2 end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - 1 + ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) ] + italic_K ( italic_K - 1 ) ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT
=pn[ddpn((K1)(1pn)Kpn+(1pn)K+1pn2(1pn)2pn2)]+K(K1)(1pn)K1absentsubscript𝑝𝑛delimited-[]𝑑𝑑subscript𝑝𝑛𝐾1superscript1subscript𝑝𝑛𝐾subscript𝑝𝑛superscript1subscript𝑝𝑛𝐾1superscriptsubscript𝑝𝑛2superscript1subscript𝑝𝑛2superscriptsubscript𝑝𝑛2𝐾𝐾1superscript1subscript𝑝𝑛𝐾1\displaystyle=p_{n}\left[\frac{d}{dp_{n}}\left(\frac{(K-1)(1-p_{n})^{K}}{p_{n}% }+\frac{(1-p_{n})^{K+1}}{p_{n}^{2}}-\frac{(1-p_{n})^{2}}{p_{n}^{2}}\right)% \right]+K(K-1)(1-p_{n})^{K-1}= italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT [ divide start_ARG italic_d end_ARG start_ARG italic_d italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ( divide start_ARG ( italic_K - 1 ) ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG + divide start_ARG ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K + 1 end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - divide start_ARG ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) ] + italic_K ( italic_K - 1 ) ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT
=pn[K(K1)(1pn)K1pn+(K1)(1pn)Kpn2\displaystyle=p_{n}\Bigg{[}-\frac{K(K-1)(1-p_{n})^{K-1}p_{n}+(K-1)(1-p_{n})^{K% }}{p_{n}^{2}}= italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT [ - divide start_ARG italic_K ( italic_K - 1 ) ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + ( italic_K - 1 ) ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG
(K+1)(1pn)Kpn2+2(1pn)K+1pnpn4𝐾1superscript1subscript𝑝𝑛𝐾superscriptsubscript𝑝𝑛22superscript1subscript𝑝𝑛𝐾1subscript𝑝𝑛superscriptsubscript𝑝𝑛4\displaystyle\quad\quad\quad-\frac{(K+1)(1-p_{n})^{K}p_{n}^{2}+2(1-p_{n})^{K+1% }p_{n}}{p_{n}^{4}}- divide start_ARG ( italic_K + 1 ) ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K + 1 end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG
+2(1pn)pn2+2(1pn)2pnpn4]+K(K1)(1pn)K1\displaystyle\quad\quad\quad+\frac{2(1-p_{n})p_{n}^{2}+2(1-p_{n})^{2}p_{n}}{p_% {n}^{4}}\Bigg{]}+K(K-1)(1-p_{n})^{K-1}+ divide start_ARG 2 ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG ] + italic_K ( italic_K - 1 ) ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT
=(K1)(1pn)Kpn(K+1)(1pn)Kpn2(1pn)K+1pn2+2(1pn)pn+2(1pn)2pn2absent𝐾1superscript1subscript𝑝𝑛𝐾subscript𝑝𝑛𝐾1superscript1subscript𝑝𝑛𝐾subscript𝑝𝑛2superscript1subscript𝑝𝑛𝐾1superscriptsubscript𝑝𝑛221subscript𝑝𝑛subscript𝑝𝑛2superscript1subscript𝑝𝑛2superscriptsubscript𝑝𝑛2\displaystyle=-\frac{(K-1)(1-p_{n})^{K}}{p_{n}}-\frac{(K+1)(1-p_{n})^{K}}{p_{n% }}-\frac{2(1-p_{n})^{K+1}}{p_{n}^{2}}+\frac{2(1-p_{n})}{p_{n}}+\frac{2(1-p_{n}% )^{2}}{p_{n}^{2}}= - divide start_ARG ( italic_K - 1 ) ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG - divide start_ARG ( italic_K + 1 ) ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG - divide start_ARG 2 ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K + 1 end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 2 ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG + divide start_ARG 2 ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG
=2K(1pn)Kpn2(1pn)(1pn)Kpn2+2(1pn)pn2absent2𝐾superscript1subscript𝑝𝑛𝐾subscript𝑝𝑛21subscript𝑝𝑛superscript1subscript𝑝𝑛𝐾superscriptsubscript𝑝𝑛221subscript𝑝𝑛superscriptsubscript𝑝𝑛2\displaystyle=-\frac{2K(1-p_{n})^{K}}{p_{n}}-\frac{2(1-p_{n})(1-p_{n})^{K}}{p_% {n}^{2}}+\frac{2(1-p_{n})}{p_{n}^{2}}= - divide start_ARG 2 italic_K ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG - divide start_ARG 2 ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 2 ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG
=2K(1pn)Kpn2(1pn)Kpn2+2(1pn)Kpn+2(1pn)pn2absent2𝐾superscript1subscript𝑝𝑛𝐾subscript𝑝𝑛2superscript1subscript𝑝𝑛𝐾superscriptsubscript𝑝𝑛22superscript1subscript𝑝𝑛𝐾subscript𝑝𝑛21subscript𝑝𝑛superscriptsubscript𝑝𝑛2\displaystyle=-\frac{2K(1-p_{n})^{K}}{p_{n}}-\frac{2(1-p_{n})^{K}}{p_{n}^{2}}+% \frac{2(1-p_{n})^{K}}{p_{n}}+\frac{2(1-p_{n})}{p_{n}^{2}}= - divide start_ARG 2 italic_K ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG - divide start_ARG 2 ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 2 ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG + divide start_ARG 2 ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG
=2(K1)(1pn)Kpn2(1pn)Kpn2+2(1pn)pn2.absent2𝐾1superscript1subscript𝑝𝑛𝐾subscript𝑝𝑛2superscript1subscript𝑝𝑛𝐾superscriptsubscript𝑝𝑛221subscript𝑝𝑛superscriptsubscript𝑝𝑛2\displaystyle=-\frac{2(K-1)(1-p_{n})^{K}}{p_{n}}-\frac{2(1-p_{n})^{K}}{p_{n}^{% 2}}+\frac{2(1-p_{n})}{p_{n}^{2}}.= - divide start_ARG 2 ( italic_K - 1 ) ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG - divide start_ARG 2 ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 2 ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG . (B.6.4)

Thus,

𝔼[Sn2]𝔼delimited-[]superscriptsubscript𝑆𝑛2\displaystyle\mathbb{E}\left[S_{n}^{2}\right]blackboard_E [ italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] =𝔼[Sn(Sn1)]+𝔼[Sn]absent𝔼delimited-[]subscript𝑆𝑛subscript𝑆𝑛1𝔼delimited-[]subscript𝑆𝑛\displaystyle=\mathbb{E}\left[S_{n}(S_{n}-1)\right]+\mathbb{E}\left[S_{n}\right]= blackboard_E [ italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - 1 ) ] + blackboard_E [ italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ]
=2(K1)(1pn)Kpn2(1pn)Kpn2+2(1pn)pn2+1pn(1pn)Kpnabsent2𝐾1superscript1subscript𝑝𝑛𝐾subscript𝑝𝑛2superscript1subscript𝑝𝑛𝐾superscriptsubscript𝑝𝑛221subscript𝑝𝑛superscriptsubscript𝑝𝑛21subscript𝑝𝑛superscript1subscript𝑝𝑛𝐾subscript𝑝𝑛\displaystyle=-\frac{2(K-1)(1-p_{n})^{K}}{p_{n}}-\frac{2(1-p_{n})^{K}}{p_{n}^{% 2}}+\frac{2(1-p_{n})}{p_{n}^{2}}+\frac{1}{p_{n}}-\frac{(1-p_{n})^{K}}{p_{n}}= - divide start_ARG 2 ( italic_K - 1 ) ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG - divide start_ARG 2 ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 2 ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 1 end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG - divide start_ARG ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG
=(2K1)(1pn)Kpn2(1pn)Kpn2+2pnpn2.absent2𝐾1superscript1subscript𝑝𝑛𝐾subscript𝑝𝑛2superscript1subscript𝑝𝑛𝐾superscriptsubscript𝑝𝑛22subscript𝑝𝑛superscriptsubscript𝑝𝑛2\displaystyle=-\frac{(2K-1)(1-p_{n})^{K}}{p_{n}}-\frac{2(1-p_{n})^{K}}{p_{n}^{% 2}}+\frac{2-p_{n}}{p_{n}^{2}}.= - divide start_ARG ( 2 italic_K - 1 ) ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG - divide start_ARG 2 ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 2 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG . (B.6.5)

Therefore,

Var[Sn]Vardelimited-[]subscript𝑆𝑛\displaystyle\mathrm{Var}\left[S_{n}\right]roman_Var [ italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] =𝔼[Sn2]𝔼[Sn]2absent𝔼delimited-[]superscriptsubscript𝑆𝑛2𝔼superscriptdelimited-[]subscript𝑆𝑛2\displaystyle=\mathbb{E}\left[S_{n}^{2}\right]-\mathbb{E}\left[S_{n}\right]^{2}= blackboard_E [ italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] - blackboard_E [ italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=(2K1)(1pn)Kpn2(1pn)Kpn2+2pnpn2(1pn(1pn)Kpn)2absent2𝐾1superscript1subscript𝑝𝑛𝐾subscript𝑝𝑛2superscript1subscript𝑝𝑛𝐾superscriptsubscript𝑝𝑛22subscript𝑝𝑛superscriptsubscript𝑝𝑛2superscript1subscript𝑝𝑛superscript1subscript𝑝𝑛𝐾subscript𝑝𝑛2\displaystyle=-\frac{(2K-1)(1-p_{n})^{K}}{p_{n}}-\frac{2(1-p_{n})^{K}}{p_{n}^{% 2}}+\frac{2-p_{n}}{p_{n}^{2}}-\left(\frac{1}{p_{n}}-\frac{(1-p_{n})^{K}}{p_{n}% }\right)^{2}= - divide start_ARG ( 2 italic_K - 1 ) ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG - divide start_ARG 2 ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 2 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - ( divide start_ARG 1 end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG - divide start_ARG ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=(2K1)(1pn)Kpn2(1pn)Kpn2+2pnpn21pn2+2(1pn)Kpn2(1pn)2Kpn2absent2𝐾1superscript1subscript𝑝𝑛𝐾subscript𝑝𝑛2superscript1subscript𝑝𝑛𝐾superscriptsubscript𝑝𝑛22subscript𝑝𝑛superscriptsubscript𝑝𝑛21superscriptsubscript𝑝𝑛22superscript1subscript𝑝𝑛𝐾superscriptsubscript𝑝𝑛2superscript1subscript𝑝𝑛2𝐾superscriptsubscript𝑝𝑛2\displaystyle=-\frac{(2K-1)(1-p_{n})^{K}}{p_{n}}-\frac{2(1-p_{n})^{K}}{p_{n}^{% 2}}+\frac{2-p_{n}}{p_{n}^{2}}-\frac{1}{p_{n}^{2}}+\frac{2(1-p_{n})^{K}}{p_{n}^% {2}}-\frac{(1-p_{n})^{2K}}{p_{n}^{2}}= - divide start_ARG ( 2 italic_K - 1 ) ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG - divide start_ARG 2 ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 2 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 2 ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - divide start_ARG ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_K end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG
=(2K1)(1pn)Kpn+1pnpn2(1pn)2Kpn2,absent2𝐾1superscript1subscript𝑝𝑛𝐾subscript𝑝𝑛1subscript𝑝𝑛superscriptsubscript𝑝𝑛2superscript1subscript𝑝𝑛2𝐾superscriptsubscript𝑝𝑛2\displaystyle=-\frac{(2K-1)(1-p_{n})^{K}}{p_{n}}+\frac{1-p_{n}}{p_{n}^{2}}-% \frac{(1-p_{n})^{2K}}{p_{n}^{2}},= - divide start_ARG ( 2 italic_K - 1 ) ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG + divide start_ARG 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - divide start_ARG ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_K end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , (B.6.6)

which gives the final variance result. ∎

Now, we are ready to obtain an upper bound of the weight error term.

Proof of Theorem 3.

Case 1: According to Algorithm 2, we have ωtn=1superscriptsubscript𝜔𝑡𝑛1\omega_{t}^{n}=1italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = 1 in the initial rounds before the first participation has occurred. This includes at least one round (t=0𝑡0t=0italic_t = 0) and at most K𝐾Kitalic_K rounds. In these initial rounds, we have

𝔼[(pnωtn1)2]1.𝔼delimited-[]superscriptsubscript𝑝𝑛superscriptsubscript𝜔𝑡𝑛121\displaystyle\mathbb{E}\left[\left(p_{n}\omega_{t}^{n}-1\right)^{2}\right]\leq 1.blackboard_E [ ( italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ 1 . (B.6.7)

Case 2: For all the other rounds, ωtnsuperscriptsubscript𝜔𝑡𝑛\omega_{t}^{n}italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is estimated based on at least one sample of Snsubscript𝑆𝑛S_{n}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. Therefore, using the mean and variance expressions from Lemma B.6.1, we have the following for these rounds:

𝔼[(pnωtn1)2]𝔼delimited-[]superscriptsubscript𝑝𝑛superscriptsubscript𝜔𝑡𝑛12\displaystyle\mathbb{E}\left[\left(p_{n}\omega_{t}^{n}-1\right)^{2}\right]blackboard_E [ ( italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] =(pn)2𝔼[(ωtn(1pn(1pn)Kpn)(1pn)Kpn)2]absentsuperscriptsubscript𝑝𝑛2𝔼delimited-[]superscriptsuperscriptsubscript𝜔𝑡𝑛1subscript𝑝𝑛superscript1subscript𝑝𝑛𝐾subscript𝑝𝑛superscript1subscript𝑝𝑛𝐾subscript𝑝𝑛2\displaystyle=(p_{n})^{2}\mathbb{E}\left[\left(\omega_{t}^{n}-\left(\frac{1}{p% _{n}}-\frac{(1-p_{n})^{K}}{p_{n}}\right)-\frac{(1-p_{n})^{K}}{p_{n}}\right)^{2% }\right]= ( italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E [ ( italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT - ( divide start_ARG 1 end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG - divide start_ARG ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ) - divide start_ARG ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=(a)(pn)2𝔼[(ωtn(1pn(1pn)Kpn))2]+(1pn)2K𝑎superscriptsubscript𝑝𝑛2𝔼delimited-[]superscriptsuperscriptsubscript𝜔𝑡𝑛1subscript𝑝𝑛superscript1subscript𝑝𝑛𝐾subscript𝑝𝑛2superscript1subscript𝑝𝑛2𝐾\displaystyle\overset{(a)}{=}(p_{n})^{2}\mathbb{E}\left[\left(\omega_{t}^{n}-% \left(\frac{1}{p_{n}}-\frac{(1-p_{n})^{K}}{p_{n}}\right)\right)^{2}\right]+(1-% p_{n})^{2K}start_OVERACCENT ( italic_a ) end_OVERACCENT start_ARG = end_ARG ( italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E [ ( italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT - ( divide start_ARG 1 end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG - divide start_ARG ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_K end_POSTSUPERSCRIPT
(b)(pn)2Var[Sn]max{tK,1}+(1pn)2K𝑏superscriptsubscript𝑝𝑛2Vardelimited-[]subscript𝑆𝑛𝑡𝐾1superscript1subscript𝑝𝑛2𝐾\displaystyle\overset{(b)}{\leq}(p_{n})^{2}\cdot\frac{\mathrm{Var}\left[S_{n}% \right]}{\max\left\{\lfloor\frac{t}{K}\rfloor,1\right\}}+(1-p_{n})^{2K}start_OVERACCENT ( italic_b ) end_OVERACCENT start_ARG ≤ end_ARG ( italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ divide start_ARG roman_Var [ italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] end_ARG start_ARG roman_max { ⌊ divide start_ARG italic_t end_ARG start_ARG italic_K end_ARG ⌋ , 1 } end_ARG + ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_K end_POSTSUPERSCRIPT
(pn)22KVar[Sn]t+(1pn)2Kabsentsuperscriptsubscript𝑝𝑛22𝐾Vardelimited-[]subscript𝑆𝑛𝑡superscript1subscript𝑝𝑛2𝐾\displaystyle\leq(p_{n})^{2}\cdot\frac{2K\mathrm{Var}\left[S_{n}\right]}{t}+(1% -p_{n})^{2K}≤ ( italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ divide start_ARG 2 italic_K roman_Var [ italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] end_ARG start_ARG italic_t end_ARG + ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_K end_POSTSUPERSCRIPT
(c)2K(1pn)t+(1pn)2K,𝑐2𝐾1subscript𝑝𝑛𝑡superscript1subscript𝑝𝑛2𝐾\displaystyle\overset{(c)}{\leq}\frac{2K(1-p_{n})}{t}+(1-p_{n})^{2K},start_OVERACCENT ( italic_c ) end_OVERACCENT start_ARG ≤ end_ARG divide start_ARG 2 italic_K ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_ARG start_ARG italic_t end_ARG + ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_K end_POSTSUPERSCRIPT , (B.6.8)

where (a)𝑎(a)( italic_a ) is because the inner product term is zero since the mean of ωtnsuperscriptsubscript𝜔𝑡𝑛\omega_{t}^{n}italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is equal to 1pn(1pn)Kpn1subscript𝑝𝑛superscript1subscript𝑝𝑛𝐾subscript𝑝𝑛\frac{1}{p_{n}}-\frac{(1-p_{n})^{K}}{p_{n}}divide start_ARG 1 end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG - divide start_ARG ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG; (b)𝑏(b)( italic_b ) is due to the definition of variance, the fact that we consider the computation of ωtnsuperscriptsubscript𝜔𝑡𝑛\omega_{t}^{n}italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT to be based on at least one sample of Snsubscript𝑆𝑛S_{n}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, and for any round t𝑡titalic_t there are at least tK𝑡𝐾\lfloor\frac{t}{K}\rfloor⌊ divide start_ARG italic_t end_ARG start_ARG italic_K end_ARG ⌋ samples of Snsubscript𝑆𝑛S_{n}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT due to the cutoff interval of length K𝐾Kitalic_K; (c)𝑐(c)( italic_c ) uses the upper bound of Var[Sn]1pnpn2Vardelimited-[]subscript𝑆𝑛1subscript𝑝𝑛superscriptsubscript𝑝𝑛2\mathrm{Var}\left[S_{n}\right]\leq\frac{1-p_{n}}{p_{n}^{2}}roman_Var [ italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] ≤ divide start_ARG 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG.

We note that the bound (B.6.7) in Case 1 always applies for t=0𝑡0t=0italic_t = 0, because we always have ωtn=1superscriptsubscript𝜔𝑡𝑛1\omega_{t}^{n}=1italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = 1 for t=0𝑡0t=0italic_t = 0 according to Algorithm 2. For rounds 0<t<K0𝑡𝐾0<t<K0 < italic_t < italic_K, either the bound (B.6.7) in Case 1 or the bound (B.6.8) in Case 2 applies, thus 𝔼[(pnωtn1)2]𝔼delimited-[]superscriptsubscript𝑝𝑛superscriptsubscript𝜔𝑡𝑛12\mathbb{E}\left[\left(p_{n}\omega_{t}^{n}-1\right)^{2}\right]blackboard_E [ ( italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] is upper bounded by the sum of both bounds in these rounds. Then, for tK𝑡𝐾t\geq Kitalic_t ≥ italic_K, the bound (B.6.8) in Case 2 applies. According to this fact, summing up the bounds for each round and dividing by T𝑇Titalic_T gives

1Tt=0T1𝔼[(pnωtn1)2]1𝑇superscriptsubscript𝑡0𝑇1𝔼delimited-[]superscriptsubscript𝑝𝑛superscriptsubscript𝜔𝑡𝑛12\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\left[\left(p_{n}\omega_{t}^% {n}-1\right)^{2}\right]divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT blackboard_E [ ( italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] 1T[K+(T1)(1pn)2K+2K(1pn)t=1T11t]absent1𝑇delimited-[]𝐾𝑇1superscript1subscript𝑝𝑛2𝐾2𝐾1subscript𝑝𝑛superscriptsubscript𝑡1𝑇11𝑡\displaystyle\leq\frac{1}{T}\left[K+(T-1)(1-p_{n})^{2K}+2K(1-p_{n})\sum_{t=1}^% {T-1}\frac{1}{t}\right]≤ divide start_ARG 1 end_ARG start_ARG italic_T end_ARG [ italic_K + ( italic_T - 1 ) ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_K end_POSTSUPERSCRIPT + 2 italic_K ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_t end_ARG ]
K+2K(1pn)(logT+1)T+(1pn)2Kabsent𝐾2𝐾1subscript𝑝𝑛𝑇1𝑇superscript1subscript𝑝𝑛2𝐾\displaystyle\leq\frac{K+2K(1-p_{n})\left(\log T+1\right)}{T}+(1-p_{n})^{2K}≤ divide start_ARG italic_K + 2 italic_K ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ( roman_log italic_T + 1 ) end_ARG start_ARG italic_T end_ARG + ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_K end_POSTSUPERSCRIPT
3K+2KlogTT+(1pn)2K,absent3𝐾2𝐾𝑇𝑇superscript1subscript𝑝𝑛2𝐾\displaystyle\leq\frac{3K+2K\log T}{T}+(1-p_{n})^{2K},≤ divide start_ARG 3 italic_K + 2 italic_K roman_log italic_T end_ARG start_ARG italic_T end_ARG + ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_K end_POSTSUPERSCRIPT , (B.6.9)

where we use the relation that t=1T11tlogT+1superscriptsubscript𝑡1𝑇11𝑡𝑇1\sum_{t=1}^{T-1}\frac{1}{t}\leq\log T+1∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_t end_ARG ≤ roman_log italic_T + 1 for T2𝑇2T\geq 2italic_T ≥ 2, and the logarithm is based on e𝑒eitalic_e.

The final result is obtained by averaging (B.6.9) over all n𝑛nitalic_n. ∎

B.7 Proof of Corollary 4

We first prove the upper bound of the weight error term in the following lemma.

Lemma B.7.1.

Choosing K=logcT𝐾subscript𝑐𝑇K=\left\lceil\log_{c}T\right\rceilitalic_K = ⌈ roman_log start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_T ⌉, where c:=(11p)2assign𝑐superscript11𝑝2c:=\left(\frac{1}{1-p}\right)^{2}italic_c := ( divide start_ARG 1 end_ARG start_ARG 1 - italic_p end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and p:=minnpnassign𝑝subscript𝑛subscript𝑝𝑛p:=\min_{n}p_{n}italic_p := roman_min start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. Define R:=1logcassign𝑅1𝑐R:=\frac{1}{\log c}italic_R := divide start_ARG 1 end_ARG start_ARG roman_log italic_c end_ARG. When T2𝑇2T\geq 2italic_T ≥ 2, the aggregation weights {ωtn}superscriptsubscript𝜔𝑡𝑛\{\omega_{t}^{n}\}{ italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT } obtained from Algorithm 2 satisfies

1NTt=0T1n=1N𝔼[(pnωtn1)2]1𝑁𝑇superscriptsubscript𝑡0𝑇1superscriptsubscript𝑛1𝑁𝔼delimited-[]superscriptsubscript𝑝𝑛superscriptsubscript𝜔𝑡𝑛12\displaystyle\frac{1}{NT}\sum_{t=0}^{T-1}\sum_{n=1}^{N}\mathbb{E}\left[\left(p% _{n}\omega_{t}^{n}-1\right)^{2}\right]divide start_ARG 1 end_ARG start_ARG italic_N italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT blackboard_E [ ( italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] 𝒪(Rlog2TT).absent𝒪𝑅superscript2𝑇𝑇\displaystyle\leq\mathcal{O}\left(\frac{R\log^{2}T}{T}\right).≤ caligraphic_O ( divide start_ARG italic_R roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T end_ARG start_ARG italic_T end_ARG ) . (B.7.1)
Proof.

Let K=logcT𝐾subscript𝑐𝑇K=\left\lceil\log_{c}T\right\rceilitalic_K = ⌈ roman_log start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_T ⌉, we have

1Nn=1N(1pn)2K1𝑁superscriptsubscript𝑛1𝑁superscript1subscript𝑝𝑛2𝐾\displaystyle\frac{1}{N}\sum_{n=1}^{N}(1-p_{n})^{2K}divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( 1 - italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_K end_POSTSUPERSCRIPT (1p)2K=(1p)2logcT(1p)2logcT=(1(11p)2)logcTabsentsuperscript1𝑝2𝐾superscript1𝑝2subscript𝑐𝑇superscript1𝑝2subscript𝑐𝑇superscript1superscript11𝑝2subscript𝑐𝑇\displaystyle\leq(1-p)^{2K}=(1-p)^{2\left\lceil\log_{c}T\right\rceil}\leq(1-p)% ^{2\log_{c}T}=\left(\frac{1}{\left(\frac{1}{1-p}\right)^{2}}\right)^{\log_{c}T}≤ ( 1 - italic_p ) start_POSTSUPERSCRIPT 2 italic_K end_POSTSUPERSCRIPT = ( 1 - italic_p ) start_POSTSUPERSCRIPT 2 ⌈ roman_log start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_T ⌉ end_POSTSUPERSCRIPT ≤ ( 1 - italic_p ) start_POSTSUPERSCRIPT 2 roman_log start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_T end_POSTSUPERSCRIPT = ( divide start_ARG 1 end_ARG start_ARG ( divide start_ARG 1 end_ARG start_ARG 1 - italic_p end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT roman_log start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_T end_POSTSUPERSCRIPT
=1clogcT=1Tlog22TTabsent1superscript𝑐subscript𝑐𝑇1𝑇superscriptsubscript22𝑇𝑇\displaystyle=\frac{1}{c^{\log_{c}T}}=\frac{1}{T}\leq\frac{\log_{2}^{2}T}{T}= divide start_ARG 1 end_ARG start_ARG italic_c start_POSTSUPERSCRIPT roman_log start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG = divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ≤ divide start_ARG roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T end_ARG start_ARG italic_T end_ARG (B.7.2)

This shows that by choosing K=logcT𝐾subscript𝑐𝑇K=\left\lceil\log_{c}T\right\rceilitalic_K = ⌈ roman_log start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_T ⌉ where T2𝑇2T\geq 2italic_T ≥ 2, the RHS in (7) of Theorem 3 is upper bounded by O(log2TTlogc+log2TT)=O(Rlog2TT)𝑂superscript2𝑇𝑇𝑐superscript2𝑇𝑇𝑂𝑅superscript2𝑇𝑇O\left(\frac{\log^{2}T}{T\log c}+\frac{\log^{2}T}{T}\right)=O\left(\frac{R\log% ^{2}T}{T}\right)italic_O ( divide start_ARG roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T end_ARG start_ARG italic_T roman_log italic_c end_ARG + divide start_ARG roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T end_ARG start_ARG italic_T end_ARG ) = italic_O ( divide start_ARG italic_R roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T end_ARG start_ARG italic_T end_ARG ), which proves the result. ∎

Proof of Corollary 4.

We note that

γ𝛾\displaystyle\gammaitalic_γ =min{1LIT;1415LI}1LITabsent1𝐿𝐼𝑇1415𝐿𝐼1𝐿𝐼𝑇\displaystyle=\min\left\{\frac{1}{LI\sqrt{T}};\frac{1}{4\sqrt{15}LI}\right\}% \leq\frac{1}{LI\sqrt{T}}= roman_min { divide start_ARG 1 end_ARG start_ARG italic_L italic_I square-root start_ARG italic_T end_ARG end_ARG ; divide start_ARG 1 end_ARG start_ARG 4 square-root start_ARG 15 end_ARG italic_L italic_I end_ARG } ≤ divide start_ARG 1 end_ARG start_ARG italic_L italic_I square-root start_ARG italic_T end_ARG end_ARG
γη𝛾𝜂\displaystyle\gamma\etaitalic_γ italic_η =min{NQ(Iδ2+σ2)LIT;14LI;N54LIQ}NQ(Iδ2+σ2)LITabsent𝑁𝑄𝐼superscript𝛿2superscript𝜎2𝐿𝐼𝑇14𝐿𝐼𝑁54𝐿𝐼𝑄𝑁𝑄𝐼superscript𝛿2superscript𝜎2𝐿𝐼𝑇\displaystyle=\min\left\{\sqrt{\frac{\mathcal{F}N}{Q\left(I\delta^{2}+\sigma^{% 2}\right)LIT}};\frac{1}{4LI};\frac{N}{54LIQ}\right\}\leq\sqrt{\frac{\mathcal{F% }N}{Q\left(I\delta^{2}+\sigma^{2}\right)LIT}}= roman_min { square-root start_ARG divide start_ARG caligraphic_F italic_N end_ARG start_ARG italic_Q ( italic_I italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_L italic_I italic_T end_ARG end_ARG ; divide start_ARG 1 end_ARG start_ARG 4 italic_L italic_I end_ARG ; divide start_ARG italic_N end_ARG start_ARG 54 italic_L italic_I italic_Q end_ARG } ≤ square-root start_ARG divide start_ARG caligraphic_F italic_N end_ARG start_ARG italic_Q ( italic_I italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_L italic_I italic_T end_ARG end_ARG
1γη1𝛾𝜂\displaystyle\frac{1}{\gamma\eta}divide start_ARG 1 end_ARG start_ARG italic_γ italic_η end_ARG =max{Q(Iδ2+σ2)LITN;4LI;54LIQN}Q(Iδ2+σ2)LITN+4LI+54LIQNabsent𝑄𝐼superscript𝛿2superscript𝜎2𝐿𝐼𝑇𝑁4𝐿𝐼54𝐿𝐼𝑄𝑁𝑄𝐼superscript𝛿2superscript𝜎2𝐿𝐼𝑇𝑁4𝐿𝐼54𝐿𝐼𝑄𝑁\displaystyle=\max\left\{\sqrt{\frac{Q\left(I\delta^{2}+\sigma^{2}\right)LIT}{% \mathcal{F}N}};4LI;\frac{54LIQ}{N}\right\}\leq\sqrt{\frac{Q\left(I\delta^{2}+% \sigma^{2}\right)LIT}{\mathcal{F}N}}+4LI+\frac{54LIQ}{N}= roman_max { square-root start_ARG divide start_ARG italic_Q ( italic_I italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_L italic_I italic_T end_ARG start_ARG caligraphic_F italic_N end_ARG end_ARG ; 4 italic_L italic_I ; divide start_ARG 54 italic_L italic_I italic_Q end_ARG start_ARG italic_N end_ARG } ≤ square-root start_ARG divide start_ARG italic_Q ( italic_I italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_L italic_I italic_T end_ARG start_ARG caligraphic_F italic_N end_ARG end_ARG + 4 italic_L italic_I + divide start_ARG 54 italic_L italic_I italic_Q end_ARG start_ARG italic_N end_ARG

The result follows by plugging these upper bounds of γ𝛾\gammaitalic_γ, γη𝛾𝜂\gamma\etaitalic_γ italic_η, and 1γη1𝛾𝜂\frac{1}{\gamma\eta}divide start_ARG 1 end_ARG start_ARG italic_γ italic_η end_ARG and the result in Lemma B.7.1 into Theorem 2, where we note that Iδ2+σ2Iδ+σ𝐼superscript𝛿2superscript𝜎2𝐼𝛿𝜎\sqrt{I\delta^{2}+\sigma^{2}}\leq\sqrt{I}\delta+\sigmasquare-root start_ARG italic_I italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ≤ square-root start_ARG italic_I end_ARG italic_δ + italic_σ since Iδ2+σ2Iδ2+2Iδσ+σ2=(Iδ+σ)2𝐼superscript𝛿2superscript𝜎2𝐼superscript𝛿22𝐼𝛿𝜎superscript𝜎2superscript𝐼𝛿𝜎2I\delta^{2}+\sigma^{2}\leq I\delta^{2}+2\sqrt{I}\delta\sigma+\sigma^{2}=\left(% \sqrt{I}\delta+\sigma\right)^{2}italic_I italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_I italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 square-root start_ARG italic_I end_ARG italic_δ italic_σ + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ( square-root start_ARG italic_I end_ARG italic_δ + italic_σ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. ∎

Appendix C Additional Setup Details of Experiments

C.1 Code

The code for reproducing our experiments is available via the following link:
https://shiqiang.wang/code/fedau

C.2 Datasets

The SVHN dataset has a citation requirement Netzer et al. (2011). Its license is for non-commercial use only. It includes 32×32323232\times 3232 × 32 color images with real-world house numbers of 10101010 different digits, containing 73,2577325773,25773 , 257 training data samples and 26,0322603226,03226 , 032 test data samples.

The CIFAR-10 dataset only has a citation requirement Krizhevsky & Hinton (2009). It includes 32×32323232\times 3232 × 32 color images of 10101010 different types of real-world objects, containing 50,0005000050,00050 , 000 training data samples and 10,0001000010,00010 , 000 test data samples.

The CIFAR-100 dataset only has a citation requirement Krizhevsky & Hinton (2009). It includes 32×32323232\times 3232 × 32 color images of 100100100100 different types of real-world objects, containing 50,0005000050,00050 , 000 training data samples and 10,0001000010,00010 , 000 test data samples.

The CINIC-10 dataset Darlow et al. (2018) has MIT license. It includes 32×32323232\times 3232 × 32 color images of 10101010 different types of real-world objects, containing 90,0009000090,00090 , 000 training data samples and 90,0009000090,00090 , 000 test data samples.

We have cited all the references in the main paper and conformed to all the license terms.

We applied some basic data augmentation techniques to these datasets during the training stage. For SVHN, we applied random cropping. For CIFAR-10 and CINIC-10, we applied both random cropping and random horizontal flipping. For CIFAR-100, we applied a combination of random sharpness adjustment, color jitter, random posterization, random equalization, random cropping, and random horizontal flipping.

C.3 Models

All the models include two convolutional layers with a kernel size of 3333, filter size of 32323232, and ReLU activation, where each convolutional layer is followed by a max-pool layer. The model for the SVHN dataset has two fully connected layers, while the models for the CIFAR-10/100 and CINIC-10 datasets have three fully connected layers. All the fully connected layers use ReLU activation, except for the last layer that is connected to softmax output. For CIFAR-100 and CINIC-10 datasets, a dropout layer (with dropout probability p=0.2𝑝0.2p=0.2italic_p = 0.2) is applied before each fully connected layer. We use Kaiming initialization for the weights. See the code for further details on model definition (the model class files are located inside the “model/” subfolder).

C.4 Hyperparameters

For each dataset and algorithm, we conducted a grid search on the learning rates γ𝛾\gammaitalic_γ and η𝜂\etaitalic_η separately. The grid for the local step size γ𝛾\gammaitalic_γ is {102,101.75,101.5,101.25,101,100.75,100.5}superscript102superscript101.75superscript101.5superscript101.25superscript101superscript100.75superscript100.5\{10^{-2},10^{-1.75},10^{-1.5},10^{-1.25},10^{-1},10^{-0.75},10^{-0.5}\}{ 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 1.75 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 1.5 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 1.25 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 0.75 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 0.5 end_POSTSUPERSCRIPT } and the grid for the global step size η𝜂\etaitalic_η is {100,100.25,100.5,100.75,101,101.25,101.5}superscript100superscript100.25superscript100.5superscript100.75superscript101superscript101.25superscript101.5\{10^{0},10^{0.25},10^{0.5},10^{0.75},10^{1},10^{1.25},10^{1.5}\}{ 10 start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT 0.25 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT 0.5 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT 0.75 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT 1.25 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT 1.5 end_POSTSUPERSCRIPT }. To reduce the complexity of the search, we first search for the value of γ𝛾\gammaitalic_γ with η=1𝜂1\eta=1italic_η = 1, and then search for η𝜂\etaitalic_η while fixing γ𝛾\gammaitalic_γ to the value found in the first search. We consider the training loss at 500500500500 rounds for determining the best γ𝛾\gammaitalic_γ and η𝜂\etaitalic_η. The hyperparameters found from this search and used in our experiments are shown in Table C.4.1.

Learning Rate Decay for CIFAR-100 Dataset. Only for the CIFAR-100 dataset, we decay the local learning rate γ𝛾\gammaitalic_γ by half every 1,00010001,0001 , 000 rounds, starting from the 10,0001000010,00010 , 000-th round.

Table C.4.1: Values of hyperparameters γ𝛾\gammaitalic_γ and η𝜂\etaitalic_η, where we use 101.250.0562,101.50.0316,100.251.78formulae-sequencesuperscript101.250.0562formulae-sequencesuperscript101.50.0316superscript100.251.7810^{-1.25}\approx 0.0562,10^{-1.5}\approx 0.0316,10^{0.25}\approx 1.7810 start_POSTSUPERSCRIPT - 1.25 end_POSTSUPERSCRIPT ≈ 0.0562 , 10 start_POSTSUPERSCRIPT - 1.5 end_POSTSUPERSCRIPT ≈ 0.0316 , 10 start_POSTSUPERSCRIPT 0.25 end_POSTSUPERSCRIPT ≈ 1.78
  Dataset SVHN CIFAR-10 CIFAR-100 CINIC-10
Method / Hyperparameter ​​  γ𝛾\gammaitalic_γ ​​  η𝜂\etaitalic_η ​​  γ𝛾\gammaitalic_γ ​​  η𝜂\etaitalic_η ​​  γ𝛾\gammaitalic_γ ​​  η𝜂\etaitalic_η ​​  γ𝛾\gammaitalic_γ ​​  η𝜂\etaitalic_η
  FedAU (ours, K𝐾K\rightarrow\inftyitalic_K → ∞) ​​0.10.10.10.1 ​​1.01.01.01.0 ​​0.10.10.10.1 ​​1.01.01.01.0 ​​0.05620.05620.05620.0562 ​​1.781.781.781.78 ​​0.10.10.10.1 ​​1.01.01.01.0
FedAU (ours, K=50𝐾50K=50italic_K = 50) ​​0.10.10.10.1 ​​1.01.01.01.0 ​​0.10.10.10.1 ​​1.01.01.01.0 ​​0.05620.05620.05620.0562 ​​1.781.781.781.78 ​​0.10.10.10.1 ​​1.01.01.01.0
Average participating ​​0.05620.05620.05620.0562 ​​1.781.781.781.78 ​​0.05620.05620.05620.0562 ​​1.781.781.781.78 ​​0.03160.03160.03160.0316 ​​1.781.781.781.78 ​​0.05620.05620.05620.0562 ​​1.781.781.781.78
Average all ​​0.10.10.10.1 ​​10.010.010.010.0 ​​0.10.10.10.1 ​​10.010.010.010.0 ​​0.05620.05620.05620.0562 ​​10.010.010.010.0 ​​0.10.10.10.1 ​​10.010.010.010.0
  FedVarp (250×250\times250 × memory) ​​0.10.10.10.1 ​​1.01.01.01.0 ​​0.05620.05620.05620.0562 ​​1.01.01.01.0 ​​0.03160.03160.03160.0316 ​​1.781.781.781.78 ​​0.05620.05620.05620.0562 ​​1.01.01.01.0
MIFA (250×250\times250 × memory) ​​0.10.10.10.1 ​​1.01.01.01.0 ​​0.05620.05620.05620.0562 ​​1.01.01.01.0 ​​0.03160.03160.03160.0316 ​​1.781.781.781.78 ​​0.05620.05620.05620.0562 ​​1.01.01.01.0
Known participation statistics ​​0.10.10.10.1 ​​1.01.01.01.0 ​​0.05620.05620.05620.0562 ​​1.01.01.01.0 ​​0.03160.03160.03160.0316 ​​1.781.781.781.78 ​​0.05620.05620.05620.0562 ​​1.01.01.01.0
 

C.5 Computation Resources

The experiments were split between a desktop machine with RTX 3070 GPU and an internal GPU cluster. In our experiments, the total number of rounds is 2,00020002,0002 , 000 for SVHN, 10,0001000010,00010 , 000 for CIFAR-10 and CINIC-10, and 20,0002000020,00020 , 000 for CIFAR-100. Each experiment with 10,0001000010,00010 , 000 rounds took approximately 4444 hours to complete, for one random seed on RTX 3070 GPU. The time taken for experiments with other number of rounds scales accordingly. We ran experiments with 5555 different random seeds for each dataset and algorithm. It was possible to run multiple experiments simultaneously on the same GPU while not exceeding the GPU memory.

C.6 Heterogeneous Participation Across Clients

C.6.1 Generating Participation Patterns

In each experiment with a specific simulation seed, we take only one sample of this Dirichlet distribution with parameter αpsubscript𝛼𝑝\alpha_{p}italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, which gives a probability vector 𝐪Dir(αp)similar-to𝐪Dirsubscript𝛼𝑝\mathbf{q}\sim\mathrm{Dir}(\alpha_{p})bold_q ∼ roman_Dir ( italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) that has a dimension equal to the total number of classes in the dataset.222We use Dir(αp)Dirsubscript𝛼𝑝\mathrm{Dir}(\alpha_{p})roman_Dir ( italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) to denote Dir(𝜶p)Dirsubscript𝜶𝑝\mathrm{Dir}(\boldsymbol{\alpha}_{p})roman_Dir ( bold_italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) with all the elements in the vector 𝜶psubscript𝜶𝑝\boldsymbol{\alpha}_{p}bold_italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT equal to αpsubscript𝛼𝑝\alpha_{p}italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT. The participation probability pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT for each client n𝑛nitalic_n is obtained by computing an inner product between 𝐪𝐪\mathbf{q}bold_q and the class distribution vector of the data at client n𝑛nitalic_n, and then dividing by a normalization factor. The rationale behind this approach is that the elements in 𝐪𝐪\mathbf{q}bold_q indicate how different classes contribute to the participation probability. For example, if the first element of 𝐪𝐪\mathbf{q}bold_q is large, it means that clients with a lot of data samples in the first class will have a high participation probability, and vice versa. Since the participation probabilities {pn}subscript𝑝𝑛\{p_{n}\}{ italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } generated using this approach are random variables, the normalization ensures a certain mean participation probability, i.e., 𝔼[pn]𝔼delimited-[]subscript𝑝𝑛\mathbb{E}\left[p_{n}\right]blackboard_E [ italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ], of any client n𝑛nitalic_n, which is set to 0.10.10.10.1 in our experiments. We further cap the minimum value of any pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT to be 0.020.020.020.02.

Among the three participation patterns in our experiments, i.e., Bernoulli, Markovian, and cyclic, we maintain the same stationary participation probabilities {pn}subscript𝑝𝑛\{p_{n}\}{ italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } for the clients, so the difference is in the temporal distribution of when a client participates, which is summarized as follows.

  • For Bernoulli participation, in every round t𝑡titalic_t, each client n𝑛nitalic_n decides whether or not to participate according to a Bernoulli distribution with probability pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. This decision is independent across time, i.e., independent across different rounds.

  • For Markovian participation, each client participates according a two-state Markov chain, where the motivation is similar to cyclic participation (see next item below) but includes more randomness. We set the maximum transition probability of a client transitioning from not participating to participating to 0.050.050.050.05. The initial state of the Markov chain is determined by a random sampling according to the stationary probability pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, and the transition probabilities are determined in a way so that the same stationary probability is maintained across all the subsequent rounds.

  • For cyclic participation, each client participates cyclically, i.e., it participates for a certain number of rounds and does not participate in the other rounds of a cycle. This setup has been used in existing works to simulate periodic behavior of client devices being charged (e.g., at night) (Eichner et al., 2019, Ding et al., 2020, Cho et al., 2023, Wang & Ji, 2022). We set each cycle to be 100100100100 rounds. We apply a random initial offset to the cycle for each client, to simulate a stationary random process for each client’s participation pattern.

Figure C.6.1 shows examples of these three types of participation patterns.

Refer to caption
Refer to caption
Refer to caption
Figure C.6.1: Illustration of different participation patterns, in the first 400400400400 rounds of a single client with 52.7%percent52.752.7\%52.7 % mean participation rate.

C.6.2 Illustration of Data and Participation Heterogeneity

As described in Section 5 and Appendix C.6.1, we generate the data and participation heterogeneity with two separate Dirichlet distributions with parameters αdsubscript𝛼𝑑\alpha_{d}italic_α start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT and αpsubscript𝛼𝑝\alpha_{p}italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, respectively. In the following, we illustrate the result of this generation for a specific random instance. In Figure C.6.2, the class-wise data distribution of each client is drawn from Dir(αd)Dirsubscript𝛼𝑑\mathrm{Dir}(\alpha_{d})roman_Dir ( italic_α start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ). For computing the participation probability, we draw a vector 𝐪𝐪\mathbf{q}bold_q from Dir(αp)Dirsubscript𝛼𝑝\mathrm{Dir}(\alpha_{p})roman_Dir ( italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ), which gave the following result in our random trial:

𝐪=[0.02, 0.05, 0.12, 0.00, 0.00, 0.78, 0.00, 0.00, 0.02, 0.00].𝐪0.020.050.120.000.000.780.000.000.020.00\mathbf{q}=[0.02,\,0.05,\,0.12,\,0.00,\,0.00,\,0.78,\,0.00,\,0.00,\,0.02,\,0.0% 0].bold_q = [ 0.02 , 0.05 , 0.12 , 0.00 , 0.00 , 0.78 , 0.00 , 0.00 , 0.02 , 0.00 ] .

Then, the participation probability is set as the inner product of 𝐪𝐪\mathbf{q}bold_q and the class distribution of each client’s data, divided by a normalization factor. For the above 𝐪𝐪\mathbf{q}bold_q, the 6666-th element has the highest value, which means that clients with a larger proportion of data in the 6666-th class (label) will have a higher participation probability. This is confirmed by comparing the class distributions and the participation probabilities in Figure C.6.2.

Refer to caption
Figure C.6.2: Illustration of data and participation heterogeneity, for an example with 20202020 clients.

In this procedure, 𝐪𝐪\mathbf{q}bold_q is kept the same for all the clients, to simulate a consistent correlation between participation probability and class distribution across all the clients. However, the value of 𝐪𝐪\mathbf{q}bold_q changes with the random seed, which means that we have different 𝐪𝐪\mathbf{q}bold_q for different experiments. We ran experiments with 5555 different random seeds for each setting, which allows us to observe the general behavior.

More precisely, let 𝜿nDir(αd)similar-tosubscript𝜿𝑛Dirsubscript𝛼𝑑\boldsymbol{\kappa}_{n}\sim\mathrm{Dir}(\alpha_{d})bold_italic_κ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∼ roman_Dir ( italic_α start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) denote the class distribution of client n𝑛nitalic_n’s data. The participation probability pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT of each client n𝑛nitalic_n is computed as

pn=1λ𝜿n,𝐪,subscript𝑝𝑛1𝜆subscript𝜿𝑛𝐪p_{n}=\frac{1}{\lambda}\left\langle\boldsymbol{\kappa}_{n},\mathbf{q}\right\rangle,italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_λ end_ARG ⟨ bold_italic_κ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_q ⟩ , (C.6.1)

where λ𝜆\lambdaitalic_λ is the normalization factor to ensure that 𝔼[pn]𝔼delimited-[]subscript𝑝𝑛\mathbb{E}\left[p_{n}\right]blackboard_E [ italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] is equal to some target μ𝜇\muitalic_μ, because pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is a random quantity when using this randomized generation procedure. In our experiments, we set μ=0.1𝜇0.1\mu=0.1italic_μ = 0.1. Let C𝐶Citalic_C denote the total number of classes (labels). From the mean of Dirichlet distribution and the fact that 𝜿nsubscript𝜿𝑛\boldsymbol{\kappa}_{n}bold_italic_κ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and 𝐪𝐪\mathbf{q}bold_q are independent, we know that 𝔼[𝜿n,𝐪]=𝔼[𝜿n],𝔼[𝐪]=1C𝔼delimited-[]subscript𝜿𝑛𝐪𝔼delimited-[]subscript𝜿𝑛𝔼delimited-[]𝐪1𝐶\mathbb{E}\left[\left\langle\boldsymbol{\kappa}_{n},\mathbf{q}\right\rangle% \right]=\left\langle\mathbb{E}\left[\boldsymbol{\kappa}_{n}\right],\mathbb{E}% \left[\mathbf{q}\right]\right\rangle=\frac{1}{C}blackboard_E [ ⟨ bold_italic_κ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_q ⟩ ] = ⟨ blackboard_E [ bold_italic_κ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] , blackboard_E [ bold_q ] ⟩ = divide start_ARG 1 end_ARG start_ARG italic_C end_ARG. Therefore, to ensure that 𝔼[pn]=μ𝔼delimited-[]subscript𝑝𝑛𝜇\mathbb{E}\left[p_{n}\right]=\mublackboard_E [ italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] = italic_μ, according to (C.6.1), the normalization factor is chosen as λ=1Cμ𝜆1𝐶𝜇\lambda=\frac{1}{C\mu}italic_λ = divide start_ARG 1 end_ARG start_ARG italic_C italic_μ end_ARG.

We emphasize again that this procedure is only used for simulating an experimental setup with both data and participation heterogeneity. Our FedAU algorithm still does not know the actual values of {pn}subscript𝑝𝑛\{p_{n}\}{ italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }.

Appendix D Additional Results from Experiments

D.1 Results with Different Participation Heterogeneity

We present the results with different participation heterogeneity (characterized by the Dirichlet parameter αpsubscript𝛼𝑝\alpha_{p}italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT) on the CIFAR-10 dataset in Table D.1.1, for the case of Bernoulli participation. The main observations remain consistent with those in Section 5. We also see that the difference between different methods becomes larger when the heterogeneity is higher (i.e., smaller αpsubscript𝛼𝑝\alpha_{p}italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT), which aligns with intuition. For all degrees of heterogeneity, our FedAU algorithm performs the best among the algorithms that work under the same setting, i.e., the top part of Table D.1.1.

Table D.1.1: Accuracy results (in %) on training and test data of CIFAR-10, with different participation heterogeneity (Bernoulli participation)
  Participation heterogeneity αp=0.01subscript𝛼𝑝0.01\alpha_{p}=0.01italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 0.01 αp=0.05subscript𝛼𝑝0.05\alpha_{p}=0.05italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 0.05 αp=0.1subscript𝛼𝑝0.1\alpha_{p}=0.1italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 0.1 αp=0.5subscript𝛼𝑝0.5\alpha_{p}=0.5italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 0.5 αp=1.0subscript𝛼𝑝1.0\alpha_{p}=1.0italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 1.0
Method / Metric ​​  Train ​​  Test ​​  Train ​​  Test ​​  Train ​​  Test ​​  Train ​​  Test ​​  Train ​​  Test
  FedAU (ours, K𝐾K\rightarrow\inftyitalic_K → ∞) ​​83.7±plus-or-minus\pm±0.8 ​​76.2±plus-or-minus\pm±0.7 ​​84.3±plus-or-minus\pm±0.8 ​​76.6±plus-or-minus\pm±0.5 ​​85.4±plus-or-minus\pm±0.4 ​​77.1±plus-or-minus\pm±0.4 ​​87.3±plus-or-minus\pm±0.5 ​​77.8±plus-or-minus\pm±0.2 ​​88.1±plus-or-minus\pm±0.7 ​​78.1±plus-or-minus\pm±0.2
FedAU (ours, K=50𝐾50K=50italic_K = 50) ​​84.7±plus-or-minus\pm±0.6 ​​76.9±plus-or-minus\pm±0.6 ​​85.1±plus-or-minus\pm±0.5 ​​77.1±plus-or-minus\pm±0.3 ​​86.0±plus-or-minus\pm±0.5 ​​77.3±plus-or-minus\pm±0.3 ​​87.6±plus-or-minus\pm±0.4 ​​77.8±plus-or-minus\pm±0.4 ​​88.2±plus-or-minus\pm±0.7 ​​78.0±plus-or-minus\pm±0.2
Average participating ​​80.6±plus-or-minus\pm±1.2 ​​72.3±plus-or-minus\pm±1.7 ​​81.5±plus-or-minus\pm±1.1 ​​72.6±plus-or-minus\pm±1.4 ​​83.5±plus-or-minus\pm±0.9 ​​74.1±plus-or-minus\pm±0.8 ​​85.9±plus-or-minus\pm±0.7 ​​75.7±plus-or-minus\pm±0.9 ​​87.0±plus-or-minus\pm±1.0 ​​76.8±plus-or-minus\pm±0.6
Average all ​​76.9±plus-or-minus\pm±2.7 ​​69.5±plus-or-minus\pm±2.7 ​​78.5±plus-or-minus\pm±1.7 ​​70.6±plus-or-minus\pm±1.8 ​​81.0±plus-or-minus\pm±0.9 ​​72.7±plus-or-minus\pm±0.9 ​​83.6±plus-or-minus\pm±1.4 ​​74.6±plus-or-minus\pm±0.8 ​​84.9±plus-or-minus\pm±1.2 ​​75.9±plus-or-minus\pm±1.0
  FedVarp (250×250\times250 × memory) ​​82.5±plus-or-minus\pm±0.8 ​​77.3±plus-or-minus\pm±0.3 ​​83.0±plus-or-minus\pm±0.4 ​​77.5±plus-or-minus\pm±0.4 ​​84.2±plus-or-minus\pm±0.3 ​​77.9±plus-or-minus\pm±0.2 ​​85.4±plus-or-minus\pm±0.5 ​​78.1±plus-or-minus\pm±0.2 ​​86.4±plus-or-minus\pm±0.7 ​​78.5±plus-or-minus\pm±0.3
MIFA (250×250\times250 × memory) ​​82.1±plus-or-minus\pm±0.8 ​​77.0±plus-or-minus\pm±0.7 ​​82.6±plus-or-minus\pm±0.3 ​​77.3±plus-or-minus\pm±0.4 ​​83.5±plus-or-minus\pm±0.6 ​​77.5±plus-or-minus\pm±0.3 ​​84.9±plus-or-minus\pm±0.5 ​​77.9±plus-or-minus\pm±0.3 ​​85.4±plus-or-minus\pm±0.4 ​​78.0±plus-or-minus\pm±0.4
​​​Known participation statistics​​​ ​​83.1±plus-or-minus\pm±0.8 ​​76.3±plus-or-minus\pm±0.6 ​​83.6±plus-or-minus\pm±0.6 ​​76.7±plus-or-minus\pm±0.5 ​​84.3±plus-or-minus\pm±0.5 ​​77.0±plus-or-minus\pm±0.5 ​​86.1±plus-or-minus\pm±0.6 ​​77.7±plus-or-minus\pm±0.4 ​​86.8±plus-or-minus\pm±0.9 ​​77.9±plus-or-minus\pm±0.7
 

Note to the table. The same note in Table 1 also applies to this table.

D.2 Loss and Accuracy Plots

For Bernoulli participation, we plot the loss and accuracy results in different rounds for the four datasets, as shown in Figures D.2.1D.2.4. In these plots, the curves show the mean values and the shaded areas show the standard deviation. We applied moving average with a window size equal to 3%percent33\%3 % of the total number of rounds, and the mean and standard deviation are computed across samples from all experiments (with 5555 different random seeds) within each moving average window.

The main conclusions from Figures D.2.1D.2.4 are similar to what we have seen from the final-round results shown in Table 1 in the main paper. We can see that our FedAU algorithm performs the best in the vast majority of cases and across most rounds. Only for the CIFAR-10 dataset, FedAU gives a slightly worse test accuracy compared to FedVarp and MIFA, which aligns with the results in Table 1. However, FedAU still gives the highest training accuracy on CIFAR-10. This implies that FedVarp/MIFA gives a slightly better generalization on the CIFAR-10 dataset, where the reasons are worth further investigation. We emphasize again that FedVarp and MIFA both require a substantial amount of additional memory than FedAU, thus they do not work under the same system assumptions as FedAU. For the CIFAR-100 dataset, there is a jump around the 10,0001000010,00010 , 000-th round due to the learning rate decay schedule, as mentioned in Section C.4.

Refer to caption
Figure D.2.1: Results on SVHN dataset (Bernoulli participation).
Refer to caption
Figure D.2.2: Results on CIFAR-10 dataset (Bernoulli participation).
Refer to caption
Figure D.2.3: Results on CIFAR-100 dataset (Bernoulli participation).
Refer to caption
Figure D.2.4: Results on CINIC-10 dataset (Bernoulli participation).

D.3 Client-wise Distributions of Loss and Accuracy

We plot the loss and accuracy value distributions among all the clients in Figure D.3.1, where we consider Bernoulli participation and compare with baselines that do not require extra resources or information.

Refer to caption
Refer to caption
Refer to caption
Figure D.3.1: Client-wise distributions of loss value, training accuracy, and test accuracy, on CIFAR-10 dataset with Bernoulli participation. The distribution is expressed as empirical cumulative distribution function (CDF).

We can see that compared to the average-participating and average-all baselines that use the same amount of memory as FedAU, the spread in the loss and accuracy with FedAU is smaller. This is also seen in the standard deviation of all the clients’ loss and accuracy values in Table D.3.1, where we only include the standard deviation values because the mean values are the same as those in Table 1.

Table D.3.1: Client-wise statistics of loss and accuracy (CIFAR-10 dataset with Bernoulli participation)
  Method ​​Client-wise std. dev. of loss ​​Client-wise std. dev. of training accuracy ​​Client-wise std. dev. of test accuracy
  FedAU (ours, K𝐾K\rightarrow\inftyitalic_K → ∞) ​​0.017 ​​9.9% ​​11.7%
FedAU (ours, K=50𝐾50K=50italic_K = 50) ​​0.016 ​​9.5% ​​11.2%
Average participating ​​0.031 ​​13.3% ​​11.6%
Average all ​​0.030 ​​13.3% ​​12.2%
 

This shows that FedAU (especially with K=50𝐾50K=50italic_K = 50) reduces the bias among clients compared to the two baselines, which aligns with our motivation mentioned in Section 1 about reducing discrimination.

D.4 Aggregation Weights

As shown in Figure D.4.1, with Bernoulli participation, the computed weights can be quite different from 1/pn1subscript𝑝𝑛\nicefrac{{1}}{{p_{n}}}/ start_ARG 1 end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG, especially when the participation probability is low (in Subfigures 1(c)1(e)). In contrast, we see in Figure D.4.2 that with cyclic participation the weights computed by FedAU and the known participation statistics baseline are more similar. This aligns with the fact that the accuracies in the case of cyclic participation are also more similar compared to the case of Bernoulli participation, as seen in Table 1.

Note that we use K𝐾K\rightarrow\inftyitalic_K → ∞ for FedAU in both Figure D.4.1 and Figure D.4.2.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Figure D.4.1: Aggregation weights with Bernoulli participation, (a) average aggregation weights over all clients, (b) aggregation weights of a single client with a high mean participation rate of 52.7%percent52.752.7\%52.7 %, (c)-(e) aggregation weights of three individual clients with a low mean participation rate of 2%percent22\%2 %.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Figure D.4.2: Aggregation weights with cyclic participation, (a) average aggregation weights over all clients, (b) aggregation weights of a single client with a high mean participation rate of 52.7%percent52.752.7\%52.7 %, (c)-(e) aggregation weights of three individual clients with a low mean participation rate of 2%percent22\%2 %.

D.5 Choice of different K𝐾Kitalic_K

We study the effect of the cutoff interval length K𝐾Kitalic_K by considering the performance of FedAU under different minimum participation probabilities. The distributions of participation probabilities {pn}subscript𝑝𝑛\{p_{n}\}{ italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } for all the clients with different lower bounds are shown in Figure D.5.1, where we can see that a smaller lower bound value corresponds to having more clients with very small participation probabilities. The full set of plots complementing Figure 1 is shown in Figure D.5.2.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure D.5.1: Distributions of participation probabilities {pn}subscript𝑝𝑛\{p_{n}\}{ italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } at clients, with different lower bound values of these probabilities. The distribution is expressed as empirical cumulative distribution function (CDF) with logarithmic scale on the x𝑥xitalic_x-axis.
Refer to caption
(a) Full plots
Refer to caption
(b) Flots with enlarged y𝑦yitalic_y-axis
Figure D.5.2: Results of FedAU with different K𝐾Kitalic_K, on CIFAR-10 dataset with Bernoulli participation. The loss is NaN for K=105𝐾superscript105K=10^{5}italic_K = 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT with pn0.0subscript𝑝𝑛0.0p_{n}\geq 0.0italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≥ 0.0.

D.6 Low Participation Rates

To further study the performance of FedAU in the presence of clients with low participation rates, we set the lower bound of participation probabilities to 0.00.00.00.0 (i.e., we do not impose a specific lower bound; see Appendix C.6.1 and Appendix D.5 for details) and compare the performance of FedAU with K=50𝐾50K=50italic_K = 50 to the baseline algorithms. We consider settings with different mean participation probabilities 𝔼[pn]𝔼delimited-[]subscript𝑝𝑛\mathbb{E}\left[p_{n}\right]blackboard_E [ italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ], while following the same procedure of generating heterogeneous participation patterns as described in Appendix C.6, to capture the effect of different overall participation rates of clients. The resulting distributions of {pn}subscript𝑝𝑛\{p_{n}\}{ italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } with different 𝔼[pn]𝔼delimited-[]subscript𝑝𝑛\mathbb{E}\left[p_{n}\right]blackboard_E [ italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] are shown in Figure D.6.1.

Refer to caption
Refer to caption
Refer to caption
Figure D.6.1: Distributions of participation probabilities {pn}subscript𝑝𝑛\{p_{n}\}{ italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } at clients, with different values of 𝔼[pn]𝔼delimited-[]subscript𝑝𝑛\mathbb{E}\left[p_{n}\right]blackboard_E [ italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ]. The distribution is expressed as empirical cumulative distribution function (CDF) with logarithmic scale on the x𝑥xitalic_x-axis.
Table D.6.1: Accuracy results (in %) on training and test data of CIFAR-10, with different mean participation probabilities 𝔼[pn]𝔼delimited-[]subscript𝑝𝑛\mathbb{E}\left[p_{n}\right]blackboard_E [ italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] of clients (Bernoulli participation) and no minimum cap of pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT
  Mean participation probability 𝔼[pn]=0.1𝔼delimited-[]subscript𝑝𝑛0.1\mathbb{E}\left[p_{n}\right]=0.1blackboard_E [ italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] = 0.1 𝔼[pn]=0.01𝔼delimited-[]subscript𝑝𝑛0.01\mathbb{E}\left[p_{n}\right]=0.01blackboard_E [ italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] = 0.01 𝔼[pn]=0.001𝔼delimited-[]subscript𝑝𝑛0.001\mathbb{E}\left[p_{n}\right]=0.001blackboard_E [ italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] = 0.001
Method / Metric ​​  Train ​​  Test ​​  Train ​​  Test ​​  Train ​​  Test
  FedAU (ours, K=50𝐾50K=50italic_K = 50) ​​84.1±plus-or-minus\pm±0.6 ​​75.9±plus-or-minus\pm±0.5 ​​71.5±plus-or-minus\pm±1.3 ​​67.7±plus-or-minus\pm±1.1 ​​45.9±plus-or-minus\pm±1.2 ​​45.6±plus-or-minus\pm±1.2
Average participating ​​81.6±plus-or-minus\pm±0.7 ​​72.5±plus-or-minus\pm±0.7 ​​60.2±plus-or-minus\pm±1.4 ​​57.9±plus-or-minus\pm±1.5 ​​26.7±plus-or-minus\pm±1.8 ​​26.8±plus-or-minus\pm±1.8
Average all ​​79.5±plus-or-minus\pm±0.8 ​​71.5±plus-or-minus\pm±0.9 ​​61.8±plus-or-minus\pm±2.4 ​​60.0±plus-or-minus\pm±2.5 ​​33.0±plus-or-minus\pm±1.8 ​​33.5±plus-or-minus\pm±1.9
  FedVarp (250×250\times250 × memory) ​​61.5±plus-or-minus\pm±25.8 ​​59.5±plus-or-minus\pm±24.8 ​​10.0±plus-or-minus\pm±0.0 ​​10.0±plus-or-minus\pm±0.0 ​​12.7±plus-or-minus\pm±3.4 ​​12.8±plus-or-minus\pm±3.5
MIFA (250×250\times250 × memory) ​​74.8±plus-or-minus\pm±1.9 ​​72.5±plus-or-minus\pm±1.5 ​​10.0±plus-or-minus\pm±0.0 ​​10.0±plus-or-minus\pm±0.0 ​​10.0±plus-or-minus\pm±0.0 ​​10.0±plus-or-minus\pm±0.0
Known participation statistics ​​15.0±plus-or-minus\pm±10.0 ​​14.9±plus-or-minus\pm±9.8 ​​10.0±plus-or-minus\pm±0.0 ​​10.0±plus-or-minus\pm±0.0 ​​10.0±plus-or-minus\pm±0.0 ​​10.0±plus-or-minus\pm±0.0
 

Note to the table. Total number of rounds is 10,0001000010,00010 , 000. We do not enforce a minimum participation probability in these results, i.e., we allow any pn0.0subscript𝑝𝑛0.0p_{n}\geq 0.0italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≥ 0.0. We use the hyperparameters listed in Table C.4.1 for all the experiments. The same note in Table 1 also applies to this table.

Key Observations. The accuracy results are presented in Table D.6.1, from experiments with the CIFAR-10 dataset and 10,0001000010,00010 , 000 rounds of FL. As expected, the performance of the majority of algorithms decreases as 𝔼[pn]𝔼delimited-[]subscript𝑝𝑛\mathbb{E}\left[p_{n}\right]blackboard_E [ italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] decreases, where the minor increase of FedVarp’s performance from the case of 𝔼[pn]=0.01𝔼delimited-[]subscript𝑝𝑛0.01\mathbb{E}\left[p_{n}\right]=0.01blackboard_E [ italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] = 0.01 to 𝔼[pn]=0.001𝔼delimited-[]subscript𝑝𝑛0.001\mathbb{E}\left[p_{n}\right]=0.001blackboard_E [ italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] = 0.001 is due to randomness in the experiments. We summarize the key findings from Table D.6.1 in the following.

It is interesting to see that the baseline algorithms that require additional memory or other information actually perform very poorly when the clients’ participation rates are low, where we note that an accuracy of 10%percent1010\%10 % corresponds to random guess for the CIFAR-10 dataset that has 10101010 classes of images. The reason is that FedVarp and MIFA both perform variance reduction based on previous updates of clients. When clients participate rarely, it is likely that the saved updates are outdated, causing more distortion than benefit to parameter updates. For the case of known participation statistics, the aggregation weight of each client n𝑛nitalic_n is chosen as 1/pn1subscript𝑝𝑛\nicefrac{{1}}{{p_{n}}}/ start_ARG 1 end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG. When pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is very small, the aggregation weight becomes very large, which causes instability to the model training process.

The average participating and average all baselines perform better than the FedVarp, MIFA, and known participation statistics baselines, because the aggregation weights used by the average participating and average all algorithms do not have much variation, which provides more stability in the case of low client participation rates.

FedAU (with K=50𝐾50K=50italic_K = 50) gives the best performance, because the cutoff interval of length K𝐾Kitalic_K ensures that the aggregation weights are not too large, which provides stability in the training process. At the same time, clients that participate frequently still have lower aggregation weights, which balances the contributions of clients with different participation rates.

Further Discussion. We further note that all the results in Table D.6.1 are from experiments using the hyperparameters listed in Table C.4.1. These near-optimal hyperparameters were found from a grid search (see Appendix C.4) when the clients participate according to Bernoulli distribution with statistics described in Appendix C.6. For the baseline methods that give random guess (or close to random guess) accuracies, it is possible that their performance can be slightly improved by choosing a much smaller learning rate, which may alleviate the impact of stale updates (for FedVarp and MIFA) or excessively large aggregation weights (for known participation statistics) when the client participation rate is low. However, it is impractical to fine tune the learning rates depending on the participation rates, especially when the participation rates are unknown a priori. In addition, using very small learning rates generally slows down the convergence, although the algorithm may converge in the end after a large number of rounds. The fact that FedAU gives the best performance compared to the baselines for a wide range of client participation rates, while keeping the learning rates unchanged (i.e., using the values in Table C.4.1), confirms its stability and usefulness in practice.