A Lightweight Method for Tackling Unknown Participation Statistics in Federated Averaging
Abstract
In federated learning (FL), clients usually have diverse participation statistics that are unknown a priori, which can significantly harm the performance of FL if not handled properly. Existing works aiming at addressing this problem are usually based on global variance reduction, which requires a substantial amount of additional memory in a multiplicative factor equal to the total number of clients. An important open problem is to find a lightweight method for FL in the presence of clients with unknown participation rates. In this paper, we address this problem by adapting the aggregation weights in federated averaging (FedAvg) based on the participation history of each client. We first show that, with heterogeneous participation statistics, FedAvg with non-optimal aggregation weights can diverge from the optimal solution of the original FL objective, indicating the need of finding optimal aggregation weights. However, it is difficult to compute the optimal weights when the participation statistics are unknown. To address this problem, we present a new algorithm called FedAU, which improves FedAvg by adaptively weighting the client updates based on online estimates of the optimal weights without knowing the statistics of client participation. We provide a theoretical convergence analysis of FedAU using a novel methodology to connect the estimation error and convergence. Our theoretical results reveal important and interesting insights, while showing that FedAU converges to an optimal solution of the original objective and has desirable properties such as linear speedup. Our experimental results also verify the advantage of FedAU over baseline methods with various participation patterns.
1 Introduction
We consider the problem of finding that minimizes the distributed finite-sum objective:
(1) |
where each individual (local) objective is only computable at the client . This problem often arises in the context of federated learning (FL) (Kairouz et al., 2021, Li et al., 2020a, Yang et al., 2019), where is defined on client ’s local dataset, is the global objective, and is the parameter vector of the model being trained. Each client keeps its local dataset to itself, which is not shared with other clients or the server. It is possible to extend (1) to weighted average with positive coefficients multiplied to each , but for simplicity, we consider such coefficients to be included in (see Appendix A.1) and do not write them out.
Federated averaging (FedAvg) is a commonly used algorithm for minimizing (1), which alternates between local updates at each client and parameter aggregation among multiple clients with the help of a server (McMahan et al., 2017). However, there are several challenges in FedAvg, including data heterogeneity and partial participation of clients, which can cause performance degradation and even non-convergence if the FedAvg algorithm is improperly configured.
Unknown, Uncontrollable, and Heterogeneous Participation of Clients. Most existing works on FL with partial client participation assume that the clients participate according to a known or controllable random process (Karimireddy et al., 2020, Yang et al., 2021, Chen et al., 2022, Fraboni et al., 2021a, Li et al., 2020b; c). In practice, however, it is common for clients to have heterogeneous and time-varying computation power and network bandwidth, which depend on both the inherent characteristics of each client and other tasks that concurrently run in the system. This generally leads to heterogeneous participation statistics across clients, which are difficult to know a priori due to their complex dependency on various factors in the system (Wang et al., 2021). It is also generally impossible to fully control the participation statistics, due to the randomness of whether a client can successfully complete a round of model updates (Bonawitz et al., 2019).
The problem of having heterogeneous and unknown participation statistics is that it may cause the result of FL to be biased towards certain local objectives, which diverges from the optimum of the original objective in (1). In FL, data heterogeneity across clients is a common phenomenon, resulting in diverse local objectives . The participation heterogeneity is often correlated with data heterogeneity, because the characteristics of different user populations may be correlated with how powerful their devices are. Intuitively, when some clients participate more frequently than others, the final FL result will be benefiting the local objectives of those frequently participating clients, causing a possible discrimination for clients that participate less frequently.
A few recent works aiming at addressing this problem are based on the idea of global variance reduction by saving the most recent updates of all the clients, which requires a substantial amount of additional memory in the order of , i.e., the total number of clients times the dimension of the model parameter vector (Yang et al., 2022, Yan et al., 2020, Gu et al., 2021, Jhunjhunwala et al., 2022). This additional memory consumption is either incurred at the server or evenly distributed to all the clients. For practical FL systems with many clients, this causes unnecessary memory usage that affects the overall capability and performance of the system. Therefore, we ask the following important question in this paper:
Is there a lightweight method that provably minimizes the original objective in (1), when the participation statistics of clients are unknown, uncontrollable, and heterogeneous?
We leverage the insight that we can apply different weights to different clients’ updates in the parameter aggregation stage of FedAvg. If this is done properly, the effect of heterogeneous participation can be canceled out so that we can minimize (1), as shown in existing works that assume known participation statistics (Chen et al., 2022, Fraboni et al., 2021a, Li et al., 2020b; c). However, in our setting, we do not know the participation statistics a priori, which makes it challenging to compute (estimate) the optimal aggregation weights. It is also non-trivial to quantify the impact of estimation error on convergence.
Our Contributions. We thoroughly analyze this problem and make the following novel contributions.
- 1.
-
2.
We propose a lightweight procedure for estimating the optimal aggregation weight at each client as part of the overall FL process, based on client ’s participation history. We name this new algorithm FedAU, which stands for FedAvg with adaptive weighting to support unknown participation statistics.
-
3.
We analyze the convergence upper bound of FedAU, using a novel method that first obtains a weight error term in the convergence bound and then further bounds the weight error term via a bias-variance decomposition approach. Our result shows that FedAU converges to an optimal solution of the original objective (1). In addition, a desirable linear speedup of convergence with respect to the number of clients is achieved when the number of FL rounds is large enough.
-
4.
We verify the advantage of FedAU in experiments with several datasets and baselines, with a variety of participation patterns including those that are independent, Markovian, and cyclic.
Related Work. Earlier works on FedAvg considered the convergence analysis with full client participation (Gorbunov et al., 2021, Haddadpour et al., 2019, Lin et al., 2020, Stich, 2019, Wang & Joshi, 2019; 2021, Yu et al., 2019, Malinovsky et al., 2023), which do not capture the fact that only a subset of clients participates in each round in practical FL systems. Recently, partial client participation has came to attention. Some works analyzed the convergence of FedAvg where the statistics or patterns of client participation are known or controllable (Fraboni et al., 2021a; b, Li et al., 2020c, Yang et al., 2021, Wang & Ji, 2022, Cho et al., 2023, Karimireddy et al., 2020, Li et al., 2020b, Chen et al., 2022, Rizk et al., 2022). However, as pointed out by Wang et al. (2021), Bonawitz et al. (2019), the participation of clients in FL can have complex dependencies on the underlying system characteristics, which makes it difficult to know or control each client’s behavior a priori. A recent work analyzed the convergence for a re-weighted objective (Patel et al., 2022), where the re-weighting is essentially arbitrary for unknown participation distributions. Some recent works (Yang et al., 2022, Yan et al., 2020, Gu et al., 2021, Jhunjhunwala et al., 2022) aimed at addressing this problem using variance reduction, by including the most recent local update of each client in the global update, even if they do not participate in the current round. These methods require a substantial amount of additional memory to store the clients’ local updates. In contrast, our work focuses on developing a lightweight algorithm that has virtually the same memory requirement as the standard FedAvg algorithm.
A related area is adaptive FL algorithms, where adaptive gradients (Reddi et al., 2021, Wang et al., 2022b; c) and adaptive local updates (Ruan et al., 2021, Wang et al., 2020) were studied. Some recent works viewed the adaptation of aggregation weights from different perspectives (Wu & Wang, 2021, Tan et al., 2022, Wang et al., 2022a), which do not address the problem of unknown participation statistics. All these methods are orthogonal to our work and can potentially work together with our algorithm. To the best of our knowledge, no prior work has studied weight adaptation in the presence of unknown participation statistics with provable convergence guarantees.
A uniqueness in our problem is that the statistics related to participation need to be collected across multiple FL rounds. Although Wang & Ji (2022) aimed at extracting a participation-specific term in the convergence bound, that approach still requires the aggregation weights in each round to sum to one (thus coordinated participation); it also requires an amplification procedure over multiple rounds for the bound to hold, making it difficult to tune the hyperparameters. In contrast, this paper considers uncontrolled and uncoordinated participation without sophisticated amplification mechanisms.
2 FedAvg with Pluggable Aggregation Weights
We begin by describing a generic FedAvg algorithm that includes a separate oracle for computing the aggregation weights, as shown in Algorithm 1. In this algorithm, there are a total of rounds, where each round includes steps of local stochastic gradient descent (SGD) at a participating client. For simplicity, we consider to be the same for all the clients, while noting that our algorithm and results can be extended to more general cases. We use and to denote the local and global step sizes, respectively. The variable is the initial model parameter, is an identity function that is equal to one if client participates in round and zero otherwise, and is the stochastic gradient of the local objective for each client .
The main steps of Algorithm 1 are similar to those of standard FedAvg, but with a few notable items as follows. 1) In Line 1, we clearly state that we do not have prior knowledge of the sampling process of client participation. 2) Line 1 calls a separate oracle to compute the aggregation weight () for client in round . This computation is done on each client alone, without coordinating with other clients. We do not need to save the full sequence of participation record , because it is sufficient to save an aggregated metric of the participation record for weight computation. In Section 3, we will see that we use the average participation interval for weight computation in FedAU, where the average can be computed in an online manner. We also note that we do not include in the current round for computing the weight, which is needed for the convergence analysis so that is independent of the local parameter when the initial parameter of round (i.e., ) is given. 3) The parameter aggregation is weighted by for each client in Line 1.
Objective Inconsistency with Improper Aggregation Weights. We first show that without weight adaptation, FedAvg minimizes an alternative objective that is generally different from (1).
Theorem 1 (Objective minimized at convergence, informal).
When and the weights are time-constant, i.e., but generally may not be equal to (), with properly chosen learning rates and and some other assumptions, Algorithm 1 minimizes the following objective:
(2) |
where .
A formal version of the theorem is given in Appendix B.4. Theorem 1 shows that, even in the special case where each client participates according to a Bernoulli distribution with probability , choosing a constant aggregation weight such as as in standard FedAvg causes the algorithm to converge to a different objective that is weighted by . As mentioned earlier, this implicit weighting discriminates clients that participate less frequently. In addition, since the participation statistics (here, the probabilities ) of clients are unknown, the exact objective being minimized is also unknown, and it is generally unreasonable to minimize an unknown objective. This means that it is important to design an adaptive method to find the aggregation weights, so that we can minimize (1) even when the participation statistics are unknown, which is our focus in this paper.
The full proofs of all mathematical claims are in Appendix B.
3 FedAU: Estimation of Optimal Aggregation Weights
In this section, we describe the computation of aggregation weights based on the participation history observed at each client, which is the core of our FedAU algorithm that extends FedAvg. Our goal is to choose to minimize the original objective (1) as close as possible.
Intuition. We build from the intuition in Theorem 1 and design an aggregation weight adaptation algorithm that works for general participation patterns, i.e., not limited to the Bernoulli distribution considered in Theorem 1. From (2), we see that if we can choose , the objective being minimized is the same as (1). We note that for each client when is large, due to ergodicity of the Bernoulli distribution considered in Theorem 1. Extending to general participation patterns that are not limited to the Bernoulli distribution, intuitively, we would like to choose the weight to be inversely proportional to the average frequency of participation. In this way, the bias caused by lower participation frequency is “canceled out” by the higher weight used in aggregation. Based on this intuition, our goal of aggregation weight estimation is as follows.
Problem 1 (Goal of Weight Estimation, informal).
Choose so that its long-term average (i.e., for large ) is close to , for each .
Some previous works have discovered this need of debiasing the skewness of client participation (Li et al., 2020c, Perazzone et al., 2022) or designing the client sampling scheme to ensure that the updates are unbiased (Fraboni et al., 2021a, Li et al., 2020b). However, in our work, we consider the more realistic case where the participation statistics are unknown, uncontrollable, and heterogeneous. In this case, we are unable to directly find the optimal aggregation weights because we do not know the participation statistics a priori.
Technical Challenge. If we were to know the participation pattern for all the rounds, an immediate solution to Problem 1 is to choose (for each client ) to be equal to divided by the number of rounds where client participates. We can see that this solution is equal to the average interval between every two adjacent participating rounds, assuming that the first interval starts right before the first round . However, since we do not know the future participation pattern or statistics in each round , we cannot directly apply this solution. In other words, in every round , we need to perform an online estimation of the weight based on the participation history up to round .
A challenge in this online setting is that the estimation accuracy is related to the number of times each client has participated until round . When is small and client has not yet participated in any of the preceding rounds, we do not have any information about how to choose . For an intermediate value of where client has participated only in a few rounds, we have limited information about the choice of . In this case, if we directly use the average participation interval up to the -th round, the resulting can be far from its optimal value, i.e., the estimation has a high variance if the client participation follows a random process. This is problematic especially when there exists a long interval between two rounds (both before the -th round) where the client participates. Although the probability of the occurrence of such a long interval is usually low, when it occurs, it results in a long average interval for the first rounds when is relatively small, and using this long average interval as the value of may cause instability to the training process.
Key Idea. To overcome this challenge, we define a positive integer as a “cutoff” interval length. If a client has not participated for rounds, we consider to be a participation interval that we sample and start a new interval thereafter. In this way, we can limit the length of each interval by adjusting . By setting to be the average of this possibly cutoff participation interval, we overcome the aforementioned challenge. From a theoretical perspective, we note that will be a biased estimation when and the bias will be larger when is smaller. In contrast, a smaller leads to a smaller variance of , because we collect more samples in the computation of with a smaller . Therefore, an insight here is that controls the bias-variance tradeoff111Note that we focus on the aggregation weights here, which is different from classical concept of the bias-variance tradeoff of the model. of the aggregation weight . In Section 4, we will formally show this property and obtain desirable convergence properties of the weight error term and the overall objective function (1), by properly choosing in the theoretical analysis. Our experimental results in Section 5 also confirm that choosing an appropriate value of improves the performance in most cases.
Online Algorithm. Based on the above insight, we describe the procedure of computing the aggregation weights , as shown in Algorithm 2. The computation is independent for each client . In this algorithm, the variable denotes the number of (possibly cutoff) participation intervals that have been collected, and denotes the the length of the last interval that is being computed. We compute the interval by incrementing by one in every round, until the condition in Line 2 holds. When this condition holds, is the actual length of the latest participation interval with possible cutoff. As explained above, we always start a new interval when reaches . Also note that we consider instead of in this condition and start the loop from in Line 2, to align with the requirement in Algorithm 1 that the weights are computed from the participation records before (not including) the current round . For , we always use . In Line 2, we compute the weight using an online averaging method, which is equivalent to averaging over all the participation intervals that have been observed until each round . With this method, we do not need to save all the previous participation intervals. Essentially, the computation in each round only requires three state variables that are scalars, including , , and the previous round’s weight . This makes this algorithm extremely memory efficient.
In the full FedAU algorithm, we plug in the result of for each round obtained from Algorithm 2 into Line 1 of Algorithm 1. In other words, ComputeWeight in Algorithm 1 calls one step of update that includes Lines 2–2 of Algorithm 2.
Compatibility with Privacy-Preserving Mechanisms. In our FedAU algorithm, the aggregation weight computation (Algorithm 2) is done individually at each client, which only uses the client’s participation states and does not use the training dataset or the model. When using these aggregation weights as part of FedAvg in Algorithm 1, the weight can be multiplied with the parameter update at each client (and in each round ) before the update is transmitted to the server. In this way, methods such as secure aggregation (Bonawitz et al., 2017) can be applied directly, since the server only needs to compute a sum of the participating clients’ updates. Differentially private FedAvg methods (McMahan et al., 2018, Andrew et al., 2021) can be applied in a similar way.
Practical Implementation. We will see from our experimental results in Section 5 that a coarsely chosen value of gives a reasonably good performance in practice, which means that we do not need to fine-tune . There are also other engineering tweaks that can be made in practice, such as using an exponentially weighted average in Line 2 of Algorithm 2 to put more emphasis on the recent participation characteristics of clients. In an extreme case where each client participates only once, a possible solution is to group clients that have similar computation power (e.g., same brand/model of devices) and are in similar geographical locations together. They may share the same state variables , , and used for weight computation in Algorithm 2. We note that according to the lower bound derived by Yang et al. (2022), if each client participates only once, it is impossible to have an algorithm to converge to the original objective without sharing additional information.
4 Convergence Analysis
Assumption 1.
The local objective functions are -smooth, such that
(3) |
Assumption 2.
The local stochastic gradients and unbiased with bounded variance, such that
(4) |
In addition, the stochastic gradient noise is independent across different rounds (indexed by ), clients (indexed by ), and local update steps (indexed by ).
Assumption 3.
The divergence between local and global gradients is bounded, such that
(5) |
Assumption 4.
The client participation random variable is independent across different and . It is also independent of the stochastic gradient noise. For each client , we define such that , i.e., , where the value of is unknown to the system a priori.
Assumptions 1–3 are commonly used in the literature for the convergence analysis of FL algorithms (Yang et al., 2021, Wang & Ji, 2022, Cho et al., 2023). Our consideration of independent participation across clients in Assumption 4 is more realistic than the conventional setting of sampling among all the clients with or without replacement (Li et al., 2020c, Yang et al., 2021), because it is difficult to coordinate the participation across a large number of clients in practical FL systems.
Challenge in Analyzing Time-Dependent Participation. Regarding the assumption on the independence of across time (round) in Assumption 4, the challenge in analyzing the more general time-dependent participation is due to the complex interplay between the randomness in stochastic gradient noise, participation identities , and estimated aggregation weights . In particular, the first step in our proof of the general descent lemma (see Appendix B.3, the specific step is in (B.3.6)) would not hold if is dependent on the past, because the past information is contained in and that are conditions of the expectation. We emphasize that this is a purely theoretical limitation, and this time-independence of client participation has been assumed in the majority of works on FL with client sampling (Fraboni et al., 2021a; b, Karimireddy et al., 2020, Li et al., 2020b; c, Yang et al., 2021). The novelty in our analysis is that we consider the true values of to be unknown to the system. Our experimental results in Section 5 show that FedAU provides performance gains also for Markovian and cyclic participation patterns that are both time-dependent.
Assumption 5.
We assume that either of the following holds and define accordingly.
-
•
Option 1: Nearly optimal weights. Under the assumption that for all , we define .
-
•
Option 2: Bounded global gradient. Under the assumption that for any , we define .
Assumption 5 is only needed for Theorem 2 (stated below) and not for Theorem 1. Here, the bounded global gradient assumption is a relaxed variant of the bounded stochastic gradient assumption commonly used in adaptive gradient algorithms (Reddi et al., 2021, Wang et al., 2022b; c). Although focusing on very different problems, our FedAU method shares some similarities with adaptive gradient methods in the sense that we both adapt the weights used in model updates, where the adaptation is dependent on some parameters that progressively change during the training process; see Appendix A.2 for some further discussion. For the nearly optimal weights assumption, we can see that it holds if , which means a toleration of a relative error of from the optimal weight . Theorem 2 holds under either of these two additional assumptions.
Main Results. We now present our main results, starting with the convergence of Algorithm 1 with arbitrary (but given) weights with respect to (w.r.t.) the original objective function in (1).
Theorem 2 (Convergence error w.r.t. (1)).
The proof of Theorem 2 includes a novel step to obtain (ignoring the other constants), referred to as the weight error term, that characterizes how the aggregation weights affect the convergence. Next, we focus on obtained from Algorithm 2.
Theorem 3 (Bounding the weight error term).
For obtained from Algorithm 2, when ,
(7) |
The proof of Theorem 3 is based on analyzing the unique statistical properties of the possibly cutoff participation interval obtained in Algorithm 2. The first term of the bound in (7) is related to the variance of . This term increases linearly in , because when gets larger, the minimum number of samples of that are used for computing gets smaller, thus the variance upper bound becomes larger. The second term of the bound in (7) is related to the bias of , which measures how far departs from the desired quantity of . Since , this term decreases exponentially in . This result confirms the bias-variance tradeoff of that we mentioned earlier.
Corollary 4 (Convergence of FedAU).
The result in Corollary 4 is the convergence upper bound of the full FedAU algorithm. Its proof involves further bounding (7) in Theorem 3, when choosing , and plugging back the result along with the values of and into Theorem 2. It shows that, with properly estimated aggregation weights using Algorithm 2, the error approaches zero as , although the actual participation statistics are unknown. The first two terms of the bound in (8) dominate when is large enough, which are related to the stochastic gradient variance and gradient divergence . The error caused by the fact that is unknown is captured by the third term of the bound in (8), which has an order of . We also see that, as long as we maintain to be large enough so that the first two terms of the bound in (8) dominate, we can achieve the desirable property of linear speedup in . This means that we can keep the same convergence error by increasing the number of clients () and decreasing the number of rounds (), to the extent that remains large enough. Our result also recovers existing convergence bounds for FedAvg in the case of known participation probabilities (Karimireddy et al., 2020, Yang et al., 2021); see Appendix A.3 for details.
5 Experiments
Participation pattern | Dataset | SVHN | CIFAR-10 | CIFAR-100 | CINIC-10 | ||||
Method / Metric | Train | Test | Train | Test | Train | Test | Train | Test | |
Bernoulli | FedAU (ours, ) | 90.40.5 | 89.30.5 | 85.40.4 | 77.10.4 | 63.40.6 | 52.30.4 | 65.20.5 | 61.50.4 |
FedAU (ours, ) | 90.60.4 | 89.60.4 | 86.00.5 | 77.30.3 | 63.80.3 | 52.10.6 | 66.70.3 | 62.70.2 | |
Average participating | 89.10.3 | 87.20.3 | 83.50.9 | 74.10.8 | 59.30.4 | 48.80.7 | 61.12.3 | 56.62.0 | |
Average all | 88.50.5 | 87.00.3 | 81.00.9 | 72.70.9 | 58.20.4 | 47.90.5 | 60.52.3 | 56.22.0 | |
\clineB2-102 | FedVarp ( memory) | 89.60.5 | 88.90.5 | 84.20.3 | 77.90.2 | 57.20.9 | 49.20.8 | 64.40.6 | 62.00.5 |
MIFA ( memory) | 89.40.3 | 88.70.2 | 83.50.6 | 77.50.3 | 55.81.1 | 48.40.7 | 63.80.7 | 61.50.5 | |
Known participation statistics | 89.20.5 | 88.40.5 | 84.30.5 | 77.00.5 | 59.40.7 | 50.60.4 | 63.20.6 | 60.50.5 | |
Markovian | FedAU (ours, ) | 90.50.4 | 89.30.4 | 85.30.3 | 77.10.3 | 63.20.5 | 51.80.3 | 64.90.3 | 61.20.2 |
FedAU (ours, ) | 90.60.3 | 89.50.3 | 85.90.5 | 77.20.3 | 63.50.4 | 51.70.3 | 66.30.4 | 62.30.2 | |
Average participating | 89.00.3 | 87.10.2 | 83.40.9 | 74.20.7 | 59.20.4 | 48.60.4 | 61.52.3 | 56.91.9 | |
Average all | 88.40.6 | 86.80.7 | 80.81.0 | 72.50.5 | 57.80.9 | 47.70.5 | 59.92.8 | 55.72.2 | |
\clineB2-102 | FedVarp ( memory) | 89.60.3 | 88.60.2 | 84.00.3 | 77.80.2 | 56.41.1 | 48.80.5 | 64.60.4 | 62.10.4 |
MIFA ( memory) | 89.10.3 | 88.40.2 | 83.00.4 | 77.20.4 | 55.11.2 | 48.10.6 | 63.50.7 | 61.20.6 | |
Known participation statistics | 89.50.2 | 88.60.2 | 84.50.4 | 76.90.3 | 59.70.5 | 50.30.5 | 63.50.9 | 60.70.6 | |
Cyclic | FedAU (ours, ) | 89.80.6 | 88.70.6 | 84.20.8 | 76.30.7 | 60.90.6 | 50.60.3 | 63.51.0 | 60.00.8 |
FedAU (ours, ) | 89.90.6 | 88.80.6 | 84.80.6 | 76.60.4 | 61.30.8 | 51.00.5 | 64.50.9 | 60.90.7 | |
Average participating | 87.40.5 | 85.50.7 | 81.61.2 | 73.30.8 | 58.11.0 | 48.30.8 | 58.92.1 | 55.01.6 | |
Average all | 89.10.8 | 87.40.8 | 83.11.0 | 73.80.8 | 59.70.3 | 48.80.4 | 62.91.7 | 57.61.5 | |
\clineB2-102 | FedVarp ( memory) | 84.80.5 | 83.90.6 | 79.70.9 | 75.30.7 | 50.90.5 | 45.90.4 | 60.40.7 | 58.50.6 |
MIFA ( memory) | 78.61.2 | 77.41.1 | 73.01.3 | 70.61.1 | 44.80.6 | 41.10.6 | 51.21.0 | 50.20.9 | |
Known participation statistics | 89.90.7 | 88.70.6 | 83.60.7 | 76.10.5 | 60.20.4 | 50.80.4 | 62.60.8 | 59.80.7 | |
Note to the table. The top part of the sub-table for each participation pattern includes our method and baselines in the same setting. The bottom part of each sub-table includes baselines that require either additional memory or known participation statistics. For each column, the best values in the top and bottom parts are highlighted with bold and underline, respectively. The total number of rounds is for SVHN; for CIFAR-10 and CINIC-10; for CIFAR-100. The mean and standard deviation values shown in the table are computed over experiments with different random seeds, for the average accuracy over the last rounds (measured at an interval of rounds).
We evaluate the performance of FedAU in experiments. More experimental setup details, including the link to the code, and results are in Appendices C and D, respectively.
Datasets, Models, and System. We consider four image classification tasks, with datasets including SVHN (Netzer et al., 2011), CIFAR-10 (Krizhevsky & Hinton, 2009), CIFAR-100 (Krizhevsky & Hinton, 2009), and CINIC-10 (Darlow et al., 2018), where CIFAR-100 has classes (labels) while the other datasets have classes. We use FL train convolutional neural network (CNN) models of slightly different architectures for these tasks. We simulate an FL system that includes a total of clients, where each has its own participation pattern.
Heterogeneity. Similar to existing works (Hsu et al., 2019, Reddi et al., 2021), we use a Dirichlet distribution with parameter to generate the class distribution of each client’s data, for a setup with non-IID data across clients. Here, specifies the degree of data heterogeneity, where a smaller indicates a more heterogeneous data distribution. In addition, to simulate the correlation between data distribution and client participation frequency as motivated in Section 1, we generate a class-wide participation probability distribution that follows a Dirichlet distribution with parameter . Here, specifies the degree of participation heterogeneity, where a smaller indicates more heterogeneous participation across clients. We generate client participation patterns following a random process that is either Bernoulli (independent), Markovian, or cyclic, and study the performance of these types of participation patterns in different experiments. The participation patterns have a stationary probability , for each client , that is generated according to a combination of the two aforementioned Dirichlet distributions, and the details are explained in Appendix C.6. We enforce the minimum , , to be in the main experiments, which is relaxed later. This generative approach creates an experimental scenario with non-IID client participation, while our FedAU algorithm and most baselines still do not know the actual participation statistics.
Baselines. We compare our FedAU algorithm with several baselines. The first set of baselines includes algorithms that compute an average of parameters over either all the participating clients (average participating) or all the clients (average all) in the aggregation stage of each round, where the latter case includes updates of non-participating clients that are equal to zero as part of averaging. These two baselines encompass most existing FedAvg implementations (e.g., Yang et al. (2021), McMahan et al. (2017), Patel et al. (2022)) that do not address the bias caused by heterogeneous participation statistics. They do not require additional memory or knowledge, thus they work under the same system assumptions as FedAU. The second set of baselines has algorithms that require extra resources or information, including FedVarp (Jhunjhunwala et al., 2022) and MIFA (Gu et al., 2021), which require times of memory, and an idealized baseline that assumes known participation statistics and weighs the clients’ contributions using the reciprocal of the stationary participation probability. For each baseline, we performed a separate grid search to find the best and .
Results. The main results are shown in Table 1, where we choose for FedAU with finite based on a simple rule-of-thumb without detailed search. Our general observation is that FedAU provides the highest accuracy compared to almost all the baselines, including those that require additional memory and known participation statistics, except for the test accuracy on the CIFAR-10 dataset where FedVarp performs the best. Choosing generally gives a better performance than choosing for FedAU, which aligns with our discussion in Section 3.
The reason that FedAU can perform better than FedVarp and MIFA is that these baselines keep historical local updates, which may be outdated when some clients participate infrequently. Updating the global model parameter with outdated local updates can lead to slow convergence, which is similar to the consequence of having stale updates in asynchronous SGD (Recht et al., 2011). In contrast, at the beginning of each round, participating clients in FedAU always start with the latest global parameter obtained from the server. This avoids stale updates, and we compensate heterogeneous participation statistics by adapting the aggregation weights, which is a fundamentally different and more efficient method compared to tracking historical updates as in FedVarp and MIFA.
It is surprising that FedAU even performs better than the case with known participation statistics. To understand this phenomenon, we point out that in the case of Bernoulli-distributed participation with very low probability (e.g., ), the empirical probability of a sample path of a client’s participation can diverge significantly from . For rounds, the standard deviation of the total number of participated rounds is while the mean is . Considering the range within , we know that the optimal participation weight when seen on the empirical probability ranges from to , while the optimal weight computed on the model-based probability is . Our FedAU algorithm computes the aggregation weights from the actual participation sample path of each client, which captures the actual client behavior and empirically performs better than using even if is known. Some experimental results that further explain this phenomenon are in Appendix D.4.
As mentioned earlier, we lower-bounded , , by for the main results. Next, we consider different lower bounds of , where a smaller lower bound of means that there exist clients that participate less frequently. The performance of FedAU with different choices of and different lower bounds of is shown in Figure 1. We observe that choosing always gives the best performance; the performance remains similar even when the lower bound of is small and there exist some clients that participate very infrequently. However, choosing a large (e.g., ) significantly deteriorates the performance when the lower bound of is small. This means that having a finite cutoff interval of an intermediate value (i.e., in our experiments) for aggregation weight estimation, which is a uniqueness of FedAU, is essential especially when very infrequently participating clients exist.
6 Conclusion
In this paper, we have studied the challenging practical FL scenario of having unknown participation statistics of clients. To address this problem, we have considered the adaptation of aggregation weights based on the participation history observed at each individual client. Using a new consideration of the bias-variance tradeoff of the aggregation weight, we have obtained the FedAU algorithm. Our analytical methodology includes a unique decomposition which yields a separate weight error term that is further bounded to obtain the convergence upper bound of FedAU. Experimental results have confirmed the advantage of FedAU with several client participation patterns. Future work can study the convergence analysis of FedAU with more general participation processes and the incorporation of aggregation weight adaptation into other types of FL algorithms.
Acknowledgment
The work of M. Ji was supported by the National Science Foundation (NSF) CAREER Award 2145835.
References
- Andrew et al. (2021) Galen Andrew, Om Thakkar, Brendan McMahan, and Swaroop Ramaswamy. Differentially private learning with adaptive clipping. Advances in Neural Information Processing Systems, 34:17455–17466, 2021.
- Bonawitz et al. (2017) Keith Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H Brendan McMahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth. Practical secure aggregation for privacy-preserving machine learning. In proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 1175–1191, 2017.
- Bonawitz et al. (2019) Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloé Kiddon, Jakub Konečný, Stefano Mazzocchi, Brendan McMahan, Timon Van Overveldt, David Petrou, Daniel Ramage, and Jason Roselander. Towards federated learning at scale: System design. In Proceedings of Machine Learning and Systems, volume 1, pp. 374–388, 2019.
- Chen et al. (2022) Wenlin Chen, Samuel Horváth, and Peter Richtárik. Optimal client sampling for federated learning. Transactions on Machine Learning Research, 2022. ISSN 2835-8856.
- Cho et al. (2023) Yae Jee Cho, Pranay Sharma, Gauri Joshi, Zheng Xu, Satyen Kale, and Tong Zhang. On the convergence of federated averaging with cyclic client participation. arXiv preprint arXiv:2302.03109, 2023.
- Darlow et al. (2018) Luke N Darlow, Elliot J Crowley, Antreas Antoniou, and Amos J Storkey. CINIC-10 is not Imagenet or CIFAR-10. arXiv preprint arXiv:1810.03505, 2018.
- Ding et al. (2020) Yucheng Ding, Chaoyue Niu, Yikai Yan, Zhenzhe Zheng, Fan Wu, Guihai Chen, Shaojie Tang, and Rongfei Jia. Distributed optimization over block-cyclic data. arXiv preprint arXiv:2002.07454, 2020.
- Eichner et al. (2019) Hubert Eichner, Tomer Koren, Brendan McMahan, Nathan Srebro, and Kunal Talwar. Semi-cyclic stochastic gradient descent. In International Conference on Machine Learning, pp. 1764–1773. PMLR, 2019.
- Fraboni et al. (2021a) Yann Fraboni, Richard Vidal, Laetitia Kameni, and Marco Lorenzi. Clustered sampling: Low-variance and improved representativity for clients selection in federated learning. In International Conference on Machine Learning, volume 139, pp. 3407–3416. PMLR, Jul. 2021a.
- Fraboni et al. (2021b) Yann Fraboni, Richard Vidal, Laetitia Kameni, and Marco Lorenzi. On the impact of client sampling on federated learning convergence. arXiv preprint arXiv:2107.12211, 2021b.
- Gorbunov et al. (2021) Eduard Gorbunov, Filip Hanzely, and Peter Richtarik. Local SGD: Unified theory and new efficient methods. In International Conference on Artificial Intelligence and Statistics, volume 130 of PMLR, pp. 3556–3564, 2021.
- Gu et al. (2021) Xinran Gu, Kaixuan Huang, Jingzhao Zhang, and Longbo Huang. Fast federated learning in the presence of arbitrary device unavailability. In Advances in Neural Information Processing Systems, 2021.
- Haddadpour et al. (2019) Farzin Haddadpour, Mohammad Mahdi Kamani, Mehrdad Mahdavi, and Viveck Cadambe. Local SGD with periodic averaging: Tighter analysis and adaptive synchronization. In Advances in Neural Information Processing Systems, 2019.
- Hsu et al. (2019) Tzu-Ming Harry Hsu, Hang Qi, and Matthew Brown. Measuring the effects of non-identical data distribution for federated visual classification. arXiv preprint arXiv:1909.06335, 2019.
- Jhunjhunwala et al. (2022) Divyansh Jhunjhunwala, Pranay Sharma, Aushim Nagarkatti, and Gauri Joshi. FedVARP: Tackling the variance due to partial client participation in federated learning. In Uncertainty in Artificial Intelligence, pp. 906–916. PMLR, 2022.
- Kairouz et al. (2021) Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. Foundations and Trends® in Machine Learning, 14(1–2):1–210, 2021.
- Karimireddy et al. (2020) Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. SCAFFOLD: Stochastic controlled averaging for federated learning. In International Conference on Machine Learning, pp. 5132–5143. PMLR, 2020.
- Krizhevsky & Hinton (2009) Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
- Li et al. (2020a) Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Magazine, 37(3):50–60, 2020a.
- Li et al. (2020b) Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. Proceedings of Machine Learning and Systems, 2:429–450, 2020b.
- Li et al. (2020c) Xiang Li, Kaixuan Huang, Wenhao Yang, Shusen Wang, and Zhihua Zhang. On the convergence of FedAvg on non-IID data. In International Conference on Learning Representations, 2020c.
- Lin et al. (2020) Tao Lin, Sebastian U. Stich, Kumar Kshitij Patel, and Martin Jaggi. Don’t use large mini-batches, use local SGD. In International Conference on Learning Representations, 2020.
- Malinovsky et al. (2023) Grigory Malinovsky, Samuel Horváth, Konstantin Burlachenko, and Peter Richtárik. Federated learning with regularized client participation. arXiv preprint arXiv:2302.03662, 2023.
- McMahan et al. (2017) Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, pp. 1273–1282. PMLR, 2017.
- McMahan et al. (2018) H. Brendan McMahan, Daniel Ramage, Kunal Talwar, and Li Zhang. Learning differentially private recurrent language models. In International Conference on Learning Representations, 2018.
- Netzer et al. (2011) Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng. Reading digits in natural images with unsupervised feature learning. In NeurIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
- Patel et al. (2022) Kumar Kshitij Patel, Lingxiao Wang, Blake Woodworth, Brian Bullins, and Nathan Srebro. Towards optimal communication complexity in distributed non-convex optimization. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022.
- Perazzone et al. (2022) Jake Perazzone, Shiqiang Wang, Mingyue Ji, and Kevin S Chan. Communication-efficient device scheduling for federated learning using stochastic optimization. In IEEE Conference on Computer Communications, pp. 1449–1458, 2022.
- Recht et al. (2011) Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. Advances in neural information processing systems, 24, 2011.
- Reddi et al. (2021) Sashank J. Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konečný, Sanjiv Kumar, and Hugh Brendan McMahan. Adaptive federated optimization. In International Conference on Learning Representations, 2021.
- Rizk et al. (2022) Elsa Rizk, Stefan Vlaski, and Ali H Sayed. Federated learning under importance sampling. IEEE Transactions on Signal Processing, 70:5381–5396, 2022.
- Ruan et al. (2021) Yichen Ruan, Xiaoxi Zhang, Shu-Che Liang, and Carlee Joe-Wong. Towards flexible device participation in federated learning. In International Conference on Artificial Intelligence and Statistics, pp. 3403–3411. PMLR, 2021.
- Stich (2019) Sebastian U. Stich. Local SGD converges fast and communicates little. In International Conference on Learning Representations, 2019.
- Tan et al. (2022) Lei Tan, Xiaoxi Zhang, Yipeng Zhou, Xinkai Che, Miao Hu, Xu Chen, and Di Wu. Adafed: Optimizing participation-aware federated learning with adaptive aggregation weights. IEEE Transactions on Network Science and Engineering, 9(4):2708–2720, 2022.
- Wang & Joshi (2019) Jianyu Wang and Gauri Joshi. Adaptive communication strategies to achieve the best error-runtime trade-off in local-update SGD. In Proceedings of Machine Learning and Systems, volume 1, pp. 212–229, 2019.
- Wang & Joshi (2021) Jianyu Wang and Gauri Joshi. Cooperative SGD: A unified framework for the design and analysis of local-update SGD algorithms. Journal of Machine Learning Research, 22(213):1–50, 2021.
- Wang et al. (2020) Jianyu Wang, Qinghua Liu, Hao Liang, Gauri Joshi, and H Vincent Poor. Tackling the objective inconsistency problem in heterogeneous federated optimization. Advances in neural information processing systems, 33:7611–7623, 2020.
- Wang et al. (2021) Jianyu Wang, Zachary Charles, Zheng Xu, Gauri Joshi, H Brendan McMahan, Maruan Al-Shedivat, Galen Andrew, Salman Avestimehr, Katharine Daly, Deepesh Data, et al. A field guide to federated optimization. arXiv preprint arXiv:2107.06917, 2021.
- Wang et al. (2022a) Qiyuan Wang, Qianqian Yang, Shibo He, Zhiguo Shi, and Jiming Chen. AsyncFedED: Asynchronous federated learning with euclidean distance based adaptive weight aggregation. arXiv preprint arXiv:2205.13797, 2022a.
- Wang & Ji (2022) Shiqiang Wang and Mingyue Ji. A unified analysis of federated learning with arbitrary client participation. In Advances in Neural Information Processing Systems, volume 35, 2022.
- Wang et al. (2022b) Yujia Wang, Lu Lin, and Jinghui Chen. Communication-efficient adaptive federated learning. In International Conference on Machine Learning, pp. 22802–22838. PMLR, 2022b.
- Wang et al. (2022c) Yujia Wang, Lu Lin, and Jinghui Chen. Communication-compressed adaptive gradient method for distributed nonconvex optimization. In International Conference on Artificial Intelligence and Statistics, pp. 6292–6320. PMLR, 2022c.
- Wu & Wang (2021) Hongda Wu and Ping Wang. Fast-convergent federated learning with adaptive weighting. IEEE Transactions on Cognitive Communications and Networking, 7(4):1078–1088, 2021.
- Yan et al. (2020) Yikai Yan, Chaoyue Niu, Yucheng Ding, Zhenzhe Zheng, Fan Wu, Guihai Chen, Shaojie Tang, and Zhihua Wu. Distributed non-convex optimization with sublinear speedup under intermittent client availability. arXiv preprint arXiv:2002.07399, 2020.
- Yang et al. (2021) Haibo Yang, Minghong Fang, and Jia Liu. Achieving linear speedup with partial worker participation in non-IID federated learning. In International Conference on Learning Representations, 2021.
- Yang et al. (2022) Haibo Yang, Xin Zhang, Prashant Khanduri, and Jia Liu. Anarchic federated learning. In International Conference on Machine Learning, pp. 25331–25363. PMLR, 2022.
- Yang et al. (2019) Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. Federated machine learning: Concept and applications. ACM Transactions on Intelligent Systems and Technology (TIST), 10(2):12, 2019.
- Yu et al. (2019) Hao Yu, Sen Yang, and Shenghuo Zhu. Parallel restarted SGD with faster convergence and less communication: Demystifying why model averaging works for deep learning. In AAAI Conference on Artificial Intelligence, pp. 5693–5700, 2019.
Appendix
[sections] \printcontents[sections]l1
Appendix A Additional Discussion
A.1 Extending Objective (1) to Weighted Average
We note that our objective (1) can be easily extended to a weighted average of per-client empirical risk (i.e., average of sample losses), with arbitrary weights . To see this, let denote the (local) empirical risk of client , and let . We can define the local objective of client as , which gives us the global objective of
(A.1.1) |
This objective is in a standard form seen in most FL papers. The extension allows us to give different importance to different clients, if needed. For simplicity, we do not write out the weights in the main paper, because this extension to arbitrary weights is straightforward, and such a simplification has also been made in various other works such as Jhunjhunwala et al. (2022), Karimireddy et al. (2020), Reddi et al. (2021), Wang & Ji (2022).
A.2 Assumption on Bounded Global Gradient
As stated in Theorem 2, our convergence result holds when either of the “bounded global gradient” assumption or the “nearly optimal weights” assumption holds. When the aggregation weights are nearly optimal satisfying , we do not need the bounded gradient assumption.
For the bounded gradient assumption itself, a stronger assumption of bounded stochastic gradient is used in related works on adaptive gradient algorithms (Reddi et al., 2021, Wang et al., 2022b; c), which implies an upper bound on the per-sample gradient. Compared to these works, we only require an upper bound on the global gradient, i.e., average of per-sample gradients, in our work. Although focusing on very different problems, our FedAU method shares some similarities with adaptive gradient methods in the sense that we both adapt the weights used in model updates, where the adaptation is dependent on some parameters that progressively change during the training process. The difference, however, is that our weight adaptation is based on each client’s participation history, while adaptive gradient methods adapt the element-wise weights based on the historical model update vector. Nevertheless, the similarity in both methods leads to a technical (mathematical) step of bounding a “weight error” in the proofs, which is where the bounded gradient assumption is needed especially when the “weight error” itself cannot be bounded. In our work, this step is done in the proof of Theorem 2 (in Appendix B.5). In adaptive gradient methods, as an example, this step is on page 14 until Equation (4) in Reddi et al. (2021).
Again, we note that the bounded gradient assumption is only needed when the aggregation weights are estimated and the estimation error is large. This is seen in the two choices in Assumption 5; the convergence bound holds when either of these two conditions hold. Intuitively, this aligns with the reasoning of the need for bounding the “weight error”.
A.3 Comparison with Existing Convergence Bounds for FedAvg
We compare our result in Corollary 4 with existing FedAvg convergence results, where the latter assumes known participation probabilities. Since most existing results consider equiprobable sampling of a certain number (denoted by here) of clients out of all the clients, we first convert our bound to the same setting so that it is comparable with existing results. We note that our convergence bound includes the parameter that is defined as in Theorem 2. When we know the participation probabilities and choose for all , we have . Further, for equiprobable sampling of clients out of a total of clients, we have and thus . Therefore, when is large and ignoring the other constants, our upper bound in Corollary 4 becomes .
Considering existing results of FedAvg with partial participation where the probabilities are both homogeneous and known, Theorem 1 in Karimireddy et al. (2020) gives the same convergence bound of for non-convex objectives, and Corollary 2 in Yang et al. (2021) gives a covergence bound of . Here, we note that Karimireddy et al. (2020) express the bound on communication rounds while we give the bound on the square of gradient norm, but the two types of bounds are directly convertible to each other. Our bound of matches with Theorem 1 in Karimireddy et al. (2020) and improves over Corollary 2 in Yang et al. (2021). We also note that, in this special case, our result shows a linear speedup with respect to the number of participating clients, i.e., , which is the same as the existing results in Karimireddy et al. (2020), Yang et al. (2021).
The uniqueness of our work compared to Karimireddy et al. (2020), Yang et al. (2021) and most other existing works is that we consider heterogeneous and unknown participation statistics (probabilities), where each client has its own participation probability that can be different from other clients. In contrast, Karimireddy et al. (2020), Yang et al. (2021) assume uniformly sampled clients where a fixed (and known) number of clients participate in each round. Our setup is more general where the number of clients that participate in each round can vary over time. Because of this generality, we cannot define a fixed value of in our convergence bound that holds for this general setup, so we use to capture the statistical characteristics of client participation. When the overall probability distribution of client participation remains the same, increasing the total number of clients () has the same effect as increasing the number of participating clients (), as we have shown above.
Appendix B Proofs
B.1 Preliminaries
We first note the following preliminary inequalities that we will use in the proofs without explaining them further.
We have
(B.1.1) |
for any with , which is a direct consequence of Jensen’s inequality.
We also have
(B.1.2) |
for any and , which is known as (the generalized version of) Young’s inequality and also Peter-Paul inequality. A direct consequence of (B.1.2) is
(B.1.3) |
for some constant .
We also use the variance relation as follows:
(B.1.4) |
for any , while noting that (B.1.4) also holds when all the expectations are conditioned on the same variable(s).
B.2 Equivalent Formulation of Algorithm 1
For the purpose of analysis, similar to Wang & Ji (2022), we consider an equivalent formulation of the original Algorithm 1, as shown in Algorithm 3. In this algorithm, we assume that all the clients compute their local updates in Lines 3–3. This is logically equivalent to the practical setting where the clients that do not participate have no computation, because their computed update has no effect in Line 3 if , thus Algorithm 1 and Algorithm 3 give the same output sequence . Our proofs in the following sections consider the logically equivalent Algorithm 3 for analysis and also use the notations defined in this algorithm.
B.3 General Descent Lemma
To prove the general descent lemma that is used to derive both Theorems 1 and 2, we first define the following generally weighted loss function.
Definition B.3.1.
Define
(B.3.1) |
where for all and .
In (B.3.1), choosing gives our original objective of . Note that we consider the updates in Algorithm 3 to be still without weighting by , which allows us to quantify the convergence to a different objective when the aggregation weights are not properly chosen.
Lemma B.3.1.
Define , we have
(B.3.2) |
Proof.
From Assumption 3, we have
where we use the Jensen’s inequality in (a). The final result follows due to . ∎
Lemma B.3.2.
When ,
(B.3.3) |
Proof.
This lemma has the same form as in Yang et al. (2021, Lemma 2) and Reddi et al. (2021, Lemma 3), but we present it here for a single client instead of average over multiple clients.
For , we have
(B.3.4) |
where follows from expanding the squared norm above and applying the law of total expectation on the second term, is because the second part of the inner product has no randomness when and are given, is because the inner product is zero due to the unbiasedness of stochastic gradient, follows from expanding the second term and applying the Peter-Paul inequality, uses gradient variance bound, Lipschitz gradient, and gradient divergence bound.
By unrolling the recursion, we obtain
where uses for any and . ∎
Lemma B.3.3 (General descent lemma).
When and , we have
(B.3.5) |
Proof.
Due to Assumption 1 (-smoothness), we have
(B.3.6) |
where the last equality is due to and the unbiasedness of the stochastic gradient giving (for simplicity, we will not write out this total expectation in subsequent steps of this proof).
Expanding the third term of (B.3.6), we have
(B.3.8) |
where we note that follows Bernoulli distribution with probability , thus and , yielding the relation in . We also use the independence across different and for the stochastic gradients and the independence across for the client participation random variable , as well as the fact that and the stochastic gradients are independent of each other, so the local updates (progression of ) are independent of according to the logically equivalent algorithm formulation in Algorithm 3. The independence yields some inner product terms to be zero, giving the results in , , and .
B.4 Formal Version and Proof of Theorem 1
We first state the formal version of Theorem 1 as follows.
Theorem B.4.1 (Objective minimized at convergence, formal).
Proof.
According to Algorithm 3, the result remains the same when we replace and (thus ) with and , respectively, while keeping the product . We choose and . Then, we choose in Lemma B.3.3. We can see that this choice satisfies , so Lemma B.3.3 holds after replacing and in the lemma with and , respectively, and in Lemma B.3.3 is equal to defined in Theorem B.4.1 with this choice of . Therefore,
(B.4.2) |
Because and , there exists a sufficiently large so that . In this case, after taking the total expectation of (B.4.2) and rearranging, we have
(B.4.3) |
Then, summing up over rounds and dividing by , we have
(B.4.4) |
where with as the truly minimum value.
Since and , we can see that the upper bound above converges to zero as . Thus, there exists a sufficiently large to achieve an upper bound of an arbitrarily positive value of . ∎
B.5 Proof of Theorem 2
We first present the following variant of the descent lemma for the original objective defined in (1).
Lemma B.5.1 (Descent lemma for original objective).
Under the same conditions as in Lemma B.3.3,
(B.5.1) |
Proof.
Proof of Theorem 2.
Consider the last term in Lemma B.5.1. Due to , and as specified in the theorem, we have and .
Case 1: When assuming , we have
(B.5.2) |
Plugging back into Lemma B.5.1, after taking total expectation and rearranging, we obtain
(B.5.3) |
Case 2: When assuming , we have
(B.5.4) |
Plugging back into Lemma B.5.1, after taking total expectation and rearranging, we obtain
(B.5.5) |
B.6 Proof of Theorem 3
We start by analyzing the statistical properties of the possibly cutoff participation interval . Because in every round, each client participates according to a Bernoulli distribution with probability , the random variable has the following probability distribution:
(B.6.1) |
which is a “cutoff” geometric distribution with a maximum value of . We will refer to this probability distribution as -cutoff geometric distribution. We can see that when , this distribution becomes the same as the geometric distribution, but we consider the general case with an arbitrary that is specified later. We also recall that the actual value of is unknown to the system, which is why we need to compute using the estimation procedure in Algorithm 2.
Lemma B.6.1.
Equation (B.6.1) defines a probability distribution, and the mean and variance of are
(B.6.2) |
Proof.
We first show that (B.6.1) defines a probability distribution. According to the definition in (B.6.1), we have for any . We prove this by induction. Let and denote the random variables following -cutoff and -cutoff geometric distributions, respectively. For , we have . Therefore, we can assume that holds for a certain value of . For -cutoff distribution, we first note that according to (B.6.1),
Therefore,
This shows that defined in (B.6.1) is a probability distribution.
In the following, we derive the mean and variance of , where we use to denote the derivative of with respect to .
We have
(B.6.3) |
which gives the expression for the expected value.
To compute the variance, we note that
(B.6.4) |
Thus,
(B.6.5) |
Therefore,
(B.6.6) |
which gives the final variance result. ∎
Now, we are ready to obtain an upper bound of the weight error term.
Proof of Theorem 3.
Case 1: According to Algorithm 2, we have in the initial rounds before the first participation has occurred. This includes at least one round () and at most rounds. In these initial rounds, we have
(B.6.7) |
Case 2: For all the other rounds, is estimated based on at least one sample of . Therefore, using the mean and variance expressions from Lemma B.6.1, we have the following for these rounds:
(B.6.8) |
where is because the inner product term is zero since the mean of is equal to ; is due to the definition of variance, the fact that we consider the computation of to be based on at least one sample of , and for any round there are at least samples of due to the cutoff interval of length ; uses the upper bound of .
We note that the bound (B.6.7) in Case 1 always applies for , because we always have for according to Algorithm 2. For rounds , either the bound (B.6.7) in Case 1 or the bound (B.6.8) in Case 2 applies, thus is upper bounded by the sum of both bounds in these rounds. Then, for , the bound (B.6.8) in Case 2 applies. According to this fact, summing up the bounds for each round and dividing by gives
(B.6.9) |
where we use the relation that for , and the logarithm is based on .
The final result is obtained by averaging (B.6.9) over all . ∎
B.7 Proof of Corollary 4
We first prove the upper bound of the weight error term in the following lemma.
Lemma B.7.1.
Choosing , where and . Define . When , the aggregation weights obtained from Algorithm 2 satisfies
(B.7.1) |
Proof.
Appendix C Additional Setup Details of Experiments
C.1 Code
The code for reproducing our experiments is available via the following link:
https://shiqiang.wang/code/fedau
C.2 Datasets
The SVHN dataset has a citation requirement Netzer et al. (2011). Its license is for non-commercial use only. It includes color images with real-world house numbers of different digits, containing training data samples and test data samples.
The CIFAR-10 dataset only has a citation requirement Krizhevsky & Hinton (2009). It includes color images of different types of real-world objects, containing training data samples and test data samples.
The CIFAR-100 dataset only has a citation requirement Krizhevsky & Hinton (2009). It includes color images of different types of real-world objects, containing training data samples and test data samples.
The CINIC-10 dataset Darlow et al. (2018) has MIT license. It includes color images of different types of real-world objects, containing training data samples and test data samples.
We have cited all the references in the main paper and conformed to all the license terms.
We applied some basic data augmentation techniques to these datasets during the training stage. For SVHN, we applied random cropping. For CIFAR-10 and CINIC-10, we applied both random cropping and random horizontal flipping. For CIFAR-100, we applied a combination of random sharpness adjustment, color jitter, random posterization, random equalization, random cropping, and random horizontal flipping.
C.3 Models
All the models include two convolutional layers with a kernel size of , filter size of , and ReLU activation, where each convolutional layer is followed by a max-pool layer. The model for the SVHN dataset has two fully connected layers, while the models for the CIFAR-10/100 and CINIC-10 datasets have three fully connected layers. All the fully connected layers use ReLU activation, except for the last layer that is connected to softmax output. For CIFAR-100 and CINIC-10 datasets, a dropout layer (with dropout probability ) is applied before each fully connected layer. We use Kaiming initialization for the weights. See the code for further details on model definition (the model class files are located inside the “model/” subfolder).
C.4 Hyperparameters
For each dataset and algorithm, we conducted a grid search on the learning rates and separately. The grid for the local step size is and the grid for the global step size is . To reduce the complexity of the search, we first search for the value of with , and then search for while fixing to the value found in the first search. We consider the training loss at rounds for determining the best and . The hyperparameters found from this search and used in our experiments are shown in Table C.4.1.
Learning Rate Decay for CIFAR-100 Dataset. Only for the CIFAR-100 dataset, we decay the local learning rate by half every rounds, starting from the -th round.
Dataset | SVHN | CIFAR-10 | CIFAR-100 | CINIC-10 | ||||
Method / Hyperparameter | | | | | | | | |
FedAU (ours, ) | | | | | | | | |
FedAU (ours, ) | | | | | | | | |
Average participating | | | | | | | | |
Average all | | | | | | | | |
FedVarp ( memory) | | | | | | | | |
MIFA ( memory) | | | | | | | | |
Known participation statistics | | | | | | | | |
C.5 Computation Resources
The experiments were split between a desktop machine with RTX 3070 GPU and an internal GPU cluster. In our experiments, the total number of rounds is for SVHN, for CIFAR-10 and CINIC-10, and for CIFAR-100. Each experiment with rounds took approximately hours to complete, for one random seed on RTX 3070 GPU. The time taken for experiments with other number of rounds scales accordingly. We ran experiments with different random seeds for each dataset and algorithm. It was possible to run multiple experiments simultaneously on the same GPU while not exceeding the GPU memory.
C.6 Heterogeneous Participation Across Clients
C.6.1 Generating Participation Patterns
In each experiment with a specific simulation seed, we take only one sample of this Dirichlet distribution with parameter , which gives a probability vector that has a dimension equal to the total number of classes in the dataset.222We use to denote with all the elements in the vector equal to . The participation probability for each client is obtained by computing an inner product between and the class distribution vector of the data at client , and then dividing by a normalization factor. The rationale behind this approach is that the elements in indicate how different classes contribute to the participation probability. For example, if the first element of is large, it means that clients with a lot of data samples in the first class will have a high participation probability, and vice versa. Since the participation probabilities generated using this approach are random variables, the normalization ensures a certain mean participation probability, i.e., , of any client , which is set to in our experiments. We further cap the minimum value of any to be .
Among the three participation patterns in our experiments, i.e., Bernoulli, Markovian, and cyclic, we maintain the same stationary participation probabilities for the clients, so the difference is in the temporal distribution of when a client participates, which is summarized as follows.
-
•
For Bernoulli participation, in every round , each client decides whether or not to participate according to a Bernoulli distribution with probability . This decision is independent across time, i.e., independent across different rounds.
-
•
For Markovian participation, each client participates according a two-state Markov chain, where the motivation is similar to cyclic participation (see next item below) but includes more randomness. We set the maximum transition probability of a client transitioning from not participating to participating to . The initial state of the Markov chain is determined by a random sampling according to the stationary probability , and the transition probabilities are determined in a way so that the same stationary probability is maintained across all the subsequent rounds.
-
•
For cyclic participation, each client participates cyclically, i.e., it participates for a certain number of rounds and does not participate in the other rounds of a cycle. This setup has been used in existing works to simulate periodic behavior of client devices being charged (e.g., at night) (Eichner et al., 2019, Ding et al., 2020, Cho et al., 2023, Wang & Ji, 2022). We set each cycle to be rounds. We apply a random initial offset to the cycle for each client, to simulate a stationary random process for each client’s participation pattern.
Figure C.6.1 shows examples of these three types of participation patterns.
C.6.2 Illustration of Data and Participation Heterogeneity
As described in Section 5 and Appendix C.6.1, we generate the data and participation heterogeneity with two separate Dirichlet distributions with parameters and , respectively. In the following, we illustrate the result of this generation for a specific random instance. In Figure C.6.2, the class-wise data distribution of each client is drawn from . For computing the participation probability, we draw a vector from , which gave the following result in our random trial:
Then, the participation probability is set as the inner product of and the class distribution of each client’s data, divided by a normalization factor. For the above , the -th element has the highest value, which means that clients with a larger proportion of data in the -th class (label) will have a higher participation probability. This is confirmed by comparing the class distributions and the participation probabilities in Figure C.6.2.
In this procedure, is kept the same for all the clients, to simulate a consistent correlation between participation probability and class distribution across all the clients. However, the value of changes with the random seed, which means that we have different for different experiments. We ran experiments with different random seeds for each setting, which allows us to observe the general behavior.
More precisely, let denote the class distribution of client ’s data. The participation probability of each client is computed as
(C.6.1) |
where is the normalization factor to ensure that is equal to some target , because is a random quantity when using this randomized generation procedure. In our experiments, we set . Let denote the total number of classes (labels). From the mean of Dirichlet distribution and the fact that and are independent, we know that . Therefore, to ensure that , according to (C.6.1), the normalization factor is chosen as .
We emphasize again that this procedure is only used for simulating an experimental setup with both data and participation heterogeneity. Our FedAU algorithm still does not know the actual values of .
Appendix D Additional Results from Experiments
D.1 Results with Different Participation Heterogeneity
We present the results with different participation heterogeneity (characterized by the Dirichlet parameter ) on the CIFAR-10 dataset in Table D.1.1, for the case of Bernoulli participation. The main observations remain consistent with those in Section 5. We also see that the difference between different methods becomes larger when the heterogeneity is higher (i.e., smaller ), which aligns with intuition. For all degrees of heterogeneity, our FedAU algorithm performs the best among the algorithms that work under the same setting, i.e., the top part of Table D.1.1.
Participation heterogeneity | ||||||||||
Method / Metric | Train | Test | Train | Test | Train | Test | Train | Test | Train | Test |
FedAU (ours, ) | 83.70.8 | 76.20.7 | 84.30.8 | 76.60.5 | 85.40.4 | 77.10.4 | 87.30.5 | 77.80.2 | 88.10.7 | 78.10.2 |
FedAU (ours, ) | 84.70.6 | 76.90.6 | 85.10.5 | 77.10.3 | 86.00.5 | 77.30.3 | 87.60.4 | 77.80.4 | 88.20.7 | 78.00.2 |
Average participating | 80.61.2 | 72.31.7 | 81.51.1 | 72.61.4 | 83.50.9 | 74.10.8 | 85.90.7 | 75.70.9 | 87.01.0 | 76.80.6 |
Average all | 76.92.7 | 69.52.7 | 78.51.7 | 70.61.8 | 81.00.9 | 72.70.9 | 83.61.4 | 74.60.8 | 84.91.2 | 75.91.0 |
FedVarp ( memory) | 82.50.8 | 77.30.3 | 83.00.4 | 77.50.4 | 84.20.3 | 77.90.2 | 85.40.5 | 78.10.2 | 86.40.7 | 78.50.3 |
MIFA ( memory) | 82.10.8 | 77.00.7 | 82.60.3 | 77.30.4 | 83.50.6 | 77.50.3 | 84.90.5 | 77.90.3 | 85.40.4 | 78.00.4 |
Known participation statistics | 83.10.8 | 76.30.6 | 83.60.6 | 76.70.5 | 84.30.5 | 77.00.5 | 86.10.6 | 77.70.4 | 86.80.9 | 77.90.7 |
Note to the table. The same note in Table 1 also applies to this table.
D.2 Loss and Accuracy Plots
For Bernoulli participation, we plot the loss and accuracy results in different rounds for the four datasets, as shown in Figures D.2.1–D.2.4. In these plots, the curves show the mean values and the shaded areas show the standard deviation. We applied moving average with a window size equal to of the total number of rounds, and the mean and standard deviation are computed across samples from all experiments (with different random seeds) within each moving average window.
The main conclusions from Figures D.2.1–D.2.4 are similar to what we have seen from the final-round results shown in Table 1 in the main paper. We can see that our FedAU algorithm performs the best in the vast majority of cases and across most rounds. Only for the CIFAR-10 dataset, FedAU gives a slightly worse test accuracy compared to FedVarp and MIFA, which aligns with the results in Table 1. However, FedAU still gives the highest training accuracy on CIFAR-10. This implies that FedVarp/MIFA gives a slightly better generalization on the CIFAR-10 dataset, where the reasons are worth further investigation. We emphasize again that FedVarp and MIFA both require a substantial amount of additional memory than FedAU, thus they do not work under the same system assumptions as FedAU. For the CIFAR-100 dataset, there is a jump around the -th round due to the learning rate decay schedule, as mentioned in Section C.4.
D.3 Client-wise Distributions of Loss and Accuracy
We plot the loss and accuracy value distributions among all the clients in Figure D.3.1, where we consider Bernoulli participation and compare with baselines that do not require extra resources or information.
We can see that compared to the average-participating and average-all baselines that use the same amount of memory as FedAU, the spread in the loss and accuracy with FedAU is smaller. This is also seen in the standard deviation of all the clients’ loss and accuracy values in Table D.3.1, where we only include the standard deviation values because the mean values are the same as those in Table 1.
Method | Client-wise std. dev. of loss | Client-wise std. dev. of training accuracy | Client-wise std. dev. of test accuracy |
FedAU (ours, ) | 0.017 | 9.9% | 11.7% |
FedAU (ours, ) | 0.016 | 9.5% | 11.2% |
Average participating | 0.031 | 13.3% | 11.6% |
Average all | 0.030 | 13.3% | 12.2% |
This shows that FedAU (especially with ) reduces the bias among clients compared to the two baselines, which aligns with our motivation mentioned in Section 1 about reducing discrimination.
D.4 Aggregation Weights
As shown in Figure D.4.1, with Bernoulli participation, the computed weights can be quite different from , especially when the participation probability is low (in Subfigures 1(c)–1(e)). In contrast, we see in Figure D.4.2 that with cyclic participation the weights computed by FedAU and the known participation statistics baseline are more similar. This aligns with the fact that the accuracies in the case of cyclic participation are also more similar compared to the case of Bernoulli participation, as seen in Table 1.
D.5 Choice of different
We study the effect of the cutoff interval length by considering the performance of FedAU under different minimum participation probabilities. The distributions of participation probabilities for all the clients with different lower bounds are shown in Figure D.5.1, where we can see that a smaller lower bound value corresponds to having more clients with very small participation probabilities. The full set of plots complementing Figure 1 is shown in Figure D.5.2.
D.6 Low Participation Rates
To further study the performance of FedAU in the presence of clients with low participation rates, we set the lower bound of participation probabilities to (i.e., we do not impose a specific lower bound; see Appendix C.6.1 and Appendix D.5 for details) and compare the performance of FedAU with to the baseline algorithms. We consider settings with different mean participation probabilities , while following the same procedure of generating heterogeneous participation patterns as described in Appendix C.6, to capture the effect of different overall participation rates of clients. The resulting distributions of with different are shown in Figure D.6.1.
Mean participation probability | ||||||
Method / Metric | Train | Test | Train | Test | Train | Test |
FedAU (ours, ) | 84.10.6 | 75.90.5 | 71.51.3 | 67.71.1 | 45.91.2 | 45.61.2 |
Average participating | 81.60.7 | 72.50.7 | 60.21.4 | 57.91.5 | 26.71.8 | 26.81.8 |
Average all | 79.50.8 | 71.50.9 | 61.82.4 | 60.02.5 | 33.01.8 | 33.51.9 |
FedVarp ( memory) | 61.525.8 | 59.524.8 | 10.00.0 | 10.00.0 | 12.73.4 | 12.83.5 |
MIFA ( memory) | 74.81.9 | 72.51.5 | 10.00.0 | 10.00.0 | 10.00.0 | 10.00.0 |
Known participation statistics | 15.010.0 | 14.99.8 | 10.00.0 | 10.00.0 | 10.00.0 | 10.00.0 |
Key Observations. The accuracy results are presented in Table D.6.1, from experiments with the CIFAR-10 dataset and rounds of FL. As expected, the performance of the majority of algorithms decreases as decreases, where the minor increase of FedVarp’s performance from the case of to is due to randomness in the experiments. We summarize the key findings from Table D.6.1 in the following.
It is interesting to see that the baseline algorithms that require additional memory or other information actually perform very poorly when the clients’ participation rates are low, where we note that an accuracy of corresponds to random guess for the CIFAR-10 dataset that has classes of images. The reason is that FedVarp and MIFA both perform variance reduction based on previous updates of clients. When clients participate rarely, it is likely that the saved updates are outdated, causing more distortion than benefit to parameter updates. For the case of known participation statistics, the aggregation weight of each client is chosen as . When is very small, the aggregation weight becomes very large, which causes instability to the model training process.
The average participating and average all baselines perform better than the FedVarp, MIFA, and known participation statistics baselines, because the aggregation weights used by the average participating and average all algorithms do not have much variation, which provides more stability in the case of low client participation rates.
FedAU (with ) gives the best performance, because the cutoff interval of length ensures that the aggregation weights are not too large, which provides stability in the training process. At the same time, clients that participate frequently still have lower aggregation weights, which balances the contributions of clients with different participation rates.
Further Discussion. We further note that all the results in Table D.6.1 are from experiments using the hyperparameters listed in Table C.4.1. These near-optimal hyperparameters were found from a grid search (see Appendix C.4) when the clients participate according to Bernoulli distribution with statistics described in Appendix C.6. For the baseline methods that give random guess (or close to random guess) accuracies, it is possible that their performance can be slightly improved by choosing a much smaller learning rate, which may alleviate the impact of stale updates (for FedVarp and MIFA) or excessively large aggregation weights (for known participation statistics) when the client participation rate is low. However, it is impractical to fine tune the learning rates depending on the participation rates, especially when the participation rates are unknown a priori. In addition, using very small learning rates generally slows down the convergence, although the algorithm may converge in the end after a large number of rounds. The fact that FedAU gives the best performance compared to the baselines for a wide range of client participation rates, while keeping the learning rates unchanged (i.e., using the values in Table C.4.1), confirms its stability and usefulness in practice.