A Practical Recipe For Federated Learning Under Statistical Heterogeneity Experimental Design
A Practical Recipe For Federated Learning Under Statistical Heterogeneity Experimental Design
A Practical Recipe For Federated Learning Under Statistical Heterogeneity Experimental Design
0, MONTH 2023 1
Abstract—Federated Learning (FL) has been an area of progress in the field. The studies and findings discussed
active research in recent years. There have been numerous in our work can significantly help the federated learning
studies in FL to make it more successful in the presence field by providing a more comprehensive understanding
of data heterogeneity. However, despite the existence of of the impact of experimental design factors, facilitating
many publications, the state of progress in the field is the design of better performing algorithms, and enabling a
unknown. Many of the works use inconsistent experimental more accurate evaluation of the effectiveness of different
settings and there are no comprehensive studies on the methods.
effect of FL-specific experimental variables on the results Index Terms—Benchmark, Data Heterogeneity, Exper-
and practical insights for a more comparable and consistent imental Design, Federated Learning, Machine Learning,
FL experimental setup. Furthermore, the existence of Non-IID Data.
several benchmarks and confounding variables has further
complicated the issue of inconsistency and ambiguity.
In this work, we present the first comprehensive study I. I NTRODUCTION
on the effect of FL-specific experimental variables in
relation to each other and performance results, bringing
several insights and recommendations for designing a
meaningful and well-incentivized FL experimental setup.
F EDERATED LEARNING (FL) is a machine learning
setting, which aims to collaboratively train machine
learning models with the participation of several clients
We further aid the community by releasing FedZoo- under the orchestration of a central server in a privacy-
Bench, an open-source library based on PyTorch with preserving and communication efficient manner [34, 62].
pre-implementation of 22 state-of-the-art methods1 , and FL has seen a surge of interest in the machine learning
a broad set of standardized and customizable features research community in recent years, thanks to its potential
available at https://github.com/MMorafah/FedZoo-Bench.
We also provide a comprehensive comparison of several to improve the performance of edge users without com-
state-of-the-art (SOTA) methods to better understand the promising data privacy. This innovative approach has been
current state of the field and existing limitations. successfully applied to a wide range of tasks, including
Impact Statement—Federated Learning aims to train a image classification, natural language processing, and
machine learning model using the massive decentralized more [15, 34, 64, 73, 82].
data available at IoT and mobile devices, and different data The ultimate goal of standard (global) FL is to train
centers while maintaining data privacy. However, despite a shared global model which uniformly performs well
the existence of numerous works, the state of progress over almost the entire participating clients. However,
in the field is not well-understood. Papers use different
methodologies and experimental setups that are hard to the inherent diversity and not independent and identical
compare and examine the effectiveness of methods in more (Non-IID) distribution of clients’ local data has made the
general settings. Moreover, the effect of federated learning global FL approach very challenging [24, 28, 47, 48, 68,
experimental design factors such as local epochs, and sample 84]. Indeed, clients’ incentives to participate in FL can
rate on the performance results have remained unstudied be to derive personalized models rather than learning a
in the field. Our work comprehensively studies the effect of
experimental design factors in federated learning, provides shared global model. This client-centric perspective along
suggestions and insights, introduces FedZoo-Bench with with the challenges in the global FL approach under
the pre-implementation of 22 state-of-the-art algorithms Non-IID data distributions has motivated an alternative
under a unified setting, and finally measures the state of personalized FL approach. Personalized FL aims to learn
personalized models performing well according to the
M. Morafah, W. Wang and B. Lin are all with Electrical and Computer
Engineering Department of University of California San Diego, USA
distinct data distribution of each participating client.
(e-mail address: {mmorafah, wweijia, billlin}@eng.ucsd.edu). Despite the significant number of works that have been
Corresponding author: Mahdi Morafah. done in both global and personalized FL approaches
1We will continue the effort to extend FedZoo-Bench by implement-
under data heterogeneity, the state of progress is not
ing more methods and adding more features. Any contributions to well understood in the FL community. In particular, the
FedZoo-Bench would be greatly appreciated as well. following key questions have remained unanswered in
2 JOURNAL OF IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 00, NO. 0, MONTH 2023
the existing literature: what factors affect experimental approach and statistical heterogeneity. In Section IV
results and how to control them? Which experimental we provide our comprehensive study on FL-specific
design setups are effective and well-incentivized for experimental variables. In Section V we discuss our
a fair comparison in each FL approach? What are recommendations. In Section VI we introduce FedZoo-
the best practices and remedies to compare different Bench and compare 17 different algorithms. In Sec-
methods and avoid evaluation failure? We primarily tion VII we conclude and provide future works.
find that the methodologies, experimental setups, and
evaluation metrics are so inconsistent between papers II. L ITERATURE R EVIEW
that a comprehensive comparison is impossible. For Global FL. McMahan et al. [62] proposed the first
example, some papers consider a specific FL approach, method for global FL called FedAvg which simply
however, they use an experimental setup that is not well- averages the local models to update the server-side
incentivized or has been created to match their assump- model. This FL approach mainly suffers from poor
tions and is not applicable to other cases. Moreover, convergence and degradation of the results in the presence
the existence of numerous benchmarks and inconsistent of data heterogeneity [75, 84]. Some works attempt
implementation environments together with different to address these issues by regularizing local training.
confounding variables such as data augmentation and FedProx [48] uses an L2 regularizer on the weight differ-
pre-processing techniques, choice of the optimizer, and ence between the local and global model. MOON [46]
learning rate schedule have made such comparison even utilizes contrastive learning to preserve global knowledge
more difficult. during local training. In FedDyn [4], an on-device
To address the mentioned issues in the current state of dynamic regularization at each round has been used
FL research, we present the first comprehensive study, to to align the local and global solutions. Another set of
the best of our knowledge, on FL-specific experimental works studies the optimization issues of FedAvg in the
variables and provide new insights and best practices for a presence of data heterogeneity and proposes alternative
meaningful and well-incentivized FL experimental design optimization procedures with better convergence guar-
for each FL approach. We also introduce FedZoo-Bench, antees [24, 35, 36, 67, 79]. FedNova [79] addresses the
an open-source library based on PyTorch that provides objective inconsistency caused by heterogeneity in the
a commonly used set of standardized and customizable local updates by weighting the local models in server-side
features in FL, and implementation of 22 state-of-the- averaging to eliminate bias in the solution and achieve fast
art (SOTA) methods under a unified setting to make error convergence. Scaffold [36] proposes control variates
FL research more reproducible and comparable. Finally, to correct the local updates and eliminate the “client drift”
we present a comprehensive evaluation of several SOTA which happens because of data heterogeneity resulting in
methods in terms of performance, fairness, and general- convergence rate improvement. Other approaches have
ization to newcomers, to provide a clear understanding focused on proposing better model fusion techniques to
of the promises and limitations of existing methods. improve performance [23, 52, 53, 69, 78]. FedDF [52]
Contributions. Our study makes the following key adds a server-side KL training step after averaging local
contributions: models by using the average of clients’ logits on a
• We conduct the first comprehensive analysis of FL public dataset. FedMA [78] proposes averaging on the
experimental design variables by running extensive matched neurons of the models at each layer. GAMF [53]
experiments and provide new insights and best formulates model fusion as a graph matching task by
practices for a meaningful and well-incentivized considering neurons or channels as nodes and weights as
experimental setup for each FL approach. edges. For a more detailed review of methods on global
• We introduce FedZoo-Bench, an open-source li- FL literature, we recommend reading the surveys [33, 58].
brary for implementation and evaluation of FL Personalized FL. The fundamental challenges of the
algorithms under data heterogeneity, consisting of global FL approach, such as poor convergence in the
the implementation of 22 SOTA algorithms, available presence of data heterogeneity and lack of personalized
at https://github.com/MMorafah/FedZoo-Bench. solutions, have motivated the development of personalized
• Using FedZoo-Bench, we conduct a comprehensive FL. Personalized FL aims to obtain personalized models
evaluation of several SOTA algorithms in terms for participating clients. There are various efforts to solve
of performance, fairness, and generalization to this problem through different techniques. Multi-Task
newcomers to form a clear understanding of the Learning (MTL) based techniques have been proposed
strengths and limitations of existing methods. in a number of works by considering clients as tasks and
Organization. The rest of the paper is organized as framing the personalized FL as an MTL problem [19,
follows. In Section II we bring a concise literature review. 25, 49, 60, 72, 74]. Another group of studies proposes
In Section III we provide the background for each FL model interpolation techniques by mixing the global and
M. MORAFAH et al.: IEEE JOURNALS OF IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE 3
local models [17, 59, 61]. There are also works that perform local training for E epochs with the batch size of
utilize representation learning techniques by decoupling B and learning rate of η. At the end of communication
parameters (or layers) into global and local parameters round t, the selected clients send back their updated
and then averaging only the global parameters with other parameters to the server for server-side processing and
clients [5, 14, 51, 66]. Additionally, there are some meta- model fusion. This federation process continues for T
learning-based works that attempt to obtain a global total communication rounds. Algorithm 1 shows this
model with the capability of getting personalized fast by process in detail.
further local fine-tuning [11, 21, 21, 32, 71]. Clustering-
based methods have been also shown to be effective B. Problem Formulation
in several studies by grouping clients with similar data
distribution to achieve better personalized models and Global Federated Learning (gFL). The objective is to
faster convergence [8, 12, 20, 22, 57, 68, 77]. More train a shared global model at the server which uniformly
recently, personalized FL has been realized with pruning- performs well on each client and the problem is defined
based techniques as well [7, 30, 43, 44, 76]. For a more as follow:
N
detailed review of methods on personalized FL literature, X
we recommend the surveys [39, 85]. θ
bg = argmin E(x,y)∼Ditrain [ℓ(fi (x; θ g ), y)]. (1)
θg i
FL Benchmarks. Current FL benchmarks primarily
focus on enabling various platforms and APIs to perform FedAvg [62] is the first and most popular algorithm
FL for different applications. They often only realize proposed to solve Equation 1, which uses parameter
the global FL approach and implement basic algorithms averaging at the server-side for model fusion.
(e.g. FedAvg) [1, 2, 3, 55], while other benchmarks,such Personalized Federated Learning (pFL). The objec-
as FLOWER [6], FedML [26], FLUTE [18] and NIID- tive is to train personalized models to perform well on
Bench [45], offer more customizable features, and im- each client’s distinctive data distribution and the problem
plementation of more algorithms. For a more detailed is defined as follows:
review of the applicability and comparison of the existing N
X
benchmarks, we defer the reader to UniFed [54]. While bi }N = argmin
{θ E(x,y)∼Ditest [ℓ(fi (x; θ i ), y)]. (2)
1
the majority of existing benchmarks are for global {θ i }N
1 i
FL, there are a few recently released personalized FL FedAvg + Fine-Tunining (FT) [32] is the simplest
benchmarks, including pFL-Bench [10] and Motley [80]. algorithm proposed to solve Equation 2, where each
However, these benchmarks do not investigate the ef- client fine-tunes the global model obtained via FedAvg
fect of experimental variables and only consider a on their local data.
few algorithms and their variants for comparison. In
contrast, FedZoo-Bench offers support for both global and
C. Statistical Heterogeneity
personalized FL approaches, providing implementation of
22 SOTA methods and the ability to assess generalization Consider the local data distribution of clients, denoted
to newcomers. Additionally, our study on the effect of FL- as Pi (x, y). Clients can have statistical heterogeneity
specific experimental variables in relation to each other or Non-IID data distribution w.r.t each other when
and performance results, together with the identification Pi (x, y) ̸= Pj (x, y). The most popular way to realize this
of more meaningful setups for each FL approach, is a heterogeneity in FL is label distribution skew in which
new contribution to the field. clients have different label distributions of a dataset (i.e.
Pi (y) ̸= Pj (y))2 . Two commonly used mechanisms to
create label distribution skew in FL are:
III. BACKGROUND : F EDERATED L EARNING AND
• Label skew(p) [45]: each client receives p% of the
S TATISTICAL H ETEROGENEITY
total classes of a dataset at random. The data points
A. Overview of Federated Learning and Notations of each class are uniformly partitioned amongst the
Consider a server and a set of clients S which clients owning that class.
participate in the federation process. fi (x; θ i ) is the L • Label Dir(α) [29]: each client draws a random
layers neural network model of client i with parameters vector with the length of total classes of a dataset
θ i = (W 1i , . . . , W L l from Dirichlet distribution with a concentration
i ) where W stands for the l-th layer
weights, training dataset Ditrain , and test dataset Ditest . parameter α. The data points of each class are
At the beginning of communication round t, a subset uniformly partitioned amongst clients according to
of clients St are randomly selected with the sampling each client’s class proportions.
rate of C ∈ (0, 1] out of total N available clients. The 2 Other mechanisms to realize statistical heterogeneity do exist, but
selected clients receive the parameters from the server and are less commonly used in the FL community.
4 JOURNAL OF IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 00, NO. 0, MONTH 2023
Algorithm 1 Federated Learning FL approach 3 ; and SOLO, a solo (local only) training
Require: number of clients (N ), sampling rate (C ∈ of each client on their own dataset without participation
(0, 1]), number of communication rounds (T ), local in federation, which serves as a baseline to evaluate
dataset of client k (Dk ), local epoch (E), local batch the benefit of federation under different experimental
size (B), learning rate (η). conditions.
0
1: Initialize the server model with θ g Setup. We use CIFAR-10 [37] dataset and LeNet-
2: for each round t = 0, 1, . . . , T − 1 do 5 [42] architecture which has been used in the majority
3: m ← max(C · N, 1) of existing works. We fix number of clients (N ), commu-
4: St ←(random set of m clients) nication rounds (R), and sample rate (C) to 100, 100, and
5: for each client k ∈ St in parallel do 0.1 respectively; unless specified otherwise. We use SGD
6: θ t+1
k ← ClientUpdate(k; θ tg ) optimizer with a learning rate of 0.01, and momentum of
7: end for 0.9 4 . We use this base setting for all of our experimental
8: θ t+1
g = ModelFusion({θ k }k∈St ) {FedAvg [62]: studies in this section. The reported results are the average
θgt+1 = k∈St |Dk |θkt+1 / k∈St |Dk |}
P P
results over 3 independent and different runs for a more
9: end for fair and robust assessment.
t
10: function ClientUpdate(k, θ g )
t t
11: θk ← θg
12: B ← (randomly splitting Dktrain into batches of
size B)
13: for each local epoch ∈ {1, . . . , E} do A. Evaluation Metric
14: for each batch b ∈ B do
15: θ tk ← θ tk − η∇θtk ℓ(fk (x; θ tk ), y) The evaluation metric for performance is a critical
16: end for factor in making a fair and consistent assessment in
17: end for FL. However, the way in which evaluation metrics are
18: θ t+1
k ← θ tk calculated in the current FL literature is often ambiguous
19: end function and varies significantly between papers. In this part we
focus on identifying the causes of evaluation metric
failures for each FL approach and bring our suggestions
IV. C OMPREHENSIVE S TUDY ON FL E XPERIMENTAL for avoiding them.
VARIABLES Global FL. The evaluation metric is the performance
Overview. To design an effective FL experiment, it of the global model on the test dataset at the server.
is crucial to understand how the FL-specific variables We find that the causes for evaluation failures are (1)
which are clients’ data partitioning (type and level of the round used to report the result and (2) the test data
statistical heterogeneity), local epochs (E), sample rate percentage used to evaluate the model. Figure 1a shows
(C), and communication rounds (T ) interact with each the global model accuracy over the last 10 rounds on
other and can affect the results. While communication Non-IID Label Dir(0.1) partitioning. We can see there
rounds (T ) serve as the equivalent of epochs in traditional is a maximum variation of 7% in the results based on
centralized training and primarily determine the training which round to pick for reporting the result. Also, the
budget, the other variables have a more direct impact on difference of the final round result with the average bar
performance. Hence, we focus our analysis on these is about 4%. This shows that the round used to report
variables, in relation to each other and performance the result is important and to have a more robust metric
results, and evaluation metric failure and derive new for these variations, it is better to report the average
insights for the FL community to design meaningful and results over a number of rounds. Figure 1b shows the
well-incentivized FL experiments. variations of the reported result for the same model using
Baselines. We use three key baselines in our study: different test data percentages. This clearly shows that
FedAvg [62], the standard FL baseline that has been using different test data points can cause bias in the
widely used in the existing literature and can serve as a evaluation. To avoid the mentioned failures and have a
good representative for the global FL approach 3 ; FedAvg more reliable evaluation metric we suggest the following
+ Fine-Tuning (FT) [32], a simple personalized FL definition:
baseline that has been shown to perform well in practice
and can serve as a representative for the personalized
4 This optimizer and learning rate have been used in some works [14,
3 We show in Section VI-A that the performance of this baseline is 45, 77] under a similar setup and we also find that it works the best
competitive to the SOTA methods. for our studies.
M. MORAFAH et al.: IEEE JOURNALS OF IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE 5
Definition 1 (global FL evaluation metric). We label Dir type of statistical heterogeneity with 10 local
define the average performance of the global epochs varies with the level of statistical heterogeneity.
model on the entire test data at the server (if As the level of statistical heterogeneity decreases, the
available) over the last [C · N ] communication global FL approach becomes more successful than the
rounds as the evaluation metric for global FL personalized one. The vertical line in the plot indicates
approach 5 . the approximate boundary between the incentives of the
two FL perspectives. We can see that in the extreme
Personalized FL. The evaluation metric is the final Non-IID case (α = 0.05), neither of the FL approaches
average performance of all participating clients on their is motivated, as the performance of the SOLO baseline
local test data. The factor which can cause evaluation is competitive. Additionally, from α = 0.8 onwards, the
failure is the local test data percentage used to evaluate global FL approach seems to perform close to the end
each client’s model. Prior works have allocated different of the spectrum at α = ∞, which is IID partitioning.
amounts of data as local test sets to individual clients. Furthermore, we find that the incentives for globalization
Figure 1c shows the variability of the reported results for and personalization can vary with changes in the number
the same clients under different local test data percentages of local epochs for a fixed type of statistical heterogeneity
on Non-IID Label Dir(0.1) partitioning. This highlights (see Section A-A for more results).
that the use of a randomly selected portion for test data Local epochs. Figures 2b, and 2c show how the
can lead to inaccurate and biased evaluations based on performance of FedAvg and FedAvg + FT for a fixed label
the selected data points be easy or hard. To avoid the Dir type of statistical heterogeneity varies with different
mentioned failure in the evaluation metric we suggest levels of statistical heterogeneity and the number of local
the following definition: epochs. Figure 2b suggests that FedAvg favors fewer
local epochs to achieve higher performance. However,
Definition 2 (personalized FL evaluation metric). Figure 2c suggests that FedAvg + FT favors more local
We define the average of the final performance epochs for achieving better results but no more than 5
of all the clients on their entire local test data (if or 10 depending on the level of statistical heterogeneity.
available) as the evaluation metric for personal- Our findings support the observation in [36] that client
5,6 drift can have a significant impact on performance, and
ized FL approach .
increasing the number of local epochs amplifies this effect
in the results.
Type of statistical heterogeneity. Figure 3 illustrates
B. Statistical Heterogeneity and Local Epochs
the results of an experiment similar to that of Figure 2,
In this part, we focus our study to understand how but with a label skew type of statistical heterogeneity.
different levels and types of statistical heterogeneity Figure 3a shows the performance of the baselines with
together with local epochs can affect the results and fixed 10 local epochs at different levels of statistical
change the globalization and personalization incentives heterogeneity. Comparing this figure with Figure 2a re-
in FL. veals that this type of heterogeneity favors personalization
Level of statistical heterogeneity. Figure 2a illus- over globalization across a wider range of heterogeneity
trates how the performance of the baselines for a fixed levels. Figures 3b and 3c show the performance of
5 We use this metric definition for all of our experiments. FedAvg and FedAvg + FT with different levels of
6 The entire local test data consists of allocating all available test statistical heterogeneity and local epochs. Comparing
samples belonging to the classes owned by the client. these figures with Figures 2b and 2c for the label Dir
6 JOURNAL OF IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 00, NO. 0, MONTH 2023
type of statistical heterogeneity reveals that this type of rate in the range of 0.1 <= C < 0.4 for experimental
statistical heterogeneity is less affected by an increase design to accurately evaluate an algorithm’s success in
in local epochs at each level of heterogeneity. This the presence of data heterogeneity.
highlights another finding that the effect of client drift
may vary for different types of statistical heterogeneity V. S UMMARY AND R ECOMMENDATIONS
(see Section A-A for more results). In this section we identify a series of best practices
and recommendations for designing a well-incentivized
C. Sample Rate FL experimental setting base on our findings and insights
Figures 4a, 4b and 4c, 4d demonstrate the impact of in Section IV.
sample rate and local epochs on the performance for Level of statistical heterogeneity and local epochs.
label Dir(0.1) and label skew (30%) types of statistical The level of statistical heterogeneity and the number
heterogeneity, respectively. Our observations show that of local epochs determine the incentives for global and
increasing the sample rate (i.e., averaging more personalized FL. Generally, when the level of statistical
models) can effectively mitigate the negative impact heterogeneity is low, global FL is more incentivized than
of statistical heterogeneity on performance7 . Addition- personalized FL, and vice versa (Figures 2a and 3a).
ally, it is essential to consider that averaging a higher Additionally, increasing the number of local epochs can
number of models (i.e., a high sample rate) reduces make personalized FL more incentivized (Figures 5 and 6).
the approximation error associated with the averaged We have identified the well-incentivized settings for each
models across all clients8 . We find that sample rates FL approach in Table I.
of C >= 0.4 can significantly reduce the effect of Type of statistical heterogeneity. We observe that
heterogeneity on the performance, while very small the nature of label skew type of statistical heterogeneity
sample rates of C < 0.1 result in poor performance. favors personalized FL over a wider range of hetero-
Based on these findings, we suggest using a sample geneity levels compared to label Dir (Figures 2a, 3a, 5
and 6). Additionally, we observe that the impact of
7 Model averaging is a standard component of many FL algorithms.
8 It is also worth mentioning that generally high sample rates are not
local epochs on performance is more pronounced for
favorable, as they would increase the communication cost of the FL label Dir type of statistical heterogeneity compared to
algorithms. label skew (Figures 2b, 2c, 3b and 3c). To provide a
M. MORAFAH et al.: IEEE JOURNALS OF IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE 7
TABLE I: Well-incentivized settings for personalized and TABLE II: Recommended settings for pFL and gFL
global FL approaches. approaches.
Non-IID Label Dir Non-IID Label Skew Approach Type of Heterogeneity Level of Heterogeneity Local Epoch Number of Clients Sample Rate
Local Epoch
pFL gFL pFL gFL Label Dir α ∈ [0.01, 0.3] {1, 5, 10, 20} {20, 100} {0.1, 0.2, 0.3, 0.4}
pFL
Label Skew p ∈ {2, 3, 4, 5} {1, 5, 10, 20} {20, 100, 500} {0.1, 0.2, 0.3, 0.4}
E=1 α < 0.3 α > 0.3 p < 0.8 p > 0.8
Label Dir α ∈ (0.3, 1] {1, 5, 10, 20} {20, 100} {0.1, 0.2, 0.3, 0.4}
gFL
E=5 α < 0.3 α > 0.3 p < 0.8 p > 0.8
Label Skew p ∈ {8, 9} {1, 5, 10, 20} {20, 100, 500} {0.1, 0.2, 0.3, 0.4}
E = 10 α < 0.5 α > 0.5 p < 0.8 p > 0.8
E = 20 α < 0.5 α > 0.5 p < 0.9 p > 0.9
federated learning by providing researchers with a com-
prehensive set of standardized and customizable features
such as training, Non-IID data partitioning, fine-tuning,
comprehensive perspective on an algorithm’s success in performance evaluation, fairness assessment, and general-
the presence of statistical heterogeneity, we recommend ization to newcomers, for both global and personalized FL
researchers conduct experiments with both types of approaches. Additionally, it comes pre-equipped with a set
statistical heterogeneity. of models and datasets, and pre-implemented 22 different
Sample rate. This variable plays a crucial role in SOTA methods, allowing researchers to quickly test
evaluating an algorithm’s performance under statistical their ideas and hypotheses. FedZoo-Bench is a powerful
heterogeneity. Choosing a high sample rate (C > 0.4) tool that empowers researchers to explore new frontiers
can mask the effect of statistical heterogeneity, thus in federated learning and enables fair, consistent, and
misrepresenting an algorithm’s true ability to handle it. reproducible research in federated learning. We have
Additionally, it can lead to inaccurate representations provided more details on the implemented algorithms
of an algorithm’s capability to handle the stochasticity and available features in Appendix Section C.
resulting from random device selection and the inherent
errors caused by model averaging approximation. On the
other hand, a low sample rate (C < 0.1) may hinder A. Comparison of SOTA methods
convergence due to insufficient models for averaging and In this section, we present a comprehensive experi-
high errors caused by model averaging approximation mental comparison of several SOTA FL methods using
(Figure 4). To avoid these pitfalls, we recommend FedZoo-Bench. We evaluate their performance, fairness,
researchers use a sample rate of 0.1 ≤ C ≤ 0.4 for and generalization to new clients in a consistent experi-
their experiments. mental setting. Our experiments aim to provide a better
Summary. To ensure consistency, comparability, and understanding of the current progress in FL research.
meaningful results in FL experiments, we have compiled Training setting. Following our recommended settings
a set of recommended settings in Table II. We encourage indicated in Table II, we choose two different training
researchers to adopt these settings, as well as the settings presented in Table III to conduct the experimental
evaluation metrics outlined in Section IV-A, for their comparison of the baselines for each FL approach. We
experiments. This will facilitate more consistent and run each of the baselines 3 times and report the average
fair comparisons with SOTA algorithms, and eliminate and standard deviation of the results.
concerns about evaluation failures and the impact of
various experimental settings. TABLE III: Training settings
Setting Dataset Architecture clients sample rate local epochs partitioning communication rounds
gFL #1 CIFAR-10 LeNet-5 100 0.1 5 Non-IID Label Skew(80%) 100
VI. F ED Z OO -B ENCH gFL #2 CIFAR-100 ResNet-9 20 0.2 10 Non-IID Label Dir(0.5) 100
pFL #1 CIFAR-10 LeNet-5 100 0.1 10 Non-IID Label Skew(30%) 100
We introduce FedZoo-Bench, an open-source library pFL #2 CIFAR-100 ResNet-9 20 0.2 10 Non-IID Label Dir(0.1) 100
TABLE IV: Performance and fairness comparison for Fairness comparison. Fairness is another important
personalized FL baselines. aspect of the personalized FL approach. We use the fair-
ness metric mentioned in [49, 63] which is the standard
(a) Performance Comparison
deviation of final local test accuracies. Table IVb shows
Algorithm Setting (pFL #1) Setting (pFL #2) the fairness comparison of the methods. SubFedAvg and
FedAvg + FT [32] 69.26 ± 0.42 49.03 ± 0.40
Ditto have achieved the best fairness results in (pfl #1)
LG-FedAvg [51] 54.03 ± 2.16 25.30 ± 0.50
PerFedAvg [21] 76.03 ± 0.31 2.29 ± 0.07
and (pfl #2) settings, respectively. FedAvg + FT also
IFCA [22] 64.84 ± 3.41 46.73 ± 1.82 demonstrated competitive fairness results in both settings.
Ditto [49] 70.97 ± 1.27 48.16 ± 3.25 For algorithms having poor results we did not report the
FedPer [5] 64.64 ± 0.45 42.87 ± 1.66 fairness results as they would not be meaningful.
FedRep [14] 54.99 ± 3.16 29.39 ± 0.31
APFL [17] 68.91 ± 0.59 55.18 ± 0.65
pFedMe [74] 10.00 ± 0.98 2.10 ± 0.30
Generalization to newcomers. To evaluate the gen-
CFL [68] 16.83 ± 1.6 1.52 ± 0.17 eralization capabilities of personalized FL methods to
SubFedAvg [76] 69.95 ± 1.34 49.83 ± 1.09 newcomers, we reserve 20% of the clients as newcomers
PACFL [77] 67.78 ± 0.11 50.11 ± 1.10
and train the FL models using the remaining 80%
clients. While the adaptation process for many methods
(b) Fairness Comparison
is not explicitly clear, we follow the same procedure as
Algorithm Setting (pFL #1) Setting (pFL #2) in [60, 77] and allow the newcomers to receive the trained
FedAvg + FT [32] 8.05 ± 0.32 5.40 ± 0.77
FL model and perform local fine-tuning. For methods
LG-FedAvg [51] 12.79 ± 0.64 5.11 ± 0.38
PerFedAvg [21] 7.32 ± 0.27 −−
like PACFL that have a different adaptation strategy, we
IFCA [22] 10.56 ± 2.75 5.03 ± 0.07 follow their original approach. Table VI shows the results
Ditto [49] 7.50 ± 0.37 4.10 ± 0.73 of this evaluation.
FedPer [5] 7.64 ± 0.59 6.19 ± 0.59
FedRep [14] 9.18 ± 0.50 5.19 ± 0.73
Discussion. The experimental comparison between
APFL [17] 8.22 ± 0.96 4.60 ± 1.15
pFedMe [74] −− −−
several SOTA methods for each FL approach outlined in
SubFedAvg [76] 7.44 ± 0.46 4.66 ± 0.56 this section highlights the progress that has been made in
CFL [68] −− −− FL research. While we can see that many methods have
PACFL [77] 8.82 ± 0.71 4.68 ± 0.42
improved compared to the simple FedAvg and FedAvg +
FT baselines for global and personalized FL approaches,
respectively, there are also some limitations that are worth
Performance comparison. We use the evaluation met- noting:
rics outlined in Section IV-A to compare the performance
results.
• Global FL. Table V shows the performance results • There is no method that consistently performs the
of 6 different global FL methods. As it is noticeable best across all experimental settings. Furthermore,
Scaffold and FedProx have given the best results for the personalized FL approach, a method may
in settings (gfl #1) and (gfl #2), respectively and achieve good fairness results but lack generalization
FedDF has given the worst results in both settings. to newcomers. Thus, evaluating FL methods from
FedAvg which is the simplest method has appeared different perspectives and developing algorithms that
to be competitive in both settings and even better can provide a better trade-off is crucial.
than 4 other algorithms in setting (gfl #2). • Despite the existence of numerous works for each
• Personalized FL. In Table IVa, we present the FL approach, the performance of the simple methods
performance results of 12 different personalized FL of FedAvg and FedAvg + FT are still competitive
methods. Similar to global FL methods (Table V), or even better compared to several methods. Thus,
we observe that each method performs differently in there is a need for new methods that can achieve
different settings. No single method consistently consistent improvements across different types of
achieves the best results across all settings. For statistical heterogeneity.
example, PerFedAvg performs well in setting (pfl • Fairness and generalization to newcomers are two
#1), but poorly in setting (pfl #2). Additionally, important aspects of the personalized FL approach
CFL and pFedMe perform poorly in both settings. that are often overlooked in the literature and are
On the other hand, FedAvg + FT, the simplest only focused on performance improvement. There-
baseline, performs fairly well in both settings and fore, it is crucial to consider these aspects in addition
is competitive or even superior to several other to performance improvement when designing new
methods. personalized FL methods.
M. MORAFAH et al.: IEEE JOURNALS OF IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE 9
TABLE V: gFL accuracy comparison settings would be an exciting future avenue. Also, more
Algorithm Setting (gFL #1) Setting (gFL #2) studies on evaluation metrics and the development of
FedAvg [62] 44.89 ± 0.20 56.47 ± 0.57 new metrics that can better assess different aspects of an
FedProx [48] 46.01 ± 0.46 56.85 ± 0.36 FL algorithm would be a valuable future work.
FedNova [79] 44.59 ± 0.60 53.20 ± 0.33
Scaffold [36] 56.85 ± 1.06 51.71 ± 0.65
FedDF [52] 27.43 ± 2.32 30.24 ± 0.26 APPENDIX
MOON [46] 45.60 ± 0.31 50.23 ± 0.55 Organization. We organize the supplementary materi-
TABLE VI: pFL generalization to newcomers als as follow:
• In Section A, we present addition experimental
Algorithm Setting (pFL #1) Setting (pFL #2)
FedAvg + FT [32] 64.19 ± 4.64 37.14 ± 0.43
results to complete our analysis in Section IV of the
LG-FedAvg [51] 40.39 ± 17.98 21.22 ± 2.56 main paper.
PerFedAvg [21] 74.97 ± 1.10 2.22 ± 0.20 • In Section B, we discuss an experimental checklist
IFCA [22] 62.64 ± 1.03 14.84 ± 1.86
Ditto [49] 62.55 ± 3.10 38.96 ± 0.26
to facilitate an easier comparison of FL methods.
FedPer [5] 65.3 ± 2.41 35.66 ± 1.61 • In Section C, we provide more details about the
FedRep [14] 64.50 ± 0.62 23.85 ± 1.49 available algorithms, datasets, architectures and data
66.38 ± 1.25 39.52 ± 1.11
APFL [17]
partitionings in FedZoo-Bench.
SubFedAvg [76] 63.54 ± 1.42 30.81 ± 1.28
PACFL [77] 68.54 ± 1.33 36.50 ± 1.42
A PPENDIX A
VII. C ONCLUSION AND F UTURE W ORKS A DDITIONAL R ESULTS
In this paper, we present a thorough examination of key A. Globalization and Personalization Incentives
variables that influence the success of FL experiments. The additional results in this part complements the
Firstly, we provide new insights and analysis of the results discussed in Section IV-B. Comparing Figures 5
FL-specific variables in relation to each other and and 6 further corroborates our finding mentioned in
performance results by running several experiments. We Section IV-B that Non-IID Label Skew partitioning has
then use our analysis to identify recommendations and higher incentive for personalization compared to the other
best practices for a meaningful and well-incentivized FL type of heterogeneity. Moreover, increase of local epochs
experimental design. We have also developed FedZoo- has incentivized personalization more for both types of
Bench, an open-source library based on PyTorch that heterogeneity.
provides a comprehensive set of standardized and cus-
tomizable features, different evaluation metrics, and A PPENDIX B
implementation of 22 SOTA methods. FedZoo-Bench E XPERIMENTAL C HECKLIST
facilitates a more consistent and reproducible FL research.
To facilitate an easier comparison between FL methods
Lastly, we conduct a comprehensive evaluation of several
in future studies, we recommend the following checklist:
SOTA methods in terms of performance, fairness, and
• Make sure that the used experimental setting is
generalization to newcomers using FedZoo-Bench. We
hope that our work will help the FL community to better meaningful and well-incentived for the considered
understand the state of progress in the field and encourage FL approach.
• State the exact setting including local epochs, sample
a more comparable and consistent FL research.
In future work, we plan to expand our study to other rate, number of clients, type of data partition-
domains such as natural language processing and graph ing, level of heterogeneity, communication rounds,
neural networks to understand how FL experimental dataset, architecture, evaluation metrics, any pre-
settings behave in those areas and to assess its versatility processing on the dataset, any learning rate schedul-
and applicability across different problem domains. Addi- ing if used, and initialization.
• Report the average results over at least 3 independent
tionally, we will continue to improve our benchmark by
implementing more algorithms and adding new features. and different runs.
• Mention the hyperparameters used to obtain the
We also have plans to establish an open leaderboard
using FedZoo-Bench, enabling systematic evaluations results.
of FL methods across a wide variety of datasets and
settings. Based on the comparison results presented in A PPENDIX C
Section VI-A, we also believe that the development of F ED Z OO -B ENCH
new algorithms for both global and personalized FL We introduced FedZoo-Bench in Section VI. In this
approaches to achieve even greater improvement and section we provide more details about the available
more consistent results across different experimental features in FedZoo-Bench. For more information on
10 JOURNAL OF IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE, VOL. 00, NO. 0, MONTH 2023
Fig. 5: These figures show globalization and personalization incentives at different level of heterogeneity and local
epochs for Non-IID Label Dir partitioning. The approximate boundary shifts from 0.3 to 0.5 with the increase of
local epochs.
Fig. 6: These figures show globalization and personalization incentives at different level of heterogeneity and local
epochs for Non-IID Label Skew partitioning. The approximate boundary shifts from 80% to 90% with the increase
of local epochs.
FedZoo-Bench’s implementation and use cases for dif- Additionally, FedZoo-Bench can be easily used for
ferent settings, refer to the project’s documentation other variations of FedAvg [67] and different choice
at https://github.com/MMorafah/FedZoo-Bench. of optimizers.
convergence of local descent methods in federated learning. arXiv preprint arXiv:2008.03606, 2020.
learning. arXiv preprint arXiv:1910.14425, 2019. [36] Sai Praneeth Karimireddy, Satyen Kale, Mehryar
[25] Filip Hanzely, Slavomı́r Hanzely, Samuel Horváth, Mohri, Sashank Reddi, Sebastian Stich, and
and Peter Richtárik. Lower bounds and optimal Ananda Theertha Suresh. Scaffold: Stochastic
algorithms for personalized federated learning. Ad- controlled averaging for federated learning. In
vances in Neural Information Processing Systems, International Conference on Machine Learning,
33:2304–2315, 2020. pages 5132–5143. PMLR, 2020.
[26] Chaoyang He, Songze Li, Jinhyun So, Xiao Zeng, [37] Alex Krizhevsky, Geoffrey Hinton, et al. Learning
Mi Zhang, Hongyi Wang, Xiaoyang Wang, Pra- multiple layers of features from tiny images. 2009.
neeth Vepakomma, Abhishek Singh, Hang Qiu, [38] Alex Krizhevsky et al. Learning multiple layers of
et al. Fedml: A research library and benchmark features from tiny images. 2009.
for federated machine learning. arXiv preprint [39] Viraj Kulkarni, Milind Kulkarni, and Aniruddha
arXiv:2007.13518, 2020. Pant. Survey of personalization techniques for feder-
[27] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and ated learning. In 2020 Fourth World Conference on
Jian Sun. Deep residual learning for image recog- Smart Trends in Systems, Security and Sustainability
nition, 2015. URL https://arxiv.org/abs/1512.03385. (WorldS4), pages 794–797. IEEE, 2020.
[28] Kevin Hsieh, Amar Phanishayee, Onur Mutlu, and [40] Ya Le and Xuan Yang. Tiny imagenet visual
Phillip Gibbons. The non-iid data quagmire of recognition challenge.
decentralized machine learning. In International [41] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner.
Conference on Machine Learning, pages 4387–4398. Gradient-based learning applied to document recog-
PMLR, 2020. nition. Proceedings of the IEEE, 86(11):2278–2324,
[29] Tzu-Ming Harry Hsu, Hang Qi, and Matthew 1998. doi: 10.1109/5.726791.
Brown. Measuring the effects of non-identical data [42] Yann LeCun, Léon Bottou, Yoshua Bengio, and
distribution for federated visual classification. arXiv Patrick Haffner. Gradient-based learning applied to
preprint arXiv:1909.06335, 2019. document recognition. Proceedings of the IEEE, 86
[30] Tiansheng Huang, Shiwei Liu, Li Shen, Fengxiang (11):2278–2324, 1998.
He, Weiwei Lin, and Dacheng Tao. Achieving [43] Ang Li, Jingwei Sun, Binghui Wang, Lin Duan,
personalized federated learning with sparse local Sicheng Li, Yiran Chen, and Hai Li. Lotteryfl:
models. arXiv preprint arXiv:2201.11380, 2022. Personalized and communication-efficient federated
[31] J. J. Hull. A database for handwritten text recog- learning with lottery ticket hypothesis on non-iid
nition research. IEEE Transactions on Pattern datasets. arXiv preprint arXiv:2008.03371, 2020.
Analysis and Machine Intelligence, 16(5):550–554, [44] Ang Li, Jingwei Sun, Xiao Zeng, Mi Zhang, Hai
1994. doi: 10.1109/34.291440. Li, and Yiran Chen. Fedmask: Joint computation
[32] Yihan Jiang, Jakub Konečnỳ, Keith Rush, and and communication-efficient personalized federated
Sreeram Kannan. Improving federated learning learning via heterogeneous masking. In Proceed-
personalization via model agnostic meta learning. ings of the 19th ACM Conference on Embedded
arXiv preprint arXiv:1909.12488, 2019. Networked Sensor Systems, pages 42–55, 2021.
[33] Peter Kairouz, H Brendan McMahan, Brendan [45] Qinbin Li, Yiqun Diao, Quan Chen, and Bingsheng
Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin He. Federated learning on non-iid data silos: An ex-
Bhagoji, Keith Bonawitz, Zachary Charles, Graham perimental study. arXiv preprint arXiv:2102.02079,
Cormode, Rachel Cummings, et al. Advances and 2021.
open problems in federated learning. arXiv preprint [46] Qinbin Li, Bingsheng He, and Dawn Song. Model-
arXiv:1912.04977, 2019. contrastive federated learning. In Proceedings of
[34] Peter Kairouz, H Brendan McMahan, Brendan the IEEE/CVF Conference on Computer Vision and
Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Pattern Recognition, pages 10713–10722, 2021.
Bhagoji, Kallista Bonawitz, Zachary Charles, Gra- [47] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar
ham Cormode, Rachel Cummings, et al. Advances Sanjabi, Ameet Talwalkar, and Virginia Smithy.
and open problems in federated learning. Founda- Feddane: A federated newton-type method. In 2019
tions and Trends® in Machine Learning, 14(1–2): 53rd Asilomar Conference on Signals, Systems, and
1–210, 2021. Computers, pages 1227–1231. IEEE, 2019.
[35] Sai Praneeth Karimireddy, Martin Jaggi, Satyen [48] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar
Kale, Mehryar Mohri, Sashank J Reddi, Sebastian U Sanjabi, Ameet Talwalkar, and Virginia Smith.
Stich, and Ananda Theertha Suresh. Mime: Mim- Federated optimization in heterogeneous networks.
icking centralized stochastic algorithms in federated Proceedings of Machine Learning and Systems, 2:
M. MORAFAH et al.: IEEE JOURNALS OF IEEE TRANSACTIONS ON ARTIFICIAL INTELLIGENCE 13
nition, 2014. URL https://arxiv.org/abs/1409.1556. Haicheng Sun, Wei Li, Nicholas Kong, Daniel
[71] Karan Singhal, Hakim Sidahmed, Zachary Garrett, Ramage, and Françoise Beaufays. Applied federated
Shanshan Wu, John Rush, and Sushant Prakash. learning: Improving google keyboard query sugges-
Federated reconstruction: Partially local federated tions. arXiv preprint arXiv:1812.02903, 2018.
learning. Advances in Neural Information Process- [83] Michael Zhang, Karan Sapra, Sanja Fidler, Serena
ing Systems, 34:11220–11232, 2021. Yeung, and Jose M Alvarez. Personalized federated
[72] Virginia Smith, Chao-Kai Chiang, Maziar Sanjabi, learning with first order model optimization. arXiv
and Ameet S Talwalkar. Federated multi-task preprint arXiv:2012.08565, 2020.
learning. Advances in neural information processing [84] Yue Zhao, Meng Li, Liangzhen Lai, Naveen
systems, 30, 2017. Suda, Damon Civin, and Vikas Chandra. Feder-
[73] Dianbo Sui, Yubo Chen, Jun Zhao, Yantao Jia, Yuan- ated learning with non-iid data. arXiv preprint
tao Xie, and Weijian Sun. Feded: Federated learning arXiv:1806.00582, 2018.
via ensemble distillation for medical relation extrac- [85] Hangyu Zhu, Jinjin Xu, Shiqing Liu, and Yaochu
tion. In Proceedings of the 2020 conference on Jin. Federated learning on non-iid data: A survey.
empirical methods in natural language processing Neurocomputing, 465:371–390, 2021.
(EMNLP), pages 2118–2128, 2020.
Mahdi Morafah received the BS in Electrical
[74] Canh T Dinh, Nguyen Tran, and Josh Nguyen. Per- Engineering from the Tehran Polytechnic Uni-
sonalized federated learning with moreau envelopes. versity, Tehran, Iran, in 2019, and the MS de-
Advances in Neural Information Processing Systems, gree in Electrical and Computer Engineering
from the University of California, San Diego,
33:21394–21405, 2020. in 2021. He is currently pursuing the PhD de-
[75] Alysa Ziying Tan, Han Yu, Lizhen Cui, and Qiang gree in Electrical and Computer Engineering
Yang. Towards personalized federated learning. at the University of California, San Diego. His
research interests include Machine Learning
IEEE Transactions on Neural Networks and Learn- and Deep Learning, Distributed Learning,
ing Systems, 2022. Federated Learning, Continual Learning, and
[76] Saeed Vahidian, Mahdi Morafah, and Bill Lin. Optimization.
Personalized federated learning by structured and Weijia Wang received the B.S. degree in Elec-
unstructured pruning under data heterogeneity. In trical Engineering from Zhejiang University,
2021 IEEE 41st International Conference on Dis- Hangzhou, China, in 2016 and the M.S. degree
in Electrical and Computer Engineering from
tributed Computing Systems Workshops (ICDCSW), the University of California, San Diego, in
pages 27–34. IEEE, 2021. 2018. He is currently pursuing the Ph.D. de-
[77] Saeed Vahidian, Mahdi Morafah, Weijia Wang, gree in Electrical and Computer Engineering
at the University of California, San Diego.
Vyacheslav Kungurtsev, Chen Chen, Mubarak Shah, His research interest is machine learning
and Bill Lin. Efficient distribution similarity identi- and deep learning, including the compression
fication in clustered federated learning via principal and acceleration of deep convolutional neural
networks, algorithms of meta-learning and federated learning, and
angles between client data subspaces. arXiv preprint explainable artificial intelligence.
arXiv:2209.10526, 2022.
[78] Hongyi Wang, Mikhail Yurochkin, Yuekai Sun, Bill Lin received the BS, MS, and the PhD
degrees in electrical engineering and computer
Dimitris Papailiopoulos, and Yasaman Khazaeni. sciences from the University of California,
Federated learning with matched averaging. arXiv Berkeley in 1985, 1988, and 1991, respec-
preprint arXiv:2002.06440, 2020. tively. He is a Professor in Electrical and
Computer Engineering at the University of
[79] Jianyu Wang, Qinghua Liu, Hao Liang, Gauri California, San Diego, where he is actively
Joshi, and H Vincent Poor. Tackling the objective involved with the Center for Wireless Commu-
inconsistency problem in heterogeneous federated nications (CWC), the Center for Networked
Systems (CNS), and the Qualcomm Institute
optimization. Advances in neural information in industry-sponsored research efforts. His
processing systems, 33:7611–7623, 2020. research has led to over 200 journal and conference publications,
[80] Shanshan Wu, Tian Li, Zachary Charles, Yu Xiao, including a number of Best Paper awards and nominations. He also
holds 5 awarded patents. He has served as the General Chair and on
Ziyu Liu, Zheng Xu, and Virginia Smith. Motley: the executive and technical program committee of many IEEE and
Benchmarking heterogeneity and personalization in ACM conferences, and he has served as an Associate or Guest Editor
federated learning. arXiv preprint arXiv:2206.09262, for several IEEE and ACM journals as well.
2022.
[81] Han Xiao, Kashif Rasul, and Roland Vollgraf.
Fashion-mnist: a novel image dataset for bench-
marking machine learning algorithms, 2017.
[82] Timothy Yang, Galen Andrew, Hubert Eichner,