research-article

Open access

Behave Differently when Clustering: A Semi-asynchronous Federated Learning Approach for IoT

Authors:

Boyu Fan,

Xiang Su,

Sasu Tarkoma,

Pan HuiAuthors Info & Claims

ACM Transactions on Sensor Networks, Volume 20, Issue 3

Article No.: 51, Pages 1 - 28

https://doi.org/10.1145/3639825

Published: 23 February 2024 Publication History

PDF eReader

Abstract

The Internet of Things (IoT) has revolutionized the connectivity of diverse sensing devices, generating an enormous volume of data. However, applying machine learning algorithms to sensing devices presents substantial challenges due to resource constraints and privacy concerns. Federated learning (FL) emerges as a promising solution allowing for training models in a distributed manner while preserving data privacy on client devices. We contribute SAFI, a semi-asynchronous FL approach based on clustering to achieve a novel in-cluster synchronous and out-cluster asynchronous FL training mode. Specifically, we propose a three-tier architecture to enable IoT data processing on edge devices and design a clustering selection module to effectively group heterogeneous edge devices based on their processing capacities. The performance of SAFI has been extensively evaluated through experiments conducted on a real-world testbed. As the heterogeneity of edge devices increases, SAFI surpasses the baselines in terms of the convergence time, achieving a speedup of approximately × 3 when the heterogeneity ratio is 7:1. Moreover, SAFI demonstrates favorable performance in non-independent and identically distributed settings and requires lower communication cost compared to FedAsync. Notably, SAFI is the first Java-implemented FL approach and holds significant promise to serve as an efficient FL algorithm in IoT environments.

1 Introduction

The Internet of Things (IoT) connects a large number of sensing devices that gather massive amounts of data from diverse physical environments, which motivates the next wave of innovations driven by Artificial Intelligence (AI) to process IoT data [3]. Most IoT devices have constrained resources to leverage machine learning (ML) algorithms [18, 26], and IoT data typically need to be centrally stored in server machines with strong computing capacities for modeling and analytics. Nevertheless, the sensitive nature of IoT-generated data, such as personal health records and household electricity usage, poses significant privacy challenges, thereby hindering the realization of the full potential of IoT-enabled services [12]. Consequently, effectively utilizing the wealth of sensing data while preserving privacy presents considerable challenges.

Thanks to the advancement of edge intelligence [17, 48], it is now possible to process and model IoT data using ML approaches on edge devices that possess robust computational resources. Furthermore, Federated Learning (FL) emerges as a promising solution for constructing and training models without the need to share sensitive data. FL is a distributed ML paradigm enabling training and inference in a decentralized manner across heterogeneous devices [1, 21, 43]. Instead of centralizing the data for model training, the FL approach follows an iterative process comprising the following steps: (1) clients download an initial model from a central server; (2) clients update the model using their local data for training; (3) clients upload the updated model to the server; and (4) the server performs aggregation to obtain an updated global model.

The most widely used optimization algorithm in FL is Federated Average (FedAvg) [32]. However, the performance of FedAvg significantly deteriorates in the presence of heterogeneous client devices [30]. In practice, edge devices possess distinct hardware components (e.g., CPU, GPU, and RAM) and diverse running environments (such as varying network connections and power supply), thereby posing challenges to the direct application of the classical FedAvg algorithm. These challenges can be summarized as follows. First, the heterogeneity of edge devices has a significant impact on the training process. For instance, devices with superior computational capabilities may complete training faster than others, resulting in variations in training time. The server can only initiate aggregation once all clients have completed their training, thus the presence of devices with weaker resources, commonly referred to as stragglers, hampers the overall training efficiency [36]. Second, although the server typically possesses powerful computation resources compared to clients, it can only perform aggregate operation when all client updates are available. Consequently, the server often remains idle, leading to a waste of computing resources. Third, conducting FL in IoT environments is not straightforward in IoT environments due to the limited resources of sensing devices.

In this article, we contribute SAFI, a semi-asynchronous FL approach for heterogeneous edge devices based on clustering, to address the aforementioned challenges. SAFI leverages a novel in-cluster synchronous and out-cluster asynchronous FL training mode to mitigate the straggler issues and efficiently improve the FL training speed. With SAFI, the server can frequently conduct aggregation through asynchronous communication thus avoiding the long-time idle status. To make SAFI compatible with IoT settings, we propose a three-tier architecture to enable the FL training on edge devices. Additionally, we design a clustering selection module to split the edge devices into different clusters to serve as the component of our SAFI algorithm. We further analyze the computing and communication cost of SAFI and evaluate them against FedAvg and FedAsync. The experimental results demonstrate that our algorithm enables mitigation of heterogeneity issues with only minor additional cost. In addition to our algorithmic contributions, we implement the first Java-based FL framework to make FL compatible with more resource-constrained IoT devices. This development is significant as existing Python-based frameworks are limited to operating system-dependent environments, whereas our Java-based framework overcomes this constraint.

We conduct a series of experiments using a real-world testbed consisting of heterogeneous edge devices to evaluate the performance of SAFI. The experimental results on the UCI-HAR dataset [2] and MNIST dataset demonstrate that SAFI can achieve accelerated convergence, up to nearly \(\times 3\) compared with FedAvg when the heterogeneity ratio is 1:7 while still maintaining satisfactory accuracy. We define the term “heterogeneity ratio” as the ratio of computation resources of devices. For example, if one device is equipped with 1 GHz CPU while another device’s CPU is 2 GHz, their heterogeneity ratio of CPU is considered as 1:2. Furthermore, SAFI exhibits several advantages over FedAvg and FedAsync in terms of stability and convergence speed in various non-independent and identically distributed (non-IID) settings [24, 51], and requires less communication cost compared to the pure asynchronous FedAsync approach. The contributions of this article are threefold:

—

We propose SAFI, a semi-asynchronous FL algorithm that leverages a novel in-cluster synchronous and out-cluster asynchronous FL training mode, which significantly speeds up the training process in the presence of heterogeneity.

—

We propose a three-tier architecture for IoT to transfer the FL training from resource-constrained IoT devices to edge devices and design a clustering selection module to cluster the heterogeneous edge devices.

—

We implement the first Java-based FL solution and release it as open-source. Extensive experiments are conducted to demonstrate the performance of SAFI on a real-world testbed, which contains various heterogeneous edge devices, in terms of accuracy, convergence speed, non-IID data, and communication cost.

The remainder of this article is organized as follows. Section 2 discusses related work about FL edge computing, heterogeneity and straggler issues in FL, and semi-asynchronous FL. Section 3 motivates the article with a preliminary study to reveal the limitations of synchronous and asynchronous FL in heterogeneous and homogeneous settings. Section 4 starts with the SAFI architecture, presents the clustering selection module and the detailed algorithm description, and then presents the convergence analysis the cost analysis. We detail the experimental setup and present the evaluation results of our algorithm in Sections 5 and 6, respectively. Section 7 extensively discuss several crucial issues of SAFI and its limitations. Section 8 presents the conclusion and future research directions.

2 Related Work

2.1 Federated Learning with Edge Computing

Edge computing moves the services and utilities of cloud computing in proximity to the data sources and users, thereby exhibiting notable characteristics, such as quick application response time and elimination of bandwidth limitation. In the edge computing paradigm, data can be processed at the network edge to address the issues of response time, energy constraint, cost savings, and privacy [8, 34]. Combining edge computing with FL facilitates the application of FL to a range of applications involving resource-constrained devices, such as IoT and augmented reality [41, 50]. Federated edge computing can also address privacy concerns. He et al. [15] propose FedGKT, a group knowledge transfer training algorithm, to enable FL in resource-constrained devices. Wu et al. [45] develop a personalized FL framework in a cloud-edge architecture to cope with the heterogeneity issues in IoT environments. To optimize network communication, Li et al. [23] propose a convergence-guaranteed FL algorithm that incorporates flexible communication compression, which provides a promising solution in accommodating edge devices and reduce communication cost over the edge devices. Luo et al. [28] conduct an analysis to select essential control variables, i.e., the number of selected clients and local iterations in each FL training round, to minimize the total cost while ensuring convergence in edge environments. Based on their analysis results, they implement a low-cost sampling-based algorithm to find the best aforementioned parameters, thereby achieving the lowest energy cost. Zheng et al. [54] discover that the parameter exchanges among edge nodes in FL are bandwidth-consuming and propose a distributed hierarchical tensor deep computation model to condense the parameters, thereby alleviating the burden on the edge system.

However, existing works primarily focus on making FL feasible in edge computing architectures, overlooking the potential straggler issues introduced by heterogeneous edge devices. Besides, most of the existing research is limited to conducting simulation experiments based on some FL benchmarks [4, 6, 39] to evaluate the performance. For example, the Python-based deep learning library PyTorch [37] is challenging to be deployed on IoT and edge devices due to the absence of supported operating systems. Implementing FL on real-world devices presents various challenges, and accurately assessing the actual cost associated with computing and communication is difficult with simulation-based evaluations. In contrast, we implement a real-world experimental testbed using diverse edge devices to investigate the real performance of the SAFI approach.

2.2 Straggler Issues in Federated Learning

Due to the the presence of massive amount of sensing devices in FL systems, heterogeneity becomes one of the most challenging problems [29]. This heterogeneity stems from the diverse resources of devices, leading to disparate completion times for the same learning task. The slower devices, referred to as stragglers, can severely affect the global training as the server must await for them to aggregate the global model [36, 40]. Different FL-related methods are proposed to address these issues, which can be divided into two categories according to how the clients communicate with the server, i.e., synchronous FL and asynchronous FL.

The majority of existing works are based on synchronous FL. Chai et al. [7] conduct a case study to demonstrate the impact of device heterogeneity and propose a tier selection algorithm, which divides clients into multiple tiers based on device performance. This approach allows the system to select clients from the same tier for training, thus avoiding the straggler problem. However, the authors do not describe how to decide the number of tiers. Furthermore, while the server can select devices with similar capacities in each round to reduce waiting time, the remaining devices not being selected will be idle, potentially increasing the overall convergence time. Reisizadeh et al. [40] propose FLANP, a straggler-resilient FL algorithm that incorporates statistical characteristics of the clients’ data to select the proper clients in each round. The key idea is starting the training procedure with faster nodes and gradually involving the slower ones. This approach aims to mitigate the impact of stragglers on the overall training process. Horváth et al. [16] introduce a framework called FjORD, which alleviates the problem of system heterogeneity by tailoring the model width to the capabilities of each client. Since each client can efficiently train the corresponding model, no stragglers will exist. Rapp et al. [38] propose an approach that enables distributed learning in a heterogeneous system. In this setting, part of the neural networks that belongs to devices shares the same topology, which means the parameters can be jointly learned. Experimental result shows that this approach significantly improves the achievable reward on powerful devices while maintaining a high reward on weaker devices.

For the asynchronous FL approaches, Xie et al. [47] first adopt asynchronous updates to FL and propose a new asynchronous federated optimization algorithm FedAsync. In FedAsync, the server and clients can perform updates asynchronously, effectively addressing the straggler issues that can occur in synchronous FL. Considering the number of data samples on each device is not constant, Chen et al. [10] propose ASO-Fed. This asynchronous FL framework performs online learning with continuous streaming of local updates from clients, enabling wait-free communication. Liu et al. [27] propose an adaptive asynchronous FL mechanism that intelligently adjusts the number of local models for aggregation according to their arriving orders and the network situations, thus avoiding long waiting time and eliminate straggler issues. A recent work proposes FedBuff [35], an asynchronous FL framework with buffered asynchronous aggregation. By leveraging the design of a buffer, secure aggregation and differential privacy can be easily leveraged to improve the security of FL training.

2.3 Semi-asynchronous Federated Learning

Except for vanilla synchronous FL and asynchronous FL approaches, a semi-asynchronous mechanism has emerged as a practical approach combining the benefits offered by both synchronous and asynchronous FL. Hao et al. [14] propose a semi-asynchronous FL mechanism to reduce the stragglers in both synchronous and asynchronous communication manners. However, the proposed method is to expand the data to make the processing time of devices with richer computing resources the same as those with resource-constrained devices, which does not truly accelerate the training process. Wu et al. [46] propose SAFA, a semi-asynchronous FL protocol to improve training efficiency. They focus on dealing with unreliable user devices and designing a client selection algorithm to decouple the central server and clients. Specifically, SAFA only chooses well-trained clients for synchronous training while conducting asynchronous training for stragglers. In contrast, we propose a clustering selection module to cluster different clients based on their processing speeds, allowing each client to conduct synchronous training inside the cluster. Ma et al. propose FedSA [31] to address the heterogeneity challenge with developing an algorithm to determine the optimal number of clients in each round so that the total training time can be minimal. However, even though the authors claim FedSA is a semi-asynchronous mechanism, no synchronous communications are described in the algorithm. Instead, all the clients participate in the global updating asynchronously. CSAFL [52] is a semi-asynchronous FL framework that incorporates the idea of clustering and splits the clients into different groups. CSAFL assumes all groups are deployed on the central server, which means each client needs to communicate with the server directly. This assumption does not hold with large scale clients, such as in the normal cross-device FL [19], and can potentially lead to server crash. Moreover, CSAFL only clusters the clients in the initial stage, which can result in training process stagnation when certain devices go offline or experience communication issues. Unlike CSAFL, our method allows clients to communicate with local coordinators, who are responsible for communicating with the central server in an asynchronous manner. This approach effectively alleviates the communication burden on the server. Besides, our clustering selection module periodically monitors the status of each client and performs reclustering when changes in their status occur. Finally, FedCH [44] is another semi-asynchronous algorithm leveraging clustering. In FedCH, cluster topology is initially determined by solving a combinatorial optimization problem regarding local training time and communication time. Subsequently, the clients conduct hierarchical aggregation in heterogeneous settings. In contrast, SAFI leverages a clustering selection module to periodically collect the processing time of each client as a clustering feature, which has a much lower computing cost. In addition, we conduct numerical analysis and experimental validation to compare the cost of SAFI with other methods, thereby emphasizing our competitive advantages.

3 A Motivation Study

This section motivates our research with a preliminary evaluation of the synchronous FedAvg and asynchronous FedAsync algorithms in both heterogeneous and homogeneous settings. Prior to diving into this evaluation, we revisit synchronous and asynchronous FL. In synchronous FL, clients execute local updates and send the updated parameters to the server. The server awaits the completion of local training tasks by all the participating clients in each round before updating the global model. Consequently, both the server and the clients with high computation capacity often experience prolonged idle waiting times, resulting in increased overall training time and wasting computing resources. This issue is exacerbated in scenarios with highly heterogeneous devices, as we will elaborate on in detail in Section 6.2. In asynchronous FL, the server updates the global model as soon as it receives a local update from an arbitrary client, eliminating idle time in both servers and clients. Consequently, they fully use their computation resources to speed up the entire training process. However, as each client updates at its own learning pace, the network communication overhead increases compared to the synchronous FL. This increment highly depends on the number of clients in the system. The communication overhead becomes unacceptable when the system involves many heterogeneous devices, which severely affects data transmission efficiency.

Figure 1 compares synchronous and asynchronous FL in heterogeneous and homogeneous settings. In this motivating experiment, we set the heterogeneity ratio to be 1:2. When devices are homogeneous, synchronous FL converges in approximately 15 min (depicted by the blue curve in Figure 1); while synchronous FL requires about 30 min to achieve convergence in the heterogeneous setting (depicted by the orange curve in Figure 1). The proportion of system heterogeneity and the proportion of convergence time are approximately linear, which shows that the heterogeneity will significantly increase the convergence time of the synchronous algorithm. Compared with synchronous FL, asynchronous FL alleviates the heterogeneous issues theoretically because each client sends and receives the updates at their own paces, without waiting for the other clients. However, as depicted by the green curve in Figure 1, the training process exhibits instability and requires about 25 min to converge, which is still slower than synchronous FL in the homogeneous setting. The frequent aggregations from various clients in a simple linear combination make the global model’s performance vary significantly for different clients. The system needs more time to stabilize this fluctuation and thus increases the convergence time. In addition, it leads to more communication cost compared with the relatively fixed communication round in FedAvg. Therefore, designing an algorithm leveraging the advantages of the two methods mentioned above without introducing excessive communication overhead presents a critical challenge.

Fig. 1.

4 SAFI Approach

In this section, we introduce the three-tier architecture, present the clustering selection module, and highlight the SAFI algorithm. Finally, we analyze the communication and computing cost of SAFI.

4.1 SAFI Architecture

Figure 2 presents SAFI’s architecture, a hierarchical framework composed of three layers, i.e., IoT, edge, and cloud layers. The IoT layer includes various sensing devices generating a huge amount of data but with minimal computation capacities, e.g., smartwatches, smart home sensors, and smart cameras. The devices in the edge layer tend to have more powerful computational capacity with CPUs or even GPUs, such as smartphones, Raspberry Pis, LattePanda, and NVIDIA Jetson TX2s. Computation capabilities of these devices allow them to train ML models in this layer. The cloud layer consists of high-performance server machines capable of conducting complex and challenging tasks.

Fig. 2.

We design a clustering selection module in the edge layer to divide edge devices into distinct clusters based on their processing capacities. Each cluster elects a coordinator with the most powerful processing capacity. The coordinator assumes two roles, including (1) as a proxy FL server to synchronously communicate with other edge clients inside a cluster; and (2) as a general FL client to asynchronously communicate with the cloud server outside a cluster. This design minimizes idle time and maximizes the overall system performance. By adopting this approach, the system can better handle heterogeneity and mitigate straggler issues. Figure 2 presents the major components of the SAFI architecture with synchronous communication flow (depicted by blue arrows) and asynchronous communication flow (depicted by orange arrows). In the cloud layer, the server receives the updated parameters asynchronously from coordinators and conducts aggregation, then distributes the updated global model back to the edge layer for the subsequent round of training. The aggregation of the model can be vanilla averaging or an adaptive linear combination. The devices in the IoT layer deliver raw data to the edge devices, which conduct local training with this data and send the updated model parameters to each coordinator. Note that the edge devices are assumed to be trustworthy for sensing devices connecting to them, as they are typically owned by the same user. The coordinators send the local aggregated model parameters to the cloud layer and receive the updated global model. As the last step, the coordinator distributes the received global model to the other edge devices within a cluster. The system iterates through multiple rounds following this process until convergence is achieved. Once the model is well-trained, predictions and decisions can be made and then sent back to the IoT layer for specific tasks.

4.2 Clustering Selection Module

In this subsection, we outline our approach for selecting clusters and their coordinators. The performance of edge devices is intrinsically linked to available resources, e.g., CPU, memory, and network bandwidth. However, solely evaluating the performance of active devices based on their physical resource features is inadequate, as environmental factors like temperature and voltage can exert a substantial influence. Furthermore, network conditions play a pivotal role in determining the efficiency of the training process, as numerous rounds of communication occur between the server and clients.

To this end, we select the processing time as a key metric to measure the performance of an edge device. The processing time is defined as the duration taken by an edge device to perform its local training task, which is measured as the time interval between receiving the global model and returning the updated model to the server. Specifically, we improve the vanilla FedAvg algorithm by incorporating a clustering selection module at the beginning phase of training. This module works as follows. When the initial model is delivered to edge clients, each edge device conducts a local SGD update using the same batch size, similar to the normal FedAvg algorithm but only for one round. Afterwards, edge devices deliver the updated local model parameters to the cloud server. The cloud server then performs two tasks. The first task is aggregation. The server conducts a parameter averaging operation for this round to obtain the new global model. The second task is collecting the processing time of each edge client. The collected times serve as the comprehensive metrics of each client’s computation capacity and network condition, which can be utilized as a feature to cluster the edge devices.

We leverage the k-medians algorithm for clustering according to the processing times. There are two reasons for choosing this algorithm. First, we do not expect the clustering process consumes excessive time and resources, as it is an additional task of the FL training and may conduct multiple times when the edge environment undergoes changes. Compared to more complex clustering algorithms, k-medians is known for its efficiency, making it well-suited for scenarios where the feature dimension is relatively low. Second, compared to the classic k-means algorithm, k-medians is more robust to outlier data. Considering the potential network delay, an anomalous processing time cannot be avoided. Therefore, k-medians proves to be a suitable choice for our module.

Then, the crucial question becomes how to find an optimized value of k, in other words, find the optimal number of clusters. Here, we define a cost function to find the optimal number of clusters:

\begin{equation} \underset{k\in N^{*}}{min} J(k)=\frac{1}{k}\sum _{i=1}^{k}var(i)+\mu k, \end{equation}

(1)

where k is the number of clusters and \(var(i)\) is the variance of the \(i{\text{th}}\) cluster. We aim to minimize the resource variance within each cluster and maximize it among different clusters while choosing the least possible number of clusters, as more clusters will introduce more communication cost. Therefore, we add k into the cost function using \(\mu\) as the weight factor to balance the impact of network traffic.

Once the clustering selection is completed, the cloud server assigns a coordinator for each cluster. The coordinator’s role is to perform aggregate operations within the cluster and handle asynchronous communication with the cloud server outside the cluster. The server will select the edge device with the shortest execution time in this cluster as the coordinator, according to the processing times of obtained in the previous step. Secure Aggregation [9] can be leveraged in the coordinator to protect privacy so that the coordinator cannot know where the parameters come from but still can conduct aggregation. Furthermore, since the network environments of edge devices and online status may change over time, the clustering selection module will check the processing time metrics periodically and, if necessary, update the clusters and coordinators to ensure the training is successful. Given that the processing time is only a float-type number, it can be integrated into the exchanged parameter tables with negligible communication cost.

4.3 SAFI Algorithm

Based on above-mentioned three-tier architecture and clustering selection module, we propose SAFI, an FL algorithm for in-cluster synchronous communication and out-cluster asynchronous communication. Algorithm 1 outlines the training process of SAFI on the server side. The cloud server first initializes a global model and broadcasts the model to the clients. After collecting the processing times from the clients, the server calls the clustering selection module to divide edge devices into different clusters and assigns coordinators for each cluster. When the server receives model updates from an arbitrary coordinator, linear aggregation is conducted immediately to update the global model, then sends the new global model back to the coordinator. This process continues iteratively until the global model achieves convergence.

Algorithm 2 illustrates the workflow of SAFI on the client side. In each round, the coordinator initiates the process by broadcasting the global model to the remaining clients within a cluster. Each edge device, including the coordinator itself, perform local updates with multiple epochs and then send the model parameters to the coordinator. Upon receiving the model parameters from all the clients, the coordinator conducts the averaging aggregation to update the cluster-based global model. Subsequently, the coordinator asynchronously uploads the updated model parameters to the cloud server when the cluster-based global model is ready. The cloud server conducts the final averaging aggregation using the received parameters. In the end, the cloud server sends the aggregated global model back to the coordinator to complete a training round. SAFI consists of the two above-mentioned algorithms and thus combines advantages of synchronous and asynchronous FL, significantly decreasing the total idle times and mitigating the straggler problems in the system.

Model staleness is an issue to be considered. In SAFI, the cloud server immediately conducts aggregation when receiving coordinator requests. However, the uploaded parameters may be stale. To illustrate, within a given time frame, the server might receive updates from coordinator a three times, but only once from coordinator b. In this scenario, the model from coordinator b is considered to be stale, which means it is still in the earlier stage of training. Aggregating them by averaging will degrade the performance of the global model, as the faster cluster has already trained a better model than the slower one. Therefore, we need to consider how to ensure that the global model is less affected by the undertrained local model without ignoring the contribution from the clients in the slower clusters. Inspired by Reference [47], we use an adaptive factor \(\alpha\) to mitigate the impact of staleness. Concretely, in the cloud server at the \(r{\text{th}}\) round, we have

\begin{equation} w_{r}=(1-\alpha)w_{r-1}+\alpha w_{c}, \end{equation}

(2)

where \(\alpha \in (0, 1)\) , \(w_{c}\) is the uploaded model parameters from the coordinator and \(w_{r}\) is the updated parameters of the global model. Intuitively, the staler the parameters are, the less they contribute to the overall model. Therefore, we choose a form to meet the monotonicity, which is

\begin{equation} \sigma (t{}^{\prime } ,t)=(t{}^{\prime } -t+1)^{-1}, \end{equation}

(3)

where \(t{}^{\prime }\) represents the current global round at the server and t represents the recorded global round of the clients. Specifically, whenever the server receives an update, the global round count increases by 1. Simultaneously, the updated round count is sent back to the coordinator, serving as an indicator to track staleness. By this form, we have \(\alpha\) with the model staleness information:

\begin{equation} \alpha =\alpha \times \sigma (t{}^{\prime },t). \end{equation}

(4)

When the uploaded parameters are not stale, \(\sigma =1\) means the factor \(\alpha\) is just a fixed value, such as 0.5. On the contrary, if the parameters are sent to the server late, then the value of \(\sigma\) is between 0 and 1. The staler the parameters, the closer \(\sigma\) is to 0, leading to smaller \(\alpha\) and thus decreasing its impact on the global model.

Non-IID data is common in IoT settings as the devices are distributed in different environments, and non-IID data can significantly affect FL training [53]. To mitigate the impact on training, the idea is to keep the local SGD direction not too far from the global optimal direction, even if the local data distribution is highly different from others. Specifically, we add a regularization term to the loss function, which is

\begin{equation} \min _{w} l_{c}(w;w_{g})=L_{c} (w)+\frac{\gamma }{2} \left\Vert w-w_{g} \right\Vert ^{2}, \end{equation}

(5)

where \(L_{c}\) is the original loss function, w is the local model parameters and \(w_{g}\) is the global model parameters. \(\gamma\) is a hyperparameter to adjust the bound. With this term, if the data distribution on a client is highly different from other data, its contribution to the global model will be limited, which will not badly influence the performance of the global model.

Figure 3 compares the idle times for different algorithms during the FL training. Assuming there exist five clients with different processing speeds and their required training times are indicated in Figure 3. In FedAvg, the server must wait for the slowest client (Client 5) to complete the update before performing aggregation. Therefore, Clients 1 through 4 experience different idle times, from t1 to t4. In FedAsync, the training process is continuous without any idle time, as clients can immediately send updates to the server and receive the new global model after the server aggregation. In SAFI, we assume the clients are divided into two clusters marked in green and red dotted line rectangles, respectively. Within the green cluster, the idle time is significantly reduced, e.g., t1 \(^{\prime }\) versus t1 and t3 \(^{\prime }\) versus t3, as the heterogeneity within the cluster is low. It is worth noting that with a larger number of clients and a higher level of heterogeneity, our algorithm can significantly reduce the average idle time for all clients.

Fig. 3.

Through this in-cluster synchronous and out-cluster asynchronous training mode, SAFI can reduce the impact of straggler on the overall training time and decrease the idle time of the cloud server and devices with enough computing resources, thus improving the efficiency of the training.

4.4 Theoretical Guarantees

To conduct convergence analysis, we first make the following assumptions on the loss function F.

Assumption 1 (Smoothness).

F is L-smooth with \(L\gt 0\) , i.e., for \(\forall w_{1},w_{2}\) ,

\begin{equation*} F(w_{2})-F(w_{1})\le \left\langle \nabla F(w_{1}), w_{2}-w_{1} \right\rangle + \frac{L}{2}\left\Vert w_{2}-w_{1}^{} \right\Vert _2^2. \end{equation*}

Assumption 2 (Strong Convexity).

F is c-strongly convex with \(c \ge 0\) , i.e., \(\forall w_{1},w_{2}\) ,

\begin{equation*} F(w_{2})-F(w_{1})\ge \left\langle \nabla F(w_{1}), w_{2}-w_{1} \right\rangle + \frac{c}{2}\left\Vert w_{2}-w_{1}^{} \right\Vert _2^2. \end{equation*}

Definition 1.

With Assumptions 1 and 2, we define an upper bound of \(\Vert \nabla g(w; \xi) - \nabla F(w) \Vert ^2\) as

\begin{equation*} \Vert \nabla g(w; \xi) - \nabla F(w) \Vert ^2 \le B, \end{equation*}

where \(g(w; \xi)\) is local loss function based on the local dataset \(\xi\) , \(F(w)\) is the global loss function.

Theorem 1.

With Assumptions 1 and 2, as well as \(\eta \lt \frac{1}{L}\) , the convergence upper bound of SAFI is given by

\begin{equation} \mathbb {E}[F(w_{T}) - F(w^*)] \le \left[1-\alpha +\alpha (1-\eta c)^{H} \right]^{T}(F(w_{0})-F(w^{*}))+\frac{B}{2} \left(1-[1-\alpha +\alpha (1-\eta c)]^{T} \right), \end{equation}

(6)

where \(\alpha\) is impacted by the staleness defined in Equation (4), H is the number of local updates, \(w^{0}\) is the initial model weights, and \(w^{*}\) is the weights of the optimal model.

Proof.

We start with the L-smooth assumption, for any \(w_{t}, w_{t+1}\in \mathbb {R} ^{d}\) ,

\begin{align} F(w_{t+1})-F(w_{t}) &\le \bigtriangledown F(w_{t})^{\top }(w_{t+1}-w_{t})+\frac{L}{2}\left\Vert w_{t+1}-w_{t} \right\Vert ^{2} \nonumber \nonumber\\ &\le \bigtriangledown F(w_{t})^{\top }(-\eta \bigtriangledown F(w_{t}))+\frac{L}{2}\left\Vert -\eta \bigtriangledown F(w_{t}) \right\Vert ^{2} \nonumber \nonumber\\ &\le \eta (1-\frac{L}{2}\eta)(-\left\Vert \bigtriangledown F(w_{t}) \right\Vert ^{2}) \nonumber \nonumber\\ &\le \eta (1-\frac{L}{2}\eta) (-2c(F(w_{t})-F(w^{*}))) \nonumber \nonumber\\ &\le -\eta c(F(w_{t})-F(w^{*})). \end{align}

(7)

Then, we have

\begin{align} F(w_{t+1})-F(w^{*})+F(w^{*})-F(w_{t}) \le -\eta c(F(w_{t})-F(w^{*})) \end{align}

(8)

\begin{align} F(w_{t+1})-F(w^{*}) \le (1-\eta c)(F(w_{t})-F(w^{*})). \end{align}

(9)

Iterating further, we will have

\begin{align} F(w_{t+1})-F(w^{*}) \le (1-\eta c)^{t+1}(F(w_{0})-F(w^{*})). \end{align}

(10)

Consider the client i in cluster k has conducted H local updates and give the current model staleness as \(\tau\) , the convergence bound is

\begin{align} \mathbb {E} [F(w_{t}^{\tau ,H })-F(w^{*})] &\le (1-\eta c)^{H}[(w_{t}^{\tau ,0 })-F(w^{*})] + \frac{\eta B}{2}\sum _{h=1}^{H}(1-\eta c)^{h-1} \nonumber \nonumber\\ &\le (1-\eta c)^{H}[(w_{t}^{\tau ,0 })-F(w^{*})] + \frac{\eta B}{2}\sum _{h=1}^{H}(1-\eta c)^{h-1} \nonumber \nonumber\\ &\le (1-\eta c)^{H}[(w_{t}^{\tau ,0 })-F(w^{*})] + \frac{\eta B}{2}\frac{1-(1-\eta c)^H}{1-(1-\eta c)} \nonumber \nonumber\\ &\le (1-\eta c)^{H}[(w_{t}^{\tau ,0 })-F(w^{*})] + \frac{\eta B}{2}\frac{H\eta c}{1-(1-\eta c)} \nonumber \nonumber\\ &\le (1-\eta c)^{H}[(w_{t}^{\tau ,0 })-F(w^{*})] + \frac{\eta B}{2}H. \end{align}

(11)

For the global loss function, we have the convergence bound

\begin{align} \mathbb {E}[F(w_{T}) - F(w^*)] &\le (1-\alpha)F(w_{t-1})+\alpha \mathbb {E}[F(w_{k})] - F(w^*) \nonumber \nonumber\\ &\le (1-\alpha)F(w_{t-1})+\alpha \mathbb {E} \left[F \left(\frac{\sum _{i=1}^{n_k} |D_k^i|w_k^i}{\sum _{i=1}^{n_k}|D_k^i| } \right) \right]-F(w^*) \end{align}

(12)

\begin{align} &\le (1-\alpha)F(w_{t-1})+\alpha \mathbb {E} [F(w_k^i)] -F(w^*) \end{align}

(13)

\begin{align} &\le (1-\alpha)(F(w_{t-1})-F(w^*)+F(w^*))+\alpha \mathbb {E} [F(w_k^i)-F(w^*)] \nonumber \nonumber\\ &\le (1-\alpha)(F(w_{t-1})-F(w^*))+F(w^*)-\alpha F(w^*)+\alpha \mathbb {E} [F(w_k^i)]-F(w^*) \nonumber \nonumber\\ &\le (1-\alpha)(F(w_{t-1})-F(w^*))+\alpha \mathbb {E} [F(w_k^i)-F(w^*)]. \end{align}

(14)

For Equations (12) and (13), we use the truth that the expected loss of an arbitrary cluster is lower than any one single client in this cluster. This is because within a cluster, the train is the same with traditional synchronous FL training, where the global model has better performance after aggregation. Combining Equations (11) and (14), we have

\begin{align} \mathbb {E}[F(w_{T}) - F(w^*)] &\le (1-\alpha)(F(w_{t-1})-F(w^*))+ \alpha (1-\eta c)^H (F(w_{t-\tau , 0})-F(w^*))+\alpha \frac{H\eta B}{2} \nonumber \nonumber\\ &\le (1-\alpha)(F(w_{t-1})-F(w^*))+ \alpha (1-\eta c)^H (F(w_{t-1})-F(w^*))+\alpha \frac{H\eta B}{2} \nonumber \nonumber\\ &\le (1-\alpha + \alpha (1-\eta c)^H)(F(w_{t-1})-F(w^*))+\alpha \frac{H\eta B}{2}. \end{align}

(15)

Iterating further, we can derive the convergence bound after T global epochs as shown in Theorem 1. □

4.5 Cost Analysis

The total cost of SAFI contains two aspects, i.e., computing cost and communication cost. A reasonable analysis of the computing and communication cost generated during the FL process is essential for designing an effective system. Specifically, we define two kinds of rounds, i.e., local round and global round. The local round refers to the synchronous FL communication rounds within the cluster. In contrast, the global round refers to the number of asynchronous aggregations outside the cluster on the cloud server side.

4.5.1 Computing Cost.

The FL training process iterates multiple rounds until the model converges. Considering the number of data batches \(n_b^i\) and the number of epochs \(n_{e}\) , the computing cost on client side in cluster i can be formulated as

\begin{equation} C_{c}^{i}=n_{e} \times n_{b}^{i} \times round_{l}^{i} \times C_{b}, \end{equation}

(16)

where \(round_{l}^{i}\) refers to the number of local rounds in cluster i and \(C_{b}\) is the computing cost of one batch. In addition, model aggregation also consumes computing resources, such as averaging the parameters uploaded by clients. Assuming the cost of one aggregate operation is \(C_{a}\) , the cost of the server can be formulated as

\begin{equation} C_{s}=round_{g} \times C_{a}, \end{equation}

(17)

where \(round_{g}\) refers to the number of global rounds. Furthermore, the coordinator also conducts the aggregation task, which needs to aggregate the parameters from other clients within the same cluster. The cost of the coordinator in cluster i can be formulated as follows:

\begin{equation} C_{coor}^{i}=round_{l}^{i} \times C_{a}. \end{equation}

(18)

Considering there are \(N_{e}\) edge clients and \(N_{c}\) clusters, we denote the total cost of computing cost as \(C_{comp}\) , then we have

\begin{equation} C_{comp}=\sum _{i=1}^{N_e}C_{c}^{i}+\sum _{j=1}^{N_c}C_{coor}^{j}+C_{s}. \end{equation}

(19)

4.5.2 Communication Cost.

The communication cost of SAFI consists of three parts, including data exchange between the IoT layer and edge layer, model parameter exchange between the edge layer and server layer, and model parameter exchange between the coordinator and other edge clients within a cluster.

We focus on in-cluster synchronous and out-cluster asynchronous communications, which is the core communication approach of SAFI. The in-cluster data stream refers to the clients who must communicate with the coordinator within the same cluster. In contrast, the out-cluster data stream refers to the data exchange between coordinators and the cloud server. Communication cost of FL mainly involves model delivery and model parameter updates. Once the model is initialized, the size of the model parameters is a constant value. Here, we denote the communication cost of transmitting model parameters once as \(C_{p}\) . Assuming there are \(n_{e}\) edge clients in a cluster, the communication cost within a cluster can be formulated as

\begin{equation} C_{cluster}=2 \times round_{l} \times C_{p} \times (n_{e} - 1), \end{equation}

(20)

where multiplying by 2 means the parameter transfer includes both the upload process and download process, and subtracting 1 is to remove the count of the coordinator itself. Similarly, the communication cost outside the cluster can be formulated as follows:

\begin{equation} C_{server}=2 \times round_{g} \times C_{p}. \end{equation}

(21)

Therefore, the total communication cost of SAFI can be formulated as

\begin{equation} C_{comm}=C_{cluster} + C_{server}. \end{equation}

(22)

In real-world FL systems, participating edge devices may support different communication links, such as NBIoT, LTE, and WiFi hotspots. In the near future, smart sensing devices will support 5G. However, communication cost remains the principal constraint for FL, and the bottleneck is mainly from unreliable and slow communication links of participating devices. As a reference, we deploy our client on a TicWatch Pro smartwatch and test download and upload speeds for LTE and a WiFi hotspot, respectively. Testing results show that the average LTE download speed is 17.73 Mbps, the average LTE upload speed is 15.62 Mbps, the average WiFi hotspot download speed is 172 Mbps, and the average WiFi hotspot upload speed is 151 Mbps. In addition, we test download and upload speeds for a 5G testbed, which involves two 5G antennas on a cell tower providing coverage on a university campus. The average 5G download speed is 241.61 Mbps, and the upload speed is 14.85 Mbps. Therefore, different communication links significantly affect the performance of the FL system.

4.5.3 Cost comparison with other FL algorithms.

Following the above-mentioned settings, we present a cost comparison with baseline FL algorithms, including synchronous algorithm FedAvg, asynchronous algorithm FedAsync, and semi-asynchronous algorithms FedCH and SAFA. We start with formulating the computing cost of FedAvg as

\begin{equation} C_{avgcomp}=\sum _{i=1}^{N_e}C_{cg}^{i}+C_{s}, \end{equation}

(23)

where \(C_{cg}^i\) has the similar definition as Equation (16), with the only modification being the replacement of \(round_{l}^{i}\) by \(round_{g}\) . Compared with FedAvg, SAFI has additional computing cost related to the coordinator. However, considering the aggregation process is the vanilla averaging, the extra computing cost introduced by coordinators is relatively minor compared with the model training. For FedAsync, the computing cost is comparable to Equation (23) as it does not involve any extra operation beyond local update and model aggregation. Due to the lag-tolerant mechanism, SAFA does not enforce clients to upload their models if they are becoming stragglers, leading to a smaller \(N_e\) in Equation (23). However, this mechanism prolongs the time required for model convergence, leading to a bigger \(round_g\) . Therefore, there is a trade-off when analyzing the computing cost of SAFA. In terms of FedCH, despite it also bases on clustering to enable semi-asynchronous training, it incorporates more computing cost to determine cluster topology by solving a combinatorial optimization problem. In contrast, our algorithm, SAFI, directly uses the training time as a feature, employing k-medians algorithm in conjunction with Equation (1) for clustering. This significantly reduces the required computing cost while still achieving superior performance, which we have validated in Section 5.

We also start the communication cost comparison with FedAvg, which is

\begin{equation} C_{avgcomm}=2 \times round_{g} \times C_{p} \times N_{e}. \end{equation}

(24)

Compared with Equation (24), our method incorporates the extra communication \(C_{cluster}\) between edge devices and their coordinators, leading to higher communication cost of SAFI compared to FedAvg. Despite FedCH might have varied values in \(n_e\) and \(round_g\) due to diverse clustering algorithms, the total number of clients is fixed, resulting in the comparable communication cost with SAFI. In homogeneous scenarios, SAFA has the same communication pattern as FedAvg. However, in heterogeneous settings, due to the existence of lag-tolerant mechanism, some clients keep their updates in a cache rather than sending them directly to the server, thereby mitigating the issues arising from stragglers. Therefore, similar to the computing cost analysis, there exists a trade-off between decreasing involved clients in one round ( \(N_e\) ) and increasing the required global rounds for convergence ( \(round_g\) ). As for FedAsync, the computing cost is comparable to Equation (23). Although the communication cost can also be formulated by Equation (24), the value of \(round_{g}\) is much bigger than FedAvg and SAFI as each client in FedAsync has its own training pace, leading to numerous communication rounds with the cloud server. However, our algorithm groups similar devices into clusters, leading to a unified pace of them and thus significantly decreasing the number of \(round_{g}\) . We will present the experimental results of communication comparison among different algorithms in Section 6.4.

5 Experimentation Setup

In this section, we present the real-world testbed in detail. We also describe the datasets and baselines in the experiments and the model details.

5.1 Real-World Testbed

Most existing research conduct FL experiments in simulated environments, i.e., by simulating multiple terminals on a standalone machine to mimic different clients or simply splitting the dataset into multiple parts and then using a loop to simulate the FL training process [20, 25, 32]. Such simulated experiments may not reflect the actual running condition of devices, and evaluating the real communication traffic in simulated experiments is difficult.

Different from most prior work, we conduct experiments on a real-world testbed that includes real edge devices. As illustrated in Figure 4, the testbed consists of five Raspberry Pi 4 Model B (RPi), two LattePanda V1 (LPV1), two LattePanda 2 Delta 432 (LP432), three LattePanda 3 Delta 864 (LP864), and two LattePanda 2 Alpha 864s (LP864s). RPi is a single-board computer (SBC) with limited resources, commonly used as an edge device. LattePanda is another kind of SBC with more powerful resources. We utilize different versions of LattePanda to incorporate heterogeneity. Table 1 presents the detailed specifications of edge devices in our experiments. We can observe from Table 1 that the CPU frequencies increase from 1 to 3.4 GHz, which is a wide range for easily setting varying levels of heterogeneity. RPi and LattePanda serve as edge devices in the edge layer to communicate with the IoT and cloud layers. Sockets are employed to communicate with the server machine in the cloud layer.

Table 1.

Device	RPi 4	LP V1	LP 2 Delta 432	LP 3 Delta 864	LP 2 Alpha 864s
CPU	A72, 1 GHz	Z8350, 1.9 GHz	N4100, 2.4 GHz	N5105, 2.9 GHz	M3-8100, 3.4 GHz
Memory	8 GB	4 GB	4 GB	8 GB	8 GB
Storage	32 GB	64 GB	32 GB	64 GB	64 GB
Power supply	5 V, 3 A	12 V, 3 A	12 V, 3 A	12 V, 3 A	12 V, 3 A
OS	Linux	Linux, Win	Linux, Win	Linux, Win	Linux, Win

Table 1. Specifications Comparison Among Different Devices in Our Testbed

Fig. 4.

The FL cloud server is deployed on a MacBook Pro equipped with a 2.2 GHz Intel Core i7 processor and 16 GB memory. All devices are connected wirelessly through a WiFi router. Our system’s server and clients are implemented with Java using the Deeplearning4j library [42], a deep learning library designed for Java. Deeplearning4j is chosen because it can be implemented on various edge devices that can run Java Virtual Machine (JVM). To our best knowledge, it is the first work implementing an FL algorithm with Java, which greatly expands the potential for deploying FL on sensing and wearable devices. Section 7 discusses the comprehensive advantages of this framework in detail.

Our implementation primarily focuses on the components and communications on the edge and cloud layers. IoT devices are mainly considered data sources and actuators. Therefore, we use a computer sending streaming data from datasets to serve as a cluster of virtual IoT nodes. This approach allows us to evaluate the algorithm without relying on physical IoT devices and ensures scalability in our experiments.

5.2 Datasets and Models

We conduct experiments using three representative types of data: time-series IoT sensor data, image data and text data. For the the time-series IoT sensor data, we choose UCI-HAR [2], a widely used Human Activity Recognition (HAR) dataset. This dataset includes six activities collected from 30 volunteers by carrying a waist-mounted smartphone with embedded inertial sensors. The number of instances is 10299, with 561 features for each. For the image dataset, we choose the MNIST handwritten dataset [11], which contains a training set of 60,000 examples and 10,000 examples for the test set, with each example corresponding to one of ten different classes. The MNIST dataset is commonly used in experiments involving resource-constrained devices due to its simplicity and representativeness. For the text data, we use the Shakespeare dataset, which is initially introduced in LEAF [5] and has gained popularity as a natural language processing (NLP) dataset in FL research. Shakespeare dataset is inherently non-IID as each client corresponds to a specific character, and all lines of dialogue for a particular character constitute the data for an individual client.

To study both the independent and identically distributed (IID) and non-IID scenarios, we divide the datasets into IID and non-IID. (1) For IID scenario, we equally split the MNIST and HAR datasets into n parts, where n is the number of edge clients. This division ensured that each client has all categories of data and the distribution is similar to the raw training set. The amount of data in each client is also the same. Due to the non-IID characteristic of Shakespeare dataset, we shuffle the data and conduct random sampling to create a new dataset that is evenly split among different clients, which significantly reduces the degree of non-IID.

(2) For non-IID scenario, we split the non-IID dataset from MNIST in two ways, i.e., class non-IID and number non-IID. In the class non-IID, data is sorted by class and divided into five clients. Each client is randomly assigned two classes and has the same amount of data. For the number non-IID, we divide the data for different clients according to an arithmetic sequence, where the client with minimum data has 10% of the whole data, and the remaining clients hold the data incrementally, following an increment percentage of 5%. In other words, each client holds 10%, 15%, 20%, 25%, and 30% of the data, respectively.

We use a multilayer perceptron (MLP) model for the HAR task. It composes an input layer, an output layer, and one hidden layer with 1,000 units using ReLU. The number of units can be adjusted based on the complexity of the dataset. To optimize the model, we use Nesterov’s momentum as the optimizer, with a learning rate of 0.006 and a momentum with 0.9. The MLP model is suitable for comparing different FL algorithms as it is relatively simple and easy to train on edge devices.

For the MNIST dataset, we use a convolution neural network (CNN) to build the model, which starts with a \(3\times 3\) kernel size convolution layer with 50 units and a ReLU activation function, followed by a maxpooling layer of size \(2\times 2\) . An identical layer combination is added again beneath the above layers. Then, a Dense layer of 500 units with ReLU and a softmax output layer are added to build a typical CNN structure. We use Nesterov’s momentum optimizer and set the learning rate to 0.006. CNN is great for image classification tasks, and this shallow CNN structure is easily deployed on edge devices. It is worth noting that due to the limited computing resources of Raspberry Pi and other embedded devices, bigger dataset like CIFAR10 and CIFAR100, more advanced models, like ResNet18 and MobileNet, are hard to be trained within an acceptable time.

For the Shakespeare dataset, we use a long short-term memory (LSTM) to train the model. This LSTM network consists of an embedding layer that transforms input character indices into embedded vectors, a LSTM layer with 256 hidden size followed by a dropout probability of 0.5, and a fully connected linear layer that projects the LSTM output into a space.

5.3 Baselines

Considering our system has both synchronous and asynchronous processes, we select FedAvg, FedProx and FedAsync as the baselines for evaluating SAFI.

—

FedAvg is the classical synchronize FL algorithm that has been applied in many applications. In FedAvg, clients conduct local updates then sends the updated model to the server. The server conducts aggregation and sends back the new global model to clients to finish a round.

—

FedAsync is an asynchronous FL algorithm that takes advantage of asynchronous training and combines it with FL. The training process is continuous without any waiting time as each client communicates with the server at its own pace. A weighting function is introduced to adjust the impact of stale models.

—

FedProx is an improved algorithm based on FedAvg to mitigate the heterogeneous issues in FL. FedProx adds a proximal term to the local objective function, which limits the distance between the local model and the global model.

—

FedCH is a cluster-based FL method with the hierarchical aggregation strategy. In FedCH, clients in one cluster conduct synchronous training with the cluster head, while all cluster heads conduct the asynchronous method for global aggregation.

—

SAFA is a semi-asynchronous FL method designed to mitigate the problems of low round efficiency and poor convergence rate. SAFA proposes a lag-tolerant mechanism to tackle the tradeoff between fast convergence rate and lower communication overhead.

6 Results and Analysis

We conduct a series of comparative experiments to demonstrate the performance of SAFI in both heterogeneous and homogeneous settings. This section presents the experimental results and discusses the impact of heterogeneity, the impact of non-IID data, and the comparison of the real-world communication cost. All the experimental results are based on the average of multiple runs.

6.1 Accuracy and Convergence Speed

Throughout the training process, we evaluate the accuracy and convergence time for SAFI and the baselines. Figure 5 presents the accuracy and convergence time of six FL algorithms in the homogeneous setting with RPis. Figures 5(a)–5(c) present the experimental results of the MLP model on the HAR dataset, the CNN model on the MNIST image dataset, and the LSTM model on the Shakespeare dataset, respectively. At the initial stage of training, FedAvg, FedProx, and SAFA have superior accuracy than the other three algorithms. This attributes to the parameters of their models with the mean values of updated models from clients. In contrast, FedAsync, FedCH, and SAFI employ a linear combination of old and newly updated models, which may make the model performance unstable at the initial stage. All algorithms, except FedAsync, share similar performance after approximately 15 min, with an accuracy of 92.5% after 50 min in the HAR dataset and 98.8% after 35 min in the MNIST dataset, respectively. The accuracy of FedAsync is slightly lower than the other five algorithms in both datasets but FedAsync still converges to a promising result. A similar trend can be observed in LSTM model training of Shakespeare dataset.

Fig. 5.

In the homogeneous setting, the clustering selection module groups all clients into a single cluster. Thus, the process of SAFI and FedCH is similar to FedAvg and FedProx, with the exception of an extra linear aggregation step with the server model. In terms of SAFA, since there are no stragglers in a homogeneous environment, the algorithm performs synchronous communication between clients and the server, making the training process also comparable to FedAvg. FedAsync demonstrates the slowest convergence time and accuracy in the homogeneous setting due to its frequent aggregations with the older model, which can hinder the convergence speed. Additionally, from this result, we can gain an insight that, if a model on a client is not sufficiently trained, the linear aggregation on the cloud server will significantly impact the existing global model’s accuracy, which stands as one of the reasons for the relatively low accuracy of FedAsync.

Figure 6 presents the comparison results in the heterogeneous setting. In this experiment, we select three RPis and two LPV1s as the experimental devices, resulting in a heterogeneity ratio of approximately 1:2 (1 GHz:1.9 GHz). The clustering selection module divides the devices into two clusters, i.e., one cluster for three RPis (Group Pi) and the other cluster for two LPV1s (Group Panda). Unlike the homogeneous setting, both SAFI and FedCH outperform the other four algorithms in terms of convergence time. For example, in the HAR dataset, SAFI achieves convergence at about 37 min, closely followed by FedCH, which converges at around 40 min. Meanwhile, FedAvg and FedProx require more than 60 min. SAFA’s performance falls in between these algorithms. This is because SAFA’s semi-synchronous mode of operation is achieved by adopting different communication priorities for different versions of clients, which still exists straggler effect in a heterogeneous setting.

Fig. 6.

Despite FedAsync performs better than FedAvg and FedProx, it exhibits less stability in both the initial and ending stages. The can be attributed to the following reasons. In the initial stage, the local model for each client is not fully trained, and frequent aggregations lead to a biased global model due to heterogeneity. In the ending stage, despite most local models having been fully trained and reaching a good accuracy, some stragglers still send stale updates to the server, which lowers the performance of the global model after aggregation.

Our experiment on the MNIST dataset indicates that SAFI converges at about 20 min, while FedAvg and FedProx require more than 40 min to achieve convergence. A comparison between Figures 6(a) and 6(b) demonstrates that the training process of FedAsync has greater stability with CNN on the MNIST dataset than MLP on the HAR dataset. This can be attributed to CNN’s superior performance on image data, especially for the simple MNIST dataset. Therefore, the model parameters sent from the stragglers are not significantly different from the new model parameters, which does not lead to noticeable fluctuation after aggregation. Regarding Figure 6(c), the performance trends of the different algorithms are similar to those observed in the other two datasets, with the only difference being that almost all algorithms exhibit some degree of fluctuation during the training process. This is primarily due to the non-IID nature of the Shakespeare dataset.

Stragglers are considered as the main reason for the lower performance of FedAvg and FedProx. Due to resource and network conditions, the completion time of one local update round of Group Pi is approximately twice as long as the completion time with Group Panda. Therefore, the cloud server has to wait for Group Pi to complete the local training before conducting global aggregation. In SAFI, Group Pi and Group Panda conduct local updates at their own paces and asynchronously send the updated parameters to the cloud server. This eliminates the presence of stragglers, resulting in a faster overall convergence speed. It is worth noting that the accuracy of FedAsync fluctuates significantly in the HAR dataset. This can be attributed to the relatively small size of the HAR dataset. After evenly dividing the dataset among the five devices, each device had a small amount of data, which might not be sufficient for fully training the models. This led to biases among different devices. In FedAsync, each device has to communicate with the cloud server frequently, and linear aggregation will lead to fluctuation. We can also observe that the other two semi-asynchronous algorithm, FedCH and SAFA, converge faster than purely synchronous and asynchronous algorithms in all three three datasets.

6.2 Impact of Heterogeneity

To evaluate the performance of our algorithm in different settings, we design six groups of experiments using the HAR dataset, which is presented in Table 2. The second column shows the number of clusters created by our clustering selection module and the composition of clients in each cluster. The third column shows the corresponding heterogeneity ratio of devices. For example, Group 1 consists of five RPi devices, representing a homogeneous setting where all the clients share a similar processing speed. Group 2 includes three RPi and two LPV1 devices, making a processing speed ratio of 1:1.9, according to their CPU frequencies in Table 1. Configurations of the remaining groups incrementally increase the level of heterogeneity and the number of edge clients. The processing speed ratio presents the difference in clients’ resources. A higher ratio represents higher heterogeneity in the experiment.

Table 2.

Group ID	Number of clusters	Heterogeneity ratio
1	1 (5 RPi)	1
2	2 (3 RPi, 2 LPV1)	1:1.9
3	2 (3 RPi, 2 LP864)	1:2.4
4	2 (3 RPi, 2 LP864s)	1:2.9
5	3 (3 RPi, 2 LP432, 2 LP864s)	1:1.9:3.4
6	3 (3 RPi, 2 LPV1+2 LP432, 3 LP864+2 LP864s)	1:2.4:3.4

Table 2. Number of Clusters and the Corresponding Heterogeneity Ratios in Different Groups

Table 3 presents the convergence speed of the six FL algorithms in different heterogeneous settings. We observe that SAFI enables remarkable convergence speed in different heterogeneous levels, which accelerates over \(\times 1.5\) compared with FedAvg in Groups 5. In Group 1, representing a homogeneous scenario with only one cluster, all methods, except FedAsync, exhibit comparable convergence speeds, as expected. When heterogeneity is introduced, SAFI outperforms the other five algorithms. With an increase in the level of heterogeneity, the required convergence time of SAFI decreases rapidly. When the maximum processing speed ratio reached 1:3.4, SAFI achieves convergence in only 33 min, while FedAvg and FedProx still require over 50 min under the same setting. Among the other two semi-asynchronous algorithms, FedCH is slightly slower than our algorithm, because it also involves clustering-based semi-asynchronous communication but with more clusters. SAFA converges faster than FedAvg and FedProx due to its lag-tolerant-based semi-asynchronous mechanism, but slower than both SAFI and FedCH, as it still suffers straggler issues with the heterogeneity level improves.

Table 3.

Group ID	SAFI	FedAvg	FedAsync	FedProx	FedCH	SAFA
1	49 min	52 min	67 min	53 min	50 min	51 min
2	41 min	51 min	53 min	52 min	42 min	50 min
3	37 min	50 min	49 min	53 min	38 min	48 min
4	35 min	50 min	47 min	52 min	37 min	49 min
5	33 min	50 min	46 min	52 min	35 min	47 min
6	30 min	51 min	44 min	53 min	33 min	46 min

Table 3. Convergence Speed of Different FL Algorithms in Heterogeneous and Homogeneous Settings

Although there exist more heterogeneous devices in Group 6, its convergence speed is even faster than Group 5, i.e., in about 30 min. From Table 2, we notice that the clustering selection module splits these five kinds of edge devices into three clusters. Compared with Group 5, the middle cluster is faster in Group 6, which means more training rounds can be conducted simultaneously, providing a better initial model for the slow clusters, thus further accelerating their training. Meanwhile, the stale function limits the impact of the slow devices on the global model. In contrast, the performance of FedProx is worse. Due to the synchronous FL training mechanism, the server has to wait for all the clients to be ready before conducting the aggregation. Therefore, the convergence time remains the same with Group 1 even with the existence of faster devices. The performance of FedAsync is between FedProx and SAFI. It exhibits the slowest convergence speed in homogeneous setting, while becoming faster with the heterogeneity increases.

To further investigate the performance of SAFI in more complex settings, we adjust the CPU frequency of RPi to increase the level of heterogeneity. Specifically, we use RPi and LP864s for this experiment and gradually decrease the CPU frequency of RPi from 0.85 to 0.34 GHz, making the speed ratio range from 4:1 to 10:1. We do not include FedProx for comparison as it also belongs to pure synchronous FL, which has the similar trend with FedAvg when heterogeneity exists, as Table 3 presents.

Figure 7 presents the acceleration effect of our algorithm compared to the baselines. As heterogeneity increases, SAFI presents an excellent acceleration in convergence, achieving nearly a \(\times 3\) speedup compared to FedAvg when the heterogeneity ratio is 7:1. SAFI also has a better acceleration effect than FedAsync in any heterogeneity ratio. Notably, when the heterogeneity ratio is larger than 7, the acceleration effects of most algorithms begin to decline. For example, when the heterogeneity ratio is 10:1, the acceleration ratio of SAFI drops to 1.6, and SAFA drops to only 1.1. The reason being if the heterogeneity of the edge devices is too significant, the slower cluster after clustering will cause the accumulation of staleness, which will severely affect the training results of the faster cluster, so taking more time to obtain the global model to achieve convergence. FedCH peaks its acceleration effect when the heterogeneity ratio is at 6, showcasing our algorithm SAFI outperforms FedCH when the system is with a high-level heterogeneity. In contrast, the acceleration effect brought by another semi-asynchronous algorithm, SAFA, in heterogeneous scenarios is much less, with a maximum of only 1.6.

Fig. 7.

In the aforementioned groups, the processing speeds of clients in the same cluster are similar, allowing for random assignment of the coordinator. To further validate our clustering selection module, we introduce slight heterogeneity to verify if the clustering selection module can select the most suitable coordinator and the appropriate number of clusters. In this experiment, we use the same devices, data and models as in Group 6 but adjusting the CPU frequency of RPi. Specifically, we set the CPU frequencies of three RPis to 0.8, 0.9, and 1 GHz, and set the two LPV1s to 1.8 and 1.9 GHz, respectively. After conducting repeated experiments, the RPi with 1.4 GHz and the LPV1 with 1.9 GHz are consistently selected as the coordinators of their respective clusters. This verifies our clustering selection module can choose the appropriate coordinators even in the presence of minor heterogeneity.

6.3 Impact of Non-IID Data

Data distribution tends to be non-IID in real-world IoT applications. Therefore, we test SAFI on non-IID data to investigate its performance compared with the baselines. The details of constructing non-IID data are described in Section 5.2. In this experiment, we choose RPi and LPV1 devices. Figures 8(a) and 8(b) present the training accuracy of different FL algorithms on the MNIST dataset with class non-IID data and number non-IID data, respectively. From Figure 8(a), we observe that SAFI can still converge within a reasonable time in the class non-IID setting, despite the required time (about 50 min) being longer than the IID setting (about 40 min). The convergence times of FedCH and FedAsync are comparable to SAFI, also around 50 min. Due to the existence of the in-cluster synchronous step, the updated weights have been aggregated on the coordinators several times before being sent to the server. Therefore, the training process of both SAFI and FedCH is more stable than FedAsync, even in the class non-IID setting. Furthermore, the convergence speeds of SAFI, FedCH and FedAsync surpass those of FedAvg and FedProx, due to the impact of heterogeneity. When the alternative algorithms achieve convergence, the accuracy of FedAvg is only about 70% and still requires more time to converge. FedProx also performs better than FedAvg due to the existence of the proximal term, which efficiently limits the distance between local model and the global model. SAFA outperforms the pure synchronous FedAvg and FedProx due to the semi-asynchronous training scheme. Nevertheless, it still suffers from the straggler issues compared to our cluster-based SAFI.

Fig. 8.

Figure 8(b) presents SAFI converges faster than other baselines in the number non-IID setting, requiring about 45 min. The convergence speed of FedCH is close to that of SAFI with around 50 min. Due to heterogeneity, FedAvg performs the worst within the same training time, whose accuracy is below 90% at 60 min. The performance of FedAsync is better than FedProx, requiring about 56 min to achieve convergence. The training process is still unstable due to the frequent model aggregations from different clients. SAFA has a comparable convergence speed with FedAsync due to the lag-tolerant mechanism for stragglers, enforcing them to conduct asynchronous communication with the server. In addition, we observe that all six algorithms have better accuracy in the first round of training when the data is in the number non-IID setting (all above 50%), compared with the class non-IID setting (around 20%). This is because a single client only has two classes in the local data and cannot utilize the trained models from other classes before communicating with other clients. These experimental results verify that the semi-asynchronous SAFI algorithm has a faster convergence speed and a stable training process even if the data is non-IID.

6.4 Comparison of Communication Cost

We compare the real-world communication cost of SAFI algorithm with baseline algorithms in the testbed. Here, we leverage RPi and LPV1 devices so that the heterogeneity ratio is about 1:2. This experiment focuses on the HAR dataset, employing an MLP model. Figure 9 presents the required communication cost to convergence across different FL algorithms. Since each client communicates with the server individually, FedAsync has the highest communication cost, which requires nearly 2,500 MB to achieve convergence. Despite FedAvg needing a longer time to converge, it has the lowest communication overhead, which costs about 1,800 MB, due to the synchronous communication. FedProx has the same communication cost with FedAvg as the proximal term is calculated on the client side without extra data exchange. The communication cost of SAFI fall between that of FedAvg and FedAsync, which is about 2,200 MB. Compared with FedAvg, SAFI increases communication between the coordinators and the server in the out-cluster asynchronous process. However, due to the in-cluster synchronous process, the training is more stable and thus requires fewer rounds than asynchronous training, making the communication cost 12% less than that of FedAsync. Because FedCH also uses cluster to achieve semi-asynchronous communication, it has the similar communication overhead compared to SAFI, though slightly higher due to different clustering strategies. SAFA’s communication cost is less than SAFI but higher than FedAvg, as the lag-tolerant mechanism reduces the frequency of asynchronous communication between the client and the server. These experimental results verify that SAFI achieves a favorable trade-off between training speed and network communication cost.

Fig. 9.

7 Discussion

Mitigating privacy vulnerabilities. Ensuring privacy in ML systems is a crucial consideration during their design. FL offers privacy-preserving mechanisms that surpass traditional ML methods, as the raw data always remains on the client side. However, recent research has revealed that the original information can still be indirectly accessed indirectly through model parameters and updated gradients [13, 49, 55]. Instead of delivering model parameters to the cloud server straightforwardly, SAFI conducts a parameter averaging at trustful coordinators within the cluster, which enable better safety, because the model delivered to the cloud server has been mixed with the model information of many different devices. Therefore, it becomes increasingly challenging for attackers to reverse-parse the raw data from the parameter information on the cloud server. We further use secure aggregation to ensure the data is not leaked to the coordinators.

Periodical clustering. We realize that in real-world scenarios, it is common for an application to establish connections with a large number of diverse edge nodes to a huge amount of heterogeneous edge nodes. These edge devices may exhibit high dynamism, with devices joining, leaving, and switching between active and passive mode. In SAFI approach, the first step of FL training is dividing the heterogeneous edge clients into different clusters with clustering selection module. Therefore, this step should not be run only at the initial stage but periodically monitors the status of the connected edge devices in the system. The newly added devices need to be assigned to the appropriate cluster, while offline devices should be removed. This ongoing monitoring and adjustment ensure that the clusters remain up to date and optimized for efficient FL training. In the in-cluster synchronous FL training, coordinators do not need to wait for too long, since the heterogeneity inside the cluster is minor. However, if the waiting time for a client exceeds a specific threshold, indicating that the device is offline, then it should be subsequently removed from the cluster.

Real-world testbed. Through real-world experiments, we observe that even with the same devices running under the uniform environments, including network condition, temperature, power supply, and operating system, there still exist a slight difference in their performance, which leads to minor differences in processing times for the same task. This indicates that some parameters in the algorithm, such as the aggregation factor \(\alpha\) and hyperparameter \(\mu\) , should not be fixed. In the future, we will replace fixed values with probabilistic distributions, thereby introducing elements of probability and uncertainty to enhance the robustness of the system [33].

Java-based FL framework. In this work, we implement the first Java-based FL framework, that offers distinct advantages compared to existing Python-based solutions. These advantages encompass the following aspects: (1) Ease of deployment on IoT devices. The only requirement for running a Java program is JVM, which can be executed on a wide range of devices. This feature aligns well with the environment of FL, where clients consist of various kinds of devices, e.g., routers, surveillance cameras, and even refrigerators. In contrast, Python relies on operating systems, while majority of resource-constrained edge devices lacks adequate operating system support. (2) Reliable communication. Java offers a comprehensive set of built-in communication packages and methods, providing strong support for data exchange and connections between servers and clients. Python, however, mainly relays on third-party communication packages, which may not provide the same level of safety and efficiency. (3) Interoperability. Our Java-based framework can seamlessly interact with other Python-based FL implementations through the standard socket protocol. This enables easy integration with other systems or frameworks. (4) Efficiency. Java programs are compiled directly, while Python needs to be interpreted, resulting in faster execution speeds for SAFI framework. By eliminating the interpreting time, our framework allow IoT devices to efficiently focus on model training, leveraging their limited resources effectively. (5) Flexibility and extensibility. With the support of Deeplearning4j, it is easy to modify and extend our Java-based FL framework to accommodate new algorithms, techniques, or hardware configurations. In summary, our Java-based FL framework offers advantages in terms of deployment, communication reliability, interoperability, efficiency, and flexibility. These features position our framework as a promising solution for FL implementations in IoT environments. SAFI is released as an open-source project in GitHub: https://github.com/boyufan/SAFI.

Limitations. Although SAFI presents promising performance, certain limitations exist. First, the clustering selection module elects a coordinator based on the performance to communicate with other edge clients and the cloud server. However, if the elected coordinator crashes, then the FL training process is disrupted, necessitating the election of a new coordinator from the existing clients, which will affect the learning process of the entire system. Second, since aggregation also requires computing resources, adding an extra coordinator may increase total computing cost and impact the local update efficiency of the coordinator itself. As a result, the maximum number of clients in the cluster needs to be considered. An efficient resource scheduling module should be considered on the coordinator devices to balance the consumption of local updating and cluster-based model aggregation. Besides, because of the hardware limitations of real edge devices, the evaluation of larger-scale datasets becomes unfeasible, such as CIFAR100 [22].

8 Conclusion

This article contributes SAFI, a semi-asynchronous FL algorithm for IoT, allowing FL to be a feasible solution for processing large-scale IoT data when the edge devices are heterogeneous. SAFI introduces a novel cluster mechanism that incorporates synchronous FL within the cluster and asynchronous FL outside the cluster, which simultaneously possesses the tolerance of asynchronous methods for heterogeneity and the low network overhead of synchronous methods. The experimental results demonstrate that SAFI significantly outperforms the baselines in terms of convergence speed in the presence of heterogeneity and has a faster training speed when dealing with non-IID data. It also has less communication cost compared to pure asynchronous FL algorithms. SAFI combines both the advantages of synchronous and asynchronous FL to accelerate the convergence speed of the ML model while keeping the accuracy at a promising level. The proposed idea of synchronous training in clusters and asynchronous outside clusters offers an efficient solution to address straggler issues. In addition, the Java-based framework is a promising solution for FL implementations in IoT environments. In the future, we will extend this work to enable robust SAFI, such as adding a potential coordinator queue to quickly pop a new coordinator when the original one crashes and replacing the fixed parameter values with distributions. We will also extend experiments with various edge clients, such as smartphones, smartwatches, and other high-performance mobile workstations, to further verify the performance of our algorithm.

References

[1]

Sawsan Abdulrahman, Hanine Tout, Hakima Ould-Slimane, Azzam Mourad, Chamseddine Talhi, and Mohsen Guizani. 2021. A survey on federated learning: The journey from centralized to distributed on-site learning and beyond. IEEE Internet of Things J. 8, 7 (2021), 5476–5497.

Abstract

1 Introduction

2 Related Work

2.1 Federated Learning with Edge Computing

2.2 Straggler Issues in Federated Learning

2.3 Semi-asynchronous Federated Learning

3 A Motivation Study

4 SAFI Approach

4.1 SAFI Architecture

4.2 Clustering Selection Module

4.3 SAFI Algorithm

4.4 Theoretical Guarantees

4.5 Cost Analysis

4.5.1 Computing Cost.

4.5.2 Communication Cost.

4.5.3 Cost comparison with other FL algorithms.

5 Experimentation Setup

5.1 Real-World Testbed

5.2 Datasets and Models

5.3 Baselines

6 Results and Analysis

6.1 Accuracy and Convergence Speed

6.2 Impact of Heterogeneity

6.3 Impact of Non-IID Data

6.4 Comparison of Communication Cost

7 Discussion

8 Conclusion

References

Cited By

Index Terms

Recommendations

Trust-driven reinforcement selection strategy for federated learning on IoT devices

A Trust and Energy-Aware Double Deep Reinforcement Learning Scheduling Strategy for Federated Learning on IoT Devices

Prototype of deployment of Federated Learning with IoT devices

Comments

Information

Published In

Publisher

Journal Family

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations