survey

Open access

Topology-aware Federated Learning in Edge Computing: A Comprehensive Survey

Authors:

Steve DrewAuthors Info & Claims

ACM Computing Surveys, Volume 56, Issue 10

Article No.: 262, Pages 1 - 41

https://doi.org/10.1145/3659205

Published: 22 June 2024 Publication History

PDF eReader

Abstract

The ultra-low latency requirements of 5G/6G applications and privacy constraints call for distributed machine learning systems to be deployed at the edge. With its simple yet effective approach, federated learning (FL) is a natural solution for massive user-owned devices in edge computing with distributed and private training data. FL methods based on FedAvg typically follow a naive star topology, ignoring the heterogeneity and hierarchy of the volatile edge computing architectures and topologies in reality. Several other network topologies exist and can address the limitations and bottlenecks of the star topology. This motivates us to survey network topology-related FL solutions. In this paper, we conduct a comprehensive survey of the existing FL works focusing on network topologies. After a brief overview of FL and edge computing networks, we discuss various edge network topologies as well as their advantages and disadvantages. Lastly, we discuss the remaining challenges and future works for applying FL to topology-specific edge networks.

1 Introduction

Edge computing has been widely deployed in recent years as a strategy to reduce costly data transfer by bringing computation closer to data sources than conventional cloud computing.

Both academia and industry have seen a surge in research relating to edge computing [70, 112, 115]. This is true specifically in the fields of Industrial Internet of Things (IIoT) [96], connected autonomous vehicles (CAVs) [72], augmented reality (AR) [1], wearable technologies [14], and hybrid architectures and systems combining cloud and edge [131, 134, 136, 143]. Edge computing is prevalent in agriculture, energy, manufacturing, telecommunications, and many other domains. It creates a tremendous amount of data at the edge with heterogeneous data distribution patterns. Distributed data of this scale and variety has created a persistent demand for machine learning to drive decision-making processes at the edge of the network.

Edge computing enables data and computational decentralization. Decentralized devices can collaboratively perform machine learning tasks together forming a cohesive network of nodes [81, 133]. We see the urgency and prospects that distributed learning systems will play a vital role as the bread and butter for edge-based decision-making applications [51, 146]. Back in 2016, the EU imposed restrictions under the General Data Protection Regulation (GDPR) [129], which is a regulation in EU law regarding data protection and privacy. Commercial companies in the EU are forbidden from collecting, dealing with, or exchanging user information without their consent. China and the United States are implementing similar legislation as well [146]. As a result, federated learning (FL) has emerged as an optimal solution for edge computing without violating privacy legislation. Several companies have shown interest in FL applications, including Amazon [21], Google [31], Webank [33], IBM¹ and ByteDance².

More recently, innovative approaches using foundation models [155], Vision Transformers (VIT) [164], and pre-trained models [41, 92] are gaining huge attention. Despite its wide applications, critical challenges of edge computing [72, 102] remain in latency, communication costs, service availability, privacy, and fairness, especially for machine learning tasks at the edge. The low latency limits of communication to the cloud and data privacy requirements have inevitably grounded these distributed machine learning tasks: the data should never leave the edge. The deployment of edge-based applications demands learning tasks to run at edge infrastructure instead of the remote cloud [29, 135]. The ubiquitous applications of edge computing in multiple industries [57, 153] justify the strong motivation for the research of distributed machine learning in edge computing.

The topology of the edge network can sometimes be largely overlooked. In this survey, the network topology can be treated both as a challenge and a solution. As a challenge, specific topologies impose certain restraints like extra layers of communication and network structures. Whereas a solution in edge computing, topologies offer new ways to address different bottlenecks such as communication overhead, over-dependent to the central server, and so on. Multiple topology structures exist in current FL works, and each topology brings its benefits and challenges. For example, ring topology [59] is utilized to enhance scalability and accommodate diverse client activities, thereby eliminating the need for a central server. Hosseinalipour et al. [39] propose fog learning, a paradigm that intelligently distributes ML model training across nodes, from edge devices to cloud servers. It enhances FL along three major dimensions: network, heterogeneity, and proximity.

Table 1.

Survey	Year	Focus	Topology	FL
Rajaraman [105]	2002	Topology and routing in Ad-hoc Network	\(\checkmark\)	\(\times\)
Li et al. [60]	2006	Overview of topology control techniques	\(\checkmark\)	\(\times\)
Donnet and Friedman [23]	2007	Measurements of network topology	\(\checkmark\)	\(\times\)
Lim et al. [68]	2020	FL in Mobile Edge Networks	\(\times\)	\(\checkmark\)
Kairouz et al. [46]	2021	FL Advances and Open Problems	\(\times\)	\(\checkmark\)
Nguyen et al. [91]	2022	FL for Smart Healthcare domains	\(\times\)	\(\checkmark\)
Nguyen et al. [90]	2023	FL Applications for IoT networks	\(\times\)	\(\checkmark\)
Zhu et al. [165]	2023	Blockchain-empowered FL	\(\times\)	\(\checkmark\)
Ours	2024	Edge Network Topology for FL	\(\checkmark\)	\(\checkmark\)

Table 1. Existing Surveys that Discuss FL or Network Topology Design

1.1 Scope and Contribution

In this survey, we study the various network topology structures that exist in FL. In Table 1, we compare our survey with existing surveys discussing network topology or FL. There is a significant gap in the number of surveys conducted on network topology at present. Since the introduction of FL in 2016, we have seen a huge increase in FL-related papers. Many existing surveys look to treat network topology as a system limitation or challenge, while some papers propose to use new network topology to improve communication or computing efficiency. Several comprehensive surveys have extensively covered the general concepts, architectures, and applications in FL [46, 63, 68, 146]. More recently, many studies tend to concentrate on specialized areas within the FL domain, possibly due to the rapid rate at which FL works are being published. Some of the recent work includes FL in the health domain [91], FL for the internet of things [90], and blockchain-empowered FL [165]. However, no existing surveys have discussed or organized research works of FL in edge computing from the network topology perspective. This gap leaves us with a vast open area to discuss FL from a new perspective. Our study examines surveys from a wide range of dates of publication that discussed FL or network topology designs. We found that these two topics were never discussed together, even though FL comprises many different network topologies. This motivated us to showed them here in Table 1 about their difference to us. To our knowledge, no existing surveys have reviewed unique FL works from the edge network topology perspective and have promoted the development of various topology structures in FL. Compared with previous surveys, this paper’s main contributions are:

(1)

Our survey introduces a novel perspective by employing edge network topologies (the network’s structure) to categorize unique FL works.

(2)

We provide a comprehensive classification of FL into four major topologies, including star topologies, mesh topologies, hybrid topologies, and less common network topologies, which provide clarifications for future research.

(3)

We follow a systematic review approach using PRISMA [86] for the paper selection process.

(4)

We present the design, baselines, and benchmarks and then thoroughly review the key findings of some highlighted work.

(5)

We outline promising research directions and challenges for the future development of topology-aware FLs.

The rest of the paper is organized as follows. In Section 2, we explain our research methodology. In Section 3, we introduce an overview of FL in edge computing. In Section 4, we propose eight types of FL network topologies and summarize existing studies along each topology. In Section 5, we present some of the open Issues in edge FL Topology. In Section 5, we explain the limitations and synthesize a roadmap for future research. Last, we conclude our paper in Section 6.

2 Research Methodology

2.1 Research Goals Formulation

We aim to provide an in-depth and systematic overview of all papers on FL that utilize one or more unique network topologies. Furthermore, we use the PRISMA [86] search strategy to collect all the papers following a similar approach as Pfitzner et al. [97]. We show an example of searching for tree topology papers using the PRISMA flow diagram in Figure 1. Additionally, we aim to show evidence that different network topologies and FL can benefit from each other. We summarize our research goals into three points.

Fig. 1.

—

Identify existing edge network topology structures in the current FL literature.

—

Examine the unique challenges and benefits different topologies bring.

—

Provide readers with an overview of the baseline methods and datasets used in each paper.

2.2 Search Strategy

For our paper search strategy, we start by searching papers that contain the terms “federated learning AND topology.” We submitted this search criteria to the digital scholarly databases. The three main scholarly databases we used are the digital library of the Association for Computing Machinery, the online portal of the Institute of Electrical and Electronics Engineers, and Google Scholar. This search strategy only returned a few papers and it was not helpful. Therefore, we modified our search strategy to treat every topology as its own branch of work and restarted the search process. The detailed search process is listed below:

(1)

Identity a topology structure to start the search process (star, tree, … or mesh topology).

(2)

Initalite the PRISMA [86] search process.

(3)

Group the specific topology paper into major topology or minor topology.

2.3 Inclusion and Exclusion Criteria

Our survey aims to give readers a good understanding of FL, and the upsides and downsides of several network topology structures, so we have selected the following criteria for inclusion. After collecting the papers returned from the database search, we include papers that are:

—

Peer-reviewed (Identification phase).

—

Presenting one or more unique topology structures (Identification phase).

—

Using FL as the primary methodology (Screening phase).

—

Implanting and comparing the proposed method with strong baseline methods (Eligibility phase).

However, the surveyed query terms return many irrelevant works to this review. Some papers may contain only one or two mentions of FL and cover completely unrelated topics. Our exclusion criteria are listed below:

—

Share different titles but are different versions of the same paper (Identification phase).

—

FL is not involved at all (Screening phase).

—

Experiments do not use known baseline methods (Eligibility phase).

—

Application-focused or case study (Eligibility phase).

—

Benchmark paper evaluating existing works (Eligibility phase).

We use our proposed strategy to search for eight topologies in Section 2.2. We organize the results into five major topology types shown in Figure 5. A total of 42 papers meet all selection criteria. We also illustrate the number of papers from each topology in Figure 2.

Fig. 2.

3 An Overview of Federated Learning in Edge Computing

3.1 Background

Recent years have embraced effervescent advances of Federated Learning (FL) algorithms in various applications, including IoT [85], healthcare [107], image processing [51], and the like.

In the representative federated learning approach FedAvg [83], with the restrictions of GDPR [129], each mobile device learns a local model and updates the model to a central server periodically. A central server then aggregates local models using a simple yet effective method to produce a global model and distributes the global model to all Android devices for the next learning cycle. The FedAvg algorithm improves over FedSGD, which uses parallel stochastic gradient descent (SGD). FedSGD selects a set of workers each round, and the selected workers compute the gradient using the global model parameters and their local data. Gradients from each worker are sent back to the server, which performs SGD using the combined gradients and the learning rate. The process is repeated until the model converges. Compared to the baseline algorithm FedSGD, FedAvg requires significantly fewer rounds of communication to converge. As with FedSGD, FedAvg followed the general computation steps where the server sends the model parameters to each worker. Each worker computes the gradient using the received model parameters, its local data, and a given learning rate. FedAvg differs from FedSGD in that each worker repeats the training process multiple times before sending the updated model parameters back to the server. FedAvg was developed with the intention of achieving the same level of efficacy with less communication to the server. While the overall computation task for each worker increased, there were fewer rounds of communication compared to FedSGD, resulting in a trade-off between computation and communication costs. In many FL scenarios, the edge clients generally have limited data residing locally [5, 83]. Even though deep models are commonly used, the computational expenses are often overshadowed by the communication costs. This is why FL with FedAvg algorithm [83], known for its communication efficiency, is particularly effective.

We categorize the FL algorithms we surveyed based on their emphasis on the types of challenges they tackled.

3.1.1 Statistical and System Heterogeneity.

A significant amount of effort has been made to address the issue of user heterogeneity in FL. Specifically, the heterogeneity is manifested in both statistically and systematically.

On the one hand, statistical heterogeneity in FL refers to the differences among user local data distributions as shown in Figure 3. Namely, the clients have datasets that are not independent and identically distributed (Non-IID). When the data is collected locally, such differences are likely induced by heterogeneous user behavior. In the case of Non-IID data distributions, aggregation may lead to a biased global model with a sub-optimal generalization performance. This phenomenon is also known as client-drift [47], and it refers to the process by which global models are updated toward local optimal solutions as a result of heterogeneous data.

Fig. 3.

Towards addressing this client-drift issue, previous works including FedProx [111], pFedMe [123], and SCAFFOLD [47], have proposed constraining the local model parameters to prevent them from diverging far from the global model optimal. Personalized FL is an alternative strategy for handling data heterogeneity. It permits different model parameters or even architectures to be adopted by local users. Besides diversified architectures, few-shot adaptations can also achieve personalization by fine-tuning a global model using local data [62, 123].

In the meantime, system heterogeneity results from different user capacities in terms of computation, memory, bandwidth, and so on. We show an example of system heterogeneity in Figure 3. Adopting one unified model architecture for FL can be undesirable under such scenarios: an over-large global model might bring heavy workloads to small users which lack computation or transmission resources, while an over-small global model may under-perform in capturing complex feature representations for the learning tasks. Therefore, an emerging group of algorithms is pursuing FL frameworks that support heterogeneous user model architectures [39, 79, 111, 132, 152, 156, 157].

3.1.2 Privacy.

Although FL allows decentralized devices to participate in machine learning without directly exchanging data, there are still potential privacy concerns. Furthermore, adversaries may be able to deduce some original data from the parameters of a model.

High-level FL privacy threats include inference attacks and communication bottlenecks. Secure multi-party computation, differential privacy, verifyNet, and adversarial training are effective techniques for preserving privacy in FL [88].

3.1.3 Convergence Guarantee.

There have been extensive studies about the theoretical convergence properties of FL algorithms under different problem settings. Pioneer efforts along this line, such as [64, 83, 104], have analyzed the convergence speedup of FL algorithms and derived a desirable conclusion that linear speedup can be achieved for FedAvg, which is the most representative FL algorithm, with commonly adopted assumptions for analysis.

3.1.4 Communication Efficiency.

To improve communication efficiency, one popular approach is to either reduce the number of communication rounds or to require fewer data to be transmitted per communication round [54]. Depending on the infrastructure of computing, by selecting an appropriate topology design, communication efficiency can also be optimized. Generally, the star topology network design ensures the least amount of communication with the central server since all devices are directly connected to it. In tree topologies, intermediate edge servers are usually involved, and devices can benefit from fast and efficient communication with edge devices at a low cost. For fully meshed topologies, communication usually takes place in a P2P or D2D manner, and direct communication between the devices is generally quite efficient. Furthermore, hybrid topologies are emerging, which combine the common topology with the strengths of each to produce a more dynamic system.

The aforementioned challenges in FL can be tackled in parallel. For instance, some personalized FL algorithms [62, 123], which share only partial model parameters, can tackle user heterogeneity while achieving high communication efficiency.

3.2 FL Characteristics Specific to Edge Computing

Unlike the typical configurations of FL, which follow a naive star topology, practical edge computing demonstrates unique characteristics in architectures and network configurations, which could deeply impact the design and implementation of the effective deployment of FL algorithms. We list the key features of edge computing frameworks below.

3.2.1 Heterogeneity, Energy Efficiency, and Task Offloading.

A large portion of edge networks consists of user devices. These devices include highly embedded devices such as wearable glasses and watches, as well as powerful personal servers [116, 119, 150]. These devices are mostly still powered through batteries, making it necessary to consider energy-efficient protocol and algorithm design [43, 122]. The heterogeneous devices also introduce a huge variance of computational capacities, leading to a natural research direction of task offloading [13, 127].

The heterogeneity, energy efficiency, and task offloading play substantial roles in formulating the topologies of the edge networks [94]. To elaborate, energy consumption considerations prohibit a star-topology in a large-scale edge network because the central edge server would be overwhelmed due to its capability limit [79]. Offloading FL tasks from less capable edge devices to more powerful edge devices is a viable and increasingly researched approach in FL and edge computing [13, 122, 127]. The offloading schemes are typically accompanied by their corresponding topology best practices [156, 157].

3.2.2 Hierarchy and Clustering.

The nature of edge computing and 5G/6G communications has led to hierarchical networks where a base station covers the data transmission in small areas of wireless edge devices [19, 52]. The partial coverage results in multiple base stations deployed at the edge networks. The base stations can forward the data to the central server in a three-tier network hierarchy. When the number of base stations is large enough, there can be even more than three tiers for the data to move up along the hierarchy. The multi-fold hierarchy creates the prerequisites for configurable clustering and aggregation, making space for creativity over the hierarchical edge networks.

Hierarchical networks at the edge are often another product of the heterogeneous edge devices. Separations of capabilities have evolved into separations of hierarchies. Edge devices with lower capabilities can be dedicated to collecting sensing data and uploading it to their edge servers, whereas the edge server can be used for training local models and receiving updated global models. To minimize the exposure of the models to non-server parties, the edge servers can use different topology patterns from the central server to communicate with each other to maximize efficiency and privacy. In multi-tier edge networks, dynamic topologies can be applied based on internal and external factors for optimal learning performance.

3.2.3 Availability and Mobility.

Compared to cloud data centers, edge servers have less redundancy and less reliability due to space, power, and budget. Mobile Edge devices, such as CAVs and unmanned aerial vehicles (UAVs), have even lower availability because of their mobility. The moving edge device may enter and exit the boundaries of an edge network and switch between different clusters of an edge network, leading to interruptions of task processing and computations. Figure 4 shows some application scenarios of mobile edge computing in FL.

Fig. 4.

The mobile and volatile edge devices of CAVs and UAVs are pushing dynamic-format topology, where at any epoch of the system, there can be the addition or removal of edge devices and edge servers.

3.3 FL Challenges and Solutions in Edge Network Topologies

The unique characteristics of edge computing and edge networks are posing fundamental challenges to performing reliable and efficient federated learning and applying feasible distributed learning systems at the edge. Many of those challenges can be resolved or mitigated by topology designs. In the following sections, we list some of the major challenges in FL and their corresponding solutions using a different network topology structure.

3.3.1 Scattered Data across Organizations.

As its name describes, FL may require data from independent organizations to be federated. In this scenario, there are stricter data-sharing policies without directly sharing any data or the intermediate local models. For example, federated transfer learning (FTL) [74] can unite those organizations and leverage their data without violating privacy protection regulations. Compared with the vanilla FedAvg, FTL allows learning from the entire dataset, rather than only those samples with common features.

3.3.2 High Communication Costs.

The original FL requires each device to directly communicate with the central server for upstream model aggregation and downstream model update. In the context of edge computing, direct communication to the central server is expensive for some edge devices and may cause high latency. The hierarchical edge computing topology can pool and aggregate the local updates from devices and hence reduce the communication costs to the cloud.

3.3.3 Privacy Concerns and Trust Issues.

While federated learning keeps the storage of training data to the device, it still does not eliminate the risk of exposing sensitive information through repeated aggregated local model uploads to central servers. When a threat model considers the privacy concern in central aggregation servers, a network topology with decentralized model aggregation methods will help mitigate or eliminate the risk. The rationale behind this is that all relaying edge servers in the topology will aggregate part of the information. So one compromised central server will not be able to see all the fine-granularity model updates from all clients, and therefore largely reduces the differential information repeatedly exposed to the server.

3.3.4 Imbalanced Data Distribution.

The nature of heterogeneous edge devices and networks in edge computing environments has led to significantly imbalanced data distribution and intensity based on the type of applications and devices. For example, an augmented reality (AR) application may generate a large burst of data over a short period when a user is actively using the application. In comparison, a temperature monitoring application may only generate a small amount of data for temperature records. However the data is produced constantly and periodically. By utilizing the tree network topology, methods like Astraea [24] add mediators between the FL server and the clients to resolve imbalanced data problems.

3.4 Categorization of Topology-Aware FL in Edge Computing

With the recent advancements in deep learning and increasing research interests in FL, a growing number of studies have expanded the horizon of FL applications. Numerous studies have reviewed existing FL areas [49, 61, 68, 88]. However, due to the broad applications and the nature of FL, there is no standard to summarize existing topology-aware FL studies systematically. Many existing FL studies focus on specific characteristics of FL and attempt to categorize it accordingly. Several of these methods are summarized in the following section.

3.4.1 Based on Data Partition Horizontal FL (HFL), vertical FL (VFL), and federated transfer learning (FTL).

FL can be categorized into horizontal FL (HFL), vertical FL (VFL), and federated transfer learning (FTL) based on data partition in the feature and sample spaces [146]. Horizontal FL (HFL) represents a typical FL setting where the set of features of all participating clients are the same, making it easy for implementation and training. In most cases, studies treat horizontal FL as the default structure and may not even mention the term “horizontal”. For example, the first implementation of FL by Google [83] is an example of HFL where the feature space of all participating devices is the same.

On the other hand, Vertical FL (VFL) [139] is catered specifically toward vertically partitioned data, where clients in VFL have different feature spaces. For example, hospitals and other healthcare facilities may have data about the same patient but different types of health information. Fusing multiple types of information from the same set of samples or overlapping samples in different institutions belongs to the VFL setting.

FTL [74] was initially designed for scenarios where participants in FL have heterogeneous data in both feature space and sample space. In this setting, both HFL and VFL are unable to train efficiently in heterogeneous settings. FTL is considered the ideal solution at the time. FTL leverages the whole sample and feature space with transfer learning, where two neural networks serve as feature transformation functions that can project the source features of the two networks into a common feature subspace, allowing the knowledge to be transferred between the two parties.

3.4.2 Based on Model Update Protocols Synchronous, Asynchronous, and Semi-Synchronous FL.

FL can be separated into synchronous, asynchronous, and semi-synchronous by communication protocols [121]. For synchronous FL [30, 75, 128], each learner performs a set round of local training. After every learner has finished their assigned training, they share their local models with the centralized server and then receive a new community model, and the process continues. Synchronous FL may result in the underutilization of a large number of learners and slower convergence, as others must wait for the slowest device to complete the training. With asynchronous FL [121, 141], there are no synchronization points. Instead, learners request community updates from the centralized server when their local training has been completed. As fast learners complete more rounds of training, they require more community updates, which would increase communication costs and lower the generalization of the global model. A semi-synchronous FL framework called FedRec [121] was proposed that allowed learners to continuously train on their local dataset up to a specific synchronization point where the current local models of all learners are mixed to form the community model.

3.4.3 Based on Data Distribution Non-IID and IID Data FL.

One of the major statistical challenges surrounding FL in the early stage is when training data is non-IID [159]. The consistent performance of FL relies heavily on IID data distribution on the local clients. However, in most real-life cases, local data are likely non-IID, which significantly decreases the performance of existing FL techniques if not catered specifically to non-IID data. Therefore, existing FL studies can be categorized as FL with non-IID data or FL with IID data.

3.4.4 Based on Scale of Federation: Cross-Silo and Cross-Device FL.

Based on the scale of the federation, FL studies can be divided into cross-silo and cross-device FL [46]. Cross-silo FL focus on coordinating a small amount of large data centers like hospitals or banks. On the other hand, Cross-device FL has relatively large amounts of devices and small amounts of data in each device. The key differences between the two are the number of participating parties and the amount of data stored in each participating party in FL.

3.4.5 Based on Global Model: Centralized and Decentralized FL.

The most straightforward method for implementing and managing FL is to connect all participating devices through a central server. For centralized FL, the central server is either used to compute a global model or to coordinate local devices [93]. Having a central server, however, may contradict the aim of decentralization in FL. For fully decentralized FL, there is no overarching central server at the top, and devices are connected in a D2D or P2P manner.

4 Types of FL Network Topology

To inspire future research, our work summarizes state-of-the-art FL studies from the perspective of network topology, as opposed to existing FL reviews that focus on particular features, such as data partitioning, communication architecture, or communication protocols.

In the case of FL, the network topology represents how edge devices communicate with each other and eventually to a centralized server [154]. FL can benefit from the topologies of the networks to increase communication efficiency by performing partial and tiered model aggregations [9, 10], enhancing privacy by avoiding transmitting local models directly to a centralized server [162, 163], and improving scalability with horizontally replicable network structures [7, 11]. The major types of FL network topology essentially come down to either centralized (e.g., star topologies), decentralized (e.g., mesh topologies), or hybrid topologies, which consist of two or more traditional topology designs. Other less common network topologies, like ring topologies, will also be covered in this section. Figure 5 shows the overview of the mentioned FL topology.

Fig. 5.

4.1 Star Topology

The original use case of FL was to train machine learning algorithms across multiple devices in different locations. The key concept is to enable ML without centralizing or directly exchanging private user data. However, most FL implementations still require the presence of a centralized server. The most common network topology used in FL, including the original FL work [83], adopted centralized aggregation and distribution architecture, also known as “star topology”. As a result, the graph of the server-client architecture resembles a star. Numerous FL research and algorithms are based on the assumption of a star topology [30, 93, 128]. While being the most straightforward approach, a star network topology suffers from issues like high communication costs, privacy leak concerns to the central server, and security concerns [26]. Some studies posed solutions to address these issues [93]. However, the star-topology-based solutions are not always the optimal network topology design for all FL systems. It is worth questioning if the star architecture is the network topology that best fits all scenarios.

There are a substantial number of studies in FL using the default star network topology [46, 63, 68, 70, 146, 150]. Most of those studies do not focus on the aspect of network topology or edge computing. In this section, we select various FL works that focus on optimizing the topology, communication cost, and edge computing while still using the traditional star topology structure. In Table 2, we highlight some of the works using star topology.

Table 2.

FL Type	Baselines and Benchmarks	Key Findings
Synchronous	FedAvg and Large Scale SGD with MNIST, CIFAR-10, CIFAR-100, and ILSVRC 2012	Computation and communication bandwidth were significantly decreased [30, 128]
	FedSGD, FedBCD-p, and FedBCD-s with MIMIC-III, MNIST, and NUS-WIDE	The models performed as well as the centralized model. Communication costs were significantly reduced [75]
	Noise-Free FL, Conventional RIS, Random STAR-RIS, Equal Power Allocation with MNIST, CIFAR-10 under IID and non-IID	STAR-RIS used both NOMA and AirFL framework to address the spectrum scarcity and heterogeneous services issues [93]
Asynchronous/ Semi-Synchronous	FedAvg and single-thread SGD with CIFAR-10 and WikiText-2	FedAsync was generally insensitive to hyperparameters, had fast convergence and staleness tolerance [141]
Asynchronous/ Semi-Synchronous	FedAvg, FedAsync, and FedRec with Cifar-10 and Cifar-100	Faster generalization and learning convergence, better utilization of available resources and accuracy [121]
Personalized	eFD(Extended Federated Dropout) and Federated Dropout(FD) using CIFAR10, FEMNIST, and Shakespeare	Able to extract submodels of varying FLOPs and sizes without the retraining; flexibility across different environment setups [37]
	pFedMe, Ditto, FedAlt, and FedSim with StackOverflow, EMNIST, GLDv2, and LibriSpeech	Proposed partial model personalization can obtain most benefit of full model personalization; provided convergence guarantee [98]
	FedAvg, pFedMe, Ditto, FedEM, FedRep, FedMask, and HeteroFL with EMNIST, FEMNIST, CIFAR10, and CIFAR100	Significantly improves performance; thorough theoretical analysis; extensive experiments are conducted show superior effectiveness, efficiency, and robustness [12]

Table 2. Highlighted Works - Star Topology

A distributed learning method called splitNN was proposed [30, 128] to facilitate collaborations of health entities without sharing the raw health data. In a star topology, all subsequent nodes are connected to the master node. Data does not have to be shared directly with the master node. By using a single supercomputing resource, a star topology network can provide training with access to a significantly larger amount of data from multiple sources. Alice(s) represent data entities in the deep neural network, and Bob represents one supercomputing resource that corresponds to the role of nodes and a central server. While all the single data entities (Alices) are connected to the supercomputing resource (Bob), no raw data are shared between each other. Techniques will include encoding data into a different space and transmitting it to train a deep neural network. Experimental results were obtained on the MNIST, CIFAR-10, and ILSVRC (ImageNet) 2012 datasets and showed similar performance to other neural networks trained on a single machine. As compared with classic single-agent deep learning models, this technique significantly reduces client-side computational costs. Although federated learning was available at the time, the authors argued that there had been no proper non-vanilla settings with vertically partitioned data and without labeling, with distributed semi-supervised learning and distributed multi-task learning.

An algorithm named Federated Stochastic Block Coordinate Descent (FedBCD) [75] was proposed boasting multiple local updates before communications to the central server. Through theoretical analysis, the authors found that the algorithm needed \(O(\sqrt {T})\) iterations for T iterations and achieved \(O(1/\sqrt {T})\) accuracy.

Ni et al. [93] proposed a new FL framework called STAR-RIS which integrates nonorthogonal multiple access (NOMA) and over-the-air federated learning (AirFL). STAR-RIS used NOMA and AirFL frameworks to address the spectrum scarcity and heterogeneous services issues in FL. This work follows the classical star topology where all client needs to update in a synchronized fashion and connect to the server. The proposed STAR-RIS used a novel approach that utilizes simultaneous transmitting and reflecting reconfigurable intelligent surface that boosts performance compared with other methods. STAR-RIS addressed issues specific to the integration of communication and learning technologies for the 6G network. STAR-RIS provided a closed-form expression for the convergence upper bound, which gives a strong theoretical guarantee.

4.1.1 Asynchronous FL Topologies.

Stripelis and Ambite [121] identified the issue that in heterogeneous environments, classic FL approaches exhibited poor performance. Synchronous FL protocols were communication efficient but had slow learning convergence, while asynchronous FL protocols had faster convergence but higher communication costs. For synchronous FL, the original FedAvg algorithm serves as a great example: after each participant device trains for a fixed number of epochs, the system will wait until all the devices complete their training and then compute the community models. The approach is in no way efficient, but it limits communication to a fixed amount because all devices will have the same number of communication rounds. In particular, in the case of fast and slow workers, the fast devices will have a long idle time to wait for the slow devices. For asynchronous FL, FedAsync [141] provides a thorough analysis of the subject. asynchronous FL is the complete opposite of synchronous FL. As the previous protocol minimizes communication costs, asynchronous protocols seek to utilize all participant devices to their fullest capability, which means once a device finishes assigned training, it can request a community update and continue training. However, this approach significantly increases network communication costs for fast devices. Semi-synchronous [121] FL seeks to combine the benefits of both protocols by setting up a synchronization point for all devices, allowing the fast devices to complete more rounds of training and prevent excessive communication along the way.

4.1.2 Personalized Star Topology.

There has been great attention to personalized FL in recent studies, mainly to increase the fairness and robustness of FL [124]. Most personalized FL [12, 37, 62, 124] follows the traditional star topology. One interesting aspect of personalized FL is that personalized local clients may require fewer models to be transmitted over the network, one approach is to partially upload and download the global models from the server [98]. Another approach is to have a dynamically adapting model size based on the heterogeneous data distributions or resource constraints [12, 37]. From the topology perspective, personalized FL brings some unique opportunities for further optimizing communication with various-sized local models.

Pillutla et al. [98] explored the idea of training partially personalized models, each local model has some shared and personal parameters. The authors experimented with both the simultaneous update and alternating update approaches. In addition, there is another personalized FL known as pFedme [123], which employs Moreau envelopes as a way of regularizing loss functions. pFedMe follows the same structure as the conventional FedAvg algorithm with an additional parameter used for the global model update. Specifically, each client must solve to obtain their personalized mode, which is used for local updates. The server uniformly samples a subset of clients and the local model is sent to the server. Horvath et al. [37] proposed Fjord which dynamically adapts the model size by with Ordered Dropout. By using this importance-based pruning approach, Fjord can create nested submodels from a main model and enable partial training only on the submodels. Fjord shows strong scalability and adaptability compared with baseline methods. Chen et al. [12] take a further step on personalized FL by optimizing both clients’ local data distribution and hardware resources using adaptive gated weights. The proposed pFedGate [12] can generate personalized sparse models while also considering the resource limitation of the local device. Combining both model compression and personalization approaches, pFedGate achieves superior global and individual accuracy and efficiency compared to existing methods.

4.1.3 Cohorts and Secure Aggregation.

Charles et al. [11] studied how the number of clients sampled at each round affected the learning model. Challenges were encountered while using large cohorts in FL. Particularly, the data heterogeneity caused the misalignment between the server model \(x\) and the client’s loss \(f_k\) . With a threshold for “catastrophic training failure” defined, the authors revealed that the failure rate increased from 0% to 80% when the cohort size expanded from 10 to 800. While the star topology remained the same, improved methods were proposed, including dynamic cohort sizes [120], scaling the learning rate [28, 56].

Secure aggregation protocols with poly-logarithmic communication and computation complexity were proposed in [3] and [16] requiring three rounds of interaction between the server and clients. In [6], the star topology of the communication network was replaced with a random subset of clients and secret sharing was only used for a subset of clients instead of all client pairs. Shamir’s t-out-of-n Secret Sharing technique prevents the splitted subgroups from divulging any information about the original. In [16], the proposed secure aggregation (CCESA) algorithm provided data privacy using substantially reduced communication and computational resources compared to other secured solutions. The key idea was to design the topology of secret-sharing nodes as a sparse random graph instead of the complete graph [6]. The required resource of CCESA is reduced by a factor of at least \(O(\sqrt {n / \log {n}})\) compared to [6].

4.2 Tree Topology

There can be additional layers between the central server and edge devices. For instance, edge servers that connect edge devices and the central server can formulate one or multiple layers, making a tree-like topology with the highest level of the tree being the central server and the lowest level being edge devices. Tree topologies must contain at least three levels. Otherwise, they are considered star topologies. Compared to traditional FL, tree topology helps overcome performance bottlenecks and single points of failure. We list the features and benefits of tree topology in Table 3. In this section, we discuss applications and motivations for adopting tree topology. A review of state-of-the-art optimization frameworks and algorithms is presented. We show some visualization of tree topology structures and their benefits for FL in Figure 6 and Figure 7. In the end, we cover some grouping strategies and privacy enhancement schemes.

Table 3.

Features	Benefits
Clustered clients	Adaptive strategies of in-cluster communications based on cluster’s condition.
Configurable cluster	Better scalability compared to star topology.
Configurable number of layers	Varying policies for client-edge and edge-cloud, and inter-layer aggregations.

Table 3. Features and Benefits of Tree Topology

Fig. 6.

Fig. 7.

There are two major categories of FL studies in tree topology: hierarchical and dynamic. Hierarchical represents the classic two-tier hierarchy in topology design, while dynamic continues to follow the overall structure of the tree topology with some modifications. In the following sections, we organize the works that use the tree topology based on their topics. We show a classic example of Hierarchical FL in Figure 8. We list the features and benefits of tree topology in Table 3. We highlight some of the works using tree topology in Table 4 and Table 5.

Table 4.

FL Type	Baselines and Benchmarks	Key Findings	Performance
Hierarchical	Hierarchical FL using CNN and mini-batch SGD with MNIST and CIFAR-10 under non-IID setting	Vanilla hierarchical FL, ignores heterogeneous distribution	Reduced communication, training time, and energy cost with the cloud. Also achieved efficient client-edge communication [71]
	Resource allocation methods and FedAvg with MNIST and FEMNIST	Multiple edge servers can be accessed by the device. Optimize device computation capacity and edge bandwidth allocation	Better global cost-saving, training performance, test and training accuracy, and lower training loss than FedAvg [78]
	Binary tree and static saturated structure, and FSVRG and SGD algorithm with MNIST	Using the layer-by-layer approach, more edge nodes can be included in the model aggregation	Scalability (time cost increases logarithmically rather than linearly in traditional FL), reduced bandwidth usage and time-consuming [8]
	Uniform, gradient-aware, and energy-aware scheduling with MNIST	Optimize scheduling and resource allocation by striking a balance between 3 scheduling schemes	Outperformed the baselines if \(\lambda\) is chosen properly. Otherwise slightly better or worse performance [140]
	FedAvg plus SGD using CNN with MNIST	Both the central server and the edge servers are responsible for global aggregation	Reduced global communication cost, model training time and energy consumption [147]
	RF, CNN, and RegionNet with BelgiumTSC	Classic hierarchical FL in 5G and 6G settings for object detection	Faster convergence and better learning accuracy for 6G supported IoV applications [163]
	FedAvg with imbalanced EMNIST and CINIC-10, CIFAR-10	Relieved global and local imbalance of training data; recover accuracy	Significantly reduced communication cost and achieved better accuracy on imbalanced data [24]
	FedAvg with MNIST and FEMNIST under IID and non-IID settings	A clustering step was introduced to determine client similarity and form subsets of similar clients	Fewer communication rounds especially for some non-IID settings. Allowed more clients to reach target accuracy [7]

Table 4. Highlighted Works - Tree Topology - Hierarchical

Table 5.

FL Type	Baselines and Benchmarks	Key Findings	Performance
Dynamic	FedAvg using Random and heuristic sampling with MNIST and F-MNIST	Able to offload data from non-selected devices to selected devices during training	Significant improvements in datapoints processed, training speed, and model accuracy [132]
	FedAvg using F-Fix and F-Opt with CNN on MNIST	Flexible system topology that optimizes computing speed and transmission power	Accelerated the federated learning process, and achieved a higher energy efficiency [42]
	WAN-FL using CNN with FEMNIST and CelebA under non-IID settings	Dynamic device selection based on the network capacity of LAN domains. Relied heavily on manual parameter tuning	Accelerate training process, saved WAN traffic, and reduced monetary cost while preserving model accuracy [151]
	FedAvg, TiFL, FedAsync with FMNIST, CIFAR-10, Sentiment140	Models were updated synchronously with clients of the same tier and asynchronously with the global model across tiers	Faster convergence towards the optimal solution, improved prediction performance, and reduced communication cost [10]
	Cloud-based FL (C-FL), Cost only CPLEX (CC), Data only greedy (DG) with MNIST and CIFAR-10	As opposed to an edge server, groups of distributed nodes are used for edge aggregation	Improved FL performance at a very low communication cost, provided a good balance between learning performance and communication costs [20]
	Traditional FL (TFL) low and high power mode with MNIST under IID and non-IID settings	Based on the status of their local resources, clients are assigned to different subnetworks of the global model	Outperformed TFL in both low and high power modes, especially in low power. Reliable in dynamic wireless communication environments [149]

Table 5. Highlighted Works - Tree Topology - Dynamic

Fig. 8.

4.2.1 Typical Tree Topology FL.

Zhou et al. [163] proposed a typical use of end-edge-cloud federated learning framework in 6G. The authors integrated a convolutional neural network-based approach specially perform hierarchical and heterogeneous model selection and aggregation using individual vehicles and RSUs at the edge and cloud level. Evaluation results showed an overall better outperform in learning accuracy, precision, recall, and F1 score compared to other state-of-the-art methods in 6G network settings.

Yuan et al. [151] designed a LAN-based hierarchical federated learning platform to solve the communication bottleneck. The authors regarded that existing FL protocols have a critical communication bottleneck in a federated network coupled with privacy concerns, usually powered by a wide-area network (WAN). Such a WAN-driven FL design led to significantly higher costs and much slower model convergence. An efficient FL protocol was proposed to create groups of LAN domains in P2P mode without an intermediate edge server which involved a hierarchical aggregation mechanism using Local Area Network (LAN), as it had abundant bandwidth and almost negligible cost compared to WAN.

The benefits of training data aggregation at the edge in HFL were acknowledged by Deng et al. [20]. When comparing HFL and cloud-based FL, \(\kappa _e\) and \(\kappa _c\) were defined as the aggregation frequency at the edge and the cloud, respectively. It was concluded from their research that in the HFL framework, with fixed \(\kappa _e\) and \(\kappa _c\) , uniform distribution of the training data at the edge significantly enhanced FL performance and reduced the rounds of communications. The original problem was first divided into two sub-problems to minimize the per-round communication cost and mean Kullback–Leibler divergence (KLD) of edge aggregator data. Then two lightweight algorithms were developed, adopting a heuristic method formulating a topology encouraging a uniform distribution of training data.

Briggs et al. pointed out in [7] that in reality, most data was distributed in a non-IID fashion. These fashions included feature distribution skew, label distribution skew, and concept shifts. In such cases, most FL methods suffered accuracy loss. They introduced a hierarchical clustering step (FL+HC) to separate clusters of clients by the similarity of their local updates to the global joint model. Then multiple models targeted toward groups of clients were preferred. The empirical study showed that FL+HC allowed the training to converge in fewer communication rounds with higher accuracy.

4.2.2 Optimization: Trade-off among Energy Cost, Communication Delay, Model Accuracy, Data Privacy.

Liu et al. [71] proposed a client-edge-cloud hierarchical learning system that reduced communication with the cloud by trading off between the client-edge and edge-end communication costs. This is achieved by leveraging the edge server’s ability to exchange local updates with clients constantly. There are two types of data collection rounds: one is from the client to the edge servers, and the other is from the edge servers to the cloud. The proposed FL algorithm Hierarchical Federated Averaging (HierFAVG) extends from the classic FAVG algorithm. Under the HierFAVG architecture, after the local clients finish \(k_1\) rounds of training, each corresponding edge server aggregates its client’s model for \(k_2\) rounds of aggregation, and the cloud server then aggregates all the edge servers’ models. Compared to the traditional systems following the star topology, this tree topology-based architecture greatly reduced the total communication rounds with the cloud server. Standard MNIST and CIFAR-10 datasets were used for the experiment. Additional two non-IID cases for MNIST were also considered. Experiments showed promising results on reduced communication frequency and energy consumption. When the overall communication ( \(k_1 k_2\) ) is fixed, fewer rounds of local updates ( \(k_1\) ) and more communication rounds with the edge will result in faster training which effectively reduces the number of computation tasks on the local clients. For the case of IID data on the edges, fewer communication rounds with the cloud server will not result in a decrease in performance as well. On energy consumption, with moderately increased communication with clients and edge, the energy consumption decreases. However, excessive communication between edge servers and clients will result in extra energy consumption. Therefore, a balance of overall communication ( \(k_1 k_2\) ) is needed to minimize energy consumption.

Luo et al. [78] also introduced a Hierarchical Federated Edge Learning (HFEL) framework to jointly minimize energy consumption and delay. The authors formulated a joint computation and communication resource allocation problem for global cost optimization, which considered minimizing system-wide energy and delay within one global iteration, denoted by the following equation:

\(\begin{equation} E = \sum _{i \in \kappa } \left(E^{cloud}_{i} + E^{edge}_{S_i} \right), \end{equation}\)

(1)

\(\begin{equation} T = \max _{i \in \kappa } \left\lbrace T^{cloud}_{i} + T^{edge}_{S_i} \right\rbrace , \end{equation}\)

(2)

where \(E\) represented the total energy consumed by each edge server \(i \in \kappa\) aggregating data and each sets of devices \(S_i\) for edge server \(i\) uploading models. \(T\) was defined as the total delay with those introduced by the edge servers to the cloud, denoted by \(T^{cloud}_{i}\) , and by the sets of devices uploading models, denoted by \(T^{edge}_{S_i}\) . The optimization was to jointly minimize \(E\) and \(T\) with varying weights. A resource scheduling algorithm was developed based on the model, which relieved the core network transmission overhead and enabled great potential in low-latency and energy-efficient FL.

Cao et al. proposed a federated learning system [8] with an aggregation method using the topology of the edge nodes to progress model aggregation layer by layer, specifically allowing child nodes on the lower levels to complete training first and then upload results to the higher node. Compared to the traditional FL architecture, where all the end devices connect to the same server, the proposed layered and step-wise approach ensures that only one gradient data is transmitted in a link at most. The simulation result shows better scalability where the time cost increases logarithmically rather than linearly in traditional FL systems.

Another joint optimization strategy that investigated the trade-off between computation cost and accuracy was presented by Wen et al. [140] for hierarchical federated edge learning (H-FEEL), where an optimization approach was developed to minimize the weighted sum of energy consumption and gradient divergence. The innovative contributions included three phases: local gradient computing, weighted gradient uploading, and model updating.

Ye et al. [147] proposed EdgeFed featuring the trade-off between privacy and computation efficiency. In the EdgeFed scheme, split training was applied to merge local training data into batches before being transmitted to the edge servers. Local updates from mobile devices were partially offloaded to edge servers where more computational tasks are assigned, reducing the computational overhead to mobile devices, which can focus on the training of low layers. In the EdgeFed algorithms, each iteration included multiple split training between \(K\) edge devices and corresponding edge servers and a global aggregation between \(m\) edge servers and the central server. The edge device \(k\) performed calculations with local data on low layers of the multi-layer neural network model. After receiving the outputs of low layers from all edge devices, the edge server \(m\) aggregated all data received into a larger matrix \(x^{m}_{pool}\) , which was then taken as the input of the remaining layers:

\(\begin{equation} x_{pool}^{m} \leftarrow \left[ x_{conv}^{1}, x_{conv}^{2}, \ldots x_{conv}^{k}, \ldots , x_{conv}^{K} \right] \end{equation}\)

(3)

As the updates to the central server were narrowed down between edge servers and the central server, with more computational power on the edge server compared to edge devices, the overall communication costs were reduced. However, the transferred data processed by the low layer of the model may be a threat to privacy because the edge server may be able to restore the original data.

4.2.3 Dynamic Topology.

Mhaisen et al. [84] proposed that the tree topology can be dynamic with a dense edge network. Edge devices may pair with different edge servers in different rounds of data aggregation. When participants change either due to the client selection strategy [106, 132] or participants entering or exiting the network [55], the topology will subsequently change. The authors argued that user equipment (UE) had access to more than one edge server in dense networks and increased the mobility of UE. Choosing the best edge server resulted in their proposed UE-edge assignment solutions. The user assignment problem was formalized in HFL based on the analysis of learning parameters with non-IID data.

Kourtellis et al. [55] explored the possibility of collaborative modeling across different 3rd-party applications and presented federated learning as a service (FLaaS), a system allowing 3rd-party applications to create models collaboratively. A proof-of-concept implementation was developed on a mobile phone setting, demonstrating 100 devices working collaboratively for image object detection. FedPAQ was proposed in [106] as a communication-efficient federated learning method with periodic averaging and quantization. FedPAQ’s first key feature was to run local training before synchronizing with the parameter server. The second feature of FedPAQ was to capture the constraint on the availability of active edge nodes by allowing partial node participation, leading to better scalability and a smaller communication load. The third feature of FedPAQ was that only a fraction of device participants sent a quantized version of their local information to the server during each round of communication, significantly reducing the communication overhead.

The device sampling in Heterogeneous FL was studied in [132]. The authors noticed that there may be significant overlaps in the local data distribution of devices. Then a joint optimization was developed with device sampling aiming at selecting the best combination of sample nodes and data offloading configurations to maximize FL training accuracy with network and device capability constraints.

Huang et al. [42] proposed a novel topology-optimized federated edge learning (TOFEL) scheme where any devices in the system received and aggregated their own gradients and then passed them to other devices or edge servers for further aggregation. The system acted as a hierarchical FL topology with adjustable gradient uploading and aggregation topology. The authors formulated a joint topology and computing speed optimization as a mixed-integer nonlinear program (MINLP) problem aims at minimizing energy consumption and latency. A penalty-based successive convex approximation (SCA) method was developed to transform the MINLP into an equivalent continuous optimization problem which demonstrates that the proposed TOFEL scheme speed up the federated learning process while consuming less energy.

Duan et al. [24] focused on the imbalanced data distribution in mobile systems, which led to model biases. The authors built a self-balancing FL framework called Astraea to alleviate the imbalances with a mediator to reschedule the training of clients. The methods included Z-score-based data augmentation and mediator-based multi-client rescheduling. The Astraea framework consisted of three parts: FL server, mediator, and clients.

4.2.4 Grouping Strategy and Privacy Enhancement.

He et al. [34] proposed a grouping mechanism called Auto-Group, which automatically generated grouped users using an optimized Genetic Algorithm without the need to specify the number of groups. The Genetic Algorithm balanced the data distribution of each group to be as close as the global distribution.

FedAT was a method with asynchronous tiers under non-IID data proposed by Chai et al. [10]. FedAT had the topology organized in tiers based on the response latencies of edge devices. For intra-tier training, the synchronous method was used as the latencies are similar. For cross-tier training, the asynchronous method was used. The clients were split into tiers based on their training speed with the help of the tier-based module, allowing faster clients to complete more local training while using server-side optimization to avoid bias. By bridging the synchronous and asynchronous training through tiering, FedAT minimized the straggler effect with improved convergence speed and test accuracy. FedAT used a straggler-aware, weighted aggregation heuristic to steer and balance the training for further accuracy improvement. FedAT compressed the uplink and downlink communications using an efficient, polyline-encoding-based compression algorithm, therefore minimizing the communication cost. Results showed that FedAT improved the prediction performance by up to 21.09%, and reduced the communication cost by up to 8.5 times compared to the state-of-the-art FL methods.

With the two typical FL scenarios in MEC, i.e., virtual keyboard and end-to-end autonomous driving, Yu and Li proposed a neural-structure-aware resource management approach [149] for FL. The mobile clients were assigned to different subnetworks of the global model based on the status of local resources.

Wainakh et al. [130] discussed the implications of the hierarchical architecture of edge computing for privacy protection. The topology and algorithm enabled by hierarchical FL (HFL) may help enhance privacy compared to the original FL. These enhancements included flexible placement of defense and verification methods within the hierarchy and the possibility of employing trust between users to mitigate several threats. The methods linked to HFL were illustrated, such as sampling users, training algorithms, model broadcasting, and model updates aggregation. Group-based user update verification could also be introduced with HFL. Flexible applications of defense methods were available in HFL because of the hierarchical nature of the network topology.

4.3 Decentralized\Mesh Topology

Decentralized\Mesh topology is a network topology where all end devices are inter-connected to each other in a local network [15, 58, 73, 118, 142, 152]. In recent studies, mesh topologies are commonly used in FL systems. Decentralized approaches like peer-to-peer (P2P) or device-to-device (D2D) FL fall under the Mesh Topology. Many existing FL systems still rely on a centralized/cloud server for model aggregation [54]. The decentralized approach is sometimes regarded as a poor alternative to the centralized method when a centralized server is not feasible. This section covers three major FL systems using fully decentralized approaches. In Table 6, we highlighted some works that utilized the decentralized topology.

Table 6.

FL Type	Baselines and Benchmarks	Key Findings
Decentralized Mesh	Using 20 Newsgroups dataset integrating GBDT	Obtained high utility and accuracy, effective data leakage detection, near-real-time performance data leakage defending [77]
	FedAvg and FedGMTL using AGE and GAT wtih MoleculeNet	Train GNNs in serverless scenarios, outperformed Star FL even if clients can only communicate with few neighbors [32]
	PENS, Random, Local, FixTopology, Oracle, IFCA, FedAvg with MNIST, FMNIST, and CIFAR10	CNI was effective to match neighbors with similar objectives; directional communications helped to converge faster; robust in non-IID settings [66]
	FedAvg using ResNet-20 model with CIFAR-10 under IID and non-IID settings	Provided an unbiased estimate of the model update to PS through relaying; optimized consensus weights of clients to improve convergence; compatible in different topologies [148]
Decentralized Wireless	FedAvg, CDSGD, D-PSGD using CNN with MNIST, FMNIST, CIFAR-10 under IID and non-IID settings	Outperformed in accuracy, less sensitive to the topology sparsity; similar performance for each user; viable on IID and non-IID data under time-invariant topology [15]
	DSGD, TDMA-based, local SGD(no communication) with FMNIST	Over-the-air computing can only outperform conventional star topology implementations of DSGD [142]
	DOL and COL with SUSY and Room Occupancy dataset	Worked better than DOL in row stochastic confusion matrix, usually outperformed COL in running time [33]
	FedAvg and gossip approach without segmentation with CIFAR-10	Required the least training time to achieve given accuracy, more scalable, synchronization time significantly reduced [40]
	Gossip and Combo with FEMNIST and Synthetic data	Maximize bandwidth utilization by segmented gossip aggregation over the network; speed up training; maintain convergence [45]
	DFL and C-SGD with MNIST, CIFAR-10	Showed linear convergence behavior for convex objective, strong convergence guarantees for both DFL and C-DFL [73]
	FLS with MALC dataset using QuickNAT architecture	Enabled more robust training; similar performance with centralized approaches; generic method and transferable [108]
	FedAvg with MNIST using CNN and LSTM	Improved convergence performance of FL especially when the model was complex and network traffic was high [99]
	Gossip and GossipPGA using LEAF with FEMNIST and Synthetic data	Reduced training time and maintained good convergence, whereas partial exchange significantly reduced latency [44]

Table 6. Highlighted Works - Decentralized Topology

The performance of decentralized FL algorithms was discussed in [67]. Theoretical analysis on a D-PDGD algorithm was conducted to prove the possibility of a decentralized FL algorithm outperforming centralized FL algorithms. With comparable computational complexity, the decentralized FL algorithm required much less communication cost on the busiest node. D-PSGD could be one order of magnitude faster than its well-optimized centralized counterparts.

Lu et al. [77] developed a vehicular FL scheme based on a sub-gossip update mechanism along with a secure architecture for vehicular cyber-physical systems (VCPS). The P2P vehicular FL scheme used random sub-gossip updating without a curator, which enhanced security and efficiency. The aggregation process was done in each vehicle asynchronously. The data retrieval information was registered on nearby RSUs as a distributed hash table (DHT). All related vehicles were searched for the DHT before FL started.

Gossip learning [35, 95] was compared with FL in [36] as an alternative, where training data also remained at the edge devices, but there was no central server for aggregation. Gossip learning can be seen as a variation of the mesh topology. Nodes exchanged and aggregated models directly. No centralized servers meant no single-point failure and led to better scalability and robustness. The performance of gossip learning was generally comparable with FL and even better in some scenarios. The experiment was conducted using PEERSIM [87].

SpreadGNN was proposed in [32] as a novel multi-task federated training framework able to operate with partial labels of client data for graph neural networks in a fully decentralized manner. The serverless multitask learning optimization problem was formulated, and Decentralized Periodic Averaging SGD (DPA-SGD) was introduced to solve the problem. The result shows that it is viable to train graph neural networks federated learning in a fully decentralized setting.

Li et al. [66] leveraged P2P communications between FL clients without a central server and proposed an algorithm formulating a decentralized, effective communication topology in a decentralized manner without assuming the number of clusters. To design the algorithm, two novel metrics were created for measuring client similarity. Another two-stage algorithm directed the clients to match same-cluster neighbors and to discover more neighbors with similar objectives. Theoretical analysis was included showing the effectiveness of the work compared to other P2P FL methods.

A semi-decentralized topology was introduced by Yemini et al. [148], where a client was able to relay the update from its neighboring clients. A weighted update with both the client’s own data and its neighboring clients’ data was transmitted to the parameter server (PS). The goal was to optimize averaging weights to reduce the variance of the global update at the PS, as well as minimize the bias in the global model, eventually reducing the convergence time.

4.3.1 Decentralized Topology in Wireless Networks.

The mesh or decentralized FL topology has been explored in wireless networks [27, 118, 142, 145], as the wireless coverage of P2P or D2D devices overlaps with each other, plus no centralized server has been provided.

Trust was treated as a metric of FL in [27]. The trust was quantified upon the relationship among network entities according to their communication history. Positive contributions to the model were interpreted as an increment of trust, and vice versa.

Shi et al. proposed over-the-air FL [118] over wireless networks, where over-the-air computation (AirComp) [145] was adopted to facilitate the local model consensus in a D2D communication manner.

Chen et al. [15] considered the deficiency of high divergence and model average necessity in previous decentralized FL implementations like CDSGD and D-PSGD. They devised a decentralized FL implementation called DACFL [15] which adapts more to non-ideal network topology. DACFL allows individual users to train their own model with their own training data while exchanging the intermediate models with neighbors using FODAC (first-order dynamic average consensus) to negate potential over-fitting problems discretely without a central server during training.

Xing et al. considered a network of wireless devices sharing a common fading wireless channel for deploying FL [142]. Each device held a generally distinct training set, and communication typically took place in a D2D manner. In the ideal case, where all devices within their communication range could communicate simultaneously and noiselessly, a standard protocol guaranteed the convergence to an optimal solution of the global empirical risk minimization problem under convexity and connectivity assumptions was called the Decentralized Stochastic Gradient Descent (DSGD). DSGD integrated local SGD steps with periodic consensus averages that required communication between neighboring devices. Wireless protocols were proposed for implementing DSGD by accounting for the presence of path loss, fading, blockages, and mutual interference.

He et al. explored the use cases of FL in social networks where centralized FL was not applicable [33]. Online Push-Sum (OPS) method was proposed to leverage trusted users for aggregations. OPS offered an effective tool to cooperatively train machine learning models in applications where the willingness to share is single-sided.

Lalitha et al. considered the problem of training models in fully decentralized networks. They proposed a distributed learning algorithm [58] where users aggregated information from their on-hop neighbors to learn a model that best fits their observations to the entire network with small probabilities of error.

Savazzi et al. proposed a fully distributed, serverless FL approach to massively dense and fully decentralized networks [113]. Devices are independently trained based on local datasets received from neighbors. Then devices forwarded the model updates to their one-hop neighbors for a new consensus step, extending the method of gossip learning. Both model updates and gradients were iteratively exchanged to improve convergence and minimize the rounds of communications.

Combo, a decentralized federated learning system based segmented gossip approach was presented in [40] to split the FL model into segmentations. A worker updated its local segmentation with k other workers, where k was much smaller than the total number of workers. Each worker stochastically selected a few other workers for each training iteration to transfer model segmentation. Replication of models was also introduced to ensure that workers had enough segmentation for training purposes.

Jiang et al. proposed Bandwidth Aware Combo (BACombo) [45] with a segmented gossip aggregation mechanism that makes full use of node-to-node bandwidth to speed up the communication time. Besides, a bandwidth-aware worker selection model further reduces the transmission delay by greedily choosing the bandwidth-sufficient worker. The convergence guarantees were provided for BACombo. The experimental results on various datasets demonstrated that the training time was reduced by up to 18 times that of baselines without accuracy degradation.

The work in [73] focused on balancing between communication-efficiency and convergence performance of Decentralized federated learning (DFL). The proposed framework performed both multiple local updates and multiple inter-node communications periodically, unifying traditional decentralized SGD methods. Strong convergence guarantees were presented for the proposed DFL algorithm without the assumption of a convex objective function. The balance of communication and computation rounds was essential to optimize decentralized federated learning under constrained communication and computation resources. To further improve the communication efficiency of FL, compressed communication was applied to DFL, which exhibited linear convergence for strongly convex objectives.

BrainTorrent [108] was proposed to perform FL for medical centers without the use of a central server to protect patient privacy. As the central server required trust from all clients, which was not feasible for multiple medical organizations, BrainTorrent presented a dynamic peer-to-peer environment. All medical centers directly interact with each other, acting like a P2P network topology. Each client maintains its model version of the model and the last versions of models it used during merging. By sending a ping request, a client receives responses from other clients with their latest model versions and subsets of the models. The client then merged the models received by weighted averaging to generate a model.

FedAir [99] explored enabling FL over wireless multiple-hop networks, including the widely deployed wireless community mesh networks. Wireless multi-hop FL system consists of a central server as an aggregator with multi-hop link wireless to edge servers as workers. According to the authors, multi-hop FL faced several challenges, including slow convergence rate, prolonged per-round training time and potential divergence of synchronous FL, and difficulties in model-based optimization for multiple hops.

Jiang and Hu proposed gradient partial level decentralized federated learning (FedPGA) [44] aiming to improve on traditional star topology FL’s high training latency problem in real-world scenarios. The authors used a partial gradient exchange mechanism to maximize the bandwidth to improve communication time, and an adaptive model updating method to adaptively increase the step size. The experimental results showed up to 14 \(\times\) faster training time compared to baselines without compromising accuracy.

4.3.2 Routing in Decentralized Topology.

A topology design problem for cross-silo FL was analyzed in [82] due to traditional FL topology designs being inefficient in the cross-silo settings. They proposed algorithms that find the optimal topology using the theory of max-plus linear systems. By minimizing the duration of communication rounds or maximizing the largest throughput, they were able to find the most optimal topology design that significantly shortens training time.

Sacco et al. proposed Blaster [110], a federated architecture for routing packets within a distributed edge network, to improve the application’s performance and allow scalability of data-intensive applications. A path selection model was proposed using Long Short Term Memory (LSTM) to predict the optimal route. Initial results were shown with a prototype deployed over the GENI testbed. This approach showed that communications between SDN controllers could be optimized to preserve bandwidth for the data traffic.

In the Cross-device FL scenario, Ruan et al. studied flexible device participation in FL [109]. The authors assumed that it was difficult to ensure that all devices were available during the entire training in practice. It could not guarantee that devices would complete their assigned training tasks in every training round as expected. Specifically, the research incorporated four situations: in-completeness where devices submitted only partially completed work in a round, inactivity where devices did not complete any updates or respond to the coordinator at all, early departures where existing devices quit the training without finishing all training rounds, and late arrivals where new devices joined after the training has already started.

In [89], a Federated Autonomous Driving network (FADNet) was designed to improve FL model stability, ensure convergence, and handle imbalanced data distribution problems. The experiments were conducted with a dense topology called the Internet Topology Zoo (Gaia) [53].

A federated learning on a fully decentralized network problem was analyzed in [48], particularly on how the convergence of a decentralized FL system will be affected under different network settings. Several simulations were conducted with different topologies, datasets, and machine learning models. The end results suggested that scale-free and small-world networks are more suitable for decentralized FL and a hierarchical network has convergence speed with trade-offs.

4.3.3 Blockchain-Based Topology.

As one of the decentralized methods for FL, numerous blockchain-based topologies have been presented in previous works. The blockchains combined with FL aim at replacing the central server for generating the global model. Figure 9 displays a typical blockchain-based FL topology. The highlighted works of blockchain-based topology are shown in Table 7. The potential benefits of introducing blockchains in FL systems include the following:

Table 7.

FL Type	Baselines and Benchmarks	Key Findings
Blockchain	Basic FL and stand-alone training framework with FEMNIST	Higher resistance to malicious nodes, mitigate the influence of malicious central servers or nodes [65]
	FL-Block with CIFAR-10, FASHION-MINIST	Fully capable to support big data scenarios, particularly fog computing applications, provides decentralized privacy protection while preventing a single point of failure [103]
	Leaf with Ethereum as the underlying blockchain, tested logistic regression (LR) and NNs models	Incentive mechanisms encouraged clients to provide high-quality training data, communication overhead can be significantly reduced when the data set size is extremely large [158]
	Integrate FL in consensus process of permissioned blockchain with Reuters and 20 newsgroups dataset	Increased efficiency of data sharing scheme by improving utilization of computing resources; secure data sharing with high utility and efficiency [76]
	Provided assistance to home appliance manufacturers using FL to predict future customer demands and behavior with MNIST	Created an incentive program to reward participants while preventing poisoning attacks from malicious customers; communication costs are small compared with wasted training time on mobile [160]
	Evaluation were based on 3GPP LTE Cat. M1 specification	Allowed autonomous vehicles to communicate efficiently, as it exploited consensus mechanisms in blockchain to enable oVML ML without centralized server [100]
	FL with Multi-Krum and DP under position attacks using Credit Card dataset and MNIST	Scalable; fault-tolerant; defend against known attacks; capable of protecting the privacy of client updates and maintaining the performance of the global model with 30% adversaries [114]

Table 7. Highlighted Works - Blockchain Topology

Fig. 9.

—

Placement of model training

—

Incentive mechanism to attract more participants

—

Decentralized privacy

—

Defending poison attacks

—

Cross verification

FLChain was proposed in [80] to enhance the reliability of FL in wireless networks for separate channel selection used by FL model uploading and downloading, where local model parameters were stored as blocks on a blockchain as an alternative to a central aggregation server and the edge devices provided network resources to the resource-constraint mobile devices and served as nodes in the blockchain network of FLChain. Similar to most blockchain-based FL frameworks, FLChain had the blockchain network above the edge devices for channel registration and global model updates.

A blockchain-based federated learning framework with committee consensus (BFLC) was proposed in [65] to reduce the amount of consensus computing and malicious attacks. The alliance blockchain was used to manage FL nodes for permission control. Different from the traditional FL process, there is an additional committee between the training nodes and the central server for updates selection. In each round of FL, updates were validated and packaged by the selected committee, allowing the most honest nodes to improve the global model continuously. A small number of incorrect or malicious node updates will be ignored to avoid damaging the global model. Nodes can join or leave at any time without damaging the training process. The blockchain acted as a distributed storage system for persisting the updates.

Qu et al. developed FL-Block [103] to allow the exchange of local learning updates from end devices via a blockchain-based global learning model verified by miners. The central authority was replaced with an efficient blockchain-based protocol. The blockchain miners verified and stored the local model updates. A linear regression problem was presented with the objective of minimizing a loss function \(f(\omega)\) . An algorithm designed for block-enabled FL enabled block generation by the winning miner after the local model was uploaded to the fog servers. The fog servers received updates of global models from the blockchain.

A blockchain anchoring protocol was designed [158] for device failure detection. Specifically, a blockchain anchoring protocol was designed which built custom Merkle trees with each leaf node representing a record of data root onto blockchains to verify Industrial IoT (IIoT) data integrity efficiently.

In a similar research scenario of processing IIoT data, a permissioned blockchain [101] was used in [76] for recording IIoT data retrieval and data sharing transactions. The Proof of Training Quality (PoQ) was proposed to replace the original Proof of Work (PoW) mechanism for lower cost reaching consensus. A differential privacy preserved model was first incorporated into FL. Regarding the PoQ, the committee leader was selected according to the trained model by prediction accuracy, measured by the mean absolute errors (MAEs):

\(\begin{equation} {MAE}(m_i) = \frac{1}{n} \sum _{i = 1}^{n} \left| y_i - f(x_i) \right|, \end{equation}\)

(4)

where \(f(x_i)\) denoted the prediction value of model \(m_i\) and \(y_i\) was the observed value. The consensus process started with the election of the committee leader with the lowest \({MAE}^u\) by voting. This leader was then assigned to drive the consensus process. The trained models were circulated among the neighboring committee nodes, denoted by \(P_i\) , of a committee node \(P_j\) , leading to the MAE for \(P_j\) to be

\(\begin{equation} MAE^{u}(P_j) = \gamma \cdot {MAE}(m_j) + \frac{1}{n} \sum _{i = 1}^{n} {MAE}(m_i), \end{equation}\)

(5)

where \({MAE}(m_j)\) was the locally trained model weighted by \(\gamma\) , and \({MAE}(m_i)\) referred to remotely trained models.

Zhao et al. [160] replaced aggregator nodes in traditional FL systems with blockchains for traceable activities. Customer data was selected to be sent to selected miners for averaging. One of the miners, selected as the leader, uploaded the aggregated model to the blockchain. More importantly, the authors proposed a normalization technique with differential privacy preservation.

Pokhrel and Choi [100] discussed blockchain-based FL (BFL) parameters for vehicular communication networking, considering local on-vehicle machine learning updates. The blockchain-related parameters were discussed via a mathematical framework, including the retransmission rate, block size, block arrival rate, block arrival rate, and frame sizes. The analytical results proved that tuning the block arrival rate was able to minimize the system delay.

Shayan et al. proposed a fully decentralized multi-party ML (Bitscotti) [114] using blockchain and cryptographic emphasis on privacy-preserving. The training process of clients is stored in the blockchain ledger. Clients complete local training and the results are masked using a private noise. Then the masked updates go through a validation process as an extra layer of security. A new block is created for every new round of training. However, due to communication overhead, Bitscotti does not support large deep learning models. With a size of 200 peers, the test shows similar utility compared to traditional star topology federated learning.

4.4 Minor Topologies

In addition to the above-mentioned topologies, some minor topologies combine existing topologies or utilize niche new topologies that are not widely used. Although there are only a few studies on some minor topologies, these works can still provide valuable insight into the subject. The highlighted works of minor topology are shown in Table 8.

Table 8.

FL Type	Baselines and Benchmarks	Key Findings
Ring	FedAvg with MNIST, CIFAR-10 and CIFAR-100	Improved bandwidth utilization, robustness, and system communication efficiency, reduce communication costs [137]
	G-plain (Graph-based), R-plain (ring-based), and UBAR with MNIST and CIFAR-10	Fast and computationally efficient; superior performance with SOTA in IID and non-IID; achieve linear convergence rate; further scalability in parallel implementation [25]
	LeNet and VGG11 with MNIST, FMNIST, EMNIST, CIFAR-10	Achieved higher test accuracy in fewer communication rounds; faster convergence, robustness to non-IID dataset [144]
Clique	Fedavg with MNIST and CIFAR10	Reduced gradient bias, convergence in heterogeneous data environments, reduction in edge and message numbers [4]
Fog	FedAvg, HierFAVG, DPSGD with MNIST, FEMNIST, Synthetic dataset	Robust under dynamic topologies; fastest convergence rates under both static and dynamic topologies [161]
	FedAvg with MNIST, FEMNIST, Shakespeare	Gave smooth convergence curve; higher model accuracy; more scalable; communication-efficient [17]
	FL with full device participation and FL with one device sampled from each cluster with MNIST, F-MNIST	Better model accuracy; energy consumption; robustness against outages; favorable performance with non-convex loss functions [69]
	Only Cloud, INC Solution, Non-INC, and INC LB	Reached near-optimal network latency; outperformed baselines; helped cloud node significantly decrease its network’s aggregation latency, traffic, and computing load [22]
	FedAsync and FedAvg with MNIST and CIFAR-10	Reduced consumption of network traffic; faster converges; effective with non-IID data; dealing with staleness [138]
Semi-ring	FedAvg with MNIST under non-IID setting	Improved communication efficiency, flexible and adaptive convergence [125]
Semi-ring	Astraea, FedAvg, HierFavg, IFCA, MM-PSGD, SemiCylic with FedShakespeare, MNIST	Near-linear scalability; improved model accuracy [59]

Table 8. Highlighted Works - Minor Topologies

4.4.1 Ring Topology.

A ring-topology decentralized federated learning (RDFL) framework was proposed in [137] for communication-efficient learning across multiple data sources in a decentralized environment. RDFL was inspired by the idea of ring-allreduce³ and applied a consistent hashing technique to construct a ring topology of decentralized nodes. An IPFS-based data-sharing scheme was designed as well to reduce communication costs.

RingFed [144] took advantage of the ring topology setup allowing clients to communicate with each other while performing preaggregation on clients to further reduce communication rounds. RingFed does not rely on the central server to perform model training tasks but uses the central server to assist in the passing of model parameters. The client only communicated with the central server when the set number of periods. In comparison to other algorithms, an additional step of recalculating all client parameters is added. Experimental results show that RingFed outperforms FedAvg in most cases and the results of training are optimized on non-IID data as well.

Elkordy et al. [25] proposed Basil, a fast and computationally efficient Byzantine robust algorithm for decentralized (serverless) training systems. In particular, the key aspect of their work is that it considers the decentralized FL and leverages the logical ring topology among nodes. Basil has also proven to achieve a linear convergence rate and further scalability in parallel implementation.

4.4.2 Clique Topology.

Cliques are defined in graph theory, referring to a subset of vertices of an undirected graph such that every two distinct vertices in the clique are adjacent [2]. Cliques have been well-studied in graph theory. Cliques have also been used in FL for improving the accuracy in sparse neural networks [4]. We used the clique-based topology structure from Bellet et al. [4] as a visualized example shown in Figure 10.

Fig. 10.

D-cliques [4] was a topology that reduced the gradient bias by grouping nodes in sparsely interconnected cliques such that the label distribution in a clique is representative of the global label distribution. This way, the impact of label distribution skew can be mitigated for heterogeneous data. Instead of providing a fully connected topology which may be unrealistic with large numbers of clients, D-Cliques instead provided locally fully connected neighborhoods. Each node belonged to a Clique, a set of fully connected nodes with data distribution as close as possible to the global distribution of the data through the network. Each Clique of the network provided a fair representation of the true data distribution, while substantially reducing the number of links.

4.4.3 Grid Topology.

The grid topology also enables data transmission between adjacent clients. Compared to the ring topology, where each client has two one-hope neighbors, the grid topology gives each client four neighbors, encouraging more data exchange in local networks. Shi et al. [117] discussed a scenario of distributed federated learning in a multi-hop wireless network. The experiments evaluated the performance over the line, ring, star, and grid networks, proving that more neighbors would lead to faster convergence and higher accuracy.

4.4.4 Hybrid Topology.

In the previous section, we have introduced various types of topologies frequently seen in prior FL studies. Though these topologies cover a great portion of the use cases, each topology has its pros and cons. Researchers have explored combinations of various topologies to receive maximum benefits from network topologies. Hybrid topologies [17, 39, 59, 125] combine the strengths of at least two traditional topologies to create a more dynamic solution.

4.4.5 Fog Topology (Star + Mesh).

In this section, we examine the fog topology, which is essentially a fusion of star and mesh topologies. We show a visualization of fog topology from the works of Hosseinalipour et al. [39] shown in Figure 11.

Fig. 11.

The concept of fog learning was presented in [39] compared to FL over heterogeneous wireless networks. The word “fog” was used to address the heterogeneity across devices. Compared to FL, fog learning considers the diversity of devices with various proximities and topology structures for scalability. The proposed fog learning boasted its multi-layer network architecture and its vertical and horizontal device communications ability. Device-to-device (D2D) communications were possible when there were fewer privacy concerns. Compared to the tree topology, additional D2D communication paths were added at the edge layer. The D2D offloading could happen among trusted devices, at the cost of privacy compromise, if such a sacrifice were acceptable. Inter-layer data offloading could also be implemented to increase the similarity of local data and reduce model bias.

Hosseinalipour et al. [38] developed multi-stage hybrid federated learning (MH-FL) built on fog learning which is a hybrid of intra- and inter-layer model learning that considered the network as a multi-layer, hybrid structure with both mesh topology and tree topology. In MH-FL, each layer of which consists of multiple device clusters. MH-FL considered the topology structures among the nodes in the clusters, including local networks formed via D2D communications. It orchestrated the devices at different network layers in a collaborative/cooperative manner to form a local consensus on the model parameters and combined it with multi-stage parameters relaying between layers of the tree-shaped hierarchy. These clusters were designed in two types: limited uplink transmission (LUT) clusters with limited capability to upload data to the upper layer, and extensive uplink transmission (EUT) clusters with enough resources to perform conventional FL.

Strictly speaking, some topologies are based on tree topologies but with additional edges [161], making them more genetic graphs with more connectivity. To scale up FL, Parallel FL (PFL) systems were built with multiple parameter servers (PS). A parallel FL algorithm called P-FedAvg was proposed in [161], extending FedAvg by allowing multiple parameter servers to work together. The authors identified that a single parameter server became the bottleneck due to two reasons: the difficulty in establishing a fast network that connects all devices to a single PS, as well as the limited communication capacity of only one PS. With the P-FedAvg algorithm, each client conducted several local iterations before uploading the model parameters to its PS. A PS collected model parameters from selected clients and conducted a global iteration by aggregating the model parameters uploaded from its clients and then mixing model parameters with its neighbor PS. The authors optimized the weights for PS to mix their parameters with neighbors. Essentially, this was a non-global aggregation without requiring communications with the central server. The study indicated that PFL could significantly improve the convergence rate if the network was not sparsely connected. They also compared its P-FedAvg under three different network topologies: Ring, 2d-torus, and Star, while 2d-torus is the most robust.

FedP2P was proposed in [17], aiming at reorganizing the connectivity structure to distribute both the training and communication on the edge devices by leveraging P2P communication. While edge devices performed pairwise communication in a D2D manner, a central server was still in place. However, the central server only communicated with a small number of devices. Each of these small numbers of devices represented the partition. The parameters had been aggregated before being transmitted to the central server. Compared to the tree-based topology, FedP2P was robust if one or more nodes in a P2P subnetwork were down. Compared to the original star topology, FedP2P still had better scalability with clustered P2P networks. Lin et al. proposed a semi-decentralized learning architecture called TT-HF [69], which combined the traditional star topology of FL with decentralized D2D communications for model training, formulating a semi-decentralized FL topology. The problem of resource-efficient federated learning across heterogeneous local datasets at the wireless edge was studied. D2D communications were enabled. A consensus mechanism to mitigate model divergence was developed to mitigate low-power communications among nearby devices. TT-HF incorporated two timescales for model training, including iterations of stochastic gradient descent at individual devices and rounds of cooperative D2D communications within clusters.

Dinh et al. [22] proposed an edge network architecture that decentralized the model aggregation process at the server and significantly reduced the aggregation latency. First, an in-network aggregation process was designed so that the majority of aggregation computations were offloaded from the cloud server to edge nodes. Then a joint routing and resource allocation optimization problem was formulated to minimize the aggregation latency for the whole system at every learning round. Numerical results showed a 4.6 times improvement in the network latency. FedCH [138] constructed a special cluster topology and performed hierarchical aggregation for training. FedCH arranged clients into multiple clusters based on their heterogeneous training capacities. The cluster head collected all updates from clients in that cluster for aggregation. All cluster headers took the asynchronous method for global aggregation. The authors concluded that the convergence bound was related to the number of clusters and the training epochs, and then proposed an algorithm for the optimal number of clusters with resource budgets and with the cluster topology, showing an improvement of completion time by 49.5–79.5% and the network traffic by 57.4–80.8%.

4.4.6 Semi-Ring Topology (Ring + Star/Tree).

In addition to comprehensively comparing FL system structures with different topologies, the ring and tree topology were used in [125] by Tao et al. for efficient parameter aggregation. A hybrid network topology design was proposed integrating ring (R) and n-ary tree (T) to provide flexible and adaptive convergecast in federated learning. Participating peers within one-hop were formed as a local ring to adapt to the frequent joining and leaving of devices; an n-ary convergecast tree was formed from local rings to the aggregator for communication efficiency. Theoretical analysis found that the hybrid (R+T) convergecast design was superior for system latency. We show these hybrid topologies, termed as semi-ring Topology in Figure 12.

Fig. 12.

Lee et al. presented an algorithm called TornadoAggregate [59] by facilitating the ring architecture to improve the accuracy and scalability of FL. A global inter-node transfer model that is synchronized with the new model will replace traditional global aggregation in the traditional star architecture. TornadoAggregate can achieve a low convergence bond and satisfy the diurnal property condition.

5 Challenges AND Future Research Roadmaps

Despite great attention addressing topology-related challenges for edge computing FL in recent years [4, 15, 84, 137], the nature of the network topologies and data distribution still introduces unique challenges. Apart from the previously mentioned topology-aware FL works, much is still to be studied about network topology in FL. In this section, we provide some open challenges and research directions for topology-aware FL.

5.1 Topology Selection

When implementing or designing a topology-aware FL approach, there are a few things to consider. For example, in hierarchical and heterogeneous edge networks, multiple paths exist from the edge devices to the edge servers and the central server. When selecting the topology optimal for FL, the following questions need to be asked:

(1)

Does a server-less architecture fit the system?

(2)

Is the traditional star topology no longer sufficient for the system?

(3)

Is there a unique topology that already exists at the hardware or structural level in the system?

When a system structure already exists, for example, a network of devices and subgroups of those devices controlled and managed by an intermediate server, the tree topology structure will be an obvious choice. As a result, the focus will shift from which topology to select to how to optimize the tree topology structure for a specific goal, i.e., for increased communication efficiency or to mitigate security bottlenecks. This area offers many opportunities for further research, including new topologies development or combinations of existing topologies.

5.2 Communication Cost

It is common for edge devices to be powered by batteries. Performing local model aggregation and wireless model transmission consume the limited power sources of edge devices. Saving communication costs and developing energy-aware federated algorithms are aligned with our primary goals. Moreover, existing solutions save communication costs from the amount of data transmissions. Further research can be conducted on topology control algorithms that can optimize energy consumption in conjunction with network topology.

The network heterogeneity and ever-changing nature of edge devices pose great challenges to FL. The heterogeneity of the edge networks determines that the bandwidth resources vary for links in edge networks. Meanwhile, the mobility and density of the edge devices further reduce the actual bandwidth of those links. In the worst case, certain links in the edge networks suffer connectivity loss. The changing link conditions in edge networks demand dynamic and fault-tolerant network topologies for FL to aggregate data. Based on the amount of data for model transmission, models and algorithms are needed to find the most effective topologies to deliver the model reliably on time.

5.3 Client-Drift

Due to the large number of edge devices and statistical heterogeneity, a phenomenon known as client drift [47] could occur. Client drift occurs when clients with non-IID data develop extremely distinct local models away from the global optimal model. Some clients can be seen as noisy since their updates can be misleading global models. Some edge devices may become particularly “noisy” and their local model updates can dominate the global model weights. The problem is even more severe in the clusters with one noisy edge device. If FL leverages the cluster’s local model for FL, it may be too biased towards the models by the noisy nodes. A solution to mitigate such a biased model at the cluster heads or edge servers can be an opportunistic routing that intentionally integrates models from edge devices outside a cluster. For our example in the left of Figure 13, the local models learned can be contributed to multiple clusters. The slicing of the data reduces the exposure of repeated model transmissions updated on the same dataset to the same edge servers and therefore enhances privacy.

Fig. 13.

5.4 Ethical/Privacy Concerns

In this section, we discuss new ethical and privacy challenges possessed by different network topology structures. The primary concern with the standard star topology is the communication bottleneck and excessive reliance on a central server. The central server is heavily tasked with safeguarding all client information. The default star topology in FL can be represented as a single point of failure potentially compromising the privacy of the entire network and raising potential ethical concerns. Other network structures can address some of the privacy concerns of the traditional topology structure. For tree topology, the additional communication layers and intermediate servers present both advantages and disadvantages. The tree topology allows the central server to offload some computational tasks and client information to the intermediate server. However, the presence of the intermediate servers requires greater efforts towards privacy on the edge. Other fully decentralized topologies do not require a central server, such as mesh, gossip, clique, and grid topology. Without an overarching central server, there is no need for direct communication with the server, which would normally pose a significant privacy threat and ethical violations. What comes with this is the increased amount of peer-to-peer (P2P) communication which could introduce new privacy challenges.

The extended period of model aggregations, “devices -> edge servers -> central server”, could lead to privacy concerns regarding the training data of the edge devices. When an edge device sends its aggregated model upon a global model update request, the model will be broadcast to its neighbors. Repeated rounds of sharing models among the nearby neighbors will mix the device’s local model with the sub-local models. The changes in the topology select changing sets of neighbors, further increasing the diversity of local models. The model mingling activities will reduce the vulnerability of devices suffering differential privacy attacks.

In conclusion, privacy is always a trade-off. No single topology can meet all the needs of all users. As technology advances, the existing topology will face new challenges. New and niche topologies will present new challenges and opportunities. Various aspects of network topology and their impact on ethical and privacy issues require further research.

5.5 Availability-Aware FL Assisted by Topology Overlay

On top of that, the model aggregation tasks at the edge servers, also known as cluster heads in a hierarchical FL architecture, face availability challenges when the edge servers are down. While with the meshed links among the edge servers, learning task replication techniques can be used for maintaining the availability levels of the model aggregation tasks. In other words, we must make a trade-off between robustness and redundancy. The problem can be further investigated from a resource allocation perspective, scheduling, and clustering. In the right of Figure 14, the clusters can be built logically instead of following the physical topology of the edge networks based on the intensity and the distribution of data generation. In the example, there are two physical clusters \(v^a\) and \(v^b\) . The three edge devices in a cluster belong to one of the three separate overlay clusters \(u^a\) , \(u^b\) , and \(u^c\) .

Fig. 14.

5.6 Conduct Real-World Deployment

In most topology-related FL works, their experiments are conducted in simulated environments, with the exceptions of [22, 31, 50, 100, 126, 127, 135, 162]. Within these works, Zhou et al. [162] only used the Alibaba Cloud as its parameter server and Tran and Pompili [127] simply designed their experiment using realistic model settings from [18]. The experiment setting for [22] although still simulations, does involve real deployment on a grid network inside a 500m \(\times\) 500m area. Wang et al. [135] use a unique approach that captures the Xender’s trace content and requests files from active mobile users. Many FL works to date do not include or consider real-world model deployments in their experiments. However, this type of work can demonstrate real challenges faced when deploying unique edge topologies in a realistic setting while tackling specific issues such as model deployment time, inference time, communication costs, and so on. For further research, it would be advantageous to conduct experiments with real-world deployment and evaluate topology-aware FL studies in real edge environments. This approach ensures the proposed FL techniques and algorithms can be properly validated, moving beyond just proofs-of-concept in simulated environments. For topology-aware FL, much work can be done to develop a realistic test bed and to perform real-world deployments.

6 Conclusion

In this survey, the role of topology-aware federated learning in edge computing is discussed in detail. Various network topologies, including star, tree, decentralized, and hybrid topologies, are summarized and compared to illustrate the substantial impact of topology on the efficiency and effectiveness of federated learning. Different topologies can bring many benefits to the network. It is important to note that various topology structures will undoubtedly bring extra complexity. There is a choice to be made if the simple star topology cannot meet the needs of a growing system and infrastructure and whether to opt for another topology for increased communication and complexity. FL architectures must also account for factors such as the central server’s necessity or absence, clients’ diurnal activity patterns, and options for implementing intermediate servers.

Footnotes

https://github.com/IBM/FedMA

https://github.com/bytedance/fedlearner

https://andrew.gibiansky.com/blog/machine-learning/baidu-allreduce/

References

[1]

Ali Al-Shuwaili and Osvaldo Simeone. 2017. Energy-efficient resource allocation for mobile edge computing-based augmented reality applications. IEEE Wireless Communications Letters 6, 3 (2017), 398–401.