Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Fair Resource Allocation For Hierarchical Federated Edge Learning in Space-Air-Ground Integrated Networks via
Deep Reinforcement Learning with Hybrid Control

Chong Huang, , Gaojie Chen, , Pei Xiao, ,
Jonathon A. Chambers, , and Wei Huang
C. Huang, G. Chen and P. Xiao are with 5GIC & 6GIC, Institute for Communication Systems (ICS), Home for 5GIC & 6GIC, University of Surrey, Guildford, GU2 7XH, United Kingdom. Email: {chong.huang, gaojie.chen, p.xiao}@surrey.ac.uk. (Corresponding author: G. Chen)Jonathon A. Chambers is with the School of Engineering, University of Leicester, Leicester, LE1 7RU, United Kingdom. Email: jonathon.chambers@leicester.ac.uk.W. Huang is with School of Flexible Electronics (SoFE) & State Key Laboratory of Optoelectronic Materials and Technologies, Sun Yat-sen University, Guangdong, 510220, China. Email: huangw323@mail.sysu.edu.cn
Abstract

The space-air-ground integrated network (SAGIN) has become a crucial research direction in future wireless communications due to its ubiquitous coverage, rapid and flexible deployment, and multi-layer cooperation capabilities. However, integrating hierarchical federated learning (HFL) with edge computing and SAGINs remains a complex open issue to be resolved. This paper proposes a novel framework for applying HFL in SAGINs, utilizing aerial platforms and low Earth orbit (LEO) satellites as edge servers and cloud servers, respectively, to provide multi-layer aggregation capabilities for HFL. The proposed system also considers the presence of inter-satellite links (ISLs), enabling satellites to exchange federated learning models with each other. Furthermore, we consider multiple different computational tasks that need to be completed within a limited satellite service time. To maximize the convergence performance of all tasks while ensuring fairness, we propose the use of the distributional soft-actor-critic (DSAC) algorithm to optimize resource allocation in the SAGIN and aggregation weights in HFL. Moreover, we address the efficiency issue of hybrid action spaces in deep reinforcement learning (DRL) through a decoupling and recoupling approach, and design a new dynamic adjusting reward function to ensure fairness among multiple tasks in federated learning. Simulation results demonstrate the superiority of our proposed algorithm, consistently outperforming baseline approaches and offering a promising solution for addressing highly complex optimization problems in SAGINs.

Index Terms:
Space-air-ground integrated network, hierarchical federated Learning, federated edge learning, deep reinforcement learning, satellite communications, unmanned aerial vehicle.

I Introduction

I-A Background

With the advancement of wireless communications and artificial intelligence (AI) technologies, there has been a rapid increase in the demand for intelligent wireless networks in recent years [1]. Many intelligent applications in wireless networks are trained based on local data from wireless nodes, which are stored in a distributed manner across these nodes. In traditional centralized machine learning methods, it is necessary to transfer vast amounts of data to the cloud or data centers for processing, this approach not only consumes a significant amount of wireless bandwidth but also leads to considerable latency [2]. Moreover, data privacy and information security issues have become a major concern in the current digital era, because many applications involve the processing of sensitive personal information [3]. To address the above issues, federated learning has emerged as a very popular machine learning framework in wireless communications [4]. Within the federated learning framework, local devices in wireless networks train their models using their own local data, and only share their model parameters rather than their training data to protect data privacy. Furthermore, federated learning significantly reduces the amount of data that needs to be transmitted over wireless communications, thus saving wireless communication resources.

On the other hand, satellite networks have attracted much attention in wireless communications due to their capability to provide global coverage and seamless connections [5]. In recent years, satellite network projects like Starlink [6], OneWeb [7], and O3b [8] have been rapidly advanced to provide global communication services, especially offering connections to remote areas. Thus, satellite communication has become an important component of future wireless communications. Furthermore, considering the strong demand for communication latency and rates in sixth-generation (6G) communications, the integration of satellite networks with terrestrial communication networks has emerged as a very promising design paradigm [9]. In particular, space-air-ground integrated networks (SAGINs) seamlessly integrate satellites, aerial platforms, and terrestrial communication nodes, where satellites provide ubiquitous communication connections globally, aerial platforms offer rapid and flexible edge access capabilities for terrestrial networks, and terrestrial networks provide localized communication connections as the communication backbone. This integrated Space-Air-Ground framework design offers unprecedented prospects for future communications.

Moreover, edge computing was emerged to meet the growing demand for low-latency and high-bandwidth services by leveraging the computational and storage capabilities of terminal devices and edge servers [10]. In edge computing, computing tasks are transferred from terminals to edge servers to reduce service latency in the wireless network [11]. However, traditional edge computing requires the exchange of local data, occupying a large amount of communication bandwidth and raising issues of personal data privacy [12]. Hence, hierarchical federated learning (HFL) was developed to address these issues via combining federated learning and edge computing [13]. In HFL, models from terminals are aggregated multiple times at edge servers before being aggregated in the cloud server to efficiently utilize edge network resource while protecting data privacy. Furthermore, the hierarchical training process in HFL naturally aligns with the multi-level communication architecture of SAGINs [14]. The global coverage capability and flexible aerial deployment of SAGINs enable more ground users to participate in HFL to significantly increase the diversity and performance of task training.

I-B Related Work

Federated learning is emerging as a paradigm for providing artificial intelligence services in wireless communications, and numerous current studies have investigated its performance within wireless networks [15, 16, 17, 18]. In [15], the Hungarian algorithm was introduced to optimize the user selection and resource block allocation to improve the federated learning aggregation performance in wireless networks. A distributed resource allocation method was proposed in [16] to minimize the energy consumption and latency of federated learning in transmission control protocol/internet protocol (TCP/IP) networks. In [17], the delay performance was analyzed to improve the federated learning performance and the energy efficiency in wireless vehicular networks. To reduce the convergence time of federated learning in wireless communications, the authors of [18] analyzed the effect of quantization errors and limited wireless resources in federated learning and proposed a model-based reinforcement learning algorithm to adjust the local model quantization and the number of local users in federate learning.

On the other hand, SAGIN has become a pivotal research direction for future wireless communications [19, 20, 21, 22]. In [19], the outage performance of the relaying scenario was analyzed with a closed form for SAGIN. To strike the balance between computational resource consumption and communication resource consumption, a heuristic greedy algorithm was proposed to design the virtual network functions and data routing in SAGINs. In [21], a queuing game model based method was introduced to enhance the network throughput and reduce the service delay in SAGINs. To minimize the energy consumption and latency for cloud and edge computing in SAGINs, a decision-assisted reinforcement learning algorithm was proposed in [22] to optimize the resource allocation. Considering a SAGIN’s global coverage capability and rapid dynamic deployment, it enables a more flexible participation of ground users in federated learning tasks, thus federated learning within SAGIN has been extensively researched in recent studies [23, 24]. In [23], a novel topology-aware framework was designed to speed up the convergence for federated learning in SAGINs. To jointly consider the requirements of adaptivity, communication efficiency and model security in federated learning, a tensor-empowered federated learning framework was proposed to meet these three requirements in SAGINs [24]. The existing research on federated learning within SAGINs has focused on two-layer federated learning architectures, failing to fully leverage the potential performance capabilities of nodes within the SAGIN framework.

However, the conventional two-layer federated learning framework design often fails to meet the demands in scenarios with dispersed user locations and unstable communication channels [25]. Moreover, the integration of federated learning with network edges in recent years enables low-latency edge intelligence at the data generation source. Therefore, multi-layer HFL with network edges has emerged as a highly promising research direction in wireless communications [25, 26, 27]. In [25], a semi-decentralized framework was proposed to utilize the edge servers to enhance the convergence performance of federated learning. To improve the convergence performance and reduce the round delay in federated learning, a hierarchical aggregation strategy was proposed to optimize the edge server scheduling and bandwidth allocation in wireless communications [26]. In [27], a hybrid deep reinforcement learning (DRL) algorithm was utilized to optimize the resource allocation strategy for fast convergence of federated learning in wireless communications. However, two significant challenges remain in the existing works: Firstly, the deployment of HFL utilizing space, air, and ground nodes within SAGINs as edge and cloud servers remains an area that has not been investigated. Secondly, while existing federated learning research has explored personalized multi-task learning with non-independent and identically distributed (non-i.i.d.) data [28], achieving balance in the convergence performance of multiple completely unrelated tasks which are repeatedly distributed in multiple users within constrained timeframes, such as the service times of SAGIN environments, continues to be an unaddressed issue.

I-C Motivation and Contributions

In this paper, we explore the integration of HFL in SAGINs. To the best of the authors’ knowledge, we address a research gap as the existing works have not considered the utilization of the HFL framework in SAGIN to leverage the performance of various nodes and balance the convergence performance of different tasks within a limited satellite service time. Our study also encompasses the investigation of the impact of dynamic deployment and user association within SAGIN on HFL. To optimize the communication rounds within HFL and balance the final convergence performance of different tasks, dynamic planning of unmanned aerial vehicles (UAVs), user association, access between satellites and UAVs, the numbers of local convergence iterations and the weights of models in aggregations are crucial for HFL in SAGINs. However, traditional algorithms struggle to adapt to dynamically changing communication environments. Thus, we propose a hybrid action space DRL-based algorithm to address this challenge. The main contributions of this paper are summarized as follows.

  • We first propose a novel framework of HFL in SAGINs, where UAVs are considered as edge servers, LEO satellites as cloud servers, and the use of ISL for aggregation between multiple satellites and transferring aggregated models. To counteract channel fading effects in ground-aerial communications, we introduce trajectory planning for UAVs. Moreover, we consider the time constraints of LEO satellite access to aerial platforms, which is crucial for completing tasks within a limited satellite service time in practical applications and has not been taken into account by most studies.

  • Our objective is to accomplish multiple federated learning tasks within a finite satellite service time while ensuring fairness among different task performances. Thus, we formulate a complex optimization problem in our proposed system, which includes UAV trajectory planning, dynamic adjustment of ground user-UAV pairing (uplink and downlink), UAV-satellite pairing (uplink and downlink), final aggregation selection between satellites, and optimization of weights in edge and cloud aggregation.

  • We introduce the distributional soft-actor-critic (DSAC) algorithm, leveraging its ability to consider long-term returns and mitigate the overestimation issue of Q-values using return distributions. Moreover, due to the large number of coupled discrete and continuous optimization variables in the proposed problem, we propose a decoupling and recoupling algorithm to enhance the DSAC’s performance. Finally, to ensure fairness, we design a dynamic adjusting reward function for the proposed algorithm to mitigate training progress deviations among different tasks.

  • Simulation results demonstrate the superior performance of our proposed algorithm, compared to several benchmarks. Through the design of a hybrid action space, the proposed DSAC algorithm fully learns the proposed system and obtains corresponding solutions. Moreover, the use of the dynamic adjusting reward design ensures fairness among multiple tasks.

The rest of this paper is summarized as follows: The system model including the communication model, satellite coverage model, federated learning framework and problem formulation are introduced in Section II. In Section III, a hybrid action space DRL algorithm with a dynamic adjusting reward function is proposed. Section IV verifies the performance of the proposed method for HFL in SAGINs via simulations. Finally, Section V summarizes this work.

II System Model and Problem Formulation

II-A System Model

Refer to caption

Figure 1: System model of a SAGIN.
TABLE I: Basic Symbol definitions for SAGINs
Definition Symbol Definition Symbol
Ground User Gksubscript𝐺𝑘G_{k}italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT All UAV’s Trajectory Set 𝒬𝒬\mathcal{Q}caligraphic_Q
UAV Umsubscript𝑈𝑚U_{m}italic_U start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT UAV’s Trajectory qmsubscript𝑞𝑚q_{m}italic_q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT
LEO Snsubscript𝑆𝑛S_{n}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT Number of UAVs M𝑀Mitalic_M
Task 𝕋fsubscript𝕋𝑓\mathbb{T}_{f}blackboard_T start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT Number of LEOs N𝑁Nitalic_N
User-UAV Cluster Weierstrass-p\wp Number of Users K𝐾Kitalic_K
Distance d𝑑ditalic_d Number of Tasks F𝐹Fitalic_F
Dataset D𝐷Ditalic_D Number of Users K𝐾Kitalic_K
Bandwidth B𝐵Bitalic_B User-UAV Pairing Indicator zGsuperscript𝑧𝐺z^{G}italic_z start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT
Channel Rate C𝐶Citalic_C Number of UAVs Connected to Satellite Snsubscript𝑆𝑛S_{n}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT MSn𝑀subscript𝑆𝑛MS_{n}italic_M italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT
Rician Factor ω𝜔\omegaitalic_ω Number of Users in Cluster msubscriptWeierstrass-p𝑚\wp_{m}℘ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT Kmsubscript𝐾𝑚K_{m}italic_K start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT
AWGN σ𝜎\sigmaitalic_σ UAV-Satellite Pairing Indicator zUsuperscript𝑧𝑈z^{U}italic_z start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT
Antenna Gain ξ𝜉\xiitalic_ξ Path Loss Exponents(NLoS) τNsubscript𝜏𝑁\tau_{N}italic_τ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT
Wavelength λ𝜆\lambdaitalic_λ Path Loss Exponents(LoS) τLsubscript𝜏𝐿\tau_{L}italic_τ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT
Antenna Phase ϱmsubscriptitalic-ϱ𝑚\varrho_{m}italic_ϱ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT Edge Aggregation Weight ζfKsuperscript𝜁subscript𝑓𝐾\zeta^{f_{K}}italic_ζ start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUPERSCRIPT
Bessel Function δ𝛿\deltaitalic_δ Cloud Aggregation Weight ζfMsuperscript𝜁subscript𝑓𝑀\zeta^{f_{M}}italic_ζ start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_POSTSUPERSCRIPT
Thermal Noise φ𝜑\varphiitalic_φ Final Aggregation Weight ζfSsuperscript𝜁subscript𝑓𝑆\zeta^{f_{S}}italic_ζ start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUPERSCRIPT
Boltzmann Constant ς𝜍\varsigmaitalic_ς Transmit Power of LEOs PSsubscript𝑃𝑆P_{S}italic_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT
Height of LEOs RSsubscript𝑅𝑆R_{S}italic_R start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT Transmit Power of UAVs PUsubscript𝑃𝑈P_{U}italic_P start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT
Velocity of LEOs vLsubscript𝑣𝐿v_{L}italic_v start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT Transmit Power of Ground Users PGsubscript𝑃𝐺P_{G}italic_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT
Final Aggregation LEO Sfsuperscript𝑆𝑓S^{f}italic_S start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT User-UAV Pairing (Uplink) usubscript𝑢\Im_{u}roman_ℑ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT
Radius of the Earth REsubscript𝑅𝐸R_{E}italic_R start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT User-UAV Pairing (Dplink) dsubscript𝑑\Im_{d}roman_ℑ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT
ISL Carrier Frequency FISLsubscript𝐹𝐼𝑆𝐿F_{ISL}italic_F start_POSTSUBSCRIPT italic_I italic_S italic_L end_POSTSUBSCRIPT Minimum Elevation Angle ϖitalic-ϖ\varpiitalic_ϖ
Normalized Gain of LEO ηdsuperscript𝜂𝑑\eta^{d}italic_η start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT Peak Gain of LEO ηmaxsubscript𝜂max\eta_{\rm max}italic_η start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT

As shown in Fig. 1, we consider a three-layer SAGIN to provide global communication service. In the proposed SAGIN, there are K𝐾Kitalic_K ground users Gksubscript𝐺𝑘G_{k}italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (k𝒦={1,2,,K}𝑘𝒦12𝐾k\in\mathcal{K}=\{1,2,...,K\}italic_k ∈ caligraphic_K = { 1 , 2 , … , italic_K }), M𝑀Mitalic_M UAVs Umsubscript𝑈𝑚U_{m}italic_U start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT (m={1,2,,M}𝑚12𝑀m\in\mathcal{M}=\{1,2,...,M\}italic_m ∈ caligraphic_M = { 1 , 2 , … , italic_M }) and N𝑁Nitalic_N low Earth orbit (LEO) satellites Snsubscript𝑆𝑛S_{n}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT (n𝒩={1,2,,N}𝑛𝒩12𝑁n\in\mathcal{N}=\{1,2,...,N\}italic_n ∈ caligraphic_N = { 1 , 2 , … , italic_N }). UAVs can provide ground-to-air connectivity services for ground users and can also connect with satellites, each UAV can dynamically plan its trajectory to provide communication and computation services to ground users within its coverage area, we assume the 3D coordinates of UAV Umsubscript𝑈𝑚U_{m}italic_U start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is qm(t)={xm(t),ym(t),zm(t)}subscript𝑞𝑚𝑡subscript𝑥𝑚𝑡subscript𝑦𝑚𝑡subscript𝑧𝑚𝑡q_{m}(t)=\{x_{m}(t),y_{m}(t),z_{m}(t)\}italic_q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_t ) = { italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_t ) , italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_t ) , italic_z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_t ) } (qm𝒬={q1,q2,,qM}subscript𝑞𝑚𝒬subscript𝑞1subscript𝑞2subscript𝑞𝑀q_{m}\in\mathcal{Q}=\{q_{1},q_{2},...,q_{M}\}italic_q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ caligraphic_Q = { italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_q start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT })) at a given time slot t𝑡titalic_t. However, due to the potential long distances between ground users, direct communication between them may not be feasible. Similarly, UAVs which provide services for ground users might not be able to communicate directly with each other. Furthermore, we consider that satellites are constantly in high-speed motion, thus the communication window between UAVs and LEO satellites is limited in time.

II-B Communication Model

In this work, we assume that ground users are divided into M𝑀Mitalic_M clusters, a cluster msubscriptWeierstrass-p𝑚\wp_{m}℘ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT (m{1,2,,M}𝑚12𝑀m\in\{1,2,...,M\}italic_m ∈ { 1 , 2 , … , italic_M }) has Kmsubscript𝐾𝑚K_{m}italic_K start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT users, ={1,2,,M}Weierstrass-psubscriptWeierstrass-p1subscriptWeierstrass-p2subscriptWeierstrass-p𝑀\wp=\{\wp_{1},\wp_{2},...,\wp_{M}\}℘ = { ℘ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ℘ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , ℘ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT } denotes the cluster set. The cluster msubscriptWeierstrass-p𝑚\wp_{m}℘ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is serviced by the UAV Umsubscript𝑈𝑚U_{m}italic_U start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, and zm,kGsubscriptsuperscript𝑧𝐺𝑚𝑘z^{G}_{m,k}italic_z start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_k end_POSTSUBSCRIPT denotes ground user Gksubscript𝐺𝑘G_{k}italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is assigned to the cluster msubscriptWeierstrass-p𝑚\wp_{m}℘ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. Due to the considerable distance between different clusters, direct communication between them is not feasible. We assume the channels between UAVs and ground users follow Rician fading [29], and UAV Umsubscript𝑈𝑚U_{m}italic_U start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT serves cluster msubscriptWeierstrass-p𝑚\wp_{m}℘ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT which includes users Gksubscript𝐺𝑘G_{k}italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, thus the channel coefficient hm,ksubscript𝑚𝑘h_{m,k}italic_h start_POSTSUBSCRIPT italic_m , italic_k end_POSTSUBSCRIPT between the UAV Umsubscript𝑈𝑚U_{m}italic_U start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and the ground user Gksubscript𝐺𝑘G_{k}italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is given as

hm,k=ωω+1H¯m,k+1ω+1H^m,k,subscript𝑚𝑘𝜔𝜔1subscript¯𝐻𝑚𝑘1𝜔1subscript^𝐻𝑚𝑘\small h_{m,k}=\sqrt{\frac{\omega}{\omega+1}}\bar{H}_{m,k}+\sqrt{\frac{1}{% \omega+1}}\hat{H}_{m,k},italic_h start_POSTSUBSCRIPT italic_m , italic_k end_POSTSUBSCRIPT = square-root start_ARG divide start_ARG italic_ω end_ARG start_ARG italic_ω + 1 end_ARG end_ARG over¯ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_m , italic_k end_POSTSUBSCRIPT + square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_ω + 1 end_ARG end_ARG over^ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_m , italic_k end_POSTSUBSCRIPT , (1)

where ω𝜔\omegaitalic_ω indicates the Rician factor, H^m,k=𝕘^m,kdm,kτN/2subscript^𝐻𝑚𝑘subscript^𝕘𝑚𝑘superscriptsubscript𝑑𝑚𝑘subscript𝜏𝑁2\hat{H}_{m,k}=\hat{\mathbb{g}}_{m,k}d_{m,k}^{-{\tau_{N}}/2}over^ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_m , italic_k end_POSTSUBSCRIPT = over^ start_ARG blackboard_g end_ARG start_POSTSUBSCRIPT italic_m , italic_k end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_m , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_τ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT / 2 end_POSTSUPERSCRIPT and H¯m,k=𝕘¯m,kdm,kτL/2subscript¯𝐻𝑚𝑘subscript¯𝕘𝑚𝑘superscriptsubscript𝑑𝑚𝑘subscript𝜏𝐿2\bar{H}_{m,k}=\bar{\mathbb{g}}_{m,k}d_{m,k}^{-{\tau_{L}}/2}over¯ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_m , italic_k end_POSTSUBSCRIPT = over¯ start_ARG blackboard_g end_ARG start_POSTSUBSCRIPT italic_m , italic_k end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_m , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_τ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT / 2 end_POSTSUPERSCRIPT indicate the non-line-of-sight (NLoS) and the line-of-sight (LoS) channel coefficients, respectively. dm,ksubscript𝑑𝑚𝑘d_{m,k}italic_d start_POSTSUBSCRIPT italic_m , italic_k end_POSTSUBSCRIPT denotes the distance between UAV Umsubscript𝑈𝑚U_{m}italic_U start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and ground user Gksubscript𝐺𝑘G_{k}italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. τNsubscript𝜏𝑁\tau_{N}italic_τ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT and τLsubscript𝜏𝐿\tau_{L}italic_τ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT are the path loss exponents for NLoS and LOS channel, respectively. In the NLoS channel, the coefficient 𝕘^m,ksubscript^𝕘𝑚𝑘\hat{\mathbb{g}}_{m,k}over^ start_ARG blackboard_g end_ARG start_POSTSUBSCRIPT italic_m , italic_k end_POSTSUBSCRIPT is formed by a zero-mean unit-variance Gaussian fading channel. Therefore, in a given cluster msubscriptWeierstrass-p𝑚\wp_{m}℘ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, the transmission rate of the uplink channel from Gksubscript𝐺𝑘G_{k}italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT to Umsubscript𝑈𝑚U_{m}italic_U start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT can be expressed as

Cm,k=Bk,mlog2(1+PGk|hm,k|2i=1,ikKPGi|hmi|2+σn2),subscript𝐶𝑚𝑘subscript𝐵𝑘𝑚subscriptlog21subscript𝑃subscript𝐺𝑘superscriptsubscript𝑚𝑘2superscriptsubscriptformulae-sequence𝑖1𝑖𝑘𝐾subscript𝑃subscript𝐺𝑖superscriptsubscript𝑚𝑖2superscriptsubscript𝜎𝑛2\small C_{m,k}=B_{k,m}{\rm{log_{2}}}\left(1+\frac{P_{G_{k}}|h_{m,k}|^{2}}{\sum% _{i=1,i\neq k}^{K}P_{G_{i}}|h_{mi}|^{2}+{{\sigma}_{n}^{2}}}\right),italic_C start_POSTSUBSCRIPT italic_m , italic_k end_POSTSUBSCRIPT = italic_B start_POSTSUBSCRIPT italic_k , italic_m end_POSTSUBSCRIPT roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( 1 + divide start_ARG italic_P start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT | italic_h start_POSTSUBSCRIPT italic_m , italic_k end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 , italic_i ≠ italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT | italic_h start_POSTSUBSCRIPT italic_m italic_i end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) , (2)

where Bk,msubscript𝐵𝑘𝑚B_{k,m}italic_B start_POSTSUBSCRIPT italic_k , italic_m end_POSTSUBSCRIPT indicates the bandwidth of the channel from Gksubscript𝐺𝑘G_{k}italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT to Umsubscript𝑈𝑚U_{m}italic_U start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, PGksubscript𝑃subscript𝐺𝑘P_{G_{k}}italic_P start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT denotes the transmit power of ground user Gksubscript𝐺𝑘G_{k}italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, σn2superscriptsubscript𝜎𝑛2{{\sigma}_{n}^{2}}italic_σ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT denotes the variance of the additive-white-Gaussian-noise (AWGN) at Umsubscript𝑈𝑚U_{m}italic_U start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. Similarly, the transmission rate of the downlink channel from Umsubscript𝑈𝑚U_{m}italic_U start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT to Gksubscript𝐺𝑘G_{k}italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT can be expressed as

Ck,m=Bm,klog2(1+PUm|hm,k|2σn2),subscript𝐶𝑘𝑚subscript𝐵𝑚𝑘subscriptlog21subscript𝑃subscript𝑈𝑚superscriptsubscript𝑚𝑘2superscriptsubscript𝜎𝑛2\small C_{k,m}=B_{m,k}{\rm{log_{2}}}\left(1+\frac{P_{U_{m}}|h_{m,k}|^{2}}{{{% \sigma}_{n}^{2}}}\right),italic_C start_POSTSUBSCRIPT italic_k , italic_m end_POSTSUBSCRIPT = italic_B start_POSTSUBSCRIPT italic_m , italic_k end_POSTSUBSCRIPT roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( 1 + divide start_ARG italic_P start_POSTSUBSCRIPT italic_U start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT | italic_h start_POSTSUBSCRIPT italic_m , italic_k end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) , (3)

where Bm,ksubscript𝐵𝑚𝑘B_{m,k}italic_B start_POSTSUBSCRIPT italic_m , italic_k end_POSTSUBSCRIPT indicates the bandwidth of the channel from Umsubscript𝑈𝑚U_{m}italic_U start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT to Gksubscript𝐺𝑘G_{k}italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, and PUmsubscript𝑃subscript𝑈𝑚P_{U_{m}}italic_P start_POSTSUBSCRIPT italic_U start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT denotes the transmit power of UAV Umsubscript𝑈𝑚U_{m}italic_U start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. Notice that in federated learning, the uplink and downlink transmissions occur asynchronously, and the edge server (e.g. Umsubscript𝑈𝑚U_{m}italic_U start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT) broadcasts the aggregated model to all ground users in downlink transmissions. As such, there is no interference in this stage. On the other hand, we assume that a UAV can establish communication with only one LEO satellite during any given time slot, zn,mUsubscriptsuperscript𝑧𝑈𝑛𝑚z^{U}_{n,m}italic_z start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT denotes UAV Umsubscript𝑈𝑚U_{m}italic_U start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT can communicate with satellite Snsubscript𝑆𝑛S_{n}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and MSn𝑀subscript𝑆𝑛MS_{n}italic_M italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT denotes the number of UAVs connected to satellite Snsubscript𝑆𝑛S_{n}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. Thus, the channel coefficient between satellite Snsubscript𝑆𝑛S_{n}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and UAV Umsubscript𝑈𝑚U_{m}italic_U start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT can be expressed as [30]

h^n,m=ξmλ4πdn,mejϱm,subscript^𝑛𝑚subscript𝜉𝑚𝜆4𝜋subscript𝑑𝑛𝑚superscript𝑒𝑗subscriptitalic-ϱ𝑚\small\hat{h}_{n,m}=\frac{\sqrt{\xi_{m}}\lambda}{4\pi d_{n,m}}e^{j{\varrho_{m}% }},over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT = divide start_ARG square-root start_ARG italic_ξ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG italic_λ end_ARG start_ARG 4 italic_π italic_d start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT end_ARG italic_e start_POSTSUPERSCRIPT italic_j italic_ϱ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , (4)

where ξmsubscript𝜉𝑚\xi_{m}italic_ξ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT indicates the antenna gain, λ𝜆\lambdaitalic_λ is the wavelength, dn,msubscript𝑑𝑛𝑚d_{n,m}italic_d start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT denotes the distance between the satellite and the UAV, ϱmsubscriptitalic-ϱ𝑚\varrho_{m}italic_ϱ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT indicates the phase. In this work, we consider the impact of outdated channel state information (CSI) due to the considerable distance between satellites and aerial platforms. The outdated CSI is formulated as [31]

hn,m=δh^n,m+1δ2g^n,m,subscript𝑛𝑚𝛿subscript^𝑛𝑚1superscript𝛿2subscript^𝑔𝑛𝑚\small h_{n,m}=\delta\hat{h}_{n,m}+\sqrt{1-\delta^{2}}\hat{g}_{n,m},italic_h start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT = italic_δ over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT + square-root start_ARG 1 - italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT , (5)

where δ=κ^(2πD^n,mTn,m)𝛿^𝜅2𝜋subscript^𝐷𝑛𝑚subscript𝑇𝑛𝑚\delta=\hat{\kappa}(2\pi{\hat{D}}_{n,m}T_{n,m})italic_δ = over^ start_ARG italic_κ end_ARG ( 2 italic_π over^ start_ARG italic_D end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ), κ^^𝜅\hat{\kappa}over^ start_ARG italic_κ end_ARG denotes the Bessel function of the first kind of order 0, D^n,msubscript^𝐷𝑛𝑚{\hat{D}}_{n,m}over^ start_ARG italic_D end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT and Tn,msubscript𝑇𝑛𝑚T_{n,m}italic_T start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT indicate the maximum Doppler frequency and the transmissions delay between the satellite and the UAV, respectively. g^n,msubscript^𝑔𝑛𝑚\hat{g}_{n,m}over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT represents a complex Gaussian random variable with an equivalent variance to that of h^n,msubscript^𝑛𝑚\hat{h}_{n,m}over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT. Therefore, the transmission rate of the uplink channel from UAV Umsubscript𝑈𝑚U_{m}italic_U start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT to satellite Snsubscript𝑆𝑛S_{n}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is given as

Cm,n=Bm,nlog2(1+PUm|hn,m|2i=1,imMPUi|hn,i|2+σl2),subscript𝐶𝑚𝑛subscript𝐵𝑚𝑛subscriptlog21subscript𝑃subscript𝑈𝑚superscriptsubscript𝑛𝑚2superscriptsubscriptformulae-sequence𝑖1𝑖𝑚𝑀subscript𝑃subscript𝑈𝑖superscriptsubscript𝑛𝑖2superscriptsubscript𝜎𝑙2\small C_{m,n}=B_{m,n}{\rm{log_{2}}}\left(1+\frac{P_{U_{m}}|h_{n,m}|^{2}}{\sum% _{i=1,i\neq m}^{M}P_{U_{i}}|h_{n,i}|^{2}+{{\sigma}_{l}^{2}}}\right),italic_C start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT = italic_B start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( 1 + divide start_ARG italic_P start_POSTSUBSCRIPT italic_U start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT | italic_h start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 , italic_i ≠ italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT | italic_h start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) , (6)

where Bm,nsubscript𝐵𝑚𝑛B_{m,n}italic_B start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT is the bandwidth of the channel from Umsubscript𝑈𝑚U_{m}italic_U start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT to Snsubscript𝑆𝑛S_{n}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, i=1,imMPUi|hn,i|2superscriptsubscriptformulae-sequence𝑖1𝑖𝑚𝑀subscript𝑃subscript𝑈𝑖superscriptsubscript𝑛𝑖2\sum_{i=1,i\neq m}^{M}P_{U_{i}}|h_{n,i}|^{2}∑ start_POSTSUBSCRIPT italic_i = 1 , italic_i ≠ italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT | italic_h start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is the interference from other UAVs, σl2superscriptsubscript𝜎𝑙2{\sigma}_{l}^{2}italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is AWGN at Slsubscript𝑆𝑙S_{l}italic_S start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT. Similarly, the transmission rate of the downlink channel from Snsubscript𝑆𝑛S_{n}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT to Umsubscript𝑈𝑚U_{m}italic_U start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT can be expressed as

Cn,m=Bn,mlog2(1+PSn|hn,m|2σl2),subscript𝐶𝑛𝑚subscript𝐵𝑛𝑚subscriptlog21subscript𝑃subscript𝑆𝑛superscriptsubscript𝑛𝑚2superscriptsubscript𝜎𝑙2\small C_{n,m}=B_{n,m}{\rm{log_{2}}}\left(1+\frac{P_{S_{n}}|h_{n,m}|^{2}}{{{% \sigma}_{l}^{2}}}\right),italic_C start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT = italic_B start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( 1 + divide start_ARG italic_P start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT | italic_h start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) , (7)

where Bn,msubscript𝐵𝑛𝑚B_{n,m}italic_B start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT is the bandwidth of the channel from Snsubscript𝑆𝑛S_{n}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT to Umsubscript𝑈𝑚U_{m}italic_U start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, PSnsubscript𝑃subscript𝑆𝑛P_{S_{n}}italic_P start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT denotes the transmit power of satellite Snsubscript𝑆𝑛S_{n}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT in downlink transmissions. Similar to the downlink transmission between UAVs and ground users, there is no interference at this stage. Moreover, there are inter-satellite links (ISLs) between satellites in current satellite networks, we thus also consider inter-satellite communications via ISLs. The transmission data rate of the ISL channel [32] between two satellites Sasubscript𝑆𝑎S_{a}italic_S start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and Sbsubscript𝑆𝑏S_{b}italic_S start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT can be expressed as

Cab=Ba,blog2(1+PSa|ηmax|2ςφBab(4πdabFISLc)2),subscript𝐶𝑎𝑏subscript𝐵𝑎𝑏subscriptlog21subscript𝑃subscript𝑆𝑎superscriptsubscript𝜂max2𝜍𝜑subscript𝐵𝑎𝑏superscript4𝜋subscript𝑑𝑎𝑏subscript𝐹𝐼𝑆𝐿𝑐2\small C_{ab}=B_{a,b}{\rm{log_{2}}}\left(1+\frac{P_{S_{a}}|\eta_{\rm max}|^{2}% }{\varsigma\varphi B_{ab}(\frac{{4\pi d_{ab}F_{ISL}}}{c})^{2}}\right),italic_C start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT = italic_B start_POSTSUBSCRIPT italic_a , italic_b end_POSTSUBSCRIPT roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( 1 + divide start_ARG italic_P start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT | italic_η start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_ς italic_φ italic_B start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT ( divide start_ARG 4 italic_π italic_d start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_I italic_S italic_L end_POSTSUBSCRIPT end_ARG start_ARG italic_c end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) , (8)

where Ba,bsubscript𝐵𝑎𝑏B_{a,b}italic_B start_POSTSUBSCRIPT italic_a , italic_b end_POSTSUBSCRIPT indicates the bandwidth of channel from Sasubscript𝑆𝑎S_{a}italic_S start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT to Sbsubscript𝑆𝑏S_{b}italic_S start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT, PSasubscript𝑃subscript𝑆𝑎P_{S_{a}}italic_P start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the transmit power for the transmitter satellite Sasubscript𝑆𝑎S_{a}italic_S start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT. ηmax=maxJ(a,b)ηa,bdsubscript𝜂maxsubscriptmax𝐽𝑎𝑏subscriptsuperscript𝜂𝑑𝑎𝑏\eta_{\rm max}={\rm max}_{J(a,b)}{\eta^{d}_{a,b}}italic_η start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = roman_max start_POSTSUBSCRIPT italic_J ( italic_a , italic_b ) end_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a , italic_b end_POSTSUBSCRIPT represents the peak gain of the Sasubscript𝑆𝑎S_{a}italic_S start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT antennas in the direction of their main lobe, J(a,b)𝐽𝑎𝑏J(a,b)italic_J ( italic_a , italic_b ) denotes the relative direction between Sasubscript𝑆𝑎S_{a}italic_S start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and Sbsubscript𝑆𝑏S_{b}italic_S start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT, da,bsubscript𝑑𝑎𝑏d_{a,b}italic_d start_POSTSUBSCRIPT italic_a , italic_b end_POSTSUBSCRIPT indicates the distance Sasubscript𝑆𝑎S_{a}italic_S start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and Sbsubscript𝑆𝑏S_{b}italic_S start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT, ηa,bdsubscriptsuperscript𝜂𝑑𝑎𝑏\eta^{d}_{a,b}italic_η start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a , italic_b end_POSTSUBSCRIPT is the normalized gain of the satellite antennas in the transmission direction, ς𝜍\varsigmaitalic_ς denotes the Boltzmann constant, φ𝜑\varphiitalic_φ is the thermal noise, FISLsubscript𝐹𝐼𝑆𝐿F_{ISL}italic_F start_POSTSUBSCRIPT italic_I italic_S italic_L end_POSTSUBSCRIPT denotes the ISL carrier frequency, and c𝑐citalic_c indicates the speed of light. It is noteworthy that interference can be omitted by ensuring that the inter-plane ISL antennas utilize sufficiently narrow beams along with precise beam steering or antenna pointing capabilities [32].

II-C Satellite Coverage Model

Considering that LEO satellites are constantly in high-speed motion, each LEO provides a limited time window for aerial (UAV) access. We consider the coverage area of satellites as illustrated in Fig. 2, where the motion path length of an LEO satellite during the coverage access time for a specific UAV is as [33]

Refer to caption

Figure 2: The coverage area of a LEO satellite in SAGINs.
RC=2(RE+RS)(arccos(RERE+RScosϖ)ϖ),subscript𝑅𝐶2subscript𝑅𝐸subscript𝑅𝑆subscript𝑅𝐸subscript𝑅𝐸subscript𝑅𝑆italic-ϖitalic-ϖ\small R_{C}=2(R_{E}+R_{S})\bigg{(}\arccos\Big{(}\frac{R_{E}}{R_{E}+R_{S}}\cos% \varpi\Big{)}-\varpi\bigg{)},italic_R start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT = 2 ( italic_R start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT + italic_R start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) ( roman_arccos ( divide start_ARG italic_R start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT end_ARG start_ARG italic_R start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT + italic_R start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_ARG roman_cos italic_ϖ ) - italic_ϖ ) , (9)

where REsubscript𝑅𝐸R_{E}italic_R start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT denotes the radius of the earth, RSsubscript𝑅𝑆R_{S}italic_R start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT indicates the height of the LEO satellite, ϖitalic-ϖ\varpiitalic_ϖ denotes the minimum coverage elevation angle for the LEO satellite. Therefore, we can obtain the total communication time for the LEO satellite and UAV as

Tc=RCvL,subscript𝑇𝑐subscript𝑅𝐶subscript𝑣𝐿\small T_{c}=\frac{R_{C}}{v_{L}},italic_T start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = divide start_ARG italic_R start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_ARG start_ARG italic_v start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_ARG , (10)

where vLsubscript𝑣𝐿v_{L}italic_v start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT denotes the orbital velocity of the LEO satellite. In this work, considering the spacing between LEO satellites, there are differences in the remaining service time between each LEO satellite and the same UAV at the beginning. Moreover, due to the negligible altitude of the UAV compared to the altitude of LEO satellites and the radius of the Earth, we assume the altitude of the UAVs can be disregarded in the computation of LEO coverage time window.

II-D Federated Learning Model

Refer to caption

Figure 3: The structure of the hierarchical federated edge learning in SAGINs.

In the proposed federated learning framework, we assume there are F𝐹Fitalic_F distinct tasks 𝕋fsubscript𝕋𝑓\mathbb{T}_{f}blackboard_T start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT (f1,2,,F𝑓12𝐹f\in{1,2,...,F}italic_f ∈ 1 , 2 , … , italic_F) that need to be trained in the SAGIN. The local datasets of each ground user consists of F𝐹Fitalic_F datasets Dk={Dk1,Dk2,,DkF}subscript𝐷𝑘subscriptsuperscript𝐷1𝑘subscriptsuperscript𝐷2𝑘subscriptsuperscript𝐷𝐹𝑘D_{k}=\{D^{1}_{k},D^{2}_{k},...,D^{F}_{k}\}italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = { italic_D start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , … , italic_D start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT }, corresponding to F𝐹Fitalic_F tasks. Notice that a zero-sized dataset indicates that the ground user cannot contribute to the corresponding task. For example, if DkFsubscriptsuperscript𝐷𝐹𝑘D^{F}_{k}italic_D start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is a zero-sized dataset, the ground user Gksubscript𝐺𝑘G_{k}italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT cannot participate in the training of task 𝕋Fsubscript𝕋𝐹\mathbb{T}_{F}blackboard_T start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT. Furthermore, we assume that each ground user can train only one iteration of one task per time slot. Therefore, the federated learning training process consists of four stages for a specific task 𝕋fsubscript𝕋𝑓\mathbb{T}_{f}blackboard_T start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT: local training and edge aggregation, cloud aggregation and global update.

II-D1 Local Training and Edge Aggregation

In the local training phase, user Gksubscript𝐺𝑘G_{k}italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in cluster msubscriptWeierstrass-p𝑚\wp_{m}℘ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT selects task 𝕋fsubscript𝕋𝑓\mathbb{T}_{f}blackboard_T start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT for training and updates its corresponding local model. Subsequently, user Gksubscript𝐺𝑘G_{k}italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT uploads the updated local model with the parameter μkfsubscriptsuperscript𝜇𝑓𝑘\mu^{f}_{k}italic_μ start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT to the corresponding UAV Umsubscript𝑈𝑚U_{m}italic_U start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. Considering the existence of multiple tasks, the local model being trained and the local model being uploaded may not belong to the same task. Therefore, the uploaded local model may have experienced multiple iterations of local training. We assume that the model size of task 𝕋fsubscript𝕋𝑓\mathbb{T}_{f}blackboard_T start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT is DMf𝐷subscript𝑀𝑓DM_{f}italic_D italic_M start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT. However, due to the dynamic nature of wireless channels and the dynamic trajectory planning of UAVs, the uplink rates between ground users and UAVs are not constant in SAGINs which leads to variations in upload time as time varies. Therefore, UAV trajectory planning is crucial in this phase as it can mitigate the fluctuations in upload rate caused by channel variations. Subsequently, the edge server (UAV) Umsubscript𝑈𝑚U_{m}italic_U start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT will operate edge aggregation after receiving the local models from ground users within its served area. The aggregation equation is given by

μmf=1Kmi=1KmζifKμif,subscriptsuperscript𝜇𝑓𝑚1subscript𝐾𝑚superscriptsubscript𝑖1subscript𝐾𝑚subscriptsuperscript𝜁subscript𝑓𝐾𝑖subscriptsuperscript𝜇𝑓𝑖\displaystyle\mu^{f}_{m}=\frac{1}{K_{m}}\sum_{i=1}^{K_{m}}\zeta^{f_{K}}_{i}\mu% ^{f}_{i},italic_μ start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_K start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_ζ start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (11)

where Kmsubscript𝐾𝑚K_{m}italic_K start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT denotes the number of ground users in this cluster, ζifKsubscriptsuperscript𝜁subscript𝑓𝐾𝑖\zeta^{f_{K}}_{i}italic_ζ start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the aggregation weight for the local model from ground user Gisubscript𝐺𝑖G_{i}italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Due to the influence of the dynamic environment and different locations of ground users, the uplink transmission rate of ground users are different, thus the time required for uploading models are different among ground users. Furthermore, due to the presence of multiple tasks in federated learning, the number of training iterations experienced by the models uploaded by each ground user during each edge aggregation stage may vary. Moreover, considering the non-i.i.d. distribution of local datasets in a specific task, ground user’s local datasets have different impact on the aggregated model. Therefore, the weight of local models in the aggregation is crucial to the result, and relying on traditional federated learning aggregation algorithms such as FedAvg’s weighted average aggregation method can not fully optimize the model in SAGIN.

II-D2 Cloud Aggregation

In this work, the LEO satellites are assumed to be the cloud servers for HFL in SAGINs. Each UAV can select one satellite within its service time for uploading. Each satellite receiving the edge model transmitted by UAVs will perform cloud aggregation, and then transmit it to a specific LEO satellite for final aggregation. The cloud model at LEO Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be expressed as

μif=1Mim=1MiζmfMμmf,subscriptsuperscript𝜇𝑓𝑖1subscript𝑀𝑖superscriptsubscript𝑚1subscript𝑀𝑖subscriptsuperscript𝜁subscript𝑓𝑀𝑚subscriptsuperscript𝜇𝑓𝑚\displaystyle\mu^{f}_{i}=\frac{1}{M_{i}}\sum_{m=1}^{M_{i}}\zeta^{f_{M}}_{m}\mu% ^{f}_{m},italic_μ start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_ζ start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , (12)

where Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the number of UAVs which upload edge models to LEO Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, ζmfMsubscriptsuperscript𝜁subscript𝑓𝑀𝑚\zeta^{f_{M}}_{m}italic_ζ start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT denotes the aggregation weight for the local model from UAV Umsubscript𝑈𝑚U_{m}italic_U start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. Thus, the final aggregation at LEO satellite Snsubscript𝑆𝑛S_{n}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT can be expressed

μnf=1Mn+Nn(m=1MnζmfSμmf+j=1NnζjfSμjf),subscriptsuperscript𝜇𝑓𝑛1subscript𝑀𝑛subscript𝑁𝑛superscriptsubscript𝑚1subscript𝑀𝑛subscriptsuperscript𝜁subscript𝑓𝑆𝑚subscriptsuperscript𝜇𝑓𝑚superscriptsubscript𝑗1subscript𝑁𝑛subscriptsuperscript𝜁subscript𝑓𝑆𝑗subscriptsuperscript𝜇𝑓𝑗\displaystyle\mu^{f}_{n}=\frac{1}{M_{n}+N_{n}}\Big{(}\sum_{m=1}^{M_{n}}\zeta^{% f_{S}}_{m}\mu^{f}_{m}+\sum_{j=1}^{N_{n}}\zeta^{f_{S}}_{j}\mu^{f}_{j}\Big{)},italic_μ start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_M start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_N start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ( ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_ζ start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_ζ start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , (13)

where μnfsubscriptsuperscript𝜇𝑓𝑛\mu^{f}_{n}italic_μ start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT denotes the global model, Mnsubscript𝑀𝑛M_{n}italic_M start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT denotes the number of UAVs which upload edge models to LEO Snsubscript𝑆𝑛S_{n}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, Nnsubscript𝑁𝑛N_{n}italic_N start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT denotes the number of other LEO satellites which receive edge models from other UAVs, ζjfSsubscriptsuperscript𝜁subscript𝑓𝑆𝑗\zeta^{f_{S}}_{j}italic_ζ start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT denotes the aggregation weight for the local model from satellite Sjsubscript𝑆𝑗S_{j}italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Similar to the edge aggregation phase, the transmission rate between the air and space layer also varies due to the high-speed motion of satellites. Moreover, the model upload between UAVs and LEO satellites presents similar challenges as the previous phase, such as non-i.i.d. data distribution from ground users, different training iterations among different ground users, and differing aggregation iterations among different edge servers (UAVs).

II-D3 Global Update

After an LEO satellite has aggregated all the models as the global model for a given task, it needs to broadcast the global model back to ground users. Considering the service time constraints of LEO satellites, an LEO with the global model may not be able to access all UAVs directly. In such cases, the global model may need to be transmitted to other satellites via ISLs and then relayed to corresponding UAVs. Subsequently, the UAVs transmit the global models to ground users within their coverage areas. Ground users continue to repeat the above steps after updating their local models by using the received global model until final convergence.

II-E Problem Formulation

Our objective is to maximize the average performance (e.g. minimizing the loss in classification tasks) of all tasks while ensuring fairness among different tasks, i.e., reducing the discrepancy in performance across tasks. In the proposed SAGIN, each ground user is allocated in a cluster and may participate in one or more or even all tasks, while the service time of satellites is limited. Moreover, the dynamic nature of the wireless communication environment such as channel fading, affects the speed of model uploads and downloads. Therefore, it is necessary to plan the UAV trajectories to counteract the effects of channel fading and to optimize the clustering between ground users and UAVs. This strategy aims to enhance the communication efficiency between the ground layer and the aerial layer for the proposed HFL system.

Furthermore, UAVs can select whether all users within their cluster area participate in ground-aerial communications based on the progress of task training (fewer users can reduce interference). For example, if considerable progress is made in training a task, UAVs can prioritize communication with users associated with other tasks. Similarly, satellites can choose whether to access UAVs within their coverage area based on the progress of task training, considering that the data distribution of user groups served by different UAVs is uneven.

Moreover, the ISLs between satellites create a collaborative network for SAGINs. However, the selection of the LEO satellite for uploading models from UAVs and the selection of the final aggregation node between satellites also affects the total training rounds. Besides, within the limited satellite service time, multiple rounds of global aggregation need to be performed for each task, inevitably resulting in some satellites being unable to cover UAVs in the later stages. Therefore, selecting the final aggregation satellite node and utilizing ISL to reduce communication time and improve aggregation efficiency are crucial for HFL in SAGINs.

For the downlink transmission in the global update stage, the pairing issue between LEO satellites and UAVs also exists due to the variations in service time and rate caused by the high mobility of satellites. Furthermore, the trajectory of UAVs affects the efficiency of broadcasting the global model to ground users within their service clusters. Therefore, the pairing issue and the UAV trajectories are also crucial for the communication efficiency during the broadcasting stage.

Therefore, to fairly maximize the performance of all tasks within the limited service time of the LEO satellites, we need to optimize: 1) the trajectory of each UAV; 2) the pairing between UAVs and users (uplink); 3) the pairing between UAVs and satellites (uplink); 4) the selection of the final aggregation satellite, 5) the pairing between satellites and UAVs (downlink); 6) the weights for edge aggregation on UAVs; 7) the weights for cloud aggregation on satellites; and 8) the weights for final aggregation. We formulate the optimization problem as

(P1)::P1absent\displaystyle\mathbb{\rm(P1)}:( P1 ) : min𝒬(t),(t),u(t),Sf(t),d(t),ζfK(t),ζfM(t),ζfS(t)subscript𝒬𝑡Weierstrass-p𝑡subscript𝑢𝑡superscript𝑆𝑓𝑡subscript𝑑𝑡superscript𝜁subscript𝑓𝐾𝑡superscript𝜁subscript𝑓𝑀𝑡superscript𝜁subscript𝑓𝑆𝑡\displaystyle\min_{\mathcal{Q}(t),\wp(t),\Im_{u}(t),S^{f}(t),\Im_{d}(t),\zeta^% {f_{K}}(t),\zeta^{f_{M}}(t),\zeta^{f_{S}}(t)}roman_min start_POSTSUBSCRIPT caligraphic_Q ( italic_t ) , ℘ ( italic_t ) , roman_ℑ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_t ) , italic_S start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( italic_t ) , roman_ℑ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_t ) , italic_ζ start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_t ) , italic_ζ start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_t ) , italic_ζ start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_t ) end_POSTSUBSCRIPT
f=1Ft=1Tfn=1Nm=1Mk,f(ψk,f,Dk,f),superscriptsubscript𝑓1𝐹superscriptsubscript𝑡1subscript𝑇𝑓superscriptsubscript𝑛1𝑁superscriptsubscript𝑚1𝑀subscript𝑘𝑓subscript𝜓𝑘𝑓subscript𝐷𝑘𝑓\displaystyle\sum_{f=1}^{F}\sum_{t=1}^{T_{f}}\sum_{n=1}^{N}\sum_{m=1}^{M}{% \mathcal{L}}_{k,f}(\psi_{k,f},D_{k,f}),∑ start_POSTSUBSCRIPT italic_f = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_k , italic_f end_POSTSUBSCRIPT ( italic_ψ start_POSTSUBSCRIPT italic_k , italic_f end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT italic_k , italic_f end_POSTSUBSCRIPT ) , (14)
s.t.formulae-sequencest\displaystyle{\rm s.t.}roman_s . roman_t . m=1MKm=K,superscriptsubscript𝑚1𝑀subscript𝐾𝑚𝐾\displaystyle~{}\sum_{m=1}^{M}K_{m}=K,∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = italic_K , (14a)
m=1Mzm,kG=1,k𝒦,formulae-sequencesuperscriptsubscript𝑚1𝑀subscriptsuperscript𝑧𝐺𝑚𝑘1for-all𝑘𝒦\displaystyle\sum_{m=1}^{M}z^{G}_{m,k}=1,\forall k\in{\mathcal{K}},∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_z start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_k end_POSTSUBSCRIPT = 1 , ∀ italic_k ∈ caligraphic_K , (14b)
n=1Nzn,mU=1,n𝒩,formulae-sequencesuperscriptsubscript𝑛1𝑁subscriptsuperscript𝑧𝑈𝑛𝑚1for-all𝑛𝒩\displaystyle\sum_{n=1}^{N}z^{U}_{n,m}=1,\forall n\in{\mathcal{N}},∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_z start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT = 1 , ∀ italic_n ∈ caligraphic_N , (14c)
n=1NMSn=M,superscriptsubscript𝑛1𝑁𝑀subscript𝑆𝑛𝑀\displaystyle\sum_{n=1}^{N}MS_{n}=M,∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_M italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_M , (14d)
i=1KmζifK=1,m,formulae-sequencesuperscriptsubscript𝑖1subscript𝐾𝑚subscriptsuperscript𝜁subscript𝑓𝐾𝑖1for-allsubscriptWeierstrass-p𝑚Weierstrass-p\displaystyle\sum_{i=1}^{K_{m}}\zeta^{f_{K}}_{i}=1,\forall\wp_{m}\in{\wp},∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_ζ start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 , ∀ ℘ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ ℘ , (14e)
i=1MSnζnfM=1,n𝒩,formulae-sequencesuperscriptsubscript𝑖1𝑀subscript𝑆𝑛subscriptsuperscript𝜁subscript𝑓𝑀𝑛1for-all𝑛𝒩\displaystyle\sum_{i=1}^{MS_{n}}\zeta^{f_{M}}_{n}=1,\forall n\in{\mathcal{N}},∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_ζ start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 1 , ∀ italic_n ∈ caligraphic_N , (14f)
n=1NζnfS=1,superscriptsubscript𝑛1𝑁subscriptsuperscript𝜁subscript𝑓𝑆𝑛1\displaystyle\sum_{n=1}^{N}\zeta^{f_{S}}_{n}=1,∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_ζ start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 1 , (14g)
vmvmax,subscript𝑣𝑚subscript𝑣\displaystyle v_{m}\leq v_{\max},italic_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ≤ italic_v start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT , (14h)
zmzmin&zmzmax,subscript𝑧𝑚subscript𝑧subscript𝑧𝑚subscript𝑧\displaystyle z_{m}\geq z_{\min}~{}\&~{}z_{m}\leq z_{\max},italic_z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ≥ italic_z start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT & italic_z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ≤ italic_z start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT , (14i)

where Tfsubscript𝑇𝑓T_{f}italic_T start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT denotes the number of training rounds in f𝑓fitalic_f-th federated learning task, u(t)={1u(t),2u(t),,Mu(t)}subscript𝑢𝑡subscriptsuperscript𝑢1𝑡subscriptsuperscript𝑢2𝑡subscriptsuperscript𝑢𝑀𝑡\Im_{u}(t)=\{\Im^{u}_{1}(t),\Im^{u}_{2}(t),...,\Im^{u}_{M}(t)\}roman_ℑ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_t ) = { roman_ℑ start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t ) , roman_ℑ start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_t ) , … , roman_ℑ start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_t ) } denotes the pairing between UAVs and satellites for uplink transmissions, Sfsuperscript𝑆𝑓S^{f}italic_S start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT denotes the final aggregation satellite, d(t)={1d(t),2d(t),,Md(t)}subscript𝑑𝑡subscriptsuperscript𝑑1𝑡subscriptsuperscript𝑑2𝑡subscriptsuperscript𝑑𝑀𝑡\Im_{d}(t)=\{\Im^{d}_{1}(t),\Im^{d}_{2}(t),...,\Im^{d}_{M}(t)\}roman_ℑ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_t ) = { roman_ℑ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t ) , roman_ℑ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_t ) , … , roman_ℑ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_t ) } denotes the pairing between UAVs and satellites for downlink transmissions. ζfKsuperscript𝜁subscript𝑓𝐾\zeta^{f_{K}}italic_ζ start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, ζfMsuperscript𝜁subscript𝑓𝑀\zeta^{f_{M}}italic_ζ start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and ζfSsuperscript𝜁subscript𝑓𝑆\zeta^{f_{S}}italic_ζ start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUPERSCRIPT denotes the aggregation weights for edge aggregation, cloud aggregation and final aggregation, respectively. k,f(ψk,f,Dk,f)subscript𝑘𝑓subscript𝜓𝑘𝑓subscript𝐷𝑘𝑓{\mathcal{L}}_{k,f}(\psi_{k,f},D_{k,f})caligraphic_L start_POSTSUBSCRIPT italic_k , italic_f end_POSTSUBSCRIPT ( italic_ψ start_POSTSUBSCRIPT italic_k , italic_f end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT italic_k , italic_f end_POSTSUBSCRIPT ) denotes the local loss of f𝑓fitalic_fth task for ground user Gksubscript𝐺𝑘G_{k}italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT with local model parameter ψk,fsubscript𝜓𝑘𝑓\psi_{k,f}italic_ψ start_POSTSUBSCRIPT italic_k , italic_f end_POSTSUBSCRIPT and local dataset Dk,fsubscript𝐷𝑘𝑓D_{k,f}italic_D start_POSTSUBSCRIPT italic_k , italic_f end_POSTSUBSCRIPT. (14a) shows that all ground users are divided into M𝑀Mitalic_M clusters. (14b) indicates that a ground user is assigned to only one cluster. (14c) and (14d) denote that each UAV need to access to one satellite. (14e), (14f) and (14g) shows that the rule of the aggregation weights. (14h) indicates the speed limitation of UAVs, and (14i) presents the altitude constraint for UAVs.

The above optimization problem includes eight variables, comprising both continuous and discrete variables, and some optimization variables are interdependent (e.g., the UAV’s trajectory and the UAV-ground user pairing). Thus, it is clear that this is a mixed-integer nonlinear optimization problem and is highly complex. Considering that our proposed system involves multi-round hierarchical aggregation in federated learning, it also represents a long-term planning problem. Traditional algorithms wouldn’t be efficient in addressing this problem, and real-time computational complexity is a critical concern in dynamic wireless communication scenarios. To tackle this issue we employ a DRL algorithm which is well-suited for this kind of optimization problem and exhibits low computational complexity after deployment. Moreover, we explore a hybrid action space framework to enhance the DRL performance and consider the fairness among different tasks in reward design.

III Hybrid Action Space DRL Algorithm with Fairness Reward Function

The proposed problem in this work is a long-term optimization problem which can be effectively addressed by using DRL. However, the presence of several continuous and discrete optimization variables in the optimization problem makes DRL optimization difficult to converge. Some existing approaches tried to tackle this issue of DRL by discretizing or continuousizing all optimization variables. However, these methods sacrifice a portion of the optimization space, reducing the system’s controllability, and may lead to challenges in exploring the huge action space when exploring different possibilities [34]. Therefore, we employ a decoupling approach to separate the hybrid optimization variables and assign them to different agents within DRL for learning. Subsequently, to efficiently combine the optimization results from different agents, we utilize a maximum posteriori policy based algorithm to mix the discretized and continuous variables after optimization, providing an effective solution to the proposed optimization problem. Furthermore, considering the fairness among different tasks, we design a fair and effective reward scheme to ensure that each task achieves satisfactory training results.

III-1 MDP Design in DRL

We utilize a DRL algorithm named DSAC [35] to address the optimization problem. Firstly, we need to model the proposed system as a Markov decision process (MDP). The MDP encompasses states, actions, and rewards. The state s(t)𝑠𝑡s(t)italic_s ( italic_t ) at time slot t𝑡titalic_t in this paper is given by

s(t)=𝑠𝑡absent\displaystyle s(t)=italic_s ( italic_t ) = {{hk,m(t)}k𝒦,m,{qm(t)}qm𝒬,\displaystyle~{}\{\{h_{k,m}(t)\}_{k\in\mathcal{K},m\in\mathcal{M}},\{q_{m}(t)% \}_{q_{m}\in\mathcal{Q}},{ { italic_h start_POSTSUBSCRIPT italic_k , italic_m end_POSTSUBSCRIPT ( italic_t ) } start_POSTSUBSCRIPT italic_k ∈ caligraphic_K , italic_m ∈ caligraphic_M end_POSTSUBSCRIPT , { italic_q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_t ) } start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ caligraphic_Q end_POSTSUBSCRIPT , (15)
{Tc(m,n)(t)}m,n𝒩,Acci(t)i{,𝒩}},\displaystyle~{}\{T_{c(m,n)}(t)\}_{m\in\mathcal{M},n\in\mathcal{N}},{Acc}_{i}(% t)_{i\in\{\mathcal{M},\mathcal{N}\}}\},{ italic_T start_POSTSUBSCRIPT italic_c ( italic_m , italic_n ) end_POSTSUBSCRIPT ( italic_t ) } start_POSTSUBSCRIPT italic_m ∈ caligraphic_M , italic_n ∈ caligraphic_N end_POSTSUBSCRIPT , italic_A italic_c italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) start_POSTSUBSCRIPT italic_i ∈ { caligraphic_M , caligraphic_N } end_POSTSUBSCRIPT } ,

where {hk,m(t)}k𝒦,msubscriptsubscript𝑘𝑚𝑡formulae-sequence𝑘𝒦𝑚\{h_{k,m}(t)\}_{k\in\mathcal{K},m\in\mathcal{M}}{ italic_h start_POSTSUBSCRIPT italic_k , italic_m end_POSTSUBSCRIPT ( italic_t ) } start_POSTSUBSCRIPT italic_k ∈ caligraphic_K , italic_m ∈ caligraphic_M end_POSTSUBSCRIPT indicates the channel coefficients of ground-aerial links, {qm(t)}qm𝒬subscriptsubscript𝑞𝑚𝑡subscript𝑞𝑚𝒬\{q_{m}(t)\}_{q_{m}\in\mathcal{Q}}{ italic_q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_t ) } start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ caligraphic_Q end_POSTSUBSCRIPT denotes the UAV’s trajectories, {Tc(m,n)(t)}m,n𝒩subscriptsubscript𝑇𝑐𝑚𝑛𝑡formulae-sequence𝑚𝑛𝒩\{T_{c(m,n)}(t)\}_{m\in\mathcal{M},n\in\mathcal{N}}{ italic_T start_POSTSUBSCRIPT italic_c ( italic_m , italic_n ) end_POSTSUBSCRIPT ( italic_t ) } start_POSTSUBSCRIPT italic_m ∈ caligraphic_M , italic_n ∈ caligraphic_N end_POSTSUBSCRIPT is the remain service time of space-aerial links, Acci(t)i{,𝒩}𝐴𝑐subscript𝑐𝑖subscript𝑡𝑖𝒩{Acc}_{i}(t)_{i\in\{\mathcal{M},\mathcal{N}\}}italic_A italic_c italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) start_POSTSUBSCRIPT italic_i ∈ { caligraphic_M , caligraphic_N } end_POSTSUBSCRIPT denotes the test performance in each aggregation node (UAVs and LEO satellites). Furthermore, the action a(t)𝑎𝑡a(t)italic_a ( italic_t ) at time slot t𝑡titalic_t is expressed as

a(t)={(t),(t),Imu(t),Sf(t),Imd(t),ζfK(t),ζfM(t),ζfS(t)},𝑎𝑡𝑡Weierstrass-p𝑡𝐼subscript𝑚𝑢𝑡superscript𝑆𝑓𝑡𝐼subscript𝑚𝑑𝑡superscript𝜁subscript𝑓𝐾𝑡superscript𝜁subscript𝑓𝑀𝑡superscript𝜁subscript𝑓𝑆𝑡\displaystyle a(t)=\{\mathbb{Q}(t),\wp(t),Im_{u}(t),S^{f}(t),Im_{d}(t),\zeta^{% f_{K}}(t),\zeta^{f_{M}}(t),\zeta^{f_{S}}(t)\},italic_a ( italic_t ) = { blackboard_Q ( italic_t ) , ℘ ( italic_t ) , italic_I italic_m start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_t ) , italic_S start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( italic_t ) , italic_I italic_m start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_t ) , italic_ζ start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_t ) , italic_ζ start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_t ) , italic_ζ start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_t ) } , (16)

where (t)𝑡\mathbb{Q}(t)blackboard_Q ( italic_t ) denotes the UAV trajectory planning, (t)Weierstrass-p𝑡\wp(t)℘ ( italic_t ) is the pairing for ground users and UAVs in uplink transmissions, Imu(t)𝐼subscript𝑚𝑢𝑡Im_{u}(t)italic_I italic_m start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_t ) denotes the pairing for UAVs and LEO satellites in uplink transmissions, Sf(t)superscript𝑆𝑓𝑡S^{f}(t)italic_S start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( italic_t ) denotes the selection of the final aggregation node, Imd(t)𝐼subscript𝑚𝑑𝑡Im_{d}(t)italic_I italic_m start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_t ) indicates the pairing for LEO satellites and UAVs in downlink transmissions. ζfK(t)superscript𝜁subscript𝑓𝐾𝑡\zeta^{f_{K}}(t)italic_ζ start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_t ), ζfM(t)superscript𝜁subscript𝑓𝑀𝑡\zeta^{f_{M}}(t)italic_ζ start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_t ) and ζfS(t)superscript𝜁subscript𝑓𝑆𝑡\zeta^{f_{S}}(t)italic_ζ start_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_t ) presents the weights of models in edge, cloud and final aggregations, respectively.

Moreover, we need to design a reward function for the MDP to provide feedback to the agents of the DRL algorithm. Since our objective is to enhance the average performance of all tasks while ensuring the fairness, our reward design needs to consider fairness in task progress to guarantee that all tasks converge to satisfactory performance after the satellites service time ends. To this end, we design the reward function for F𝐹Fitalic_F tasks as :

r(t)=𝑟𝑡absent\displaystyle r(t)=italic_r ( italic_t ) = αFf=1Fγf(t)/ϵc1ϵf𝛼𝐹superscriptsubscript𝑓1𝐹subscript𝛾𝑓𝑡subscriptitalic-ϵ𝑐1subscriptitalic-ϵ𝑓\displaystyle~{}\frac{\alpha}{F}\sum_{f=1}^{F}\frac{\gamma_{f}(t)/\epsilon_{c1% }}{\epsilon_{f}}divide start_ARG italic_α end_ARG start_ARG italic_F end_ARG ∑ start_POSTSUBSCRIPT italic_f = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT divide start_ARG italic_γ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_t ) / italic_ϵ start_POSTSUBSCRIPT italic_c 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_ϵ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_ARG (17)
+(1α)Ff=1Fγf(t)/ϵc2ϵf+|1Fj=1Fγj(t)/γf(t)1|,1𝛼𝐹superscriptsubscript𝑓1𝐹subscript𝛾𝑓𝑡subscriptitalic-ϵ𝑐2subscriptitalic-ϵ𝑓1𝐹superscriptsubscript𝑗1𝐹subscript𝛾𝑗𝑡subscript𝛾𝑓𝑡1\displaystyle+\frac{(1-\alpha)}{F}\sum_{f=1}^{F}\frac{\gamma_{f}(t)/\epsilon_{% c2}}{\epsilon_{f}+|\frac{1}{F}\sum_{j=1}^{F}\gamma_{j}(t)/\gamma_{f}(t)-1|},+ divide start_ARG ( 1 - italic_α ) end_ARG start_ARG italic_F end_ARG ∑ start_POSTSUBSCRIPT italic_f = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT divide start_ARG italic_γ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_t ) / italic_ϵ start_POSTSUBSCRIPT italic_c 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_ϵ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT + | divide start_ARG 1 end_ARG start_ARG italic_F end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT italic_γ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) / italic_γ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_t ) - 1 | end_ARG ,

where α𝛼\alphaitalic_α denotes the time decay factor used to adjust the inevitable initial imbalance in task progress during access and the later requirement for overall balance, we define α=βt𝛼superscript𝛽𝑡\alpha=\beta^{t}italic_α = italic_β start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, where β𝛽\betaitalic_β is the decay factor and α𝛼\alphaitalic_α is set as 1 at the beginning. γf(t)subscript𝛾𝑓𝑡\gamma_{f}(t)italic_γ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_t ) denotes the performance of task 𝕋fsubscript𝕋𝑓\mathbb{T}_{f}blackboard_T start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT at time slot t𝑡titalic_t, ϵc1subscriptitalic-ϵ𝑐1\epsilon_{c1}italic_ϵ start_POSTSUBSCRIPT italic_c 1 end_POSTSUBSCRIPT and ϵc1subscriptitalic-ϵ𝑐1\epsilon_{c1}italic_ϵ start_POSTSUBSCRIPT italic_c 1 end_POSTSUBSCRIPT are constants used to normalize the rewards for tasks, ϵfsubscriptitalic-ϵ𝑓\epsilon_{f}italic_ϵ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT is a constant introduced to prevent zero value of the denominator when the reward bias is zero.

The designed reward function can be divided into two parts. At the beginning of the federated learning training process, it’s not feasible to achieve balance among all tasks when the proposed system starts operating. Therefore, the focus is more on optimizing the performance of all tasks. As time progresses, the emphasis of rewards gradually shifts towards the second part of the reward function. The designed reward bias ensures that tasks deviating too much from the performance mean receive certain penalties, while those approaching the performance mean contribute significantly to rewards. Thus, the designed reward function dynamically adjusts the reward values based on task performance and runtime, ensuring fairness among tasks while optimizing the performance of all tasks.

III-2 DSAC-Based Optimization

Existing research in reinforcement learning has started to shift towards distributional algorithm studies because distributional DRL can estimate the entire distribution of total returns rather than just the expected distribution. This is highly beneficial for the convergence of long-term rewards. However, the relationship between learning the return distribution and predicting values has not been discussed in traditional distributional DRL. Therefore, we introduce the DSAC algorithm which reduces the impact of overestimation by learning the distribution function of state–action returns. Firstly, the algorithm introduces the distributional idea into SAC, the return for state-action pairs in SAC can be represented as:

θπ(s(t),a(t))=r(t)+ϑι(t+1),superscript𝜃𝜋𝑠𝑡𝑎𝑡𝑟𝑡italic-ϑ𝜄𝑡1\displaystyle\theta^{\pi}(s(t),a(t))=r(t)+\vartheta\iota(t+1),italic_θ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ( italic_t ) , italic_a ( italic_t ) ) = italic_r ( italic_t ) + italic_ϑ italic_ι ( italic_t + 1 ) , (18)

where π𝜋\piitalic_π denotes the policy in DRL, ϑitalic-ϑ\varthetaitalic_ϑ is the discount factor, ι(j)=j=tϑ(jt)[r(j)νlogπ(a(j)|s(j))]\iota(j)=\sum_{j=t}^{\infty}\vartheta^{(}j-t)[r(j)-\nu{\rm log}\pi(a(j)|s(j))]italic_ι ( italic_j ) = ∑ start_POSTSUBSCRIPT italic_j = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_ϑ start_POSTSUPERSCRIPT ( end_POSTSUPERSCRIPT italic_j - italic_t ) [ italic_r ( italic_j ) - italic_ν roman_log italic_π ( italic_a ( italic_j ) | italic_s ( italic_j ) ) ], where ν𝜈\nuitalic_ν is an importance factor related to the entropy [35]. The Q-value in DSAC can be expressed as

Qπ(s(t),a(t))=𝔼[θπ(s(t),a(t))],superscript𝑄𝜋𝑠𝑡𝑎𝑡𝔼delimited-[]superscript𝜃𝜋𝑠𝑡𝑎𝑡\displaystyle Q^{\pi}(s(t),a(t))=\mathbb{E}\big{[}\theta^{\pi}(s(t),a(t))\big{% ]},italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ( italic_t ) , italic_a ( italic_t ) ) = blackboard_E [ italic_θ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ( italic_t ) , italic_a ( italic_t ) ) ] , (19)

where 𝔼[]𝔼delimited-[]{\mathbb{E}}[\cdot]blackboard_E [ ⋅ ] indicates the expectation. Compared to the expected state-action returns in (19), we can directly consider using soft returns in (18) to construct the algorithm. Considering the maximum entropy in this case, we can define the distribution of the Bellman operator as

υθπ(s(t),a(t))=𝜐superscript𝜃𝜋𝑠𝑡𝑎𝑡absent\displaystyle\upsilon\theta^{\pi}(s(t),a(t))=italic_υ italic_θ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ( italic_t ) , italic_a ( italic_t ) ) = r(t)+ϑ(θπ(s(t+1),a(t+1))\displaystyle~{}r(t)+\vartheta\big{(}\theta^{\pi}(s(t+1),a(t+1))italic_r ( italic_t ) + italic_ϑ ( italic_θ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ( italic_t + 1 ) , italic_a ( italic_t + 1 ) ) (20)
νlogπ(a(t+1)|s(t+1))).\displaystyle-\nu{\rm log}\pi(a(t+1)|s(t+1))\big{)}.- italic_ν roman_log italic_π ( italic_a ( italic_t + 1 ) | italic_s ( italic_t + 1 ) ) ) .

Then, we can update the soft return distribution based on (20) as

θ^new=argmin𝔼[dD(υθ^old(|s(t),a(t)),θ^(|s(t),a(t)))],\displaystyle\hat{\theta}_{new}={\rm argmin}\mathbb{E}\big{[}d_{D}(\upsilon% \hat{\theta}_{old}(\cdot|s(t),a(t)),\hat{\theta}(\cdot|s(t),a(t)))\big{]},over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT = roman_argmin blackboard_E [ italic_d start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ( italic_υ over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT ( ⋅ | italic_s ( italic_t ) , italic_a ( italic_t ) ) , over^ start_ARG italic_θ end_ARG ( ⋅ | italic_s ( italic_t ) , italic_a ( italic_t ) ) ) ] , (21)

where dDsubscript𝑑𝐷d_{D}italic_d start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT is the distance between new and old distributions which can be expressed by Kullback–Leibler (KL) divergence [36]. To update the soft return policy, we form the update function as

Θθ^(χ)=𝔼Θθ^(χ)[logϝ(Psiπϕθ(s(t),a(t))|θ^(|s(t),a(t)))],\displaystyle\Theta_{\hat{\theta}}(\chi)=-\mathop{\mathbb{E}}\limits_{\aleph_{% \Theta_{\hat{\theta}}(\chi)}}\Big{[}{\rm log}\digamma\big{(}Psi^{\pi_{\phi^{% \prime}}}\theta(s(t),a(t))|\hat{\theta}(\cdot|s(t),a(t))\big{)}\Big{]},roman_Θ start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( italic_χ ) = - blackboard_E start_POSTSUBSCRIPT roman_ℵ start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( italic_χ ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log italic_ϝ ( italic_P italic_s italic_i start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_θ ( italic_s ( italic_t ) , italic_a ( italic_t ) ) | over^ start_ARG italic_θ end_ARG ( ⋅ | italic_s ( italic_t ) , italic_a ( italic_t ) ) ) ] , (22)

where χ𝜒\chiitalic_χ and ϕitalic-ϕ\phiitalic_ϕ are the parameters of the distributional value function θ^(|s(t),a(t))\hat{\theta}(\cdot|s(t),a(t))over^ start_ARG italic_θ end_ARG ( ⋅ | italic_s ( italic_t ) , italic_a ( italic_t ) ) and the agent’s behavior policy πϕ(|s)\pi_{\phi}(\cdot|s)italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( ⋅ | italic_s ), respectively. Θθ^(χ)=(s(t),s(t+1),a(t),r(t))𝔹,a(t+1)πϕ,θ(s(t+1),a(t+1))θ^χ(|s(t+1),a(t+1))\aleph_{\Theta_{\hat{\theta}}(\chi)}=(s(t),s(t+1),a(t),r(t))\sim\mathbb{B},a(t% +1)\sim\pi_{\phi_{\prime}},\theta(s(t+1),a(t+1))\sim\hat{\theta}_{\chi^{\prime% }}(\cdot|s(t+1),a(t+1))roman_ℵ start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( italic_χ ) end_POSTSUBSCRIPT = ( italic_s ( italic_t ) , italic_s ( italic_t + 1 ) , italic_a ( italic_t ) , italic_r ( italic_t ) ) ∼ blackboard_B , italic_a ( italic_t + 1 ) ∼ italic_π start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT ′ end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_θ ( italic_s ( italic_t + 1 ) , italic_a ( italic_t + 1 ) ) ∼ over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_χ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( ⋅ | italic_s ( italic_t + 1 ) , italic_a ( italic_t + 1 ) ), ϝitalic-ϝ\digammaitalic_ϝ is the probability distribution between states and actions, 𝔹𝔹\mathbb{B}blackboard_B denotes the experience buffer in DRL, χϕsubscript𝜒superscriptitalic-ϕ\chi_{\phi^{\prime}}italic_χ start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and ϕϕsubscriptitalic-ϕsuperscriptitalic-ϕ\phi_{\phi^{\prime}}italic_ϕ start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT denote the parameters of the target return and target behavior policy, respectively. Then we can update χ𝜒\chiitalic_χ by using the following equation

χΘθ^(χ)=subscript𝜒subscriptΘ^𝜃𝜒absent\displaystyle\nabla_{\chi}\Theta_{\hat{\theta}}(\chi)=∇ start_POSTSUBSCRIPT italic_χ end_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( italic_χ ) = 𝔼χΘθ^(χ)[χlogϝ(Psiπϕθ(s(t),a(t))\displaystyle~{}-\mathop{\mathbb{E}}\limits_{\aleph_{\nabla_{\chi}\Theta_{\hat% {\theta}}(\chi)}}\Big{[}\nabla_{\chi}{\rm log}\digamma\big{(}Psi^{\pi_{\phi^{% \prime}}}\theta(s(t),a(t))- blackboard_E start_POSTSUBSCRIPT roman_ℵ start_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_χ end_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( italic_χ ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∇ start_POSTSUBSCRIPT italic_χ end_POSTSUBSCRIPT roman_log italic_ϝ ( italic_P italic_s italic_i start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_θ ( italic_s ( italic_t ) , italic_a ( italic_t ) ) (23)
|θ^(|s(t),a(t)))],\displaystyle~{}|\hat{\theta}(\cdot|s(t),a(t))\big{)}\Big{]},| over^ start_ARG italic_θ end_ARG ( ⋅ | italic_s ( italic_t ) , italic_a ( italic_t ) ) ) ] ,

where Θθ^(χ)=(s(t),s(t+1),a(t),r(t))𝔹,a(t+1)πϕ,θ(s(t+1),a(t+1))θ^χformulae-sequencesubscriptsubscriptΘ^𝜃𝜒𝑠𝑡𝑠𝑡1𝑎𝑡𝑟𝑡similar-to𝔹formulae-sequencesimilar-to𝑎𝑡1subscript𝜋subscriptitalic-ϕsimilar-to𝜃𝑠𝑡1𝑎𝑡1subscript^𝜃superscript𝜒\aleph_{\Theta_{\hat{\theta}}(\chi)}=(s(t),s(t+1),a(t),r(t))\sim\mathbb{B},a(t% +1)\sim\pi_{\phi_{\prime}},\theta(s(t+1),a(t+1))\sim\hat{\theta}_{\chi^{\prime}}roman_ℵ start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( italic_χ ) end_POSTSUBSCRIPT = ( italic_s ( italic_t ) , italic_s ( italic_t + 1 ) , italic_a ( italic_t ) , italic_r ( italic_t ) ) ∼ blackboard_B , italic_a ( italic_t + 1 ) ∼ italic_π start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT ′ end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_θ ( italic_s ( italic_t + 1 ) , italic_a ( italic_t + 1 ) ) ∼ over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_χ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. We assume θ^χsubscript^𝜃𝜒\hat{\theta}_{\chi}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_χ end_POSTSUBSCRIPT has a Gaussian distribution, thus we can obtain

χΘθ^(χ)=subscript𝜒subscriptΘ^𝜃𝜒absent\displaystyle\nabla_{\chi}\Theta_{\hat{\theta}}(\chi)=∇ start_POSTSUBSCRIPT italic_χ end_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( italic_χ ) = 𝔼χΘθ^(χ)[Ψθ^(χ)Qχ(s(t),a(t))χQχ(s(t),a(t))\displaystyle~{}\mathop{\mathbb{E}}\limits_{\aleph_{\nabla_{\chi}\Theta_{\hat{% \theta}}(\chi)}}\Big{[}-\frac{\partial\Psi_{\hat{\theta}}(\chi)}{\partial Q_{% \chi}(s(t),a(t))}\nabla_{\chi}Q_{\chi}(s(t),a(t))blackboard_E start_POSTSUBSCRIPT roman_ℵ start_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_χ end_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( italic_χ ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ - divide start_ARG ∂ roman_Ψ start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( italic_χ ) end_ARG start_ARG ∂ italic_Q start_POSTSUBSCRIPT italic_χ end_POSTSUBSCRIPT ( italic_s ( italic_t ) , italic_a ( italic_t ) ) end_ARG ∇ start_POSTSUBSCRIPT italic_χ end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_χ end_POSTSUBSCRIPT ( italic_s ( italic_t ) , italic_a ( italic_t ) ) (24)
Ψθ^(χ)ρχ(s(t),a(t))χρχ(s(t),a(t))],\displaystyle~{}-\frac{\partial\Psi_{\hat{\theta}}(\chi)}{\partial\rho_{\chi}(% s(t),a(t))}\nabla_{\chi}\rho_{\chi}(s(t),a(t))\Big{]},- divide start_ARG ∂ roman_Ψ start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( italic_χ ) end_ARG start_ARG ∂ italic_ρ start_POSTSUBSCRIPT italic_χ end_POSTSUBSCRIPT ( italic_s ( italic_t ) , italic_a ( italic_t ) ) end_ARG ∇ start_POSTSUBSCRIPT italic_χ end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_χ end_POSTSUBSCRIPT ( italic_s ( italic_t ) , italic_a ( italic_t ) ) ] ,

where

Ψθ^(χ)Qχ(s(t),a(t))=υπϕθ(s(t),a(t))Qχ(s(t),a(t))ρχ(s(t),a(t))2,subscriptΨ^𝜃𝜒subscript𝑄𝜒𝑠𝑡𝑎𝑡superscript𝜐subscript𝜋subscriptitalic-ϕ𝜃𝑠𝑡𝑎𝑡subscript𝑄𝜒𝑠𝑡𝑎𝑡subscript𝜌𝜒superscript𝑠𝑡𝑎𝑡2\displaystyle\frac{\partial\Psi_{\hat{\theta}}(\chi)}{\partial Q_{\chi}(s(t),a% (t))}=\frac{\upsilon^{\pi_{\phi_{\prime}}}\theta(s(t),a(t))-Q_{\chi}(s(t),a(t)% )}{\rho_{\chi}(s(t),a(t))^{2}},divide start_ARG ∂ roman_Ψ start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( italic_χ ) end_ARG start_ARG ∂ italic_Q start_POSTSUBSCRIPT italic_χ end_POSTSUBSCRIPT ( italic_s ( italic_t ) , italic_a ( italic_t ) ) end_ARG = divide start_ARG italic_υ start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT ′ end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_θ ( italic_s ( italic_t ) , italic_a ( italic_t ) ) - italic_Q start_POSTSUBSCRIPT italic_χ end_POSTSUBSCRIPT ( italic_s ( italic_t ) , italic_a ( italic_t ) ) end_ARG start_ARG italic_ρ start_POSTSUBSCRIPT italic_χ end_POSTSUBSCRIPT ( italic_s ( italic_t ) , italic_a ( italic_t ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , (25)
Ψθ^(χ)ρχ(s(t),a(t))=subscriptΨ^𝜃𝜒subscript𝜌𝜒𝑠𝑡𝑎𝑡absent\displaystyle\frac{\partial\Psi_{\hat{\theta}}(\chi)}{\partial\rho_{\chi}(s(t)% ,a(t))}=divide start_ARG ∂ roman_Ψ start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( italic_χ ) end_ARG start_ARG ∂ italic_ρ start_POSTSUBSCRIPT italic_χ end_POSTSUBSCRIPT ( italic_s ( italic_t ) , italic_a ( italic_t ) ) end_ARG = υπϕθ(s(t),a(t))Qχ(s(t),a(t))ρχ(s(t),a(t))3superscript𝜐subscript𝜋subscriptitalic-ϕ𝜃𝑠𝑡𝑎𝑡subscript𝑄𝜒𝑠𝑡𝑎𝑡subscript𝜌𝜒superscript𝑠𝑡𝑎𝑡3\displaystyle~{}\frac{\upsilon^{\pi_{\phi_{\prime}}}\theta(s(t),a(t))-Q_{\chi}% (s(t),a(t))}{\rho_{\chi}(s(t),a(t))^{3}}divide start_ARG italic_υ start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT ′ end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_θ ( italic_s ( italic_t ) , italic_a ( italic_t ) ) - italic_Q start_POSTSUBSCRIPT italic_χ end_POSTSUBSCRIPT ( italic_s ( italic_t ) , italic_a ( italic_t ) ) end_ARG start_ARG italic_ρ start_POSTSUBSCRIPT italic_χ end_POSTSUBSCRIPT ( italic_s ( italic_t ) , italic_a ( italic_t ) ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG (26)
1ρχ(s(t),a(t)),1subscript𝜌𝜒𝑠𝑡𝑎𝑡\displaystyle~{}-\frac{1}{\rho_{\chi}(s(t),a(t))},- divide start_ARG 1 end_ARG start_ARG italic_ρ start_POSTSUBSCRIPT italic_χ end_POSTSUBSCRIPT ( italic_s ( italic_t ) , italic_a ( italic_t ) ) end_ARG ,

where ρχ(s(t),a(t))subscript𝜌𝜒𝑠𝑡𝑎𝑡\rho_{\chi}(s(t),a(t))italic_ρ start_POSTSUBSCRIPT italic_χ end_POSTSUBSCRIPT ( italic_s ( italic_t ) , italic_a ( italic_t ) ) is a standard deviation of the Gaussian distribution for θ(|s(t),a(t))\theta(\cdot|s(t),a(t))italic_θ ( ⋅ | italic_s ( italic_t ) , italic_a ( italic_t ) ). From (26), it can be seen that the Q-value is very easy to be overestimated during updates. Moreover, when the standard deviation of the Gaussian distribution tends towards zero or infinity, gradient computation issues can arise. Therefore, we use the following equation to constrain the standard deviation

ρχ(s(t),a(t))=max(ρχ(s(t),a(t)),ρmin),subscript𝜌𝜒𝑠𝑡𝑎𝑡subscript𝜌𝜒𝑠𝑡𝑎𝑡subscript𝜌\displaystyle\rho_{\chi}(s(t),a(t))=\max\Big{(}\rho_{\chi}(s(t),a(t)),\rho_{% \min}\Big{)},italic_ρ start_POSTSUBSCRIPT italic_χ end_POSTSUBSCRIPT ( italic_s ( italic_t ) , italic_a ( italic_t ) ) = roman_max ( italic_ρ start_POSTSUBSCRIPT italic_χ end_POSTSUBSCRIPT ( italic_s ( italic_t ) , italic_a ( italic_t ) ) , italic_ρ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ) , (27)

where ρminsubscript𝜌\rho_{\min}italic_ρ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT is the limitation factor. Another constraint method is to clip υπϕθ(s(t),a(t))superscript𝜐subscript𝜋subscriptitalic-ϕ𝜃𝑠𝑡𝑎𝑡\upsilon^{\pi_{\phi_{\prime}}}\theta(s(t),a(t))italic_υ start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT ′ end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_θ ( italic_s ( italic_t ) , italic_a ( italic_t ) ) to control the range of the updated Q-values, with the clipping scheme as

Ψθ^(χ)ρχ(s(t),a(t))=subscriptΨ^𝜃𝜒subscript𝜌𝜒𝑠𝑡𝑎𝑡absent\displaystyle\frac{\partial\Psi_{\hat{\theta}}(\chi)}{\partial\rho_{\chi}(s(t)% ,a(t))}=divide start_ARG ∂ roman_Ψ start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( italic_χ ) end_ARG start_ARG ∂ italic_ρ start_POSTSUBSCRIPT italic_χ end_POSTSUBSCRIPT ( italic_s ( italic_t ) , italic_a ( italic_t ) ) end_ARG = ȷQχ(s(t),a(t))ρχ(s(t),a(t))3italic-ȷsubscript𝑄𝜒𝑠𝑡𝑎𝑡subscript𝜌𝜒superscript𝑠𝑡𝑎𝑡3\displaystyle~{}\frac{\jmath-Q_{\chi}(s(t),a(t))}{\rho_{\chi}(s(t),a(t))^{3}}divide start_ARG italic_ȷ - italic_Q start_POSTSUBSCRIPT italic_χ end_POSTSUBSCRIPT ( italic_s ( italic_t ) , italic_a ( italic_t ) ) end_ARG start_ARG italic_ρ start_POSTSUBSCRIPT italic_χ end_POSTSUBSCRIPT ( italic_s ( italic_t ) , italic_a ( italic_t ) ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG (28)
1ρχ(s(t),a(t)),1subscript𝜌𝜒𝑠𝑡𝑎𝑡\displaystyle~{}-\frac{1}{\rho_{\chi}(s(t),a(t))},- divide start_ARG 1 end_ARG start_ARG italic_ρ start_POSTSUBSCRIPT italic_χ end_POSTSUBSCRIPT ( italic_s ( italic_t ) , italic_a ( italic_t ) ) end_ARG ,

where

ȷ=(υπϕθ(s(t),a(t)),Qχ(s(t),a(t)),Qχ(s(t),a(t))+),italic-ȷsuperscript𝜐subscript𝜋subscriptitalic-ϕ𝜃𝑠𝑡𝑎𝑡subscript𝑄𝜒𝑠𝑡𝑎𝑡subscript𝑄𝜒𝑠𝑡𝑎𝑡\displaystyle\jmath=\ell\Big{(}\upsilon^{\pi_{\phi_{\prime}}}\theta(s(t),a(t))% ,Q_{\chi}(s(t),a(t))-\emptyset,Q_{\chi}(s(t),a(t))+\emptyset\Big{)},italic_ȷ = roman_ℓ ( italic_υ start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT ′ end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_θ ( italic_s ( italic_t ) , italic_a ( italic_t ) ) , italic_Q start_POSTSUBSCRIPT italic_χ end_POSTSUBSCRIPT ( italic_s ( italic_t ) , italic_a ( italic_t ) ) - ∅ , italic_Q start_POSTSUBSCRIPT italic_χ end_POSTSUBSCRIPT ( italic_s ( italic_t ) , italic_a ( italic_t ) ) + ∅ ) , (29)

where [l,i,j]𝑙𝑖𝑗\ell[l,i,j]roman_ℓ [ italic_l , italic_i , italic_j ] indicates l𝑙litalic_l is clipped between [i,j]𝑖𝑗[i,j][ italic_i , italic_j ], \emptyset is the clipping factor. Subsequently, we adopt the policy update equation in DSAC as

Θπ(ϕ)=𝔼s(t)𝔹,a(t)πϕ[Qχ(s(t),a(t))νlog(πϕ(a(t)|s(t)))].subscriptΘ𝜋italic-ϕsubscript𝔼formulae-sequencesimilar-to𝑠𝑡𝔹similar-to𝑎𝑡subscript𝜋italic-ϕdelimited-[]subscript𝑄𝜒𝑠𝑡𝑎𝑡𝜈logsubscript𝜋italic-ϕconditional𝑎𝑡𝑠𝑡\displaystyle\Theta_{\pi}(\phi)=\mathop{\mathbb{E}}\limits_{s(t)\sim\mathbb{B}% ,a(t)\sim\pi_{\phi}}\Big{[}Q_{\chi}(s(t),a(t))-\nu{\rm log}(\pi_{\phi}(a(t)|s(% t)))\Big{]}.roman_Θ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_ϕ ) = blackboard_E start_POSTSUBSCRIPT italic_s ( italic_t ) ∼ blackboard_B , italic_a ( italic_t ) ∼ italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_Q start_POSTSUBSCRIPT italic_χ end_POSTSUBSCRIPT ( italic_s ( italic_t ) , italic_a ( italic_t ) ) - italic_ν roman_log ( italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_a ( italic_t ) | italic_s ( italic_t ) ) ) ] . (30)

Then, we can update ϕitalic-ϕ\phiitalic_ϕ by

ϕΘπ(ϕ)=subscriptitalic-ϕsubscriptΘ𝜋italic-ϕabsent\displaystyle\nabla_{\phi}\Theta_{\pi}(\phi)=∇ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_ϕ ) = 𝔼s(t)𝔹,[νϕlog(πϕ(a(t)|s(t)))(a(t)Qχ(s(t),\displaystyle~{}\mathop{\mathbb{E}}\limits_{s(t)\sim\mathbb{B},\Re}\Big{[}-\nu% \nabla_{\phi}{\rm log}(\pi_{\phi}(a(t)|s(t)))\big{(}\nabla_{a(t)}Q_{\chi}(s(t),blackboard_E start_POSTSUBSCRIPT italic_s ( italic_t ) ∼ blackboard_B , roman_ℜ end_POSTSUBSCRIPT [ - italic_ν ∇ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT roman_log ( italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_a ( italic_t ) | italic_s ( italic_t ) ) ) ( ∇ start_POSTSUBSCRIPT italic_a ( italic_t ) end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_χ end_POSTSUBSCRIPT ( italic_s ( italic_t ) , (31)
a(t))νa(t)log(πϕ(a(t)|s(t))))ϕı(;s(t))],\displaystyle~{}a(t))-\nu\nabla_{a(t)}{\rm log}(\pi_{\phi}(a(t)|s(t)))\big{)}% \nabla_{\phi}\imath(\Re;s(t))\Big{]},italic_a ( italic_t ) ) - italic_ν ∇ start_POSTSUBSCRIPT italic_a ( italic_t ) end_POSTSUBSCRIPT roman_log ( italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_a ( italic_t ) | italic_s ( italic_t ) ) ) ) ∇ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_ı ( roman_ℜ ; italic_s ( italic_t ) ) ] ,

where ı(;s(t))=a^+a¯italic-ı𝑠𝑡^𝑎¯𝑎\imath(\Re;s(t))=\hat{a}+\Re\bigodot\bar{a}italic_ı ( roman_ℜ ; italic_s ( italic_t ) ) = over^ start_ARG italic_a end_ARG + roman_ℜ ⨀ over¯ start_ARG italic_a end_ARG, \Reroman_ℜ is sampled from a fixed distribution, \bigodot is Hadamard product, a^^𝑎\hat{a}over^ start_ARG italic_a end_ARG and a¯¯𝑎\bar{a}over¯ start_ARG italic_a end_ARG are the mean and standard deviation of πϕ(|s(t))\pi_{\phi}(\cdot|s(t))italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( ⋅ | italic_s ( italic_t ) ), respectively.

Refer to caption

Figure 4: The structure of DSAC.

Moreover, we add a noise in the exploration of DRL as

a(t)=Vϕ(ε(t);s(t)),𝑎𝑡subscript𝑉italic-ϕ𝜀𝑡𝑠𝑡\displaystyle a(t)=V_{\phi}(\varepsilon(t);s(t)),italic_a ( italic_t ) = italic_V start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_ε ( italic_t ) ; italic_s ( italic_t ) ) , (32)

where Vϕ(i;j)subscript𝑉italic-ϕ𝑖𝑗V_{\phi}(i;j)italic_V start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_i ; italic_j ) indicates the random selection from noise and policy network, ε𝜀\varepsilonitalic_ε is a Gaussian distributed noise. The structure of DSAC is shown in Fig. 4.

III-3 Hybrid Action Space in DSAC

As mentioned in previous discussion, optimization problems that involve both discrete and continuous actions pose to a challenge to traditional DRL algorithms. Whether discretize or make all actions continuous, it leads to a decrease in system control capability and loss of optimization performance. Therefore, we introduce a decoupling strategy to decompose discrete and continuous actions in the optimization variables. Actions and policies are re-expressed as

A𝐴\displaystyle Aitalic_A ={Ad,Ac},absentsubscript𝐴𝑑subscript𝐴𝑐\displaystyle~{}=\{A_{d},A_{c}\},= { italic_A start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT } , (33)
Qϕ(a|s)subscript𝑄italic-ϕconditional𝑎𝑠\displaystyle Q_{\phi}(a|s)italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_a | italic_s ) =QϕAd(aAd|s)QϕAc(aAc|s)absentsubscriptsuperscript𝑄subscript𝐴𝑑italic-ϕconditionalsuperscript𝑎subscript𝐴𝑑𝑠subscriptsuperscript𝑄subscript𝐴𝑐italic-ϕconditionalsuperscript𝑎subscript𝐴𝑐𝑠\displaystyle~{}=Q^{A_{d}}_{\phi}(a^{A_{d}}|s)Q^{A_{c}}_{\phi}(a^{A_{c}}|s)= italic_Q start_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUPERSCRIPT | italic_s ) italic_Q start_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT | italic_s )
=aiAdQϕAd(ai|s)ajAcQϕAc(aj|s),absentsubscriptproductsubscript𝑎𝑖subscript𝐴𝑑subscriptsuperscript𝑄subscript𝐴𝑑italic-ϕconditionalsubscript𝑎𝑖𝑠subscriptproductsubscript𝑎𝑗subscript𝐴𝑐subscriptsuperscript𝑄subscript𝐴𝑐italic-ϕconditionalsubscript𝑎𝑗𝑠\displaystyle~{}=\prod_{a_{i}\in A_{d}}Q^{A_{d}}_{\phi}(a_{i}|s)\prod_{a_{j}% \in A_{c}}Q^{A_{c}}_{\phi}(a_{j}|s),= ∏ start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_A start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_s ) ∏ start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_A start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_s ) , (34)

respectively, where Adsubscript𝐴𝑑A_{d}italic_A start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT and Acsubscript𝐴𝑐A_{c}italic_A start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT indicate the discrete and continuous action space, respectively. Upon decomposing the optimization variables into discrete and continuous actions, we assign these two parts to different DRL agents for control, each based on an independently trained DSAC networks. Therefore, we formulate the update functions of value networks for discrete and continuous actions in DSAC networks as

ΘQχAd=𝔼(s(t),a(t))𝔹[12(QχAd(s(t),a(t))QχAd(s(t),a(t)))2],subscriptΘsubscriptsuperscript𝑄subscript𝐴𝑑𝜒subscript𝔼similar-to𝑠𝑡𝑎𝑡𝔹delimited-[]12superscriptsubscriptsuperscript𝑄subscript𝐴𝑑𝜒𝑠𝑡𝑎𝑡subscriptsuperscript𝑄subscript𝐴𝑑superscript𝜒𝑠𝑡𝑎𝑡2\displaystyle\Theta_{Q^{A_{d}}_{\chi}}=\mathop{\mathbb{E}}\limits_{(s(t),a(t))% \sim\mathbb{B}}\bigg{[}\frac{1}{2}\Big{(}{Q^{A_{d}}_{\chi}}\big{(}s(t),a(t)% \big{)}-{Q}^{A_{d}}_{\chi^{\prime}}(s(t),a(t))\Big{)}^{2}\bigg{]},roman_Θ start_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_χ end_POSTSUBSCRIPT end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT ( italic_s ( italic_t ) , italic_a ( italic_t ) ) ∼ blackboard_B end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_Q start_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_χ end_POSTSUBSCRIPT ( italic_s ( italic_t ) , italic_a ( italic_t ) ) - italic_Q start_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_χ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ( italic_t ) , italic_a ( italic_t ) ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (35)
ΘQχAc=𝔼(s(t),a(t))𝔹[12(QχAc(s(t),a(t))QχAc(s(t),a(t)))2],subscriptΘsubscriptsuperscript𝑄subscript𝐴𝑐𝜒subscript𝔼similar-to𝑠𝑡𝑎𝑡𝔹delimited-[]12superscriptsubscriptsuperscript𝑄subscript𝐴𝑐𝜒𝑠𝑡𝑎𝑡subscriptsuperscript𝑄subscript𝐴𝑐superscript𝜒𝑠𝑡𝑎𝑡2\displaystyle\Theta_{Q^{A_{c}}_{\chi}}=\mathop{\mathbb{E}}\limits_{(s(t),a(t))% \sim\mathbb{B}}\bigg{[}\frac{1}{2}\Big{(}{Q^{A_{c}}_{\chi}}\big{(}s(t),a(t)% \big{)}-{Q}^{A_{c}}_{\chi^{\prime}}(s(t),a(t))\Big{)}^{2}\bigg{]},roman_Θ start_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_χ end_POSTSUBSCRIPT end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT ( italic_s ( italic_t ) , italic_a ( italic_t ) ) ∼ blackboard_B end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_Q start_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_χ end_POSTSUBSCRIPT ( italic_s ( italic_t ) , italic_a ( italic_t ) ) - italic_Q start_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_χ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ( italic_t ) , italic_a ( italic_t ) ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (36)

respectively. The update function of policy networks for discrete and continuous actions can be expressed as

ΘQϕAd=subscriptΘsubscriptsuperscript𝑄subscript𝐴𝑑italic-ϕabsent\displaystyle\Theta_{Q^{A_{d}}_{\phi}}=roman_Θ start_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 𝔼s(t)𝔹[logQϕAd(Vϕ(ε(t);s(t))|s(t))\displaystyle~{}\mathop{\mathbb{E}}\limits_{s(t)\sim\mathbb{B}}\bigg{[}{\rm log% }Q^{A_{d}}_{\phi}(V_{\phi}\big{(}\varepsilon(t);s(t))|s(t)\big{)}blackboard_E start_POSTSUBSCRIPT italic_s ( italic_t ) ∼ blackboard_B end_POSTSUBSCRIPT [ roman_log italic_Q start_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_V start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_ε ( italic_t ) ; italic_s ( italic_t ) ) | italic_s ( italic_t ) ) (37)
QχAd(s(t),fϕ(ε(t);s(t)))],\displaystyle-Q^{A_{d}}_{\chi}\big{(}s(t),f_{\phi}(\varepsilon(t);s(t))\big{)}% \bigg{]},- italic_Q start_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_χ end_POSTSUBSCRIPT ( italic_s ( italic_t ) , italic_f start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_ε ( italic_t ) ; italic_s ( italic_t ) ) ) ] ,
ΘQϕAc=subscriptΘsubscriptsuperscript𝑄subscript𝐴𝑐italic-ϕabsent\displaystyle\Theta_{Q^{A_{c}}_{\phi}}=roman_Θ start_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 𝔼s(t)𝔹[logQϕAc(Vϕ(ε(t);s(t))|s(t))\displaystyle~{}\mathop{\mathbb{E}}\limits_{s(t)\sim\mathbb{B}}\bigg{[}{\rm log% }Q^{A_{c}}_{\phi}(V_{\phi}\big{(}\varepsilon(t);s(t))|s(t)\big{)}blackboard_E start_POSTSUBSCRIPT italic_s ( italic_t ) ∼ blackboard_B end_POSTSUBSCRIPT [ roman_log italic_Q start_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_V start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_ε ( italic_t ) ; italic_s ( italic_t ) ) | italic_s ( italic_t ) ) (38)
QχAc(s(t),fϕ(ε(t);s(t)))],\displaystyle-Q^{A_{c}}_{\chi}\big{(}s(t),f_{\phi}(\varepsilon(t);s(t))\big{)}% \bigg{]},- italic_Q start_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_χ end_POSTSUBSCRIPT ( italic_s ( italic_t ) , italic_f start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_ε ( italic_t ) ; italic_s ( italic_t ) ) ) ] ,

respectively. After decoupling the actions and obtaining samples from the environment separately by different agents, we need to re-couple the continuous and discrete action spaces. Therefore, we design a new policy ϕcsubscriptitalic-ϕ𝑐\phi_{c}italic_ϕ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT for the proposed DSAC algorithm [37], the update equation of ϕcsubscriptitalic-ϕ𝑐\phi_{c}italic_ϕ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT can be expressed as

ΘQϕc=𝔼ϕc(a(t)|s(t))[Q¯(s(t),a(t))],subscriptΘsubscript𝑄subscriptitalic-ϕ𝑐subscript𝔼subscriptitalic-ϕ𝑐conditional𝑎𝑡𝑠𝑡delimited-[]¯𝑄𝑠𝑡𝑎𝑡\displaystyle\Theta_{Q_{\phi_{c}}}=\mathop{\mathbb{E}}\limits_{\phi_{c}(a(t)|s% (t))}[\bar{Q}(s(t),a(t))],roman_Θ start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_a ( italic_t ) | italic_s ( italic_t ) ) end_POSTSUBSCRIPT [ over¯ start_ARG italic_Q end_ARG ( italic_s ( italic_t ) , italic_a ( italic_t ) ) ] , (39)

where Q¯¯𝑄\bar{Q}over¯ start_ARG italic_Q end_ARG indicates a new value network trained from the experience buffer [38]. When coupling the hybrid action space, we also need to consider the deviation during policy updates. Therefore, we can obtain

𝔼s(t)𝔹[u(Qϕc(a(t)|s(t))||Qϕc(a(t)|s(t)))]<W,\displaystyle\mathop{\mathbb{E}}\limits_{s(t)\sim\mathbb{B}}[u(Q_{\phi_{c}}(a(% t)|s(t))||Q_{\phi^{\prime}_{c}}(a(t)|s(t)))]<W,blackboard_E start_POSTSUBSCRIPT italic_s ( italic_t ) ∼ blackboard_B end_POSTSUBSCRIPT [ italic_u ( italic_Q start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_a ( italic_t ) | italic_s ( italic_t ) ) | | italic_Q start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_a ( italic_t ) | italic_s ( italic_t ) ) ) ] < italic_W , (40)

where u𝑢uitalic_u indicates the KL divergence, ϕcsubscriptsuperscriptitalic-ϕ𝑐\phi^{\prime}_{c}italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT denotes the old policy of ϕcsubscriptitalic-ϕ𝑐\phi_{c}italic_ϕ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, and W𝑊Witalic_W is a threshold factor to avoid deviation. The update function of the hybrid policy can be expressed as

ϕ^c=subscript^italic-ϕ𝑐absent\displaystyle\hat{\phi}_{c}=~{}over^ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = argmaxϕc𝔼s𝔹[u(Qϕc(a|s)||QϕcAd(aAd|s)QϕcAc(aAc|s))],\displaystyle\arg\max_{\phi_{c}}\mathop{\mathbb{E}}\limits_{s\sim\mathbb{B}}[u% (Q_{\phi_{c}}(a|s)||Q^{A_{d}}_{\phi_{c}}(a^{A_{d}}|s)Q^{A_{c}}_{\phi_{c}}(a^{A% _{c}}|s))],roman_arg roman_max start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s ∼ blackboard_B end_POSTSUBSCRIPT [ italic_u ( italic_Q start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_a | italic_s ) | | italic_Q start_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUPERSCRIPT | italic_s ) italic_Q start_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT | italic_s ) ) ] , (41)
s.t.formulae-sequencest\displaystyle{\rm s.t.}roman_s . roman_t . 𝔼s𝔹[u(QϕcAd(aAd|s)||QϕcAd(aAd|s))]<WAd,,\displaystyle~{}\mathop{\mathbb{E}}\limits_{s\sim\mathbb{B}}[u(Q^{A_{d}}_{\phi% ^{\prime}_{c}}(a^{A_{d}}|s)||Q^{A_{d}}_{\phi_{c}}(a^{A_{d}}|s))]<W_{A_{d}},,blackboard_E start_POSTSUBSCRIPT italic_s ∼ blackboard_B end_POSTSUBSCRIPT [ italic_u ( italic_Q start_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUPERSCRIPT | italic_s ) | | italic_Q start_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUPERSCRIPT | italic_s ) ) ] < italic_W start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT , , (41a)
𝔼s𝔹[1Zz=1Zu(QϕcAc(aAc|s)||QϕcAc(aAc|s))]<WAc,\displaystyle~{}\mathop{\mathbb{E}}\limits_{s\sim\mathbb{B}}[\frac{1}{Z}\sum_{% z=1}^{Z}u(Q^{A_{c}}_{\phi^{\prime}_{c}}(a^{A_{c}}|s)||Q^{A_{c}}_{\phi_{c}}(a^{% A_{c}}|s))]<W_{A_{c}},blackboard_E start_POSTSUBSCRIPT italic_s ∼ blackboard_B end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG italic_Z end_ARG ∑ start_POSTSUBSCRIPT italic_z = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Z end_POSTSUPERSCRIPT italic_u ( italic_Q start_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT | italic_s ) | | italic_Q start_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT | italic_s ) ) ] < italic_W start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT , (41b)

where Wcsubscript𝑊𝑐W_{c}italic_W start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and Wdsubscript𝑊𝑑W_{d}italic_W start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT are the threshold for continuous and discrete action space, respectively. Z𝑍Zitalic_Z represents the value of the discrete action space. Therefore, we employed a decoupling strategy to allocate the coupled optimization variables to different agents, and then coupled the optimized action spaces to form a hybrid solution. The proposed decoupling algorithm re-couples the variables after decoupling, providing DRL with stronger capabilities to explore the relationship between actions and states. Moreover, it is noticed that the prediction complexity in DRL is substantially lower than the training complexity. Once deployed, the pre-trained DRL model can make real-time decisions with very small computational overhead. The structure of the hybrid action space in DSAC method is shown in Fig. 5. The pseudo-code of the proposed hybrid action space DSAC (H-DSAC) algorithm is in Algorithm 1.

Refer to caption


Figure 5: The structure of hybrid action space in DSAC.
Algorithm 1 H-DSAC:
1:Initialize all the networks and parameters.
2:repeat:
3:     for each time slot do
4:         Choose action a(t)𝑎𝑡a(t)italic_a ( italic_t ) based on the current state s(t)𝑠𝑡s(t)italic_s ( italic_t )
5:000000and (32).
6:         Obtain the next state s(t+1)𝑠𝑡1s(t+1)italic_s ( italic_t + 1 ) and reward r(t)𝑟𝑡r(t)italic_r ( italic_t ).
7:         Generate experience sample {s(t),a(t),r(t),\{s(t),a(t),r(t),{ italic_s ( italic_t ) , italic_a ( italic_t ) , italic_r ( italic_t ) ,
8:000000s(t+1)}s(t+1)\}italic_s ( italic_t + 1 ) } and save it to the buffer 𝔹𝔹\mathbb{B}blackboard_B.
9:     end for
10:     for each training epoch do
11:         Sample a minibatch from the buffer 𝔹𝔹\mathbb{B}blackboard_B.
12:         Update the importance factor ννLννΘ(ν)𝜈𝜈subscript𝐿𝜈subscript𝜈Θ𝜈\nu\leftarrow\nu-L_{\nu}\nabla_{\nu}\Theta(\nu)italic_ν ← italic_ν - italic_L start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT roman_Θ ( italic_ν )
13:         Update QχAdsubscriptsuperscript𝑄subscript𝐴𝑑𝜒Q^{A_{d}}_{\chi}italic_Q start_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_χ end_POSTSUBSCRIPT based on (35), QχAcsubscriptsuperscript𝑄subscript𝐴𝑐𝜒Q^{A_{c}}_{\chi}italic_Q start_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_χ end_POSTSUBSCRIPT based on (36).
14:         Update QϕAdsubscriptsuperscript𝑄subscript𝐴𝑑italic-ϕQ^{A_{d}}_{\phi}italic_Q start_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT based on (37), QϕAcsubscriptsuperscript𝑄subscript𝐴𝑐italic-ϕQ^{A_{c}}_{\phi}italic_Q start_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT based on (38).
15:         Soft update target networks χsuperscript𝜒\chi^{\prime}italic_χ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and ϕsuperscriptitalic-ϕ\phi^{\prime}italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.
16:         Update ϕcsubscriptitalic-ϕ𝑐\phi_{c}italic_ϕ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT based on 1
17:     end for
18:until Convergence.

IV Simulation Results

In this section, we select 10 datasets (from Kaggle Computer Vision Open Datasets) as 10 tasks in federated learning and use TensorFlow-2 to train the tasks in simulations, the size of the local dataset of a ground user for any given task is randomly generated. Unless otherwise stated, parameters in simulations are set as: the number of ground users K=10𝐾10K=10italic_K = 10, the number of UAVs M=3𝑀3M=3italic_M = 3, ground users and UAVs are randomly distributed within a 250 m × 250 m area. The ground users are randomly assigned to M𝑀Mitalic_M clusters, while the UAVs initially maintain an altitude of 50 meters. The maximum speed of UAVs vmax=5subscript𝑣5v_{\max}=5italic_v start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 5 m/s, zmin=40subscript𝑧40z_{\min}=40italic_z start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT = 40 m, zmax=subscript𝑧absentz_{\max}=italic_z start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 60m. The number of LEO satellites N=5𝑁5N=5italic_N = 5, the Rician factor ω=10𝜔10\omega=10italic_ω = 10 dB, the altitude of LEO satellites is 800 km, the speed of LEO satellites is 7.8 km/s, the minimum coverage elevation angle ϖ=40italic-ϖsuperscript40\varpi=40^{\circ}italic_ϖ = 40 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT, the carrier frequency for UAVs and LEOs are 1 GHz and 30 GHz, respectively. The antenna gain ξm=25subscript𝜉𝑚25\xi_{m}=25italic_ξ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = 25 dB for UAVs and 40 dB for LEOs, the maximum Doppler frequency is in the Ka-band [39]. The bandwidth in ground-air-space transmissions is set as 10 MHz, the bandwidth for ISL transmissions is 1 GHz, the normalized gain Gd=1superscript𝐺𝑑1G^{d}=1italic_G start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT = 1 for the links between satellites, the thermal noise φ=354.81𝜑354.81\varphi=354.81italic_φ = 354.81 K [32]. The path loss exponents τL=2subscript𝜏𝐿2\tau_{L}=2italic_τ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT = 2 and τN=2.5subscript𝜏𝑁2.5\tau_{N}=2.5italic_τ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT = 2.5, the transmit power of ground users, UAVs and LEO satellites are 0.1 W, 1 W and 2 W, respectively. The discount factor ϑ=0.99italic-ϑ0.99\vartheta=0.99italic_ϑ = 0.99, the decay factor β=0.995𝛽0.995\beta=0.995italic_β = 0.995, the thresholds W=0.1𝑊0.1W=0.1italic_W = 0.1, Wc=0.001subscript𝑊𝑐0.001W_{c}=0.001italic_W start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = 0.001, Wd=0.01subscript𝑊𝑑0.01W_{d}=0.01italic_W start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = 0.01, the normalizing constants ϵc1=200subscriptitalic-ϵ𝑐1200\epsilon_{c1}=200italic_ϵ start_POSTSUBSCRIPT italic_c 1 end_POSTSUBSCRIPT = 200, ϵc2=100subscriptitalic-ϵ𝑐2100\epsilon_{c2}=100italic_ϵ start_POSTSUBSCRIPT italic_c 2 end_POSTSUBSCRIPT = 100, ϵf=0.01subscriptitalic-ϵ𝑓0.01\epsilon_{f}=0.01italic_ϵ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = 0.01, and ρmin=1subscript𝜌1\rho_{\min}=1italic_ρ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT = 1, the training time for each round is 1 s. We consider the distances between LEO satellites to be randomly generated between 100 km and 500 km. Moreover, we utilize parametrized DQN (PDQN) [40], multi-agent proximal policy optimization (M-PPO) [41], ‘H-DSAC + FedAvg’ (Using FedAvg [42] to calculate the weights in aggregations) and ‘H-DSAC + HoveringUAV’ (Without considering UAV trajectory planning, the UAVs remain in hovering states) to calculate the weights in aggregations as the benchmarks in simulations.

Refer to caption

Figure 6: Average test accuracy vs communication time.

Fig. 6 presents the average test accuracy of tasks versus the communication time for the proposed algorithm H-DSAC and benchmarks. As we can see in this figure, the average performance of tasks rises with an increase in the time. As the service time progresses, tasks continue to be trained in HFL which leads to the increase of accuracy. H-DSAC achieves the average test accuracy of 91.4% when the communication time is 100 s, while the others achieve 90.4%, 85.6%, 84.9% and 81.9%, respectively. The proposed algorithm consistently outperforms all other algorithms, thus demonstrating the superiority of the H-DSAC algorithm. The performance of ‘H-DSAC + FedAvg’ consistently ranks the second, indicating that the gain from the hybrid action space method is higher than the gain from aggregation weights in the proposed optimization problem. Compared to other algorithms, the performance of ‘H-DSAC + HoveringUAV’ significantly deteriorates with the absence of UAV trajectory planning. Therefore, UAV trajectory planning plays a crucial role in ensuring the stability of ground-air links. In addition, MPPO and PDQN achieve similar performance across various scenarios, this result indicates that while MPPO may offer algorithmic superiority over DQN and DDPG, PDQN also possesses its own advantages in optimizing hybrid spaces.

Refer to caption

Figure 7: Average loss vs communication time.

Compare the result in Fig. 7 to that in Fig. 6, it can be observed that while the loss is not directly related to the accuracy, their trends are very similar. As the average loss decreases continuously, the average accuracy increases correspondingly. H-DSAC achieves the average loss of 0.32 when the communication time is 40 s, while the others achieve 0.37, 0.88, 0.71 and 0.74, respectively. These results confirmed that the the proposed algorithm utilize the advantage of UAV trajectory control and weighted aggregation to enhance the performance. Moreover, The hybrid action space design offer algorithmic superiority over other DRL algorithms.

Refer to caption

Figure 8: Average test accuracy vs communication time.

As illustrated in Fig. 8, we compare the performance of two selected task to show the fairness among tasks training, ‘Task T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Fixed Reward’ refers to the performance of task T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT during training, utilizing a fixed reward design instead of a dynamically adjusted reward function. It is clear that the performance of both two tasks rise rapidly at the beginning, Task T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT achieves the average test accuracy of 89.7% when the communication time is 80 s, while ‘Task T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Fixed Reward’ achieves 90.7%. It can be observed from the result that without a fairness-controlled reward function, certain tasks may converge to high performance more rapidly. Moreover, the fairness-oriented and dynamically adjusted reward function has small impact during the initial stages of communications based on (17), resulting in slow progress for Task T2subscript𝑇2T_{2}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT at the beginning. However, as the satellite communication time approaches its end, both tasks can converge to a high performance level with the assistance of the proposed dynamically adjusted reward function. Furthermore, we can see that without fairness considerations, Task T2subscript𝑇2T_{2}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT may be consistently undervalued, leading to only around 82% for testing accuracy even towards the end of communication. Besides, when we utilize the proposed reward function, the average accuracy is higher than that with fixed reward. This result underscores the beneficial role of the proposed reward mechanism for ensuring fairness among multiple tasks. If the satellite cannot guarantee that all tasks converge to a high accuracy threshold due to limited access time, the algorithm’s dynamic design will ensure that all tasks fairly converge to the best possible value within the given time duration.

Refer to caption

Figure 9: Average test accuracy vs transmit power of ground users.

Fig. 9 compares the average test accuracy between the proposed method with different transmit power of ground users. It can be observed that the average test accuracy of tasks increases as the ground user power rises. This is because higher ground user transmission power can provide higher ground-to-air data rates, thereby reducing model transmission time and accelerating aggregation speed, allowing for more aggregation cycles within the limited satellite service time. Moreover, the proposed H-DSAC algorithm achieves around 97.99% accuracy when the transmit power is 0.2 W for all ground users, while the other three benchmarks only achieve 96.4%, 94.1% and 94.2%, respectively. This result demonstrates the superiority of the proposed algorithm over the benchmarks.

Refer to caption

Figure 10: Average test accuracy vs the number of tasks (non-i.i.d. data distribution).

In Fig. 10, we compare the performance of federated learning tasks across different numbers of tasks, and consider that the data distribution for each task among different ground users follows non-i.i.d. From this figure, it can be seen that as the number of tasks increases, the average performance tends to deteriorate. This is because different tasks may originate from entirely different datasets, and an increase in the number of tasks inevitably leads to tighter constraints on transmission and computational resources. Moreover, the performance of the proposed algorithm consistently outperforms other benchmarks, H-DSAC achieves 96.9% with 8 tasks, while others achieve between 91.7% and 95.2%. This result confirms the effectiveness of our proposed algorithm.

Refer to caption


Figure 11: Average test accuracy vs the number of tasks (i.i.d. data distribution).

The results depicted in Fig. 11 illustrate the performance of federated learning tasks when the dataset of each task follows i.i.d. among different ground users. Contrary Fig. 10, i.i.d. data facilitates to algorithm convergence because in this scenario, the importance of user models is equalized and each user holds an equivalent position in a specific task, necessitating only a balance in resource allocation to regulate their participation frequency in aggregation. Due to the inability of FedAvg to adjust the participation weights of users based on communication conditions for each aggregation, the proposed algorithm’s advantage of adapting weights according to the state of communication links becomes more pronounced.

Refer to caption

Figure 12: Average test accuracy vs minimum coverage elevation angle.

In Fig. 12, we assess the average test accuracy from both the proposed scheme and the benchmarks in relation to different minimum coverage elevation angle. As shown in all the results, the average test accuracy decreases as the minimum coverage elevation angle rises. The rationale is that, according to (9) and (10), as the minimum coverage elevation angle increases the service time between satellites and UAVs will decrease, which leads to a reduced service time for ground users to utilize SAGIN for task training. Therefore, both the number of federated learning training iterations and aggregation iterations will decrease, leading to a reduction in the average performance of all tasks.

V Conclusion

In this paper, we proposed a novel application of the HFL framework within SAGIN, utilizing aerial platforms and LEO satellites as edge and cloud servers for HFL, respectively. We also considered ISLs as communication channels between satellites to further provide training resources for federated learning tasks. To maximize the average performance of multiple distinct tasks and ensure fairness among them during training, we employed a decoupling-coupling approach in the DSAC algorithm and designed a novel dynamically changing reward function to guarantee fairness among multiple tasks. By optimizing ground-to-air pairings in uplink and downlink transmissions, air-to-satellite pairings, aerial UAV trajectory planning, final aggregation selection between satellites, edge aggregation weights, cloud aggregation weights, and final aggregation weights, we balanced the training effects among different tasks to ensure that the performance of any individual task can converge to a high level while maximizing the overall performance. Through simulations, we validated the superiority of the proposed algorithm and emphasized the importance of fairness design during task training. Furthermore, we analyzed the impact of user transmit power, UAV trajectory planning and minimum coverage elevation angle on the results and demonstrated the significance of resource allocation in SAGINs for federated learning training performance. Moreover, we analyzed the effects of data distribution in tasks on algorithm convergence, the results show that the proposed algorithm exhibits strong adaptability to dynamic environments and provides a highly promising optimization tool for future federated learning frameworks and multi-access edge computing systems in SAGINs. In our future work, we will consider the adaptive downlink/uplink bandwidth allocation and full-duplex mode to further enhance the efficiency of federated learning in SAGINs.

References

  • [1] D. C. Nguyen, S. Hosseinalipour, D. J. Love, P. N. Pathirana, and C. G. Brinton, “Latency optimization for blockchain-empowered federated learning in multi-server edge computing,” IEEE Journal on Selected Areas in Communications, vol. 40, no. 12, pp. 3373–3390, Dec. 2022.
  • [2] C. Feng, H. H. Yang, S. Wang, Z. Zhao, and T. Q. S. Quek, “Hybrid learning: When centralized learning meets federated learning in the mobile edge computing systems,” IEEE Transactions on Communications, vol. 71, no. 12, pp. 7008–7022, Dec. 2023.
  • [3] S. Niknam, H. S. Dhillon, and J. H. Reed, “Federated learning for wireless communications: Motivation, opportunities, and challenges,” IEEE Communications Magazine, vol. 58, no. 6, pp. 46–51, Jun. 2020.
  • [4] Z. Qin, G. Y. Li, and H. Ye, “Federated learning and wireless communications,” IEEE Wireless Communications, vol. 28, no. 5, pp. 134–140, Oct. 2021.
  • [5] M. Hosseinian, J. P. Choi, S.-H. Chang, and J. Lee, “Review of 5G NTN standards development and technical challenges for satellite integration with the 5G network,” IEEE Aerospace and Electronic Systems Magazine, vol. 36, no. 8, pp. 22–31, Aug. 2021.
  • [6] J. C. McDowell, “The low earth orbit satellite population and impacts of the Spacex Starlink constellation,” The Astrophysical Journal Letters, vol. 892, no. 2, p. L36, Apr. 2020.
  • [7] V. L. Foreman, A. Siddiqi, and O. De Weck, “Large satellite constellation orbital debris impacts: Case studies of oneweb and spacex proposals,” AIAA SPACE and Astronautics Forum and Exposition, Orlando, FL, Sep. 2017.
  • [8] G. Giambene, S. Kota, and P. Pillai, “Satellite-5G integration: A network perspective,” IEEE Network, vol. 32, no. 5, pp. 25–31, Sep. 2018.
  • [9] S. Liu, Z. Gao, Y. Wu, D. W. Kwan Ng, X. Gao, K.-K. Wong, S. Chatzinotas, and B. Ottersten, “LEO satellite constellations for 5G and beyond: How will they reshape vertical domains?,” IEEE Communications Magazine, vol. 59, no. 7, pp. 30–36, Jul. 2021.
  • [10] T. X. Tran, A. Hajisami, P. Pandey, and D. Pompili, “Collaborative mobile edge computing in 5G networks: New paradigms, scenarios, and challenges,” IEEE Communications Magazine, vol. 55, no. 4, pp. 54–61, Apr. 2017.
  • [11] Z. Chu, P. Xiao, M. Shojafar, D. Mi, W. Hao, J. Shi, and F. Zhou, “Utility maximization for IRS assisted wireless powered mobile edge computing and caching (WP-MECC) networks,” IEEE Transactions on Communications, vol. 71, no. 1, pp. 457–472, Jan. 2023.
  • [12] W. Y. B. Lim, N. C. Luong, D. T. Hoang, Y. Jiao, Y.-C. Liang, Q. Yang, D. Niyato, and C. Miao, “Federated learning in mobile edge networks: A comprehensive survey,” IEEE Communications Surveys & Tutorials, vol. 22, no. 3, pp. 2031–2063, thirdquarter, 2020.
  • [13] H. Zheng, M. Gao, Z. Chen, and X. Feng, “A distributed hierarchical deep computation model for federated learning in edge computing,” IEEE Transactions on Industrial Informatics, vol. 17, no. 12, pp. 7946–7956, Dec. 2021.
  • [14] H. Xu, S. Han, X. Li, and Z. Han, “Anomaly traffic detection based on communication-efficient federated learning in Space-Air-Ground integration network,” IEEE Transactions on Wireless Communications, vol. 22, no. 12, pp. 9346–9360, Dec. 2023.
  • [15] M. Chen, Z. Yang, W. Saad, C. Yin, H. V. Poor, and S. Cui, “A joint learning and communications framework for federated learning over wireless networks,” IEEE Transactions on Wireless Communications, vol. 20, no. 1, pp. 269–283, Jan. 2021.
  • [16] Y. Gao, Z. Ye, H. Yu, Z. Xiong, Y. Xiao, and D. Niyato, “Multi-resource allocation for on-device distributed federated learning systems,” in IEEE Global Communications Conference (GLOBECOM), Rio de Janeiro, Brazil, Dec. 2022, pp. 160-165.
  • [17] T. Liu, H. Zhou, J. Li, F. Shu, and Z. Han, “Uplink and downlink decoupled 5G/B5G vehicular networks: A federated learning assisted client selection method,” IEEE Transactions on Vehicular Technology, vol. 72, no. 2, pp. 2280–2292, Feb. 2023.
  • [18] S. Wang, M. Chen, C. G. Brinton, C. Yin, W. Saad, and S. Cui, “Performance optimization for variable bitwidth federated learning in wireless networks,” IEEE Transactions on Wireless Communications, vol. 23, no. 3, pp. 2340–2356, Mar. 2024.
  • [19] J. Ye, S. Dang, B. Shihada, and M.-S. Alouini, “Space-Air-Ground integrated networks: Outage performance analysis,” IEEE Transactions on Wireless Communications, vol. 19, no. 12, pp. 7897–7912, Dec. 2020.
  • [20] G. Wang, S. Zhou, S. Zhang, Z. Niu, and X. Shen, “SFC-based service provisioning for reconfigurable Space-Air-Ground integrated networks,” IEEE Journal on Selected Areas in Communications, vol. 38, no. 7, pp. 1478–1489, Jul. 2020.
  • [21] C. Guo, C. Gong, H. Xu, L. Zhang, and Z. Han, “A dynamic handover software-defined transmission control scheme in Space-Air-Ground integrated networks,” IEEE Transactions on Wireless Communications, vol. 21, no. 8, pp. 6110–6124, Aug. 2022.
  • [22] C. Huang, G. Chen, P. Xiao, Y. Xiao, Z. Han, and J. A. Chambers, “Joint offloading and resource allocation for hybrid cloud and edge computing in SAGINs: A decision assisted hybrid action space deep reinforcement learning approach,” IEEE Journal on Selected Areas in Communications, vol. 42, no. 5, pp. 1029–1043, May. 2024.
  • [23] Q. Fang, Z. Zhai, S. Yu, Q. Wu, X. Gong, and X. Chen, “Olive branch learning: A topology-aware federated learning framework for Space–Air–Ground integrated network,” IEEE Transactions on Wireless Communications, vol. 22, no. 7, pp. 4534–4551, Jul. 2023.
  • [24] R. Zhao, L. T. Yang, D. Liu, and W. Lu, “Tensor-enabled communication-efficient and trustworthy federated learning for heterogeneous intelligent Space–Air–Ground-Integrated IoT,” IEEE Internet of Things Journal, vol. 10, no. 23, pp. 20285–20296, Dec. 2023.
  • [25] Y. Sun, J. Shao, Y. Mao, J. H. Wang, and J. Zhang, “Semi-decentralized federated edge learning with data and device heterogeneity,” IEEE Transactions on Network and Service Management, vol. 20, no. 2, pp. 1487–1501, Jun. 2023.
  • [26] C. You, K. Guo, H. H. Yang, and T. Q. S. Quek, “Hierarchical personalized federated learning over massive mobile edge computing networks,” IEEE Transactions on Wireless Communications, vol. 22, no. 11, pp. 8141–8157, Nov. 2023.
  • [27] Q. Chen, Z. You, D. Wen, and Z. Zhang, “Enhanced hybrid hierarchical federated edge learning over heterogeneous networks,” IEEE Transactions on Vehicular Technology, vol. 72, no. 11, pp. 14601–14614, Nov. 2023.
  • [28] O. Marfoq, G. Neglia, A. Bellet, L. Kameni, and R. Vidal, “Federated multi-task learning under a mixture of distributions,” Annual Conference on Neural Information Processing Systems (NeurIPS), vol. 34, Dec. 2021, pp. 15434-15447.
  • [29] C. Huang, G. Chen, P. Xiao, D. Mi, Y. Zhang, H. Tang, C. Lu, and R. Tafazolli, “Federated learning for RIS-assisted UAV-enabled wireless networks: Learning-based optimization for UAV trajectory, RIS phase shifts and weighted aggregation,” in Annual Conference of the IEEE Industrial Electronics Society (IECON), Singapore, Oct. 2023.
  • [30] Z. Lin, H. Niu, K. An, Y. Wang, G. Zheng, S. Chatzinotas, and Y. Hu, “Refracting RIS-aided hybrid satellite-terrestrial relay networks: Joint beamforming design and optimization,” IEEE Transactions on Aerospace and Electronic Systems, vol. 58, no. 4, pp. 3717–3724, Aug. 2022.
  • [31] Y. Liu, C. Huang, G. Chen, R. Song, S. Song, and P. Xiao, “Deep learning empowered trajectory and passive beamforming design in UAV-RIS enabled secure cognitive non-terrestrial networks,” IEEE Wireless Communications Letters, vol. 13, no. 1, pp. 188–192, Jan. 2024.
  • [32] I. Leyva-Mayorga, B. Soret, and P. Popovski, “Inter-plane inter-satellite connectivity in dense LEO constellations,” IEEE Transactions on Wireless Communications, vol. 20, no. 6, pp. 3430–3443, Jun. 2021.
  • [33] Q. Tang, Z. Fei, B. Li, and Z. Han, “Computation offloading in LEO satellite networks with hybrid cloud and edge computing,” IEEE Internet of Things Journal, vol. 8, no. 11, pp. 9164–9176, Jun. 2021.
  • [34] Y. Xu, Y. Wei, K. Jiang, L. Chen, D. Wang, and H. Deng, “Action decoupled SAC reinforcement learning with discrete-continuous hybrid action spaces,” Neurocomputing, vol. 537, pp. 141–151, Jun. 2023.
  • [35] J. Duan, Y. Guan, S. E. Li, Y. Ren, Q. Sun, and B. Cheng, “Distributional soft actor-critic: Off-policy reinforcement learning for addressing value estimation errors,” IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 11, pp. 6584–6598, Nov. 2022.
  • [36] M. G. Bellemare, W. Dabney, and R. Munos, “A distributional perspective on reinforcement learning,” International Conference on Machine Learning (ICML), Sydney, Australia, Aug. 2017, pp. 449-458.
  • [37] A. Abdolmaleki, J. T. Springenberg, Y. Tassa, R. Munos, N. Heess, and M. Riedmiller, “Maximum a posteriori policy optimisation,” International Conference on Learning Representations (ICLR), Vancouver, Canada, Apr. 2018.
  • [38] M. Neunert, A. Abdolmaleki, M. Wulfmeier, T. Lampe, T. Springenberg, R. Hafner, F. Romano, J. Buchli, N. Heess, and M. Riedmiller, “Continuous-discrete reinforcement learning for hybrid control in robotics,” Proceedings of the Conference on Robot Learning, Nov. 2020, pp. 735-751.
  • [39] J. Shi, Z. Li, J. Hu, Z. Tie, S. Li, W. Liang, and Z. Ding, “OTFS enabled LEO satellite communications: A promising solution to severe doppler effects,” IEEE Network, Feb. 2023.
  • [40] J. Xiong, Q. Wang, Z. Yang, P. Sun, L. Han, Y. Zheng, H. Fu, T. Zhang, J. Liu, and H. Liu, “Parametrized deep Q-networks learning: Reinforcement learning with discrete-continuous hybrid action space,” arXiv preprint arXiv:1810.06394, Oct. 2018.
  • [41] C. Yu, A. Velu, E. Vinitsky, J. Gao, Y. Wang, A. Bayen, and Y. WU, “The surprising effectiveness of PPO in cooperative multi-agent games,” Conference on Neural Information Processing Systems (NeurIPS), New Orleans, LA, Nov. 2022, pp. 24611-24624.
  • [42] K. H. W. Y. S. W. Li, Xiang and Z. Zhang, “On the convergence of fedavg on non-iid data,” International Conference on Learning Representations (ICLR), Apr. 2020.