Fair Resource Allocation For Hierarchical Federated Edge Learning in Space-Air-Ground Integrated Networks via
Deep Reinforcement Learning with Hybrid Control

Chong Huang, , Gaojie Chen, , Pei Xiao, ,
Jonathon A. Chambers, , and Wei Huang C. Huang, G. Chen and P. Xiao are with 5GIC & 6GIC, Institute for Communication Systems (ICS), Home for 5GIC & 6GIC, University of Surrey, Guildford, GU2 7XH, United Kingdom. Email: {chong.huang, gaojie.chen, p.xiao}@surrey.ac.uk. (Corresponding author: G. Chen)Jonathon A. Chambers is with the School of Engineering, University of Leicester, Leicester, LE1 7RU, United Kingdom. Email: jonathon.chambers@leicester.ac.uk.W. Huang is with School of Flexible Electronics (SoFE) & State Key Laboratory of Optoelectronic Materials and Technologies, Sun Yat-sen University, Guangdong, 510220, China. Email: huangw323@mail.sysu.edu.cn

Abstract

The space-air-ground integrated network (SAGIN) has become a crucial research direction in future wireless communications due to its ubiquitous coverage, rapid and flexible deployment, and multi-layer cooperation capabilities. However, integrating hierarchical federated learning (HFL) with edge computing and SAGINs remains a complex open issue to be resolved. This paper proposes a novel framework for applying HFL in SAGINs, utilizing aerial platforms and low Earth orbit (LEO) satellites as edge servers and cloud servers, respectively, to provide multi-layer aggregation capabilities for HFL. The proposed system also considers the presence of inter-satellite links (ISLs), enabling satellites to exchange federated learning models with each other. Furthermore, we consider multiple different computational tasks that need to be completed within a limited satellite service time. To maximize the convergence performance of all tasks while ensuring fairness, we propose the use of the distributional soft-actor-critic (DSAC) algorithm to optimize resource allocation in the SAGIN and aggregation weights in HFL. Moreover, we address the efficiency issue of hybrid action spaces in deep reinforcement learning (DRL) through a decoupling and recoupling approach, and design a new dynamic adjusting reward function to ensure fairness among multiple tasks in federated learning. Simulation results demonstrate the superiority of our proposed algorithm, consistently outperforming baseline approaches and offering a promising solution for addressing highly complex optimization problems in SAGINs.

Index Terms:

Space-air-ground integrated network, hierarchical federated Learning, federated edge learning, deep reinforcement learning, satellite communications, unmanned aerial vehicle.

I Introduction

I-A Background

With the advancement of wireless communications and artificial intelligence (AI) technologies, there has been a rapid increase in the demand for intelligent wireless networks in recent years [1]. Many intelligent applications in wireless networks are trained based on local data from wireless nodes, which are stored in a distributed manner across these nodes. In traditional centralized machine learning methods, it is necessary to transfer vast amounts of data to the cloud or data centers for processing, this approach not only consumes a significant amount of wireless bandwidth but also leads to considerable latency [2]. Moreover, data privacy and information security issues have become a major concern in the current digital era, because many applications involve the processing of sensitive personal information [3]. To address the above issues, federated learning has emerged as a very popular machine learning framework in wireless communications [4]. Within the federated learning framework, local devices in wireless networks train their models using their own local data, and only share their model parameters rather than their training data to protect data privacy. Furthermore, federated learning significantly reduces the amount of data that needs to be transmitted over wireless communications, thus saving wireless communication resources.

On the other hand, satellite networks have attracted much attention in wireless communications due to their capability to provide global coverage and seamless connections [5]. In recent years, satellite network projects like Starlink [6], OneWeb [7], and O3b [8] have been rapidly advanced to provide global communication services, especially offering connections to remote areas. Thus, satellite communication has become an important component of future wireless communications. Furthermore, considering the strong demand for communication latency and rates in sixth-generation (6G) communications, the integration of satellite networks with terrestrial communication networks has emerged as a very promising design paradigm [9]. In particular, space-air-ground integrated networks (SAGINs) seamlessly integrate satellites, aerial platforms, and terrestrial communication nodes, where satellites provide ubiquitous communication connections globally, aerial platforms offer rapid and flexible edge access capabilities for terrestrial networks, and terrestrial networks provide localized communication connections as the communication backbone. This integrated Space-Air-Ground framework design offers unprecedented prospects for future communications.

Moreover, edge computing was emerged to meet the growing demand for low-latency and high-bandwidth services by leveraging the computational and storage capabilities of terminal devices and edge servers [10]. In edge computing, computing tasks are transferred from terminals to edge servers to reduce service latency in the wireless network [11]. However, traditional edge computing requires the exchange of local data, occupying a large amount of communication bandwidth and raising issues of personal data privacy [12]. Hence, hierarchical federated learning (HFL) was developed to address these issues via combining federated learning and edge computing [13]. In HFL, models from terminals are aggregated multiple times at edge servers before being aggregated in the cloud server to efficiently utilize edge network resource while protecting data privacy. Furthermore, the hierarchical training process in HFL naturally aligns with the multi-level communication architecture of SAGINs [14]. The global coverage capability and flexible aerial deployment of SAGINs enable more ground users to participate in HFL to significantly increase the diversity and performance of task training.

I-B Related Work

Federated learning is emerging as a paradigm for providing artificial intelligence services in wireless communications, and numerous current studies have investigated its performance within wireless networks [15, 16, 17, 18]. In [15], the Hungarian algorithm was introduced to optimize the user selection and resource block allocation to improve the federated learning aggregation performance in wireless networks. A distributed resource allocation method was proposed in [16] to minimize the energy consumption and latency of federated learning in transmission control protocol/internet protocol (TCP/IP) networks. In [17], the delay performance was analyzed to improve the federated learning performance and the energy efficiency in wireless vehicular networks. To reduce the convergence time of federated learning in wireless communications, the authors of [18] analyzed the effect of quantization errors and limited wireless resources in federated learning and proposed a model-based reinforcement learning algorithm to adjust the local model quantization and the number of local users in federate learning.

On the other hand, SAGIN has become a pivotal research direction for future wireless communications [19, 20, 21, 22]. In [19], the outage performance of the relaying scenario was analyzed with a closed form for SAGIN. To strike the balance between computational resource consumption and communication resource consumption, a heuristic greedy algorithm was proposed to design the virtual network functions and data routing in SAGINs. In [21], a queuing game model based method was introduced to enhance the network throughput and reduce the service delay in SAGINs. To minimize the energy consumption and latency for cloud and edge computing in SAGINs, a decision-assisted reinforcement learning algorithm was proposed in [22] to optimize the resource allocation. Considering a SAGIN’s global coverage capability and rapid dynamic deployment, it enables a more flexible participation of ground users in federated learning tasks, thus federated learning within SAGIN has been extensively researched in recent studies [23, 24]. In [23], a novel topology-aware framework was designed to speed up the convergence for federated learning in SAGINs. To jointly consider the requirements of adaptivity, communication efficiency and model security in federated learning, a tensor-empowered federated learning framework was proposed to meet these three requirements in SAGINs [24]. The existing research on federated learning within SAGINs has focused on two-layer federated learning architectures, failing to fully leverage the potential performance capabilities of nodes within the SAGIN framework.

However, the conventional two-layer federated learning framework design often fails to meet the demands in scenarios with dispersed user locations and unstable communication channels [25]. Moreover, the integration of federated learning with network edges in recent years enables low-latency edge intelligence at the data generation source. Therefore, multi-layer HFL with network edges has emerged as a highly promising research direction in wireless communications [25, 26, 27]. In [25], a semi-decentralized framework was proposed to utilize the edge servers to enhance the convergence performance of federated learning. To improve the convergence performance and reduce the round delay in federated learning, a hierarchical aggregation strategy was proposed to optimize the edge server scheduling and bandwidth allocation in wireless communications [26]. In [27], a hybrid deep reinforcement learning (DRL) algorithm was utilized to optimize the resource allocation strategy for fast convergence of federated learning in wireless communications. However, two significant challenges remain in the existing works: Firstly, the deployment of HFL utilizing space, air, and ground nodes within SAGINs as edge and cloud servers remains an area that has not been investigated. Secondly, while existing federated learning research has explored personalized multi-task learning with non-independent and identically distributed (non-i.i.d.) data [28], achieving balance in the convergence performance of multiple completely unrelated tasks which are repeatedly distributed in multiple users within constrained timeframes, such as the service times of SAGIN environments, continues to be an unaddressed issue.

I-C Motivation and Contributions

In this paper, we explore the integration of HFL in SAGINs. To the best of the authors’ knowledge, we address a research gap as the existing works have not considered the utilization of the HFL framework in SAGIN to leverage the performance of various nodes and balance the convergence performance of different tasks within a limited satellite service time. Our study also encompasses the investigation of the impact of dynamic deployment and user association within SAGIN on HFL. To optimize the communication rounds within HFL and balance the final convergence performance of different tasks, dynamic planning of unmanned aerial vehicles (UAVs), user association, access between satellites and UAVs, the numbers of local convergence iterations and the weights of models in aggregations are crucial for HFL in SAGINs. However, traditional algorithms struggle to adapt to dynamically changing communication environments. Thus, we propose a hybrid action space DRL-based algorithm to address this challenge. The main contributions of this paper are summarized as follows.

•

We first propose a novel framework of HFL in SAGINs, where UAVs are considered as edge servers, LEO satellites as cloud servers, and the use of ISL for aggregation between multiple satellites and transferring aggregated models. To counteract channel fading effects in ground-aerial communications, we introduce trajectory planning for UAVs. Moreover, we consider the time constraints of LEO satellite access to aerial platforms, which is crucial for completing tasks within a limited satellite service time in practical applications and has not been taken into account by most studies.
•

Our objective is to accomplish multiple federated learning tasks within a finite satellite service time while ensuring fairness among different task performances. Thus, we formulate a complex optimization problem in our proposed system, which includes UAV trajectory planning, dynamic adjustment of ground user-UAV pairing (uplink and downlink), UAV-satellite pairing (uplink and downlink), final aggregation selection between satellites, and optimization of weights in edge and cloud aggregation.
•

We introduce the distributional soft-actor-critic (DSAC) algorithm, leveraging its ability to consider long-term returns and mitigate the overestimation issue of Q-values using return distributions. Moreover, due to the large number of coupled discrete and continuous optimization variables in the proposed problem, we propose a decoupling and recoupling algorithm to enhance the DSAC’s performance. Finally, to ensure fairness, we design a dynamic adjusting reward function for the proposed algorithm to mitigate training progress deviations among different tasks.
•

Simulation results demonstrate the superior performance of our proposed algorithm, compared to several benchmarks. Through the design of a hybrid action space, the proposed DSAC algorithm fully learns the proposed system and obtains corresponding solutions. Moreover, the use of the dynamic adjusting reward design ensures fairness among multiple tasks.

The rest of this paper is summarized as follows: The system model including the communication model, satellite coverage model, federated learning framework and problem formulation are introduced in Section II. In Section III, a hybrid action space DRL algorithm with a dynamic adjusting reward function is proposed. Section IV verifies the performance of the proposed method for HFL in SAGINs via simulations. Finally, Section V summarizes this work.

II System Model and Problem Formulation

II-A System Model

Refer to caption — Figure 1: System model of a SAGIN.

TABLE I: Basic Symbol definitions for SAGINs

Definition	Symbol	Definition	Symbol
Ground User	$G_{k}$	All UAV’s Trajectory Set	$\mathcal{Q}$
UAV	$U_{m}$	UAV’s Trajectory	$q_{m}$
LEO	$S_{n}$	Number of UAVs	$M$
Task	$\mathbb{T}_{f}$	Number of LEOs	$N$
User-UAV Cluster	$\wp$	Number of Users	$K$
Distance	$d$	Number of Tasks	$F$
Dataset	$D$	Number of Users	$K$
Bandwidth	$B$	User-UAV Pairing Indicator	$z^{G}$
Channel Rate	$C$	Number of UAVs Connected to Satellite $S_{n}$	$MS_{n}$
Rician Factor	$\omega$	Number of Users in Cluster $\wp_{m}$	$K_{m}$
AWGN	$\sigma$	UAV-Satellite Pairing Indicator	$z^{U}$
Antenna Gain	$\xi$	Path Loss Exponents(NLoS)	$\tau_{N}$
Wavelength	$\lambda$	Path Loss Exponents(LoS)	$\tau_{L}$
Antenna Phase	$\varrho_{m}$	Edge Aggregation Weight	$\zeta^{f_{K}}$
Bessel Function	$\delta$	Cloud Aggregation Weight	$\zeta^{f_{M}}$
Thermal Noise	$\varphi$	Final Aggregation Weight	$\zeta^{f_{S}}$
Boltzmann Constant	$\varsigma$	Transmit Power of LEOs	$P_{S}$
Height of LEOs	$R_{S}$	Transmit Power of UAVs	$P_{U}$
Velocity of LEOs	$v_{L}$	Transmit Power of Ground Users	$P_{G}$
Final Aggregation LEO	$S^{f}$	User-UAV Pairing (Uplink)	$\Im_{u}$
Radius of the Earth	$R_{E}$	User-UAV Pairing (Dplink)	$\Im_{d}$
ISL Carrier Frequency	$F_{ISL}$	Minimum Elevation Angle	$\varpi$
Normalized Gain of LEO	$\eta^{d}$	Peak Gain of LEO	$\eta_{\rm max}$

As shown in Fig. 1, we consider a three-layer SAGIN to provide global communication service. In the proposed SAGIN, there are $K$ ground users $G_{k}$ ( $k\in\mathcal{K}=\{1,2,...,K\}$ ), $M$ UAVs $U_{m}$ ( $m\in\mathcal{M}=\{1,2,...,M\}$ ) and $N$ low Earth orbit (LEO) satellites $S_{n}$ ( $n\in\mathcal{N}=\{1,2,...,N\}$ ). UAVs can provide ground-to-air connectivity services for ground users and can also connect with satellites, each UAV can dynamically plan its trajectory to provide communication and computation services to ground users within its coverage area, we assume the 3D coordinates of UAV $U_{m}$ is $q_{m}(t)=\{x_{m}(t),y_{m}(t),z_{m}(t)\}$ ( $q_{m}\in\mathcal{Q}=\{q_{1},q_{2},...,q_{M}\}$ )) at a given time slot $t$ . However, due to the potential long distances between ground users, direct communication between them may not be feasible. Similarly, UAVs which provide services for ground users might not be able to communicate directly with each other. Furthermore, we consider that satellites are constantly in high-speed motion, thus the communication window between UAVs and LEO satellites is limited in time.

II-B Communication Model

In this work, we assume that ground users are divided into $M$ clusters, a cluster $\wp_{m}$ ( $m\in\{1,2,...,M\}$ ) has $K_{m}$ users, $\wp=\{\wp_{1},\wp_{2},...,\wp_{M}\}$ denotes the cluster set. The cluster $\wp_{m}$ is serviced by the UAV $U_{m}$ , and $z^{G}_{m,k}$ denotes ground user $G_{k}$ is assigned to the cluster $\wp_{m}$ . Due to the considerable distance between different clusters, direct communication between them is not feasible. We assume the channels between UAVs and ground users follow Rician fading [29], and UAV $U_{m}$ serves cluster $\wp_{m}$ which includes users $G_{k}$ , thus the channel coefficient $h_{m,k}$ between the UAV $U_{m}$ and the ground user $G_{k}$ is given as

\small h_{m,k}=\sqrt{\frac{\omega}{\omega+1}}\bar{H}_{m,k}+\sqrt{\frac{1}{% \omega+1}}\hat{H}_{m,k},

(1)

where $\omega$ indicates the Rician factor, $\hat{H}_{m,k}=\hat{\mathbb{g}}_{m,k}d_{m,k}^{-{\tau_{N}}/2}$ and $\bar{H}_{m,k}=\bar{\mathbb{g}}_{m,k}d_{m,k}^{-{\tau_{L}}/2}$ indicate the non-line-of-sight (NLoS) and the line-of-sight (LoS) channel coefficients, respectively. $d_{m,k}$ denotes the distance between UAV $U_{m}$ and ground user $G_{k}$ . $\tau_{N}$ and $\tau_{L}$ are the path loss exponents for NLoS and LOS channel, respectively. In the NLoS channel, the coefficient $\hat{\mathbb{g}}_{m,k}$ is formed by a zero-mean unit-variance Gaussian fading channel. Therefore, in a given cluster $\wp_{m}$ , the transmission rate of the uplink channel from $G_{k}$ to $U_{m}$ can be expressed as

\small C_{m,k}=B_{k,m}{\rm{log_{2}}}\left(1+\frac{P_{G_{k}}|h_{m,k}|^{2}}{\sum% _{i=1,i\neq k}^{K}P_{G_{i}}|h_{mi}|^{2}+{{\sigma}_{n}^{2}}}\right),

(2)

where $B_{k,m}$ indicates the bandwidth of the channel from $G_{k}$ to $U_{m}$ , $P_{G_{k}}$ denotes the transmit power of ground user $G_{k}$ , ${{\sigma}_{n}^{2}}$ denotes the variance of the additive-white-Gaussian-noise (AWGN) at $U_{m}$ . Similarly, the transmission rate of the downlink channel from $U_{m}$ to $G_{k}$ can be expressed as

\small C_{k,m}=B_{m,k}{\rm{log_{2}}}\left(1+\frac{P_{U_{m}}|h_{m,k}|^{2}}{{{% \sigma}_{n}^{2}}}\right),

(3)

where $B_{m,k}$ indicates the bandwidth of the channel from $U_{m}$ to $G_{k}$ , and $P_{U_{m}}$ denotes the transmit power of UAV $U_{m}$ . Notice that in federated learning, the uplink and downlink transmissions occur asynchronously, and the edge server (e.g. $U_{m}$ ) broadcasts the aggregated model to all ground users in downlink transmissions. As such, there is no interference in this stage. On the other hand, we assume that a UAV can establish communication with only one LEO satellite during any given time slot, $z^{U}_{n,m}$ denotes UAV $U_{m}$ can communicate with satellite $S_{n}$ and $MS_{n}$ denotes the number of UAVs connected to satellite $S_{n}$ . Thus, the channel coefficient between satellite $S_{n}$ and UAV $U_{m}$ can be expressed as [30]

\small\hat{h}_{n,m}=\frac{\sqrt{\xi_{m}}\lambda}{4\pi d_{n,m}}e^{j{\varrho_{m}% }},

(4)

where $\xi_{m}$ indicates the antenna gain, $\lambda$ is the wavelength, $d_{n,m}$ denotes the distance between the satellite and the UAV, $\varrho_{m}$ indicates the phase. In this work, we consider the impact of outdated channel state information (CSI) due to the considerable distance between satellites and aerial platforms. The outdated CSI is formulated as [31]

\small h_{n,m}=\delta\hat{h}_{n,m}+\sqrt{1-\delta^{2}}\hat{g}_{n,m},

(5)

where $\delta=\hat{\kappa}(2\pi{\hat{D}}_{n,m}T_{n,m})$ , $\hat{\kappa}$ denotes the Bessel function of the first kind of order 0, ${\hat{D}}_{n,m}$ and $T_{n,m}$ indicate the maximum Doppler frequency and the transmissions delay between the satellite and the UAV, respectively. $\hat{g}_{n,m}$ represents a complex Gaussian random variable with an equivalent variance to that of $\hat{h}_{n,m}$ . Therefore, the transmission rate of the uplink channel from UAV $U_{m}$ to satellite $S_{n}$ is given as

\small C_{m,n}=B_{m,n}{\rm{log_{2}}}\left(1+\frac{P_{U_{m}}|h_{n,m}|^{2}}{\sum% _{i=1,i\neq m}^{M}P_{U_{i}}|h_{n,i}|^{2}+{{\sigma}_{l}^{2}}}\right),

(6)

where $B_{m,n}$ is the bandwidth of the channel from $U_{m}$ to $S_{n}$ , $\sum_{i=1,i\neq m}^{M}P_{U_{i}}|h_{n,i}|^{2}$ is the interference from other UAVs, ${\sigma}_{l}^{2}$ is AWGN at $S_{l}$ . Similarly, the transmission rate of the downlink channel from $S_{n}$ to $U_{m}$ can be expressed as

\small C_{n,m}=B_{n,m}{\rm{log_{2}}}\left(1+\frac{P_{S_{n}}|h_{n,m}|^{2}}{{{% \sigma}_{l}^{2}}}\right),

(7)

where $B_{n,m}$ is the bandwidth of the channel from $S_{n}$ to $U_{m}$ , $P_{S_{n}}$ denotes the transmit power of satellite $S_{n}$ in downlink transmissions. Similar to the downlink transmission between UAVs and ground users, there is no interference at this stage. Moreover, there are inter-satellite links (ISLs) between satellites in current satellite networks, we thus also consider inter-satellite communications via ISLs. The transmission data rate of the ISL channel [32] between two satellites $S_{a}$ and $S_{b}$ can be expressed as

\small C_{ab}=B_{a,b}{\rm{log_{2}}}\left(1+\frac{P_{S_{a}}|\eta_{\rm max}|^{2}% }{\varsigma\varphi B_{ab}(\frac{{4\pi d_{ab}F_{ISL}}}{c})^{2}}\right),

(8)

where $B_{a,b}$ indicates the bandwidth of channel from $S_{a}$ to $S_{b}$ , $P_{S_{a}}$ is the transmit power for the transmitter satellite $S_{a}$ . $\eta_{\rm max}={\rm max}_{J(a,b)}{\eta^{d}_{a,b}}$ represents the peak gain of the $S_{a}$ antennas in the direction of their main lobe, $J(a,b)$ denotes the relative direction between $S_{a}$ and $S_{b}$ , $d_{a,b}$ indicates the distance $S_{a}$ and $S_{b}$ , $\eta^{d}_{a,b}$ is the normalized gain of the satellite antennas in the transmission direction, $\varsigma$ denotes the Boltzmann constant, $\varphi$ is the thermal noise, $F_{ISL}$ denotes the ISL carrier frequency, and $c$ indicates the speed of light. It is noteworthy that interference can be omitted by ensuring that the inter-plane ISL antennas utilize sufficiently narrow beams along with precise beam steering or antenna pointing capabilities [32].

II-C Satellite Coverage Model

Considering that LEO satellites are constantly in high-speed motion, each LEO provides a limited time window for aerial (UAV) access. We consider the coverage area of satellites as illustrated in Fig. 2, where the motion path length of an LEO satellite during the coverage access time for a specific UAV is as [33]

\small R_{C}=2(R_{E}+R_{S})\bigg{(}\arccos\Big{(}\frac{R_{E}}{R_{E}+R_{S}}\cos% \varpi\Big{)}-\varpi\bigg{)},

(9)

where $R_{E}$ denotes the radius of the earth, $R_{S}$ indicates the height of the LEO satellite, $\varpi$ denotes the minimum coverage elevation angle for the LEO satellite. Therefore, we can obtain the total communication time for the LEO satellite and UAV as

\small T_{c}=\frac{R_{C}}{v_{L}},

(10)

where $v_{L}$ denotes the orbital velocity of the LEO satellite. In this work, considering the spacing between LEO satellites, there are differences in the remaining service time between each LEO satellite and the same UAV at the beginning. Moreover, due to the negligible altitude of the UAV compared to the altitude of LEO satellites and the radius of the Earth, we assume the altitude of the UAVs can be disregarded in the computation of LEO coverage time window.

II-D Federated Learning Model

In the proposed federated learning framework, we assume there are $F$ distinct tasks $\mathbb{T}_{f}$ ( $f\in{1,2,...,F}$ ) that need to be trained in the SAGIN. The local datasets of each ground user consists of $F$ datasets $D_{k}=\{D^{1}_{k},D^{2}_{k},...,D^{F}_{k}\}$ , corresponding to $F$ tasks. Notice that a zero-sized dataset indicates that the ground user cannot contribute to the corresponding task. For example, if $D^{F}_{k}$ is a zero-sized dataset, the ground user $G_{k}$ cannot participate in the training of task $\mathbb{T}_{F}$ . Furthermore, we assume that each ground user can train only one iteration of one task per time slot. Therefore, the federated learning training process consists of four stages for a specific task $\mathbb{T}_{f}$ : local training and edge aggregation, cloud aggregation and global update.

II-D1 Local Training and Edge Aggregation

In the local training phase, user $G_{k}$ in cluster $\wp_{m}$ selects task $\mathbb{T}_{f}$ for training and updates its corresponding local model. Subsequently, user $G_{k}$ uploads the updated local model with the parameter $\mu^{f}_{k}$ to the corresponding UAV $U_{m}$ . Considering the existence of multiple tasks, the local model being trained and the local model being uploaded may not belong to the same task. Therefore, the uploaded local model may have experienced multiple iterations of local training. We assume that the model size of task $\mathbb{T}_{f}$ is $DM_{f}$ . However, due to the dynamic nature of wireless channels and the dynamic trajectory planning of UAVs, the uplink rates between ground users and UAVs are not constant in SAGINs which leads to variations in upload time as time varies. Therefore, UAV trajectory planning is crucial in this phase as it can mitigate the fluctuations in upload rate caused by channel variations. Subsequently, the edge server (UAV) $U_{m}$ will operate edge aggregation after receiving the local models from ground users within its served area. The aggregation equation is given by

\displaystyle\mu^{f}_{m}=\frac{1}{K_{m}}\sum_{i=1}^{K_{m}}\zeta^{f_{K}}_{i}\mu% ^{f}_{i},

(11)

where $K_{m}$ denotes the number of ground users in this cluster, $\zeta^{f_{K}}_{i}$ denotes the aggregation weight for the local model from ground user $G_{i}$ . Due to the influence of the dynamic environment and different locations of ground users, the uplink transmission rate of ground users are different, thus the time required for uploading models are different among ground users. Furthermore, due to the presence of multiple tasks in federated learning, the number of training iterations experienced by the models uploaded by each ground user during each edge aggregation stage may vary. Moreover, considering the non-i.i.d. distribution of local datasets in a specific task, ground user’s local datasets have different impact on the aggregated model. Therefore, the weight of local models in the aggregation is crucial to the result, and relying on traditional federated learning aggregation algorithms such as FedAvg’s weighted average aggregation method can not fully optimize the model in SAGIN.

II-D2 Cloud Aggregation

In this work, the LEO satellites are assumed to be the cloud servers for HFL in SAGINs. Each UAV can select one satellite within its service time for uploading. Each satellite receiving the edge model transmitted by UAVs will perform cloud aggregation, and then transmit it to a specific LEO satellite for final aggregation. The cloud model at LEO $S_{i}$ can be expressed as

\displaystyle\mu^{f}_{i}=\frac{1}{M_{i}}\sum_{m=1}^{M_{i}}\zeta^{f_{M}}_{m}\mu% ^{f}_{m},

(12)

where $M_{i}$ denotes the number of UAVs which upload edge models to LEO $S_{i}$ , $\zeta^{f_{M}}_{m}$ denotes the aggregation weight for the local model from UAV $U_{m}$ . Thus, the final aggregation at LEO satellite $S_{n}$ can be expressed

\displaystyle\mu^{f}_{n}=\frac{1}{M_{n}+N_{n}}\Big{(}\sum_{m=1}^{M_{n}}\zeta^{% f_{S}}_{m}\mu^{f}_{m}+\sum_{j=1}^{N_{n}}\zeta^{f_{S}}_{j}\mu^{f}_{j}\Big{)},

(13)

where $\mu^{f}_{n}$ denotes the global model, $M_{n}$ denotes the number of UAVs which upload edge models to LEO $S_{n}$ , $N_{n}$ denotes the number of other LEO satellites which receive edge models from other UAVs, $\zeta^{f_{S}}_{j}$ denotes the aggregation weight for the local model from satellite $S_{j}$ . Similar to the edge aggregation phase, the transmission rate between the air and space layer also varies due to the high-speed motion of satellites. Moreover, the model upload between UAVs and LEO satellites presents similar challenges as the previous phase, such as non-i.i.d. data distribution from ground users, different training iterations among different ground users, and differing aggregation iterations among different edge servers (UAVs).

II-D3 Global Update

After an LEO satellite has aggregated all the models as the global model for a given task, it needs to broadcast the global model back to ground users. Considering the service time constraints of LEO satellites, an LEO with the global model may not be able to access all UAVs directly. In such cases, the global model may need to be transmitted to other satellites via ISLs and then relayed to corresponding UAVs. Subsequently, the UAVs transmit the global models to ground users within their coverage areas. Ground users continue to repeat the above steps after updating their local models by using the received global model until final convergence.

II-E Problem Formulation

Our objective is to maximize the average performance (e.g. minimizing the loss in classification tasks) of all tasks while ensuring fairness among different tasks, i.e., reducing the discrepancy in performance across tasks. In the proposed SAGIN, each ground user is allocated in a cluster and may participate in one or more or even all tasks, while the service time of satellites is limited. Moreover, the dynamic nature of the wireless communication environment such as channel fading, affects the speed of model uploads and downloads. Therefore, it is necessary to plan the UAV trajectories to counteract the effects of channel fading and to optimize the clustering between ground users and UAVs. This strategy aims to enhance the communication efficiency between the ground layer and the aerial layer for the proposed HFL system.

Furthermore, UAVs can select whether all users within their cluster area participate in ground-aerial communications based on the progress of task training (fewer users can reduce interference). For example, if considerable progress is made in training a task, UAVs can prioritize communication with users associated with other tasks. Similarly, satellites can choose whether to access UAVs within their coverage area based on the progress of task training, considering that the data distribution of user groups served by different UAVs is uneven.

Moreover, the ISLs between satellites create a collaborative network for SAGINs. However, the selection of the LEO satellite for uploading models from UAVs and the selection of the final aggregation node between satellites also affects the total training rounds. Besides, within the limited satellite service time, multiple rounds of global aggregation need to be performed for each task, inevitably resulting in some satellites being unable to cover UAVs in the later stages. Therefore, selecting the final aggregation satellite node and utilizing ISL to reduce communication time and improve aggregation efficiency are crucial for HFL in SAGINs.

For the downlink transmission in the global update stage, the pairing issue between LEO satellites and UAVs also exists due to the variations in service time and rate caused by the high mobility of satellites. Furthermore, the trajectory of UAVs affects the efficiency of broadcasting the global model to ground users within their service clusters. Therefore, the pairing issue and the UAV trajectories are also crucial for the communication efficiency during the broadcasting stage.

Therefore, to fairly maximize the performance of all tasks within the limited service time of the LEO satellites, we need to optimize: 1) the trajectory of each UAV; 2) the pairing between UAVs and users (uplink); 3) the pairing between UAVs and satellites (uplink); 4) the selection of the final aggregation satellite, 5) the pairing between satellites and UAVs (downlink); 6) the weights for edge aggregation on UAVs; 7) the weights for cloud aggregation on satellites; and 8) the weights for final aggregation. We formulate the optimization problem as

$\displaystyle\mathbb{\rm(P1)}:$	$\displaystyle\min_{\mathcal{Q}(t),\wp(t),\Im_{u}(t),S^{f}(t),\Im_{d}(t),\zeta^% {f_{K}}(t),\zeta^{f_{M}}(t),\zeta^{f_{S}}(t)}$
	$\displaystyle\sum_{f=1}^{F}\sum_{t=1}^{T_{f}}\sum_{n=1}^{N}\sum_{m=1}^{M}{% \mathcal{L}}_{k,f}(\psi_{k,f},D_{k,f}),$	(14)
$\displaystyle{\rm s.t.}$	$\displaystyle~{}\sum_{m=1}^{M}K_{m}=K,$	(14a)
	$\displaystyle\sum_{m=1}^{M}z^{G}_{m,k}=1,\forall k\in{\mathcal{K}},$	(14b)
	$\displaystyle\sum_{n=1}^{N}z^{U}_{n,m}=1,\forall n\in{\mathcal{N}},$	(14c)
	$\displaystyle\sum_{n=1}^{N}MS_{n}=M,$	(14d)
	$\displaystyle\sum_{i=1}^{K_{m}}\zeta^{f_{K}}_{i}=1,\forall\wp_{m}\in{\wp},$	(14e)
	$\displaystyle\sum_{i=1}^{MS_{n}}\zeta^{f_{M}}_{n}=1,\forall n\in{\mathcal{N}},$	(14f)
	$\displaystyle\sum_{n=1}^{N}\zeta^{f_{S}}_{n}=1,$	(14g)
	$\displaystyle v_{m}\leq v_{\max},$	(14h)
	$\displaystyle z_{m}\geq z_{\min}~{}\&~{}z_{m}\leq z_{\max},$	(14i)

where $T_{f}$ denotes the number of training rounds in $f$ -th federated learning task, $\Im_{u}(t)=\{\Im^{u}_{1}(t),\Im^{u}_{2}(t),...,\Im^{u}_{M}(t)\}$ denotes the pairing between UAVs and satellites for uplink transmissions, $S^{f}$ denotes the final aggregation satellite, $\Im_{d}(t)=\{\Im^{d}_{1}(t),\Im^{d}_{2}(t),...,\Im^{d}_{M}(t)\}$ denotes the pairing between UAVs and satellites for downlink transmissions. $\zeta^{f_{K}}$ , $\zeta^{f_{M}}$ and $\zeta^{f_{S}}$ denotes the aggregation weights for edge aggregation, cloud aggregation and final aggregation, respectively. ${\mathcal{L}}_{k,f}(\psi_{k,f},D_{k,f})$ denotes the local loss of $f$ th task for ground user $G_{k}$ with local model parameter $\psi_{k,f}$ and local dataset $D_{k,f}$ . (14a) shows that all ground users are divided into $M$ clusters. (14b) indicates that a ground user is assigned to only one cluster. (14c) and (14d) denote that each UAV need to access to one satellite. (14e), (14f) and (14g) shows that the rule of the aggregation weights. (14h) indicates the speed limitation of UAVs, and (14i) presents the altitude constraint for UAVs.

The above optimization problem includes eight variables, comprising both continuous and discrete variables, and some optimization variables are interdependent (e.g., the UAV’s trajectory and the UAV-ground user pairing). Thus, it is clear that this is a mixed-integer nonlinear optimization problem and is highly complex. Considering that our proposed system involves multi-round hierarchical aggregation in federated learning, it also represents a long-term planning problem. Traditional algorithms wouldn’t be efficient in addressing this problem, and real-time computational complexity is a critical concern in dynamic wireless communication scenarios. To tackle this issue we employ a DRL algorithm which is well-suited for this kind of optimization problem and exhibits low computational complexity after deployment. Moreover, we explore a hybrid action space framework to enhance the DRL performance and consider the fairness among different tasks in reward design.

III Hybrid Action Space DRL Algorithm with Fairness Reward Function

The proposed problem in this work is a long-term optimization problem which can be effectively addressed by using DRL. However, the presence of several continuous and discrete optimization variables in the optimization problem makes DRL optimization difficult to converge. Some existing approaches tried to tackle this issue of DRL by discretizing or continuousizing all optimization variables. However, these methods sacrifice a portion of the optimization space, reducing the system’s controllability, and may lead to challenges in exploring the huge action space when exploring different possibilities [34]. Therefore, we employ a decoupling approach to separate the hybrid optimization variables and assign them to different agents within DRL for learning. Subsequently, to efficiently combine the optimization results from different agents, we utilize a maximum posteriori policy based algorithm to mix the discretized and continuous variables after optimization, providing an effective solution to the proposed optimization problem. Furthermore, considering the fairness among different tasks, we design a fair and effective reward scheme to ensure that each task achieves satisfactory training results.

III-1 MDP Design in DRL

We utilize a DRL algorithm named DSAC [35] to address the optimization problem. Firstly, we need to model the proposed system as a Markov decision process (MDP). The MDP encompasses states, actions, and rewards. The state $s(t)$ at time slot $t$ in this paper is given by

	$\displaystyle s(t)=$	$\displaystyle~{}\{\{h_{k,m}(t)\}_{k\in\mathcal{K},m\in\mathcal{M}},\{q_{m}(t)% \}_{q_{m}\in\mathcal{Q}},$		(15)
		$\displaystyle~{}\{T_{c(m,n)}(t)\}_{m\in\mathcal{M},n\in\mathcal{N}},{Acc}_{i}(% t)_{i\in\{\mathcal{M},\mathcal{N}\}}\},$		(15)

where $\{h_{k,m}(t)\}_{k\in\mathcal{K},m\in\mathcal{M}}$ indicates the channel coefficients of ground-aerial links, $\{q_{m}(t)\}_{q_{m}\in\mathcal{Q}}$ denotes the UAV’s trajectories, $\{T_{c(m,n)}(t)\}_{m\in\mathcal{M},n\in\mathcal{N}}$ is the remain service time of space-aerial links, ${Acc}_{i}(t)_{i\in\{\mathcal{M},\mathcal{N}\}}$ denotes the test performance in each aggregation node (UAVs and LEO satellites). Furthermore, the action $a(t)$ at time slot $t$ is expressed as

\displaystyle a(t)=\{\mathbb{Q}(t),\wp(t),Im_{u}(t),S^{f}(t),Im_{d}(t),\zeta^{% f_{K}}(t),\zeta^{f_{M}}(t),\zeta^{f_{S}}(t)\},

(16)

where $\mathbb{Q}(t)$ denotes the UAV trajectory planning, $\wp(t)$ is the pairing for ground users and UAVs in uplink transmissions, $Im_{u}(t)$ denotes the pairing for UAVs and LEO satellites in uplink transmissions, $S^{f}(t)$ denotes the selection of the final aggregation node, $Im_{d}(t)$ indicates the pairing for LEO satellites and UAVs in downlink transmissions. $\zeta^{f_{K}}(t)$ , $\zeta^{f_{M}}(t)$ and $\zeta^{f_{S}}(t)$ presents the weights of models in edge, cloud and final aggregations, respectively.

Moreover, we need to design a reward function for the MDP to provide feedback to the agents of the DRL algorithm. Since our objective is to enhance the average performance of all tasks while ensuring the fairness, our reward design needs to consider fairness in task progress to guarantee that all tasks converge to satisfactory performance after the satellites service time ends. To this end, we design the reward function for $F$ tasks as :

	$\displaystyle r(t)=$	$\displaystyle~{}\frac{\alpha}{F}\sum_{f=1}^{F}\frac{\gamma_{f}(t)/\epsilon_{c1% }}{\epsilon_{f}}$		(17)
		$\displaystyle+\frac{(1-\alpha)}{F}\sum_{f=1}^{F}\frac{\gamma_{f}(t)/\epsilon_{% c2}}{\epsilon_{f}+\|\frac{1}{F}\sum_{j=1}^{F}\gamma_{j}(t)/\gamma_{f}(t)-1\|},$		(17)

where $\alpha$ denotes the time decay factor used to adjust the inevitable initial imbalance in task progress during access and the later requirement for overall balance, we define $\alpha=\beta^{t}$ , where $\beta$ is the decay factor and $\alpha$ is set as 1 at the beginning. $\gamma_{f}(t)$ denotes the performance of task $\mathbb{T}_{f}$ at time slot $t$ , $\epsilon_{c1}$ and $\epsilon_{c1}$ are constants used to normalize the rewards for tasks, $\epsilon_{f}$ is a constant introduced to prevent zero value of the denominator when the reward bias is zero.

The designed reward function can be divided into two parts. At the beginning of the federated learning training process, it’s not feasible to achieve balance among all tasks when the proposed system starts operating. Therefore, the focus is more on optimizing the performance of all tasks. As time progresses, the emphasis of rewards gradually shifts towards the second part of the reward function. The designed reward bias ensures that tasks deviating too much from the performance mean receive certain penalties, while those approaching the performance mean contribute significantly to rewards. Thus, the designed reward function dynamically adjusts the reward values based on task performance and runtime, ensuring fairness among tasks while optimizing the performance of all tasks.

III-2 DSAC-Based Optimization

Existing research in reinforcement learning has started to shift towards distributional algorithm studies because distributional DRL can estimate the entire distribution of total returns rather than just the expected distribution. This is highly beneficial for the convergence of long-term rewards. However, the relationship between learning the return distribution and predicting values has not been discussed in traditional distributional DRL. Therefore, we introduce the DSAC algorithm which reduces the impact of overestimation by learning the distribution function of state–action returns. Firstly, the algorithm introduces the distributional idea into SAC, the return for state-action pairs in SAC can be represented as:

\displaystyle\theta^{\pi}(s(t),a(t))=r(t)+\vartheta\iota(t+1),

(18)

where $\pi$ denotes the policy in DRL, $\vartheta$ is the discount factor, $\iota(j)=\sum_{j=t}^{\infty}\vartheta^{(}j-t)[r(j)-\nu{\rm log}\pi(a(j)|s(j))]$ , where $\nu$ is an importance factor related to the entropy [35]. The Q-value in DSAC can be expressed as

\displaystyle Q^{\pi}(s(t),a(t))=\mathbb{E}\big{[}\theta^{\pi}(s(t),a(t))\big{% ]},

(19)

where ${\mathbb{E}}[\cdot]$ indicates the expectation. Compared to the expected state-action returns in (19), we can directly consider using soft returns in (18) to construct the algorithm. Considering the maximum entropy in this case, we can define the distribution of the Bellman operator as

	$\displaystyle\upsilon\theta^{\pi}(s(t),a(t))=$	$\displaystyle~{}r(t)+\vartheta\big{(}\theta^{\pi}(s(t+1),a(t+1))$		(20)
		$\displaystyle-\nu{\rm log}\pi(a(t+1)\|s(t+1))\big{)}.$		(20)

Then, we can update the soft return distribution based on (20) as

\displaystyle\hat{\theta}_{new}={\rm argmin}\mathbb{E}\big{[}d_{D}(\upsilon% \hat{\theta}_{old}(\cdot|s(t),a(t)),\hat{\theta}(\cdot|s(t),a(t)))\big{]},

(21)

where $d_{D}$ is the distance between new and old distributions which can be expressed by Kullback–Leibler (KL) divergence [36]. To update the soft return policy, we form the update function as

\displaystyle\Theta_{\hat{\theta}}(\chi)=-\mathop{\mathbb{E}}\limits_{\aleph_{% \Theta_{\hat{\theta}}(\chi)}}\Big{[}{\rm log}\digamma\big{(}Psi^{\pi_{\phi^{% \prime}}}\theta(s(t),a(t))|\hat{\theta}(\cdot|s(t),a(t))\big{)}\Big{]},

(22)

where $\chi$ and $\phi$ are the parameters of the distributional value function $\hat{\theta}(\cdot|s(t),a(t))$ and the agent’s behavior policy $\pi_{\phi}(\cdot|s)$ , respectively. $\aleph_{\Theta_{\hat{\theta}}(\chi)}=(s(t),s(t+1),a(t),r(t))\sim\mathbb{B},a(t% +1)\sim\pi_{\phi_{\prime}},\theta(s(t+1),a(t+1))\sim\hat{\theta}_{\chi^{\prime% }}(\cdot|s(t+1),a(t+1))$ , $\digamma$ is the probability distribution between states and actions, $\mathbb{B}$ denotes the experience buffer in DRL, $\chi_{\phi^{\prime}}$ and $\phi_{\phi^{\prime}}$ denote the parameters of the target return and target behavior policy, respectively. Then we can update $\chi$ by using the following equation

	$\displaystyle\nabla_{\chi}\Theta_{\hat{\theta}}(\chi)=$	$\displaystyle~{}-\mathop{\mathbb{E}}\limits_{\aleph_{\nabla_{\chi}\Theta_{\hat% {\theta}}(\chi)}}\Big{[}\nabla_{\chi}{\rm log}\digamma\big{(}Psi^{\pi_{\phi^{% \prime}}}\theta(s(t),a(t))$		(23)
		$\displaystyle~{}\|\hat{\theta}(\cdot\|s(t),a(t))\big{)}\Big{]},$		(23)

where $\aleph_{\Theta_{\hat{\theta}}(\chi)}=(s(t),s(t+1),a(t),r(t))\sim\mathbb{B},a(t% +1)\sim\pi_{\phi_{\prime}},\theta(s(t+1),a(t+1))\sim\hat{\theta}_{\chi^{\prime}}$ . We assume $\hat{\theta}_{\chi}$ has a Gaussian distribution, thus we can obtain

	$\displaystyle\nabla_{\chi}\Theta_{\hat{\theta}}(\chi)=$	$\displaystyle~{}\mathop{\mathbb{E}}\limits_{\aleph_{\nabla_{\chi}\Theta_{\hat{% \theta}}(\chi)}}\Big{[}-\frac{\partial\Psi_{\hat{\theta}}(\chi)}{\partial Q_{% \chi}(s(t),a(t))}\nabla_{\chi}Q_{\chi}(s(t),a(t))$		(24)
		$\displaystyle~{}-\frac{\partial\Psi_{\hat{\theta}}(\chi)}{\partial\rho_{\chi}(% s(t),a(t))}\nabla_{\chi}\rho_{\chi}(s(t),a(t))\Big{]},$		(24)

where

\displaystyle\frac{\partial\Psi_{\hat{\theta}}(\chi)}{\partial Q_{\chi}(s(t),a% (t))}=\frac{\upsilon^{\pi_{\phi_{\prime}}}\theta(s(t),a(t))-Q_{\chi}(s(t),a(t)% )}{\rho_{\chi}(s(t),a(t))^{2}},

(25)

	$\displaystyle\frac{\partial\Psi_{\hat{\theta}}(\chi)}{\partial\rho_{\chi}(s(t)% ,a(t))}=$	$\displaystyle~{}\frac{\upsilon^{\pi_{\phi_{\prime}}}\theta(s(t),a(t))-Q_{\chi}% (s(t),a(t))}{\rho_{\chi}(s(t),a(t))^{3}}$		(26)
		$\displaystyle~{}-\frac{1}{\rho_{\chi}(s(t),a(t))},$		(26)

where $\rho_{\chi}(s(t),a(t))$ is a standard deviation of the Gaussian distribution for $\theta(\cdot|s(t),a(t))$ . From (26), it can be seen that the Q-value is very easy to be overestimated during updates. Moreover, when the standard deviation of the Gaussian distribution tends towards zero or infinity, gradient computation issues can arise. Therefore, we use the following equation to constrain the standard deviation

\displaystyle\rho_{\chi}(s(t),a(t))=\max\Big{(}\rho_{\chi}(s(t),a(t)),\rho_{% \min}\Big{)},

(27)

where $\rho_{\min}$ is the limitation factor. Another constraint method is to clip $\upsilon^{\pi_{\phi_{\prime}}}\theta(s(t),a(t))$ to control the range of the updated Q-values, with the clipping scheme as

	$\displaystyle\frac{\partial\Psi_{\hat{\theta}}(\chi)}{\partial\rho_{\chi}(s(t)% ,a(t))}=$	$\displaystyle~{}\frac{\jmath-Q_{\chi}(s(t),a(t))}{\rho_{\chi}(s(t),a(t))^{3}}$		(28)
		$\displaystyle~{}-\frac{1}{\rho_{\chi}(s(t),a(t))},$		(28)

where

\displaystyle\jmath=\ell\Big{(}\upsilon^{\pi_{\phi_{\prime}}}\theta(s(t),a(t))% ,Q_{\chi}(s(t),a(t))-\emptyset,Q_{\chi}(s(t),a(t))+\emptyset\Big{)},

(29)

where $\ell[l,i,j]$ indicates $l$ is clipped between $[i,j]$ , $\emptyset$ is the clipping factor. Subsequently, we adopt the policy update equation in DSAC as

\displaystyle\Theta_{\pi}(\phi)=\mathop{\mathbb{E}}\limits_{s(t)\sim\mathbb{B}% ,a(t)\sim\pi_{\phi}}\Big{[}Q_{\chi}(s(t),a(t))-\nu{\rm log}(\pi_{\phi}(a(t)|s(% t)))\Big{]}.

(30)

Then, we can update $\phi$ by

	$\displaystyle\nabla_{\phi}\Theta_{\pi}(\phi)=$	$\displaystyle~{}\mathop{\mathbb{E}}\limits_{s(t)\sim\mathbb{B},\Re}\Big{[}-\nu% \nabla_{\phi}{\rm log}(\pi_{\phi}(a(t)\|s(t)))\big{(}\nabla_{a(t)}Q_{\chi}(s(t),$		(31)
		$\displaystyle~{}a(t))-\nu\nabla_{a(t)}{\rm log}(\pi_{\phi}(a(t)\|s(t)))\big{)}% \nabla_{\phi}\imath(\Re;s(t))\Big{]},$		(31)

where $\imath(\Re;s(t))=\hat{a}+\Re\bigodot\bar{a}$ , $\Re$ is sampled from a fixed distribution, $\bigodot$ is Hadamard product, $\hat{a}$ and $\bar{a}$ are the mean and standard deviation of $\pi_{\phi}(\cdot|s(t))$ , respectively.

Moreover, we add a noise in the exploration of DRL as

\displaystyle a(t)=V_{\phi}(\varepsilon(t);s(t)),

(32)

where $V_{\phi}(i;j)$ indicates the random selection from noise and policy network, $\varepsilon$ is a Gaussian distributed noise. The structure of DSAC is shown in Fig. 4.

III-3 Hybrid Action Space in DSAC

As mentioned in previous discussion, optimization problems that involve both discrete and continuous actions pose to a challenge to traditional DRL algorithms. Whether discretize or make all actions continuous, it leads to a decrease in system control capability and loss of optimization performance. Therefore, we introduce a decoupling strategy to decompose discrete and continuous actions in the optimization variables. Actions and policies are re-expressed as

$\displaystyle A$	$\displaystyle~{}=\{A_{d},A_{c}\},$	(33)
$\displaystyle Q_{\phi}(a\|s)$	$\displaystyle~{}=Q^{A_{d}}_{\phi}(a^{A_{d}}\|s)Q^{A_{c}}_{\phi}(a^{A_{c}}\|s)$
	$\displaystyle~{}=\prod_{a_{i}\in A_{d}}Q^{A_{d}}_{\phi}(a_{i}\|s)\prod_{a_{j}% \in A_{c}}Q^{A_{c}}_{\phi}(a_{j}\|s),$	(34)

respectively, where $A_{d}$ and $A_{c}$ indicate the discrete and continuous action space, respectively. Upon decomposing the optimization variables into discrete and continuous actions, we assign these two parts to different DRL agents for control, each based on an independently trained DSAC networks. Therefore, we formulate the update functions of value networks for discrete and continuous actions in DSAC networks as

\displaystyle\Theta_{Q^{A_{d}}_{\chi}}=\mathop{\mathbb{E}}\limits_{(s(t),a(t))% \sim\mathbb{B}}\bigg{[}\frac{1}{2}\Big{(}{Q^{A_{d}}_{\chi}}\big{(}s(t),a(t)% \big{)}-{Q}^{A_{d}}_{\chi^{\prime}}(s(t),a(t))\Big{)}^{2}\bigg{]},

(35)

\displaystyle\Theta_{Q^{A_{c}}_{\chi}}=\mathop{\mathbb{E}}\limits_{(s(t),a(t))% \sim\mathbb{B}}\bigg{[}\frac{1}{2}\Big{(}{Q^{A_{c}}_{\chi}}\big{(}s(t),a(t)% \big{)}-{Q}^{A_{c}}_{\chi^{\prime}}(s(t),a(t))\Big{)}^{2}\bigg{]},

(36)

respectively. The update function of policy networks for discrete and continuous actions can be expressed as

	$\displaystyle\Theta_{Q^{A_{d}}_{\phi}}=$	$\displaystyle~{}\mathop{\mathbb{E}}\limits_{s(t)\sim\mathbb{B}}\bigg{[}{\rm log% }Q^{A_{d}}_{\phi}(V_{\phi}\big{(}\varepsilon(t);s(t))\|s(t)\big{)}$		(37)
		$\displaystyle-Q^{A_{d}}_{\chi}\big{(}s(t),f_{\phi}(\varepsilon(t);s(t))\big{)}% \bigg{]},$		(37)

	$\displaystyle\Theta_{Q^{A_{c}}_{\phi}}=$	$\displaystyle~{}\mathop{\mathbb{E}}\limits_{s(t)\sim\mathbb{B}}\bigg{[}{\rm log% }Q^{A_{c}}_{\phi}(V_{\phi}\big{(}\varepsilon(t);s(t))\|s(t)\big{)}$		(38)
		$\displaystyle-Q^{A_{c}}_{\chi}\big{(}s(t),f_{\phi}(\varepsilon(t);s(t))\big{)}% \bigg{]},$		(38)

respectively. After decoupling the actions and obtaining samples from the environment separately by different agents, we need to re-couple the continuous and discrete action spaces. Therefore, we design a new policy $\phi_{c}$ for the proposed DSAC algorithm [37], the update equation of $\phi_{c}$ can be expressed as

\displaystyle\Theta_{Q_{\phi_{c}}}=\mathop{\mathbb{E}}\limits_{\phi_{c}(a(t)|s% (t))}[\bar{Q}(s(t),a(t))],

(39)

where $\bar{Q}$ indicates a new value network trained from the experience buffer [38]. When coupling the hybrid action space, we also need to consider the deviation during policy updates. Therefore, we can obtain

\displaystyle\mathop{\mathbb{E}}\limits_{s(t)\sim\mathbb{B}}[u(Q_{\phi_{c}}(a(% t)|s(t))||Q_{\phi^{\prime}_{c}}(a(t)|s(t)))]<W,

(40)

where $u$ indicates the KL divergence, $\phi^{\prime}_{c}$ denotes the old policy of $\phi_{c}$ , and $W$ is a threshold factor to avoid deviation. The update function of the hybrid policy can be expressed as

$\displaystyle\hat{\phi}_{c}=~{}$	$\displaystyle\arg\max_{\phi_{c}}\mathop{\mathbb{E}}\limits_{s\sim\mathbb{B}}[u% (Q_{\phi_{c}}(a\|s)\|\|Q^{A_{d}}_{\phi_{c}}(a^{A_{d}}\|s)Q^{A_{c}}_{\phi_{c}}(a^{A% _{c}}\|s))],$	(41)
$\displaystyle{\rm s.t.}$	$\displaystyle~{}\mathop{\mathbb{E}}\limits_{s\sim\mathbb{B}}[u(Q^{A_{d}}_{\phi% ^{\prime}_{c}}(a^{A_{d}}\|s)\|\|Q^{A_{d}}_{\phi_{c}}(a^{A_{d}}\|s))]<W_{A_{d}},,$	(41a)
	$\displaystyle~{}\mathop{\mathbb{E}}\limits_{s\sim\mathbb{B}}[\frac{1}{Z}\sum_{% z=1}^{Z}u(Q^{A_{c}}_{\phi^{\prime}_{c}}(a^{A_{c}}\|s)\|\|Q^{A_{c}}_{\phi_{c}}(a^{% A_{c}}\|s))]<W_{A_{c}},$	(41b)

where $W_{c}$ and $W_{d}$ are the threshold for continuous and discrete action space, respectively. $Z$ represents the value of the discrete action space. Therefore, we employed a decoupling strategy to allocate the coupled optimization variables to different agents, and then coupled the optimized action spaces to form a hybrid solution. The proposed decoupling algorithm re-couples the variables after decoupling, providing DRL with stronger capabilities to explore the relationship between actions and states. Moreover, it is noticed that the prediction complexity in DRL is substantially lower than the training complexity. Once deployed, the pre-trained DRL model can make real-time decisions with very small computational overhead. The structure of the hybrid action space in DSAC method is shown in Fig. 5. The pseudo-code of the proposed hybrid action space DSAC (H-DSAC) algorithm is in Algorithm 1.

Algorithm 1 H-DSAC:

1:Initialize all the networks and parameters.

2:repeat:

3: for each time slot do

4: Choose action

a(t)

based on the current state

s(t)

5:000000and (32).

6: Obtain the next state

s(t+1)

and reward

r(t)

7: Generate experience sample

\{s(t),a(t),r(t),

8:000000

s(t+1)\}

and save it to the buffer

\mathbb{B}

9: end for

10: for each training epoch do

11: Sample a minibatch from the buffer

\mathbb{B}

12: Update the importance factor

\nu\leftarrow\nu-L_{\nu}\nabla_{\nu}\Theta(\nu)

13: Update

Q^{A_{d}}_{\chi}

based on (35),

Q^{A_{c}}_{\chi}

based on (36).

14: Update

Q^{A_{d}}_{\phi}

based on (37),

Q^{A_{c}}_{\phi}

based on (38).

15: Soft update target networks

\chi^{\prime}

and

\phi^{\prime}

16: Update

\phi_{c}

based on 1

17: end for

18:until Convergence.

IV Simulation Results

In this section, we select 10 datasets (from Kaggle Computer Vision Open Datasets) as 10 tasks in federated learning and use TensorFlow-2 to train the tasks in simulations, the size of the local dataset of a ground user for any given task is randomly generated. Unless otherwise stated, parameters in simulations are set as: the number of ground users $K=10$ , the number of UAVs $M=3$ , ground users and UAVs are randomly distributed within a 250 m × 250 m area. The ground users are randomly assigned to $M$ clusters, while the UAVs initially maintain an altitude of 50 meters. The maximum speed of UAVs $v_{\max}=5$ m/s, $z_{\min}=40$ m, $z_{\max}=$ 60m. The number of LEO satellites $N=5$ , the Rician factor $\omega=10$ dB, the altitude of LEO satellites is 800 km, the speed of LEO satellites is 7.8 km/s, the minimum coverage elevation angle $\varpi=40^{\circ}$ , the carrier frequency for UAVs and LEOs are 1 GHz and 30 GHz, respectively. The antenna gain $\xi_{m}=25$ dB for UAVs and 40 dB for LEOs, the maximum Doppler frequency is in the Ka-band [39]. The bandwidth in ground-air-space transmissions is set as 10 MHz, the bandwidth for ISL transmissions is 1 GHz, the normalized gain $G^{d}=1$ for the links between satellites, the thermal noise $\varphi=354.81$ K [32]. The path loss exponents $\tau_{L}=2$ and $\tau_{N}=2.5$ , the transmit power of ground users, UAVs and LEO satellites are 0.1 W, 1 W and 2 W, respectively. The discount factor $\vartheta=0.99$ , the decay factor $\beta=0.995$ , the thresholds $W=0.1$ , $W_{c}=0.001$ , $W_{d}=0.01$ , the normalizing constants $\epsilon_{c1}=200$ , $\epsilon_{c2}=100$ , $\epsilon_{f}=0.01$ , and $\rho_{\min}=1$ , the training time for each round is 1 s. We consider the distances between LEO satellites to be randomly generated between 100 km and 500 km. Moreover, we utilize parametrized DQN (PDQN) [40], multi-agent proximal policy optimization (M-PPO) [41], ‘H-DSAC + FedAvg’ (Using FedAvg [42] to calculate the weights in aggregations) and ‘H-DSAC + HoveringUAV’ (Without considering UAV trajectory planning, the UAVs remain in hovering states) to calculate the weights in aggregations as the benchmarks in simulations.

Fig. 6 presents the average test accuracy of tasks versus the communication time for the proposed algorithm H-DSAC and benchmarks. As we can see in this figure, the average performance of tasks rises with an increase in the time. As the service time progresses, tasks continue to be trained in HFL which leads to the increase of accuracy. H-DSAC achieves the average test accuracy of 91.4% when the communication time is 100 s, while the others achieve 90.4%, 85.6%, 84.9% and 81.9%, respectively. The proposed algorithm consistently outperforms all other algorithms, thus demonstrating the superiority of the H-DSAC algorithm. The performance of ‘H-DSAC + FedAvg’ consistently ranks the second, indicating that the gain from the hybrid action space method is higher than the gain from aggregation weights in the proposed optimization problem. Compared to other algorithms, the performance of ‘H-DSAC + HoveringUAV’ significantly deteriorates with the absence of UAV trajectory planning. Therefore, UAV trajectory planning plays a crucial role in ensuring the stability of ground-air links. In addition, MPPO and PDQN achieve similar performance across various scenarios, this result indicates that while MPPO may offer algorithmic superiority over DQN and DDPG, PDQN also possesses its own advantages in optimizing hybrid spaces.

Compare the result in Fig. 7 to that in Fig. 6, it can be observed that while the loss is not directly related to the accuracy, their trends are very similar. As the average loss decreases continuously, the average accuracy increases correspondingly. H-DSAC achieves the average loss of 0.32 when the communication time is 40 s, while the others achieve 0.37, 0.88, 0.71 and 0.74, respectively. These results confirmed that the the proposed algorithm utilize the advantage of UAV trajectory control and weighted aggregation to enhance the performance. Moreover, The hybrid action space design offer algorithmic superiority over other DRL algorithms.

As illustrated in Fig. 8, we compare the performance of two selected task to show the fairness among tasks training, ‘Task $T_{1}$ Fixed Reward’ refers to the performance of task $T_{1}$ during training, utilizing a fixed reward design instead of a dynamically adjusted reward function. It is clear that the performance of both two tasks rise rapidly at the beginning, Task $T_{1}$ achieves the average test accuracy of 89.7% when the communication time is 80 s, while ‘Task $T_{1}$ Fixed Reward’ achieves 90.7%. It can be observed from the result that without a fairness-controlled reward function, certain tasks may converge to high performance more rapidly. Moreover, the fairness-oriented and dynamically adjusted reward function has small impact during the initial stages of communications based on (17), resulting in slow progress for Task $T_{2}$ at the beginning. However, as the satellite communication time approaches its end, both tasks can converge to a high performance level with the assistance of the proposed dynamically adjusted reward function. Furthermore, we can see that without fairness considerations, Task $T_{2}$ may be consistently undervalued, leading to only around 82% for testing accuracy even towards the end of communication. Besides, when we utilize the proposed reward function, the average accuracy is higher than that with fixed reward. This result underscores the beneficial role of the proposed reward mechanism for ensuring fairness among multiple tasks. If the satellite cannot guarantee that all tasks converge to a high accuracy threshold due to limited access time, the algorithm’s dynamic design will ensure that all tasks fairly converge to the best possible value within the given time duration.

Fig. 9 compares the average test accuracy between the proposed method with different transmit power of ground users. It can be observed that the average test accuracy of tasks increases as the ground user power rises. This is because higher ground user transmission power can provide higher ground-to-air data rates, thereby reducing model transmission time and accelerating aggregation speed, allowing for more aggregation cycles within the limited satellite service time. Moreover, the proposed H-DSAC algorithm achieves around 97.99% accuracy when the transmit power is 0.2 W for all ground users, while the other three benchmarks only achieve 96.4%, 94.1% and 94.2%, respectively. This result demonstrates the superiority of the proposed algorithm over the benchmarks.

In Fig. 10, we compare the performance of federated learning tasks across different numbers of tasks, and consider that the data distribution for each task among different ground users follows non-i.i.d. From this figure, it can be seen that as the number of tasks increases, the average performance tends to deteriorate. This is because different tasks may originate from entirely different datasets, and an increase in the number of tasks inevitably leads to tighter constraints on transmission and computational resources. Moreover, the performance of the proposed algorithm consistently outperforms other benchmarks, H-DSAC achieves 96.9% with 8 tasks, while others achieve between 91.7% and 95.2%. This result confirms the effectiveness of our proposed algorithm.

The results depicted in Fig. 11 illustrate the performance of federated learning tasks when the dataset of each task follows i.i.d. among different ground users. Contrary Fig. 10, i.i.d. data facilitates to algorithm convergence because in this scenario, the importance of user models is equalized and each user holds an equivalent position in a specific task, necessitating only a balance in resource allocation to regulate their participation frequency in aggregation. Due to the inability of FedAvg to adjust the participation weights of users based on communication conditions for each aggregation, the proposed algorithm’s advantage of adapting weights according to the state of communication links becomes more pronounced.

In Fig. 12, we assess the average test accuracy from both the proposed scheme and the benchmarks in relation to different minimum coverage elevation angle. As shown in all the results, the average test accuracy decreases as the minimum coverage elevation angle rises. The rationale is that, according to (9) and (10), as the minimum coverage elevation angle increases the service time between satellites and UAVs will decrease, which leads to a reduced service time for ground users to utilize SAGIN for task training. Therefore, both the number of federated learning training iterations and aggregation iterations will decrease, leading to a reduction in the average performance of all tasks.

V Conclusion

In this paper, we proposed a novel application of the HFL framework within SAGIN, utilizing aerial platforms and LEO satellites as edge and cloud servers for HFL, respectively. We also considered ISLs as communication channels between satellites to further provide training resources for federated learning tasks. To maximize the average performance of multiple distinct tasks and ensure fairness among them during training, we employed a decoupling-coupling approach in the DSAC algorithm and designed a novel dynamically changing reward function to guarantee fairness among multiple tasks. By optimizing ground-to-air pairings in uplink and downlink transmissions, air-to-satellite pairings, aerial UAV trajectory planning, final aggregation selection between satellites, edge aggregation weights, cloud aggregation weights, and final aggregation weights, we balanced the training effects among different tasks to ensure that the performance of any individual task can converge to a high level while maximizing the overall performance. Through simulations, we validated the superiority of the proposed algorithm and emphasized the importance of fairness design during task training. Furthermore, we analyzed the impact of user transmit power, UAV trajectory planning and minimum coverage elevation angle on the results and demonstrated the significance of resource allocation in SAGINs for federated learning training performance. Moreover, we analyzed the effects of data distribution in tasks on algorithm convergence, the results show that the proposed algorithm exhibits strong adaptability to dynamic environments and provides a highly promising optimization tool for future federated learning frameworks and multi-access edge computing systems in SAGINs. In our future work, we will consider the adaptive downlink/uplink bandwidth allocation and full-duplex mode to further enhance the efficiency of federated learning in SAGINs.

References

[1] D. C. Nguyen, S. Hosseinalipour, D. J. Love, P. N. Pathirana, and C. G. Brinton, “Latency optimization for blockchain-empowered federated learning in multi-server edge computing,” IEEE Journal on Selected Areas in Communications, vol. 40, no. 12, pp. 3373–3390, Dec. 2022.
[2] C. Feng, H. H. Yang, S. Wang, Z. Zhao, and T. Q. S. Quek, “Hybrid learning: When centralized learning meets federated learning in the mobile edge computing systems,” IEEE Transactions on Communications, vol. 71, no. 12, pp. 7008–7022, Dec. 2023.
[3] S. Niknam, H. S. Dhillon, and J. H. Reed, “Federated learning for wireless communications: Motivation, opportunities, and challenges,” IEEE Communications Magazine, vol. 58, no. 6, pp. 46–51, Jun. 2020.
[4] Z. Qin, G. Y. Li, and H. Ye, “Federated learning and wireless communications,” IEEE Wireless Communications, vol. 28, no. 5, pp. 134–140, Oct. 2021.
[5] M. Hosseinian, J. P. Choi, S.-H. Chang, and J. Lee, “Review of 5G NTN standards development and technical challenges for satellite integration with the 5G network,” IEEE Aerospace and Electronic Systems Magazine, vol. 36, no. 8, pp. 22–31, Aug. 2021.
[6] J. C. McDowell, “The low earth orbit satellite population and impacts of the Spacex Starlink constellation,” The Astrophysical Journal Letters, vol. 892, no. 2, p. L36, Apr. 2020.
[7] V. L. Foreman, A. Siddiqi, and O. De Weck, “Large satellite constellation orbital debris impacts: Case studies of oneweb and spacex proposals,” AIAA SPACE and Astronautics Forum and Exposition, Orlando, FL, Sep. 2017.
[8] G. Giambene, S. Kota, and P. Pillai, “Satellite-5G integration: A network perspective,” IEEE Network, vol. 32, no. 5, pp. 25–31, Sep. 2018.
[9] S. Liu, Z. Gao, Y. Wu, D. W. Kwan Ng, X. Gao, K.-K. Wong, S. Chatzinotas, and B. Ottersten, “LEO satellite constellations for 5G and beyond: How will they reshape vertical domains?,” IEEE Communications Magazine, vol. 59, no. 7, pp. 30–36, Jul. 2021.
[10] T. X. Tran, A. Hajisami, P. Pandey, and D. Pompili, “Collaborative mobile edge computing in 5G networks: New paradigms, scenarios, and challenges,” IEEE Communications Magazine, vol. 55, no. 4, pp. 54–61, Apr. 2017.
[11] Z. Chu, P. Xiao, M. Shojafar, D. Mi, W. Hao, J. Shi, and F. Zhou, “Utility maximization for IRS assisted wireless powered mobile edge computing and caching (WP-MECC) networks,” IEEE Transactions on Communications, vol. 71, no. 1, pp. 457–472, Jan. 2023.
[12] W. Y. B. Lim, N. C. Luong, D. T. Hoang, Y. Jiao, Y.-C. Liang, Q. Yang, D. Niyato, and C. Miao, “Federated learning in mobile edge networks: A comprehensive survey,” IEEE Communications Surveys & Tutorials, vol. 22, no. 3, pp. 2031–2063, thirdquarter, 2020.
[13] H. Zheng, M. Gao, Z. Chen, and X. Feng, “A distributed hierarchical deep computation model for federated learning in edge computing,” IEEE Transactions on Industrial Informatics, vol. 17, no. 12, pp. 7946–7956, Dec. 2021.
[14] H. Xu, S. Han, X. Li, and Z. Han, “Anomaly traffic detection based on communication-efficient federated learning in Space-Air-Ground integration network,” IEEE Transactions on Wireless Communications, vol. 22, no. 12, pp. 9346–9360, Dec. 2023.
[15] M. Chen, Z. Yang, W. Saad, C. Yin, H. V. Poor, and S. Cui, “A joint learning and communications framework for federated learning over wireless networks,” IEEE Transactions on Wireless Communications, vol. 20, no. 1, pp. 269–283, Jan. 2021.
[16] Y. Gao, Z. Ye, H. Yu, Z. Xiong, Y. Xiao, and D. Niyato, “Multi-resource allocation for on-device distributed federated learning systems,” in IEEE Global Communications Conference (GLOBECOM), Rio de Janeiro, Brazil, Dec. 2022, pp. 160-165.
[17] T. Liu, H. Zhou, J. Li, F. Shu, and Z. Han, “Uplink and downlink decoupled 5G/B5G vehicular networks: A federated learning assisted client selection method,” IEEE Transactions on Vehicular Technology, vol. 72, no. 2, pp. 2280–2292, Feb. 2023.
[18] S. Wang, M. Chen, C. G. Brinton, C. Yin, W. Saad, and S. Cui, “Performance optimization for variable bitwidth federated learning in wireless networks,” IEEE Transactions on Wireless Communications, vol. 23, no. 3, pp. 2340–2356, Mar. 2024.
[19] J. Ye, S. Dang, B. Shihada, and M.-S. Alouini, “Space-Air-Ground integrated networks: Outage performance analysis,” IEEE Transactions on Wireless Communications, vol. 19, no. 12, pp. 7897–7912, Dec. 2020.
[20] G. Wang, S. Zhou, S. Zhang, Z. Niu, and X. Shen, “SFC-based service provisioning for reconfigurable Space-Air-Ground integrated networks,” IEEE Journal on Selected Areas in Communications, vol. 38, no. 7, pp. 1478–1489, Jul. 2020.
[21] C. Guo, C. Gong, H. Xu, L. Zhang, and Z. Han, “A dynamic handover software-defined transmission control scheme in Space-Air-Ground integrated networks,” IEEE Transactions on Wireless Communications, vol. 21, no. 8, pp. 6110–6124, Aug. 2022.
[22] C. Huang, G. Chen, P. Xiao, Y. Xiao, Z. Han, and J. A. Chambers, “Joint offloading and resource allocation for hybrid cloud and edge computing in SAGINs: A decision assisted hybrid action space deep reinforcement learning approach,” IEEE Journal on Selected Areas in Communications, vol. 42, no. 5, pp. 1029–1043, May. 2024.
[23] Q. Fang, Z. Zhai, S. Yu, Q. Wu, X. Gong, and X. Chen, “Olive branch learning: A topology-aware federated learning framework for Space–Air–Ground integrated network,” IEEE Transactions on Wireless Communications, vol. 22, no. 7, pp. 4534–4551, Jul. 2023.
[24] R. Zhao, L. T. Yang, D. Liu, and W. Lu, “Tensor-enabled communication-efficient and trustworthy federated learning for heterogeneous intelligent Space–Air–Ground-Integrated IoT,” IEEE Internet of Things Journal, vol. 10, no. 23, pp. 20285–20296, Dec. 2023.
[25] Y. Sun, J. Shao, Y. Mao, J. H. Wang, and J. Zhang, “Semi-decentralized federated edge learning with data and device heterogeneity,” IEEE Transactions on Network and Service Management, vol. 20, no. 2, pp. 1487–1501, Jun. 2023.
[26] C. You, K. Guo, H. H. Yang, and T. Q. S. Quek, “Hierarchical personalized federated learning over massive mobile edge computing networks,” IEEE Transactions on Wireless Communications, vol. 22, no. 11, pp. 8141–8157, Nov. 2023.
[27] Q. Chen, Z. You, D. Wen, and Z. Zhang, “Enhanced hybrid hierarchical federated edge learning over heterogeneous networks,” IEEE Transactions on Vehicular Technology, vol. 72, no. 11, pp. 14601–14614, Nov. 2023.
[28] O. Marfoq, G. Neglia, A. Bellet, L. Kameni, and R. Vidal, “Federated multi-task learning under a mixture of distributions,” Annual Conference on Neural Information Processing Systems (NeurIPS), vol. 34, Dec. 2021, pp. 15434-15447.
[29] C. Huang, G. Chen, P. Xiao, D. Mi, Y. Zhang, H. Tang, C. Lu, and R. Tafazolli, “Federated learning for RIS-assisted UAV-enabled wireless networks: Learning-based optimization for UAV trajectory, RIS phase shifts and weighted aggregation,” in Annual Conference of the IEEE Industrial Electronics Society (IECON), Singapore, Oct. 2023.
[30] Z. Lin, H. Niu, K. An, Y. Wang, G. Zheng, S. Chatzinotas, and Y. Hu, “Refracting RIS-aided hybrid satellite-terrestrial relay networks: Joint beamforming design and optimization,” IEEE Transactions on Aerospace and Electronic Systems, vol. 58, no. 4, pp. 3717–3724, Aug. 2022.
[31] Y. Liu, C. Huang, G. Chen, R. Song, S. Song, and P. Xiao, “Deep learning empowered trajectory and passive beamforming design in UAV-RIS enabled secure cognitive non-terrestrial networks,” IEEE Wireless Communications Letters, vol. 13, no. 1, pp. 188–192, Jan. 2024.
[32] I. Leyva-Mayorga, B. Soret, and P. Popovski, “Inter-plane inter-satellite connectivity in dense LEO constellations,” IEEE Transactions on Wireless Communications, vol. 20, no. 6, pp. 3430–3443, Jun. 2021.
[33] Q. Tang, Z. Fei, B. Li, and Z. Han, “Computation offloading in LEO satellite networks with hybrid cloud and edge computing,” IEEE Internet of Things Journal, vol. 8, no. 11, pp. 9164–9176, Jun. 2021.
[34] Y. Xu, Y. Wei, K. Jiang, L. Chen, D. Wang, and H. Deng, “Action decoupled SAC reinforcement learning with discrete-continuous hybrid action spaces,” Neurocomputing, vol. 537, pp. 141–151, Jun. 2023.
[35] J. Duan, Y. Guan, S. E. Li, Y. Ren, Q. Sun, and B. Cheng, “Distributional soft actor-critic: Off-policy reinforcement learning for addressing value estimation errors,” IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 11, pp. 6584–6598, Nov. 2022.
[36] M. G. Bellemare, W. Dabney, and R. Munos, “A distributional perspective on reinforcement learning,” International Conference on Machine Learning (ICML), Sydney, Australia, Aug. 2017, pp. 449-458.
[37] A. Abdolmaleki, J. T. Springenberg, Y. Tassa, R. Munos, N. Heess, and M. Riedmiller, “Maximum a posteriori policy optimisation,” International Conference on Learning Representations (ICLR), Vancouver, Canada, Apr. 2018.
[38] M. Neunert, A. Abdolmaleki, M. Wulfmeier, T. Lampe, T. Springenberg, R. Hafner, F. Romano, J. Buchli, N. Heess, and M. Riedmiller, “Continuous-discrete reinforcement learning for hybrid control in robotics,” Proceedings of the Conference on Robot Learning, Nov. 2020, pp. 735-751.
[39] J. Shi, Z. Li, J. Hu, Z. Tie, S. Li, W. Liang, and Z. Ding, “OTFS enabled LEO satellite communications: A promising solution to severe doppler effects,” IEEE Network, Feb. 2023.
[40] J. Xiong, Q. Wang, Z. Yang, P. Sun, L. Han, Y. Zheng, H. Fu, T. Zhang, J. Liu, and H. Liu, “Parametrized deep Q-networks learning: Reinforcement learning with discrete-continuous hybrid action space,” arXiv preprint arXiv:1810.06394, Oct. 2018.
[41] C. Yu, A. Velu, E. Vinitsky, J. Gao, Y. Wang, A. Bayen, and Y. WU, “The surprising effectiveness of PPO in cooperative multi-agent games,” Conference on Neural Information Processing Systems (NeurIPS), New Orleans, LA, Nov. 2022, pp. 24611-24624.
[42] K. H. W. Y. S. W. Li, Xiang and Z. Zhang, “On the convergence of fedavg on non-iid data,” International Conference on Learning Representations (ICLR), Apr. 2020.

Fair Resource Allocation For Hierarchical Federated Edge Learning in Space-Air-Ground Integrated Networks via Deep Reinforcement Learning with Hybrid Control