Quantum Multi-Agent Reinforcement Learning for Cooperative Mobile Access in Space-Air-Ground Integrated Networks

Gyu Seon Kim, Yeryeong Cho, Jaehyun Chung, Soohyun Park, , Soyi Jung, , Zhu Han, , and Joongheon Kim G.S. Kim, Y. Cho, J. Chung, and J. Kim are with the School of Electrical Engineering, Korea University, Seoul 02841, Korea (e-mails: {kingdom0545,joyena0909,rupang1234,joongheon}@korea.ac.kr). S. Park is with the Division of Computer Science, Sookmyung Women’s University, Seoul 04310, Korea (e-mail: soohyun.park@sookmyung.ac.kr). S. Jung is with the Department of Electrical and Computer Engineering, Ajou University, Suwon 16499, Korea (e-mail: sjung@ajou.ac.kr). Z. Han is with the Department of Electrical and Computer Engineering, University of Houston, Texas, USA (e-mail: zhan2@uh.edu).

Abstract

Achieving global space-air-ground integrated network (SAGIN) access only with CubeSats presents significant challenges such as the access sustainability limitations in specific regions (e.g., polar regions) and the energy efficiency limitations in CubeSats. To tackle these problems, high-altitude long-endurance unmanned aerial vehicles (HALE-UAVs) can complement these CubeSat shortcomings for providing cooperatively global access sustainability and energy efficiency. However, as the number of CubeSats and HALE-UAVs, increases, the scheduling dimension of each ground station (GS) increases. As a result, each GS can fall into the curse of dimensionality, and this challenge becomes one major hurdle for efficient global access. Therefore, this paper provides a quantum multi-agent reinforcement Learning (QMARL)-based method for scheduling between GSs and CubeSats/HALE-UAVs in order to improve global access availability and energy efficiency. The main reason why the QMARL-based scheduler can be beneficial is that the algorithm facilitates a logarithmic-scale reduction in scheduling action dimensions, which is one critical feature as the number of CubeSats and HALE-UAVs expands. Additionally, individual GSs have different traffic demands depending on their locations and characteristics, thus it is essential to provide differentiated access services. The superiority of the proposed scheduler is validated through data-intensive experiments in realistic CubeSat/HALE-UAV settings.

Index Terms:

Quantum Multi-Agent Reinforcement Learning (QMARL), Quantum Neural Network (QNN), Cube Satellite (CubeSat), High-Altitude Long-Endurance Unmanned Aerial Vehicle (HALE-UAV), Space-Air-Ground Integrated Network (SAGIN).

Refer to caption — Figure 1: Reference network model.

1 Introduction

Ultra-small-scale and low-cost cube satellites (CubeSats) have recently emerged as novel electrical aerospace devices in non-terrestrial networks (NTN) as one major component of global space-air-ground integrated network (SAGIN) systems in order to realize seamless global access services [1]. In the past, geostationary (GEO) satellites at the altitude of approximately $36,000$ km were employed for the global access services, yet their considerable distances from the Earth introduced extremely long propagation delays, which hindered the global access services [2]. Given that CubeSats operate as low Earth orbit (LEO) satellites at the altitude of approximately $500$ km, they are more adept at facilitating global access services, offering reduced delays compared to GEO-based services [3, 4]. However, the lower altitude of CubeSats, results in considerably smaller coverage compared to GEO-based services. Consequently, in order to achieve seamless global access, a significantly larger fleet of CubeSats is essentially required [5]. To take care of large-scale CubeSats, it is essentially required to design efficient scheduling algorithms for global access availability and energy efficiency. For more details, employing CubeSats to deliver global SAGIN mobile access necessitates determinations regarding which CubeSats should engage in the global access amidst a scenario where a multitude of CubeSats are present. This scenario culminates in a scheduling problem, which can be conceptualized within the framework of multi-agent reinforcement learning (MARL) [6]. The essence of this approach stems from the necessity for multiple ground stations (GSs) to collaboratively orchestrate the scheduling and servicing of their CubeSats to facilitate global SAGIN mobile access, as depicted in Fig. 1. In the environment where multiple CubeSats exist, each GS cooperatively schedules CubeSats to participate in global SAGIN mobile access, and the corresponding efficient scheduling algorithms are needed. Due to CubeSat’s limited resources such as limited energy and bandwidth, without an efficient scheduling algorithm, it is impossible to optimally utilize these resources, maintain high quality of service (QoS), and provide optimal global access services [7]. Additionally, in the dynamic environments where the coverages of specific areas are constantly changing due to the CubeSat’s high orbital speed, it is important to schedule each GS to connect to the CubeSat in order to improve access availability and energy efficiency. Furthermore, according to the fact that the mobile access demands and requirements of individual GSs are all different depending on their locations, differentiated scheduling algorithms that can take of the characteristics, demands, and requirements of individual GSs are essentially required.

Even though CubeSats can be widely used for next-generation global SAGIN mobile access, CubeSats encounter constraints in delivering global access autonomously, owing to their restricted scales and energy capacities [8]. Hence, despite the capacity of multiple CubeSats to collectively cover extensive areas, there might persist coverage gaps in remote areas, polar regions, or the areas experiencing significant communication burdens. Moreover, the rapid orbital velocity of CubeSats, approximately $7.5$ km/s, results in frequent handovers [9]. To maintain uninterrupted global access, it becomes necessary to integrate new aerial networks that focus on specific local regions and CubeSats must be considered together [10]. Finally, despite CubeSats experiencing reduced delay time compared to GEO satellites, their delay time is still significant challenge when contrasted with terrestrial networks (TNs). Consequently, the deployment of innovative NTN devices to support CubeSats is essential for ensuring seamless global access.

To address these challenges, this paper proposes cooperative and differentiated global SAGIN mobile access involving both CubeSats and aerial networks. The aerial networks, possessing enhanced mobility compared to CubeSats that follow predetermined orbits, are capable of more adaptable responses to changing environmental conditions. Consequently, unmanned aerial vehicles (UAVs) are particularly beneficial for establishing networks across diverse regions characterized by uncertainty [11]. Despite their utility, rotorcrafts consume a significant amount of energy, posing challenges to the seamless global SAGIN mobile access. Therefore, the system discussed in this paper employs high-altitude long-endurance (HALE)-UAVs, which are fixed-wing aircraft, to overcome these limitations. The HALE-UAVs are distinguished by their capacity for long-distance flights, attributed to their substantial endurance and energy levels. Furthermore, the attributes of the HALE-UAV, one of fixed-wing aircrafts, enable them to sustain flight longer than rotary-wing aircrafts even in the scenarios where its control systems can be damaged [12]. Ultimately, HALE-UAVs can supplement CubeSats in providing flexible and extensible coverages for particular regions, such as polar areas lacking signal availability, or the regions burdened with communication overheads [13, 14]. Based on these issues and architecture characteristics, we need to design a new global SAGIN scheduling algorithm.

Moreover, the need for effective scheduling becomes paramount in the scenarios populated by numerous CubeSats and HALE-UAVs. In order to realize effective scheduling for CubeSats and HALE-UAVs in terms of access availability and energy efficiency, cooperative and differentiated global SAGIN mobile access should be proposed. In this scheduling problem, the goal is to simultaneously improve access availability in terms of QoS and capacity as well as energy efficiency in NTN devices, i.e., CubeSats and HALE-UAVs. To achieve this, we have to consider the hardware restrictions of CubeSats and HALE-UAVs at the same time. For CubeSats, their geographical coordinates in terms of latitude and longitude as well as the direction vector toward the sun for solar charging undergo real-time alterations due to their orbital movement. Furthermore, CubeSats frequently sustain damage from cosmic rays and solar winds. Similarly, the flight environment for HALE-UAVs is characterized by dynamic and uncertain conditions, including the presence of vortices and gusts. Moreover, due to the limited energy levels and capacities of NTN devices, collaboration among these NTN devices is crucial for the simultaneous optimization of energy efficiency and channel capacity.

Distinct from conventional scheduling algorithms, reinforcement learning (RL) exhibits robust performance in dynamic and uncertain environments [15, 16, 17]. MARL proves particularly effective in situations that require cooperation among multiple NTN devices [18]. Consequently, within global SAGIN mobile access that utilizes CubeSats and HALE-UAVs, MARL-based algorithms based on MARL may be employed, with multiple GSs acting as agents. Nevertheless, conventional MARL-based schedulers are unable to ensure reward convergence as the number of agents and action dimensions of GS expands. To tackle these issues, this paper proposes a novel cooperative and differentiated scheduling algorithm for access availability and energy efficiency in global SAGIN mobile access, leading to the development of quantum MARL (QMARL) [19]. This innovation utilizes the basis measurements, known as projection-valued measure (PVM), allowing the proposed QMARL-based scheduler to diminish the action dimension to a logarithmic scale [20]. Furthermore, realistic experimental setting is constructed to demonstrate the superiority and real-world relevance of our proposed QMARL-based scheduler. This includes the use of actual CubeSat orbital data, aerodynamic information about real HALE-UAVs environments with significant vortices, and the considerations for photovoltaic (PV) charging based on the CubeSats’ relative positions to the sun, i.e., the sun side and dark side. Additionally, each GS, which is an agent, has its own differentiated maximum required channel capacity depending on the region where each GS is located, the population of that region, and the degree of communication overload. Without these settings, excessive global SAGIN mobile access may be provided to GSs that do not require communication services beyond a certain requirement, and GSs with severe communication overload may not be provided with the desired level of global access. Eventually, this can result in the energy of NTN devices (i.e., CubeSats and HALE-UAVs) being wasted, uselessly. In conclusion, the efficacy of our proposed QMARL-based scheduler is validated within realistic environments, evidencing that the algorithm fulfills its objectives by simultaneously optimizing the access availability in SAGIN and the energy efficiency in NTN devices amidst scenarios characterized by high action dimensions. Ultimately, in this paper, our considering SAGIN mobile access network is implemented using multiple GSs, CubeSats, and HALE-UAVs through our proposed QMARL-based scheduler at high action dimensions, and the proposed algorithm is tested in realistic environments to increase real-world applicability.

The main contributions are as follows.

•

First of all, this paper is the first attempt to employ a QMARL-based global SAGIN mobile access scheduler for the coordination of CubeSats and HALE-UAVs. The uniqueness of this scheduler stems from its emphasis on reducing the action dimensions through the PVM. Furthermore, a new reward function is designed and implemented to encourage cooperative global SAGIN mobile access, and efficient and equitable energy usage of NTN devices in multi-CubeSats and multi-HALE-UAVs environments.
•

Moreover, the proposed QMARL-based scheduler is designed for the coordinated and differentiated global SAGIN mobile access with multiple GSs, CubeSats, and HALE-UAVs. Furthermore, our proposed scheduling also works for energy efficiency in CubeSats and HALE-UAVs. In order to realize this, the reward function of our proposed QMARL-based scheduler is formulated, and thus, it addresses the energy utilization efficiency of CubeSats, taking into account their exposure to the sun side or dark side, which is crucial given their limited energy capacities due to their compact sizes.
•

Lastly, the efficacy of the proposed algorithm is assessed under realistic experimental environments involving CubeSat that orbits in real space areas as well as HALE-UAV that flies in the real sky. The orbital elements for CubeSats are derived from the two line element (TLE), which provide the foundational data related orbit for these CubeSats. The experiment incorporates a range of realistic aerodynamic characteristics of HALE-UAVs to enhance the algorithm’s real-world applicability. In addition, specific considerations on the differentiated maximum channel capacity in individual GSs show realistic experimental environments depending on the regions where individual GSs are located, the populations of the regions, and the degrees of communication overloads.

The rest of this paper is organized as follows. Sec. 2 presents preliminary knowledge including related work and QMARL. Sec. 3 describes the fundamental modeling and Sec. 4 presents the details of our proposed QMARL-based scheduler. Sec. 5 evaluates the performance in realistic environments, and lastly, Sec. 6 concludes this paper.

2 Preliminaries

2.1 Related Work

Numerous projects focus on establishing wireless connections to create aerial NTN devices, including UAVs or satellites [21]. Given that these rely on battery-based energy management, minimizing energy consumption is crucial to stable operation in unknown environments for the efficient operation of multiple UAVs and satellites [22]. In the literature, the efficient operation of multiple UAVs has garnered significant attention [23]. Minimizing energy consumption is important to stable operation in unfamiliar environments, necessitating efficient communications [24]. At the same time, efficient scheduling among satellites is imperative to ensure swift responses to diverse sightings and unforeseen events [25]. UAVs, characterized by remarkable acquisition flexibility and very high spatial resolution (VHSR), and LEO satellites, capable of providing time-series data across extensive areas, have traditionally been employed independently. However, the proposed algorithm in [26] can minimize total energy costs and reduce time complexity which is crucial for optimizing their effective operation for both UAVs and satellites. Therefore UAVs and satellites must be controlled cooperatively to improve performance [27]. To efficiently manage both UAVs and satellites, numerous studies have demonstrated different methodologies for applying RL algorithms [28]. The proposed algorithm in [29] proves the superiority of RL, particularly beneficial in the management of multiple agents. However, to build global SAGIN mobile access, more agents need to be controlled [30]. Notably, quantum algorithms have advantages in managing large-scale scenarios, such as those encountered in aerial networks [31]. This paper demonstrates the superiority of using QRL over RL in multi-agent scheduling.

2.2 Quantum Neural Network

In QNN architectures, a significant deviation from classical neural networks is the utilization of qubits as the unit for basic learning computations [32]. Within quantum systems, qubits stand as the fundamental units of information, and their representation is grounded in the base states of $\left|0\right\rangle:=[1,0]^{T}$ and $\left|1\right\rangle:=[0,1]^{T}$ . The representation of a single qubit state can be realized through a normalized 2D complex vector as $\left|\psi\right>=\alpha\left|0\right>+\beta\left|1\right>$ and $\left\|\alpha\right\|^{2}+\left\|\beta\right\|^{2}=1$ holds, where $\left\|\alpha\right\|^{2}$ and $\left\|\beta\right\|^{2}$ denote the probabilities of observing $\left|0\right>$ and $\left|1\right>$ , respectively. The QNN computation is carried out over the 3D Bloch sphere, defined as the Hilbert space which represents the quantum domain. Expressing this within the Bloch sphere, which serves as a representation of the quantum domain, it can be geometrically denoted as, $\left|\psi\right>=\cos(\theta)\left|0\right>+e^{i\phi}\sin(\theta)\left|1\right>$ , where $\theta$ denotes a parameter that determines the probabilities of measuring $\left|0\right>$ and $\left|1\right>$ , and $\phi$ represents the relative phase, respectively, where $0\leq\theta\leq\pi$ and $0\leq\phi<2\pi$ [32]. Fig. 2(a) shows a qubit represented over the Bloch sphere. When considering a $q$ qubit system, the representation of quantum states within the system’s Hilbert space is as $\left|\psi\right\rangle=\sum^{2^{q}-1}_{l=0}\omega_{l}\left|l\right\rangle$ , where $\left|\psi\right\rangle$ denotes the quantum state, $\left|l\right\rangle$ represents $l$ -th basis, and $\omega_{l}$ stands for the probability amplitude of $q$ qubit system, respectively. Then, the probability amplitude fulfills $\sum^{2^{q}-1}_{l=0}|\omega_{l}|^{2}=1$ . A significant component in classical neural networks is a hidden layer, capable of representing linear and nonlinear transformations to achieve accurate function approximation within the neural network. Hence, the primary design consideration factors in QNN involve designing and implementing linear and nonlinear transformations over the 3D sphere. This QNN design facilitates the fundamental enablement of QRL-based control, achieved by incorporating the states and actions of RL-based control as inputs and outputs within QNN architectures.

In QNN architecture, there are three primary components: (i) state encoding, (ii) parameterized quantum circuit (PQC), and (iii) measurement, as illustrated in Fig. 2(b).

•

State Encoding. The encoder performs the function of converting the classical data, represented as $\zeta_{t}$ at a specific time $t$ , to the initialized quantum state $\left|0\right\rangle$ . The encoder carries out this function due to the inability of quantum circuits to directly accept classical bits. Through the application of multiple unitary matrices, denoted as $U(\cdot)$ , this encoding transformation is achieved mathematically. An important point to highlight is that the encoder does not include any trainable parameters. Thus, the encoded quantum state of the QNN at a specific time $t$ is defined as $\left|\psi_{0;t}\right\rangle=U_{\text{ENC}}(\zeta_{t})\left|0\right\rangle^{% \otimes q}$ , where the classical data $\zeta_{t}$ serves as rotation angles within the set of encoding gates $U$ .
•

PQC. The operations performed by PQC are analogous to the multiplications seen in the accumulated hidden layers of classical neural networks. Quantum gates can transform the state of qubits through the operations they perform [32]. Within this paper, the following three gates will be introduced: Pauli, Controlled, and rotation gates [32]. Outlined below are the definitions for Pauli- $\Gamma$ gates and Controlled- $\Gamma$ gates, i.e., $X\!\!=\!\!\begin{bmatrix}0&1\\ 1&0\end{bmatrix}\!$ , $\!Y\!\!=\!\!\begin{bmatrix}0&\!-\!i\\ i&0\end{bmatrix}\!$ , $\!Z\!\!=\!\!\begin{bmatrix}1&0\\ 0&\!-\!1\end{bmatrix}\!$ , and $C\Gamma\!\!=\!\!\begin{bmatrix}{\textbf{{I}}}&0\\ 0&\Gamma\end{bmatrix}$ , where $i=\sqrt{(-1)}$ , $\forall\Gamma\in\left\{X,Y,Z\right\}$ , and I stands for the identity matrix, respectively. The Pauli- $\Gamma$ gates perform $180\,^{\circ}$ rotations of the quantum state in the x, y, and z axes of the Bloch sphere. Between two qubits, the Controlled- $\Gamma$ gates produce entanglement. Within QNN, rotation gates $R_{\Gamma}$ featuring the trainable parameters $\theta_{k}$ , defined within the range $[0,2\pi]$ , find widespread utilization. This can be represented as follows: $R_{\Gamma}(\theta_{k})=e^{-i\frac{\theta_{k}}{2}\Gamma}$ . Achieving rotations and entanglement of all qubits involves utilizing Pauli- $\Gamma$ , Controlled- $\Gamma$ , and rotation gates. At this moment, Pauli- $\Gamma$ gates and $R_{\Gamma}$ are employed for implementing linear transformations, while the Controlled- $\Gamma$ gates are utilized for nonlinear transformations. Therefore, PQC achieves two transformations on the 3D sphere. Consequently, in PQC, it can vary depending on the configuration of the $R_{\Gamma}$ and Controlled- $\Gamma$ gates, and is an important factor in building a QNN. To thoroughly explore trainable rotation parameters and entanglement, we implement multiple quantum layers in this paper, each consisting of $R_{\Gamma}$ gates within PQC of each QNN. At a specific time $t$ , the quantum state of the QNN, denoted as $\left|\psi_{t}\right\rangle$ , can be represented as $\left|\psi_{t}\right\rangle=\prod_{l=1}^{L}\nolimits\boldsymbol{U}_{l}(\theta_% {t})U_{\text{ENC}}(\zeta_{t})\left|0\right\rangle^{\otimes q}$ , where $\boldsymbol{U}_{l}(\theta_{t})$ stands for the $l$ -th quantum layer at the specific time $t$ with its corresponding set of trainable parameters. Observe that $\boldsymbol{U}_{l}(\theta_{t})$ takes the trainable parameters as inputs, therefore it works differently from the encoder’s gates.
•

Measurement. The quantum state that is acquired by PQC is utilized as the input for measurement. In this process, quantum data is decoded back to the original format before performing measurements on the input. The z-axis is commonly used for measurements, but axes in other directions can also be used if they are appropriately defined. The quantum state collapses and its properties become observable after the quantum state is measured. Upon completion of the decoding procedure, the observable property is employed to minimize the loss function. Achieving the expected decoded value of the quantum state $\left|\psi_{t}\right\rangle$ can be accomplished through $\left\langle\psi_{t}\right|O\left|\psi_{t}\right\rangle$ , where $\left|\psi_{t}\right\rangle=\prod_{l=1}^{L}\boldsymbol{U}_{l}(\theta_{t})U_{% \text{ENC}}(\zeta_{t})\left|0\right\rangle^{\otimes q}$ , $\left\langle\psi_{t}\right|$ denotes the conjugate transpose of $\left|\psi_{t}\right\rangle$ , and $O$ represents the observable, respectively.

2.3 QMARL for Scheduling

This section investigates the use of QMARL for scheduling CubeSats and HALE-UAVs, presenting a strong argument for its preference over conventional MARL approaches. Conventional MARL has been effective for optimizing decisions in scenarios with relatively small action dimensions. Nonetheless, within intricate systems like integrated networks using CubeSats/HALE-UAVs, characterized by exponentially vast action dimensions, the efficacy of conventional MARL diminishes due to computational burden and the inefficacy in managing extensive action spaces. The expansion of the action dimension introduces the challenge of the curse of dimensionality [33], a significant impediment in conventional MARL frameworks. QMARL, empowered by quantum computing features such as superposition and entanglement, offers a significant computational edge [34]. This quantum advantage allows QMARL to efficiently process large-scale data and complex decision matrices [35], presenting a superior solution for the extensive action dimensions encountered in integrated networks using CubeSats/HALE-UAVs. Moreover, the multi-agent dynamics of these integrated networks involving many communicating devices such as multiple GSs, CubeSats, and HALE-UAVs make the scheduling decision-making problem more complex. QMARL signifies a crucial advancement in overcoming the challenges of high-dimensional and complex scheduling tasks for integrated networks using CubeSats/HALE-UAVs. Its enhanced computational strength and ability to effectively manage multi-agent scenarios establish it as a powerful and efficient approach, facilitating the development of more sophisticated, effective, and dependable SAGIN.

3 Modeling

3.1 Global SAGIN Access Scheduling Modeling

The considered global SAGIN is illustrated in Fig. 1 and structured around three principal elements, $N$ GSs, a fleet of $M$ CubeSats, and a group of $L$ HALE-UAVs. Each GS is denoted as $G_{i}$ , $i\in\mathcal{N}$ , and note that $|\mathcal{N}|\triangleq N$ . In addition, CubeSats and HALE-UAVs are denoted as $S_{j}$ and $A_{l}$ , respectively, where $S_{j},j\in\mathcal{M}$ and $A_{l},l\in\mathcal{L}$ , and also note that $|\mathcal{M}|\triangleq M$ and $|\mathcal{L}|\triangleq L$ . Our proposed scheduling works by each GS $G_{i}$ to establish the communications with CubeSats $S^{i}_{j}$ or HALE-UAVs $A^{i}_{l}$ that are located within the coverage of $G_{i}$ , for network access services. The main purpose of this scheduling is for maximizing (i) the residual energy amounts of NTN devices, (ii) the fair energy consumption among NTN devices, and (iii) the global access performance in terms of capacity and QoS, in SAGIN systems.

3.2 HALE-UAV

In order to ensure the maneuvers of HALE-UAVs while maintaining the equilibrium among the energy levels of HALE-UAVs, energy expenditure modeling for HALE-UAV is essential. The required energy is the minimum energy amount to overcome aerodynamic drag and advance in each HALE-UAV. The energy is equivalent to the work per unit over time under the force applied to the dynamic system, and it is defined as the dot product of force and velocity. Therefore, the required energy of the $l$ -th HALE-UAV at time $t$ , denoted as $E^{A}_{l}(t)$ , is defined as $E^{A}_{l}(t)=DV$ , where $D$ and $V$ denote its drag and velocity at time $t$ , respectively. Here, drag $D$ can be obtained as $D=\frac{1}{2}\rho V^{2}SC_{D}=qSC_{D}$ , where $C_{D}$ is drag coefficient. Because $C_{D}$ is expressed as $C_{D}=C_{D_{0}}+kC_{L}^{2}$ and $C_{L}$ is expressed as $C_{L}=\frac{W}{\frac{1}{2}\rho V^{2}S}=\frac{W}{qS}$ , the required energy of the $l$ -th HALE-UAV at time $t$ , i.e., $E^{A}_{l}(t)$ , is,

E^{A}_{l}(t)=\underbrace{\frac{1}{2}C_{D_{0}}\rho V^{3}S}_{\textrm{parasite % energy, $P_{p}$}}+\underbrace{\frac{kW^{2}}{\frac{1}{2}\rho SV}}_{\textrm{% induced energy, $P_{i}$}}\\ =\underbrace{qSC_{D_{0}}V}_{\textrm{parasite energy, $P_{p}$}}+\underbrace{% \frac{W^{2}kV}{qS}}_{\textrm{induced energy, $P_{i}$}},

(1)

where $C_{D_{0}}$ , $\rho$ , $V$ , $S$ , $k$ , $W$ , and $q$ are the parasite drag coefficient at zero lift, density of the air, velocity, wing surface area, induced drag coefficient, HALE-UAV weight, and dynamic pressure ( $q=\frac{1}{2}\rho V^{2}$ ) [36], respectively. As expressed in (1), the required energy is composed of the parasite energy and induced energy [37]. Here, the parasite energy arises from parasite drag, encompassing skin friction drag (drag that varies with the UAV’s surface texture), form drag (drag that depends on the HALE-UAV’s size, structure, and shape), and interference drag (drag generated from the interaction between skin friction and form drag) [38]. In addition, the induced energy originates from the drag produced by generating lift. This type of drag is caused by wingtip vortices, resulting from the differential pressure on the wing’s upper and lower surfaces, which in turn creates downwash at the wing’s rear. Accordingly, $P_{p}$ increases with the cube of velocity, whereas $P_{i}$ is inversely related to velocity, demonstrating the dynamics of aerodynamic drag in relation to the UAV’s velocity [39].

On the other hand, velocity $V$ is computed as the aggregate of velocities along each axis, formulated as $V=\sqrt{u^{2}+v^{2}+w^{2}}$ , where $u$ , $v$ , and $w$ represent the velocities over the $x$ -, $y$ -, and $z$ -axes of body axis coordinate system, respectively. Here, velocity $V$ in (1) is the velocity based on the body axis coordinate system of aircraft. Nevertheless, due to the fact that the velocities of HALE-UAVs for each axis are determined with the relation to the ground coordinate system, it is imperative to utilize coordinate transformation matrices. Therefore, velocities $u_{1}$ , $v_{1}$ , and $w_{1}$ in the ground coordinate system are transformed into the velocities $u$ , $v$ , and $w$ within the body axis coordinate system through multiplication by the coordinate transformation matrices $L_{1}$ , $L_{2}$ , and $L_{3}$ , which is expressed as,

\begin{bmatrix}u\\ v\\ w\end{bmatrix}=L_{1}\times L_{2}\times L_{3}\times\begin{bmatrix}u_{1}\\ v_{1}\\ w_{1}\end{bmatrix},

(2)

where $L_{1}$ , $L_{2}$ , and $L_{3}$ are the transformation matrices over the $z$ -axis, $y$ -axis, and $x$ -axes, sequentially. The geometric relationships among these transformations are illustrated in Fig. 3, and the transformation of coordinates for each axis can be articulated via,

\begin{bmatrix}u_{2}\\ v_{2}\\ w_{2}\end{bmatrix}=\underbrace{\begin{bmatrix}\cos\psi&\sin\psi&0\\ -\sin\psi&\cos\psi&0\\ 0&0&1\\ \end{bmatrix}}_{L_{1}}\begin{bmatrix}u_{1}\\ v_{1}\\ w_{1}\end{bmatrix},

(3)

\begin{bmatrix}u_{3}\\ v_{3}\\ w_{3}\end{bmatrix}=\underbrace{\begin{bmatrix}\cos\theta&0&-\sin\theta\\ 0&1&0\\ \sin\theta&0&\cos\theta\\ \end{bmatrix}}_{L_{2}}\begin{bmatrix}u_{2}\\ v_{2}\\ w_{2}\end{bmatrix},

(4)

\begin{bmatrix}u\\ v\\ w\end{bmatrix}=\underbrace{\begin{bmatrix}1&0&0\\ 0&\cos\phi&\sin\phi\\ 0&-\sin\phi&\cos\phi\\ \end{bmatrix}}_{L_{3}}\begin{bmatrix}u_{3}\\ v_{3}\\ w_{3}\end{bmatrix},

(5)

where $\psi$ , $\theta$ , and $\phi$ represent the rotations over the $z$ -, $y$ -, and $z$ -axes, respectively. Within the real flight environment of HALE-UAVs, such disturbances are attributable to turbulence and wind gusts, which have the potential to alter the UAV’s rotational orientation. Amidst conditions where turbulence and gusts are prevalent across all axes, the goal of HALE-UAV is to simultaneously optimize the global access performance of the integrated network and the energy use of HALE-UAV. Details pertaining to the HALE-UAV deployed in this paper are compiled in Table I.

TABLE I: Specifications of HALE-UAV.

Notation	Value
Mass of HALE-UAV, $m$	1,815 [ $\mathrm{kg}$ ]
Acceleration of gravity, g	9.81 [ $\mathrm{m/s^{2}}$ ]
Weight of HALE-UAV, ${W}={mg}$	17,799 [ $\mathrm{N}$ ]
Wing surface area, S	6.61 [ $\mathrm{m^{2}}$ ]
Density of the air, $\rho$	0.089 [ $\mathrm{kg/m^{3}}$ ]
Parasite drag coefficient at zero lift, $C_{D_{0}}$	0.045
Induced drag coefficient, $k$	0.052

3.3 CubeSat

3.3.1 Two Line Element (TLE)

In order to observe the orbital mechanics of CubeSats, TLE is essentially required. Originating from the North American Aerospace Defense Command (NORAD), TLE contains the vital details concerning the trajectories of objects orbiting the Earth, especially for CubeSats. NORAD, tasked with the surveillance and cataloging of space debris, introduced the TLE format to effectively disseminate orbital information. The structure of TLE consists of two lines as illustrated in Fig. 4, detailing specific orbital parameters and CubeSat characteristics. Fig. 4 displays the TLE for OPS-3811, a CubeSat utilized in the experiment, encompassing orbital elements such as inclination ( $i$ ), ascending node ( $\Omega$ ), eccentricity ( $e$ ), argument of perigee ( $\omega$ ), and mean anomaly ( $M$ ). The inclination ( $i$ ) signifies the CubeSat’s orbital plane angle relative to the equatorial plane of the Earth. The ascending node ( $\Omega$ ) specifies the location where the CubeSat’s orbit crosses the equatorial plane from south to north, also known as the right ascension of the line of nodes. The eccentricity ( $e$ ) is a measure of how far a CubeSat’s elliptical orbit deviates from a circle. The argument of perigee ( $\omega$ ) is the angle from the line of nodes to the perigee of the orbit. The mean anomaly ( $M$ ) indicates the CubeSat’s current position within its orbit, assuming a circular path with the same semi-major axis ( $a$ ). In other words, the mean anomaly is the angle between the current position of the CubeSat and the perigee of the orbit, assuming that the CubeSat moves at an average speed when moving along an elliptical orbit. These TLE data, such as $e$ and $\Omega$ , are instrumental in calculating the CubeSat’s latitude, longitude, facilitating the determination of $x^{i}_{j}(t)$ between $G_{i}$ and $S^{i}_{j}$ , by (15).

3.3.2 Orbital Elements of CubeSats

As mentioned, the orbital elements expressed in TLE include eccentricity ( $e$ ), inclination ( $i$ ), right ascension of the ascending node ( $\Omega$ ), argument of perigee ( $\omega$ ), and mean anomaly ( $M$ ). The orbital elements that are not in TLE, such as semi-major axis ( $a$ ), eccentric anomaly ( $E$ ), and true anomaly ( $\nu$ ), are obtained using the orbital elements in TLE. Fig. 5(a) presents the geometric representation of orbital elements. The semi-major axis ( $a$ ), illustrated with a green line, denotes the CubeSat’s orbit’s longest radius, crucial for calculating its eccentricity ( $e$ ). The eccentricity itself measures how much the orbit deviates from a perfect circle, with values close to $0$ indicating near circularity and values near $1$ highlighting an elliptical shape. The eccentricity vector ( $\overrightarrow{e}$ ) is a vector that goes from the center of the CubeSat’s orbit to the perigee of the orbit. Additionally, the orbital inclination ( $i$ ) is assessed as the angle between the orbit’s normal axis ( $\overrightarrow{k}$ ) and its angular momentum vector ( $\overrightarrow{H}$ ), with the latter perpendicular to the plane of the orbit, thereby quantifying the orbit’s tilt with respect to the equatorial plane of the Earth. The ascending node ( $\Omega$ ) signifies the line of nodes’s longitude, which is the point where the CubeSat’s orbital plane intersects the Earth’s equatorial plane. The argument of perigee ( $\omega$ ) is defined by the angle from the ascending node vector ( $\overrightarrow{n}$ ) to the eccentricity vector ( $\overrightarrow{e}$ ), with $\overrightarrow{n}$ directing towards the line of nodes, depicted as a sky blue line in Fig. 5(a). This angle delineates the orbit’s orientation relative to the equator, marking the perigee’s location. The mean anomaly ( $M$ ) is a parameter for predicting the position of a CubeSat moving along an elliptical orbit over time, and is expressed as an angle representing the average position of the object within the orbital period, aiding in the calculation of the eccentric anomaly ( $E$ ). In an elliptical orbit, the CubeSat’s velocity changes as it passes through periapsis (the closest point) and apogee (the farthest point), but mean anomaly does not take these velocity changes into account and assumes that it moves at a uniform velocity. Therefore, a difference may occur between the actual position of the CubeSat and the position calculated by mean anomaly, and eccentric anomaly and true anomaly are used to correct this difference. The mean anomaly does not directly correspond to the actual CubeSat position, but is used as an initial value to calculate more accurate positions, such as the eccentric anomaly and true anomaly, using the eccentricity of the orbit and other orbital elements. Therefore, the mean anomaly plays an important role when modeling trajectories as a function of time. Finally, the true anomaly ( $\nu$ ) is the angle from the perigee to the CubeSat’s actual position, represented by the angle between vectors $\overrightarrow{r}$ and $\overrightarrow{e}$ , where $\overrightarrow{r}$ points from the origin of the coordinate system to the CubeSat, and the coordinate axis $\overrightarrow{i}$ aims towards the vernal equinox.

3.3.3 Latitude and Longitude of CubeSat

To ascertain the locations of CubeSats change over time, their positions are represented through coordinates of latitude ( $p^{\phi}_{j}(t)$ ) and longitude ( $p^{\lambda}_{j}(t)$ ) within the orbital coordinate systems. Given that the CubeSat’s unprocessed data in TLE consist of the coordinates in the celestial coordinate systems, the transformation to the orbital coordinate systems is required for the derivation of latitude and longitude. The latitude and longitude that change over time for each CubeSat are calculated through TLE, which is raw CubeSat data. Consequently, the latitude ( $p^{\phi}_{j}(t)$ ) and longitude ( $p^{\lambda}_{j}(t)$ ) pertaining to the current position of CubeSat $S^{i}_{j}$ , i.e., the $j$ -th CubeSat located within the coverage of the $i$ -th GS, are articulated as, $p^{\phi}_{j}(t)=\sin^{-1}\left(\frac{R_{f}[3]}{\lVert R_{f}\rVert}\right)$ and $p^{\lambda}_{j}(t)=\cos^{-1}\left(\frac{R_{f}[1]}{\lVert R_{f}\rVert\cos\phi}\right)$ , where $R_{f}[1]$ and $R_{f}[3]$ refer to $R_{f}$ ’s first and third elements, and this $R_{f}$ is defined as,

R_{f}\triangleq[C_{1}\times C_{2}\times C_{3}\times C_{4}]\times V_{4}.

(6)

In (6), the coordinate transformation matrices, $C_{1}$ , $C_{2}$ , $C_{3}$ , and $C_{4}$ , are

	$\displaystyle C_{1}=\begin{bmatrix}\cos(\Omega)&\sin(\Omega)&0\\ -\sin(\Omega)&\cos(\Omega)&0\\ 0&0&1\end{bmatrix},C_{2}=\begin{bmatrix}1&0&0\\ 0&\cos(i)&\sin(i)\\ 0&-\sin(i)&\cos(i)\end{bmatrix},$
	$\displaystyle C_{3}=\begin{bmatrix}\cos({\omega})&\sin({\omega})&0\\ -\sin({\omega})&\cos({\omega})&0\\ 0&0&1\end{bmatrix},C_{4}=\begin{bmatrix}\cos({\theta})&\sin({\theta})&0\\ -\sin({\theta})&\cos({\theta})&0\\ 0&0&1\end{bmatrix},$

where $\theta$ is the angle by which the Earth has rotated in $t$ . Therefore, $\theta$ represents the product of the Earth’s rotational angular velocity and the time interval $t$ . Lastly, $V_{4}$ in (6) is,

V_{4}=\begin{bmatrix}r\cos({\nu})~{}~{}r\sin(\nu)~{}~{}0\end{bmatrix}^{T},

(7)

where $r$ denotes the conic section, and this $r$ is a clue to compute the distance between the center of the elliptical orbit and CubeSat. Additionally, $\overrightarrow{r}$ is the vector pointing from the center of the elliptical orbit to the current position of CubeSat. Therefore, the current coordinates of CubeSat measured in the celestial coordinate system are expressed as (7). However, in order to calculate the CubeSat’s latitude and longitude that change over time, $V_{4}$ in the celestial coordinate system must be converted to the orbital coordinate system, and the previously defined coordinate transformation matrices are utilized. The corresponding coordinate transformation matrices, denoted as $C_{1}$ , $C_{2}$ , $C_{3}$ , and $C_{4}$ , facilitate the conversion of celestial coordinate systems into orbital coordinate systems. Finally, $r$ in (7) is determined by

r=\frac{H^{2}/\mu}{1+e\cos(\nu)},

(8)

where $\mu$ and $H$ represents the standard gravitational parameter and angular momentum, respectively, where $H\triangleq\sqrt{\mu a(1-e^{2})}$ and $\nu=2\tan^{-1}\left(\sqrt{\frac{1+e}{1-e}}\tan\left(\frac{E}{2}\right)\right)$ , where $E=M+e\sin{M}$ . Here, the data from TLE are transformed into geographical coordinates, i.e., latitude and longitude, over time. The constants needed to calculate the latitude and longitude of a CubeSat that change over time through TLE are summarized in Table II.

TABLE II: Parameter Settings for CubeSat Position Calculations

Constant	Value
Gravitational Constant, $G$	6.673 $e$ -20
Mass of the Earth, $M_{e}$	5.974 $e$ +24 kg
Radius of the Earth, $R_{e}$	6.378 e+6 m
Standard Gravitational Parameter, $\mu$ = $GM_{e}$	3.986 e+14 $m^{3}$ $s^{-2}$

3.3.4 Distance between GS and CubeSat

The distance between GSs and NTN devices (i.e., CubeSats and HALE-UAVs) can be formulated as follows.

Lemma 1.

The distance between $G_{i}$ and $S^{i}_{j}$ , varies over time due to the updated latitude and longitude of the CubeSat. It can be formulated as,

d^{i}_{j}(t)=\sqrt{H^{i}_{j}(t)^{2}+V^{i}_{j}(t)^{2}},

(9)

where $H^{i}_{j}(t)$ and $V^{i}_{j}(t)$ represent the respective horizontal and vertical distances between $G_{i}$ and $S_{j}^{i}$ , and note that $V^{i}_{j}(t)$ indicates the altitude of $S^{i}_{j}$ relative to $G_{i}$ . Then,

H^{i}_{j}(t)=R_{e}\cos^{-1}(\cos p^{\phi}_{i}(t)\cos p^{\phi}_{j}(t)\cos(p^{% \lambda}_{i}(t)-p^{\lambda}_{j}(t))\\ +\sin p^{\phi}_{i}(t)\sin p^{\phi}_{j}(t)),

(10)

where $p^{\phi}_{i}(t)$ and $p^{\lambda}_{i}(t)$ denote the latitude and longitude of $G_{i}$ ; and $R_{e}$ is the radius of the Earth.

Proof.

As illustrated in Fig. 5(b), $\vec{P}_{GS_{i}}$ and $\vec{P}_{CS_{j}}$ are positioned on the surface of the Earth. These vectors are denoted as $\vec{P}_{GS_{i}}=(x_{i},y_{i},z_{i})$ and $\vec{P}_{CS_{j}}=(x_{j},y_{j},z_{j})$ , correspondingly, where $\vec{P}_{GS_{i}}$ and $\vec{P}_{CS_{j}}$ are identified as coordinate vectors along with $x$ -, $y$ -, and $z$ -axes, respectively. In addition, the angular difference between $\vec{P}_{GS_{i}}$ and $\vec{P}_{CS_{j}}$ , i.e., $\theta$ , can be obtained as,

\theta=\cos^{-1}\frac{\vec{P}_{GS_{i}}\cdot\vec{P}_{CS_{j}}}{\left\|\vec{P}_{% GS_{i}}\right\|\left\|\vec{P}_{CS_{j}}\right\|}\\ =\cos^{-1}\frac{x_{i}x_{j}+y_{i}y_{j}+z_{i}z_{j}}{\sqrt{x_{i}^{2}+y_{i}^{2}+z_% {i}^{2}}\sqrt{x_{j}^{2}+y_{j}^{2}+z_{j}^{2}}},

(11)

where $x_{i}$ , $y_{i}$ , $z_{i}$ , $x_{j}$ , $y_{j}$ , and $z_{j}$ can be represented as,

	$\displaystyle\begin{bmatrix}x_{i}\\ y_{i}\\ z_{i}\end{bmatrix}=\begin{bmatrix}R_{e}\cos p_{i}^{\phi}(t)\cos p_{i}^{\lambda% }(t)\\ R_{e}\cos p_{i}^{\phi}(t)\sin p_{i}^{\lambda}(t)\\ R_{e}\sin p_{i}^{\phi}(t)\end{bmatrix},$		(12)
	$\displaystyle\begin{bmatrix}x_{j}\\ y_{j}\\ z_{j}\end{bmatrix}=\begin{bmatrix}R_{e}\cos p_{j}^{\phi}(t)\cos p_{j}^{\lambda% }(t)\\ R_{e}\cos p_{j}^{\phi}(t)\sin p_{j}^{\lambda}(t)\\ R_{e}\sin p_{j}^{\phi}(t)\end{bmatrix},$		(13)

where $p_{i}^{\phi}(t)$ , $p_{i}^{\lambda}(t)$ , $p_{j}^{\phi}(t)$ , and $p_{j}^{\lambda}(t)$ are the latitude of $\vec{P}_{GS_{i}}$ , the longitude of $\vec{P}_{GS_{i}}$ , the latitude of $\vec{P}_{CS_{j}}$ , and the longitude of $\vec{P}_{CS_{j}}$ , at $t$ , respectively. Given that the magnitudes of these vectors are equivalent, $\sqrt{x_{i}^{2}+y_{i}^{2}+z_{i}^{2}}=\sqrt{x_{j}^{2}+y_{j}^{2}+z_{j}^{2}}=R_{e}$ , and thus, $x_{i}x_{j}+y_{i}y_{j}+z_{i}z_{j}=R_{e}^{2}\cos^{-1}(\cos p_{i}^{\phi}(t)\cos p% _{j}^{\phi}(t)\cos(p_{i}^{\lambda}(t)-p_{j}^{\lambda}(t))+\sin p_{i}^{\phi}(t)% \sin p_{j}^{\phi}(t))$ by (13). Therefore, according to the fact that $H^{i}_{j}(t)$ is derived from $R_{e}\theta$ , which is depicted as the red line in Fig. 5(b), $H^{i}_{j}(t)=R_{e}\cos^{-1}(\cos p_{i}^{\phi}(t)\cos p_{j}^{\phi}(t)\cos(p_{i}% ^{\lambda}(t)-p_{j}^{\lambda}(t))+\sin p_{i}^{\phi}(t)\sin p_{j}^{\phi}(t))$ . ∎

Similarly, the distance between $G_{i}$ and the $l$ -th HALE-UAV within the coverage of $G_{i}$ , i.e., denoted as $A^{i}_{l}$ , is determined based on the latitude ( $p^{\phi}_{l}(t)$ ) and longitude ( $p^{\lambda}_{l}(t)$ ) of $A^{i}_{l}$ , calculated as $d^{i}_{l}(t)=\sqrt{H^{i}_{l}(t)^{2}+V^{i}_{l}(t)^{2}}$ , where $H^{i}_{l}(t)$ and $V^{i}_{l}(t)$ are the horizontal and vertical distances, and note that $V^{i}_{l}(t)$ indicates the altitude of $A_{l}^{i}$ relative to $G_{i}$ , due to (9). Furthermore, according to (10), $H^{i}_{l}(t)=R_{e}\cos^{-1}(\cos p^{\phi}_{i}(t)\cos p^{\phi}_{l}(t)\cos(p^{% \lambda}_{i}(t)-p^{\lambda}_{l}(t))+\sin p^{\phi}_{i}(t)\sin p^{\phi}_{l}(t))$ , where $p^{\phi}_{l}(t)$ , and $p^{\lambda}_{l}(t)$ denote the latitude and longitude of the $l$ -th HALE-UAV at time $t$ , respectively.

4 Problem Formulation and Algorithm Design

4.1 Main Objective for Global SAGIN Mobile Access

The purpose of our proposed QMARL-based scheduler in SAGIN is to preserve the residual energy of NTN devices as much as possible while each GS improves the global access performance in terms of access availability and energy efficiency. Therefore, when each GS schedules CubeSats and HALE-UAVs for global access, it is important to simultaneously optimize the global access performance and the residual energy of NTN devices. To achieve this goal, corresponding reward function should designed for MARL based algorithm design. The main objective of global SAGIN mobile access for each $i$ -th GS can be formulated as,

\max_{x^{i}_{j,l}(t)\in\{0,1\}}:\lim_{\mathcal{T}\rightarrow\infty}\frac{1}{% \mathcal{T}}\sum_{t=0}^{\mathcal{T}-1}\nolimits\sum_{\forall j\in M^{i},% \forall l\in L^{i}}\nolimits R_{i}(d^{i}_{j,l}(t),x^{i}_{j,l}(t)),

(14)

where $d^{i}_{j,l}(t)$ and $x^{i}_{j,l}(t)$ represent the distance and the scheduling vector between $G_{i}$ and the NTN device within the coverage of $G_{i}$ (i.e., $S^{i}_{j}$ or $A^{i}_{l}$ ) at $t$ , respectively. In addition, $M^{i}$ and $L^{i}$ in (14) stand for the sets of CubeSats and HALE-UAVs within the coverage of $G_{i}$ . Furthermore, $\sum_{\forall j\in M^{i},\forall l\in L^{i}}x^{i}_{j,l}(t)\leq\bar{H}_{i},% \forall x^{i}_{j,l}(t)\in\{0,1\},\forall j\in M^{i},\forall l\in L^{i}$ holds where $\bar{H}_{i}$ means the maximal number of acceptable NTN devices ( $S^{i}_{j}$ or $A^{i}_{l}$ ) that $G_{i}$ can monitor. Lastly, $R_{i}(d^{i}_{j,l}(t),x^{i}_{j,l}(t))$ is our utility function for seamless global access, and it can be formulated as,

R_{i}(d^{i}_{j,l}(t),x^{i}_{j,l}(t))=U_{i}(d^{i}_{j,l}(t),x^{i}_{j,l}(t))-C_{i% }(d^{i}_{j,l}(t),x^{i}_{j,l}(t)),

(15)

where $U_{i}(d^{i}_{j,l}(t),x^{i}_{j,l}(t))$ and $C_{i}(d^{i}_{j,l}(t),x^{i}_{j,l}(t))$ stand for the utility and cost functions. In (15),

U_{i}(d^{i}_{j,l}(t),x^{i}_{j,l}(t))=\sum_{\forall j\in M^{i},\forall l\in L^{% i}}\nolimits\textbf{q}(d^{i}_{j,l}(t))\cdot\xi_{j,l}^{SA}(t)\cdot x^{i}_{j,l}(% t),

(16)

where $\textbf{q}(d^{i}_{j,l}(t))$ and $\xi_{j,l}^{SA}(t)$ denote the quality function and capacity of the link between $G_{i}$ and its associated NTN device ( $S^{i}_{j}$ or $A^{i}_{l}$ ). In (16), the quality function can be generalized as [40],

\textbf{q}(d^{i}_{j,l}(t))\triangleq\left(1+\exp^{-\xi_{1}\left(\Lambda^{i}_{j% ,l}(d^{i}_{j,l}(t))-\xi_{2}\right)}\right)^{-1},

(17)

where the data rate $\Lambda^{i}_{j,l}(d^{i}_{j,l}(t))$ depends on bandwidth ( $\mathrm{W}$ ) and signal-to-noise ratio (SNR), which is denoted as $\Gamma$ , thus,

\Lambda^{i}_{j,l}(d^{i}_{j,l}(t))=\mathrm{W}\cdot\log_{2}\left(1+\Gamma(d^{i}_% {j,l}(t))\right).

(18)

Additionally, the cost function in (15) is expressed as,

C_{i}(d^{i}_{j,l}(t),x^{i}_{j,l}(t))=\sum_{\forall j\in M^{i}}\limits E^{S}_{j% }(d^{i}_{j}(t),x^{i}_{j}(t))\cdot\underbrace{\sigma_{i}^{S}(t)}_{\textrm{(% cooperation)}}\\ +\sum_{\forall l\in L^{i}}\limits E^{A}_{l}(d^{i}_{l}(t),x^{i}_{l}(t))\cdot% \underbrace{\sigma_{i}^{A}(t)}_{\textrm{(cooperation)}},

(19)

where $E^{S}_{j}(d^{i}_{j}(t),x^{i}_{j}(t))$ and $E^{A}_{l}(d^{i}_{l}(t),x^{i}_{l}(t))$ represent the normalized energy expenditure of $S^{i}_{j}$ and $A^{i}_{l}$ , respectively. In (19), $\sigma_{i}^{S}(t)$ , and $\sigma_{i}^{A}(t)$ quantify the standard deviation of the residual energy levels for $S^{i}_{j}$ and $A^{i}_{l}$ . The cooperation highlighted in (19) is essential for reducing the variance of each NTN device (CubeSat or HALE-UAV)’s energy status, thereby it can avert the disproportionate energy usage of any specific CubeSat or HALE-UAV as well as promote collaborative operations for minimizing total energy expenditure.

Furthermore, the total energy expenditure, i.e., $E^{S}_{j}(d^{i}_{j}(t),x^{i}_{j}(t))$ and $E^{A}_{l}(d^{i}_{l}(t),x^{i}_{l}(t))$ , corresponds to the amount of energy utilized during communications between $G_{i}$ and its associated NTN device ( $S^{i}_{j}$ or $A^{i}_{l}$ ). The energy consumed in $S^{i}_{j}$ , i.e., $E^{S}_{j}(d^{i}_{j}(t),x^{i}_{j}(t))$ , and also in $A^{i}_{l}$ , i.e., $E^{A}_{l}(d^{i}_{l}(t),x^{i}_{l}(t))$ , are limited by their specific maximum capacities, $\bar{e}_{j}$ for $S^{i}_{j}$ and $\bar{e}_{l}$ for $A^{i}_{l}$ , which can be expressed as $E^{S}_{j}(d^{i}_{j}(t),x^{i}_{j}(t))\leq\bar{e}_{j},\forall j\in M^{i}$ and $E^{A}_{l}(d^{i}_{l}(t),x^{i}_{l}(t))\leq\bar{e}_{l},\forall l\in L^{i}$ , respectively. Furthermore, the maximum capacity of $G_{i}$ is also taken into account, i.e.,

\xi^{GS}_{i}(t)+\sum_{\forall j\in M^{i}}\nolimits\xi^{S}_{j}(t)\cdot x^{i}_{j% }(t)\\ +\sum_{\forall l\in L^{i}}\nolimits\xi^{A}_{l}(t)\cdot x^{i}_{l}(t)\leq\bar{% \xi}_{i}=\frac{\varrho}{1+e^{-\zeta(t-\tau)}},

(20)

where $\xi^{GS}_{i}(t)$ , $\xi^{S}_{j}(t)$ , $\xi^{A}_{l}(t)$ , and $\bar{\xi}_{i}$ , are the capacity of $G_{i}$ , the capacity of $S^{i}_{j}$ , the capacity of $A^{i}_{l}$ , and the maximum capacity of the $G_{i}$ , respectively, and the $\bar{\xi}_{i}$ varies depending on the region where each GS is located, the population of that region, and the degree of communication overloads. Additionally, $\varrho$ , $\zeta$ , $t$ , and $\tau$ are the maximum of logarithmic quality function curve, control factor the steepness of the curve, time, and midpoint of the curve, respectively.

4.2 Reinforcement Learning Modeling

According to the dynamics of CubeSats and HALE-UAVs under uncertain environments, the rapid and unexpected state changes occur over time. These dynamics and uncertain environments are obviously obstacles for large-scale global SAGIN mobile access scheduling, which can be modelled with combinatorics optimization. For more details, these scheduling problems are generally formulated as integer programming (IP), which are known for their non-deterministic polynomial (NP)-hard complexity, making them particularly difficult to solve using conventional methods. Therefore, it is highly advantageous to re-formulate the original optimization framework into RL-based sequential discrete-time decision-making for time-average scheduling utility maximization. Additionally, in the environment formalized through RL, GS constantly interacts with the environment and learns the optimal policy in the process, therefore RL can be a good solution in such a very dynamic and uncertain environment. However, to implement realistic global access in SAGIN, many GSs, CubeSats, and HALE-UAVs are needed. Because multiple GSs are required, this changes the form of the problem from RL to MARL scheduling, and because multiple CubeSats and HALE-UAVs must be used, the action dimension of the GS increases exponentially as the number of these NTN devices increases. The conventional MARL has a fatal problem that as the number of GS increases, or as the number of actions that GS can select, that is, the number of CubeSats and HALE-UAVs increases, GS suffers from the curse of dimensionality and its learning performance deteriorates. This paper undertakes such a re-formulation using QMARL, proposing a novel approach for tackling the complexities of scheduling in time-varying dynamic environments. QMARL utilizes QNN and is free from the curse of dimensionality, which is the big problem in conventional MARL. If QMARL is used to implement realistic global access in SAGIN, seamless global access can be achieved by simultaneously optimizing global access performance and the residual energy of NTN devices even when using numerous GS, CubeSat, and HALE-UAV.

State. In our considering aerial network with CubeSats and HALE-UAVs, the state is defined by the observational data collected by $G_{i}$ , denoted as $\mathcal{S}_{i}(t)$ , and it can be as follows,

\mathcal{S}_{i}(t)\triangleq\{P_{i}(t),\xi_{i}(t),\bigcup_{j\in M^{i}}\{P^{S}_% {j}(t),E^{S}_{j}(t),\xi^{S}_{j}(t)\},\\ \bigcup_{l\in L^{i}}\{P^{A}_{l}(t),E^{A}_{l}(t),\xi^{A}_{l}(t)\}\},

(21)

where $P_{i}(t)$ , $\xi_{i}(t)$ , $P^{S}_{j}(t)$ , $E^{S}_{j}(t)$ , $\xi^{S}_{j}(t)$ , $P^{A}_{l}(t)$ , $E^{A}_{l}(t)$ , and $\xi^{A}_{l}(t)$ stand for the position of $G_{i}$ , the capacity of $G_{i}$ , the position of $S_{j}^{i}(t)$ , the energy state of $S_{j}^{i}(t)$ , the capacity of $S_{j}^{i}(t)$ , the position of $A_{l}^{i}(t)$ , the energy state of $A_{l}^{i}(t)$ , and the capacity of $A_{l}^{i}(t)$ . Here, the positions of $G_{i}$ , $S_{j}^{i}$ , and $A_{l}^{i}$ are specified as $P_{i}(t)=\{p^{\phi}_{i}(t),p^{\lambda}_{i}(t),p^{H}_{i}(t)\}$ , $P^{S}_{j}(t)=\{p^{\phi}_{j}(t),p^{\lambda}_{j}(t),p^{H}_{j}(t),v^{S}_{j}(t)\}$ , and $P^{A}_{l}(t)=\{p^{\phi}_{l}(t),p^{\lambda}_{l}(t),p^{H}_{l}(t),v^{A}_{l}(t)\}$ , where $p^{\phi}_{i}(t)$ , $p^{\lambda}_{i}(t)$ , and $p^{H}_{i}(t)$ denote the latitude, longitude, and altitude of $G_{i}$ . Similarly, $p^{\phi}_{j}(t)$ , $p^{\lambda}_{j}(t)$ , $p^{H}_{j}(t)$ , $v^{S}_{j}(t)$ , $p^{\phi}_{l}(t)$ , $p^{\lambda}_{l}(t)$ , $p^{H}_{l}(t)$ , and $v^{A}_{l}(t)$ represent the latitude of $S_{j}^{i}$ , the longitude of $S_{j}^{i}$ , the altitude of $S_{j}^{i}$ , the velocity vector of $S_{j}^{i}$ , the latitude of $A_{l}^{i}$ , the longitude of $A_{l}^{i}$ , the altitude of $A_{l}^{i}$ , and the velocity vector of $A_{l}^{i}$ .

Action. The action at $t$ is represented as $\mathcal{A}(t)={[x^{i}_{j,l}(t)]}$ , where $x^{i}_{j,l}(t)\in\{0,1\}$ . This indicates whether $G_{i}$ is available for $S_{j}^{i}$ or $A_{l}^{i}$ at $t$ or not, and note that the network access service between $G_{i}$ and NTN device ( $S_{j}^{i}$ or $A_{l}^{i}$ ) is available when $x^{i}_{j}(t)=1$ or $x^{i}_{l}(t)=1$ (vice versa).

Reward. The reward function is outlined in (15), with its maximization reliant on the action scheduling $x^{i}_{j,l}(t)$ made by $G_{i}$ . This reward encompasses both utility and cost functions. Fundamentally, the goal is for each GS to orchestrate the scheduling of NTN devices (CubeSats or HALE-UAVs) to enhance the access performance in global SAGIN systems. Simultaneously, our reward function aims at the reduction of (i) the overall energy usage and (ii) the standard deviation of individual energy levels of CubeSats and HALE-UAVs. This reward function facilitates the autonomous and cooperative energy management in CubeSat and HALE-UAV.

4.3 QMARL-based Scheduler Design

In the depicted scenario, each GS agent, identified as the $i$ -th GS, is responsible for executing a combinatorial scheduling decision across $M$ CubeSats and $L$ HALE-UAVs, as illustrated in Fig. 6. As the number of CubeSats $M$ and HALE-UAVs $L$ increment linearly, the total number of feasible scheduling decisions experiences an exponential rise, quantified as $2^{M+L}$ . This significant increase highlights the imperative for conventional RL policies to expand their output dimensionality, i.e., action dimensions, thereby accommodating the $2^{M+L}$ potential combinations of these scheduling actions. However, such an increase in output dimensionality introduces difficulties in learning efficacy, a situation often described as the curse of dimensionality [41]. To tackle the mentioned challenge, this paper proposes an innovative strategy utilizing QMARL. This approach leverages quantum measurement techniques, facilitating effective navigation through high-dimensional action decision spaces by GSs. It’s noteworthy that training MARL with a substantial number of agents typically encounters reward convergence issues. Furthermore, as the number of action dimensions required by agents rises, achieving reward convergence grows more challenging. The quantum-based proposed measurement introduced here stands out as a singular solution capable of surmounting these challenges.

The QMARL-based scheduler outlined in this scenario is organized into three separate stages. The first two stages include encoding, which involves converting classical bits into quantum states referred to as qubits, and PQC, which involves the process of applying rotation gates to manipulate these quantum states in accordance with conventional QNN-based RL policies. The third and most important stage is measurement. During the concluding measurement stage, quantum states are transformed into an observable. This observable serves as the output obtained through the measurement of quantum states. The process of quantum measurement acts as a decoding mechanism, translating the outcomes of quantum computing into a format that classical computing systems can interpret and use. To facilitate global access performance of integrated networks through QMARL, the quantum system is established with a total of $M+L$ qubits. This total directly reflects the combined amount of CubeSats ( $M$ ) and HALE-UAVs ( $L$ ), leading to the equation: $|\psi\rangle=\sum_{k=1}^{2^{M+L}}\alpha_{k}|\mathbf{e}_{k}\rangle$ . In this context, $\alpha_{k}$ is defined as the probability amplitude, and $\mathbf{e}_{k}$ represents the $k$ -th basis within the Hilbert space.

In the domain of QNN, the Pauli-Z measurement is a prevalent method for transforming quantum states into observables. This conversion process does not depend on the number of qubits in use. In the Pauli-Z operator, each column denotes the computational basis of $|\hat{0}\rangle$ and $|\hat{1}\rangle$ . For the purpose of deriving the expectation value of each qubit’s state, a matrix that projects the quantum state onto the $z$ -axis is employed, which is expressed as, ${\textbf{{P}}}^{k}_{Z}\triangleq{\textbf{{I}}}^{k-1}\otimes{\textbf{{Z}}}% \otimes{\textbf{{I}}}^{Q-k}$ , where I is the identity matrix. The equation to compute an observable associated with a single basis is formulated as, $\langle\mathcal{O}_{k}\rangle=\langle\psi|{\textbf{{P}}}^{k}_{Z}|\psi\rangle$ , where $\forall k\in\mathbbm{N}[1,Q]$ , $\langle\mathcal{O}_{k}\rangle\in\mathbbm{R}[-1,1]$ . To manage the combinatorial scheduling of $M$ CubeSats and $L$ HALE-UAVs, a requisite output dimensionality of $2^{M+L}$ necessitates the use of $2^{M+L}$ qubits. This methodology, however, does not address the issue identified as the curse of dimensionality. In contrast, the QMARL-based scheduler proposed in this paper effectively minimizes the requisite number of qubits to a logarithmic scale, transitioning from $2^{M+L}$ down to $M+L$ . Consequently, this innovative approach significantly reduces the qubit requirement, ensuring its operational feasibility even amidst the constraints of the noisy intermediate-scale quantum (NISQ) era, where qubit availability is limited. By implementing the basis measurement, particularly through PVM, the approach outlined in this paper facilitates the determination of probabilities for every possible $2^{M+L}$ combinations with merely $M+L$ qubits. Thus, the likelihood of each conceivable $2^{M+L}$ action can be ascertained using only $M+L$ qubits, expressed as, $\{\text{Pr}(\mathcal{A}_{k})\}_{k=1}^{2^{M+L}}\triangleq\{\operatorname*{% \raisebox{-1.07639pt}{\scalebox{1.2}{$\bigotimes$}}}_{k=1}^{M+L}\nolimits|x^{i% }_{j,l}\rangle$ }, where $\operatorname*{\raisebox{-1.07639pt}{\scalebox{1.2}{$\bigotimes$}}}$ symbolizes the Kronecker product, $\forall x^{i}_{j,l}\in\{0,1\}$ , $\forall j\in[1,M]$ , $\forall l\in[1,L]$ . Finally, the process to determine the probability that the $i$ -th GS will choose for the $k$ -th action from $2^{M+L}$ possibilities at $t$ , according to its strategy, is represented as,

\pi(\mathcal{A}_{k}(t)|\mathcal{S}_{i}(t);\boldsymbol{\theta}_{i})\!=\!\langle% \psi|\mathbf{e}_{k}\rangle\langle\mathbf{e}_{k}|\psi\rangle\!=\!|\langle\psi|% \mathbf{e}_{k}\rangle|^{2}\!=\!|\alpha_{k}|^{2},

(22)

where $|\mathbf{e}_{k}\rangle\langle\mathbf{e}_{k}|$ denotes the projector for the $k$ -th basis, with the collection of all such projectors for every basis being $\{|\mathbf{e}_{k}\rangle\langle\mathbf{e}_{k}|\}^{2^{M+L}}_{k=1}$ . This is because the probabilities for each action corresponds to an individual outputs as, $\sum^{2^{M+L}}_{k=1}\nolimits\pi(\mathcal{A}_{k}(t)|\mathcal{S}_{i}(t);% \boldsymbol{\theta}_{i})=1$ . This paper adopts activation functions as basis measurement, thereby allowing each GS to undertake action decision-making on the logarithmically reduced action dimension.

4.4 QMARL-based Scheduler Training

The network under consideration is conceptualized as a multi-agent system, where each $i$ -th GS acts as the $i$ -th agent equipped with its own QNN-based RL policy, $\pi(\mathcal{A}(t)|\mathcal{S}_{i}(t);\boldsymbol{\theta}_{i})$ , parameterized by $\boldsymbol{\theta}_{i}$ . In the training phase, a unified centralized critic, parameterized by $\phi$ , assesses the policy effectiveness of multiple agents by estimating the state-value function $V_{\boldsymbol{\phi}}(\mathcal{S}(t))$ , with $\mathcal{S}(t)$ representing the ground truth, encapsulating all accessible environmental data [42]. Conversely, each GS engages in sequential decision-making based on its individual partial state (i.e., observation), $\mathcal{S}_{i}(t)$ . This training framework enables all GSs to refine their policies towards collective decision-making, notwithstanding their limited observation of the environment. Furthermore, during inference, due to the distributed approach to cooperation, it is possible to achieve effective scalability and efficient use of computing resources.

After completing this procedure, TD error is utilized to implement multi-agent PG methods for the training of quantum multi-actor centralized-critic networks. The objective function for the $i$ -th actor ( $G_{i}$ ), denoted as $J(\boldsymbol{\theta}_{i})$ , is expressed as,

\nabla_{\theta_{i}}J(\theta_{i})=\mathbbm{E}_{\mathcal{S}}\Big{[}\!\sum^{T}_{t% =1}\nolimits\sum^{N}_{i=1}\nolimits\delta_{\phi}(t)\nabla_{\theta_{i}}\!\log% \pi(\mathcal{A}(t)|\mathcal{S}_{i}(t);\theta_{i})\Big{]},

(23)

where $\delta_{\phi}(t)$ , $\pi$ , $\mathcal{A}(t)$ , $\mathcal{S}_{i}(t)$ , and $\theta_{i}$ are the TD error based on Bellman optimality equation in time step $t$ , policy, action at time $t$ , state at time $t$ , and neural network parameters, respectively. The loss function pertaining to the critic, denoted by $\mathcal{L}(\phi)$ , is specified as,

\nabla_{\phi}\mathcal{L}(\phi)=\sum^{T}_{t=1}\nolimits\nabla_{\phi}\left\|% \delta_{\phi}(t)\right\|^{2},

(24)

To optimize the objective function for multiple GSs and reduce the loss function of the centralized critic, the derivatives of the $k$ -th parameters are expressed as,

	$\displaystyle\frac{\partial J(\theta_{i})}{\partial\theta_{k}}=\underbrace{% \frac{\partial J(\theta_{i})}{\partial\pi_{\theta_{i}}}\cdot\frac{\partial\pi_% {\theta_{i}}}{\partial\langle\mathcal{O}_{k,\boldsymbol{\theta}_{i}}\rangle}}_% {\textrm{(Classical Backpropagation)}}\cdot\underbrace{\frac{{\partial\langle% \mathcal{O}_{k,\boldsymbol{\theta}_{i}}\rangle}}{\partial\theta_{k}}}_{\textrm% {(Parameter-Shift Rule)}},$		(25)
	$\displaystyle\frac{\partial\mathcal{L}(\phi)}{\partial\phi_{k}}=\underbrace{% \frac{\partial\mathcal{L}(\phi)}{\partial V_{\phi}}\cdot\frac{\partial V_{\phi% }}{\partial\langle\mathcal{O}_{k,\phi}\rangle}}_{\textrm{(Classical % Backpropagation)}}\cdot\underbrace{\frac{{\partial\langle\mathcal{O}_{k,\phi}% \rangle}}{\partial\phi_{k}}}_{\textrm{(Parameter-Shift Rule)}},$		(26)

and the first and second terms of the right-hand side in (25) and (26) are computed using classical partial derivatives. Nonetheless, the third term presents a challenge for classical computation methods, as the quantum state’s specifics remain indeterminate before collapsing its state by measurement. To overcome this problem in parameter optimization throughout the training phase, the parameter shift rule comes into play. The rule applied for computing the derivative of the $i$ -th GS’s $k$ -th parameter, focusing on the $0$ -th order derivative, is specified as,

\frac{{\partial\langle\mathcal{O}_{k,\boldsymbol{\theta}_{i}}\rangle}}{% \partial\theta_{k}}=\langle\mathcal{O}_{k,\boldsymbol{\theta}_{i}+\frac{\pi}{2% }\mathbf{e}_{k}}\rangle-\langle\mathcal{O}_{k,\boldsymbol{\theta}_{i}-\frac{% \pi}{2}\mathbf{e}_{k}}\rangle,

(27)

where $\mathbf{e}_{k}$ denotes the $k$ -th basis. Unlike classical backpropagation, the parameter shift rule provides a more straightforward and intuitive methodology. As a result, this approach can significantly expedite the training process for QNNs.

5 Performance Evaluation

5.1 Benchmarks and Simulation Setup

To evaluate the performance of the dimension-reduced QMARL-based scheduler, various benchmarks are utilized, i.e., MARL, Independent Q-Learning (IQL), Deep Q-Network (DQN), and Random (i.e., Monte Carlo) schedulers. In the (17) for the quality function, $\xi_{1}$ and $\xi_{2}$ are $\xi_{1}=0.01$ and $\xi_{2}=1,024$ , respectively, and the parameters used for this performance evaluation are presented in Table III.

TABLE III: System Parameters for Performance Evaluation

Notation	Value
No. of GSs/CubeSats/HALE-UAVs ( $N$ , $M$ , $L$ )	$4$ , $8$ , $8$
Action dimension ( $\|\mathcal{A}\|$ )	$\{2^{1},2^{4},2^{16}\}$
Discount factor ( $\gamma$ )	$0.98$
Batch size	$64$
Initial/Min of epsilon ( $\epsilon_{\mathrm{init}}$ , $\epsilon_{\min}$ )	$0.275$ , $10^{-2}$
Annealing epsilon	$5\times 10^{-5}$
LR of actor ( $\alpha_{\mathrm{actor}}$ )	$10^{-3}$
LR of central critic ( $\alpha_{\mathrm{critic}}$ )	$2.5\times 10^{-4}$
Training epochs	$10,000$
Activation	ReLU, Optimizer: Adam

5.2 Policy Training

Fig. 7(a) illustrate that the QMARL-based scheduling approach introduced in this paper outperforms comparative benchmarks, achieving a maximal reward of $1.0$ . In comparison, the MARL-based scheduler provides less reward than the QMARL-based scheduler, and the reward value fluctuates and eventually does not converge. Furthermore, the performance of IQL and DQN based schedulers closely mirrors that of the Random based scheduler in terms of reward. Figs. 7(b)-(e) reveal that the scheduler based on QMARL attains superior QoS, capacity, and remaining energy for CubeSats/HALE-UAVs. Conversely, MARL-based scheduling approaches fail to concurrently optimize multiple metrics related to communication and the energy efficiency of NTN devices. Within the MARL based-scheduler, an increase in QoS and capacity correlates with a decrease in residual energy, indicating an inability to simultaneously optimize global access performance of integrated networks (QoS, capacity) and the residual energy of CubeSats/HALE-UAVs. In contrast, the QMARL-based scheduler successfully optimizes both global access performance and energy efficiency in parallel.

TABLE IV: Performance Evaluation Results when

|\mathcal{A}|=2^{16}

Algorithm	QoS	Capacity	Residual Energy
QMARL	$\mathbf{0.906}$	$\mathbf{0.894}$	$\mathbf{0.912}$
MARL	$0.484$	$0.321$	$0.457$
IQL	$0.148$	$0.188$	$0.419$
DQN	$0.194$	$0.258$	$0.442$
Random	$0.151$	$0.197$	$0.437$

Table IV illustrates that the QMARL based scheduler significantly surpasses its MARL-based scheduler, recording an 87.2 $\%$ enhancement in QoS, a 178 $\%$ increase in capacity, and an 99.5 $\%$ augmentation in remaining energy. Additionally, the performance of IQL, DQN, and Random based scheduler are notably inferior in all evaluated aspects, with QoS not exceeding $0.2$ , capacity remaining below $0.26$ , and the residual energy of CubeSats/HALE-UAVs falling short of $0.45$ , as explicated in Table IV.

Figs. 8(a)–(b) delineate the correlation between the global access performance of integrated networks and the normalized residual energy of NTNs, contingent upon the employed algorithm. The epoch on the $x$ -axis is segmented into three phases: $0$ to $4k$ (initial phase), $4k$ to $7k$ (intermediate phase), and $7k$ to $10k$ (final phase). Throughout the progression from the initial to the intermediate phase in MARL, an increment is observed in the energy of NTN devices, albeit with a reduction in QoS and capacity. This limitation is not exclusive to MARL but also extends to schedulers based on IQL, DQN, and Random schedulers, which are unable to concurrently optimize the performance of global access performance of integrated networks and the residual energy of NTN devices. In stark contrast, QMARL-based scheduler consistently maintains elevated levels of QoS, capacity, and residual energy. Figs. 8(c)–(d) display the remaining energy of the $S^{i}_{j}$ and $A^{i}_{l}$ . The occurrence of non-operational NTN devices is attributed to the inefficiency in energy utilization by the benchmarks, including those based on MARL, IQL, DQN, and Random based schedulers. In contrast, the QMARL based scheduler consistently exhibits superior residual energy performance, ensuring the avoidance of any non-functional NTN devices. Additionally, the QMARL-based scheduler has higher residual energy of NTN devices compared to other benchmarks.

TABLE V: Total Normalized Converged Rewards

$\|\mathcal{A}\|$	QMARL	MARL	IQL	DQN	Random
$2^{1}$	0.9971	1.0000	0.9411	0.9527	0.2755
$2^{4}$	0.9813	1.0000	0.8267	0.9215	0.5452
$\mathbf{2^{16}}$	1.0000	0.4103	0.1730	0.2235	0.1390

Figs. 9(a)-(b) and Table V provide a comparative analysis of the rewards obtained by GSs utilizing both the proposed algorithms and benchmarks across varying sizes of the action dimension, specifically for $|\mathcal{A}|\in\{2^{1}$ , $2^{4}$ , $2^{16}\}$ . The MARL-based scheduler exhibits superior reward outcomes at smaller action dimensions ( $|\mathcal{A}|\in\{2^{1},2^{4}\}$ ); however, it encounters significant difficulties at larger action dimension ( $|\mathcal{A}|=2^{16}$ ), where its performance falls behind that of the QMARL based scheduler by 41.03 $\%$ , due to the curse of dimensionality. In a similar vein, IQL, DQN based schedulers yield outcomes that are analogous to those of a Random based scheduler at the largest action dimension ( $|\mathcal{A}|=2^{16}$ ). Fig. 9(a) depicts a box plot summarizing the reward distribution across all action dimensions throughout the training process. The median reward is represented by the red line at the center of each box, with the lower and upper boundaries of the box indicating the 25 $\%$ and 75 $\%$ , respectively. Outliers are marked with a red ’+’ symbol. Notably, at the exceedingly large action dimension ( $|\mathcal{A}|=2^{16}$ ), the QMARL-based scheduler achieves the highest reward, while the performance of other benchmarks deteriorates. Fig. 9(b) illustrates the converged normalized reward values according to the action dimensions. The utilization of larger action dimensions is deemed more realistic due to the inclusion of a greater number of CubeSats and HALE-UAVs, hence enhancing real-world applicability. In global access of integrated networks involving extensive deployment of CubeSats and HALE-UAVs, solely the QMARL-based scheduler achieves successful training outcomes, thereby evidencing a significant performance disparity in comparison to other benchmarks. These training results distinctly emphasize the exceptional capability of the QMARL based scheduler in addressing and mitigating the challenges posed by the curse of dimensionality.

Additionally, Fig. 9(c) shows the normalized average residual energy of NTN devices with and without GS-specific capacity requirements. The pink bar graph represents the average residual energy of CubeSats, and the beige bar graph represents the average residual energy of HALE-UAVs. In addition, the two bar graphs on the left are when there are no capacity requirements for each GS, and the two bar graphs on the right are when there are capacity requirements for each GS. If there are capacity requirements for each GS, unnecessary energy waste in NTN devices can be prevented. If the maximum capacity requirements are set differently for each GS depending on the region where the GS is located, the population of the region, and the degree of communication overload, the residual energy for CubeSat is 46.2 $\%$ and HALE-UAV is 38.7 $\%$ higher.

6 Concluding Remarks

This paper introduces a novel QMARL-based global SAGIN mobile access scheduler for CubeSats and HALE-UAVs, which aims at the maximization of access availability and energy efficiency. The CubeSats, characterized by their limited energy resources, employ energy efficiency strategies that differentiate between sun side and dark side orbital segments to conserve power. The reason why the quantum-based approach is utilized is that it can realize scheduling action dimension reduction. This attribute is particularly advantageous for ensuring the robust convergence of rewards in scenarios entailing extensive-scale actions, such as global access with considerable numbers of CubeSats and HALE-UAVs. The study’s experimental setup reflects real-world conditions by incorporating the orbital dynamics of CubeSats and the aerodynamic characteristics of HALE-UAVs, thereby underscoring the practical applicability of our proposed QMARL-based scheduler. Our performance evaluations with various aspects and benchmarks verify that our proposed scheduler can achieve desired performance improvements.

References

[1] J. Tang, J. Li, L. Zhang, X. Chen, K. Xue, Q. Sun, and J. Lu, “Opportunistic content-aware routing in satellite-terrestrial integrated networks,” IEEE Trans. Mobile Computing, pp. 1-15, 2024 (Early Access).
[2] Z. Luo, C. Wu, Z. Li, and W. Zhou, “Scaling GEO-Distributed Network Function Chains: A Prediction and Learning Framework,” IEEE J. Sel. Areas Commun., vol. 37, no. 8, pp. 1838–1850, Aug. 2019.
[3] S. Jung, M.-S. Lee, J. Kim, M.-Y. Yun, J. Kim, and J.-H. Kim, “Trustworthy handover in LEO satellite mobile networks,” ICT Express, vol. 8, no. 3, pp. 432–437, Sept. 2022.
[4] F. Tang, H. Zhang, and L. T. Yang, “Multipath Cooperative Routing with Efficient Acknowledgement for LEO Satellite Networks,” IEEE Trans. Mobile Computing, vol. 18, no. 1, pp. 179–192, Jan. 2019.
[5] S. S. Hassan, Y. M. Park, Y. K. Tun, W. Saad, Z. Han, and C. S. Hong, “Satellite-Based ITS Data Offloading & Computation in 6G Networks: A Cooperative Multi-Agent Proximal Policy Optimization DRL With Attention Approach,” IEEE Trans. Mobile Computing, vol. 23, no. 5, pp. 4956–4974, May 2024.
[6] Z. Ji, S. Wu, and C. Jiang, “Cooperative Multi-Agent Deep Reinforcement Learning for Computation Offloading in Digital Twin Satellite Edge Networks,” IEEE J. Sel. Areas Commun., vol. 41, no. 11, pp. 3414–3429, Nov. 2023.
[7] G. Pan, J. Ye, J. An, and M.-S. Alouini, “Latency Versus Reliability in LEO Mega-Constellations: Terrestrial, Aerial, or Space Relay?,” IEEE Trans. Mobile Computing, vol. 22, no. 9, pp. 5330–5345, Sept. 2023.
[8] Y. K. Tun, K. T. Kim, L. Zou, Z. Han, G. D ̵́an, and C. S. Hong, “Collaborative Computing Services at Ground, Air, and Space: An Optimization Approach,” IEEE Trans. Veh. Technol., vol. 73, no. 1, pp. 1491–1496, Jan. 2024.
[9] X. Feng, Y. Sun, and M. Peng, “Distributed Satellite-Terrestrial Cooperative Routing Strategy Based on Minimum Hop-Count Analysis in Mega LEO Satellite Constellation,” IEEE Trans. Mobile Computing, pp. 1–16, 2024 (Early Access).
[10] C. Dai, K. Zhu, and E. Hossain, “Multi-Agent Deep Reinforcement Learning for Joint Decoupled User Association and Trajectory Design in Full-Duplex Multi-UAV Networks,” IEEE Trans. Mobile Computing, vol. 22, no. 10, pp. 6056–6070, Oct. 2023.
[11] N. Qi, Z. Huang, F. Zhou, Q. Shi, Q. Wu, and M. Xiao, “Multi-Agent Deep Reinforcement Learning for Joint Decoupled User Association and Trajectory Design in Full-Duplex Multi-UAV Networks,” IEEE Trans. Mobile Computing, vol. 22, no. 10, pp. 6056–6070, Oct. 2023.
[12] P. Qi, X. Zhao, Y. Wang, R. Palacios, and A. Wynn, “Aeroelastic and Trajectory Control of High Altitude Long Endurance Aircraft,” IEEE Trans. Aerosp. Electron. Syst., vol. 54, no. 6, pp. 2992–3003, Dec. 2018.
[13] X. Dai, Z. Xiao, H. Jiang, and J. C. S. Lui, “UAV-Assisted Task Offloading in Vehicular Edge Computing Networks,” IEEE Trans. Mobile Computing, vol. 23, no. 4, pp. 2520–2534, Apr. 2024.
[14] X. Li, F. Tang, L. Fu, J. Yu, L. Chen, J. Liu, Y. Zhu, and L. T. Yang, “Optimized Controller Provisioning in Software-Defined LEO Satellite Networks,” IEEE Trans. Mobile Computing, vol. 22, no. 8, pp. 4850–4864, Aug. 2023.
[15] L. Huang, S. Bi, and Y.-J. A. Zhang, “Deep Reinforcement Learning for Online Computation Offloading in Wireless Powered Mobile-Edge Computing Networks,” IEEE Trans. Mobile Computing, vol. 19, no. 11, pp. 2581–2593, Nov. 2020.
[16] M. Tang and V. W. Wong, “Deep Reinforcement Learning for Task Offloading in Mobile Edge Computing Systems,” IEEE Trans. Mobile Computing, vol. 21, no. 6, pp. 1985–1997, Jun. 2022.
[17] G. S. Kim, J. Chung, and S. Park, “Realizing Stabilized Landing for Computation-Limited Reusable Rockets: A Quantum Reinforcement Learning Approach,” IEEE Trans. Veh. Technol., pp. 1–6, 2024 (Early Access).
[18] J. Cui, Y. Liu, and A. Nallanathan, “Multi-Agent Reinforcement Learning-Based Resource Allocation for UAV Networks,” IEEE Trans. Wirel. Commun., vol. 19, no. 2, pp. 729–743, Feb. 2020.
[19] S. Park, J. Chung, C. Park, S. Jung, M. Choi, S. Cho, and J. Kim, “Joint Quantum Reinforcement Learning and Stabilized Control for Spatio-Temporal Coordination in Metaverse,” IEEE Trans. Mobile Computing, pp. 1–18, 2024 (Early Access).
[20] H. Baek, S. Park, and J. Kim, “Logarithmic Dimension Reduction for Quantum Neural Networks,” in Proc. ACM Conf. Int. Knowl. Manage. (CIKM), Birmingham, UK, Oct. 2023, pp. 3738–3742.
[21] W. K. New, C. Y. Leow, K. Navaie, and Z. Ding, “Aerial-Terrestrial Network NOMA for Cellular-Connected UAVs,” IEEE Trans. Veh. Technol., vol. 71, no. 6, pp. 6559–6573, Jun. 2022.
[22] J.-H. Lee, J. Park, M. Bennis, and Y.-C. Ko, “Integrating LEO Satellites and Multi-UAV Reinforcement Learning for Hybrid FSO/RF Non-Terrestrial Networks,” IEEE Trans. Veh. Technol., vol. 72, no. 3, pp. 3647–3662, Mar. 2023.
[23] H. Hu, Z. Chen, F. Zhou, Z. Han, and H. Zhu, “Joint Resource and Trajectory Optimization for Heterogeneous-UAVs Enabled Aerial-Ground Cooperative Computing Networks,” IEEE Trans. Veh. Technol., vol. 72, no. 7, pp. 8812–8826, Jul. 2023.
[24] N. Babu, M. Virgili, C. B. Papadias, P. Popovski, and A. J. Forsyth, “Cost- and Energy-Efficient Aerial Communication Networks With Interleaved Hovering and Flying,” IEEE Trans. Veh. Technol., vol. 70, no. 9, pp. 9077–9087, Sept. 2021.
[25] Y. Wang, M. Sheng, W. Zhuang, S. Zhang, N. Zhang, R. Liu, and J. Li, “Multi-Resource Coordinate Scheduling for Earth Observation in Space Information Networks,” IEEE J. Sel. Areas Commun., vol. 36, no. 2, pp. 268–279, Feb. 2018.
[26] Z. Jia, M. Sheng, J. Li, D. Niyato, and Z. Han, “LEO-Satellite-Assisted UAV: Joint Trajectory and Data Collection for Internet of Remote Things in 6G Aerial Access Networks,” IEEE Internet Things J., vol. 8, no. 12, pp. 9814–9826, Jun. 2021.
[27] T. Ma, H. Zhou, B. Qian, N. Cheng, X. Shen, X. Chen, and B. Bai, “UAV-LEO Integrated Backbone: A Ubiquitous Data Collection Approach for B5G Internet of Remote Things Networks,” IEEE J. Sel. Areas Commun., vol. 39, no. 11, pp. 3491–3505, Nov. 2021.
[28] J. Li, G. Wu, T. Liao, M. Fan, X. Mao, and W. Pedrycz, “Task Scheduling Under a Novel Framework for Data Relay Satellite Network via Deep Reinforcement Learning,” IEEE Trans. Veh. Technol., vol. 72, no. 5, pp. 6654–-6668, May 2023.
[29] C. Park, G. S. Kim, S. Park, S. Jung, and J. Kim, “Multi-Agent Reinforcement Learning for Cooperative Air Transportation Services in City-Wide Autonomous Urban Air Mobility,” IEEE Trans. Intell. Veh., vol. 8, no. 8, pp. 4016–4030, Aug. 2023.
[30] R. Chen, J. Chen, H. Wang, X. Tong, Y. Xu, N. Qi, and Y. Xu, “Joint Channel Access and Power Control Optimization in Large-Scale UAV Networks: A Hierarchical Mean Field Game Approach,” IEEE Trans. Veh. Technol., vol. 72, no. 2, pp. 1982–1996, Feb. 2023.
[31] C. Park, W. J. Yun, J. P. Kim, T. K. Rodrigues, S. Park, S. Jung, and J. Kim, “Quantum Multi-Agent Actor-Critic Networks for Cooperative Mobile Access in Multi-UAV Systems,” IEEE Internet Things J., vol. 10, no. 22, pp. 20033–20048, Nov. 2023.
[32] O. Simeone, “An Introduction to Quantum Machine Learning for Engineers,” Found. Trends Signal Process., vol. 16, no. 1-2, pp. 1–223, Aug. 2022.
[33] S. Wojtowytsch and W. E, “Can Shallow Neural Networks Beat the Curse of Dimensionality? A Mean Field Training Perspective,” IEEE Trans. Artif. Intell., vol. 1, no. 2, pp. 121–129, Oct. 2020.
[34] S. Park, J. P. Kim, C. Park, S. Jung, and J. Kim, “Quantum Multi-Agent Reinforcement Learning for Autonomous Mobility Cooperation,” IEEE Commun. Mag., 2023 (Early Access).
[35] S. Park and J. Kim, C. Park, S. Jung, and J. Kim, “Quantum Reinforcement Learning for Large-Scale Multi-Agent Decision-Making in Autonomous Aerial Networks,” in Proc. IEEE VTS Asia Pac. Wirel. Commun. Symp. (APWCS), Taiwan, China, Aug. 2023, pp. 1–4.
[36] C. D. Perkins and R. E. Hage, Airplane Performance, Stability and Control, Wiley, Jan. 1991.
[37] S. Jung, W. J. Yun, M. Shin, J. Kim, and J.-H. Kim, “Orchestrated Scheduling and Multi-Agent Deep Reinforcement Learning for Cloud-Assisted Multi-UAV Charging Systems,” IEEE Trans. Veh. Technol., vol. 70, no. 6, pp. 5362–5377, Jun. 2021.
[38] A. R. S. Bramwell, D. Balmford, and G. Done, Bramwell’s Helicopter Dynamics, Elsevier, Apr. 2001.
[39] Y. Zeng, J. Xu, and R. Zhang, “Energy Minimization for Wireless Communication With Rotary-Wing UAV,” IEEE Trans. Wirel. Commun., vol. 18, no. 4, pp. 2329–2345, Apr. 2019.
[40] J. Lee, R. R. Mazumdar, and N. B. Shroff, “Non-convex optimization and rate control for multi-class services in the Internet,” IEEE/ACM Trans. Netw., vol. 13, no. 4, pp. 827–840, Aug. 2005.
[41] W. Du and S. Ding, “A survey on multi-agent deep reinforcement learning: From the perspective of challenges and applications,” Artif. Intell. Rev., vol. 54, no. 5, pp. 3215–3238, Nov. 2020.
[42] R. Lowe, Y. Wu, A. Tamar et al., “Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments,” in Adv. Neural Inf. Process. Syst. (NeurIPS), Long Beach, CA, Dec. 2017, pp. 6379–6390.