Proximal Policy Optimization-Based Hierarchical Decision-Making Mechanism for Resource Allocation Optimization in UAV Networks

Sun, Kun; Yang, Jianyong; Li, Jinglei; Yang, Bo; Ding, Shuman

doi:10.3390/electronics14040747

Open AccessArticle

Proximal Policy Optimization-Based Hierarchical Decision-Making Mechanism for Resource Allocation Optimization in UAV Networks

by

Kun Sun

^1,*,

Jianyong Yang

¹,

Jinglei Li

²,

Bo Yang

² and

Shuman Ding

²

¹

The 54th Research Institute of China Electronics Technology Group Corporation, Shijiazhuang 050000, China

²

School of Telecommunications Engineering, Xidian University, Xi’an 710071, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(4), 747; https://doi.org/10.3390/electronics14040747

Submission received: 29 December 2024 / Revised: 4 February 2025 / Accepted: 4 February 2025 / Published: 14 February 2025

(This article belongs to the Special Issue Applied Cryptography and Practical Cryptoanalysis for Web 3.0)

Download

Browse Figures

Versions Notes

Abstract

:

To address the resource allocation problem in dynamic environments where multiple unmanned aerial vehicle base stations (UAV-BSs) provide efficient downlink services to ground users, this paper proposes a novel hierarchical decision-making mechanism based on the Proximal Policy Optimization (PPO) algorithm. The proposed method optimizes time-frequency resource allocation in the downlink, aiming to maximize the total user throughput over multiple time slots. By constructing channel and interference models, the complex multi-channel resource allocation problem is decomposed into a series of single-channel decision subproblems, significantly reducing the action space complexity. Specifically, the original exponential complexity

O (N^{M})

(where N is the number of users and M is the number of channels) is reduced to a linear complexity

O (N)

, effectively alleviating the curse of dimensionality. Simulation results demonstrate that the proposed hierarchical architecture, integrated with the PPO algorithm, achieves superior performance in terms of total throughput, convergence speed, and stability compared to existing methods. This study provides new insights and technical support for efficient resource management in UAV-BS systems operating in complex and dynamic environments.

Keywords:

unmanned air vehicle (UAV); proximal policy optimization; spectrum allocation

1. Introduction

Unmanned aerial vehicles (UAVs) have emerged as a promising solution for a wide range of applications, including disaster recovery, surveillance, and communication services in rural and underserved areas [1]. Due to challenging terrain and other factors, the coverage capabilities of existing terrestrial base stations (BSs) remain insufficient to meet the increasing communication demands. As a result, the development of integrated space–air–ground networks has become a critical trend in the evolution of future wireless communication systems, with UAV networks playing a vital role within space-based networks [2]. With wireless communication modules onboard, UAVs can be rapidly deployed as aerial BSs, providing a flexible and reliable solution for delivering connectivity to ground user equipment (UEs), particularly in temporary hotspots or large-scale public events [3].

However, the integration of UAVs into wireless communication systems presents several significant challenges, particularly in the allocation of channel resources [4]. As aerial base stations, UAVs must efficiently allocate limited wireless resources to ground users. Resource allocation in UAV-based communication networks faces several key difficulties. First, the mobility of both UAVs and users creates strong spatiotemporal correlations in the communication environment, leading to continuously fluctuating channel conditions. The varying locations and movement patterns of users introduce significant heterogeneity in their channel demands and adaptability, making traditional static channel allocation methods ineffective for managing these dynamic changes. Second, UAVs operate under strict resource constraints, such as limited power and spectrum availability, which necessitates the development of efficient strategies to balance communication quality with optimal resource utilization. Finally, the competition for channel resources among multiple users, coupled with mutual interference, demands the implementation of effective scheduling mechanisms to maximize overall network throughput. Addressing these challenges requires innovative approaches to ensure the efficient and reliable operation of UAV-based communication systems.

Given the highly dynamic nature of UAV networks [5], traditional static resource allocation methods are often inadequate. Therefore, developing novel spectrum allocation strategies tailored to UAV networks is essential to fully exploit their potential and address the unique challenges they present [6]. Many studies have focused on optimizing resources for unmanned aerial vehicles (UAVs), using either deterministic or stochastic methods. Deterministic approaches optimize resource allocation based on fixed [7], well-defined conditions, while stochastic methods handle uncertainty by introducing probabilistic and statistical models. Together, these approaches offer complementary ways to address the challenges of managing UAV resources in dynamic and unpredictable environments.

In recent years, deep reinforcement learning (DRL) has emerged as a powerful tool for solving dynamic resource allocation problems, thanks to its ability to handle high-dimensional and complex decision-making tasks [8,9]. Among DRL algorithms, Proximal Policy Optimization (PPO) has gained widespread recognition as an efficient and stable approach, striking a balance between optimization stability and computational efficiency.

PPO achieves this by alternating updates between the policy and value functions, employing a clipped surrogate objective to constrain the probability ratio during policy updates. This mechanism effectively mitigates the risk of abrupt policy fluctuations, thereby enhancing both the stability and convergence speed of the learning process. As a result, PPO has been extensively applied to resource optimization challenges in dynamic environments, demonstrating its capability to adapt to rapidly changing conditions.

We proposes an innovative hierarchical reinforcement learning framework to address the wireless resource allocation problem. The authors in [10] propose a deep reinforcement learning approach, but their study assumes that the communication channels are exclusively used by UAVs. In contrast, our research aims to address the resource allocation problem for efficient downlink service provision by multiple UAV base stations to ground users. Compared to the algorithm in [10], PPO is able to converge to a better strategy more quickly during training, reducing both training time and computational resources. For example, in complex multi-UAV scenarios, PPO achieves superior performance with fewer training steps. The study by author [11] focuses on data collection scenarios in UAV-assisted cellular networks. The goal of optimization is to enable UAVs to collect more information and improve energy efficiency under limited energy consumption. By modeling the data collection process as a Markov decision process (MDP), the PPO algorithm is used to jointly optimize the UAV’s path and altitude, thereby enhancing system energy efficiency and enabling more efficient data collection. Our work addresses the scenario where multiple UAV base stations provide downlink services to ground users, aiming to maximize the total user throughput. We decompose the multi-channel resource allocation problem into subproblems of single-channel decisions, combining the PPO algorithm with hierarchical decision-making and progressive learning strategies. Simulation results demonstrate that the proposed method exhibits good convergence, superior performance across varying numbers of channels, and strong scalability.

Unlike traditional approaches, a novel hierarchical decision-making mechanism is designed, which decomposes the complex multi-channel resource allocation problem into a series of single-channel decision subproblems. Traditional centralized decision-making methods directly address high-dimensional state-action spaces, and their computational complexity grows exponentially with the number of drones, making them difficult to apply in large-scale systems. Although typical distributed methods reduce computational complexity, they suffer from the lack of effective coordination among agents, as each agent makes independent decisions, thus failing to achieve global optimality. The hierarchical approach proposed in this paper decomposes the complex problem into multiple subproblems, ensuring both decision-making efficiency and global coordination. This hierarchical architecture not only significantly reduces computational complexity but, more importantly, achieves an effective integration of local optimization and global coordination. At each decision step, the policy network optimizes the user allocation for a single channel rather than handling the allocation of all channels simultaneously. This hierarchical architecture significantly reduces the action space complexity from the exponential

O (N^{M})

(where N is the number of users and M is the number of channels) to a linear complexity of

O (N)

, thereby effectively mitigating the curse of dimensionality.

Moreover, a progressive learning strategy is proposed, enabling the network to first focus on learning the optimal allocation policy for a single channel. Through a carefully designed state update mechanism, the sub-decisions are then organically integrated into a complete resource allocation scheme. This sequential decision-making approach not only significantly accelerates the convergence speed and learning efficiency of the policy network but also enhances the model’s generalization capability.

The experimental results demonstrate that the proposed method achieves competitive resource allocation performance while substantially reducing the computational complexity of the algorithm. This provides a practical solution for the deployment of large-scale wireless communication systems.

The remainder of this paper is organized as follows: Section 2 reviews the related research in this field. Section 3 provides a detailed introduction to the system model, the practical scenarios under study, and the core problems to be addressed. Section 4 discusses the theoretical foundations of resource allocation based on the PPO algorithm. Section 5 describes the simulation setup and presents the simulation results. Finally, Section 6 summarizes the main contributions of this paper and outlines potential directions for future research.

2. Related Works

Substantial progress has been made in addressing spectrum management challenges, particularly through spectrum allocation. This technique plays a vital role in enhancing spectrum utilization, and current research largely focuses on two approaches: integrating machine learning into spectrum allocation and improving traditional algorithms.

Regarding machine learning-based approaches, Morozs et al. [12] proposed a dynamic spectrum access (DSA) algorithm using distributed heuristically accelerated Q-learning (DIAQ) for LTE cellular systems. The DIAQ algorithm significantly enhances quality of service (QoS) and supports higher network throughput density compared to conventional heuristic ICIC methods; however, it does not guarantee smooth spectrum access. Gao et al. [13] introduced an improved online synchronized Q-learning algorithm for dynamic spectrum access, mitigating spectrum congestion in cognitive radio networks. Naparstek et al. [14] employed deep reinforcement learning to address user interference issues in slowly varying wireless environments. Their algorithm effectively suppresses spectrum conflicts among users with limited channels and users but performs poorly in highly dynamic scenarios with a larger number of users and channels.

For traditional spectrum allocation algorithms, Zhang et al. [15] explored solutions to address spectrum scarcity by utilizing the abundant unoccupied bandwidth in millimeter wave (mmWave) frequencies. These bands can serve as high-throughput channels for both terrestrial and aerial UAV networks. In contrast, Wang et al. [16] proposed a dynamic spectrum allocation algorithm, which improves flexibility by allocating spectrum resources based on varying user demands, thereby enhancing spectrum utilization. In [17], Tu proposed flow scheduling and channel aggregation strategies for wireless multimedia multi-flow transmission scenarios. The Efficient Multi-flow Multicast Transmission (EMMT) algorithm was designed to increase the number of concurrent multimedia streams while ensuring their transmission performance. Deb et al. [18] developed a centralized control strategy for managing user spectrum access and switching, focusing on transmission queue stability under data burst and user mobility conditions. However, their approach does not account for the signaling overhead caused by centralized control or the challenges of highly dynamic wireless environments. In the context of UAV multicast scenarios, some studies have proposed practical solutions to ensure throughput. In [19], Tu introduced the Efficient Transition Formation (ETF) algorithm, which includes the design of a seamless checking algorithm and a trajectory formation algorithm. This approach handles various UAV transition scenarios, enabling fast and resource-efficient transitions while ensuring high-performance group communication during mobility. In [20], Tu addressed multicast scenarios in wireless mesh networks (WMNs) by proposing the Parallel Low-rate Transmission (PLT) scheme and the Alternative Rate Transmission (ART) algorithm. These approaches improve transmission coverage, mitigate interference, and expand the coverage area. By balancing rate and coverage over a broader area, the proposed methods effectively enhance network throughput.

To improve spectrum utilization through spectrum sharing, Wang et al. [21] introduced a robust spectrum sharing framework. However, UAV communications are particularly susceptible to interference due to their wireless broadcasting and line-of-sight characteristics. Chen et al. [22] proposed an anti-jamming scheme based on Markov decision processes to address interference attacks but did not consider the competition among users for limited spectrum resources or the temporal behavior between user channel selection and external interference attacks. Yao et al. [23] modeled the anti-jamming problem using Markov games and proposed a collaborative multi-agent anti-jamming algorithm to achieve optimal interference resistance strategies. However, their approach is limited to frequency sweeping interference and lacks intelligent decision-making capabilities.

3. System Model

3.1. Network Model

The temporal resource is divided into discrete time slots, denoted as the set

T = \{\begin{matrix} 1, 2, \dots, T \end{matrix}\}

. As illustrated in Figure 1, there are

N

UAVs, represented by the set

N = \{\begin{matrix} 1, 2, \dots, N \end{matrix}\}

, with each UAV equipped with an omnidirectional antenna array. Additionally, there are

U

users, denoted as the set

U = \{\begin{matrix} 1, 2, \dots, U \end{matrix}\}

, where each user is also equipped with an omnidirectional antenna array. Each UAV serves a subset of users

U_{N}

, represented by the set

U_{N} = {1, 2, \dots, U_{N}}

, and the coverage radius of each UAV is

r_{n}

.

The total available spectrum has a bandwidth of

B = f^{u p} - f^{d o w n}

and divided into

M

subchannels, represented as the set

M = (1, 2, \dots, M)

. Each UAV has a total transmit power budget of

P_{n}

, and the transmission power of UAV n to user

u_{n}

is denoted as

P_{n, u_{n}}

.

Assuming that user locations vary slowly over time, the position of the u user at time t is represented as

L_{u} = (x_{u} (t), y_{u} (t), h_{u} (t))

, with a constant altitude

h_{u} (t) = h_{u} (0)

. Similarly, the position of the UAV n during time slot t is represented as

L_{n} (t) = (x_{n} (t), y_{n} (t), h_{n} (t))

.

3.2. Link Model

Assuming favorable channel conditions, the channel model between the UAV n and the user

u_{n}

during the time slot t is based on a free-space path loss model. The free-space path loss for a single transmission is expressed as

P L_{s, n_{s}} = 20 {log}_{10} (\frac{4 π f_{c} d_{s, n_{s}}}{c})

(1)

where

c = 3 \times 10^{8}

m/s and

f_{c}

denotes the carrier frequency, and

d_{n, u_{n}}

represents the distance between the UAV n and the user

u_{n}

. The calculation formula is given as follows

d_{n, u_{n}} = \sqrt{{(x_{u_{n}} (t) - x_{n} (t))}^{2} + {(y_{u_{n}} (t) - y_{n} (t))}^{2} + {(h_{u_{n}} (t) - h_{n} (t))}^{2}}

(2)

In time slot t, the subchannel allocation scheme is represented by

K (t) \in C^{M \times U}

, where

[K (t)] = k_{m, u_{n}} (t)

is the subchannel indicator. Specifically, if the m-th subchannel is assigned to user

u_{n}

during time slot t, then

k_{m, u_{n}} (t) = 1

, otherwise,

k_{m, u_{n}} (t) = 0

. The rate of user

u_{n}

is

R_{u_{n}} (t) = \sum_{m = 1}^{M} \frac{B}{M} k_{m, u_{n}} (t) {log}_{2} (1 + S I N R_{m, u_{n}} (t))

(3)

The total achievable rate can be expressed as

\begin{matrix} R_{t o t a l} & = \sum_{t = 1}^{T} \sum_{n = 1}^{N} \sum_{u_{n} = 1}^{U_{n}} R_{u_{n}} (t) \\ = \sum_{t = 1}^{T} \sum_{n = 1}^{N} \sum_{u_{n} = 1}^{U_{n}} \sum_{m = 1}^{M} \frac{B}{M} k_{m, u_{n}} (t) \\ {log}_{2} (1 + S I N R_{m, u_{n}} (t)) \end{matrix}

(4)

where

S I N R_{m, u_{n}} (t)

represents the Signal-to-Interference-plus-Noise Ratio (SINR) for user

u_{n}

served by UAV n in time slot t, and is given by

S I N R_{m, u_{n}} (t) = \frac{P_{u_{n}} h_{n, u_{n}} (t)}{I_{n, u_{n}}^{intra} (t) + I_{n, u_{n}}^{inter} (t) + δ^{2} (t)}

(5)

δ^{2} (t)

is the power of the additive white Gaussian noise (AWGN).

I_{n, u_{n}}^{intra} (t)

represents the interference caused by the same UAV when serving other users in time slot t. The interference caused by other UAVs serving their respective users during time slot t is

I_{n, u_{n}}^{inter} (t)

.

As illustrated in Figure 2, User 1 is served by UAV A, but the distance between User 1 and UAV B is within the coverage radius of UAV B. Consequently, UAV B causes interference to User 1. To account for such interference, a variable

ξ_{u_{n}, j}

is introduced to represent the interference from other UAVs to the current user:

ξ_{u_{n}, j} = \{\begin{matrix} 1, & d_{u_{n}, j} < r_{u_{n}, j} \\ 0, & d_{u_{n}, j} > r_{u_{n}, j} \end{matrix}

(6)

where

d_{u_{n}, j}

is specified in Equation (2).

The interference within a UAV and the interference caused by the UAV to its users can be quantified and analyzed as

I_{n, u_{n}}^{intra} (t) = \sum_{i = 1, i \neq u_{n}}^{U_{n}} k_{m, i} (t) P_{n, i} h_{n, u_{n}} (t)

(7)

I_{n, u_{n}}^{inter} (t) = \sum_{j = 1, j \neq n}^{N} \sum_{i = 1}^{U_{j}} k_{m, u_{j}} (t) ξ_{u_{n}, j} (t) P_{j, i} h_{n, u_{n}} (t)

(8)

3.3. Problem Formulation

Our objective is to maximize the total data rate under the constraints of the maximum transmission power and limited bandwidth for each UAV. Thus, the resource allocation problem in each time slot is formulated as an optimization problem.

\begin{matrix} max & \sum_{t = 1}^{T} \sum_{n = 1}^{N} \sum_{u_{n} = 1}^{U_{n}} \sum_{m = 1}^{M} \frac{B}{M} k_{m, u_{n}} (t) {log}_{2} (1 + S I N R_{m, u_{n}} (t)) \\ s . t . & C 1 : S I N R_{m, u_{n}} (t) = \frac{P_{u_{n}} h_{n, u_{n}} (t)}{I_{n, u_{n}}^{intra} (t) + I_{n, u_{n}}^{inter} (t) + δ^{2} (t)} \\ \geq S I N R_{t h r e s h o l d}, n \in N, u_{n} \in U_{n}, m \in M \\ C 2 : \sum_{m = 1}^{M} \frac{B}{M} k_{m, u_{n}} (t) {log}_{2} (1 + S I N R_{m, u_{n}} (t)) \geq R_{u_{n}}^{re}, \\ n \in N, u_{n} \in U_{n}, m \in M \\ C 3 : \sum_{m = 1}^{M} k_{m, u_{n}} (t) \leq M, n \in N, u_{n} \in U_{n}, m \in M \\ C 4 : \sum_{m = 1}^{M} \frac{B}{M} k_{m, u_{n}} (t) \leq B, n \in N, u_{n} \in U_{n}, m \in M \\ C 5 : k_{m, u_{n}} (t) \in {0, 1}, n \in N, u_{n} \in U_{n}, m \in M \\ C 6 : 0 \leq P_{u_{n}} \leq P_{n}, n \in N, u_{n} \in U_{n} \\ C 7 : \sum_{i = 1}^{U_{n}} P_{u_{n}} \leq P_{n}, n \in N, u_{n} \in U_{n} \end{matrix}

(9)

In these constraints,

C 1

ensures that the SINR received by each user is greater than a predefined threshold. Constraint

C 2

guarantees that the data rate allocated to each user in a time slot meets or exceeds their required data rate. Constraint

C 3

and

C 4

ensure that the number of subchannels and the bandwidth allocated to a single user remain within the limits of the total available subchannels and bandwidth. Constraint

C 5

specifies that a single subchannel of a UAV can only be assigned to one user. Constraint

C 6

ensures that the power allocated to a user is non-negative and does not exceed the maximum power limit. Finally, Constraint

C 7

ensures that the total power allocated to all users served by a UAV remains within the UAV’s maximum power budget.

4. Time-Frequency Resource Allocation Based on PPO

In this section, we model the time-frequency resource allocation optimization problem as a Markov decision process (MDP) and propose a solution based on the PPO algorithm. The PPO algorithm demonstrates higher stability during the learning process and is particularly suitable for discrete action spaces. The framework of the PPO algorithm is illustrated in Figure 3, which provides an overview of its structure and operational flow.

4.1. Algorithm Formulation

In this paper, UAVs are treated as intelligent agents. The MDP framework describes the interaction between the agent and the environment, which is defined as a 4-tuple <

S, A, P, R

>. The transition probability is denoted as

P

. In addition, we provide a detailed definition of the state space

S

, action space

A

, and reward function

R

for the optimization problem. The detailed procedure of the PPO-based resource allocation algorithm is summarized in Algorithm 1, which outlines the training process for UAVs to optimize spectrum allocation through policy gradient updates.

Algorithm 1 The PPO-based Algorithm for Multi-UAV Systems Resource Allocation

1:: Initialize the environment and the relay buffer B
2:: Initialize the actor network $π_{θ}$ , $π_{θ_{old}}$ , and critic network $V_{ϕ}$
3:: for each episode do
4:: Reset the state
5:: for $t = 1$ to T do
6:: Obtain $s_{t}$ from the environment
7:: Execute $a_{t}$ according to $π_{θ_{old}}$
8:: Calculate $r_{t}$ and observe the next state $s_{t + 1}$
9:: Store the transition $〈 s_{t}, a_{t}, r_{t}, s_{t + 1} 〉$ into B
10:: if $len (batch) \geq batch size$ then
11:: Calculate the advantage $A_{t}$ by Equation (19)
12:: for each epoch do
13:: Calculate $L_{PPO}$ by Equation (18)
14:: Calculate $L_{critic}$ by Equation (21)
15:: Update the actor network by $L_{PPO}$
16:: Update the critic network by $L_{critic}$
17:: end for
18:: end if
19:: Update $π_{θ_{old}} \leftarrow π_{θ}$
20:: end for
21:: end for

4.1.1. State Space

The design of the state space reflects the relationship between the system’s current channel allocation state and the relevant information. The design is detailed as follows:

The state represents the current allocation configuration. Specifically, the state is a one-dimensional vector with a size of

U \times (M + N)

. Each element is either 0 or 1, indicating whether a user is allocated to a certain channel or associated with a UAV.

The state space is defined by the placement function, with a range of

{[0, 1]}^{U \times (M + N)}

. The upper and lower bounds indicate whether a user is allocated to a channel. For example, a state of

[1, 0, 0, \dots]

indicates that the first user and the fourth user are allocated to a specific channel.

At the beginning of the simulation, the state is initialized randomly according to the channel allocation method. The initial state is a

U \times (M + N)

matrix, which is then unfolded into a one-dimensional vector.

4.1.2. Action Space

In the simulation environment, the design of the action space revolves around the distribution of users and channels, specifically focusing on the form of action space representation. The details are as follows:

An action is represented as an integer, which is mapped to a binary control code that specifies a user assigned to a channel. For example, the size of the action space is

C_{N}^{U}

, where U is the number of users and N is the number of UAVs.

The action space is defined by the placement function, with a size of

C_{N}^{U}

. This means that each action can be mapped to a binary allocation state, indicating which user is assigned to a particular channel. Specifically, in the allocation vector, a value of 1 in the i-th row indicates that the i-th user is allocated to that channel.

In the environment, an integer action is mapped to a binary control matrix through a custom mapping function. This matrix represents the specific channel allocation state. For instance, an integer action is first converted into a binary control code, then padded to a fixed length string and finally transformed into a vector.

4.1.3. Reward Function

The goal of the reward function is to maximize the system’s total data transmission rate, encouraging efficient channel allocation strategies. The detailed design is as follows:

The reward represents the sum of the transmission rates of all users under the current channel allocation state. The transmission rate is calculated based on Shannon’s theorem:

R_{u_{n}} (t) = \sum_{m = 1}^{M} \frac{B}{M} k_{m, u_{n}} (t) {log}_{2} (1 + S I N R_{m, u_{n}} (t))

(10)

The SINR of a channel is defined as follows:

S I N R_{m, u_{n}} (t) = \frac{P_{u_{n}} h_{m, u_{n}} (t)}{I_{n, m}^{intra} (t) + I_{n, m}^{inter} (t) + δ^{2} (t)}

(11)

where

P_{u_{n}}

is the transmission power of user

u_{n}

,

h_{m, u_{n}} (t)

is the channel gain between user

u_{n}

and channel m,

I_{n, m}^{intra} (t)

is the intra-cell interference,

I_{n, m}^{inter} (t)

is the inter-cell interference, and

δ^{2} (t)

is the noise power.

The total reward is defined as the sum of the transmission rates of all users across all channels:

\sum_{t = 1}^{T} \sum_{n = 1}^{N} \sum_{m = 1}^{M} \frac{B}{M} k_{m, u_{n}} (t) {log}_{2} (1 + S I N R_{m, u_{n}} (t))

(12)

4.2. Proximal Policy Optimization

PPO is an emerging policy gradient (PG) algorithm. To address the issues of step-size sensitivity and difficulty in determining an appropriate step size in traditional PG algorithms, PPO introduces a novel objective function for mini-batch updates, improving both stability and efficiency.

The PPO algorithm originates from the Trust Region Policy Optimization (TRPO) algorithm and introduces a clipped surrogate objective to control the magnitude of policy updates. The core idea is to measure the difference between the updated and old policies using importance sampling and to ensure the updated policy remains within the “trust region” of the old policy, thereby avoiding drastic fluctuations during updates.

Policy gradient algorithms optimize the policy function

π (a | s; θ)

through a parameter

θ

. In policy gradient methods, the objective function is defined as follows:

J (θ) = E_{π_{θ}} [R (τ)],

(13)

where

π_{θ}

is the stochastic policy,

τ = (s_{1}, a_{1}, \dots, s_{T}, a_{T})

represents the trajectory of length T, and

R (τ) = \sum_{t = 1}^{T} γ^{t} r_{t}

is the cumulative reward with a discount factor

γ

. By differentiating

J (θ)

to estimate the policy gradient, the loss function

L^{p}

can be obtained as

L^{p} (θ) = E_{t} [\log π_{θ} (a_{t} | s_{t}) A_{t}],

(14)

where

A_{t}

is an estimator of the advantage function.

Based on the policy gradient method, the parameter update rule can be expressed as follows:

θ_{new} = θ_{old} + α \nabla_{θ} J,

(15)

where

α

is the learning rate. This equation demonstrates that the learning rate

α

directly determines the quality of the updated policy. If

α

is not appropriate, the updated policy may not perform well, and poor policies could degrade the learning process when used for further updates, causing instability in the entire training process.

To ensure that the new policy improves the expected reward without excessive changes, the TRPO algorithm was proposed. In TRPO, the KL (Kullback–Leibler) divergence is introduced as a trust region constraint to limit the difference between the new and old policies. Specifically, the trust region is defined by the probability ratio:

ψ (θ) = \frac{π_{θ} (a_{t} | s_{t})}{π_{θ_{old}} (a_{t} | s_{t})},

(16)

and the objective function is expressed as follows:

L^{TRPO} (θ) = E_{t} [ψ (θ) A_{t}] .

(17)

TRPO solves a constrained optimization problem, which involves significant computational complexity to enforce the trust region constraint. To address this limitation, the PPO algorithm was introduced. PPO achieves a similar effect to TRPO by incorporating a clipping penalty to restrict the policy update, thereby avoiding the need for second-order calculations. The new objective function can be expressed as

L^{P P O} (θ) = E_{t} [\min (ψ (θ), c l i p (ψ (θ), 1 - ε, 1 + ε)) A_{t}] .

(18)

ϵ

is a hyperparameter, and the clipping constraint

ψ (θ)

ensures that the updated policy remains within the range

(1 - ϵ, 1 + ϵ)

to avoid excessively large updates to the policy.

4.3. Time-Frequency Resource Allocation Based on PPO

The core idea of the PPO algorithm is to limit the magnitude of policy updates by introducing a truncated ratio objective function. Specifically, in each iteration, the probability ratio between the new and old policies is constrained within the range of

[1 - ϵ, 1 + ϵ]

, thereby preventing excessively large policy updates that could destabilize the training process. In the context of spectrum allocation in a UAV network, a hierarchical action space design is employed: each action in the action space represents the allocation of each channel, specifically the occupancy state of each user on the various channels. This hierarchical design reduces the search complexity of the action space. Additionally, a reward function based on the total system throughput is formulated, incorporating two key penalty terms: (1) a rate requirement constraint, ensuring that the communication demands of each UAV are satisfied, and (2) a spectrum resource constraint, preventing excessive competition and redundant allocation of channels. This reward function design incentivizes the algorithm to maximize system performance while satisfying practical resource constraints. Through PPO’s policy gradient updates and value function estimation, the spectrum allocation strategy is progressively optimized, ultimately achieving efficient resource distribution. Experimental results demonstrate that the PPO algorithm with a hierarchical action space effectively addresses the spectrum allocation problem in UAV networks, achieving significant improvements in system performance while ensuring the satisfaction of various constraints.

In this system, UAVs are modeled as intelligent agents, each equipped with a policy network

π_{θ}

and a value evaluation network

V_{φ}

. At each time step t, the agent selects an action

a_{t}

based on the current state

s_{t}

from the environment, where the action

a_{t}

is generated through the policy network

π_{θ}

.

After the agent executes the action

a_{t}

, the environment returns an immediate reward

r_{t}

, and the state transitions to a new state

s_{t + 1}

. The trajectory

〈 s_{1}, a_{1}, r_{1}, \dots, s_{T} 〉

is stored in an experience replay buffer. As the experience buffer accumulates more trajectories, the agent periodically samples a batch of trajectories to update the actor network (policy network) and critic network (value network), thereby optimizing the policy and value function.

The actor network minimizes the clipped loss function

L^{P P O}

to optimize the policy parameter

θ

. The generalized advantage estimation (GAE) is adopted to calculate

A_{t}

, defined as follows:

A_{t} = δ_{t} + γ δ_{t + 1} + \dots + γ^{T - t + 1} δ_{T - 1},

(19)

where

δ_{t} = r_{t} + γ V_{φ} (s_{t + 1}) - V_{φ} (s_{t}) .

(20)

Here,

V_{φ} (s_{t})

represents the state value function at time step t.

The critic network updates its parameters

φ

by minimizing the mean square error of the advantage estimates

A_{t}

, and its loss function is defined as follows:

L^{critic} (φ) = \frac{1}{T} \sum_{t = 1}^{T} A_{t}^{2} .

(21)

5. Simulation Results

In this section, the proposed scheme is compared with the other baselines and a large quantity of simulation results is provided to validate the superiority of the proposed scheme.

5.1. Simulation Settings

In this simulation, three areas with dimensions of

30 km \times 30 km

are considered. Each area is equipped with UAV-BS to provide wireless communication services for ground users. The path loss model in the environment follows the free-space path loss model.

In the policy network of the Proximal Policy Optimization (PPO) algorithm, a two-layer feedforward fully connected neural network structure is adopted to generate the action probability distribution for the current state. The input layer accepts a state vector of size equal to the number of states, representing the state information of the current environment. The hidden layer contains 256 neurons, which achieve nonlinear feature mapping through linear transformations and the ReLU activation function. The output layer has a dimension equal to the number of users, and a Softmax function is used to generate the probability distribution over the action space, ensuring that the probabilities sum to one. The primary function of the policy network is to map states to action probabilities, guiding the agent’s decision-making process in the environment. In the PPO algorithm, this network represents the policy function, and the agent samples actions from the output probability distribution or selects the optimal action to interact with the environment.

The value network in PPO adopts a structure similar to the policy network but serves the purpose of estimating the value of the input states. The input layer accepts a state vector of the same size as the number of states. The hidden layer consists of 16 neurons and uses the ReLU activation function for nonlinear feature extraction. The output layer has a single scalar node that directly outputs the scalar value of the current state without additional activation functions. The main function of the value network is to map states to state values, estimating the long-term expected return of the agent in the current state. In the PPO algorithm, the value network provides a baseline for policy optimization by calculating state values, which are used to estimate the advantage function. This helps to guide the direction of policy updates, improving training stability and efficiency.

Additionally, the neural network parameters are trained using the Adam optimizer. For the value network, which aims to predict expected returns with accuracy, Adam adapts the learning rate to handle the non-stationary nature of value prediction. In the early stages of training, Adam employs bias correction mechanisms to adjust the estimates of the first and second moments, ensuring the accuracy of early training updates. This optimization approach, combined with PPO’s clipped objective function, ensures both efficient parameter updates and sustained training stability. The simulation is implemented in a Python 3.7 environment with TensorFlow 2.6. Other parameters used in the simulation are provided in Table 1.

5.2. Performance Analysis

Figure 4 illustrates the training dynamics of the PPO algorithm. The horizontal axis represents the number of training episodes, while the vertical axis indicates the reward values obtained. In the initial training phase (episodes 0–200), significant fluctuations in rewards are observed due to the agent’s exploration within the action space and the instability of the learned policy. Around episode 200, a sharp increase in rewards occurs as the agent begins to identify more optimal strategies. After this rapid improvement, the rewards stabilize near their maximum value of approximately 10,000. This stabilization, observed from episode 300 onward, indicates that the agent has successfully converged to a robust and stable policy.

Based on the data shown in Figure 5, the raw reward dynamics during PPO training exhibit significant fluctuations in the initial stage (episodes 0–304), which can be attributed to the randomness in strategy updates and exploration. However, after episode 304, the rewards stabilize near a value of approximately 2000 Mbps, indicating that the PPO algorithm has converged.

This result demonstrates the PPO algorithm’s effectiveness in achieving higher rewards and its ability to converge during the training process.

Figure 6 illustrates the deviation of PPO reward values relative to the post-convergence mean. In the early training phase (episodes 0–304), significant negative deviations are observed, with values dropping as low as −0.6 due to the instability introduced during exploration. After episode 304, the deviation rapidly decreases and begins to stabilize. By approximately episode 400, the deviation consistently falls within a ±3% threshold (indicated by red dashed lines), demonstrating that PPO rewards have stabilized and oscillate minimally around the mean. In this figure, the horizontal red dashed lines indicate the threshold range within which the rewards oscillate after convergence, while the vertical red dashed line represents the point where the algorithm achieves stability. This analysis highlights the PPO algorithm’s convergence behavior and its ability to achieve stable performance after episode 304.

Figure 7 compares the PPO raw reward values (blue line) with their 50-episode moving average (green solid line). During the initial phase (episodes 0–304), the raw reward values exhibit significant fluctuations, ranging between approximately 600 and 1200, primarily due to the instability caused by exploration and the policy learning process. The 50-episode moving average smooths these fluctuations, revealing a steady upward trend. After episode 304, both the raw reward values and the moving average stabilize near the mean reward value of 2000 Mbps (indicated by the red dashed line), highlighting the PPO algorithm’s convergence and consistent performance. In the figure, the horizontal red dashed line represents the fluctuation range of the reward values, while the vertical red dashed line marks the episode at which the algorithm reaches convergence.

Figure 8 presents the Coefficient of Variation (CV), a statistical measure of the ratio of the standard deviation to the mean, useful for assessing relative dispersion. In the early training phase (episodes 0–304), the CV is relatively high, with values reaching up to 0.15, indicating significant fluctuations in the reward values. This is primarily due to the instability introduced by exploration and policy updates, resulting in a larger standard deviation relative to the mean.

After episode 304, the CV drops sharply and stabilizes near zero, reflecting a convergence in the PPO training process. At this stage, the reward values stabilize around the mean, with minimal variability. The low CV observed in the later training stages highlights the high stability and consistency of PPO training results.

Based on the analysis of the in Figure 9, the throughput of the PPO algorithm exhibits a clear linear growth trend with the increase in the number of channels. Specifically, as the number of channels increases from 3 to 12, the system throughput steadily rises from 983.3 Mbps to 3933.8 Mbps. With each additional channel, the throughput increases by approximately 327.8 Mbps. This linear relationship indicates that the PPO algorithm can effectively utilize additional channel resources, demonstrating good scalability. In the high-channel range of the 10 to 12 channels in particular, the algorithm maintains a stable growth pattern, without encountering any performance bottlenecks or signs of slowdown, suggesting that the PPO algorithm performs excellently in multi-channel resource scheduling. Additionally, the smoothness of the growth curve indicates a steady and stable performance across different numbers of channels, with no noticeable fluctuations or abrupt changes.

The analysis of the results in Figure 10 demonstrates that the PPO algorithm exhibits stable throughput performance across different numbers of users. The throughput remains consistently around 2622 Mbps, with the maximum value observed at 9 users (2622.63 Mbps) and the minimum at 14 users (2622.44 Mbps), resulting in an overall fluctuation of less than 0.2 Mbps. This minimal variation highlights the algorithm’s robustness and stability. Even as the number of users increases from 5 to 15, no significant decline in throughput is observed, indicating that the algorithm effectively manages and allocates network resources to maintain stable system performance. Notably, in scenarios with higher user counts (e.g., 13 to 15 users), although there is a slight decrease in throughput, the reduction is minimal. This suggests that the PPO algorithm can maintain a high-quality service even under increased user loads. Such a characteristic is critical for real-world network deployments, as it demonstrates the algorithm’s scalability and ability to handle growing demands while sustaining system performance.

Based on the analysis of the Figure 11, it is evident that the algorithm’s convergence characteristics undergo significant changes with an increase in the discount factor (

γ

). When

γ \leq 0.85

, the algorithm exhibits a relatively fast convergence rate, reaching a stable value of approximately 1966 in fewer iterations. However, as

γ

continues to increase, the number of iterations required for convergence shows a clear upward trend. In particular, when

γ

reaches 0.90, the convergence rate slows considerably, making it difficult to achieve full convergence within the limited number of iterations. At

γ = 0.95

, a sharp decline in the observed convergence rate is noted, with the value dropping to 1752.5. This does not indicate a substantial decrease in algorithm performance, but rather reflects the fact that, within the given iteration limit, the algorithm has not yet fully converged. This phenomenon can be explained by the fact that a larger discount factor places more emphasis on long-term rewards, which necessitates more iterations to thoroughly evaluate and balance the long-term effects of various decisions, thus slowing the convergence process. Therefore, the sharp decline observed in the figure is actually a manifestation of insufficient convergence, rather than a fundamental degradation in the algorithm’s performance.

Through the analysis of the impact of the discount factor (

γ

) on the convergence of the PPO algorithm in Figure 12, it can be observed that within the range of

γ

from 0.5 to 0.9, the algorithm exhibits relatively stable characteristics, with the number of training iterations required for convergence generally remaining between 200 and 300. Specifically, when

γ = 0.6

, the algorithm achieves the optimal convergence effect, completing training in just 200 iterations. However, when the discount factor exceeds 0.9, particularly at

γ = 0.95

, the number of training iterations required for convergence sharply increases to 490. This suggests that placing excessive emphasis on long-term rewards may significantly affect the stability of the algorithm’s convergence. Based on these observations, it is recommended to select a discount factor in the range of 0.6 to 0.8 for practical applications, as this ensures both the stability of the algorithm and a relatively fast convergence rate. In particular,

γ = 0.6

may be the optimal choice, as it achieves the fastest convergence while maintaining good algorithm performance. On the other hand, if a balance between short-term and long-term rewards is needed,

γ \approx 0.8

would be a more ideal choice, as it maintains stability while enabling convergence within a reasonable number of iterations.

Based on the updated data in Figure 13, we summarize the performance comparison among the Proximal Policy Optimization (PPO) algorithm, the greedy algorithm, and the round-robin algorithm as follows. The greedy algorithm operates by making locally optimal decisions at each step with the hope that these will lead to a globally optimal solution. In the context of resource allocation, the greedy algorithm typically selects the most “promising” task or channel at each decision point, without considering the long-term impact of its choice. The round-robin algorithm is a time-sharing scheduling method where tasks or resources are assigned in a cyclic manner. In a networking context, this means that each channel is allocated in turn, without regard to the current load or demand on each channel.

The bar chart illustrates the sum rates (in Mbps) attained by PPO (represented in green), the greedy algorithm (in blue), and the round-robin algorithm (in red) across varying numbers of channels, ranging from 3 to 15. Notably, PPO outperforms the other two algorithms consistently across all channel counts.

In the case of a small number of channels (from three to five), PPO provides moderate performance advantages over the other algorithms. For example, when the number of channels is five, PPO achieves a sum rate of 1638.70 Mbps. This is approximately 31.3% higher than the sum rate of the greedy algorithm (1247.32 Mbps) and around 55.4% higher than that of the round-robin algorithm (1055.09 Mbps).

As the number of channels increases, the performance advantage of PPO becomes even more significant. At 15 channels, PPO reaches a sum rate of 4912.60 Mbps, surpassing the greedy algorithm (3741.14 Mbps) by roughly 31.3% and the round-robin algorithm (3203.94 Mbps) by about 53.3%.

PPO exhibits excellent scalability, achieving substantial performance gains as the number of channels grows. The greedy algorithm outperforms the round-robin algorithm in all scenarios, yet it lags far behind PPO, especially when the number of channels is high. The round-robin algorithm has difficulty keeping up with PPO, indicating its lower efficiency in utilizing channel resources.

This comparison underscores the PPO algorithm’s capacity to adapt and optimize in dynamic environments, achieving remarkably higher sum rates and demonstrating superior performance and scalability relative to the other two algorithms.

Algorithms with exponential complexity in action space face significant challenges when addressing large-scale problems, as the number of possible actions within the action space becomes extraordinarily large. In such cases, convergence may not occur within a finite number of steps, or the algorithm may experience significant fluctuations after convergence. In large-scale UAV network resource allocation, as the number of users and channels increases, algorithms with exponential complexity may fail to converge within a limited number of steps or may show considerable instability after convergence. In contrast, linear complexity algorithms can converge quickly within a finite number of steps and exhibit better stability, providing timely resource allocation solutions that meet real-time system requirements and improve both system response time and operational efficiency.

Algorithms with linear complexity in the action space are better suited to handle large-scale problems. As the problem size grows, the increase in the number of actions remains relatively small, leading to a more modest decline in algorithm performance. This makes such algorithms more adaptable and scalable when faced with expanding UAV networks, without the need to worry about limitations in computational resources and time. This scalability offers better support for the system’s future development.

However, simplifying the model and reducing complexity may lead to the loss of important information. When decomposing the multi-channel resource allocation problem into subproblems with single-channel decisions, the interdependencies between channels are often overlooked. The loss of this information can result in suboptimal resource allocation schemes, reducing the overall performance and efficiency of the system.

6. Conclusions

This paper introduces an innovative hierarchical reinforcement learning framework for addressing the wireless resource allocation problem. By leveraging a hierarchical decision-making mechanism, the proposed approach decomposes the complex multi-channel resource allocation task into a series of simpler single-channel decision subproblems. This decomposition greatly reduces the action space complexity from

O (N^{M})

to

O (N)

, effectively mitigating the curse of dimensionality and enabling efficient resource optimization.

To enhance learning efficiency and stability, a progressive learning strategy was adopted. This strategy allows the policy network to first concentrate on learning optimal single-channel allocation policies, which are then seamlessly integrated into a holistic resource allocation scheme through a carefully designed state update mechanism. Such a sequential decision-making process not only accelerates convergence but also improves the generalization ability of the algorithm in diverse network scenarios.

Extensive simulation results validate the effectiveness of the proposed framework. The Proximal Policy Optimization (PPO)-based model exhibits consistent convergence behavior, achieving stable and high reward values after training. The comparative analysis with baseline algorithms highlights the scalability and adaptability of the proposed method. Specifically, the PPO framework consistently outperforms the greedy algorithm and the round-robin strategy, demonstrating up to a 124.4% improvement in sum rate over the round-robin approach and a 31.3% improvement over the greedy algorithm when the number of channels increases.

In conclusion, the hierarchical reinforcement learning framework proposed in this work provides a practical and efficient solution for large-scale wireless communication systems. It achieves superior resource allocation performance while significantly reducing computational complexity, making it a promising approach for real-world deployment. Future research will explore extending this framework to highly dynamic and heterogeneous wireless environments, further enhancing its applicability and robustness in practical scenarios.

Author Contributions

Conceptualization, K.S.; methodology, J.Y.; software, J.L.; validation, B.Y.; writing—review, S.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Natural Science Basis Research Plan in Shaanxi Province of China (2023JCYB555).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

Authors Kun Sun and Jianyong Yang were employed by The 54th Research Institute of China Electronics Technology Group Corporation. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

UAV	Unmanned aerial vehicle
PPO	Proximal Policy Optimization
BS	Base Stations
UAV-BS	unmanned aerial vehicle base station

References

Shakhatreh, H.; Sawalmeh, A.H.; Al-Fuqaha, A.; Dou, Z.; Almaita, E.; Khalil, I.; Othman, N.S.; Khreishah, A. and Guizani, M. Unmanned Aerial Vehicles (UAVs): A Survey on Civil Applications and Key Research Challenges. IEEE Access 2019, 7, 48572–48634. [Google Scholar] [CrossRef]
Shahzadi, R.; Ali, M.; Khan, H.Z.; Naeem, M. UAV Assisted 5G and Beyond Wireless Networks: A Survey. J. Netw. Comput. Appl. 2021, 189, 103114. [Google Scholar] [CrossRef]
Gu, X.; Zhang, G. A Survey on UAV-Assisted Wireless Communications: Recent Advances and Future Trends. Comput. Commun. 2023, 208, 44–78. [Google Scholar] [CrossRef]
Jasim, M.A.; Shakhatreh, H.; Siasi, N.; Sawalmeh, A.H.; Aldalbahi, A.; Al-Fuqaha, A. A Survey on Spectrum Management for Unmanned Aerial Vehicles (UAVs). IEEE Access 2022, 10, 11443–11499. [Google Scholar] [CrossRef]
Zhou, L.; Leng, S.; Wang, Q.; Quek, T.Q.S.; Guizani, M. Cooperative Digital Twins for UAV-Based Scenarios. IEEE Commun. Mag. 2024. [Google Scholar] [CrossRef]
Bithas, P.S.; Michailidis, E.T.; Nomikos, N.; Vouyioukas, D.; Kanatas, A.G. A Survey on Machine-Learning Techniques for UAV-Based Communications. Sensors 2019, 19, 5170. [Google Scholar] [CrossRef]
Razzaq, S.; Xydeas, C.; Mahmood, A.; Ahmed, S.; Ratyal, N.I.; Iqbal, J. Efficient optimization techniques for resource allocation in UAVs mission framework. PLoS ONE 2023, 18, e0283923. [Google Scholar] [CrossRef]
Emami, Y.; Gao, H.; Li, K.; Almeida, L.; Tovar, E.; Han, Z. Age of Information Minimization Using Multi-Agent UAVs Based on AI-Enhanced Mean Field Resource Allocation. IEEE Trans. Veh. Technol. 2024, 73, 13368–13380. [Google Scholar] [CrossRef]
Qi, W.; Song, Q.; Guo, L.; Jamalipour, A. Energy-Efficient Resource Allocation for UAV-Assisted Vehicular Networks With Spectrum Sharing. IEEE Trans. Veh. Technol. 2022, 71, 7691–7702. [Google Scholar] [CrossRef]
Zhou, X.; Lin, Y.; Tu, Y.; Mao, S.; Dou, Z. Dynamic Channel Allocation for Multi-UAVs: A Deep Reinforcement Learning Approach. Proceedings of IEEE Global Communications Conference (GLOBECOM), Waikoloa, HI, USA, 9–13 December 2019; pp. 1–6. [Google Scholar]
Chen, T.; Dong, F.; Ye, H.; Wang, Y.; Wu, B. Data Collection Mechanism for UAV-Assisted Cellular Network Based on PPO. Electronics 2023, 12, 1376. [Google Scholar] [CrossRef]
Morozs, N.; Clarke, T.; Grace, D. Distributed Heuristically Accelerated Q-Learning for Robust Cognitive Spectrum Management in LTE Cellular Systems. IEEE Trans. Mob. Comput. 2016, 15, 817–825. [Google Scholar] [CrossRef]
Gao, Z.; Wen, B.; Huang, L.; Chen, C.; Su, Z. Q-Learning-Based Power Control for LTE Enterprise Femtocell Networks. IEEE Syst. J. 2016, 11, 2699–2707. [Google Scholar] [CrossRef]
Naparstek, O.; Cohen, K. Deep Multi-User Reinforcement Learning for Distributed Dynamic Spectrum Access. IEEE Trans. Wirel. Commun. 2019, 18, 310–323. [Google Scholar] [CrossRef]
Zhang, L.; Zhao, H.; Hou, S.; Zhao, Z.; Xu, H.; Wu, X.; Wu, Q.; Zhang, R. A Survey on 5G Millimeter Wave Communications for UAV-Assisted Wireless Networks. IEEE Access 2019, 7, 117460–117504. [Google Scholar] [CrossRef]
Wang, B.; Ji, Z.; Liu, K.R.; Clancy, T.C. Primary-Prioritized Markov Approach for Dynamic Spectrum Allocation. IEEE Trans. Wirel. Commun. 2009, 8, 1854–1865. [Google Scholar] [CrossRef]
Tu, W. Efficient Resource Utilization for Multi-Flow Wireless Multicasting Transmissions. IEEE J. Sel. Areas Commun. 2012, 30, 1246–1258. [Google Scholar] [CrossRef]
Deb, S.; Chaporkar, P.; Karandikar, A. Stability Analysis of Device-to-Device Relay Assisted Cellular Networks. arXiv 2018, arXiv:1808.03881. [Google Scholar]
Tu, W. Resource-efficient seamless transitions for high-performance multi-hop UAV multicasting. Comput. Netw. 2022, 213, 109051. [Google Scholar] [CrossRef]
Tu, W. Efficient Wireless Multimedia Multicast in Multi-Rate Multi-Channel Mesh Networks. IEEE Trans. Signal Inf. Process. Over Netw. 2016, 2, 376–390. [Google Scholar] [CrossRef]
Wang, H.; Wang, J.; Ding, G.; Xue, Z.; Zhang, L.; Xu, Y. Robust Spectrum Sharing in Air-Ground Integrated Networks: Opportunities and Challenges. IEEE Wirel. Commun. 2020, 27, 148–155. [Google Scholar] [CrossRef]
Chen, C.; Song, M.; Xin, C.; Backens, J. A Game-Theoretical Anti-Jamming Scheme for Cognitive Radio Networks. IEEE Netw. 2013, 27, 22–27. [Google Scholar] [CrossRef]
Yao, F.; Jia, L. A Collaborative Multi-Agent Reinforcement Learning Anti-Jamming Algorithm in Wireless Networks. IEEE Wirel. Commun. Lett. 2019, 8, 1024–1027. [Google Scholar] [CrossRef]

Figure 1. Deployment scenario for UAV-assisted network.

Figure 2. The interference model.

Figure 3. The framework of the PPO algorithm.

Figure 4. PPO training dynamics graph.

Figure 5. Raw reward dynamics.

Figure 6. Deviation from mean analysis.

Figure 7. Moving average comparison.

Figure 8. Coefficient of Variation.

Figure 9. Relationship between number of channels and throughput.

Figure 10. Relationship between number of users and throughput.

Figure 11. Relationship between discount factor and throughput.

Figure 12. Relationship between discount factor and convergence points.

Figure 13. The comparison results of the PPO algorithm with the greedy algorithm and the polling strategy.

Table 1. Simulation parameters.

Parameter	Value
Lower frequency	1440 MHz
Number of iterations	900
Upper frequency	1443 MHz
Policynet learning rate	$5 \times 10^{- 5}$
Number of time slots	10
Valuenet learning rate	$1 \times 10^{- 4}$
Transmission power	1 W
Clipping parameter	0.2
Area of the region	$900 {km}^{2}$
Training epochs	10
Power spectral density	$3.98 \times 10^{- 21}$ W/Hz
Number of hidden layers	256

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sun, K.; Yang, J.; Li, J.; Yang, B.; Ding, S. Proximal Policy Optimization-Based Hierarchical Decision-Making Mechanism for Resource Allocation Optimization in UAV Networks. Electronics 2025, 14, 747. https://doi.org/10.3390/electronics14040747

AMA Style

Sun K, Yang J, Li J, Yang B, Ding S. Proximal Policy Optimization-Based Hierarchical Decision-Making Mechanism for Resource Allocation Optimization in UAV Networks. Electronics. 2025; 14(4):747. https://doi.org/10.3390/electronics14040747

Chicago/Turabian Style

Sun, Kun, Jianyong Yang, Jinglei Li, Bo Yang, and Shuman Ding. 2025. "Proximal Policy Optimization-Based Hierarchical Decision-Making Mechanism for Resource Allocation Optimization in UAV Networks" Electronics 14, no. 4: 747. https://doi.org/10.3390/electronics14040747

APA Style

Sun, K., Yang, J., Li, J., Yang, B., & Ding, S. (2025). Proximal Policy Optimization-Based Hierarchical Decision-Making Mechanism for Resource Allocation Optimization in UAV Networks. Electronics, 14(4), 747. https://doi.org/10.3390/electronics14040747

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Proximal Policy Optimization-Based Hierarchical Decision-Making Mechanism for Resource Allocation Optimization in UAV Networks

Abstract

1. Introduction

2. Related Works

3. System Model

3.1. Network Model

3.2. Link Model

3.3. Problem Formulation

4. Time-Frequency Resource Allocation Based on PPO

4.1. Algorithm Formulation

4.1.1. State Space

4.1.2. Action Space

4.1.3. Reward Function

4.2. Proximal Policy Optimization

4.3. Time-Frequency Resource Allocation Based on PPO

5. Simulation Results

5.1. Simulation Settings

5.2. Performance Analysis

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI