A Framework of Recommendation System for Unmanned Aerial Vehicle Autonomous Maneuver Decision

Hao, Qinzhi; Jing, Tengyu; Sun, Yao; Yang, Zhuolin; Zhang, Jiali; Wang, Jiapeng; Wang, Wei

doi:10.3390/drones9010025

Open AccessArticle

A Framework of Recommendation System for Unmanned Aerial Vehicle Autonomous Maneuver Decision

by

Qinzhi Hao

¹,

Tengyu Jing

²

,

Yao Sun

^1,*,

Zhuolin Yang

²,

Jiali Zhang

²

,

Jiapeng Wang

²

and

Wei Wang

²

¹

Graduate School, Air Force Engineering University, Xi’an 710051, China

²

School of Telecommunications Engineering, Xidian University, Xi’an 710071, China

^*

Author to whom correspondence should be addressed.

Drones 2025, 9(1), 25; https://doi.org/10.3390/drones9010025

Submission received: 6 November 2024 / Revised: 22 December 2024 / Accepted: 23 December 2024 / Published: 30 December 2024

(This article belongs to the Collection Drones for Security and Defense Applications)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Autonomous maneuvering decision-making in unmanned serial vehicles (UAVs) is crucial for executing complex missions involving both individual and swarm UAV operations. Leveraging the successful deployment of recommendation systems in commerce and online applications, this paper pioneers a framework tailored for UAV maneuvering decisions. This novel approach harnesses recommendation systems to enhance decision-making in UAV maneuvers. Our framework incorporates a comprehensive six-degree-of-freedom dynamics model that integrates gravitational effects and defines mission success criteria. We developed an integrated learning recommendation system capable of simulating varied mission scenarios, facilitating the acquisition of optimal strategies from a blend of expert human input and algorithmic outputs. The system supports extensive simulation capabilities, including various control modes (manual, autonomous, and hybrid) and both continuous and discrete maneuver actions. Through rigorous computer-based testing, we validated the effectiveness of established recommendation algorithms within our framework. Notably, the prioritized experience replay deep deterministic policy gradient (PER-DDPG) algorithm, employing dense rewards and continuous actions, demonstrated superior performance, achieving a 69% success rate in confrontational scenarios against a versatile expert algorithm after 1000 training iterations, marking an 80% reduction in training time compared to conventional reinforcement learning methods. This framework not only streamlines the comparison of different maneuvering algorithms but also promotes the integration of multi-source expert knowledge and sophisticated algorithms, paving the way for advanced UAV applications in complex operational environments.

Keywords:

recommendation system; maneuver decision; collaborative filtering; deep reinforcement learning; dense reward

1. Introduction

Unmanned platforms are playing an increasingly vital role in both current and future civilian and military missions. Among these, unmanned aerial vehicles (UAVs) are at the forefront of large-scale applications. They are characterized by autonomous swarm operations and serve as a prime example of integrated space–air–ground networks (SAGINs) [1]. UAVs encompass all unmanned flight platforms equipped with specific intelligence and autonomy. Their use in cooperative and confrontational scenarios offers numerous advantages, including overcoming the physiological limits of human pilots, reducing pilot training costs, and providing high mobility, agility, and compactness [2]. Effective maneuver decision-making has emerged as a critical technology in UAV cooperation and confrontation [3].

Effective maneuver decision-making is essential for UAVs to secure advantageous positions or evade unfavorable ones in complex, dynamic scenarios [4,5]. It involves executing actions based on a UAV’s state—such as flight speed, angle, altitude, azimuth, and pitch angle—to adapt its motion state in real-time [6]. However, the dynamic nature of modern operational environments poses significant challenges to existing decision-making frameworks, necessitating innovative approaches.

Currently, the most typical and effective research methods for maneuver decision-making are mainly divided into three categories: methods based on optimization theory [7], methods based on game theory [8], and methods based on artificial intelligence [9], especially methods using reinforcement learning [9]. The UAV maneuver decision problem is often modeled as a multi-objective optimization problem. Through Genetic Algorithm (GA), Particle Swarm Optimization (PSO), and Receding Horizon Control (RHC), and other intelligent optimization algorithms to solve. However, these optimization methods face some problems such as high computational cost, slow convergence speed, and poor adaptability to the problem scale. Genetic algorithm is an evolutionary algorithm based on natural selection, which is widely used in the multi-aircraft collaborative decision-making of UAVs by simulating the quality of the solution of the biological evolution process [10]. Xie et al. [11] improved the traditional genetic algorithm to enhance the UAV’s ability to adapt to the dynamic environment and provide a stable solution to the multi-objective optimization problem in the complex environment. Vicsek et al. [12] proposed a simplified model of the particle swarm optimization algorithm, and Guo et al. [13] designed a multi-objective adversarial decision-making model based on this model and completed a dynamic adversarial by artificial potential field method, successfully overcoming the defect that traditional particle swarm optimization algorithm has, which is that it is prone to falling into local optimal. Game theoretic methods include differential games and matrix games. However, these approaches face challenges such as high computational requirements, inability to learn strategies autonomously, reliance on predefined rules, and limited flexibility. Li et al. [14] proposed a game strategy method combining linear programming and linear inequality constraints to solve the problem where it is difficult for multiple UAVs to obtain situation information in complex environments, and they designed a dimensionality reduction matrix game solution algorithm in the literature [15], which can greatly improve the decision-making speed in air combat. GENG et al. [16] propose a hybrid strategy combining rule base and a fuzzy Bayesian network, which effectively improves the reliability and real-time decision-making under uncertain environments. Hyunju et al. [17] used a method based on differential game theory to develop a UAV for an automatic maneuvering generation algorithm, Within Visual. The algorithm follows the hierarchical decision structure and computes the score function matrix based on differential game theory to find the optimal value. At present, the most advanced and effective decision-making methods are mainly based on reinforcement learning [18]. These methods can fully capture the high-dimensional features involved in the UAV cooperative confrontation process without relying on artificial design features so as to achieve reasonable decision-making. They do not require prior knowledge or human experience, but only a properly defined loss function and optimizer to iteratively learn features [11]. Common dynamic decision-making methods based on reinforcement learning are mainly divided into value function-based methods, such as deep Q network (DQN), dual DQN (DDQN) [12,13,14] and policy-based methods. Examples include actor–critic (AC) and deep deterministic policy gradient (DDPG). Zhang et al. [19] proposed a dynamic decision-making method based on the Q network and Nash equilibrium strategies to improve the efficiency of reinforcement learning. Lillicrap et al. [20] proposed a deep deterministic strategy gradient algorithm based on the action–evaluation framework, and applied DQN to continuous action space to realize end-to-end strategy learning.

Recommendation systems, a cornerstone of modern AI applications, have demonstrated remarkable success in domains such as e-commerce, personalized education, and smart transportation [21,22,23]. At present, the primary recommendation methods employed in recommendation systems include content-based recommendation algorithms, collaborative filtering algorithms, and hybrid recommendation algorithms [24]. In recent years, with the advent of artificial intelligence, reinforcement learning and deep learning-based recommendation algorithms have also gained prominence [25,26].

Content-based recommendation algorithms primarily depend on users’ personal historical interaction data. In contrast, collaborative filtering (CF) recommendation algorithms leverage the interaction history of other users to suggest relevant products to the target user. Collaborative filtering techniques are mainly categorized into two types: memory-based and model-based. Memory-based collaborative filtering further divides into user-collaborative filtering (UserCF) and item-collaborative filtering (ItemCF) [27]. UserCF predicts ratings by analyzing behavioral similarities between users, whereas ItemCF utilizes the similarity between items and users’ actual item choices to forecast ratings. Model-based collaborative filtering algorithms generate recommendations by building a predictive model. These methods typically use historical interaction data to model user preferences. Common algorithms include matrix factorization (such as Singular Value Decomposition, SVD [28]), latent factor models [29], and deep learning models that have become widely used in recent years [30]. Model-based collaborative filtering algorithms generally offer better scalability, especially for large datasets. Their advantage lies in the ability to learn latent user interest patterns through the model, thereby providing more accurate recommendations.

Traditional recommendation methods cannot easily depict the user’s sequential characteristics, cannot model the sequential relationship among the items, and there are data dependency problems [31]. The best-performing recommendation method at present is the reinforcement learning method [32,33,34]. Reinforcement learning recommendation methods are mainly divided into traditional reinforcement learning and deep reinforcement learning recommendation methods. As an interactive recommendation (IR) method, its recommendation model can update the recommendation strategy in real-time through interaction with users and can obtain users’ real feedback, which is more in line with the actual recommendation scenario compared to traditional static methods [35]; at the same time, since reinforcement learning problems are usually formalized as Markov decision processes (MDP), these models have a natural characteristic of modeling user behavior sequences, which can fully depict sequential characteristics and capture user dynamic preferences. Moreover, the setting of the exploration mechanism can make the agent more fully explore the state and action space to a certain extent, increasing the diversity of the recommendation results. Finally, since these models often take maximizing the cumulative revenue of the recommendation system, that is, the user’s long-term feedback, as the optimization goal to update the recommendation strategy, to a certain extent, it can improve the user’s long-term satisfaction [36].

Recommendation systems provide dynamic and adaptive decision-making by modeling user preferences and optimizing long-term engagement, making them particularly suited to complex real-time environments. Despite their potential, their applications in drone maneuvering decision-making remain largely unexplored, providing an opportunity to bridge the two fields.

Starting from the technical levels of recommendation systems and maneuver decision-making, this paper analyzes the similarities between the two and innovatively proposes a maneuver decision-making algorithm based on the recommendation system. Its main contributions are:

(1): For the first time, a maneuver decision-making algorithm based on the recommendation system has been proposed, and this innovative integration provides new ideas for the application of recommendation systems in the field of UAV cooperation and confrontation.
(2): A recommendation system that simulates UAV cooperation and confrontation has been constructed, and the biggest feature of this environment is the use of an integrated framework to integrate various excellent maneuver decision-making algorithms to achieve different types of confrontation, including continuous and discrete actions. Based on this, an expert algorithm based on discrete actions for both offensive and defensive strategies and an expert algorithm for offensive greedy strategies has been designed as a baseline algorithm.
(3): To verify the feasibility and effectiveness of applying recommendation systems to maneuver decision-making problems, starting from collaborative filtering recommendation algorithms, an offline integrated KNN-UserCF maneuver decision recommendation algorithm has been implemented, which introduces KNN based on UserCF to solve the problem of the difficult calculation of the user similarity matrix, and uses the Bagging ensemble method to integrate different K-nearest neighbors to improve the robustness of the recommendation.
(4): Due to the data dependency problem of traditional offline recommendation algorithms, deep reinforcement learning technology has been used to design and implement online maneuver decision recommendation algorithms based on discrete actions DDQN and continuous actions DDPG of deep reinforcement learning, respectively. These algorithms introduce prioritized experience replay into the standard deep reinforcement learning algorithm to improve the sampling efficiency of samples and construct a dense reward guided by a situation assessment to lead the UAV to quickly reach an advantageous spatial position and complete the target task, thereby accelerating the convergence speed of the algorithm.

Research challenges are shown in Figure 1.

The paper is organized as follows: Section 2 outlines the framework for an integrated recommendation algorithm by establishing the simulation environment for the UAV cooperative countermeasure recommendation system. Section 3 discusses the KNN-UserCF recommendation model and introduces the integrated learning module. Section 4 designs a deep reinforcement learning recommendation algorithm based on DDQN and DDPG, incorporating prioritized experience replay to address the issue of experience forgetting caused by average experience replay. Section 5 presents simulation results that demonstrate the winning rate and stability of the proposed autonomous maneuverable decision recommender system framework for UAV cooperative countermeasure tasks. Conclusions are drawn in Section 6.

2. Construction of the UAV Cooperation and Confrontation Simulation Environment for the Recommendation System

This section proposes the development of a recommendation system for UAV collaboration and confrontation environments. It begins with reinforcement learning recommendation systems, introducing flight dynamics and control models, mission success and failure evaluation models, and the design of a maneuver library. Additionally, it presents the integrated framework and proposes two baseline algorithms: the offensive greedy strategy expert algorithm and the offensive and defensive greedy strategy expert algorithm. The environment constructed in this paper operates under ideal conditions without considering interference such as wind, communication delays, or sensor inaccuracies.

2.1. Recommendation System Based on Reinforcement Learning

In reinforcement learning, both recommendation systems and maneuver decision-making can be described by Markov processes, which are composed of a quintuple,

{S, A, P, R, γ}

, as shown in Figure 2. Compared with recommendation systems, maneuver decision-making is a Markov process with high temporal dimensions and rapid changes.

Where

S = (S_{1}, S_{2}, \dots, S_{t})

represents the state set, which describes the state composed of the interaction between the agent and the environment. In the recommendation system, it means the set of user states; in maneuver decision-making, it means the state set of the UAV cooperation and confrontation.

The variable

A = (a_{1}, a_{2}, \dots, a_{t})

represents the action set, that is, the set of actions that the agent can choose. In the recommendation system, it means the set of recommended products; in maneuver decision-making, it means the set of available maneuver actions.

The variable

P

is the state transition probability, describing the probability distribution of the next state under a given state action.

The variable

R = (R_{1}, R_{2}, \dots, R_{t})

is the reward function, defined by the state and action, indicating the immediate feedback obtained by the agent after taking a certain action. In the recommendation system, it means user feedback, such as liking or disliking a product; in maneuver decision-making, it means the feedback of the current state when the UAV cooperates and confronts, such as winning or losing.

The variable

γ

is the discount factor, used to balance current rewards and future rewards. A larger discount factor means more emphasis on long-term benefits, while a smaller discount factor focuses more on immediate user benefits. Selecting the appropriate discount factor is an important adjustment parameter in reinforcement learning, which directly affects the strategy and value function learned by the agent. The discount return can be expressed as:

G_{t} = R_{t + 1} + γ R_{t + 2} + γ R_{t + 3} + \dots = \sum_{k = 0}^{\infty} γ^{k} R_{t + k + 1}

(1)

where

G_{t}

is the discounted return starting from time

t

, and

R_{t + 1}

,

R_{t + 2} \dots

is the reward at time in the future.

In the reinforcement learning recommendation system framework, there are four important components, which are the same as the components of the reinforcement learning maneuver decision framework. In the recommendation system, the state description mainly includes items, user characteristics, embedded encoding, etc., and the most common method is to use user, product, or context features as the user states (denoted as

S = (S_{i t e m}, S_{u s e r}, \dots, S_{c o n t e x t})

). In maneuver decision-making, the state description mainly includes the UAV’s three-dimensional coordinates, attitude information, and relative position information between friends and enemies (denoted as

S = (S_{p o s t i o n}, S_{r e l_p o s}, \dots, S_{a t t i t u d e})

).

Policy optimization: In the context of the recommendation system, policy optimization aims to enhance the recommendation strategy to better match user preferences and behaviors. The general formula for policy optimization is:

θ^{'} = θ + α \nabla J (θ)

(2)

where

θ

represents the parameters of the policy,

θ^{'}

represents the updated parameters,

α

represents the learning rate, that is, the speed of learning, and

\nabla J (θ)

represents the gradient of the objective function with respect to the policy parameters. In the maneuver decision-making problem, the above formula is also applicable, and its goal is to enhance the maneuver strategy to better defeat the enemy.

Reward function design: In reinforcement learning, the design of the reward function directly affects the quality of the learning effect. There are two methods for setting the reward function, one is a simple numerical reward, and the other is a comprehensive reward of one or more observations. In the recommendation system, the general expression of the reward function is:

R = R (u, i)

, which represents the reward obtained by recommending an item

i

to the user

u

. In the maneuver decision-making problem, the expression of the reward function is:

R = R (S, a)

, which represents the reward obtained by the UAV when executing an action

a

in the cooperative and confrontational environment

S

.

Environmental modeling: Establishing an appropriate environment is crucial for evaluating the effectiveness of reinforcement learning. Environmental modeling is usually divided into three types: offline, online, and simulation environments. In the recommendation system, environmental modeling aims to capture user preferences, historical interactions, and other factors that may affect the recommendation process. The general expression is:

S^{'} = f (S, i)

, where

S^{'}

represents the next state of the user, and

f

represents the state transition function of the environment when the user is recommended an item

i

in the state

S

. In the maneuver decision-making problem, environmental modeling is also similar:

S^{'} = f (S, a)

, where

S^{'}

represents the next state of the UAV, and

f

represents the state transition function of the environment when the UAV performs an action

a

in the state

S

.

In the realm of reinforcement learning, both maneuver decision-making and recommendation systems can be conceptualized as a dynamic cycle of observation, judgment, decision-making, and action. There are inherent technological parallels between these two systems. Consequently, their components can be mapped to each other: maneuver actions align with recommended products, while the states of users correspond to the states of UAVs. The objective of a recommendation system is to propose the most satisfactory products to users, whereas the goal of maneuver decision-making is to advise the most appropriate actions for UAVs, the mapping table of the recommendation system to the reinforcement learning is as follows (see Table 1):

2.2. Flight Dynamics and Control Model Construction

Referring to the modeling method of constructing a particle motion model mentioned in reference [37], construct a recommendation system simulation UAV cooperation and confrontation environment, which includes six degrees of freedom flight parameters and three control parameters. Treat the UAV in the environment as a mass point, and regard the UAV state as the user state,

s = [x, y, z, v, γ, φ]

represent the flight parameters with variables, and regard the maneuver actions as recommended content, represented by

c o n t r o l = [n_{x}, n_{z}, μ]

, and the parameter schematic diagram is shown in Figure 3:

Where

x, y, z

represents the three-dimensional spatial coordinates of the UAV,

γ

represents the pitch angle, defined as the angle between the UAV’s velocity direction and the xoy plane,

φ

represents the trajectory inclination angle, defined as the UAV between the projection of the UAV’s velocity direction on the xoy plane and the positive direction of the axis.

2.3. Mission Victory and Defeat Judgment Model

Based on the state information of the UAV within the recommendation environment, we calculate the relative situation between the two parties, including relative distance and relative angle. For the sake of clear differentiation, our UAV is depicted as the red side, while the adversary UAV is depicted as the blue side.

The variable

a n g l e_{R}

represents the angle between the red side UAV and the blue side velocity direction. The variable

a n g l e_{B}

represents the angle between the velocity direction of the red side’s agent and that of the blue side. The variable

d

represents the distance between the red and blue sides.

The following conditions are all set as non-escapable conditions:

\{\begin{cases} a n g l e < a n g l e_{a t t a c k} \\ d < d_{a t t a c k} \end{cases}

(3)

where

a n g l e_{a t t a c k}

represents the maximum firing angle of the UAV, that is, the maximum attack angle, and

d_{a t t a c k}

represents the maximum firing distance of the UAV, that is, the attack distance; entering this condition means that our UAV mission has failed.

2.4. Maneuver Action Library

The maneuver action library is mainly divided into two categories, one is the typical tactical action library, and the other is the basic maneuver action library. The basic maneuver action library is more suitable for the maneuver decision-making process, so it has attracted widespread attention. In the constructed recommendation system simulation UAV cooperation and confrontation environment, a discrete maneuver action library containing 27 kinds of maneuver actions has been designed [2]. Each maneuver action corresponds to three control quantities, and the discrete values of the three control quantities,

n_{x}

,

n_{z}

,

μ

, are {−2, 0, 2}, {−2, 0, 2}, {−2, 0, 2}, and the continuous value ranges are (−3, 3), (−3, 9), (

- π

,

π

).

2.5. Integrated Environment Construction

The constructed recommendation system simulation UAV cooperation and confrontation environment uses an integrated framework, as shown in Figure 4:

The integrated framework used can integrate various different algorithms, including rule-based, pilot operation, offline algorithms, and online algorithms.

Let the integrated module be composed of

m

weak learners, the

i

weak learner is represented as

l e a r n e r_{i}

, and the number of nearest neighbors recommended by each weak learner is

k_{i}

, and the TopN maneuver actions recommended by the weak learner

l e a r n e r_{i}

are

a c t i o n_l i s t_{i} = [a_{1}, a_{2}, \dots, a_{N}]

, and the Bagging method for integrating

m

weak learners is:

a_{j} = \sum_{i = 1}^{i = m} a_{i, j}

(4)

where

a_{i, j}

represents the recommendation of the action

a_{j}

by the

i

weak learner, if it recommends the action,

a_{i, j} = 1

, if it does not recommend the action, then

a_{i, j} = 0

. Produce the voting results for each action, and finally select the TopN maneuver actions

a c t i o n_l i s t = [a_{1}, a_{2}, \dots, a_{N}]

for a recommendation.

2.6. Baseline Algorithm Design

Based on the constructed recommendation system simulation UAV cooperation and confrontation environment, two expert algorithms are designed as baseline algorithms to complete the task. One is the offensive greedy strategy expert algorithm, and the other is the offensive and defensive greedy strategy expert algorithm. The designed expert algorithms are all discrete maneuver decision algorithms.

(1): Offensive Greedy Strategy Expert Algorithm

The expert algorithm with only an offensive strategy is a greedy expert algorithm based on the distance approach. The UAV will traverse the 27 types of maneuver actions in the maneuver action library, and virtually execute each action to judge the distance between the two sides after executing the action. Finally, select the action that can maximize the distance reduction between the two sides for execution.

(2): Offensive and Defensive Greedy Strategy Expert Algorithm

The offensive and defensive expert algorithm has a defense strategy compared to the offensive greedy strategy, and also adds a greedy strategy based on narrowing the attack angle in the offensive aspect. Taking our UAV as an example, first obtain the state of the enemy and our UAV, then convert the UAV state through the coordinate transformation to a spherical relative coordinate system centered on our UAV, and calculate the relative position of the enemy to us. Then, divide our field of view into nine areas, corresponding to the combinations of upper, middle, and lower, left, middle, and right areas. Divide the 27 maneuver actions according to tangent overload into acceleration, deceleration, and uniform speed, and divide the normal overload into pull-up, dive, and level flight. Divide the roll control quantity into left turn, right turn, and level flight.

In each decision-making process, the UAV first judges whether it has entered a dangerous altitude. If it has, it adjusts the altitude to a safe altitude. After that, it starts to calculate the situation between the two sides, first calculating the angle advantage. The calculation formula is:

T_{a n g l e} = a n g l e_{R} - a n g l e_{B}

(5)

when

T_{a n g l e}

is less than 0, our side has the angle advantage, if

T_{a n g l e}

is greater than 0, the enemy has the angle advantage, and when it is 0, both sides are in a balanced situation. After judging the angle advantage, judge our side’s distance advantage. If our side is outside the enemy’s attack range and has the angle advantage, then choose to continue adjusting our side’s attack angle and accelerate to approach the enemy. If our side is within the enemy’s attack range and has the angle advantage, then continue to adjust our side’s attack angle and approach the enemy at a constant speed. If our side is at a disadvantage in the angle and outside the enemy’s attack range, then continue to adjust our side’s attack angle and decelerate to approach the enemy. If our side is at a disadvantage in the angle and within the enemy’s attack range, then continue to adjust to escape the enemy’s attack angle and accelerate to get away.

3. KNN-UserCF Integrated Recommendation Algorithm

This section introduces the integrated recommendation algorithm combining KNN and UserCF. It begins with an overview of the algorithm’s framework, followed by a detailed description of the KNN-UserCF module, and concludes with the presentation of the ensemble learning module.

3.1. Framework of KNN-UserCF Integrated Recommendation Algorithm

To verify the feasibility and effectiveness of applying recommendation systems to UAV maneuver decision-making problems, starting from traditional recommendation methods, the KNN-UserCF integrated learning recommendation algorithm model is implemented [38], as shown in Figure 5:

Treat each moment’s state as a user and the maneuver action as a product. First, normalize the current moment’s UAV state and input it into the recommendation model, and the state information includes 9-dimensional features. Then, the model recommends actions according to each KNN-UserCF with different numbers of K-nearest neighbors. Due to the large state space, it is difficult to directly calculate the user similarity matrix, so KNN is introduced to achieve the mapping from feature space to action space and the calculation of user similarity, while UserCF is responsible for calculating item scores based on user similarity. After obtaining the results of each weak learner, voting is carried out through the Bagging method of ensemble learning to obtain the TopN maneuver actions for recommendation.

3.2. KNN-UserCF Module

The KNN-UserCF module is the core part of the recommendation model, which mainly consists of KNN and UserCF. KNN is mainly responsible for calculating user similarity, and UserCF recommends actions based on similarity. Under the current moment’s state, first normalize the obtained UAV state to avoid the influence of features with large values on the weight, increasing robustness. Let the state feature vector be

S_{t} = [s_{1}, s_{2} \dots, s_{m}]

,

m

representing the total number of state features,

s_{i}

representing the

i

feature, and the normalization of the feature can be represented as:

s_{i n o r m a l i z e} = s_{i} / (s_{i \max} - s_{i \min})

(6)

where

s_{i n o r m a l i z e}

represents the normalized value of the

i

feature,

s_{i \max}

represents the maximum value that

s_{i}

can be obtained, and represents the minimum value that can be obtained. After normalizing the features, use the KNN model trained with the UAV cooperation and confrontation dataset to calculate

S_{t}

the similarity with other moment’s state

S_{t^{'}}

features. Cosine similarity is often used in fields such as text similarity, image processing, and recommendation systems. It is a simple, intuitive, and effective way to measure similarity, so cosine similarity is used to calculate user similarity:

s i m (S_{t}, S_{t^{'}}) = \frac{\sum_{i = 1}^{m} s_{t, i} s_{t^{'}, i}}{\sqrt{\sum_{i = 1}^{m} {s_{t, i}}^{2}} \sqrt{\sum_{i = 1}^{m} {s_{t^{'}, i}}^{2}}}

(7)

where k nearest neighbor state users from the users with the highest similarity as the most similar k users for the target state user. According to the actions chosen by each neighbor user, that is, the product, and the similarity with the target user, calculate the score for each action for the target user—the calculation formula is:

a c t i o n_{t, i} = \sum_{t^{'} = 0}^{k} s i m (S_{t}, S_{t^{'}}) \cdot a c t i o n_{t^{'}, i}

(8)

where

a c t i o n_{t, i}

represents the score of the action

i

under the state

S_{t}

.

a c t i o n_{t^{'}, i}

represents the score of

S_{t^{'}}

for the action

i

, if the action

i

is taken, then the value

a c t i o n_{t^{'}, i}

is 1, otherwise it is 0. According to the score of each action, recommend the TopN actions with the highest score to the target user.

3.3. Ensemble Learning Module

After utilizing the KNN-UserCF module for action recommendation, we consider several such modules as multiple weak learners. Each weak learner employs a different number of neighbors, and their results are integrated using ensemble learning methods. By applying the Bagging approach to vote on the recommended actions from each weak learner, the final TopN optimal actions are determined. This ensemble learning strategy effectively enhances the performance of the recommendation system, making the model more robust and generalizable.

4. Deep Reinforcement Learning Recommendation Algorithm

This section introduces a recommendation framework for UAV maneuver decision-making based on deep reinforcement learning. It starts with an algorithm framework that integrates both discrete (PER-DDQN) and continuous (PER-DDPG) action spaces. Following this, it presents the prioritized experience replay mechanism and concludes with a description of the dense reward structure designed to enhance learning efficiency in UAV operations.

4.1. Deep Reinforcement Learning Framework for Maneuver Decision Recommendation

Traditional recommendation algorithms have some performance limitations in terms of data dependency. To overcome these problems, introducing reinforcement learning-based recommendation algorithms can improve the performance of the recommendation system by interacting with users in real time and updating the recommendation strategy online, thereby improving the UAV mission completion rate. The proposed deep reinforcement learning maneuver decision recommendation model is shown in Figure 6.

In the model, both PER-DDQN and PER-DDPG recommendation algorithms based on discrete actions and continuous actions are constructed, respectively. The following Table 2 summarizes the advantages and disadvantages of existing mainstream reinforcement learning algorithms.

Based on the evaluation of the algorithms in the table, we chose DDQN and DDPG for our recommendation system because they best address the specific needs of UAV tasks. DDQN, with its ability to mitigate Q-value overestimation, is well-suited for discrete decision-making scenarios, such as waypoint selection and mission mode changes. On the other hand, DDPG excels in handling continuous action spaces, making it ideal for tasks that require continuous control, such as angle adjustment and speed control.

Regard different UAV states as users, and regard the maneuver action library as the product library. In each round of the experiment, at each step of the recommendation process, the product library and the state

S_{t}

of the user at the

t

moment are passed to the DRL recommendation module. The DRL calculates the product

a_{t}

with the highest value output through the decision-making network and recommends it to the user. The user provides feedback

R_{t}

based on the recommended product, which is a reward or punishment signal, such as punishment for winning or losing. At the same time, the user’s state also changes, and the experience pool stores the quadruple experience

(S_{t}, a_{t}, R_{t}, S_{t + 1})

. Afterward, the user passes the new state

S_{t + 1}

into the recommendation module for decision-making and recommendation to form a closed loop until the UAV autonomously decides the outcome. During this period, when the number of experiences in the experience pool reaches the batch_size, the network of the recommendation module is updated using PER, and its strategy is continuously optimized until convergence, as much as possible to improve the UAV confrontation success rate.

4.2. Prioritized Experience Replay

The traditional experience replay method randomly selects samples from the experience buffer, while prioritized experience replay selects samples according to the importance of the samples.

In prioritized experience replay, each sample is assigned a priority, which is usually calculated based on the temporal difference (TD) error of the sample. The TD error represents the difference between the current estimate and the target value and is used to measure the prediction error of the model.

The samples with high priority have large TD errors, which means they are more important for model training because if the TD error is large, it means that the current Q-function is far from the target Q-function and should be updated multiple times. In PER, the priority value of each experience is calculated as:

P (i) = \frac{p_{i}}{\sum p_{i}}

(9)

where

p_{i} = | δ_{i} + ε |

,

ε

is a very small value,

\sum p_{i}

representing the sum of all experiences drawn

p_{i}

. To prevent the probability of drawing experiences with zero TD error from being 0.

During the training process, prioritized experience replay selects samples for training based on their priority. Samples with higher priority will be selected more frequently, thereby increasing the number of training times for these samples and improving the learning effect of important samples. This can make the model pay more attention to those samples that are more critical for improving the performance of the model.

4.3. Dense Reward

In reinforcement learning recommendation systems, the design of the reward signal plays a crucial role in the agent’s learning effectiveness. Although sparse rewards are common in real-world tasks, they come with drawbacks such as learning challenges and exploration difficulties. Agents require more time to gather sufficient feedback and struggle to identify valuable states or behaviors.

In contrast, dense rewards provide feedback at each time step, offering significant advantages. They can accelerate the learning process and enhance the agent’s performance. In deep reinforcement learning maneuver decision recommendation algorithms, dense rewards are particularly effective. By receiving user feedback immediately after each recommendation, the system can adjust its strategies more frequently, thus speeding up the convergence of the decision network.

In summary, employing a dense reward mechanism in deep reinforcement learning recommendation systems can effectively overcome the limitations of sparse rewards. It provides the agent with richer and more timely feedback, thereby achieving more efficient learning and more accurate recommendations.

5. Simulation Results and Analysis

This section introduces the experimental setup and the analysis of simulation results.

5.1. Experimental Setup

(1): Evaluation Metrics

The evaluation metric used is the mission success rate of the UAV. Let the number of rounds of competition between the red and blue sides be, and the success rate rules for the red and blue sides are statistically calculated as:

s c o r e = \{\begin{cases} 1, R e d f i n i s h \\ 0, B l u e f i n i s h \\ 0.5, Draw \end{cases}

(10)

According to the above counting method, the average success rate is calculated as:

r a t e = \frac{\sum_{i = 1}^{e p o c h} s c o r e_{i}}{e p o c h}

(11)

where

s c o r e_{i}

represents the score obtained in the

i

round of experiment.

(2): Initial State Settings

In this study, the proposed recommendation model is used to conduct confrontation experiments with expert algorithms. The red side uses the trained KNN-UserCF integrated recommendation algorithm, while the blue side uses the offensive and defensive greedy strategy expert algorithm. The initial states of the two UAVs are shown in Table 3.

The time resolution of this experiment refers to the sampling frequency of Prepar3D (P3D) (https://www.prepar3d.com/). P3D is a flight simulation software developed by Lockheed Martin. In the experiment, the time interval is set to 0.02 s (i.e., a 50 Hz sampling rate), and the maximum number of steps per round of the experiment is limited to 3000 steps to ensure a reasonable duration and computational efficiency of the simulation process.

5.2. Simulation Results Analysis of Integrated KNN-UserCF Maneuver Decision Recommendation

(1): Determination of K Nearest Neighbors and the Number of Learners

The K-fold cross-validation method is used to determine K and the number of learners, respectively. In the experiment, k is usually taken as 10, and the range of values for the K-nearest neighbors is [0, 20]. According to the prediction accuracy corresponding to different K values, a curve diagram of K and prediction accuracy is drawn. When the K value is on the rise after K is greater than 5, the performance trend of the model tends to be stable, and the prediction accuracy fluctuates around 92.6%. Therefore, we conclude that when the number of K-nearest neighbors is greater than 5, the model can show more excellent and stable performance.

According to the result of the selected KNN K-nearest neighbors, set the K-nearest neighbors of each weak learner to 10. Subsequently, experiments were conducted in the range of 1–100 for the number of weak learners. In the ensemble learning of the recommendation system, as the number of weak learners increased, the recommendation accuracy rate showed an increasing trend. However, when the number of weak learners exceeded 10, the recommendation accuracy rate tended to stabilize at 92.5%. Considering the balance of computational volume and performance, this paper selected the number of weak learners to be 10. This can not only ensure the performance of the recommendation system but also reduce computational complexity.

(2): Results Analysis

To verify the impact of different numbers of recommended actions on the success rate, this study set conditions for recommending one to three actions. Under each condition of the number of recommended actions, 1000 rounds of experimental simulations were conducted. The execution strategy is as follows: when recommending one action, execute the action directly; when recommending multiple actions, randomly select one to execute with equal probability. The experimental results show that when recommending one action, the average success rate is 50.8%, when recommending two actions, it slightly increases to 51.0%, and when recommending three actions, the average success rate is 50.2%.

When the number of recommended actions is one, the KNN-UserCF ensemble learning recommendation algorithm and the expert algorithm can achieve a success rate of 50%. As the number of recommended actions increases, the success rate still remains around 50%. It can be seen that although the success rate is affected by the initial position and attitude of the UAV and fluctuates during the process, it is generally around 50%. According to the above experimental results, the constructed KNN-UserCF ensemble recommendation model is feasible in autonomous maneuver decision-making problems and has achieved good results. It can narrow down the original 27 types of maneuver actions to a maximum of three actions in autonomous maneuver decision-making, improving decision-making efficiency.

The data generated from the confrontation between the expert algorithm based on the offensive and defensive strategy is used to train the KNN-UserCF ensemble recommendation system, and the trained recommendation system is simulated against the expert algorithm based on the offensive greedy strategy. The red side uses the KNN-UserCF ensemble maneuver decision recommendation algorithm, and the blue side uses the offensive greedy strategy expert algorithm. After nearly 1000 rounds of experiments, the red side’s average success rate is 60.8%, and the blue side’s average success rate is 39.2%, proving that the KNN-UserCF ensemble maneuver decision recommendation system constructed in this paper can learn effective maneuver strategies from the confrontation data of the expert algorithm with offensive and defensive strategies.

5.3. Simulation Results Analysis of Deep Reinforcement Learning Maneuver Decision Recommendation

(1): Deep Reinforcement Learning Recommendation System with Sparse Rewards

Sparse rewards refer to returning a reward value to the red side UAV based on the UAV’s state in the environment at the end of each round of the experiment. And no reward is returned during the experiment.

The reward function constructed in this paper for sparse rewards is as follows:

s c o r e = T (S_{t})

(12)

T (S_{t}) = \{\begin{cases} T_{1} if win \\ T_{2} if lose \\ T_{3} if draw \\ T_{4} otherwise \end{cases}

(13)

In Equation (13),

T_{1}

is taken as 1000,

T_{2}

is taken as −500,

T_{3}

is taken as 10, and

T_{4}

is taken as 0.

According to the above reward function, this paper establishes the PER-DDQN and PER-DDPG deep reinforcement learning maneuver decision recommendation algorithms. After about 1000 rounds of round training, the results of the two deep reinforcement learning algorithms under sparse rewards are obtained, as shown in Figure 7:

According to Figure 7a, it can be observed that after about 1000 rounds, the PER-DDQN reinforcement learning maneuver decision recommendation system built with sparse rewards can reach a maximum success rate of about 40% in the competition with the expert algorithm, but the overall success rate fluctuates around 35%.

According to Figure 7b, it can be observed that after training for about 1000 rounds, the PER-DDPG reinforcement learning maneuver decision recommendation system built with sparse rewards also has a success rate of about 35% in the competition with the expert algorithm.

According to the experimental results, for the UAV autonomous maneuver decision recommendation problem, the deep reinforcement learning recommendation algorithm using sparse rewards finds it difficult to learn good maneuver strategies from the confrontation because the reward signal obtained in the cooperation and confrontation is too scarce. Therefore, a dense reward function must be constructed to train the UAV.

(2): Dense Reward Maneuver Decision Recommendation System

Since it is difficult for the reinforcement learning algorithm using sparse rewards to learn good strategies from the confrontation, a dense reward function is constructed to train the UAV. The constructed dense reward mainly consists of angle rewards, distance rewards, and situation assessment rewards. The expression of the dense reward is as follows:

s c o r e = T (S_{t}) + T (a n g l e_r e w a r d) + T (a n g l e_p u n i s h) + T (d i s) + T (s c o r e 1) - M

(14)

where

T (S_{t})

represents the reward or penalty obtained at the end of the experiment, and the value setting is the same as in formula (14);

T (a n g l e_r e w a r d)

represents the angle at which the blue side enters the red side, and when the angle is greater than

π / 6

, it is set to 0, and when it is less than

π / 6

, the reward is set to

\cos (a n g l e_{R})

,

a n g l e_{R}

representing the angle between the blue side UAV and the red side velocity direction in the environment;

T (a n g l e_p u n i s h)

represents the angle at which the red side enters the blue side, and when the angle is greater than

π / 6

, it is set to 0, and when it is less than

π / 6

, the reward is set to

- \cos (a n g l e_{R})

,

a n g l e_{B}

representing the angle between the red side UAV and the blue side velocity direction in the environment;

T (d i s)

represents the distance reward between the red and blue sides, and when the red side approaches the blue side to a certain distance, the reward value is set to 0.5, and in other cases, it is set to 0;

T (s c o r e 1)

represents the reward of situation assessment, and reference [13] selects the CRITIC-G1 method based on the sliding window for situation assessment to obtain the situation assessment reward, with a value of 0–1; M is the minimum situation assessment reward value accepted in the combat, set to 0.38.

According to the dense reward constructed by Formula (14), this paper designs the PER-DDQN and PER-DDPG maneuver decision recommendation systems. After about 1000 rounds of round training, the results of the two deep reinforcement learning algorithms under dense rewards are obtained, as shown in Figure 8:

In Figure 8a, it can be seen that the PER-DDQN maneuvering decision recommendation system utilizing dense rewards achieves an approximately 63% win rate against expert algorithms, exhibiting more efficient learning of effective maneuvering strategies compared to its sparse reward counterpart.

As illustrated in Figure 8b, after approximately 1000 training iterations, the PER-DDPG maneuvering decision recommendation system with dense rewards attains a win rate of approximately 69% against expert algorithms, demonstrating superior performance compared to the dense reward PER-DDQN system. In conclusion, while the PER-DDQN reinforcement learning maneuvering decision recommendation system is limited to recommending discrete actions, the PER-DDPG recommendation algorithm excels in recommending continuous maneuvering actions (products). This capability enables more refined control and enhanced adaptability in complex air combat scenarios.

To further evaluate the efficiency and performance of the proposed PER-DDPG and PER-DDQN algorithms compared to conventional reinforcement learning algorithms, a detailed comparison was conducted. Table 4 presents the convergence time (defined as the number of training iterations required to achieve stable reward values) and the mission success rates of various algorithms in scenarios involving competition against the offensive and defensive greedy strategy expert algorithm.

As shown in Table 4, PER-DDPG is based on DDPG’s powerful continuous action space processing capabilities, combined with priority experience replay, so that important experiences in the training process are learned first, thereby accelerating the convergence speed. DQN’s performance is limited by its Q-learning foundation, and convergence is slower in large-scale state spaces or complex tasks. The PER-DDPG algorithm has the best effect, with a task completion rate of up to 69%. The convergence efficiency of PER-DDPG is 80% higher than that of Q-Learning.

Figure 9 demonstrates the initial training phase of the red side UAV in the proposed PER-DDPG maneuvering decision recommendation system based on dense rewards.

In the early stage of training, the PER-DDPG recommendation system is still in the exploration stage and has not learned a good maneuver action recommendation strategy. In Figure 9a, the actions recommended by the recommendation system for the red side did result in an advantageous position. The blue side looked for the red side at the beginning of the experiment, and the red side just flew forward. In the end, the blue side dived and adjusted its attitude to pursue the red side. In the end, the PER-DDPG recommendation algorithm recommended that the red side fly straight down and forward, and it crashed before the blue side reached the mission completion condition.

In Figure 9b, after training for a period of time, the recommendation system has learned the strategy of crashing and has also learned some attack strategies. At the beginning of the experiment, the recommendation system recommended actions that made the red side try to attack the blue side, and the blue side also gradually turned right to attack the red side. However, neither side reached the mission completion condition, and then the recommendation system recommended actions for the red side to fly straight up, and there was no trend of seizing an advantageous position. The blue side approached and circled up around the red side, trying to reach the mission completion condition. In the end, after nearly 30,000 m, the blue side caught up with the red side and completed the task.

Figure 10 illustrates the later training phase of the red side UAV in the proposed PER-DDPG maneuvering decision recommendation system based on dense rewards.

In the later stage of training, in Figure 10a, at the beginning, the blue side had a larger angle advantage, the red side had an upward pitch angle, and the blue side’s pitch angle was downward. Under the recommendation of the PER-DDPG recommendation algorithm, the red side performed a roll action trying to approach the blue side, and the blue side was also actively lifting, approaching the red side. After the first intersection, neither side reached the mission completion condition. Then, the red side slowly lifted up trying to approach the blue side again, and the blue side began to accelerate and dive (due to the height advantage, this converted into a larger energy advantage) and reached the front of the red side at a disadvantageous angle, and at this time because of its high speed, it was too late to lift, and it finally crashed.

In Figure 10b, at the beginning, the red side was in front of the blue side, and the blue side had an angle advantage relative to the red side. After that, the blue side began to adjust its attitude and approached the red side. The recommendation algorithm recommended actions for the red side to dive down to evade the blue side. After that, the red side began to lift to seek a height advantage, and the blue side did not reach the mission completion condition after the first attack, and the two sides intersected. After lifting, the red side occupied the height advantage, and then the blue side, after not reaching the mission completion condition, flew to the front of the red side at a disadvantageous angle. After that, the blue side began to dive to escape from the red side, and the red side adjusted its attitude in time to dive and catch up with the blue side to reach the mission completion condition, and it finally shot down the blue side to win.

In this part, we analyze the performance of the maneuver decision recommendation system based on dense rewards. We chose the dense reward structure due to its ability to accelerate learning and ensure stable gradient propagation, which significantly improves training efficiency.

However, it is important to note that while the dense reward structure offers several advantages, it also has certain limitations. Specifically, dense rewards can lead to overfitting, especially when the agent becomes too sensitive to minor environmental changes. Furthermore, there is the potential for the reward signal to be overly dependent on specific features of the environment, which could result in a narrow solution space.

Despite these limitations, we argue that the dense reward structure was the most appropriate choice for our current problem, given its efficiency and the relatively stable nature of the task environment.

(3): Deep Reinforcement Learning and Traditional Recommendation Algorithm Performance Comparison

The KNN-UserCF ensemble learning maneuver decision recommendation algorithm and the deep reinforcement learning maneuver decision recommendation system based on dense rewards are compared with the expert algorithm for confrontation, and the success rate comparison graph is shown in Figure 11:

In the early stage of deep reinforcement learning training, the success rate was low, but as the number of training rounds increased, its success rate in the competition with the expert algorithm continued to improve and finally exceeded 50%, while the success rate of the traditional maneuver decision recommendation algorithm remained at around 50%. One of the key advantages of the proposed framework is by incorporating deep reinforcement learning and prioritized experience replay, the framework enhances learning efficiency and promotes the convergence of optimal decision-making strategies in diverse scenarios. Moreover, the integrated approach allows the incorporation of multiple human experts and advanced algorithms, enabling collaborative decision-making that adapts to dynamic, real-time environments.

Among the two deep reinforcement learning recommendation algorithms, compared with the PER-DDQN that recommends discrete actions, the PER-DDPG recommendation algorithm that recommends continuous actions has a more stable success rate curve, and the recommended products, that is, actions, are more refined, and the success rate is higher.

For real-time systems, time step directly affects the response speed and decision quality of the system. Therefore, choosing the right time step is very important to realize real-time. With reference to the P3D setting, this paper determined that the time step was 0.02 s, and the maximum step length limit for each experiment round was 3000 steps. Compared with this, PER-DDQN has a slower response speed in real-time decision-making tasks, which leads to a decline in decision-making quality. PER-DDPG showed better stability in continuous action space tasks.

6. Conclusions

This paper pioneers the application of recommendation systems to UAV autonomous maneuver decision-making, leveraging an integrated framework that combines diverse algorithms to simulate UAV cooperation and confrontation scenarios. Through extensive computer simulations, the paper demonstrates the feasibility and effectiveness of this framework and the associated recommendation algorithms in addressing maneuver decision-making challenges. Furthermore, it finds that a maneuver decision recommendation algorithm enhanced with deep reinforcement learning, particularly using dense rewards and prioritized experience replay, achieves a high success rate. Despite its contributions, the framework faces certain limitations. The computational intensity of deep reinforcement learning presents challenges for real-time deployment, and the system’s reliance on high-quality data makes it sensitive to noise and incompleteness. While the framework performs well in simulations, further testing in real-world environments is necessary to ensure robustness and adaptability. Future work should focus on improving the generalization ability of the recommendation system across diverse UAV operational contexts and exploring methods for data robustness to ensure reliable performance in real-world deployment.

Author Contributions

Conceptualization, Q.H. and Y.S.; methodology, W.W.; software, T.J., J.Z., J.W. and Z.Y.; validation, Q.H., W.W. and Z.Y.; writing—original draft preparation, T.J. and W.W.; writing—review and editing, Q.H.; supervision, Y.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China through the following project number GKJJ22012101.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

DURC Statement

Current research is limited to the field of Artificial Intelligence (AI), which is beneficial for enhancing technological advancements, improving efficiency across various industries, and fostering innovative solutions to complex problems. This research does not pose a threat to public health or national security. The authors acknowledge the dual-use potential of research involving AI and confirm that all necessary precautions have been taken to prevent potential misuse. As an ethical responsibility, the authors strictly adhere to relevant national and international laws concerning Dual Use Research of Concern (DURC). The authors advocate for responsible deployment, ethical considerations, regulatory compliance, and transparent reporting to mitigate misuse risks and foster beneficial outcomes.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Huang, Y.; Feng, B.; Tian, A.; Dong, P.; Yu, S.; Zhang, H. An Efficient Differentiated Routing Scheme for MEO/LEO-Based Multi-Layer Satellite Networks. IEEE Trans. Netw. Sci. Eng. 2024, 11, 1041–2024. [Google Scholar] [CrossRef]
Wang, Y.; Ren, T.; Fan, Z. Unmanned aerial vehicle air combat maneuver decision-making based on guided Minimax-DDQN. Comput. Appl. 2023, 43, 2636–2643. [Google Scholar]
Wei, Y.J.; Zhang, H.P.; Huang, C.Q. Maneuver Decision-Making For Autonomous Air Combat Through Curriculum Learning And Reinforcement Learning with Sparse Rewards. arXiv 2023, arXiv:2302.05838. [Google Scholar]
Yin, S.; Kang, Y.; Zhao, Y.; Xue, J. Air Combat Maneuver Decision Based on Deep Reinforcement Learning and Game Theory. In Proceedings of the 2022 41st Chinese Control Conference (CCC), Hefei, China, 25–27 July 2022; pp. 6939–6943. [Google Scholar]
Hu, J.; Wang, L.; Hu, T.; Guo, C.; Wang, Y. Autonomous maneuver decision making of dual-UAV cooperative air combat based on deep reinforcement learning. Electronics 2022, 11, 467. [Google Scholar] [CrossRef]
Dong, Y.; Ai, J. Decision Making in Autonomous Air Combat: Review and Prospects. Acta Aeronaut. Astronaut. Sin. 2020, 41, 724264. [Google Scholar] [CrossRef]
Fu, W.; Wang, H.; Gao, S. UAV Decision-Making Expert System in Dynamic Environment Based on Heuristic Algorithm. Beijing Univ. Aeronaut. Astronaut. J. 2015, 41, 1994–1999. [Google Scholar]
Fu, L.; Wang, X. Research on differential game modeling for close-range air combat of unmanned combat aircraft. Ordnance J. 2012, 33, 1210–1216. [Google Scholar]
Li, K.; Zhang, K.; Zhang, Z.; Liu, Z.; Hua, S.; He, J. A UAV maneuver decision-making algorithm for autonomous airdrop based on deep reinforcement learning. Sensors 2021, 21, 2233. [Google Scholar] [CrossRef] [PubMed]
Huang, C.; Xie, Q. Cooperative Multi-Objective Attack Decision-Making Method Based on Genetic Algorithm. Fire Control Command Control 2004, 29, 4–8. [Google Scholar]
Xie, J.; Yang, Q.; Dai, S.; Wang, W.; Zhang, J. UAV Maneuvering Decision Research Based on Enhanced Genetic Algorithms. J. Northwestern Polytech. Univ. 2020, 6, 38. [Google Scholar]
Vicsek, T. Universal Patterns of Collective Motion from Minimal Models of Flocking. In Proceedings of the 2008 Second IEEE International Conference on Self-Adaptive and Self-Organizing Systems, Venice, Italy, 26–29 October 2008; pp. 3–11. [Google Scholar]
Guo, H.; Xu, H.-J.; Gu, X.-D.; Liu, D.-Y. Air Combat Decision-Making for Cooperative Multiple Target Attack Based on Improved Particle Swarm Algorithm. Fire Control Command. Control 2011, 36, 49–51+55. [Google Scholar]
Li, S.Y.; Chen, M.; Wang, Y.H.; Wu, Q.X. Air combat decision-making of multiple UCAVs based on constraint strategy games. Def. Technol. 2022, 18, 368–383. [Google Scholar] [CrossRef]
Li, S.; Chen, M.; Wang, Y.; Wu, Q. A fast algorithm to solve large-scale matrix games based on dimensionality reduction and its application in multiple unmanned combat air vehicles attack-defense decision-making. Inf. Sci. 2022, 594, 305–321. [Google Scholar] [CrossRef]
Geng, W.X.; Kong, F.E.; Ma, D.Q. Study on tactical decision of UAV medium range air combat. In Proceedings of the 26th Chinese Control and Decision Conference, Changsha, China, 31 May–2 June 2014; pp. 135–139. [Google Scholar]
Park, H.; Lee, B.Y.; Tahk, M.J.; Yoo, D.W. Differential game based air combat maneuver generation using scoring function matrix. Int. J. Aeronaut. Space Sci. 2016, 17, 204–213. [Google Scholar] [CrossRef]
Zhao, Z.; Wan, Y.; Chen, Y. Deep Reinforcement Learning-Driven Collaborative Rounding-Up for Multiple Unmanned Aerial Vehicles in Obstacle Environments. Drones 2024, 8, 464. [Google Scholar] [CrossRef]
Zhang, Q.; Yang, R.; Yu, L.; Zhang, T.; Zuo, J. BVR air combat maneuvering decision by using Q-network reinforcement learning. J. Air Force Eng. Univ. (Nat. Sci. Ed.) 2018, 19, 8–14. [Google Scholar]
Lillicrap, T.P. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
Li, F.; Li, Q.; Wang, X. Thinking on the complexity of air combat under the development trend of equipment technology. Military Digest 2024, 1, 31–33. [Google Scholar]
Nguyen, L.V. OurSCARA: Awareness-Based Recommendation Services for Sustainable Tourism. World 2024, 5, 471–482. [Google Scholar] [CrossRef]
Pradeep, N.; Mangalore, K.K.R.; Rajpal, B.; Prasad, N.; Shastri, R. Content Based Movie Recommendation System. Int. J. Res. Ind. Eng. 2020, 9, 337–348. [Google Scholar] [CrossRef]
Shan, J. Research on Content-Based Personalized Recommendation System. Ph.D. Thesis, Northeast Normal University, Changchun, China, 2015; pp. 3–5. [Google Scholar]
Bachiri, K.; Yahyaouy, A.; Gualous, H.; Malek, M.; Bennani, Y.; Makany, P.; Rogovschi, N. Multi-Agent DDPG Based Electric Vehicles Charging Station Recommendation. Energies 2023, 16, 6067. [Google Scholar] [CrossRef]
Zhou, Q. A novel movies recommendation algorithm based on reinforcement learning with DDPG policy. Int. J. Intell. Comput. Cybern. 2020, 13, 67–79. [Google Scholar] [CrossRef]
Wang, Y.; Zheng, Y.; Xu, J. Personalized recommendation system for library based on hybrid algorithm. Comput. Inf. Technol. 2023, 31. 39–42+50. [Google Scholar] [CrossRef]
Koren, Y.; Bell, R.; Volinsky, C. Matrix factorization techniques for recommender systems. Computer 2009, 42, 30–37. [Google Scholar] [CrossRef]
Rendle, S.; Schmidt-Thieme, L. Factorization Models for Collaborative Filtering. ACM Comput. Surv. 2010, 42, 1–23. [Google Scholar]
Zhang, S.; Yao, L.; Sun, A.; Tay, Y. Deep Learning Based Recommender System: A Survey and New Perspectives. ACM Comput. Surv. (CSUR) 2019, 52, 1–38. [Google Scholar] [CrossRef]
Feng, B.; Tian, A.; Yu, S.; Li, J.; Zhou, H.; Zhang, H. Efficient Cache Consistency Management for Transient IoT Data in Content-Centric Networking. IEEE Internet Things J. 2022, 9, 12931–12944. [Google Scholar] [CrossRef]
Zhao, K.; Liu, S.; Cai, Q.; Zhao, X.; Liu, Z.; Zheng, D.; Jiang, P.; Gai, K. KuaiSim: A comprehensive simulator for recommender systems. Adv. Neural Inf. Process. Syst. 2023, 36, 44880–44897. [Google Scholar]
Zhao, X.; Xia, L.; Zou, L.; Liu, H.; Yin, D.; Tang, J. Whole-chain recommendations. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, Galway, Ireland, 19–23 October 2020; pp. 1883–1891. [Google Scholar]
Zhao, X.; Gu, C.; Zhang, H.; Yang, X.; Liu, X.; Tang, J.; Liu, H. Dear: Deep reinforcement learning for online advertising impression in recommender systems. Proc. AAAI Conf. Artif. Intell. 2021, 35, 750–758. [Google Scholar] [CrossRef]
Afsar, M.M.; Crump, T.; Far, B. Reinforcement Learning Based Recommender Systems: A Survey. ACM Comput. Surv. 2022, 55, 1–38. [Google Scholar] [CrossRef]
Chu, W.T.; Tsai, Y.L. A hybrid recommender system considering visual information for predicting favorite restaurants. World Wide Web 2017, 20, 1313–1331. [Google Scholar] [CrossRef]
Zhou, S.; Shi, Y.; Yang, W.; Wang, J.; Gao, L.; Gao, Y. Multi-aircraft cooperative air combat maneuver decision-making based on Cook-Seiford group decision-making algorithm. Command Control Simul. 2023, 45, 44–51. [Google Scholar]
Feng, B.; Huang, Y.; Tian, A.; Wang, H.; Zhou, H.; Yu, S.; Zhang, H. DR-SDSN: An Elastic Differentiated Routing Framework for Software-Defined Satellite Networks. IEEE Wirel. Commun. 2022, 29, 86–2022. [Google Scholar] [CrossRef]
Zhao, J.; Gan, Z.; Liang, J.; Wang, C.; Yue, K.; Li, W.; Li, Y.; Li, R. Path Planning Research of a UAV Base Station Searching for Disaster Victims’ Location Information Based on Deep Reinforcement Learning. Entropy 2022, 24, 1767. [Google Scholar] [CrossRef] [PubMed]
Bao, T.; Syed, A.; Kennedy, W.S.; Kantarcı, M.E. Sustainable Task Offloading in Secure UAV-Assisted Smart Farm Networks: A Multi-Agent DRL with Action Mask Approach. arXiv 2024. [Google Scholar] [CrossRef]
Yan, C.; Xiang, X.; Wang, C. Towards Real-Time Path Planning through Deep Reinforcement Learning for a UAV in Dynamic Environments. J. Intell. Robot. Syst. 2020, 98, 297–309. [Google Scholar] [CrossRef]
Yijing, Z.; Zheng, Z.; Xiaoyi, Z.; Yang, L. Q learning algorithm based UAV path learning and obstacle avoidence approach. In Proceedings of the 2017 36th Chinese Control Conference (CCC), Dalian, China, 26–28 July 2017; pp. 3397–3402. [Google Scholar] [CrossRef]
van Hasselt, H.; Guez, A.; Silver, D. Deep Reinforcement Learning with Double Q-Learning. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI’16), Phoenix, AZ, USA, 12–17 February 2016; pp. 1–7. [Google Scholar] [CrossRef]
Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous Methods for Deep Reinforcement Learning. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016; pp. 1928–1937. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]

Figure 1. Research challenges and decision-making process.

Figure 2. Schematic diagram of Markov decision process.

Figure 3. Schematic diagram of flight parameters.

Figure 4. Integrated framework structure diagram.

Figure 5. KNN-UserCF Integrated learning recommendation algorithm model.

Figure 6. Deep reinforcement learning recommendation algorithm model diagram.

Figure 7. The winning rate changes in PER-DDQN and PER-DDPG under sparse rewards. (a) The winning rate changes in PER-DDQN; (b) The winning rate changes in PER-DDPG.

Figure 8. The winning rate changes in PER-DDQN and PER-DDPG under dense rewards. (a) The winning rate changes in PER-DDQN; (b) the winning rate changes in PER-DDPG.

Figure 9. Defeat trajectory of the red side UAV in the PER-DDPG maneuvering decision recommendation system based on dense rewards in the initial training phase. (a) Trajectory 1; (b) Trajectory 2.

Figure 10. Victory trajectory of the red side UAV in the PER-DDPG maneuvering decision recommendation system based on dense rewards in the later training phase. (a) Trajectory 1; (b) Trajectory 2.

Figure 11. Comparison chart of winning rates for three mobile decision-making recommendation algorithms.

Table 1. The mapping table of the recommendation system to the reinforcement learning.

Recommendation System Module	Reinforcement Learning Module
user state	state
recommended item	action
user feedback	reward
recommendation algorithm	policy
recommendation context	environment
cold start vs. long-term optimization	exploration vs. exploitation

Table 2. Comparison of reinforcement learning algorithms for UAV tasks.

Algorithm	Advantages	Disadvantages	Applicable Scenarios	Relevance to This Task
DDQN [39,40]	addresses Q-value overestimation	requires discretization of continuous spaces	discrete decision-making tasks like path selection	ideal for UAV tasks requiring discrete decisions (e.g., waypoint selection or mission mode changes).
DDPG [41]	handles continuous actions high efficiency in exploration combines value and policy methods	prone to local optima less stochastic	continuous control tasks like trajectory adjustment	essential for UAV tasks requiring smooth, continuous control (e.g., angle adjustment, speed control).
Q-Learning [42]	simple and effective easy to modify	limited scalability to large state-action spaces slow convergence	discrete action spaces simple decision-making problems	for tasks where the UAV’s control actions can be discretized
DQN [43]	simple and stable for static or simple environments	prone to Q-value overestimation unsuitable for complex dynamic or continuous tasks	simple discrete tasks in static environments	insufficient for dynamic UAV tasks due to lack of adaptability and overestimation issues
A3C (asynchronous advantage actor-critic) [44]	supports both action types efficient through asynchronous updates (A3C)	high computational cost complex implementation	multi-task or dynamic environments	resource-intensive and overly complex for real-time UAV tasks focused on efficiency
PPO (proximal policy optimization) [45]	stable training and robust performance supports both action types	high computational demands complex implementation	complex dynamic environments	unsuitable for resource-constrained UAV tasks requiring fast, real-time decisions

Table 3. Initial states settings.

Name	x	y	z	Speed	Pitch	Heading
red side	random	random	5000–10,000 m	300 m/s	[0–2π]	[0–2π]
blue side	random	random	5000–10,000 m	300 m/s	[0–2π]	[0–2π]

Table 4. Comparison of convergence time and mission success rates among different methods.

Algorithm	PER-DDPG	PER-DDQN	DQN	Q-Learning
success rate	69%	63%	45%	30%
convergence time(iterations)	1000	1000	3000	5000

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hao, Q.; Jing, T.; Sun, Y.; Yang, Z.; Zhang, J.; Wang, J.; Wang, W. A Framework of Recommendation System for Unmanned Aerial Vehicle Autonomous Maneuver Decision. Drones 2025, 9, 25. https://doi.org/10.3390/drones9010025

AMA Style

Hao Q, Jing T, Sun Y, Yang Z, Zhang J, Wang J, Wang W. A Framework of Recommendation System for Unmanned Aerial Vehicle Autonomous Maneuver Decision. Drones. 2025; 9(1):25. https://doi.org/10.3390/drones9010025

Chicago/Turabian Style

Hao, Qinzhi, Tengyu Jing, Yao Sun, Zhuolin Yang, Jiali Zhang, Jiapeng Wang, and Wei Wang. 2025. "A Framework of Recommendation System for Unmanned Aerial Vehicle Autonomous Maneuver Decision" Drones 9, no. 1: 25. https://doi.org/10.3390/drones9010025

APA Style

Hao, Q., Jing, T., Sun, Y., Yang, Z., Zhang, J., Wang, J., & Wang, W. (2025). A Framework of Recommendation System for Unmanned Aerial Vehicle Autonomous Maneuver Decision. Drones, 9(1), 25. https://doi.org/10.3390/drones9010025

Article Menu

A Framework of Recommendation System for Unmanned Aerial Vehicle Autonomous Maneuver Decision

Abstract

1. Introduction

2. Construction of the UAV Cooperation and Confrontation Simulation Environment for the Recommendation System

2.1. Recommendation System Based on Reinforcement Learning

2.2. Flight Dynamics and Control Model Construction

2.3. Mission Victory and Defeat Judgment Model

2.4. Maneuver Action Library

2.5. Integrated Environment Construction

2.6. Baseline Algorithm Design

3. KNN-UserCF Integrated Recommendation Algorithm

3.1. Framework of KNN-UserCF Integrated Recommendation Algorithm

3.2. KNN-UserCF Module

3.3. Ensemble Learning Module

4. Deep Reinforcement Learning Recommendation Algorithm

4.1. Deep Reinforcement Learning Framework for Maneuver Decision Recommendation

4.2. Prioritized Experience Replay

4.3. Dense Reward

5. Simulation Results and Analysis

5.1. Experimental Setup

5.2. Simulation Results Analysis of Integrated KNN-UserCF Maneuver Decision Recommendation

5.3. Simulation Results Analysis of Deep Reinforcement Learning Maneuver Decision Recommendation

6. Conclusions

Author Contributions

Funding

Data Availability Statement

DURC Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI