Abstract
Integration of unmanned aerial vehicles (UAVs) with ultra-reliable and low-latency communication (URLLC) systems can improve the real-time communication performance for various industrial internet of things (IIoT) applications. Designing an intelligent resource allocation system is one of the challenges posed by an energy-constrained UAV communication system. Therefore, we formulate a sum rate maximization problem, subject to the UAVs’ energy by optimizing the blocklength allocation and the power control jointly in the uplink UAV-assisted URLLC systems, in which the probabilistic channel model between UAV and users is adopted. The problem is difficult to solve due to the non-convex objective function and the energy constraints, and also challenging to make fast decision in the complex communication environment. Thus, we propose a deep reinforcement learning (DRL)-based scheme to optimize the blocklength allocation and power control jointly. First, transform the original problem into the multi-agent reinforcement learning process, where each subcarrier is regarded as the agent that optimizes its individual blocklength allocation and power control. Then, each agent makes the intelligent decision to obtain the maximum reward value depending on the weighted segmented reward function, which is related to the UAV energy consumption and user rates to improve the rate performance. Finally, the simulation results show that the proposed scheme outperforms the benchmark schemes and has the stable convergence in different settings, such as the learning rate, the error probability, the subcarrier spacing, and the number of users.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
With the large-scale commercial deployment of the fifth-generation (5 G) mobile networks and the industrial internet of things (IIoT), ultra-reliable and low-latency communication (URLLC) is urgently needed in many mission-critical applications, such as the automotive industry, remote surgery, and automotive-driving [1, 2]. Since the URLLC scenarios have the extremely stringent performance requirements, there exists increasing interest in the development of new communication technologies to increase the reliability and reduce the latency. For example, the wireless transmission such as UAV communication will replace wired connection in IIoT to increase flexibility and reduce the infrastructure cost [3]. The above problem imposes challenges on the wireless transmission subject to latency and reliability.
URLLC has been regarded as one of the three pillar applications in the 5 G communications, with application scenarios including factory automation, automatic driving, remote surgery [4]. The essential targets established by the international organization for standardization has to achieve 99.999% reliability and 1ms latency in URLLC communication system [4]. The standard is designed to support future industrial automation and factory deployments through wireless communication. In URLLC systems, subcarriers play a critical role in achieving the high levels of reliability and low latency required for applications [5]. However, it is still challenging to achieve the latency and reliability objectives stated in the URLLC standard. Therefore, developed 5 G new radio technology to derive URLLC performance to satisfy the communications requirements of industrial automation. Mini-slot structures have been proposed as a potential solution for supporting short-packet communications [6]. This is because most existing wireless communication systems are designed for long packet transmissions, for which Shannon’s theorem about communication rate is valid. However, the Shannon’s theorem rate formula is not suitable for the short-packet communication. Fortunately, a unified scheme for establishing strict limitations on coding rates under short-packet assumptions is available in [7]. In the finite blocklength regime, the authors in [8] adopted the SNR distribution to derive the corresponding average rate and average error probability. Additionally, the authors in [9] jointly improved blocklength and power allocation to minimize the probability of decoding errors. However, the URLLC network requires real-time data transmission for IIoT applications. UAV-assisted wireless could play an important role in solving this challenge, which can make a fast decision and be a controllable move to the ground users to reduce the pathloss[10].
Due to the high mobility, wide coverage, and low latency, UAV communication have emerged as a promising technology for driving URLLC services in the IIoT [11]. The utilization of UAVs in the IIoT can improve the automation capabilities, and data analysis of industrial processes, leading to potential benefits such as improved reliability, cost reduction and wide coverage. UAV-assisted wireless can achieve a high sum rate for the primary line-of-sight (LoS) connection with the ground user in comparison with typical wireless communications [10]. It can dynamically adjust its positions to optimize performance and meet quality-of-service requirements. UAVs can function as relays or base station (BS) to establish connections between transmitting and receiving devices [12, 13]. To enhance URLLC communication, the authors in [14] optimized resource allocation, including blocklength allocation, power control, trajectory planning, and energy harvesting, aiming to minimize the probability of decoding errors. The authors in [15] optimized UAV node placement, uplink power, and UAV-IoT device association jointly in the multi-UAV IoT communication system, in which UAVs act as BSs to gather data from IoT devices. To jointly optimize the total rate and reduce the transmit power of the UAV, the authors in [16] investigated the resource optimization problem for 5 G networks with UAV assistance. In future applications, the large-scale connections of numerous users necessitate the consideration of various quality of service requirements. The decision-making behavior of each user can significantly affect the interference levels of others, while the power control and blocklength problems are interdependent. To optimize the transmit power and blocklength while addressing these issues, deep reinforcement learning (DRL) offers a more efficient approach. By leveraging stored experience and deep neural networks to learn from the environment, DRL emerges as highly adept in non-convex problems. The authors in [17] proposed a resource allocation scheme based on DRL to guarantee high reliability and low latency for each wireless user under data rate constraints. The authors in [18] applied a DRL method to solve the problem of subcarrier power allocation in device-to-device communication. The proposed algorithm is well suited for the dynamic changes in the wireless environment. A method based on deep learning was proposed in [19] for the joint optimization of reconfigurable intelligent surfaces and power allocation of the access point across each subcarrier. However, investigating the joint allocation of blocklength and transmit power while considering the energy constraints of UAVs poses a significant challenge in UAV-assisted URLLC communication.
In this work, we consider an uplink UAV-assisted URLLC system, in which the ground users shared subcarriers to transmit their information to multiple UAVs. This work aims to jointly optimize the blocklength allocation and subcarrier transmit power to maximize the communication sum rate, subject to UAVs’ energy consumption. The main contributions of this work can be summarized as follows:
-
We formulate a communication rate maximization problem in the uplink UAV-assisted URLLC communication system with the blocklength, energy consumption, and transmit power constraints. The objective function and constraints make it challenging to make an intelligent decision in this complex communication environment.
-
To solve this non-convex problem, we propose a distributed DRL-based scheme to jointly optimize the transmit power and the blocklength. In this scheme, each subcarrier acts as an agent, making decisions on blocklength and transmit power based on the reward value. The weighted segmented reward function related to the UAV energy consumption and user rates are proposed to improve the rate performance.
-
The effectiveness and convergence of the proposed blocklength allocation and power control scheme are evaluated. According to the simulation results, the proposed scheme outperforms the Q-learning-based scheme and greedy scheme in different maximum transmit power and decoding error probability.
2 Related Work
In this section, we review the previous studies about UAV-assisted URLLC system, with an emphasis on URLLC services, resource allocation in UAV-enable wireless networks, and the application of DRL-based optimization method.
Compared to traditional terrestrial-based networks, UAV-assisted URLLC systems can leverage the flexibility and maneuverability of UAVs to dynamically adjust their deployment positions and form good channel link to achieve higher reliability and lower latency. The authors in [20] formulated the the sum rate maximization problem in the UAV-assisted URLLC system coexisting with the ground users. When the central ground station transmits the control signals to the UAV, the average packet error probability and effective throughput were studied in [21]. The authors in [22] addressed the physical layer security problem in URLLC systems, in which UAVs are regarded as the crucial component for mitigating physical layer security risks. The comparison of using UAVs as the relays to enhance security rate and employing UAVs as jammers to alleviate the impact of eavesdropping attacks on URLLC communication was analyzed. A task offloading method from devices to UAV was proposed in [23] to fulfill the low-latency demands by jointly optimizing the computing times of the devices and the UAV, the offloading bandwidths, and the location of the UAV. Moreover, the author in [25] investigated the resource allocation method for supporting URLLC in a bidirectional relay system by leveraging the advantages of both UAV and URLLC. It jointly optimized the time, the bandwidth, and the UAV position to maximize the transmission rate of the backward link under the constraints of URLLC requirements in the forward link. By employing age of information (AoI) as a new system metric, the author in [24] proposed an energy-efficient resource allocation scheme to support UAV-assisted URLLC communication systems under the blocklength and imperfect channel conditions constraints.
DRL algorithm is widely used in resource allocation and combinatorial optimization problems to solve the complex nonlinear challenges effectively [26]-[28], which is beneficial to the combination of the exploratory learning approach and the feature extraction in the high-dimensional space capabilities of deep neural networks. The authors in [26] focused on the power control for a UAV-assisted URLLC system and incorporated the deep neural network for channel estimation. The authors in [27] employed the actor-critic multi-agent deep reinforcement learning algorithm in the vehicle networks to obtain the optimal allocation of the frequency, the computation, and the caching resources. The novel centralized deep reinforcement learning and federated deep reinforcement learning frameworks were proposed in [28], aimed at optimizing the downlink URLLC transmission for the coexisting new radio in unlicensed spectrum with WiFi systems by dynamically adjusting the energy detection threshold.
In this paper, to solve the complex combinatorial optimization problem under the UAV energy consumption, the blocklength, and the transmit power constraints, the DRL algorithm is more suitable than the traditional optimization methods, such as the convex optimization and the heuristic algorithms. Due to the rate expression and block length constraint, the optimization problems in URLLC systems are typically non-convex [28]. Convex optimization methods aim to obtain the optimal solution [29]. However, it is difficult to solve the non-convex problem in the high-dimensional space or there exists the large amount of computation complexity to increase the computation time and energy consumption [30]. Approximate algorithms [31] such as the greedy algorithm, the local search algorithm, relaxation algorithm, can yield the approximate solution. However, when the problem is complex, the approximated performance remains unacceptable. Heuristic algorithms [32] also fail to effectively adapt to the dynamic environment and lack adaptability to the system. Difficulty in effectively exploring and exploiting these spaces will result in suboptimal solutions or inefficient computation. Thus, for high-dimensional and non-convex problems, the self-exploratory learning approach of DRL algorithms can achieve the rapid solution.
3 System Model
We consider an uplink UAV-assisted URLLC network consisting of a BS, \(\mathcal {M}\) single-antenna UAVs and multiple ground users, in which UAVs can exchange information at the specific BS and then collect information of the ground users, as shown in Fig. 1. Due to the limited energy of UAVs, each UAV has a specific communication area. For simplicity, let \({J}_m\) represent the ground user group of the m-th UAV, in which the each user \(k_j \in {J}_m\), and all the ground users share \(\mathcal {N}\) subcarrier. Furthermore, let H denote the flight height of the UAV to increase the probability of the LoS link. Thus, the three-dimensional (3D) position of the m-th UAV and the \(k_j\)-th user can be expressed as \((x_m^\textrm{U},y_m^\textrm{U}, H)\) and \((x_{k_j,m}^\textrm{G},y_{k_j,m}^\textrm{G},0)\), respectively. As a result, the distance between the m-th UAV and the \(k_j\)-th user is \(d_{k_j,m}^\textrm{3D} = \sqrt{(x_{m}^\textrm{U}-x_{k_j,m}^\textrm{G})^{2}+(y_{m}^\textrm{U}-y_{k_j,m}^\textrm{G})^{2}+H^{2}}\), \((\forall m \in \mathcal {M}, \forall k_j \in {J}_{m})\). The horizontal distance between the m-th UAV and the \(k_j\)-th user is \(d_{k_j,m}^\textrm{2D} = \sqrt{ (x_{m}^\textrm{U}-x_{k_j,m}^\textrm{G})^{2}+(y_{m}^\textrm{U}-y_{k_j,m}^\textrm{G})^{2}}\), \((\forall m \in \mathcal {M}, \forall k_j \in {J}_{m})\).
The probabilistic LoS channel model introduced in [34, 35] is used by this system in order to more correctly characterize the channel information. The LoS probability of the link between the m-th UAV and the \(k_j\)-th user \(\eta _{k_j,m}^\mathrm{{LoS}}\) depends on the elevation angle \(\theta _{k_j,m}\), which is given by \(\theta _{k_j,m} = \arccos (\frac{d_{k_j,m}^\textrm{3D}}{d_{k_j,m}^\textrm{2D}})\), \((\forall m \in \mathcal {M}, \forall k_j \in {J}_{m})\). Therefore, the LoS probability is modeled as follows
where the parameters \(A_1\) and \(A_2\) depend on the environment [35]. Based on the LoS probability, the non-line-of-sight (NLoS) probability of the link between the m-th UAV and the \(k_j\)-th user is denoted as \(\eta _{k_j,m}^\mathrm{{NLoS}}(\theta _{k_j,m}) = 1 - \eta _{k_j,m}^\mathrm{{LoS}}(\theta _{k_j,m})\), \((\forall m \in \mathcal {M}, \forall k_j \in {J}_{m})\) [35]. The pathloss of LoS link between the m-th UAV and the \(k_j\)-th user is given as [36]
where the \(f_c\) is the carrier frequency. Furthermore, the formula for the pathloss of the NLoS link is given by [36]
Therefore, the channel power gain \(h_{k_j,m}, (\forall m \in \mathcal {M}, \forall k_j \in {J}_{m}) \) between the m-th UAV and \(k_j\)-th user can be written as
Subsequently, let \({K}_n\) denote the \(k_j\)-th user identification that using the n-th subcarrier. According to (4), the received signal between the \(k_j\)-th user and the m-th UAV over the n-th subcarrier can be expressed as
where \(s_{k_j,m,n}\), \((\forall m \in \mathcal {M}, \forall k_j \in {J}_{m})\) denotes the transmit symbol between the \(k_j\)-th user and the m-th UAV over the n-th subcarrier, \(z_{m,n}\) denotes the corresponding noise between the m-th UAV and the n-th subcarrier. For simplicity, we assume that the transmit symbol is distributed as a discrete, independent, and identical complex Gaussian variable, denoted by \(s_{k_j,m,n} \sim \mathbb{C}\mathbb{N}(0,1)\), and that the noise has a mean and variance of zero \(\sigma ^2\). Thus, the signal-to-interference-plus-noise ratio (SINR) between the m-th UAV and the \(k_j\)-th user over the n-th subcarrier \(\gamma _{k_j,m,n}\), \((\forall m \in \mathcal {M}, \forall k_j \in {J}_{m})\) can be given by [35]
In practice, this work especially focuses on the communication between the ground users and the UAVs. In the infinite blocklength communication, it has been shown that in the limit of blocklengths, reliable transmission with zero decoding errors probability can be achieved [37]. In URLLC systems, the strict latency requirements impose limitations on the data size, resulting in the unavailability of accurate encoding rates based on Shannon’s channel capacity. Therefore, in URLLC communication networks, the achievable rate \(R_{k_j,m,n}\) between the m-th UAV and the \(k_j\)-th user over the n-th subcarrier must be approximated for a given error probability and finite blocklength, as shown in [7, 38]
where \(\eta _{k_j,m}\), \((\forall m \in \mathcal {M}, \forall k_j \in {J}_{m})\) denotes the required decoding error probability between the m-th UAV and the \(k_j\)-th user. And \(l_{k_j,m,n}\) denotes the blocklength between the m-th UAV and the \(k_j\)-th user over the n-th subcarrier. It should be noted that the approximation provided in [7, 38] is highly accurate, provided that the blocklength is greater than or equal to 100. Furthermore, \(V(\gamma _{k_j,m,n}) = 1- 1/(1 + \gamma _{k_j,m,n})^2\) represents the channel dispersion, while \(\tilde{Q}^{-1}(x)\) is the inverse Gaussian Q-function with \(\tilde{Q}(x) = \frac{1}{\sqrt{2\pi } } \int _{x}^{\infty } \exp ^{-t^2/2}dt\).
Since the \(k_j\)-th user can occupy multiple subcarriers for communication, let \(\mathcal N_k\) represent the \(k_j\)-th user’s occupied subcarrier index. As a result, in the UAV-assisted URLLC communication system, the rate of communication between the m-th UAV and the \(k_j\)-th user may be represented as
In the UAV-assisted URLLC communication system, the delay of transmit and the UAV’s total energy consumption are crucial performance indicators. To evaluate these metrics, we express the transmit delay between the m-th UAV and the \(k_j\)-th user over the n-th subcarrier as \(T_{k_j,m,n}=l_{k_j,m,n}T_{s}\) [14], where \(T_{s}\) is the symbol duration that is equal to \(1 / W_\textrm{sc}\) and \(W_\textrm{sc}\) is the each subcarrier spacing. Based on the differences in transmit delay, the transmit energy consumption of the m-th UAV can be described as follows
The UAV’s hovering energy consumption, which is determined by its own performance and environmental factors, is often described as the total energy use of the system [3, 39]
where \(\delta \) and \(\rho \) denote the profile drag coefficient and the air density, respectively. G and \(s_r\) represent robustness of the engine rotor and area of the rotor disc, \(\Omega \) denotes the angular speed of the rotor. The parameters of \(R_z\) and \(k_z\) denote the rotor radius and the incremental correction factor for power. Y is the weight of the UAV. Thus, the total energy consumption of the m-th UAV includes the transmit energy consumption and the hovering energy consumption which can be expressed as
In the UAV-assisted URLLC communication system, the objective is to increase the sum rate by jointly optimizing the power control and blocklength allocation over the subcarriers. As a result, the optimization problem may be described as follows:
where \(P_{\max }\), \(R_{\min }\), and \(E_{\max }\) represent the maximum transmit power, the minimum rate, and the maximum energy consumption, respectively. \(L_{\min }\) and \(L_{\max }\) denote the minimum blocklength, and maximum blocklength, respectively. \(R_{\min }\) is given by \(D/L_{\max }\) [21], where D is transmit data size of the ground user and, \(L_{\max }\) is always related to the maximum transmit duration \(T_{\max }\) and system bandwidth W, i.e., \(L_{\max } = WT_{\max }\) [40, 41], which means that the data transmit under the latency constraint, and the transmission has to be complete within the maximum blocklength \(L_{\max }\).
However, the sum rate is affected by the dynamic environment, and the objective function is non-convex. In addition, the coupling of transmit power and blocklength has a direct impact on SINR and rate, making it computationally complex to search for the optimal solution for multi-user power and blocklength in an unknown system. Hence, it is challenging to obtain the optimal solution using the standard convex optimization method. Thus, a reinforcement learning-based scheme is considered to solve this non-convex problem to make intelligent decisions. DRL algorithm is suitable for the complex URLLC dynamic environment in the work, and the deep neural network can model the high-dimensional space in the optimization problem [33].
4 Proposed Blocklength Allocation and Power Control Scheme
Recently, DRL is one of the promising machine learning methods to solve the resource allocation problem to enable the intelligence of wireless communication systems since it has a capable of making a decision by selecting the potential action based on the stored experiences. Instead of traditional reinforcement learning, it’s using the deep neural network to learn instead of the massive number of values. Markov decision process (MDP) is applied to model the reinforcement learning process. MDP can be modeled by a tuple \(<\mathcal {S}, \mathcal {A}, R, \gamma>\) with the state space \(\mathcal {S}\), action \(\mathcal {A}\), reward R and discount factor \(\gamma \in [0,1]\) [42]. At the step t, the agent selects the action \(a_t\) by interacting with the system environment to maximize the reward \(R_t = r_t + \sum \nolimits _{t'=1}^{t-1} \gamma ^{(t - t') } r_{t'}\).
In this paper, a novel multi-agent reinforcement learning scheme is proposed to jointly optimize blocklength allocation and power control in the UAV-assisted URLLC system with the high-dimensional action space. To obtain an effective reinforcement learning algorithm, it is essential to define the state space of the environment and the action space of the agent and to model a suitable reward function that satisfies the system’s constraints while maximizing the objective function of the problem (P1). The method we propose has some benefits, such as the capacity to handle complicated, high-dimensional action spaces and support cooperative decision-making among agents, which can produce more effective and efficient solutions.
4.1 State, Action, and Reward Function
Deep Q-Network (DQN) is a DRL algorithm that combines reinforcement learning with a deep neural network to approximate the Q-value function, enabling more efficient and effective decision-making in complex environments. DQN has it own advantages which make it become an effective tool for solving non-convex optimization problems. First, it can handle high-dimensional action spaces, which is often importance in complex systems. Second, it is particularly suited for issues with models that are unknown or poorly understood since it does not call on previous knowledge of the system dynamics or the goal function. Third, DQN is capable of efficiently exploring the action space to find optimal policies, which is essential in non-convex problems where the optimal solution is often difficult to determine. Finally, DQN is able to balance exploration and exploitation to ensure that it is continually learning and improving, even as the system changes over time. The detailed design framework based on DQN in URLLC communication system is shown in Fig. 2.
In this framework, we establish a targeted action space \(\mathcal {A}\) to denote the action space set. By weighting blocklength and communication rate, we design a segmented reward function \(r^t\) with the rewards and penalties to incentivize the agent to select better action. At the same time, through the stored experience data in the replay buffer, the estimate and target neural networks are updated in mini-batches to train the agent.
Agent: In our work, each subcarrier works as the agent in the complex systems. For example, at the step t, the current agent independently decides the transmit power value \(a_{k_j,m,n}^{p,t}\) and the blocklength \(a_{k_j,m,n}^{l,t}\) based on the current state \(\textbf{s}_{k_j,m,n}^t\) and reward value \(r^t\) to satisfy the blocklength constraint and maximize the sum rate performance.
Action Space: In the UAV-assisted URLLC system, the selection of optimization parameter is crucial for meeting the stringent requirements of low latency and high reliability [25]. Therefore, the action space includes the discrete transmit power values and the blocklength values, each agent can select action \(\textbf{a}_{k_j,m,n}^t = \{a_{k_j,m,n}^{p,t}, a_{k_j,m,n}^{l,t}\} \in \mathcal {A} = \{ \mathcal {A}^p, \mathcal {A}^l \}\) in any state to reach the next state at time slot t. For simplicity, we assume that each agent has the same action space \(\mathcal {A}^p = \{0, \frac{P_{\max }}{L_p -1}, \frac{2 P_{\max }}{L_p-1},..., P_{\max } \}, \mathcal {A}^l = \{0, \frac{L_{\max }}{L_l -1}, \frac{2 L_{\max }}{L_l-1},..., L_{\max }\}\), where \(P_{\max }\) denotes the maximum of transmit power and \(L_{\max }\) denotes the system total blocklength, \(L_p\) denotes the length of the action space \(\mathcal {A}^p\) and \(L_l\) denotes the length of the action space \(\mathcal {A}^l\). The agent can independently select the action \(a_{k_j,m,n}^{p,t} \in \mathcal {A}^p, a_{k_j,m,n}^{l,t} \in \mathcal {A}^l\) to maximize the reward value.
State Space: The state of the n-th subcarrier of the k-th user at step t is represented by the desired power and interference power, respectively, it can be expressed as. The state space consists of desired power and interference power, respectively, i.e., the state of n-th subcarrier of the k-th user at the step t is
in the initial state, i.e., \(t=0\), each agent can randomly select the subcarrier power and blocklength according to the constraints (14), (17) and (13). Based on the current state \(\textbf{s}_{k_j,m,n}^t\) and the action \(\textbf{a}_{k_j,m,n}^t\), the agent can obtain the next state \(\textbf{s}_{k_j,m,n}^{t+1}\).
Reward Function: To maximize the communication rate of the k-user and satisfy the blocklength constraint, the difference between the system blocklength and the actually used blocklength can be modeled as
Then, we take energy consumption and sum rate into account to design the reward function and improve the system performance by maximizing the the reward value [25]. Thus, the reward function of each agent can be modeled as
where \({\lambda _1, \lambda _2} \in (0, 1)\) are the weighted parameters. It is observed that the agent has penalty value when the blocklength constraint (17) and (13) is not satisfied and can obtain more reward value when blocklength becomes smaller. Thus, the agent can select the potential action to maximize the rate performance.
4.2 Proposed Deep Q-network Algorithm
Due to the high dimensionality of action space, the DQN is considered to solve the non-convex problem which learns the policy by the neural network rather than storing the Q value. It is because continuous power and blocklength in the URLLC communication system need to be quantized into discrete. And formulate the state-action function to characterize the influence of the selected action on the performance with a specific state. The computational complexity of all agents to calculate the reward is \(O(N \cdot |s_{k,n}|) \) with N denoting the number of agents. The complexity of action selection is usually determined by the network structure such as DQN. The neural network structure of DQN algorithm includes a single neural network with 3 hidden layers and 3K hidden nodes in each layer. For the DQN network, the number of neurons in the m-th layer is \(U_m\), and the number of layers in the DQN network is M. Thus, the computational complexity of the DQN networks for all agents is \(O(K(|s_{k,n}|.U_{2} + \sum _{m=3}^{M}(U_{m-1}U_{m}+U_{m}U_{m+1}+U_{M-1}.|a_{k,n}|)))\) [43]. Given the control policy \(\xi \) for the n-th subcarrier, the Q-function is defined as [44]
where \(\gamma \in [0,1]\) is the discount factor. The current reward decided the Q-function when the discount factor \(\gamma =0\), i.e., the agent selects the action only depending on the current reward \(r_n(\textbf{s}_n^t, \textbf{a}_n^t)\). The optimal action to maximize the rate performance in (P1) is \(\textbf{a}_n^{t,*} = \arg \mathop {\max }_{\textbf{a}_n^j \in \mathcal {A}} Q^{\xi }(\textbf{s}_n^t, \textbf{a}_n^j)\) by searching Q-value under different potential actions.
To derive the optimal control policy \(\xi ^{*}\), the Q-function using the following scheme can be updated as [35]
where \(\nu \) denotes the learning rate. According to (22), we can find that each subcarrier update the Q-function and study the control strategy based on the stored Q-values, and then selecting the action to maximize the reward. To tackle the action selection in the limited state-action information, an \(\epsilon \)-greedy strategy is adopted to explore the environment with the exploration probability \(\epsilon \), which is written as
According to this strategy, the subcarrier based on the probability \(\epsilon \) to take a random action and explore the URLLC communication environment. The deep neural network may intelligently extract characteristics from the current data sets and reduce the computational complexity by forecasting the output since the subcarrier’s unknown state space should call for a large memory capacity and a slow convergence rate. According to the framework in Fig. 2, the tuple consists of the state, action, reward, and next state working as the input of deep neural network to output Q-value as \( Q(\textbf{s}_n^{t+1}, \textbf{a}_n^t| \theta _t)\) and \( Q(\textbf{s}_n^{t+1}, \textbf{a}_n^t| \theta _t^-)\) in the estimate and target neural networks, where \(\theta _t\) and \( \theta _t^-\) denote the parameters of the estimate and target neural network during i-th training, respectively. To guarantee stability, the target neural network for the deep neural network set is made to be an exact replica of the estimated neural network every \(N_\textrm{rep}\) steps. In order to acquire an optimal Q-function, it is crucial to adjust the parameters of neural network \(\theta _i\) based on the appropriate loss function. The loss function is defined as follows [44]
The majority optimizers, such as the gradient descent method, may be used to determine the optimal neural network parameters based on the function’s loss and the training data set. The deep neural network must be trained using the training data. The approaches of experience replay and random sampling are used to address the reliance on training data. The proposed DQN algorithm use \(N_\textrm{mem}\) experience replay memory to save the tuple of reinforcement learning process and update the data every \(N_\textrm{tr}\) steps, which can keep the training data fresh. The experience data are randomly selected from the replay memory to complete the batch, which may smooth the transitions between the history data and the fresh observation. Algorithm 1 displays the proposed DQN-based scheme for the URLLC communication system.
5 Simulation Results
In this section, simulation results validate the effectiveness and convergence of the proposed multi-agent DQN-based algorithm for power control and blocklength allocation. Then, analyze the influence of different parameters on rate performance in UAV-assisted URLLC system. Python 3.6 and Tensorflow 1.13 are the simulation tools used to train the deep neural network. The simulation parameters are presented in Table 1. And K users are randomly placed in a 200-meter-diameter circle centered on the UAV.
The UAVs are randomly distributed within the BS service area with a radius of \(R_s\). Unless otherwise stated, the number of UAVs is set \(\mathcal {M}\)=2 and the number of ground users is K=2.
We design the neural network in DQN-based algorithm with one input layer, three hidden layers, and one output layer. And the parameters of the neural network is optimized by the gradient descent method. The training begins after \(N_\textrm{bat}\) steps to ensure the size of an efficient batch of data. Furthermore, to verify the proposed DQN-based algorithm is suitable for solving the optimization problem in this work, it is compared with the traditional Q-learning algorithm and greedy scheme. The Q-learning algorithm belongs to the class of value-based reinforcement learning methods. It involves the creation of a Q-table to store Q values for states and actions, and then to selection the action that maximizes the benefit. Therefore, the greedy algorithm is a well-known scheme for finding the optimal solution, which involves selecting the optimal choice in the current state.
Figure 3 illustrates the comparison of rate performance among the proposed DQN-based scheme, Q-learning scheme, and greedy scheme with different learning episodes. It is observed that the proposed DQN-based scheme significantly outperforms the benchmark schemes in terms of rate performance. The state space, encompassing desired power and interference power, is vast due to the dynamic characteristics of wireless networks. The proposed DQN-based algorithm intelligently allocates resources based on the current state, leading to a substantial improvement in rate performance. The rate performance of the proposed scheme and Q-learning scheme converges to the optimal value with increasing training episodes. However, the proposed scheme exhibits significantly faster and more stable convergence compared to the Q-learning-based scheme. The Q-learning algorithm involves a large Q-table in wireless communication environments, which hampers query speed and the ability to query states effectively. Furthermore, the proposed DQN-based scheme outperforms the Q-learning scheme by 7.43%. The proposed scheme adopts experience replay and random sampling from batches during the learning process, allowing the agent to adjust quickly and efficiently to the dynamic environment. The experience replay effectively enhances the training efficiency and stability of the intelligent agent, thereby reducing instability and convergence difficulties during the training episodes. These results show the performance of the proposed DQN scheme to address this optimization problem and its adaptability to the dynamic characteristics of wireless networks.
The principle of a mini-slot is introduced by the international organization for standardization to allow URLLC applications by reducing the transmit time interval [45]. Additionally, the NR release-15 provides scalable numerology with subcarrier spacings of 15 KHz, 30 KHz, and 60 KHz below 6 GHz, and 120 KHz or 240 KHz above 6 GHz [46]. Figure 4 illustrates the impact of subcarrier spacing and error probability on the sum rate performance of the proposed DQN-based scheme. The maximum sum rate increases as the subcarrier spacing increases from 30 to 60 KHz. The proposed algorithm allocates a longer blocklength as the transmit bandwidth W increases due to the increase in subcarrier spacing. Furthermore, the rate performance decreases as the required error probability \(\eta \) decreases, which is consistent with the rate expression (7). The results show that when the learning rate is 0.0001, although the convergence of the proposed algorithm is fast, the convergence effect is not better. When the learning rate is 0.1, the DQN algorithm will converge to the local optimal solution faster, because the larger learning rate can make the parameter update faster. When the learning rate is 0.01, the smaller learning rate reduces the shock and instability in the training process, but it is not the optimal solution under the communication system. The simulation results show that when the learning rate is 0.001, the stability and convergence of the training can be guaranteed efficiently. In addition, the increasing learning rate may accelerate the convergence but it may lead to unstable training results.
Figure 5 illustrates the impact of the maximum transmit power on the sum rate of the proposed scheme and benchmark schemes. By adjusting the maximum transmit power of users, it is possible to effectively alter the overall system rate. Experimental results demonstrate a stable increase in system rate as the maximum transmit power increases. This implies that increasing the maximum transmit power can enhance the data transfer rate of the system, thereby boosting network performance and efficiency. In the proposed DQN scheme, intelligent decision-making enables the system to better utilize the increased transmit power resources, further enhancing the overall rate. Experimental findings reveal that compared to traditional greedy algorithms, the proposed DQN scheme improves the total rate by 37.19%, and compared to Q-learning algorithms, by 13.05%. This underscores the effectiveness and superiority of DQN in optimizing system performance. Thus, by appropriately adjusting the maximum transmit power and integrating intelligent algorithms, it’s possible to maximize the data transfer rate and performance of the system.
Figure 6 investigates the impact of the learning rate on the rate of the proposed DQN-based scheme. This figure verifies the influence of learning rate on convergence speed and convergence value under the difference allocation policy. It can be seen that when the learning rate \(\nu \)=0.0001, the proposed algorithm converges too fast and the obtained value is far from the optimal allocation policy. Moreover, \(\nu \)=0.1 or \(\nu \)=0.01 has better convergence speed and convergence value, but the final result of convergence is not optimal compared with the \(\nu \)=0.001. However, when the learning rate \(\nu \)= 0.03 or \( \nu \)= 0.003, the algorithm displays inferior convergence results, including convergence speed and efficiency. Therefore, selecting the optimal learning rate based on the communication environment can allow the proposed scheme to achieve a stable performance and obtain the optimal value for the current environment.
Figure 7 illustrates the impact of the number of users on the rate performance using the proposed scheme. In the simulation environment, each user is allocated \(N_\textrm{a}=2\) subcarriers, and \(N_\textrm{a}\) represents the number of subcarriers allocated to the user, which means users share the subcarrier. The results indicate that as the ratio of the number of UAVs to the number of users increases, the average rate of each subcarrier assigned to each user also increases. Specifically, the average rate of subcarriers with a ratio of \(\frac{\mathcal {M}}{K}=1/2\) demonstrates a rate performance gain of up to \(6.28\%\) compared to the ratio of \(\frac{\mathcal {M}}{K}=1/4\). This gain is due to the increase in interference power as the number of users covered by a single UAV increases. First, the communication interference increases with the increase of the number of users, and thus reducing the reliability and rate performance of communication. Second, the increase in the number of users within the range of UAV communication also has an impact on the coverage quality of the system. When the number of users increases, UAVs need more communication resources to meet the communication needs between users, which may lead to the shortage of communication resources and the reduction of coverage, thus affecting the communication quality of the system.
Fig. 8 investigates the sum rate versus the number of allocated subcarriers. It can be observed that the system sum rate decreases with the increasing \(N_a\), because the inter-group interference is considered in SINR expression (6), so that there exist different users occupying the same subcarrier causing the interference. When \(N_a=2\), two subcarriers are allocated to users which can increase the user’s Signal to Interference plus Noise Ratio (SINR) compared to \(N_a=3/4\).
6 Conclusion
In this paper, the joint blocklength allocation and power control scheme based on multi-agent DQN algorithm was proposed to maximize the rate performance in the UAV-assisted URLLC communication system. The optimization problem is difficult to solve due to the non-convex UAVs’ energy constraints. Therefore, the non-convex optimization problem was decomposed into the multi-agent reinforcement learning process, in which each subcarrier works as the agent to intelligently determine its own transmit power and blocklength. According to blocklength, communication rate, and the energy consumption of UAVs, the proposed DQN-based scheme constructs the reward function. Furthermore, this work investigated the influence of the learning rate, the error probability and the subcarrier spacing imposed on the performance of system. According to the simulation results, it is shown that the proposed scheme outperforms the benchmark scheme in terms of effectiveness and convergence. In the future, exploring intelligent trajectory design is the interesting research direction in UAV-assisted URLLC communication systems.
7 Future Work
In the future work, the dynamic characteristics of UAVs can be studied by considering the flying trajectory, the dynamic deployment and the wireless channel in UAV-assisted URLCC system. According to the proposed method, the intelligent resource allocation method is interesting to design the high-dimensional action space, practical reward function depending on the optimization target and constraints.
Coordinated UAVs can enhance the rate and the reliability performance, and reduce the latency in the wide coverage. However, it is difficult to optimize the resource and trajectory for the large-scale UAVs. On the one hand, the complex interference consists of UAV to UAV link and UAV to user communication link. It limits the power and blocklength optimization to satisfy the rate and reliability performance. On the other hand, the large-scale UVAs require the large amount of computation due to the increasing number of UAVs and the high-dimensional action space. Using advanced algorithms such as DRL and distributed optimization, it can design the intelligent method to autonomously adjust the large-scale UAV network.
Data Availability
The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.
References
Wu, Y., Dai, H.N., Wang, H.: Convergence of blockchain and edge computing for secure and scalable IIoT critical infrastructures in industry 4.0. IEEE Int. Things J. 8(4), 2300–2317 (2021)
Vaezi, M., Azari, A., Khosravirad, S.R., Shirvanimoghaddam, M., Azari, M.M., Chasaki, D., Popovski, P.: Cellular, wide-area, and non-terrestrial IoT: A survey on 5G advances and the road toward 6G. IEEE Communications Surveys and Tutorials. 24(2), 1117–1174 (2022)
Xu, D., Yu, K., Ritcey, J.A.: Cross-layer device authentication with quantum encryption for 5G enabled IIoT in industry 4.0. IEEE Transactions on Industrial Informatics. 18(9), 6368–6378 (2022)
Li, Y., Hu, C., Wang, J., Xu, M.: Optimization of URLLC and eMBB multiplexing via deep reinforcement learning. IEEE/CIC International Conference on Communications Workshops in China (ICCC Workshops). 245–250 (2019)
Sutton, J., G., Zeng, J., Liu, R., et al.: Enabling technologies for ultra-reliable and low latency communications: From PHY and MAC layer perspectives. IEEE Commun. Surv. Tutorials. 21(3), 2488–2524 (2019)
Philipp, S., Maximilian, M., Henrik, K., et al.: Latency critical IoT applications in 5G: Perspective on the design of radio interface and network architecture. IEEE Commun. Mag. 55(2), 70–78 (2017)
Polyanskiy, Y., Poor, H.V., Verdu, S.: Channel coding rate in the finite blocklength regime. IEEE Trans. Inf. Theory 56(5), 2307–2359 (2010)
Ramin, H., Samad, A., Nurul Huda, M., Matti, L.a.: Average rate and error probability analysis in short packet communications over RIS-aided URLLC systems. IEEE Transactions on Vehicular Technology. 70(10), 10,320–10,334 (2021)
Ren, H., Pan, C., Deng, Y., Elkashlan, M., Nallanathan, A.: Joint power and blocklength optimization for URLLC in a factory automation scenario. IEEE Trans. Wireless Commun. 19(3), 1786–1801 (2019)
Hu, Y., Sun, G., Zhang, G., Gursoy, M.C., Schmeink, A.: Optimal resource allocation in ground wireless networks supporting unmanned aerial vehicle transmissions. IEEE Trans. Veh. Technol. 69(8), 8972–8984 (2020)
Chen, K., Wang, Y., Fei, Z., Wang, X.: Power limited ultra-reliable and low-latency communication in UAV-enabled IoT networks. IEEE Wireless Communications and Networking Conference (WCNC). 1–6 (2020)
Shiri, H., Park, J., Bennis, M.: Remote UAV online path planning via neural network-based opportunistic control. IEEE Wireless Communications Letters. 9(6), 861–865 (2020)
Ren, H., Pan, C., Wang, K., Deng, Y., Elkashlan, M., Nallanathan, A.: Achievable data rate for URLLC-enabled UAV systems with 3-D channel model. IEEE Wireless Communications Letters. 8(6), 1587–1590 (2019)
Ranjha, A., Kaddoum, G.: URLLC-enabled by laser powered UAV relay: A quasi-optimal design of resource allocation, trajectory planning and energy harvesting. IEEE Trans. Veh. Technol. 71(1), 753–765 (2022)
Mozaffari, M., Saad, W., Bennis, M., Debbah, M.: Mobile unmanned aerial vehicles (UAVs) for energy-efficient internet of things communications. IEEE Trans. Wireless Commun. 16(11), 7574–7589 (2017)
Pandey, S.R., Kim, K., Alsenwi, M., Tun, Y.K., Han, Z., Hong, C.S.: Latency-sensitive service delivery with UAV-assisted 5G networks. IEEE Wireless Communications Letters. 10(7), 1518–1522 (2021)
Kasgari, A.T.Z., Saad, W., Mozaffari, M., Poor, H.V.: Experienced deep reinforcement learning with generative adversarial networks (GANs) for model-free ultra reliable low latency communication. IEEE Trans. Commun. 69(2), 884–899 (2021)
Gu, B., Zhang, X., Lin, Z., Alazab, M.: Deep multiagent reinforcement-learning-based resource allocation for internet of controllable things. IEEE Internet Things J. 8(5), 3066–3074 (2021)
Zhong, R., Liu, Y., Mu, X., Chen, Y., Song, L.: AI empowered RIS-assisted NOMA networks: Deep learning or reinforcement learning? IEEE J. Sel. Areas Commun. 40(1), 182–196 (2022)
Elwekeil, M., Zappone, A., Buzzi, S.: Power control in cell-free massive MIMO networks for UAVs URLLC under the finite blocklength regime. IEEE Trans. Commun. 71(2), 1126–1140 (2023)
Wang, K., Pan, C., Ren, H., Xu, W., Zhang, L., Nallanathan, A.: Packet error probability and effective throughput for ultra-reliable and low-latency UAV communications. IEEE Trans. Commun. 69(1), 73–84 (2021)
Narsani, H.K., Ranjha, A., Dev, K., Memon, F.H., Qureshi, N.M.F.: Leveraging UAV-assisted communications to improve secrecy for URLLC in 6G systems. Digital Communications and Networks. 9(6), 1458–1464 (2023)
Wu, Q., Cui, M., Zhang, G., Wang, F., Wu, Q., Chu, X.: Latency minimization for UAV-enabled URLLC-based mobile edge computing systems. Early Access in IEEE Transactions on Wireless Communications. (2023)
Ranjha, A., Javed, M.A., Piran, M.J., Asif, M., Hussien, M., Zeadally, S., Frnda, J.: Towards facilitating power efficient URLLC systems in UAV networks under jittering. Early Access in IEEE Transactions on Consumer Electronics. (2023)
Cai, Y., Jiang, X., Liu, M., Zhao, N., Chen, Y., Wang, X.: Resource allocation for URLLC-oriented two-way UAV relaying. IEEE Trans. Veh. Technol. 71(3), 3344–3349 (2022)
Yang, P., Xi, X., Q. S. Quek, T., Cao, X., Chen, J.: Power control for a URLLC-enabled UAV system incorporated with DNN-based channel estimation. IEEE Wireless Communications Letters. 10(5), 1018–1022 (2021)
Hazarika, B., Singh, K.: AFL-DMAAC: Integrated resource management and cooperative caching for URLLC-IoV networks. IEEE Transactions on Intelligent Vehicles. 1–16 (2023)
Liu, Y., Zhou, H., Deng, Y., Nallanathan, A.: Channel access optimization in unlicensed spectrum for downlink URLLC: Centralized and federated DRL approaches. IEEE J. Sel. Areas Commun. 41(7), 2208–2222 (2023)
Bubeck, S., et al.: Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning. 8(3–4), 231–357 (2015)
Zhang, X., Zhang, Z., Gong, X., Yin, Y.: An exact branch-and-bound algorithm for seru scheduling problems with sequence-dependent setup time. Soft. Comput. 27(10), 6415–6436 (2023)
Zhao, J., Mao, M., Zhao, X., Zou, J.: A hybrid of deep reinforcement learning and local search for the vehicle routing problems. IEEE Trans. Intell. Transp. Syst. 22(11), 7208–7218 (2021)
Rajwar, K., Deep, K., Das, S.: An exhaustive review of the metaheuristic algorithms for search and optimization: Taxonomy, applications, and open challenges. Artif. Intell. Rev. 56(11), 13187–13257 (2023)
Hickling, T., Zenati, A., Aouf, N., Spencer, P.: Explainability in deep reinforcement learning: A review into current methods and applications. ACM Comput. Surv. 56(5), 1–35 (2023)
Al-Hourani, A., Kandeepan, S., Lardner, S.: Optimal LAP altitude for maximum coverage. IEEE Wireless Communications Letters. 3(6), 569–572 (2014)
Li, X., Xu, J.: Positioning optimization for sum-rate maximization in UAV-enabled interference channel. IEEE Signal Process. Lett. 26(10), 1466–1470 (2019)
3GPP TR 38.901: Study on channel model for frequencies from 0.5 to 100 GHz (2021)
Wang, L., Zhang, H.: Analysis of joint scheduling and power control for predictable URLLC in industrial wireless networks. IEEE International Conference on Industrial Internet (ICII). 160–169 (2019)
Fang, M., Li, D., Zhang, H., Fan, L., Trigui, I.: Performance analysis of short-packet communications with incremental relaying. Comput. Commun. 177(1), 51–56 (2021)
Qin, Y., Yuen, C., Shao, Y., Qin, B., Li, X.: Slow-varying dynamics-assisted temporal capsule network for machinery remaining useful life estimation. IEEE Transactions on Cybernetics. 53(1), 592–606 (2022)
Giuseppe, D., Tobias, K., Petar, P.: Toward massive, ultrareliable, and low-latency wireless communication with short packets. Proc. IEEE 104(9), 1711–1726 (2016)
Feng, R., Li, Z., Wang, Q., Huang, J.: An ADMM-based optimization method for URLLC-enabled UAV relay system. IEEE Wireless Communications Letters. 14(8), 1–5 (2022)
Yin, B., Li, X., Yan, J., Zhang, S., Zhang, X.: DQN-based power control and offloading computing for information freshness in multi-UAV-assisted V2X system. IEEE 96th Vehicular Technology Conference (VTC2022-Fall). 1–6 (2022)
Ciftler, B.S., Alwarafy, A., Abdallah, M.: Distributed DRL-based downlink power allocation for hybrid RF/VLC networks. IEEE Photonics J. 14(3), 1–10 (2022)
Li, X., Li, J., Yin, B., Yan, J., Fang, Y.: Age of information optimization in UAV-enabled intelligent transportation system via deep reinforcement learning. IEEE 96th Vehicular Technology Conference (VTC2022-Fall). 1–5 (2022)
3GPP TR38.912: Study on new radio access technology: Radio access architecture and interfaces (2016)
Joachim, S., Gustav, W., Torsten, D., Robert, B., Kittipong, K.: 5G radio network design for ultra-reliable low-latency communication. IEEE Network 32(2), 24–31 (2018)
Acknowledgements
The authors would like to thank the anonymous reviewers for their comments to improve the quality of the paper.
Funding
The work was supported in part by the National Science Foundation of China under Grant 62101386, in part by the Shanghai Sailing Program under Grants 21YF1450000, in part by the Guangdong Provincial Key Laboratory of Future Networks of Intelligence, the Chinese University of Hong Kong, Shenzhen under Grant 2022B1212010001-OF04, in part by the Natural Science Foundation of Sichuan Province under Grant 2023NSFSC1388, in part by the Open Fund of Key Laboratory of Civil Aircraft Airworthiness Technology under Grant SH2020112706, in part by the Key Laboratory of Medicinal and Edible Plant Resources Development of Sichuan Education Department, Chengdu University under Grant 10Y202201.
Author information
Authors and Affiliations
Contributions
XML, JHL, and XHZ conceived of the presented idea and drafted the manuscript. FYL and JHL conducted the experiments. YH and XQZ modified the manuscript and supervised this work. All the authors have read and approved the final manuscript.
Corresponding author
Ethics declarations
Conflict of Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Li, X., Zhang, X., Li, J. et al. Blocklength Allocation and Power Control in UAV-Assisted URLLC System via Multi-agent Deep Reinforcement Learning. Int J Comput Intell Syst 17, 138 (2024). https://doi.org/10.1007/s44196-024-00530-8
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s44196-024-00530-8