Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
46 views

DRL Minimum Throughput Maximization For Multi-UAV Enabled WPCN

This document summarizes a research paper that proposes using deep reinforcement learning to optimize unmanned aerial vehicle (UAV) trajectories and time resource allocation in a wireless powered communication network (WPCN) in order to maximize minimum throughput. Specifically, it introduces a WPCN where multiple UAVs provide wireless power transfer to charge Internet of Things devices and the devices then transmit information. The goal is to maximize minimum throughput by jointly optimizing UAV path planning and channel assignment, subject to constraints. As this problem is non-convex, the paper proposes using a multi-agent deep Q-learning algorithm to tackle the problem in an efficient manner.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views

DRL Minimum Throughput Maximization For Multi-UAV Enabled WPCN

This document summarizes a research paper that proposes using deep reinforcement learning to optimize unmanned aerial vehicle (UAV) trajectories and time resource allocation in a wireless powered communication network (WPCN) in order to maximize minimum throughput. Specifically, it introduces a WPCN where multiple UAVs provide wireless power transfer to charge Internet of Things devices and the devices then transmit information. The goal is to maximize minimum throughput by jointly optimizing UAV path planning and channel assignment, subject to constraints. As this problem is non-convex, the paper proposes using a multi-agent deep Q-learning algorithm to tackle the problem in an efficient manner.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

SPECIAL SECTION ON ARTIFICIAL INTELLIGENCE FOR

PHYSICAL-LAYER WIRELESS COMMUNICATIONS

Received December 17, 2019, accepted December 31, 2019, date of publication January 6, 2020, date of current version January 15, 2020.
Digital Object Identifier 10.1109/ACCESS.2020.2964042

Minimum Throughput Maximization for


Multi-UAV Enabled WPCN: A Deep
Reinforcement Learning Method
JIE TANG 1 , (Senior Member, IEEE), JINGRU SONG 1 , JUNHUI OU 1 , JINGCI LUO 1,

XIUYIN ZHANG 1 , (Senior Member, IEEE), AND KAI-KIT WONG 2 , (Fellow, IEEE)
1 School of Electronic of Information Engineering, South China University of Technology, Guangzhou 510641, China
2 Department of Electronic and Electrical Engineering, University College London, London WC1E 7JE, U.K.
Corresponding author: Junhui Ou (oujunhui@scut.edu.cn)
This work was supported in part by the National Natural Science Foundation of China under Grant 61971194 and Grant 61601186, in part
by the Natural Science Foundation of Guangdong Province under Grant 2017A030313383, and in part by the Open Research Fund of
National Mobile Communications Research Laboratory, Southeast University, under Grant 2019D06.

ABSTRACT This paper investigates joint unmanned aerial vehicle (UAV) trajectory planning and time
resource allocation for minimum throughput maximization in a multiple UAV-enabled wireless powered
communication network (WPCN). In particular, the UAVs perform as base stations (BS) to broadcast energy
signals in the downlink to charge IoT devices, while the IoT devices send their independent information in the
uplink by utilizing the collected energy. The formulated throughput optimization problem which involves
joint optimization of 3D path design and channel resource assignment with the constraint of flight speed
of UAVs and uplink transmit power of IoT devices, is not convex and thus is extremely difficult to solve
directly. We take advantage of the multi-agent deep Q learning (DQL) strategy and propose a novel algorithm
to tackle this problem. Simulation results indicate that the proposed DQL-based algorithm significantly
improve performance gain in terms of minimum throughput maximization compared with the conventional
WPCN scheme.

INDEX TERMS Unmanned aerial vehicle (UAV), wireless powered communication network (WPCN),
Internet of Things (IoT), trajectory design, deep reinforcement learning (DRL).

I. INTRODUCTION communication demands. Thus, IoT has received significant


The Internet of Things (IoT) ensures the data collection and research attention in 5G era.
exchange by interconnecting heterogeneous smart devices In a conventional scene, the IoT devices are battery-
such as sensors, smart phones, smart transportation sys- constrained and can’t handle enormous energy consump-
tem, which makes machine-to-machine (M2M) communi- tion. Radio frequency (RF) based energy harvesting (EH)
cation and seamless communication possible [1], [2]. The can be regarded as a prospective scheme to extend the life-
massive application scenarios in IoT may generate rigorous time of energy-constraint IoT devices [4]. Moreover, massive
communication requirements such as low latency, high relia- ground IoT devices have large and frequent communication
bility and safety. The Long-Term Evolution (LTE) can’t sup- requirements. Wireless powered communication network
port Machine-Type communication (MTC) effectively due to (WPCN) [5] which integrates wireless power transfer (WPT)
the fact that they focus on broadband communication [3]. and wireless information transfer (WIT), provides a feasible
The fifth generation (5G) mobile network brings higher solution for energy-constraint IoT devices. Authors in [6]
throughput, lower end-to-end latency and enhanced security proposes a classic protocol named ‘‘harvest-then-transmit’’
mechanism, which is capable of meeting the massive IoT (HTT). In this protocol, the ground users get charged by the
downlink energy flow first, and then transmit their uplink
information signals by utilizing the collected energy. More-
The associate editor coordinating the review of this manuscript and over, time division multiple access (TDMA) is adopted as
approving it for publication was Guan Gui . a typical design for WPCN in [6] and sum-throughput is

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/
9124 VOLUME 8, 2020
J. Tang et al.: Minimum Throughput Maximization for Multi-UAV Enabled WPCN

maximized by optimizing time resource allocation. Further- maximization [15], [16]. However, the work in [15] only
more, a multi-antenna energy beamforming and space divi- considers a single-UAV based WPCN which is not suitable
sion multiple access (SDMA) protocol is employed in [7] for for the scene of massive IoT devices. In [16], a multi-UAV
higher spectrum efficiency. In [8], the authors combine multi- assisted wireless communication network is proposed but the
user multi-in multi-out (MIMO) technology with cognitive energy supply of ground devices is not considered. Work
radio and WPCN for maximizing the sum throughput. The in [17] investigates the 3D placement of UAV for the pur-
authors in [9] and [10] introduce backscatter communication pose of maximizing the coverage, but the flexible trajectory
mode into HTT-based WPCN for the sake of maximizing design is not taken into account. In the scenario with a lot of
the throughput. However, there is still a challenge named energy-constraint IoT devices located in a large area, multi-
‘‘doubly-near-far’’ in conventional WPCN [11], which means UAV and downlink WPT are both worth studying. Further-
that comparing with devices close to base station (BS), more, the 3D trajectory design of UAV is necessary in order
devices far away from BS harvest less wireless energy in the to achieve better channel quality. Motivated by the above
downlink but have to consume more to transmit information research, we put forward a minimum throughput maximiza-
in the uplink. tion problem for multi-UAV enabled WPCN with jointly opti-
Owing to its high maneuverability and flexibility, mization of 3D trajectory design and time resource assign-
unmanned aerial vehicle (UAV) can provide greater proba- ment. The contributions of this paper are summarized as
bility of line-of-sight (LoS) channel and better connectivity follows.
comparing to the conventional fixed BS. Therefore, UAV has 1) We come up with a WPCN in which multiple UAVs
been applied in many research fields of wireless communi- provide reliable energy supply and communication ser-
cation. In [12] and [13], UAVs perform as flying relays in vices to IoT devices. Based on the considered model,
order to achieve the end-to-end throughput maximization by our target is to maximize the minimum throughput
jointly optimizing UAV’s trajectory planning and transmit by jointly scheduling the UAVs’ trajectory planning
power control. In [14], the authors introduce UAV into a and time resource assignment with the constraint of
conventional WPCN, and propose a channel-weighted path maximum flight speed, peak uplink power and flight
planning method to maximum the sum throughput, where area. Nevertheless, the minimum throughput optimiza-
UAV performs as an assistant of located BS. In [15], a UAV- tion problem is not convex which is unmanageable.
aided WPCN is considered, in which the UAV performs as In order to tackle this problem, we introduce the con-
the aerial BS in order to provide service to a cluster of ground cept of DQL.
users. A joint successive hover-and-fly trajectory design and 2) We put forward a multi-agent DQL based strategy
wireless resource allocation protocol is proposed in [15] for in order to maximize the minimum throughput by
throughput maximization. Authors in [16] consider a wireless jointly optimizing UAVs’ path design and time resource
network in which multiple UAVs provide wireless communi- assignment. In particular, each UAV owns an indepen-
cation service, and the co-channel interference and transmit dent DQN for making action strategy while the other
power control are discussed. In [17], the authors maximize the UAVs are considered as a part of environment. After
number of users in coverage subject to the minimum transmit each epoch, UAVs receive a reward or penalty based
power by optimizing 3D placement of UAV. on the minimum throughput.
Deep learning (DL) has been proved to be a powerful 3) The simulation results illustrate that our algorithm
tool for solving non-convex problems and high complexity accomplishes significant performance improvement in
issues, which has been widely applied in the optimization of the field of minimum throughput optimization com-
wireless communication system [18]–[22]. As a kind of deep pared with the traditional schemes.
reinforcement learning (DRL), deep Q learning (DQL) makes
action strategy by utilizing Deep Neural Network (DNN) B. ORGANIZATION
and performs well while dealing with dynamic time-variant The rest of this paper is organized as follows. In Section II,
environments [23]. Therefore, DQL provides a promising the multi-UAV enabled WPCN model is presented, and we
technique for UAV’s dynamic control. Authors in [24] adopt formulate the minimum throughput maximization problem.
reinforcement learning (RL) for the purpose of acquiring In Section III, the multi-agent DQL based algorithm is pro-
the optimal hover position of UAVs. In [25], UAVs make posed to jointly design UAVs’ trajectory and time resource
decisions based on deep Q network (DQN) for energy- allocation. Our simulation results are provided in Section IV
efficient data collection while they are deployed in smart to demonstrate the effectiveness of the proposed algorithm.
cities. A DRL-based UAV control strategy is proposed by [26] Finally, conclusions is given in Section V.
for maximizing both the energy efficiency and communica-
tion coverage. II. PRELIMINARIES
In this section, we first introduce the system model of the
A. MAIN CONTRIBUTIONS considered multi-UAV enabled WPCN, and then formulate
The previous research have investigated the UAV and WPCN the corresponding UAVs’ path planning and time resource
related system, and provide effective solutions for throughput allocation problem.

VOLUME 8, 2020 9125


J. Tang et al.: Minimum Throughput Maximization for Multi-UAV Enabled WPCN

hl [n]
where θl,kl [n] = sin−1 ( ) denotes the elevation angle
dl,kl [n]
from IoT device
q kl to the UAV l in the n-th time slot,
dl,kl [n] = (xl [n] − xkl )2 + (yl [n] − ykl )2 + hl 2 [n] is the
distance between UAV l and device kl . Besides, b1 and b2
stand for constant values representing the environment influ-
ence, ζ is another constant value which is determined by
both the antenna and the environment. Note that, the NLoS
probability is PNLoS = 1 − PLoS .
The path loss model for LoS and NLoS links between UAV
l and device kl is given by [28]

4π fc dl,kl α
µ1 (
 ) , LoS link,
Ll,kl = c (3)
FIGURE 1. A multi-UAV enabled WPCN. µ ( 4π fc d l,kl α
 2 ) , NLoS link,
c
where µ1 and µ2 are the attenuation coefficients of the LoS
A. SYSTEM MODEL
and NLoS links, fc and c denotes the carrier frequency and the
We consider a WPCN system in which multiple UAVs per- speed of light respectively, α stand for the path loss exponent.
form as aerial BSs to support ground IoT devices in a given Considering (2) and (3), the channel’s power gain between
area as shown in Fig. 1. We repartition the IoT devices into UAV l and device kl can be denoted as
L clusters and each UAV is in charge of a cluster. All the
UAVs are equipped with single antenna and share the same gl,kl [n] = [PLoS µ1 + PNLoS µ2 ]−1 (K0 dl,kl [n])−α , (4)
frequency band. IoT devices in the particular area are denoted 4π fc
as K = {K1 , · · · , KL }, where devices in the l-th cluster where K0 = .
c
are denoted as Kl , l ∈ L = {1, 2, · · · , L}. Then, we have Next, we illustrate the TDMA and HTT transmission pro-
Kl ∩ Kl 0 = ∅, l 0 6 = l, l ∈ L, which means there is no overlap tocol of the UAV-enabled WPCN in detail. As mentioned
between the clusters. For any cluster l, l ∈ L, we consider above, there are N + 1 time slots in each flight period T .
a UAV-enabled TDMA system which adopts HTT protocol, Specifically, the 0-th time slot is assigned to the downlink
where the UAVs travel through the area periodically to charge WPT and the n-th time slot, n ∈ N = {1, 2, · · · , N } is
the cluster via downlink WPT, and each device utilizes its allocated to the uplink WIT. We use binary variable al [0] to
collected energy to send the information in the uplink. denote the downlink WET mode of UAV l, al [0] equaling 1 or
Let us analyze the system within a specific flight period of 0 respectively represent that the energy is transferred or not
the UAVs, represented as t ∈ [0, T ]. We describe the loca- by the l-th UAV; while al,kl [n] is used to represent the uplink
tions of IoT devices and UAVs in a 3D Cartesian coordinate WIT allocation between UAV l and IoT device kl at n-th time
system. To be specific, the locations of device kl ∈ Kl and slot. Specifically, al,kl [n] equaling 1 or 0 means IoT device kl
UAV l are respectively denoted as wkl = (xkl , ykl , 0) and does communicate or does not with the l-th UAV. Since the
ql (t) = (xl (t), yl (t), hl (t)), hmin ≤ hl (t) ≤ hmax , hl (t) denotes TDMA protocol is employed, the following constraints on the
the altitude of UAV l. To facilitate the analysis, the flight time resource allocation should be considered
period T is discretized into N + 1 time slots. In order to make
al [0] = {0, 1}, ∀l ∈ L,
sure that the UAVs is approximately stationary in a time slot,
the number N is selected to be adequately large. Suppose vmax al,kl [n] = {0, 1}, ∀l ∈ L, kl ∈ K, n ∈ N ,
X
is the maximum speed of UAVs, then the location of UAVs al,kl [n] ≤ 1, ∀Kl ∈ K, l ∈ L, n ∈ N . (5)
should satisfy kl ∈Kl

ql [n] − ql [n − 1] ≤ Vmax · δN ,
At 0-th time slot of each flight period, the UAVs transmit
(1)
the downlink energy signals with the transmit power PD .
(1 − α)T Therefore, the collected energy of each IoT device kl at period
where δN = denotes the length of each subslot T is expressed as
N
for uplink, α stands for the proportion of downlink WPT in a L
X
period. Ekl = η · α · T · ai [0] · gi,kl [0] · PD , ∀l ∈ L, kl ∈ K,
The channel condition of UAVs and IoT devices in our i=1
system can be regarded as air-to-ground channel, in which (6)
the LoS and non-line-of-sight (NLoS) appear randomly. The
probability of LoS can be expressed as [27] where η ∈ (0, 1] denotes the RF-to-direct current(DC) energy
conversion efficiency of each device.
180 Then, we consider the WIT mode for IoT device kl ∈ K at
PLoS (θl,kl ) = b1 ( θl,kl − ζ )b2 , (2) time slot n. Let PU
π kl [n] denotes the uplink power of device kl

9126 VOLUME 8, 2020


J. Tang et al.: Minimum Throughput Maximization for Multi-UAV Enabled WPCN

N
at n-th time slot, then the available energy Ekl [n] of device kl X
al,kl [j] · δN · PU
kl [j] ≤ Ekl , (12.5)
in n-th time slot can be represented as
j=1
n−1
X Rkl ≥ Rmin , ∀kl ∈ K, (12.6)
Ekl [n] = Ekl − al,kl [j] · δN · PU
kl [j]. (7) ql [n] − ql [n − 1] ≤ Vmax · δN .

(12.7)
j=1

Therefore, the upper bound of uplink power for IoT device kl Constraint (12.1) indicates that each device belongs to a
should satisfy non-overlapping cluster and associates with a specific UAV.
Constraint (12.2) indicates the flight range of UAVs. Con-
al,kl [n]δN PU
kl [n] ≤ Ekl [n], straint (12.3) and (12.4) represent the time resource allocation
N
X restrictions. Equation (12.5) qualifies the peak uplink power
al,kl [j] · δN · PU
kl [j] ≤ Ekl . (8) constraint of each IoT device. Constraint (12.6) indicates the
j=1 minimum rate requirement of each IoT device. Constraint
Accordingly, the received SINR γkl [n] of UAV l connected (12.7) represents the maximum speed constraint of UAVs.
to IoT device kl at time slot n is given by It can be observed that there exist two reasons making it
difficult to solve problem (P1). First, equations (12.3) and
PU
kl [n]gl,kl [n] (12.4) have binary constraints on al [0] and al,kl [n]. Besides,
γkl [n] = , (9)
Ikl [n] + σ 2 constraints (12.5) and (12.6) have complicated energy and
rate functions with respect to coupled variables al [0], al,kl [n],
where σ 2 = Bkl N0 , N0 represents the power spectral density
PUkl [n], ql [n]. Therefore, problem (P1) is mixed-integer non-
of the additive white Gaussian noise (AWGN) at the receivers.
L convex, and we can’t obtain a feasible solution by general
PU
P
Moreover, Ikl [n] = kj [n]gl,kj [n] is the inference methods. As a result, we come up with a DQL based strategy
j=1,j6=l for the purpose of optimizing the minimum throughput.
received by UAV l from cluster j, j ∈ L, j 6 = l.
Then the instantaneous throughput Rkl [n] of IoT device kl
III. JOINT MULTIPLE UAVS’ 3D TRAJECTORY PLANNING
can be represented as
AND TIME RESOURCE ASSIGNMENT ALGORITHM
PU
kl [n]gl,kl [n] Since the throughput optimization problem is non-convex
Rkl [n] = Bkl log2 (1 + ). (10) which is complicated to resolve directly, we bring in the
Ikl [n] + σ 2
DQL algorithm in this section to solve the minimum through-
Therefore, the average throughput Rkl of IoT device kl of the put maximization problem. In particular, we introduce the
flight cycle T can be denoted by background of DQL first and then describe the proposed
N throughput optimization strategy in detail.
1X
Rkl = al,kl [n]Rkl [n]
T A. DEEP Q LEARNING
n=1
N
1X PU
k [n]gkl [n]
A RL problem can be described as a Markov Decision Process
= al,kl [n]Bkl log2 (1 + l ). (11) (MDP), which is defined by a 4-tuple < S, A, P, R >.
T Ikl [n] + σ 2
n=1 In particular, S = {s1 , s2 , · · · , sm } represents the state space,
A = {a1 , a2 , · · · , am } denotes action space. R denotes the
B. PROBLEM FORMULATION
reward function and particularly R(s, a) represents the reward
Let A = {al [0], al,kl [n], ∀l, kl , n}, PU = {PU kl [n], ∀kl , n}, for executing action a at state s. P is the transition probability
Q = {ql [n], ∀l, n}. In this work, our optimization objective matrix. The optimal policy is obtained through the interaction
is to maximize the minimum average throughput of a multi- between RL agent and the environment. To be specific, an RL
UAV enabled WPCN by jointly optimizing the IoT devices’ agent observes the environment and then obtains the current
association {al [0], al,kl [n]}, the uplink power {PU kl [n]}, and state st ∈ S. The next state of agent st+1 can be obtained
the UAVs’ 3D trajectory {ql [n]}. Therefore, the throughput after choosing and executing an action at ∈ A. At the end
optimization problem can be mathematically formulated as of a cycle, the agent receives a reward rt according to the
follows environment.
(P1) max Rmin RL is designed to find a optimal policy π (s) for maximiz-
Rmin ,A,PU ,Q ing the cumulative expectation of rewards. The cumulative
s.t. Kl ∩ Kl 0 = ∅, l ∈ L, (12.1) reward at t-th step by executing action a at state s on the basis
hmin ≤ hl [n] ≤ hmax , ∀l ∈ L, (12.2) of policy π can be represented by
al [0], al,kl [n] = {0, 1}, ∀l ∈ L, kl ∈ K, n ∈ N , ∞
π
X
(12.3) Q (s, a) = E[ γ k rt+k |st , at , π], (13)
k=0
X
al,kl [n] ≤ 1, ∀Kl ∈ K, l ∈ L, n ∈ N , (12.4)
kl ∈Kl where γ ∈ [0, 1] is the discount factor.

VOLUME 8, 2020 9127


J. Tang et al.: Minimum Throughput Maximization for Multi-UAV Enabled WPCN

As a kind of model free RL, Q learning (QL) evaluates and executes the action separately, and the other agents are
the value of action a executed at state s without building the seen as part of environment. The Markov Property becomes
environment transition model. The Q value determined by the invalid in this approach and the environment is not stationary.
state-action pair is stored in a look-up table and is updated as Despite these disadvantages, IQL achieves great results with
follow low complexity.
Q(s, a) = (1−α)Q(s, a)+α(R(s, a)+λ max Q(s0 , a0 )), (14)
a0 B. PROPOSED DQL-BASED SOLUTION
where α ∈ (0, 1] is the learning rate, s0 , a0
are respectively In our proposed multi-agent DQL-based algorithm, the IoT
the next state and next action. It’s proved that Q learning can devices are uniformly distributed at an area, which can be par-
converge to Q∗ in the case where state and action spaces are titioned into L clusters by K-means [34]. Each agent stands
discrete and finite. for a UAV, which owns an independent DQN and performs
However, since the UAVs fly flexibly in a 3D area, our action respectively. Meanwhile, the agents share the state
model has a large and continuous state space. The storage with others and regard the others as a part of environment.
and search of the Q-table becomes impractical and the con- After each epoch, agents get a reward or penalty based on the
vergence rate might become slow. Function approximation shared environment.
is adopted in several research to tackle this problem [29]. Let us illustrate the definition of state space, action space
As a kind of non-linear function approximation, deep neu- and reward function of agents in our algorithm.
ral network (DNN) has been widely applied for large-scale • The state space of each agent is made up of three parts:
reinforcement learning [23], [30], i.e. Q(s, a) ≈ Q(s, a, w), 1) ql [n]: the location of UAV l;
where w represents the weight parameters of neural network. 2) {al,kl [n]}: the number of times that each device
In DQL, the distribution of Q value function is approximated communicates with UAV l;
by DNN, and the DNN is trained by means of optimizing the 3) {Rkl [n]}: the average throughput of devices in l-th
loss function cluster.
L(w) = E[(yt − Q(s, a, w))2 ], (15) • The action space contains 27 elements which are defined
by (x, y, z): (x, y, z) varying from (−1, −1, −1) to
where yt is the target Q value which is set as a label and can (1, 1, 1). To be specific, x = −1 stands for that the
be denoted by UAV turns to the left; x = 1 signifies that the UAV flies
yt = r + γ max Q(s0 , a0 , w). (16) towards right; y = −1 implies the UAV flies backward;
a0 y = 1 means the UAV flies forward; z = −1 repre-
As a combination of RL and DL, DRL might be unstable sents the UAV descends; z = 1 means the UAV rises;
because of two reasons. First, the training samples in RL are (x, y, z) = (0, 0, 0) indicates the UAV remains still.
relevant and can’t meet the independent and identical distri- After flying to the next location, UAV broadcasts energy
bution demand of DL. Besides, a slight update of Q parameter flow or selects the device that owns the best channel
may cause a huge oscillation in the strategy, which will bring condition in its cluster for uplink communication.
a variation in the distribution of training samples. Experience • The reward function is defined as follows:
replay and target network mechanism are developed in order 1) If UAV flies beyond the border after performing
to solve these issue [31]. In particular, replay buffer is applied action, then the UAV receives a penalty of −1 and
to store the state transition samples (s, a, r, s0 ) generated will be located at the boundary.
at each episode which can be randomly sampled for learn- 2) At each time step, if there is a cross between the
ing. Due to the randomness of the samples, the correlation trajectory of UAV i and UAV j, then UAV i and
between these data can be eliminated. In addition, target UAV j receive a penalty of −1 and stay at the
network own the same structure as the online network but previous location.
different weight parameters. In particular, the parameters in 3) After each epoch, if the throughput of device in
target network remain unchanged and will be duplicated from communication does not increase, which means
online network periodically, thus the stability of the target can that the device communications with UAV too
be ensured. many times, its energy is exhausted and thus the
Since multiple agents interact simultaneously with envi- UAV only receives the interference, in this situ-
ronment and potentially with each other, it’s more complex ation the UAV receives a penalty of −1; if the
to learn in a multi-agent environment than in the single- device’s throughput increases, then the UAV gets
agent case. In [32], authors first introduce an independent a reward of 1.
Q learning (IQL) strategy for multi-agent scenario. Based on 4) After each epoch, if the minimum average through-
this work, the authors in [33] combine DQN and IQL and dis- put of devices in a cluster is 0, which means that
cuss the phenomena such as cooperation, communication and some devices do not communication with the UAV
competition in reinforced multi-agent systems. In [33], each in this epoch, then the UAV receive a penalty
agent learns the action strategy with its independent DQN of −2.

9128 VOLUME 8, 2020


J. Tang et al.: Minimum Throughput Maximization for Multi-UAV Enabled WPCN

5) After each epoch, if the minimum average through- For our simulations, it’s assumed that 25 IoT devices are
put of all devices does not increase, then all the uniformly distributed within a 50m × 50m district. For ease
UAVs receive a penalty of −1; if the minimum of analysis, the flight period of UAVs is set as T = 1s. The
averaget throughput increases, all UAVs receive a transmission power for UAVs’ downlink and peak power for
reward of 1. IoT devices’ uplink transmit are respectively PD = 40dBm
and PU max = −20dBm. The uplink power of IoT device is
Algorithm 1 Proposed 3D Trajectory Design and Time defined by the available energy and average available time
Resource Allocation Solution Based on DQL slots. The maximum speed of UAVs is set as Vmax = 6m/s,
1: Initialize target network and online network;
and the height of UAVs are constraint within [10, 20]m. The
2: Initialize UAVs’ location and IoT devices’ location;
energy conversion efficiency of devices is set to η = 0.1 [35].
3: for episode = 1, · · · , M, do
Other simulation parameters are presented in Table 1.
4: for time slot t = 1, · · · , T, do
TABLE 1. Simulation parameters.
5: for UAV i = 1, · · · , L, do
6: Choose action with -greedy, while  increases;
7: Get UAV i’s next location;
8: if UAV i flies beyond the border then
9: UAV i stays at the border, and gets a penalty of
−1;
10: end if
11: end for
12: for UAV i = 1, · · · , L, do
13: if UAV i and UAV j’s trajectory exists cross then
14: UAV i and UAV j stay at the previous location,
and get a penalty of −1;
15: end if
16: Execute action, and get next state;
17: if device t’s throughput does not increase then
18: UAV i gets a penalty of −1;
19: end if
20: end for
21: if time slot t = T then
22: if minimum throughput of all devices in a cluster
equals zero then
23: The UAV get a penalty of −2;
24: end if
25: if minimum throughput of devices does not
increase then
26: All UAVs get a penalty of −1;
27: end if
28: end if
29: Store (s,a,r,s’) into replay buffer;
30: Randomly select a minibatch of H samples from
replay buffer;
31: Train the network, and update weight;
32: end for
FIGURE 2. Uplink minimum throughput with respect to iteration number.
33: end for

First, we illustrate the converge property of the proposed


The complete algorithm to solve the minimum throughput joint trajectory design and time resource allocation algorithm
optimization problem for multi-UAV enabled WPCN system in a special case with L = 3 UAVs. In order to observe the
with DQL technique is summarized in Algorithm 1. results more intuitively, we make an average of throughput
for every 60 periods. As shown in Fig. 2, the minimum
IV. SIMULATION RESULTS throughput converges to a stable value after 400 iterations for
In this section, we present numerical results to validate the the proposed algorithm.
effectiveness and superiority of our proposed strategy in the Afterwards, we investigate the minimum throughput maxi-
field of minimum throughput maximization. mization performance of the proposed DQL-based algorithm

VOLUME 8, 2020 9129


J. Tang et al.: Minimum Throughput Maximization for Multi-UAV Enabled WPCN

under different number of UAVs. Moreover, we compare our


proposed algorithm with the following strategies [36].
• Static: UAVs are fixed right above the centroid of its
cluster c = (xc , yc ), while the height of UAVs are set
as H = 15m. The IoT devices communication with its
UAV in sequence.
• Circular trajectory: UAVs fly at a plane with altitude
of 15m, and follow a circular trajectory scheme. In this
scenario, the center c = (xc , yc ) is set at the centroid
of a cluster and the radius r = min(r c , r v ), in which
1 PKl vmax · T
rc = k=1 kc − uk k and r v = respectively
Kl 2π
indicate the average distance between centroid and IoT
devices and the maximal radius determined by speed
constraint. Same as the static scheme, the IoT devices
are served by its UAV in sequence.

FIGURE 4. Trajectories of UAVs optimized by the proposed algorithm for


UAV = 3.

FIGURE 3. Maximum minimum throughput with respect to the number of


UAVs.

The parameters of constraints are identical as previous in FIGURE 5. Trajectories of UAVs optimized by the proposed algorithm for
this simulation. The number of UAVs varies from 2 to 7. UAV = 5.
From Fig. 3, it’s seen that comparing to the static WPCN,
the proposed method can achieve better minimum through-
put performance, which demonstrates that the flexibility of better performance in maximizing the minimum throughput
UAV can improve the communication quality of WPCN. of UAV-enabled WPCN.
Furthermore, for our proposed DQL, the minimum through- The optimized flight trajectories of multiple UAVs for
put increases when the number of UAV increases from 2 to 3, UAV = 3 and 5 are respectively represented in Fig. 4 and
but decreases afterwards. This is because as the number of Fig. 5. For ease of observation, we choose to observe the
agents increase, the cooperation between agents becomes trajectory in a 2-dimensional coordinate system. The star
more complicated. On the other hand, as shown in circular represents the IoT devices, and the triangle represents the
and static scheme, the throughput does not increase any more centroid of a cluster. As it can be seen in Fig. 4, it can be
when the number of UAVs is greater than 5. This is because observed that the UAVs attempted to cover all the devices
as the number of UAVs increases, the number of devices in by flying around the centroid of cluster. Moreover, the UAVs
a cluster decreases, thus the harvested energy and time of hover close with the devices in its cluster to improve the chan-
allocated uplink communication increase whereas the dis- nel quality and stay away from each other as far as possible
tance between UAVs gets closer and thus the co-interference to reduce the co-interference. In a word, the optimization
increases. In the end, the gains and interference offset each algorithm tends to make a balance between good channel
other. Overall, our proposed DQL-based algorithm provides condition and existing co-interference. As shown in Fig. 5,

9130 VOLUME 8, 2020


J. Tang et al.: Minimum Throughput Maximization for Multi-UAV Enabled WPCN

the above rules are also applicable to the trajectory when [14] S. Cho, K. Lee, B. Kang, K. Koo, and I. Joe, ‘‘Weighted harvest-then-
5 UAVs are deployed. These simulation results represent that transmit: UAV-enabled wireless powered communication networks,’’ IEEE
Access, vol. 6, pp. 72212–72224, 2018.
the proposed algorithm can plan the trajectory excellently no [15] L. Xie, J. Xu, and R. Zhang, ‘‘Throughput maximization for UAV-enabled
matter how many UAVs there are. wireless powered communication networks,’’ IEEE Internet Things J.,
vol. 6, no. 2, pp. 1690–1703, Apr. 2019.
[16] Q. Wu, Y. Zeng, and R. Zhang, ‘‘Joint trajectory and communication
V. CONCLUSION design for multi-UAV enabled wireless networks,’’ IEEE Trans. Wireless
In this paper, we investigate the throughput maximization Commun., vol. 17, no. 3, pp. 2109–2121, Mar. 2018.
problem for multi-UAV enabled WPCN in which UAVs [17] M. Alzenad, A. El-Keyi, F. Lagum, and H. Yanikomeroglu, ‘‘3-D place-
ment of an unmanned aerial vehicle base station (UAV-BS) for energy-
act as wireless charger and information receiver to support efficient maximal coverage,’’ IEEE Wireless Commun. Lett., vol. 6, no. 4,
ground IoT devices. Our target is to maximize the mini- pp. 434–437, Aug. 2017.
mum throughput while satisfying several constraints includ- [18] H. Huang, S. Guo, G. Gui, Z. Yang, J. Zhang, H. Sari, and F. Adachi,
‘‘Deep learning for physical-layer 5G wireless techniques: Opportunities,
ing maximum flying speed, maximum uplink transmit power, challenges and solutions,’’ IEEE Wireless Commun. Mag., to be published,
time resource allocation. The formulated jointly UAVs’ 3D doi: 10.1109/MWC.2019.1900027.
trajectory design and time resource assignment optimization [19] H. Huang, Y. Peng, J. Yang, W. Xia, and G. Gui, ‘‘Fast beamforming
design via deep learning,’’ IEEE Trans. Veh. Technol., to be published,
problem is non-convex, which is difficult to solve straight- doi: 10.1109/TVT.2019.2949122.
forward. A multi-agent DQL based algorithm is proposed [20] G. Gui, F. Liu, J. Sun, J. Yang, Z. Zhou, and D. Zhao, ‘‘Flight delay
for a feasible solution. Numerical results illustrate that the prediction based on aviation big data and machine learning,’’ IEEE Trans.
proposed strategy surpasses the traditional strategies in the Vehicular Technol., to be published.
[21] G. Gui, H. Huang, Y. Song, and H. Sari, ‘‘Deep learning for an effec-
field of maximizing the minimum throughput of multi-UAV tive nonorthogonal multiple access scheme,’’ IEEE Trans. Veh. Technol.,
enabled WPCN, which confirms the advantage of adopt- vol. 67, no. 9, pp. 8440–8450, Sep. 2018.
ing UAVs’ trajectory design and time resource allocation in [22] H. Huang, Y. Song, J. Yang, G. Gui, and F. Adachi, ‘‘Deep-learning-based
millimeter-wave massive MIMO for hybrid precoding,’’ IEEE Trans. Veh.
WPCN system. Technol., vol. 68, no. 3, pp. 3027–3032, Mar. 2019.
[23] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou,
REFERENCES D. Wierstra, and M. A. Riedmiller, ‘‘Playing Atari with deep reinforcement
learning,’’ CoRR, vol. abs/1312.5602, Dec. 2013. [Online]. Available:
[1] G. A. Akpakwu, B. J. Silva, G. P. Hancke, and A. M. Abu-Mahfouz, http://arxiv.org/abs/1312.5602
‘‘A survey on 5G networks for the Internet of Things: Communication [24] X. Liu, Y. Liu, Y. Chen, and L. Hanzo, ‘‘Trajectory design and power
technologies and challenges,’’ IEEE Access, vol. 6, pp. 3619–3647, 2018. control for multi-UAV assisted wireless networks: A machine learning
[2] Q. Wu, W. Chen, D. W. K. Ng, and R. Schober, ‘‘Spectral and energy- approach,’’ IEEE Trans. Veh. Technol., vol. 68, no. 8, pp. 7957–7969,
efficient wireless powered IoT networks: NOMA or TDMA?’’ IEEE Trans. Aug. 2019.
Veh. Technol., vol. 67, no. 7, pp. 6663–6667, Jul. 2018. [25] B. Zhang, C. H. Liu, J. Tang, Z. Xu, J. Ma, and W. Wang, ‘‘Learning-
[3] M. R. Palattella, M. Dohler, A. Grieco, G. Rizzo, J. Torsner, T. Engel, based energy-efficient data collection by unmanned vehicles in smart
and L. Ladid, ‘‘Internet of Things in the 5G era: Enablers, architecture, cities,’’ IEEE Trans. Ind. Informat., vol. 14, no. 4, pp. 1666–1676,
and business models,’’ IEEE J. Sel. Areas Commun., vol. 34, no. 3, Apr. 2018.
pp. 510–527, Mar. 2016. [26] C. H. Liu, Z. Chen, J. Tang, J. Xu, and C. Piao, ‘‘Energy-efficient UAV
[4] X. Lu, P. Wang, D. Niyato, D. I. Kim, and Z. Han, ‘‘Wireless networks with control for effective and fair communication coverage: A deep reinforce-
RF energy harvesting: A contemporary survey,’’ IEEE Commun. Surveys ment learning approach,’’ IEEE J. Sel. Areas Commun., vol. 36, no. 9,
Tuts., vol. 17, no. 2, pp. 757–789, 2nd Quart., 2015. pp. 2059–2070, Sep. 2018.
[5] F. Yang, W. Xu, Z. Zhang, L. Guo, and J. Lin, ‘‘Energy efficiency [27] A. Al-Hourani, S. Kandeepan, and A. Jamalipour, ‘‘Modeling air-to-
maximization for relay-assisted WPCN: Joint time duration and power ground path loss for low altitude platforms in urban environments,’’ in
allocation,’’ IEEE Access, vol. 6, pp. 78297–78307, 2018. Proc. IEEE Global Commun. Conf., Dec. 2014, pp. 2898–2904.
[6] H. Ju and R. Zhang, ‘‘Throughput maximization in wireless powered [28] M. Mozaffari, W. Saad, M. Bennis, and M. Debbah, ‘‘Wireless communica-
communication networks,’’ IEEE Trans. Wireless Commun., vol. 13, no. 1, tion using unmanned aerial vehicles (UAVs): Optimal transport theory for
pp. 418–428, Jan. 2014. hover time optimization,’’ IEEE Trans. Wireless Commun., vol. 16, no. 12,
[7] L. Liu, R. Zhang, and K.-C. Chua, ‘‘Multi-antenna wireless powered com- pp. 8052–8066, Dec. 2017.
munication with energy beamforming,’’ IEEE Trans. Commun., vol. 62, [29] X. Xu, L. Zuo, and Z. Huang, ‘‘Reinforcement learning algorithms with
no. 12, pp. 4349–4361, Dec. 2014. function approximation: Recent advances and applications,’’ Inf. Sci.,
[8] J. Kim, H. Lee, C. Song, T. Oh, and I. Lee, ‘‘Sum throughput maxi- vol. 261, pp. 1–31, Mar. 2014.
mization for multi-user MIMO cognitive wireless powered communication [30] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness,
networks,’’ IEEE Trans. Wireless Commun., vol. 16, no. 2, pp. 913–923, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski,
Feb. 2017. S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran,
[9] B. Lyu, Z. Yang, G. Gui, and Y. Feng, ‘‘Wireless powered communication D. Wierstra, S. Legg, and D. Hassabis, ‘‘Human-level control through
networks assisted by backscatter communication,’’ IEEE Access, vol. 5, deep reinforcement learning,’’ Nature, vol. 518, no. 7540, pp. 529–533,
pp. 7254–7262, 2017. Feb. 2015.
[10] B. Lyu, H. Guo, Z. Yang, and G. Gui, ‘‘Throughput maximization for [31] P. Hernandez-Leal, B. Kartal, and M. E. Taylor, ‘‘A survey and critique of
hybrid backscatter assisted cognitive wireless powered radio networks,’’ multiagent deep reinforcement learning,’’ Auton Agent Multi-Agent Syst,
IEEE Internet Things J., vol. 5, no. 3, pp. 2015–2024, Jun. 2018. vol. 33, no. 6, pp. 750–797, Nov. 2019.
[11] S. Bi, Y. Zeng, and R. Zhang, ‘‘Wireless powered communication net- [32] M. Tan, ‘‘Multi-agent reinforcement learning: Independent vs.
works: An overview,’’ IEEE Wireless Commun., vol. 23, no. 2, pp. 10–18, Cooperative agents,’’ in Proc. 10th Int. Conf. Mach. Learn., 1993,
Apr. 2016. pp. 330–337.
[12] Y. Zeng, R. Zhang, and T. J. Lim, ‘‘Throughput maximization for UAV- [33] A. Tampuu, T. Matiisen, D. Kodelja, I. Kuzovkin, K. Korjus, J. Aru,
enabled mobile relaying systems,’’ IEEE Trans. Commun., vol. 64, no. 12, J. Aru, and R. Vicente, ‘‘Multiagent cooperation and competition with
pp. 4983–4996, Dec. 2016. deep reinforcement learning,’’ PLoS ONE, vol. 12, no. 4, Apr. 2017,
[13] G. Zhang, H. Yan, Y. Zeng, M. Cui, and Y. Liu, ‘‘Trajectory optimization Art. no. e0172395.
and power allocation for multi-hop UAV relaying communications,’’ IEEE [34] A. K. Jain, ‘‘Data clustering: 50 years beyond K-means,’’ Pattern Recognit.
Access, vol. 6, pp. 48566–48576, 2018. Lett., vol. 31, no. 8, pp. 651–666, Jun. 2010.

VOLUME 8, 2020 9131


J. Tang et al.: Minimum Throughput Maximization for Multi-UAV Enabled WPCN

[35] I. Krikidis, S. Timotheou, S. Nikolaou, G. Zheng, D. W. K. Ng, and XIUYIN ZHANG (Senior Member, IEEE) received
R. Schober, ‘‘Simultaneous wireless information and power transfer in the B.S. degree in communication engineer-
modern communication systems,’’ IEEE Commun. Mag., vol. 52, no. 11, ing from the Chongqing University of Posts
pp. 104–110, Nov. 2014. and Telecommunications, Chongqing, China,
[36] J. Park, H. Lee, S. Eom, and I. Lee, ‘‘UAV-aided wireless powered in 2001, the M.S. degree in electronic engineering
communication networks: Trajectory optimization and resource allo- from the South China University of Technology,
cation for minimum throughput maximization,’’ IEEE Access, vol. 7, Guangzhou, China, in 2006, and the Ph.D. degree
pp. 134978–134991, 2019.
in electronic engineering from the City University
of Hong Kong, Kowloon, Hong Kong, in 2009.
From 2001 to 2003, he was with ZTE Corpo-
JIE TANG (Senior Member, IEEE) received ration, Shenzhen, China. He was a Research Assistant, from July 2006 to
the B.Eng. degree in information engineering June 2007, and a Research Fellow, from September 2009 to February 2010,
from the South China University of Technology, with the City University of Hong Kong. He is currently a Full Professor and
Guangzhou, China, in 2008, the M.Sc. degree the Vice Dean with the School of Electronic and Information Engineering,
(Hons.) in communication systems and signal South China University of Technology. He also serves as the Deputy Director
processing from the University of Bristol, U.K., of the Guangdong Provincial Engineering Research Center of Antennas and
in 2009, and the Ph.D. degree from Loughborough RF Techniques and the Vice Director of the Engineering Research Center
University, Leicestershire, U.K., in 2012. He held for Short-Distance Wireless Communications and Network, Ministry of
Postdoctoral research positions at the School of Education. He has authored or coauthored more than 100 internationally
Electrical and Electronic Engineering, The Univer- referred journal papers, including 55 IEEE Transaction papers as well as
sity of Manchester, U.K. He is currently an Associate Professor with the around 60 conference papers. His research interests include microwave
School of Electronic and Information Engineering, South China University circuits and sub-systems, antennas and arrays, and SWIPT.
of Technology, China. His research interests include green communications, Dr. Zhang is a Fellow of the Institution of Engineering and Technology.
NOMA, 5G networks, SWIPT, heterogeneous networks, cognitive radio, and He was a recipient of the National Science Foundation for Distinguished
D2D communications. Young Scholars of China, the Young Scholar of the Chang-Jiang Scholars
He was a co-recipient of the Best Paper Awards in IEEE ICNC 2018, Program of Chinese Ministry of Education, and the Top-notch Young
CSPS 2018, and IEEE WCSP 2019. He also served as a Track Co-Chair for Professionals of National Program of China. He was also a recipient of
the IEEE Vehicular Technology Conference (VTC) Spring 2018. He is cur- the Scientific and Technological Award (Hons.) of Guangdong Province.
rently serving as an Editor for IEEE ACCESS, EURASIP Journal on Wireless He was supervisor of several conference best paper award winners. He has
Communications and Networking, Physical Communications, and Ad Hoc & served as a Technical Program Committee (TPC) Chair/ member and session
Sensor Wireless Networks. organizer/Chair for a number of conferences. He is an Associate Editor for
IEEE ACCESS.

JINGRU SONG received the B.Eng. degree from


the School of Information Science and Engineer-
ing, Shandong University, Jinan, China, in 2018.
She is currently pursuing the M.Sc. with the
KAI-KIT WONG (Fellow, IEEE) received the
School of Electronic and Information Engineer-
B.Eng., M.Phil., and Ph.D. degrees in elec-
ing, South China University of Technology, China,
trical and electronic engineering from The
under the supervision of Dr. Jie Tang. Her research
Hong Kong University of Science and Technology,
interests include machine learning, unmanned
Hong Kong, in 1996, 1998, and 2001, respectively.
aerial vehicle, wireless power transmission, simul-
After graduation, he took up academic and
taneous wireless information and power transfer,
research positions at University of Hong Kong,
and 5G networks.
Lucent Technologies, Bell-Labs, Holmdel, Smart
Antennas Research Group of Stanford Univer-
sity, and University of Hull, U.K. He is currently
JUNHUI OU was born in Guangdong, China. the Chair in wireless communications with the Department of Electronic
He received the B.E. degree in automation and the and Electrical Engineering, University College London, U.K. His current
M.Sc. and Ph.D. degrees in communication engi- research interests include 5G and beyond mobile communications, including
neering from Sun Yat-sen University, Guangdong, topics such as massive MIMO, full-duplex communications, millimeter-
in 2012, 2014, and 2018, respectively. wave communications, edge caching and fog networking, physical layer
He holds a postdoctoral position with the South security, wireless power transfer and mobile computing, V2X communi-
China University of Technology, Guangdong. His cations, and cognitive radios. There are also a few other unconventional
current research interests include antenna design, research topics that he has set his heart on, including for example, fluid
RF circuit design, wireless power transmission, antenna communications systems and team optimization.
and simultaneous wireless information and power Dr. Wong is a Fellow of IET and is also on the editorial board of
transmission. several international journals. He was a co-recipient of the 2013 IEEE
Signal Processing Letters Best Paper Award, the 2000 IEEE VTS Japan
Chapter Award at the IEEE Vehicular Technology Conference in Japan,
JINGCI LUO received the B.Eng. degree from in 2000, and a few other international best paper awards. He served as an
the School of Electronic and Information Engi- Associate Editor for the IEEE SIGNAL PROCESSING LETTERS, from 2009 to
neering, South China University of Technology, 2012, and an Editor for the IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS,
Guangzhou, China, in 2017, where she is currently from 2005 to 2011. He was also a Guest Editor for the IEEE JSAC SI on
pursuing the M.Sc. degree, under the supervision Virtual MIMO, in 2013, and currently a Guest Editor for the IEEE JSAC
of Dr. Jie Tang. Her research interests include SI on physical layer security for 5G. He has been a Senior Editor for the
energy efficiency optimization, machine learn- IEEE COMMUNICATIONS LETTERS, since 2012, and for the IEEE WIRELESS
ing, non-orthogonal multiple access, simultaneous COMMUNICATIONS LETTERS, since 2016. He is also an Area Editor for the IEEE
wireless information and power transfer, and 5G TRANSACTIONS ON WIRELESS COMMUNICATIONS.
networks.

9132 VOLUME 8, 2020

You might also like