Deep Reinforcement Learning For Online
Deep Reinforcement Learning For Online
Deep Reinforcement Learning For Online
Abstract—Wireless powered mobile-edge computing (MEC) has recently emerged as a promising paradigm to enhance the data
processing capability of low-power networks, such as wireless sensor networks and internet of things (IoT). In this paper, we consider a
wireless powered MEC network that adopts a binary offloading policy, so that each computation task of wireless devices (WDs) is
arXiv:1808.01977v4 [cs.NI] 28 Jul 2019
either executed locally or fully offloaded to an MEC server. Our goal is to acquire an online algorithm that optimally adapts task
offloading decisions and wireless resource allocations to the time-varying wireless channel conditions. This requires quickly solving
hard combinatorial optimization problems within the channel coherence time, which is hardly achievable with conventional numerical
optimization methods. To tackle this problem, we propose a Deep Reinforcement learning-based Online Offloading (DROO) framework
that implements a deep neural network as a scalable solution that learns the binary offloading decisions from the experience. It
eliminates the need of solving combinatorial optimization problems, and thus greatly reduces the computational complexity especially
in large-size networks. To further reduce the complexity, we propose an adaptive procedure that automatically adjusts the parameters
of the DROO algorithm on the fly. Numerical results show that the proposed algorithm can achieve near-optimal performance while
significantly decreasing the computation time by more than an order of magnitude compared with existing optimization methods. For
example, the CPU execution latency of DROO is less than 0.1 second in a 30-user network, making real-time and optimal offloading
truly viable even in a fast fading environment.
Index Terms—Mobile-edge computing, wireless power transfer, reinforcement learning, resource allocation.
1 I NTRODUCTION
online computation offloading policy based on DQN under 3.2 Local Computing Mode
random task arrivals. However, both DQN-based works A WD in the local computing mode can harvest energy
take discretized channel gains as the input state vector, and compute its task simultaneously [6]. Let fi denote
and thus suffer from the curse of dimensionality and slow the processor’s computing speed (cycles per second) and
convergence when high channel quantization accuracy is 0 ≤ ti ≤ T denote the computation time. Then, the amount
required. Besides, because of its exhaustive search nature of processed bits by the WD is fi ti /φ, where φ > 0 denotes
in selecting the action in each iteration, DQN is not suitable the number of cycles needed to process one bit of task data.
for handling problems with high-dimensional action spaces Meanwhile, the energy consumption of the WD due to the
[23]. In our problem, there are a total of 2N offloading computing is constrained by ki fi3 ti ≤ Ei , where ki denotes
decisions (actions) to choose from, where DQN is evidently the computation energy efficiency coefficient [13]. It can be
inapplicable even for a small N , e.g., N = 20. shown that to process the maximum amount of data within
T under the energy constraint, a WD should exhaust the
3 P RELIMINARY harvested energy and compute throughout the time frame,
13
3.1 System Model i.e., t∗i = T and accordingly fi∗ = kEi Ti . Thus, the local
As shown in Fig. 1, we consider a wireless powered MEC computation rate (in bits per second) is
network consisting of an AP and N fixed WDs, denoted as 31
a set N = {1, 2, . . . , N }, where each device has a single ∗ f ∗ t∗ hi 1
rL,i (a) = i i = η1 a3 , (1)
antenna. In practice, this may correspond to a static sensor φT ki
network or a low-power IoT system. The AP has stable 1
power supply and can broadcast RF energy to the WDs. where η1 , (µP ) 3 /φ is a fixed parameter.
Each WD has a rechargeable battery that can store the
harvested energy to power the operations of the device. 3.3 Edge Computing Mode
Suppose that the AP has higher computational capability Due to the TDD constraint, a WD in the offloading mode
than the WDs, so that the WDs may offload their computing can only offload its task to the AP after harvesting energy.
tasks to the AP. Specifically, we suppose that WPT and We denote τi T as the offloading time of the i-th WD, τi ∈
communication (computation offloading) are performed in [0, 1]. Here, we assume that the computing speed and the
the same frequency band. Accordingly, a time-division- transmit power of the AP is much larger than the size- and
multiplexing (TDD) circuit is implemented at each device energy-constrained WDs, e.g., by more than three orders
to avoid mutual interference between WPT and communi- of magnitude [6], [9]. Besides, the computation feedback to
cation. be downloaded to the WD is much shorter than the data
The system time is divided into consecutive time frames offloaded to the edge server. Accordingly, as shown in Fig. 1,
of equal lengths T , which is set smaller than the channel we safely neglect the time spent on task computation and
coherence time, e.g., in the scale of several seconds [24]–[26] downloading by the AP, such that each time frame is only
in a static IoT environment. At each tagged time, both the occupied by WPT and task offloading, i.e.,
amount of energy that a WD harvests from the AP and the
N
communication speed between them are related to the wire- X
less channel gain. Let hi denote the wireless channel gain τi + a ≤ 1. (2)
i=1
between the AP and the i-th WD at a tagged time frame.
The channel is assumed to be reciprocal in the downlink To maximize the computation rate, an offloading WD
and uplink,1 and remain unchanged within each time frame, exhausts its harvested energy on task offloading, i.e., Pi∗ =
Ei
but may vary across different frames. At the beginning of a τi T . Accordingly, the computation rate equals to its data
time frame, aT amount of time is used for WPT, a ∈ [0, 1], offloading capacity, i.e.,
where the AP broadcasts RF energy for the WDs to harvest.
µP ah2i
∗ Bτi
Specifically, the i-th WD harvests Ei = µP hi aT amount rO,i (a, τi ) = log2 1 + , (3)
of energy, where µ ∈ (0, 1) denotes the energy harvesting vu τi N0
efficiency and P denotes the AP transmit power [1]. With where B denotes the communication bandwidth and N0
the harvested energy, each WD needs to accomplish a pri- denotes the receiver noise power.
oritized computing task before the end of a time frame. A
unique weight wi is assigned to the i-th WD. The greater the 3.4 Problem Formulation
weight wi , the more computation rate is allocated to the i-th
WD. In this paper, we consider a binary offloading policy, Among all the system parameters in (1) and (3), we assume
such that the task is either computed locally at the WD (such that only the wireless channel gains h = {hi |i ∈ N } are
as WD2 in Fig. 1) or offloaded to the AP (such as WD1 and time-varying in the considered period, while the others
WD3 in Fig. 1). Let xi ∈ {0, 1} be an indicator variable, (e.g., wi ’s and ki ’s) are fixed parameters. Accordingly, the
where xi = 1 denotes that the i-th user’s computation task weighted sum computation rate of the wireless powered
is offloaded to the AP, and xi = 0 denotes that the task is MEC network in a tagged time frame is denoted as
computed locally. N
X
∗ ∗
Q (h, x, τ , a) , wi (1 − xi )rL,i (a) + xi rO,i (a, τi ) ,
1. The channel reciprocity assumption is made to simplify the nota- i=1
tions of channel state. However, the results of this paper can be easily
extended to the case with unequal uplink and downlink channels. where x = {xi |i ∈ N } and τ = {τi |i ∈ N }.
IEEE Transaction on Internet of Things,Year:2019 4
TABLE 1
&RPSXWDWLRQ5DWH0D[LPL]DWLRQ 6ROYLQJ0,33UREOHP 3 x, IJ , a Notations used throughout the paper
Notation Description
N The number of WDs
2IIORDGLQJ'HFLVLRQ 'HHS5HLQIRUFHPHQW/HDUQLQJ x T The length of a time frame
i Index of the i-th WD
hi The wireless channel gain between the i-th WD and the
AP
a The fraction of time that the AP broadcasts RF energy
5HVRXUFH$OORFDWLRQ 6ROYLQJ&RQYH[3UREOHP 3 IJ,a
for the WDs to harvest
Ei The amount of energy harvested by the i-th WD
P The AP transmit power when broadcasts RF energy
Fig. 2. The two-level optimization structure of solving (P1). µ The energy harvesting efficiency
wi The weight assigned to the i-th WD
xi An offloading indicator for the i-th WD
For each time frame with channel realization h, we are fi The processor’s computing speed of the i-th WD
φ The number of cycles needed to process one bit of task
interested in maximizing the weighted sum computation data
rate: ti The computation time of the i-th WD
ki The computation energy efficiency coefficient
∗
(P 1) : Q (h) = maximize Q (h, x, τ , a) (4a) τi The fraction of time allocated to the i-th WD for task
x,τ ,a offloading
PN B The communication bandwidth
subject to i=1 τi + a ≤ 1, (4b)
N0 The receiver noise power
a ≥ 0, τi ≥ 0, ∀i ∈ N , (4c) h The vector representation of wireless channel gains
{hi |i ∈ N }
xi ∈ {0, 1}. (4d) x The vector representation of offloading indicators
{xi |i ∈ N }
We can easily infer that τi = 0 if xi = 0, i.e., when the i-th τ The vector representation of {τi |i ∈ N }
WD is in the local computing mode. Q(·) The weighted sum computation rate function
Problem (P1) is a mixed integer programming non- π Offloading policy function
convex problem, which is hard to solve. However, once x θ The parameters of the DNN
x̂t Relaxed computation offloading action
is given, (P1) reduces to a convex problem as follows. K The number of quantized binary offloading actions
gK The quantization function
(P 2) : Q∗ (h, x) = maximize Q (h, x, τ , a) L(·) The training loss function of the DNN
τ ,a
PN δ The training interval of the DNN
subject to i=1 τi + a ≤ 1,
∆ The updating interval for K
a ≥ 0, τi ≥ 0, ∀i ∈ N .
Accordingly, problem (P1) can be decomposed into two sub- supervised learning-based deep neural network (DNN) ap-
problems, namely, offloading decision and resource alloca- proaches (such as in [27] and [28]) in dynamic wireless
tion (P2), as shown in Fig. 2: applications. Other than the fact that deep reinforcement
• Offloading Decision: One needs to search among the learning does not need manually labeled training samples
2N possible offloading decisions to find an optimal (e.g., the (h, x) pairs in this paper) as DNN, it is much
or a satisfying sub-optimal offloading decision x. more robust to the change of user channel distributions.
For instance, meta-heuristic search algorithms are For instance, the DNN needs to be completely retrained
proposed in [7] and [12] to optimize the offloading once some WDs change their locations significantly or are
decisions. However, due to the exponentially large suddenly turned off. In contrast, the adopted deep reinforce-
search space, it takes a long time for the algorithms ment learning method can automatically update its offload-
to converge. ing decision policy upon such channel distribution changes
• Resource Allocation: The optimal time allocation without manual involvement. Those important notations
{a∗ , τ ∗ } of the convex problem (P2) can be ef- used throughout this paper are summarized in Table 1.
ficiently solved, e.g., using a one-dimensional bi-
section search over the dual variable associated with 4 T HE DROO A LGORITHM
the time allocation constraint in O(N ) complexity [7]. We aim to devise an offloading policy function π that
The major difficulty of solving (P1) lies in the offloading quickly generates an optimal offloading action x∗ ∈ {0, 1}N
decision problem. Traditional optimization algorithms re- of (P1) once the channel realization h ∈ RN>0 is revealed at
quire iteratively adjusting the offloading decisions towards the beginning of each time frame. The policy is denoted as
the optimum [11], which is fundamentally infeasible for π : h 7→ x∗ . (5)
real-time system optimization under fast fading channel. To
tackle the complexity issue, we propose a novel deep re- The proposed DROO algorithm gradually learns such policy
inforcement learning-based online offloading (DROO) algo- function π from the experience.
rithm that can achieve a millisecond order of computational
time in solving the offloading decision problem. 4.1 Algorithm Overview
Before leaving this section, it is worth mentioning the The structure of the DROO algorithm is illustrated in Fig. 3.
advantages of applying deep reinforcement learning over It is composed of two alternating stages: offloading action
IEEE Transaction on Internet of Things,Year:2019 5
E\VROYLQJ $FWLRQ
FRQYH[
SUREOHP 3
2XWSXWIRUWKHtWKWLPH
IUDPH xt at IJt
5HSOD\0HPRU\
7UDLQ
8SGDWH
7UDLQLQJ6DPSOHV
&KDQQHO ht $FWLRQxt
generation and offloading policy update. The generation of suffices to approximate any continuous mapping f if a
the offloading action relies on the use of a DNN, which proper activation function is applied at the neurons, e.g., sig-
is characterized by its embedded parameters θ, e.g., the moid, ReLu, and tanh functions [29]. Here, we use ReLU as
weights that connect the hidden neurons. In the t-th time the activation function in the hidden layers, where the out-
frame, the DNN takes the channel gain ht as the input, and put y and input v of a neuron are related by y = max{v, 0}.
outputs a relaxed offloading action x̂t (each entry is relaxed In the output layer, we use a sigmoid activation function,
to continuous between 0 and 1) based on its current of- i.e., y = 1/ (1 + e−v ), such that the relaxed offloading action
floading policy πθt , parameterized by θt . The relaxed action satisfies x̂t,i ∈ (0, 1).
is then quantized into K binary offloading actions, among Then, we quantize x̂t to obtain K binary offloading
which one best action x∗t is selected based on the achievable actions, where K is a design parameter. The quantization
computation rate as in (P2). The corresponding {x∗t , a∗t , τ ∗t } function, gK , is defined as
is output as the solution for ht , which guarantees that all
the physical constrains listed in (4b)-(4d) are satisfied. The
network takes the offloading action x∗t , receives a reward gK : x̂t 7→ {xk | xk ∈ {0, 1}N , k = 1, · · · , K}. (7)
Q∗ (ht , x∗t ), and adds the newly obtained state-action pair
(ht , x∗t ) to the replay memory. In general, K can be any integer within [1, 2N ] (N is
Subsequently, in the policy update stage of the t-th time the number of WDs), where a larger K results in better
frame, a batch of training samples are drawn from the solution quality and higher computational complexity, and
memory to train the DNN, which accordingly updates its vice versa. To balance the performance and complexity, we
parameter from θt to θt+1 (and equivalently the offloading propose an order-preserving quantization method, where the
policy πθt+1 ). The new offloading policy πθt+1 is used in value of K could be set from 1 to (N + 1). The basic idea
the next time frame to generate offloading decision x∗t+1 is to preserve the ordering during quantization. That is,
according to the new channel ht+1 observed. Such iterations for each quantized action xk , xk,i ≥ xk,j should hold if
repeat thereafter as new channel realizations are observed, x̂t,i ≥ x̂t,j for all i, j ∈ {1, · · · , N }. Specifically, for a given
and the policy πθt of the DNN is gradually improved. The 1 ≤ K ≤ N + 1, the set of K quantized actions {xk } is
descriptions of the two stages are detailed in the following generated from the relaxed action x̂t as follows:
subsections.
1) The first binary offloading decision x1 is obtained
4.2 Offloading Action Generation as
Suppose that we observe the channel gain realization ht in
the t-th time frame, where t = 1, 2, · · · . The parameters of
(
1 x̂t,i > 0.5,
the DNN θt are randomly initialized following a zero-mean x1,i = (8)
normal distribution when t = 1. The DNN first outputs a 0 x̂t,i ≤ 0.5,
relaxed computation offloading action x̂t , represented by a
parameterized function x̂t = fθt (ht ), where for i = 1, · · · , N .
2) To generate the remaining K − 1 actions, we first
x̂t = {x̂t,i |x̂t,i ∈ [0, 1], i = 1, · · · , N } (6)
order the entries of x̂t with respective to their dis-
and x̂t,i denotes the i-th entry of x̂t . tances to 0.5, denoted by |x̂t,(1) − 0.5| ≤ |x̂t,(2) −
The well-known universal approximation theorem 0.5| ≤ · · · ≤ |x̂t,(i) − 0.5| · · · ≤ |x̂t,(N ) − 0.5|, where
claims that one hidden layer with enough hidden neurons x̂t,(i) is the i-th order statistic of x̂t . Then, the k -
IEEE Transaction on Internet of Things,Year:2019 6
th offloading decision xk , where k = 2, · · · , K , is frame, we randomly select a batch of training data samples
calculated based on x̂t,(k−1) as {(hτ , x∗τ ) | τ ∈ Tt } from the memory, characterized by a
set of time indices Tt . The parameters θt of the DNN are
1 x̂t,i > x̂t,(k−1) , updated by applying the Adam algorithm [31] to reduce the
t,(k−1) and x̂t,(k−1) ≤ 0.5,
1 x̂ = x̂
t,i averaged cross-entropy loss, as
xk,i = (9)
0 x̂ t,i = x̂ t,(k−1) and x̂t,(k−1) > 0.5, L(θt ) =
0 x̂t,i < x̂t,(k−1) ,
1 X
|
− (x∗τ ) log fθt (hτ ) + (1 − x∗τ )| log 1 − fθt (hτ ) ,
|Tt | τ ∈T t
for i = 1, · · · , N .
where |Tt | denotes the size of Tt , the superscript | denotes
Because there are in total N order statistic of x̂t , while the transpose operator, and the log function denotes the
each can be used to generate one quantized action from element-wise logarithm operation of a vector. The detailed
(9), the above order-preserving quantization method in (8) update procedure of the Adam algorithm is omitted here for
and (9) generates at most (N + 1) quantized actions, i.e., brevity. In practice, we train the DNN every δ time frames
K ≤ N + 1. In general, setting a large K (e.g., K = N ) after collecting sufficient number of new data samples. The
leads to better computation rate performance at the cost experience replay technique used in our framework has
of higher complexity. However, as we will show later in several advantages. First, the batch update has a reduced
Section 4.4, it is not only inefficient but also unnecessary complexity than using the entire set of data samples. Second,
to generate a large number of quantized actions in each the reuse of historical data reduces the variance of θt during
time frame. Instead, setting a small K (even close to 1) the iterative update. Third, the random sampling fastens
suffices to achieve good computation rate performance and the convergence by reducing the correlation in the training
low complexity after sufficiently long training period. samples.
We use an example to illustrate the above order- Overall, the DNN iteratively learns from the best state-
preserving quantization method. Suppose that x̂t = [0.2, 0.4, action pairs (ht , x∗t )’s and generates better offloading deci-
0.7, 0.9] and K = 4. The corresponding order statistics of x̂t sions output as the time progresses. Meanwhile, with the
are x̂t,(1) = 0.4, x̂t,(2) = 0.7, x̂t,(3) = 0.2, and x̂t,(4) = 0.9. finite memory space constraint, the DNN only learns from
Therefore, the 4 offloading actions generated from the above the most recent data samples generated by the most recent
quantization method are x1 = [0, 0, 1, 1], x2 = [0, 1, 1, 1], x3 (and more refined) offloading policies. This closed-loop
= [0, 0, 0, 1], and x4 = [1, 1, 1, 1]. In comparison, when the reinforcement learning mechanism constantly improves its
conventional KNN method is used, the obtained actions are offloading policy until convergence. We provide the pseudo-
x1 = [0, 0, 1, 1], x2 = [0, 1, 1, 1], x3 = [0, 0, 0, 1], and x4 = [0, code of the DROO algorithm in Algorithm 1.
1, 0, 1].
Compared to the KNN method where the quantized Algorithm 1: An online DROO algorithm to solve the
solutions are closely placed around x̂, the offloading actions offloading decision problem.
produced by the order-preserving quantization method are
input : Wireless channel gain ht at each time frame t,
separated by a larger distance. Intuitively, this creates higher
the number of quantized actions K
diversity in the candidate action set, thus increasing the
output: Offloading action x∗t , and the corresponding
chance of finding a local maximum around x̂t . In Sec-
optimal resource allocation for each time
tion 5.1, we show that the proposed order-preserving quan-
frame t;
tization method achieves better convergence performance
1 Initialize the DNN with random parameters θ1 and
than KNN method.
empty memory;
Recall that each candidate action xk can achieve 2 Set iteration number M and the training interval δ ;
Q∗ (ht , xk ) computation rate by solving (P2). Therefore, the 3 for t = 1, 2, . . . , M do
best offloading action x∗t at the t-th time frame is chosen as 4 Generate a relaxed offloading action x̂t = fθt (ht );
x∗t ∗
= arg max Q (ht , xi ). (10) 5 Quantize x̂t into K binary actions {xk } = gK (x̂t );
xi ∈{xk } 6 Compute Q∗ (ht , xk ) for all {xk } by solving (P2);
∗
Note that the K -times evaluation of Q (ht , xk ) can be 7 Select the best action x∗t = arg max Q∗ (ht , xk );
{xk }
processed in parallel to speed up the computation of (10). 8 Update the memory by adding (ht , x∗t );
Then, the network outputs the offloading action x∗t along 9 if t mod δ = 0 then
with its corresponding optimal resource allocation (τt∗ , a∗t ). 10 Uniformly sample a batch of data set
{(hτ , x∗τ ) | τ ∈ Tt } from the memory;
4.3 Offloading Policy Update 11 Train the DNN with {(hτ , x∗τ ) | τ ∈ Tt } and
update θt using the Adam algorithm;
The offloading solution obtained in (10) will be used to
12 end
update the offloading policy of the DNN. Specifically, we
13 end
maintain an initially empty memory of limited capacity. At
the t-th time frame, a new training data sample (ht , x∗t ) is
added to the memory. When the memory is full, the newly
generated data sample replaces the oldest one. 4.4 Adaptive Setting of K
We use the experience replay technique [15], [30] to train Compared to the conventional optimization algorithms, the
the DNN using the stored data samples. In the t-th time DROO algorithm has the advantage in removing the need
IEEE Transaction on Internet of Things,Year:2019
7
5
K = N . In Section 5.2, we numerically show that setting
4 a proper ∆ can effectively speed up the learning process
without compromising the computation rate performance.
3
2
5 N UMERICAL R ESULTS
1
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 In this section, we use simulations to evaluate the per-
formance of the proposed DROO algorithm. In all simu-
lations, we use the parameters of Powercast TX91501-3W
with P = 3 Watts for the energy transmitter at the AP,
Fig. 4. The index kt∗ of the best offloading actions x∗t for DROO algo-
rithm when the number of WDs is N = 10 and K = N . The detailed and those of P2110 Powerharvester for the energy receiver
simulation setups are presented in Section 5. at each WD.2 The energy harvesting efficiency µ = 0.51.
The distance from the i-th WD to the AP, denoted by di ,
is uniformly distributed in the range of (2.5, 5.2) meters,
i = 1, · · · , N . Due to the page limit, the exact values of di ’s
of solving hard MIP problems, and thus has the potential
are omitted. The average channel gain h̄i follows the free-
to significantly reduce the complexity. The major compu- de
3·108
tational complexity of the DROO algorithm comes from space path loss model h̄i = Ad 4πf c di
, where Ad = 4.11
solving (P2) K times in each time frame to select the best denotes the antenna gain, fc = 915 MHz denotes the carrier
offloading action. Evidently, a larger K (e.g., K = N ) in frequency, and de = 2.8 denotes the path loss exponent.
general leads to a better offloading decision in each time The time-varying wireless channel gain of the N WDs at
frame and accordingly a better offloading policy in the long time frame t, denoted by ht = [ht1 , ht2 , · · · , htN ], is generated
term. Therefore, there exists a fundamental performance- from a Rayleigh fading channel model as hti = h̄i αit . Here αit
complexity tradeoff in setting the value of K . is the independent random channel fading factor following
In this subsection, we propose an adaptive procedure an exponential distribution with unit mean. Without loss
to automatically adjust the number of quantized actions of generality, the channel gains are assumed to remain the
generated by the order-preserving quantization method. same within one time frame and vary independently from
We argue that using a large and fixed K is not only one time frame to another. We assume equal computing
computationally inefficient but also unnecessary in terms efficiency ki = 10−26 , i = 1, · · · , N , and φ = 100 for all
of computation rate performance. To see this, consider a the WDs [32]. The data offloading bandwidth B = 2 MHz,
wireless powered MEC network with N = 10 WDs. We receiver noise power N0 = 10−10 , and vu = 1.1. Without
apply the DROO algorithm with a fixed K = 10 and plot loss of generality, we set T = 1 and the wi = 1 if i is an
in Fig. 4 the index of the best action x∗t calculated from (10) odd number and wi = 1.5 otherwise. All the simulations
over time, denoted as kt∗ . For instance, kt∗ = 2 indicates that are performed on a desktop with an Intel Core i5-4570 3.2
the best action in the t-th time frame is ranked the second GHz CPU and 12 GB memory.
among the K ordered quantized actions. In the figure, the We simply consider a fully connected DNN consisting of
curve is plotted as the 50-time-frames rolling average of kt∗ one input layer, two hidden layers, and one output layer in
and the light shadow region is the upper and lower bounds the proposed DROO algorithm, where the first and second
of kt∗ in the past 50 time frames. Apparently, most of the hidden layers have 120 and 80 hidden neurons, respectively.
selected indices kt∗ are no larger than 5 when t ≥ 5000. This Note that the DNN can be replaced by other structures with
indicates that those generated offloading actions xk with different number of hidden layers and neurons, or even
k > 5 are redundant. In other words, we can gradually other types of neural networks to fit the specific learning
reduce K during the learning process to speed up the problem, such as convolutional neural network (CNN) or
algorithm without compromising the performance. recurrent neural network (RNN) [33]. In this paper, we
find that a simple two-layer perceptron suffices to achieve
Inspired by the results in Fig. 4, we propose an adaptive satisfactory convergence performance, while better conver-
method for setting K . We denote Kt as the number of binary gence performance is expected by further optimizing the
offloading actions generated by the quantization function at DNN parameters. We implement the DROO algorithm in
the t-th time frame. We set K1 = N initially and update Kt Python with TensorFlow 1.0 and set training interval δ = 10,
every ∆ time frames, where ∆ is referred to as the updating training batch size |T | = 128, memory size as 1024, and
interval for K . Upon an update time frame, Kt is set as 1
plus the largest kt∗ observed in the past ∆ time frames. The 2. See detailed product specifications at http://
reason for the additional 1 is to allow Kt to increase during www.powercastco.com.
IEEE Transaction on Internet of Things,Year:2019
8
0.5
0.3
0.3
0.2
0.2
0.1
0.1 0
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
0
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Alternate Alternate
all weights all weights
1
1
0.95 0.95
0.9
0.9
0.85
0.85
0.8
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
0.8 Time Frame t
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Time Frame t
1
Furthermore, a large batch size consumes more time for
4
3
training. As a trade-off between convergence speed and
2 computation time, we set the training batch size |T | = 128
1
in the following simulations. In Fig. 9(c), we investigate
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
the convergence of DROO under different training intervals
δ . DROO converges faster with shorter training interval,
Double Triple Reset both
WD 2 's weight WD 1 's weight WDs' weights and thus more frequent policy update. However, numerical
Normalized Computation Rate Q̂
1
results show that it is unnecessary to train and update the
DNN too frequently. Hence, we set the training interval
0.95
δ = 10 to speed up the convergence of DROO. In Fig. 9(d),
0.9
we study the impact of the learning rate in Adam optimizer
0.85
[31] to the convergence performance. We notice that either a
0.8
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
too small or a too large learning rate causes the algorithm to
Time Frame t converge to a local optimum. In the following simulations,
we set the learning rate as 0.01.
In Fig. 10, we compare the performance of two quantiza-
Fig. 7. Computation rates for DROO algorithm with temporarily new
weights when N = 10 and K = 10. tion methods: the proposed order-preserving quantization
and the conventional KNN quantization method under
0.5
different K . In particular, we plot the the moving average
Training Loss L(θtπ )
D E
F G
Fig. 9. Moving average of Q̂ under different algorithm parameters when N = 10: (a) memory size ; (b) training batch size; (c) training interval; (d)
learning rate.
1 1
0.99
0.95
0.98
0.9 0.97
0.96
0.85
0.95
0.94
0.8
0.93
0.75 0.92
0.91
0.7
0.9
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Fig. 10. Moving average of Q̂ under different quantization functions and Fig. 11. Moving average of Q̂ for DROO algorithm with different updating
K when N = 10. interval ∆ for setting an adaptive K . Here, we set N = 10.
×106
7
10
CD
9 DROO
6
7
6 4
5
3
4
3 2
2
1
1
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
0
10 20 30
Number of WDs N
0.98
Δ=4 compare the performance of DROO and LR algorithms. For
0.975 better exposition, we plot the normalized computation rate
Q̂ achievable by DROO and LR. Specifically, we enumerate
Δ=2
all 2N possible offloading actions as in (11) when N = 10.
0.97
• Local Computing. All N WDs only perform local com- At last, we evaluate the execution latency of the DROO al-
putation, i.e., setting xi = 0, i = 1, · · · , N in (P2). gorithm. The computational complexity of DROO algorithm
• Edge Computing. All N WDs offload their tasks to the greatly depends on the complexity in solving the resource
AP, i.e., setting xi = 1, i = 1, · · · , N in (P2). allocation sub-problem (P2). For fair comparison, we use
the same bi-section search method as the CD algorithm
In Fig. 14, we first compare the computation rate per- in [7]. The CD method is reported to achieve an O(N 3 )
formance achieved by different offloading algorithms under complexity. For the DROO algorithm, we consider both
varying number of WDs, N . Before the evaluation, DROO using a fixed K = N and an adaptive K as in Section 4.4.
has been trained with 24, 000 independent wireless channel Note that the execution latency for DROO listed in Table 2
realizations, and its offloading policy has converged. This is averaged over 30,000 independent wireless channel re-
is reasonable since we are more interested in the long-term alizations including both offloading action generation and
operation performance [34] for field deployment. Each point DNN training. Overall, the training of DNN contributes
only a small proportion of CPU execution latency, which
3. CVXPY package is online available at https://www.cvxpy.org/ is much smaller than that of the bi-section search algorithm
IEEE Transaction on Internet of Things,Year:2019
12
! 6 C ONCLUSION
In this paper, we have proposed a deep reinforcement
learning-based online offloading algorithm, DROO, to max-
imize the weighted sum computation rate in wireless pow-
"!
ered MEC networks with binary computation offloading.
The algorithm learns from the past offloading experiences
"
to improve its offloading action generated by a DNN via
reinforcement learning. An order-preserving quantization
! and an adaptive parameter setting method are devised
DRLOO
LR to achieve fast algorithm convergence. Compared to the
conventional optimization methods, the proposed DROO al-
gorithm completely removes the need of solving hard mixed
integer programming problems. Simulation results show
that DROO achieves similar near-optimal performance as
Fig. 15. Boxplot of the normalized computation rate Q̂ for DROO and
LR algorithms under different number of WDs. The central mark (in red) existing benchmark methods but reduces the CPU execution
indicates the median, and the bottom and top edges of the box indicate latency by more than an order of magnitude, making real-
the 25th and 75th percentiles, respectively. time system optimization truly viable for wireless powered
MEC networks in fading environment.
Despite that the resource allocation subproblem is solved
for resource allocation. Taking DROO with K = 10 as an
under a specific wireless powered network setup, the pro-
example, it uses 0.034 second to generate an offloading
posed DROO framework is applicable for computation
action and uses 0.002 second to train the DNN in each
offloading in general MEC networks. A major challenge,
time frame. Here training DNN is efficient. During each
however, is that the mobility of the WDs would cause
offloading policy update, only a small batch of training
DROO harder to converge.
data samples, |T | = 128, are used to train a two-hidden-
As a concluding remark, we expect that the proposed
layer DNN with only 200 hidden neurons in total via back-
framework can also be extended to solve MIP problems
propagation. We see from Table 2 that an adaptive K can
for various applications in wireless communications and
effectively reduce the CPU execution latency than a fixed
networks that involve in coupled integer decision and con-
K = N . Besides, DROO with an adaptive K requires much
tinuous resource allocation problems, e.g., mode selection
shorter CPU execution latency than the CD algorithm and
in D2D communications, user-to-base-station association
the LR algorithm. In particular, it generates an offloading
in cellular systems, routing in wireless sensor networks,
action in less than 0.1 second when N = 30, while CD
and caching placement in wireless networks. The proposed
and LR take 65 times and 14 times longer CPU execution
DROO framework is applicable as long as the resource
latency, respectively. Overall, DROO achieves similar rate
allocation subproblems can be efficiently solved to evaluate
performance as the near-optimal CD algorithm but requires
the quality of the given integer decision variables.
substantially less CPU execution latency than the heuristic
LR algorithm.
The wireless-powered MEC network considered in this
R EFERENCES
paper may correspond to a static IoT network with both the
transmitter and receivers are fixed in locations. Measure- [1] S. Bi, C. K. Ho, and R. Zhang, “Wireless powered communication:
ment experiments [24]–[26] show that the channel coherence Opportunities and challenges,” IEEE Commun. Mag., vol. 53, no. 4,
pp. 117–125, Apr. 2015.
time, during which we deem the channel invariant, ranges [2] M. Chiang and T. Zhang, “Fog and IoT: An overview of research
from 1 to 10 seconds, and is typically no less than 2 seconds. opportunities,” IEEE Internet Things J., vol. 3, no. 6, pp. 854–864,
The time frame duration is set smaller than the coherence Dec. 2016.
time. Without loss of generality, let us assume that the time [3] Y. Mao, J. Zhang, and K. B. Letaief. “Dynamic computation
offloading for mobile-edge computing with energy harvesting
frame is 2 seconds. Taking the MEC network with N = 30 devices.” IEEE J. Sel. Areas Commun., vol. 34, no. 12, pp. 3590-3605,
as an example, the total execution latency of DROO is 0.059 Dec. 2016.
second, accounting for 3% of the time frame, which is an ac- [4] C. You, K. Huang, H. Chae, and B.-H. Kim, “Energy-efficient
resource allocation for mobile-edge computation offloading,” IEEE
ceptable overhead for field deployment. In fact, DROO can Trans. Wireless Commun., vol. 16, no. 3, pp. 1397–1411, Mar. 2017.
be further improved by only generating offloading actions [5] X. Chen, L. Jiao, W. Li, and X. Fu. “Efficient multi-user compu-
at the beginning of the time frame and then training DNN tation offloading for mobile-edge cloud computing.” IEEE/ACM
during the remaining time frame in parallel with energy Trans. Netw., vol. 24, no. 5, pp. 2795-2808, Oct. 2016.
[6] F. Wang, J. Xu, X. Wang, and S. Cui, “Joint offloading and com-
transfer, task offloading and computation. In comparison, puting optimization in wireless powered mobile-edge computing
the execution of LR algorithm consumes 40% of the time systems,” IEEE Trans. Wireless Commun., vol. 17, no. 3, pp. 1784–
frame, and the CD algorithm even requires longer execution 1797, Mar. 2018.
time than the time frame, which are evidently unacceptable [7] S. Bi and Y. J. A. Zhang, “Computation rate maximization for wire-
less powered mobile-edge computing with binary computation
in practical implementation. Therefore, DROO makes real- offloading,” IEEE Trans. Wireless Commun., vol. 17, no. 6, pp. 4177–
time offloading and resource allocation truly viable for 4190, Jun. 2018.
13
IEEE Transaction on Internet of Things,Year:2019
TABLE 2
Comparisons of CPU execution latency
DROO DROO
# of WDs CD LR
(Fixed K = N ) (Adaptive K with ∆ = 32)
10 3.6e-2s 1.2e-2s 2.0e-1s 2.4e-1s
20 1.3e-1s 3.0e-2s 1.3s 5.3e-1s
30 3.1e-1s 5.9e-2s 3.8s 8.1e-1s
[8] Y. Mao, C. You, J. Zhang, K. Huang, and K. B. Letaief, “A survey on [28] H. Ye, G. Y. Li, and B. H. Juang, “Power of deep learning for
mobile edge computing: The communication perspective,” IEEE channel estimation and signal detection in OFDM systems,” IEEE
Commun. Surveys Tuts., vol. 19, no. 4, pp. 2322–2358, Aug. 2017. Wireless Commun. Lett., vol. 7, no. 1, pp. 114–117, Feb 2018.
[9] C. You, K. Huang, and H. Chae, “Energy efficient mobile cloud [29] S. Marsland, Machine learning: an algorithmic perspective. CRC
computing powered by wireless energy transfer,” IEEE J. Sel. Areas press, 2015.
Commun., vol. 34, no. 5, pp. 1757-1771, May 2016. [30] L.-J. Lin, “Reinforcement learning for robots using neural net-
[10] P. M. Narendra and K. Fukunaga, “A branch and bound algorithm works,” Carnegie-Mellon Univ Pittsburgh PA School of Computer
for feature subset selection,” IEEE Trans. Comput., vol. C-26, no. 9, Science, Tech. Rep., 1993.
pp. 917–922, Sep. 1977. [31] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-
[11] D. P. Bertsekas, Dynamic programming and optimal control. Athena tion,” in Proc. ICLR, 2015.
Scientific Belmont, MA, 1995, vol. 1, no. 2. [32] Y. Wang, M. Sheng, X. Wang, L. Wang, and J. Li, “Mobile-edge
computing: Partial computation offloading using dynamic voltage
[12] T. X. Tran and D. Pompili, “Joint task offloading and resource
scaling,” IEEE Trans. Commun., vol. 64, no. 10, pp. 4268–4282, Oct.
allocation for multi-server mobile-edge computing networks,”
2016.
arXiv preprint arXiv:1705.00704, 2017.
[33] I. Goodfellow and Y. Bengio and A. Courville, Deep Learning. MIT
[13] S. Guo, B. Xiao, Y. Yang, and Y. Yang, “Energy-efficient dynamic press, 2016.
offloading and resource scheduling in mobile cloud computing,” [34] R. S. Sutton, and A. G. Barto, Reinforcement learning: An introduc-
in Proc. IEEE INFOCOM, Apr. 2016, pp. 1–9. tion, 2nd ed., Cambridge, MA: MIT press, 2018.
[14] T. Q. Dinh, J. Tang, Q. D. La, and T. Q. Quek, “Offloading in mobile
edge computing: Task allocation and computational frequency
scaling,” IEEE Trans. Commun., vol. 65, no. 8, pp. 3571–3584, Aug.
2017.
[15] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski
et al., “Human-level control through deep reinforcement learning,”
Nature, vol. 518, no. 7540, p. 529, Feb. 2015.
[16] G. Dulac-Arnold, R. Evans, H. van Hasselt, P. Sunehag, T. Lillicrap,
J. Hunt, T. Mann, T. Weber, T. Degris, and B. Coppin, “Deep rein-
forcement learning in large discrete action spaces,” arXiv preprint
arXiv:1512.07679, 2015.
[17] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol.
521, no. 7553, p. 436, May 2015.
[18] Y. He, F. R. Yu, N. Zhao, V. C. Leung, and H. Yin, “Software-
defined networks with mobile edge computing and caching for
smart cities: A big data deep reinforcement learning approach,”
IEEE Commun. Mag., vol. 55, no. 12, pp. 31–37, Dec. 2017.
[19] L. Huang, X. Feng, A. Feng, Y. Huang, and P. Qian, “Distributed
Deep Learning-based Offloading for Mobile Edge Computing Net-
works,” Mobile Netw. Appl., 2018, doi: 10.1007/s11036-018-1177-x.
[20] M. Min, D. Xu, L. Xiao, Y. Tang, and D. Wu, “Learning-based
computation offloading for IoT devices with energy harvesting,”
IEEE Trans. Veh. Technol., vol. 68, no. 2, pp. 1930-1941, Feb. 2019.
[21] X. Chen, H. Zhang, C. Wu, S. Mao, Y. Ji, and M. Bennis, “Perfor-
mance optimization in mobile-edge computing via deep reinforce-
ment learning,” IEEE Internet of Things Journal, Oct. 2018.
[22] L. Huang, X. Feng, C. Zhang, L. Qian, Y. Wu, “Deep reinforcement
learning-based joint task offloading and bandwidth allocation for
multi-user mobile edge computing,” Digital Communications and
Networks, vol. 5, no. 1, pp. 10-17, 2019.
[23] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,
D. Silver, and D. Wierstra, “Continuous control with deep rein-
forcement learning,” in Proc. ICLR, 2016.
[24] R. Bultitude, “Measurement, characterization and modeling of
indoor 800/900 MHz radio channels for digital communications,”
IEEE Commun. Mag., vol. 25, no. 6, pp. 5-12, Jun. 1987.
[25] S. J. Howard and K. Pahlavan, “Doppler spread measurements of
indoor radio channel,” Electronics Letters, vol. 26, no. 2, pp. 107-109,
Jan. 1990.
[26] S. Herbert, I. Wassell, T. H. Loh, and J. Rigelsford, “Characterizing
the spectral properties and time variation of the in-vehicle wireless
communication channel,” IEEE Trans. Commun., vol. 62, no. 7,
pp. 2390-2399, Jul. 2014.
[27] H. Sun, X. Chen, Q. Shi, M. Hong, X. Fu, and N. D. Sidiropoulos,
“Learning to optimize: Training deep neural networks for wireless
resource management,” in Proc. IEEE SPAWC, Jul. 2017, pp. 1–6.