Cloud 2

IEEE TRANSACTIONS ON SUSTAINABLE COMPUTING, VOL. 8, NO.
1, JANUARY-MARCH 2023 109
Service Management and Energy Scheduling

Toward Low-Carbon Edge Computing
Lin Gu , Member, IEEE, Weiying Zhang, Zhongkui Wang,
Deze Zeng , Member, IEEE, and Hai Jin , Fellow, IEEE
Abstract—Edge computing has become an alternative low-latency provision of cloud computing thanks to its close-proximity to the
users, and the geo-distribution nature of edge servers enables the utilization green energy from the environment on-site. To pursue the
goal of low-carbon edge computing, it is desirable to minimize the operational expenditure by scheduling the computing resource and
green energy according to the spatially and temporally varying user demands. In this article, inspired by the successful application of
deep reinforcement learning (DRL) in diverse domains, we propose a DRL-based edge computing management strategy which
continuously explores the states and adaptively makes decisions on service management and energy scheduling, towards long-term
cost minimization. Different from model-based solutions, our proposal is a model-free method, without any assumption on statistical
knowledge as a priori, and therefore is practical in implementation. To speedup the agent training procedure, we further design a
prioritized replay memory by utilizing the model-based solution as a guideline to set the transition priority. Extensive experiment results
based on real-world traces validate that our proposed DRL-based strategy can make considerably progress compared to the one-shot
greedy strategy, and it can learn the system dynamically to manage the edge computing services at runtime.
Index Terms—Low-carbon edge computing, deep reinforcement learning, service management
1 INTRODUCTION reduce carbon emissions. [3] and [4] study the grid power
market records and propose a computing model that
ITH the exponential growth of computing demands, it
W is increasingly difcult for traditional cloud comput-
ing to meet the requirements of various latency-sensitive
directly connects cloud computing resources and green
energy generator to reuse the wasted energy during over-
supply and transmission congestion.
applications (e.g., mobile applications, Internet-of-Things,
Compared with the cloud, edge computing shows great
connected and autonomous vehicles). Edge computing has
advantages in utilizing green and sustainable computing
been becoming an attractive low-latency computing para-
because its geo-distributed servers can naturally harvest
digm by pushing services to the edge servers within the
user proximity, and is promising in migrating the heavy various on-site green energy (e.g., wind energy, hydroelec-
burden from both the cloud computing platform and the tric energy, and solar energy) as the rst power supply [5],
backbone networks. However, as computing demand [6], [7], [8], to realize low-carbon edge computing. However,
grows, so does the electricity consumption and the emission edge computing is more complex than cloud computing,
of greenhouse gas. Low carbon cloud computing has been due to its higher dynamics in a wide range of aspects, e.g.,
proposed to limit the carbon emissions of cloud by many user mobility, user demands, resource availability. Speci-
studies [1], [2]. For example, Canada’s Greenstar Network cally, the distribution of user demands is usually related to
[1] connects cloud computing resources directly to wind time and user mobility, and the service s should also be
turbines or solar power generators, using green energy to migrated accordingly. Green energy, featured by unpredict-
ability and geographical diversity, further exaggerates such
dynamics. To ensure energy supply stably, it is not enough
 Lin Gu, Weiying Zhang, Zhongkui Wang, and Hai Jin are with the
National Engineering Research Center for Big Data Technology and Sys- to exclusively utilize intermittent green energy, so electricity
tem, Services Computing Technology and System Lab, Cluster and Grid from power grid is needed. But grid company usually
Computing Lab, School of Computer Science and Technology, Huazhong charges electricity in the form of step tariff, i.e., time-vary-
University of Science and Technology, Wuhan, Hubei 430074, China.
E-mail: {lingu, zhang_wy1, zkwang, hjin}@hust.edu.cn. ing price, which cannot be ignored.
 Deze Zeng is with the Hubei Key Laboratory of Intelligent Geo-Informa- From the perspective of edge computing operators, it is
tion Processing, School of Computer Science, China University of Geo- always desirable to reduce the operational expenditure (OPEX)
sciences, Wuhan, Hubei 430074, China. E-mail: deze@cug.edu.cn. including energy cost as one of the most dominant parts, so
Manuscript received 26 March 2022; revised 17 June 2022; accepted 7 August as to maximize revenue. Therefore, how to jointly manage
2022. Date of publication 29 September 2022; date of current version 7 March
2023.
the services and schedule the energy in edge computing
This work was supported in part by the NSF of China under Grants 61972171, toward the goal of OPEX minimization has been becoming a
62172375, 62232011, and in part by the Open Research Projects of Zhejiang hot research topic recently. For example, Yang et al. [9] pro-
Lab under Grant 2021KE0AB02. pose an ofine service management strategy with the
(Corresponding author: Deze Zeng.)
Recommended for acceptance by H. Yao. assumption that green energy cannot be stored, to minimize
Digital Object Identier no. 10.1109/TSUSC.2022.3210564 short-term brown energy costs. Mao et al. [10] invent a
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
110 IEEE TRANSACTIONS ON SUSTAINABLE COMPUTING, VOL. 8, NO. 1, JANUARY-MARCH 2023
Fig. 1. An example on service management and energy scheduling.
Lyapunov optimization based algorithm to reduce the ser- operator shall make online decisions on both the service
vice cost of edge computing by aggressively maximizing the management (i.e., which edge server to migrate the service)
utility of available green energy. Chen et al. [11] also propose and the energy scheduling (i.e., how much green energy
an online peer ofoading strategy that maximizes the perfor- shall be utilized) in response to the real-time system envi-
mance efciency while guaranteeing the long-term energy ronment and potential future dynamics. This naturally ts
consumption constraints on edge servers. in the application scope of RL. Therefore, we leverage the
Although there have been many researches in this area, representative DRL algorithm deep Q-network (DQN), to
we notice that most proposed methods are based on optimi- design an efcient energy-aware service management strat-
zation models with certain assumptions or prerequisites to egy. We mainly make the following contributions in this
manage service. Even there are some online algorithms, paper:
they usually oversimplify the complex system or model it
inaccurately. When these model-based optimization meth-  With the goal of minimizing the long-term OPEX, we
ods are applied in practice scenarios, the achieved perfor- formulate the online service management and
mance is usually not satisfactory. Therefore, a model-free energy scheduling problem into a mixed integer linear
solution is desired to intelligently manage service according programming (MILP) form.
to runtime system dynamics. Fortunately, the successful  We consider the effect of current service migration
application of AlphaGo series on game control has been and energy scheduling actions for future energy con-
raising fervent concern of academia and industry to rein- sumption and leverage the DQN technology and
forcement learning (RL) again [12], [13]. Especially with the design a DQN based solution to reduce migration
combination of deep learning, deep reinforcement learning cost and improve green energy efciency.
(DRL) and its derivative algorithms such as prioritized expe-  To further accelerate the DQN agent training, we
rience replay (PER-DQN) [14] have already been extensively propose a prioritized DRL (pDRL) algorithm by rede-
used in many control domains, e.g., robotics, autonomous signing a prioritized episodic transition sampling
driving, and trafc light control, with no exception to the method according to the temporal-difference (TD)
ICT system management. For example, Huang et al. [15] error and utilizing the solution of the MILP problem
propose a deep reinforcement learning-based online ofoading as a guideline to improve the efciency of the the
(DROO) framework for ofoading decisions and wireless actions.
resource allocation with the goal of maximizing the  Extensive experiments are conducted to show that
weighted sum computation rate. Therefore, applying RL to our pDRL accelerates the convergence speed by
design a model-free service management strategy for edge 37:67% and reduces the long-term OPEX by 32:70%,
computing is a feasible way. compared with state-of-the-art studies.
However, these solvers are shown to be inefcient when The reminder of this paper is organized as follows. First,
they come to complex environments. Take DROO as an section 2 introduces the investigated problem through a toy
example, the solver adopts a deep neural network to learn example and presents the formal problem formulation. Our
the integer variables, it will be time consuming with the customized DRL algorithm is proposed in Section 3. Then,
increment of action space. In this paper, we design a service Section 4 illustrates the experimental results and 5 discusses
management strategy from the perspective of edge comput- some related studies. Finally, the conclusion of this paper is
ing operators towards the goal of long-term cost efciency. provided in Section 6.
To achieve this, there are a series of different spatial and
temporal dynamics that need to be addressed, as declared
before, including the user demands, green energy genera- 2 MOTIVATION AND PROBLEM STATEMENT
tion, and brown energy pricing. Usually, the service runs in 2.1 Motivation Example
the form of container, which can be freely migrated among To understand this problem better, let us consider a toy
different edge servers. In pursuit of cost efciency, the example of service management and energy scheduling
GU ET AL.: SERVICE MANAGEMENT AND ENERGY SCHEDULING TOWARD LOW-CARBON EDGE COMPUTING 111
TABLE 1
Major Notations
Constants
N The set of edge node n
F The set of different type of service f
T The set of time slot t
Hmn The hop from the edge node m to another edge node n
Pf The energy consumption of request processing for service f in a unit
Sf The energy consumption of request transmission for service f in a unit
Vf The energy consumption of service migration for service f in a unit
nf ðtÞ The request arrival rate of service f in edge node n at time slot t
Gn ðtÞ The green energy reserved in edge node n at time slot t
Variables
xnf ðtÞ Binary variables to indicate whether edge node n host service f
mn ðtÞ The total amount of green energy used for all requests in edge node n at time slot t
en ðtÞ The total amount of energy needed for all requests in edge node n at time slot t
rn ðtÞ The total amount of reserved green energy in node n at the end of t
problem in two edge servers over 2 time slots, as shown in use and reduce the total energy efciently. The Smart strat-
Fig 1. In this example, we consider the total cost consisting egy in Fig. 1 shows that total energy consumption is not
of three aspects: 1) processing cost: Only the edge servers changed, yet we use only part of the green energy in server
with the corresponding service can process the requests at a at t1 as 17 and leave the rest to t2 , nally, the total brown
an energy cost of 5. We consider a simple scenario that there energy cost reduced to 28  1 þ 0  2 ¼ 28.
is only one node to run the service while others are hiber- Comparing with the results of Greedy and Myopic, we
nating to save energy. 2) service migration cost: Migrating can nd that greedily selecting edge server with the highest
service between edge servers costs 11 units of energy in the on-site green energy supply at each time slot will not always
node hosting service; and 3) communication cost: Service nd a satisfactory solution. Similarly, from Myopic and
demands need to be routed to the edge server hosting ser- Smart, it can be obtained that aggressively using all the
vice with a cost of 1 per request per hop. Initially, service is green energy in edge server might miss the opportunity to
located at edge server a, and each server reserves a certain minimize the long-term energy cost.
amount of green energy, i.e., 25 units in edge server a, and
50 units in edge server b. And the number of hops between 2.2 Problem Statement and Formulation
server a and b is 3. The newly generated green energy Based on the toy example, now let us consider a general
amounts are f20; 25g and f5; 5g and the request arrival rates case of energy scheduling problem in a set of edge nodes N
are f1; 5g and f5; 1g in server a and b, at time slots t1 and t2 , over discrete time slots T ¼ f1; 2; 3; :::; T g. Table 1 shows
respectively. Assume the brown energy costs is 1 at time the major notations used in this paper.
slot t1 and 2 at time slot t2 , while the green energy is free.
One of the straightforward solutions is maximizing the 2.2.1 Service Migration and Energy Consumption
utility of green energy at each time slot in a Greedy way.
To achieve low carbon edge computing, we need to select
That is, we move the service to server b at time slot t1 with
the proper edge node to host the service and decide the
50 units green energy cost and then migrate back to server a
utilized green energy amount at each time slot. This long-
at time slot t2 with 50 units green energy cost, i.e., choosing
term energy cost minimization over time period T can be
the edge servers with maximum green energy. We can see
considered as a Markov decision process (MDP). Note that
from strategy Greedy that 30 þ 33 þ 3 ¼ 66 units of energy
certain demands can only be performed on the edge node
are needed by server b at time slot t1 while 50 units can be
hosting the corresponding service. So we are interested in
covered by local green energy. So the energy cost is ð66 
reducing the total energy cost consisting of three parts: 1)
50Þ  1 ¼ 16. Similarly, the energy cost of service in server a
request transmission, 2) request processing, and 3) service
at time slot t2 is 16  2 ¼ 32. In this case, the Greedy strategy
migration.
produces a total energy cost of 48 as shown in Fig. 1. From
Let binary variables xnf ðtÞ indicate whether the service f
the Greedy strategy, we can see that greedily using the
is placed in edge node n at time slot t, as
green energy might need to migrate the service very often,
hence resulting in a higher energy cost. So, now we consider
1; whether service f is placed in n at time slot t,

leaving service in edge server a, using its available green xnf ðtÞ ¼
energy as shown in strategy Myopic. In this case, at time 0; otherwise.
slot t1 and t2 , the brown energy consumptions are ð30 þ
15Þ  25 ¼ 20 and ð30 þ 3Þ  25 ¼ 8, receptively, leading to Note that, the amount of arrival requests for different
a total cost of 20  1 þ 8  2 ¼ 36. It is noticeable that the services is variable at different time slot, which contributes
brown energy price at time slot t2 is 2, which is relatively to the diversity of requests arrival rate. Besides, the newly
higher than time slot t1 . So, if we are smart enough, we generated green energy also varies on time and location. At
should save the green energy at time slot t1 to t2 for future the beginning of each t, we need to decide how to place the
services and schedule the energy. For each type of service f, green energy to process all requests on the selected node n,
we consider one service instance for its requests, i.e., brown energy will be utilized at a higher price. Since
X aggressively using up the green energy in selected edge
xnf ðtÞ ¼ 1: (1) node n with service may cause a higher brown energy cost
n2N during future peak hours, we need to determine the number
Once an edge node n is selected, all the ofoaded requests of used green energy at the start of each time slot t as mn ðtÞ.
then will be transformed and processed in its corresponding Note that we can only schedule the green energy of selected
service on node n. That is, if xnf ðtÞ ¼ 1, edge node edge node hosting the service, i.e.,
Pn is selected
to process the requests of service f with rate m2N m f and P
the energy consumption of request processing Pf , thus the f2F xnf ðtÞ X
 mn ðtÞ  xnf ðtÞ  A; 8n 2 N; (9)
processing energy consumption of node n is A f2F
XX
epn ðtÞ ¼ xnf ðtÞ  m
f ðtÞ  Pf ; (2) where A is an arbitrarily large number.
f2F m2N
Of course, the green energy amount also cannot exceed
also, the transmission energy cannot be ignored. Since the available amount or the total energy required. Let en ðtÞ
requests nf ðtÞ need to be transformed from edge node m to be the total energy requirement as
n with energy consumption of request transmission Sf ,
therefore, the transmission energy consumption esn ðtÞ as en ðtÞ ¼ epn ðtÞ þ esn ðtÞ þ evn ðtÞ; 8n 2 N; (10)
XX
esn ðtÞ ¼ xnf ðtÞ  m
f ðtÞ  Hmn  Sf : (3) Then we have
f2F m2N
0  mn ðtÞ  Gn ðtÞ; 8n 2 N; (11)

As for the service migration cost evn ðtÞ, it is related to the
location of service at previous time slot xm f ðt  1Þ and the and
“distances” of migration, i.e., network hops Hmn . When a
service migration decision is made from edge node m to n, 0  mn ðtÞ  en ðtÞ; 8n 2 N: (12)
m; n 2 N, it will consume energy Vf for service migration,
thus, the migration energy consumption evn ðtÞ occurs as
When there are not many requests at time slot t, the green
XX
evn ðtÞ ¼ xnf ðtÞ  xm energy of selected node for processing these requests might
f ðt  1Þ  Hmn  Vf : (4)
f2F m2N not be totally used, which can be reserved and utilized
together with the new green energy to be generated at time
Note that, if no service migration happens, the migration slot t þ 1 (i.e., Hn ðt þ 1Þ), and the same goes for other nodes.
cost evf ðtÞ ¼ 0 since the migration distance Hmn is 0. So, the amount of green energy in node n at time slot t þ 1 is
Notice that equation (4) is nonlinear due to the products
of two binary variables, i.e., xnf ðtÞ  xm
f ðt  1Þ. To linearise Gn ðt þ 1Þ ¼ minðGn ðtÞ  mn ðtÞ þ Hn ðt þ 1Þ; Gmax
n Þ; 8n 2 N;
this equation to linear, a new binary variable yof ðtÞ; o 2 N is
dened as (13)
where Gmax
n is the battery capacity of edge node n.
yof ðtÞ ¼ xnf ðtÞ  xm
f ðt  1Þ; 8ðm; nÞ 2 N; (5) To reduce the electricity cost, it is reasonable to set differ-
ent priorities to use green energy and brown energy. Based
then the products can be equivalently replaced by the new
on the above denitions, the brown energy to be used can
variables with the following two linear constraints as
be expressed as en ðtÞ  mn ðtÞ and the total energy cost is
0  yof ðtÞ  xnf ðtÞ; 8ðm; nÞ 2 N; (6) X
CðtÞ ¼ aðtÞðen ðtÞ  mn ðtÞÞ þ bðtÞmn ðtÞ; (14)
and since xnf ðtÞ is a two binary variable, therefore n2N
xnf ðtÞ þ xm o m
f ðt  1Þ  1  yf ðtÞ  xf ðt  1Þ; 8ðm; nÞ 2 N: (7) in which the coefcients a and b can be decided by edge
computing operators to indicate the green energy and
brown energy prices.
In this case, (4) can be rewritten into a linear equation as
In our experiments, we aim at minimizing the consump-
XX tion of brown energy. Intuitively, we can preferentially uti-
evn ðtÞ ¼ yof ðtÞ  Hmn  Vf ; (8)
f2F m2N
lize green energy to replace brown energy, hence we set a
much larger than b.
with constraints (6) and (7).
2.4 A Joint MILP Formulation
2.3 Green Energy Scheduling Now, we can formulate this service management and
There is no doubt that the generation of green energy varies energy scheduling problem into a mixed-integer linear pro-
from node to node and time to time. For example, solar gramming (MILP) and its objective is to minimize the long-
powered edge servers can harvest sufcient green energy in term total energy cost. By rewriting (14), this problem is for-
sunny day while others may not. If there is not enough mulated as the following one:
Note that when mXf ðtÞ ðtÞ  eXf ðtÞ ðtÞ, we can only
Cost-Min:
XX need to use eXf ðtÞ ðtÞ units green energy, remaining
min : aðtÞðen ðtÞ  mn ðtÞÞ þ bðtÞmn ðtÞ; the ðmXf ðtÞ ðtÞ  eXf ðtÞ ðtÞÞ units for node Xf ðtÞ.
t2T n2N r: We set the reward at time slot t as the negative of (14)
s.t. : ð1Þ; ð6Þ; ð7Þ; ð10Þ; ð9Þ; ð11Þ; ð12Þ and ð13Þ: dened in our Cost-Min problem, i.e.,
rðtÞ ¼ CðtÞ: (18)

In fact, when the supply of green energy is sufcient, that
is, mn ðtÞ is large enough, the problem can be transformed With the three key elements, the agent can interact with
into minimizing the consumption of green energy. environment and evaluate the policy p on the basis of the
action-value function as
3 THE DRL-BASED EDGE COMPUTING ENERGY
SCHEDULING ALGORITHM QðsðtÞ; aðtÞÞ ¼ rðtÞ þ gQðsðt þ 1Þ; pðsðt þ 1ÞÞÞ; (19)
To cope with the computational complexity and the com- where Qðsðt þ 1Þ; pðsðt þ 1ÞÞÞ contains the discounted
plex network conditions, we propose a DRL-based algo- future potential reward, directing the agent with long-term
rithm to solve the service management and energy view. The optimal policy always select the action that can
scheduling problem in edge computing. As a model-free obtain the maximal Q value, so the state-action value func-
approach, it can automatically learn from transitions and tion can be updated according to transition j ¼ ðs; a; r; s0 Þ as
then give adequate control solutions accordingly at runtime  
without any prior knowledge of the energy dynamics or 0 0
Qðs; aÞ Qðs; aÞ þ a r þ g max Qðs ; a Þ  Qðs; aÞ : (20)
network statistics. At the very beginning, the DRL agent has a0
no knowledge of how to make decisions (or take action). It
receives the states of the network conditions, including ser-
vice demands, green energy amount, and service location, DQN applies neural network with weights u to t Q val-
and uses them as the input of the neural network to produce ues so that it can generically evaluate the state-action that
corresponding decisions of service management and energy has never been seen before. Furthermore, DQN introduces
scheduling as actions. Then, a corresponding reward can be experience replay memory M and target network with
calculated based on how satisfactory the decisions are and weights u0 to eliminate the correlation of experiences and
the weights of neuronal network will be updated accord- improve the stability of evaluating future reward. So mini-
ingly. Theoretically, if the agent experiences enough states, batch B consisting of experiences randomly sampled from
it will be able to make optimal-approaching decisions. So, in M can be learned at once with learning rate a. Accordingly,
the rst place, we dene the three key elements of RL, i.e., the loss function is expressed as
state space S, action space A, and reward r as follows,
X 2
S: The state s 2 S is represented as an array composed of LðuÞ ¼ rj þ g max Qðs0j ; a0 ; u0 Þ  Qðsj ; aj ; uÞ : (21)
a0
j2B
the indexes of nodes which host the service, available
green energy amount, and the arrival rates of requests
processing on each edge server at time slot t, i.e., 3.1 Prioritized Replay Memory Design
The agent needs to learn sufcient transition experiences to
sðtÞ ¼ððXf ðtÞ; 8f 2 FÞ; before it can make appropriate decisions. However, with all
ðGn ðtÞ; 8n 2 NÞ; possible service placement and energy scheduling deci-
sions, it is impossible to try all possible actions. Some exist-
ðn ðtÞ; 8n 2 NÞÞ; (15)
ing work shows that potentially good actions should be
explored with a higher priority to accelerate the training
where Xf ðtÞ is the index of the node to host service f
procedure [14]. It is undeniable that the use of prioritisation
at time slot t.
in sampling transitions from experience replay does play
A:To minimize the long-term energy cost, two decisions
the role of improving training, however, the prioritized
should be made as
transition with high temporal-difference (TD) error may
cause the jitter of result during long-term training. We
aðtÞ ¼ððXf ðtÞ; 8f 2 FÞ;
notice that the solution of Cost-Min problem in Section 2.2
ðmXf ðtÞ ðtÞ; 8f 2 FÞÞ (16) can also be utilized as a guideline to judge the efciency of
the actions made by the agent when it is not well trained.
to indicate the services migration locations and We dene a normalized difference error s of the instant
corresponding energy scheduling. Dene O ¼ reward rðtÞ and the solution of problem Cost-Min YCM as
f0; O1 GXf ðtÞ ðtÞ; O2 GXf ðtÞ ðtÞ; :::; GXf ðtÞ ðtÞg as the percent
1
of remained green energy GXf ðtÞ t in node Xf ðtÞ to be s¼ : (22)
used, where the value of O can be customized by the jYCM  rðtÞj þ 1
network operators. The action mXf ðtÞ ðtÞ as the green By integrating the error s, with the Temporal-Difference (TD)
energy usage can be decided as error d as
mXf ðtÞ ðtÞ 2 O  GXf ðtÞ ðtÞ; 8f 2 F: (17) d ¼ r þ g max Qðs0 ; a0 ; u0 Þ  Qðs; a; uÞ; (23)
a0
we dene the priority of transition j consisting of two parts: TABLE 2

CM-error jsj related to the Cost-Min formulation and the Default Settings
TD error d calculated by the agent, as
Parameters Values
pj ¼ t  s þ ð1  tÞ  d; (24) The cost of processing a request 0.2
The cost of migrating a vm between a hop 15
where t is the controlling parameter to balance the importance The cost of one request’s communication 0.1
of s and d. Initially, we would like to set t ¼ 1 for the inexperi- between a hop
enced DQN. Once it has gained some experiences, we attenu- The weight of green energy cost in reward 1
ate t to 0 since we can trust the well trained DQN then. The The weight of brown energy cost in reward 1000
The number of replaced size per iteration 512
probability of sampling transition j can be calculated as
The size of prioritized replay memory 4000
pdj The number of fully connected layers 3
Pj ¼ P ; (25) The learning rate 0.0001
k2M pdk The discount rate 0.9
where the d is the prioritization exponent.
updated as pj ¼ t  t þ ð1  tÞ  d according to current TD-
Algorithm 1. Prioritized DRL-based Agent Training error dj and CM-error s in line 13. After that, as shown in lines
1: INPUT: minibatch B, step-size h, replay period R 16 and 17, the weights of the evaluation network and the target
2: Randomly initialize neuronal network Qðs; ajuÞ with network are also updated according to the accumulated
weights u and target network Qðs; aju0 Þ with weights u0 weight-change calculated in line 14. The above procedure iter-
3: Initialize replay memory M ates until convergence or reaching the predened episode
4: Receive the initial observed state sð0Þ and make action að0Þ bound. Once the agent gets well trained, it shall be already to
5: for t ¼ 0 to Tepisode do make the right decision towards energy cost minimization
6: Generate action aðtÞ by exploration according to any real-time observation (i.e., states).
7: Interact with the environment using actions aðtÞ, obtain
reward rðtÞ and new state sðt þ 1Þ and store ðsðtÞ; aðtÞ; r 4 PERFORMANCE EVALUATION
ðtÞ; sðt þ 1ÞÞ ! M with pt ¼ 1
8: if t  0 mod R then To validate the efciency of our prioritized DRL-based algo-
9: for j ¼ 1 ! M do rithm, we conduct extensive simulations and report the
pb
10: Sample transition j according to Pj ¼ P j b results in this section. The simulated environment and all
p
k2K k algorithms are implemented by Python 3.7. The neural net-
11: Compute the importance-sampling weight works involved in DRL-based algorithms are implemented
ðBP Þr using Tensorow 1.9 framework and the optimization prob-
wj ¼ max j
i2B wi
lem of CM is solved using Gurobi 9.0.2. All the experiments
12: Compute the TD-error dj and CM-error s are conducted on a server equipped with a 1.80GHz 8-Core
13: Update transition priority according to (22) and (24) Intel Core CPU i7-8550U processor. Our pDRL consists of two
14: Accumulate weight-change D D þ wj  s j  ru Q networks, namely eval_net and target_net, respectively. Both
ðSj1 ; aj1 Þ eval_net and target_net have the same network structure. The
15: end for input size is the dimension of state, and the input layer is con-
16: Update neuronal network weights u u þ h  D and set nected to three fully connected layers called n_l1, n_l2, and
D¼0 n_l3. For the rst layers n_l1, the size is 236, and the size of
17: Copy weights to target network every h steps u0 u n_l2 is 118 while n_l3 is 59. The activation function is relu and
18: end if the output size is the numbers of edge nodes. The parameters
19: end for
of target_net will be updated by the eval_net during the agent
training. The default settings are summarized in Table 2.
3.2 Agent Training Algorithm
We summarize our pDRL algorithm in Algorithm 1 with the 4.1 Simulation Result
incorporate of the prioritized replay memory. To start the We consider two well-known network topologies, i.e., ARPA
pDRL training procedure, we rst dene minibatch B, step- Network and NSF Network, consisting of 9 and 14 edge serv-
size h, replay period R in line 1. Then, we randomly set the ers, respectively. Each server is associated with an amount of
weights of the evaluation network Qðs; ajuÞ as u, and the requests and green energy at each time slot. We set 5 different
weights of target networks Qðs; aju0 Þ are cloned from the eval- types of service from real world, including Uber request, pub-
uation network(see line 2). Now, we are ready to start the lic work request, food order request, and base station request.
agent training with initial state sð0Þ and obtain an action að0Þ, As shown in Fig. 2a, these requests follow different distribu-
as shown in line 4. This action að0Þ is executed to get a reward tion patterns. The solid line indicates the average amount of
rð0Þ and the resulted in new state sð1Þ. This procedure is one service request arriving in all the edge nodes, and the
repeated and the new transitions ðsðtÞ; aðtÞ; rðtÞ; sðt þ 1ÞÞ are shaded area is the variation range of request amount. As we
stored in the replay memory with an initial priority of 1 (line can see, different service requests have different amounts and
7), as the newly generated transitions should always be trends throughout the day. The green energy generation trace
explored rst. For every R time, we sample jBj transitions is extracted from ENTSOE1, including solar energy and wind
according to their probability Pj and compute the importance-
sampling weight as wj (see line 11). The priority is also 1. https://transparency.entsoe.eu/dashboard/show
Fig. 2. Different request patterns.
energy. To be practical, we set the time slots from 17 to 23 as above two networks. The Greedy algorithm, primarily sorting
the peak hours and the electricity prices in this period are the node according to its remained green energy, is OððjFj þ
tripled. jNjÞ  logjNjÞ of time complexity, increasing almost linearly
The communication cost and migration cost are set accord- with the service number and edge node number. It can be seen
ing to the hops between any two edge servers, and different that the execution time on ARPA network is 0.02ms in Fig. 3b
services have different unit cost. We evaluate the performance and NSF is 0.02ms in 4b, respectively. While the Instant algo-
of traditional model-based algorithm by minimizing the rithm is executed by Gurobi’s ILP solver, which solves this
instant cost of each time slot in (14) (”Instant”), greedy algo- problem by heuristically exploring all the solution trees to
rithm by choosing the node with max green energy remained obtain the optimal one. Consequently, its time complexity is
(”Greedy”), DRL-based algorithm (”DRL”), Deep Reinforce- almost OðjNjjFj Þ, increasing exponentially with the increase of
ment Learning for Online Computation Ofoading algorithm service number or edge node number. As a result, it need
(”DROO”), Prioritized Experience Replay DQN (”PER”), more 0.17ms when the network topology expands from 9
and our prioritized DRL-based algorithm (”pDRL”). nodes to 14 nodes. But in DRL-based algorithms, the results
First, Figs. 3a and 4a shows the convergence trend of epi- show no obvious execution time increase with the network
sode rewards obtained by our pDRL, DRL, PER, and DROO scale, because the execution time of neural network is only
algorithms. Taking NSF network as an example, we have related to its structure. Hence they can be applied to large scale
trained these neural networks with 40000 episodes. With the networks and provide a fast response.
increasing number of training episodes, both DRL and pDRL Then, we show the green energy consumption and total cost
eventually converge. It can be seen that our pDRL converges of the four algorithms. Figs. 3c and 4c show the green energy
much faster than other DRL, converging in the 13000th epi- consumptions of 24 time slots, and Figs. 3d and 4d show the
sodes. Moreover, our pDRL reaches a lower cost than that of costs. In our experiments, brown energy price is tripled during
other DRL algorithms, as 17761 for pDRL. This is because peak hours, i.e., time slots from 17 to 23, comparing to off-peak
our pDRL prioritizes the transitions according to its reward, hours, i.e., time slots from 0 to 16. When there is insufcient
and the good transitions can be learned more effectively, green energy, Greedy and Instant algorithms myopically focus
leading to faster convergence and lower cost. Fig. 3a shows on optimizing the on-site energy cost based on present network
the same trend on ARPA Network, validating the effective- state without any consideration of the future probabilities. So,
ness of our pDRL on different network topologies. they intend to maximize the green energy utility at each off-
Decision-making time is a critical metrics, determining the peak time slot as shown in Figs. 3c and 4c. When peak hour
availability of an algorithm. Hence, we present the average with higher electricity price falls, there is not sufcient green
execution time of four algorithms on two different network energy available and more brown energy is needed, leading to
topologies. As shown in Figs. 4b and 3b, all DRL-based algo- large cost as shown in Figs. 3d and 4d. The total cost of Greedy
rithms use neural networks to make the placement decisions and Instant is 36224 and 27529 respectively. As for DRL-based
and can complete decisions in a short time, i.e., DRL and our algorithms, they can learn from past experiences and evaluate
pDRL spend 0.06ms, 0.07ms on both NSF and ARPA Network, probabilities of the future electricity price changing. With this
respectively, while PER and DROO take 0.08ms, 0.08ms on the knowledge, they decide to use fewer green energy in time slots
Fig. 3. Simulation results on ARPA network.
0  16, and reserve part of green energy for time slots 17  23 with other DRL-based algorithms. So we can conclude that our
with a higher electricity price, as shown in Figs. 3c and 4c. As a pDRL algorithm deals with the environmental dynamics and
result, we can use the reserved green energy during peak hours, shows its advantages under this real-world trace.
and the brown energy cost of pDRL and oher DRL-based algo- More generally, we evaluate our pDRL with 3 different
rithms are much lower than Greedy and Instant, especially in types of service request patterns, as shown in Figs. 2b, 2c, and
time slots 17  22. Furthermore, compared with other DRL- 2d. Similar to Fig. 2a, the solid line indicates the average
based algorithms, our pDRL has an advanced prioritized replay amount of service request, and the shaded area represents the
memory to learn better transitions, and consequently manages amount varying range. The request pattern in Fig. 2b follows a
the service and schedules the energy more efciently. The total uniform distribution, and each service has different upper and
cost of pDRL is further reduced in Figs. 3d and 4d, compared lower bounds. The request pattern in Fig. 2c follows a normal
Fig. 4. Simulation results on NSF network.

TABLE 3 TABLE 4
Green Energy Consumptions Under Different Request Patterns Costs Under Different Request Patterns
Algorithm n Pattern real world uniform normal time dependent Algorithm n Pattern real world uniform normal time dependent
Greedy 36898 45266 48373 50245 Greedy 36224 32644 46062 40885
Instant 36746 44935 48514 49305 Instant 27529 16208 32223 30174
DRL 34216 33460 41306 43687 DRL 25325 39068 48936 48421
PER[14] 32473 33999 34459 37684 PER[14] 28259 32662 52711 54460
DROO[15] 33188 34903 40035 44330 DROO[15] 24829 46738 44402 47344
pDRL 35660 39010 43715 46800 pDRL 21638 30139 44360 41784
distribution, and each service has a different mean and vari- problem. It is important for edge servers to properly man-
ance. While the number of service request in Fig. 2d varies with age energy while scheduling resource. For example, [16]
time, presenting a Gaussian distribution in the time dimension, studies how to balance mobile edge computing (MEC) system
and a certain range of random uctuations are carried out near performance and energy consumption by properly migrat-
the Gaussian distribution value. The performance of four algo- ing services between edge servers with request delay con-
rithms on the NSF network are listed in Tables 3 and 4. straint. Zhang et al. [17] leverage DVFS to adjust the power
Table 3 shows the total green energy consumptions of four consumption of the VMs, with certain service failure proba-
algorithms. We can nd that the green energy consumption bility. Guo et al. [18] propose an efcient request ofoading
of pDRL is always higher than other DRL-based algorithms. and resource allocation strategy to reduce energy consump-
This is because with the prioritized replay memory design, tion and application completion time on smart mobile devi-
our pDRL can learn better than DRL within the same training ces. Chen et al. [19] focus on energy-efcient ofoading in
episodes, and hence can nd better service location and MEC and propose a novel algorithm to reduce total energy
make better energy scheduling solution. It is also noticeable consumption. Guo et al. [20] study how to adjust the cover-
that Greedy and Instant consume more green energy than age range and allocate channel resources of base stations
four DRL-based algorithms. The reason is that they always with EH powering, to reduce the brown energy cost. Xiong
try to migrate services to the nodes with more green energy et al. [21] focus on a wireless energy and data transfer sup-
and use as much green energy as possible in each time slot. ported UAV system and propose a novel DRL-based
While frequent migration results in more energy consump- approach for long-term utility of UAV.
tion, especially more brown energy cost when green energy For energy-constrained edge devices, ofoading requests
is insufcient, as shown in Table 4. Table 4 also shows that to the nearby edge servers is a feasible solution to prolong
the cost of pDRL always outperforms other algorithms, fol- device lifetime. Sun et al. [22] focus on the optimization of
lowed by DRL. Since Instant myopically nds a single-time- device scheduling problem with the limitation of energy
slot optimal solution and is second to DRL-based algorithms, and devise an energy-aware dynamic scheduling algorithm.
Greedy always shows the worst performance. Xu et al. [23] investigate request ofoading of AI applica-
Averagely, our pDRL accelerates the convergence speed tions in MEC with the consideration of energy scheduling
by 37:67% and reduces the long-term OPEX by 32:70%, com- and propose an energy-aware strategy. Yi et al. [24] consider
pared with state-of-the-art studies.. The training acceleration how to divide the devices into multiple edge sub-networks
is because our pDRL prioritizes the transitions according to where a MEC using green powered server is equipped, to
its reward. Moreover, the solution of Cost-Min problem in minimize the operation cost with limited green energy.
Section 2.2 is used as a guideline to judge the quality of the Geng et al. [25] leverage the Big. Little architecture of the
actions. As a result, the good transitions can be learned more multicore based edge devices to jointly make request off-
effectively by the agent and the training procedure is acceler- loading and local scheduling decisions, which schedules
ated. As for the lower cost, it can be seen from Figs. 3c and 4c requests to the appropriate CPU core according to request
that our pDRL saves the green energy from time slot 0 to 16 priority, to reduce the energy consumption while satisfying
with lower price for future use while Greedy and Instant algo- the completion time constraints.
rithms myopically focus on optimizing the on-site energy cost Leveraging energy harvesting technology is another via-
based on present network state without any consideration of ble way to lengthen the device lifetime. For example, [26]
the future probabilities. For example, at time slot 7, the green study the trade-off between request latency and energy con-
energy consumption of Greedy and Instant in ARPA network sumption in MEC, and propose different request scheduling
is 1206 and 1125 respectively, while our pDRL is 563. With methods. However, most of the existing solutions for energy
such long term energy scheduling, during the peak hour with management are model-based with simplied network or
higher electricity price, our pDRL saves sufcient green assumptions on environment dynamics, besides, they
energy and consumes less brown energy, leading to lower mainly target on one-shot optimization, which may fail to
cost compared with Greedy and Instant solutions. adopted well in dynamically changing real scenarios.
5 RELATED WORK 5.2 A Review on RL in Edge Computing

5.1 A Review on Edge Energy Management Reinforcement learning, as a model-free approach without
With the ever-growing computing demand, energy con- any prior knowledge, can automatically learn the dynamics
sumption of edge computing has been becoming a notable and make appropriate decisions accordingly at runtime.
Many existing works leverage different RL-based algo- episode reward of transition sample as the difference error to
rithms in edge caching. Qian et al. [27] propose a content adapt to the sparsity of transition sample with high reward in
push and cache divide-and-conquer strategy based on RL, real scenarios. Extensive experiments verify the correctness of
where Q-learning decides which le to be cached and DQN our design and the efciency of our algorithm by the fact that it
decides which BS to cache content, and this framework outperforms traditional model-based method algorithm. Our
effectively learns the content popularity and predicts the study indicates that DRL provides the possibility of escaping
future user demands. Wang et al. [28] propose a DQN- from the model-based solution with assumptions. A trained
based algorithm with attention-weighted federated learn- agent can take a picture of the whole network as input to make
ing, which selects the cache nodes and the replacement of control decisions according to the desired objective for complex
cache contents in mobile networks, with the goal of improv- network control problems. While, we shall not only simply
ing the cache hit rate. Zheng et al. [29] cast edge caching apply the DRL algorithm, but shall also well customize it
problem into a MINLP problem and propose a RL-based according to the characteristics of the control problem. In future
method to solve the problem with consideration of minimiz- work, several approaches could be extended from this work.
ing the total energy consumption. For example, the unused green energy can be transferred to
Request/task ofoading and resource allocation are also neighbour servers to improve energy efciency. Moreover,
widely studied in edge computing, and DRL is a particularly pDRL could be exploited in mobile edge computing scenarios
effective strategy to tackle this complex controlling problem. assisted by multiple agents to support distributed service man-
For example, Chen et al. [30] utilize DRL to jointly decide agement and energy scheduling in large scale networks.
energy-efcient request ofoading and computing resource
allocation, which achieves a great reduction on energy con-
sumption compared with traditional algorithms. Liu et al. [31]
REFERENCES
use DRL to solve the problem of load balancing with the consid- [1] M. Lemay, K. K. Nguyen, B. S. Arnaud, and M. Cheriet,
“Toward a zero-carbon network: Converging cloud computing
eration of communication and computation requirements. With and network virtualization,” IEEE Internet Comput., vol. 16,
the goal of achieving close-to-optimal performance, [32], [33] no. 6, pp. 51–59, Nov./Dec. 2012.
solve complex request ofoading problems based on RL under [2] P. Steenhof et al., “A protocol for quantifying the carbon reduc-
various constraints, all of them achieved better performance tions achieved through the provision of low or zero carbon ICT
services,” Sustain. Comput.: Inform. Syst., pp. 23–32, 2012.
than traditional heuristic or convex optimization algorithms. [3] A. A. Chien, R. Wolski, and F. Yang, “Zero-carbon cloud: A vola-
Taking [34] as an example, the authors propose a novel deep Q- tile resource for high-performance computing,” in Proc. IEEE Int.
learning-based system to solve the problem of resource provi- Conf. Comput. Informat. Technol., 2015, pp. 1997–2001.
[4] F. Yang and A. A. Chien, “Zccloud: Exploring wasted green
sioning and task scheduling with the goal of improving energy power for high-performance computing,” in Proc. IEEE Int. Paral-
cost efciency and fast convergence. RL is also used to solve tra- lel Distrib. Process. Symp., 2016, pp. 1051–1060.
ditional ICT system management. Liu et al. [35] study the prob- [5] G. Zhang, W. Zhang, Y. Cao, D. Li, and L. Wang, “Energy-delay
lem of sub-network division in the IoT and leverage DQN to tradeoff for dynamic ofoading in mobile-edge computing system
with energy harvesting devices,” IEEE Trans. Ind. Informat.,
deal with the dynamic load balancing of edge servers. Kim vol. 14, no. 10, pp. 4642–4655, Oct. 2018.
et al. [36] design AutoScale based on reinforcement learning [6] J. Mineraud, L. Wang, S. Balasubramaniam, and J. Kangasharju,
algorithm to decide running inference on the CPU or co-pro- “Hybrid renewable energy routing for isp networks,” in Proc.
cessors so as to improve energy efciency in mobile systems. IEEE Int. Conf. Comput. Commun., 2016, pp. 1–9.
[7] F. Guo, L. Ma, H. Zhang, H. Ji, and X. Li, “Joint load management
RL-based solutions have shown their great improvement and resource allocation in the energy harvesting powered small
in edge computing resource management that is difcult for cell networks with mobile edge computing,” in Proc. IEEE Conf.
traditional model-based methods. This inspires us to intro- Comput. Commun. Workshops, 2018, pp. 299–304.
[8] S. Conti, G. Faraci, R. Nicolosi, S. A. Rizzo, and G. Schembra,
duce RL to the eld of energy scheduling and service man- “Battery management in a green fog-computing node: A reinforce-
agement in edge computing. However, taking advantage of ment-learning approach,” IEEE Access, vol. 5, pp. 21 126–21 138,
RL into this eld is limited to the efency of trial-and-error, 2017.
how to utilize the vast amount of samplings required for [9] Y. Yang, D. Wang, D. Pan, and M. Xu, “Wind blows, trafc ows:
Green internet routing under renewable energy,” in Proc. IEEE
agent training with effect in energy scheduling and service Conf. Comput. Commun., 2016, pp. 1–9.
management is our goal. [10] Y. Mao, J. Zhang, and K. B. Letaief, “Dynamic computation ofoad-
ing for mobile-edge computing with energy harvesting devices,”
IEEE J. Sel. Areas Commun., vol. 34, no. 12, pp. 3590–3605, Dec.
6 CONCLUSION 2016.
[11] L. Chen, S. Zhou, and J. Xu, “Computation peer ofoading
In this paper, we investigate the problem of service manage- for energy-constrained mobile edge computing in small-cell
ment and energy scheduling in edge computing. To minimize networks,” IEEE/ACM Trans. Netw., vol. 26, no. 4, pp. 1619–1632,
Aug. 2018.
the long-term energy cost, every decision should be made care- [12] D. Silver et al., “Mastering the game of Go without human knowl-
fully in a foresighted way by taking current network statics edge,” Nature, vol. 550, pp. 354–359, 2017.
and future vision into account. It is hard to use the traditional [13] Z. Xiong, Y. Zhang, D. Niyato, R. Deng, P. Wang, and L. Wang,
model-based algorithm to address such problem. Instead, we “Deep reinforcement learning for mobile 5G and beyond: Funda-
mentals, applications, and challenges,” IEEE Veh. Technol. Mag.,
resort to the newly proposed DRL technique, i.e., DQN, to vol. 14, no. 2, pp. 44–52, Jun. 2019.
design a model-free solution. To accelerate the DQN training [14] I. A. T. Schaul, J. Quan, and D. Silver, “Prioritized experience
converge, we further customize it by leveraging the traditional replay,” in Proc. Int. Conf. Learn. Representations (Poster), 2015.
[15] L. Huang, S. Bi, and Y.-J. A. Zhang, “Deep reinforcement learning
model-based solution as a guideline to update the transition for online computation ofoading in wireless powered mobile-
sample priority. Different with Prioritized Replay Memory edge computing networks,” IEEE Trans. Mobile Comput., vol. 19,
based on Temporal-Difference Error, pDRL utilized the no. 11, pp. 2581–2593, Nov. 2020.
[16] H. Badri, T. Bahreini, D. Grosu, and K. Yang, “Energy-aware Lin Gu (Member, IEEE) received the MS and PhD
application placement in mobile edge computing: A stochastic degrees in computer science from the University of
optimization approach,” IEEE Trans. Parallel Distrib. Syst., vol. 31, Aizu, Fukushima, Japan in 2011 and 2015. She is
no. 4, pp. 909–922, Apr. 2020. currently an associate professor with the School of
[17] W. Zhang, Z. Zhang, S. Zeadally, H. Chao, and V. C. M. Leung, Computer Science and Technology, Huazhong
“Energy-efcient workload allocation and computation resource University of Science and Technology, China. Her
conguration in distributed cloud/edge computing systems with current research interests include serverless com-
stochastic workloads,” IEEE J. Sel. Areas Commun., vol. 38, no. 6, puting, network function virtualization, cloud com-
pp. 1118–1132, Jun. 2020. puting, software-dened networking, and data
[18] S. Guo, J. Liu, Y. Yang, B. Xiao, and Z. Li, “Energy-efcient center networking. She has authored 2 books and
dynamic computation ofoading and cooperative task scheduling more than 40 papers in refereed journals and con-
in mobile cloud computing,” IEEE Trans. Mobile Comput., vol. 18, ferences in these areas. She is a senior number
no. 2, pp. 319–333, Feb. 2019. of CCF.
[19] X. Chen, J. Zhang, B. Lin, Z. Chen, K. Wolter, and G. Min,
“Energy-efcient ofoading for DNN-based smart IoT systems in
cloud-edge environments,” IEEE Trans. Parallel Distrib. Syst., Weiying Zhang received the graduation degree
vol. 33, no. 3, pp. 683–697, Mar. 2022. from the Huazhong University of Science and
[20] F. Guo, L. Ma, H. Zhang, H. Ji, and X. Li, “Joint load management Technology, the BE degree from Northeastern
and resource allocation in the energy harvesting powered small University in 2019, and the MS degree in 2022.
cell networks with mobile edge computing,” in Proc. IEEE Conf. Her current interests mainly focus on the service
Comput. Commun., 2018, pp. 299–304. management and energy scheduling.
[21] Z. Xiong et al., “UAV-assisted wireless energy and data transfer
with deep reinforcement learning,” IEEE Trans. Cogn. Commun.
Netw., vol. 7, no. 1, pp. 85–99, Mar. 2021.
[22] Y. Sun, S. Zhou, Z. Niu, and D. G€ und€uz, “Dynamic scheduling for
over-the-air federated edge learning with energy constraints,”
IEEE J. Sel. Areas Commun., vol. 40, no. 1, pp. 227–242, Jan. 2022.
[23] Z. Xu et al., “Energy-aware inference ofoading for DNN-driven Zhongkui Wang received the graduation degree
applications in mobile edge clouds,” IEEE Trans. Parallel Distrib. from the Huazhong University of Science and
Syst., vol. 32, no. 4, pp. 799–814, Apr. 2021. Technology, the BE degree from Jilin University in
[24] Y. Liu, S. Xie, Q. Yang, and Y. Zhang, “Joint computation ofoad- 2018, and the MS degree in 2021. His interests
ing and demand response management in mobile edge network mainly focus on the computing resource manage-
with renewable energy sources,” IEEE Trans. Veh. Technol, vol. 69, ment and reinforcement learning.
no. 12, pp. 15 720–15 730, Dec. 2020.
[25] Y. Geng, Y. Yang, and G. Cao, “Energy-efcient computation off-
loading for multicore-based mobile devices,” in Proc. IEEE Conf.
Comput. Commun., 2018, pp. 46–54.
[26] J. Luo et al., “Container-based fog computing architecture and
energy-balancing scheduling algorithm for energy IoT,” Future
Gener. Comput. Syst., vol. 97, pp. 50–60, 2019.
[27] Y. Qian, R. Wang, J. Wu, B. Tan, and H. Ren, “Reinforcement learn- Deze Zeng (Member, IEEE) is currently a full pro-
ing-based optimal computing and caching in mobile edge network,” fessor with the School of Computer Science,
IEEE J. Sel. Areas Commun., vol. 38, no. 10, pp. 2343–2355, Oct. 2020. China University of Geosciences, Wuhan, China.
[28] X. Wang, R. Li, C. Wang, X. Li, T. Taleb, and V. C. M. Leung, His current research interests mainly focus on
“Attention-weighted federated deep reinforcement learning for edge computing, cloud computing, and IoT. He
device-to-device assisted heterogeneous collaborative edge cach- has authored 3 books and more than 120 papers
ing,” IEEE J. Sel. Areas Commun., vol. 39, no. 1, pp. 154–169, Jan. 2021. in refereed journals and conferences in these
[29] H. Zheng, H. Zhou, N. Wang, P. Chen, and S. Xu, “Reinforcement areas. He also received 5 best paper awards
learning for energy-efcient edge caching in mobile edge networks,” from IEEE/ACM conferences and journals. He
in Proc. IEEE Conf. Comput. Commun. Workshops, 2021, pp. 1–6. serves in editorial boards of IEEE Transactions
[30] X. Chen and G. Liu, “Energy-efcient task ofoading and resource on Sustainable Computing, Journal of Network
allocation via deep reinforcement learning for augmented reality and Computer Applications, Frontiers of Com-
in mobile edge networks,” IEEE Internet Things J., vol. 8, no. 13, puter Science, and Open Journal of Computer Society. He has been in
pp. 10 843–10 856, Jul. 2021. the organization or program committees of many international conferen-
[31] Q. Liu, T. Xia, L. Cheng, M. van Eijk, T. Ozcelebi, and Y. Mao, ces. He is a senior member of CCF.
“Deep reinforcement learning for load-balancing aware network
control in IoT edge systems,” IEEE Trans. Parallel Distrib. Syst.,
vol. 33, no. 6, pp. 1491–1502, Jun. 2022. Hai Jin (Fellow, IEEE) received the PhD degree
[32] S. Wang, Y. Guo, N. Zhang, P. Yang, A. Zhou, and X. Shen, in computer engineering from HUST in 1994. He
“Delay-aware microservice coordination in mobile edge comput- is a chair professor of computer science and
ing: A reinforcement learning approach,” IEEE Trans. Mobile Com- engineering with the Huazhong University of Sci-
put., vol. 20, no. 3, pp. 939–951, Mar. 2021. ence and Technology (HUST) in China. In 1996,
[33] J. Wang, J. Hu, G. Min, A. Y. Zomaya, and N. Georgalas, “Fast he was awarded a German Academic Exchange
adaptive task ofoading in edge computing based on meta rein- Service fellowship to visit the Technical University
forcement learning,” IEEE Trans. Parallel Distrib. Syst., vol. 32, of Chemnitz in Germany. He worked with The
no. 1, pp. 242–253, Jan. 2021. University of Hong Kong between 1998 and
[34] M. Cheng, J. Li, and S. Nazarian, “Drl-cloud: Deep reinforcement 2000, and as a visiting scholar with the University
learning-based resource provisioning and task scheduling for of Southern California between 1999 and 2000.
cloud service providers,” in Proc. 23rd Asia South Pacic Des. Auto- He was awarded Excellent Youth Award from the National Science Foun-
mat. Conf., 2018, pp. 129–134. dation of China in 2001. He is a Fellow of CCF, and a life member of the
[35] Q. Liu, L. Cheng, T. Ozcelebi, J. Murphy, and J. Lukkien, “Deep ACM. He has co-authored more than 20 books and published more than
reinforcement learning for IoT network dynamic clustering in 900 research papers. His research interests include computer architec-
edge computing,” in Proc. IEEE/ACM Int. Symp. Cluster, Cloud ture, parallel and distributed computing, big data processing, data stor-
Grid Comput., 2019, pp. 600–603. age, and system security.
[36] Y. G. Kim and C.-J. Wu, “Autoscale: Energy efciency optimization
for stochastic edge inference using reinforcement learning,” in
Proc. Annu. IEEE/ACM Int. Symp. Microarchit., 2020, pp. 1082–1096. " For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/csdl.

Cloud 2

Uploaded by

Copyright:

Available Formats

Cloud 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cloud 2

Uploaded by

Copyright:

Available Formats

IEEE TRANSACTIONS ON SUSTAINABLE COMPUTING, VOL. 8, NO.

1, JANUARY-MARCH 2023 109

Service Management and Energy Scheduling

Index Terms—Low-carbon edge computing, deep reinforcement learning, service management

Fig. 1. An example on service management and energy scheduling.

0  mn ðtÞ  Gn ðtÞ; 8n 2 N; (11)

rðtÞ ¼ CðtÞ: (18)

we dene the priority of transition j consisting of two parts: TABLE 2

Fig. 2. Different request patterns.

Fig. 3. Simulation results on ARPA network.

Fig. 4. Simulation results on NSF network.

5 RELATED WORK 5.2 A Review on RL in Edge Computing

You might also like