Computation Offloading with Privacy-Preserving in Multi-Access Edge Computing: A Multi-Agent Deep Reinforcement Learning Approach

Dai, Xiang; Luo, Zhongqiang; Zhang, Wei

doi:10.3390/electronics13132655

Open AccessArticle

Computation Offloading with Privacy-Preserving in Multi-Access Edge Computing: A Multi-Agent Deep Reinforcement Learning Approach

by

Xiang Dai

¹,

Zhongqiang Luo

^1,2,*

and

Wei Zhang

^3,*

¹

School of Automation and Information Engineering, Sichuan University of Science and Engineering, Yibin 644000, China

²

Artificial Intelligence Key Laboratory of Sichuan Province, Sichuan University of Science and Engineering, Yibin 644000, China

³

Faculty of Intelligent Manufacturing, Yinbin University, Yibin 644000, China

^*

Authors to whom correspondence should be addressed.

Electronics 2024, 13(13), 2655; https://doi.org/10.3390/electronics13132655

Submission received: 21 May 2024 / Revised: 3 July 2024 / Accepted: 4 July 2024 / Published: 6 July 2024

Download

Browse Figures

Versions Notes

Abstract

:

The rapid development of mobile communication technologies and Internet of Things (IoT) devices has introduced new challenges for multi-access edge computing (MEC). A key issue is how to efficiently manage MEC resources and determine the optimal offloading strategy between edge servers and user devices, while also protecting user privacy and thereby improving the Quality of Service (QoS). To address this issue, this paper investigates a privacy-preserving computation offloading scheme, designed to maximize QoS by comprehensively considering privacy protection, delay, energy consumption, and the task discard rate of user devices. We first formalize the privacy issue by introducing the concept of privacy entropy. Then, based on quantified indicators, a multi-objective optimization problem is established. To find an optimal solution to this problem, this paper proposes a computation offloading algorithm based on the Twin delayed deep deterministic policy gradient (TD3-SN-PER), which integrates clipped double-Q learning, prioritized experience replay, and state normalization techniques. Finally, the proposed method is evaluated through simulation analysis. The experimental results demonstrate that our approach can effectively balance multiple performance metrics to achieve optimal QoS.

Keywords:

multi-access edge computing; computation offloading; privacy-preserving; quality of service; deep reinforcement learning

1. Introduction

The rapid development of Internet of Things (IoT) technology has driven the global proliferation of various smart mobile devices, leading to the emergence of numerous smart applications in our daily lives, such as smart homes [1] and smart healthcare [2]. These applications significantly improve the Quality of Experience (QoE) for users. However, these applications often generate a large volume of computation tasks, overwhelming the capabilities of resource-constrained mobile devices to meet the demands for data processing and computation [3]. This limitation severely impacts the performance and Quality of Service (QoS) of the applications. In response to these challenges, multi-access edge computing (MEC) technology has emerged as a solution. By deploying computation, storage, and other resources closer to mobile devices, MEC enables the processing of data near its point of generation [4], significantly reducing delay and energy consumption, and enhancing data processing efficiency [5]. Despite the numerous benefits brought by MEC, it has become a key issue to efficiently manage MEC resources and determine the optimal offloading strategy to improve QoS [6].

To address these issues, some studies have focused on enhancing QoS by minimizing system delay and energy consumption in single access point environments [7,8,9]. However, multi-access point environments are closer to reality, where the computation tasks are often assumed to be fine-grained. Depending on factors like data volume size and network conditions, tasks with larger data volumes are offloaded to the nearest edge servers with data processing capabilities. In contrast, tasks with smaller data volumes are offloaded to more distant servers for processing. This approach can obtain the relatively lowest delay, energy consumption, and task discard rate, but it also has certain drawbacks. For instance, if an attacker were to monitor the edge server, they could potentially infer a user’s exact location based on the locations of multiple edge servers and the user’s offloading preferences, thereby leaking the user’s location privacy [10].

Existing studies rarely consider the aforementioned issues collectively, and often focus on only one or a few of them [8,11,12]. Hence, our research takes a holistic approach by considering privacy protection, delay, energy consumption, and the task discard rate in a multi-access point network to enhance the QoS. However, in a complex and dynamic environment such as MEC, it is particularly difficult to solve the above problems using traditional methods such as Liapunov optimization methods [13], convex optimization methods [14], and heuristic techniques [15]. In this scenario, deep reinforcement learning (DRL) emerges as a novel solution for computation offloading in MEC, thanks to its exceptional ability to tackle complex decision-making problems [16]. DRL combines the decision-making capability of reinforcement learning (RL) and the representational learning capability of deep learning (DL), enabling it to learn to obtain the optimal policy through environmental interaction in the absence of explicit instructions [17].

Hence, this paper proposes a privacy-preserving computation offloading scheme based on DRL, and the main contributions are as follows:

(1): Considering multiple performances of the MEC system, this study formulates a multi-objective optimization problem aimed at maximizing the QoS within a multi-access point environment;
(2): For the multi-objective optimization problem, this paper proposes a computation offloading algorithm named TD3-SN-PER. The algorithm features two independent critic networks, a design choice aimed at mitigating the overestimation bias. To enhance learning efficiency and stability, and to address the issue of sample relevance, the state normalization and prioritized experience replay techniques are integrated, to ensure that the algorithm can better generate the globally optimal computation offloading policy;
(3): Extensive experimental data demonstrate that the TD3-SN-PER algorithm significantly improves system performance. Compared to other approaches, this algorithm consistently achieves optimal QoS, even as the number of users and the rate of task arrival substantially increase.

The remainder of this document is structured as follows: Section 2 discusses recent related research, Section 3 outlines our system model and the optimization problem, Section 4 details the algorithm we propose, Section 5 delves into the analysis of experimental outcomes, and Section 6 offers a conclusion to the entire study.

2. Related Works

Numerous studies in MEC propose solutions to tackle the challenges in the computation offloading process. Various problems often require distinct optimization objectives. In scenarios sensitive to delays, minimizing delay is typically the primary objective. For example, Song et al. [18] used Branch and Bound (BnB) and a multi-objective particle swarm optimization to solve the offloading decision problem and the bandwidth allocation problem to minimize the delay. Li et al. [19] modeled the task offloading problem as a constrained Markov decision process (CMDP) and proposed a prioritized experience replay and dueling double deep Q-network-based algorithm to solve the CMDP problem.

Existing studies often focus on the joint optimization of delay and energy consumption. Liao et al. [20] proposed an online DRL algorithm aimed at reducing long-term energy consumption and delay by jointly deciding computation offloading transmit power, CPU frequency, and the offloading decision. Avgeris et al. [21] proposed a two-phase DRL scheme. In the first phase, each user device independently decides to offload the task to a connected edge server or execute it locally. If the task is offloaded, it proceeds to the second phase, where load balancing is achieved by transferring the task between different edge servers. Some studies explore performance metrics beyond delay and energy consumption. For instance, Ref. [22] proposed a predictive algorithm by combining a long short-term memory network with DRL to reduce the system task discard rate, delay, and energy consumption.

With users increasingly focusing on their privacy, how to protect user privacy during computation offloading has emerged as a critical issue. For this reason, some works have now been devoted to studying how to protect user privacy during computation offloading. Ju et al. [23] proposed a secure DRL-based computation offloading scheme by utilizing the spectrum-sharing architecture and physical layer security techniques. Lang et al. [24] ensured the synchronization and invariance of the offloading data by applying blockchain technology to the collaborative computation offloading of on-board MEC.

In addition to the different choices of optimization objectives, different studies have been conducted on the offloading method of the task. For example, some studies are oriented towards binary offloading [8,25]. Zheng et al. [8] decomposed the delay minimization problem into three subproblems, achieving a fast near-optimal offloading decision under varying wireless channel conditions. Sun et al. [25] proposed a multi-branch network-based DQN algorithm to address the problem of the number of actions in the system increasing combinatorially with the number of mobile devices. Other studies are oriented toward partial offloading [26,27]. Sun et al. [26] improved the system’s utility by optimizing the servers’ resource allocation and load balancing. Wang et al. [27] focused on how to efficiently offload dependent subtasks, and proposed a heuristic computation offloading scheduling scheme to offload appropriate subtasks to the server. Numerous existing studies have shown that dividing a task into multiple subtasks can reduce computation delay [28].

There are also many research works on different network architectures for MEC under different application requirements. Ke et al. [29] proposed a distributed multi-agent DRL algorithm to optimize bandwidth allocation and computation offloading strategies by training neural networks in a decentralized manner. However, it is a challenge for distributed nodes to learn collaborative strategies [30], for which an effective approach is to use a combination of centralized and distributed. For example, Yao et al. [6] proposed an experience-sharing offloading algorithm based on DRL in a distributed architecture. Wu et al. [10] proposed a multi-agent DRL algorithm (JODRL-PP) to improve system performance while protecting user privacy. However, the scheme is prone to the overestimation bias problem, and ignores the importance of the difference in empirical samples.

Based on the analysis of the above studies, it is evident that existing computation offloading schemes can effectively optimize system performance. However, these schemes exhibit certain limitations. Some schemes focus on optimizing a single performance metric, overlooking the importance of other metrics. Others consider multiple metrics, but fail to address the issues of overestimation bias and the significance of empirical sample differences. Therefore, to overcome these shortcomings, this paper proposes a TD3-SN-PER algorithm, which merges dual Q-networks, state normalization, and prioritized experience replay strategies. This algorithm enhances the system model’s training by partitioning the offloading task into numerous subtasks and applying global information. It aims to identify the globally optimal computation offloading strategy by evaluating various system performance metrics.

3. System Model and Problem Formulation

In this paper, we consider a cell scenario with multiple edge servers and multiple devices, as shown in Figure 1. Assume that there are a total of N user devices, denoted as

N = \{1, 2, 3, \dots, N\}

, and a total of M edge servers, denoted as

M = \{1, 2, 3, \dots, M\}

, and the computation tasks defined

d_{n}

, which is fine-grained. Define

a_{n} \in [0, 1]

as a decision variable, when

a_{n} = 0

, the tasks are all executed locally, at this time,

d_{n, l o c a l} = d_{n}

. When

a_{n} = 1

, the tasks are all offloaded to the edge servers for processing; at this time,

\sum_{m \in M} d_{n, m} = d_{n, e d g e} = d_{n}

, which is expressed as the tasks can be assigned to multiple edge servers for processing at the same time. When

α_{n} = (0, 1)

, the tasks are partially processed locally and partially processed at the edge servers; at this time,

d_{n, l o c a l} + d_{n, e d g e} = d_{n}

.

3.1. Computation Model

When the local device executes the computation task, there will be no data transmission, thus no transmission delay and transmission energy consumption during the task processing. Defining the number of CPU cycles that can be processed by the user device in each time slot t to be

f_{n, l}

, the delay required in task processing can be described as

T_{n, l} = \frac{d_{n, l o c a l} C}{f_{n, l}}

(1)

where C denotes the CPU cycles necessary to process 1 bit of data. The energy consumption needed for the user device to execute the task can be denoted as

E_{n, l} = p_{n, l} T_{n, l}

(2)

where

p_{n, l}

is the calculated power of the device. When the computation tasks are offloaded to the edge server for processing, the task data are transmitted by the wireless link. During data transmission, the user has some mobility. In this regard, the wireless channel gain during offloading is denoted as

h_{n, m} = \frac{g_{0}}{\sqrt{{(x_{n} - x_{m})}^{2} + {(y_{n} - y_{m})}^{2}}}

(3)

where

(x_{n}, y_{n})

and

(x_{m}, y_{m})

are the coordinates of device n and edge server m, and

g_{0}

is the base channel gain at one meter from the edge server. In this regard, the signal-to-interference-plus-noise ratio during data transmission can be described as

s i n r_{n, m} = \frac{p_{n, m} h_{n, m}}{\sum_{i \in N \neq n} p_{i, m} h_{i, m} + α^{2}}

(4)

where

p_{n, m}

denotes the transmission power during task offloading, and

α^{2}

denotes the channel noise. So, the rate of data transmission can be described as

R_{n, m} = W_{n} {log}_{2} (1 + s i n r_{n . m})

(5)

where

W_{n}

is the channel bandwidth allocated to the user device. Based on the calculated data transmission rate, the transmission delay during task offloading can be expressed as

T_{n, m}^{e u} = \frac{d_{n, m}}{R_{n, m}}

(6)

Define the computation frequency of the edge server as

f_{m, e}

. The delay incurred by the edge server in processing a task can be expressed as

T_{n, m}^{e c} = \frac{d_{n, m} C}{f_{m, e}}

(7)

The delay generated during task offloading includes data uploading delay, data processing delay, and result feedback delay, but since the result feedback data are much smaller than the task uploading data, the result feedback delay is often not considered in the calculation process. In this regard, the total delay of offloading the task to the edge server for processing can be expressed as

T_{n, m} = max_{m \in M} \{T_{n, 1}^{e u} + T_{n, 1}^{e c}, T_{n, 2}^{e u} + T_{n, 2}^{e c}, \dots, T_{n, m}^{e u} + T_{n, m}^{e c}\}

(8)

Taken together, the delay required to complete the task can be expressed as

T_{n} = max \{T_{n, l}, T_{n, m}\}

(9)

Based on the transmission power and transmission delay, the transmission energy consumption can be expressed as

E_{n, m}^{e u} = \sum_{m \in M} p_{n, m} T_{n, m}^{e u}

(10)

Since this paper aims to maximize the QoS, it focuses only on the energy consumption of the user’s device, which can be expressed as

E_{n} = E_{n, l} + E_{n, m}^{e u}

(11)

During task processing, there are often certain time and energy constraints, and when a task is not completed within the specified conditions, the task is discarded. The amount of these discarded tasks can be expressed as

d i s_{n} = ξ (d_{n, l o c a l} - \frac{f_{n, l} ς}{C}) + ξ (d_{n, m} - R_{n, m} ς)

(12)

where

ς

is the time slot length,

\frac{f_{n}^{l} ς}{C}

is the maximum number of tasks that the local device can compute,

R_{n, m} ς

is the maximum number of tasks that can be offloaded, and the function

ξ (.)

represents

ξ (n u m) = \{\begin{matrix} n u m & , n u m > 0 \\ 0 & , n u m ⩽ 0 \end{matrix}

(13)

3.2. Privacy Model

In the traditional offloading strategy, in order to reduce energy consumption and delay, the tendency is to offload tasks to the closest edge server. This is mainly due to the fact that the further the distance between the edge servers and the user devices, the worse the channel state is between them, and the higher the transmission delay and transmission energy consumption. This offload preference lets attackers infer a user’s location privacy by federating multiple edge servers. While offloading tasks to remote servers can confuse an attacker’s judgment, this can lead to greater delay and energy consumption. Therefore, this paper introduces the concept of privacy entropy [31,32] to reduce the dependence on proximity servers and increase the use of remote servers to provide optimal QoS by sacrificing a certain amount of performance to protect the user’s privacy. Privacy entropy is a measure of how randomly preferences are distributed among different servers in the computation offloading process; a greater level of privacy entropy signifies an increased challenge for attackers attempting to deduce the user’s location and the level of privacy protection.

The offloading preference for user device n can be represented by the aggregate count of tasks offloaded, as well as the specific number of those tasks offloaded on the edge server:

q_{n, m} = \frac{d_{n, m}}{d_{n, e d g e}}

(14)

When the local device handles the computation tasks, the attacker cannot infer the user’s location information, and the privacy entropy achieves its maximum value of

H_{max} = 2

. When computation tasks are all offloaded to the nearest edge server for processing, the task offloading preference becomes very clear, making it easy for an attacker to infer the user’s location; at this time, the privacy entropy is zero. In this regard, the privacy entropy of the user’s device n is as follows:

H_{n} = \{\begin{matrix} - \sum_{m \in M} q_{n, m} {log}_{2} q_{n, m} & , d_{n, m} > 0 \\ H_{max} & , d_{n, m} = 0 \end{matrix}

(15)

3.3. Problem Formulation

In this paper, we develop an optimization problem to enhance the system’s QoS by weighing the privacy protection against the delay, energy consumption, and task discard rate, which we denote as

\begin{matrix} P : max Q o S = max (ω_{1} H_{n} - ω_{2} T_{n} - ω_{3} E_{n} - ω_{4} d i s_{n}) \\ s . t : C_{1} : d_{n, l o c a l} + \sum_{m \in M} d_{n, m} ⩽ d_{n} \\ C_{2} : \sum_{m \in M} f_{m}^{e} ⩽ F_{m}^{e} \\ C_{3} : a_{n} \in [0, 1] \\ C_{4} : 0 ⩽ max (T_{n}^{l}, T_{n}^{e}) ⩽ ξ \end{matrix}

(16)

where

ω_{i}

is a weighting factor indicating the importance given to each performance. Constraint

C_{1}

indicates that the combined quantity of data processed locally and offloaded to the edge server should not surpass the total number of tasks generated by the user device, constraint

C_{2}

indicates that the computation resources assigned to the tasks remain within the total available computation capacity of the edge server, constraint

C_{3}

indicates that the computation tasks may be processed by the local device as well as by the edge server, and constraint

C_{4}

indicates that the local processing delay and offload processing delay must not exceed the time slot length.

4. DRL-Based Computation Offloading Strategy

Considering the complexity and dynamics of MEC systems, this section proposes a DRL-based algorithm for finding an optimal computation offloading strategy. In this regard, we define several essential elements in DRL and then elaborate on the TD3-SN-PER algorithm for solving the computation offloading joint optimization problem.

4.1. State, Action, and Reward Definition

In the proposed scheme, each user device is considered an agent, and the experience samples of all agents are integrated into the global experience samples, and the global experience is utilized to train the network. Defining S as the state space, the state set of all agents can be denoted as

S = {s_{1}, s_{2}, \dots, s_{N}}

; defining A as the action space, while the action set of all agents can be denoted as

A = {a_{1}, a_{2}, \dots, a_{N}}

; similarly, the reward set of all agents can be denoted as

R = {r_{1}, r_{2}, \dots, r_{N}}

.

(1) State space: In a dynamic MEC environment, the state space of different time slots changes dynamically, and the agent makes the corresponding offloading decision by observing the state of the current time slot. The state

s_{n}^{t}

of device n at period t can be expressed as follows:

s_{n}^{t} = (d_{n}, x_{n}, y_{n})

(17)

where

x_{n}

and

y_{n}

denote the horizontal and vertical coordinates of the user device.

(2) Action space: The set of all actions that an agent can choose is called the action space. The agent selects the appropriate action for a given state, including the quantity of tasks assigned to edge servers and local execution, as well as the local computation power and transmission power. The action

a_{n}^{t}

can be described as

a_{n}^{t} = (\begin{matrix} d_{n, 1} & d_{n, 2} & \dots & d_{n, m} & d_{n, l o c a l} \\ p_{n, 1} & p_{n, 2} & \dots & p_{n, m} & p_{n, l} \end{matrix})

(18)

(3) Reward function: In DRL, the reward serves as the sole feedback to the agent, whose objective is to select the optimal actions that maximize the reward within a specific environment. This paper aims to enhance the QoS throughout the computation offloading process. Hence, we delineate the reward function as follows:

r_{n}^{t} = Q o S = ω_{1} H_{n} - ω_{2} T_{n} - ω_{3} E_{n} - ω_{4} d i s_{n}

(19)

4.2. TD3-SN-PER Algorithmic Framework

The framework of the TD3-SN-PER algorithm proposed in this paper is shown in Figure 2, which is based on the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm.

In the TD3-SN-PER algorithm, each agent contains six neural networks, namely actor network

u (S_{t} | θ^{u})

, where

S_{t}

denotes the set of states of all agents, critic1 network

Q 1 (s_{n}^{t}, a_{n}^{t}, - a_{n}^{t} | θ^{Q 1})

, and critic2 network

Q 2 (s_{n}^{t}, a_{n}^{t}, - a_{n}^{t} | θ^{Q 2})

, where

- a_{n}^{t}

denotes a set of actions for all agents except agent n; target actor network

u^{'} (S_{t} | θ^{u^{'}})

, target critic1 network

Q 1^{'} (s_{n}^{t}, a_{n}^{t}, - a_{n}^{t} | θ^{Q 1^{'}})

, and target critic2 network

Q 2^{'} (s_{n}^{t}, a_{n}^{t}, - a_{n}^{t} | θ^{Q 2^{'}})

. Where

θ^{u}

,

θ^{Q 1}

,

θ^{Q 2}

,

θ^{u^{'}}

,

θ^{Q 1^{'}}

, and

θ^{Q 2^{'}}

are network parameters.

In the initial phase of the algorithm, the agent selects an action

a_{n}^{t}

in accordance with the current policy

u (S_{t} | θ^{u})

and the prevailing state

s_{n}^{t}

of the environment. After executing the action

a_{n}^{t}

, it observes the reward

r_{n}^{t}

and the new state

s_{n}^{t + 1}

and stores these samples

(s_{n}^{t}, a_{n}^{t}, r_{n}^{t}, s_{n}^{t + 1})

into the prioritized experience buffer and draws a batch of samples from the experience buffer in the form of the prioritized experience replay, for which we elaborate in detail in the following. For each sample, a target critic network and a target actor network are used to compute the target value y with the expression

y = r_{n}^{t} + γ min [Q 1^{'} (s_{n}^{t + 1}, \overset{⌢}{\to} a_{n}^{t + 1}, - \overset{⌢}{\to} a_{n}^{t + 1} ∣ θ^{Q 1^{'}}), Q 2^{'} (s_{n}^{t + 1}, \overset{⌢}{\to} a_{n}^{t + 1}, - \overset{⌢}{\to} a_{n}^{t + 1} ∣ θ^{Q 2^{'}})]

(20)

where

\overset{⌢}{\to} a_{n}^{t + 1} = u^{'} (S_{t} | θ^{u^{'}}) + c l i p (ϑ, - c, c)

,

ϑ

is the added noise for target strategy smoothing. After deriving the target value y, the two critic networks are updated using the extracted samples and the target value y. The update goal is to minimize their loss functions,

l o s s (θ^{Q_{i}}) = E_{((s_{n}^{t}, a_{n}^{t}, r_{n}^{t}, s_{n}^{t + 1})) \sim D} [{(Q i (s_{n}^{t}, a_{n}^{t}, - a_{n}^{t} | θ^{Q i}) - y)}^{2}] w (i)

(21)

where D is the empirical buffer, and

w (i)

is the important sampling weight used to reduce the bias caused by prioritized empirical sampling. In contrast to the deep deterministic policy gradient (DDPG) algorithm, the TD3-SN-PER algorithm updates the actor network whenever a certain number of critic network updates have been performed to ensure that the estimation of the critical network is sufficiently stable. The update formula is

\nabla_{θ^{u}} J (θ^{u}) = E_{s_{n}^{t} \sim D} [\nabla_{a_{n}^{t}} Q (s_{n}^{t}, a_{n}^{t}, - a_{n}^{t} | θ^{Q}) |_{a = u (S_{t} | θ^{u})} \nabla_{θ^{u}} u (S_{t} | θ^{u})]

(22)

Subsequently, the target network undergoes a soft update to gradually align with the parameters of the main network. The formula for this update is

θ^{{Q_{i}}^{'}} \leftarrow τ θ^{Q_{i}} + (1 - τ) θ^{{Q_{i}}^{'}}

(23)

θ^{u^{'}} \leftarrow τ θ^{u} + (1 - τ) θ^{u^{'}}

(24)

where

τ

is the soft update parameter.

4.3. State Normalization

During the deep neural network training process, the scales of different state features may vary greatly, which will lead to unstable updating of the network, and the problem of gradient explosion or gradient disappearance may occur during the training process, affecting the training results. In this regard, in this paper, the observed states are normalized and preprocessed for better training of deep neural networks. For the input state vector

s_{n}^{t} = (d_{n}, x_{n}, y_{n})

, the task features and coordinate features are normalized separately. To normalize the task features:

{d_{n}}^{'} = \frac{d_{n} - d_{n, min}}{d_{n, max} - d_{n, min}}

(25)

To normalize the coordinate features:

{x_{n}}^{'} = \frac{x_{n} - x_{n, min}}{x_{n, max} - x_{n, min}}

(26)

{y_{n}}^{'} = \frac{y_{n} - y_{n, min}}{y_{n, max} - y_{n, min}}

(27)

where

d_{n, min}

and

d_{n, max}

are the minimum and maximum values of the task features, and

x_{n, min}

,

x_{n, max}

,

y_{n, min}

, and

y_{n, max}

are the minimum and maximum values of the horizontal and vertical coordinates.

4.4. Prioritized Experience Replay

During the training process of the DRL algorithm, the way to train the model by randomly selecting a small batch of samples from the experience pool ignores the difference in the importance of the experience samples, which reduces the learning efficiency and final performance. In this regard, the TD3-SN-PER algorithm introduces a prioritized experience replay mechanism during the training process, which improves the learning efficiency and performance by calculating and prioritizing the important sampling weights of each sample and selecting samples with higher priority during sampling.

In this paper, the importance of the sample is expressed in terms of the absolute TD error, which can be calculated using two critic networks, denoted as

σ_{t, 1} = y - Q_{1} (s_{n}^{t}, a_{n}^{t})

(28)

σ_{t, 2} = y - Q_{2} (s_{n}^{t}, a_{n}^{t})

(29)

The priority of the sample is updated by taking the value with the largest absolute value of the two TD errors, denoted as

η_{t} = max (| σ_{t, 1} |, | σ_{t, 2} |) + ε

(30)

where

ε

is a very small positive number, used to ensure that even samples with small TD errors have a certain priority. Based on the above prioritization, the sampling probability of the sample can be obtained as

p_{i} = \frac{{(η_{t})}^{α}}{\sum_{N} {(η_{t})}^{α}}

(31)

where

α

is the prioritization parameter. In this process, in order to reduce the bias caused by prioritized empirical sampling, this paper introduces the important sampling weights, denoted as

w (i) = {(\frac{1}{N} . \frac{1}{p_{i}})}^{β}

(32)

where N is the size of the empirical buffer and

β \in [0, 1]

, which is used to control the magnitude of weight changes.

Taken together, the TD3-SN-PER algorithm can be summarized as Algorithm 1.

Algorithm 1 TD3-SN-PER Algorithm

1:: Input: Environment status and parameters
2:: Output: Offloading policy for user equipment
3:: Initialization:
4:: Initialize actor and critic networks
5:: Initialize the prioritized experience replay pool
6:: Initialize state normalization parameters
7:: Set the soft update parameter $τ$ , the discount factor $γ$
8:: for each episode do
9:: Reset the environment and get the initial state $S_{0} = {s_{1}^{0}, s_{2}^{0}, \dots, s_{N}^{0}}$
10:: Normalize each feature in the state $S_{0}^{'} = {s_{1}^{0^{'}}, s_{2}^{0^{'}}, \dots, s_{N}^{0^{'}}}$
11:: for each step do
12:: for each agent do
13:: Select an action $a_{n}^{t}$ based on the current policy and noise
14:: end for
15:: Execute the actions, observe the new state $s_{n}^{t + 1}$ , reward $r_{n}^{t}$ , and whether the episode is done
16:: Normalize the new state $s_{n}^{t + 1}$ :
17:: Repeat the normalization process for the initial state
18:: Collect agents’ samples and combine them into global experience, and store them in the prioritized experience replay.
19:: if experienced enough then
20:: for each agent do
21:: Sample a batch of experiences $(s_{n}^{t^{'}}, a_{n}^{t^{'}}, r_{n}^{t^{'}}, s_{n}^{{t + 1}^{'}})$ and corresponding importance sampling weights $w (i)$ from the prioritized experience replay pool
22:: Calculate the loss $l o s s (θ^{Q_{i}})$ for the critic network
23:: Update the critic network
24:: Calculate and update the priorities $η_{t}$ in the prioritized experience replay pool
25:: Calculate the policy gradient $\nabla_{θ^{u}} J (θ^{u})$
26:: Update the actor network
27:: Update the target network parameters $θ^{{Q_{i}}^{'}}$ , $θ^{u^{'}}$
28:: end for
29:: end if
30:: end for
31:: end for

5. Simulation Results and Analysis

This section evaluates the proposed TD3-SN-PER algorithm through experimental simulations, and compares its performance with several benchmark algorithms.

5.1. Simulation Setup

The experimental environment is built using Python3.8 and Pytorch1.8.1. During the experiment, it is assumed that the edge environment is a square cell with an area of 1 km² square. There are four uniformly distributed edge servers with coordinates (333, 333), (666, 666), (666, 333) and (333, 666). The users are randomly distributed with a certain degree of mobility, but the displacement range does not exceed 2 m in each time slot. At the onset of the experiment, the user count is fixed at 10, while the task arrival rate for the device is sampled from a normal distribution with a mean value of 2. The weighting factors

ω_{1}

,

ω_{2}

,

ω_{3}

, and

ω_{4}

in the reward function are set to 3, 2, 2, and 2, respectively. The rest of the parameters are shown in Table 1.

5.2. Performance Evaluation

In order to illustrate the effectiveness of our work, we simulate and analyze the training effect of the related work in this paper and compare it with the JODRL-PP algorithm [10], which adopts the idea of DDPG. The results are shown in Figure 3. Among them, the TD3 algorithm is an improved algorithm for the DDPG algorithm, the TD3-SN algorithm normalizes the states in the TD3 algorithm, and the TD3-SN-PER algorithm performs prioritized experience replay along with the state normalization. As can be seen from the figure, each of our work optimizes the training results and proves the effectiveness of our work.

Figure 4 depicts the training effect under different learning rates. When the learning rate of the critic and actor networks is 0.01 and 0.001, respectively, the training of the TD3-SN-PER algorithm fails to reach the desired effect. This is because if the learning rate is excessively high, the network can not be stabilized in the parameter space, which leads to the learning process not being efficiently carried out. The algorithm also fails to achieve the desired training effect when the learning rate of the critic and actor networks is 0.0001 and 0.00001, respectively, which is because the algorithm’s learning rate is too low, resulting in the network parameters being updated too slowly. So, in this paper, the learning rate of critic and actor networks is 0.001 and 0.0001, respectively.

To further demonstrate the efficacy of our approach, we analyze it in comparison with the baseline algorithm and two DRL algorithms, which are as follows:

(1): Full local processing (Local): computation tasks are not offloaded and are only processed locally;
(2): Nearest offloading (Near): computation tasks are offloaded to the edge server closest to the user’s device for processing;
(3): QMIX [30]: the QMIX algorithm is a multi-agent DRL algorithm based on centralized training with distributed execution for solving collaborative tasks with multiple agents;
(4): JODRL-PP: the JODRL-PP algorithm is also a multi-agent DRL algorithm based on the concept of centralized training coupled with distributed execution, which is used to enhance the system’s performance while protecting the user’s privacy during the computation offloading process.

5.2.1. Performance Comparison with Different Numbers of Users

Figure 5 shows the performance comparison with different numbers of users. The figure demonstrates that the TD3-SN-PER algorithm outperforms the two compared DRL algorithms in terms of delay, energy consumption, and task discard rate. However, in the comparison between Figure 5a,b regarding energy consumption and delay, the Near algorithm exhibits lower delay and energy consumption than the TD3-SN-PER algorithm. This is due to the fact that when the tasks are offloaded only to the nearest edge servers for processing, the offloading generated transmission delay and transmission energy consumption are relatively low, whereas the approach in this paper maximizes QoS by simultaneously weighing energy consumption, delay, task discard rate, and privacy entropy, in which a certain amount of performance is sacrificed for the optimal solution of the final result. In the delay trend depicted in Figure 5b, the delay of the TD3-SN-PER algorithm is higher than that of the Local algorithm when the number of users is close to 25. This situation arises because the user count is excessively high, resulting in the edge servers being able to allocate too few computation resources to the users, and in order to ensure the task discarding rate and the energy consumption of the device during task processing, the user device still decides to offload part of the data to the edge server for processing, resulting in the higher delay of TD3-SN-PER algorithm than the local algorithm.

The trend in Figure 5c regarding the task discard rate shows that the TD3-SN-PER algorithm obtains the lowest task discard rate among several algorithms, with a maximum of 0.0148 as the number of users increases. As shown in Figure 5d, since the Local algorithm does not perform task offloading, its privacy entropy can reach a maximum value of 2, while the Near algorithm only considers offloading tasks to the nearest edge server for processing, which makes it very easy for the attacker to infer the user’s location information based on this offloading preference, and therefore its privacy entropy is the minimum value of 0. In contrast, when the number of users is close to 20, the privacy entropy of the TD3-SN-PER algorithm is slightly lower than that of the JODRL-PP algorithm, but it maintains the minimum value of 1.96 or more.

Figure 6 shows the variation in QoS with different numbers of users. The experimental findings reveal that the TD3-SN-PER algorithm consistently achieves the highest QoS throughout the process. In addition, except for the Local algorithm, the QoS of all algorithms diminishes as the number of users increases. This is due to the fact that when the number of users increases, the burden on the edge servers and the data transmission links increases, which reduces the relevant performance of the system. As for the Local algorithm, when the number of users increases, the data processing equipment increases with it, so the QoS of the Local algorithm does not change with the number of users.

5.2.2. Performance Comparison with Different Task Arrival Rates

Figure 7 displays how performance fluctuates with varying task arrival rates. It is evident from Figure 7 that, as the task arrival rate escalates, there is a significant increase in energy consumption, delay, and task discard rate associated with the Local algorithm. This is because the user device has very limited computation resources, and it is difficult for the user device to process a large amount of data at the same time when the amount of data are too large, which results in very unsatisfactory performance of the system. In contrast, the Near algorithm yields the lowest energy consumption and delay, which is obtained by sacrificing user privacy. The TD3-SN-PER algorithm has lower energy consumption, delay, and task discard rate than Local, QMIX, and JODRL-PP algorithms, which proves the effectiveness of the TD3-SN-PER algorithm. Comparing Figure 7d, it can be found that the privacy entropy of QMIX and JODRL-PP algorithms is slightly larger than that of the TD3-SN-PER algorithm when the task arrival rate is 1. However, when the task arrival rate is greater than 1.5, the privacy entropy of the TD3-SN-PER algorithm becomes larger than that of the QMIX and JODRL-PP algorithms, reaching up to 1.99.

Figure 8 shows the QoS changes under different task arrival rates. As the task arrival rate increases, the QoS of the TD3-SN-PER algorithm does not degrade significantly and stays in the optimal state, which indicates that the TD3-SN-PER algorithm provides an efficient computation offloading strategy.

Summarizing the analysis, it is evident that the performance of the TD3-SN-PER algorithm is significantly superior to that of the QMIX and JODRL-PP algorithms, while it is slightly lower than the Local algorithm in terms of privacy preservation and slightly lower than the Near algorithm in terms of reducing the system’s delay and energy consumption, while the TD3-SN-PER algorithm achieves the highest QoS by trade-offs between several performances.

6. Conclusions and Future Work

This paper proposes a DRL-based computation offloading algorithm for multi-user, multi-server MEC environments, aiming to maximize QoS by simultaneously considering user privacy protection, delay, energy consumption, and task discard rate. In order to reduce the overestimation bias problem, we use the TD3 algorithm with a clipped double-Q learning technique to solve the problem. Additionally, to enhance the learning efficiency and performance of the algorithm, we integrate state normalization and prioritized experience replay techniques. The experimental results demonstrate that the TD3-SN-PER algorithm in this paper is more effective in improving QoS.

When performing computation offloading, edge servers collaborate with each other to effectively improve system performance. However, if there are dependencies between tasks, the execution order of tasks needs to be coordinated among different edge servers, which can make the task scheduling process very complicated. Therefore, our future work will be devoted to solving the problem of computation offloading for tasks with dependencies in a collaborative MEC environment.

Author Contributions

Conceptualization: Z.L.; Methodology: X.D.; Investigation: X.D.; Writing—original draft preparation: X.D.; Writing—review and editing: Z.L. and W.Z.; Supervision: Z.L.; Project administration: Z.L.; Funding acquisition: Z.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant 61801319, in part by Sichuan Science and Technology Program under Grant 2020JDJQ0061 and 2021YFG0099, in part by the Innovation Fund of Engineering Research Center of the Ministry of Education of China, Digital Learning Technology Integration and Application, under Grant 1221009; in part by the 2022 Graduate Innovation Fund of Sichuan University of Science and Engineering under Grant Y2023297; in part by the Opening Project of Artificial Intelligence Key Laboratory of Sichuan Province under Grant 2021RZJ01, in part by the Postgraduate Innovation Fund Project of Sichuan University of Science and Engineering under Grant Y2023280.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Nasir, M.; Muhammad, K.; Ullah, A.; Ahmad, J.; Baik, S.W.; Sajjad, M. Enabling automation and edge intelligence over resource constraint IoT devices for smart home. Neurocomputing 2022, 491, 494–506. [Google Scholar] [CrossRef]
Hartmann, M.; Hashmi, U.S.; Imran, A. Edge computing in smart health care systems: Review, challenges, and research directions. Trans. Emerg. Telecommun. Technol. 2022, 33, e3710. [Google Scholar] [CrossRef]
Feng, C.; Han, P.; Zhang, X.; Yang, B.; Liu, Y.; Guo, L. Computation offloading in mobile edge computing networks: A survey. J. Netw. Comput. Appl. 2022, 202, 103366. [Google Scholar] [CrossRef]
Yeganeh, S.; Sangar, A.B.; Azizi, S. A novel Q-learning-based hybrid algorithm for the optimal offloading and scheduling in mobile edge computing environments. J. Netw. Comput. Appl. 2023, 214, 103617. [Google Scholar] [CrossRef]
Chanyour, T.; El Ghmary, M.; Hmimz, Y.; Cherkaoui Malki, M.O. Energy-efficient and delay-aware multitask offloading for mobile edge computing networks. Trans. Emerg. Telecommun. Technol. 2022, 33, e3673. [Google Scholar] [CrossRef]
Yao, X.; Chen, N.; Yuan, X.; Ou, P. Performance optimization of serverless edge computing function offloading based on deep reinforcement learning. Future Gener. Comput. Syst. 2023, 139, 74–86. [Google Scholar] [CrossRef]
Chen, Z.; Wang, X. Decentralized computation offloading for multi-user mobile edge computing: A deep reinforcement learning approach. EURASIP J. Wirel. Commun. Netw. 2020, 2020, 188. [Google Scholar] [CrossRef]
Zheng, K.; Jiang, G.; Liu, X.; Chi, K.; Yao, X.; Liu, J. DRL-based offloading for computation delay minimization in wireless-powered multi-access edge computing. IEEE Trans. Commun. 2023, 71, 1755–1770. [Google Scholar] [CrossRef]
Zeng, S.; Huang, X.; Li, D. Joint Communication and Computation Cooperation in Wireless Powered Mobile Edge Computing Networks with NOMA. IEEE Internet Things J. 2023, 10, 9849–9862. [Google Scholar] [CrossRef]
Wu, G.; Chen, X.; Gao, Z.; Zhang, H.; Yu, S.; Shen, S. Privacy-preserving offloading scheme in multi-access mobile edge computing based on MADRL. J. Parallel Distrib. Comput. 2024, 183, 104775. [Google Scholar] [CrossRef]
Huda, S.A.; Moh, S. Deep reinforcement learning-based computation offloading in uav swarm-enabled edge computing for surveillance applications. IEEE Access 2023, 11, 68269–68285. [Google Scholar] [CrossRef]
Li, J.; Yang, Z.; Wang, X.; Xia, Y.; Ni, S. Task offloading mechanism based on federated reinforcement learning in mobile edge computing. Digit. Commun. Net. 2023, 9, 492–504. [Google Scholar] [CrossRef]
Jia, Y.; Zhang, C.; Huang, Y.; Zhang, W. Lyapunov optimization based mobile edge computing for Internet of Vehicles systems. IEEE Trans. Commun. 2022, 70, 7418–7433. [Google Scholar] [CrossRef]
Tan, K.; Feng, L.; Dán, G.; Törngren, M. Decentralized convex optimization for joint task offloading and resource allocation of vehicular edge computing systems. IEEE Trans. Veh. Technol. 2022, 71, 13226–13241. [Google Scholar] [CrossRef]
Xu, X.; Li, D.; Dai, Z.; Li, S.; Chen, X. A heuristic offloading method for deep learning edge services in 5G networks. IEEE Access 2019, 7, 67734–67744. [Google Scholar] [CrossRef]
Hortelano, D.; de Miguel, I.; Barroso, R.J.D.; Aguado, J.C.; Merayo, N.; Ruiz, L.; Asensio, A.; Masip-Bruin, X.; Fernández, P.; Lorenzo, R.M.; et al. A comprehensive survey on reinforcement-learning-based computation offloading techniques in Edge Computing Systems. J. Netw. Comput. Appl. 2023, 216, 103669. [Google Scholar] [CrossRef]
Wang, X.; Wang, S.; Liang, X.; Zhao, D.; Huang, J.; Xu, X.; Dai, B.; Miao, Q. Deep reinforcement learning: A survey. IEEE Trans. Neural Netw. Learn. Syst. 2022, 35, 5064–5078. [Google Scholar] [CrossRef]
Song, S.; Ma, S.; Zhu, X.; Li, Y.; Yang, F.; Zhai, L. Joint bandwidth allocation and task offloading in multi-access edge computing. Expert Syst. Appl. 2023, 217, 119563. [Google Scholar] [CrossRef]
Li, C.; Jiang, K.; Zhang, Y.; Jiang, L.; Luo, Y.; Wan, S. Deep Reinforcement Learning-based Mining Task Offloading Scheme for Intelligent Connected Vehicles in UAV-aided MEC. ACM Trans. Des. Autom. Electron. Syst. 2024, 29, 1–29. [Google Scholar] [CrossRef]
Liao, L.; Lai, Y.; Yang, F.; Zeng, W. Online computation offloading with double reinforcement learning algorithm in mobile edge computing. J. Parallel Distrib. Comput. 2023, 171, 28–39. [Google Scholar] [CrossRef]
Avgeris, M.; Mechennef, M.; Leivadeas, A.; Lambadaris, I. A two-stage cooperative reinforcement learning scheme for energy-aware computational offloading. In Proceedings of the 2023 IEEE 24th International Conference on High Performance Switching and Routing (HPSR), Albuquerque, NM, USA, 5–7 June 2023; pp. 179–184. [Google Scholar]
Tu, Y.; Chen, H.; Yan, L.; Zhou, X. Task offloading based on LSTM prediction and deep reinforcement learning for efficient edge computing in IoT. Future Internet 2022, 14, 30. [Google Scholar] [CrossRef]
Ju, Y.; Chen, Y.; Cao, Z.; Liu, L.; Pei, Q.; Xiao, M.; Ota, K.; Dong, M.; Leung, V.C. Joint secure offloading and resource allocation for vehicular edge computing network: A multi-agent deep reinforcement learning approach. IEEE Trans. Intell. Transp. Syst. 2023, 24, 5555–5569. [Google Scholar] [CrossRef]
Lang, P.; Tian, D.; Duan, X.; Zhou, J.; Sheng, Z.; Leung, V.C. Blockchain-based cooperative computation offloading and secure handover in vehicular edge computing networks. IEEE Trans. Intell. Veh. 2023, 8, 3839–3853. [Google Scholar] [CrossRef]
Sun, Y.; He, Q. Joint task offloading and resource allocation for multi-user and multi-server MEC networks: A deep reinforcement learning approach with multi-branch architecture. Eng. Appl. Artif. Intell. 2023, 126, 106790. [Google Scholar] [CrossRef]
Sun, Z.; Sun, G.; Liu, Y.; Wang, J.; Cao, D. BARGAIN-MATCH: A game theoretical approach for resource allocation and task offloading in vehicular edge computing networks. IEEE Trans. Mob. Comput. 2023, 23, 1655–1673. [Google Scholar] [CrossRef]
Wang, H.; Li, W.; Sun, J.; Zhao, L.; Wang, X.; Lv, H.; Feng, G. Low-complexity and efficient dependent subtask offloading strategy in IoT integrated with multi-access edge computing. IEEE Trans. Netw. Serv. Manag. 2023, 21, 621–636. [Google Scholar] [CrossRef]
Cao, B.; Li, Z.; Liu, X.; Lv, Z.; He, H. Mobility-aware multiobjective task offloading for vehicular edge computing in digital twin environment. IEEE J. Sel. Areas Commun. 2023, 41, 3046–3055. [Google Scholar] [CrossRef]
Ke, H.; Wang, H.; Sun, H. Multi-agent deep reinforcement learning-based partial task offloading and resource allocation in edge computing environment. Electronics 2022, 11, 2394. [Google Scholar] [CrossRef]
Han, C.; Yao, H.; Mai, T.; Zhang, N.; Guizani, M. QMIX aided routing in social-based delay-tolerant networks. IEEE Trans. Veh. Technol. 2021, 71, 1952–1963. [Google Scholar] [CrossRef]
Li, T.; He, X.; Jiang, S.; Liu, J. A survey of privacy-preserving offloading methods in mobile-edge computing. J. Netw. Comput. Appl. 2022, 203, 103395. [Google Scholar] [CrossRef]
Zhang, D.G.; An, H.Z.; Zhang, J.; Zhang, T.; Dong, W.M.; Jiang, X.R. Novel Privacy Awareness Task Offloading Approach Based On Privacy Entropy. IEEE Trans. Netw. Serv. Manag. 2024. [Google Scholar] [CrossRef]

Figure 1. MEC system.

Figure 2. TD3-SN-PER algorithmic framework.

Figure 3. Comparison of rewards at different work points.

Figure 4. Training effects at different learning rates.

Figure 5. Performance with different numbers of users. (a) Energy with different numbers of users; (b) delay with different numbers of users; (c) task discard rate with different numbers of users; (d) privacy entropy with different numbers of users.

Figure 6. QoS with different numbers of users.

Figure 7. Performance with different task arrival rates. (a) energy with different task arrival rates; (b) delay with different task arrival rates; (c) task discard rate with different task arrival rates; (d) privacy entropy with different task arrival rates.

Figure 8. QoS with different task arrival rates.

Table 1. Simulation parameters.

Symbol	Meaning	Value
$p_{max}$	Maximum power of user equipment	1.5 W
$f_{m, e}$	The computing capacity of edge servers	12.5 GHz
C	Cycles required by the CPU to compute 1 bit of data	500
$ς$	Slot length	1 s
$α^{2}$	Channel noise	$10^{- 7}$
W	Bandwidth size	3 GHz

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Dai, X.; Luo, Z.; Zhang, W. Computation Offloading with Privacy-Preserving in Multi-Access Edge Computing: A Multi-Agent Deep Reinforcement Learning Approach. Electronics 2024, 13, 2655. https://doi.org/10.3390/electronics13132655

AMA Style

Dai X, Luo Z, Zhang W. Computation Offloading with Privacy-Preserving in Multi-Access Edge Computing: A Multi-Agent Deep Reinforcement Learning Approach. Electronics. 2024; 13(13):2655. https://doi.org/10.3390/electronics13132655

Chicago/Turabian Style

Dai, Xiang, Zhongqiang Luo, and Wei Zhang. 2024. "Computation Offloading with Privacy-Preserving in Multi-Access Edge Computing: A Multi-Agent Deep Reinforcement Learning Approach" Electronics 13, no. 13: 2655. https://doi.org/10.3390/electronics13132655

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Computation Offloading with Privacy-Preserving in Multi-Access Edge Computing: A Multi-Agent Deep Reinforcement Learning Approach

Abstract

1. Introduction

2. Related Works

3. System Model and Problem Formulation

3.1. Computation Model

3.2. Privacy Model

3.3. Problem Formulation

4. DRL-Based Computation Offloading Strategy

4.1. State, Action, and Reward Definition

4.2. TD3-SN-PER Algorithmic Framework

4.3. State Normalization

4.4. Prioritized Experience Replay

5. Simulation Results and Analysis

5.1. Simulation Setup

5.2. Performance Evaluation

5.2.1. Performance Comparison with Different Numbers of Users

5.2.2. Performance Comparison with Different Task Arrival Rates

6. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI