1. Introduction
The rapid development of Internet of Things (IoT) technology has driven the global proliferation of various smart mobile devices, leading to the emergence of numerous smart applications in our daily lives, such as smart homes [
1] and smart healthcare [
2]. These applications significantly improve the Quality of Experience (QoE) for users. However, these applications often generate a large volume of computation tasks, overwhelming the capabilities of resource-constrained mobile devices to meet the demands for data processing and computation [
3]. This limitation severely impacts the performance and Quality of Service (QoS) of the applications. In response to these challenges, multi-access edge computing (MEC) technology has emerged as a solution. By deploying computation, storage, and other resources closer to mobile devices, MEC enables the processing of data near its point of generation [
4], significantly reducing delay and energy consumption, and enhancing data processing efficiency [
5]. Despite the numerous benefits brought by MEC, it has become a key issue to efficiently manage MEC resources and determine the optimal offloading strategy to improve QoS [
6].
To address these issues, some studies have focused on enhancing QoS by minimizing system delay and energy consumption in single access point environments [
7,
8,
9]. However, multi-access point environments are closer to reality, where the computation tasks are often assumed to be fine-grained. Depending on factors like data volume size and network conditions, tasks with larger data volumes are offloaded to the nearest edge servers with data processing capabilities. In contrast, tasks with smaller data volumes are offloaded to more distant servers for processing. This approach can obtain the relatively lowest delay, energy consumption, and task discard rate, but it also has certain drawbacks. For instance, if an attacker were to monitor the edge server, they could potentially infer a user’s exact location based on the locations of multiple edge servers and the user’s offloading preferences, thereby leaking the user’s location privacy [
10].
Existing studies rarely consider the aforementioned issues collectively, and often focus on only one or a few of them [
8,
11,
12]. Hence, our research takes a holistic approach by considering privacy protection, delay, energy consumption, and the task discard rate in a multi-access point network to enhance the QoS. However, in a complex and dynamic environment such as MEC, it is particularly difficult to solve the above problems using traditional methods such as Liapunov optimization methods [
13], convex optimization methods [
14], and heuristic techniques [
15]. In this scenario, deep reinforcement learning (DRL) emerges as a novel solution for computation offloading in MEC, thanks to its exceptional ability to tackle complex decision-making problems [
16]. DRL combines the decision-making capability of reinforcement learning (RL) and the representational learning capability of deep learning (DL), enabling it to learn to obtain the optimal policy through environmental interaction in the absence of explicit instructions [
17].
Hence, this paper proposes a privacy-preserving computation offloading scheme based on DRL, and the main contributions are as follows:
- (1)
Considering multiple performances of the MEC system, this study formulates a multi-objective optimization problem aimed at maximizing the QoS within a multi-access point environment;
- (2)
For the multi-objective optimization problem, this paper proposes a computation offloading algorithm named TD3-SN-PER. The algorithm features two independent critic networks, a design choice aimed at mitigating the overestimation bias. To enhance learning efficiency and stability, and to address the issue of sample relevance, the state normalization and prioritized experience replay techniques are integrated, to ensure that the algorithm can better generate the globally optimal computation offloading policy;
- (3)
Extensive experimental data demonstrate that the TD3-SN-PER algorithm significantly improves system performance. Compared to other approaches, this algorithm consistently achieves optimal QoS, even as the number of users and the rate of task arrival substantially increase.
The remainder of this document is structured as follows:
Section 2 discusses recent related research,
Section 3 outlines our system model and the optimization problem,
Section 4 details the algorithm we propose,
Section 5 delves into the analysis of experimental outcomes, and
Section 6 offers a conclusion to the entire study.
2. Related Works
Numerous studies in MEC propose solutions to tackle the challenges in the computation offloading process. Various problems often require distinct optimization objectives. In scenarios sensitive to delays, minimizing delay is typically the primary objective. For example, Song et al. [
18] used Branch and Bound (BnB) and a multi-objective particle swarm optimization to solve the offloading decision problem and the bandwidth allocation problem to minimize the delay. Li et al. [
19] modeled the task offloading problem as a constrained Markov decision process (CMDP) and proposed a prioritized experience replay and dueling double deep Q-network-based algorithm to solve the CMDP problem.
Existing studies often focus on the joint optimization of delay and energy consumption. Liao et al. [
20] proposed an online DRL algorithm aimed at reducing long-term energy consumption and delay by jointly deciding computation offloading transmit power, CPU frequency, and the offloading decision. Avgeris et al. [
21] proposed a two-phase DRL scheme. In the first phase, each user device independently decides to offload the task to a connected edge server or execute it locally. If the task is offloaded, it proceeds to the second phase, where load balancing is achieved by transferring the task between different edge servers. Some studies explore performance metrics beyond delay and energy consumption. For instance, Ref. [
22] proposed a predictive algorithm by combining a long short-term memory network with DRL to reduce the system task discard rate, delay, and energy consumption.
With users increasingly focusing on their privacy, how to protect user privacy during computation offloading has emerged as a critical issue. For this reason, some works have now been devoted to studying how to protect user privacy during computation offloading. Ju et al. [
23] proposed a secure DRL-based computation offloading scheme by utilizing the spectrum-sharing architecture and physical layer security techniques. Lang et al. [
24] ensured the synchronization and invariance of the offloading data by applying blockchain technology to the collaborative computation offloading of on-board MEC.
In addition to the different choices of optimization objectives, different studies have been conducted on the offloading method of the task. For example, some studies are oriented towards binary offloading [
8,
25]. Zheng et al. [
8] decomposed the delay minimization problem into three subproblems, achieving a fast near-optimal offloading decision under varying wireless channel conditions. Sun et al. [
25] proposed a multi-branch network-based DQN algorithm to address the problem of the number of actions in the system increasing combinatorially with the number of mobile devices. Other studies are oriented toward partial offloading [
26,
27]. Sun et al. [
26] improved the system’s utility by optimizing the servers’ resource allocation and load balancing. Wang et al. [
27] focused on how to efficiently offload dependent subtasks, and proposed a heuristic computation offloading scheduling scheme to offload appropriate subtasks to the server. Numerous existing studies have shown that dividing a task into multiple subtasks can reduce computation delay [
28].
There are also many research works on different network architectures for MEC under different application requirements. Ke et al. [
29] proposed a distributed multi-agent DRL algorithm to optimize bandwidth allocation and computation offloading strategies by training neural networks in a decentralized manner. However, it is a challenge for distributed nodes to learn collaborative strategies [
30], for which an effective approach is to use a combination of centralized and distributed. For example, Yao et al. [
6] proposed an experience-sharing offloading algorithm based on DRL in a distributed architecture. Wu et al. [
10] proposed a multi-agent DRL algorithm (JODRL-PP) to improve system performance while protecting user privacy. However, the scheme is prone to the overestimation bias problem, and ignores the importance of the difference in empirical samples.
Based on the analysis of the above studies, it is evident that existing computation offloading schemes can effectively optimize system performance. However, these schemes exhibit certain limitations. Some schemes focus on optimizing a single performance metric, overlooking the importance of other metrics. Others consider multiple metrics, but fail to address the issues of overestimation bias and the significance of empirical sample differences. Therefore, to overcome these shortcomings, this paper proposes a TD3-SN-PER algorithm, which merges dual Q-networks, state normalization, and prioritized experience replay strategies. This algorithm enhances the system model’s training by partitioning the offloading task into numerous subtasks and applying global information. It aims to identify the globally optimal computation offloading strategy by evaluating various system performance metrics.
3. System Model and Problem Formulation
In this paper, we consider a cell scenario with multiple edge servers and multiple devices, as shown in
Figure 1. Assume that there are a total of
N user devices, denoted as
, and a total of
M edge servers, denoted as
, and the computation tasks defined
, which is fine-grained. Define
as a decision variable, when
, the tasks are all executed locally, at this time,
. When
, the tasks are all offloaded to the edge servers for processing; at this time,
, which is expressed as the tasks can be assigned to multiple edge servers for processing at the same time. When
, the tasks are partially processed locally and partially processed at the edge servers; at this time,
.
3.1. Computation Model
When the local device executes the computation task, there will be no data transmission, thus no transmission delay and transmission energy consumption during the task processing. Defining the number of CPU cycles that can be processed by the user device in each time slot
t to be
, the delay required in task processing can be described as
where
C denotes the CPU cycles necessary to process 1 bit of data. The energy consumption needed for the user device to execute the task can be denoted as
where
is the calculated power of the device. When the computation tasks are offloaded to the edge server for processing, the task data are transmitted by the wireless link. During data transmission, the user has some mobility. In this regard, the wireless channel gain during offloading is denoted as
where
and
are the coordinates of device
n and edge server
m, and
is the base channel gain at one meter from the edge server. In this regard, the signal-to-interference-plus-noise ratio during data transmission can be described as
where
denotes the transmission power during task offloading, and
denotes the channel noise. So, the rate of data transmission can be described as
where
is the channel bandwidth allocated to the user device. Based on the calculated data transmission rate, the transmission delay during task offloading can be expressed as
Define the computation frequency of the edge server as
. The delay incurred by the edge server in processing a task can be expressed as
The delay generated during task offloading includes data uploading delay, data processing delay, and result feedback delay, but since the result feedback data are much smaller than the task uploading data, the result feedback delay is often not considered in the calculation process. In this regard, the total delay of offloading the task to the edge server for processing can be expressed as
Taken together, the delay required to complete the task can be expressed as
Based on the transmission power and transmission delay, the transmission energy consumption can be expressed as
Since this paper aims to maximize the QoS, it focuses only on the energy consumption of the user’s device, which can be expressed as
During task processing, there are often certain time and energy constraints, and when a task is not completed within the specified conditions, the task is discarded. The amount of these discarded tasks can be expressed as
where
is the time slot length,
is the maximum number of tasks that the local device can compute,
is the maximum number of tasks that can be offloaded, and the function
represents
3.2. Privacy Model
In the traditional offloading strategy, in order to reduce energy consumption and delay, the tendency is to offload tasks to the closest edge server. This is mainly due to the fact that the further the distance between the edge servers and the user devices, the worse the channel state is between them, and the higher the transmission delay and transmission energy consumption. This offload preference lets attackers infer a user’s location privacy by federating multiple edge servers. While offloading tasks to remote servers can confuse an attacker’s judgment, this can lead to greater delay and energy consumption. Therefore, this paper introduces the concept of privacy entropy [
31,
32] to reduce the dependence on proximity servers and increase the use of remote servers to provide optimal QoS by sacrificing a certain amount of performance to protect the user’s privacy. Privacy entropy is a measure of how randomly preferences are distributed among different servers in the computation offloading process; a greater level of privacy entropy signifies an increased challenge for attackers attempting to deduce the user’s location and the level of privacy protection.
The offloading preference for user device
n can be represented by the aggregate count of tasks offloaded, as well as the specific number of those tasks offloaded on the edge server:
When the local device handles the computation tasks, the attacker cannot infer the user’s location information, and the privacy entropy achieves its maximum value of
. When computation tasks are all offloaded to the nearest edge server for processing, the task offloading preference becomes very clear, making it easy for an attacker to infer the user’s location; at this time, the privacy entropy is zero. In this regard, the privacy entropy of the user’s device
n is as follows:
3.3. Problem Formulation
In this paper, we develop an optimization problem to enhance the system’s QoS by weighing the privacy protection against the delay, energy consumption, and task discard rate, which we denote as
where
is a weighting factor indicating the importance given to each performance. Constraint
indicates that the combined quantity of data processed locally and offloaded to the edge server should not surpass the total number of tasks generated by the user device, constraint
indicates that the computation resources assigned to the tasks remain within the total available computation capacity of the edge server, constraint
indicates that the computation tasks may be processed by the local device as well as by the edge server, and constraint
indicates that the local processing delay and offload processing delay must not exceed the time slot length.
4. DRL-Based Computation Offloading Strategy
Considering the complexity and dynamics of MEC systems, this section proposes a DRL-based algorithm for finding an optimal computation offloading strategy. In this regard, we define several essential elements in DRL and then elaborate on the TD3-SN-PER algorithm for solving the computation offloading joint optimization problem.
4.1. State, Action, and Reward Definition
In the proposed scheme, each user device is considered an agent, and the experience samples of all agents are integrated into the global experience samples, and the global experience is utilized to train the network. Defining S as the state space, the state set of all agents can be denoted as ; defining A as the action space, while the action set of all agents can be denoted as ; similarly, the reward set of all agents can be denoted as .
(1) State space: In a dynamic MEC environment, the state space of different time slots changes dynamically, and the agent makes the corresponding offloading decision by observing the state of the current time slot. The state
of device
n at period
t can be expressed as follows:
where
and
denote the horizontal and vertical coordinates of the user device.
(2) Action space: The set of all actions that an agent can choose is called the action space. The agent selects the appropriate action for a given state, including the quantity of tasks assigned to edge servers and local execution, as well as the local computation power and transmission power. The action
can be described as
(3) Reward function: In DRL, the reward serves as the sole feedback to the agent, whose objective is to select the optimal actions that maximize the reward within a specific environment. This paper aims to enhance the QoS throughout the computation offloading process. Hence, we delineate the reward function as follows:
4.2. TD3-SN-PER Algorithmic Framework
The framework of the TD3-SN-PER algorithm proposed in this paper is shown in
Figure 2, which is based on the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm.
In the TD3-SN-PER algorithm, each agent contains six neural networks, namely actor network , where denotes the set of states of all agents, critic1 network , and critic2 network , where denotes a set of actions for all agents except agent n; target actor network , target critic1 network , and target critic2 network . Where , , , , , and are network parameters.
In the initial phase of the algorithm, the agent selects an action
in accordance with the current policy
and the prevailing state
of the environment. After executing the action
, it observes the reward
and the new state
and stores these samples
into the prioritized experience buffer and draws a batch of samples from the experience buffer in the form of the prioritized experience replay, for which we elaborate in detail in the following. For each sample, a target critic network and a target actor network are used to compute the target value
y with the expression
where
,
is the added noise for target strategy smoothing. After deriving the target value
y, the two critic networks are updated using the extracted samples and the target value
y. The update goal is to minimize their loss functions,
where
D is the empirical buffer, and
is the important sampling weight used to reduce the bias caused by prioritized empirical sampling. In contrast to the deep deterministic policy gradient (DDPG) algorithm, the TD3-SN-PER algorithm updates the actor network whenever a certain number of critic network updates have been performed to ensure that the estimation of the critical network is sufficiently stable. The update formula is
Subsequently, the target network undergoes a soft update to gradually align with the parameters of the main network. The formula for this update is
where
is the soft update parameter.
4.3. State Normalization
During the deep neural network training process, the scales of different state features may vary greatly, which will lead to unstable updating of the network, and the problem of gradient explosion or gradient disappearance may occur during the training process, affecting the training results. In this regard, in this paper, the observed states are normalized and preprocessed for better training of deep neural networks. For the input state vector
, the task features and coordinate features are normalized separately. To normalize the task features:
To normalize the coordinate features:
where
and
are the minimum and maximum values of the task features, and
,
,
, and
are the minimum and maximum values of the horizontal and vertical coordinates.
4.4. Prioritized Experience Replay
During the training process of the DRL algorithm, the way to train the model by randomly selecting a small batch of samples from the experience pool ignores the difference in the importance of the experience samples, which reduces the learning efficiency and final performance. In this regard, the TD3-SN-PER algorithm introduces a prioritized experience replay mechanism during the training process, which improves the learning efficiency and performance by calculating and prioritizing the important sampling weights of each sample and selecting samples with higher priority during sampling.
In this paper, the importance of the sample is expressed in terms of the absolute TD error, which can be calculated using two critic networks, denoted as
The priority of the sample is updated by taking the value with the largest absolute value of the two TD errors, denoted as
where
is a very small positive number, used to ensure that even samples with small TD errors have a certain priority. Based on the above prioritization, the sampling probability of the sample can be obtained as
where
is the prioritization parameter. In this process, in order to reduce the bias caused by prioritized empirical sampling, this paper introduces the important sampling weights, denoted as
where
N is the size of the empirical buffer and
, which is used to control the magnitude of weight changes.
Taken together, the TD3-SN-PER algorithm can be summarized as Algorithm 1.
Algorithm 1 TD3-SN-PER Algorithm |
- 1:
Input: Environment status and parameters - 2:
Output: Offloading policy for user equipment - 3:
Initialization: - 4:
Initialize actor and critic networks - 5:
Initialize the prioritized experience replay pool - 6:
Initialize state normalization parameters - 7:
Set the soft update parameter , the discount factor - 8:
for each episode do - 9:
Reset the environment and get the initial state - 10:
Normalize each feature in the state - 11:
for each step do - 12:
for each agent do - 13:
Select an action based on the current policy and noise - 14:
end for - 15:
Execute the actions, observe the new state , reward , and whether the episode is done - 16:
Normalize the new state : - 17:
Repeat the normalization process for the initial state - 18:
Collect agents’ samples and combine them into global experience, and store them in the prioritized experience replay. - 19:
if experienced enough then - 20:
for each agent do - 21:
Sample a batch of experiences and corresponding importance sampling weights from the prioritized experience replay pool - 22:
Calculate the loss for the critic network - 23:
Update the critic network - 24:
Calculate and update the priorities in the prioritized experience replay pool - 25:
Calculate the policy gradient - 26:
Update the actor network - 27:
Update the target network parameters , - 28:
end for - 29:
end if - 30:
end for - 31:
end for
|
6. Conclusions and Future Work
This paper proposes a DRL-based computation offloading algorithm for multi-user, multi-server MEC environments, aiming to maximize QoS by simultaneously considering user privacy protection, delay, energy consumption, and task discard rate. In order to reduce the overestimation bias problem, we use the TD3 algorithm with a clipped double-Q learning technique to solve the problem. Additionally, to enhance the learning efficiency and performance of the algorithm, we integrate state normalization and prioritized experience replay techniques. The experimental results demonstrate that the TD3-SN-PER algorithm in this paper is more effective in improving QoS.
When performing computation offloading, edge servers collaborate with each other to effectively improve system performance. However, if there are dependencies between tasks, the execution order of tasks needs to be coordinated among different edge servers, which can make the task scheduling process very complicated. Therefore, our future work will be devoted to solving the problem of computation offloading for tasks with dependencies in a collaborative MEC environment.