1. Introduction
With the ongoing development and application of the Internet of Vehicles (IoV) and autonomous driving technology, the Internet of Autonomous Vehicles (IoAV) has gained increasing attention [
1,
2]. This concept includes advanced automation, intelligent sensing, and decision-making capabilities, and it presents technological complexities and cost challenges. Meanwhile, the continued advancement of IoAV has enabled the deployment of a wide range of related applications such as augmented reality (AR), real-time video analysis, and human behavior recognition [
3,
4]. These applications are compute-intensive and delay-sensitive, requiring substantial computing resources for processing multi-dimensional and diverse data, posing challenges for deployment on resource-limited vehicles. Moreover, these applications exacerbate inter-task dependencies, extending overall task execution times and increasing energy consumption. This can prevent tasks from being completed within designated times, negatively impacting the quality of experience (QoE) and potentially endangering lives. For instance, real-time video analysis often relies on preliminary information from human action recognition for downstream tasks, which must wait until the information arrives. If the task is not completed within the specified time, it will affect the subsequent decision-making in the autonomous driving vehicle, which may lead to serious safety accidents.
Mobile cloud computing (MCC), a technical mode for handling large-scale dynamic data streams, provides a viable solution to the challenges mentioned above [
5]. This technology divides vehicle tasks into multiple sub-tasks and offloads them to cloud servers with powerful computing capabilities for execution, offering mobile users ample computing resources and efficient online resource management. However, cloud computing centers are often centralized in specific geographic locations to consider cost and stable energy supply, typically far from densely populated areas. This geographical separation inevitably leads to higher communication costs and increased latency, which can compromise the reliability of autonomous vehicles and heighten safety risks for users [
6,
7]. The emergence of mobile edge computing (MEC), which provides computing services by deploying distributed infrastructure near data sources, effectively bridges the gap with MCC [
8,
9,
10]. In IoAV, MEC servers are deployed in roadside units (RSUs) along both sides of roads in dense traffic areas and sections prone to frequent accidents, or at base stations (BS) that cover large areas. Through vehicle-to-infrastructure (V2I) or vehicle-to-vehicle (V2V) communication, autonomous vehicles offload computationally intensive tasks to edge devices or nearby vehicles with spare resources for processing, which effectively shares the computational pressure, reducing latency and bandwidth consumption and improving task execution efficiency [
11,
12]. However, edge servers and service vehicles with constrained computing resources often struggle to meet the demands of large-scale applications in scenarios with dense traffic flow [
13]. This limitation can lead to communication congestion and network paralysis, which seriously affect traffic order and the safety of autonomous vehicles. Moreover, in-vehicle applications such as autonomous driving, Internet of Vehicles (IoV) services, and advanced driver assistance systems (ADAS) exhibit significant task dependencies. Improper offloading strategies can directly impact system performance, response speed, and resource utilization [
14]. Therefore, efficiently utilizing edge-side resources and ensuring task execution priority has become a critical challenge.
The task offloading of autonomous vehicles is commonly framed as a decision-making issue [
15]. Traditional methods such as dynamic programming (DP) and the genetic algorithm (GA) struggle to adapt to dynamically changing environments, making it difficult to converge to global optima in complex and highly dynamic IoAV scenarios. Additionally, these traditional methods lack energy optimization and learning capabilities, suffering from high computational complexity and increased latency when multiple complex dependencies are involved. This can be catastrophic for tasks with strict latency requirements [
16]. Deep Reinforcement Learning (DRL), a machine learning algorithm that utilizes a trial-and-error learning strategy, has pioneered an innovative research direction for offloading decision-making [
17]. It can learn complex dependencies and adaptively adjust the task-offloading strategy through continuous interaction and learning with the environment. This enables flexible real-time scheduling and allocation in highly dynamic and uncertain IoAV scenarios, effectively handling offloading decision issues with non-convex optimization constraints. With the increasing complexity of task scenarios, traditional single neural networks (NNs) often face limitations in complex, dynamic, and resource-constrained environments. The multi-layer perceptron (MLP), in particular, has limited learning ability and struggles with high-dimensionality and nonlinear complex problems [
18]. In the highly dynamic, task-dependent vehicle network, these traditional networks often lack strong adaptability and generalization, failing to conduct accurate and precise operational strategies in response to emergencies, thus reducing system performance and efficiency. In scenarios with high safety requirements, such as autonomous driving, they may be incapable of responding in time to unexpected situations, compromising the safety and reliability of the system. Therefore, optimizing the DRL network architecture to enhance adaptability and robustness is essential for managing complex and dynamic vehicle task-offloading scenarios.
In this context, we propose an innovative task-offloading scheme that models the vehicle task-offloading optimization issues as a constrained Markov decision process (CMDP). Subsequently, we introduce an improved version of the twin-delayed deep deterministic policy gradient (TD3) algorithm, termed LT-TD3, which integrates a long short-term memory (LSTM) network [
19] and a self-attention mechanism [
20] to enhance network performance and efficiency. Considering the strong task dependencies, we employ a topological sorting algorithm to allocate execution priorities for subtasks, thereby optimizing task offloading efficiency.
The main contributions of this paper are as follows:
We consider the issues of vehicle task offloading and task dependence during offloading in an IoAV scenario within a densely populated urban area, which is modeled as a Markov decision process to minimize the delay and improve the offloading efficiency.
The proposed innovative deep reinforcement learning (DRL) algorithm, LT-TD3, integrates the TD3 algorithm with LSTM networks and the self-attention mechanism to enhance algorithm performance and efficiency.
To address the issue of strongly dependent tasks in the offloading process, we employ a topological sorting algorithm to assign offloading priorities to subtasks, optimizing the task-offloading sequence and reducing the impact of task dependencies on the overall offloading efficiency.
Simulation results demonstrate the effectiveness of the proposed algorithm in highly dynamic, densely populated scenarios with strongly dependent tasks. The proposed algorithm significantly improves convergence, energy consumption, and latency compared with baseline methods.
The remainder of this paper is organized as follows.
Section 2 introduces related work.
Section 3 describes the system model and problem formulation.
Section 4 presents the design and implementation of LT-TD3.
Section 5 evaluates the simulation results of the proposed approach and compares it with existing methods.
Section 6 concludes with a summary of the article and discusses potential future research areas.
3. System Model and Problem Formulation
The task dependence and overall utilization optimization of available computing resources are considered when constructing the system model for autonomous vehicles.
3.1. Vehicle Network Model
In the designed cloud environment for autonomous vehicles, key components such as RSUs, autonomous vehicles, conventional vehicles, and cloud control centers are primarily involved, as shown in
Figure 1. Among these, the cloud control center is deployed on the RSU, which facilitates wireless communication with vehicles through the V2I transmission mode and is equipped with an MEC server to augment computational capabilities. Moreover, we introduce the concepts of resource units and resource pools to better represent the computing resources within the autonomous vehicle cloud. The resource unit is defined as the smallest unit of resources in the autonomous vehicle cloud, and all resource units form a resource pool, which is centrally managed by the cloud control center.
Vehicles that generate computation tasks within the coverage area of the vehicle cloud are referred to as task vehicles. When a task vehicle with limited computing resources potentially fails to meet high task requirements, it uploads task request information to the cloud control center via a wireless link. The control center evaluates its capability to process the task based on the received information and the available resources in the resource pool. If the task can be processed, the task vehicle uploads the task details to the autonomous vehicle cloud, and the control center allocates the necessary resource units for processing.
In addition, the scenario involves more complex vehicle tasks with dependencies, namely, the completion of a certain task depends on the data provided by other tasks. The tasks set to be processed in the autonomous vehicles cloud are denoted as ; each task has triad properties , where denotes the task data size, denotes the number of resources required to complete the task, and denotes the maximum tolerable delay of the task. In relation to the presence of K, each task can be expressed as , where represents the subtask of applying , and represents the total number of tasks of task . To simplify representation, we express each subtask with a triple . Here, denotes the data size of task . Moreover, represents the computing resources required by task , and represents the maximum task delay that can tolerate.
3.2. Task Dependency Model
A directed acyclic graph (DAG) is a structure consisting of nodes connected by directed edges, where no cycles are present—meaning it is impossible to start at one node, follow a sequence of directed edges, and return to the same node. This property enables us to model tasks in a way that inherently respects the order of dependencies. Considering the DAG can comprehensively describe a task process sequence, we utilize the DAG to model vehicle tasks with dependencies. In the paper, tasks with dependencies are denoted as
, where
represents the subtask set, where each subtask is a vertex of the DAG,
denotes the set of dependencies, with the dependencies between each subtask forming a DAG edge. Assuming the DAG edge
, which includes
, representing task
,
, and dependencies between. In this case,
needs to obtain the output data from
to proceed, namely,
denotes the predecessor task of
, and
denotes the successor task of
. It is noted that a subtask is called the entry or start task if it has no predecessor task. Similarly, a subtask is called the exit or end task if it has no successor task.
Figure 2a specifically illustrates a vehicle task with dependencies using an augmented reality (AR) program. The video source and renderer cannot be offloaded, which must be processed locally within the vehicle, whereas the tracker, mapper, and object recognizer can be offloaded. The dependencies between components are visible. For instance, the object recognizer requires the mapper’s output to execute, and this output is subsequently used as input for the renderer.
Figure 2b shows the DAG modeling of vehicle-dependent tasks, where task 1 is the predecessor to task 4, and task 5 is the successor to task 4. In the DAG, task 1 is the start task, while task 13 is the end task.
3.3. Transport Model and Computation Model
In autonomous vehicle cloud task offloading, vehicles typically offload tasks to RSUs through V2I communication, in which the task communication link remains temporarily stable. The transmission rate in the V2I mode can be expressed using the Shannon formula:
where
W represents the bandwidth size,
represents the transmission power,
h and
v represent the channel gain and the path loss exponent, respectively,
denotes the noise power, and
represents the distance between the task vehicle and RSU. In this case, the task transmission time that the task vehicle needs to offload is as follows:
where
denotes the data size of the subtask. The execution time
for task
i is denoted as follows:
If the task is processed locally, the resource pool does not allocate resources for it; instead, only the local computing resources of the task vehicle are used for processing. The local processing time,
, is denoted by the following:
where
denotes the number of resources required to complete the task, and
denotes the locally available resources.
3.4. Energy Consumption Model
The transmission and processing of vehicle tasks require a certain amount of energy to minimize energy consumption while adhering to the maximum tolerable task delay. The local processing energy consumption,
, is expressed as follows:
where
denotes the local processing power.
If the task is transmitted to the autonomous vehicle cloud for processing, the energy consumed
is expressed as follows:
where
denotes the cloud processing power of autonomous vehicles.
3.5. Dependency Task-Offloading Problem Formulation
In the DAG model used for modeling dependent tasks, time-sensitive or resource-constrained tasks are given additional priority by the topological sorting algorithm. This ensures that critical, latency-sensitive operations are not delayed when resource availability is limited. Thanks to this approach, the LT-TD3 can more effectively manage complex, interdependent task sequences, which improves the overall responsiveness of the system and ensures robust performance in real-time vehicular environments when resources are scarce. In this paper, the control center within the autonomous vehicle cloud allocates resource units to complete the vehicle tasks with dependencies, ensuring that tasks are completed within the maximum tolerable delay while reducing energy consumption. The formulation is as follows:
where
denotes the maximum tolerable task delay, and
denotes the total task completion time, which can be expressed as
. It has the following constraints:
where constraint
represents the offloading location of the task,
and
are marking bits in which
indicates execution in the autonomous vehicle cloud, and
indicates local execution in the task vehicle, meaning a subtask can only be executed in one location. Constraint
indicates that all task delays
n cannot exceed their maximum tolerable delays.
4. Algorithm Design and Implementation
Aimed at the DAG task model in the autonomous vehicle cloud, this paper proposes an improved twin-delayed deep deterministic policy gradient (TD3) algorithm. The deep deterministic policy gradient (DDPG) algorithm [
44] is based on the deep Q-network (DQN) algorithm [
45], as DQN cannot address continuous action control. While DDPG resolves this limitation, it still suffers from certain drawbacks, which TD3 [
46] aims to overcome. Moreover, we have enhanced the foundational TD3 network by integrating the self-attention mechanism and LSTM to further boost network performance and efficiency. The self-attention mechanism focuses on key sections of the input data, dynamically adjusting weights to emphasize critical information. This approach effectively manages long-term dependencies, enabling the model to prioritize relevant details and filter out less essential data. LSTM is designed to capture short-term dependencies in sequential data, and is particularly effective for long-term tasks. LSTM addresses the vanishing gradient problem, retaining important information over long sequences, which satisfies dynamic task requirements
4.1. Environment Model
Firstly, the agent and environment are defined: the state space , the action space , and the reward function .
4.1.1. State Space
The state space
is defined as follows:
where
represent the triple-attribute task information,
represents the available resource units of the autonomous vehicles,
represents the available computing resources of traditional vehicles,
denotes the local computing resources, and
Q represents the task-sorting results in DAG.
4.1.2. Action Space
The action space
is defined as follows:
where
refers to the resource units allocated to the
nth task, and
represents that the task is processed locally. Moreover, allocated resource units cannot exceed the total number
k of resource units in the current autonomous vehicle cloud.
4.1.3. Reward Function
The objective of this paper is to complete the tasks within the maximum tolerable delay, while possibly reducing energy consumption. The reward
is defined as follows:
where
denotes the maximum acceptable task delay,
denotes the total time consumed in task processing,
K denotes the positive constant value,
and
denote flag bits representing the task-offloading positions.
and
represent the total energy consumption values generated by the tasks in the autonomous vehicle cloud and during local processing, respectively. In addition, a penalty should be given when the total task delay exceeds the maximum tolerable task delay, where the penalty value is
.
4.2. LT-TD3-Based Task-Offloading Algorithm
To address the challenge of task offloading with dependencies, tasks are decomposed into subtasks that are processed using corresponding resource units allocated by the control center. This paper utilizes an improved TD3 algorithm, introducing key enhancements to DDPG and integrating the self-attention mechanism and LSTM to manage these tasks. Firstly, adopting the double DQN (DDQN) approach, utilizing two critic networks to estimate the Q-value, with the smaller critic network serving as the target network, mitigates the overestimation issue in DDPG. Secondly, TD3 implements a delayed update mechanism, where the actor is updated after multiple updates to the critic, which reduces the fluctuation of the algorithm optimization, ensuring a more stable policy is obtained. Finally, introducing target policy smoothing, which injects noise into the target action space, promotes the network to explore diverse actions, which prevents the policy network from prematurely focusing on narrow action choices, enhancing sample efficiency and accelerating algorithm convergence.
Moreover, integrating the self-attention mechanism and LSTM into the critic and actor improves the long-dependence capture ability and interoperability of the network, allowing it to effectively capture temporal and contextual dependencies in real-time. By evaluating and weighing time series data, attention focuses on the data key information to conduct informed decision-making. Unlike traditional DRL methods, LT-TD3 utilizes LSTM to model long-term temporal relationships for vehicle mobility and task demands, allowing the system to predict and adjust strategy with greater stability under environmental fluctuations, which enhances its capability to dynamically adapt to vehicular environments. Furthermore, the self-attention mechanism allows the algorithm to dynamically focus on critical features and pay close attention to critical changes in the network environment, enhancing the decision-making of LT-TD3 in scenarios where multiple factors are rapidly changing. This combination enables LT-TD3 to offer more precise and efficient task-offloading decisions under highly dynamic conditions. The network parameter optimization details are described below.
Following the principles of DDQN, TD3 employs two critic networks in both the main and target networks. The smaller Q-value from the two critic networks is chosen to compute the target value
, described as follows:
where
denotes the combination of action
and random noise, which is introduced to avoid falling into local optima during exploration of the action space. This is similar to the
-greedy principles in the DQN algorithm.
is specified as follows:
where
,
denotes the cut-off value.
The actor network updates its parameters using deterministic policy gradients, which are expressed as follows:
The parameters are updated according to the following formulation:
Considering the time-varying variables in the autonomous vehicle environments, this section proposes LT-TD3, a dependent task-offloading algorithm. LT-TD3 comprises three components: the main network, target network, and experience pool. Both the main network and target network consist of three deep neural networks: two critic networks and one actor. The actor maps the state in the autonomous vehicle cloud to a specific action (for example, the allocation of resource units), exploring and identifying the optimal policy. The two critic networks evaluate the performance of the current policy, providing feedback to support the actor learning process. Additionally, the self-attention mechanism and LSTM are integrated into both the critic and actor networks to capture long-term dependencies, which accelerate network convergence, enhancing system decision-making capabilities. The specific structure of the LT-TD3 algorithm is illustrated in
Figure 3, and its pseudocode is provided in Algorithm 1.
Algorithm 1: LT-TD3 algorithm. |
- Input
System state information - Output
Resource unit allocation - 1:
Initialize the experience pool D - 2:
Initialize critic , and actor with random parameters , , - 3:
Initialize the target network - 4:
for for each episode do - 5:
for to T do - 6:
Observe the state and reward . - 7:
Choose the action with exploration noise - 8:
Update state - 9:
Store quadruple (,,,) in D - 10:
Sample mini-batch from D - 11:
Calculate target Q: - 12:
- 13:
Update the parameters : - 14:
- 15:
if t mode then - 16:
Update with the deterministic policy gradient: - 17:
- 18:
- 19:
Updating parameters - 20:
- 21:
- 22:
end if - 23:
end for - 24:
end for
|
5. Simulation Results and Analysis
5.1. Experimental Environments
Data collected using the VISSIM simulator was employed for the experiment. The experimental environment included Windows 10, an i7-6700HQ CPU, and 16 GB of memory. To enhance algorithm stability and speed up convergence, an adaptive cosine annealing learning rate strategy with warm-up training was used to guide network parameters progressively and smoothly toward optimal values. The learning rate was set at 0.01, with a discount factor and value of 0.9. Moreover, the memory and batch size were configured to 500 and 32, respectively, to reduce memory consumption during training, thus improving the efficiency of algorithm training. It is noteworthy that VISSIM offers a detailed and controllable environment for simulating traffic flow and network conditions, but it may not fully capture the complexities of real-world vehicular networks, including environmental interference, hardware variability, and unpredictable congestion patterns. Therefore, we integrated real-world vehicular environment data and added vehicular network conditions to VISSIM to further bridge the gap. Furthermore, the modular characteristics of the LT-TD3 allowed it to change components such as reward functions and resource constraints to adapt to alien scenarios without redesign.
5.2. Experimental Results and Analysis
To evaluate the performance of the proposed LT-TD3 algorithm, this paper compares it against foundational models such as TD3, the DDPG algorithm, and local-only processing—namely, no offloading. The experimental metrics include average task delay and task completion rate, which significantly impact the performance and user experience of real-world vehicles. Vehicle tasks with dependencies are generally more complex. A lower average task delay typically indicates better performance and faster task completion, ensuring that the vehicle receives processing results in a timely manner, especially in scenarios that require rapid responses, such as sudden obstacle avoidance and traffic signal recognition. This significantly improves the real-time performance and stability of the system. Similarly, the task completion rate is a crucial metric, measuring the percentage of tasks completed within their deadlines. A higher task completion rate reflects more efficient resource utilization and fewer tasks exceeding their maximum tolerable delay. A high task completion rate enables vehicles to continuously perform various computation-intensive tasks, including path planning, object detection, and driving decisions, thus enhancing the safety and reliability of driving.
Table 1 and
Figure 4 illustrate the average task delay under varying numbers of vehicles in the autonomous vehicle cloud, consisting of 50% autonomous vehicles and 50% conventional vehicles. The experimental results indicate that an increasing number of vehicles makes the complexity of the state and action space rise, while the presence of more nearby vehicles introduces additional resource units, providing a larger resource pool for allocation. Compared to baselines, LT-TD3 can allocate the optimal number of resource units, effectively reducing the task delay with the increase in vehicles, where the average task delay is maintained at about 1 s. However, the worst delay of the baseline model reaches 2.90 s, which affects the decision-making efficiency of the system.
Table 2 and
Figure 5 display the average task delay for different task data sizes in the autonomous vehicle cloud. The task data size is a critical factor affecting the average task delay—the larger the task data, the greater the transmission time and computing resources required, making the task more challenging to complete. In this experiment, the initial task size (100%) is set between 10 and 40 MB. As the task size increases, the average task latency for all algorithms rises. When the task size increases to 120%, the delay of baseline models is over 2 s. However, LT-TD3 consistently maintains a lower average task latency compared to the other algorithms, with the task delay being 1.59 s. This demonstrates that LT-TD3 effectively allocates an appropriate number of resource units, ensuring the timely completion of tasks even when handling large volumes of task data.
Table 3 and
Figure 6 illustrate the task completion rates under varying maximum tolerable delays for different tasks. The task completion rate reflects the proportion of tasks completed within the maximum acceptable delay, which is critical in determining whether a task can be completed on time. The experiment simulates different acceptable delay limits (with the initial acceptable delay (100%) set at 2–3 s). As the tolerable delay decreases, the task completion rate declines for all algorithms and the lowest completion rate in the baseline model reaches 49.6%. However, the LT-TD3 algorithm consistently maintains a 99.2% task completion rate compared to baselines. Conversely, as the acceptable delay increases, indicating greater tolerance for delay, the highest completion rate of baseline models improves to 98.5%, but it still falls short of the LT-TD3 algorithm. Despite lower available local resources, which limit task completion rates across all algorithms, LT-TD3 demonstrates the critical importance of efficient task offloading. In summary, LT-TD3 maintains a consistently high task completion rate, whether handling tasks with strict or relaxed delay requirements.
Table 4 and
Figure 7 present the average task delay under varying resource availability at edge devices. This study focuses on vehicle tasks with dependencies, aiming to identify an optimal resource allocation strategy that ensures efficient task completion. The number of resource units provided by edge devices, such as those from roadside units (RSUs), traditional vehicles, and autonomous vehicles, is a critical factor influencing whether offloaded tasks can be completed within the required time. Initially, the number of resource units is set to 100%. As resources decrease, the average task delay rises across all algorithms. The average task delay of the baseline models all exceeded 2 s, while the LT-TD3 algorithm always maintained a lower delay, with an average task delay of 1.81 s. Even when edge device resources increased to 120%, the LT-TD3 algorithm still achieved the lowest average task delay of 1.03 s. while the baseline model had less improvement with the lowest delay of 1.49 s. These experimental results highlight that the LT-TD3 algorithm can dynamically adapt to different levels of edge device resources by selecting the optimal resource allocation strategy, ensuring lower task delay, improved vehicle safety, and enhanced user driving experience.
While the proposed LT-TD3 algorithm demonstrates promising results in optimizing task offloading and resource allocation in vehicular networks, the system performance may be degraded when confronting various extreme emergencies in large-scale networks with a high density of edge nodes and vehicles. Moreover, managing task dependencies using DAGs and optimizing them with topological sorting algorithms may not be optimal as the network size increases. In future work, we will enhance the LT-TD3 network by designing powerful feature extraction modules and integrating V2I and V2V networks to provide more options for the system; this expansion aims to broaden its applicability, especially in scenarios with high dynamic complexity. Furthermore, exploring alternative solutions for managing task dependencies will refine task dependency management, enabling more effective handling of intricate dependency structures, which contributes to greater system stability and adaptability.
6. Conclusions
Addressing the issue of delay and energy consumption in vehicle task offloading within the autonomous vehicle cloud environment, this paper utilizes RSUs and nearby vehicles (both traditional and autonomous) to form an autonomous driving network that provides resource units for task vehicles. Additionally, the task-offloading issue is formulated as a Markov decision process, and the LT-TD3 algorithm is proposed to solve it, which jointly optimizes delay and energy consumption under the constraint of limited resource unit availability. To further enhance agent performance and efficiency, LSTM and a self-attention mechanism are integrated into the LT-TD3 algorithm. This enables the network to capture long-term dependencies, which accelerates network convergence and enhances feature extraction capabilities. Considering the strongly dependent tasks, a topological sorting algorithm is employed to prioritize subtasks with dependencies, allowing for execution in a logical order, enhancing the efficiency of task scheduling, and minimizing processing delays and energy consumption. Comprehensive experiments demonstrate that the average task delay and completion rate of LT-TD3 outperform those of baseline models under various conditions. In the future, we will further enhance the algorithm by designing reward functions and resource constraints tailored for various autonomous environments. This will be validated in diverse settings, including rural areas with sparse edge nodes and urban scenarios with high communication interference, with the aim to increase the robustness and flexibility of the task-offloading strategy to meet the demands of high-bandwidth, low-latency applications required by autonomous vehicles. Moreover, the development of 5G and 6G technologies allows the algorithm to be extended to leverage these networks, further improving offloading efficiency and reducing task latency. Increasing compatibility with different vehicle communication standards and handling interoperability with transportation systems to ensure that the LT-TD3 can be readily deployed in real-world networks is also a challenge we will explore in the future.