Autonomous Driving Control Using the DDPG and RDPG Algorithms

Chang, Che-Cheng; Tsai, Jichiang; Lin, Jun-Han; Ooi, Yee-Ming

doi:10.3390/app112210659

Open AccessArticle

Autonomous Driving Control Using the DDPG and RDPG Algorithms

¹

Department of Information Engineering and Computer Science, Feng Chia University, Taichung City 407, Taiwan

²

Department of Electrical Engineering, Graduate Institute of Communication Engineering, National Chung Hsing University, Taichung City 402, Taiwan

³

Department of Electrical Engineering, National Chung Hsing University, Taichung City 402, Taiwan

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2021, 11(22), 10659; https://doi.org/10.3390/app112210659

Submission received: 6 October 2021 / Revised: 23 October 2021 / Accepted: 9 November 2021 / Published: 12 November 2021

(This article belongs to the Special Issue New Trends in Robotics, Automation and Mechatronics (RAM))

Download

Browse Figures

Versions Notes

Abstract

:

Recently, autonomous driving has become one of the most popular topics for smart vehicles. However, traditional control strategies are mostly rule-based, which have poor adaptability to the time-varying traffic conditions. Similarly, they have difficulty coping with unexpected situations that may occur any time in the real-world environment. Hence, in this paper, we exploited Deep Reinforcement Learning (DRL) to enhance the quality and safety of autonomous driving control. Based on the road scenes and self-driving simulation modules provided by AirSim, we used the Deep Deterministic Policy Gradient (DDPG) and Recurrent Deterministic Policy Gradient (RDPG) algorithms, combined with the Convolutional Neural Network (CNN), to realize the autonomous driving control of self-driving cars. In particular, by using the real-time images of the road provided by AirSim as the training data, we carefully formulated an appropriate reward-generation method to improve the convergence speed of the adopted DDPG and RDPG models and the control performance of moving driverless cars.

Keywords:

autonomous driving; Deep Deterministic Policy Gradient (DDPG); Recurrent Deterministic Policy Gradient (RDPG)

1. Introduction

During the past decade, there have been many use cases for artificial intelligence in smart vehicles in our lives, e.g., convenience, exploration, rescue, and so on. Among these applications, the most important topic is to make a vehicle capable of moving autonomously, i.e., an autonomous vehicle, also known as a driverless car. At first, autonomous vehicles were implemented through rule-based techniques. Obviously, it is impossible to consider and add all the necessary rules to a complex system such as an autonomous vehicle in practice. Namely, rule-based control strategies are unfit for the time-varying traffic conditions, and they also have difficulty coping with unexpected traffic situations in the real-world environment. On the other hand, some control strategies are implemented based on the absolute positioning information from the Global Positioning System (GPS), and this may result in precision and availability problems, that is, for some scenarios, GPS may not always be precise and available because of the effects of signal attenuation and multipath propagation [1,2,3]. Hence, instead of the rule-based control strategies and absolute positioning information, we exploited DRL with the CNN to indirectly use relative positioning information to enhance the quality and safety of autonomous driving control. More specifically, via the road scenes and self-driving simulation modules provided by AirSim [4], we utilized the DDPG and RDPG algorithms, combined with the CNN, to realize autonomous driving control.

Next, there are two important issues that need to be discussed in advance, i.e., the adopted algorithms and the sensor data. First, there are several kinds of DRL algorithms in the literature, and each has its own characteristics. For example, the Deep Q Network (DQN) approach involves the design of a discrete action space, and it does not perform well in complex applications in practical scenarios. Thus, two algorithms with a continuous action space, the DDPG and RDPG, were chosen to implement the autonomous driving control strategies in this research. All existing DRL concepts proposed in the literature are only frameworks, so we still need to elaborately perform the design and experiments to realize a specific application. This is the main contribution of this work. On the other hand, two main kinds of sensors to realize autonomous driving control are cameras and Light Detection and Ranging (LiDAR). LiDAR uses pulses of light to detect objects, that is to determine the distance and range of an object. For instance, for collision avoidance, according to the distance information retrieved by LiDAR, we can detect the distance from an object and then slow down the vehicle if needed. However, the raw LiDAR data are massive. They are troublesome to store, transfer, and process. This was the critical motivation for us to realize an autonomous control strategy using the camera, which provides lightweight visual data, as opposed to the raw LiDAR data. Furthermore, with the development of computer vision algorithms, objects can be identified immediately while driving. This helps us implement many control strategies, e.g., collision avoidance, lane changes, recognizing traffic signs (even reading the text from a sign using Optical Character Recognition (OCR)), and so on.

There are several related research works in the literature, and we compared them with our approach and then conducted our experiments. To our best knowledge, our work is the sole research work to consider both camera vision and distance information based on different methods simultaneously for a simulated vehicle model in order to present a comprehensive discussion. For example, the authors in [5] implemented their work with only a LiDAR sensor based on the DDPG method. In [6], although the authors realized their work based on LiDAR and an odometer, they only considered the

ϵ

-greedy policy of the DQN with different parameters to update the neural network. In [7], the method was implemented based on virtual robots. Namely, rather than using a simulated model, they directly assumed the effectiveness of the simulation properties of a virtual robot, e.g., the gyration radius and mass, the maximum speed and the maximum acceleration, and so on. Obviously, this may reduce the level of confidence. More importantly, we used a significant procedure to properly deal with the camera vision before importing it into the training and testing procedures instead of retrieving the information from the simulation environment directly. This helped us improve the portability of our research on our autonomous control strategy to a real vehicle in a real-world environment later. For instance, in [3,8,9], the vision information captured from the simulation software was utilized directly. Furthermore, this may reduce the level of confidence while realizing the designs in a real-world environment.

The rest of our paper is organized as follows. In Section 2, the preliminary knowledge related to this work is reviewed, e.g., AirSim, the DDPG, the RDPG, and so on. Then, we detail our designs in Section 3, and the experimental results corresponding to different simulation settings are analyzed and discussed in Section 4. In particular, by using the real-time images of the road provided by AirSim as the training data, we carefully formulated an appropriate reward-generation architecture for the purpose of improving the convergence speed of the adopted models and the control performance of moving driverless cars. Finally, this research is concluded in Section 5, and we also contemplate some possible future research.

2. Preliminaries

2.1. AirSim

AirSim is a simulator for drones, cars, and other vehicles. It is open-source and cross-platform and supports software-in-the-loop and hardware-in-the-loop experiments with several popular controllers for physically and visually realistic simulations [4]. AirSim is built on Unreal Engine and developed as a plugin that can be dropped into Unreal environments. Unreal Engine is a complete suite of development tools for anyone working with the real-time technology [10]. It gives creators freedom and control in immersive virtual worlds.

Consequently, we can generate training data from AirSim for machine learning. For example, we can record the vehicle pose, views, distance information from the ego-vehicle to particular objects, and so on, for each frame. Notice that AirSim supports many sensors, including cameras, Inertial Measurement Units (IMUs), GPS, distance sensors, LiDAR, and so forth, which can be configured for distinct scenarios and applications.

2.2. Elements of Reinforcement Learning

Beyond the agent and the environment, four main subelements of a Reinforcement Learning (RL) system are the policy, reward, value function, and model of the environment [11,12,13]:

A policy is a mapping from perceived states of the environment to actions to be taken. In some cases, it may be a simple lookup table, while it may involve extensive computation in other case;
After each step, the environment sends to the agent a single number called the reward. Then, the objective is to maximize the total reward in the entire procedure;
Although the reward indicates what is better/worse immediately, a value function specifies what is worthier/unworthier in the long run, namely the value of a state is the sum of the rewards that an agent can anticipate to amass over time, starting from this state;
A model of the environment is something that mimics or infers the behavior of the environment. A model is used for planning, by which we can consider possible futures before we actually experience them. The solving of an RL problem with a model to consider possible futures is called a model-based method, as opposed to a model-free method, which involves pure trial and error.

2.3. Model-Free Methods

Since the model of a model-based method should be accurate enough to be useful, a model-free method can have advantages in complex applications. Accordingly, we will not be confronted with the problem of constructing a sufficiently accurate environment model while we use a model-free method. More specifically, a model-free strategy relies on stored values for state–action pairs. These are estimates that the agent can expect for each action taken from each state. They are obtained over many trials from start to finish. When the procedure has become good enough at the estimates of the optimal returns, the agent will just select the action with the largest action value at each state in order to make optimal decisions [11].

In the model-free approaches, we will be confronted with several primary kinds of branches, e.g., the value-based DQN, the policy-based DDPG, the policy-based RDPG, and so forth. In particular, the DQN tries to estimate all values for each state–action pair, so its performance will drop when the considered action space is complex/continuous. However, this issue is easily dealt with by the DDPG, which has the nature of highly dimensional outputs inherited from the Deterministic Policy Gradient (DPG) [14,15]. On the other hand, the RDPG is another extended version of the DPG. It is coupled with the memory concept in order to solve a variety of physical control problems. More specifically, the short-term integration of information and the long-term memory problems are included in the RDPG approach [16]. According to the above discussion about properties of RL, we utilized both the DDPG and RDPG, combined with the CNN, to realize our autonomous driving control strategies in Section 3.

2.4. DDPG

Applying the DQN to a continuous action space is not possible, since finding the greedy policy of the DQN at every time step takes too long to implement in practice. Hence, in [15], the authors used an actor–critic-based concept to realize the DDPG algorithm. Notice that the actor–critic architecture, in which the actor function specifies the current policy by mapping states to a specific action and the critic function acquires knowledge as in the DQN. Such an architecture allows the DDPG to utilize neural network approximators efficiently in continuous state–action spaces. The abstract flowchart of the DDPG is shown in Figure 1.

In Figure 1, the actor part takes the input state s and outputs the action a. Then, the next state

s^{'}

is obtained from the feedback of the environment. On the other side, the critic part provides its critique (Q-value) by taking state s and action a as the inputs to update the DDPG later.

2.5. RDPG

The partial observability of a control problem is the trickiest part when performing an implementation in the real world. For example, consider the scenario of using a camera in a dynamic scene: we may be confronted with the following events: a static image providing no information regarding the velocity, occlusion, and restricted field-of-view. These cause many problems for the control strategy. Consequently, in [16], the authors demonstrated that their RDPG algorithm, which was based on the recurrent neural network, could learn more effectively in partially observed problems. The abstract flowchart of the RDPG is shown in Figure 2.

Figure 2 illustrates that we need to input several states in time sequence

s_{0}, s_{1}, \dots, s_{n}

and also output the action a. In particular, the next sequence of states in time sequence

s_{0}^{'}, s_{1}^{'}, \dots, s_{n}^{'}

are obtained from the feedback from the environment. This shows the memory property of the RDPG. On the other side, the critic part provides its critique (Q-value) by taking in sequence the input states

s_{0}, s_{1}, \dots, s_{n}

and action a as the inputs to update the RDPG later.

2.6. HSV Color Space

The Hue, Saturation, Value (HSV) color space is an alternative representation of the Red, Green, Blue (RGB) color space. The HSV color space is more closely aligned with the attributes of human vision than the RGB model. Namely, it is better for the color gradations found in nature [17,18]. In particular, a color is specified by h, s, and v, as shown in Figure 3:

h corresponds to an angle from 0–360 for a specific color, e.g., red is at 0 and yellow at 60;
$s \in [0, 1]$ measures the departure from white;
$v \in [0, 1]$ measures the departure from black;
Figure 3a presents the HSV color model and Figure 3b the HSV triangle.

3. Our DRL Control Strategies

In this section, we start by detailing our autonomous driving control strategies. All existing DRL concepts proposed in the literature are only frameworks, so we still needed to elaborately perform the design and experiments to realize a specific application. This is the main contribution of this work. First, the block diagram shown in Figure 4 is utilized to explain the relationship between AirSim and our autonomous driving control strategies. Thanks to the design of the simulation architecture, for the different DRL approaches, the DDPG and RDPG, we merely need to replace the source code in the DRL part (right component) with another one. Specifically, AirSim (left component) generates training data and sends them to the DRL approach for the training procedure. After the training procedure has finished, upon receipt of the necessary information, the latter continuously provides responses to the former for the purposes of controlling the vehicle autonomously.

3.1. Designs for the Reward Mechanism

The reward mechanism is composed of several main components, i.e.,

R_{a r e a}

,

R_{s e n s o r}

,

R_{v e l o c i t y}

,

R_{d i r e c t i o n}

, and

R_{p u n i s h m e n t}

. Consider the following scenario: if a car is following the road accurately, the image captured by the front camera mounted on the vehicle will include a higher percentage of the road area (Figure 5a). On the contrary, if the image includes a lower percentage of the road area, this means that the car cannot properly follow the road and will become out of control in a short time (Figure 5b). According to this idea, the first component of the reward mechanism was set to:

R_{a r e a} = (A_{r o a d} - T_{1}) \times F_{1},

(1)

where

A_{r o a d}

is the measure of the road area,

T_{1}

the threshold, and

F_{1}

the weight. Both

T_{1}

and

F_{1}

are responsible for confining

R_{a r e a}

within a certain range.

Here, we want to detail how to use computer vision to obtain the measure of the road area. In accordance with the discussion presented in Section 2.6, the HSV color model is better for the color gradations found in nature. Hence, using Figure 6a as an example, the first step is to convert the color space (Figure 6b). Then, binarization [19,20] is applied to replace each pixel in the image with a black/white pixel (Figure 6c); this helps us obtain the measure of the road area. The corresponding result of the example in Figure 6a is illustrated in Figure 6d.

The second component is

R_{s e n s o r}

, which is motivated by the difference between the distance from the ego-vehicle to the right and left road edges. Since the environment is right-hand traffic (RHT), we will obtain a higher reward while the vehicle keeps to the right side. Note that the distance from the ego-vehicle to the right and left road edges can be obtained by attaching the distance sensors to the vehicle.

R_{s e n s o r} = (D_{l e f t} - D_{r i g h t}) \times F_{2},

(2)

where

D_{l e f t}

and

D_{r i g h t}

are obtained by the distance sensors and

F_{2}

is responsible for confining

R_{s e n s o r}

within a certain range.

The next two components regard the stability of the ego-vehicle while driving,

R_{v e l o c i t y}

and

R_{d i r e c t i o n}

. More specifically, the vehicle speed is similar to that at the latest point in time; it will be given a larger award. Furthermore, the concept of vehicle direction is similar to that of vehicle speed. Hence, we can acquire two designs for the reward components:

R_{v e l o c i t y} = 1 / ((| V_{t} - V_{t - 1} | - T_{3}) \times F_{3}),

(3)

R_{d i r e c t i o n} = 1 / ((| D_{t} - D_{t - 1} | - T_{4}) \times F_{4}),

(4)

where

V_{t}

is the vehicle speed at time t and

D_{t}

the vehicle direction at time t. Moreover,

T_{3}

,

T_{4}

,

F_{3}

, and

F_{4}

are responsible for confining

R_{v e l o c i t y}

and

R_{d i r e c t i o n}

within certain limits.

The last one is the punishment part,

R_{p u n i s h m e n t}

. Here, violating a traffic regulation will cause a negative reward and then reset the environment to the initial state. The design is shown as follows:

R_{p u n i s h m e n t} = \{\begin{matrix} R e s e t = C (c o n s t a n t) \\ O t h e r = 0 \end{matrix},

(5)

Finally, the complete reward mechanism is composed of all the above components:

R_{t o t a l} = R_{a r e a} + R_{s e n s o r} + R_{v e l o c i t y} + R_{d i r e c t i o n} + R_{p u n i s h m e n t} .

(6)

3.2. Details of the Actor–Critic Network

The designs of the actor–critic networks of the DDPG and RDPG are illustrated in this subsection. In Figure 7, the driving view image is taken as the input of the actor architecture of the DDPG. After performing the procedures of 2D convolution and batch normalization three times, it is concatenated with the driving speed. Then, the concatenated data including the throttle, steering, and brake information are reweighted as the output. In Figure 8, the critic architecture of the DDPG is shown, where the driving view image, driving speed, and actor action are taken as the inputs. Then, those are concatenated to obtain the Q-value.

On the other hand, Figure 9 shows the actor architecture of the RDPG. Here, the historical sequence data are taken as the inputs, i.e., driving view image and speed information. Then, the Long Short-Term Memory (LSTM) layer is used to import the memory property. In Figure 10, along with the LSTM layer, the historical driving view image, driving speed, and actor action are taken as the inputs to obtain the Q-value.

4. Experimental Results

In this section, we illustrate our simulation in detail. The Coastline map provided by AirSim was divided into the training and testing part. Then, the training path and two driving scenarios are shown in Figure 11, i.e., the starting point and the destination. Notice that the training and testing path are mutually exclusive to provide a fair assessment of the experiment; this is important and significant. The simulation settings are shown as follows:

Inputs: driving view image and driving speed;
Due to the memory limitation, the driving view image was resized to 224 × 224 for both the training and testing procedures;
The inputs at the previous timestamp were treated as the inputs as well for the memory property of the RDPG;
Outputs: accelerator, brake, and steering;
The range of the accelerator was set to $[0, 1]$ ;
The range of the brake was set to $[0, 1]$ ;
The range of the steering was set to $[- 1, 1]$ from left to right;
Hyperparameters are detailed here: learning rate: 0.0001 (actor) and 0.001 (critic); learning rate decay: 0.9; replay buffer: 10,000; replay buffer threshold: 500; batch size: 64; $ϵ$ from the start: 1; $ϵ$ decay: 0.99; minimum $ϵ$ : 0.01;
The simulation was executed 1000 times for each DRL method;
Arriving at the destination or violating a traffic regulation will reset the environment to the initial state;
The specification of the experimental computer: CPU: Intel Core i7-9700 3.00 GHz, RAM: 16 GB DDR4 3200 MHz, and GPU: Nvidia GeForce RTX 2080 8 GB*2 with the Scalable Link Interface (SLI).

Figure 12 shows the reward of the training procedure, which demonstrates the average reward values with respect to the corresponding algorithms of every 100 driving rounds. Note that the reason for considering every 100 driving rounds as a whole was that doing so allows presenting the simulation results more clearly. We can observe that the driving strategy with the RDPG approach had better performance: the reward values accumulated via the RDPG approach were apparently higher in the early phase. Namely, the RDPG had better convergence performance than the DDPG. In the training procedure also, the first time the driving strategy with RDPG reached the destination was in the 223rd round, and that with the DDPG was in 419th round. Similarly, the RDPG also had a higher arrival rate than the DDPG during the training procedure (Figure 13), especially in the early phase.

Next, for the testing procedure, the trajectories of the testing path of the DDPG and RDPG are presented in Figure 14 and Figure 15 accordingly. Obviously, the autonomous driving mission was complete while using the RDPG, but not complete while utilizing the DDPG. In particular, the autonomous control strategy using the DDPG could not pass the sole U-turn, which did not appear in the training path. The intuition is that since RDPG has the memory property, which can adopt the inputs at sequential timestamps, it has better adaptability for a scenario that the model has never seen before, i.e., a U-turn. We performed the testing experiments 20 times. The strategy with the RDPG approach apparently possessed a higher average reward than that with the DDPG approach, i.e., 13,183.8819 vs. 7398.3034.

Last but not least, we used the route before the U-turn in the testing path to further discuss the performance of the DDPG and RDPG control strategies, i.e., vehicle steering and speed information, presented in Figure 16, Figure 17, Figure 18 and Figure 19. Note that the reason for choosing this part was that both the DDPG and RDPG control strategies can complete it. According to the above description, we performed the testing experiments 20 times. Since the vehicle steering and speed information was similar every time, we picked one of those to provide the discussions. The steering data are shown in Figure 16 and Figure 18. We can observe that the DDPG control strategy had more dramatic changes in a short period. Without the memory property, the DDPG makes no preparation for the meandering path. However, the RDPG could think ahead regarding this situation and separate the steering changes into more parts.

On the other hand, we found a similar situation to exist in the speed data (Figure 17 and Figure 19). Due to the memory property, the RDPG kept the speed range within only about 10 km/h, but the DDPG could not. Thus, the DDPG finally confronted a situation with which it could not deal with; even the vehicle steering and speed changed dramatically, i.e., U-turn. In the last parts shown in Figure 18 and Figure 19, the RDPG maintained its steering angle for a short while and slowed down, and it made the turn safely.

5. Conclusions and Future Work

Since all existing DRL concepts proposed in the literature are only frameworks, we still needed to elaborately perform the design and experiments to realize a viable DRL model based on a certain concept for a specific application. Therefore, we made the aforementioned contribution in this paper, e.g., using the transformation in different color spaces to design the reward-generation architecture, adopting the inputs at sequential timestamps to improve the adaptability to deal properly with a scenario that had not appeared in the training data, and so on. The detailed results were presented in the previous section. We also gave a comprehensive discussion.

For future work, we intend to design some novel methods with the memory property to find the most proper information at sequential timestamps to obtain better performance. For instance, if excessive inputs are used, it is apparent that the model would demand much system memory and may not acquire better performance. This is the tradeoff among the amount of data, the model complexity, the model accuracy, the training efficiency, and so forth.

Author Contributions

Methodology, C.-C.C. and J.T.; software, J.-H.L. and Y.-M.O.; validation, C.-C.C., J.-H.L. and Y.-M.O.; formal analysis, C.-C.C. and Y.-M.O.; writing—original draft preparation, C.-C.C.; writing—review and editing, C.-C.C. and J.T.; funding acquisition, C.-C.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Ministry of Science and Technology, Taiwan, R.O.C., under Grants 109-2221-E-035-067-MY3 and 109-2622-H-035-001.

Conflicts of Interest

The authors declare no conflict of interest.

References

Dionisio-Ortega, S.; Rojas-Perez, L.O.; Martinez-Carranza, J.; Cruz-Vega, I. A Deep Learning Approach towards Autonomous Flight in Forest Environments. In Proceedings of the 2018 International Conference on Electronics, Communications and Computers (CONIELECOMP), Cholula, Mexico, 31–23 February 2018; pp. 139–144. [Google Scholar]
Maximov, V.; Tabarovsky, O. Survey of Accuracy Improvement Approaches for Tightly Coupled ToA/IMU Personal Indoor Navigation System. In Proceedings of the International Conference on Indoor Positioning and Indoor Navigation, Montbeliard-Belfort, France, 28–31 October 2013. [Google Scholar]
Chang, C.-C.; Tsai, J.; Lu, P.-C.; Lai, C.-A. Accuracy Improvement of Autonomous Straight Take-off, Flying Forward, and Landing of a Drone with Deep Reinforcement Learning. Int. J. Comput. Intell. Syst. 2020, 13, 914–919. [Google Scholar] [CrossRef]
Home—AirSim. Available online: https://microsoft.github.io/AirSim/ (accessed on 12 November 2021).
Chen, W.; Zhou, S.; Pan, Z.; Zheng, H.; Liu, Y. Mapless Collaborative Navigation for a Multi-Robot System Based on the Deep Reinforcement Learning. Appl. Sci. 2019, 9, 4198. [Google Scholar] [CrossRef] [Green Version]
Feng, S.; Sebastian, B.; Ben-Tzvi, P. A Collision Avoidance Method Based on Deep Reinforcement Learning. Robotics 2021, 10, 73. [Google Scholar] [CrossRef]
Zhu, P.; Dai, W.; Yao, W.; Ma, J.; Zeng, Z.; Lu, H. Multi-Robot Flocking Control Based on Deep Reinforcement Learning. IEEE Access 2020, 8, 150397–150406. [Google Scholar] [CrossRef]
Krishnan, S.; Boroujerdian, B.; Fu, W.; Faust, A.; Reddi, V.J. Air Learning: A Deep Reinforcement Learning Gym for Autonomous Aerial Robot Visual Navigation. Mach. Learn. 2021, 110, 2501–2540. [Google Scholar] [CrossRef]
Shin, S.-Y.; Kang, Y.-W.; Kim, Y.-G. Obstacle Avoidance Drone by Deep Reinforcement Learning and Its Racing with Human Pilot. Appl. Sci. 2019, 9, 5571. [Google Scholar] [CrossRef] [Green Version]
The Most Powerful Real-Time 3D Creation Platform—Unreal Engine. Available online: https://www.unrealengine.com/en-US/ (accessed on 12 November 2021).
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; The MIT Press: Cambridge, UK, 2018. [Google Scholar]
Martin-Guerrero, J.D.; Lamata, L. Reinforcement Learning and Physics. Appl. Sci. 2021, 11, 8589. [Google Scholar] [CrossRef]
Jembre, Y.Z.; Nugroho, Y.W.; Khan, M.T.R.; Attique, M.; Paul, R.; Shah, S.H.A.; Kim, B. Evaluation of Reinforcement and Deep Learning Algorithms in Controlling Unmanned Aerial Vehicles. Appl. Sci. 2021, 11, 7240. [Google Scholar] [CrossRef]
Deep Reinforcement Learning. Available online: https://julien-vitay.net/deeprl/ (accessed on 12 November 2021).
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous Control with Deep Reinforcement Learning. arXiv 2019, arXiv:1509.02971. [Google Scholar]
Heess, N.; Hunt, J.J.; Lillicrap, T.P.; Silver, D. Memory-based Control with Recurrent Neural Networks. arXiv 2015, arXiv:1512.04455. [Google Scholar]
Agoston, M.K. Computer Graphics and Geometric Modeling: Implementation and Algorithms; Springer: London, UK, 2005. [Google Scholar]
Cheng, H.D.; Jiang, X.H.; Wang, Y.S.A.J. Color Image Segmentation: Advances and Prospects. Pattern Recognit. 2001, 34, 2259–2281. [Google Scholar] [CrossRef]
Chaki, N.; Shaikh, S.H.; Saeed, K. Exploring Image Binarization Techniques; Springer: New Delhi, India, 2014. [Google Scholar]
Stockman, G.; Shapiro, L.G. Computer Vision; Prentice Hall: Upper Saddle River, NJ, USA, 2001. [Google Scholar]

Figure 1. The flowchart of the DDPG.

Figure 2. The flowchart of the RDPG.

Figure 3. (a) HSV color model and (b) HSV triangle.

Figure 4. The architecture of our work.

Figure 5. (a) An example of an image including a higher percentage of the road area; (b) an example of an image including a lower percentage of the road area.

Figure 6. An example of an image to obtain the measure of the road area: (a) original image; (b) converted to the HSV color space; (c) binarization; (d) result with the contour line.

Figure 7. Actor architecture of the DDPG.

Figure 8. Critic architecture of the DDPG.

Figure 9. Actor architecture of the RDPG.

Figure 10. Critic architecture of the RDPG.

Figure 11. (a) The starting point on the training path on the Coastline map; (b) the destination on the training path on Coastline map; (c) the training path on Coastline map.

Figure 12. The reward comparison between the DDPG and RDPG in the training procedure.

Figure 13. The arrival rate comparison between the DDPG and RDPG in the training procedure.

Figure 14. The trajectory of the testing path of the DDPG.

Figure 15. The trajectory of the testing path of the RDPG.

Figure 16. The steering data of the DDPG.

Figure 17. The speed data of the DDPG.

Figure 18. The steering data of the RDPG.

Figure 19. The speed data of the RDPG.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chang, C.-C.; Tsai, J.; Lin, J.-H.; Ooi, Y.-M. Autonomous Driving Control Using the DDPG and RDPG Algorithms. Appl. Sci. 2021, 11, 10659. https://doi.org/10.3390/app112210659

AMA Style

Chang C-C, Tsai J, Lin J-H, Ooi Y-M. Autonomous Driving Control Using the DDPG and RDPG Algorithms. Applied Sciences. 2021; 11(22):10659. https://doi.org/10.3390/app112210659

Chicago/Turabian Style

Chang, Che-Cheng, Jichiang Tsai, Jun-Han Lin, and Yee-Ming Ooi. 2021. "Autonomous Driving Control Using the DDPG and RDPG Algorithms" Applied Sciences 11, no. 22: 10659. https://doi.org/10.3390/app112210659

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Autonomous Driving Control Using the DDPG and RDPG Algorithms

Abstract

1. Introduction

2. Preliminaries

2.1. AirSim

2.2. Elements of Reinforcement Learning

2.3. Model-Free Methods

2.4. DDPG

2.5. RDPG

2.6. HSV Color Space

3. Our DRL Control Strategies

3.1. Designs for the Reward Mechanism

3.2. Details of the Actor–Critic Network

4. Experimental Results

5. Conclusions and Future Work

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI