14

electronics
Article
Optimization Control of Adaptive Traffic Signal with Deep
Reinforcement Learning
Kerang Cao 1 , Liwei Wang 1 , Shuo Zhang 1 , Lini Duan 2 , Guimin Jiang 3 , Stefano Sfarra 4 , Hai Zhang 5
and Hoekyung Jung 6, *
1 Key Laboratory of Intelligent Technology of Chemical Process Industry in Liaoning Province,

College of Computer Science and Technology, Shenyang University of Chemical Technology,
Shenyang 110142, China; caokerang@syuct.edu.cn (K.C.); victorywlw@163.com (L.W.);
syzs1210@163.com (S.Z.)
2 Big Data Management and Application, Shenyang University of Chemical Technology,
Shenyang 110142, China; liniduan@163.com
3 School of Automation and Electrical Engineering, Shenyang Ligong University, Shenyang 110159, China;
22bg18171@stu.hit.edu.cn
4 Department of Industrial and Information Engineering and Economics, University of L’Aquila,
I-67100 L’Aquila, Italy; stefano.sfarra@univaq.it
5 Centre for Composite Materials and Structures (CCMS), Harbin Institute of Technology, Harbin 150001, China;
hai.zhang.1@ulaval.ca
6 Computer Engineering Department, Paichai University, Daejeon 35345, Republic of Korea
* Correspondence: hkjung@pcu.ac.kr
Abstract: The optimization and control of traffic signals is very important for logistics transportation.
It not only improves the operational efficiency and safety of road traffic, but also conforms to the
direction of the intelligent, green, and sustainable development of modern cities. In order to improve
the optimization effect of traffic signal control, this paper proposes a traffic signal optimization
method based on deep reinforcement learning and Simulation of Urban Mobility (SUMO) software
for urban traffic scenarios. The intersection training scenario was established using SUMO micro
traffic simulation software, and the maximum vehicle queue length and vehicle queue time were
selected as performance evaluation indicators. In order to be more relevant to the real environment,
the experiment uses Weibull distribution to simulate vehicle generation. Since deep reinforcement
Citation: Cao, K.; Wang, L.; Zhang, S.;
learning takes into account both perceptual and decision-making capabilities, this study proposes a
Duan, L.; Jiang, G.; Sfarra, S.; Zhang,
traffic signal optimization control model based on the deep reinforcement learning Deep Q Network
H.; Jung, H. Optimization Control of
(DQN) algorithm by considering the realism and complexity of traffic intersections, and first uses
Adaptive Traffic Signal with Deep
Reinforcement Learning. Electronics
the DQN algorithm to train the model in a training scenario. After that, the G-DQN (Grouping-
2024, 13, 198. https://doi.org/ DQN) algorithm is proposed to address the problems that the definition of states in existing studies
10.3390/electronics13010198 cannot accurately represent the traffic states and the slow convergence of neural networks. Finally,
the performance of the G-DQN algorithm model was compared with the original DQN algorithm
Academic Editor: Mohamed Karray
model and Advantage Actor-Critic (A2C) algorithm model. The experimental results show that the
Received: 27 October 2023 improved algorithm increased the main indicators in all aspects.
Revised: 6 December 2023
Accepted: 6 December 2023 Keywords: logistics transportation; traffic signal optimization control; intelligent optimization; deep
Published: 2 January 2024 reinforcement learning; DQN
Copyright: © 2024 by the authors.

1. Introduction
Licensee MDPI, Basel, Switzerland.
This article is an open access article Traffic signal optimization control is a method of the intelligent management of traffic
distributed under the terms and facilities such as intersections and road sections through technological means, in order
conditions of the Creative Commons to improve the operational efficiency, safety, and comfort of road traffic. It can improve
Attribution (CC BY) license (https:// logistics efficiency, reduce costs, and improve delivery reliability, while also being in line
creativecommons.org/licenses/by/ with the trend of green environmental protection and information technology development;
4.0/). in addition, it has a positive role in promoting the sustainable development of the logistics
Electronics 2024, 13, 198. https://doi.org/10.3390/electronics13010198 https://www.mdpi.com/journal/electronics

Electronics 2024, 13, 198 2 of 20
industry [1]. The proposal of the concept of intelligent transportation can effectively
enhance the capacity of road transportation. Smart transportation combines technologies
such as the Internet of Things, cloud computing, big data, and artificial intelligence to gather
traffic information through high technology and support the control of traffic management,
transportation, and other transportation fields in order to fully guarantee traffic safety, bring
into play the effectiveness of transportation infrastructure, and improve the management
of transportation systems for smooth public travel and economic development [2]. Among
them, intelligent optimization of traffic signal control is an important application. Most
traditional traffic signal optimization control systems use traditional fixed timing schemes,
which cannot be adjusted according to the real-time status of intersections and are prone
to traffic congestion if the traffic flow is too high. Therefore, many scholars have started
to study Adaptive Traffic Signal Control (ATSC), such as developing new adaptive signal
optimization control systems that set different signals according to different geographical
areas [3]; using pressurized routing algorithms in the field of communication to effectively
reduce traffic congestion and decrease the average travel time of vehicles on the road [4];
introducing greedy ideas and modeling two-dimensional automata matrices for vehicles so
that they can explore rerouting paths in congested areas [5]; and setting up new traffic light
schedules that take holiday factors into account to model the behavior of traffic lights [6].
However, all these ATSC systems rely on manually designed traffic signal schemes, which
have many limitations.
With the development of artificial intelligence and IoT technologies, intelligent traffic
light signal optimization control methods are becoming more and more sophisticated. Com-
pared with traditional traffic signal optimization control methods, intelligent traffic signal
optimization control methods are able to adjust the strategy according to real-time road
conditions. Since the traffic signal optimization control problem is a sequential decision
problem, reinforcement learning is well suited to solve the problem. Noaeen et al. [7]
summarized the applications of reinforcement learning in various areas of traffic signal
optimization control over the last 26 years, exploring all the applied methods and defin-
ing the main first events in the scope, and finally giving suggestions for development.
KuangLi et al. [8] simplified the representation of the states using a generated Q-table, and
proposed a bi-objective reward function to enable an intelligent body to quickly learn the
optimal signal timing strategy. Müller et al. [9] argued that the deployment of reinforcement
learning for traffic signal optimization control lacks safety in realistic intersections, so they
designed a safe action space and action masking mechanism to ensure safety in practi-
cal applications, driving the application of reinforcement learning in realistic scenarios.
Subsequently, Zhao et al. [10] argued that existing reinforcement learning-based traffic
signal optimization control progressed slowly due to poorly designed states and rewards in
reinforcement learning, so they introduced the concept of strength, which ensured that their
states and rewards reflected vehicle states more accurately. Genders et al. [11] compared
two reinforcement learning adaptive traffic signal controllers with the Webster controller
to demonstrate the superiority of reinforcement learning, by developing a new action
selection strategy. Zhang et al. [12] combine reinforcement learning with vehicle wireless
technology. Vehicle wireless technology uses vehicle-to-infrastructure (V2I) wireless com-
munication to detect vehicles, but traditional V2I intelligence detects vehicles equipped
with wireless communication capabilities, and reinforcement learning gives help to this
problem. Boukerche et al. [13], to address the problem that existing methods ignore the
impact of transmission delay on the system exchanging traffic flow information, proposed
a traffic state detection method, and proved to solve the data transmission delay problem
by an experimental comparison. Alegre et al. [14] proposed the TOS(λ)-FB algorithm
and proved its efficiency by combining the Fourier basis function and the reinforcement
learning SARSA(λ) algorithm in order to solve the dimensional explosion problem due to
the large state space. Wang et al. [15] investigated multi-intelligent reinforcement learning
for large-scale traffic signal optimization control problems. A multi-intelligent reinforce-
ment learning algorithm called Co-DQL was proposed with improved local state sharing
Electronics 2024, 13, 198 3 of 20
methods and reward allocation mechanisms to improve the robustness of the learning
process. Antes et al. [16] optimized the problem of passing information between intelli-
gences in a multi-intersection traffic environment. They used a hierarchical approach to
increase the basis for action selection of the intelligences and proposed an information-up,
recommendation-down framework to organize reinforcement learning. Experiments show
that this approach outperforms the fixed timing approach and the reinforcement learning
approach without layering. However, when the amount of data increases or the state space
is high-dimensional, the computational and learning capabilities of the above-mentioned
reinforcement learning methods without using neural networks are severely limited.
With the rapid development of deep learning, the application of deep reinforcement
learning is becoming more and more widespread [17], which can be a good solution to
the shortcomings of reinforcement learning. Park et al. [18] developed a traffic signal
optimization control model for a single intersection and two coordinated intersections,
and used DQN and Synchro 6.0 software optimized fixed-time timing for comparison to
demonstrate the effectiveness of deep reinforcement learning algorithms. Zhao et al. [19]
used a combination of the convolutional neural network (CNN) and Recurrent Neural
Network (RNN) with an additional layer of Long and Short-Term Memory neural net-
work (LSTM) in the outer layer of CNN by proposing a new Deep Recurrent Q-Learning
(DRQN) algorithm able to define the reinforcement learning states as matrices to better
characterize the environment. Wan et al. [20], on the other hand, used a CNN model to
compare with a DNN model experiments and modeled traffic intersections using VISSIM
simulation software. The experiments proved that the CNN model has a better ability to
extract features than the DNN model. Zou et al. [21] combined Bayesian methods with
reinforcement learning to solve the overfitting problem of neural networks and to speed
up the training of neural network models. Chu et al. [22] used for the first time a three-
dimensional reinforcement learning state; the data were taken using photos by intersection
cameras and vehicles were visualized using Sumo-web3d 3D visualization tool. To better
extract data features, experiments were conducted using three algorithms and two combi-
nations of neural network model permutations, and the results showed that DQN + CNN
worked best. Muresan et al. [23] used the idea of migration learning to apply the model
training method for a single intersection to multiple intersections and simulated the real
scenario using SUMO simulation software. Alhassan et al. [24] used the DQN algorithm
to perform traffic optimization on four different SUMO-simulated streets and re-planned
the streets. Bouktif et al. [25] combined a dual deep Q-network (DDQN) with a priority
experience replay (PER) mechanism and proposed a new method for defining states and
rewards. They adopted this method to optimize the agent’s strategy for selecting actions,
allowing the agent to select the best action faster. Su et al. [26] believed that the scalability
of existing optimization algorithms for signal light control was bad, and the current deep
reinforcement learning algorithms had the problem of overestimating the Q-value. There-
fore, they reduced the dimensionality of the action space and proposed a dynamic delayed
update method based on Exponential Weighted Moving Average (EWMA) to optimize the
algorithm and improve the performance of the model. Due to the increasingly widespread
application of neural networks, they are being used in more than just transportation [27,28].
This proves the widespread application of neural networks in various industries.
In response to the above findings, it was found that the original DQN algorithm using
DNN to represent the Q function was not as effective as CNN and RNN, and CNN was the
most effective. Many studies overlook the detailed description of the traffic environment,
which can affect the convergence speed of neural networks. This also leads to the inability
of ordinary structured CNN to adapt well to complex traffic situations. This article starts
with the design of reinforcement learning states, then improvement of the CNN structure,
and adjustment of action selection strategies. Thus, the main contributions of this paper,
based on the previous academic studies, are the following:
Electronics 2024, 13, 198 4 of 20
(1) By improving the action selection strategy of deep reinforcement learning, the model
can be trained faster, and the number of vehicle queues can be effectively reduced by
experimental verification.
(2) Based on the DQN algorithm, the G-DQN algorithm is proposed. The algorithm
redefines the state and reward in the three elements of reinforcement learning, and
improves the original CNN network structure into a dual-channel CNN, so that
it can reflect the state of the traffic intersection more accurately and improve the
performance of the model, and its optimization effect is proved to be better than the
DQN algorithm, the A2C algorithm and the fixed signal timing strategy through
comparison experiments.
(3) The experiments are modeled and simulated by SUMO simulation software, and the
Weibull distribution is used to simulate the traffic flow in the real environment.
The composition of the remaining part of the paper can be summarized as follows.
Section 2 provides a detailed introduction of the basic tools and methods used in the study.
Section 3 explains the construction and implementation of the G-DQN model. Section 4
explains the method of data generation and the way of conducting the experiments to
demonstrate the universality and superiority of the proposed G-DQN in two optimization
objectives by comparing it with the original DQN algorithm, A2C algorithm, and current
fixed signal timing methods. Finally, in Section 5, the research conclusions and innovative
directions for future works are presented.
2. Related Work
This section introduces SUMO 1.14.1 simulation software and deep reinforcement
learning methods, providing theoretical support for the following experiments.
2.1. SUMO Traffic Simulation Software

SUMO (Simulation of Urban Mobility) is an open source, microscopic, multi-modal
traffic simulation software. Its microcosm is reflected in the fact that each vehicle is explicitly
modeled with its own individual route of movement in the road network structure. It
allows to simulate how a complex traffic flow consisting of many individual vehicles moves
through a given road network structure after a given traffic demand. The simulation
software solves a large number of traffic management problems by simulating vehicle
movements for different vehicle types at spatially continuous and temporally discrete
times, and by managing road network structures of up to 20 lanes in both directions,
supporting lane changes, multi-lane streets, and different access rules. SUMO has a fast
OpenGL graphical user interface that enables the user to design each element of each
roadway at an intersection, including the number of roadway strips, directions, and the
number of lanes, the location of traffic signals and the phase sequence and duration of
traffic signals. In addition, the package Traci in SUMO can easily enable it to communicate
with python, which facilitates the simulation. By calling Traci, various data from the traffic
simulation, such as vehicle waiting time, number of vehicles on the road, and vehicle speed,
can be obtained in real time. From this, the real-time traffic status of the intersection can be
obtained and the optimization control strategy of the signal can be adjusted [29].
In SUMO, a simulation requires several configurations: a road network file (.net.xml),
a traffic flow file (.rou.xml), an additional file (.add.xml), and an executable file (.sumocfg).
Among them, road network files, traffic flow files, and executable files are indispensable.
The working principle of SUMO simulation software is as follows:
(1) Create a road network file. A road network file is used to generate the scenarios
required for the experiment, including various parameters of the intersection, such as
the number of lanes, lane distribution, lane length and transition rules. It has three
ways to generate these. The first is to write the code manually. The second is to design
through SUMO’s own netedit road network editor. The Netedit road network editor
can be very convenient for users to directly draw any shape of road network in the
editor. The signal light control phase can also be generated by the editor. The third is
Electronics 2024, 13, 198 5 of 20
to download the real-world road network through OpenStreetMap to generate the

osm file, and then convert it into a road network file.
(2) Create a traffic flow file. A traffic flow file is used to generate vehicles and their
attributes, including various parameters of vehicles, such as vehicle ID, vehicle type,
vehicle color, and lanes entered and left. The traffic file is intelligently generated by
manually writing code.
(3) Create an executable file. An executable file should be configured with the file names
of the two files mentioned above, so that the road network and traffic flow can be
combined for simulation.
(4) Start simulation. Open sumo-gui in the bin directory and open the configured exe-
cutable file to start the simulation.
2.2. Deep Reinforcement Learning

Deep reinforcement learning (DRL) has become one of the hot spots in artificial
intelligence research. DRL comes from deep learning that has been applied in the recent
past in several research fields [30–34]. DRL has also been widely used in various fields such
as recommender systems, robot control, end-to-end control, and natural language dialogue
systems. It integrates the perceptual ability of deep learning and the decision-making
ability of reinforcement learning to first explore and optimize the system objectives through
reinforcement learning, and then, second, to solve the optimization control strategy problem
through deep learning algorithms. This is with the aim to understand the dimensional
explosion problem caused by the lack of storage space of reinforcement learning algorithms,
so as to optimize the objective function of the system and seek the optimal value of
the system.
As shown in Figure 1, in the process of reinforcement learning, the agent and the
environment are constantly interacting, which is a sequential decision problem. Generally,
reinforcement learning problems are regarded as Markov decision processes (MDPs). MDPs
are a mathematical model for sequential decision making, which include a set of interactive
objects such as environment and agent. In addition, they also includes five model elements:
state, action, reward, strategy, and value function. Based on these elements, MDPs calculate
the optimal strategy, which is how to select the best action in each state to maximize the
expected cumulative reward [35]. At each moment the intelligence receives a state from
the environment, and based on such a state, the intelligence makes a corresponding action,
which then acts on the environment and gives the intelligence a reward and the next state,
and the intelligence selects the action for the next state based on the effect of the reward on
the strategy. By continuously cycling the above process, the optimal strategy to achieve the
goal is finally obtained. The value function is divided into a state value function
Electronics 2024, 12, x FOR PEER REVIEW 6 of 22 and an
action value function, which evaluate the state and action, respectively, and their results
are the expectation of cumulative rewards.
Intensive
Figure1.1.Intensive
Figure learning
learning process.
process.
The neural network in deep learning has strong feature extraction capability, which
can process features more easily and no longer requires manual feature extraction. Espe-
cially in large-capacity data processing, the effect of deep learning algorithms has obvious
advantages over traditional artificial intelligence algorithms.
Electronics 2024, 13, 198 6 of 20
The neural network in deep learning has strong feature extraction capability, which
can process features more easily and no longer requires manual feature extraction. Espe-
cially in large-capacity data processing, the effect of deep learning algorithms has obvious
advantages over traditional artificial intelligence algorithms.
This study applies a convolutional neural network, as shown in Figure 2. The learning
process of deep reinforcement learning can be described as follows:
(a) The established single-intersection training scenario is implanted into SUMO sim-
ulation software to form SUMO’s intersection environment and we use Weibull
distribution to simulate vehicle generation and generate traffic flow data.
(b) The traffic flow data are processed using the functions under the vehicle module in
the Traci interface to calculate the state and reward, the Q-value is calculated from the
reward, and the state and Q-value are passed into the convolutional neural network
for training.
(c) The convolutional neural network will use the output generated by each turn pre-
diction, i.e., action, for decision making until the specified training turn is reached,
and it will feed back to the SUMO intersection environment to change the phase of
the signal in SUMO. In addition, the queue length and queue time of vehicles in the
traffic system for each round will be counted.
(d) When the convolutional neural network completes the prescribed training rounds, 7 of 22
Electronics 2024, 12, x FOR PEER REVIEW
the queue length and queue time of vehicles in the traffic system for each round are
compared to verify the performance improvement in the improved algorithm.
2. Technology
Figure 2.
Figure Technologyroadmap.
roadmap.
Since the value function-based reinforcement learning algorithm is more suitable for
Since the value function-based reinforcement learning algorithm is more suitable for
dealing with continuous state space and discrete action space problems, this study uses
dealing
the DQNwith continuous
algorithm stateforspace
as the basis and discrete
improvement, action
so as to space
perform problems,
intelligent this study uses
optimization
the DQN algorithm
of regional as the
traffic signal basis for improvement,
optimization so as totable
control. The naming perform intelligent
of variables optimization
is shown
of regional
in Table 1. traffic signal optimization control. The naming table of variables is shown in
Table 1.
Table 1. Nomenclature table.
Parameter Name Parameter

State s
Action a
Reward r
Time step t
Action selection strategy π
Electronics 2024, 13, 198 7 of 20
Table 1. Nomenclature table.
Parameter Name Parameter

State s
Action a
Reward r
Time step t
Action selection strategy π
Neural network parameters θ
Experience pool size M
Learning rate α
Discount factor γ
Target network update period C
The DQN algorithm is an improved algorithm based on the q-learning algorithm of

reinforcement learning, which uses neural networks in deep learning to record the action
values in each state. The input of the network is the state information and the output is the
value of each action to optimize the action decision in the next state. the DQN algorithm
contains a Q-value function and a V-value function, and the algorithm uses the Q-value
function to evaluate the goodness of the current state it is in. The Q-value function is the
value of taking an action at with a state shift probability of P in the current state st under
a specific strategy π, and the V-value function is the average value in the current state st .
The Q-value function is calculated as Equation (1) and the V-value function is calculated as
Equation (2) as follows [36]:
Qπ (st , at ) = ∑ Psatt→st+1 [r(st , at ) + γV (st+1 )] (1)

s t +1
Vπ (st ) = ∑ π (st , at ) ∑ Psatt→st+1 [r(st , at ) + γV (st+1 )] (2)
at s t +1
where Qπ (st ,at ) is the Q-valued function, and Vπ (st ) is the V-valued function; γ represents
the degree to which the agent is concerned about future benefits, and a larger value indicates
that the agent is more concerned about future benefits.
The DQN algorithm adds experience replay and objective network mechanisms [31].
The experience replay mechanism can store the experience (consisting of five parameters:
current state st , action at , reward r, next-turn state st + 1 , and whether to reach the scheduled
turn True/False) in the experience pool, and then each time the input of the neural network
is randomly sampled from the experience pool, which can break the correlation between
the data and make the data satisfy independent identical distribution, thus reducing the
parameter update. This can break the correlation between the data and make the data
meet independent homogeneous distribution, thus reducing the variance in the parameter
update and improving the convergence speed of the network. Meanwhile, the target of
the network mechanism is to build a network with the same structure on the basis of
the original network, but the time to update the parameter θ is different. θ is the weight
parameter of the convolutional neural network model. A continuous function Qθ (s,a)
with parameter θ is established in the DQN algorithm to approximate Qπ (s,a), and Q(st ,at )
is calculated from the current state st , action at as the output result of the network r +
Q(st+1 ,at+1 ) is the target value, and the iterative update Formula (3) is as follows:
Q(st+1 , at+1 ) ← Q(s, a) + α[r + γmaxQ(st+1 , at+1 ) − Q(s, a)] (3)

a t +1
Electronics 2024, 13, 198 8 of 20
Since the parameters of the original network are updated in real time during the
training process, while the target network parameters and the target value r + Q(st+1 ,at+1 )
are constant, the introduction of the target network increases the stability of learning.
These two mechanisms are a good solution to the problems of neural networks that cannot
guarantee convergence, training instability and training difficulties.
In addition, the DQN algorithm also has a loss function to calculate the error between
the predicted value of the intelligent agent and the actual value. The definition of loss
function L is shown in Formula (4):
L = [r + γmaxQ(st+1 , at+1 ; θ‘) − Q(s, a; θ )]2 (4)
The loss function can help the agent learn more accurate Q values, promote their
exploratory behavior, and avoid the overfitting of neural networks.
3. Traffic Signal Optimization Control Model

In the reinforcement learning framework, the intelligent body is always interacting
with the environment. The intelligent body contains three elements, i.e., state, action and
reward. The definition of the elements directly affects the performance of the algorithm,
and how to define the elements is the key to build the traffic light optimization control
model. The following is the definition of the environment, the three elements of rein-
forcement learning, the learning strategy and the convolutional neural network in the
G-DQN algorithm.
3.1. Environment
The scenario used in this study is shown in Figure 3, which is an intersection with eight
lanes in both directions. The traffic flow file is generated using a Weibull distribution, where
each generated car is assigned a random origin and destination. The Weibull distribution
is a continuous probability distribution function with an overall increase followed9 by
Electronics 2024, 12, x FOR PEER REVIEW a
of 22
decrease, which fits well with the pattern of high and low traffic peaks. The environment
for the scenario is defined as follows:
Figure 3.
Figure 3. SUMO
SUMO interface.
interface.
Lanes: The
Lanes: Thelanes
lanesof
ofananintersection
intersectionare
aredivided
dividedinto
intoanan approaching
approaching lane
lane and
and an an exit-
exiting
lane. The The
ing lane. lanelane
in which the vehicle
in which is approaching
the vehicle the intersection
is approaching is theisentry
the intersection lane,lane,
the entry and
the
andlane
the in which
lane the vehicle
in which is exiting
the vehicle the intersection
is exiting is the is
the intersection exit
thelane,
exit with
lane,four
withexit
fourlanes
exit
and
lanesfour
andentry
four lanes
entryinlanes
the same
in thedirection. The four
same direction. entry
The fourlanes
entryconsist
lanes of two straight
consist of two
straight lanes, one left-turn lane and one right-turn lane, and the vehicle movements are
randomly assigned to straight, left-turn and right-turn movements.
Traffic signals: In the traffic movement, each approach lane to the corresponding exit
lane has a corresponding traffic signal to direct the vehicle movement. Among them, the
red light represents the prohibition of vehicle movement, the green light represents the
Electronics 2024, 13, 198 9 of 20
lanes, one left-turn lane and one right-turn lane, and the vehicle movements are randomly
assigned to straight, left-turn and right-turn movements.
Traffic signals: In the traffic movement, each approach lane to the corresponding exit
lane has a corresponding traffic signal to direct the vehicle movement. Among them, the
red light represents the prohibition of vehicle movement, the green light represents the
permission of vehicle movement, and the yellow light represents the buffer between the
red and green lights, and the vehicles that have crossed the stop line can continue to pass.
In general, right turns are not subject to traffic signals.
Phase: Phase represents the set of all non-conflicting signal combinations at an in-
tersection, each representing the state of passage at the intersection at that moment. For
example, the east–west direction allows straight ahead and right turn, but not left turn; at
this time the traffic signal for the east–west straight ahead lane is green, while the signal for
the left turn lane is red.
Phase sequence: Phase sequence refers to the combination of several phases in the
phase set, which is used to control the sequence of phase changes and is the strategy
adopted by the intelligence after learning, and is also the main optimization goal.
The problem of traffic signal optimization control can be defined as the intelligent
body as an intersection, the state s of the intersection, the action a selected by the intelligent
body from a predefined set of actions, the reward r derived from the reward function,
the time step t, the policy π, and the value function Qπ (s,a) under that policy. First, the
intelligence perceives the environment and receives the current state st . Then, the next
action at is selected based on the current state st and previous experience, and the selected
action yields the reward rt and the value function Qπ (st ,at ) for this time. The value of
this action is then evaluated by the reward rt and the value function Qπ (st ,at ) for this
time. Finally, these experiences are used to train the convolutional neural network, which
updates itself by continuously receiving states and rewards from the environment and
can dynamically select the phase sequence of traffic signals by learning from historical
experiences. However, the traffic signals must follow the corresponding rules of operation,
and the duration of the phases in this study model is fixed and can be adjusted according
to the experimental effects. Within the same time step, there must be lanes in the phase
that are in the green or yellow light state, and there cannot be a state where the entire
intersection is red.
The design process of the simulation environment is as follows:
(1) Construct a bidirectional eight lane intersection environment in netedit of SUMO
simulation software.
(2) Use Python to write classes that generate vehicles. This function uses the Weibull
distribution function to generate traffic flow files. It ensures that the vehicles in the
environment are randomly generated.
(3) Write simulation classes through the Traci interface in Python. Traci is an interface
between SUMO and Python, and its built-in functions can help Python set or obtain
information in SUMO. For example, traci.start() can call SUMO to start simulation;
traci.vehicle.getIDList() can be used to obtain the vehicle ID; traci.vehicle.getRoadID()
can be used to obtain the road ID; traci.vehicle.getSpeed() can be used to obtain the
vehicle speed; traci.vehicle.getAccumulatedWaitingTime() can be used to obtain the
cumulative waiting time of the vehicle; traci.trafficlight.setPhase() sets the phase of
the signal light; traci.vehicle.getPosition() can be used to obtain vehicle position; etc.
We use these functions to control and obtain information from simulation software.
(4) Import the simulation environment code into the deep reinforcement learning algo-
rithm and connect it with the neural network model framework for experiments.
3.2. State
The state in deep reinforcement learning is used to describe the current characteristics
of the environment, which is generally used as the input of the neural network to provide
the basis for the decision of the intelligence. In related studies, the state of an intersection is
Electronics 2024, 13, 198 10 of 20
usually represented by a matrix as input to a convolutional neural network in order to better

represent the features of the intersection and to minimize the training time of the neural
network. First is the vehicle location matrix: since the congestion level of the intersection
depends mainly on the approach lanes, the four approach lanes in each direction of the
intersection are divided into 10 cells of the same size; i.e., a 16 × 10 matrix can be generated.
If a vehicle exists in the cell, the state is set to 1; otherwise, it is 0. Secondly, the speed
matrix: the speed matrix is also a 16 × 10 matrix, and the data in the speed matrix are
the average speed of all vehicles in the cell. In addition, to facilitate convolutional neural
network training, the speed matrix is normalized and all the data in the matrix are divided
by the road speed limit.
It is indeed a simple and efficient way to model the intersection by discretizing the
approaching lanes into cells, but for the actual road conditions, the closer the vehicles are
to the intersection, the greater the impact on the current traffic environment. Therefore,
in order to more accurately reflect the state of the intersection, this paper introduces the
idea of cluster series. The cluster series originates from the Warring States “Sun Bin Art of
War—Ten Array”, which is a series of numbers {an } grouped according to a certain regular
geometric pattern. The common arrangements of numbers are from small to large order,
an “S” arrangement, the Yang Hui triangle arrangement, and other forms. The common
grouping forms are triangle, rectangle, trapezoid, sawtooth, a long snake formation form,
etc. For example, given a series {an }: a1 , a2 , a3 , a4 . . . in accordance with the rules of the
equal difference series tolerance of one grouping order, then you can obtain the sequence of
the group as a unit: (a1 ), (a2 , a3 ), (a4 , a5 , a6 ) . . . In order to reflect the impact of the vehicle
distance from the intersection on the road conditions, it will drive into the lane. The 10
cells {a10 } delineated (in order from near to far from the intersection) are divided into four
groups, namely (a1 ), (a2 , a3 ), (a4 , a5 , a6 ), (a7 , a8 , a9 , a10 ); for the reasonableness of weight
assignment and convenience of neural network training, the normalized matrix should be
kept as normalized as possible, so the values in it are assigned weights 0.4, 0.3, 0.2, and 0.1,
respectively, indicating the degree of influence on road conditions.
3.3. Action
Action in deep reinforcement learning is the interaction between the intelligence and
the environment, determined by the current state and the reward of the feedback. In order
to allow the traffic to pass through the intersection as soon as possible and to improve the
efficiency of vehicular traffic at the intersection, the actions defined in this paper caused
the signal of a set of lanes to turn green in one cycle. All possible actions are defined as a
set of phases:
A = { EW, EW L, SN, SNL} (5)
where EW is the east–west-direction-driving vehicles that can go straight and turn right;
EWL is the east–west-direction-driving vehicles that can only turn left; SN is the north–
south-direction-driving vehicles that can go straight and turn right; and SNL is the north–
south-direction-driving vehicles that can only turn left. Since the phase sequence is the
main optimization target, in this study we set the duration of green and yellow lights as
fixed values, which were set to 12 s and 4 s, respectively, and the remaining time was the
duration of red lights. When the intelligent body selected the action, if the two actions
before and after were consistent, it would continue to maintain the original signal light; if
not, it would add a yellow light to buffer between the switching of red and green lights.
3.4. Reward
The goal of reinforcement learning intelligence is to maximize the total reward that
can be obtained when the learning is completed, so the setting of the reward function
is particularly important, which is the learning direction of reinforcement learning. In
previous studies, most models simply designed the reward function as the change in
cumulative waiting time between adjacent cycles [37]. This approach has appeared in many
previous studies, and using only a single cumulative waiting time change cannot accurately
Electronics 2024, 13, 198 11 of 20
describe the traffic environment in the environment. The variables in the reward function
need to be closely related to the optimization objective of the model, so the reward function
was defined in this paper as a weighted linear combination of the change in cumulative
vehicle waiting time T and the cumulative duration of the yellow light Y as the intersection
road conditions deteriorate. Both indicators reflect the effect of the model to improve traffic
after learning. The value of T is the waiting time of all vehicles in the previous time step
minus the waiting time of all vehicles in the current time step, which is negative when
the road conditions become worse. The Y value is the duration when the signal light
switches to yellow. Due to the high traffic flow at the intersection studied in this article,
the longer duration of the traffic lights can improve the passing rate of vehicles. Frequent
phase switching can easily lead to vehicle congestion. Finally, the reward function can be
expressed as follows:
R = k1 T + k2 Y k1 + k2 = 1, k2 < k1 (6)
where k1 and k2 are the weights of T and Y, respectively, and the size of k1 and k2 is
constrained because of the large influence of T on the optimization effect.
3.5. Action Selection Strategy

In order to prevent the neural network training from always training in the direction
of the maximum value function, i.e., the intelligent body always uses the action that
maximizes the reward function when making action decisions, which leads to overfitting
the neural network, in this paper we used ε-greedy as the action selection strategy. The
idea is that the intelligence selects the action with the greatest known reward with the
probability of 1 − ε at each decision, called exploitation, and randomly selects an action
from the set of actions with probability of ε, called exploration:
(
argmaxQ(s, a), 1 − ε
a= a (7)
random( A), ε
The purpose of this strategy is to prevent the neural network from overfitting caused by
training in the direction of maximum Q value all the time after it starts training, which can
enable the intelligence to choose the appropriate learning strategy in a limited number of
rounds and achieve a balance between exploitation and exploration to obtain the maximum
cumulative reward. Although the strategy does play a certain optimization role for the
neural network, the direction of action selection is basically fixed after the neural network
is trained for many rounds, and too high a ε value will affect the convergence speed of
the model. Therefore, in this paper, ε was dynamically adjusted during the training of
the neural network, and was set to 1 at the beginning of the training, and then decreased
according to the number of training rounds. This avoided the overfitting of the neural
network and also sped up the convergence of the model.
3.6. Convolutional Neural Network

Since the deep reinforcement learning state of this study extracted two feature matrices,
a dual-channel CNN design was adopted in order for the neural network to obtain more
information about the traffic intersection. The dual-channel CNN obtained double the
local features by different inputs, which could describe the current traffic state more
accurately compared with the single-channel CNN. The model structure is shown in
Figure 4. Firstly, the model contained two parallel networks with input position and speed
matrices, respectively, which underwent feature extraction by two convolutional layers;
then, the Flatten layer flattened the refined features; the concatenate layer was responsible
for uniting the features; and finally the fully connected layer was added to integrate and
extract the data.
Electronics 2024,13,
Electronics2024, 12,198
x FOR PEER REVIEW 12 of
13 of 20
22
Figure4.4.Convolutional
Figure Convolutionalneural
neuralnetwork
networkstructure.
structure.
The
Thenetwork
networkuses
usesrelu
reluactivation
activationfunction
functionfor
forall
allconvolutional
convolutionalandandfully
fullyconnected
connected
layers except the last fully connected layer, which uses liner activation function
layers except the last fully connected layer, which uses liner activation function to to output a
output
traffic
a trafficlight signal
light that
signal matches
that matchesthethe
action space.
action space.
The
TheG-DQN
G-DQNalgorithm
algorithmprocess
processisisshown
shownininAlgorithm
Algorithm1:1:
Algorithm
Algorithm 11StepsStepsofofG-DQN
G-DQN algorithm.
algorithm.
Input
Input
Initialize thenumber
Initialize the number ofof training
training rounds
rounds episodes,
episodes, the number
the number of training
of training steps persteps per
round
Initialize
round the optimization network Q and the parameter θ. The target network Q’= Q and the
parameter θ’
Initialize the optimization network Q and the parameter θ. The target network Q’= Q
Initialization batch, experience pool size M, learning rate α, discount factor γ, learning strategy
and the parameter θ’
probability ε, target network update period C
Initialization
Start batch, experience pool size M, learning rate α, discount factor γ, learning
strategy
For each probability
episode: ε, target network update period C
Initialize the environment and divide the lanes into groups to obtain the state s
Start
ForFor each
each step of episode:
episode:
Select an action a randomly with probability ε or with probability 1 − ε a = argmaxQ(s, a),
Initialize the environment and divide the lanes into groups to obtain the state as
ε = 1 − (episode/episodes)
For each step of episode:
Execute action a to get new state s‘ and reward r
Select an action a randomly with probability ε or with probability 1 − ε a=
Store (s,a,r,s‘) in the experience pool
𝒂𝒓𝒈 𝐦𝐚𝐱 Get𝑸the 𝒔,batch
𝒂 , ε group(s,a,r,s‘)
= 1 − (episode/episodes)
from the experience pool
𝒂
target r + maxaQto
y = action
Execute (s’,get
a) new state s` and reward r
a
Update the parameter
Store (s,a,r,s`) of the Q network
in theθexperience pool so that Q(s,a) approximates the target y
Update state s = s’, step += 1
Get the batch group(s,a,r,s`) from the experience pool
When step is divisible by period C, update the target network Q’ = Q
target y = 𝐫 + 𝐦𝐚𝐱 𝑸 𝒔’, 𝒂
End for 𝒂
End forUpdate the parameter θ of the Q network so that Q(s,a) approximates the target y
Update state s = s’, step += 1
When step is divisible by period C, update the target network Q’ = Q
Electronics 2024, 13, 198 13 of 20
4. Results
This section is the experimental part of the article, which first introduces the experimen-
tal environment and explains the method of generating data. Then, the experimental results
are explained in detail, and a comprehensive comparison is made between the G-DQN
algorithm and the other three methods to prove the effectiveness of the G-DQN algorithm.
4.1. Experimental Environment

The hardware environment in this paper is 11th Gen Intel(R) Core(TM) i5-11300H @
3.10GHz, the simulation environment is SUMO 1.14.1, the deep learning framework used
is Keras, and the experiments are run in python 3.8.
The hyperparameter settings are shown in the following Table 2.
Table 2. Hyperparameters.
Hyperparameter Value
Number of training rounds episodes 30
Number of training steps per round steps 5400
Batch size 20
Experience pool size M 50,000
Learning Rate α 0.001
Discount Factor γ 0.75
Initial learning strategy probability ε 1.0
Target network update cycle C 500
Green light duration 12
Yellow light duration 4
k1 0.8
k2 0.2
4.2. Experimental Data

The two parameters of the Weibull distribution, shape parameters and scale parame-
ters, both have intuitive physical meanings. The shape parameter determines the shape of
the distribution, while the scale parameter determines the scale of the distribution. This
makes the analysis results easier to interpret and apply to practical problems. As shown
in Figure 5, when the shape parameter of the Weibull distribution is 2, the Weibull distri-
bution coincides with the peak hour traffic flow, showing a trend of increasing first and
then decreasing [38]. Therefore, in order to make the experiment more realistic, we used
the Weibull distribution in simulating vehicle generation. The Weibull distribution is a
continuous probability distribution with probability density of [39]:
(
k x k −1 −( λx )k
f ( x; λ, k ) = λ(λ) e x≥0 (8)
0 x<0
where x is the random variable, λ > 0 is the scale parameter, and k > 0 is the shape parameter.
When k = 1, it is an exponential distribution; when k = 2, it is a Rayleigh distribution. The
number of vehicles during the peak traffic hours showed a trend of rising and then falling,
with the number of vehicles rising steadily at the beginning and gradually entering the
peak hours; after the peak, the number of vehicles gradually decreased. Our experiments
used the Traci interface in SUMO to associate SUMO with python, specifically defining a
class for generating vehicles; the class initialized the total number of vehicles and the total
number of steps in the experiment, and the class defined a function for generating vehicles,
the function’s function would randomly generate 1000 vehicles as the experimental data
source; and each vehicle was set with a random initial place and destination.
defining a class for generating vehicles; the class initialized the total numb
and the total number of steps in the experiment, and the class defined a fun
erating vehicles, the function’s function would randomly generate 1000 ve
experimental data source; and each vehicle was set with a random initial pla
Electronics 2024, 13, 198 14 of 20
nation.
Figure 5. Weibull distribution for shape parameter of 2.

Figure 5. Weibull distribution for shape parameter of 2.
4.3. Experimental Results
4.3. Experimental
In this paper, weResults
first experimented with a dynamic ε-greedy action selection strategy.
The experimental process made ε dynamically adjustable, set to 1 at the beginning of
In this
training, and thenpaper, we first
afterwards experimented
it was each round minus withthea current
dynamic numberε-greedy
of rounds, action se
egy. The
divided by experimental
the total number of process made
rounds. This ε dynamically
avoided adjustable,
overfitting of the neural networkset toand1 at the
also sped up the convergence of the model. In order to verify
training, and then afterwards it was each round minus the current numb the feasibility of this approach,
a comparison experiment was set up in this paper, with a fixed ε = 0.9 action selection
divided by the total
strategy according to thenumber
reference, ofandrounds. This avoided
both experiments used theoverfitting
convolutionalof the neural
neural
also sped
network DQN upalgorithm
the convergence of the
for model training and model. In of
a final round order toThe
testing. verify
analysisthewas feasibili
performed in terms of the metric of the number of vehicles
proach, a comparison experiment was set up in this paper, with in the queue, and the comparison
Electronics 2024, 12, x FOR PEER REVIEW 16 ofa 22
fixed ε =
of the two methods is shown in Figure 6. During testing, the number of vehicles queuing in
lection strategy
the environment according
optimized by theto the reference,
dynamic ε-greedy methodand both experiments
was significantly used
less than a the c
neural network
fixed ε-Greedy DQN
method. Thisalgorithm
indicates thefor model
dynamic training
ε-greedy methodand hasaafinal round
stronger trafficof testi
diversion
ysis was ability
performedand can in
maintain
termsvehicles
of theatmetric
intersections
of theat anumber
relatively of
lowvehicles
level. in the qu
comparison of the two methods is shown in Figure 6. During testing, the nu
cles queuing in the environment optimized by the dynamic ε-greedy metho
cantly less than a fixed ε-Greedy method. This indicates the dynamic ε-gr
has a stronger traffic diversion ability and can maintain vehicles at intersect
tively low level.
Figure
Figure Comparisonof
6.6.Comparison of the number
numberofof
queued vehicles
queued for dynamic
vehicles and static
for dynamic and action
static selection strategies.
action selection
strategies.
In summary, the dynamic ε-greedy action selection strategy is significantly better

than the static ε-greedy action selection strategy. Therefore, we applied dynamic ε-greedy
to the original DQN algorithm and continued to optimize the algorithm. In this paper, we
proposed a road division strategy based on subgroup series for obtaining traffic states.
Electronics 2024, 13, 198 15 of 20
In summary, the dynamic ε-greedy action selection strategy is significantly better than
the static ε-greedy action selection strategy. Therefore, we applied dynamic ε-greedy to
the original DQN algorithm and continued to optimize the algorithm. In this paper, we
proposed a road division strategy based on subgroup series for obtaining traffic states. Since
vehicles close to intersections have a greater degree of traffic impact and thus cause more
traffic pressure, we proposed the G-DQN algorithm by dividing the divided intersection
cells into groups. The larger the weight of the group closer to the intersection, the greater
its impact on the traffic road condition was indicated. Compared with the average way of
dividing roads, the method proposed in this paper can reflect the current traffic state more
accurately, and at the same time can normalize the matrix, which is convenient for neural
network training.
We conducted model training and testing on the G-DQN algorithm, DQN algorithm,
A2C algorithm, and signal light fixed-timing scheme in the same environment, and com-
pared and analyzed three indicators: cumulative reward, average waiting time, and average
queue length. Among them, the cumulative reward reflects the speed of the rate of conver-
gence of the neural network, which is used to display the quality of the traffic conditions
laterally. Due to the lack of cumulative reward indicators in the fixed timing scheme of
signal lights, the comparison of cumulative rewards among the three deep reinforcement
algorithms is shown in Figure 7. At the beginning of the algorithm, the cumulative rewards
of the G-DQN algorithm first declined to a certain extent, and the cumulative rewards of
the DQN algorithm were more stable, while the A2C algorithm, as a new deep reinforce-
Electronics 2024, 12, x FOR PEER REVIEW 17 of 22
ment learning algorithm, had a more stable state. With the training to about five episodes,
the performance of the G-DQN algorithm was significantly better than that of the DQN
algorithm, the G-DQN algorithm had a faster rate of convergence than DQN, and the
performance of the final convergence was similar to that of the A2C algorithm.
Figure7.7.Comparison
Figure Comparisonof
ofthe
theaccumulated
accumulatedrewards
rewardsofofthe
thethree
threealgorithms.
algorithms.
The
Theaverage
averagequeue
queuelength
lengthcomparison
comparison of of
thethe
four methods
four methodsis shown in Figure
is shown 8. At8.the
in Figure At
beginning of training, the number of vehicle queues in the G-DQN algorithm
the beginning of training, the number of vehicle queues in the G-DQN algorithm environ- environment
was
menthigher than the
was higher other
than the three
other methods, and the
three methods, andstability of these
the stability three three
of these methods was
methods
better than that
was better thanofthat
the of
G-DQN algorithm
the G-DQN at the beginning.
algorithm However,
at the beginning. after several
However, afterrounds
several
of training, the number of queued vehicles in the G-DQN algorithm environment
rounds of training, the number of queued vehicles in the G-DQN algorithm environment sharply
decreased. The number of queued vehicles in the environment was the lowest among
sharply decreased. The number of queued vehicles in the environment was the lowest
the four methods, and the convergence trend appeared faster than in the DQN algorithm
among the four methods, and the convergence trend appeared faster than in the DQN
environment, reflecting the performance superiority of the G-DQN algorithm.
algorithm environment, reflecting the performance superiority of the G-DQN algorithm.
was better than that of the G-DQN algorithm at the beginning. However, after several
rounds of training, the number of queued vehicles in the G-DQN algorithm environment
sharply decreased. The number of queued vehicles in the environment was the lowest
among the four methods, and the convergence trend appeared faster than in the DQN
algorithm environment, reflecting the performance superiority of the G-DQN algorithm.
Electronics 2024, 13, 198 16 of 20
Figure
Figure 8. Comparison
8. Comparison of the
of the average
average length
length of the
of the four
four methods.
methods.
Electronics 2024, 12, x FOR PEER REVIEW 18 of 22
TheThe cumulative
cumulative queue
queue timetime comparison
comparison of of
thethe four
four methods
methods is shown
is shown in Figure
in Figure 9. 9.
Similarly,atatthe
Similarly, thebeginning
beginning of of the
the training,
training, the
thethree
threedeep
deepreinforcement
reinforcement learning
learningalgorithms
algo-
did not
rithms didconverge,
not converge,with with
largelarge
fluctuations, and the
fluctuations, andcumulative queue
the cumulative time time
queue was even higher
was even
the signal lightfixed-timing
fixed-timingscheme.
scheme.However,
Compared with the DQN algorithm, thethe
cumulative
higher than the signal fixed-timing scheme. However, after several rounds of training, the of
than the signal after several rounds of training, effect
waiting
the three time under the G-DQN algorithm environment was less than thatthat
under the DQN
effect of thedeep
threereinforcement learning
deep reinforcement schemes
learning was obviously
schemes better
was obviously than
better of the
than thatsignal
of
algorithm
light environment,
fixed-timing scheme. and the rate of
Compared convergence
with of the G-DQN
the DQN algorithm, algorithm iswaiting
the cumulative signifi-
cantly
time faster
under thethan
G-DQNthat algorithm
of the DQN algorithm. was
environment Compared
less thanwith
thatA2C,
under although
the DQNthe G-DQN
algorithm
environment, and the rate of convergence of the G-DQN algorithm is significantlyshorter
algorithm had lower stability, the vehicle queuing time in the environment was faster
than that
than that of
of the
A2CDQNalgorithm.
algorithm. Compared with A2C, although the G-DQN algorithm
had lower stability, the vehicle queuing time in the environment was shorter than that of
A2C algorithm.
Figure9.9.Comparison
Figure Comparisonof
ofcumulative
cumulativequeuing
queuingtime
timeof
ofthe
thefour
fourmethods.
methods.
Finally,
Finally,we
weconducted
conducted a round
a round of testing on four
of testing methods
on four and analyzed
methods and analyzedand compared
and com-
the
parednumber of queued
the number vehicles
of queued under under
vehicles simulated road conditions
simulated lasting
road conditions for 90for
lasting min. The
90 min.
experimental results are shown in Figure 10. At the beginning of the testing
The experimental results are shown in Figure 10. At the beginning of the testing phase, phase, the
number of vehicles queuing in all four methods’ environments increased,
the number of vehicles queuing in all four methods’ environments increased, with the with the A2C al-
gorithm environment having the least number of vehicles queuing. Under
A2C algorithm environment having the least number of vehicles queuing. Under the en- the environment
of fixed-signal
vironment timing scheme,
of fixed-signal timingthescheme,
numberthe of vehicles
number queuing upqueuing
of vehicles was obviously
up wasmore
obvi-
than the three deep reinforcement learning algorithms, which proves the
ously more than the three deep reinforcement learning algorithms, which proves the feasibility of the
fea-
deep reinforcement learning algorithm. And within 1200 steps of testing, the
sibility of the deep reinforcement learning algorithm. And within 1200 steps of testing, thenumber of
queued vehicles in the DQN algorithm and A2C algorithm environments was less than that
number of queued vehicles in the DQN algorithm and A2C algorithm environments was
in the G-DQN algorithm environment, which is consistent with the conclusion of the model
less than that in the G-DQN algorithm environment, which is consistent with the conclu-
sion of the model training process mentioned above. As the testing progressed, the stabil-
ity of the G-DQN algorithm began to manifest. In the G-DQN algorithm environment, the
increasing trend of the number of queued vehicles was slow, and the peak value of queued
vehicles was smaller than that in the A2C algorithm environment. After 2000 steps, there
Electronics 2024, 13, 198 17 of 20
training process mentioned above. As the testing progressed, the stability of the G-DQN
algorithm began to manifest. In the G-DQN algorithm environment, the increasing trend
of the number of queued vehicles was slow, and the peak value of queued vehicles was
smaller than that in the A2C algorithm environment. After 2000 steps, there was a clear
downward trend, and the number of queued vehicles in the environment was almost zero
when the algorithm converged. Due to the lack of improvement in the state and rewards of
the DQN algorithm, it cannot effectively capture the features in the environment, which
can have an impact on optimization. Therefore, there was a significant fluctuation in the
number of queued vehicles in the DQN algorithm environment, which proves that the
DQN algorithm is not very stable, and the number of queued vehicles when the algorithm
converges is significantly higher than in the G-DQN algorithm. Although the number of
queued
Electronics 2024, 12, x FOR PEER REVIEW vehicles in both the G-DQN and A2C algorithm environments tended to 19zero
of 22in
the end, it can be seen that the number of queued vehicles decreased faster in the G-DQN
algorithm environment. Therefore, the comprehensive performance of the G-DQN algo-
rithm is better. However, the G-DQN algorithm started to have poor stability, which is an
aspect to be optimized in the future.
Figure
Figure 10.10. Comparison
Comparison ofof the
the number
number ofof vehicles
vehicles inin queue
queue forfor the
the four
four methods.
methods.
InInsummary,
summary,as asshown
shownin in Table
Table 3,3, during
during the
the entire
entireexperiment,
experiment,the thefixed
fixedtiming
timing ofof
the
signal
the signal lights diddid
lights notnot
learn anyany
learn traffic information,
traffic information,so the performance
so the performance waswaspoor. The The
poor. other
threethree
other reinforcement learning
reinforcement algorithms
learning all apply
algorithms allbetter
applydynamics than the ε-Greedy
better dynamics than thestrat-
ε-
egy to select actions. According to the cumulative reward comparison
Greedy strategy to select actions. According to the cumulative reward comparison in Fig- in Figure 7, all three
deep
ure reinforcement
7, all learning algorithms
three deep reinforcement learning tended to converge
algorithms tended during the training
to converge duringprocess,
the
and the G-DQN and A2C algorithms converged more stably. According
training process, and the G-DQN and A2C algorithms converged more stably. According to the comparison
toofthe
vehicle queue length
comparison andqueue
of vehicle cumulative
lengthwaiting time in Figures
and cumulative 8 and
waiting time9,in
G-DQN
Figuresand A2C
8 and
showed better performance during the training process, with
9, G-DQN and A2C showed better performance during the training process, with both both indicators maintained
at lower maintained
indicators levels. Therefore,
at lowerin levels.
the final testing phase,
Therefore, in the the
finalG-DQN
testing and A2C
phase, thealgorithms
G-DQN
were able to handle 1000 vehicles in the environment well, ensuring
and A2C algorithms were able to handle 1000 vehicles in the environment well, ensuring that there were no
congested vehicles in the final testing phase of the environment. The difference lies in the
that there were no congested vehicles in the final testing phase of the environment. The
G-DQN reflecting the state of the intersection in more detail, optimizing the details of the
difference lies in the G-DQN reflecting the state of the intersection in more detail, opti-
model, designing a more realistic state space, and providing more comprehensive rewards.
mizing the details of the model, designing a more realistic state space, and providing more
Therefore, the neural network has faster convergence speed and better performance. The
comprehensive rewards. Therefore, the neural network has faster convergence speed and
experimental results show that when the intersection enters peak hours, the model trained
better performance. The experimental results show that when the intersection enters peak
by the G-DQN algorithm can arrange the traffic lights in a more reasonable phase. This
hours, the model trained by the G-DQN algorithm can arrange the traffic lights in a more
optimizes the traffic environment with signal lights, improving the operational efficiency
reasonable phase. This optimizes the traffic environment with signal lights, improving the
and safety of road traffic.
operational efficiency and safety of road traffic.
Table 3. Comparison of experimental results.
Number of
Cumulative Vehicle Neural Network
Accumulated Average Queue
Queuing Queues Convergence Step
Rewards Length (Vehicles)
Electronics 2024, 13, 198 18 of 20
Table 3. Comparison of experimental results.
Number of Vehicle Neural Network

Accumulated Average Queue Cumulative
Queues Tested Convergence Step
Rewards Length (Vehicles) Queuing Time (s)
(Vehicles) during Testing
G-DQN −12,935.5 4.9 26,676 0 3200
DQN −22,209.0 7.2 38,870 32 4100
A2C −13,709.3 5.5 29,524 0 3600
Fixed - 8.0 40,538 95 -
5. Conclusions
The optimization and control of traffic signals is of great significance for improving
logistics efficiency, improving distribution reliability, optimizing distribution networks, and
improving the level of logistics informatization. It can promote the healthy development
of the logistics industry. Deep reinforcement learning provides a method to optimize the
signal optimization control system and provides a solution for urban traffic optimization
control. In this paper, the DQN algorithm based on a convolutional neural network (CNN)
model is used to model regional traffic, and then the concept of grouping sequence is
introduced on the basis of the algorithm, and the G-DQN algorithm is proposed. We
compared the optimization performance of the DQN algorithm, G-DQN algorithm, and
A2C algorithm. Experiments have shown that the G-DQN algorithm has significant advan-
tages in dealing with the optimization control of regional traffic lights during peak hours.
After optimization, the number and time of vehicle queues in the area are significantly
better than other methods, making it more suitable for dealing with the current traffic
congestion problems in large cities. This work systematically studied the optimization
control of regional traffic signal lights and achieved some research results with theoretical
and practical application value. However, with the increase in urban vehicles, the number
of high-volume traffic intersections will also increase day by day, and constructing a single
intersection traffic model can no longer meet the needs of the intelligent transportation field.
In the future, we will first consider optimizing traffic signal control in multi-intersection
areas and regulating traffic signals in more complex environments by using more complex
road network environments, such as irregular multiple intersections and intersections
with more than four directions, and using more complex traffic environments, such as
adding buses or sidewalks. Secondly, we will focus on improving more advanced deep
reinforcement learning algorithms, such as A3C, Twin Delayed Deep Deterministic Policy
Gradients (TD3) and Reinforcement Learning with Expert Demonstrations and Minimal
Losses (REM). We will add convolutional layers to the neural networks in these algorithms
for finer feature extraction, and consider adding residual networks to improve network
performance. Finally, we will further adjust the definitions of status and rewards to provide
a more detailed description of the transportation environment.
Author Contributions: Conceptualization, K.C. and L.W.; methodology, L.W.; software, L.W.; vali-
dation, S.Z. and L.D.; formal analysis, H.J.; investigation, H.J.; resources, K.C.; data curation, L.W.;
writing—original draft preparation, L.W., G.J., S.S. and H.Z.; writing—review and editing, K.C.;
visualization, L.W.; supervision, S.Z.; project administration, L.D.; funding acquisition, K.C. All
authors have read and agreed to the published version of the manuscript.
Funding: This research was supported by the MSIT (Ministry of Science and ICT), Republic of Korea,
under the Innovative Human Resource Development for Local Intellectualization support program
(IITP-2024-RS-2022-00156334) supervised by the IITP (Institute for Information & communications
Technology Planning & Evaluation). We also received support from the 2023 Liaoning Provincial
Education Department’s Basic Research General Project (JYTMS20231518).
Data Availability Statement: Data supporting this study are included within the article.
Conflicts of Interest: The authors declare no conflict of interest.
Electronics 2024, 13, 198 19 of 20
References
1. Min, Z. Enhance the management of roads in central to the development of the “two-oriented society” the significance of logistics.
Logist. Eng. Manag. 2009, 31, 40–41.
2. Naidoo, K. Using intelligent transport technologies in SA’s largest urban areas. Civ. Eng. Mag. S. Afr. Inst. Civ. Eng. 2016, 2016,
51–54.
3. Jin, W.; Salek, M.S.; Chowdhury, M.; Torkjazi, M.; Huynh, N.; Gerard, P. Assessment of Operational Effectiveness of Synchro
Green Adaptive Signal Control System in South Carolina. Transp. Res. Rec. 2021, 2675, 714–728. [CrossRef]
4. Maipradit, A.; Kawakami, T.; Liu, Y.; Gao, J.; Ito, M. An Adaptive Traffic Signal Control Scheme Based on Back-Pressure with
Global Information. J. Inf. Process. 2021, 29, 124–131. [CrossRef]
5. Tai, T.S.; Yeung, C.H. Adaptive strategies for route selection en-route in transportation networks. Chin. J. Phys. 2022, 77, 712–720.
[CrossRef]
6. Cahyono, S.D.; Tristono, T.; Sutomo; Utomo, P. Model of demand order method of traffic lights phases. J. Phys. Conf. Ser. 2019,
1211, 012036. [CrossRef]
7. Noaeen, M.; Naik, A.; Goodman, L.; Crebo, J.; Abrar, T.; Abad, Z.S.H.; Bazzan, A.L.; Far, B. Reinforcement learning in urban
network traffic signal control: A systematic literature review. Expert Syst. Appl. 2022, 199, 116830. [CrossRef]
8. Kuang, L.; Zheng, J.; Li, K.; Gao, H. Intelligent Traffic Signal Control Based on Reinforcement Learning with State Reduction for
Smart Cities. ACM Trans. Internet Technol. (TOIT) 2021, 21, 102. [CrossRef]
9. Müller, A.; Sabatelli, M. Safe and Psychologically Pleasant Traffic Signal Control with Reinforcement Learning Using Action
Masking. In Proceedings of the 2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC), Macau,
China, 8–12 October 2022.
10. Zhao, W.; Ye, Y.; Ding, J.; Wang, T.; Wei, T.; Chen, M. IPDALight: Intensity- and phase duration-aware traffic signal control based
on Reinforcement Learning. J. Syst. Archit. 2022, 123, 102374. [CrossRef]
11. Genders, W.; Razavi, S.; Asce, A.M. Policy Analysis of Adaptive Traffic Signal Control Using Reinforcement Learning. J. Comput.
Civ. Eng. 2019, 34, 04019046. [CrossRef]
12. Zhang, R.; Ishikawa, A.; Wang, W.; Striner, B.; Tonguz, O.K. Using Reinforcement Learning with Partial Vehicle Detection for
Intelligent Traffic Signal Control. IEEE Trans. Intell. Transp. Syst. 2021, 22, 404–415. [CrossRef]
13. Boukerche, A.; Zhong, D.; Sun, P. A Novel Reinforcement Learning-Based Cooperative Traffic Signal System through Max-pressure
Control. IEEE Trans. Veh. Technol. 2021, 71, 1187–1198. [CrossRef]
14. Alegre, L.N.; Ziemke, T.; Bazzan, A.L.C. Using Reinforcement Learning to Control Traffic Signals in a Real-World Scenario: An
Approach Based on Linear Function Approximation. IEEE Trans. Intell. Transp. Syst. 2021, 23, 9126–9135. [CrossRef]
15. Wang, X.; Ke, L.; Qiao, Z.; Chai, X. Large-Scale Traffic Signal Control Using a Novel Multiagent Reinforcement Learning. IEEE
Trans. Cybern. 2020, 51, 174–187. [CrossRef] [PubMed]
16. Antes, T.D.O.; Bazzan, A.L.C.; Tavares, A.R. Information upwards, recommendation downwards: Reinforcement learning with
hierarchy for traffic signal control. Procedia Comput. Sci. 2022, 201, 24–31. [CrossRef]
17. Li, H.; Kumar, N.; Chen, R.; Georgiou, P. A deep reinforcement learning framework for Identifying funny scenes in movies.
In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB,
Canada, 15–20 April 2018.
18. Park, S.; Han, E.; Park, S.; Jeong, H.; Yun, I. Deep Q-network-based traffic signal control models. PLoS ONE 2021, 16, e0256405.
[CrossRef] [PubMed]
19. Zhao, T.; Wang, P.; Li, S. Traffic Signal Control with Deep Reinforcement Learning. In Proceedings of the 2019 International
Conference on Intelligent Computing, Automation and Systems (ICICAS), Madurai, India, 15–17 May 2019.
20. Wan, C.H.; Hwang, M.C. Adaptive Traffic Signal Control Methods Based on Deep Reinforcement Learning. In Intelligent Transport
Systems for Everyone’s Mobility; Springer: Singapore, 2019; pp. 195–209.
21. Zou, Y.; Qin, Z. Value-Based Bayesian Meta-Reinforcement Learning and Traffic Signal Control. arXiv 2020, arXiv:2010.00163.
22. Chu, K.F.; Lam, A.Y.S.; Li, V.O.K. Traffic Signal Control Using End-to-End Off-Policy Deep Reinforcement Learning. IEEE Trans.
Intell. Transp. Syst. 2021, 23, 7184–7195. [CrossRef]
23. Muresan, M.; Pan, G.; Fu, L. Multi-Intersection Control with Deep Reinforcement Learning and Ring-and-Barrier Controllers.
Transp. Res. Rec. 2021, 2675, 308–319. [CrossRef]
24. Alhassan, A.; Saeed, M. Adjusting Street Plans Using Deep Reinforcement Learning. In Proceedings of the 2020 International
Conference on Computer, Control, Electrical, and Electronics Engineering (ICCCEEE), Khartoum, Sudan, 26 February–1 March 2021.
25. Bouktif, S.; Cheniki, A.; Ouni, A.; El-Sayed, H. Deep reinforcement learning for traffic signal control with consistent state and
reward design approach. Knowl. Based Syst. 2023, 267, 110440. [CrossRef]
26. Su, H.; Zhong, Y.D.; Chow, J.Y.; Dey, B.; Jin, L. EMVLight: A multi-agent reinforcement learning framework for an emergency
vehicle decentralized routing and traffic signal control system. Transp. Res. Part C Emerg. Technol. 2023, 146, 103955. [CrossRef]
27. Ramos-Martinez, M.; Torres-Cantero, C.A.; Ortiz-Torres, G.; Sorcia-Vázquez, F.D.; Avila-George, H.; Lozoya-Ponce, R.E.;
Vargas-Méndez, R.A.; Renteria-Vargas, E.M.; Rumbo-Morales, J.Y. Control for Bioethanol Production in a Pressure Swing
Adsorption Process Using an Artificial Neural Network. Mathematics 2023, 11, 3967. [CrossRef]
Electronics 2024, 13, 198 20 of 20
28. Rentería-Vargas, E.M.; Aguilar, C.J.Z.; Morales, J.Y.R.; De-La-Torre, M.; Cervantes, J.A.; Huerta, J.R.L.; Torres, G.O.; Vázquez,
F.D.J.S.; Sánchez, R.O. Identification by Recurrent Neural Networks applied to a Pressure Swing Adsorption Process for Ethanol
Purification. In Proceedings of the 2022 Signal Processing: Algorithms, Architectures, Arrangements, and Applications (SPA),
Poznan, Poland, 21–22 September 2022; pp. 128–134.
29. Behrisch, M.; Bieker, L.; Erdmann, J.; Krajzewicz, D. SUMO—Simulation of Urban MObility: An Overview. In Proceedings of the
SIMUL 2011, the Third International Conference on Advances in System Simulation, Barcelona, Spain, 23–28 October 2011.
30. Özdil, A.; Yilmaz, B. Medical infrared thermal image based fatty liver classification using machine and deep learning. Quant.
InfraRed Thermogr. J. 2023. [CrossRef]
31. Garrido, I.; Lagüela, S.; Fang, Q.; Arias, P. Introduction of the combination of thermal fundamentals and Deep Learning for the
automatic thermographic inspection of thermal bridges and water-related problems in infrastructures. Quant. InfraRed Thermogr.
J. 2022, 20, 231–255. [CrossRef]
32. Chebbah, N.K.; Ouslim, M.; Benabid, S. New computer aided diagnostic system using deep neural network and SVM to detect
breast cancer in thermography. Quant. InfraRed Thermogr. J. 2023, 20, 62–77. [CrossRef]
33. Mahoro, E.; Akhloufi, M.A. Breast cancer classification on thermograms using deep CNN and transformers. Quant. InfraRed
Thermogr. J. 2022. [CrossRef]
34. Ervural, S.; Ceylan, M. Thermogram classification using deep siamese network for neonatal disease detection with limited data.
Quant. InfraRed Thermogr. J. 2022, 19, 312–330. [CrossRef]
35. Barto, A.G.; Sutton, R.S. Reinforcement learning: An introduction (Adaptive computation and machine learning). IEEE Trans.
Neural Netw. 1998, 9, 1054.
36. Kanis, S.; Samson, L.; Bloembergen, D.; Bakker, T. Back to Basics: Deep Reinforcement Learning in Traffic Signal Control. arXiv
2021, arXiv:2109.07180.
37. Xu, Y.; Ni, Z. Intelligent traffic signal coordination control method based on YOLOv3 and DQN. In Proceedings of the International
Conference on Signal Processing and Communication Technology (SPCT 2022), Harbin, China, 23–25 December 2022.
38. Modarres, M.; Kaminskiy, M.P.; Krivtsov, V. Reliability Engineering and Risk Analysis: A Practical Guide, 2nd ed.; CRC Press: Boca
Raton, FL, USA, 2009.
39. Johnson, N.L.; Kotz, S.; Balakrishnan, N. Continuous Univariate Distributions, 2nd ed.; John Wiley & Sons: Hoboken, NJ, USA, 1994;
Volume 1.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

14

Uploaded by

Copyright:

Available Formats

14

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

14

Uploaded by

Copyright:

Available Formats

electronics

1 Key Laboratory of Intelligent Technology of Chemical Process Industry in Liaoning Province,

Copyright: © 2024 by the authors.

Electronics 2024, 13, 198. https://doi.org/10.3390/electronics13010198 https://www.mdpi.com/journal/electronics

2.1. SUMO Traffic Simulation Software

to download the real-world road network through OpenStreetMap to generate the

2.2. Deep Reinforcement Learning

Table 1. Nomenclature table.

Parameter Name Parameter

Table 1. Nomenclature table.

Parameter Name Parameter

The DQN algorithm is an improved algorithm based on the q-learning algorithm of

Qπ (st , at ) = ∑ Psatt→st+1 [r(st , at ) + γV (st+1 )] (1)

Q(st+1 , at+1 ) ← Q(s, a) + α[r + γmaxQ(st+1 , at+1 ) − Q(s, a)] (3)

L = [r + γmaxQ(st+1 , at+1 ; θ‘) − Q(s, a; θ )]2 (4)

3. Traffic Signal Optimization Control Model

usually represented by a matrix as input to a convolutional neural network in order to better

3.5. Action Selection Strategy

3.6. Convolutional Neural Network

4.1. Experimental Environment

4.2. Experimental Data

Figure 5. Weibull distribution for shape parameter of 2.

In summary, the dynamic ε-greedy action selection strategy is significantly better

Table 3. Comparison of experimental results.

Table 3. Comparison of experimental results.

Number of Vehicle Neural Network

You might also like