1. Introduction
Urban traffic affects and propels the development of the entire functional urban layout. With the improvement in economic levels and the continuous acceleration of urbanization, urban road traffic congestion has become a problem that restricts the development of modern cities. The intersection has become the critical node and main bottleneck of the road traffic network so the intelligent control of the intersection signal plays a crucial role in alleviating traffic congestion [
1]. Previous traffic signal control research was mostly based on vehicles. However, in recent years, researchers have increasingly focused on intelligent traffic signal control for pedestrians. It is necessary because when the delay of pedestrians exceeds their tolerance time, they will become impatient, which can result in violations of traffic rules [
2] and even traffic accidents, endangering personal safety. According to the World Health Organization’s Global Status Report on Road Safety 2018, more than half of all road traffic fatalities are among vulnerable road users: pedestrians, cyclists, and motorcyclists. Pedestrians and cyclists account for 26% of all deaths [
3]. Therefore, the right design, appropriate setting, and effective use of traffic control signals are of great significance to improving the traffic capacity of vehicles at intersections, reducing the waiting time of pedestrians crossing the street and ensuring the safety of pedestrians and vehicles using the street.
The current crossing control methods include unsignalized, timing signal control, adaptive signal control, and intelligent traffic signal control (ITSC) [
4]. The timing signal control method operates according to a set timing scheme. The adaptive method is a more comprehensive scheme, which is formulated by comprehensively considering traffic flow, parking time, and other factors. The ITSC method is more flexible than the timing control method and can improve the traffic capacity of the intersection. The intelligent traffic signal control method collects the traffic information of vehicles and pedestrians in all directions at the intersection in real time. It selects the signal light phase according to the real-time signal situation of the intersection, reducing the number of vehicles stopping and the waiting time of pedestrians and avoiding the waste of green-light time.
Most of the ITSC systems are only aimed at general and special vehicles on the road and research on pedestrians is scarce. In addition, “pedestrian priority” should become the basic principle of urban road traffic management, therefore the setting of an ITSC system should also fully consider the needs of pedestrians. Only in this way can the problem of urban road traffic congestion be relieved efficiently [
5]. Most existing studies consider pedestrians on the basis of vehicles and do not consider vehicles and pedestrians as a whole or say that the two are separate.
To solve the problem of pedestrian waiting time caused by vehicle-based signal control systems, we propose a real-time optimization model to coordinate the signal control between pedestrians and vehicles at intersections. The pedestrian and vehicle factors are also considered in the state and reward design. We modify the discrete traffic state encoding method proposed by Gender [
6] et al. to add sidewalks to obtain pedestrian location status information. Furthermore, simulation experiments using actual intersection information demonstrate the validity of the method.
The main contributions of this work are as follows:
To improve the sampling efficiency of the model, we combine a Dueling Double Deep Q Network (3DQN) with a multi-process operation method. The method can be trained in multiple environments at the same time to generate multiple experiences, improve the sampling efficiency of the experience pool, and accelerate the convergence;
We modify the discrete traffic state encoding method and add sidewalks to the range of detection lanes to extract the location information of pedestrians. The factors of pedestrians and vehicles are jointly considered, and more comprehensive state and reward designs are defined;
To evaluate the model, we employ the SUMO simulation to conduct experiments based on real intersections.
This work is organized as follows:
Section 2 introduces the related work on traffic signal control.
Section 3 discusses the traffic signal model, which presents the algorithm architecture of the proposed model, the reinforcement learning model, and the design of the deep reinforcement learning algorithm. In
Section 4, we conduct simulation experiments and discuss the results. Finally, the work concludes with a conclusion.
2. Literature Review
The traditional traffic signal control method adopts a fixed timing scheme. With an increase in the traffic flow, the fixed signal timing scheme finds it difficult to deal with complex traffic scenarios; therefore, researchers have proposed a variable signal control scheme. For example, the British Institute of Transport and Road Research proposed the TRANSYT system [
7]; PB et al. [
8] proposed the SCOOT system, which was an improvement on the TRANSYT system; and Luk et al. [
9] proposed the SCATS system, which can select a signal scheme in real time.
With the development of variable signal technology, researchers have employed reinforcement learning (RL) [
10] technology in traffic signal control [
11,
12]. MIKAMI [
13] was the first to employ RL technology in traffic signal control. However, the method led to a sharp increase in the complexity of the algorithm and affected model training. Therefore, researchers have combined deep learning and reinforcement learning to solve the problem of high algorithm complexity, namely deep reinforcement learning (DRL) [
14,
15].
DRL has good performance in traffic signal control and therefore is widely used in traffic signals [
16]. In order to improve the efficiency of the traffic system, Genders [
6] et al. were the first to apply a DQN in the field of traffic signal control. The optimal traffic signal control scheme was obtained through the training of agents. Subsequently, Zeng [
17] et al. used the memory-based deep reinforcement learning method and combined the recurrent neural network to solve the signal control problem. The proposed DRQN algorithm achieved better results than the DQN. Liang [
18] et al. proposed a D3QN algorithm by combining Dueling DQN and Double DQN. Through simulation experiments, it was found that their method could reduce the average waiting time of vehicles and had a better learning speed. Gu [
19] et al. proposed the double DQN algorithm of double agents to improve the stability of system training. The experiments showed that their algorithm could effectively improve the capacity of intersections. Although the above models improved the traffic efficiency of vehicles at intersections, they ignored the relationship between pedestrians and vehicles at the intersections.
In terms of pedestrian crossings, some researchers have used traditional control methods. For example, Zhuang [
20] proposed to implement a pedestrian button-type activation method when crossing the street, where the phase of the signal lights is changed by the pedestrians. This method did not consider pedestrians and vehicles as a whole, which is not conducive to signal light control. MA [
21] used the Pedestrian-Specific Phase (EPP), which eliminated all interactions between vehicles and pedestrians, helping to improve safety but affecting the efficiency of traffic. The other researchers used different algorithms to optimize the signal control system. Liu et al. [
22] proposed an intelligent traffic light control method based on distributed multi-agent Q-learning. The method minimized the risks of vehicle delays and pedestrian crossings. Zhang et al. [
23] proposed a traffic signal method considering pedestrian safety and the delay of vehicles. At the end of each pedestrian Flashing Green (FG), they introduced an additional Dynamic All Red (DAR) stage to reduce the rate of pedestrian violations at crosswalks. Wu et al. [
24] proposed a multi-agent recurrent deep deterministic policy gradient (MARDDPG) method, which considered the number of pedestrians in the state space. This method alleviated the traffic congestion at intersections. Zhu [
25] et al. proposed a new context-aware multi-agent generalized reinforcement learning method and applied a semi-synthetic data set to traffic signal control, which solved the problem of missing pedestrian data in the open data set. Xu [
26] et al. modified the max-pressure signal control strategy to consider pedestrian factors. The activation function based on pedestrian waiting time and tolerance time were designed to effectively limit pedestrian waiting time. Although these studies considered pedestrian factors, they did not consider pedestrian location information. As a result, the demands of pedestrians could not be determined in time and the relationship between pedestrians and vehicles could not be better balanced.
Most studies only used discrete traffic state encoding [
6] to extract the vehicle location information when setting the reinforcement learning states, ignoring the pedestrian traffic conditions, signal phases, and waiting times. Based on this, under the premise of considering the safety performance and traffic efficiency of pedestrians and vehicles at intersections, we propose an optimization method for signal control that coordinates pedestrians and vehicles. Pedestrian location information, pedestrian waiting time, and other information are taken into account in the state space. In the reward function, the maximum tolerance time of vehicles and pedestrians is considered on the basis of vehicle traffic and pedestrian waiting time, taking into account pedestrian crossing safety and reducing the average traffic delay of vehicles and pedestrians. On the basis of the 3DQN method, multiple processes are used to train to improve the training efficiency of the system.
3. Traffic Signal Control Model
3.1. Algorithmic Framework
We propose a signal control system in a single intersection to reduce the waiting times of pedestrians and vehicles. The system uses the deep reinforcement learning method, which is primarily split into three parts: the main network, the target network, and the experience pool. The model framework is shown in
Figure 1.
The main network calculates the Q value and selects actions through a convolutional neural network. We integrate the position status of pedestrians and vehicles, input the integrated results into the convolution layer, together with the signal phase and traffic light duration status, and input the results into the full connection layer after flattening. We adopt Dueling DQN to divide the fully connected layer into two parts: the value V and advantage function A, and add the two parts to obtain the final Q value.
The target network is a mechanism to shuffle the correlation and the target network and the main network are identical in structure. The main network uses the latest experience data for model training and the target network adopts the older data to predict the real Q value. The value output by the main network is used to evaluate the current state and the target network derives the Q value and updates the parameters of the main network according to the loss function. After iterations, the model copies the parameters of the main network to the target network. Double DQN employs the main network to determine the action and the target network determines the action’s value and changes the target value’s calculation method.
When the experience data reaches the preset value, the model randomly extracts a certain amount of experience data for neural network training. Such an extraction method can disrupt the correlation between data, making learning more efficient and stable.
3.2. Reinforcement Learning Model
This work combines the idea of single-intersection signal control to design the relevant elements of deep reinforcement learning. We design an agent’s states, actions, and rewards according to the traffic control standards and signal light configuration requirements.
3.2.1. States
In reinforcement learning, the agent decides what action to take based on the state of the environment. The state space is the input part of the training and its selection determines the quality of the control strategy so the selection of the state is particularly critical. Due to the differences in an actual intersection situation, a traffic signal control system based on deep reinforcement learning can use many different state representations. In order to meet the needs of vehicles and pedestrians passing through the intersection, we set four states in this module: vehicle and pedestrian status information, signal phase scheme information, pedestrian red light duration, and vehicle green light duration.
As shown in
Figure 2, we set the position state information of pedestrians and vehicles as the first state
. We use and improve the discrete traffic state encoding (DTSE) to reflect the real-time position state of pedestrians and predict the next state. Unlike the original DTSE, which only considers the position of vehicles, we also consider pedestrians. The road is gridded according to the actual situation. Several lanes and sidewalks are divided into several departments. Vehicle and pedestrian information is stored in different departments and the position information of the traffic flow is formed as a state input in the form of a vector.
Figure 3 shows the west entrance of a standard intersection as an example. Based on the original lane, we add a sidewalk on the right side of the lane to simultaneously detect the positions of pedestrians and vehicles. Assuming that the length of the westward entrance road is
l and the length of the vehicle is
c, the lane is divided into
grids. Each grid stores information about the presence of vehicles and pedestrians in its lane. If there are no vehicles or pedestrians in the grid cell, the value is set to 0; if the presence of vehicles is detected in the grid, it is recorded as 1; and if there are many pedestrians in the same grid, we superimpose them. As the figure shows, two pedestrians are detected in the grid so the value is set to 2. The first state
(Equation (1)) we set is obtained by integrating the position state information of vehicles and pedestrians in the four directions of the intersection:
To reduce the waiting times of pedestrians crossing the street, we capture the position of vehicles and pedestrians in each lane of the intersection. Because of the effect caused by the current state of the signal lights, we also need to collect the current state of the signal lights so we set the signal phase scheme information as the second state
(Equation (2)). According to the driving characteristics of typical intersections, we design a four-phase basic scheme and the phase of the signal light is represented by a four-dimensional vector with one-hot encoding.
where
and
represent four phases. If the current phase is the first phase, we set the value of
as 1 and the remaining values are 0. The second state
is represented as
.
To increase pedestrians’ rights of way and reduce their waiting times, we consider pedestrian safety factors. Therefore, the third state
is the duration of the red light for pedestrians, which is used to record the total time of the red light for pedestrians for each sidewalk. At the same time, to prevent the problem of too short a duration of a green light due to the high priority of pedestrians, the fourth state
is set as the duration of the green light for vehicles. The third and fourth states are both in the form of vectors and we calculate the duration of the red light and green light for each lane or sidewalk, respectively, finally forming the states
and
through integration. This work takes a standard three-lane intersection as an example and the specific formulas are shown in Equations (3) and (4):
3.2.2. Agent Actions
After the agent obtains the current state, it selects the action from the action space and observes the reward for the action. The agent selects the action phase every second so that the signal light’s execution phase and time are not fixed. The agent can choose the optimal action based on the actual position state of pedestrians and vehicles. Hence, the flexibility of the signal light control system is increased.
The action phase in the first and next seconds can be the same or different. If the action phase is different, we insert a yellow light for 3 s to ensure traffic safety. Based on the above situation, the action phase is shown in
Table 1.
As shown in
Table 1, the action space contains four actions. The traffic direction corresponding to the first action
is EW, which means that the vehicle in the east–west direction is allowed to go straight and the pedestrian in the east–west direction passes through the intersection when the action is executed. EWL in
indicates that left-turn vehicles in the east–west direction are allowed to pass and pedestrians in any direction are not allowed to pass through the intersection, regardless of the control of the right-turn signal lights. The signal light statuses are listed in
Table 1 (where ‘G’ represents green light and ‘r’ represents red light).
3.2.3. Reward Function
The reward function in the reinforcement learning model is the feedback on the previous action. The reward function should take into account the correlation between the control objectives [
27]. In this work, we design a reward function that includes vehicle throughput, pedestrian waiting time, exceeding pedestrian maximum red time, and less than vehicle minimum green time. This method aims to reduce the waiting time of pedestrians crossing the street without affecting the efficiency of vehicle traffic. The observation time of the agent in this work is only 1 s and the interval time is too short to calculate accurately. We use a time period
to calculate the reward value. The first part of the reward is the pedestrian waiting time:
where
represents the waiting time of all pedestrians during
i time period at the intersection,
is a constant, and we set the period constant at 20 s.
represents the waiting time of all pedestrians within a period
and
is set to a negative value to reduce the waiting time of pedestrians.
To avoid our proposed model only considering the passing rights of pedestrians, we also calculate the throughput at the same moment. The second part of the reward is vehicle throughput, as shown in Equation (6):
where
represents the throughput of the intersection at time period
i.
is the throughput, except for right-turn vehicles.
To ensure pedestrians’ rights of way, the red light time for pedestrians should be less than the maximum waiting time that pedestrians can tolerate. In the third part, we penalize the part that exceeds the maximum red light time, as shown in Equation (7):
where
represents the duration of the red light at the entrance
n at time
t and
is the maximum red-light duration. After reviewing the relevant research literature, the maximum waiting time for pedestrians to cross the street is 70 s so we take this value as 70 s [
28,
29].
Regarding rationality and safety, setting a minimum green light duration is necessary. When the green light duration is less than the minimum green light duration, the model should give a penalization. Hence, we define the vehicle minimum green time penalization as
, as shown in Equation (8):
where
represents the duration of the green light for entrance
n at time
t and
is the minimum green light duration. According to the literature [
30], the minimum green light time for vehicles is 10 s so we take it as 10 s.
After integrating the four parts of the reward function, we can obtain the final reward function, as shown in Equation (9):
3.3. Algorithm Design and Model Training
Our network adopts the 3DQN (Dueling Double Deep Q Network) algorithm. 3DQN is an algorithm that combines Double DQN and Dueling DQN. The Double DQN model uses the optimal action value to obtain the target Q value through the target network, which reduces the deviation of the final value. In addition, Dueling DQN uses two fully connected layers to estimate the Q value.
The dueling network inputs the state into two fully connected layers after passing through the convolution layer; one generates a value function
and the other generates an advantage function
. Finally, the value function
V and the advantage function
A are added to obtain the final
Q value, where
is the parameter of the convolutional layer and
and
are the parameters of the fully connected layer. The formula for calculating the
Q value is shown in Equation (10):
where
is the estimated value of the state function. The second part of the formula
uses the mean value to estimate the advantage function. The
Q value is estimated by comparing the advantages of the different actions in the advantage function. Therefore, we can select the more valuable states without knowing the effect of each state.
Double DQN uses two action-value functions to reduce the maximizing bias. One of two action-value functions is to estimate the action and the other is to estimate the value of the action. The main network is adopted to determine the action and the target network estimates the action value. The loss function formula is shown in Equations (11) and (12):
where
is the network parameter and the loss function is the mean squared error.
The design of the neural network is shown in
Figure 1. Because of the high correlation of the position information between vehicles and pedestrians in state
, firstly, we put the position information of the four directions in state
into a separate convolutional neural network. Then, we put the output of the previous convolutional neural network into the next convolutional neural network, together with the information of the other three states.
In addition, to improve the efficiency of the neural network training [
31], we adopt the method of collecting experience in multiple processes. In each training of DQN, only one sample data can be obtained in the experience pool, which makes the number of samples grow slowly and reduces the training efficiency of the model. Therefore, we adopt a multi-process approach, running multiple environments simultaneously for training. This method can generate multiple sample data at one time, which helps the neural network to quickly extract empirical data for training and improves the training efficiency of the model. The neural network training and experience generation process in this work is as follows:
In Algorithm 1, the model simultaneously uses multiple traffic environments for training and selects an action corresponding to the maximum value or randomly selects an action according to the probability factor
. We execute the chosen action in the current environment, then, we obtain the reward
and the next state
. Finally, the model stores the gained experience
in the experience pool.
Algorithm 1 Experience Pool Experience Generation |
Input: the maximum capacity M of the experience pool D, the total simulation step S, the total simulation time T, the random probability factor Output: experience matrix - 1:
Initialize simulation step size i = 0 - 2:
whiledo - 3:
Launch multiple simulation environments - 4:
if simulation time then - 5:
if random number in (0, 1) < then - 6:
Randomly choose an action - 7:
else - 8:
- 9:
end if - 10:
Action acts on the road network to obtain a new traffic state and reward - 11:
Update state - 12:
Store the experience matrix in the experience pool - 13:
end if - 14:
i←i + 1 - 15:
end while
|
As shown in Algorithm 2, the model randomly selects several samples from the experience pool to form small batches and then feeds them into the neural network. The neural network parameters are updated by gradient descent. With the increase in training, the empirical data generated in the early stage have little reference value for the current action selection. Therefore, when the experience reaches a certain amount, the model deletes the early experience data to reduce the operating burden of the system. The agent decides the optimal action at the current stage according to the corresponding action of the optimal
Q value obtained by training. After iteration, the agent chooses the best action based on the maximum
Q value to maximize the reward value.
Algorithm 2 Neural Network Training |
|
5. Conclusions
This work proposed a pedestrian-considered deep reinforcement learning single-intersection signal control method. This method utilized the influencing factors between pedestrians and intersection signal control optimization, effectively alleviated the inconvenience caused by the signal timing method controlled by vehicle status, balanced the demands of pedestrians and vehicles, and was more realistic. At the same time, we ran multiple environments to improve the training efficiency of the system. In addition, we built a standard intersection and a simulation environment based on an actual intersection in Fuzhou to verify the effectiveness of our method. It was observed from the simulation results of SUMO that our method effectively reduced the waiting time of pedestrians and also achieved better results in the actual intersection environment.
Nevertheless, our study has some limitations. First, there may be inaccuracies in the vehicle and pedestrian traffic information collected by manual counting. Second, the impact of latency in inter-intersection communications on our model was not taken into account. In future work, it is worth considering the latency in inter-intersection communications and a more accurate statistical scheme. In addition, we should consider more intersection types, such as T-shaped intersections and roundabouts, and apply the single intersection method to multiple intersections.