0% found this document useful (0 votes)

17 views

Deep Reinforcement Learning Algorithm With Experience Replay and Target Network

This paper proposes a deep reinforcement learning algorithm for adaptive traffic signal control. The algorithm uses a convolutional neural network to extract useful features from raw traffic data and learn the optimal traffic signal control policy. Experience replay and target networks are used to improve algorithm stability.

Uploaded by

chanwengqiu

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

Deep Reinforcement Learning Algorithm With Experience Replay and Target Network

Uploaded by

chanwengqiu

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

1

Adaptive Traffic Signal Control: Deep

Reinforcement Learning Algorithm with Experience
Replay and Target Network
Juntao Gao, Yulong Shen, Jia Liu, Minoru Ito and Norio Shiratori
arXiv:1705.02755v1 [cs.NI] 8 May 2017

Abstract—Adaptive traffic signal control, which adjusts traffic (roads) by pressure gradients (the number of queued vehicles)
signal timing according to real-time traffic, has been shown to be [9]. Authors in [5]–[8] proposed to use reinforcement learn-
an effective method to reduce traffic congestion. Available works ing method to adaptively control traffic signals, where they
on adaptive traffic signal control make responsive traffic signal
control decisions based on human-crafted features (e.g. vehicle modelled the control problem as a Markov decision process
queue length). However, human-crafted features are abstractions [10]. However, all these works make responsive traffic signal
of raw traffic data (e.g., position and speed of vehicles), which control decisions based on human-crafted features, such as
ignore some useful traffic information and lead to suboptimal vehicle queue length and average vehicle delay. Human-crafted
traffic signal controls. In this paper, we propose a deep reinforce- features are abstractions of raw traffic data (e.g., position and
ment learning algorithm that automatically extracts all useful
features (machine-crafted features) from raw real-time traffic speed of vehicles), which ignore some useful traffic informa-
data and learns the optimal policy for adaptive traffic signal tion and lead to suboptimal traffic signal controls. For example,
control. To improve algorithm stability, we adopt experience vehicle queue length does not consider vehicles that are not in
replay and target network mechanisms. Simulation results show queue but will come soon, which is also useful information for
that our algorithm reduces vehicle delay by up to 47% and 86% controlling traffic signals; average vehicle delay only reflects
when compared to another two popular traffic signal control
algorithms, longest queue first algorithm and fixed time control history traffic data not real-time traffic demand.
algorithm, respectively. In this paper, instead of using human-crafted features, we
propose a deep reinforcement learning algorithm that automat-
ically extracts all features (machine-crafted features) useful for
I. I NTRODUCTION adaptive traffic signal control from raw real-time traffic data
Traffic congestion has led to some serious social problems: and learns the optimal traffic signal control policy. Specifically,
long travelling time, fuel consumption, air pollution, etc [1], we model the control problem as a reinforcement learning
[2]. Factors responsible for traffic congestion include: prolif- problem [10]. Then, we use deep convolutional neural network
eration of vehicles, inadequate traffic infrastructure and inef- to extract useful features from raw real-time traffic data (i.e.,
ficient traffic signal control. However, we cannot stop people vehicle position, speed and traffic signal state) and output the
from buying vehicles and building new traffic infrastructure optimal traffic signal control decision. A well-known problem
is of high cost. The relatively easy solution is to improve with deep reinforcement learning is that the algorithm may be
efficiency of traffic signal control. Fixed-time traffic signal unstable or even diverge in decision making [11]. To improve
control is common in use, where traffic signal timing at an algorithm stability, we adopt two methods proposed in [11]:
intersection is predetermined and optimized offline based on experience replay and target network (see details in Section
history traffic data (not real-time traffic demands). However, III).
traffic demands may change from time to time, making prede- The rest of this paper is organized as follows. In Section
termined settings of traffic signal timing out of date. Therefore, II, we introduce intersection model and define reinforcement
fixed-time traffic signal control cannot adapt to dynamic and learning components: intersection state, agent action, reward
bursty traffic demands, resulting in traffic congestion. and agent goal. In Section III, we present details of our
In contrast, adaptive traffic signal control, which adjusts proposed deep reinforcement learning algorithm for traffic
traffic signal timing according to real-time traffic demand, signal control. In Section IV, we verify our algorithm by
has been shown to be an effective method to reduce traffic simulations and compare its performance to popular traffic
congestion [3]–[8]. For example, Zaidi et al. [3] and Gregoire signal control algorithms. In Section V, we review related
et al. [4] proposed adaptive traffic signal control algorithms work on adopting deep reinforcement learning for traffic signal
based on back-pressure method, which is similar to pushing control and their limitations and conclude the whole paper in
water (here vehicles) to flow through a network of pipes Section VI.
J. Gao and M. Ito are with the Graduate School of Information Science,
Nara Institute of Science and Technology, 8916-5 Takayama-cho, Ikoma, Nara II. S YSTEM M ODEL AND P ROBLEM F ORMULATION
630-0192, JAPAN. E-mail: {jtgao,ito}@is.naist.jp.
Y. Shen is with School of Computer Science and Technology, Xidian Uni- In this section, we first introduce intersection model and
versity, Xian, Shaanxi 710071, PR China. E-mail: ylshen@mail.xidian.edu.cn.
J. Liu is with National Institute of Informatics, JAPAN. then formulate traffic signal control problem as a reinforce-
N. Shiratori is with Tohoku University, Sendai, JAPAN. ment learning problem.
2

(a) cell of length c stopping line

road #

road "
(b)
0 0 1 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 0 1 1
$% 0 0 0 0 0 0 0 1 0 1 1 1
$&
$' (c)
$(
0 0 0.9 0 0 0 0 0 0 0 0 0
road 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0.5 0 0.1 0 0
road !
Fig. 3. (a) snapshot of traffic at road 0. (b) matrix of vehicle position. (c)
matrix of normalized vehicle speed.

Fig. 1. A four-way intersection.

state changes to a new state St+1 . The agent also gets
reward Rt (defined later) at the end of time step t as a
consequence of its decision on selecting traffic signals. Such
Agent
reward serves as a signal guiding the agent to achieve its goal.
actuates In time sequence, the agent interacts with the intersection as
traffic · · · , St , At , Rt , St+1 , At+1 · · · . Next, we define intersection
intersection reward signals state St , agent action At and reward Rt , respectively.
state %! $! Intersection State: Intersection information needed by the
! agent to control traffic signals includes vehicle position, ve-
hicle speed at each road and traffic signal state. To easily
represent information of vehicle position and vehicle speed
%!"#
(following methods in [12]), we divide lane segment of length
l, starting from stop line, into discrete cells of length c for
each road i = 0, 1, 2, 3 as illustrated in Fig. 3. We then collect
!"#
vehicle position and speed information of road i into two
matrices: matrix of vehicle position Pi and matrix of vehicle
speed Vi . If a vehicle is present at one cell, the corresponding
Fig. 2. Reinforcement learning agent for traffic signal control. entry of matrix Pi is set to 1. The vehicle speed, normalized
by road speed limit, is recorded at the corresponding entry of
matrix Vi . The matrix P of vehicle position for all roads of
Consider a four-way intersection in Fig.1, where each road the intersection is then given by
consists of four lanes. For each road, the innermost lane  
P0
(referred to as L0 ) is for vehicles turning left, the middle  P2 
two lanes (L1 and L2 ) are for vehicles going straight and P =  P1 
 (1)
the outermost lane (L3 ) is for vehicles going straight or
P3
turning right. Vehicles at this intersection run under control of
traffic signals: green lights mean vehicles can go through the Similarly, the matrix V of vehicle speed for all roads of the
intersection, however vehicles at left-turn waiting area should intersection is given by
let vehicles going straight pass first; yellow lights mean lights  
V0
are about to turn red and vehicles should stop if it is safe to  V2 
do so; red lights mean vehicles must stop. For example, green V =
 V1

 (2)
lights for west-east traffic are turned on in Fig.1.
V3
We formulate traffic signal control problem as a rein-
forcement learning problem shown in Fig.2 [10], where an To represent the state of selected traffic signals, we use a
agent interacts with the intersection at discrete time steps, vector L of size 2 since the agent can only choose between
t = 0, 1, 2, · · · , and the goal of the agent is to reduce two actions: turning on green lights for west-east traffic (i.e.,
vehicle staying time at this intersection in the long run, thus red lights for north-south traffic) or turning on green lights for
alleviating traffic congestion. Specifically, such an agent first north-south traffic (i.e., red lights for west-east traffic). When
observes intersection state St (defined later) at the beginning green lights are turned on for west-east traffic, L = [1, 0];
of time step t, then selects and actuates traffic signals At . when green lights are turned on for north-south traffic, L =
After vehicles move under actuated traffic signals, intersection [0, 1].
3

& '" =( & =( & )" =" '$&, %!

'&, %!
green green transition green
!" +" +# '$&,
'&,

1. Observe intersection 1. Observe intersection 3. Execute action green green transition green
state state
2. Choose action 2. Choose action vehicle & +! +"
3. Execute action arrival

4. Observe reward 4. Observe reward

5. Train DNN, $, $% 5. Train DNN, $, $%
# #$ # %! #$ %!
First Second First Second
Fig. 4. Timeline for agent events: observation, choosing action, executing observation observation observation observation
action, observing rewards and training DNN network. at step at step at step + ! at step + !

timeline road 0 road 1 road 2 road 3

Fig. 6. Example of vehicle staying time.
! " # $ ! " # $ ! " # $ ! " # $
step % & ' g G G G r r r r g G G G r r r r
green ()
Define action policy π as rules the agent follows to choose
step % g G G G r r r r g G G G r r r r
green ()
actions after observing intersection state. For example, π can
(+ g y y y r r r r g y y y r r r r
be a random policy such that the agent chooses actions with
transition green

probability P {At = a|St = s}, a ∈ A, s ∈ S.

step

(, G r r r r r r r G r r r r r r r
Reward: To reduce traffic congestion, it is reasonable to
(+ y r r r r r r r y r r r r r r r
reward the agent at each time step for choosing some action
(, r r r r g G G G r r r r g G G G
if the time of vehicles staying at the intersection decreases.
r: red light Specifically, the agent observes vehicles staying time twice
G: green light every time step to determine its change as shown in Fig. 6.
g: green light for vehicles turning left, letting vehicles going straight pass first
y: yellow light
The first observation is at the beginning of green light interval
* : lane i
at each time step and the second observation is at the end of
green light interval at each time step.
Let wi,t be the staying time (in seconds) of vehicle i from
Fig. 5. Example of traffic signal timing for actions in Fig. 4. the time the vehicle enters one road of the intersection to the
beginning of green light interval at time step t (vehicle i should
′
still be at the intersection, otherwise wi,t = 0), and wi,t be the
In summary, at the beginning of time step t, the agent staying time of vehicle i from the time the vehicle enters one
observes intersection state St = (P, V, L) ∈ S for traffic road of the intersection to the endP of green light interval at
signal control, where S denotes the whole state space. time step t. Similarly, let Wt = i wi,t be the sum of staying
Agent Action: As shown in Fig.4, after observing intersec- time of all vehicles at theP beginning of green light interval at
tion state St at the beginning of each time step t, the agent time step t, and Wt′ = i wi,t ′
be the sum of staying time of
chooses one action At ∈ A, A = {0, 1}: turning on green all vehicles at the end of green light interval at time step t.
lights for west-east traffic (At = 0) or for north-south traffic For example, at time step t in Fig. 6, Wt is observed at the
(At = 1), and then executes the chosen action. Green lights beginning of time step t because green light interval starts at
for each action last for fixed time interval of length τg . When the beginning of time step t and Wt′ is observed at the end of
green light interval ends, the current time step t ends and time step t. However, at time step t + 1, Wt+1 is observed not
new time step t + 1 begins. The agent then observes new at the beginning of time step t + 1 but when transition interval
intersection state St+1 and chooses the next action At+1 (the ends and green light interval begins, Wt+1 ′
is observed at the
same action may be chosen consecutively across time steps, end of time step t + 1, i.e., when green light interval ends. At
e.g., steps t − 1 and t in Fig.4). If the chosen action At+1 at time step t, if the staying time Wt′ decreases, Wt′ < Wt , the
time step t+1 is the same with previous action At , simply keep agent should be rewarded; if the staying time Wt′ increases,
current traffic signal settings unchanged. If the chosen action Wt′ > Wt , the agent should be penalized. Thus, we define the
At+1 is different from previous action At , before the selected reward Rt for the agent choosing some action at time step t
action At+1 is executed, the following transition traffic signals as follows
are actuated to clear vehicles going straight and vehicles at
Rt = Wt − Wt′ (3)
left-turn waiting area. First, turn on yellow lights for vehicles
going straight. All yellow lights last for fixed time interval Agent Goal: Recall that the goal of the agent is to reduce
of length τy . Then, turn on green lights of duration τg for vehicle staying time at the intersection in the long run. Suppose
left-turn vehicles. Finally, turn on yellow lights for left-turn the agent observes intersection state St at the beginning of time
vehicles. An example in Fig. 5 shows the traffic signal timing step t, then makes action decisions according to some action
corresponding to the chosen actions in Fig. 4. policy π hereafter, and receives a sequence of rewards after
4

time step t, Rt , Rt+1 , Rt+2 , Rt+3 , · · · . If the agent aims to Q(s, a; θ) ≈ Q∗ (s, a), where θ are features/parameters that
reduce vehicle staying time at the intersection for one time will be learned from raw traffic data.
step t, it is sufficient for the agent to choose one action that DNN Structure: We construct such a DNN network, fol-
maximizes the immediate reward Rt as defined in (3). Since lowing the approach in [11] and [12], where the network input
the agent aims to reduce vehicle staying time in the long run, is the observed intersection state St = (P, V, L) and the
the agent needs to find an action policy π ∗ that maximizes the output is a vector of estimated Q-values Q(St , a; θ) for all
following cumulative future reward, namely Q-value, actions a ∈ A under observed state St . Detailed architecture
of the DNN network is given in Fig. 7: (1) position matrix
Qπ (s, a) = E Rt +γRt+1 +γ 2 Rt+2 + · · · |St = s, At = a, π

P is fed to a stacked sub-network where the first layer
∞
X convolves matrix P with 16 filters of 4 × 4 with stride 2 and
=E γ k Rt+k |St = s, At = a, π (4)
applies a rectifier nonlinearity activation function (ReLU), the
k=0
second layer convolves the first layer output with 32 filters
where the expectation is with respect to action policy π, γ is of 2 × 2 with stride 1 and also applies ReLU; (2) speed
a discount parameter, 0 ≤ γ ≤ 1, reflecting how much weight matrix V is fed to another stacked sub-network which has
the agent puts on future rewards: γ = 0 means the agent the same structure with the previous sub-network, however
is shortsighted, only considering immediate reward Rt and γ with different parameters; (3) traffic signal state vector L
approaching 1 means the agent is more farsighted, considering is concatenated with the flattened outputs of the two sub-
future rewards more heavily. networks, forming the input of the third layer in Fig. 7. The
More formally, the agent needs to find an action policy π ∗ third and fourth layers are fully connected layers of 128 and 64
such that units, respectively, followed by rectifier nonlinearity activation
π ∗ = arg max Qπ (s, a) (5) functions (ReLU). The final output layer is fully connected
π
linear layer outputting a vector of Q-values, where each vector
for all s ∈ S, a ∈ A entry corresponds to the estimated Q-value Q(St , a; θ) for an
Denote the optimal Q-values under action policy π ∗ by action a ∈ A under state St .
Q∗ (s, a) = Qπ∗ (s, a). DNN Training: The whole training algorithm is summa-
rized in Algorithm 1 and illustrated in Fig. 8. Note that
III. D EEP R EINFORCEMENT L EARNING A LGORITHM FOR time at line 8 simulates the real world time in seconds, time
T RAFFIC S IGNAL C ONTROL step at line 9 is one period during which agent events occur
as shown in Fig. 4. At each time step t, the agent records
In this section, we introduce deep reinforcement learning
observed interaction experience Et = (St , At , Rt , St+1 ) into
algorithm that extracts useful features from raw traffic data
a replay memory M = {E1 , E2 , · · · , Et }. The replay memory
and finds the optimal traffic signal control policy π ∗ , and
is of finite capacity and when it is full, the oldest data will
experience replay and target network mechanisms to improve
be discarded. To learn DNN features/parameters θ such that
algorithm stability.
outputs Q(s, a; θ) best approximate Q∗ (s, a), the agent needs
If the agent already knows the optimal Q-values Q∗ (s, a)
training data: input data set X = {(St , At ) : t ≥ 1} and
for all state-action pairs s ∈ S, a ∈ A, the optimal action
the corresponding targets y = {Q∗ (St , At ) : t ≥ 1}. For
policy π ∗ is simply choosing the action a that achieves the
input data set, (St , At ) can be retrieved from replay memory
optimal value Q∗ (s, a) under intersection state s. Therefore,
M. However, target Q∗ (St , At ) is not known. As in [11],
the agent needs to find optimal Q-values Q∗ (s, a) next. For
we use its estimate value Rt + γ maxa′ Q(St+1 , a′ ; θ′ ) as
optimal Q-values Q∗ (s, a), we have the following recursive
the target instead, where Q(St+1 , a′ ; θ′ ) is the output of a
relationship, known as Bellman optimality equation [10],
separate target network with parameters θ′ as shown in Fig.
Q∗ (s, a) = E Rt +γ max Q∗ (St+1 , a′ )|St = s, At = a 8 (see (8) for how to set θ′ ) and the input of the target
a
′
network is the corresponding St+1 from interaction experience
for all s ∈ S, a ∈ A (6)
Et = (St , At , Rt , St+1 ). Define Q(St+1 , a′ ; θ′ ) = 0 if training
The intuition is that the optimal cumulative future reward the episode terminates at time step t + 1. The target network has
agent receives is equal to the immediate reward it receives after the same architecture with the DNN network shown in Fig. 7.
choosing action a at intersection state s plus the optimal future Thus, targets y = {Rt + γ maxa′ Q(St+1 , a′ ; θ′ ) : t ≥ 1}.
reward thereafter. In principle, we can solve (6) to get optimal After collecting training data, the agent learns fea-
Q-values Q∗ (s, a) if the number of total states is finite and tures/parameters θ by training the DNN network to minimize
we know all details of the underlying system model, such as the following mean squared error (MSE)
transition probabilities of intersection states and corresponding m
1 Xn
expected reward. However, it is too difficult, if not impossible, M SE(θ) = Rt + γ max Q(St+1 , a′ ; θ′ )
m t=1 a ′
to get these information in reality. Complex traffic situations
o2
at the intersection constitute enormous intersection states, − Q(St , At ; θ) (7)
making it hard to find transition probabilities for those states.
Instead of solving (6) directly, we resort to approximating where m is the size of input data set X. However, if m
those optimal Q-values Q∗ (s, a) by a parameterized deep neu- is large, the computational cost for minimizing M SE(θ) is
ral network (DNN) such that the output of the neural network high. To reduce computational cost, we adopt the stochastic
5

1st layer 2nd layer 3rd layer

Convolution + ReLU Convolution + ReLU ReLU
(1) position !
0 1 0 1 0 1
4th layer
0 0 1 0 1 0 ReLU
1 0 1 0 0 0
0 0 1 0 0 1
0 0 0 1 0 1
output
layer

1st layer 2nd layer 0.3

(2) speed " Convolution + ReLU Convolution + ReLU 0.5
0 1 0 1 0 1
0 0 1 0 1 0
1 0 1 0 0 0
0 0 1 0 0 1
0 0 0 1 0 1

(3) latest traffic signal state

0
1
Fig. 7. DNN structure. Note that the small matrices and vectors in this figure are for illustration simplicity, whose dimensions should be set accordingly in
DNN implementation.

replay memory ( #!
minibatch of 45 max 01 , 2/ ; )+3
% , "% , #& , & target network ./
!$%

& , "& , #' , '

*+
●●●

- !$% RMSProp
●●●

! , "! , #! , !$% update )+ optimizer

●●●

! , "! , #! , !$% !

DNN 01 ! , "! ; )3
●●●

)
random sampling
update )

Fig. 8. Agent training process.

gradient descent algorithm RMSProp [13] with minibatch of states not experienced may not be well estimated. Moreover
size 32. Following this method, when the agent trains the the state space itself may be changing continuously, making
DNN network, it randomly draws 32 samples from the replay current estimated Q-values out of date. Therefore, the agent
memory M to form 32 input data and target pairs (referred always faces a trade-off problem: whether to exploit already
to as experience replay), and then uses these 32 input data learned Q-values (which may not be accurate or out of date)
and targets to update DNN parameters/features θ by RMSProp and select the action with the greatest Q-value; or to explore
algorithm. other possible actions to improve Q-values estimate and finally
After updating DNN features/parameters θ, the agent also improve action policy. We adopt a simple yet effective trade-
needs to update the target network parameters θ′ as follows off method, ǫ-greedy method. Following this method, the agent
(we call it soft update) [14] selects the action with the current greatest estimated Q-value
with probability 1 − ǫ (exploitation) and randomly selects one
θ′ = βθ + (1 − β)θ′ (8) action with probability ǫ (exploration) at each time step.
where β is update rate, β ≪ 1.
The explanation for why experience replay and target net-
work mechanisms can improve algorithm stability has been
IV. S IMULATION E VALUATION
given in [11].
Optimal Action Policy: Ideally, after the agent is trained,
it will reach good estimate of the optimal Q-values and learn In this section we first verify our deep reinforcement learn-
the optimal action policy accordingly. In reality, however, the ing algorithm by simulations in terms of vehicle staying time,
agent may not learn good estimate of those optimal Q-values, vehicle delay and algorithm stability, we then compare the
because the agent has only experienced limited intersection vehicle delay of our algorithm to another two popular traffic
states so far, not the overall state space, thus Q-values for signal control algorithms.
6

Algorithm 1 Deep reinforcement learning algorithm with TABLE I

experience replay and target network for traffic signal control T RAFFIC ROUTES
1: Initialize DNN network with random weights θ; Route Description
2: Initialize target network with weights θ ′ = θ; 06 going straight from road 0 to road 6
3: Initialize ǫ, γ, β, N ; 07 turning left from road 0 to road 7
24 going straight from road 2 to road 4
4: for episode= 1 to N do
25 turning left from road 2 to road 5
5: Initialize intersection state S1 ; 35 going straight from road 3 to road 5
6: Initialize action A0 ; 36 turning left from road 3 to road 6
7: Start new time step; 17 going straight from road 1 to road 7
14 turning left from road 1 to road 4
8: for time = 1 to T seconds do
9: if new time step t begins then
10: The agent observes current intersection state St ; Traffic Arrival Process: Vehicles arrive at road entrances
11: The agent selects action At = randomly and select a route in advance. All arrivals follow the
arg maxa Q(St , a; θ) with probability 1 − ǫ and same Bernoulli process (an approximation to Poisson process)
randomly selects an action At with probability ǫ; but with different rates Pij , where ij is route index, i ∈
12: if At == At−1 then {0, 1, 2, 3}, j ∈ {4, 5, 6, 7}. For example, a vehicle following
13: Keep current traffic signal settings unchanged; route 06 will arrive at entrance of road 0 with probability P06
14: else each second. To simulate heterogeneous traffic demands, we
15: Actuate transition traffic signals; set roads 0, 2 to be busy roads and roads 1, 3 to be less busy
16: end if roads. Specifically, P06 = 1/5, P07 = 1/20, P24 = 1/5, P25 =
17: end if 1/20, P35 = 1/10, P36 = 1/20, P17 = 1/10, P14 = 1/20. All
18: Vehicles run under current traffic signals; vehicles enter one road from random lanes.
19: time = time + 1; Traffic Signal Timing: Rules for actuating traffic signals
20: if transition signals are actuated and transition inter- have been introduced in Agent Action of Section II and
val ends then examples are given in Fig. 4 and Fig. 5. Here, we set green
21: Execute selected action At ; light interval τg = 10 seconds and yellow light interval τy = 6
22: end if seconds.
23: if time step t ends then Agent Parameters: The agent is trained for N = 2000
24: The agent observes reward Rt and current inter- episodes. Each episode corresponds to traffic of 1.5 hours.
section state St+1 ; For ǫ-greedy method in Algorithm 1, parameter ǫ is set to
25: Store observed experience (St , At , Rt , St+1 ) into be 0.1 for all 2000 episodes. Set discount factor γ = 0.95,
replay memory M; update rate β = 0.001, learning rate of RMSProp algorithm
26: Randomly draw 32 samples (Si , Ai , Ri , Si+1 ) as to be 0.0002 and capacity of replay memory to store data for
minibatch from memory M; 200 episodes.
27: Form training data: input data set X and targets y; Simulation data processing: Define the delay of a vehicle
at an intersection as the time interval (in seconds) from the
28: Update θ by applying RMSProp algorithm to train- time the vehicle enters one road of the intersection to the time
ing data; it passes through/leaves the intersection. From the definition,
29: Update θ′ according to (8); we know that vehicle staying time is closely related to vehicle
30: end if delay. During simulations, we record two types of data into
31: end for separate files for all episodes: the sum of staying time of all
32: end for vehicles at the intersection at every second and the delay of
vehicles at each separate road 0, 1, 2, 3. After collecting these
data, we calculate their average values for each episode.
A. Simulation Settings
To simulate intersection traffic and traffic signal control, we B. Simulation Results
use one popular open source simulator: Simulation of Urban First, we examine simulation data to show that our algorithm
MObility (SUMO) [15]. Detailed simulation settings are as indeed learns good action policy (i.e., traffic signal control
follows. policy) that effectively reduces vehicle staying time , thus
Intersection: Consider an intersection of four ways, each reducing vehicle delay and traffic congestion, and that our
road with four lanes as shown in Fig. 1. Set road length to algorithm is stable in making control decisions, i.e., not
be 500 meters, road segment l to be 160 meters, cell length oscillating between good and bad action policies or even
c to be 8 meters, road speed limit to be 19.444 m/s (i.e., 70 diverging to bad action policies.
km/h), vehicle length to be 5 meters, minimum gap between The average values for the sum of staying time of all
vehicles to be 2.5 meters. vehicles at the intersection are presented in Fig. 9. From this
Traffic Route: All possible traffic routes at the intersection figure, we can see that the average of the sum of vehicle
are summarized in Table I. staying time decreases rapidly as the agent is trained for more
7

reduction when compared to longest queue first algorithm. As

5
traffic demand increases (i.e., as ρ increases), average vehicle
4.0x10
Average of the sum of vehicle staying time

delay of fixed time control algorithm increases exponentially.

3.5x10
5
This is because fixed time control algorithm is blind thus not
adaptable to real-time traffic demands. Longest queue first
at intersection (Seconds)

5
3.0x10
algorithm can adapt to real-time traffic demand somewhat.
2.5x10
5 However it only considers halting vehicles in queues, vehicles
not in queues but to come soon are ignored, which is also
5
2.0x10
useful information for traffic signal control. Our algorithm
1.5x10
5
considers real time traffic information of all relevant vehi-
cles, therefore outperforms the other two algorithms. Another
1.0x10
5
observation from Fig. 11(a) and Fig. 11(c) is that as traffic
4
demand increases, the average vehicle delay of our algorithm
5.0x10
increases only slightly, indicating that our algorithm indeed
0.0 adapts to dynamic traffic demand to reduce traffic congestion.
0 400 800 1200 1600 2000
However, this comes at the cost of slight increase in average
Episode
vehicle delay at less busier roads 1, 3, as shown in the zoomed
in portions of Fig. 11(b) and Fig. 11(d).

Fig. 9. Average of the sum of vehicle staying time at the intersection. V. R ELATED W ORK
In this section, we review related work on adopting deep
episodes and finally reduces to some small values, indicating reinforcement learning for traffic signal control.
that the agent does learn good action policy from training. After formulating traffic signal control problem as a rein-
We can also see that after 800 episodes, average vehicle forcement learning problem, Li et al. proposed to use deep
staying time keeps stable at small values, indicating that stacked autoencoders (SAE) neural network to estimate the op-
our algorithm converges to good action policy and algorithm timal Q-values [17], where the algorithm takes the number of
stabilizing mechanisms, experience replay and target network, queued vehicles as input and queue difference between west-
work effectively. east traffic and north-south traffic as reward. By simulation,
The average values for delay of vehicles at each separate they compared the performance of their deep reinforcement
road are presented in Fig. 10. From this figure we see that learning algorithm to that of conventional reinforcement learn-
average vehicle delay at each road is reduced greatly as ing algorithm (i.e., without deep neural network) for traffic
the agent is trained for more episodes, indicating that our signal control, and concluded that deep reinforcement learning
algorithm achieves adaptive and efficient traffic signal control. algorithm can reduce average traffic delay by 14%. However,
After the agent learns good action policy, average vehicle they did not detail how target network is used for Q-value
delay reduces to small values (around 90.5 seconds for road estimation nor how target network parameters are updated,
0, 107.2 seconds for road 1, 91.5 seconds for road 2 and which is important for stabilizing algorithm. Furthermore, they
109.4 seconds for road 3) and stays stable thereafter. From simulated an uncommon intersection scenario, where turning
these stable values, we also know that our algorithm learns left, turning right are not allowed and there is no yellow
a fair policy: average vehicle delay for roads with different clearance time. Whether their algorithm works for realistic
vehicle arrival rates does not differ too much. This is because intersection remains unknown. Different from this work, our
long vehicle staying time, thus vehicle delay, at any road leads algorithm does not use human-crafted feature, vehicle queue
penalty to the agent (see (3)), causing the agent to adjust its length, but automatically extracts all useful features from
action policy accordingly. raw traffic data. Our algorithm works effectively for realistic
Next, we compare the vehicle delay performance of our al- intersections.
gorithm to that of another two popular traffic signal control al- Aiming at realistic intersection, Genders et al. [12] also pro-
gorithms, longest queue first algorithm (turning on green lights posed a deep reinforcement learning algorithm to adaptively
for eligible traffic with most queued vehicles) [16] and fixed control traffic signals, where convolutional neural networks
time control algorithm (turning on green lights for eligible are used to approximate optimal Q-values. Their algorithm
traffic using a predetermined cycle), under the same simulation takes vehicle position matrix, vehicle speed matrix and latest
settings in IV-A. However, we change vehicle arrival rates by traffic signal as input, change in cumulative vehicle delay
a parameter ρ as ρPij , 0.1 ≤ ρ ≤ 1, during simulation, where as reward, and uses a target network to estimate target Q-
values of Pij , i ∈ {0, 1, 2, 3}, j ∈ {4, 5, 6, 7}, are given in values. Through simulations, they showed that their algorithm
Section IV-A. Simulation results are summarized in Fig.11. could effectively reduce cumulative vehicle delay and vehicle
From Fig. 11, we can see that for busy roads 0, 2, the travel time at an intersection. However, a well known problem
average vehicle delay of our deep reinforcement learning with deep reinforcement learning is algorithm instability due
algorithm is the lowest all the time: up to 86% reduction to the moving target problem as explained in [18]. The
when compared to fixed time control algorithm and up to 47% authors did not mention how to solve this problem, a major
8

2500 2500
Average vehicle delay at road 0 (Seconds)

Average vehicle delay at road 1 (Seconds)

2000 2000

1500 1500

1000 1000

500 500

0 0
0 400 800 1200 1600 2000 0 400 800 1200 1600 2000
Episode Episode

(a) Road 0 (b) Road 1

2500 2500
Average vehicle delay at road 2 (Seconds)

Average vehicle delay at road 3 (Seconds)

2000 2000

1500 1500

1000 1000

500 500

0 0
0 400 800 1200 1600 2000 0 400 800 1200 1600 2000
Episode Episode

(c) Road 2 (d) Road 3

Fig. 10. Average vehicle delay for separate roads at the intersection.

drawback of their work. Furthermore, they did not consider found in practice. Moreover, they used inefficient method to
fair traffic signal control issues as they mentioned and their represent vehicle position information, which results in great
intersection model does not have left-turn waiting areas, which computation cost during training. Specifically, the author used
is a commonly adopted and efficient mechanism for reducing a binary position matrix: one indicating the presence of a
vehicle delay at an intersection. In comparison, our algorithm vehicle at a position and zero indicating the absence of a
not only improves algorithm stability but also finds fair traffic vehicle at that position. Instead of covering only the roads area
signal control policy for common intersections with left-turn relevant to traffic signal control, they set the binary matrix to
waiting areas. cover a whole rectangular area around the intersection. Since
vehicles cannot run at areas except roads, most entries of
Pol addressed the moving target problem of deep reinforce- the binary matrix are zero and redundant, making the binary
ment learning for traffic signal control in [18] and proposed matrix inefficient. Differently, our algorithm solves moving
to use a separate target network to approximate the target Q- target problem by softly updating target network parameters
values. Specifically, they fix target network parameters θ′ for θ′ , not needing to find proper value of M . Moreover, our
M time steps during training, however update DNN network algorithm represents vehicle position information efficiently
parameters θ every time step and copy DNN parameters θ into (vehicle position matrix only covers intersection roads) thus
target network parameters θ′ every M time steps (referred to reducing training computation cost.
as hard update). By simulation, they showed that algorithm sta-
bility is improved if M is set to be a proper value neither small
nor large. However, this proper value of M cannot be easily
9

deep reinforcement learning

Average vehicle delay at road 1 (Seconds)

Average vehicle delay at road 0 (Seconds)

500
500 deep reinforcement learning
longest queue first
longest queue first
fixed time
fixed time 100

400 400

300 300 60

40
200
200
0.60 0.70 0.80

100
100

0.20 0.40 0.60 0.80 1.00 0

0.20 0.40 0.60 0.80 1.00

Rate control papameter
Rate control papameter

(a) Road 0 (b) Road 1

Average vehicle delay at road 3 (Seconds)

deep reinforcement learning
Average vehicle delay at road 2 (Seconds)

600 500
deep reinforcement learning longest queue first
longest queue first fixed time
500
fixed time 100
400

400 80

300
60
300

40
200
200
0.60 0.70 0.80 0.90

100 100

0
0
0.20 0.40 0.60 0.80 1.00
0.20 0.40 0.60 0.80 1.00

Rate control papameter Rate control papameter

(c) Road 2 (d) Road 3

Fig. 11. Average vehicle delay for separate roads at the intersection under different traffic signal control algorithms.

VI. C ONCLUSION and Cybernetics, Part C (Applications and Reviews), vol. 42, no. 4, pp.
485–494, July 2012.
We proposed a deep reinforcement learning algorithm for [2] M. Alsabaan, W. Alasmary, A. Albasir, and K. Naik, “Vehicular net-
adaptive traffic signal control to reduce traffic congestion. Our works for a greener environment: A survey,” IEEE Communications
algorithm can automatically extract useful features from raw Surveys & Tutorials, vol. 15, no. 3, pp. 1372–1388, Third Quarter 2013.
[3] A. A. Zaidi, B. Kulcsr, and H. Wymeersch, “Back-pressure traffic signal
real-time traffic data, which uses deep convolutional neural control with fixed and adaptive routing for urban vehicular networks,”
network, and learn the optimal traffic signal control policy. By IEEE Transactions on Intelligent Transportation Systems, vol. 17, no. 8,
adopting experience replay and target network mechanisms, pp. 2134–2143, August 2016.
we improved algorithm stability in the sense that our algorithm [4] J. Gregoire, X. Qian, E. Frazzoli, A. de La Fortelle, and T. Wong-
piromsarn, “Capacity-aware backpressure traffic signal control,” IEEE
converges to good traffic signal control policy. Simulation Transactions on Control of Network Systems, vol. 2, no. 2, pp. 164–
results showed that our algorithm significantly reduces vehicle 173, June 2015.
delay when compared to another two popular algorithms, [5] P. LA and S. Bhatnagar, “Reinforcement learning with function ap-
proximation for traffic signal control,” IEEE Transactions on Intelligent
longest queue first algorithm and fixed time control algorithm, Transportation Systems, vol. 12, no. 2, pp. 412–421, June 2011.
and that our algorithm learns a fair traffic signal control policy [6] B. Yin, M. Dridi, and A. E. Moudni, “Approximate dynamic pro-
such that no vehicles at any road wait too long for passing gramming with recursive least-squares temporal difference learning for
adaptive traffic signal control,” in IEEE 54th Annual Conference on
through the intersection. Decision and Control (CDC), 2015.
[7] I. Arel, C. Liu, T. Urbanik, and A. G. Kohls, “Reinforcement learning-
R EFERENCES based multi-agent system for network traffic signal control,” IET Intel-
ligent Transport Systems, vol. 4, no. 2, pp. 128–135, June 2010.
[1] D. Zhao, Y. Dai, and Z. Zhang, “Computational intelligence in urban [8] P. Mannion, J. Duggan, and E. Howley, An Experimental Review of
traffic signal control: A survey,” IEEE Transactions on Systems, Man, Reinforcement Learning Algorithms for Adaptive Traffic Signal Control.
10

Springer International Publishing, May 2016, ch. Autonomic Road

Transport Support Systems, pp. 47–66.
[9] M. J. Neely, “Dynamic power allocation and routing for satellite and
wireless networks with time varying channels,” Ph.D. dissertation, LIDS,
Massachusetts Institute of Technology, Cambridge, MA, USA, 2003.
[10] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction.
MIT Press, 1998.
[11] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski,
S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran,
D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through
deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533,
2015.
[12] W. Genders and S. Razavi, “Using a deep reinforcement learning
agent for traffic signal control,” November 2016, [Online]. Available:
https://arxiv.org/abs/1611.01142.
[13] T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop: Divide the gradient
by a running average of its recent magnitude,” COURSERA: Neural
networks for machine learning, vol. 4, no. 2, 2012.
[14] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez,
Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with
deep reinforcement learning,” February 2016, [Online]. Available:
https://arxiv.org/abs/1509.02971.
[15] D. Krajzewicz, J. Erdmann, M. Behrisch, and L. Bieker, “Recent
development and applications of sumo simulation of urban mobility,”
International Journal On Advances in Systems and Measurements,
vol. 5, no. 3 & 4, pp. 128–138, December 2012.
[16] R. Wunderlich, C. Liu, and I. Elhanany, “A novel signal-scheduling algo-
rithm with quality-of-service provisioning for an isolated intersection,”
IEEE Transactions on Intelligent Transportation Systems, vol. 9, no. 3,
pp. 536–547, September 2008.
[17] L. Li, Y. Lv, and F.-Y. Wang, “Traffic signal timing via deep reinforce-
ment learning,” IEEE/CAA Journal of Automatica Sinica, vol. 3, no. 3,
pp. 247– 254, July 2016.
[18] E. van der Pol, “Deep reinforcement learning for coordination in traffic
light control,” Master’s thesis, University of Amsterdam, August 2016.