Deep Reinforcement Learning Algorithm With Experience Replay and Target Network
Deep Reinforcement Learning Algorithm With Experience Replay and Target Network
Abstract—Adaptive traffic signal control, which adjusts traffic (roads) by pressure gradients (the number of queued vehicles)
signal timing according to real-time traffic, has been shown to be [9]. Authors in [5]–[8] proposed to use reinforcement learn-
an effective method to reduce traffic congestion. Available works ing method to adaptively control traffic signals, where they
on adaptive traffic signal control make responsive traffic signal
control decisions based on human-crafted features (e.g. vehicle modelled the control problem as a Markov decision process
queue length). However, human-crafted features are abstractions [10]. However, all these works make responsive traffic signal
of raw traffic data (e.g., position and speed of vehicles), which control decisions based on human-crafted features, such as
ignore some useful traffic information and lead to suboptimal vehicle queue length and average vehicle delay. Human-crafted
traffic signal controls. In this paper, we propose a deep reinforce- features are abstractions of raw traffic data (e.g., position and
ment learning algorithm that automatically extracts all useful
features (machine-crafted features) from raw real-time traffic speed of vehicles), which ignore some useful traffic informa-
data and learns the optimal policy for adaptive traffic signal tion and lead to suboptimal traffic signal controls. For example,
control. To improve algorithm stability, we adopt experience vehicle queue length does not consider vehicles that are not in
replay and target network mechanisms. Simulation results show queue but will come soon, which is also useful information for
that our algorithm reduces vehicle delay by up to 47% and 86% controlling traffic signals; average vehicle delay only reflects
when compared to another two popular traffic signal control
algorithms, longest queue first algorithm and fixed time control history traffic data not real-time traffic demand.
algorithm, respectively. In this paper, instead of using human-crafted features, we
propose a deep reinforcement learning algorithm that automat-
ically extracts all features (machine-crafted features) useful for
I. I NTRODUCTION adaptive traffic signal control from raw real-time traffic data
Traffic congestion has led to some serious social problems: and learns the optimal traffic signal control policy. Specifically,
long travelling time, fuel consumption, air pollution, etc [1], we model the control problem as a reinforcement learning
[2]. Factors responsible for traffic congestion include: prolif- problem [10]. Then, we use deep convolutional neural network
eration of vehicles, inadequate traffic infrastructure and inef- to extract useful features from raw real-time traffic data (i.e.,
ficient traffic signal control. However, we cannot stop people vehicle position, speed and traffic signal state) and output the
from buying vehicles and building new traffic infrastructure optimal traffic signal control decision. A well-known problem
is of high cost. The relatively easy solution is to improve with deep reinforcement learning is that the algorithm may be
efficiency of traffic signal control. Fixed-time traffic signal unstable or even diverge in decision making [11]. To improve
control is common in use, where traffic signal timing at an algorithm stability, we adopt two methods proposed in [11]:
intersection is predetermined and optimized offline based on experience replay and target network (see details in Section
history traffic data (not real-time traffic demands). However, III).
traffic demands may change from time to time, making prede- The rest of this paper is organized as follows. In Section
termined settings of traffic signal timing out of date. Therefore, II, we introduce intersection model and define reinforcement
fixed-time traffic signal control cannot adapt to dynamic and learning components: intersection state, agent action, reward
bursty traffic demands, resulting in traffic congestion. and agent goal. In Section III, we present details of our
In contrast, adaptive traffic signal control, which adjusts proposed deep reinforcement learning algorithm for traffic
traffic signal timing according to real-time traffic demand, signal control. In Section IV, we verify our algorithm by
has been shown to be an effective method to reduce traffic simulations and compare its performance to popular traffic
congestion [3]–[8]. For example, Zaidi et al. [3] and Gregoire signal control algorithms. In Section V, we review related
et al. [4] proposed adaptive traffic signal control algorithms work on adopting deep reinforcement learning for traffic signal
based on back-pressure method, which is similar to pushing control and their limitations and conclude the whole paper in
water (here vehicles) to flow through a network of pipes Section VI.
J. Gao and M. Ito are with the Graduate School of Information Science,
Nara Institute of Science and Technology, 8916-5 Takayama-cho, Ikoma, Nara II. S YSTEM M ODEL AND P ROBLEM F ORMULATION
630-0192, JAPAN. E-mail: {jtgao,ito}@is.naist.jp.
Y. Shen is with School of Computer Science and Technology, Xidian Uni- In this section, we first introduce intersection model and
versity, Xian, Shaanxi 710071, PR China. E-mail: ylshen@mail.xidian.edu.cn.
J. Liu is with National Institute of Informatics, JAPAN. then formulate traffic signal control problem as a reinforce-
N. Shiratori is with Tohoku University, Sendai, JAPAN. ment learning problem.
2
road #
road "
(b)
0 0 1 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 0 1 1
$% 0 0 0 0 0 0 0 1 0 1 1 1
$&
$' (c)
$(
0 0 0.9 0 0 0 0 0 0 0 0 0
road 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0.5 0 0.1 0 0
road !
Fig. 3. (a) snapshot of traffic at road 0. (b) matrix of vehicle position. (c)
matrix of normalized vehicle speed.
1. Observe intersection 1. Observe intersection 3. Execute action green green transition green
state state
2. Choose action 2. Choose action vehicle & +! +"
3. Execute action arrival
(, G r r r r r r r G r r r r r r r
Reward: To reduce traffic congestion, it is reasonable to
(+ y r r r r r r r y r r r r r r r
reward the agent at each time step for choosing some action
(, r r r r g G G G r r r r g G G G
if the time of vehicles staying at the intersection decreases.
r: red light Specifically, the agent observes vehicles staying time twice
G: green light every time step to determine its change as shown in Fig. 6.
g: green light for vehicles turning left, letting vehicles going straight pass first
y: yellow light
The first observation is at the beginning of green light interval
* : lane i
at each time step and the second observation is at the end of
green light interval at each time step.
Let wi,t be the staying time (in seconds) of vehicle i from
Fig. 5. Example of traffic signal timing for actions in Fig. 4. the time the vehicle enters one road of the intersection to the
beginning of green light interval at time step t (vehicle i should
′
still be at the intersection, otherwise wi,t = 0), and wi,t be the
In summary, at the beginning of time step t, the agent staying time of vehicle i from the time the vehicle enters one
observes intersection state St = (P, V, L) ∈ S for traffic road of the intersection to the endP of green light interval at
signal control, where S denotes the whole state space. time step t. Similarly, let Wt = i wi,t be the sum of staying
Agent Action: As shown in Fig.4, after observing intersec- time of all vehicles at theP beginning of green light interval at
tion state St at the beginning of each time step t, the agent time step t, and Wt′ = i wi,t ′
be the sum of staying time of
chooses one action At ∈ A, A = {0, 1}: turning on green all vehicles at the end of green light interval at time step t.
lights for west-east traffic (At = 0) or for north-south traffic For example, at time step t in Fig. 6, Wt is observed at the
(At = 1), and then executes the chosen action. Green lights beginning of time step t because green light interval starts at
for each action last for fixed time interval of length τg . When the beginning of time step t and Wt′ is observed at the end of
green light interval ends, the current time step t ends and time step t. However, at time step t + 1, Wt+1 is observed not
new time step t + 1 begins. The agent then observes new at the beginning of time step t + 1 but when transition interval
intersection state St+1 and chooses the next action At+1 (the ends and green light interval begins, Wt+1 ′
is observed at the
same action may be chosen consecutively across time steps, end of time step t + 1, i.e., when green light interval ends. At
e.g., steps t − 1 and t in Fig.4). If the chosen action At+1 at time step t, if the staying time Wt′ decreases, Wt′ < Wt , the
time step t+1 is the same with previous action At , simply keep agent should be rewarded; if the staying time Wt′ increases,
current traffic signal settings unchanged. If the chosen action Wt′ > Wt , the agent should be penalized. Thus, we define the
At+1 is different from previous action At , before the selected reward Rt for the agent choosing some action at time step t
action At+1 is executed, the following transition traffic signals as follows
are actuated to clear vehicles going straight and vehicles at
Rt = Wt − Wt′ (3)
left-turn waiting area. First, turn on yellow lights for vehicles
going straight. All yellow lights last for fixed time interval Agent Goal: Recall that the goal of the agent is to reduce
of length τy . Then, turn on green lights of duration τg for vehicle staying time at the intersection in the long run. Suppose
left-turn vehicles. Finally, turn on yellow lights for left-turn the agent observes intersection state St at the beginning of time
vehicles. An example in Fig. 5 shows the traffic signal timing step t, then makes action decisions according to some action
corresponding to the chosen actions in Fig. 4. policy π hereafter, and receives a sequence of rewards after
4
time step t, Rt , Rt+1 , Rt+2 , Rt+3 , · · · . If the agent aims to Q(s, a; θ) ≈ Q∗ (s, a), where θ are features/parameters that
reduce vehicle staying time at the intersection for one time will be learned from raw traffic data.
step t, it is sufficient for the agent to choose one action that DNN Structure: We construct such a DNN network, fol-
maximizes the immediate reward Rt as defined in (3). Since lowing the approach in [11] and [12], where the network input
the agent aims to reduce vehicle staying time in the long run, is the observed intersection state St = (P, V, L) and the
the agent needs to find an action policy π ∗ that maximizes the output is a vector of estimated Q-values Q(St , a; θ) for all
following cumulative future reward, namely Q-value, actions a ∈ A under observed state St . Detailed architecture
of the DNN network is given in Fig. 7: (1) position matrix
Qπ (s, a) = E Rt +γRt+1 +γ 2 Rt+2 + · · · |St = s, At = a, π
P is fed to a stacked sub-network where the first layer
∞
X convolves matrix P with 16 filters of 4 × 4 with stride 2 and
=E γ k Rt+k |St = s, At = a, π (4)
applies a rectifier nonlinearity activation function (ReLU), the
k=0
second layer convolves the first layer output with 32 filters
where the expectation is with respect to action policy π, γ is of 2 × 2 with stride 1 and also applies ReLU; (2) speed
a discount parameter, 0 ≤ γ ≤ 1, reflecting how much weight matrix V is fed to another stacked sub-network which has
the agent puts on future rewards: γ = 0 means the agent the same structure with the previous sub-network, however
is shortsighted, only considering immediate reward Rt and γ with different parameters; (3) traffic signal state vector L
approaching 1 means the agent is more farsighted, considering is concatenated with the flattened outputs of the two sub-
future rewards more heavily. networks, forming the input of the third layer in Fig. 7. The
More formally, the agent needs to find an action policy π ∗ third and fourth layers are fully connected layers of 128 and 64
such that units, respectively, followed by rectifier nonlinearity activation
π ∗ = arg max Qπ (s, a) (5) functions (ReLU). The final output layer is fully connected
π
linear layer outputting a vector of Q-values, where each vector
for all s ∈ S, a ∈ A entry corresponds to the estimated Q-value Q(St , a; θ) for an
Denote the optimal Q-values under action policy π ∗ by action a ∈ A under state St .
Q∗ (s, a) = Qπ∗ (s, a). DNN Training: The whole training algorithm is summa-
rized in Algorithm 1 and illustrated in Fig. 8. Note that
III. D EEP R EINFORCEMENT L EARNING A LGORITHM FOR time at line 8 simulates the real world time in seconds, time
T RAFFIC S IGNAL C ONTROL step at line 9 is one period during which agent events occur
as shown in Fig. 4. At each time step t, the agent records
In this section, we introduce deep reinforcement learning
observed interaction experience Et = (St , At , Rt , St+1 ) into
algorithm that extracts useful features from raw traffic data
a replay memory M = {E1 , E2 , · · · , Et }. The replay memory
and finds the optimal traffic signal control policy π ∗ , and
is of finite capacity and when it is full, the oldest data will
experience replay and target network mechanisms to improve
be discarded. To learn DNN features/parameters θ such that
algorithm stability.
outputs Q(s, a; θ) best approximate Q∗ (s, a), the agent needs
If the agent already knows the optimal Q-values Q∗ (s, a)
training data: input data set X = {(St , At ) : t ≥ 1} and
for all state-action pairs s ∈ S, a ∈ A, the optimal action
the corresponding targets y = {Q∗ (St , At ) : t ≥ 1}. For
policy π ∗ is simply choosing the action a that achieves the
input data set, (St , At ) can be retrieved from replay memory
optimal value Q∗ (s, a) under intersection state s. Therefore,
M. However, target Q∗ (St , At ) is not known. As in [11],
the agent needs to find optimal Q-values Q∗ (s, a) next. For
we use its estimate value Rt + γ maxa′ Q(St+1 , a′ ; θ′ ) as
optimal Q-values Q∗ (s, a), we have the following recursive
the target instead, where Q(St+1 , a′ ; θ′ ) is the output of a
relationship, known as Bellman optimality equation [10],
separate target network with parameters θ′ as shown in Fig.
Q∗ (s, a) = E Rt +γ max Q∗ (St+1 , a′ )|St = s, At = a 8 (see (8) for how to set θ′ ) and the input of the target
a
′
network is the corresponding St+1 from interaction experience
for all s ∈ S, a ∈ A (6)
Et = (St , At , Rt , St+1 ). Define Q(St+1 , a′ ; θ′ ) = 0 if training
The intuition is that the optimal cumulative future reward the episode terminates at time step t + 1. The target network has
agent receives is equal to the immediate reward it receives after the same architecture with the DNN network shown in Fig. 7.
choosing action a at intersection state s plus the optimal future Thus, targets y = {Rt + γ maxa′ Q(St+1 , a′ ; θ′ ) : t ≥ 1}.
reward thereafter. In principle, we can solve (6) to get optimal After collecting training data, the agent learns fea-
Q-values Q∗ (s, a) if the number of total states is finite and tures/parameters θ by training the DNN network to minimize
we know all details of the underlying system model, such as the following mean squared error (MSE)
transition probabilities of intersection states and corresponding m
1 Xn
expected reward. However, it is too difficult, if not impossible, M SE(θ) = Rt + γ max Q(St+1 , a′ ; θ′ )
m t=1 a ′
to get these information in reality. Complex traffic situations
o2
at the intersection constitute enormous intersection states, − Q(St , At ; θ) (7)
making it hard to find transition probabilities for those states.
Instead of solving (6) directly, we resort to approximating where m is the size of input data set X. However, if m
those optimal Q-values Q∗ (s, a) by a parameterized deep neu- is large, the computational cost for minimizing M SE(θ) is
ral network (DNN) such that the output of the neural network high. To reduce computational cost, we adopt the stochastic
5
replay memory ( #!
minibatch of 45 max 01 , 2/ ; )+3
% , "% , #& , & target network ./
!$%
- !$% RMSProp
●●●
! , "! , #! , !$% !
DNN 01 ! , "! ; )3
●●●
)
random sampling
update )
gradient descent algorithm RMSProp [13] with minibatch of states not experienced may not be well estimated. Moreover
size 32. Following this method, when the agent trains the the state space itself may be changing continuously, making
DNN network, it randomly draws 32 samples from the replay current estimated Q-values out of date. Therefore, the agent
memory M to form 32 input data and target pairs (referred always faces a trade-off problem: whether to exploit already
to as experience replay), and then uses these 32 input data learned Q-values (which may not be accurate or out of date)
and targets to update DNN parameters/features θ by RMSProp and select the action with the greatest Q-value; or to explore
algorithm. other possible actions to improve Q-values estimate and finally
After updating DNN features/parameters θ, the agent also improve action policy. We adopt a simple yet effective trade-
needs to update the target network parameters θ′ as follows off method, ǫ-greedy method. Following this method, the agent
(we call it soft update) [14] selects the action with the current greatest estimated Q-value
with probability 1 − ǫ (exploitation) and randomly selects one
θ′ = βθ + (1 − β)θ′ (8) action with probability ǫ (exploration) at each time step.
where β is update rate, β ≪ 1.
The explanation for why experience replay and target net-
work mechanisms can improve algorithm stability has been
IV. S IMULATION E VALUATION
given in [11].
Optimal Action Policy: Ideally, after the agent is trained,
it will reach good estimate of the optimal Q-values and learn In this section we first verify our deep reinforcement learn-
the optimal action policy accordingly. In reality, however, the ing algorithm by simulations in terms of vehicle staying time,
agent may not learn good estimate of those optimal Q-values, vehicle delay and algorithm stability, we then compare the
because the agent has only experienced limited intersection vehicle delay of our algorithm to another two popular traffic
states so far, not the overall state space, thus Q-values for signal control algorithms.
6
5
3.0x10
algorithm can adapt to real-time traffic demand somewhat.
2.5x10
5 However it only considers halting vehicles in queues, vehicles
not in queues but to come soon are ignored, which is also
5
2.0x10
useful information for traffic signal control. Our algorithm
1.5x10
5
considers real time traffic information of all relevant vehi-
cles, therefore outperforms the other two algorithms. Another
1.0x10
5
observation from Fig. 11(a) and Fig. 11(c) is that as traffic
4
demand increases, the average vehicle delay of our algorithm
5.0x10
increases only slightly, indicating that our algorithm indeed
0.0 adapts to dynamic traffic demand to reduce traffic congestion.
0 400 800 1200 1600 2000
However, this comes at the cost of slight increase in average
Episode
vehicle delay at less busier roads 1, 3, as shown in the zoomed
in portions of Fig. 11(b) and Fig. 11(d).
Fig. 9. Average of the sum of vehicle staying time at the intersection. V. R ELATED W ORK
In this section, we review related work on adopting deep
episodes and finally reduces to some small values, indicating reinforcement learning for traffic signal control.
that the agent does learn good action policy from training. After formulating traffic signal control problem as a rein-
We can also see that after 800 episodes, average vehicle forcement learning problem, Li et al. proposed to use deep
staying time keeps stable at small values, indicating that stacked autoencoders (SAE) neural network to estimate the op-
our algorithm converges to good action policy and algorithm timal Q-values [17], where the algorithm takes the number of
stabilizing mechanisms, experience replay and target network, queued vehicles as input and queue difference between west-
work effectively. east traffic and north-south traffic as reward. By simulation,
The average values for delay of vehicles at each separate they compared the performance of their deep reinforcement
road are presented in Fig. 10. From this figure we see that learning algorithm to that of conventional reinforcement learn-
average vehicle delay at each road is reduced greatly as ing algorithm (i.e., without deep neural network) for traffic
the agent is trained for more episodes, indicating that our signal control, and concluded that deep reinforcement learning
algorithm achieves adaptive and efficient traffic signal control. algorithm can reduce average traffic delay by 14%. However,
After the agent learns good action policy, average vehicle they did not detail how target network is used for Q-value
delay reduces to small values (around 90.5 seconds for road estimation nor how target network parameters are updated,
0, 107.2 seconds for road 1, 91.5 seconds for road 2 and which is important for stabilizing algorithm. Furthermore, they
109.4 seconds for road 3) and stays stable thereafter. From simulated an uncommon intersection scenario, where turning
these stable values, we also know that our algorithm learns left, turning right are not allowed and there is no yellow
a fair policy: average vehicle delay for roads with different clearance time. Whether their algorithm works for realistic
vehicle arrival rates does not differ too much. This is because intersection remains unknown. Different from this work, our
long vehicle staying time, thus vehicle delay, at any road leads algorithm does not use human-crafted feature, vehicle queue
penalty to the agent (see (3)), causing the agent to adjust its length, but automatically extracts all useful features from
action policy accordingly. raw traffic data. Our algorithm works effectively for realistic
Next, we compare the vehicle delay performance of our al- intersections.
gorithm to that of another two popular traffic signal control al- Aiming at realistic intersection, Genders et al. [12] also pro-
gorithms, longest queue first algorithm (turning on green lights posed a deep reinforcement learning algorithm to adaptively
for eligible traffic with most queued vehicles) [16] and fixed control traffic signals, where convolutional neural networks
time control algorithm (turning on green lights for eligible are used to approximate optimal Q-values. Their algorithm
traffic using a predetermined cycle), under the same simulation takes vehicle position matrix, vehicle speed matrix and latest
settings in IV-A. However, we change vehicle arrival rates by traffic signal as input, change in cumulative vehicle delay
a parameter ρ as ρPij , 0.1 ≤ ρ ≤ 1, during simulation, where as reward, and uses a target network to estimate target Q-
values of Pij , i ∈ {0, 1, 2, 3}, j ∈ {4, 5, 6, 7}, are given in values. Through simulations, they showed that their algorithm
Section IV-A. Simulation results are summarized in Fig.11. could effectively reduce cumulative vehicle delay and vehicle
From Fig. 11, we can see that for busy roads 0, 2, the travel time at an intersection. However, a well known problem
average vehicle delay of our deep reinforcement learning with deep reinforcement learning is algorithm instability due
algorithm is the lowest all the time: up to 86% reduction to the moving target problem as explained in [18]. The
when compared to fixed time control algorithm and up to 47% authors did not mention how to solve this problem, a major
8
2500 2500
Average vehicle delay at road 0 (Seconds)
1500 1500
1000 1000
500 500
0 0
0 400 800 1200 1600 2000 0 400 800 1200 1600 2000
Episode Episode
2500 2500
Average vehicle delay at road 2 (Seconds)
1500 1500
1000 1000
500 500
0 0
0 400 800 1200 1600 2000 0 400 800 1200 1600 2000
Episode Episode
drawback of their work. Furthermore, they did not consider found in practice. Moreover, they used inefficient method to
fair traffic signal control issues as they mentioned and their represent vehicle position information, which results in great
intersection model does not have left-turn waiting areas, which computation cost during training. Specifically, the author used
is a commonly adopted and efficient mechanism for reducing a binary position matrix: one indicating the presence of a
vehicle delay at an intersection. In comparison, our algorithm vehicle at a position and zero indicating the absence of a
not only improves algorithm stability but also finds fair traffic vehicle at that position. Instead of covering only the roads area
signal control policy for common intersections with left-turn relevant to traffic signal control, they set the binary matrix to
waiting areas. cover a whole rectangular area around the intersection. Since
vehicles cannot run at areas except roads, most entries of
Pol addressed the moving target problem of deep reinforce- the binary matrix are zero and redundant, making the binary
ment learning for traffic signal control in [18] and proposed matrix inefficient. Differently, our algorithm solves moving
to use a separate target network to approximate the target Q- target problem by softly updating target network parameters
values. Specifically, they fix target network parameters θ′ for θ′ , not needing to find proper value of M . Moreover, our
M time steps during training, however update DNN network algorithm represents vehicle position information efficiently
parameters θ every time step and copy DNN parameters θ into (vehicle position matrix only covers intersection roads) thus
target network parameters θ′ every M time steps (referred to reducing training computation cost.
as hard update). By simulation, they showed that algorithm sta-
bility is improved if M is set to be a proper value neither small
nor large. However, this proper value of M cannot be easily
9
500
500 deep reinforcement learning
longest queue first
longest queue first
fixed time
fixed time 100
400 400
80
300 300 60
40
200
200
0.60 0.70 0.80
100
100
600 500
deep reinforcement learning longest queue first
longest queue first fixed time
500
fixed time 100
400
400 80
300
60
300
40
200
200
0.60 0.70 0.80 0.90
100 100
0
0
0.20 0.40 0.60 0.80 1.00
0.20 0.40 0.60 0.80 1.00
VI. C ONCLUSION and Cybernetics, Part C (Applications and Reviews), vol. 42, no. 4, pp.
485–494, July 2012.
We proposed a deep reinforcement learning algorithm for [2] M. Alsabaan, W. Alasmary, A. Albasir, and K. Naik, “Vehicular net-
adaptive traffic signal control to reduce traffic congestion. Our works for a greener environment: A survey,” IEEE Communications
algorithm can automatically extract useful features from raw Surveys & Tutorials, vol. 15, no. 3, pp. 1372–1388, Third Quarter 2013.
[3] A. A. Zaidi, B. Kulcsr, and H. Wymeersch, “Back-pressure traffic signal
real-time traffic data, which uses deep convolutional neural control with fixed and adaptive routing for urban vehicular networks,”
network, and learn the optimal traffic signal control policy. By IEEE Transactions on Intelligent Transportation Systems, vol. 17, no. 8,
adopting experience replay and target network mechanisms, pp. 2134–2143, August 2016.
we improved algorithm stability in the sense that our algorithm [4] J. Gregoire, X. Qian, E. Frazzoli, A. de La Fortelle, and T. Wong-
piromsarn, “Capacity-aware backpressure traffic signal control,” IEEE
converges to good traffic signal control policy. Simulation Transactions on Control of Network Systems, vol. 2, no. 2, pp. 164–
results showed that our algorithm significantly reduces vehicle 173, June 2015.
delay when compared to another two popular algorithms, [5] P. LA and S. Bhatnagar, “Reinforcement learning with function ap-
proximation for traffic signal control,” IEEE Transactions on Intelligent
longest queue first algorithm and fixed time control algorithm, Transportation Systems, vol. 12, no. 2, pp. 412–421, June 2011.
and that our algorithm learns a fair traffic signal control policy [6] B. Yin, M. Dridi, and A. E. Moudni, “Approximate dynamic pro-
such that no vehicles at any road wait too long for passing gramming with recursive least-squares temporal difference learning for
adaptive traffic signal control,” in IEEE 54th Annual Conference on
through the intersection. Decision and Control (CDC), 2015.
[7] I. Arel, C. Liu, T. Urbanik, and A. G. Kohls, “Reinforcement learning-
R EFERENCES based multi-agent system for network traffic signal control,” IET Intel-
ligent Transport Systems, vol. 4, no. 2, pp. 128–135, June 2010.
[1] D. Zhao, Y. Dai, and Z. Zhang, “Computational intelligence in urban [8] P. Mannion, J. Duggan, and E. Howley, An Experimental Review of
traffic signal control: A survey,” IEEE Transactions on Systems, Man, Reinforcement Learning Algorithms for Adaptive Traffic Signal Control.
10