Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

MTLight: Efficient Multi-Task Reinforcement Learning for Traffic Signal Control

Liwen Zhu
Peking University
liwenzhu@pku.edu.cn
&Peixi Peng
Peking University
pxpeng@pku.edu.cn
&Zongqing Lu
Peking University
zongqing.lu@pku.edu.cn
&Yonghong Tian
Peking University
yhtian@pku.edu.cn
Abstract

Traffic signal control has a great impact on alleviating traffic congestion in modern cities. Deep reinforcement learning (RL) has been widely used for this task in recent years, demonstrating promising performance but also facing many challenges such as limited performances and sample inefficiency. To handle these challenges, MTLight is proposed to enhance the agent observation with a latent state, which is learned from numerous traffic indicators. Meanwhile, multiple auxiliary and supervisory tasks are constructed to learn the latent state, and two types of embedding latent features, the task-specific feature and task-shared feature, are used to make the latent state more abundant. Extensive experiments conducted on CityFlow demonstrate that MTLight has leading convergence speed and asymptotic performance. We further simulate under peak-hour pattern in all scenarios with increasing control difficulty and the results indicate that MTLight is highly adaptable.

1 introduction

Traffic signal control aims to coordinate traffic signals across intersections to improve the traffic efficiency of a district or a city, which plays an important role in efficient transportation. Most conventional methods aim to control traffic signals by fixed-time Koonce & Rodegerdts (2008) or hand-crafted heuristics Kouvelas et al. (2014), which heavily rely on expert knowledge and in-depth excavation of regional historical traffic, making it difficult to migrate. Recently, deep reinforcement learning (DRL) based methods Guo et al. (2021); Jintao et al. (2020); Pan et al. (2020); He & Shin (2020); Tong et al. (2021); Wang et al. (2020); Gu et al. (2020); Liu et al. (2021); Xu et al. (2021); Zhang et al. (2021) employ a deep neural network to control an intersection where the network is learned by directly interacting with the environment. However, due to the plenty of traffic indicators (number of vehicles, queue length, waiting time, speed, etc.), complex observation and the dynamic environment, the problem is challenging and remains unsolved.

Since the observation, reward and dynamics of each traffic signal are closely related to others, hence optimizing traffic signal control in a large-scale road network is naturally modeled as a multi-agent reinforcement learning (MARL) problem. Most exiting works Wei et al. (2019a); Zhang et al. (2020b); Chen et al. (2020); Zheng et al. (2019a) are proposed to learn the policy of each agent only conditioned on the raw observations of the intersection, while ignoring the help of the global state, which is accessible in smart city. As stated in Zheng et al. (2019b), different metrics have a considerable impact on the traffic signal control task. Hence, the observation design of agent should not only involve the raw observations of the intersection, but also the global state. A good agent observation design could make full use of samples, and improves not only the policy performance but also the sample efficiency. However, there are a huge amount of traffic indicators or metrics in the global state, and it is hard to subjectively design suited and non-redundant agent observation among these indicators. On one hand, an overly concise observation design could not adequately and comprehensively represent the state characteristics and therefore affects the accuracy of the estimation of state transition and as well as influencing action selection. In contrast, if an overly complex combination of metrics is used as an observation, the weights of different metrics are difficult to precisely define, and it may cause data redundancy and dimension explosion, which will not only increase the computational consumption, but also make the agent hard to learn.

Refer to caption
Figure 1: Multi-Task module forms task-shared and task-specific latent states to enhance the agent observation.

In order to provide an adequate representation of the traffic signal control task, the latent state is introduced. Specifically, the raw observation is identical to the intersection, which consists of several variables with concrete semantic meanings (i.e., the number of vehicles on each incoming lane and current signal phase). Then, the raw observation is enhanced by the latent space. To learn the latent space from the global state, multiple auxiliary and supervisory tasks are constructed, which are related to traffic signal control. That is, several statistics of global state history are taken as inputs, a RNN-based network is employed firstly, and several branches are introduced subsequently to predict multiple types of statistics of the global state, such as the flow distribution and the travel time distribution, respectively. To make the latent space more abundant, two types of embedding features are extracted: the task-specific feature and task-shared feature. The former is extracted by the task-specific branch and represents the task-driven information, while the later is from the task-shared layer and could express more general underlying characteristics. Hence, they are complementary to each other and are both used to enhance the raw observation. Finally, conditioned on the enhanced observation, the policy is learned by DRL Mnih et al. (2015). Note that the multiple tasks are learned simultaneously with the DRL, which makes the latent space more adaptive to the policy learning.

2 Problem Statement

2.1 Problem Definition

We consider a multi-agent traffic signal control problem, the task is modeled as a Markov Game Littman (1994), which can be denoted by a tuple 𝒢=<𝒩,𝒮,𝒜,𝒪,𝒫,,,γ>\mathcal{G}=<\mathcal{N},\mathcal{S},\mathcal{A},\mathcal{O},\mathcal{P},% \mathcal{R},\mathcal{H},\gamma>caligraphic_G = < caligraphic_N , caligraphic_S , caligraphic_A , caligraphic_O , caligraphic_P , caligraphic_R , caligraphic_H , italic_γ >. 𝒩{1,,n}𝒩1𝑛\mathcal{N}\equiv\{1,\ldots,n\}caligraphic_N ≡ { 1 , … , italic_n } is a finite set of agents, and each intersection in the scenario is controlled by an agent. 𝒮𝒮\mathcal{S}caligraphic_S is a finite set of global state space. 𝒜𝒜\mathcal{A}caligraphic_A denotes the action space for an individual agent. The joint action 𝒂𝐀𝒜n𝒂𝐀superscript𝒜𝑛\bm{a}\in\mathbf{A}\equiv\mathcal{A}^{n}bold_italic_a ∈ bold_A ≡ caligraphic_A start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is a collection of individual actions [ai]i=1nsuperscriptsubscriptdelimited-[]subscript𝑎𝑖𝑖1𝑛\left[a_{i}\right]_{i=1}^{n}[ italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. At each timestep, each agent i𝑖iitalic_i receives an observation oi𝒪subscript𝑜𝑖𝒪o_{i}\in\mathcal{O}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_O, selects an action aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, results in the next state ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT according to the transition function 𝒫(ss,𝒂)𝒫conditionalsuperscript𝑠𝑠𝒂\mathcal{P}\left(s^{\prime}\mid s,\bm{a}\right)caligraphic_P ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_s , bold_italic_a ) and a reward r=(s,𝐚)𝑟𝑠𝐚r=\mathcal{R}(s,\mathbf{a})italic_r = caligraphic_R ( italic_s , bold_a ) for each agent. \mathcal{H}caligraphic_H is the time horizon and γ[0,1)𝛾01\gamma\in[0,1)italic_γ ∈ [ 0 , 1 ) is the discount factor.

2.2 Agent Design

Each intersection in the system is controlled by an agent. In the following, we introduce the state design, action design and reward design of the RL agent.

  • Observation. Our primitive observation consists of two parts: (1) the number of vehicles on each incoming lane 𝐟tvsuperscriptsubscript𝐟𝑡𝑣\mathbf{f}_{t}^{v}bold_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT; (2) current signal phase 𝐟tssuperscriptsubscript𝐟𝑡𝑠\mathbf{f}_{t}^{s}bold_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT. Both of them can be obtained directly from the simulator, the concepts are described in detail in Section B.4. The raw observation of agent i𝑖iitalic_i is defined by

    oi={𝐟tv,𝐟ts},subscript𝑜𝑖superscriptsubscript𝐟𝑡𝑣superscriptsubscript𝐟𝑡𝑠\displaystyle o_{i}=\{\mathbf{f}_{t}^{v},\mathbf{f}_{t}^{s}\},italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { bold_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT , bold_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT } , (1)

    where 𝐟tv={Vl1in,Vl2in,,Vlmin}superscriptsubscript𝐟𝑡𝑣subscript𝑉superscriptsubscript𝑙1𝑖𝑛subscript𝑉superscriptsubscript𝑙2𝑖𝑛subscript𝑉superscriptsubscript𝑙𝑚𝑖𝑛\mathbf{f}_{t}^{v}=\{{V}_{l_{1}^{in}},{V}_{l_{2}^{in}},\ldots,{V}_{l_{m}^{in}}\}bold_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT = { italic_V start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , … , italic_V start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT } and lin={l1in,,lmin}superscript𝑙𝑖𝑛superscriptsubscript𝑙1𝑖𝑛superscriptsubscript𝑙𝑚𝑖𝑛{l}^{in}=\{l_{1}^{in},\ldots,l_{m}^{in}\}italic_l start_POSTSUPERSCRIPT italic_i italic_n end_POSTSUPERSCRIPT = { italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n end_POSTSUPERSCRIPT , … , italic_l start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n end_POSTSUPERSCRIPT } is a finite set of incoming lanes in the intersection. Current signal phase 𝐟ts=pk,k1,,Kformulae-sequencesuperscriptsubscript𝐟𝑡𝑠subscript𝑝𝑘𝑘1𝐾\mathbf{f}_{t}^{s}={p}_{k},k\in{1,\ldots,K}bold_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_k ∈ 1 , … , italic_K, and K𝐾Kitalic_K is the total number of phases. Each phase p𝑝pitalic_p is represented as a one-hot vector. Our goal is to learn latent space to enhance the raw observation to make better use of the sample.

  • Action. The action of each agent is to choose the phase for the next time interval. Note that the phases may organize in a sequential way in reality, while directly selecting a phase makes the traffic control plan more flexible. Action of agent i𝑖iitalic_i is defined by

    ai={𝐟ts},subscript𝑎𝑖superscriptsubscript𝐟𝑡𝑠\displaystyle a_{i}=\{\mathbf{f}_{t}^{s}\},italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { bold_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT } , (2)

    where 𝐟ts=pk,k1,,Kformulae-sequencesuperscriptsubscript𝐟𝑡𝑠subscript𝑝𝑘𝑘1𝐾\mathbf{f}_{t}^{s}={p}_{k},k\in{1,\ldots,K}bold_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_k ∈ 1 , … , italic_K.

  • Reward. We define the reward as the negative of the queue length on incoming lanes, which is generally accepted and reasonable in previous work Zheng et al. (2019b); Huang et al. (2021); Zang et al. (2020); Zheng et al. (2019a); Wei et al. (2019b). Reward of agent i𝑖iitalic_i is defined by

    ri=mMqlmin,subscript𝑟𝑖subscriptsuperscript𝑀𝑚subscript𝑞subscriptsuperscript𝑙𝑖𝑛𝑚\displaystyle r_{i}=-\sum^{M}_{m}q_{l^{in}_{m}},italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = - ∑ start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT italic_i italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT , (3)

    where qlminsubscript𝑞subscriptsuperscript𝑙𝑖𝑛𝑚q_{l^{in}_{m}}italic_q start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT italic_i italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the queue length on incoming lane lminsubscriptsuperscript𝑙𝑖𝑛𝑚l^{in}_{m}italic_l start_POSTSUPERSCRIPT italic_i italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT.

3 method

In this section, we will introduce the main modules of our proposed method MTLight, which focuses on learning task-related task-shared latent state and task-specific latent state by introducing an auxiliary Multi-Task network to help policy learning. The whole process of MTLight is described in Algorithm 1, and the framework of MTLight is shown in Fig. 2.

MTLight consists of a Multi-Task network and an agent network. For the latter, Deep Q-Network (DQN) Mnih et al. (2015) is employed as function approximator to estimate the Q-value function, which is consistent with the previous methods Chen et al. (2020); Wei et al. (2019b; a); Zheng et al. (2019a); Wei et al. (2018). The Multi-Task module adopts a hard parameter sharing paradigm Caruana (1997), which generally applied by sharing the hidden layers between all tasks, while keeping several task-specific output layers.

3.1 Multi-Task Learning for Latent State

Refer to caption
Figure 2: MTLight consists of a multi-task network and a policy network. RL agent is augmented with a task-shared latent state 𝐨tshrsuperscriptsubscript𝐨tshr\mathrm{\mathbf{o}_{t}^{shr}}bold_o start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_shr end_POSTSUPERSCRIPT and a task-specific latent state 𝐨tspesuperscriptsubscript𝐨tspe\mathrm{\mathbf{o}_{t}^{spe}}bold_o start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_spe end_POSTSUPERSCRIPT.

For each agent, its raw observation includes the number of vehicles 𝐟tvsuperscriptsubscript𝐟𝑡𝑣\mathbf{f}_{t}^{v}bold_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT and the current signal phase 𝐟tssuperscriptsubscript𝐟𝑡𝑠\mathbf{f}_{t}^{s}bold_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT. Besides, several information from the global state is given, such as: the number of incoming cars in the last τ𝜏\tauitalic_τ steps, denoted as 𝐟tτ:tc=[𝐟tτc,𝐟tτ+1c,,𝐟tc]superscriptsubscript𝐟:𝑡𝜏𝑡𝑐superscriptsubscript𝐟𝑡𝜏𝑐superscriptsubscript𝐟𝑡𝜏1𝑐superscriptsubscript𝐟𝑡𝑐\mathbf{f}_{t-\tau:t}^{c}=[\mathbf{f}_{t-\tau}^{c},\mathbf{f}_{t-\tau+1}^{c},% \ldots,\mathbf{f}_{t}^{c}]bold_f start_POSTSUBSCRIPT italic_t - italic_τ : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT = [ bold_f start_POSTSUBSCRIPT italic_t - italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , bold_f start_POSTSUBSCRIPT italic_t - italic_τ + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , … , bold_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ], the average travel time during the past τ𝜏\tauitalic_τ steps, denoted as 𝐟tτ:ttr=[𝐟tτtr,𝐟tτ+1tr,,𝐟ttr]superscriptsubscript𝐟:𝑡𝜏𝑡𝑡𝑟superscriptsubscript𝐟𝑡𝜏𝑡𝑟superscriptsubscript𝐟𝑡𝜏1𝑡𝑟superscriptsubscript𝐟𝑡𝑡𝑟\mathbf{f}_{t-\tau:t}^{tr}=[\mathbf{f}_{t-\tau}^{tr},\mathbf{f}_{t-\tau+1}^{tr% },\ldots,\mathbf{f}_{t}^{tr}]bold_f start_POSTSUBSCRIPT italic_t - italic_τ : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_r end_POSTSUPERSCRIPT = [ bold_f start_POSTSUBSCRIPT italic_t - italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_r end_POSTSUPERSCRIPT , bold_f start_POSTSUBSCRIPT italic_t - italic_τ + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_r end_POSTSUPERSCRIPT , … , bold_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_r end_POSTSUPERSCRIPT ], the queue length during the past τ𝜏\tauitalic_τ steps, denoted as 𝐟tτ:tq=[𝐟tτq,𝐟tτ+1q,,𝐟tq]superscriptsubscript𝐟:𝑡𝜏𝑡𝑞superscriptsubscript𝐟𝑡𝜏𝑞superscriptsubscript𝐟𝑡𝜏1𝑞superscriptsubscript𝐟𝑡𝑞\mathbf{f}_{t-\tau:t}^{q}=[\mathbf{f}_{t-\tau}^{q},\mathbf{f}_{t-\tau+1}^{q},% \ldots,\mathbf{f}_{t}^{q}]bold_f start_POSTSUBSCRIPT italic_t - italic_τ : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT = [ bold_f start_POSTSUBSCRIPT italic_t - italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT , bold_f start_POSTSUBSCRIPT italic_t - italic_τ + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT , … , bold_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ], the current vehicles during the past τ𝜏\tauitalic_τ steps, which is denoted as 𝐟tτ:tvr=[𝐟tτvr,𝐟tτ+1vr,,𝐟tvr]superscriptsubscript𝐟:𝑡𝜏𝑡𝑣𝑟superscriptsubscript𝐟𝑡𝜏𝑣𝑟superscriptsubscript𝐟𝑡𝜏1𝑣𝑟superscriptsubscript𝐟𝑡𝑣𝑟\mathbf{f}_{t-\tau:t}^{vr}=[\mathbf{f}_{t-\tau}^{vr},\mathbf{f}_{t-\tau+1}^{vr% },\ldots,\mathbf{f}_{t}^{vr}]bold_f start_POSTSUBSCRIPT italic_t - italic_τ : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v italic_r end_POSTSUPERSCRIPT = [ bold_f start_POSTSUBSCRIPT italic_t - italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v italic_r end_POSTSUPERSCRIPT , bold_f start_POSTSUBSCRIPT italic_t - italic_τ + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v italic_r end_POSTSUPERSCRIPT , … , bold_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v italic_r end_POSTSUPERSCRIPT ].

The Multi-Task module includes the following four tasks:

  1. 1.

    Flow distribution approximation. We use 𝒯flowsubscript𝒯𝑓𝑙𝑜𝑤\mathcal{T}_{flow}caligraphic_T start_POSTSUBSCRIPT italic_f italic_l italic_o italic_w end_POSTSUBSCRIPT to denote the traffic distribution estimation task, i.e., to predict the mean μfsubscript𝜇𝑓\mu_{f}italic_μ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT and variance σf2superscriptsubscript𝜎𝑓2\sigma_{f}^{2}italic_σ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT of flow arrival rate from start up to the time step t𝑡titalic_t. The task could be denoted as:

    (μf,σf2)[𝐟tv,𝐟ts,𝐟tτ:tc,𝐟tτ:ttr,𝐟tτ:tq,𝐟tτ:tvr].subscript𝜇𝑓superscriptsubscript𝜎𝑓2superscriptsubscript𝐟𝑡𝑣superscriptsubscript𝐟𝑡𝑠superscriptsubscript𝐟:𝑡𝜏𝑡𝑐superscriptsubscript𝐟:𝑡𝜏𝑡𝑡𝑟superscriptsubscript𝐟:𝑡𝜏𝑡𝑞superscriptsubscript𝐟:𝑡𝜏𝑡𝑣𝑟\displaystyle(\mu_{f},\sigma_{f}^{2})\leftarrow[\mathbf{f}_{t}^{v},\mathbf{f}_% {t}^{s},\mathbf{f}_{t-\tau:t}^{c},\mathbf{f}_{t-\tau:t}^{tr},\mathbf{f}_{t-% \tau:t}^{q},\mathbf{f}_{t-\tau:t}^{vr}].( italic_μ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ← [ bold_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT , bold_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , bold_f start_POSTSUBSCRIPT italic_t - italic_τ : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , bold_f start_POSTSUBSCRIPT italic_t - italic_τ : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_r end_POSTSUPERSCRIPT , bold_f start_POSTSUBSCRIPT italic_t - italic_τ : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT , bold_f start_POSTSUBSCRIPT italic_t - italic_τ : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v italic_r end_POSTSUPERSCRIPT ] . (4)
  2. 2.

    Travel time distribution approximation. We use 𝒯travelsubscript𝒯𝑡𝑟𝑎𝑣𝑒𝑙\mathcal{T}_{travel}caligraphic_T start_POSTSUBSCRIPT italic_t italic_r italic_a italic_v italic_e italic_l end_POSTSUBSCRIPT to denote the travel distribution estimation task, i.e., to predict the mean μtrsubscript𝜇𝑡𝑟\mu_{tr}italic_μ start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT and variance σtr2superscriptsubscript𝜎𝑡𝑟2\sigma_{tr}^{2}italic_σ start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT of average travel time of vehicles that have completed the trip from start up to the time step t𝑡titalic_t:

    (μtr,σtr2)[𝐟tv,𝐟ts,𝐟tτ:tc,𝐟tτ:ttr,𝐟tτ:tq,𝐟tτ:tvr].subscript𝜇𝑡𝑟superscriptsubscript𝜎𝑡𝑟2superscriptsubscript𝐟𝑡𝑣superscriptsubscript𝐟𝑡𝑠superscriptsubscript𝐟:𝑡𝜏𝑡𝑐superscriptsubscript𝐟:𝑡𝜏𝑡𝑡𝑟superscriptsubscript𝐟:𝑡𝜏𝑡𝑞superscriptsubscript𝐟:𝑡𝜏𝑡𝑣𝑟\displaystyle(\mu_{tr},\sigma_{tr}^{2})\leftarrow[\mathbf{f}_{t}^{v},\mathbf{f% }_{t}^{s},\mathbf{f}_{t-\tau:t}^{c},\mathbf{f}_{t-\tau:t}^{tr},\mathbf{f}_{t-% \tau:t}^{q},\mathbf{f}_{t-\tau:t}^{vr}].( italic_μ start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ← [ bold_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT , bold_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , bold_f start_POSTSUBSCRIPT italic_t - italic_τ : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , bold_f start_POSTSUBSCRIPT italic_t - italic_τ : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_r end_POSTSUPERSCRIPT , bold_f start_POSTSUBSCRIPT italic_t - italic_τ : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT , bold_f start_POSTSUBSCRIPT italic_t - italic_τ : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v italic_r end_POSTSUPERSCRIPT ] . (5)
  3. 3.

    Next queue length approximation. We use 𝒯queuesubscript𝒯𝑞𝑢𝑒𝑢𝑒\mathcal{T}_{queue}caligraphic_T start_POSTSUBSCRIPT italic_q italic_u italic_e italic_u italic_e end_POSTSUBSCRIPT to denote the next queue length estimation task, i.e., to predict the average number q𝑞qitalic_q of vehicles in queue at the next step:

    q[𝐟tv,𝐟ts,𝐟tτ:tc,𝐟tτ:ttr,𝐟tτ:tq,𝐟tτ:tvr].𝑞superscriptsubscript𝐟𝑡𝑣superscriptsubscript𝐟𝑡𝑠superscriptsubscript𝐟:𝑡𝜏𝑡𝑐superscriptsubscript𝐟:𝑡𝜏𝑡𝑡𝑟superscriptsubscript𝐟:𝑡𝜏𝑡𝑞superscriptsubscript𝐟:𝑡𝜏𝑡𝑣𝑟\displaystyle q\leftarrow[\mathbf{f}_{t}^{v},\mathbf{f}_{t}^{s},\mathbf{f}_{t-% \tau:t}^{c},\mathbf{f}_{t-\tau:t}^{tr},\mathbf{f}_{t-\tau:t}^{q},\mathbf{f}_{t% -\tau:t}^{vr}].italic_q ← [ bold_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT , bold_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , bold_f start_POSTSUBSCRIPT italic_t - italic_τ : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , bold_f start_POSTSUBSCRIPT italic_t - italic_τ : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_r end_POSTSUPERSCRIPT , bold_f start_POSTSUBSCRIPT italic_t - italic_τ : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT , bold_f start_POSTSUBSCRIPT italic_t - italic_τ : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v italic_r end_POSTSUPERSCRIPT ] . (6)
  4. 4.

    Vehicles on the road approximation. We use 𝒯vehiclessubscript𝒯𝑣𝑒𝑖𝑐𝑙𝑒𝑠\mathcal{T}_{vehicles}caligraphic_T start_POSTSUBSCRIPT italic_v italic_e italic_h italic_i italic_c italic_l italic_e italic_s end_POSTSUBSCRIPT to denote the vehicles on the road approximation task, i.e., to predict the number of vehicles Vrsuperscript𝑉𝑟V^{r}italic_V start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT existing in the system:

    Vr[𝐟tv,𝐟ts,𝐟tτ:tc,𝐟tτ:ttr,𝐟tτ:tq,𝐟tτ:tvr].superscript𝑉𝑟superscriptsubscript𝐟𝑡𝑣superscriptsubscript𝐟𝑡𝑠superscriptsubscript𝐟:𝑡𝜏𝑡𝑐superscriptsubscript𝐟:𝑡𝜏𝑡𝑡𝑟superscriptsubscript𝐟:𝑡𝜏𝑡𝑞superscriptsubscript𝐟:𝑡𝜏𝑡𝑣𝑟\displaystyle V^{r}\leftarrow[\mathbf{f}_{t}^{v},\mathbf{f}_{t}^{s},\mathbf{f}% _{t-\tau:t}^{c},\mathbf{f}_{t-\tau:t}^{tr},\mathbf{f}_{t-\tau:t}^{q},\mathbf{f% }_{t-\tau:t}^{vr}].italic_V start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ← [ bold_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT , bold_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , bold_f start_POSTSUBSCRIPT italic_t - italic_τ : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , bold_f start_POSTSUBSCRIPT italic_t - italic_τ : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_r end_POSTSUPERSCRIPT , bold_f start_POSTSUBSCRIPT italic_t - italic_τ : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT , bold_f start_POSTSUBSCRIPT italic_t - italic_τ : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v italic_r end_POSTSUPERSCRIPT ] . (7)

    Note that vehicles that have completed the trips or have not yet entered the road network do not belong to these.

The above tasks act auxiliary tasks to learn the latent space. Since the numbers of 𝐟tτ:tcsuperscriptsubscript𝐟:𝑡𝜏𝑡𝑐\mathbf{f}_{t-\tau:t}^{c}bold_f start_POSTSUBSCRIPT italic_t - italic_τ : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT, 𝐟tτ:ttrsuperscriptsubscript𝐟:𝑡𝜏𝑡𝑡𝑟\mathbf{f}_{t-\tau:t}^{tr}bold_f start_POSTSUBSCRIPT italic_t - italic_τ : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_r end_POSTSUPERSCRIPT, 𝐟tτ:tqsuperscriptsubscript𝐟:𝑡𝜏𝑡𝑞\mathbf{f}_{t-\tau:t}^{q}bold_f start_POSTSUBSCRIPT italic_t - italic_τ : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT, 𝐟tτ:tvrsuperscriptsubscript𝐟:𝑡𝜏𝑡𝑣𝑟\mathbf{f}_{t-\tau:t}^{vr}bold_f start_POSTSUBSCRIPT italic_t - italic_τ : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v italic_r end_POSTSUPERSCRIPT have different scales and their dimensions are different with 𝐟tvsuperscriptsubscript𝐟𝑡𝑣\mathbf{f}_{t}^{v}bold_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT and 𝐟tssuperscriptsubscript𝐟𝑡𝑠\mathbf{f}_{t}^{s}bold_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT, four independent linear layers and ReLU functions are employed firstly to scale them respectively:

𝐡c=ReLU(𝐖1𝐟tτ:tc+𝐛1),𝐡tr=ReLU(𝐖2𝐟tτ:ttr+𝐛2),formulae-sequencesuperscript𝐡𝑐𝑅𝑒𝐿𝑈subscript𝐖1superscriptsubscript𝐟:𝑡𝜏𝑡𝑐subscript𝐛1superscript𝐡𝑡𝑟𝑅𝑒𝐿𝑈subscript𝐖2superscriptsubscript𝐟:𝑡𝜏𝑡𝑡𝑟subscript𝐛2\displaystyle\mathbf{h}^{c}={ReLU}(\mathbf{W}_{1}\mathbf{f}_{t-\tau:t}^{c}+% \mathbf{b}_{1}),\ \mathbf{h}^{tr}={ReLU}(\mathbf{W}_{2}\mathbf{f}_{t-\tau:t}^{% tr}+\mathbf{b}_{2}),bold_h start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT = italic_R italic_e italic_L italic_U ( bold_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_f start_POSTSUBSCRIPT italic_t - italic_τ : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT + bold_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , bold_h start_POSTSUPERSCRIPT italic_t italic_r end_POSTSUPERSCRIPT = italic_R italic_e italic_L italic_U ( bold_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_f start_POSTSUBSCRIPT italic_t - italic_τ : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_r end_POSTSUPERSCRIPT + bold_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , (8)
𝐡q=ReLU(𝐖3𝐟tτ:tq+𝐛3),𝐡vr=ReLU(𝐖4𝐟tτ:tvr+𝐛4).formulae-sequencesuperscript𝐡𝑞𝑅𝑒𝐿𝑈subscript𝐖3superscriptsubscript𝐟:𝑡𝜏𝑡𝑞subscript𝐛3superscript𝐡𝑣𝑟𝑅𝑒𝐿𝑈subscript𝐖4superscriptsubscript𝐟:𝑡𝜏𝑡𝑣𝑟subscript𝐛4\displaystyle\mathbf{h}^{q}={ReLU}(\mathbf{W}_{3}\mathbf{f}_{t-\tau:t}^{q}+% \mathbf{b}_{3}),\ \mathbf{h}^{vr}={ReLU}(\mathbf{W}_{4}\mathbf{f}_{t-\tau:t}^{% vr}+\mathbf{b}_{4}).bold_h start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT = italic_R italic_e italic_L italic_U ( bold_W start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT bold_f start_POSTSUBSCRIPT italic_t - italic_τ : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT + bold_b start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) , bold_h start_POSTSUPERSCRIPT italic_v italic_r end_POSTSUPERSCRIPT = italic_R italic_e italic_L italic_U ( bold_W start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT bold_f start_POSTSUBSCRIPT italic_t - italic_τ : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v italic_r end_POSTSUPERSCRIPT + bold_b start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) . (9)

Then a linear layer and ReLU function is used to calculate the hidden state after concatenating all embedded inputs:

𝐇t=ReLU(𝐖(𝐟tv,𝐟ts,𝐡c,𝐡tr,𝐡q,𝐡vr)+𝐛).subscript𝐇𝑡𝑅𝑒𝐿𝑈𝐖superscriptsubscript𝐟𝑡𝑣superscriptsubscript𝐟𝑡𝑠superscript𝐡𝑐superscript𝐡𝑡𝑟superscript𝐡𝑞superscript𝐡𝑣𝑟𝐛\displaystyle\mathbf{H}_{t}={ReLU}(\mathbf{W}(\mathbf{f}_{t}^{v},\mathbf{f}_{t% }^{s},\mathbf{h}^{c},\mathbf{h}^{tr},\mathbf{h}^{q},\mathbf{h}^{vr})+\mathbf{b% }).bold_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_R italic_e italic_L italic_U ( bold_W ( bold_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT , bold_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , bold_h start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , bold_h start_POSTSUPERSCRIPT italic_t italic_r end_POSTSUPERSCRIPT , bold_h start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT , bold_h start_POSTSUPERSCRIPT italic_v italic_r end_POSTSUPERSCRIPT ) + bold_b ) . (10)

Based on 𝐇tsubscript𝐇𝑡\mathbf{H}_{t}bold_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, a task-shared network module is used to generate its task-shared latent feature (𝐨tshrsuperscriptsubscript𝐨tshr\mathrm{\mathbf{o}_{t}^{shr}}bold_o start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_shr end_POSTSUPERSCRIPT, also called apparent state). Then, 4 independent branches are introduced for each task and calculate task-specific latent feature (𝐨tspesuperscriptsubscript𝐨tspe\mathrm{\mathbf{o}_{t}^{spe}}bold_o start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_spe end_POSTSUPERSCRIPT, also called mental state) from 𝐨tshrsuperscriptsubscript𝐨tshr\mathrm{\mathbf{o}_{t}^{shr}}bold_o start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_shr end_POSTSUPERSCRIPT. The specific implementation of network architecture is listed in the supplementary.

We use a single latent variable model to extract hierarchical latent features, which follows insights by Zhao et al. (2017). That is, the mental state is output of the shared-layer after GRU in Multi-Task network and could express more general underlying characteristics. In contrast, the apparent state is the the concatenation of the output of the task-specific layer and represents the task-driven information. In other words, the mental state is more coarse-grained, while apparent state is more fine-grained. Hence, they are complementary to each other and both used in our method.

3.2 Policy with Latent State

With the help of latent state, the agent observation is enhanced from 𝐨tsubscript𝐨t\mathrm{\mathbf{o}_{t}}bold_o start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT to (𝐨t,𝐨tshr,𝐨tspe)subscript𝐨tsuperscriptsubscript𝐨tshrsuperscriptsubscript𝐨tspe(\mathrm{\mathbf{o}_{t}},\mathrm{\mathbf{o}_{t}^{shr}},\mathrm{\mathbf{o}_{t}^% {spe}})( bold_o start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT , bold_o start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_shr end_POSTSUPERSCRIPT , bold_o start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_spe end_POSTSUPERSCRIPT ). For the policy πθsuperscript𝜋𝜃\pi^{\theta}italic_π start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT, the objective is to maximize the cumulative reward:

maxθJ(θ)=𝔼atπθ(at𝐨t,𝐨tshr.𝐨tspe)t=01γtrt+1.\displaystyle\max\limits_{\theta}J(\theta)=\mathbb{E}_{\begin{subarray}{c}a_{t% }\sim\pi^{\theta}(a_{t}\mid\mathrm{\mathbf{o}_{t}},\mathrm{\mathbf{o}_{t}^{shr% }}.\mathrm{\mathbf{o}_{t}^{spe}})\end{subarray}}\sum\limits_{t=0}^{\mathcal{H}% -1}\gamma^{t}r_{t+1}.roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_J ( italic_θ ) = blackboard_E start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_π start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ bold_o start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT , bold_o start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_shr end_POSTSUPERSCRIPT . bold_o start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_spe end_POSTSUPERSCRIPT ) end_CELL end_ROW end_ARG end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_H - 1 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT . (11)

An agent that maximises Eq. 11 acts optimally under uncertainty and is called Bayes-optimal Ghavamzadeh et al. (2015), assuming we treat the knowledge over related tasks as our epistemic prior about the environment. Multi-Task module minimizes the complexity of the model and give informative priors to the model. Besides, it can minimize the representation bias in a way that push the learning algorithm to find a solution on a smaller area of representations on the intersection rather than on a large area of a single task. This incentivises a faster and better convergence.

4 experiment

We conduct the experiments on CityFlow Zhang et al. (2019), an city-level open-source simulation platform for traffic signal control. The simulator is used as the environment to provide state for traffic signal control, the agents execute actions by changing the phase of traffic lights, and the simulator returns feedback.

Please refer to Appendix D.1 and Appendix D.2 for the detailed settings of road network and traffic flow configuration. Baselines are described in detail in Appendix F.

4.1 Performance Comparison

Refer to caption
Figure 3: Illustration of strategies for all RL methods under Real configuration in Hangzhou.
Table 1: Overall performance comparison on Hangzhou, Jinan, New York and Shenzhen under Real and Synthetic configurations. Average travel time is reported in the unit of second. ”Mean” in the last column shows the average performance of the scenarios shown in the previous 8 columns.
Model Hangzhou Jinan Newyork Shenzhen Mean
real syn_peak real syn_peak real syn_peak real syn_peak
MaxPressure 416.82 2320.65 355.12 1218.13 380.42 1481.48 389.45 1387.87 1387.87
Fixedtime 718.29 1787.58 814.09 1739.69 1849.78 2086.59 786.54 1845.03 1453.45
SOTL 1209.26 2062.49 1453.97 1991.03 1890.55 2140.15 1376.52 2098.09 1777.76
Individual RL 743.00 1819.57 843.63 1745.07 1867.86 2100.68 769.47 1845.34 1466.83
MetaLight 480.77 1576.32 784.98 1854.38 261.34 2145.49 694.83 2083.26 1235.17
PressLight 529.64 1754.09 809.87 1930.98 302.87 1846.76 639.04 1832.76 1205.75
CoLight 297.89 1077.29 511.43 1217.17 159.81 1457.56 438.45 1367.38 815.87
GeneraLight 335.18 1574.93 585.89 1616.28 1208.73 1686.49 792.22 1574.10 1171.73
Base 705.85 1718.37 808.28 1703.21 903.82 2097.84 728.49 1937.45 1325.41
Base+raw 684.34 1845.92 623.94 1835.45 592.34 1934.04 703.56 1845.32 1258.11
Base+shr 313.28 1146.79 499.88 1325.27 463.15 1416.65 438.69 1371.53 871.91
Base+spe 431.55 1446.63 517.09 1430.96 431.65 1669.61 684.83 1442.35 1006.83
MTLight 161.24 1011.67 346.93 1176.02 209.46 1394.15 402.57 1284.93 748.37
Refer to caption
Figure 4: Performance of RL methods under real configurations.

Tab. 1 lists the comparative results, and it is evident that: 1) In general, RL methods perform better than conventional methods, and it indicates the advantage of the RL. Moreover, MTLight is outperforms other methods in almost all cities and flow configurations, which demonstrates the effectiveness of the method. 2) MTLight shows good generalization for different scenarios and configurations. For example, MaxPressure performs well in 𝒟Hangzhousubscript𝒟𝐻𝑎𝑛𝑔𝑧𝑜𝑢\mathcal{D}_{Hangzhou}caligraphic_D start_POSTSUBSCRIPT italic_H italic_a italic_n italic_g italic_z italic_h italic_o italic_u end_POSTSUBSCRIPT with the Real, while under the Synthetic traffic conditions, MaxPressure shows significantly worse than other methods. In contrast, MTLight can not only achieve good performance under diverse configurations of 𝒟Hangzhousubscript𝒟𝐻𝑎𝑛𝑔𝑧𝑜𝑢\mathcal{D}_{Hangzhou}caligraphic_D start_POSTSUBSCRIPT italic_H italic_a italic_n italic_g italic_z italic_h italic_o italic_u end_POSTSUBSCRIPT, but also shows great stability. 3) MTLight outperforms Individual RL, MetaLight and PressLight with 693.46, 461.80 and 432.38, respectively. The reason is that they learn the traffic light’s policy only using its observation and ignore the influence of the neighbors, while MTLight considers the neighbors as the latent part of the environment to help learning. 4) The neighbor’s information is modeled in CoLight and GeneraLight can adapt to a variety of flows, they both perform well. While results of MTLight is superiors to them in multiple scenarios, resulting mean 42.5 and 398 improvement. Compared to them, MTLight benefits from prior knowledge learned from Multi-Task network to make more accurate decisions.

Fig. 4 shows the performances of all RL methods of 𝒟Hangzhousubscript𝒟𝐻𝑎𝑛𝑔𝑧𝑜𝑢\mathcal{D}_{Hangzhou}caligraphic_D start_POSTSUBSCRIPT italic_H italic_a italic_n italic_g italic_z italic_h italic_o italic_u end_POSTSUBSCRIPT under Real traffic pattern, and it is obvious that MTLight converges faster and has better asymptotic performance. Fig. 5 shows the performances of all RL methods of 𝒟Hangzhousubscript𝒟𝐻𝑎𝑛𝑔𝑧𝑜𝑢\mathcal{D}_{Hangzhou}caligraphic_D start_POSTSUBSCRIPT italic_H italic_a italic_n italic_g italic_z italic_h italic_o italic_u end_POSTSUBSCRIPT under Synthetic traffic pattern, we can conclude that MTLight converges quickly and learns effectively during the peak hour, while the other method have only a weak boost during training.

Refer to caption
Figure 5: Performance of RL methods under synthetic peak configurations.

Fig. 8 and Tab. 5 illustrates the turning statistics of vehicle routes. Take 𝒟Hangzhousubscript𝒟𝐻𝑎𝑛𝑔𝑧𝑜𝑢\mathcal{D}_{Hangzhou}caligraphic_D start_POSTSUBSCRIPT italic_H italic_a italic_n italic_g italic_z italic_h italic_o italic_u end_POSTSUBSCRIPT Real as an example, the frequency of turning left and going straight is 14% and 86% respectively (turning right are not considered because they are free from the control by lights). Fig. 3 shows the percentage of each phase of RL methods, we can find: 1) The total left-turn phase of MTLight accounts for 15.3%, which is highly consistent with the left-turn frequency of 14%, which indicates that the strategy is interpretable. 2) The GeneraLight left-turn ratio of 10.9% is also close, but because it has an excessive proportion of straight phases, it may cause left-turn vehicles to be stranded, resulting in increased travel time. 3) Individual RL tends to consider phase 1 and 2, which account for as much as 65.9%, MetaLight prefers to go straight, PressLight is eccentric to phase 1, and CoLight assigns a relatively even distribution to each phase, rather than aligning with the traffic flow direction. These all demonstrate the limitations of other RL methods in multi-agent environments, while MTLight can learn more stable strategies by introducing task-shared and task-specific latent states.

4.2 Ablations

To better validate the contribution of each component, three variants of MTLight are evaluated under a variety of scenarios, as shown in Tab. 1.

  • Base only keeps the policy network and removes the Multi-Task network.

  • Base+raw only keeps the policy network and discards Multi-Task network, but directly uses the original input of Multi-Task module as part of the observation.

  • Base+shr retains the Multi-Task network and the policy, but only has task-shared latent state and removes task-specific latent state.

  • Base+spe retains the Multi-Task network and the policy. In contrast to Base+shr, Base+spe has only the task-specific latent state and removes the task-shared latent state.

Note that MTLight contains the whole modules: policy network, Multi-Task network with both task-specific latent state and task-shared latent state.

The quantitative evaluation results are presented in Tab. 1. We can obtain the following findings: 1) Among these 4 models, the performance of Base is the worst. The reason is that it is hard to learn the effective policy independently in the multi-agent traffic signal control task, where the surrounding environment is changing dynamically, but Base has no sense of it. 2) Compared with the Base and Base+raw, the improvement of Base+shr and Base+spe demonstrate the effectiveness of the task-shared latent state 𝐨tshrsuperscriptsubscript𝐨tshr\mathrm{\mathbf{o}_{t}^{shr}}bold_o start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_shr end_POSTSUPERSCRIPT and task-specific latent state 𝐨tspesuperscriptsubscript𝐨tspe\mathrm{\mathbf{o}_{t}^{spe}}bold_o start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_spe end_POSTSUPERSCRIPT respectively. 𝐨tshrsuperscriptsubscript𝐨tshr\mathrm{\mathbf{o}_{t}^{shr}}bold_o start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_shr end_POSTSUPERSCRIPT reflects prior information that is constant over time with multiple related tasks , 𝐨tspesuperscriptsubscript𝐨tspe\mathrm{\mathbf{o}_{t}^{spe}}bold_o start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_spe end_POSTSUPERSCRIPT reflects prior information that is align with the latest changing trends, both of them help policy to make Bayesian optimal decisions. 3) The 𝐨tshrsuperscriptsubscript𝐨tshr\mathrm{\mathbf{o}_{t}^{shr}}bold_o start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_shr end_POSTSUPERSCRIPT and 𝐨tspesuperscriptsubscript𝐨tspe\mathrm{\mathbf{o}_{t}^{spe}}bold_o start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_spe end_POSTSUPERSCRIPT are both effective because each of them is an efficient representations of environmental features. Compared to them, the superiority of MTLight indicates 𝐨tshrsuperscriptsubscript𝐨tshr\mathrm{\mathbf{o}_{t}^{shr}}bold_o start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_shr end_POSTSUPERSCRIPT and 𝐨tspesuperscriptsubscript𝐨tspe\mathrm{\mathbf{o}_{t}^{spe}}bold_o start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_spe end_POSTSUPERSCRIPT are complementary to each other. Overall, all of the proposed components contribute positively to the final results.

5 conclusion

We introduced MTLight, an efficient Multi-Task reinforcement learning method for traffic signal control that can be scaled to complex multi-agent urban road networks of different scale. We showed that MTLight’s latent structure learns a hierarchical latent representations of related tasks, separating the task-shared and task-specific latent states. On several cities’ datasets we demonstrated that this latent representation inspired from related multiple tasks, and conditioning the policy on it, allows an agent to adapt to the complex environment. We conclude that maintaining prior approximations over related tasks helps compared to model-free approaches, especially when there is too much information in the environment and it cannot be fully expressed by artificial state design.

For the future, the latent prior could be learned from expert data prepared in advance using imitation learning techniques Song et al. (2018), or by using existing multi-agent algorithms to pre-train Multi-Task network.

References

  • Abdoos et al. (2011) Monireh Abdoos, Nasser Mozayani, and Ana LC Bazzan. Traffic light control in non-stationary environments based on multi agent q-learning. In ITSC. IEEE, 2011.
  • Abdoos et al. (2013) Monireh Abdoos, Nasser Mozayani, and Ana LC Bazzan. Holonic multi-agent system for traffic signals control. Engineering Applications of Artificial Intelligence, 2013.
  • Arel et al. (2010) Itamar Arel, Cong Liu, Tom Urbanik, and Airton G Kohls. Reinforcement learning-based multi-agent system for network traffic signal control. IET Intelligent Transport Systems, 2010.
  • Bellemare et al. (2019) Marc Bellemare, Will Dabney, Robert Dadashi, Adrien Ali Taiga, Pablo Samuel Castro, Nicolas Le Roux, Dale Schuurmans, Tor Lattimore, and Clare Lyle. A geometric perspective on optimal representations for reinforcement learning. Advances in neural information processing systems, 32, 2019.
  • Caruana (1997) Rich Caruana. Multitask learning. Machine learning, 1997.
  • Chen et al. (2020) Chacha Chen, Hua Wei, Nan Xu, Guanjie Zheng, Ming Yang, Yuanhao Xiong, Kai Xu, and Zhenhui Li. Toward a thousand lights: Decentralized deep reinforcement learning for large-scale traffic signal control. In AAAI, 2020.
  • Chiu (1992) Stephen Chiu. Adaptive traffic signal control using fuzzy logic. In Proceedings of the Intelligent Vehicles92 Symposium. IEEE, 1992.
  • Chiu & Chand (1993) Stephen Chiu and Sujeet Chand. Self-organizing traffic control via fuzzy logic. In IEEE Conference on Decision and Control. IEEE, 1993.
  • Chu et al. (2019) Tianshu Chu, Jie Wang, Lara Codecà, and Zhaojian Li. Multi-agent deep reinforcement learning for large-scale traffic signal control. ITS, 2019.
  • Cools et al. (2013) Seung-Bae Cools, Carlos Gershenson, and Bart D’Hooghe. Self-organizing traffic lights: A realistic simulation. In Advances in applied self-organizing systems. Springer, 2013.
  • Dusparic & Cahill (2009) Ivana Dusparic and Vinny Cahill. Distributed w-learning: Multi-policy optimization in self-organizing systems. In self-adaptive and self-organizing systems. IEEE, 2009.
  • El-Tantawy et al. (2013) Samah El-Tantawy, Baher Abdulhai, and Hossam Abdelgawad. Multiagent reinforcement learning for integrated network of adaptive traffic signal controllers (marlin-atsc): methodology and large-scale application on downtown toronto. IEEE TITS, 2013.
  • Finn et al. (2017) Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML. PMLR, 2017.
  • Ghavamzadeh et al. (2015) Mohammad Ghavamzadeh, Shie Mannor, Joelle Pineau, Aviv Tamar, et al. Bayesian reinforcement learning: A survey. Foundations and Trends® in Machine Learning, 2015.
  • Gu et al. (2020) Jingjing Gu, Qiang Zhou, Jingyuan Yang, Yanchi Liu, Fuzhen Zhuang, Yanchao Zhao, and Hui Xiong. Exploiting interpretable patterns for flow prediction in dockless bike sharing systems. IEEE Transactions on Knowledge and Data Engineering, 2020.
  • Guo et al. (2021) Xin Guo, Zhengxu Yu, Pengfei Wang, Zhongming Jin, Jianqiang Huang, Deng Cai, Xiaofei He, and Xiansheng Hua. Urban traffic light control via active multi-agent communication and supply-demand modeling. IEEE Transactions on Knowledge and Data Engineering, 2021.
  • He & Shin (2020) Suining He and Kang G Shin. Spatio-temporal capsule-based reinforcement learning for mobility-on-demand coordination. IEEE Transactions on Knowledge and Data Engineering, 2020.
  • Huang et al. (2021) Xingshuai Huang, Di Wu, Michael Jenkin, and Benoit Boulet. Modellight: Model-based meta-reinforcement learning for traffic signal control. arXiv preprint arXiv:2111.08067, 2021.
  • Hunt et al. (1981) PB Hunt, DI Robertson, RD Bretherton, and RI Winton. Scoot-a traffic responsive method of coordinating signals. Technical report, 1981.
  • Jaderberg et al. (2016) Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397, 2016.
  • Jiang et al. (2021) Qize Jiang, Jingze Li, Weiwei Sun SUN, and Baihua Zheng. Dynamic lane traffic signal control with group attention and multi-timescale reinforcement learning. IJCAI, 2021.
  • Jintao et al. (2020) KE Jintao, Hai Yang, Jieping Ye, et al. Learning to delay in ride-sourcing systems: a multi-agent deep reinforcement learning framework. IEEE Transactions on Knowledge and Data Engineering, 2020.
  • Koonce & Rodegerdts (2008) Peter Koonce and Lee Rodegerdts. Traffic signal timing manual. Technical report, United States. Federal Highway Administration, 2008.
  • Kouvelas et al. (2014) Anastasios Kouvelas, Jennie Lioris, S Alireza Fayazi, and Pravin Varaiya. Maximum pressure controller for stabilizing queues in signalized arterial networks. Transportation Research Record, 2014.
  • Kuyer et al. (2008) Lior Kuyer, Shimon Whiteson, Bram Bakker, and Nikos Vlassis. Multiagent reinforcement learning for urban traffic control using coordination graphs. In ECML-PKDD. Springer, 2008.
  • Lin et al. (2019) Xingyu Lin, Harjatin Baweja, George Kantor, and David Held. Adaptive auxiliary task weighting for reinforcement learning. Advances in neural information processing systems, 2019.
  • Littman (1994) Michael L Littman. Markov games as a framework for multi-agent reinforcement learning. In Machine learning proceedings. Elsevier, 1994.
  • Liu et al. (2021) Jia Liu, Tianrui Li, Shenggong Ji, Peng Xie, Shengdong Du, Fei Teng, and Junbo Zhang. Urban flow pattern mining based on multi-source heterogeneous data fusion and knowledge graph embedding. IEEE Transactions on Knowledge and Data Engineering, 2021.
  • Lowrie (1990) PR Lowrie. Scats, sydney co-ordinated adaptive traffic system: A traffic responsive method of controlling urban traffic. 1990.
  • Lyle et al. (2021) Clare Lyle, Mark Rowland, Georg Ostrovski, and Will Dabney. On the effect of auxiliary tasks on representation dynamics. In International Conference on Artificial Intelligence and Statistics. PMLR, 2021.
  • Mannion et al. (2016) Patrick Mannion, Jim Duggan, and Enda Howley. An experimental review of reinforcement learning algorithms for adaptive traffic signal control. In Autonomic road transport support systems. Springer, 2016.
  • Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 2015.
  • Ndirango & Lee (2019) Anthony Ndirango and Tyler Lee. Generalization in multitask deep neural classifiers: a statistical physics approach. Advances in Neural Information Processing Systems, 2019.
  • Nishi et al. (2018) Tomoki Nishi, Keisuke Otaki, Keiichiro Hayakawa, and Takayoshi Yoshimura. Traffic signal control based on reinforcement learning with graph convolutional neural nets. In ITSC. IEEE, 2018.
  • Oh et al. (2017) Junhyuk Oh, Satinder Singh, Honglak Lee, and Pushmeet Kohli. Zero-shot task generalization with multi-task deep reinforcement learning. In ICML. PMLR, 2017.
  • Oroojlooy et al. (2020) Afshin Oroojlooy, Mohammadreza Nazari, Davood Hajinezhad, and Jorge Silva. Attendlight: Universal attention-based reinforcement learning model for traffic signal control. arXiv preprint arXiv:2010.05772, 2020.
  • Pan et al. (2020) Zheyi Pan, Wentao Zhang, Yuxuan Liang, Weinan Zhang, Yong Yu, Junbo Zhang, and Yu Zheng. Spatio-temporal meta learning for urban traffic prediction. IEEE Transactions on Knowledge and Data Engineering, 2020.
  • Rizzo et al. (2019) Stefano Giovanni Rizzo, Giovanna Vantini, and Sanjay Chawla. Time critic policy gradient methods for traffic signal control in complex and congested scenarios. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019.
  • Roess et al. (2004) Roger P Roess, Elena S Prassas, and William R McShane. Traffic engineering. Pearson/Prentice Hall, 2004.
  • Ruder (2017) Sebastian Ruder. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098, 2017.
  • Song et al. (2018) Jiaming Song, Hongyu Ren, Dorsa Sadigh, and Stefano Ermon. Multi-agent generative adversarial imitation learning. Advances in neural information processing systems, 2018.
  • Svanes & Delaney (1981) Torgny Svanes and James R Delaney. Scat: System control analysis and training simulator. In Human Detection and Diagnosis of System Failures. Springer, 1981.
  • Tong et al. (2021) Yongxin Tong, Dingyuan Shi, Yi Xu, Weifeng Lv, Zhiwei Qin, and Xiaocheng Tang. Combinatorial optimization meets reinforcement learning: Effective taxi order dispatching at large-scale. IEEE Transactions on Knowledge and Data Engineering, 2021.
  • Tongloy et al. (2017) T Tongloy, S Chuwongin, K Jaksukam, C Chousangsuntorn, and S Boonsang. Asynchronous deep reinforcement learning for the mobile robot navigation with supervised auxiliary tasks. In International Conference on Robotics and Automation Engineering (ICRAE), pp.  68–72. IEEE, 2017.
  • Van der Pol & Oliehoek (2016) Elise Van der Pol and Frans A Oliehoek. Coordinated deep reinforcement learners for traffic light control. NeurIPS, 2016.
  • Varaiya (2013) Pravin Varaiya. The max-pressure controller for arbitrary networks of signalized intersections. In Advances in Dynamic Network Modeling in Complex Transportation Systems. Springer, 2013.
  • Wang et al. (2020) Senzhang Wang, Jiannong Cao, and Philip Yu. Deep learning for spatio-temporal data mining: A survey. IEEE transactions on knowledge and data engineering, 2020.
  • Webster (1958) Fo Vo Webster. Traffic signal settings. Technical report, 1958.
  • Webster (1966) FV Webster. Traffic signals. Road research technical paper, 1966.
  • Wei et al. (2018) Hua Wei, Guanjie Zheng, Huaxiu Yao, and Zhenhui Li. Intellilight: A reinforcement learning approach for intelligent traffic light control. In SIGKDD, 2018.
  • Wei et al. (2019a) Hua Wei, Chacha Chen, Guanjie Zheng, Kan Wu, Vikash Gayah, Kai Xu, and Zhenhui Li. Presslight: Learning max pressure control to coordinate traffic signals in arterial network. In SIGKDD, 2019a.
  • Wei et al. (2019b) Hua Wei, Nan Xu, Huichu Zhang, Guanjie Zheng, Xinshi Zang, Chacha Chen, Weinan Zhang, Yanmin Zhu, Kai Xu, and Zhenhui Li. Colight: Learning network-level cooperation for traffic signal control. In CIKM, 2019b.
  • Xiong et al. (2019) Yuanhao Xiong, Guanjie Zheng, Kai Xu, and Zhenhui Li. Learning traffic signal control from demonstrations. In CIKM, 2019.
  • Xu et al. (2021) Bingyu Xu, Yaowei Wang, Zhaozhi Wang, Huizhu Jia, and Zongqing Lu. Hierarchically and cooperatively learning traffic signal control. In AAAI, 2021.
  • Yu et al. (2020) Zhengxu Yu, Shuxian Liang, Long Wei, Zhongming Jin, Jianqiang Huang, Deng Cai, Xiaofei He, and Xian-Sheng Hua. Macar: Urban traffic light control via active multi-agent communication and action rectification. In IJCAI, 2020.
  • Zang et al. (2020) Xinshi Zang, Huaxiu Yao, Guanjie Zheng, Nan Xu, Kai Xu, and Zhenhui Li. Metalight: Value-based meta-reinforcement learning for traffic signal control. In AAAI, 2020.
  • Zhang et al. (2021) Feng Zhang, Yani Liu, Ningxuan Feng, Cheng Yang, Jidong Zhai, Shuhao Zhang, Bingsheng He, Jiazao Lin, Xiao Zhang, and Xiaoyong Du. Periodic weather-aware lstm with event mechanism for parking behavior prediction. IEEE Transactions on Knowledge and Data Engineering, 2021.
  • Zhang et al. (2019) Huichu Zhang, Siyuan Feng, Chang Liu, Yaoyao Ding, Yichen Zhu, Zihan Zhou, Weinan Zhang, Yong Yu, Haiming Jin, and Zhenhui Li. Cityflow: A multi-agent reinforcement learning environment for large scale city traffic scenario. In WWW, 2019.
  • Zhang et al. (2020a) Huichu Zhang, Markos Kafouros, and Yong Yu. Planlight: Learning to optimize traffic signal control with planning and iterative policy improvement. IEEE Access, 2020a.
  • Zhang et al. (2020b) Huichu Zhang, Chang Liu, Weinan Zhang, Guanjie Zheng, and Yong Yu. Generalight: Improving environment generalization of traffic signal control via meta reinforcement learning. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, 2020b.
  • Zhang & Yang (2021) Yu Zhang and Qiang Yang. A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering, 2021.
  • Zhao et al. (2017) Shengjia Zhao, Jiaming Song, and Stefano Ermon. Learning hierarchical features from deep generative models. In ICML. PMLR, 2017.
  • Zheng et al. (2019a) Guanjie Zheng, Yuanhao Xiong, Xinshi Zang, Jie Feng, Hua Wei, Huichu Zhang, Yong Li, Kai Xu, and Zhenhui Li. Learning phase competition for traffic signal control. In CIKM, 2019a.
  • Zheng et al. (2019b) Guanjie Zheng, Xinshi Zang, Nan Xu, Hua Wei, Zhengyao Yu, Vikash Gayah, Kai Xu, and Zhenhui Li. Diagnosing reinforcement learning for traffic signal control. arXiv, 2019b.

Appendix A Appendix

You may include other additional sections here.

Table 2: Implementation details of MTLight
Items Details
Number of policy steps 3600
Discount factor γ𝛾\gammaitalic_γ 0.95
Policy ϵitalic-ϵ\epsilonitalic_ϵ 0.1 \rightarrow 0.01
ϵitalic-ϵ\epsilonitalic_ϵ decay rate 0.995
Policy Learning rate 0.005
Policy minibatch 32
task-shared latent space dim 5
task-specific latent space dim 5
task-shared latent state coef 10
task-specific latent state coef 10
Policy network 2 hidden layers,
architecture 20 nodes each,
ReLU activations
Policy network RMSprop with learning rate 0.001
optimizer and MSE loss
5 MLP embedding layers ,
2 shared FC layers before GRU,
GRU with hidden size 64,
Multi-Task architecture 1 shared FC layer after GRU,
4 task-specific FC layers,
4 output task layers
ReLU activations
Multi-Task optimizer Adam with learning rate 0.01
and MSE loss

Appendix B related work

B.1 Conventional and Adaptive Traffic Signal Control

Most conventional traffic signal control methods are designed based on fixed-time signal control Webster (1958), actuated control Chiu (1992) or self-organizing traffic signal control Chiu & Chand (1993); Cools et al. (2013); Lowrie (1990); Svanes & Delaney (1981); Hunt et al. (1981). These approaches rely on expert knowledge and often perform unsatisfactorily in complicated real-world situations. To solve this problem, several optimization-based methods Roess et al. (2004); Varaiya (2013); Kouvelas et al. (2014) have been proposed to optimize average travel time, throughput, etc., which decide the traffic signal plans according to the observed data instead of the human prior. However, these approaches typically rely on strict assumptions which might not hold in the real-world cases Webster (1966). Furthermore, the optimization problems are usually hard to tract and require significant computing power in complex scenarios.

B.2 RL-based Traffic Signal Control

RL-based traffic signal control methods aim to learn the policy from interactions with the environment. Earlier studies use tabular Q-learning El-Tantawy et al. (2013); Abdoos et al. (2013); Dusparic & Cahill (2009); Abdoos et al. (2011) where the states in an environment are required to be discretized and low-dimensional. To address the unmanageable large or continuous state space, recent advances employ deep RL with more complex continuous state representations (like images or feature vectors) to map the high-dimensional states into actions.

Efforts have been made to design strategies that formulate the task as a single agent Wei et al. (2018); Mannion et al. (2016); Huang et al. (2021); Zang et al. (2020); Oroojlooy et al. (2020); Jiang et al. (2021); Rizzo et al. (2019) or some isolated intersections Zheng et al. (2019b; a); Xiong et al. (2019); Wei et al. (2019a); Chen et al. (2020); Oroojlooy et al. (2020); Zhang et al. (2020b; a), i.e., each agent makes decision for its own. The above methods are usually easy to scale, but they may have difficulty achieving globally optimal performance due to a lack of collaboration. To solve the problem, another way is to consider jointly modeling the action between learning agents with centralized optimization Van der Pol & Oliehoek (2016); Kuyer et al. (2008). However, as the number of agents increases, joint optimization usually leads to dimensional explosion, which has inhibited the widespread adoption of such methods to a large-scale traffic signal control. To overcome the difficulty, another type of methods are implemented in a decentralized manner, taking into account the collaboration between neighbors with appropriate reward and state design Arel et al. (2010); Nishi et al. (2018); Wei et al. (2019b); Xu et al. (2021). Methods such as El-Tantawy et al. (2013); Chu et al. (2019) add neighboring information into states, Nishi et al. (2018); Wei et al. (2019b); Yu et al. (2020); Guo et al. (2021) add neighbors’ hidden features into states, and Xu et al. (2021) optimizes neighborhood travel time as an additional reward. However, simple concatenation of neighboring information is not reasonable enough because the influence of neighboring intersections is not balanced. Unlike the above methods that add neighbor information to the state, our method learns task-shared and task-specific latent states by constructing Multi-Task network.

B.3 Multi-Task Learning

Multi-Task Learning(MTL) Caruana (1997) is a learning paradigm aims to jointly learn multiple related tasks so that the knowledge contained in a task can be leveraged by other tasks. Past works Oh et al. (2017); Zhang & Yang (2021); Ruder (2017); Ndirango & Lee (2019) have found that, by sharing a representation among related tasks and jointly learning all the tasks, better generalization can be achieved over independently learning each task. Constructing auxiliary tasks to help the main task is a branch of Multi-Task Learning. Reinforcement learning is known to be sample inefficient, transferring knowledge from other auxiliary tasks is a powerful tool for improving the learning efficiency Jaderberg et al. (2016); Lin et al. (2019); Lyle et al. (2021); Tongloy et al. (2017); Bellemare et al. (2019). Lin et al. (2019) combines different auxiliary tasks which provide gradient directions to speed up the training of the main reinforcement learning task. In comparison, our work aims to transfer knowledge from the task-related auxiliary tasks as a prior to the main reinforcement learning task, to ultimately boost the performance. Specifically, we model the Multi-Task network as a latent structure where the task-shared latent state is generated from early layers and the task-specific latent state is generated from deeper layers. This incentivies the policy to learn the Bayers-optimal behaviours: the policy can take into account its uncertainty over the comprehensive information when choosing actions.

B.4 Preliminaries

In this section, we first introduce some basic concepts related to traffic signal control (TSC) that have been widely recognized in previous work Wei et al. (2019b); Zheng et al. (2019a); Zhang et al. (2020b); Wei et al. (2019a); Chen et al. (2020); Zang et al. (2020). Note that the concepts can be easily generalized to other intersections with different structures.

Refer to caption
Figure 6: Illustration of phase.
  • Incoming/Outgoing Lanes. The incoming lanes refer to the lanes where the vehicles are about to enter the intersection. It usually contains three basic types: ”left-turn”, ”straight” and ”right-turn” from inner to outer. The outgoing lanes refer to the lanes where the vehicles are about to leave the intersection.

  • Roadnet. A roadnet is a part of a dataset that represents an area of a city. A roadnet consists of signalized intersections, unsignalized intersections, and lanes connecting the intersections. Generally, the lane lengths, number of lanes and relative locations of intersections vary from one roadnet to another.

  • Phase. Phase is a controller timing unit associated with the control of one or more movements, representing the permulation and combination of different traffic flows. The 4-phase setting is the most common configuration in reality, illustrated in Fig. 6, but the number of phases can vary due to different intersection topologies (3-way, 5-way intersections, etc.).

  • Queue Length. Queue length is the number of vehicles waiting at an intersection due to a red light. Vehicles on the incoming lane with a speed of less than 0.1m/s are considered to be waiting.

  • Average Travel Time. The travel time of a vehicle is the time discrepancy between entering and leaving a particular area. Average travel time of all vehicles in a road network is the most frequently used measure to evaluate the performance of traffic signal control Wei et al. (2019b; a); Zhang et al. (2020b); Chen et al. (2020); Zheng et al. (2019a).

  • Flow Distribution. Flow distribution is the distribution of traffic entering the road network, which is generally expressed by the arrival rate of vehicles, i.e., the volume of traffic entering the road network per unit time.

  • Vehicles on Road. Vehicles on road indicate the running vehicle, i.e., vehicles that have entered the road network and have not reached the end point. Vehicles on road can represent the real-time load on the road network.

Appendix C Algorithm

The algorithm is shown in Alg. 1.

Input: Roadnet file; traffic flow file; number of training episodes E𝐸Eitalic_E; frequency of updating policy tpsubscript𝑡𝑝t_{p}italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT; frequency of updating multi-task network tmsubscript𝑡𝑚t_{m}italic_t start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT; total simulate time T𝑇Titalic_T
Output: Set of optimized parameters for the intersections; optimized parameter for the multi-task network
1 Initialize task-shared and task-specific latent state 𝐨tshrsuperscriptsubscript𝐨tshr\mathrm{\mathbf{o}_{t}^{shr}}bold_o start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_shr end_POSTSUPERSCRIPT, 𝐨tspesuperscriptsubscript𝐨tspe\mathrm{\mathbf{o}_{t}^{spe}}bold_o start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_spe end_POSTSUPERSCRIPT
2 Initialize policy replay buffer πsuperscript𝜋\mathcal{B}^{\pi}caligraphic_B start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT
3 Initialize policy πθsuperscript𝜋𝜃\pi^{\theta}italic_π start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT and multi-task network 𝐌ϕsuperscript𝐌italic-ϕ\mathbf{M}^{\phi}bold_M start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT
4 Initialize reward of each agent {rii1,,n}conditional-setsubscript𝑟𝑖𝑖1𝑛\{r_{i}\mid i\in 1,\ldots,n\}{ italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_i ∈ 1 , … , italic_n }
5 for episode \longleftarrow 1, 2, …, E𝐸Eitalic_E do
6       for  step t \longleftarrow 1, 2, …, T𝑇Titalic_T do
7             Collect original observations for all agents
8             Add task-shared 𝐨tshrsuperscriptsubscript𝐨tshr\mathrm{\mathbf{o}_{t}^{shr}}bold_o start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_shr end_POSTSUPERSCRIPT and task-specific 𝐨tspesuperscriptsubscript𝐨tspe\mathrm{\mathbf{o}_{t}^{spe}}bold_o start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_spe end_POSTSUPERSCRIPT latent state to the observations
9             for  agent i \longleftarrow 1, 2, …, n do
10                   Select action according to πθsuperscript𝜋𝜃\pi^{\theta}italic_π start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT
11                  
12            Employ joint action 𝒂𝒂\bm{a}bold_italic_a to the environment
13             Get new observations and environmental reward
14             Collect trajectories to replay buffer πsuperscript𝜋\mathcal{B}^{\pi}caligraphic_B start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT
15             Get multi-task network input 𝐟tv,𝐟ts,𝐟tc,𝐟ttr,𝐟tq,𝐟tvrsuperscriptsubscript𝐟𝑡𝑣superscriptsubscript𝐟𝑡𝑠superscriptsubscript𝐟𝑡𝑐superscriptsubscript𝐟𝑡𝑡𝑟superscriptsubscript𝐟𝑡𝑞superscriptsubscript𝐟𝑡𝑣𝑟\mathbf{f}_{t}^{v},\mathbf{f}_{t}^{s},\mathbf{f}_{t}^{c},\mathbf{f}_{t}^{tr},% \mathbf{f}_{t}^{q},\mathbf{f}_{t}^{vr}bold_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT , bold_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , bold_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , bold_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_r end_POSTSUPERSCRIPT , bold_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT , bold_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v italic_r end_POSTSUPERSCRIPT from the environment
16             Predict results using multi-task network 𝐌ϕsuperscript𝐌italic-ϕ\mathbf{M}^{\phi}bold_M start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT
17             Get task-shared 𝐨tshrsuperscriptsubscript𝐨tshr\mathrm{\mathbf{o}_{t}^{shr}}bold_o start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_shr end_POSTSUPERSCRIPT and task-specific 𝐨tspesuperscriptsubscript𝐨tspe\mathrm{\mathbf{o}_{t}^{spe}}bold_o start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_spe end_POSTSUPERSCRIPT latent state from 𝐌ϕsuperscript𝐌italic-ϕ\mathbf{M}^{\phi}bold_M start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT
18             Calculate statistics from 0 up to t𝑡titalic_t as supervised signal
19             if t = tpsubscript𝑡𝑝t_{p}italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT then
20                   Train policy πθsuperscript𝜋𝜃\pi^{\theta}italic_π start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT by maximizing reward in Eq. 11
21                   Clean up πsuperscript𝜋\mathcal{B}^{\pi}caligraphic_B start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT
22            if t = tmsubscript𝑡𝑚t_{m}italic_t start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT then
23                   Calculate loss from the results of step 1 and step 1
24                   Train multi-task network 𝐌ϕsuperscript𝐌italic-ϕ\mathbf{M}^{\phi}bold_M start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT
25                  
26            if t = T𝑇Titalic_T then
27                   Collect the average total travel time of all vehicles as criteria
28                  
29            
30      
Algorithm 1 Training Process of MTLight

Appendix D Datasets

D.1 Road Networks

The evaluation scenarios come from four real road network maps of different scales, including Hangzhou (China), Jinan (China), New York (USA) and Shenzhen (China), illustrated in Fig. 7. The road networks and data of Hangzhou, Jinan and New York are from the public datasets111https://traffic-signal-control.github.io/. The road network map of Shenzhen is made by ourselves which is derived from OpenStreetMap222The road network map and data of Shenzhen will be released to facilitate the future research.. The road networks of Jinan and Hangzhou contain 12 and 16 intersections in 4×3434\times 34 × 3 and 4×4444\times 44 × 4 grids, respectively. The road network of New York includes 48 intersections in 16×316316\times 316 × 3 grid. The road network of Shenzhen contains 33 intersections, which is not grid compared to other three maps.

Refer to caption
Figure 7: The illustration of the road networks. The figures from left to right represent the road network of Jinan(China), Hangzhou(China), New York(USA) and Shenzhen(China), containing 12 (4×3434\times 34 × 3), 16 (4×4444\times 44 × 4), 48 (16×316316\times 316 × 3) and 33 (Non-grid) traffic signals respectively.
Refer to caption
Figure 8: Turning statistics of vehicle routes.
Table 3: Arrival rate of real-world traffic dataset
Dataset # Intersections Arrival rate (vehicles/300s)
Mean Std Max Min
𝒟Hangzhousubscript𝒟𝐻𝑎𝑛𝑔𝑧𝑜𝑢\mathcal{D}_{Hangzhou}caligraphic_D start_POSTSUBSCRIPT italic_H italic_a italic_n italic_g italic_z italic_h italic_o italic_u end_POSTSUBSCRIPT 16 (4 ×\times× 4) 248.58 42.25 333 212
𝒟Jinansubscript𝒟𝐽𝑖𝑛𝑎𝑛\mathcal{D}_{Jinan}caligraphic_D start_POSTSUBSCRIPT italic_J italic_i italic_n italic_a italic_n end_POSTSUBSCRIPT 12 (4×\times×3) 524.58 102.91 672 256
𝒟NewYorksubscript𝒟𝑁𝑒𝑤𝑌𝑜𝑟𝑘\mathcal{D}_{NewYork}caligraphic_D start_POSTSUBSCRIPT italic_N italic_e italic_w italic_Y italic_o italic_r italic_k end_POSTSUBSCRIPT 48 (16×\times×3) 235.33 5.84 244 224
𝒟Shenzhensubscript𝒟𝑆𝑒𝑛𝑧𝑒𝑛\mathcal{D}_{Shenzhen}caligraphic_D start_POSTSUBSCRIPT italic_S italic_h italic_e italic_n italic_z italic_h italic_e italic_n end_POSTSUBSCRIPT 33 (Non-grid) 147.92 79.35 255 22
Table 4: Data statistics of synthetic traffic dataset
Dataset Time
Arrival rate
(vehicles/s)
Incoming
vehicles
Accumulated
vehicles
𝒟Hangzhousubscript𝒟𝐻𝑎𝑛𝑔𝑧𝑜𝑢\mathcal{D}_{Hangzhou}caligraphic_D start_POSTSUBSCRIPT italic_H italic_a italic_n italic_g italic_z italic_h italic_o italic_u end_POSTSUBSCRIPT/ 𝒟Jinansubscript𝒟𝐽𝑖𝑛𝑎𝑛\mathcal{D}_{Jinan}caligraphic_D start_POSTSUBSCRIPT italic_J italic_i italic_n italic_a italic_n end_POSTSUBSCRIPT/ 𝒟NewYorksubscript𝒟𝑁𝑒𝑤𝑌𝑜𝑟𝑘\mathcal{D}_{NewYork}caligraphic_D start_POSTSUBSCRIPT italic_N italic_e italic_w italic_Y italic_o italic_r italic_k end_POSTSUBSCRIPT/ 𝒟Shenzhensubscript𝒟𝑆𝑒𝑛𝑧𝑒𝑛\mathcal{D}_{Shenzhen}caligraphic_D start_POSTSUBSCRIPT italic_S italic_h italic_e italic_n italic_z italic_h italic_e italic_n end_POSTSUBSCRIPT 0-600 1.00 600 600
600-1200 0.25 150 750
1200-1800 4.00 2400 3150
1800-2400 2.00 1200 4350
2400-3000 0.2 120 4470
3000-3600 0.5 150 4770
Table 5: Statistics of turning frequency at intersections in all routes.
 Model Hangzhou Jinan Newyork
real syn_peak real syn_peak real syn_peak
turn left 1093 (14%) 5175 (24%) 3044 (20%) 5833 (30%) 3886 (18%) 7169 (20%)
go straight 6620 (86%) 16293 (76%) 12175 (80%) 13704 (70%) 17498 (82%) 27976 (80%)
turn right 3184 8752 5972 8747 4021 7421

D.2 Flow configurations

We run the experiments under two traffic flow configurations: real traffic flow and synthetic traffic flow. The real traffic flow is real-world hourly statistical data with slight variance in vehicle arrival rates, as shown in Tab. 3. Since the real-world strategies tend to break down during bottleneck period (peak hour), to better evaluate the performances of traffic light control methods in the flat-peak-flat scenario, we use synthetic datasets, which have a more dramatic variance in vehicle arrival rates, as shown in Tab. 4. A detailed description of traffic flow configurations is:

  • Real. The traffic flows of Hangzhou (China), Jinan (China) and New York (USA) are from the public datasets, which are processed from multiple sources. The traffic flow of Shenzhen (China) is made by ourselves generated based on the traffic trajectories collected from 80 red-light cameras and 16 monitoring cameras in a hour. The data statistics are listed in Tab. 3.

  • Synthetic. The Synthetic is a mixed traffic flow with a total flow of 4770 in one hour, to simulate a heavy peak. The arrival rate changes every 10 minutes, which is used to simulate the uneven traffic flow distribution in the real world, the details of the vehicle arrival rate and cumulative traffic flow are shown in Tab. 4.

Appendix E Evaluation Criteria

Following existing studies Wei et al. (2019b; a); Xiong et al. (2019); Chen et al. (2020); Zang et al. (2020), we use the average travel time to evaluate the performance of different methods for traffic signal control. The average travel time indicates the overall traffic situation in an area over a period of time. For a detailed definition of average travel time, see Section B.4. Since the number of vehicles and the origin-destination (OD) positions are fixed, better traffic signal control strategies result in less average travel time.

Appendix F Baselines

Our method is compared with the following two categories of methods: conventional transportation methods and RL methods333Some existing RL based traffic signal control methods, such as AttendLight Oroojlooy et al. (2020) and SD-MaCAR Guo et al. (2021), evaluate their method under different experimental settings (e.g., road network or traffic flow), and the source codes are not available yet. Therefore, they are not compared in our experiments.. Note that for a fair comparison all the RL methods are learned without any pre-trained parameters and the methods are evaluated under the same settings. The results are obtained by running the source codes444https://github.com/traffic-signal-control/RL_signals. All the baselines are run with three random seeds, and the mean is taken as the final result. The action interval is five seconds for each method, and the horizon is 3600 seconds for each episode. Specifically, the compared methods contain:

F.1 Conventional methods

  • MaxPressure Varaiya (2013) is a leading conventional method, which greedily chooses the phase with the maximum pressure. The pressure is defined as the difference of vehicle density between the incoming lane and the outgoing lane, and the vehicle density means the actual number of vehicles divided by the maximum permissible vehicle number.

  • Fixedtime Koonce & Rodegerdts (2008) with random offset Roess et al. (2004) executes each phase in a phase loop with a pre-defined span of phase duration, which is widely used for steady traffic.

  • SOTL Cools et al. (2013) specifies a pre-defined threshold for the number of waiting vehicles on approaching lanes. Once the waiting vehicles exceeds the threshold, it will switch to the next phase.

F.2 RL-based methods

  • Individual RL. Wei et al. (2018) Independent control is performed for each agent in multi-agent environment, each intersection is controlled by one agent. The replay buffer and network parameters are not shared, and the model update is independent. There is no information transfer between agents, and no neighbor information is considered.

  • MetaLight Zang et al. (2020) is a value-based meta reinforcement learning method via parameter initialization, which is based on MAML Finn et al. (2017). MetaLight is originally a single-agent approach for meta-learning on multiple separate tasks. Here we extend it to a multi-agent scenario without considering neighbor information.

  • PressLight Wei et al. (2019a) combines the traditional traffic method MaxPressure Varaiya (2013) with RL technology together. PressLight is a RL method that optimizes the pressure of each intersection.

  • CoLight Wei et al. (2019b) uses graph convolution and attention mechanism to model the neighbor information, and then further uses this neighbor information to optimize the queue length.

  • GeneraLight Zhang et al. (2020b) is a meta reinforcement learning method which uses generative adversarial network to generate diverse traffic flows and uses them to build training environments.