0% found this document useful (0 votes)

23 views

Solving The Train Dispatching Problem Via Deep Reinforcement Learning

Uploaded by

Ashar Aksa

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views

Solving The Train Dispatching Problem Via Deep Reinforcement Learning

Uploaded by

Ashar Aksa

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Journal of Rail Transport Planning & Management 26 (2023) 100394

Contents lists available at ScienceDirect

Journal of Rail Transport Planning & Management

journal homepage: www.elsevier.com/locate/jrtpm

Solving the train dispatching problem via deep reinforcement

learning
Valerio Agasucci a,b , Giorgio Grani c ,∗, Leonardo Lamorgese b
a DIAG, Sapienza University of Rome, Rome, Italy
b OPTRAIL, Rome, Italy
c
Department of Statistical Sciences, Sapienza University of Rome, Rome, Italy

ARTICLE INFO ABSTRACT

Keywords: Every day, railways experience disturbances and disruptions, both on the network and the fleet
Scheduling side, that affect the stability of rail traffic. Induced delays propagate through the network,
Reinforcement learning which leads to a mismatch in demand and offer for goods and passengers, and, in turn, to a
Optimization
loss in service quality. In these cases, it is the duty of human traffic controllers, the so-called
dispatchers, to do their best to minimize the impact on traffic. However, dispatchers inevitably
have a limited depth of perception of the knock-on effect of their decisions, particularly how
they affect areas of the network that are outside their direct control. In recent years, much
work in Decision Science has been devoted to developing methods to solve the problem
automatically and support the dispatchers in this challenging task. This paper investigates
Machine Learning-based methods for tackling this problem, proposing two different Deep Q-
Learning methods(Decentralized and Centralized). Numerical results show the superiority of
these techniques respect to the classical linear Q-Learning based on matrices. Moreover the
Centralized approach is compared with a MILP formulation showing interesting results. The
experiments are inspired on data provided by a U.S. class 1 railroad.

1. Introduction

A railway system is a complex network of interconnected tracks, where train traffic is controlled via the signaling system by
activating switches and signals. The management and operation of a railway system is the task of the Infrastructure Manager (IM).
A Train Operating Company (TOC) organizes its fleet to accommodate expected demands, maximizing revenue and coverage. In
regions where for geographical and historical reasons the market is made up of predominantly freight-hauling companies, such
as, e.g., North America and Australia, the IM and TOC are often the same entity. That is, rail operators are generally vertically
integrated, owning and operating both the network and the fleet. In most countries however, the IM is a single public authority,
which rents its network to TOCs to operate their train services. This is particularly common for systems that operate predominantly
passenger traffic, like in Europe. Here, the IM interacts with the TOCs to establish a plan for train traffic, the so-called timetable. This
process, referred to as timetabling, takes place offline, generally every 3 to 12 months, and is a challenging and time-consuming
task. Its main goal is to assign routes to trains and create a schedule that is conflict-free and, typically, presents some elements
of periodicity. Despite a vast body of literature in this field of research, in practice timetables are to this day still in many cases
hand-engineered by specialized personnel, basing their decisions largely on experience, regulation, safety measures, business rules

∗ Corresponding author.
E-mail addresses: agasucci@diag.uniroma1.it (V. Agasucci), g.grani@uniroma1.it (G. Grani), leonardo.lamorgese@optrail.com (L. Lamorgese).

https://doi.org/10.1016/j.jrtpm.2023.100394
Received 23 September 2021; Received in revised form 1 April 2023; Accepted 4 May 2023
Available online 13 May 2023
2210-9706/© 2023 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license
(http://creativecommons.org/licenses/by/4.0/).
V. Agasucci et al. Journal of Rail Transport Planning & Management 26 (2023) 100394

to factor in the various requests expressed by the TOC (the latter more delicate and time-consuming in liberalized markets with
multiple competing TOCs).
In operating a railway system, the IM ideally attempts to adhere perfectly to the timetable. Unfortunately, as anyone who has
ever taken a train will be familiar with, this seldom happens, as small disturbances, or in some cases serious disruptions, occur
daily. A train malfunction, switch or track failure, delays in the preparation of the train or in passengers embarking, and plenty of
other issues may create train delays and ultimately affect the overall network, sometimes in unforeseen ways. In some cases small
delays are recovered from simply driving trains faster, but, very often, online re-routing and re-scheduling decisions have to be taken
to reduce delays and increase efficiency. In literature, this online decision making process is referred to as the Train Dispatching
problem (TD), a real-time variant of the above mentioned Train Timetabling problem (known to be NP-hard Caprara et al., 2002).
The very little time admitted for computation (often only a few seconds) and the size and complexity of real-life instances makes
this problem very challenging to solve also in practice. The real-time nature of the problem in particular largely limits the solution
approaches that can be used effectively. Operationally, the task of dispatching trains is in the hands of the IM’s traffic controllers,
the dispatchers, each assigned to a specific portion of the network (a station, a junction, a part of a line, etc). To this day, dispatchers
are generally provided with little or no decision support in this process, which makes it challenging for them to go beyond their
local view of the network and take into account the knock-on effects of their decisions, especially in areas of the network that are
outside their direct control.
The Train Dispatching problem has sparked much research interest over the years, in particular in the Optimization community.
Different models and solutions approaches have been proposed over the years: some exact methods (Lamorgese and Mannino, 2015;
Lamorgese et al., 2018; Liao et al., 2021; Narayanaswami and Rangaraj, 2011, 2013), heuristic based on MILP formulation as (Yan
and Yang, 2012)-(Boccia et al., 2012)-(Boccia et al., 2013)-(Adenso-Dıaz et al., 1999), tabu search (Tornquist and Persson, 2005),
genetic algorithms (Higgins et al., 1997), classic greedy heuristic (Cai et al., 1998), and neighborhood search (Samà et al., 2017).
Refer to Lamorgese et al. (2018) for a deeper insight on the Train Dispatching problem, and to the many surveys on the topic for
an overview of these approaches (e.g. Cacchiani et al., 2014; Cai et al., 1998; Caprara et al., 2002; Chen et al., 2018; Corman et al.,
2017; Corman and Meng, 2014; Dolan and Moré, 2002; Drori et al., 2020; Fang et al., 2015; Ghasempour and Heydecker, 2019;
Higgins et al., 1997; Khadilkar, 2018; Khalil et al., 2017; Lamorgese and Mannino, 2015; Lamorgese et al., 2018; Liao et al., 2021;
Narayanaswami and Rangaraj, 2011, 2013; Ning et al., 2019; Obara et al., 2018; Pellegrini et al., 2015; Samà et al., 2017; Scarselli
et al., 2008; Schrittwieser et al., 2019; Šemrov et al., 2016; Sharma et al., 2017; Silver et al., 2018, 2017a,b; Sutton and Barto, 2018;
Törnquist, 2006). Recent developments in both Machine Learning and Optimization have led to the definition of new learning-based
paradigms to solve hard problems. Our focus is on Reinforcement Learning, for which a lot of interesting results have been achieved
so far. In particular, in the well known AlphaGo algorithm (see Silver et al. (2017b)) the authors tackle the complex game of Go,
developing an outstanding framework able to overcome the best human player. Several approaches have followed AlphaGo, like
AlphaZero (Silver et al., 2017a, 2018) and MuZero (Schrittwieser et al., 2019), increasing every time the degree of generalization
possible. Recently in Khalil et al. (2017), this approach was extended to combinatorial problems with a more general range of
possible applications. To achieve their results, the authors combined Deep Q-Learning with graph convolutional neural networks,
which are a specialized class of models for graph-like structures. A very similar approach was followed in Drori et al. (2020).

Our contribution. In this paper, TD is tackled by means of Deep Q-Learning on a single railway line. More specifically, two
approaches are investigated: Decentralized and Centralized. In the former, each train can be seen as an independent agent with the
ability to see only a part of the network, namely some tracks ahead/behind it and not beyond. This approach has the advantage
that it can be easily generalized, but, on the other hand, may lack the depth of prediction useful to express network dynamics. The
latter method takes as input the entire line and learns to deal with delay propagation. Moreover, we use a Graph Neural Network
to estimate this delay reflecting the railway topology. Both methods generalize to different railway sizes. From the standpoint of
the application, these authors find the use of Deep Reinforcement Learning approaches (such as Deep Q-Learning) very promising,
as they present the advantage of shifting the computational burden to the learning stage. While enumerative algorithms typical of
Combinatorial Optimization and Constraint Programming have proven to be effective in several train dispatching contexts, scalability
remains a daunting challenge. On the other hand, under the assumption that the model can be trained effectively, Deep RL could
achieve a (quasi) real-time performance, unattainable with classical optimization approaches. Finally, we point out that a possible
reason for the slow adoption of automatic dispatching software in the industry is the diversity of business rules and requirements
for such software in different regions and markets. The step of adapting the software to a specific set of such rules and requirements
could be accelerated (or indeed skipped) using a Deep RL-based approach, which builds its internal input–output representation
based on provided data and requires virtually no knowledge of the rules themselves.
For all the above, we believe that it is worth pursuing the application of Deep RL techniques in the field of train dispatching.
This article is not the first article proposing the use of RL for train dispatching, so we highlight the differences between these.
Papers (Khadilkar, 2018; Khalil et al., 2017; Lamorgese and Mannino, 2015; Lamorgese et al., 2018; Liao et al., 2021; Narayanaswami
and Rangaraj, 2011, 2013; Ning et al., 2019; Obara et al., 2018; Pellegrini et al., 2015; Samà et al., 2017; Scarselli et al., 2008;
Schrittwieser et al., 2019; Šemrov et al., 2016) use a linear Q-Learning approach to tackle the problem. In Ghasempour and
Heydecker (2019), the authors present an approach based on approximate dynamic programming. Recent papers (Ning et al., 2019;
Obara et al., 2018; Wang et al., 2019) introduce a Deep-network to predict the next action. In Liao et al. (2021) Deep RL is used to
solve the energy-aimed timetable rescheduling (find the optimal timetable minimizing the energy usage) in Wang et al. (2021) an
RL approach is exploited to reschedule the timetable of the high-speed railway line between Shangai and Beijing, while in Wang
et al. (2023) a multi-agent reinforcement learning is incorporated in the metro system. Finally in Ying et al. (2020) and Ying et al.

2
V. Agasucci et al. Journal of Rail Transport Planning & Management 26 (2023) 100394

(2021) an Actor-Critic method is used to schedule the underground train service in London. However, these approaches are quite
limited in terms of the size of the instances that can be solved. Indeed, in Ning et al. (2019), Obara et al. (2018) and Wang et al.
(2019) the railway network considered in the experiments is limited to 7 or 8 stations and in Ning et al. (2019) and Wang et al.
(2019) train traffic is considered only in one direction. In this paper we present an approach which is able to tackle larger instances
(up to 29 stations in our experiments) and handle both traffic directions. Furthermore, we model other, important factors in the
dispatching process, such as train length and other business rules described in Section 2.1, which instead are not all covered in
the cited papers. The train length has been tackled in other works among the ones cited before, but it appears to be a factor that
increases significantly the computational complexity of the proposed procedures, where here is directly embedded. This results in
a model that is more faithful to real-life requirements and an algorithm that can solve instances of some practical relevance (e.g. a
regional line). In addition, it may be worth noting that our experiments are carried out on data from the US railway network, which
presents its own challenges respect to other regions in the world like the European, Chinese and Japanese ones. Finally, in the
computational section we show how the Deep Q-Learning approaches here presented perform better than their linear counterpart,
the matrix Q-learning approach proposed in Khadilkar (2018).
The paper is organized as follows: in Section 2, basic concepts of Train Dispatching and Reinforcement Learning are introduced
formally. In Section 3 Graph Neural Networks are introduced. In Section 4, the two algorithms are discussed. Specifically, in sub-
Section 4.1, the Decentralized approach is tackled, whereas in sub- Section 4.2, the Centralized approach is presented. In Section 5,
a numerical analysis is conducted to prove the effectiveness of Deep Reinforcement Learning approach for the train dispatching
problem. Finally in Section 6 a comparison between our approach and a well-known Mixed-Integer Programming approach is
provided.

2. Preliminaries

Two basic ingredients characterize this paper: the Train Dispatching problem (TD) and (Deep) Reinforcement Learning (DeepRL).
In the following, a short presentation of both, with the aim of being introductory and not comprehensive.

2.1. Train dispatching problem

The fundamental elements of the Train Dispatching problem (TD) are trains, the railway network, and paths. Trains are of course
the means of transportation in this system. In practice, different types of trains can operate at the same time on a network. Two
of the most important attributes of a train for dispatching decisions are its priority and its length. Priority represents the relative
importance of a train respect to other trains, which generally reflects in the decisions taken by dispatchers to recover from delay.
For example, cargo trains usually have a lower priority than passenger ones. Their length depends on various aspects, such as
technology, market and nature of the service.
A railway network is composed of a set of tracks, switches and signals. A common way to represent this network is to see it as an
alternating sequence of tracks and stations. A station is a logical entity in the network that comprises multiple tracks and switches
where, typically, the majority of routing and scheduling actions take place. Tracks are physical connections between two stations.
Trains traveling in opposite directions can never occupy a track simultaneously, while certain tracks allow this for trains traveling
in the same direction.
In terms of infrastructure, stations are effectively also a set of interconnected track segments. An interlocking route is a sequence
of track segments that connects two signals. Different routes can be conflicting, that is, certain movements may not be allowed on
such routes at the same time. The interlocking is precisely the signaling device that prevents these conflicts. Certain interlocking
routes also include stopping points, which allow train activities. Some of these activities can be embarking passengers or loading
goods. Generally each stopping point can be occupied by at most one train at the time. A switch is a mechanical installation enabling
trains to be guided from one track to the other. A switch can be occupied by one train at a time. Finally, a path is the expected
sequence of tracks and stations for a given train. The path is specified by a starting station, a set of boarding points, and the arrival
station.
The elements introduced above allow to model real-life operations with a level of approximation that is sufficient for the purposes
of this paper.
In a few words, TD is the problem of managing train traffic in real-time by taking scheduling and routing decisions in order
to maximize system efficiency. Such dispatching decisions are subject to a number of constraints related to the signaling system,
infrastructure capacity, business rules, train physics etc. In railway systems operated following an official timetable, the goal is
typically to adhere to it as much as possible. When deviations from this plan occur, the dispatcher objective is to recover from these
deviations, minimizing train delay.
A particularly pernicious situation is the occurrence of a deadlock. A group of trains are said to be in deadlock if none of
them can move because of another train that is blocking the next track of its path. A deadlock is the result of lack of information
(e.g. wrong train length assumptions) or, more often, induced by human error (erroneous dispatching decisions). A critical aspect
of any automatic dispatching systems is that it avoid creating deadlocks at all costs.

3
V. Agasucci et al. Journal of Rail Transport Planning & Management 26 (2023) 100394

Fig. 1. The reinforcement learning framework.

2.2. Reinforcement learning

Reinforcement Learning (RL) is an approximated version of Dynamic Programming, as stated in Bertsekas (2019). This paradigm
learns, in the sense that the approximated function built takes into account statistical information obtainable from previous iterations
of the same algorithm or external data. The term Deep Reinforcement Learning (DeepRL) usually refers to RL where a deep neural
network is used to build the approximation. We refer to Sutton and Barto (2018) and Bertsekas (2019) for a comprehensive discussion
on RL.
RL is usually explained through agent and environment interaction. The agent is the part of the algorithm that learns and takes
decisions. To do so, it analyzes the surrounding environment. Once it takes a decision, the environment reacts to this decision and
the agent perceives the effect its action has produced. To understand if the action taken has been successful or not, the agent may
receive a reward associated with the action and the new environment observed.
In the Decentralized approach in Section 4.1, the agent is the train and the environment the observable line, i.e. the portion of
line and trains reachable by the agent, more details in Section 4.1, as opposed to the centralized approach, the agent act as a line
coordinator, deciding for each train, and the environment is the entire line.
For the purposes of this paper, the reinforcement procedure will move forward following the rolling horizon of events, indexed
by 𝑡. An event takes place every time there is a decision to make that may affect the global objective. In the RL vocabulary, the state
represents the formal representation of the environment at a time step 𝑡, and it is formalized by the vector 𝑠𝑡 ∈ 𝑡 , where 𝑡 ⊆ R𝑚
is the set of all possible states at time 𝑡. The action taken by the agent is 𝑎𝑡 ∈ 𝑡 , where 𝑡 ⊆ R𝑛 is the set of all possible actions
at time 𝑡. Once the agent observes 𝑠𝑡 and carries out 𝑎𝑡 , the environment reacts by producing the new state 𝑠𝑡+1 ∈ 𝑡 and a reward
𝑟𝑡+1 ∈ , where  ⊆ R is the set of all possible rewards. In this case,  is mono-dimensional, but in general, it could take the form
of a vector, depending on the problem examined. Fig. 1 shows a simplified flowchart of RL.
Q-learning is a branch of RL that uses an action-value function (usually referred to as q-function) to identify the action to take.
More formally, the q-function 𝑄(𝑠𝑡 , 𝑎𝑡 ) represents the reward that the system is expected to achieve by taking said action given the
state. At each step 𝑡, the action that brings the highest reward is chosen, namely:

𝑎𝑡 = arg max 𝑄(𝑠𝑡 , 𝑎)

𝑎∈𝑡

Pseudo-code for a Q-learning algorithm is presented in Algorithm 1.

Algorithm 1: Q-learning algorithm
Input:  a set of instances,  = ∅ the memory, a loss function (𝑦, ̂ 𝑦), 𝜖 ∈ (0, 1), 𝛾 ∈ (0, 1], #episodes, #moves
Output: the trained predictor 𝑄(⋅, ⋅)
for 𝑘 = 1, … ,#episodes do
Sample 𝑃 ∈ 
Initialize episode 𝑡 = 0, 𝑠0 = 0
for 𝑡 = 0, … ,#moves do
{ { }
Unif 𝑡 , with probability 𝜖
Select an action 𝑎𝑡 =
arg max𝑎∈𝑡 𝑄(𝑠𝑡 , 𝑎), with probability 1 − 𝜖
Observe 𝑠𝑡+1 , 𝑟𝑡+1 = ENVIRONMENT(𝑠𝑡 , 𝑎𝑡 )
{
𝑟𝑡+1 , if 𝑡 + 1 = 𝑇
Store (𝑠𝑡 , 𝑎𝑡 , 𝑦𝑡+1 ) in the memory , where 𝑦𝑡+1 =
𝑟𝑡+1 + 𝛾 arg max𝑎∈𝑡 𝑄(𝑠𝑡+1 , 𝑎), oth.
Sample batch (𝑥, 𝑦) ⊆ 
Learn by making one step of stochastic gradient descent w.r.t. the loss (𝑄(𝑥), 𝑦)
if 𝑡 + 1 = 𝑇 then
Break
end
end
end

4
V. Agasucci et al. Journal of Rail Transport Planning & Management 26 (2023) 100394

(𝑦,
̂ 𝑦) is the loss function used to perform the training, i.e. a measure of the error committed by the model that one wants to
minimize. Examples of loss functions are Mean Squared error, Cross-Entropy, Mean Absolute Error and so on, see Wang et al. (2022)
for an overview on the topic. 𝜖 ∈ (0, 1) is the probability of choosing a random action, 𝑇 is the final state of the process (i.e. when
it is not possible to move anymore), 𝛾 ∈ (0, 1] is the future discount, a hyper-parameter reflecting the fact that future rewards may
be less important. To make the estimation more consistent, usually, a replay mechanism is used, so that the agent interacts with the
system for a few virtual steps before learning. This is suitable in situations where the environment can be efficiently manipulated
or simulated in a way that the computational cost is only marginally affected.

3. Graph neural networks

Graph convolutional neural networks, introduced by Scarselli et al. (2008), are used to approximate the Q-value function.
Differently from Deep-neural network, GNN exploits the graph input structure of the railway to learn the Q-value function. The
name convolutional derives from the fact that the information is aggregated, shared and propagated among nodes according to how
they are connected. This process is known as message passing, since the information of each node is propagated to the others for a
given number of steps. The outputs of message passing are the so-called embedding vectors (one embedding for each node of the
graph). The embedding vector collects information not only provided by the node to which it refers, but it brings information by
other nodes of the graph. For richer overview of GNN, please refer to Zhou et al. (2018).
The input of the network is a graph 𝐺(𝑁, 𝐴) where 𝑁 is the set of vertices and 𝐴 the set of arcs. At each node 𝑗 ∈ 𝑁 is associated
a vector of features 𝑥𝑗 ∈ R𝑛 where 𝑛 is the number of features. In our message passing architecture, we use the scheme adopted
by Almasan et al. (2019). Given an input graph, for each node 𝑖 we define the set of neighbors 𝛿(𝑖)− = {𝑗 ∈ 𝑉 ∶ (𝑗, 𝑖) ∈ 𝐸}.
The first step of the GNN is to apply a mean function to all the neighbor features of each node 𝑖:
∑
𝑗∈𝑛𝑒𝑖 𝑥𝑗
𝑚𝑖 = , 𝑖∈𝑉 (1)
|𝑛𝑒𝑖 |
where 𝑚𝑖 is the mean of the features of the neighbors of node 𝑖. In Fig. 2 we show an example of a graph, the features and the mean
(computed as reported above) associated to each node. Then a concatenation is performed between 𝑚𝑖 and 𝑥𝑖 , and put as input of
a feed-forward neural network 𝑅𝑒𝐿𝑈 1 with 𝑅𝑒𝐿𝑈 (see Sharma et al. (2017)) as an activation function:
( ⊺ ( ) )
𝑙𝑖 = 𝑅𝑒𝐿𝑈 𝑊1 𝑐𝑜𝑛𝑐𝑎𝑡 𝑥𝑖 , 𝑚𝑖 + 𝑏1 , 𝑖 ∈ 𝑉

The term 𝑙𝑖 is the output, 𝑊1 the weights and 𝑏1 the biases of the neural network 𝑅𝑒𝐿𝑈 1.
Depending on the node connections the outputs are passed to a second neural network 𝑅𝑒𝐿𝑈 2 with 𝑅𝑒𝐿𝑈 as an activation
function: ( ( ) )
⊺
∑
ℎ𝑖 = 𝑅𝑒𝐿𝑈 𝑊2 𝑙𝑖 + 𝑙𝑗 + 𝑏2 , 𝑖 ∈ 𝑉
𝑗∈𝑛𝑒𝑖
where ℎ𝑖 is the embedding vector generated as the output of the neural network 𝑅𝑒𝐿𝑈 2, having 𝑊2 and 𝑏2 as weights and biases,
respectively. The message passing returns an ℎ𝑖 for each 𝑖 ∈ 𝑁. All the steps in the message passing can be performed several times,
however for most applications 2 or 3 times are sufficient. In our experiments, we do so twice. The message passing ensures that the
information of each node is propagated not only to its neighbors, but also to the furthest nodes. 𝑅𝑒𝐿𝑈 1 and 𝑅𝑒𝐿𝑈 2 have the same
category of parameters: 𝑊1 and 𝑏1 for 𝑅𝑒𝐿𝑈 1; 𝑊2 and 𝑏2 for 𝑅𝑒𝐿𝑈 2. The embedding vector ℎ𝑖 may have a different size than 𝑥𝑖 .
This is justified by the fact that ℎ𝑖 does not only bring the information from 𝑥𝑖 but also from other nodes. To have another message
passing step ℎ𝑖 becomes the new input of (1). Finally to obtain the target (in this work the Q-values) a last network 𝑓 (⋅) is applied
to the sum of ℎ𝑖 :
(( ) )
𝑜 = 𝑓 ℎ𝑖 𝑖∈𝑉
In Fig. 3, we show the network described above referred to graph in Fig. 2.

4. Proposed methods

In this section, basic ideas regarding states, actions, and rewards to solve TD are presented. Two approaches are proposed:
Decentralized and Centralized. The difference between them lies mainly in the topology of the state and in the reward mechanism,
and therefore in the ability of the approximating q-function to capture the right policy. When describing the state, both of the
approaches use the concept of resource. A resource is a track or a stopping point that can be occupied by one or more trains.
The majority of models for TD rely on several approximations to make the problem more readable and the mathematics easier
to handle. Using RL allows to integrate many of these hidden aspects in the environment, introducing a new level of complexity
with a relatively small effort. One of the most important is train length, which translate into computational burden for traditional
optimization algorithms, where here is a feature directly embedded in the system. To avoid this, some models in the literature tend
to approximate a train with a point, but this is not the case in real life where a train may occupy more than one resource at the
same time. This is especially critical in railway systems that operate predominantly freight services (such as in the North American
market), where trains are often longer than the available infrastructure to accommodate them, and the risk of dispatcher-induced
deadlocks is very real.

5
V. Agasucci et al. Journal of Rail Transport Planning & Management 26 (2023) 100394

Fig. 2. Graph and its features.

Fig. 3. Message passing on the graph in Fig. 2.

Another important step is to model safety rules like the safety distance inside the same track, or the role of switches, For each
track, we check the safety distance between two consecutive trains. Additionally, we model the rule that if a switch is occupied by a
train, then no other train can use the same switch. Therefore, the train that has to cross will be held until the switch is freed. Finally,
we introduce the minimum headway time between the occupation of a track by consecutive trains. Given a couple of follower trains,
we compute the elapsed time since the first train has entered. If this quantity is smaller than a certain threshold, the headway time,
the train has to wait. In our experiments, the headway time, the safety distance and all the other railway parameters are exogenous
value inspired by the US class 1 Railroad we are working with. A simple rule has been implemented to avoid deadlocks between
two crossing trains in specific circumstances. In short, given a generic train 𝑇0 positioned in 𝑅0 that wishes to occupy the resource
𝑅1 , then the method evaluates the position of all visible trains 𝑇𝑖 and the single-track resources1 𝑅𝑖 , for 𝑖 = 1, … , 𝐾,2 converging
to 𝑅1 . For each train 𝑇𝑖 , the flow of resources 𝑖 from 𝑅𝑖 to 𝑅1 is then computed. Finally, if there exists at least one 𝑖 ∈ {1, … , 𝐾}
such that a train 𝑇𝑖 occupy a resource 𝑅𝑖 in 𝑖 , then 𝑇0 must wait to avoid deadlock.
The inclusion of all these aspects allows a greater adherence to the real-world problem than many optimization approaches
attain, and the relative ease in doing so is, in our opinion, one of the advantages of a deep RL approach.

4.1. Decentralized approach

In this subsection, we will present the Decentralized approach, introducing the structure of the state when full observability of
the rail network is guaranteed.
As studied in Chen et al. (2018) and Yang and Wang (2019), enriching the information available to the agent (so the state)
affects the space of policies to be learned. For this reason, the following six features are associated to each resource:

1. status, which is a discrete value chosen among stopping point, track, blocked,3 and failure4
2. number of trains
3. train priority (if there is more than one train, the train with the highest priority is reported)

1 A single-track resource is a resource that does not have a parallel resource.

2 𝑅𝑖 is directly connected with 𝑅𝑖+1 ∀𝑖 = 1, … 𝐾.
3 Temporarily reserved by another train.
4 The resource is out of service.

6
V. Agasucci et al. Journal of Rail Transport Planning & Management 26 (2023) 100394

Fig. 4. The deep network utilized.

4. direction, which is a discrete value chosen among: follower, crossing, or empty5

5. length check, a Boolean identifying if the train length is less than or equal to the resource size
6. number of parallel resources6 w.r.t. to the current one

For that which concerns the learning process the model that approximates the q-function is represented by a feed-forward deep
neural network (FNN) with two fully connected hidden layers of 60 neurons each, and a third layer mapping into the space of the
actions. The output of the third layer is then combined with a mask, disabling infeasible moves. Fig. 4 shows the structure of the
network. Firstly, the input (the state) passes through a layer of neurons, where each neuron is associated with a ReLU (see Sharma
et al. (2017)) activation function. The output of the first layer goes to a second one with the same kind of activation functions.
Then, everything is multiplied by the action mask, filtering allowed actions from prohibited ones. An action mask is a vector, whose
components are equal to one if the correspondent action is allowed, zero otherwise.
The typical goal of a dispatcher is to take routing and scheduling decisions that minimize some measures of delay. One way to
model these decisions is to establish whether a train can access a specific resource or whether it has to stop/hold before entering
it. The algorithm takes this decision when a train reaches a control point. When possible, the train will take the best-programmed
resource as default. This happens when, for instance, all the resources ahead of the train are free. The best resource is the one
ensuring a minimum programmed running-time for the specific train.
In Fig. 5 , for example, we have four resources: 1a, 1b. 1c and 1d with respectively 800, 1000, 1200 and 1300 unimpeded running
time, which is the minimum running time that a train needs to go from its current position to its destination (in this example 1a,
1b, 1c and 1d) if it were never held.. The resource 1a represents the best choice, since it accumulates less unimpeded running time,
while the alternatives (1b, 1c, 1d) are all higher.
On the other hand, it may happen that all the next reachable resources are not available, so the train is forced to stop. In all the
other situations, the algorithm acts as if the train were able to take decisions by itself. The state of the system (from the point of
view of the train) does not represent the entire network, but only a limited number of resources ahead and behind. For each visible
resource, the six features discussed before are considered.
Given the state, the action to be taken is one of the following:

• halt
• go to the best resource
• go to a reachable resource

The first action (stopping a train) can always be taken, while the other two possibilities depend on the state.
The most delicate part in the Decentralized approach is the definition of the reward. While the global objective is to minimize
the total weighted delay, the agent viewpoint does not allow to express the reward associated with a state–action pair in terms of

5 A train T1 is a follower for a train T2 if they have the same direction, otherwise T1 is a crossing train for T2.
6 Given a resource 𝑅1 that connect two resources 𝑅𝐴 and 𝑅𝐵 , a resource 𝑅2 is a parallel resource of 𝑅1 if 𝑅2 connects 𝑅𝐴 and 𝑅𝐵 .

7
V. Agasucci et al. Journal of Rail Transport Planning & Management 26 (2023) 100394

Fig. 5. Best and reachable resources.

the actual impact on the overall network objective. For this reason, the reward is assigned at the end of each episode, and then the
data collected is stored as (𝑠𝑡 , 𝑎𝑡 , 𝑄(𝑠𝑡 , 𝑎𝑡 ), 𝑟𝑡 ), where 𝑄(𝑠𝑡 , 𝑎𝑡 ) is the output of the neural network. In particular, the reward for each
generated state–action pair is given a large penalty value if the episode ends up in a deadlock, a minor penalty if the weighted delay
is greater than 1.25 times the minimum weighted delay found so far, and a prize if the weighted delay is less than or equal to 1.25
times the minimum weighted delay found so far. This strategy is inspired by Khadilkar (2018), where the authors adopt a classical
matrix-based Q-learning approach on a single-track. We define the weighted delay of a train as the difference between the actual
running time and the unimpeded running time. The actual running time is the timing of the solution taken by our algorithm, which
may be different from the planned one, whereas the unimpeded running time is the minimum running time that a train needs to go
from its current position to its destination. By multiplying this value by a priority factor, leveraging how disruptive a delay would
be, obtaining the cumulative delay of the train considered. The choice of leveraging reflects the different priorities of trains, so that
a delay has more serious consequences if it refers to passenger train rather than a freight one. This is, in our experience, a common
choice for companies. Summing up all the single delays we obtain the total weighted delay.
The last aspect to be discussed is memory management. Since no prior knowledge is used, the composition of the sample is
critical to drive a smooth learning process. Therefore, the memory was divided into three data-sets: best, normal and deadlock. The
best memory stores all the samples ending up in a weighted delay that is less than or equal to 1.25 times the best delay found so far,
the normal memory stores the ones with weighted delay that is greater than 1.25 the best found, and the deadlock memory stores
all the unsuccessful instances.
Action–state–reward items are stored according to the level of reward if and only if the action mask in the FNN allows more
than one action, so as to strengthen the learning process only on critical moves. Deadlock and normal memory data-sets are never
deleted, while the best memory is updated every time a new best weighted delay is found.

4.2. Centralized approach

The idea behind the Centralized approach is that the DeepRL algorithm may take advantage of knowing the state of the overall
network at each step, and attempt to learn the particular dynamics of the network thanks to the GNN. In this case the agent can be
seen as a line coordinator, deciding critical issues at control points, predicting the expected effect on the network. This mimics to
some extent the behavior of human dispatchers, which have full control over a limited part of the network. Moreover, considering all
the network as a state, the use of a GNN instead of a feed-forward neural network allows adopt the same neural network for different
railway sizes. In feed-forward networks, the input is fixed in size, requiring a different model for every railway case, whereas using
GNNs the same model operates to multiple railway networks.
In this approach the state is a graph, each node is a resource (track or stopping point) and there is an arc when two resources
are linked. With each node is associated a vector of features that explains the characteristics of the train in that resource. This is a
one hot-encoded vector with the following characteristics:

• the first 𝑛 bits are reserved for the train’s priority. The 𝑖th bit is set to 1, if the train belongs to the 𝑖th priority class, and 0
otherwise.
• one bit at position 𝑛 + 1, for the train’s direction in that resource
• 𝑚 bits, from position 𝑛 + 2 to position 𝑛 + 2 + 𝑚, are reserved for the class length of the train
• finally, one bit in position 𝑛 + 2 + 𝑚 + 1 states if a decision has to be taken for the train in that resource.

The action space is the same as the Decentralized approach, however the reward mechanism, unlike the Decentralized approach,
takes into account the actual delay in absolute value, rather than a measure of how good such delay is with respect to the best
achievable. More in detail we define the reward-to-go from a time step 𝑡 as:
∑
𝑇
𝑅𝑡 = − 𝑟𝑘
𝑘=𝑡

where 𝑟𝑘 is the difference of delay between step 𝑘 and step 𝑘 − 1. The reward-to-go is a negative quantity because we are minimizing
the delay. The Q-function, instead, given a state 𝑠𝑡 and the action 𝑎𝑡 is given by:
∑
𝑇
𝑄(𝑠, 𝑎) = E[ 𝑟𝑘 |𝑠 = 𝑠𝑡 , 𝑎 = 𝑎𝑡 ]
𝑘=𝑡

8
V. Agasucci et al. Journal of Rail Transport Planning & Management 26 (2023) 100394

Fig. 6. Railway example.

To estimate the reward-to-go and measure the goodness of an action, we have therefore to minimize the following loss:

∑
𝑇
L= (𝑄(𝑠𝑡 , 𝑎𝑡 ) − 𝑅𝑡 )2 .
𝑡=1

5. Numerical results

The test bed for the experiments is inspired on data provided by a U.S. class 1 railroad. The two algorithms were compared to the
linear Q-learning algorithm proposed in Khadilkar (2018), since the primary aim of this paper is to show that DeepRL is superior
to linear Q-learning. In particular, the quality of the solutions in terms of final delay is the main driver used to discriminate the
quality of an algorithm. For what concerns time, all the procedures need less than one second to complete an episode, and therefore
are well suited for this online/real-time application.
All tests were performed on an Intel i5 processor, with no use of GPUs. Our experiments take into account railway networks
with a different number of resources, number of trains and train’s priorities.
The experiments were conducted to investigate the following properties:

1. the ability to solve unseen instances generated from the same distribution
2. the ability to generalize when the number and the length of the trains increase
3. the ability to generalize when the numbers of resources in the railway network changes.

5.1. First experiment

In this experiment the considered railway network is mainly characterized by the alternation of one station and one track (single-
track), but also include some parts of the network where two stations are connected by two single tracks or more (multiple-track).
In total, the network has 134 resources (tracks and stopping points), 15 tracks and 29 stopping points have at least one parallel
resources, while 33 are single-track resources. The time window is two hours. In Fig. 6, we report a small section of our railway
to show what we consider as a stopping point and as a track. Resources 2a and 2b are stopping points where trains can meet or
pass. Elements 1, 3, 4a, 4b, 5a and 5b are tracks, where the couples 4a, 4b and 5a, 5b are parallel, meaning they are connected to
the same stations. We refer to both tracks and stopping points as resources. The track is effectively the portion of the network that
connects two stations. Stations can have stopping points or they can be crossovers, like 𝐶1 or 𝐶2, where a crossover is composed by
one or more switches connecting two parallel tracks, or a parallel track and a single one. We do not model the switches, considering
them as a part of the next track. As explained in Section 4, if a train occupies a switch no other train can use it.
Traffic characteristics for each instance are described in terms of: number of trains, position, direction, priority and length.
The range for each parameter is realistic, as again it is inspired on input provided by the U.S. class 1 railroad. More specifically:

• the number of trains 𝑁 is chosen randomly between 4 and 10 with the following probability distribution: 𝑃 (𝑁 = 4) = 0.1,
𝑃 (𝑁 = 5) = 0.2, 𝑃 (𝑁 = 6) = 0.2, 𝑃 (𝑁 = 7) = 0.2, 𝑃 (𝑁 = 8) = 0.15, 𝑃 (𝑁 = 9) = 0.1, 𝑃 (𝑁 = 10) = 0.05
• the position is chosen using a uniform distribution on the available resources
• the direction is chosen uniformly
• the priority 𝐴 is chosen randomly between 1 and 5 with the following probability distribution 𝑃 (𝐴 = 1) = 0.05, 𝑃 (𝐴 = 2) = 0.15,
𝑃 (𝐴 = 3) = 0.23, 𝑃 (𝐴 = 4) = 0.27, 𝑃 (𝐴 = 5) = 0.3
• the length is chosen uniformly in the set {4000, 4500, 5000, 5500, 6000, 6500} to be intended in feet

Given the above mentioned distributions, to generate an instance we repeat the following steps:

1. select the number of trains 𝑁

2. for each train in 𝑁:

(a) select its initial position

(b) select its direction (down-hill, up-hill)
(c) select its priority
(d) select its length.

9
V. Agasucci et al. Journal of Rail Transport Planning & Management 26 (2023) 100394

Table 1
Basic statistics on 100 test instances from the same distribution of the training.
Delay Linear Q-learning Decentralized Centralized
Minimum 886 0 0
Average 42 399.04 9914.864 9295.297
Maximum 174 262 30 578 35 441
Std dev 35 422.18 7287.825 7398.979
# deadlocks 4 5 4

Table 2
Wins, draws and defeats on 100 network test instances.
Win/Draw/Loss Centralized Decentralized Q-Learning
Centralized – 44-47-9 93-0-7
Decentralized 9-47-44 – 96-0-4
Q-Learning 7-0-93 4-0-96 –

Priority, direction and position are sampled according to the distribution probability described above even in the next
experiments. Additionally, the delay is multiplied by a penalty factor to express the relative importance of a train class and its
priority. Given a priority 𝐴 ∈ {1, 2, 3, 4, 5}, the coefficient 𝜔𝐴 is such that 𝜔1 = 20, 𝜔2 = 10, 𝜔3 = 5, 𝜔4 = 2 and 𝜔5 = 1. Delays are
affected by a penalty factor higher than one, which is linked to the priority. Since we are reporting delays multiplied by this factor,
their values may appear high.
The models were trained on 100 randomly generated instances, with the fixed rail network described above, and tested on 100
unseen instances with the same specifications. The training phase involved running 10 000 episodes for each algorithm. An episode
is stopped either when all trains are at their last resource or after a 2-hour plan is produced. In this context, we define an episode
as the production of a 2-hour plan. In Table 1 we show some statistics, like minimum, average and maximum delay, its standard
deviation and the number of deadlocks. In Table 2, we report the number of instances won, drawn, and lost by each approach
(Centralized, Decentralized and Q-Learning). We have a win when the delay found is smaller, a draw when it is comparable, and
a loss it is higher with respect to the values obtained by the other approach in comparison. By looking at the results, we see the
Centralized approach reaches a delay that is less than Decentralized in 44 instances, being comparable in 47, and higher in only
9 of them. Comparing Centralized and Q-Learning, we have that in 93 instances the Centralized reaches a delay that is less than
Q-Learning, and similarly can be said for the Decentralized approach. This highlights the Centralized has the best performances on
this test set. Fig. 7 reports the performance profiles for the three algorithms w.r.t. the value of the delay obtained on 100 instances
of the same distribution of the training set. Performance profiles have been used as proposed in Dolan and Moré (2002). Given a
set of solvers  and a set of problems , the performance profile takes as input a ratio between the performance (i.e. the value
of the weighted delay) of a solver 𝑖 ∈  on problem 𝑝 ∈  and the best performance obtained by any solver in  on the same
problem. Consider the cumulative function 𝜌𝑠 (𝜏) = |{𝑝 ∈  ∶ 𝑟𝑝,𝑖 ≤ 𝜏}|∕|| where 𝑡𝑝,𝑖 is the delay and 𝑟𝑝,𝑖 = 𝑡𝑝,𝑖 ∕ min{𝑡𝑝,𝑖′ ∶ 𝑖′ ∈ }.
The performance profile is the plot of the functions 𝜌𝑠 (𝜏) for 𝑠 ∈ 𝑆. Informally, the higher the curve the better the component. Here,
the components are algorithms and their performance is determined by comparing the delays obtained on each instance. The graph
shows the supremacy of deep architectures with respect to the simple linear one, which is reasonable due to the known complexity
of the problem. In general, the Centralized model seems to perform better than the Decentralized one.
Table 1 summarizes some basic performance statistics. As one can see, the average delay of the DeepRL approaches is
considerably smaller, so DeepRL captures inner non-linearities more efficiently and appears more effective than linear Q-learning.

5.2. Second experiment

As a second experiment, we test the ability of the three algorithms to generalize to longer trains with different frequencies.
Longer trains in a network increase the probability to end up in a deadlock. In fact, long trains can occupy at the same time more
than one resource and one or more switch. Even very simple configurations (e.g. one station with two stopping points and two
crossing trains occupying both stopping points and switches) can lead to deadlocks. In other words, taking into account the length
of trains introduces a whole new level of complexity, which is especially for freight-based traffic, where trains often tend to be very
long. The time window considered is two hours.
In particular, for the test set the new parameters adopted are:

• Number of trains 𝑁; chosen randomly between 4 and 12 with the following probability distribution: 𝑃 (𝑁 = 4) = 0.05,
𝑃 (𝑁 = 5) = 0.15, 𝑃 (𝑁 = 6) = 0.15, 𝑃 (𝑁 = 7) = 0.15, 𝑃 (𝑁 = 8) = 0.15, 𝑃 (𝑁 = 9) = 0.1, 𝑃 (𝑁 = 10) = 0.1, 𝑃 (𝑁 = 11) = 0.1,
𝑃 (𝑁 = 12) = 0.05
• Length; chosen uniformly in the set {4000, 4500, 5000, 5500, 6000, 6500, 7000, 7500, 8000}, to be intended in feet

The focus is posed on these two parameters, since it has been observed empirically that they seem to be major factors in
influencing the complexity of a TD problem.
In this case, the DeepRL Decentralized model is trained for 20 000 episodes, whereas the linear Q-learning for 50 000; we use
the same learning of the previous experiment for the Centralized model.

10
V. Agasucci et al. Journal of Rail Transport Planning & Management 26 (2023) 100394

Fig. 7. Performance profiles for the three algorithms compared together based on the value of the delay obtained on 100 instances of the same distribution of
the training set.

Table 3
Basic statistics on 200 test instances from the same distribution of the training.
Delay Linear Q-learning Decentralized Centralized
Minimum 0 0 0
Average 50 573.49 13 966.1 13 012.775
Maximum 256 883.1 86 309.01 54 979
Std dev 52 318.19 14 092.37 12 045.295
# deadlocks 18 26 14

Table 4
Wins, draws and defeats on 200 test instances from the same distribution of the training.
Win/Draw/Loss Centralized Decentralized Q-Learning
Centralized – 107-70-23 182-3-15
Decentralized 23-70-107 – 189-3-8
Q-Learning 15-3-182 8-3-189 –

The Q-learning and Decentralized DeepRL were trained on 200 instances from the distribution described in the first experiment
(except for length and number of trains) and tested on 200 instances with the specifics specified above. Table 3 summarizes some
basic performance statistics for this experiment. Both Decentralized approach and Centralized outperforms Q-Learning, moreover
Centralized finds almost half deadlock than Decentralized and as in the previous experiment the average delay in Centralized is less
than Decentralized. Linear Q-Learning appears to find less deadlocks than Decentralized approach. However, this is arguably induced
by the fact that, in many test instances, the Q-learning based algorithm halts trains before they can reach a potential deadlock. This
leads to huge delays, but falls short of explicitly creating a deadlock. In other words these results are somewhat biased, since in
real-life a plan where trains are halted continuously (like those produced here by the linear Q-Learning approach) would be deemed
equally unacceptable.
A last observation on the number of deadlocks, is that all the algorithms compared in Table 3 act as powered greedy heuristics.
For this reason, their greedy nature may lead to unwanted solutions, which are more evident when the model cannot see the entire
network, like in the Decentralized approach. We must not forget that the nature of the test instances is meant to consider also
overcrowded situations, where a feasible solution may be hard to reach for a heuristic method. To the best of our experience,
intensifying the training for longer periods or increasing the number of weights in the models may alleviate this circumstance.
Regarding the comparison among the different approaches, also in this case we appreciate the fact that both Centralized and
Decentralized approaches outperforms the Q-learning. In Table 4, we can see that in 107 test instances on 200 the Centralized
approach returns a delay that is less than the Decentralized one while in 182 test instances returns a delay that is less than the
Q-Learning. Fig. 8 shows the performance profiles in this case.

5.3. Third experiment: different sizes of network

In the last experiments, we test the generalization capability when increasing the number of resources in our network. We define
three different railways: 𝑅1, 𝑅2, 𝑅3, with the same time window of two hours. Some characteristics of the railways are shown in
Table 5. Railway 𝑅1 has the following parameters in terms of train traffic characteristics:

11
V. Agasucci et al. Journal of Rail Transport Planning & Management 26 (2023) 100394

Fig. 8. Performance profiles for the three algorithms compared together based on the value of the delay obtained on 100 with higher number of trains (10 to
15) than the training set.

Table 5
Railway network characteristics.
Railway Total resources Stopping points Single track resources Double tracks
𝑅1 155 79 52 12
𝑅2 180 96 60 12
𝑅3 196 106 60 15

Table 6
Basic statistics on a new railway with 155 resources.
Delay Linear Q-learning Decentralized Centralized
Minimum 1572 0 0
Average 54 611.55 16 597.912 14 074.175
Maximum 153 198.146 51 952.006 47 270.46
Std dev 43 666.76 14 144.157 12 861.334
# deadlocks 5 5 5

• Number of trains 𝑁; chosen randomly between 4 and 12 with the following probability distribution: 𝑃 (𝑁 = 6) = 0.05,
𝑃 (𝑁 = 7) = 0.1, 𝑃 (𝑁 = 8) = 0.1, 𝑃 (𝑁 = 9) = 0.15, 𝑃 (𝑁 = 10) = 0.15, 𝑃 (𝑁 = 11) = 0.1, 𝑃 (𝑁 = 12) = 0.1, 𝑃 (𝑁 = 13) = 0.1
• Length; chosen uniformly in the set {4000, 4500, 5000, 5500, 6000, 6500, 7000, 7500, 8000}

Railways 𝑅2 and 𝑅3 have the following parameters in terms of train traffic characteristics:

• Number of trains 𝑁; chosen randomly between 4 and 12 with the following probability distribution: 𝑃 (𝑁 = 7) = 0.05,
𝑃 (𝑁 = 8) = 0.1, 𝑃 (𝑁 = 9) = 0.15, 𝑃 (𝑁 = 10) = 0.15, 𝑃 (𝑁 = 11) = 0.15, 𝑃 (𝑁 = 12) = 0.15, 𝑃 (𝑁 = 13) = 0.1, 𝑃 (𝑁 = 14) = 0.1,
𝑃 (𝑁 = 15) = 0.05
• Length; chosen uniformly in the set {4000, 4500, 5000, 5500, 6000, 6500, 7000, 7500, 8000}

We generated 50 tests for each railway configuration. We did not train our networks on these new infrastructures, and we rather
use the one adopted in the first experiment. In Tables 6-7-8, some statistics for the three approaches. As shown in the tables, the two
Deep-Learning approaches outperform linear Q-Learning, while the Centralized one again obtains better results in terms of weighted
delay with respect to the Decentralized algorithm. In Tables 9-10-11 we can see how Centralized approach in about one third of
the instances reach a delay that is less than the Decentralized, and it is comparable for one fourth of them.

6. Mixed integer programming comparison

Finally, we decided to compare our results for the Centralized approach with those obtained using a mixed integer program (MIP).
The formulation is taken from Samà et al. (2017) and it models both the routing and scheduling of trains. However, several of the
features that were covered by our Reinforcement Learning model were not easy to extend in a mathematical formulation, therefore
we had to neglect some of them. For example, a crucial aspect such as modeling the train length, and therefore its actual occupation
of resources. The MIP model would have suffered computationally with this length adjustment, and it would have not been a fair

12
V. Agasucci et al. Journal of Rail Transport Planning & Management 26 (2023) 100394

Table 7
Basic statistics on a new railway with 175 resources.
Delay Linear Q-learning Decentralized Centralized
Minimum 5412 1074 392
Average 68 347.693 27 082.873 17 157.326
Maximum 246 231.062 112 981.007 52 735.47
Std dev 48 582.189 24 898.419 14 324.176
# deadlocks 4 11 14

Table 8
Basic statistics on a new railway with 196 resources.
Delay Linear Q-learning Decentralized Centralized
Minimum 4846 0 0
Average 61 033.829 21 250.22 16 844.557
Maximum 159 952.036 61 314 56 500
Std dev 39 812.026 15 701.39 12 972.443
# deadlocks 7 9 11

Table 9
Wins, draws and defeats on a new railway with 155 resources.
Win/Draw/Loss Centralized Decentralized Q-Learning
Centralized – 32-11-7 47-0-3
Decentralized 7-11-32 – 47-0-3
Q-Learning 3-0-47 3-0-47 –

Table 10
Wins, draws and defeats on new railway with 180 resources.
Win/Draw/Loss Centralized Decentralized Q-Learning
Centralized – 29-16-5 39-0-11
Decentralized 5-16-29 – 41-0-9
Q-Learning 11-0-39 9-0-41 –

Table 11
Wins, draws and defeats on a new railway with 196 resources.
Win/Draw/Loss Centralized Decentralized Q-Learning
Centralized – 30-14-6 44-0-6
Decentralized 6-14-30 – 46-0-4
Q-Learning 6-0-44 4-0-46 –

comparison. For this reason, solutions of the MIP model with a lower value than RL solutions may potentially be infeasible. We
report a brief description of the model adopted in Samà et al. (2017) in Appendix.
We ran the tests on the same machine of the previous experiments, and we report them in Table 12. To recap, the five different
test sets where characterized by:

• 100 test instances on 137 resources (100 T 137 R)

• 200 test instances on 137 resources (200 T 137 R)
• 50 test instances on 155 resources (50 T 155 R)
• 50 test instances on 180 resources (50 T 180 R)
• 50 test instances on 196 resources (50 T 196 R)

We ran each test instance for 10 min. For each of the five test sets we consider:

• the number of instances resolved in 10 min, where an instance is considered resolved if it finds an incumbent in the given
time limit (Solved MIP).
• the number of instances solved both by MIP approach and RL, meaning the algorithm does not end with a deadlock (Both
solved).
• the average delay of the sets according to MIP approach on the instances in the set Solved MIP (Avg delay (MIP)).

13
V. Agasucci et al. Journal of Rail Transport Planning & Management 26 (2023) 100394

Table 12
Wins, draws and defeats on a new railway with 196 resources.
Test Solved (MIP) Avg delay (MIP) Avg delay 2 (MIP) Avg delay (RL) Both solved
100 T 137 R 98 8779 8716 9387 93
200 T 137 R 187 11 712 11 503 12 893 173
50 T 150 R 37 23 550 22 960 14 666 28
50 T 185 R 41 30 339 27 494 16 936 28
50 T 197 R 43 32 926 32 504 16 379 33

• the average delay of the sets according to MIP approach on the instances in the set Both solved (Avg delay 2 (MIP)).
• the average delay of the sets according to RL approach on the instances in the set Both solved (Avg delay (RL)).

As we can see, increasing the number of resources and/or trains the number of variables also escalates and therefore the number of
solved instances solved reasonably decreases. At the same time, the average delay found in the MIP approach becomes higher than
the ones found by RL. In fact while the values are lower for the first two test groups, they are significantly higher in the remaining
sets. Particularly, for test sets 100 T 137 R and 200 T 137 R that have the less number of resources and trains MIP the average delay
obtained by MIP approach is a little less than the Centralized approach. However when the number of resources increases (from
137 to 155, 180 and 196) and even the maximum number of trains is higher (from 10 and 12 to 13 and 15) MIP for test instances
50 T 185 R finds an average delay that is 150% higher than the average delay found by RL approach while for test sets 50 T 185
R and 50 T 197R the average delay found is twice higher than average delay returned by RL.

7. Conclusions

This study compares the use of Deep Q-learning with linear Q-learning for tackling the train dispatching problem. Two Deep
Q-learning approaches were proposed: Decentralized and Centralized. The former considers a train as an agent with a limited
perception of the rail network. The latter observes the entire network and uses a GNN to estimate the rewards, allowing to change the
size of railway network without training a new neural network every time. Computational results inspired on data provided by a U.S.
class 1 railroad show that the deep approaches perform better than the linear case. The generalization to larger problems, both in
terms of number of trains and in terms of railway size, shows room for improvement, both in terms of search strategy and complexity
of the tested network and instances. When the instance is generated by the same distribution of the training set, the algorithms prove
to deal efficiently with the problem providing solutions in a very short time. This aspect is crucial in train dispatching, as indeed in
any online/real-time planning problem, and in our opinion is one of the reasons that makes this research direction interesting. While
solution algorithms based on different paradigms (e.g. Optimization, CP) have proven to tackle the problem effectively in certain
cases, scaling and computational burden issues are always behind the corner. As shown in Section 6, we compared our Centralized
approach with a MILP formulation, showing the discrepancy between the two methods grows with the complexity of the problem.i In
future research, we may compare ourselves or include our heuristics into more complex models and algorithms like Pellegrini et al.
(2015) and Corman et al. (2017). Moreover, Deep Reinforcement Learning (as in general ML-based approaches) has the advantage
of shifting the computational burden to the algorithm’s learning stage. Unlike in enumerative algorithms, online response time then
becomes basically negligible. In other words, under the assumption that implicitly mapping space–actions–rewards (which presents
its own, clear scaling issues) can be done effectively, Deep RL could represent a breakthrough in this application. For this reason,
and others listed in the Contribution paragraph in Section 1, we believe that it is worth investigating and further advancing this
research direction.

CRediT authorship contribution statement

Valerio Agasucci: Data curation, Formal analysis, Investigation, Methodology, Resources, Software, Validation, Visualization,
Writing – original draft, Writing – review & editing. Giorgio Grani: Conceptualization, Data curation, Formal analysis, Investigation,
Methodology, Resources, Software, Validation, Visualization, Supervision, Writing – original draft, Writing – review & editing.
Leonardo Lamorgese: Resources, Writing – original draft, Writing – review & editing.

Declaration of competing interest

None.

Appendix

We represent the railway network as a graph 𝐺 = (𝑁, 𝐹 , 𝐴) where each node 𝑛 ∈ 𝑁 is a resource. Moreover we have two types
of arcs: fixed and disjunctive. For each train and each routing, fixed arcs connects two subsequent resources while disjunctive arcs
represent logic alternatives where at most one of them can be activated. A disjunctive arc represents precedence constraints on
resources that can be shared by two trains.

14
V. Agasucci et al. Journal of Rail Transport Planning & Management 26 (2023) 100394

The first group of variables is represented by the routing variables 𝑦𝑢𝑚 , that are assigned to one if routing 𝑚 for train 𝑢 is chosen,
and are zero otherwise. The second group are disjunctive variables 𝑥(𝑘𝑟𝑗,𝑢𝑚𝑝),(𝑢𝑚𝑖,𝑘𝑟𝑝) , that are driven to one if train 𝑘 on routing 𝑟
enters in resource 𝑝 before train 𝑢 that chooses routing 𝑚, and zero otherwise. Finally, the last group of variables 𝑡𝑘𝑟𝑗 are associated
with time, in particular the instant when train 𝑘 enters in resource 𝑗 following routing 𝑟.
The resulting mathematical formulation is as follows:

⎧ ∑
⎪min 𝑡𝑛 𝑝𝑛
⎪ 𝑛∈𝑇
⎪
⎪ s.t. 𝑡 − 𝑡 ≥ 𝑤
⎪ 𝑘𝑟𝑗 𝑘𝑟𝑝 𝑘𝑟𝑝,𝑘𝑟𝑗 + 𝑀(1 − 𝑦𝑘𝑟 ), ∀ (𝑘𝑟𝑝, 𝑘𝑟𝑗) ∈ 𝐹 (𝑖)
⎪
⎪ 𝑡𝑢𝑚𝑝 − 𝑡𝑘𝑟𝑗 ≥ 𝑤𝐴 + 𝑀(2 − 𝑦𝑢𝑚 − 𝑦𝑘𝑟 ) + 𝑀𝑥(𝑘𝑟𝑗,𝑢𝑚𝑝),(𝑢𝑚𝑖,𝑘𝑟𝑝) ,
⎪ 𝑘𝑟𝑗,𝑢𝑚𝑝
⎪ ∀ ((𝑘𝑟𝑗, 𝑢𝑚𝑝), (𝑢𝑚𝑖, 𝑘𝑟𝑝)) ∈ 𝐴 (𝑖𝑖)
⎪
⎨
⎪ 𝑡𝑘𝑟𝑝 − 𝑡𝑢𝑚𝑖 ≥ 𝑤𝑘𝑟𝑗,𝑘𝑟𝑝 + 𝑀(2 − 𝑦𝑢𝑚 − 𝑦𝑘𝑟 ) + 𝑀(1 − 𝑥(𝑘𝑟𝑗,𝑢𝑚𝑝),(𝑢𝑚𝑖,𝑘𝑟𝑝) ),
⎪ ∀ ((𝑘𝑟𝑗, 𝑢𝑚𝑝), (𝑢𝑚𝑖, 𝑘𝑟𝑝)) ∈ 𝐴 (𝑖𝑖𝑖)
⎪
⎪
⎪ ∑𝑅𝑏
⎪ 𝑦𝑎𝑏 = 1, ∀ 𝑏 ∈ {1, … , 𝑍} (𝑖𝑣)
⎪ 𝑎=1
⎪
⎪
⎪ 𝑥(𝑘𝑟𝑗,𝑢𝑚𝑝),(𝑢𝑚𝑖,𝑘𝑟𝑝) ∈ {0, 1}, 𝑦𝑢𝑚 ∈ {0, 1}, 𝑡𝑘 ∈ R+
⎩
Constraints (i) ensure that if train 𝑘 chooses route 𝑟 for each resource, then the starting time of two subsequent resources (𝑖 and
𝑖 + 1) must be at least equal to the time to travel the 𝑖th resource. Constraints (ii)–(iii) are precedence constraints for trains that
share the same resources. Group (iv) ensures each train uses just one routing.

References

Adenso-Dıaz, B, González, M Oliva, González-Torre, P, 1999. On-line timetable re-scheduling in regional train services. Transp. Res. B 33 (6), 387–398.
Almasan, Paul, Suárez-Varela, José, Badia-Sampera, Arnau, Rusek, Krzysztof, Barlet-Ros, Pere, Cabellos-Aparicio, Albert, 2019. Deep reinforcement learning meets
graph neural networks: Exploring a routing optimization use case. arXiv preprint arXiv:1910.07421.
Bertsekas, Dimitri P., 2019. Reinforcement Learning and Optimal Control. Athena Scientific, Belmont, MA.
Boccia, M., Mannino, C., Vasiliev, I., 2012. Solving the dispatching problem on multi-track territories by mixed integer linear programming. In: Proc. RAS
Competition/INFORMS Meet.. pp. 1–16.
Boccia, Maurizio, Mannino, Carlo, Vasilyev, Igor, 2013. The dispatching problem on multitrack territories: Heuristic approaches based on mixed integer linear
programming. Networks 62 (4), 315–326.
Cacchiani, Valentina, Huisman, Dennis, Kidd, Martin, Kroon, Leo, Toth, Paolo, Veelenturf, Lucas, Wagenaar, Joris, 2014. An overview of recovery models and
algorithms for real-time railway rescheduling. Transp. Res. B 63, 15–37.
Cai, Xiaoqiang, Goh, C.J., Mees, Alistair I., 1998. Greedy heuristics for rapid scheduling of trains on a single track. IIE Trans. 30 (5), 481–493.
Caprara, Alberto, Fischetti, Matteo, Toth, Paolo, 2002. Modeling and solving the train timetabling problem. Oper. Res. 50 (5), 851–861.
Chen, Yichen, Li, Lihong, Wang, Mengdi, 2018. Scalable bilinear 𝜋 learning using state and action features. arXiv preprint arXiv:1804.10328.
Corman, Francesco, D’Ariano, Andrea, Marra, Alessio D, Pacciarelli, Dario, Samà, Marcella, 2017. Integrating train scheduling and delay management in real-time
railway traffic control. Transp. Res. E 105, 213–239.
Corman, Francesco, Meng, Lingyun, 2014. A review of online dynamic models and algorithms for railway traffic management. IEEE Trans. Intell. Transp. Syst.
16 (3), 1274–1284.
Dolan, Elizabeth D., Moré, Jorge J., 2002. Benchmarking optimization software with performance profiles. Math. Program. 91 (2), 201–213.
Drori, Iddo, Kharkar, Anant, Sickinger, William R, Kates, Brandon, Ma, Qiang, Ge, Suwen, Dolev, Eden, Dietrich, Brenda, Williamson, David P, Udell, Madeleine,
2020. Learning to solve combinatorial optimization problems on real-world graphs in linear time. arXiv preprint arXiv:2006.03750.
Fang, Wei, Yang, Shengxiang, Yao, Xin, 2015. A survey on problem models and solution approaches to rescheduling in railway networks. IEEE Trans. Intell.
Transp. Syst. 16 (6), 2997–3016.
Ghasempour, Taha, Heydecker, Benjamin, 2019. Adaptive railway traffic control using approximate dynamic programming. Transp. Res. C.
Higgins, Andrew, Kozan, Erhan, Ferreira, Luis, 1997. Heuristic techniques for single line train scheduling. J. Heuristics 3 (1), 43–62.
Khadilkar, Harshad, 2018. A scalable reinforcement learning algorithm for scheduling railway lines. IEEE Trans. Intell. Transp. Syst. 20 (2), 727–736.
Khalil, Elias, Dai, Hanjun, Zhang, Yuyu, Dilkina, Bistra, Song, Le, 2017. Learning combinatorial optimization algorithms over graphs. In: Advances in Neural
Information Processing Systems. pp. 6348–6358.
Lamorgese, Leonardo, Mannino, Carlo, 2015. An exact decomposition approach for the real-time train dispatching problem. Oper. Res. 63 (1), 48–64.
Lamorgese, Leonardo, Mannino, Carlo, Pacciarelli, Dario, Krasemann, Johanna Törnquist, 2018. Train dispatching. Handb. Optim. Railw. Ind. 265–283.
Liao, Jinlin, Yang, Guang, Zhang, Shiwen, Zhang, Feng, Gong, Cheng, 2021. A deep reinforcement learning approach for the energy-aimed train timetable
rescheduling problem under disturbances. IEEE Trans. Transp. Electrif. 7 (4), 3096–3109.
Narayanaswami, Sundaravalli, Rangaraj, Narayan, 2011. Scheduling and rescheduling of railway operations: A review and expository analysis. Technol. Oper.
Manage. 2 (2), 102–122.
Narayanaswami, Sundaravalli, Rangaraj, Narayan, 2013. Modelling disruptions and resolving conflicts optimally in a railway schedule. Comput. Ind. Eng. 64 (1),
469–481.
Ning, Lingbin, Li, Yidong, Zhou, Min, Song, Haifeng, Dong, Hairong, 2019. A deep reinforcement learning approach to high-speed train timetable rescheduling
under disturbances. In: 2019 IEEE Intelligent Transportation Systems Conference. ITSC, IEEE, pp. 3469–3474.
Obara, Mitsuaki, Kashiyama, Takehiro, Sekimoto, Yoshihide, 2018. Deep reinforcement learning approach for train rescheduling utilizing graph theory. In: 2018
IEEE International Conference on Big Data (Big Data). IEEE, pp. 4525–4533.

15
V. Agasucci et al. Journal of Rail Transport Planning & Management 26 (2023) 100394

Pellegrini, Paola, Marlière, Grégory, Pesenti, Raffaele, Rodriguez, Joaquin, 2015. RECIFE-MILP: An effective MILP-based heuristic for the real-time railway traffic
management problem. IEEE Trans. Intell. Transp. Syst. 16 (5), 2609–2619.
Samà, Marcella, Corman, Francesco, Pacciarelli, Dario, et al., 2017. A variable neighbourhood search for fast train scheduling and routing during disturbed
railway traffic situations. Comput. Oper. Res. 78, 480–499.
Scarselli, Franco, Gori, Marco, Tsoi, Ah Chung, Hagenbuchner, Markus, Monfardini, Gabriele, 2008. The graph neural network model. IEEE Trans. Neural Netw.
20 (1), 61–80.
Schrittwieser, Julian, Antonoglou, Ioannis, Hubert, Thomas, Simonyan, Karen, Sifre, Laurent, Schmitt, Simon, Guez, Arthur, Lockhart, Edward, Hassabis, Demis,
Graepel, Thore, et al., 2019. Mastering atari, go, chess and shogi by planning with a learned model. arXiv preprint arXiv:1911.08265.
Šemrov, Darja, Marsetič, Rok, Žura, Marijan, Todorovski, Ljupčo, Srdic, Aleksander, 2016. Reinforcement learning approach for train rescheduling on a single-track
railway. Transp. Res. B 86, 250–267.
Sharma, Sagar, Sharma, Simone, Athaiya, Anidhya, 2017. Activation functions in neural networks. Towards Data Sci. 6 (12), 310–316.
Silver, David, Hubert, Thomas, Schrittwieser, Julian, Antonoglou, Ioannis, Lai, Matthew, Guez, Arthur, Lanctot, Marc, Sifre, Laurent, Kumaran, Dharshan,
Graepel, Thore, et al., 2017a. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815.
Silver, David, Hubert, Thomas, Schrittwieser, Julian, Antonoglou, Ioannis, Lai, Matthew, Guez, Arthur, Lanctot, et al., 2018. A general reinforcement learning
algorithm that masters chess, shogi, and go through self-play. Science 362 (6419), 1140–1144.
Silver, David, Schrittwieser, Julian, Simonyan, Karen, Antonoglou, Ioannis, Huang, Aja, Guez, Arthur, Hubert, Thomas, Baker, Lucas, Lai, Matthew, Bolton, Adrian,
et al., 2017b. Mastering the game of go without human knowledge. Nature 550 (7676), 354–359.
Sutton, Richard S., Barto, Andrew G., 2018. Reinforcement Learning: An Introduction. MIT Press.
Törnquist, Johanna, 2006. Computer-based decision support for railway traffic scheduling and dispatching: A review of models and algorithms. In: 5th Workshop
on Algorithmic Methods and Models for Optimization of Railways (ATMOS’05). Schloss Dagstuhl-Leibniz-Zentrum für Informatik.
Tornquist, Johanna, Persson, Jan A., 2005. Train traffic deviation handling using tabu search and simulated annealing. In: Proceedings of the 38th Annual Hawaii
International Conference on System Sciences. IEEE, p. 73a.
Wang, Xuekai, D’Ariano, Andrea, Su, Shuai, Tang, Tao, 2023. Cooperative train control during the power supply shortage in metro system: A multi-agent
reinforcement learning approach. Transp. Res. B 170, 244–278.
Wang, Yin, Lv, Yisheng, Zhou, Jianying, Yuan, Zhiming, Zhang, Qi, Zhou, Min, 2021. A policy-based reinforcement learning approach for high-speed railway
timetable rescheduling. In: 2021 IEEE International Intelligent Transportation Systems Conference. ITSC, IEEE, pp. 2362–2367.
Wang, Qi, Ma, Yue, Zhao, Kun, Tian, Yingjie, 2022. A comprehensive survey of loss functions in machine learning. Ann. Data Sci. 9 (2), 187–212.
Wang, Rongsheng, Zhou, Min, Li, Yidong, Zhang, Qi, Dong, Hairong, 2019. A timetable rescheduling approach for railway based on monte carlo tree search. In:
2019 IEEE Intelligent Transportation Systems Conference. ITSC, IEEE, pp. 3738–3743.
Yan, C., Yang, L., 2012. Mixed-integer programming based approaches for the movement planner problem: Model, heuristics and decomposition. In: Proc. RAS
Problem Solving Competition. pp. 1–14.
Yang, Lin F., Wang, Mengdi, 2019. Sample-optimal parametric q-learning using linearly additive features. arXiv preprint arXiv:1902.04779.
Ying, Cheng-shuo, Chow, Andy H.F., Chin, Kwai-Sang, 2020. An actor-critic deep reinforcement learning approach for metro train scheduling with rolling stock
circulation under stochastic demand. Transp. Res. B 140, 210–235.
Ying, Cheng-Shuo, Chow, Andy HF, Wang, Yi-Hui, Chin, Kwai-Sang, 2021. Adaptive metro service schedule and train composition with a proximal policy
optimization approach based on deep reinforcement learning. IEEE Trans. Intell. Transp. Syst..
Zhou, Jie, Cui, Ganqu, Zhang, Zhengyan, Yang, Cheng, Liu, Zhiyuan, Wang, Lifeng, Li, Changcheng, Sun, Maosong, 2018. Graph neural networks: A review of
methods and applications. arXiv preprint arXiv:1812.08434.