Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Learning to Solve Job Shop Scheduling under Uncertainty

Guillaume Infantes
Jolibrain, Toulouse, France
guillaume.infantes@jolibrain.com &Stéphanie Roussel
ONERA-DTIS, Université de Toulouse, France
stephanie.roussel@onera.fr
&Pierre Pereira
Jolibrain, Toulouse, France
pierre.pereira@jolibrain.com &Antoine Jacquet
Jolibrain, Toulouse, France
antoine.jacquet@jolibrain.com &Emmanuel Benazera
Jolibrain, Toulouse, France
emmanuel.benazera@jolibrain.com
Abstract

Job-Shop Scheduling Problem (JSSP) is a combinatorial optimization problem where tasks need to be scheduled on machines in order to minimize criteria such as makespan or delay. To address more realistic scenarios, we associate a probability distribution with the duration of each task. Our objective is to generate a robust schedule, i.e. that minimizes the average makespan. This paper introduces a new approach that leverages Deep Reinforcement Learning (DRL) techniques to search for robust solutions, emphasizing JSSPs with uncertain durations. Key contributions of this research include: (1) advancements in DRL applications to JSSPs, enhancing generalization and scalability, (2) a novel method for addressing JSSPs with uncertain durations. The Wheatley approach, which integrates Graph Neural Networks (GNNs) and DRL, is made publicly available for further research and applications.

1 Introduction

Job-shop scheduling problems (JSSPs) are combinatorial optimization problems that involve assigning tasks to resources (e.g., machines) in a way that minimizes criteria such as makespan, tardiness, or total flow time. While the scheduling of production resources plays an important role in many industries, the JSSPs formulation lacks the handling of uncertainty due to its simplifying assumptions. This leads to several direct practical consequences, such as scheduling from fixed factors which can have significant impact on scheduling performance, and ignoring machine breakdowns or material shortages, leading to poor solutions when faced with real-world uncertainty.

Optimal solving of combinatorial optimization problems is NP-complete, and while there has been a lot of progress in solvers performance [5], classical approaches remain often impractical on large instances. Therefore, approximation and heuristics-based methods have been proposed [20]; but handling uncertainty within these methods remains challenging. Recent works have considered learning algorithms for such problems and report early advances with deep reinforcement learning (DRL) techniques ([6, 11, 27]). Because it models the world as a runnable environment, and the algorithm learns directly from it, DRL does offer a more natural way to handle uncertainty with JSSPs. As Reinforcement Learning methods are robust to noise, the uncertainty in the problem statement, which is reflected in the learner’s environment, is naturally handled by the algorithm.

We present two contributions for tackling JSSP with uncertain durations.

First, this work shows a range of improvements over the DRL and JSSPs literature, from neural network architectures to training hyper-parameters and reward definitions. These directly lead to better generalization and scalability, both to same-size problems and to larger problems.

Second, the proposed method solves JSSPs with uncertain duration, that beats optimal deterministic solutions on expected uncertainty. This is relevant to the general use-case where uncertainty cannot be known in advance and where the best deterministic schedule uses expected uncertainty on tasks duration.

Overall, this leads to a very flexible and efficient approach, capable of naturally handling duration uncertainty, with top results on existing Taillard benchmarks while setting a new benchmark reference for JSSPs with uncertain durations. The approach, code-named Wheatley, combines Graph Neural Networks (GNNs) and DRL techniques. The code is made available under an Open Source license at https://github.com/jolibrain/wheatley/.

The paper is organized as follows: related works are introduced in Section 2, then the JSSP with uncertainty is formalized as a Markov Decision Process (MDP) in Section 3. In Section 4, we detail the core technical contributions. Section 5 is dedicated to experiments on both deterministic and stochastic JSSPs. Finally, we conclude and discuss future works in Section 6.

2 Related work

This section provides an overview of techniques developed to address both deterministic and stochastic versions of the Job hop Scheduling Problem ([25]).

Deterministic JSSPs. Mathematical programming, including techniques such as Constraint Programming (CP) or Integer Linear Programming (ILP), has been favored for solving JSSPs due to its precision and ability to model complex scheduling problems. However, the time and resources required to achieve solutions can be very high for large scenarios [5].

Priority Dispatching Rules (PDRs), are heuristic-based strategies that assign priorities to jobs based on predefined criteria; they make local decisions at each step by picking the highest priority job. Common criteria used in PDRs include the Shortest Processing Time (jobs with the shortest processing time are given priority) and the Earliest Due Date (jobs due the earliest are prioritized). This simplifies the scheduling process, making PDRs particularly useful for real-time or large-scale scenarios where rapid decision-making is essential. However, while PDRs are computationally efficient, they rarely yield the optimal solution. A comprehensive evaluation can be found in [20].

Recently, there has been a surge in machine learning and data-driven approaches to solve JSSPs. Instead of relying on handcrafted heuristics, the Learning to Dispatch strategy (L2D) uses machine learning to emulate successful dispatching strategies, as described in [28]. A significant advantage of this method is its size-agnostic nature, as it uses the disjunctive graph representation of the JSSPs. Graph Neural Networks (GNNs) process these graphs to capture intricate relations between operations and their constraints. Deep Reinforcement Learning (DRL) guides the decision-making process, optimizing scheduling decisions based on the features extracted by the GNNs. In [16], the authors leverage a GNN to convert the JSSP graph into node representations. These representations assist in determining scheduling actions. Proximal Policy Optimization (PPO) is employed as a training method for the GNN-derived node embeddings and the associated policy. This method employs an event-based simulator for the JSSP and directly incorporates times into the states of the base Markov Decision Process. This specificity complicates its adaptation to uncertain scenarios. [22] also uses GNNs and PPO for addressing the flexible job-shop problem where the agent also has to choose machines for tasks; the authors add nodes for machines and use two different types of message-passing. The same problem is addressed using a bipartite graph and custom message passing in [9], with good results. The Reinforced Adaptive Staircase Curriculum Learning (RASCL) approach [8] is a Curriculum Learning method that adjusts difficulty levels during learning by dynamically focusing on challenging instances.

Robustness in JSSPs. Stochastic JSSPs (SJSSP) account for uncertainties in processing times by modeling them as random variables. Some techniques use classic solvers to generate robust solutions by anticipating potential disruptions or modeling worst-case scenarios. For instance, in [13], for a given JSSP, several processing times scenarios are sampled. The objective is therefore to generate a unique schedule that is good for all sampled scenarios. PDRs can also be used but they can lack a global view, as in the deterministic case. Several works address SJSSP through meta-heuristics, as described in [2]. Other techniques involving genetic algorithms and their hybridization are also widely used [3, 21]. [1] introduces a way to robustify solution to deterministic relaxation of the SJSSP. [14] presents a both proactive and reactive scheduling: a multi-agent architecture is responsible for the proactive robust scheduling and a repair procedure is involved for machine breakdown and arrival of rush jobs.

Some works address the dynamic variant of SJSSP, in which new jobs can arrive at any time. [26] is a review of such extensions along with corresponding proposed solving methods. Classical approaches involve complex mathematical programming models that do not scale well [15]. Online reactive recovery approaches are a possible solution if no robust solution is available [17]. More recent approaches explore the use of using DRL and GNN [12]. However, to the best of our knowledge, there are no work that use such techniques for the SJSSP.

3 JSSP with Uncertainty as a MDP

In this section, we first recall the JSSP definition. Then, we describe how to represent uncertainty and define the corresponding MDP. We finally present how to use Reinforcement Learning (RL) for solving the MDP.

3.1 Background

A JSSP is defined as a pair (𝒥,)𝒥(\mathcal{J},\mathcal{M})( caligraphic_J , caligraphic_M ), where 𝒥𝒥\mathcal{J}caligraphic_J is a set of jobs and \mathcal{M}caligraphic_M is a set of machines. Each job Ji𝒥subscript𝐽𝑖𝒥\mathit{J}_{i}\in\mathcal{J}italic_J start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_J must go through nisubscript𝑛𝑖\mathit{n}_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT machines in \mathcal{M}caligraphic_M in a given order OijOinisubscript𝑂𝑖𝑗subscript𝑂𝑖subscript𝑛𝑖\mathit{O}_{ij}\to...\to\mathit{O}_{i\mathit{n}_{i}}italic_O start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT → … → italic_O start_POSTSUBSCRIPT italic_i italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT, where each element Oij(1jni)subscript𝑂𝑖𝑗1𝑗subscript𝑛𝑖\mathit{O}_{ij}(1\leq j\leq\mathit{n}_{i})italic_O start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( 1 ≤ italic_j ≤ italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is called an operation of Jisubscript𝐽𝑖\mathit{J}_{i}italic_J start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The binary relation \to is a precedence constraint. The size of a JSSP instance is classically denoted as |𝒥|×||𝒥|\mathcal{J}|\times|\mathcal{M}|| caligraphic_J | × | caligraphic_M |. In the following, the set of all operations is denoted 𝒪𝒪\mathcal{O}caligraphic_O. To be executed, each operation Oij𝒪subscript𝑂𝑖𝑗𝒪\mathit{O}_{ij}\in\mathcal{O}italic_O start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∈ caligraphic_O requires a unique machine mijsubscript𝑚𝑖𝑗\mathit{m}_{ij}\in\mathcal{M}italic_m start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∈ caligraphic_M during a processing time denoted pijsubscript𝑝𝑖𝑗\mathit{p}_{ij}italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT (pij+subscript𝑝𝑖𝑗superscript\mathit{p}_{ij}\in\mathbb{N}^{+}italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∈ blackboard_N start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT). Each machine can only process one job at a time, and preemption is not allowed.

A solution σ𝜎\sigmaitalic_σ of a JSSP instance is a function that assigns a start date Sijsubscript𝑆𝑖𝑗\mathit{S}_{ij}italic_S start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT to each operation Oijsubscript𝑂𝑖𝑗\mathit{O}_{ij}italic_O start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT so that precedence between operations of each job are respected and there is no temporal overlap between operations that are performed on the same machine. The completion time of an operation Oijsubscript𝑂𝑖𝑗\mathit{O}_{ij}italic_O start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, is Cij=Sij+pijsubscript𝐶𝑖𝑗subscript𝑆𝑖𝑗subscript𝑝𝑖𝑗\mathit{C}_{ij}=\mathit{S}_{ij}+\mathit{p}_{ij}italic_C start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_S start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT + italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. A solution σ𝜎\sigmaitalic_σ is optimal if it minimizes the makespan C𝑚𝑎𝑥=𝑚𝑎𝑥Oij𝒪{Cij}subscript𝐶𝑚𝑎𝑥subscript𝑚𝑎𝑥subscript𝑂𝑖𝑗𝒪subscript𝐶𝑖𝑗\mathit{C}_{\mathit{max}}=\mathit{max}_{\mathit{O}_{ij}\in\mathcal{O}}\{% \mathit{C}_{ij}\}italic_C start_POSTSUBSCRIPT italic_max end_POSTSUBSCRIPT = italic_max start_POSTSUBSCRIPT italic_O start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∈ caligraphic_O end_POSTSUBSCRIPT { italic_C start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT }, i.e. the maximal completion time of operations.

As described in [18], the disjunctive graph is defined by 𝒢=(𝒪,𝒞,𝒟)𝒢𝒪𝒞𝒟\mathcal{G}=(\mathcal{O},\mathcal{C},\mathcal{D})caligraphic_G = ( caligraphic_O , caligraphic_C , caligraphic_D ) of a JSSP (𝒥,)𝒥(\mathcal{J},\mathcal{M})( caligraphic_J , caligraphic_M ) as:

  • 𝒪𝒪\mathcal{O}caligraphic_O is the set of vertices, i.e. there is one vertex for each operation o𝒪𝑜𝒪o\in\mathcal{O}italic_o ∈ caligraphic_O;

  • 𝒞𝒞\mathcal{C}caligraphic_C is a set of directed arcs representing the precedence constraints between operations of each job (conjunctions);

  • 𝒟𝒟\mathcal{D}caligraphic_D is a set of edges (disjunctions), each of which connects a pair of operations requiring the same machine for processing.

Figure 1(a) shows the disjunctive graph of a JSSP with 3 jobs and 3 machines. A selection is a state of the graph in which a direction is chosen for some edges in 𝒟𝒟\mathcal{D}caligraphic_D, denoted 𝒟Osuperscript𝒟𝑂\mathcal{D}^{O}caligraphic_D start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT. If an edge (Oij,Oij)subscript𝑂𝑖𝑗subscript𝑂superscript𝑖superscript𝑗(\mathit{O}_{ij},\mathit{O}_{i^{\prime}j^{\prime}})( italic_O start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , italic_O start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) in 𝒟𝒟\mathcal{D}caligraphic_D becomes oriented (in that order), then it represents that operation Oijsubscript𝑂𝑖𝑗\mathit{O}_{ij}italic_O start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is performed before Oijsubscript𝑂superscript𝑖superscript𝑗\mathit{O}_{i^{\prime}j^{\prime}}italic_O start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT on their associated machine. A selection is valid if the set of oriented arcs (𝒞𝒟O𝒞superscript𝒟𝑂\mathcal{C}\cup\mathcal{D}^{O}caligraphic_C ∪ caligraphic_D start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT) makes the graph acyclic. A solution σ𝜎\sigmaitalic_σ can be defined by a valid selection in which all edges in 𝒟𝒟\mathcal{D}caligraphic_D have a direction (𝒟=𝒟O𝒟superscript𝒟𝑂\mathcal{D}=\mathcal{D}^{O}caligraphic_D = caligraphic_D start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT) and in which start dates of operations are the earliest possible dates consistent with the selection precedences, as done in a classical Schedule Generation Scheme (SGS). Figure 1(b) illustrates a valid selection for the toy JSSP instance of Figure 1(a).

O11subscript𝑂11\mathit{O}_{11}italic_O start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPTO12subscript𝑂12\mathit{O}_{12}italic_O start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPTO13subscript𝑂13\mathit{O}_{13}italic_O start_POSTSUBSCRIPT 13 end_POSTSUBSCRIPTO21subscript𝑂21\mathit{O}_{21}italic_O start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPTO22subscript𝑂22\mathit{O}_{22}italic_O start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPTO23subscript𝑂23\mathit{O}_{23}italic_O start_POSTSUBSCRIPT 23 end_POSTSUBSCRIPTO31subscript𝑂31\mathit{O}_{31}italic_O start_POSTSUBSCRIPT 31 end_POSTSUBSCRIPTO32subscript𝑂32\mathit{O}_{32}italic_O start_POSTSUBSCRIPT 32 end_POSTSUBSCRIPTO33subscript𝑂33\mathit{O}_{33}italic_O start_POSTSUBSCRIPT 33 end_POSTSUBSCRIPT
(a) Disjunctive graph example in which each color represents a different machine
O11subscript𝑂11\mathit{O}_{11}italic_O start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPTO12subscript𝑂12\mathit{O}_{12}italic_O start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPTO13subscript𝑂13\mathit{O}_{13}italic_O start_POSTSUBSCRIPT 13 end_POSTSUBSCRIPTO21subscript𝑂21\mathit{O}_{21}italic_O start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPTO22subscript𝑂22\mathit{O}_{22}italic_O start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPTO23subscript𝑂23\mathit{O}_{23}italic_O start_POSTSUBSCRIPT 23 end_POSTSUBSCRIPTO31subscript𝑂31\mathit{O}_{31}italic_O start_POSTSUBSCRIPT 31 end_POSTSUBSCRIPTO32subscript𝑂32\mathit{O}_{32}italic_O start_POSTSUBSCRIPT 32 end_POSTSUBSCRIPTO33subscript𝑂33\mathit{O}_{33}italic_O start_POSTSUBSCRIPT 33 end_POSTSUBSCRIPT
(b) Selection example with 𝒟O={(O11,O22),\mathcal{D}^{O}=\{(\mathit{O}_{11},\mathit{O}_{22}),caligraphic_D start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT = { ( italic_O start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , italic_O start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT ) , (O11,O32),subscript𝑂11subscript𝑂32(\mathit{O}_{11},\mathit{O}_{32}),( italic_O start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , italic_O start_POSTSUBSCRIPT 32 end_POSTSUBSCRIPT ) , (O22,O32),subscript𝑂22subscript𝑂32(\mathit{O}_{22},\mathit{O}_{32}),( italic_O start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT , italic_O start_POSTSUBSCRIPT 32 end_POSTSUBSCRIPT ) , (O21,O31),subscript𝑂21subscript𝑂31(\mathit{O}_{21},\mathit{O}_{31}),( italic_O start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT , italic_O start_POSTSUBSCRIPT 31 end_POSTSUBSCRIPT ) , (O21,O13)}(\mathit{O}_{21},\mathit{O}_{13})\}( italic_O start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT , italic_O start_POSTSUBSCRIPT 13 end_POSTSUBSCRIPT ) }
Figure 1: Disjunctive graph representation

3.2 Representing uncertainty

JSSPs can be easily extended with duration uncertainty as bounds on tasks’ duration, and effect uncertainty as failure outcomes of a task. While task failures could be represented using special nodes representing completely different outcomes, this would push the boundaries outside JSSP formal capabilities. In this work, failures are handled as retries that consume an uncertain time duration a fixed maximum number of times. In the following, we focus on uncertain task duration as a generic enough scheme to capture relevant uncertainty use-cases.

In the classical JSSP definition, every operation Oijsubscript𝑂𝑖𝑗\mathit{O}_{ij}italic_O start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT has a deterministic processing time pijsubscript𝑝𝑖𝑗\mathit{p}_{ij}italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. We extend this definition by saying that processing times are not known in advance, but that each operation Oijsubscript𝑂𝑖𝑗\mathit{O}_{ij}italic_O start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT has an associated probability distribution ijsubscript𝑖𝑗\mathbb{P}_{ij}blackboard_P start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT over its possible duration values. The objective is to minimize the average makespan, that is formally defined by 𝑚𝑎𝑥Oij𝒪0+(Sij+p𝑖𝑗)𝑑ijsubscript𝑚𝑎𝑥subscript𝑂𝑖𝑗𝒪superscriptsubscript0subscript𝑆𝑖𝑗subscript𝑝𝑖𝑗differential-dsubscript𝑖𝑗\mathit{max}_{\mathit{O}_{ij}\in\mathcal{O}}\int_{0}^{+\infty}(\mathit{S}_{ij}% +\mathit{p_{ij}})\,d\mathbb{P}_{ij}italic_max start_POSTSUBSCRIPT italic_O start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∈ caligraphic_O end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + ∞ end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT + italic_p start_POSTSUBSCRIPT italic_ij end_POSTSUBSCRIPT ) italic_d blackboard_P start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT.

3.3 Sequential Decision Making

The scheduling problem boils down to a Markov Decision Process. Inputs of the problem are the the original disjunctive graph 𝒢=(𝒪,𝒞,𝒟)𝒢𝒪𝒞𝒟\mathcal{G}=(\mathcal{O},\mathcal{C},\mathcal{D})caligraphic_G = ( caligraphic_O , caligraphic_C , caligraphic_D ) and the probability distribution over duration values ijsubscript𝑖𝑗\mathbb{P}_{ij}blackboard_P start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT associated with each operation Oijsubscript𝑂𝑖𝑗\mathit{O}_{ij}italic_O start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT.

State

The state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at decision step t𝑡titalic_t is defined by:

  • the current selection 𝒟tOsubscriptsuperscript𝒟𝑂𝑡\mathcal{D}^{O}_{t}caligraphic_D start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT,

  • the set of already scheduled operations 𝒮tsubscript𝒮𝑡\mathcal{S}_{t}caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at step t𝑡titalic_t.

The initial state s0subscript𝑠0s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the disjunctive graph representing the original JSSP instance with 𝒟0O=subscriptsuperscript𝒟𝑂0\mathcal{D}^{O}_{0}=\emptysetcaligraphic_D start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ∅ and 𝒮0=subscript𝒮0\mathcal{S}_{0}=\emptysetcaligraphic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ∅. The terminal state sTsubscript𝑠𝑇s_{T}italic_s start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT is a complete solution where 𝒟TO=𝒟subscriptsuperscript𝒟𝑂𝑇𝒟\mathcal{D}^{O}_{T}=\mathcal{D}caligraphic_D start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = caligraphic_D, i.e. all disjunctive arcs have been assigned a direction, and 𝒮T=𝒪subscript𝒮𝑇𝒪\mathcal{S}_{T}=\mathcal{O}caligraphic_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = caligraphic_O, i.e. all operations have been scheduled.

Actions

At each step t𝑡titalic_t, candidate actions consist in selecting an operation Oijsubscript𝑂𝑖𝑗\mathit{O}_{ij}italic_O start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT to put directly after the last scheduled operation on the corresponding machine. This is a simple way to ensure that cycles are never added in 𝒢𝒢\mathcal{G}caligraphic_G. Intuitively, it consists in choosing an operation to do before all the ones that have not yet been scheduled, and update the current selection accordingly. Furthermore, we force that an operation is a candidate for selection only if its preceding tasks in the same job have been scheduled. As exactly one operation is scheduled at each step, the final state is reached at step T=𝑐𝑎𝑟𝑑(𝒪)𝑇𝑐𝑎𝑟𝑑𝒪T=\mathit{card}(\mathcal{O})italic_T = italic_card ( caligraphic_O ).

Candidates actions 𝒜tsubscript𝒜𝑡\mathcal{A}_{t}caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at step t𝑡titalic_t are formally defined by 𝒜t={Oij𝒪|Oij𝒮t and j<j,Oij𝒮t}subscript𝒜𝑡conditional-setsubscript𝑂𝑖𝑗𝒪formulae-sequencesubscript𝑂𝑖𝑗subscript𝒮𝑡 and for-allsuperscript𝑗𝑗subscript𝑂𝑖superscript𝑗subscript𝒮𝑡\mathcal{A}_{t}=\{\mathit{O}_{ij}\in\mathcal{O}\;|\;\mathit{O}_{ij}\notin% \mathcal{S}_{t}\textit{ and }\forall j^{\prime}<j,\mathit{O}_{ij^{\prime}}\in% \mathcal{S}_{t}\}caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_O start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∈ caligraphic_O | italic_O start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∉ caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and ∀ italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT < italic_j , italic_O start_POSTSUBSCRIPT italic_i italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }.

Transitions

If the chosen action at step t𝑡titalic_t consists in selecting the operation Oijsubscript𝑂𝑖𝑗\mathit{O}_{ij}italic_O start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT in 𝒜tsubscript𝒜𝑡\mathcal{A}_{t}caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, then it leads to adding Oijsubscript𝑂𝑖𝑗\mathit{O}_{ij}italic_O start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT to the scheduled operations set and adding the arc (Okl,Oij)subscript𝑂𝑘𝑙subscript𝑂𝑖𝑗(\mathit{O}_{kl},\mathit{O}_{ij})( italic_O start_POSTSUBSCRIPT italic_k italic_l end_POSTSUBSCRIPT , italic_O start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) to 𝒟Osuperscript𝒟𝑂\mathcal{D}^{O}caligraphic_D start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT for each operation Oklsubscript𝑂𝑘𝑙\mathit{O}_{kl}italic_O start_POSTSUBSCRIPT italic_k italic_l end_POSTSUBSCRIPT scheduled at step t𝑡titalic_t on the same machine. Formally, this gives:

  • 𝒮t+1=𝒮t{Oij}subscript𝒮𝑡1subscript𝒮𝑡subscript𝑂𝑖𝑗\mathcal{S}_{t+1}=\mathcal{S}_{t}\cup\{\mathit{O}_{ij}\}caligraphic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∪ { italic_O start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT }

  • 𝒟t+1O=𝒟tO{(Okl,Oij)𝒟|Okl𝒮t and mkl=mij}subscriptsuperscript𝒟𝑂𝑡1subscriptsuperscript𝒟𝑂𝑡conditional-setsubscript𝑂𝑘𝑙subscript𝑂𝑖𝑗𝒟subscript𝑂𝑘𝑙subscript𝒮𝑡 and subscript𝑚𝑘𝑙subscript𝑚𝑖𝑗\mathcal{D}^{O}_{t+1}=\mathcal{D}^{O}_{t}\cup\{(\mathit{O}_{kl},\mathit{O}_{ij% })\in\mathcal{D}\;|\;\mathit{O}_{kl}\in\mathcal{S}_{t}\textit{ and }\mathit{m}% _{kl}=\mathit{m}_{ij}\}caligraphic_D start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = caligraphic_D start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∪ { ( italic_O start_POSTSUBSCRIPT italic_k italic_l end_POSTSUBSCRIPT , italic_O start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) ∈ caligraphic_D | italic_O start_POSTSUBSCRIPT italic_k italic_l end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and italic_m start_POSTSUBSCRIPT italic_k italic_l end_POSTSUBSCRIPT = italic_m start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT }

Reward/Cost

In most approaches to solve MDPs with reinforcement learning, it is preferred to use non-sparse rewards, i.e. an informative reward signal at every step. For instance the authors of [28] propose to use the difference of makespan induced by the affectation of the task, as this naturally sums to makespan at the end of the trajectory. With the presence of uncertainty, while we could compute bounds on the makespan in the same way, it is not obvious to aggregate such bounds into a uni-dimensional reward signal.

We use a different approach: we draw durations only when the schedule is complete (at time T𝑇Titalic_T) and give it as a cost. Start date of each operation is its earliest possible date considering conjunctive and disjunctive precedence arcs in the solution. All other rewards are null. Formally, it is defined as follows:

  • t<Tfor-all𝑡𝑇\forall t<T∀ italic_t < italic_T, rt=0subscript𝑟𝑡0\mathit{r}_{t}=0italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 0;

  • rT=𝑚𝑎𝑥Oij𝒪(Sij+pij𝑠𝑎𝑚𝑝𝑙𝑒)subscript𝑟𝑇subscript𝑚𝑎𝑥subscript𝑂𝑖𝑗𝒪subscript𝑆𝑖𝑗superscriptsubscript𝑝𝑖𝑗𝑠𝑎𝑚𝑝𝑙𝑒\mathit{r}_{T}=\mathit{max}_{\mathit{O}_{ij}\in\mathcal{O}}(\mathit{S}_{ij}+% \mathit{p}_{ij}^{\mathit{sample}})italic_r start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = italic_max start_POSTSUBSCRIPT italic_O start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∈ caligraphic_O end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT + italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_sample end_POSTSUPERSCRIPT ) where pij𝑠𝑎𝑚𝑝𝑙𝑒ijsimilar-tosuperscriptsubscript𝑝𝑖𝑗𝑠𝑎𝑚𝑝𝑙𝑒subscript𝑖𝑗\mathit{p}_{ij}^{\mathit{sample}}\sim\mathbb{P}_{ij}italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_sample end_POSTSUPERSCRIPT ∼ blackboard_P start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT.

As reinforcement learning aims at minimizing the expectation of costs sum along trajectories, this corresponds to our objective of minimizing the average makespan.

3.4 Solving the MDP with Reinforcement Learning

We use a reinforcement learning setup, where the agent selects tasks to schedule, and gets corresponding partial schedules as observation along with rewards. Using this modeling, effects of actions are deterministic (as they only add edges in the graph), all uncertainty is in the reward value.

The objective is to find a policy that minimizes the average makespan over a set of test problems that are not used during the training phase. To do so, we use a simulator that generates problems that are close to the test problems, and aim at obtaining a policy that minimizes the expectation of makespan along the problems generated by the simulator. Such a policy has to be able to generalize to test problems, i.e. give good results without further learning. As the only source of uncertainty is the durations for which parameters only are observable, we want our parametric policy to be able to adapt to these parameters.

Algorithm

In order to learn a policy, we consider a parametric policy and use the Proximal Policy Optimization (PPO) algorithm [19], with action masking [7]. PPO is an on-policy actor-critic RL algorithm. Its current stochastic policy is the actor, while the critic estimates the quality of the current state. More precisely, as shown in Algorithm 1, the algorithm starts by randomly initializing parameters of policy (actor) and value function estimator (critic). Then, for a given number of iterations, its starts by collecting trajectory data in the form (observation, action, next observation) on train problem instances where actions are chosen using current stochastic policy. It then computes makespans corresponding to train instances and by sampling durations and applying chosen order of actions. Using this, it computes returns at every timestep, then advantages (difference of sampled returns to value estimation) using current critic. Observed graphs are then rewired (see section 4.1). The PPO update algorithm itself samples a subset of corresponding observations, actions, advantages, value prediction and returns, then updates the actor parameters (including GNN) using the gradient of the advantages, and updates the critic (including GNN) using gradient of mean square error between the critic value and the observed returns. This PPO update is repeated a small number of times or until the variation in the policy would lead to out-of-distribution critic estimations (because samples are collected using the old policy, i.e. the one before PPO updates).

We then evaluate the current policy (actor) by playing the argmax of the stochastic policy on a given set of validation problems. We repeat these steps until the policy does not seem to improve on validation instances (the N𝑁Nitalic_N in the external for loop is an upper bound on the number of iterations, which is set to a large value and the for loop is interrupted manually).

1Generate validation instances, compute heuristic and ortools performance on these instances
// actor is \thicksim current policy πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT
2 Init actor
// critic is \thicksim value function estimator
3 Init critic
4 for i=1,2,N𝑖12𝑁i=1,2,\ldots Nitalic_i = 1 , 2 , … italic_N do
       // Collect dataset
5       Generate train instances
6       Collect trials data 𝒟i=((st,at,rt,st+1),)subscript𝒟𝑖subscript𝑠𝑡subscript𝑎𝑡subscript𝑟𝑡subscript𝑠𝑡1\mathcal{D}_{i}=((s_{t},a_{t},r_{t},s_{t+1}),...)caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) , … ) using current actor
7       For each trial : sample makespan using system simulator (=== final cost)
8       Compute returns on trials
9       Compute advantages on trials using current critic
10       Rewire graphs in trial data
       // PPO update algorithm
11       repeat
12             Sample a minibatch of n data points over shuffled collected data
13             Update actor over the minibatch data towards advantage maximization
14             Update critic by MSE regression
15            
16      until max number of iterations or too large KL-divergence between current and updated policy
17      Evaluate current policy (actor) on validation instances
18      
Algorithm 1 General algorithm

4 GNN Implementation: Rewiring, Embedding and Addressing Uncertainty

An overview of the architecture is shown in Figure 2. The agent takes as an input a partial schedule in the form of a graph, as in Figure 1(b). Several elements, described in this section, are within the actor and allow to choose one action. This action is treated by the simulator to update the schedule graph by adding arcs and simulates the uncertainty when the last state is reached. Note that PPO uses the schedule graph, the action, the reward and the value estimation in order to update the embedders and the GNN.

Graph Rewirer Node Embedder Edge Embedder GNN Value Estimator Action Selector Refer to caption\pgfmathresultptgraphlogit\pgfmathresultptnodeslogitsAgent Graph Update System Simulator (uncertainty) \RightarrowSimulatorActionOijsubscript𝑂𝑖𝑗O_{ij}italic_O start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPTPartialScheduleGraphRefer to caption PPO \pgfmathresultptupdate parameters\pgfmathresultptrewardvalue
Figure 2: General Architecture

4.1 Graph Rewiring and Embedding

The representation above allows to model partial schedules (where some conflicts are not resolved) as disjunctive graph representation. Generally speaking, Message-Passing Graph Neural Networks (MP-GNN) use the graph structure as a computational lattice, meaning that information has to follow the graph adjacencies and only them. We thus have to make the difference between the input graph and the graph used by the MP-GNN. This is known as “graph rewiring” in the MP-GNN literature. In our case, if we use only precedencies as adjacencies, this would mean the we explicitly forbid information to go from future tasks to present choice of dispatch, which is definitely not what we want: we want the agent to choose task to dispatch based on effects on future conflicts, meaning that we want information go from future to present task.

Precedences.

In order to have a rewired graph as small as possible, we remove from 𝒟Osuperscript𝒟𝑂\mathcal{D}^{O}caligraphic_D start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT all edges that are not necessary to obtain the complete order. For instance, we remove from figure 1(b), the edge (O11,O32)subscript𝑂11subscript𝑂32(\mathit{O}_{11},\mathit{O}_{32})( italic_O start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , italic_O start_POSTSUBSCRIPT 32 end_POSTSUBSCRIPT ). Links are then added in the rewired graph in both directions for every precedence, with different types for precedence and reverse-precedence edges. This enables learned operators to differentiate between chronological and reverse-chronological links and allows the network to pass information in a forward and backward way, depending on what is found useful during learning phase.

Conflicts.

Remains the challenge of allowing message circulation between tasks sharing the same machine in the GNN. Two options are possible: 1) adding a node representing a machine with links to tasks using the machine and edges in both directions (from tasks to this machine node and in the opposite direction), or 2) directly connecting tasks that share a machine, resulting in a clique per machine in the message-passing graph. In this paper, we choose the second approach as it showed better results than the first one in preliminary experiments.

O11subscript𝑂11\mathit{O}_{11}italic_O start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPTO12subscript𝑂12\mathit{O}_{12}italic_O start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPTO13subscript𝑂13\mathit{O}_{13}italic_O start_POSTSUBSCRIPT 13 end_POSTSUBSCRIPTO21subscript𝑂21\mathit{O}_{21}italic_O start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPTO22subscript𝑂22\mathit{O}_{22}italic_O start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPTO23subscript𝑂23\mathit{O}_{23}italic_O start_POSTSUBSCRIPT 23 end_POSTSUBSCRIPTO31subscript𝑂31\mathit{O}_{31}italic_O start_POSTSUBSCRIPT 31 end_POSTSUBSCRIPTO32subscript𝑂32\mathit{O}_{32}italic_O start_POSTSUBSCRIPT 32 end_POSTSUBSCRIPTO33subscript𝑂33\mathit{O}_{33}italic_O start_POSTSUBSCRIPT 33 end_POSTSUBSCRIPTJobs precedence (𝒞𝒞\mathcal{C}caligraphic_C)+ scheduling precedence choicesBackward jobs precedence+ backward scheduling prec. choicesMachines conflicts
Figure 3: Rewired graph example with precedences, backward precedences and conflicts as cliques. Each type of arc on the right has its own encoding. Operations O11subscript𝑂11\mathit{O}_{11}italic_O start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT, O21subscript𝑂21\mathit{O}_{21}italic_O start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT, O22subscript𝑂22\mathit{O}_{22}italic_O start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT and O31subscript𝑂31\mathit{O}_{31}italic_O start_POSTSUBSCRIPT 31 end_POSTSUBSCRIPT have here been scheduled in this order.
Edge attributes.

They are used to give explicitly a different type to edges, allowing the network to learn to pass different messages for reverse precedence, precedence and conflicts links. This helps the GNN to effectively handle interactions between tasks of different jobs that share machines.

Node attributes.

We define for node nijsubscript𝑛𝑖𝑗n_{ij}italic_n start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT associated with task Oijsubscript𝑂𝑖𝑗O_{ij}italic_O start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT the following attributes: a boolean Aijsubscript𝐴𝑖𝑗A_{ij}italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT indicating if the corresponding task has already been scheduled (affected); a boolean 𝑆𝑒𝑙ijsubscript𝑆𝑒𝑙𝑖𝑗\mathit{Sel}_{ij}italic_Sel start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT indicating if the node is selectable and the machine identifier Mijsubscript𝑀𝑖𝑗M_{ij}italic_M start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. We also give parameters of probabilistic distribution of tasks durations, and corresponding task completion time distribution parameters. Task completion times are initialized as if there were no conflicts, i.e. using only durations of task and previous one from same job. There are updated once tasks are affected, considering conflicts with previously affected tasks.

Graph Pooling.

For the GNN to give a global summary of nodes, there are two options: either a global isotropic operator on nodes like mean, maximum or sum (or any combination); or a special node that is connected to every task nodes. The latter case is equivalent to learning a custom pooling operator.

Output of the GNN.

The message-passing GNN yields a value for every node, and a global value for the graph (either from the special node or from the chosen isotropic operator). As nodes represent tasks, these values can be directly used as values for later action selection. In our implementation, we also concatenate the global value to every node value.

Ability to Deal with Different Problem Sizes.

The GNN outputs a logit for each node, and there is a one-to-one mapping between nodes and actions, whatever the number of nodes/actions. Internally, the message passing scheme collects messages from all neighbors, making the whole pipeline agnostic to the number of nodes. Learning best actions boils down to node regression, with target values being given by the reinforcement learning loop. This still needs some careful implementation with respect to data structures and batching, but the direct mapping from nodes to actions allows to deal with different problem sizes.

4.2 Handling Uncertainty

On the observation/agent side, we have durations defined with ijsubscript𝑖𝑗\mathbb{P}_{ij}blackboard_P start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. From these durations distributions, we can compute approximate distributions of tasks completion time simply by propagating completion time parameters recursively upon the precedence graph, whenever precedences are added. The true real duration of the full schedule is computed only once the complete schedule is known based on all ijsubscript𝑖𝑗\mathbb{P}_{ij}blackboard_P start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. It is then passed as a cost signal. As the RL algorithm naturally handles uncertainty of MDPs, it learns to evaluate partial schedule quality based on expectation of costs, which is exactly our objective.

4.3 Implementation details

Connecting to PPO.

In most generic PPO implementation, the actor (policy) consists of a feature extractor whose structure depends on the data type of the observation, followed by a MLP with a output dimension matching the number of actions. Same holds for the critic (value estimator), with the difference that the output dimension is 1. Some layers can be shared (the feature extractor and first layers of the MLPs). In our case, we do not want to use such generic structure because we have a one-to-one matching from the number of nodes of the observation to the actions. We thus always keep the number of nodes as a dimension of the data tensors.

Graph embedder

The graph embedder builds the rewired graph by adding edges as stated in section 4.1. It embeds node attributes using a learnable MLP, and edge attributes (here type of edge only) using a learnable embedding. The output dimension of embeddings is an open hyper-parameter hidden_dim, we found a size of 64 being good in our experiments.

Message-passing GNN

As a message passing GNN, we use EGATConv from the DGL library [24], which enriches GATv2 graph convolutions [4] with edges attributes. We used 4 attention heads, leading to output of size 4×4\times4 × hidden_dim. This dimension is reduced to hidden_dim using learnable MLPs, before being passed to next layer (in the spirit of feed-forward networks used in transformers). This output of a layer can be summed with the input of the layer, using residual connections. For most of our experiments, we used 10 such layers.

Action selection

Action selection aims at giving action probabilities given values (logits) output from the GNN. We can either use logits output by the last layer, or use a concatenation of logits output from every layer. We furthermore concatenate the global graph logits of every layer, leading to a data size of ((n_layers+1)×hidden_dim)×2𝑛_𝑙𝑎𝑦𝑒𝑟𝑠1hidden_dim2((n\_layers+1)\times\emph{hidden\_dim})\times 2( ( italic_n _ italic_l italic_a italic_y italic_e italic_r italic_s + 1 ) × hidden_dim ) × 2 per node. This dimension is reduced to 1 using a learnable linear combination (minimal case of a MLP; we did not find using a full MLP to be useful). Finally, a distribution is built upon these logits by normalizing them, taking into account action masks at this point. As node numbers correspond to action numbers, we directly have action identifier when drawing a value from the distribution.

Normalization

Along all neural network components, we did not find any kind of normalization to be useful. On the opposite hand, durations are normalized in the [0,1] range.

5 Experiments

5.1 Uncertainty Modeling

The framework presented in this paper could accommodate to any kind of duration probability distribution. The main parameters of this distribution belong to the node features and must therefore been described formally.

In order to deal with duration uncertainties, for each operation Oijsubscript𝑂𝑖𝑗O_{ij}italic_O start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, we use a triangular distribution with 3 parameters minijp𝑚𝑖subscriptsuperscript𝑛𝑝𝑖𝑗min^{p}_{ij}italic_m italic_i italic_n start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, maxijp𝑚𝑎subscriptsuperscript𝑥𝑝𝑖𝑗max^{p}_{ij}italic_m italic_a italic_x start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, modeijp𝑚𝑜𝑑subscriptsuperscript𝑒𝑝𝑖𝑗mode^{p}_{ij}italic_m italic_o italic_d italic_e start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, as this is often used in the context of manufacturing processes.

We also have in the simulator the real processing time of a task, denoted realijp𝑟𝑒𝑎subscriptsuperscript𝑙𝑝𝑖𝑗real^{p}_{ij}italic_r italic_e italic_a italic_l start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, which is observed by the agent in the final state. With such definition, tasks completion times can respectively be represented with their min, max and mode times as follows: minijC=Sij+minijp𝑚𝑖subscriptsuperscript𝑛𝐶𝑖𝑗subscript𝑆𝑖𝑗𝑚𝑖subscriptsuperscript𝑛𝑝𝑖𝑗min^{C}_{ij}=S_{ij}+min^{p}_{ij}italic_m italic_i italic_n start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_S start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT + italic_m italic_i italic_n start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, maxijC=Sij+maxijp𝑚𝑎subscriptsuperscript𝑥𝐶𝑖𝑗subscript𝑆𝑖𝑗𝑚𝑎subscriptsuperscript𝑥𝑝𝑖𝑗max^{C}_{ij}=S_{ij}+max^{p}_{ij}italic_m italic_a italic_x start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_S start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT + italic_m italic_a italic_x start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, modeijC=Sij+modeijp𝑚𝑜𝑑subscriptsuperscript𝑒𝐶𝑖𝑗subscript𝑆𝑖𝑗𝑚𝑜𝑑subscriptsuperscript𝑒𝑝𝑖𝑗mode^{C}_{ij}=S_{ij}+mode^{p}_{ij}italic_m italic_o italic_d italic_e start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_S start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT + italic_m italic_o italic_d italic_e start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. Task completion time parameters are updated when adding precedences to the graph (min, max, mode start times are computed based on precedence relations). We give both duration distribution parameters and task completion times distribution parameters as node attributes. The real task completion times realijC=Sij+realijp𝑟𝑒𝑎subscriptsuperscript𝑙𝐶𝑖𝑗subscript𝑆𝑖𝑗𝑟𝑒𝑎subscriptsuperscript𝑙𝑝𝑖𝑗real^{C}_{ij}=S_{ij}+real^{p}_{ij}italic_r italic_e italic_a italic_l start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_S start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT + italic_r italic_e italic_a italic_l start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT can be computed only during the simulation of a complete schedule, giving the real makespan Mreal=maxij(realijCM_{real}=max_{ij}(real^{C}_{ij}italic_M start_POSTSUBSCRIPT italic_r italic_e italic_a italic_l end_POSTSUBSCRIPT = italic_m italic_a italic_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_r italic_e italic_a italic_l start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT), used as a cost given to the learning agent.

5.2 Benchmarks and Baselines

5.2.1 Benchmarks

Our approach has been tested on instances generated using Taillard rules [23]: durations are uniformly drawn in [1,99], and machine affectation is randomly chosen. For stochastic instances, this duration corresponds to modeijp𝑚𝑜𝑑subscriptsuperscript𝑒𝑝𝑖𝑗mode^{p}_{ij}italic_m italic_o italic_d italic_e start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. Minimum and maximum value are uniformly drawn in [0.95×modeijp,1.1×modeijp]0.95𝑚𝑜𝑑subscriptsuperscript𝑒𝑝𝑖𝑗1.1𝑚𝑜𝑑subscriptsuperscript𝑒𝑝𝑖𝑗[0.95\times mode^{p}_{ij},1.1\times mode^{p}_{ij}][ 0.95 × italic_m italic_o italic_d italic_e start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , 1.1 × italic_m italic_o italic_d italic_e start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ], meaning that tasks can take at most 5% less time and at most 10% more time than mode value.

5.2.2 Baselines

In order to evaluate the performance of Wheatley, we compare several approaches:

  • we first train Wheatley on instances of various sizes. More precisely, W-nxm denotes our approach tested on instances of size n×m𝑛𝑚n\times mitalic_n × italic_m, with (n,m){(6,6),(10,10),(15,15)}𝑛𝑚6610101515(n,m)\in\{(6,6),(10,10),(15,15)\}( italic_n , italic_m ) ∈ { ( 6 , 6 ) , ( 10 , 10 ) , ( 15 , 15 ) };

  • for deterministic instances, we can compare with L2D using values reported in the associated paper ([28]);

  • we test several popular Priority Dispatch Rules ([20]), namely Most Operations Remaining (MOPNR), Shortest Processing Time (SPT), Most Work Remaining (MWKR) and Minimum Ratio of Flow Due Date to Most Work Remaining (FDD/WKR). When computing a schedule, these rules use the mode duration of operations. Next, we retrieve the operations sequence scheduled for each machine and run this sequence with the real operation duration, as done in Schedule Generation Scheme (SGS);

  • we use the CP-SAT solver of OR-Tools, denoted OR-Tools in deterministic instances test. For stochastic instances, we use it with mode durations and with real instances, respectively denoted OR-Tools mode and OR-Tools real. As for PDRs, we retrieve the order on each machine and use SGS;

  • for stochastic instances, we implement the approach proposed in [13], here denoted CP-stoc, that consists in finding the schedule that minimize the average makespan over a given number of sampled instances. We found using 50 samples was a very good compromise between solution quality and computation time (100 gives not much improvement and needs too much time). It is implemented with CP Optimizer 22.10 through docplex ([10]).

Classical techniques like CP-stoc and OR-Tools/CP-sat are anytime algorithms that need to compute solution for every problem, while our approach uses a large offline training time and the resulting agent only takes a small inference time for every problem. We decide to give 3 minutes to classical techniques, as they tend to give very quickly very good solutions, including for large problems, but generally need up to hours to find optimal solution as soon as the problem size becomes large. On the opposite hand, Wheatley takes from 1 hour to a few days of training (depending on the problem size), but has a fixed inference time that can become very small when correctly optimized (linear in the number of tasks). The number of iterations to reach the best model is given as table 1.

W-6x6 W-10x10 W-15x15
deterministic 962 542 519
stochastic 712 714 434
Table 1: Epoch number for best model

5.3 Results

Deterministic Stochastic
Evaluation W-6x6 W-10x10 W-15x15 W-6x6 W-10x10 W-15x15
6×6666\times 66 × 6 𝟓𝟎𝟖508\mathbf{508}bold_508 521521521521 521521521521 𝟕𝟎𝟎700\mathbf{700}bold_700 714714714714 715715715715
10×10101010\times 1010 × 10 927927927927 𝟖𝟗𝟎890\mathbf{890}bold_890 915915915915 1269126912691269 𝟏𝟐𝟏𝟕1217\mathbf{1217}bold_1217 1232123212321232
15×15151515\times 1515 × 15 1557155715571557 𝟏𝟑𝟖𝟖1388\mathbf{1388}bold_1388 1392139213921392 2297229722972297 𝟏𝟖𝟖𝟗1889\mathbf{1889}bold_1889 𝟏𝟖𝟖𝟗1889\mathbf{1889}bold_1889
20×15201520\times 1520 × 15 1798179817981798 𝟏𝟓𝟖𝟑1583\mathbf{1583}bold_1583 1622162216221622 2585258525852585 𝟐𝟏𝟖𝟏2181\mathbf{2181}bold_2181 2188218821882188
20×20202020\times 2020 × 20 2314231423142314 1959195919591959 𝟏𝟖𝟖𝟖1888\mathbf{1888}bold_1888 3632363236323632 2643264326432643 𝟐𝟔𝟎𝟖2608\mathbf{2608}bold_2608
Table 2: Comparison of Wheatley wrt training instance sizes.

5.3.1 Wheatley baselines

We first compare the three Wheatley baselines together: we have tested them on small instances, both deterministic and stochastic.

Table 2 presents results obtained for deterministic and stochastic instances on Taillard problems of several sizes. For each size n×m𝑛𝑚n\times mitalic_n × italic_m, we have generated 100 instances and, for the stochastic evaluation, we have then sampled one duration scenario for each instance. We then compute the average makespan for each set of instances sizes. Results show that W-10x10 is a good compromise, both for deterministic and stochastic problems. Therefore, in the following, we only present results associated with this approach.

5.3.2 Deterministic JSSP

We compare W-10x10 with baselines presented previously for the deterministic case. In Table 3, we present the average makespan and the average gap111Gap for an approach a𝑎aitalic_a is equal to 100𝑚𝑎𝑘𝑒𝑠𝑝𝑎𝑛(a)𝑚𝑎𝑘𝑒𝑠𝑝𝑎𝑛𝑏𝑒𝑠𝑡𝑚𝑎𝑘𝑒𝑠𝑝𝑎𝑛𝑏𝑒𝑠𝑡100𝑚𝑎𝑘𝑒𝑠𝑝𝑎𝑛𝑎superscript𝑚𝑎𝑘𝑒𝑠𝑝𝑎𝑛𝑏𝑒𝑠𝑡superscript𝑚𝑎𝑘𝑒𝑠𝑝𝑎𝑛𝑏𝑒𝑠𝑡100\cdot\frac{\mathit{makespan}(a)-\mathit{makespan}^{\mathit{best}}}{\mathit{% makespan}^{\mathit{best}}}100 ⋅ divide start_ARG italic_makespan ( italic_a ) - italic_makespan start_POSTSUPERSCRIPT italic_best end_POSTSUPERSCRIPT end_ARG start_ARG italic_makespan start_POSTSUPERSCRIPT italic_best end_POSTSUPERSCRIPT end_ARG. obtained for all instances of each category size. Note that we do not present results obtained for each PDR but only the best result one.

Evaluation W-10x10 L2D Best PDR OR-Tools
6×6666\times 66 × 6 521(7.4)5217.4521~{}(7.4)521 ( 7.4 ) 571(17.7)57117.7571~{}(17.7)571 ( 17.7 ) 545(12.4)54512.4545~{}(12.4)545 ( 12.4 ) 𝟒𝟖𝟓(𝟎)4850\mathbf{485}~{}(\mathbf{0})bold_485 ( bold_0 )
10×10101010\times 1010 × 10 890(9.6)8909.6890~{}(9.6)890 ( 9.6 ) 993(22.3)99322.3993~{}(22.3)993 ( 22.3 ) 948(16.8)94816.8948~{}(16.8)948 ( 16.8 ) 𝟖𝟏𝟐(𝟎)8120\mathbf{812}~{}(\mathbf{0})bold_812 ( bold_0 )
15×15151515\times 1515 × 15 1389(17.2)138917.21389~{}(17.2)1389 ( 17.2 ) 1501(26.7)150126.71501~{}(26.7)1501 ( 26.7 ) 1419(19.8)141919.81419~{}(19.8)1419 ( 19.8 ) 𝟏𝟏𝟖𝟓(𝟎)11850\mathbf{1185}~{}(\mathbf{0})bold_1185 ( bold_0 )
20×15201520\times 1520 × 15 1583(16.9)158316.91583~{}(16.9)1583 ( 16.9 ) - 1642(21.3)164221.31642~{}(21.3)1642 ( 21.3 ) 𝟏𝟑𝟓𝟒(𝟎)13540\mathbf{1354}~{}(\mathbf{0})bold_1354 ( bold_0 )
20×20202020\times 2020 × 20 1959(24.9)195924.91959~{}(24.9)1959 ( 24.9 ) 2026(29.2)202629.22026~{}(29.2)2026 ( 29.2 ) 1870(19.3)187019.31870~{}(19.3)1870 ( 19.3 ) 𝟏𝟓𝟔𝟖(𝟎)15680\mathbf{1568}~{}(\mathbf{0})bold_1568 ( bold_0 )
30×10301030\times 1030 × 10 1829(5.5)18295.51829~{}(5.5)1829 ( 5.5 ) - 1878(8.9)18788.91878~{}(8.9)1878 ( 8.9 ) 𝟏𝟕𝟐𝟓(𝟎)17250\mathbf{1725}~{}(\mathbf{0})bold_1725 ( bold_0 )
30×15301530\times 1530 × 15 2043(14.5)204314.52043~{}(14.5)2043 ( 14.5 ) - 2092(17.3)209217.32092~{}(17.3)2092 ( 17.3 ) 𝟏𝟕𝟖𝟒(𝟎)17840\mathbf{1784}~{}(\mathbf{0})bold_1784 ( bold_0 )
30×20302030\times 2030 × 20 2377(22.0)237722.02377~{}(22.0)2377 ( 22.0 ) - 2331(19.7)233119.72331~{}(19.7)2331 ( 19.7 ) 𝟏𝟗𝟒𝟖(𝟎)19480\mathbf{1948}~{}(\mathbf{0})bold_1948 ( bold_0 )
50×15501550\times 1550 × 15 3060(8.3)30608.33060~{}(8.3)3060 ( 8.3 ) - 3079(9.0)30799.03079~{}(9.0)3079 ( 9.0 ) 𝟐𝟖𝟐𝟓(𝟎)28250\mathbf{2825}~{}(\mathbf{0})bold_2825 ( bold_0 )
50×20502050\times 2050 × 20 3322(14.9)332214.93322~{}(14.9)3322 ( 14.9 ) - 3295(14.0)329514.03295~{}(14.0)3295 ( 14.0 ) 𝟐𝟖𝟗𝟏(𝟎)28910\mathbf{2891}~{}(\mathbf{0})bold_2891 ( bold_0 )
60×10601060\times 1060 × 10 3357(1.7)33571.73357~{}(1.7)3357 ( 1.7 ) - 3376(2.3)33762.33376~{}(2.3)3376 ( 2.3 ) 𝟑𝟑𝟎𝟏(𝟎)33010\mathbf{3301}~{}(\mathbf{0})bold_3301 ( bold_0 )
100×2010020100\times 20100 × 20 5886(6.9)58866.95886~{}(6.9)5886 ( 6.9 ) - 5786(5.1)57865.15786~{}(5.1)5786 ( 5.1 ) 𝟓𝟓𝟎𝟕(𝟎)55070\mathbf{5507}~{}(\mathbf{0})bold_5507 ( bold_0 )
Table 3: Results on deterministic Taillard instances

Results show that OR-Tools outperforms the other approaches for these sizes, but W-10x10 manages to get close results, even for large instances. More precisely, in comparison with L2D, which is also an approach based on DRL and GNN that was developed for solving JSSPs ([28]), W-10x10 returns better schedules in average. W-10x10 competes with the best PDR, which is mostly MOPNR in the case of instances larger than 20×20202020\times 2020 × 20. These results show that Wheatley is able to learn task selection strategies that generalize to much larger problems.

5.3.3 Stochastic JSSP

OR-Tools
Evaluation W-10x10 Wd-10x10 MOPNR CP-stoc mode real
6×6666\times 66 × 6 714(16.3)71416.3714~{}(16.3)714 ( 16.3 ) 817(33.1)81733.1817~{}(33.1)817 ( 33.1 ) 699(13.8)69913.8699~{}(13.8)699 ( 13.8 ) 𝟔𝟔𝟗(9.0)6699.0\mathbf{669}~{}(9.0)bold_669 ( 9.0 ) 728(18.6)72818.6728~{}(18.6)728 ( 18.6 ) 614(0)6140\mathit{614}~{}(\mathit{0})italic_614 ( italic_0 )
10×10101010\times 1010 × 10 1217(21.5)121721.51217~{}(21.5)1217 ( 21.5 ) 1464(46.1)146446.11464~{}(46.1)1464 ( 46.1 ) 1252(25.0)125225.01252~{}(25.0)1252 ( 25.0 ) 𝟏𝟏𝟕𝟕(17.5)117717.5\mathbf{1177}~{}\mathbf{(17.5)}bold_1177 ( bold_17.5 ) 1262(25.9)126225.91262~{}(25.9)1262 ( 25.9 ) 1002(0)10020\mathit{1002}~{}(\mathit{0})italic_1002 ( italic_0 )
15×15151515\times 1515 × 15 1889(29.3)188929.31889~{}(29.3)1889 ( 29.3 ) 2406(64.7)240664.72406~{}(64.7)2406 ( 64.7 ) 1988(36.1)198836.11988~{}(36.1)1988 ( 36.1 ) 𝟏𝟖𝟕𝟐(28.1)187228.1\mathbf{1872}~{}\mathbf{(28.1)}bold_1872 ( bold_28.1 ) 1925(31.8)192531.81925~{}(31.8)1925 ( 31.8 ) 1461(0)14610\mathit{1461}~{}(\mathit{0})italic_1461 ( italic_0 )
20×15201520\times 1520 × 15 𝟐𝟏𝟖𝟏(30.5)218130.5\mathbf{2181}~{}\mathbf{(30.5)}bold_2181 ( bold_30.5 ) 2729(63.3)272963.32729~{}(63.3)2729 ( 63.3 ) 2314(38.5)231438.52314~{}(38.5)2314 ( 38.5 ) 2222(33.0)222233.02222~{}(33.0)2222 ( 33.0 ) 2244(34.3)224434.32244~{}(34.3)2244 ( 34.3 ) 1571(0)15710\mathit{1571}~{}(\mathit{0})italic_1571 ( italic_0 )
20×20202020\times 2020 × 20 2643(36.4)264336.42643~{}(36.4)2643 ( 36.4 ) 3511(81.2)351181.23511~{}(81.2)3511 ( 81.2 ) 2708(40.0)270840.02708~{}(40.0)2708 ( 40.0 ) 𝟐𝟔𝟑𝟏(35.8)263135.8\mathbf{2631}~{}\mathbf{(35.8)}bold_2631 ( bold_35.8 ) 2619(35.1)261935.12619~{}(35.1)2619 ( 35.1 ) 1938(0)19380\mathit{1938}~{}(\mathit{0})italic_1938 ( italic_0 )
30×10301030\times 1030 × 10 𝟐𝟒𝟐𝟓(14.1)242514.1\mathbf{2425~{}(14.1)}bold_2425 ( bold_14.1 ) 3511(65.2)351165.23511~{}(65.2)3511 ( 65.2 ) 2532(19.1)253219.12532~{}(19.1)2532 ( 19.1 ) 2476(16.5)247616.52476~{}(16.5)2476 ( 16.5 ) 2598(22.2)259822.22598~{}(22.2)2598 ( 22.2 ) 2126(0)21260\mathit{2126}~{}(\mathit{0})italic_2126 ( italic_0 )
30×15301530\times 1530 × 15 𝟐𝟕𝟗𝟐(26.7)279226.7\mathbf{2792~{}(26.7)}bold_2792 ( bold_26.7 ) 3251(47.5)325147.53251~{}(47.5)3251 ( 47.5 ) 2964(34.5)296434.52964~{}(34.5)2964 ( 34.5 ) 2892(31.2)289231.22892~{}(31.2)2892 ( 31.2 ) 2943(33.5)294333.52943~{}(33.5)2943 ( 33.5 ) 2204(0)22040\mathit{2204}~{}(\mathit{0})italic_2204 ( italic_0 )
30×20302030\times 2030 × 20 𝟑𝟑𝟎𝟓(36.9)330536.9\mathbf{3305~{}(36.9)}bold_3305 ( bold_36.9 ) 4186(73.3)418673.34186~{}(73.3)4186 ( 73.3 ) 3390(40.4)339040.43390~{}(40.4)3390 ( 40.4 ) 3355(39.0)335539.03355~{}(39.0)3355 ( 39.0 ) 3299(36.6)329936.63299~{}(36.6)3299 ( 36.6 ) 2415(0)24150\mathit{2415}~{}(\mathit{0})italic_2415 ( italic_0 )
50×15501550\times 1550 × 15 𝟒𝟎𝟒𝟑(16.5)404316.5\mathbf{4043~{}(16.5)}bold_4043 ( bold_16.5 ) 4413(27.1)441327.14413~{}(27.1)4413 ( 27.1 ) 4262(22.8)426222.84262~{}(22.8)4262 ( 22.8 ) 4239(22.1)423922.14239~{}(22.1)4239 ( 22.1 ) 4435(27.7)443527.74435~{}(27.7)4435 ( 27.7 ) 3472(0)34720\mathit{3472}~{}(\mathit{0})italic_3472 ( italic_0 )
50×20502050\times 2050 × 20 𝟒𝟓𝟐𝟎(26.8)452026.8\mathbf{4520~{}(26.8)}bold_4520 ( bold_26.8 ) 5351(50.1)535150.15351~{}(50.1)5351 ( 50.1 ) 4679(31.2)467931.24679~{}(31.2)4679 ( 31.2 ) 4682(31.3)468231.34682~{}(31.3)4682 ( 31.3 ) 4758(33.4)475833.44758~{}(33.4)4758 ( 33.4 ) 3566(0)35660\mathit{3566}~{}(\mathit{0})italic_3566 ( italic_0 )
60×10601060\times 1060 × 10 𝟒𝟑𝟏𝟓(6.3)43156.3\mathbf{4315~{}(6.3)}bold_4315 ( bold_6.3 ) 4475(10.2)447510.24475~{}(10.2)4475 ( 10.2 ) 4451(9.6)44519.64451~{}(9.6)4451 ( 9.6 ) 4442(9.4)44429.44442~{}(9.4)4442 ( 9.4 ) 4579(12.8)457912.84579~{}(12.8)4579 ( 12.8 ) 4061(0)40610\mathit{4061}~{}(\mathit{0})italic_4061 ( italic_0 )
100×2010020100\times 20100 × 20 𝟕𝟓𝟗𝟏(11.8)759111.8\mathbf{7591~{}(11.8)}bold_7591 ( bold_11.8 ) 8377(23.3)837723.38377~{}(23.3)8377 ( 23.3 ) 7956(17.1)795617.17956~{}(17.1)7956 ( 17.1 ) 8203(20.8)820320.88203~{}(20.8)8203 ( 20.8 ) 8188(20.5)818820.58188~{}(20.5)8188 ( 20.5 ) 6793(0)67930\mathit{6793}~{}(\mathit{0})italic_6793 ( italic_0 )
Table 4: Results on stochastic Taillard instances

Table 4 shows results obtained for stochastic problems. Note that the solver OR-Tools real is perfect, in the sense that it works with real operations duration values, which is unknown for other approaches at the scheduling time. Therefore, the makespan value computed by OR-Tools real is much lower than that of other approaches. Results show that the closest to OR-Tools real is CP-stoc for small problem sizes. In fact, despite the 50 scenarios it works with, it manages to find a good average makespan. However, when the instances size increases, W-10x10 clearly outperforms other approaches. We also present the results for the deterministic of version of Wheatley run on modes as Wd-10x10. This shows that Wheatley is able to successfully generalize on larger problems.

Refer to caption
(a) 6×6666\times 66 × 6 instance
Refer to caption
(b) 100×2010020100\times 20100 × 20 instance
Figure 4: Cumulative makespan of W-10x10 and CP-stoc for 100 duration scenarios.

In order to further compare the approaches in terms of scatter of the results, we have sampled 100 duration scenarios for one problem of size 6×6666\times 66 × 6 and 100 scenarios for one problem of size 100×2010020100\times 20100 × 20. Cumulative makespan are presented on Figure 4. It shows that results presented on Table 4 are representative of several scenarios. In fact, for the 6×6666\times 66 × 6 problem, CP-stoc returns the lowest makespans, then OR-Tools, and MOPNR and W-10x10 equivalently (Figure 4(a)). That order is completely reversed in the case of the 100×2010020100\times 20100 × 20 problem, in which W-10x10 returns the best results (Figure 4(b)).

6 Conclusion and Future Works

This paper presents Wheatley, a novel approach for solving JSSPs with uncertain operations duration. It combines Graph Neural Networks and Deep Reinforcement Learning techniques in order to learn a policy that iteratively selects the next operation to execute on each machine. The policy is updated during the training phase through PPO. Results show that Wheatley is competitive in the case of deterministic JSSPs and outperforms other approaches for stochastic JSSPs. Moreover, Wheatley is able to generalize to larger instances.

This work could be extended in several directions. First, it would be possible to extend the experiments with other JSSPs data and particularly instances coming from the industry. It would also be interesting to study the effect of pretraining the policy before running PPO. Finally, we are convinced that the GNN and DRL could be applied to other scheduling problems, such as the Resource-Constrained Project Scheduling Problem, in which handling uncertainty is essential in an industrial context.

References

  • [1] J. Beck and Nic Wilson. Proactive algorithms for scheduling with probabilistic durations. In IJCAI International Joint Conference on Artificial Intelligence, pages 1201–1206, 01 2005.
  • [2] Leonora Bianchi, Marco Dorigo, Luca Maria Gambardella, and Walter J Gutjahr. A survey on metaheuristics for stochastic combinatorial optimization. Natural Computing, 8:239–287, 2009.
  • [3] Mohammed Boukedroun, David Duvivier, Abdessamad Ait-el Cadi, Vincent Poirriez, and Moncef Abbas. A hybrid genetic algorithm for stochastic job-shop scheduling problems. RAIRO: Operations Research (2804-7303), 57(4), 2023.
  • [4] Shaked Brody, Uri Alon, and Eran Yahav. How attentive are graph attention networks? CoRR, abs/2105.14491, 2021.
  • [5] Giacomo Da Col and Erich C. Teppan. Industrial-size job shop scheduling with constraint programming. Operations Research Perspectives, 9:100249, 2022.
  • [6] Bao-An Han and Jian-Jun Yang. Research on adaptive job shop scheduling problems based on dueling double dqn. IEEE Access, 8:186474–186495, 2020.
  • [7] Shengyi Huang and Santiago Ontañón. A closer look at invalid action masking in policy gradient algorithms. The International FLAIRS Conference Proceedings, 35, may 2022.
  • [8] Zangir Iklassov, Dmitrii Medvedev, Ruben Solozabal Ochoa de Retana, and Martin Takác. On the study of curriculum learning for inferring dispatching policies on the job shop scheduling. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI 2023, 19th-25th August 2023, Macao, SAR, China, pages 5350–5358. ijcai.org, 2023.
  • [9] Yeong-Dae Kwon, Jinho Choo, Iljoo Yoon, Minah Park, Duwon Park, and Youngjune Gwon. Matrix encoding networks for neural combinatorial optimization. In 35th Conference on Neural Information Processing Systems (NEURIPS), 2021.
  • [10] Philippe Laborie, Jérôme Rogerie, Paul Shaw, and Petr Vilím. Ibm ilog cp optimizer for scheduling: 20+ years of scheduling with constraints at ibm/ilog. Constraints, 23:210–250, 2018.
  • [11] Chien-Liang Liu, Chuan-Chin Chang, and Chun-Jan Tseng. Actor-critic deep reinforcement learning for solving job shop scheduling problems. IEEE Access, 8:71752–71762, 2020.
  • [12] Chien-Liang Liu and Tzu-Hsuan Huang. Dynamic job-shop scheduling problems using graph neural network and deep reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2023.
  • [13] LocalSolver. Stochastic job shop shceduling problems, 2023.
  • [14] Ping Lou, Quan Liu, Zude Zhou, Huaiqing Wang, and Sherry Xiaoyun Sun. Multi-agent-based proactive–reactive scheduling for a job shop. The International Journal of Advanced Manufacturing Technology, 59:311–324, 2012.
  • [15] P.B. Luh, Dong Chen, and L.S. Thakur. An effective approach for job-shop scheduling with uncertain processing requirements. IEEE Transactions on Robotics and Automation, 15(2):328–339, 1999.
  • [16] Junyoung Park, Jaehyeong Chun, Sang Hun Kim, Youngkook Kim, and Jinkyoo Park. Learning to schedule job-shop problems: representation and policy learning using graph neural network and reinforcement learning. International Journal of Production Research, 59(11):3360–3377, jan 2021.
  • [17] A. S. Raheja and V. Subramaniam. Reactive recovery of job shop schedules – a review. The International Journal of Advanced Manufacturing Technology, 19:756–763, 2002.
  • [18] Bernard Roy and Bernard Sussmann. Les problemes d’ordonnancement avec contraintes disjonctives. Note ds, 9, 1964.
  • [19] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017.
  • [20] Veronique Sels, Nele Gheysen, and Mario Vanhoucke. A comparison of priority rules for the job shop scheduling problem under different flow time-and tardiness-related objective functions. International Journal of Production Research, 50(15):4255–4270, 2012.
  • [21] J. Shen and Y.Zhu. Chance-constrained model for uncertain job shop scheduling problem. Soft Computing, 20:2383–2391, 2016.
  • [22] Wen Song, Xinyang Chen, Qiqiang Li, and Zhiguang Cao. Flexible job-shop scheduling via graph neural network and deep reinforcement learning. IEEE Transactions on Industrial Informatics, 19(2):1600–1610, 2023.
  • [23] Eric Taillard. Benchmarks for basic scheduling problems. european journal of operational research, 64(2):278–285, 1993.
  • [24] Minjie Wang, Lingfan Yu, Da Zheng, Quan Gan, Yu Gai, Zihao Ye, Mufei Li, Jinjing Zhou, Qi Huang, Chao Ma, Ziyue Huang, Qipeng Guo, Hao Zhang, Haibin Lin, Junbo Zhao, Jinyang Li, Alexander J. Smola, and Zheng Zhang. Deep graph library: Towards efficient and scalable deep learning on graphs. CoRR, abs/1909.01315, 2019.
  • [25] Hegen Xiong, Shuangyuan Shi, Danni Ren, and Jinjin Hu. A survey of job shop scheduling problem: The types and models. Computers & Operations Research, 142:105731, 2022.
  • [26] Hegen Xiong, Shuangyuan Shi, Danni Ren, and Jinjin Hu. A survey of job shop scheduling problem: The types and models. Computers & Operations Research, 142, 2022.
  • [27] Yunhui Zeng, Zijun Liao, Yuanzhi Dai, Rong Wang, Xiu Li, and Bo Yuan. Hybrid intelligence for dynamic job-shop scheduling with deep reinforcement learning and attention mechanism, 2022.
  • [28] Cong Zhang, Wen Song, Zhiguang Cao, Jie Zhang, Puay Siew Tan, and Xu Chi. Learning to dispatch for job shop scheduling via deep reinforcement learning. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1621–1632. Curran Associates, Inc., 2020.