Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Hierarchical Neural Constructive Solver for Real-world TSP Scenarios

Yong Liang Goh Grabtaxi Holdings Pte Ltd &
National University of Singapore
Singapore
gyl@u.nus.edu
Zhiguang Cao Singapore Management UniversitySingapore zgcao@smu.edu.sg Yining Ma National University of SingaporeSingapore yiningma@u.nus.edu Yanfei Dong National University of SingaporeSingapore dyanfei@u.nus.edu Mohammed Haroon Dupty National University of SingaporeSingapore haroon@nus.edu.sg  and  Wee Sun Lee National University of SingaporeSingapore leews@nus.edu.sg
(2024)
Abstract.

Existing neural constructive solvers for routing problems have predominantly employed transformer architectures, conceptualizing the route construction as a set-to-sequence learning task. However, their efficacy has primarily been demonstrated on entirely random problem instances that inadequately capture real-world scenarios. In this paper, we introduce realistic Traveling Salesman Problem (TSP) scenarios relevant to industrial settings and derive the following insights: (1) The optimal next node (or city) to visit often lies within proximity to the current node, suggesting the potential benefits of biasing choices based on current locations. (2) Effectively solving the TSP requires robust tracking of unvisited nodes and warrants succinct grouping strategies. Building upon these insights, we propose integrating a learnable choice layer inspired by Hypernetworks to prioritize choices based on the current location, and a learnable approximate clustering algorithm inspired by the Expectation-Maximization algorithm to facilitate grouping the unvisited cities. Together, these two contributions form a hierarchical approach towards solving the realistic TSP by considering both immediate local neighbourhoods and learning an intermediate set of node representations. Our hierarchical approach yields superior performance compared to both classical and recent transformer models, showcasing the efficacy of the key designs.

neural constructive solver, traveling salesman problem, deep reinforcement learning
journalyear: 2024copyright: rightsretainedconference: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining; August 25–29, 2024; Barcelona, Spainbooktitle: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’24), August 25–29, 2024, Barcelona, Spaindoi: 10.1145/3637528.3672053isbn: 979-8-4007-0490-1/24/08ccs: Computing methodologies Machine learningccs: Computing methodologies Neural networksccs: Computing methodologies Sequential decision makingccs: Computing methodologies Learning latent representations

1. Introduction

The Traveling Salesman Problem (TSP) is a classical combinatorial optimization problem. Simply put, the TSP asks the following: given a set of cities, what is the shortest route where a salesman can visit every city only once and return back to his starting city? While it is simple to state, the TSP is a very difficult problem known to be NP-hard. Nevertheless, the TSP is a crucial problem to study, as many parallel problems can be reduced to solving the TSP, such as chip placement (Kumar and Luo, 2003), the study of spin glass problems in physics (Kirkpatrick and Toulouse, 1985), DNA sequencing (Caserta and Voß, 2014), and many others.

Refer to caption
Figure 1. Comparing subset of TSP drawn from USA map (first row) against a uniform distribution (second row). Left-most plots show the base distribution. The subsets follow certain underlying structures for the USA case compared to completely random problems.

Given its prevalence across a multitude of domains, the TSP has been extensively researched in the community. Particularly, the main approaches can be broken down into exact methods and approximate methods. Exact methods often materialize in the form of mathematical programming. Some popular exact solvers, such as Concorde (Applegate, 2003), are developed based on linear programming and cutting planes. Approximate methods tend to be in the form of expert heuristics. An example would be the Lin-Kernighan-Helsgaun (LKH-3) algorithm, which utilizes heuristics and local search methods to update and improve initial solutions. As their names describe, exact methods return the true optimal routes while approximate ones return solutions often within some error bound of the optimal one. As the size of the problems grows, exact methods are intractable due to the NP-hard nature of the problem.

More recently, the deep learning community has put much effort into establishing practical neural solvers. These typically appear in the form of deep reinforcement learning (Bello et al., 2016; Kool et al., 2018; Kwon et al., 2020; Kim et al., 2022; Gao et al., 2023), which presents a label-free approach to improve the models. This is preferred over supervised learning approaches (e.g, (Joshi et al., 2019; Sun and Yang, 2023)) since they require large amounts of labelled data, which is often challenging to obtain given the limited scalability of exact solvers.

It is important to note that learning-based solvers may perform well on specific target distributions they are trained on, but often suffer from poor generalization to other arbitrary instances. This is acceptable if the target distributions reflect real-world cases. However, prior works have mostly only been trained and tested on problems derived from random synthetic distributions, which may not accurately represent real-world applications. While some learned solvers have been tested on realistic TSP instances, such as those from the TSPLIB (Reinelt, [n. d.]), a collection of TSP instances observed in the real world, these realistic problem collections are typically too small to train on. Therefore, there is a gap in studying the performance of learning-based solvers in more realistic scenarios from the real world.

In this paper, we propose and study a setup that closely mirrors practical scenarios. Consider a logistic company with a large set of M𝑀Mitalic_M fixed locations that it can deliver to. The company will do many delivery trips to these locations but for each trip, only a small subset V𝑉Vitalic_V of the M𝑀Mitalic_M locations, e.g., locations that placed orders for that day, need to be visited. This observation motivates us to generate realistic distributions as follows: select a set of M𝑀Mitalic_M fixed locations from the real world to reflect real-world location distributions, and build each problem instance by randomly sampling a subset of V𝑉Vitalic_V locations from the set of M𝑀Mitalic_M locations.

We construct the locations using real-world datasets of cities from the USA, Japan, and Burma. When state-of-the-art neural solvers are trained in realistic settings, their performance may not be satisfactory. To improve the neural solvers, we focus on exploiting the nature of the node construction process. Existing neural solvers typically construct TSP tours autoregressively, solving the problem of visiting all unvisited cities starting from the current city and returning to the starting city. This suggests learning more effective and generalizable representations of the decision information regarding the current city and unvisited cities.

Specifically, we first observe that neural solvers often struggle to select a proper next city in the neighbourhood of the current city in our proposed realistic setup. This highlights the importance of exploiting more effective decision information about the current city. To this end, we propose to incorporate a customized hypernetwork layer (Ha et al., 2016) that leverages the embedding of the current city to modify the choice of the next city to visit.

Moreover, to obtain an improved representation of unvisited cities, we design a hierarchical representation that divides the cities into C𝐶Citalic_C partitions and use an embedding to represent the unvisited cities in each of the C𝐶Citalic_C partitions. We perform the partitioning using a soft clustering method, inspired by the EM algorithm. Given its differentiable property of EM, we can propagate the gradient through the clustering to learn the encoder parameters effectively.

We showcase the effectiveness of the hypernetwork and the hierarchical representation of unvisited cities in experiments with the proposed realistic setting. We highlight the following contributions:

  • We introduce a more realistic TSP setting using real-world data to more convincingly demonstrate the effectiveness of neural TSP solvers.

  • We make the key observation that neural solvers often struggle with node selection within a small locality, and we design a hypernetwork layer to emphasize local choices.

  • We further exploit the nature of solving structured TSPs by representing the set of unvisited cities with multiple key embeddings using a differentiable soft clustering algorithm instead of conventional simplistic pooling methods.

2. Related Work

2.1. Constructive Neural Solvers

Deep reinforcement learning forms the hallmark of training constructive neural solvers. Early works from (Bello et al., 2016) proposed to use the Pointer Network (Vinyals et al., 2015) based on the sequence-to-sequence architecture in (Sutskever et al., 2014) to solve TSP and Knapsack problems. They employ an actor-critic approach and achieved strong results on the TSP. Follow-up works from (Nazari et al., 2018) further improve the performance of the Pointer Network.

Following on from this, the transformer network based on attention (Vaswani et al., 2017) was proposed by (Kool et al., 2018) to solve the TSP, Capacited Vehicle Routing Problem (CVRP) and the Orienteering Problem (OP). Primarily, the work showed that one can train a neural solver using the REINFORCE algorithm (Williams, 1992) and a simple greedy rollout of the network with a lagging baseline. Since then, multiple works based on the same architecture have been proposed to improve the predictive power of such solvers further (Zhang et al., 2023). POMO (Kwon et al., 2020) was introduced and observed that constructive solvers were limited by their starting nodes. Hence, to effectively explore the search space of solutions, one should use all nodes as starting nodes, effectively constructing a simple beam search. Additionally, they showed that a stable baseline can be found in the average of all solutions. Sym-NCO (Kim et al., 2022) was proposed to exploit the symmetry of TSP by introducing symmetry losses to regularize the transformer network. Recently, ELG was introduced by (Gao et al., 2023) that defined a local learnable policy based on a k-Nearest Neighbor graph. They exemplified the generalizability of the network on large CVRP instances.

2.2. Improvement Neural Solvers

Apart from constructive methods, another approach to solving the TSP looks at improvement solvers. This methodology is inspired by algorithms such as 2-opt, whereby the solver first starts with a complete route, and a heuristic is then used to select edges to delete or add so as to edit the solution. Such methods are measured based on how quickly they can reach a strong solution. Work such as (d O Costa et al., 2020) uses neural networks to learn the 2-opt heuristic for improvement, while (Wu et al., 2021) uses the transformer to select node pairs for swaps for the TSP. Ma et al. (Ma et al., 2021) then extended the transformer network to learn node and edge embeddings separately, which is then upgraded for pickup and delivery problems in (Ma et al., 2022) and flexible k-opt in (Ma et al., 2023), pushing the iterative solver’s performance further.

2.3. Search-based techniques

The previous two approaches are based on some form of learning-based search: constructive solvers try to perform a global search by learning heuristics entirely from data, whereas iterative solvers learn to guide local search techniques instead. Besides these, there is a class of search-based techniques that involve applying search during inference. Efficient Active Search (EAS) was proposed by (Hottung et al., 2021) to introduce lightweight learnable layers at inference that could be tuned to improve the predictive power of a model on test samples. Other works such as (Fu et al., 2021) showcase that one can leverage a small pre-trained network and combine it with search techniques such as Monte-Carlo Tree Search (MCTS) (Chaslot et al., 2008) so as to solve large-scale TSPs. The work in (Choo et al., 2022) then combined both MCTS and EAS to improve the search capabilities further. Lastly, another work (Kool et al., 2022) showcased how one could combine dynamic programming with a pre-trained neural network to scale the TSP to 10,000 nodes.

Previous works attempt to regularize the networks via symmetry or scale the solver to larger problems. However, they essentially are still based on transformer models trained on arbitrary distributions. Apart from ELG, these works do not consider the impact of local choices nor explore further how to represent unvisited cities better, which are critical aspects of solving the TSP in our view.

3. Our Approach

Refer to caption
Figure 2. POMO model making a minor mistake with a poor selection of a local node
Refer to caption
Figure 3. POMO model making a major mistake by not visiting nodes that are near it, causing cross-cluster routes that are inefficient
Refer to caption
Figure 4. Overview of our proposed architecture. Given a TSP instance, we learn contextual embeddings of the cities in a set of cluster representations with an EM-inspired differentiable technique. In addition, our policy is dynamically adapted with a local hypernetwork which emphasises the completion of local cluster before moving on to new distant cities.

Generally, we find that neural solvers tend to make two classes of errors in route construction compared to the optimal solution for such practical scenarios. The first class often appears as a minor error, where a poor decision is made in a local neighbourhood, as shown in Figure 2. This results in a sub-optimal route because a local choice is not picked first. The second class tends to appear in problems with more structure, where the agent fails to visit all reasonable nodes within a local cluster and has to backtrack to the area, exemplified in Figure 3. This tends to give solutions that are significantly poorer than optimal.

Our approach seeks to tackle these two errors more effectively. We propose two main architectural improvements to the base transformer model to address these issues. Firstly, we propose a learnable local choice decoder that accentuates certain choices based on the agent’s current location (and hence locality). Secondly, we propose a differentiable clustering algorithm to learn a set of representations to capture and summarize the set of remaining cities. Our full approach is illustrated in Figure 4.

3.1. Recap: Constructive Neural Solvers

In this subsection, we review previous works in well-known constructive neural solvers such as the attention model (Kool et al., 2018) and the POMO model (Kwon et al., 2020). The problem can be defined as an instance s𝑠sitalic_s on a fully connected graph of n𝑛nitalic_n nodes, where each node represents a city. Each node nisubscript𝑛𝑖n_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT where i{1,,n}𝑖1𝑛i\in\{1,...,n\}italic_i ∈ { 1 , … , italic_n } has features xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, typically the 2D coordinate. A solution is defined as a TSP tour, and is stated as a permutation of the nodes given by τ=(τ1,,τn)𝜏subscript𝜏1subscript𝜏𝑛\mathbf{\tau}=(\tau_{1},...,\tau_{n})italic_τ = ( italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_τ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), such that τt{1,,n}subscript𝜏𝑡1𝑛\tau_{t}\in\{1,...,n\}italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ { 1 , … , italic_n }. Note that since the tour does not allow backtracking, τtτt,ttformulae-sequencesubscript𝜏𝑡subscript𝜏superscript𝑡for-all𝑡superscript𝑡\tau_{t}\neq\tau_{t^{\prime}},\forall t\neq t^{\prime}italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≠ italic_τ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , ∀ italic_t ≠ italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. The formulation yields the following policy:

(1) pθ(τ|s)=t=1npθ(τt|s,τ1:t1)subscript𝑝𝜃conditional𝜏𝑠subscriptsuperscriptproduct𝑛𝑡1subscript𝑝𝜃conditionalsubscript𝜏𝑡𝑠subscript𝜏:1𝑡1p_{\theta}(\mathbf{\tau}|s)=\prod^{n}_{t=1}p_{\theta}(\tau_{t}|s,\mathbf{\tau}% _{1:t-1})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_τ | italic_s ) = ∏ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s , italic_τ start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT )

Since the model is trained via reinforcement learning, the policy’s action thus refers to the selection of the next node τtsubscript𝜏𝑡\tau_{t}italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT given the current state. The entire policy pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is parameterized with a neural network which involves both an encoder and a decoder. The encoder is a standard transformer model represented as such

(2) 𝐡~i=LNl(𝐡il1+MHAil(h1l1,,hnl1))subscript~𝐡𝑖superscriptLN𝑙superscriptsubscript𝐡𝑖𝑙1superscriptsubscriptMHA𝑖𝑙superscriptsubscript1𝑙1superscriptsubscript𝑛𝑙1\mathbf{\tilde{h}}_{i}=\textsc{LN}^{l}(\mathbf{h}_{i}^{l-1}+\textsc{MHA}_{i}^{% l}(h_{1}^{l-1},...,h_{n}^{l-1}))over~ start_ARG bold_h end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = LN start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT + MHA start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT , … , italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT ) )
(3) 𝐡il=LNl(𝐡~𝐢+FF(𝐡~𝐢))superscriptsubscript𝐡𝑖𝑙superscriptLN𝑙subscript~𝐡𝐢FFsubscript~𝐡𝐢\mathbf{h}_{i}^{l}=\textsc{LN}^{l}(\mathbf{\tilde{h}_{i}}+\textsc{FF}(\mathbf{% \tilde{h}_{i}}))bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = LN start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( over~ start_ARG bold_h end_ARG start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT + FF ( over~ start_ARG bold_h end_ARG start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ) )

where hilsubscriptsuperscript𝑙𝑖h^{l}_{i}italic_h start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the embedding of the i𝑖iitalic_i-th node at the l𝑙litalic_l-th layer, d𝑑ditalic_d the dimension size of the embeddings, MHA is the standard multi-headed attention layer (Vaswani et al., 2017), LN is a layer normalization function, and FF is a simple feed-forward multi-layer perceptron (MLP). Each node’s embeddings go through a total of L𝐿Litalic_L-layers before being passed into a decoder.

For the decoder, the solution is produced in an autoregressive fashion. In this case, at time step t𝑡titalic_t, the decoder receives the following

(4) 𝐡(c)=𝐡lastL+𝐡startLsubscript𝐡𝑐superscriptsubscript𝐡last𝐿superscriptsubscript𝐡start𝐿\mathbf{h}_{(c)}=\mathbf{h}_{\textsc{last}}^{L}+\mathbf{h}_{\textsc{start}}^{L}bold_h start_POSTSUBSCRIPT ( italic_c ) end_POSTSUBSCRIPT = bold_h start_POSTSUBSCRIPT last end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT + bold_h start_POSTSUBSCRIPT start end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT

where 𝐡(c)subscript𝐡𝑐\mathbf{h}_{(c)}bold_h start_POSTSUBSCRIPT ( italic_c ) end_POSTSUBSCRIPT is known as a contextual embedding. In this instance, the context given is the sum of the L𝐿Litalic_L-th encoder layer’s representation of the current node and the starting node. This is then first passed through a multi-headed attention layer where visited nodes are masked, followed by a single-headed attention layer for decision-making. In this decision layer, we obtain the following

(5) Q=WQ(H),K=WK(H),V=WV(H)formulae-sequence𝑄subscript𝑊𝑄𝐻formulae-sequence𝐾subscript𝑊𝐾𝐻𝑉subscript𝑊𝑉𝐻Q=W_{Q}(H),K=W_{K}(H),V=W_{V}(H)italic_Q = italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ( italic_H ) , italic_K = italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_H ) , italic_V = italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ( italic_H )

where H𝐻Hitalic_H is the set of all node embedding after the decoder’s multi-headed attention layer. With this, we compute the compatibility of the query with all nodes together with a mask, where the mask indicates previously visited nodes. This ensures that the decoder cannot pick an already visited node. Mathematically, we use the following attention mechanism

(6) aj={Utanh(QKd)jτt,t<totherwisesubscript𝑎𝑗cases𝑈tanh𝑄superscript𝐾top𝑑formulae-sequence𝑗subscript𝜏superscript𝑡for-allsuperscript𝑡𝑡otherwisea_{j}=\begin{cases}U\cdot\textsc{tanh}(\frac{QK^{\top}}{\sqrt{d}})&j\neq\tau_{% t^{\prime}},\forall t^{\prime}<t\\ -\infty&$\text{otherwise}$\end{cases}italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = { start_ROW start_CELL italic_U ⋅ tanh ( divide start_ARG italic_Q italic_K start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) end_CELL start_CELL italic_j ≠ italic_τ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , ∀ italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT < italic_t end_CELL end_ROW start_ROW start_CELL - ∞ end_CELL start_CELL otherwise end_CELL end_ROW

where we clip the values between [U,U]𝑈𝑈[-U,U][ - italic_U , italic_U ] using a tanh function as with works in (Kool et al., 2018; Bello et al., 2016; Kwon et al., 2020). These values are then normalized with a simple softmax function, giving us the following decision-making policy:

(7) pi=pθ(τt=i|s,τ1:t1)=eajjeajsubscript𝑝𝑖subscript𝑝𝜃subscript𝜏𝑡conditional𝑖𝑠subscript𝜏:1𝑡1superscript𝑒subscript𝑎𝑗subscript𝑗superscript𝑒subscript𝑎𝑗p_{i}=p_{\theta}(\tau_{t}=i|s,\tau_{1:t-1})=\frac{e^{a_{j}}}{\sum_{j}e^{a_{j}}}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_i | italic_s , italic_τ start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT ) = divide start_ARG italic_e start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG

Finally, to train the model, the REINFORCE algorithm (Williams, 1992) is used, where works such as POMO (Kwon et al., 2020) and Sym-NCO (Kim et al., 2022) use a shared baseline of all starting points. Concretely, the expected return J𝐽Jitalic_J is maximized with gradient ascent and is approximated by

(8) θJ(θ)1Ni=1N(R(τi)bi(s))θlogpθ(τi|s)subscript𝜃𝐽𝜃1𝑁subscriptsuperscript𝑁𝑖1𝑅superscript𝜏𝑖superscript𝑏𝑖𝑠subscript𝜃subscript𝑝𝜃conditionalsuperscript𝜏𝑖𝑠\nabla_{\theta}J(\theta)\approx\frac{1}{N}\sum^{N}_{i=1}(R(\tau^{i})-b^{i}(s))% \nabla_{\theta}\log p_{\theta}(\tau^{i}|s)∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_J ( italic_θ ) ≈ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT ( italic_R ( italic_τ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) - italic_b start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_s ) ) ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_τ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | italic_s )

where R(τi)𝑅superscript𝜏𝑖R(\tau^{i})italic_R ( italic_τ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) is the reward of permutation τisuperscript𝜏𝑖\tau^{i}italic_τ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, the tour length. pθ(τi|s)subscript𝑝𝜃conditionalsuperscript𝜏𝑖𝑠p_{\theta}(\tau^{i}|s)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_τ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | italic_s ) =t=1Tpθ(ati|s,a1:t1i)absentsubscriptsuperscriptproduct𝑇𝑡1subscript𝑝𝜃conditionalsubscriptsuperscript𝑎𝑖𝑡𝑠subscriptsuperscript𝑎𝑖:1𝑡1=\prod^{T}_{t=1}p_{\theta}(a^{i}_{t}|s,a^{i}_{1:t-1})= ∏ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT ) is the product of action probabilities for the trajectory, where actions refer to the selection of the next node to move to, s𝑠sitalic_s refers to the state of the problem, the set of all nodes, current node, and starting node. The baseline is calculated as the average return of all starting points, given by

(9) bi(s)=bshared(s)=1Nj=iNR(τj)isuperscript𝑏𝑖𝑠subscript𝑏shared𝑠1𝑁subscriptsuperscript𝑁𝑗𝑖𝑅superscript𝜏𝑗for-all𝑖b^{i}(s)=b_{\text{shared}}(s)=\frac{1}{N}\sum^{N}_{j=i}R(\tau^{j})\forall iitalic_b start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_s ) = italic_b start_POSTSUBSCRIPT shared end_POSTSUBSCRIPT ( italic_s ) = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = italic_i end_POSTSUBSCRIPT italic_R ( italic_τ start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) ∀ italic_i

3.2. Improving local decision making with the Choice Decoder

More often than not, good selections for the TSP tend to lie within the locality of the current position. In the standard transformer architecture, the decoder attempts to capture this by using the current node’s representation in a single-headed attention layer, as shown in Equations 6 and 7. In fact, Equation 6 is essentially a kernel function between the current querying node and the candidate nodes, an observation made in (Tsai et al., 2019). Effectively, we can view the compatibility score as

(10) ϕ(Q,K)=Utanh(QKd)italic-ϕ𝑄𝐾𝑈tanh𝑄superscript𝐾top𝑑\phi(Q,K)=U\cdot\textsc{tanh}(\frac{QK^{\top}}{\sqrt{d}})italic_ϕ ( italic_Q , italic_K ) = italic_U ⋅ tanh ( divide start_ARG italic_Q italic_K start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG )

Given that we aim to focus more on the current node’s vicinity and features, we propose to generate a set of attention weights based on the current embeddings. This aims to amplify or nullify the compatibility scores further by conditioning it on the current node. Hypernetworks (Ha et al., 2016) are small neural networks designed to generate a set of weights for a main network. Its goal is to serve as a form of relaxed weight sharing. This approach allows the hypernetwork to take as input some information about the problem, such as its structure, and adapt the main network’s weights towards the problem. Inspired by this approach, we construct the set of attention weights using an MLP as a hypernetwork, with the input being the current node’s embedding. Concretely, we modify the compatibility function as follows:

(11) a^j=Utanh((QWchoice)Kd)subscript^𝑎𝑗𝑈tanh𝑄subscript𝑊choicesuperscript𝐾top𝑑\hat{a}_{j}=U\cdot\textsc{tanh}(\frac{(QW_{\textsc{choice}})K^{\top}}{\sqrt{d}})over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_U ⋅ tanh ( divide start_ARG ( italic_Q italic_W start_POSTSUBSCRIPT choice end_POSTSUBSCRIPT ) italic_K start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG )

such that

(12) Wchoice=MLP(Q),WchoiceRdxdformulae-sequencesubscript𝑊choiceMLP𝑄subscript𝑊choicesuperscriptR𝑑𝑥𝑑W_{\textsc{choice}}=\textsc{MLP}(Q),W_{\textsc{choice}}\in\mathrm{R}^{dxd}italic_W start_POSTSUBSCRIPT choice end_POSTSUBSCRIPT = MLP ( italic_Q ) , italic_W start_POSTSUBSCRIPT choice end_POSTSUBSCRIPT ∈ roman_R start_POSTSUPERSCRIPT italic_d italic_x italic_d end_POSTSUPERSCRIPT

Effectively, Wchoicesubscript𝑊choiceW_{\textsc{choice}}italic_W start_POSTSUBSCRIPT choice end_POSTSUBSCRIPT serves as a set of learnable conditional parameters that serve to alter the compatibility scores based on current embeddings. However, implementing this as a full matrix is extremely expensive. Instead, we realize Wchoicesubscript𝑊choiceW_{\textsc{choice}}italic_W start_POSTSUBSCRIPT choice end_POSTSUBSCRIPT as a diagonal matrix, effectively reducing the compatibility score function to

(13) ϕ(Q,K|Q)=a~j=Utanh((QWchoice)Kd)italic-ϕ𝑄conditional𝐾𝑄subscript~𝑎𝑗𝑈tanh𝑄subscript𝑊choicesuperscript𝐾top𝑑\phi(Q,K|Q)=\tilde{a}_{j}=U\cdot\textsc{tanh}(\frac{(Q*W_{\textsc{choice}})K^{% \top}}{\sqrt{d}})italic_ϕ ( italic_Q , italic_K | italic_Q ) = over~ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_U ⋅ tanh ( divide start_ARG ( italic_Q ∗ italic_W start_POSTSUBSCRIPT choice end_POSTSUBSCRIPT ) italic_K start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG )

where * is the element-wise product. Inherently, Wchoicesubscript𝑊choiceW_{\textsc{choice}}italic_W start_POSTSUBSCRIPT choice end_POSTSUBSCRIPT now reduces to a set of learnable scalars which serve to modify the compatibility scores between Q𝑄Qitalic_Q and K𝐾Kitalic_K. Additionally, these learnable scalars are now conditioned upon Q𝑄Qitalic_Q, the current node, since it is generated via the MLP. The complexity of the MLP also now reduces from producing an output of Rd×dsuperscriptR𝑑𝑑\mathrm{R}^{d\times d}roman_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT to RdsuperscriptR𝑑\mathrm{R}^{d}roman_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT.

3.3. Exploiting structure with Hierarchical Decoder

In its current state, the decoder models the TSP as a set-to-sequence function. A key aspect of the input is the contextual embedding, 𝐡(c)subscript𝐡𝑐\mathbf{h}_{(c)}bold_h start_POSTSUBSCRIPT ( italic_c ) end_POSTSUBSCRIPT. This embedding serves to represent the current state the model is in and is often a combination of the starting node, current node, and some global representation of the problem. For the global representation, works such as (Kool et al., 2018) and (Kwon et al., 2020) use an average of all node embeddings, while others such as (Jin et al., 2023), maintain an average of all visited nodes so far. Essentially, all of these approaches attempt to capture various nuances of the TSP.

However, for realistic problem settings, it is important to exploit the structure of the distribution of the cities. A single global representation would not be effective enough to capture the intricate correlations present between the cities. One notable and ubiquitous case is the presence of cluster patterns wherein certain cities are located near to each other while being distant to others. This clustering pattern, if captured within the global and contextual representation, can potentially provide the model with important clues to determine the next city to visit. Thus, for such problems, we propose to maintain a set of C𝐶Citalic_C representations that are able to summarize the set of unvisited cities left, instead of a simple single representation. We postulate that this is meaningful as structured problems have frequent cities in fixed areas of the map, and being able to identify if a node belongs to a certain area could be beneficial to the decision-making process.

Prior works in other domains have shown the efficacy of cluster construction in applications such as node classification (Dong et al., 2024). In this work, we wish to group all cities into C𝐶Citalic_C representations. To this end, we design the following layer inspired by the Expectation-Maximization (EM) algorithm (Moon, 1996). We first briefly review the EM algorithm for the Gaussian Mixture Model (GMM). Let θ={πk,μk}𝜃subscript𝜋𝑘subscript𝜇𝑘\mathbf{\theta}=\{\pi_{k},\mathbf{\mu}_{k}\}italic_θ = { italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } denote the set of parameters, the coefficients of Gaussians π𝜋\piitalic_π and its associated means μ𝜇\muitalic_μ (covariance ΣksubscriptΣ𝑘\Sigma_{k}roman_Σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is assumed to be known), 𝐗={𝐱(i)}𝐗superscript𝐱𝑖\mathbf{X}=\{\mathbf{x}^{(i)}\}bold_X = { bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT } denote the set of data points, and 𝐙={z(i)}𝐙superscript𝑧𝑖\mathbf{Z}=\{z^{(i)}\}bold_Z = { italic_z start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT } denote the set of latent variables associated with the data. The maximum-likelihood objective is given by

(14) logp(𝐗|π,μ)=i=1Nlogz(i)=1Kp(𝐱(n)|z(n);μ)p(z(n)|π)𝑝conditional𝐗𝜋𝜇subscriptsuperscript𝑁𝑖1subscriptsuperscript𝐾superscript𝑧𝑖1𝑝conditionalsuperscript𝐱𝑛superscript𝑧𝑛𝜇𝑝conditionalsuperscript𝑧𝑛𝜋\log p(\mathbf{X}|\pi,\mathbf{\mu})=\sum^{N}_{i=1}\log\sum^{K}_{z^{(i)}=1}p(% \mathbf{x}^{(n)}|z^{(n)};\mathbf{\mu})p(z^{(n)}|\pi)roman_log italic_p ( bold_X | italic_π , italic_μ ) = ∑ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT roman_log ∑ start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT italic_p ( bold_x start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT | italic_z start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT ; italic_μ ) italic_p ( italic_z start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT | italic_π )

In general, Equation 14 yields no closed-form solution. Additionally, it is non-convex, and its derivatives are expensive to compute. Since latent variable z(i)superscript𝑧𝑖z^{(i)}italic_z start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT exists for every observation and we have a sum inside a log, we look at the EM algorithm to solve this. Typically, the EM algorithm involves two steps: the E-step computes the posterior probability over z𝑧zitalic_z given the current parameters, and the M-step, which assumes that given that the data was generated with z(i)=ksuperscript𝑧𝑖𝑘z^{(i)}=kitalic_z start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = italic_k, finds the set of parameters that maximizes this. Effectively, for a standard GMM, this yields the following E-step, given an initial set of parameters θ𝜃\mathbf{\theta}italic_θ:

(15) γk=p(z(i)=k|𝐱(i);θold)=πk𝒩(𝐱|μk)j=1Kπj𝒩(𝐱|μj)subscript𝛾𝑘𝑝superscript𝑧𝑖conditional𝑘superscript𝐱𝑖superscript𝜃oldsubscript𝜋𝑘𝒩conditional𝐱subscript𝜇𝑘subscriptsuperscript𝐾𝑗1subscript𝜋𝑗𝒩conditional𝐱subscript𝜇𝑗\gamma_{k}=p(z^{(i)}=k|\mathbf{x}^{(i)};\mathbf{\theta}^{\text{old}})=\frac{% \pi_{k}\mathcal{N}(\mathbf{x}|\mathbf{\mu}_{k})}{\sum^{K}_{j=1}\pi_{j}\mathcal% {N}(\mathbf{x}|\mathbf{\mu}_{j})}italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_p ( italic_z start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = italic_k | bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ; italic_θ start_POSTSUPERSCRIPT old end_POSTSUPERSCRIPT ) = divide start_ARG italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT caligraphic_N ( bold_x | italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT caligraphic_N ( bold_x | italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG

where γksubscript𝛾𝑘\gamma_{k}italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT can be viewed as the responsibility of cluster k𝑘kitalic_k towards data point 𝐱(i)superscript𝐱𝑖\mathbf{x}^{(i)}bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT. Then, for GMMs, the following update equations can be applied in the M-step:

(16) μk=1Nki=1Nγk(i)𝐱(i)subscript𝜇𝑘1subscript𝑁𝑘subscriptsuperscript𝑁𝑖1superscriptsubscript𝛾𝑘𝑖superscript𝐱𝑖\mathbf{\mu}_{k}=\frac{1}{N_{k}}\sum^{N}_{i=1}\gamma_{k}^{(i)}\mathbf{x}^{(i)}italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT
(17) πk=NkN,wherei=iNγk(i)subscript𝜋𝑘subscript𝑁𝑘𝑁wheresubscriptsuperscript𝑁𝑖𝑖superscriptsubscript𝛾𝑘𝑖\pi_{k}=\frac{N_{k}}{N},\text{where}\sum^{N}_{i=i}\gamma_{k}^{(i)}italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG , where ∑ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = italic_i end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT

Essentially, Equation 15 estimates the contribution of each Gaussian model given the current set of parameters. While Equations 16 and 17, highlight closed-form update equations for the Gaussian parameters.

Now, suppose a TSP instance drawn from a fixed map can be represented efficiently with C𝐶Citalic_C latent representations. Since we are dealing with subsets of problems from the same space, these C𝐶Citalic_C representations can be fixed and learnable. Let 𝐂RNc×d𝐂superscriptRsubscript𝑁𝑐𝑑\mathbf{C}\in\mathrm{R}^{N_{c}\times d}bold_C ∈ roman_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT denote this set of representations, where we have C={c1,c2,,cNc}𝐶subscript𝑐1subscript𝑐2subscript𝑐subscript𝑁𝑐C=\{c_{1},c_{2},...,c_{N_{c}}\}italic_C = { italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_c start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT }. We propose to learn and update these representations by considering a mixture model, where the latent variables are modelled by these latent embeddings. Similar to the EM algorithm, we produce a set of mixing coefficients using an attention layer and its attention weights. Concretely, our soft clustering algorithm estimates its mixing coefficients via the attention mechanism, and using these scores, the clusters are then updated with a weighted sum of the embeddings. This can be shown as

(18) h^i=WHhisubscript^𝑖subscript𝑊𝐻subscript𝑖\hat{h}_{i}=W_{H}h_{i}over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_W start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
(19) cj^=WCcj^subscript𝑐𝑗subscript𝑊𝐶subscript𝑐𝑗\hat{c_{j}}=W_{C}c_{j}over^ start_ARG italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG = italic_W start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT
(20) πi,j=softmax(hi^cj^d)subscript𝜋𝑖𝑗softmax^subscript𝑖superscript^subscript𝑐𝑗top𝑑\pi_{i,j}=\textsc{softmax}(\frac{\hat{h_{i}}\hat{c_{j}}^{\top}}{\sqrt{d}})italic_π start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = softmax ( divide start_ARG over^ start_ARG italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG over^ start_ARG italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG )
(21) cj=iπi,jhisubscript𝑐𝑗subscript𝑖subscript𝜋𝑖𝑗subscript𝑖c_{j}=\sum_{i}\pi_{i,j}h_{i}italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT

where ΠΠ\Piroman_Π is the matrix containing all coefficients πi,jsubscript𝜋𝑖𝑗\pi_{i,j}italic_π start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT, WHsubscript𝑊𝐻W_{H}italic_W start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT and WCsubscript𝑊𝐶W_{C}italic_W start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT are parameters for the attentional scores, and C𝐶Citalic_C is the set of learnable embeddings for the distribution. Given a single set of parameters in the attention layer, the embeddings H𝐻Hitalic_H and C𝐶Citalic_C are passed through this same layer a total of B𝐵Bitalic_B times iteratively, mimicking a ”rollout” of a soft clustering algorithm of E-steps and M-steps within each iteration. Loosely, we can see that Equation 20 resembles a similar calculation of the E-step, wherein we use a set of parameters to estimate the coefficients instead of minimizing for the Euclidean distance in the GMM case. Equation 21 is similar to the M-step of GMMs, where we update the centers (in our case the embeddings are the latent variables) with a weighted sum of the data.

Once we have the set of C𝐶Citalic_C embeddings, we update the representation at every step of decoding by subtracting a weighted sum of the current node’s embedding; this computes the weighted sum of embeddings of unvisited cities, instead of all cities. Thus, at time step T𝑇Titalic_T, if the agent is current at node i𝑖iitalic_i, we update C𝐶Citalic_C via

(22) cj=cj(πi,jhi),jNcformulae-sequencesubscriptsuperscript𝑐𝑗subscript𝑐𝑗subscript𝜋𝑖𝑗subscript𝑖for-all𝑗subscript𝑁𝑐c^{\prime}_{j}=c_{j}-(\pi_{i,j}*h_{i}),\forall j\in N_{c}italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - ( italic_π start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∗ italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , ∀ italic_j ∈ italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT

Then, we now construct a new context embedding such that

(23) 𝐡(c)=Wcombine[hlastL,c1,c2,,cNc]+𝐡firstLsubscript𝐡𝑐subscript𝑊combinesubscriptsuperscript𝐿lastsubscript𝑐1subscript𝑐2subscript𝑐subscript𝑁𝑐subscriptsuperscript𝐡𝐿first\mathbf{h}_{(c)}=W_{\textsc{combine}}[h^{L}_{\textsc{last}},c_{1},c_{2},...,c_% {N_{c}}]+\mathbf{h}^{L}_{\textsc{first}}bold_h start_POSTSUBSCRIPT ( italic_c ) end_POSTSUBSCRIPT = italic_W start_POSTSUBSCRIPT combine end_POSTSUBSCRIPT [ italic_h start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT last end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_c start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] + bold_h start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT first end_POSTSUBSCRIPT

where []delimited-[][\cdot][ ⋅ ] is the concatenation operation, and Wcombinesubscript𝑊combineW_{\textsc{combine}}italic_W start_POSTSUBSCRIPT combine end_POSTSUBSCRIPT is a simple linear layer to combine the embeddings. We keep 𝐡firstLsubscriptsuperscript𝐡𝐿first\mathbf{h}^{L}_{\textsc{first}}bold_h start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT first end_POSTSUBSCRIPT separate so as to preserve the importance of the starting node. As the decoder constructs the solutions, the C𝐶Citalic_C embeddings get updated along the way, maintaining a small set of unvisited cities so as to keep track of the solution.

Algorithm 1 Psuedo code of soft clustering algorithm
1:procedure CLUSTER(encoder embeddings H𝐻Hitalic_H, number of centers Ncsubscript𝑁𝑐N_{c}italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, number of iterations B𝐵Bitalic_B, initial embeddings C𝐶Citalic_C, embedding size d𝑑ditalic_d)
2:     for b1𝑏1b\leftarrow 1italic_b ← 1 to B𝐵Bitalic_B do
3:         H^WH(H)^𝐻subscript𝑊𝐻𝐻\hat{H}\leftarrow W_{H}(H)over^ start_ARG italic_H end_ARG ← italic_W start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_H )
4:         C^WC(C)^𝐶subscript𝑊𝐶𝐶\hat{C}\leftarrow W_{C}(C)over^ start_ARG italic_C end_ARG ← italic_W start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_C )
5:         π=softmax(H^C^d)𝜋softmax^𝐻superscript^𝐶top𝑑\pi=\textsc{softmax}(\frac{\hat{H}\hat{C}^{\top}}{\sqrt{d}})italic_π = softmax ( divide start_ARG over^ start_ARG italic_H end_ARG over^ start_ARG italic_C end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) \triangleright Compute attention scores
6:         C=iπihi𝐶subscript𝑖subscript𝜋𝑖subscript𝑖C=\sum_{i}\pi_{i}h_{i}italic_C = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT \triangleright Update the centers with data
7:         Cout=C^+Csubscript𝐶out^𝐶𝐶C_{\textsc{out}}=\hat{C}+Citalic_C start_POSTSUBSCRIPT out end_POSTSUBSCRIPT = over^ start_ARG italic_C end_ARG + italic_C \triangleright Residual connection
8:         C=Norm(Cout)𝐶Normsubscript𝐶outC=\textsc{Norm}(C_{\textsc{out}})italic_C = Norm ( italic_C start_POSTSUBSCRIPT out end_POSTSUBSCRIPT ) \triangleright Layer normalization
9:     end for
10:     return C𝐶Citalic_C
11:end procedure
Algorithm 2 Psuedo code of one step of decoding in the hierarchical neural constructive solver
1:function UPDATE(current node embedding hisubscript𝑖h_{i}italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, cluster centers C𝐶Citalic_C, cluster weights π𝜋\piitalic_π)
2:     cj=cj(πi,jhi),cjCformulae-sequencesubscriptsuperscript𝑐𝑗subscript𝑐𝑗subscript𝜋𝑖𝑗subscript𝑖for-allsubscript𝑐𝑗𝐶c^{\prime}_{j}=c_{j}-(\pi_{i,j}*h_{i}),\forall c_{j}\in Citalic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - ( italic_π start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∗ italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , ∀ italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_C
3:     return Csuperscript𝐶C^{\prime}italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT
4:end function
5:procedure DECODE(encoder embeddings H𝐻Hitalic_H, cluster centers C𝐶Citalic_C, cluster weights π𝜋\piitalic_π, starting points P𝑃Pitalic_P)
6:     Cupdate(hlast,C,π)superscript𝐶updatesubscriptlast𝐶𝜋C^{\prime}\leftarrow\textsc{update}(h_{\textsc{last}},C,\pi)italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← update ( italic_h start_POSTSUBSCRIPT last end_POSTSUBSCRIPT , italic_C , italic_π ) \triangleright Remove visited embedding
7:     𝐡(c)Wcombine[hlast,c1,c2,,cNc]+𝐡firstLsubscript𝐡𝑐subscript𝑊combinesubscriptlastsubscript𝑐1subscript𝑐2subscript𝑐subscript𝑁𝑐subscriptsuperscript𝐡𝐿first\mathbf{h}_{(c)}\leftarrow W_{\textsc{combine}}[h_{\textsc{last}},c_{1},c_{2},% ...,c_{N_{c}}]+\mathbf{h}^{L}_{\textsc{first}}bold_h start_POSTSUBSCRIPT ( italic_c ) end_POSTSUBSCRIPT ← italic_W start_POSTSUBSCRIPT combine end_POSTSUBSCRIPT [ italic_h start_POSTSUBSCRIPT last end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_c start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] + bold_h start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT first end_POSTSUBSCRIPT
8:     WchoiceMLP(hlast)subscript𝑊𝑐𝑜𝑖𝑐𝑒MLPsubscriptlastW_{choice}\leftarrow\textsc{MLP}(h_{\textsc{last}})italic_W start_POSTSUBSCRIPT italic_c italic_h italic_o italic_i italic_c italic_e end_POSTSUBSCRIPT ← MLP ( italic_h start_POSTSUBSCRIPT last end_POSTSUBSCRIPT ) \triangleright Hypernetwork
9:     𝐡(c)MHA(𝐡(c),K,V)subscriptsuperscript𝐡𝑐MHAsubscript𝐡𝑐𝐾𝑉\mathbf{h^{\prime}}_{(c)}\leftarrow\textsc{MHA}(\mathbf{h}_{(c)},K,V)bold_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_c ) end_POSTSUBSCRIPT ← MHA ( bold_h start_POSTSUBSCRIPT ( italic_c ) end_POSTSUBSCRIPT , italic_K , italic_V )
10:     a^j=Ctanh((QWchoice)Kd)subscript^𝑎𝑗𝐶tanh𝑄subscript𝑊choicesuperscript𝐾top𝑑\hat{a}_{j}=C\cdot\textsc{tanh}(\frac{(QW_{\textsc{choice}})K^{\top}}{\sqrt{d}})over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_C ⋅ tanh ( divide start_ARG ( italic_Q italic_W start_POSTSUBSCRIPT choice end_POSTSUBSCRIPT ) italic_K start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG )
11:     ujsoftmax(a^j)subscript𝑢𝑗softmaxsubscript^𝑎𝑗u_{j}\leftarrow\textsc{softmax}(\hat{a}_{j})italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ← softmax ( over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) \triangleright Create action probabilities
12:     pisample(u)subscript𝑝𝑖sample𝑢p_{i}\leftarrow\textsc{sample}(u)italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← sample ( italic_u )
13:     return i,pi𝑖subscript𝑝𝑖i,p_{i}italic_i , italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT\triangleright Return the selected node and its probability
14:end procedure

In totality, our approach forms a hierarchy in the decision-making process; an intermediate level of C𝐶Citalic_C embeddings represent the set of unvisited cities and their groupings, and immediate local decision-making is learnt and skewed by Wchoicesubscript𝑊choiceW_{\textsc{choice}}italic_W start_POSTSUBSCRIPT choice end_POSTSUBSCRIPT to favour decisions based on current positions. Algorithms 1 and 2 highlights the overall flow of our approach.

4. Experimental Setup

Table 1. List of augmentations suggested by (Kwon et al., 2020)
f(x,y)𝑓𝑥𝑦f(x,y)italic_f ( italic_x , italic_y )
(x,y)𝑥𝑦(x,y)( italic_x , italic_y ) (y,x)𝑦𝑥(y,x)( italic_y , italic_x )
(x,1y)𝑥1𝑦(x,1-y)( italic_x , 1 - italic_y ) (y,1x)𝑦1𝑥(y,1-x)( italic_y , 1 - italic_x )
(1x,y)1𝑥𝑦(1-x,y)( 1 - italic_x , italic_y ) (1y,x)1𝑦𝑥(1-y,x)( 1 - italic_y , italic_x )
(1x,1y)1𝑥1𝑦(1-x,1-y)( 1 - italic_x , 1 - italic_y ) (1y,1x)1𝑦1𝑥(1-y,1-x)( 1 - italic_y , 1 - italic_x )

4.1. Data Generation

We present three different benchmarks for comparison. For all scenarios, we look at TSP-100 problems. Firstly, we generate random uniform data on a [0,1]01[0,1][ 0 , 1 ] square and a fixed test set of 10,0001000010,00010 , 000 instances. This is done to show if the addition of other layers interferes with the base performance of the transformer model.

Refer to caption
Figure 5. World maps used for experiments

To generate realistic data, we sample instances from 3 different countries, available at (nat, [n. d.]) and shown in Figure 5. Namely, they are

  • USA13509 - 13,509 cities across the United States of America, each with a population ¿500

  • BM33708 - 33,708 cities across the country of Burma

  • JA9847 - 9,847 cities across the country of Japan

Each country is first normalized to a [0,1]01[0,1][ 0 , 1 ] square. Then, at every training epoch, we randomly sample problems of size 100 from the map. Naturally, clusters that are denser across the country will be sampled more frequently. A test set size of 10,0001000010,00010 , 000 samples is also drawn and held aside for evaluation. Each test set is fully solved via Concorde (Applegate, 2003) to get the optimal length of each tour. We define 1 epoch to be 100,000100000100,000100 , 000 samples, and the models are trained for 200200200200 epochs to prevent overfitting. In totality, the model sees 20,000,0002000000020,000,00020 , 000 , 000 different samples. Thirdly, we define a limited data setting. This mimics a typical practical problem where the company does not have unlimited access to data. We first define a small dataset of 50,0005000050,00050 , 000 samples for such a setting. Based on the experiment size, we sample the necessary amount of data. Likewise, the models are tested on the same test set of 10,0001000010,00010 , 000 samples.

Additionally, to show the generality of our approach beyond the logistics domain, we include the PCB3038 dataset from TSPLib, where the goal is to find the shortest path across a circuit board layout. Here, we can view the problem as such: from all possible holes the board can have, and given a subset of these holes, what is the optimal path of traversal? Solving such problems with high degrees of accuracy leads to cost savings in manufacturing and improved yields.

Table 2. Model performance on 10,000 generated samples from the uniform random distribution. Best model in bold.
Model Tour Length Opt. Gap (%) Aug. Tour Length Aug. Opt Gap (%)
Concorde 7.7649 - 7.7649 -
POMO 7.8824 1.5134% 7.8114 0.5997%
Sym-NCO 7.9106 1.8772% 7.8148 0.6425%
ELG 7.8223 0.7393% 7.7861 0.2734%
Ours 7.8145 0.6397% 7.7809 0.2063%
Table 3. Performance of various models on realistic TSP100 instances from 3 different countries. Best models are in bold.
Dataset Model Tour Length Opt. Gap (%) Aug. Tour Length Aug. Opt. Gap (%) Inference Time (10k samples)
Concorde 5.6209 - 5.6209 - 56 min 37 sec
POMO 5.6958 1.3334% 5.6922 1.2677% 2 min 46 sec
USA13509 Sym-NCO 5.7022 1.4650% 5.6604 0.7219% 2 min 49 sec
ELG 5.6660 0.8022% 5.6641 0.7691% 2 min 58 sec
Ours 5.6548 0.6024% 5.6533 0.5762% 3 min 03 sec
Concorde 3.5341 - 3.5341 - 52 min 55 sec
POMO 3.5621 0.7926% 3.5620 0.7893% 2 min 38 sec
JA9857 Sym-NCO 3.5670 0.9315% 3.5497 0.4421% 2 min 40 sec
ELG 3.5576 0.6659% 3.5574 0.6596% 2 min 49 sec
Ours 3.5438 0.2741% 3.5435 0.2670% 2 min 51 sec
Concorde 5.0019 - 5.0019 - 59 min 22 sec
POMO 5.0823 1.6060% 5.0746 1.4540% 2 min 55 sec
BM33708 Sym-NCO 5.0561 1.0828% 5.0354 0.6683% 2 min 58 sec
ELG 5.0587 1.1343% 5.0528 1.0162% 3 min 02 sec
Ours 5.0383 0.7261% 5.0328 0.6166% 3 min 06 sec
Table 4. Ablation study on the USA13509 map
Dataset Model Tour Length Opt. Gap (%) Aug. Tour Length Aug. Opt. Gap (%)
POMO only 5.6958 1.3334% 5.6922 1.2677%
POMO + Choice 5.6676 0.8300% 5.6659 0.7997%
USA13509 POMO + Choice Free 5.7381 2.1038% 5.7311 1.9791%
POMO + Choice + Average Tracking 5.6648 0.7807% 5.6638 0.7633%
POMO + Choice + Soft Clustering Tracking 5.6548 0.6024% 5.6533 0.5762%

4.2. Benchmark Models

We compare our approach to the following constructive neural solvers: POMO (Kwon et al., 2020) the classical transformer that forms the basis for many follow-up works, Sym-NCO (Kim et al., 2022) a follow-up work from POMO that improves neural solvers by exploiting problem symmetry, and ELG (Gao et al., 2023) a recent work that also focuses on locality by defining a separate local policy based on its k-Nearest Neighbors. All models are trained based on the POMO shared baseline. It should be noted that in ELG, the authors introduced a different training algorithm. Since we wish to compare the efficacy of the architectural contributions, we train the ELG model using POMO. We reimplement all models and ran on the proposed dataset for our experimental results.

4.3. Hyperparameters

Since the neural models all share the same underlying backbone POMO transformer model, we retain the same set of hyperparameters across them to ensure the contributions are purely architectural. We utilize 6 layers of the transformer encoder and 2 layers of the transformer decoder. All models are trained for 200 epochs, with 100,000 episodes per epoch and a 1e41superscript𝑒41e^{-4}1 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT learning rate using the Adam optimizer (Kingma and Ba, 2014). Gradient clipping is set to [10,10]1010[-10,10][ - 10 , 10 ] for all models. For ELG, we used their recommended 50-nearest neighbours. As for our approach, we set Nc=5subscript𝑁𝑐5N_{c}=5italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = 5 embeddings and B=5𝐵5B=5italic_B = 5 iterations for the hierarchical approach by using a validation set of 1,000 samples for verification. Additional details for hyperparameters can be found in Appendix A.1.

4.4. Performance Metric

All models are measured by the optimality gap, the percentage gap between the neural model’s tour length and the optimal tour length. This is given by

(24) O=(1NiNRi1NiNLi1)100𝑂1𝑁superscriptsubscript𝑖𝑁subscript𝑅𝑖1𝑁superscriptsubscript𝑖𝑁subscript𝐿𝑖1100O=(\frac{\frac{1}{N}\sum_{i}^{N}R_{i}}{\frac{1}{N}\sum_{i}^{N}L_{i}}-1)*100italic_O = ( divide start_ARG divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG - 1 ) ∗ 100

where Lisubscript𝐿𝑖L_{i}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the tour length of test instance i𝑖iitalic_i computed by Concorde. Also, we perform instance augmentation, just like in POMO, which involves various translations and reflections across the x𝑥xitalic_x and y𝑦yitalic_y axes (Kwon et al., 2020), as shown in Table 1.

5. Results

Refer to caption
Figure 6. Optimality gaps across various dataset sizes (lower is better). The x-axis dictates the model types, and the y-axis denotes the optimality gap. The first row showcases standard data input, and the second shows data after augmentation. Countries are USA, Japan, and Burma, from left to right. Numerical details can be found in Appendix A.2.

5.1. Performance on Uniform Random Distribution

Table 2 highlights the overall performance of the models on the classic uniform random sampling dataset. Most models have solid performance, with augmentation playing a prominent role in boosting the predictive power. Our model and ELG also show the importance of having some form of local feature to improve the decision-making process. The addition of our unvisited city tracking via soft clustering provides additional boost to the model’s predictive power since we can identify locations on the square.

5.2. Performance on Structured Distributions

Table 3 showcases the different model’s performance on TSP100 instances drawn from various countries. Our model has a clear advantage in the USA and Japan, with a narrow margin in Burma. Interestingly, we also see that augmentation has minimal effect on increasing the solver’s performance, except for the case of Sym-NCO. It is likely because the maps are no longer symmetric in nature, and simple transformations do not improve the chances of finding a very different route. For Sym-NCO, we see that since it’s specifically trained to exploit augmentations by considering all symmetries during training, its strong performance only appears when it can perform those transformations.

5.3. Performance on Varied Sizes

Figure 6 compares the model’s performance between the complete set of training, 50,000 fixed samples, and 10,000 fixed samples. We can see that the model performance degrades as expected once data is limited. However, the ranking of the models is still similar at the 50,000 sample mark. Interestingly, reducing the dataset further to 10,000 samples sees ELG becoming the top neural constructive solver. We attribute ELG’s strong performance on limited data to its local policy scheme. ELG uses well-crafted local features in the form of polar coordinates to create their local policy. The direct use of these features allows their local policy to learn valuable features to alter the action probabilities easily with less data. It should be noted that our choice hypernetwork is possibly a larger function class that also encompasses this approach. Hence, our model can learn a better overall function when more data is present.

5.4. Performance on PCB3038

Table 5 showcases the performance of the models on the PCB3038 TSPLib dataset. Evidently, the dataset is no longer as symmetric as before, since Sym-NCO struggles to greatly improve upon POMO. Instead, some local features are important to have, as displayed by ELG and our model’s improvement over POMO. Since our approach is a superset function of ELG, it can learn a better-performing solver, significantly improving upon POMO.

5.5. Evolution of Embeddings

Soft clustering the nodes in the embedding space is a crucial part of our overall approach. Therefore, we evaluate if our algorithm indeed is able to cluster the nodes into a set of meaningful clusters as the training progresses. Specifically, we conduct a 2D t-SNE plot of the node and cluster embeddings during the training process. Figure 7 evidently shows that as the training progresses, the cluster centroid embeddings are better able to separate the node embeddings in the latent space. This plausibly leads to a better representation of the unvisited embeddings and hence the problem space, allowing the model to identify the groups of nodes so as to select similar ones first.

Table 5. Model performance on 10,000 on the PCB3038 dataset to show efficacy of models on problems from domains beyond logistics.
Dataset Model Tour Length Opt. Gap (%) Aug. Tour Length Aug. Opt. Gap (%)
Concorde 7.5866 - 7.5866 -
POMO 7.7952 2.7494% 7.6984 1.4743%
PCB3038 Sym-NCO 7.7702 2.4201% 7.6776 1.1996%
ELG 7.6882 1.3393% 7.6387 0.6876%
Ours 7.6417 0.7236% 7.6074 0.2746%
Refer to caption
Figure 7. 2D t-SNE plot of cluster centroids (in blue) and a set of node embeddings (in orange) as the training progresses. Over time, the centroids surround and segregate the embeddings better.

5.6. Ablation Studies

As shown in Table 4, we perform ablation studies for our network and consider three different variations. Firstly, POMO + Choice adds only the local choice layer based on the hypernetwork. Secondly, POMO + Choice Free replaces the local choice hypernetwork layer with free parameters, meaning that the weights are no longer conditioned on the current node’s embedding but rather allowed to learn freely. Thirdly, POMO + Choice + Average Tracking removes the soft clustering layer, instead averaging all unvisited nodes into a single embedding. From the table, we first see that the local choice layer is essential. Additionally, if we were to remove the ability to condition on the current node’s position instead, the model would perform exceptionally poorly - even worse than the original transformer network. Finally, if we adopt the simplistic approach of averaging all unvisited cities, the model performance also suffers, showcasing the importance of our soft clustering layer.

6. Conclusion

In this work, we propose a more realistic approach to representing and generating Traveling Salesman Problems (TSPs) in real-world contexts. Our investigation reveals that previous state-of-the-art neural constructive solvers do not fully exploit the problem’s intricacies to enhance predictive capability. To address this gap, we present a dual strategy to deal with the problem from two fronts. Firstly, we emphasize the importance of considering current agent positions, leading us to introduce a hypernetwork, which enables dynamic fine-tuning to the decision-making process based on the agent’s current node position. Secondly, we recognize that realistic TSP scenarios are often structured, and therefore, improving solutions in such scenarios necessitates a deeper understanding of the structure of the set of unvisited nodes. Instead of treating all unvisited nodes uniformly, we propose a soft clustering algorithm inspired by the EM algorithm. This approach enhances the neural solver’s performance by grouping nodes based on similarities, thereby increasing the likelihood of selecting nodes from the same cluster for early resolution. We illustrate the effectiveness of these methods across diverse geographical structures. Importantly, our methods are complementary and can be integrated with existing models like ELG or Sym-NCO to create more robust solutions.

Acknowledgements.
This work was funded by the Grab-NUS AI Lab, a joint collaboration between GrabTaxi Holdings Pte. Ltd. and National University of Singapore, and the Industrial Postgraduate Program (Grant: S18-1198-IPP-II) funded by the Economic Development Board of Singapore. This research is also supported by the National Research Foundation, Singapore under its AI Singapore Programme (AISG Award No: AISG3-RP-2022-031).

References

  • (1)
  • nat ([n. d.]) [n. d.]. https://www.math.uwaterloo.ca/tsp/world/countries.html
  • Applegate (2003) David Applegate. 2003. Concorde: A code for solving traveling salesman problems. http://www. tsp. gatech. edu/concorde. html (2003).
  • Bello et al. (2016) Irwan Bello, Hieu Pham, Quoc V Le, Mohammad Norouzi, and Samy Bengio. 2016. Neural combinatorial optimization with reinforcement learning. arXiv preprint arXiv:1611.09940 (2016).
  • Caserta and Voß (2014) Marco Caserta and Stefan Voß. 2014. A hybrid algorithm for the DNA sequencing problem. Discrete Applied Mathematics 163 (2014), 87–99.
  • Chaslot et al. (2008) Guillaume Chaslot, Sander Bakkes, Istvan Szita, and Pieter Spronck. 2008. Monte-carlo tree search: A new framework for game ai. In Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, Vol. 4. 216–217.
  • Choo et al. (2022) Jinho Choo, Yeong-Dae Kwon, Jihoon Kim, Jeongwoo Jae, André Hottung, Kevin Tierney, and Youngjune Gwon. 2022. Simulation-guided beam search for neural combinatorial optimization. Advances in Neural Information Processing Systems 35 (2022), 8760–8772.
  • d O Costa et al. (2020) Paulo R d O Costa, Jason Rhuggenaath, Yingqian Zhang, and Alp Akcay. 2020. Learning 2-opt heuristics for the traveling salesman problem via deep reinforcement learning. In Asian conference on machine learning. PMLR, 465–480.
  • Dong et al. (2024) Yanfei Dong, Mohammed Haroon Dupty, Lambert Deng, Zhuanghua Liu, Yong Liang Goh, and Wee Sun Lee. 2024. Differentiable Cluster Graph Neural Network. arXiv:2405.16185 [cs.LG]
  • Fu et al. (2021) Zhang-Hua Fu, Kai-Bin Qiu, and Hongyuan Zha. 2021. Generalize a small pre-trained model to arbitrarily large TSP instances. In Proceedings of the AAAI conference on artificial intelligence, Vol. 35. 7474–7482.
  • Gao et al. (2023) Chengrui Gao, Haopu Shang, Ke Xue, Dong Li, and Chao Qian. 2023. Towards generalizable neural solvers for vehicle routing problems via ensemble with transferrable local policy. arXiv preprint arXiv:2308.14104 (2023).
  • Ha et al. (2016) David Ha, Andrew Dai, and Quoc V. Le. 2016. HyperNetworks. arXiv:1609.09106 [cs.LG]
  • Hottung et al. (2021) André Hottung, Yeong-Dae Kwon, and Kevin Tierney. 2021. Efficient active search for combinatorial optimization problems. arXiv preprint arXiv:2106.05126 (2021).
  • Jin et al. (2023) Yan Jin, Yuandong Ding, Xuanhao Pan, Kun He, Li Zhao, Tao Qin, Lei Song, and Jiang Bian. 2023. Pointerformer: Deep Reinforced Multi-Pointer Transformer for the Traveling Salesman Problem. arXiv preprint arXiv:2304.09407 (2023).
  • Joshi et al. (2019) Chaitanya K Joshi, Thomas Laurent, and Xavier Bresson. 2019. An efficient graph convolutional network technique for the travelling salesman problem. arXiv preprint arXiv:1906.01227 (2019).
  • Kim et al. (2022) Minsu Kim, Junyoung Park, and Jinkyoo Park. 2022. Sym-nco: Leveraging symmetricity for neural combinatorial optimization. Advances in Neural Information Processing Systems 35 (2022), 1936–1949.
  • Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
  • Kirkpatrick and Toulouse (1985) Scott Kirkpatrick and Gérard Toulouse. 1985. Configuration space analysis of travelling salesman problems. Journal de Physique 46, 8 (1985), 1277–1292.
  • Kool et al. (2022) Wouter Kool, Herke van Hoof, Joaquim Gromicho, and Max Welling. 2022. Deep policy dynamic programming for vehicle routing problems. In International conference on integration of constraint programming, artificial intelligence, and operations research. Springer, 190–213.
  • Kool et al. (2018) Wouter Kool, Herke Van Hoof, and Max Welling. 2018. Attention, learn to solve routing problems! arXiv preprint arXiv:1803.08475 (2018).
  • Kumar and Luo (2003) Ratnesh Kumar and Zhonghui Luo. 2003. Optimizing the operation sequence of a chip placement machine using TSP model. IEEE Transactions on Electronics Packaging Manufacturing 26, 1 (2003), 14–21.
  • Kwon et al. (2020) Yeong-Dae Kwon, Jinho Choo, Byoungjip Kim, Iljoo Yoon, Youngjune Gwon, and Seungjai Min. 2020. Pomo: Policy optimization with multiple optima for reinforcement learning. Advances in Neural Information Processing Systems 33 (2020), 21188–21198.
  • Ma et al. (2023) Yining Ma, Zhiguang Cao, and Yeow Meng Chee. 2023. Learning to Search Feasible and Infeasible Regions of Routing Problems with Flexible Neural k-Opt. In Thirty-seventh Conference on Neural Information Processing Systems.
  • Ma et al. (2022) Yining Ma, Jingwen Li, Zhiguang Cao, Wen Song, Hongliang Guo, Yuejiao Gong, and Yeow Meng Chee. 2022. Efficient Neural Neighborhood Search for Pickup and Delivery Problems. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence. 4776–4784.
  • Ma et al. (2021) Yining Ma, Jingwen Li, Zhiguang Cao, Wen Song, Le Zhang, Zhenghua Chen, and Jing Tang. 2021. Learning to iteratively solve routing problems with dual-aspect collaborative transformer. Advances in Neural Information Processing Systems 34 (2021), 11096–11107.
  • Moon (1996) Todd K Moon. 1996. The expectation-maximization algorithm. IEEE Signal processing magazine 13, 6 (1996), 47–60.
  • Nazari et al. (2018) Mohammadreza Nazari, Afshin Oroojlooy, Lawrence Snyder, and Martin Takác. 2018. Reinforcement learning for solving the vehicle routing problem. Advances in neural information processing systems 31 (2018).
  • Reinelt ([n. d.]) Gerhard Reinelt. [n. d.]. http://comopt.ifi.uni-heidelberg.de/software/TSPLIB95/
  • Sun and Yang (2023) Zhiqing Sun and Yiming Yang. 2023. Difusco: Graph-based diffusion solvers for combinatorial optimization. arXiv preprint arXiv:2302.08224 (2023).
  • Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. Advances in neural information processing systems 27 (2014).
  • Tsai et al. (2019) Yao-Hung Hubert Tsai, Shaojie Bai, Makoto Yamada, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Transformer dissection: a unified understanding of transformer’s attention via the lens of kernel. arXiv preprint arXiv:1908.11775 (2019).
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
  • Vinyals et al. (2015) Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer networks. Advances in neural information processing systems 28 (2015).
  • Williams (1992) Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8 (1992), 229–256.
  • Wu et al. (2021) Yaoxin Wu, Wen Song, Zhiguang Cao, Jie Zhang, and Andrew Lim. 2021. Learning improvement heuristics for solving routing problems. IEEE transactions on neural networks and learning systems 33, 9 (2021), 5057–5069.
  • Ye et al. (2024) Haoran Ye, Jiarui Wang, Helan Liang, Zhiguang Cao, Yong Li, and Fanzhang Li. 2024. Glop: Learning global partition and local construction for solving large-scale routing problems in real-time. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 20284–20292.
  • Zhang et al. (2023) Cong Zhang, Yaoxin Wu, Yining Ma, Wen Song, Zhang Le, Zhiguang Cao, and Jie Zhang. 2023. A review on learning to solve combinatorial optimisation problems in manufacturing. IET Collaborative Intelligent Manufacturing 5, 1 (2023), e12072.

Appendix A Appendix

A.1. Hyperparameters

As described in the main body, we retain all original settings of the transformer. This yields the following:

  • 6 encoder layers

  • 2 decoder layers

  • We removed global graph embedding as it was found to harm performance

  • Clipping value U𝑈Uitalic_U is set to 10.010.010.010.0

  • Batch size for training is 128128128128 samples

  • A total of 200 epochs were run, where each epoch consisted of 100,000100000100,000100 , 000 episodes

  • Adam optimizer was used with a weight decay of 1e61𝑒61e-61 italic_e - 6

  • Learning rate of 0.00010.00010.00010.0001 was used

  • Unique to our model:

    • The soft clustering layer is applied B=5𝐵5B=5italic_B = 5 times

    • We use a total of Nc=5subscript𝑁𝑐5N_{c}=5italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = 5 embeddings for the clustering

A.2. Model performance across varying dataset size

Table 6 showcases in detail the bar plot results from Figure 6. Here, we compare the model’s performance across three different sizes: a full training dataset of 20,000,000 samples, a small dataset of 50,000 samples repeated across 200 epochs, and an extremely small dataset of 10,000 samples repeated across 200 epochs. In totality, we can see that even though we reduced the dataset size significantly, the performance at 50,000 samples is still remarkable. The models retain the ranking in performance in this scenario. However, reducing this dataset further sees ELG become the top performing model. This is likely because ELG leverages distinct features in the form of polar coordinates in a small k-Nearest Neighborhood. By having these explicit features, the model requires less data to translate it to a meaningful representation. Whereas for our model, we learn the importance of the pairings entirely in the latent space, requiring more data. It should also be noted that our representation of locality is likely a superset of ELG’s approach.

Table 6. Optimality gaps across various countries and dataset sizes. Best performing models in bold.
Dataset # of Samples Model Tour Length Opt. Gap (%) Aug. Tour Length Aug. Opt. Gap (%)
Concorde 5.6209 - 5.6209 -
POMO 5.7190 1.7458% 5.7166 1.7024%
USA13509 50,000 Sym-NCO 5.7345 2.0394% 5.6912 1.2691%
ELG 5.6848 1.1361% 5.6824 1.0936%
Ours 5.6769 0.9954% 5.6750 0.9617%
Concorde 5.6209 - 5.6209 -
POMO 5.8875 4.7433% 5.8848 4.6950%
USA13509 10,000 Sym-NCO 5.8598 4.2695% 5.8125 3.4282%
ELG 5.7849 2.9172% 5.7771 2.7782%
Ours 5.8152 3.4575% 5.8079 3.3265%
Concorde 3.5341 - 3.5341 -
POMO 3.5699 1.0141% 3.5697 1.0073%
JA9847 50,000 Sym-NCO 3.5724 1.0837% 3.5622 0.7960%
ELG 3.5508 0.4723% 3.5504 0.4627%
Ours 3.5506 0.4667% 3.5504 0.4629%
Concorde 3.5341 - 3.5341 -
POMO 3.6137 2.2514% 3.6136 2.2494%
JA9847 10,000 Sym-NCO 3.6037 1.9695% 3.5841 1.4154%
ELG 3.5868 1.4917% 3.5863 1.4779%
Ours 3.5943 1.7037% 3.5942 1.7016%
Concorde 5.0019 - 5.0019 -
POMO 5.1228 2.4158% 5.1164 2.2883%
BM33708 50,000 Sym-NCO 5.0878 1.7164% 5.0532 1.0242%
ELG 5.0648 1.2570% 5.0572 1.1055%
Ours 5.0566 1.0925% 5.0520 0.9833%
Concorde 5.0019 - 5.0019 -
POMO 5.1732 3.4248% 5.1659 3.2784%
BM33708 10,000 Sym-NCO 5.1666 3.2914% 5.1307 2.5740%
ELG 5.1173 2.3064% 5.1080 2.2109%
Ours 5.5157 2.9934% 5.1440 2.8393%

A.3. Comparison with General Solvers

Table 7. Comparison of GLOP and our model on the various datasets.
Dataset Model Aug. Tour Length Aug. Opt. Gap (%) Run-time
Concorde 5.6209 -
USA13509 GLOP 5.6704 0.8799% 6 min 07 sec
Ours 5.6533 0.5762% 3 min 03 sec
Concorde 3.5341 -
JA9847 GLOP 3.5839 1.4094% 6 min 04 sec
Ours 3.5435 0.2670% 3 min 01 sec
Concorde 5.0019 -
BM33408 GLOP 5.0510 0.9822% 6 min 07 sec
Ours 5.0328 0.6166% 3 min 04 sec

While our approach biases the solver towards the distribution, recent works such as GLOP (Ye et al., 2024) proposed generic solvers that utilize the attention model as a Hamiltonian path solver. Table 7 compares our neural solver against GLOP. As GLOP’s main premise is to break down a large problem and solve multiple sub-paths using local solvers trained on smaller sizes. Since our test results are on TSP100, we utilize the TSP25 and TSP50 solvers. Overall, we can see that our approach is stronger than a generically trained solver in both speed and performance. Nevertheless, since GLOP is based on the transformer model, our contributions can easily be integrated.