Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

LoGra-Med: Long Context Multi-Graph Alignment for Medical Vision-Language Model

Duy M. H. Nguyen1,2,3, Nghiem T. Diep ∗3, Trung Q. Nguyen∗3,4, Hoang-Bao Le3, Tai Nguyen3,
Tien Nguyen5,6, TrungTin Nguyen8, Nhat Ho9, Pengtao Xie10, Roger Wattenhofer11,
James Zhou12, Daniel Sonntag† 3,7, Mathias Niepert† 1,2
1 Max Planck Research School for Intelligent Sytems (IMPRS-IS), 2 University of Stuttgart,
3 German Research Centre for Artificial Intelligence (DFKI), 4 Technical University of Munich,
5 University Medical Center Göttingen, 6 Max Planck Institute for Multidisciplinary Sciences,
7 Oldenburg University, 8 University of Queensland, 9 University of Texas at Austin,
10 University of California San Diego, 11 ETH Zurich, 12 Stanford University.
Abstract

State-of-the-art medical multi-modal large language models (med-MLLM), such as LLaVA-Med or BioMedGPT, leverage instruction-following data in their pre-training stages. However, those models primarily focus on scaling the model size and data volume to boost performance while mainly relying on the autoregressive learning objectives. Surprisingly, we reveal that such learning schemes might result in a weak alignment between vision and language modalities, making these models highly reliant on extensive pre-training datasets — a significant challenge in medical domains due to the expensive and time-consuming nature of curating high-quality instruction-following instances. We address this challenge with a new multi-graph alignment algorithm, namely LoGra-Med, which enforces triplet correlations on the latent embedding space among image modalities, conversation-based descriptions, and extended contextual captions. Owing to this technique, the model is encouraged to capture the semantic meaning of the context, handle linguistic variability where the captions or questions may differ from training instances, and learn cross-modal associations, linking visual elements with various textual interpretations. To scale our algorithm to the med-MLLM setting, we also design an efficient end-to-end learning scheme based on advanced black-box gradient-estimation techniques that permit fast forward and backward steps through the LLM model (LLaMa 7B). Empirical results show that we can match the performance of LLAVA-Med pre-trained on 600K image-text pairs from PMC-15M for Medical VQA tasks and significantly outperform it when trained on only 10%percent1010\%10 % of the data. For instance, on VQA-RAD, we exceed LLAVA-Med (both trained on 10%percent1010\%10 %) by 20.13%percent20.1320.13\%20.13 % and achieve near parity with the 100%percent100100\%100 % pre-training setting (72.52%percent72.5272.52\%72.52 % vs. 72.64%percent72.6472.64\%72.64 %). Additionally, we also surpass other SOTA pre-training methods and med-MLLM such as BiomedGPT on visual chatbot or RadFM on zero-shot image classification with VQA, showcasing the power of multi-graph alignment in improving vision-language integration for medical-MLLM.

Co-second contribution, Co-senior authors.

1 Introduction

Generic Multi-Modal Large Language Models (MLLM) are an emerging field integrating processing and generation across text, images, and audio. Models like GPT-4V (Achiam et al., 2023), LLaVa (Liu et al., 2024b), and Next-GPT (Wu et al., 2023b) handle tasks from image captioning to complex visual reasoning. A critical component in training these models is instruction-following (IF) data (Lou et al., 2023), which involves complex, multi-turn interactions (Sun et al., 2024) where the model is expected to respond to specific instructions or questions about the image. In the medical domain, current efforts have been focused on building medical MLLM by curating specialized IF datasets encompassing medical images, clinical notes, and diagnostic criteria (Xie et al., 2024). These datasets are used to adapt general-purpose MLLM, aiming to transfer foundational knowledge of generic concepts and reduce computation costs required in training steps. For instance, in LLAVA-Med, (Li et al., 2024) 600K biomedical image-text pairs are sampled from PMC-15M (Zhang et al., 2023c) and GPT-4 is used to create instruction data from the text inputs, resulting in approximately 60K multi-modal IF data points. In the next step, (i) vision encoders and language decoders are taken from LLaVa and are first aligned through trainable projection layers before (ii) the models (with the exception of the vision encoders) are trained together on collected medical IF data. Both steps are called the pre-training phase, where the auto-regressive function is used as the primary objective. The model can then be fine-tuned to various downstream tasks.

Refer to caption
Figure 1: Illustration of the data-hungry behavior of auto-regressive modeling in LLaVA-Med when varying pre-training IF data size. Models are fine-tuned and performance is reported on VQA-RAD.

Following the above approach, most later works have focused on scaling up the amount of medical IF data (Xie et al., 2024; Zhang et al., 2023a; He et al., 2024) or increasing the model size by incorporating larger vision encoders or language decoders (Wu et al., 2023a; Jiang et al., 2024) while relying on the same standard autoregressive learning scheme. Contrary to this, we question the effectiveness of autoregressive objective functions when learning medical-MLLM with IF data. Surprisingly, our findings reveal that autoregressive learning is highly data-hungry during pre-training, i.e., without sufficient medical IF samples, model performance plummets for downstream tasks, even after fine-tuning. To illustrate this, we pre-trained LLAVA-Med using only 10%percent1010\%10 % of the data and compared it to the version trained on 100%percent100100\%100 %. Both models were fine-tuned on two medical visual question-answering tasks - VQA-RAD (Lau et al., 2018) and PathVQA (He et al., 2020) - and their average performance on open- and close-ended questions compared. The results show a dramatic decline: from 72.64%percent72.6472.64\%72.64 % to 52.39%percent52.3952.39\%52.39 % on VQA-RAD and from 64.06%percent64.0664.06\%64.06 % to 56.15%percent56.1556.15\%56.15 % on PathVQA (Figure 1). This underscores the instability of medical-MLLM trained with autoregressive methods and highlights the problem that these methods require the curation of enough medical IF data to achieve satisfactory performance.

To address this challenge, we present a novel multi-graph alignment algorithm, namely LoGra-Med, that improves the model’s ability to learn complex interactions between vision and language modalities, mitigating the limitations of autoregressive functions when trained on limited instruction-following data. Specifically, given pairs of input images with instruction data, we use GPT-4 (Achiam et al., 2023) to form a longer version of the instruction, with more in-context explanations for concepts and correlations among entities while preserving the same meaning. These data pairs are fed into the MLLM, where a vision encoder extracts features for the image while the language model (LLaMa) computes latent embedding of the instruction data and its extended versions. The embedding vectors obtained from different IF samples in a batch are then used to construct three graphs. The first graph’s vertices are the image features, while the vertices of the two other graphs are the embeddings of the instructions and their extended versions. We subsequently learn feature representation by solving the combinatorial multi-graph alignment problem between these three graphs such that the input embedding of the graph alignment will output the triplet coupling among the image, its instruction data, and the longer extension. By jointly learning this alignment with autoregressive methods, we aim to preserve the LLMs’ sequential generation capabilities while enhancing the model’s ability to capture deeper semantic relationships. This approach goes beyond memorization, increases resilience to linguistic variability, and significantly improves cross-modal coherence.

Our method differentiates itself from existing multi-modal alignment techniques for LLM (Park et al., 2024; Li et al., 2023a; Chen et al., 2023a) in two ways. First, while prior contrastive objectives primarily focus on learning projections to connect frozen vision encoders with frozen language models, our algorithm directly trains LLM using the multi-graph framework. Second, we unify and generalize pairwise contrastive learning, which focuses on individual image-caption pairs, by incorporating global constraints through graph structures. This allows us to integrate both feature and structural consistencies using graph edges, enhancing robustness for similar entities (whether images or descriptions) commonly found in medical datasets. Although combinatorial graph alignment is generally non-differentiable and solving multi-graph alignment is computationally expensive, we overcome these challenges by leveraging modern implicit maximum likelihood estimation techniques (Niepert et al., 2021; Minervini et al., 2023). This enables efficient gradient estimation and allows for fast forward and backward propagation through large LLMs (e.g., LLaMa-7B), using a barycenter graph (Agueh & Carlier, 2011) for alignment. As a result, the model can scale effectively with extensive datasets on large LLMs while maintaining alignment performance.

In summary, we make the following key contributions:

  • We reveal the data-demanding nature of autoregressive modeling in pre-training medical-MLLM (LLaVa-Med), showing that insufficient instruction-following data leads to significant performance drops on downstream tasks, even after fine-tuning.

  • We introduce a novel multi-graph alignment objective that establishes triplet constraints among images, their instruction-following context, and their enriched versions. Additionally, we developed an efficient solver for training with LLMs and outlined theoretical properties related to distance and the shortest path in the geodesic space of multi-modal graphs.

  • We demonstrate that using a small amount of pre-training data can achieve performance comparable to LLaVa-Med trained on 100%percent100100\%100 % data. Additionally, when trained on larger datasets, LoGra-Med outperforms several state-of-the-art medical mLLMs and advanced multi-modal pre-training algorithms across three Medical VQA tasks, medical visual chat, and the average zero-shot image classification performance on 23 datasets.

2 Related Work

Medical Multi-modal LLMs. Recent developments in medical-MLLM like Biomed-GPT (Zhang et al., 2023a), MedFlamingo (Moor et al., 2023), Med-Dr (He et al., 2024), LLAVA-Med (Li et al., 2024), and Med-PaLMs (Singhal et al., 2023; Tu et al., 2024) are transforming healthcare by integrating diverse data types and scaling medical instruction data. Biomed-GPT excels with multiple biomedical modalities, MedFlamingo focuses on few-shot learning for medical visual question answering, and LLAVA-Med leverages large-scale biomedical image-text pairs for improved performance. Commonly, these models emphasize scaling medical instruction data and increasing model parameters to enhance accuracy and applicability in real-world medical scenarios. In contrast, our approach examines the widely used autoregressive pre-training algorithms and demonstrates that incorporating long-context multi-graph alignment of exist instruction samples can significantly enhance medical-MLLM performance without requiring larger models or extensive datasets.

Visual Instruction Tuning. Visual instruction tuning techniques aim to bridge the gap between frozen vision-language models and frozen LLMs trained on unimodal data, enabling them to work effectively in a multi-modal context. These methods involve (i) learning a multi-layer perceptron (MLP) layer to map embeddings from the vision model to the language model as LLaVa (Liu et al., 2024b), VideoLLM (Chen et al., 2023a); (ii) using adapter-based adjustment as LLaMa-adapter (Zhang et al., 2024), Voxposer (Huang et al., 2023), or (iii) learning multi-modal perceiver by gated cross-attention (Alayrac et al., 2022) or Q-Former as in BLIP-2 (Li et al., 2023a). Pre-training algorithms to train these models can be combined with both auto-regressive and contrastive learning (Park et al., 2024; Zhai et al., 2023) or image-text matching as in (Li et al., 2022; 2023a). Our algorithm differs from those by focusing on directly training LLMs rather than lightweight projectors. This requires a fast solver capable of efficiently handling forward and backward passes through large-scale LLMs with extensive parameters.

Vision-language Pretraining Algorithm. Pre-training algorithms commonly applied for vision-language models, like CLIP (Radford et al., 2021), follow various strategies. Generative approaches, such as masked prediction in language models (Devlin, 2018; Song et al., 2020), or autoregressive algorithms that predict sequential text in LLMs (Liu et al., 2024b; Zhang et al., 2024), are widely used. Another direction focuses on discriminative methods, which learn contrastive distances between image-text pairs (Liu et al., 2023; Zhai et al., 2023; Khan & Fu, 2023), optimal transport (Chen et al., 2022; Nguyen et al., 2024a), or impose clustering constraints (Park et al., 2024). Our function departs from these by generalizing them into a combinatorial graph-matching formulation across cross-domain graphs. While LVM-Med (MH Nguyen et al., 2024) is the most similar to our approach, it targets alignment within vision tasks, whereas we align images, instruction-following data, and extended contextual information.

Graph alignment across K𝐾Kitalic_K domains (K3𝐾3K\geq 3italic_K ≥ 3) is highly computationally intensive. Current methods, such as multi-marginal optimal transport (Lin et al., 2022; Piran et al., 2024), Wasserstein barycenters (Nguyen et al., 2024b), and multi-adjacency matrix assumptions (Bernard et al., 2019; Swoboda et al., 2019), relax the problem but are limited to small-scale tasks and require multiple solver steps, making them inefficient for LLM training. In contrast, our approach utilizes heuristic solvers (Swoboda et al., 2017; Rolínek et al., 2020) and modern gradient estimation techniques for black-box optimization (Niepert et al., 2021; Minervini et al., 2023), enabling scalable and efficient performance for large language models.

3 Multi-graph Alignment Learning

We denote the vision encoder, projector, and large-language model (LLM) models are fθ(.),hϕ(.),gσ(.)f_{\theta}(.),\,h_{\phi}(.),\,g_{\sigma}(.)italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( . ) , italic_h start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( . ) , italic_g start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ( . ), respectively. Figure 2 illustrates our LoGra-Med algorithm, which learns parameters for these models by solving a triplet alignment between modalities in instruction tuning data. Below, we summarize the notations used before describing each component in detail.

Refer to caption
Figure 2: Overview of LoGra-Med: We train the large language model gσsubscript𝑔𝜎g_{\sigma}italic_g start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT and projector hϕsubscriptitalic-ϕh_{\phi}italic_h start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT by aligning a triplet consisting of the input image, its instruction data, and its extended long-context version using a structure-aware multi-graph alignment between 𝒢v,𝒢a,subscript𝒢𝑣subscript𝒢𝑎{\mathcal{G}}_{v},{\mathcal{G}}_{a},caligraphic_G start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , and 𝒢aesubscript𝒢𝑎𝑒{\mathcal{G}}_{ae}caligraphic_G start_POSTSUBSCRIPT italic_a italic_e end_POSTSUBSCRIPT through a barycenter graph 𝒢brsubscript𝒢𝑏𝑟{\mathcal{G}}_{br}caligraphic_G start_POSTSUBSCRIPT italic_b italic_r end_POSTSUBSCRIPT.

Notation. Given any tensor 𝑻=(Ti,j,k,l)𝑻subscript𝑇𝑖𝑗𝑘𝑙{\bm{\mathsfit{T}}}=({\mathsfit{T}}_{i,j,k,l})bold_slanted_T = ( slanted_T start_POSTSUBSCRIPT italic_i , italic_j , italic_k , italic_l end_POSTSUBSCRIPT ) and matrix 𝑴=(Mk,l)𝑴subscript𝑀𝑘𝑙{\bm{M}}=({M}_{k,l})bold_italic_M = ( italic_M start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT ), we use 𝑻𝑴tensor-product𝑻𝑴{\bm{\mathsfit{T}}}\otimes{\bm{M}}bold_slanted_T ⊗ bold_italic_M to denote the tensor-matrix multiplication, i.e.,  the matrix (k,lTi,j,k,lMk,l)i,jsubscriptsubscript𝑘𝑙subscript𝑇𝑖𝑗𝑘𝑙subscript𝑀𝑘𝑙𝑖𝑗(\sum_{k,l}{\mathsfit{T}}_{i,j,k,l}{M}_{k,l})_{i,j}( ∑ start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT slanted_T start_POSTSUBSCRIPT italic_i , italic_j , italic_k , italic_l end_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT. Given 𝒀=[𝒚1,𝒚2,,𝒚N]N×d𝒀subscript𝒚1subscript𝒚2subscript𝒚𝑁superscript𝑁𝑑{\bm{Y}}=[{\bm{y}}_{1},{\bm{y}}_{2},...,{\bm{y}}_{N}]\in\mathbb{R}^{N\times d}bold_italic_Y = [ bold_italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_italic_y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_d end_POSTSUPERSCRIPT, we define 𝔼(𝒀)=1Ni=1N𝒚id𝔼𝒀1𝑁superscriptsubscript𝑖1𝑁subscript𝒚𝑖superscript𝑑\mathbb{E}({\bm{Y}})=\frac{1}{N}\sum_{i=1}^{N}{\bm{y}}_{i}\in\mathbb{R}^{d}blackboard_E ( bold_italic_Y ) = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. Moreover, we define the matrix scalar (or inner) product associated with the Frobenius norm between two matrices 𝑴=(Mi,j)𝑴subscript𝑀𝑖𝑗{\bm{M}}=({M}_{i,j})bold_italic_M = ( italic_M start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) and 𝑵=(Ni,j)𝑵subscript𝑁𝑖𝑗{\bm{N}}=({N}_{i,j})bold_italic_N = ( italic_N start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) as ,\langle\cdot,\cdot\rangle⟨ ⋅ , ⋅ ⟩, i.e.,  𝑴,𝑵=i,jMi,jNi,j𝑴𝑵subscript𝑖𝑗subscript𝑀𝑖𝑗subscript𝑁𝑖𝑗\langle{\bm{M}},{\bm{N}}\rangle=\sum_{i,j}{M}_{i,j}{N}_{i,j}⟨ bold_italic_M , bold_italic_N ⟩ = ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT. We write [M]={1,2,,M}delimited-[]𝑀12𝑀[M]=\{1,2,\ldots,M\}[ italic_M ] = { 1 , 2 , … , italic_M } for any natural number M𝑀Mitalic_M.

3.1 Long-context enriched medical instruction-following data

Recent research has demonstrated that incorporating long-context significantly enhances LLMs’ ability to process complex inputs and improves instruction-following by retaining more relevant information (Liu et al., 2024c; An et al., 2024; Pawar et al., 2024). Building on this insight, we extend medical instruction-following data by generating long-context paraphrased versions of existing samples in the dataset, providing an additional enriched perspective of the original data.

In particular, a typical instruction sample includes {𝑿v,[𝑿q1,𝑿a1],,[𝑿qL,𝑿aL]}subscript𝑿𝑣superscriptsubscript𝑿𝑞1superscriptsubscript𝑿𝑎1superscriptsubscript𝑿𝑞𝐿superscriptsubscript𝑿𝑎𝐿\{{\bm{X}}_{v},[{\bm{X}}_{q}^{1},\,{\bm{X}}_{a}^{1}],...,[{\bm{X}}_{q}^{L},\,{% \bm{X}}_{a}^{L}]\}{ bold_italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , [ bold_italic_X start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , bold_italic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ] , … , [ bold_italic_X start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT , bold_italic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ] } where 𝑿vsubscript𝑿𝑣{\bm{X}}_{v}bold_italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT is an input image, 𝑿qlsuperscriptsubscript𝑿𝑞𝑙{\bm{X}}_{q}^{l}bold_italic_X start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT a question, and 𝑿alsuperscriptsubscript𝑿𝑎𝑙{\bm{X}}_{a}^{l}bold_italic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT an answer at round l𝑙litalic_l in multi-round L𝐿Litalic_L of a conversation. In the medical domain, most of the questions are generic, and the information answer usually covers the question, so we only focus on extending the answer 𝑿asubscript𝑿𝑎{\bm{X}}_{a}bold_italic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT. We leverage the GPT API with a prompt to form a longer-context for each 𝑿alsuperscriptsubscript𝑿𝑎𝑙{\bm{X}}_{a}^{l}bold_italic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT by:

𝑿ael=GPT(𝑿ql,𝑿al,prompt),l[L].formulae-sequencesuperscriptsubscript𝑿𝑎𝑒𝑙GPTsuperscriptsubscript𝑿𝑞𝑙superscriptsubscript𝑿𝑎𝑙promptfor-all𝑙delimited-[]𝐿{\bm{X}}_{ae}^{l}=\mathrm{GPT}\left({\bm{X}}_{q}^{l},{\bm{X}}_{a}^{l},\mathrm{% prompt}\right),\ \forall l\in[L].bold_italic_X start_POSTSUBSCRIPT italic_a italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = roman_GPT ( bold_italic_X start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , bold_italic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , roman_prompt ) , ∀ italic_l ∈ [ italic_L ] . (1)

The details for promptprompt\mathrm{prompt}roman_prompt are presented in the Appendix. In short, we ask GPT to provide additional explanations for concepts that appeared in the original answer 𝑿asubscript𝑿𝑎{\bm{X}}_{a}bold_italic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT while keeping the content consistent. An example output for 𝑿aelsuperscriptsubscript𝑿𝑎𝑒𝑙{\bm{X}}_{ae}^{l}bold_italic_X start_POSTSUBSCRIPT italic_a italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT is illustrated in Figure 3.

3.2 Multi-graph construction on vision-language embedding

For each image 𝑿v3×H×Wsubscript𝑿𝑣superscript3𝐻𝑊{\bm{X}}_{v}\in\mathbb{R}^{3\times H\times W}bold_italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 × italic_H × italic_W end_POSTSUPERSCRIPT where (H,W)𝐻𝑊(H,W)( italic_H , italic_W ) are the original spatial dimensions. We divide it into a sequence of visual patches 𝑼=[ui]i=1N𝑼superscriptsubscriptdelimited-[]subscript𝑢𝑖𝑖1𝑁{\bm{U}}=[u_{i}]_{i=1}^{N}bold_italic_U = [ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT with N=(H×W)/U𝑁𝐻𝑊𝑈N=(H\times W)/Uitalic_N = ( italic_H × italic_W ) / italic_U with U𝑈Uitalic_U as the patch size. Using a pre-trained ViT model fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, we extract patch-wise features as 𝑽=fθ(𝑼)N×dv𝑽subscript𝑓𝜃𝑼superscript𝑁subscript𝑑𝑣{\bm{V}}=f_{\theta}({\bm{U}})\in\mathbb{R}^{N\times d_{v}}bold_italic_V = italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_U ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and apply another projector to map it into the projected embedding 𝒁=hϕ(𝑽)N×d𝒁subscriptitalic-ϕ𝑽superscript𝑁𝑑{\bm{Z}}=h_{\phi}({\bm{V}})\in\mathbb{R}^{N\times d}bold_italic_Z = italic_h start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_italic_V ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_d end_POSTSUPERSCRIPT. We then pool the features from the image patches to define a global description as 𝒁v=𝔼(𝒁)dsubscript𝒁𝑣𝔼𝒁superscript𝑑{\bm{Z}}_{v}=\mathbb{E}({\bm{Z}})\in\mathbb{R}^{d}bold_italic_Z start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = blackboard_E ( bold_italic_Z ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. For each language input 𝑿cl{𝑿al,𝑿ael}superscriptsubscript𝑿𝑐𝑙superscriptsubscript𝑿𝑎𝑙superscriptsubscript𝑿𝑎𝑒𝑙{\bm{X}}_{c}^{l}\in\{{\bm{X}}_{a}^{l},{\bm{X}}_{ae}^{l}\}bold_italic_X start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∈ { bold_italic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , bold_italic_X start_POSTSUBSCRIPT italic_a italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT } with c{a,ae}𝑐𝑎𝑎𝑒c\in\{a,ae\}italic_c ∈ { italic_a , italic_a italic_e }, we assume it has M𝑀Mitalic_M tokens, i.e., 𝑿cl=[𝒙j]j=1MMsuperscriptsubscript𝑿𝑐𝑙superscriptsubscriptdelimited-[]subscript𝒙𝑗𝑗1𝑀superscript𝑀{\bm{X}}_{c}^{l}=[{\bm{x}}_{j}]_{j=1}^{M}\in\mathbb{R}^{M}bold_italic_X start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = [ bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT, and feed it into the LLM model to extract a set of embedding 𝒁cl=gσ([𝒙j]j=1M)=[𝒆j]j=1MM×dsuperscriptsubscript𝒁𝑐𝑙subscript𝑔𝜎superscriptsubscriptdelimited-[]subscript𝒙𝑗𝑗1𝑀superscriptsubscriptdelimited-[]subscript𝒆𝑗𝑗1𝑀superscript𝑀𝑑{\bm{Z}}_{c}^{l}=g_{\sigma}([{\bm{x}}_{j}]_{j=1}^{M})=[{\bm{e}}_{j}]_{j=1}^{M}% \in\mathbb{R}^{M\times d}bold_italic_Z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = italic_g start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ( [ bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ) = [ bold_italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_d end_POSTSUPERSCRIPT. We subsequently concatenate all multi-round L𝐿Litalic_L in each single instruction tuning to define 𝒁c=1Ll=1L𝔼(𝒁cl)subscript𝒁𝑐1𝐿superscriptsubscript𝑙1𝐿𝔼superscriptsubscript𝒁𝑐𝑙{\bm{Z}}_{c}=\textstyle\frac{1}{L}\sum_{l=1}^{L}\mathbb{E}({\bm{Z}}_{c}^{l})bold_italic_Z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_L end_ARG ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT blackboard_E ( bold_italic_Z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) which collects average text embedding of original answers (c=a𝑐𝑎c=aitalic_c = italic_a) and their longer-context extended versions (c=ae𝑐𝑎𝑒c=aeitalic_c = italic_a italic_e) respectively.

Refer to caption
Figure 3: Illustration for creating the longer-context instruction-following data powered by GPT-4o.

Given a batch size of B𝐵Bitalic_B instruction-tuning samples, we now construct three graphs 𝒢v=(𝒱v,v)subscript𝒢𝑣subscript𝒱𝑣subscript𝑣{\mathcal{G}}_{v}=({\mathcal{V}}_{v},{\mathcal{E}}_{v})caligraphic_G start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = ( caligraphic_V start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , caligraphic_E start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ), 𝒢a=(𝒱a,a)subscript𝒢𝑎subscript𝒱𝑎subscript𝑎{\mathcal{G}}_{a}=({\mathcal{V}}_{a},{\mathcal{E}}_{a})caligraphic_G start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = ( caligraphic_V start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , caligraphic_E start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ), and 𝒢ae=(𝒱ae,ae)subscript𝒢𝑎𝑒subscript𝒱𝑎𝑒subscript𝑎𝑒{\mathcal{G}}_{ae}=({\mathcal{V}}_{ae},{\mathcal{E}}_{ae})caligraphic_G start_POSTSUBSCRIPT italic_a italic_e end_POSTSUBSCRIPT = ( caligraphic_V start_POSTSUBSCRIPT italic_a italic_e end_POSTSUBSCRIPT , caligraphic_E start_POSTSUBSCRIPT italic_a italic_e end_POSTSUBSCRIPT ) representing for visual image features, text embedding encoded by LLM for original answers and their long-context embedding extended by GPT. Specifically, for each triplet pair {𝑿v(k),[𝑿al](k),[𝑿ael](k)}k,(k[B])subscriptsuperscriptsubscript𝑿𝑣𝑘superscriptdelimited-[]superscriptsubscript𝑿𝑎𝑙𝑘superscriptdelimited-[]superscriptsubscript𝑿𝑎𝑒𝑙𝑘𝑘𝑘delimited-[]𝐵\{{\bm{X}}_{v}^{(k)},[{\bm{X}}_{a}^{l}]^{(k)},[{\bm{X}}_{ae}^{l}]^{(k)}\}_{k},% \,(k\in[B]){ bold_italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , [ bold_italic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , [ bold_italic_X start_POSTSUBSCRIPT italic_a italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , ( italic_k ∈ [ italic_B ] ), we add a node representing 𝑿v(k)superscriptsubscript𝑿𝑣𝑘{\bm{X}}_{v}^{(k)}bold_italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT to 𝒱vsubscript𝒱𝑣{\mathcal{V}}_{v}caligraphic_V start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, a node for [𝑿al](k)superscriptdelimited-[]superscriptsubscript𝑿𝑎𝑙𝑘[{\bm{X}}_{a}^{l}]^{(k)}[ bold_italic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT to 𝒱esubscript𝒱𝑒{\mathcal{V}}_{e}caligraphic_V start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT, and finally a node for [𝑿ael](k)superscriptdelimited-[]superscriptsubscript𝑿𝑎𝑒𝑙𝑘[{\bm{X}}_{ae}^{l}]^{(k)}[ bold_italic_X start_POSTSUBSCRIPT italic_a italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT to 𝒱aesubscript𝒱𝑎𝑒{\mathcal{V}}_{ae}caligraphic_V start_POSTSUBSCRIPT italic_a italic_e end_POSTSUBSCRIPT. This results in a set of nodes 𝒱v={𝑿v(1),,𝑿v(B)}subscript𝒱𝑣superscriptsubscript𝑿𝑣1superscriptsubscript𝑿𝑣𝐵{\mathcal{V}}_{v}=\{{\bm{X}}_{v}^{(1)},...,{\bm{X}}_{v}^{(B)}\}caligraphic_V start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = { bold_italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , … , bold_italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_B ) end_POSTSUPERSCRIPT }; 𝒱c={[𝑿cl](1),,[𝑿cl](B)}subscript𝒱𝑐superscriptdelimited-[]superscriptsubscript𝑿𝑐𝑙1superscriptdelimited-[]superscriptsubscript𝑿𝑐𝑙𝐵{\mathcal{V}}_{c}=\{[{\bm{X}}_{c}^{l}]^{(1)},...,[{\bm{X}}_{c}^{l}]^{(B)}\}caligraphic_V start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = { [ bold_italic_X start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , … , [ bold_italic_X start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT ( italic_B ) end_POSTSUPERSCRIPT } for each c{a,ae}𝑐𝑎𝑎𝑒c\in\{a,ae\}italic_c ∈ { italic_a , italic_a italic_e }. We equip node-level feature matrices for these graphs using their embedding computed above, i.e., 𝑭v={𝒁v(1),,𝒁v(B)}subscript𝑭𝑣superscriptsubscript𝒁𝑣1superscriptsubscript𝒁𝑣𝐵{\bm{F}}_{v}=\{{\bm{Z}}_{v}^{(1)},...,{\bm{Z}}_{v}^{(B)}\}bold_italic_F start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = { bold_italic_Z start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , … , bold_italic_Z start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_B ) end_POSTSUPERSCRIPT }, 𝑭c={𝒁c(1),,𝒁c(B)}subscript𝑭𝑐superscriptsubscript𝒁𝑐1superscriptsubscript𝒁𝑐𝐵{\bm{F}}_{c}=\{{\bm{Z}}_{c}^{(1)},...,{\bm{Z}}_{c}^{(B)}\}bold_italic_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = { bold_italic_Z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , … , bold_italic_Z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_B ) end_POSTSUPERSCRIPT }. The edges for v,csubscript𝑣subscript𝑐{\mathcal{E}}_{v},{\mathcal{E}}_{c}caligraphic_E start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , caligraphic_E start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT afterward can be created through the k-nearest neighbors algorithm given the feature node matrices 𝑭v,𝑭csubscript𝑭𝑣subscript𝑭𝑐{\bm{F}}_{v},\,{\bm{F}}_{c}bold_italic_F start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , bold_italic_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. Finally, we can run a message-passing network mα(.)m_{\alpha}(.)italic_m start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( . ) on three built graphs to learn richer node representations. This approach has proven effective for representation learning (Tang et al., 2022; Ju et al., 2024), resulting in aggregated feature-node matrices as {𝒁^s(1),,𝒁^s(B)}=mα(𝑭s,s)superscriptsubscript^𝒁𝑠1superscriptsubscript^𝒁𝑠𝐵subscript𝑚𝛼subscript𝑭𝑠subscript𝑠\{\hat{{\bm{Z}}}_{s}^{(1)},...,\hat{{\bm{Z}}}_{s}^{(B)}\}=m_{\alpha}({\bm{F}}_% {s},{\mathcal{E}}_{s}){ over^ start_ARG bold_italic_Z end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , … , over^ start_ARG bold_italic_Z end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_B ) end_POSTSUPERSCRIPT } = italic_m start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( bold_italic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_E start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ), with s{v,a,ae}𝑠𝑣𝑎𝑎𝑒s\in\{v,a,ae\}italic_s ∈ { italic_v , italic_a , italic_a italic_e }.

3.3 Second-order graph alignment problem

We first provide background about the second-order graph alignment between two arbitrary graphs 𝒢1=(𝒱1,1)subscript𝒢1subscript𝒱1subscript1{\mathcal{G}}_{1}=({\mathcal{V}}_{1},{\mathcal{E}}_{1})caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ( caligraphic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) and 𝒢2=(𝒱2,2)subscript𝒢2subscript𝒱2subscript2{\mathcal{G}}_{2}=({\mathcal{V}}_{2},{\mathcal{E}}_{2})caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ( caligraphic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , caligraphic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), which is mentioned as quadratic assignment problem and applied in several problems in vision and graphics to find correspondences between two graph structures under consistent between node features and structure constraints (Zanfir & Sminchisescu, 2018; Haller et al., 2022; Ehm et al., 2024).

We denote by 𝑽{0,1}|𝒱1||𝒱2|𝑽superscript01subscript𝒱1subscript𝒱2{\bm{V}}\in\{0,1\}^{|{\mathcal{V}}_{1}||{\mathcal{V}}_{2}|}bold_italic_V ∈ { 0 , 1 } start_POSTSUPERSCRIPT | caligraphic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | | caligraphic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT, with |𝒱1|=Msubscript𝒱1𝑀|{\mathcal{V}}_{1}|=M| caligraphic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | = italic_M and |𝒱2|=Nsubscript𝒱2𝑁|{\mathcal{V}}_{2}|=N| caligraphic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | = italic_N, the indicator matrix of matched vertices, that is, Vi,j=1subscript𝑉𝑖𝑗1{V}_{i,j}=1italic_V start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = 1 if a vertex vi𝒱1subscript𝑣𝑖subscript𝒱1v_{i}\in{\mathcal{V}}_{1}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is matched with vj𝒱2subscript𝑣𝑗subscript𝒱2v_{j}\in{\mathcal{V}}_{2}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and Vi,j=0subscript𝑉𝑖𝑗0{V}_{i,j}=0italic_V start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = 0 otherwise. That is, 𝑽𝑽{\bm{V}}bold_italic_V is a binary matrix with exactly one non-zero entry in each row and column. Similarly, we set 𝑬{0,1}|1||2|𝑬superscript01subscript1subscript2{\bm{\mathsfit{E}}}\in\{0,1\}^{|{\mathcal{E}}_{1}||{\mathcal{E}}_{2}|}bold_slanted_E ∈ { 0 , 1 } start_POSTSUPERSCRIPT | caligraphic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | | caligraphic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT as the indicator tensor of match edges, that is, Ei,k,j,l=1subscript𝐸𝑖𝑘𝑗𝑙1{\mathsfit{E}}_{i,k,j,l}=1slanted_E start_POSTSUBSCRIPT italic_i , italic_k , italic_j , italic_l end_POSTSUBSCRIPT = 1 if Vi,j=1subscript𝑉𝑖𝑗1{V}_{i,j}=1italic_V start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = 1 and Vk,l=1subscript𝑉𝑘𝑙1{V}_{k,l}=1italic_V start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT = 1 and Ei,k,j,l=0subscript𝐸𝑖𝑘𝑗𝑙0{\mathsfit{E}}_{i,k,j,l}=0slanted_E start_POSTSUBSCRIPT italic_i , italic_k , italic_j , italic_l end_POSTSUBSCRIPT = 0 otherwise. This implies that the tensor 𝑬𝑬{\bm{\mathsfit{E}}}bold_slanted_E is fully determined by the matrix 𝑽𝑽{\bm{V}}bold_italic_V, that is, Ei,k,j,l=Vi,jVk,lsubscript𝐸𝑖𝑘𝑗𝑙subscript𝑉𝑖𝑗subscript𝑉𝑘𝑙{\mathsfit{E}}_{i,k,j,l}={V}_{i,j}{V}_{k,l}slanted_E start_POSTSUBSCRIPT italic_i , italic_k , italic_j , italic_l end_POSTSUBSCRIPT = italic_V start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT. We also define the vertex affinity matrix and edge affinity tensor as 𝑨v|𝒱1||𝒱2|superscript𝑨𝑣superscriptsubscript𝒱1subscript𝒱2{\bm{A}}^{v}\in\mathbb{R}^{|{\mathcal{V}}_{1}||{\mathcal{V}}_{2}|}bold_italic_A start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT | caligraphic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | | caligraphic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT and 𝑨e|1||2|superscript𝑨𝑒superscriptsubscript1subscript2{\bm{\mathsfit{A}}}^{e}\in\mathbb{R}^{|{\mathcal{E}}_{1}||{\mathcal{E}}_{2}|}bold_slanted_A start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT | caligraphic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | | caligraphic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT, respectively. The set 𝒜(𝒢1,𝒢2)𝒜subscript𝒢1subscript𝒢2{\mathcal{A}}({\mathcal{G}}_{1},{\mathcal{G}}_{2})caligraphic_A ( caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) indicates for all admissible pairs (𝑽,𝑬)𝑽𝑬({\bm{V}},{\bm{\mathsfit{E}}})( bold_italic_V , bold_slanted_E ) that encode a valid matching between 𝒢1subscript𝒢1{\mathcal{G}}_{1}caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝒢2subscript𝒢2{\mathcal{G}}_{2}caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

𝒜(𝒢1,𝒢2)={𝑽{0,1}M×N:i=1MVi,j=1,j=1NVi,j=1}.𝒜subscript𝒢1subscript𝒢2conditional-set𝑽superscript01𝑀𝑁formulae-sequencesuperscriptsubscript𝑖1𝑀subscript𝑉𝑖𝑗1superscriptsubscript𝑗1𝑁subscript𝑉𝑖𝑗1\displaystyle{\mathcal{A}}({\mathcal{G}}_{1},{\mathcal{G}}_{2})=\left\{{\bm{V}% }\in\{0,1\}^{M\times N}:\sum_{i=1}^{M}{V}_{i,j}=1,\sum_{j=1}^{N}{V}_{i,j}=1% \right\}.caligraphic_A ( caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = { bold_italic_V ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_M × italic_N end_POSTSUPERSCRIPT : ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = 1 , ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = 1 } . (2)

The second-order graph alignment (SoGA) problem now is defined as:

SoGA(𝑨v,𝑨e)SoGAsuperscript𝑨𝑣superscript𝑨𝑒\displaystyle\text{SoGA}({\bm{A}}^{v},{\bm{\mathsfit{A}}}^{e})SoGA ( bold_italic_A start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT , bold_slanted_A start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ) =argmin𝑽𝒜(𝒢1,𝒢2)𝑨v+𝑨e𝑽,𝑽=argmin𝑽𝒜(𝒢1,𝒢2)i,jAi,jvVi,j+i,j,k,lAi,j,k,leVi,jVk,l.absentsubscriptargmin𝑽𝒜subscript𝒢1subscript𝒢2superscript𝑨𝑣tensor-productsuperscript𝑨𝑒𝑽𝑽subscriptargmin𝑽𝒜subscript𝒢1subscript𝒢2subscript𝑖𝑗subscriptsuperscript𝐴𝑣𝑖𝑗subscript𝑉𝑖𝑗subscript𝑖𝑗𝑘𝑙subscriptsuperscript𝐴𝑒𝑖𝑗𝑘𝑙subscript𝑉𝑖𝑗subscript𝑉𝑘𝑙\displaystyle=\operatorname*{arg\,min}_{{\bm{V}}\in{\mathcal{A}}({\mathcal{G}}% _{1},{\mathcal{G}}_{2})}\langle{\bm{A}}^{v}+{\bm{\mathsfit{A}}}^{e}\otimes{\bm% {V}},{\bm{V}}\rangle=\operatorname*{arg\,min}_{{\bm{V}}\in{\mathcal{A}}({% \mathcal{G}}_{1},{\mathcal{G}}_{2})}\sum_{i,j}{A}^{v}_{i,j}{V}_{i,j}+\sum_{i,j% ,k,l}{\mathsfit{A}}^{e}_{i,j,k,l}{V}_{i,j}{V}_{k,l}.= start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_italic_V ∈ caligraphic_A ( caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ⟨ bold_italic_A start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT + bold_slanted_A start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ⊗ bold_italic_V , bold_italic_V ⟩ = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_italic_V ∈ caligraphic_A ( caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_A start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i , italic_j , italic_k , italic_l end_POSTSUBSCRIPT slanted_A start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j , italic_k , italic_l end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT . (3)

3.4 Scalable Multi-graph Alignment

Our aim is to solve the graph alignment between three graphs 𝒢v,𝒢asubscript𝒢𝑣subscript𝒢𝑎{\mathcal{G}}_{v},{\mathcal{G}}_{a}caligraphic_G start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, and 𝒢aesubscript𝒢𝑎𝑒{\mathcal{G}}_{ae}caligraphic_G start_POSTSUBSCRIPT italic_a italic_e end_POSTSUBSCRIPT to form a triplet constraint between input image embedding, its original instruction embedding, and the long-context extended version ones. However, solving a structure-aware graph alignment between K𝐾Kitalic_K domains is computationally expensive. One potential solution is to perform pairwise graph alignments (K2)binomial𝐾2\binom{K}{2}( FRACOP start_ARG italic_K end_ARG start_ARG 2 end_ARG ) times, as shown in Eq. (3), while applying specific constraints to maintain consistency between correspondences (Bernard et al., 2019; Swoboda et al., 2019). However, this method becomes impractical as K𝐾Kitalic_K increases, making it unsuitable for larger-scale problems.

Another direction leverages the barycenter concept from optimal transport, which identifies a central distribution that minimizes the weighted sum of Wasserstein distances to the given input distributions. We follow this idea to reformulate the alignment of K𝐾Kitalic_K graphs into K𝐾Kitalic_K separate alignments with a barycenter graph. Unlike previous unsupervised methods that estimate the barycenter before aligning, we directly define the barycenter using known triplet pairs across the three graphs. This significantly reduces complexity, making our solver more efficient in LLM settings.

Specifically, we define a new barycenter graph 𝒢br=(𝒱br,br)subscript𝒢𝑏𝑟subscript𝒱𝑏𝑟subscript𝑏𝑟{\mathcal{G}}_{br}=({\mathcal{V}}_{br},{\mathcal{E}}_{br})caligraphic_G start_POSTSUBSCRIPT italic_b italic_r end_POSTSUBSCRIPT = ( caligraphic_V start_POSTSUBSCRIPT italic_b italic_r end_POSTSUBSCRIPT , caligraphic_E start_POSTSUBSCRIPT italic_b italic_r end_POSTSUBSCRIPT ) where 𝒱br={vbr(1),,vbr(B)}subscript𝒱𝑏𝑟superscriptsubscript𝑣𝑏𝑟1superscriptsubscript𝑣𝑏𝑟𝐵{\mathcal{V}}_{br}=\left\{v_{br}^{(1)},...,v_{br}^{(B)}\right\}caligraphic_V start_POSTSUBSCRIPT italic_b italic_r end_POSTSUBSCRIPT = { italic_v start_POSTSUBSCRIPT italic_b italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_b italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_B ) end_POSTSUPERSCRIPT } with vbr(k)=𝑿¯(k)={𝑿v(k),[𝑿al](k),[𝑿ael](k)}superscriptsubscript𝑣𝑏𝑟𝑘superscript¯𝑿𝑘superscriptsubscript𝑿𝑣𝑘superscriptdelimited-[]superscriptsubscript𝑿𝑎𝑙𝑘superscriptdelimited-[]superscriptsubscript𝑿𝑎𝑒𝑙𝑘v_{br}^{(k)}=\overline{{\bm{X}}}^{(k)}=\{{\bm{X}}_{v}^{(k)},[{\bm{X}}_{a}^{l}]% ^{(k)},[{\bm{X}}_{ae}^{l}]^{(k)}\}italic_v start_POSTSUBSCRIPT italic_b italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = over¯ start_ARG bold_italic_X end_ARG start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = { bold_italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , [ bold_italic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , [ bold_italic_X start_POSTSUBSCRIPT italic_a italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT } and a correspondence feature node as 𝑭br=13{s𝒁^s(1),,s𝒁^s(B)}subscript𝑭𝑏𝑟13subscript𝑠superscriptsubscript^𝒁𝑠1subscript𝑠superscriptsubscript^𝒁𝑠𝐵{\bm{F}}_{br}=\dfrac{1}{3}\left\{\sum_{s}\hat{{\bm{Z}}}_{s}^{(1)},...,\sum_{s}% \hat{{\bm{Z}}}_{s}^{(B)}\right\}bold_italic_F start_POSTSUBSCRIPT italic_b italic_r end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 3 end_ARG { ∑ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT over^ start_ARG bold_italic_Z end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , … , ∑ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT over^ start_ARG bold_italic_Z end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_B ) end_POSTSUPERSCRIPT } with s{v,a,ae}𝑠𝑣𝑎𝑎𝑒s\in\{v,a,ae\}italic_s ∈ { italic_v , italic_a , italic_a italic_e }. The edge set brsubscript𝑏𝑟{\mathcal{E}}_{br}caligraphic_E start_POSTSUBSCRIPT italic_b italic_r end_POSTSUBSCRIPT is formed similarly to another graph by running the k-nearest neighbor on feature node 𝑭brsubscript𝑭𝑏𝑟{\bm{F}}_{br}bold_italic_F start_POSTSUBSCRIPT italic_b italic_r end_POSTSUBSCRIPT. We now state the multi-graph alignment as:

SoGA(𝑨sv,𝑨se)=argmin𝑽s𝒜(𝒢s,𝒢br)s{v,a,ae}𝑨sv+𝑨se𝑽s,𝑽s,SoGAsuperscriptsubscript𝑨𝑠𝑣superscriptsubscript𝑨𝑠𝑒subscriptargminsubscript𝑽𝑠𝒜subscript𝒢𝑠subscript𝒢𝑏𝑟subscript𝑠𝑣𝑎𝑎𝑒superscriptsubscript𝑨𝑠𝑣tensor-productsuperscriptsubscript𝑨𝑠𝑒subscript𝑽𝑠subscript𝑽𝑠\text{SoGA}({\bm{A}}_{s}^{v},{\bm{\mathsfit{A}}}_{s}^{e})=\operatorname*{arg\,% min}_{{\bm{V}}_{s}\in{\mathcal{A}}({\mathcal{G}}_{s},{\mathcal{G}}_{br})}\sum_% {s\in\{v,a,ae\}}\,\langle{\bm{A}}_{s}^{v}+{\bm{\mathsfit{A}}}_{s}^{e}\otimes{% \bm{V}}_{s},{\bm{V}}_{s}\rangle,SoGA ( bold_italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT , bold_slanted_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ) = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_italic_V start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∈ caligraphic_A ( caligraphic_G start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT italic_b italic_r end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_s ∈ { italic_v , italic_a , italic_a italic_e } end_POSTSUBSCRIPT ⟨ bold_italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT + bold_slanted_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ⊗ bold_italic_V start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , bold_italic_V start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ⟩ , (4)

where 𝑽ssubscript𝑽𝑠{\bm{V}}_{s}bold_italic_V start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is the indicator matrix representing for valid mapping between 𝒢ssubscript𝒢𝑠{\mathcal{G}}_{s}caligraphic_G start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and 𝒢brsubscript𝒢𝑏𝑟{\mathcal{G}}_{br}caligraphic_G start_POSTSUBSCRIPT italic_b italic_r end_POSTSUBSCRIPT, 𝑨sv|𝒱s||𝒱br|superscriptsubscript𝑨𝑠𝑣superscriptsubscript𝒱𝑠subscript𝒱𝑏𝑟{\bm{A}}_{s}^{v}\in\mathbb{R}^{|{\mathcal{V}}_{s}||{\mathcal{V}}_{br}|}bold_italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT | caligraphic_V start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | | caligraphic_V start_POSTSUBSCRIPT italic_b italic_r end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT and 𝑨se|s||br|superscriptsubscript𝑨𝑠𝑒superscriptsubscript𝑠subscript𝑏𝑟{\bm{\mathsfit{A}}}_{s}^{e}\in\mathbb{R}^{|{\mathcal{E}}_{s}||{\mathcal{E}}_{% br}|}bold_slanted_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT | caligraphic_E start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | | caligraphic_E start_POSTSUBSCRIPT italic_b italic_r end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT be vertex affinity matrix and edge affinity tensor between 𝒢ssubscript𝒢𝑠{\mathcal{G}}_{s}caligraphic_G start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and 𝒢brsubscript𝒢𝑏𝑟{\mathcal{G}}_{br}caligraphic_G start_POSTSUBSCRIPT italic_b italic_r end_POSTSUBSCRIPT. For e.g., (𝑨sv)ij=d(𝒁^s(i),13s𝒁^s(j))subscriptsuperscriptsubscript𝑨𝑠𝑣𝑖𝑗𝑑superscriptsubscript^𝒁𝑠𝑖13subscript𝑠superscriptsubscript^𝒁𝑠𝑗\left({\bm{A}}_{s}^{v}\right)_{ij}=d\left(\hat{{\bm{Z}}}_{s}^{(i)},\dfrac{1}{3% }\sum_{s}\hat{{\bm{Z}}}_{s}^{(j)}\right)( bold_italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_d ( over^ start_ARG bold_italic_Z end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , divide start_ARG 1 end_ARG start_ARG 3 end_ARG ∑ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT over^ start_ARG bold_italic_Z end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ) with d(.)d(.)italic_d ( . ) be a distance metric (e.g., cosine distance) measuring similarity between node ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT in 𝒢ssubscript𝒢𝑠{\mathcal{G}}_{s}caligraphic_G start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and node jthsuperscript𝑗𝑡j^{th}italic_j start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT in 𝒢brsubscript𝒢𝑏𝑟{\mathcal{G}}_{br}caligraphic_G start_POSTSUBSCRIPT italic_b italic_r end_POSTSUBSCRIPT.

To address the NP-Hard nature of aligning each graph to the barycenter graph 𝒢csubscript𝒢𝑐{\mathcal{G}}_{c}caligraphic_G start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, which arises from its combinatorial complexity, we employ efficient heuristic solvers utilizing Lagrange decomposition techniques (Swoboda et al., 2017; Rolínek et al., 2020).

3.5 Backpropagation with Black-box Gradient Estimation

Given 𝑽^s=SoGA(𝑨sv,𝑨se)subscript^𝑽𝑠SoGAsuperscriptsubscript𝑨𝑠𝑣superscriptsubscript𝑨𝑠𝑒\hat{{\bm{V}}}_{s}=\text{SoGA}({\bm{A}}_{s}^{v},{\bm{\mathsfit{A}}}_{s}^{e})over^ start_ARG bold_italic_V end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = SoGA ( bold_italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT , bold_slanted_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ) be solution obtained from the solver, we aim to learn feature representation for LLMs such that 𝑽^ssubscript^𝑽𝑠\hat{{\bm{V}}}_{s}over^ start_ARG bold_italic_V end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT be identical to true triplet alignments explicitly indicated by the barycenter graph. By denoting 𝑽ssuperscriptsubscript𝑽𝑠{\bm{V}}_{s}^{*}bold_italic_V start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT be an optimal mapping between the graph 𝒢csubscript𝒢𝑐{\mathcal{G}}_{c}caligraphic_G start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT to 𝒢brsubscript𝒢𝑏𝑟{\mathcal{G}}_{br}caligraphic_G start_POSTSUBSCRIPT italic_b italic_r end_POSTSUBSCRIPT, we compute the following total of hamming loss function:

(𝑽^s,𝑽s)=s{v,a,ae}𝑽^s,(1𝑽s)+𝑽s,(1𝑽^s).subscript^𝑽𝑠superscriptsubscript𝑽𝑠subscript𝑠𝑣𝑎𝑎𝑒subscript^𝑽𝑠1superscriptsubscript𝑽𝑠superscriptsubscript𝑽𝑠1subscript^𝑽𝑠\mathcal{L}(\hat{{\bm{V}}}_{s},{\bm{V}}_{s}^{*})=\sum_{s\in\{v,a,ae\}}\langle% \hat{{\bm{V}}}_{s},(1-{\bm{V}}_{s}^{*})\rangle+\langle{\bm{V}}_{s}^{*},(1-\hat% {{\bm{V}}}_{s})\rangle.caligraphic_L ( over^ start_ARG bold_italic_V end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , bold_italic_V start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_s ∈ { italic_v , italic_a , italic_a italic_e } end_POSTSUBSCRIPT ⟨ over^ start_ARG bold_italic_V end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , ( 1 - bold_italic_V start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ⟩ + ⟨ bold_italic_V start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , ( 1 - over^ start_ARG bold_italic_V end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ⟩ . (5)

However, computing the gradient of the loss function with respect to the matching problem inputs (𝑨sv,𝑨se)superscriptsubscript𝑨𝑠𝑣superscriptsubscript𝑨𝑠𝑒\left({\bm{A}}_{s}^{v},{\bm{\mathsfit{A}}}_{s}^{e}\right)( bold_italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT , bold_slanted_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ), i.e., 𝑨sv,𝑨se(𝑽^s,𝑽s)subscriptsuperscriptsubscript𝑨𝑠𝑣superscriptsubscript𝑨𝑠𝑒subscript^𝑽𝑠superscriptsubscript𝑽𝑠\nabla_{{\bm{A}}_{s}^{v},{\bm{\mathsfit{A}}}_{s}^{e}}\mathcal{L}\left(\hat{{% \bm{V}}}_{s},{\bm{V}}_{s}^{*}\right)∇ start_POSTSUBSCRIPT bold_italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT , bold_slanted_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L ( over^ start_ARG bold_italic_V end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , bold_italic_V start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ), poses a challenge due to the piecewise constant nature of the graph matching objective in Eq. (4) (Pogančić et al., 2020; Rolínek et al., 2020). To address this, we resort to the IMLE techniques (Niepert et al., 2021; Minervini et al., 2023), a method permitting estimate gradients over solutions of the combinatorial optimization problem by taking the difference between solutions of matching problem perpetuated by Gumbel noise.

In particular, given (ϵ,ϵ)Gumble(0,1)similar-toitalic-ϵsuperscriptitalic-ϵGumble01(\epsilon,\epsilon^{\prime})\sim\mathrm{Gumble}(0,1)( italic_ϵ , italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∼ roman_Gumble ( 0 , 1 ) and for each s{v,a,ae}𝑠𝑣𝑎𝑎𝑒s\in\{v,a,ae\}italic_s ∈ { italic_v , italic_a , italic_a italic_e }, we compute:

(𝑨sv,𝑨se)superscriptsubscript𝑨𝑠𝑣superscriptsubscript𝑨𝑠𝑒\displaystyle\left(\dfrac{\partial\mathcal{L}}{\partial{\bm{A}}_{s}^{v}},% \dfrac{\partial\mathcal{L}}{\partial{\bm{\mathsfit{A}}}_{s}^{e}}\right)( divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ bold_italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT end_ARG , divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ bold_slanted_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT end_ARG ) 𝑽~sSoGA(𝑨s,λv,𝑨s,λe) where 𝑽~s=SoGA(𝑨sv+ϵ,𝑨se+ϵ),absentsubscript~𝑽𝑠SoGAsuperscriptsubscript𝑨𝑠𝜆𝑣superscriptsubscript𝑨𝑠𝜆𝑒 where subscript~𝑽𝑠SoGAsuperscriptsubscript𝑨𝑠𝑣italic-ϵsuperscriptsubscript𝑨𝑠𝑒superscriptitalic-ϵ\displaystyle\approx\tilde{{\bm{V}}}_{s}-\text{SoGA}\left({\bm{A}}_{s,\lambda}% ^{v},{\bm{\mathsfit{A}}}_{s,\lambda}^{e}\right)\text{ where }\tilde{{\bm{V}}}_% {s}=\text{SoGA}\left({\bm{A}}_{s}^{v}+\epsilon,{\bm{\mathsfit{A}}}_{s}^{e}+% \epsilon^{\prime}\right),≈ over~ start_ARG bold_italic_V end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - SoGA ( bold_italic_A start_POSTSUBSCRIPT italic_s , italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT , bold_slanted_A start_POSTSUBSCRIPT italic_s , italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ) where over~ start_ARG bold_italic_V end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = SoGA ( bold_italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT + italic_ϵ , bold_slanted_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT + italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , (6)
(𝑨s,λv,𝑨s,λe)superscriptsubscript𝑨𝑠𝜆𝑣superscriptsubscript𝑨𝑠𝜆𝑒\displaystyle\left({\bm{A}}_{s,\lambda}^{v},{\bm{\mathsfit{A}}}_{s,\lambda}^{e% }\right)( bold_italic_A start_POSTSUBSCRIPT italic_s , italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT , bold_slanted_A start_POSTSUBSCRIPT italic_s , italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ) =(𝑨sv+ϵ,𝑨se+ϵ)λ𝑽~s(𝑽~s,𝑽s), with λ is a step size.absentsuperscriptsubscript𝑨𝑠𝑣italic-ϵsuperscriptsubscript𝑨𝑠𝑒superscriptitalic-ϵ𝜆subscriptsubscript~𝑽𝑠subscript~𝑽𝑠superscriptsubscript𝑽𝑠 with 𝜆 is a step size.\displaystyle=\left({\bm{A}}_{s}^{v}+\epsilon,{\bm{\mathsfit{A}}}_{s}^{e}+% \epsilon^{\prime}\right)-\lambda\nabla_{\tilde{{\bm{V}}}_{s}}\mathcal{L}(% \tilde{{\bm{V}}}_{s},{\bm{V}}_{s}^{*}),\text{ with }\lambda\text{ is a step % size.}= ( bold_italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT + italic_ϵ , bold_slanted_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT + italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_λ ∇ start_POSTSUBSCRIPT over~ start_ARG bold_italic_V end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( over~ start_ARG bold_italic_V end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , bold_italic_V start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , with italic_λ is a step size. (7)

3.6 Structure alignment-based graph distance properties

In this section, we present theoretical insights into the graph-matching problem outlined in Eq. (3). Specifically, we demonstrate that once the optimal matching between two graphs is established, it defines a valid metric distance. Additionally, the geodesic path (i.e., the shortest path distance) connecting the two graphs in the manifold space can be derived based on the computed matching alignments. We define a discrete between two graphs given a solution of matching alignment as:

dSoGA(𝒢1,𝒢2)subscript𝑑SoGAsubscript𝒢1subscript𝒢2\displaystyle d_{\text{SoGA}}({\mathcal{G}}_{1},{\mathcal{G}}_{2})italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) =min𝑽𝒜(𝒢1,𝒢2)(i,jAi,jvVi,j+i,j,k,lAi,j,k,leVi,jVk,l).absentsubscript𝑽𝒜subscript𝒢1subscript𝒢2subscript𝑖𝑗subscriptsuperscript𝐴𝑣𝑖𝑗subscript𝑉𝑖𝑗subscript𝑖𝑗𝑘𝑙subscriptsuperscript𝐴𝑒𝑖𝑗𝑘𝑙subscript𝑉𝑖𝑗subscript𝑉𝑘𝑙\displaystyle=\min_{{\bm{V}}\in{\mathcal{A}}({\mathcal{G}}_{1},{\mathcal{G}}_{% 2})}\Big{(}\sum_{i,j}{A}^{v}_{i,j}{V}_{i,j}+\sum_{i,j,k,l}{\mathsfit{A}}^{e}_{% i,j,k,l}{V}_{i,j}{V}_{k,l}\Big{)}.= roman_min start_POSTSUBSCRIPT bold_italic_V ∈ caligraphic_A ( caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_A start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i , italic_j , italic_k , italic_l end_POSTSUBSCRIPT slanted_A start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j , italic_k , italic_l end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT ) . (8)

We consider labelled graphs as tuples of the form 𝒢=(𝒱,,f,s)𝒢𝒱subscript𝑓subscript𝑠{\mathcal{G}}=({\mathcal{V}},{\mathcal{E}},{\mathcal{L}}_{f},{\mathcal{L}}_{s})caligraphic_G = ( caligraphic_V , caligraphic_E , caligraphic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , caligraphic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ), where the labelling function f:𝒱:subscript𝑓maps-to𝒱{\mathcal{L}}_{f}:{\mathcal{V}}\mapsto\mathcal{F}caligraphic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT : caligraphic_V ↦ caligraphic_F assigns each vertex vi𝒱subscript𝑣𝑖𝒱v_{i}\in{\mathcal{V}}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_V to a feature fi=f(vi)subscript𝑓𝑖subscript𝑓subscript𝑣𝑖f_{i}={\mathcal{L}}_{f}(v_{i})italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) in some feature space (,df)subscript𝑑𝑓(\mathcal{F},d_{f})( caligraphic_F , italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ). Similarly, we denote s:𝒱𝒮:subscript𝑠maps-to𝒱𝒮{\mathcal{L}}_{s}:{\mathcal{V}}\mapsto\mathcal{S}caligraphic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT : caligraphic_V ↦ caligraphic_S as a structure function which links each vertex vi𝒱subscript𝑣𝑖𝒱v_{i}\in{\mathcal{V}}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_V with its structure information si=s(vi)subscript𝑠𝑖subscript𝑠subscript𝑣𝑖s_{i}={\mathcal{L}}_{s}(v_{i})italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), e.g.,  edge information, in some structure space (𝒮,ds)𝒮subscript𝑑𝑠(\mathcal{S},d_{s})( caligraphic_S , italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ). By associating a weight to each vertex, we allow the graph 𝒢𝒢{\mathcal{G}}caligraphic_G to be represented by a fully supported mixing measure μ=i=1Nwiδ(fi,si)𝜇superscriptsubscript𝑖1𝑁subscript𝑤𝑖subscript𝛿subscript𝑓𝑖subscript𝑠𝑖\mu=\sum_{i=1}^{N}w_{i}\delta_{(f_{i},s_{i})}italic_μ = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT over the product between feature space and structure space ×𝒮𝒮{\mathcal{F}}\times{\mathcal{S}}caligraphic_F × caligraphic_S. Notably, μ𝜇\muitalic_μ is not necessarily a probability measure as the summation of its weights can be different from one. We have the vertex affinity matrix between two graphs as 𝑨vM×Nsuperscript𝑨𝑣superscript𝑀𝑁{\bm{A}}^{v}\in\mathbb{R}^{M\times N}bold_italic_A start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_N end_POSTSUPERSCRIPT, where Ai,jv=(df(fi,fj))i,jsubscriptsuperscript𝐴𝑣𝑖𝑗subscript𝑑𝑓subscript𝑓𝑖subscript𝑓𝑗𝑖𝑗{A}^{v}_{i,j}=(d_{f}(f_{i},f_{j})){i,j}italic_A start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = ( italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) italic_i , italic_j. Structural similarity is measured by pairwise distances within each graph, represented by 𝑨e|1||2|superscript𝑨𝑒superscriptsubscript1subscript2{\bm{\mathsfit{A}}}^{e}\in\mathbb{R}^{|{\mathcal{E}}_{1}||{\mathcal{E}}_{2}|}bold_slanted_A start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT | caligraphic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | | caligraphic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT, with Ai,j,k,le=|ds(si,sk)ds(sj,sl)|subscriptsuperscript𝐴𝑒𝑖𝑗𝑘𝑙subscript𝑑𝑠subscript𝑠𝑖subscript𝑠𝑘subscript𝑑𝑠subscript𝑠𝑗subscript𝑠𝑙{\mathsfit{A}}^{e}_{i,j,k,l}=|d_{s}(s_{i},s_{k})-d_{s}(s_{j},s_{l})|slanted_A start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j , italic_k , italic_l end_POSTSUBSCRIPT = | italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) |, where ds()subscript𝑑𝑠d_{s}(\cdot)italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( ⋅ ) models node distance, such as the shortest path. We then define the space of all structured graphs (×𝒮,df,μ)𝒮subscript𝑑𝑓𝜇({\mathcal{F}}\times{\mathcal{S}},d_{f},\mu)( caligraphic_F × caligraphic_S , italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , italic_μ ) over a metric feature space (,df)subscript𝑑𝑓({\mathcal{F}},d_{f})( caligraphic_F , italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ) as 𝕊()𝕊{\mathbb{S}}({\mathcal{F}})blackboard_S ( caligraphic_F ), where (𝒮,ds)𝒮subscript𝑑𝑠(\mathcal{S},d_{s})( caligraphic_S , italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) is a metric structure space and μ=i=1Nwiδ(fi,si)𝜇superscriptsubscript𝑖1𝑁subscript𝑤𝑖subscript𝛿subscript𝑓𝑖subscript𝑠𝑖\mu=\sum_{i=1}^{N}w_{i}\delta_{(f_{i},s_{i})}italic_μ = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT is a mixing measure over ×𝒮𝒮{\mathcal{F}}\times{\mathcal{S}}caligraphic_F × caligraphic_S.

Theorem 1 (Metric properties).

The distance dSoGAsubscript𝑑SoGAd_{\text{SoGA}}italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT in Eq. (8) defines a metric in 𝕊()𝕊{\mathbb{S}}(\mathcal{F})blackboard_S ( caligraphic_F ).

Theorem 1 is proved in Appendix A.1. Intuitively, we can compare two graphs, 𝒢1subscript𝒢1{\mathcal{G}}_{1}caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝒢2subscript𝒢2{\mathcal{G}}_{2}caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, described by their mixing measures μ1=i=1Mw1iδ(f1i,s1i)subscript𝜇1superscriptsubscript𝑖1𝑀subscript𝑤1𝑖subscript𝛿subscript𝑓1𝑖subscript𝑠1𝑖\mu_{1}=\sum_{i=1}^{M}w_{1i}\delta_{(f_{1i},s_{1i})}italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT and μ2=j=1Nw2jδ(f2j,s2j)subscript𝜇2superscriptsubscript𝑗1𝑁subscript𝑤2𝑗subscript𝛿subscript𝑓2𝑗subscript𝑠2𝑗\mu_{2}=\sum_{j=1}^{N}w_{2j}\delta_{(f_{2j},s_{2j})}italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT, with uniform vertex weights (i.e.,  w1i=1,i[M]formulae-sequencesubscript𝑤1𝑖1for-all𝑖delimited-[]𝑀w_{1i}=1,\forall i\in[M]italic_w start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT = 1 , ∀ italic_i ∈ [ italic_M ], w2j=1,j[N]formulae-sequencesubscript𝑤2𝑗1for-all𝑗delimited-[]𝑁w_{2j}=1,\forall j\in[N]italic_w start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT = 1 , ∀ italic_j ∈ [ italic_N ]) and shortest path structure matrices. The SoGA distance is zero if and only if there exists a one-to-one mapping between the graph vertices that preserves both shortest paths and features and both graphs have the same number of vertices.

Moreover, we also derive new constant speed geodesic in the SoGA context via Theorem 2, which is proved in Appendix A.2. This geodesic property is useful in optimal transport (discrete measures) to define the dynamic formulation and can be used in representation learning by sampling new data on the geodesic connect two endpoints graphs.

Theorem 2 (Geodesic space).

The space 𝕊()𝕊{\mathbb{S}}(\mathcal{F})blackboard_S ( caligraphic_F ) equipped with the dSoGAsubscript𝑑SoGAd_{\text{SoGA}}italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT distance is geodesic.

4 Experiments

4.1 Implementation Details

Model architectures. We use the LLaMA large language model (Touvron et al., 2023), the CLIP-ViT-L-Patch14 visual encoder (Radford et al., 2021), and an MLP projection similar to LLaVA 1.5 (Liu et al., 2024a). Stage 1 follows the standard LLaVA-Med (Li et al., 2024) setup, while stage 2 incorporates our multi-graph alignment with autoregressive training. For multi-graph alignment, a 2-layer graph convolutional network is applied to the output of the Projection and LLM Decoder (handling both image and text modalities). We train for 1 epoch in stage 1 and 3 epochs in stage 2 using the same dataset as LLava-Med. The model is optimized using Adam (Kingma, 2014) with CosineAnnealingLR scheduler and learning rates of 2e32𝑒32e-32 italic_e - 3 and 2e52𝑒52e-52 italic_e - 5 for stages 1 and 2, respectively.

Pre-training data. We follow the same dataset being used in LLaVA-Med (Li et al., 2024). For stage 1, there are 600K image-text pairs filtered from PMC-15M. They are all converted to instruction-following data with simple instructions for describing the image. For stage 2, there are 60K image-text pairs extracted from PMC-15M with 5 modalities: CXR (chest X-ray), CT (computed tomography), MRI (magnetic resonance imaging), histopathology, and gross (i.e., macroscopic) pathology. Then, the authors from LLaVA-Med (Li et al., 2024) use language-only GPT-4 to generate multi-round questions and answers in a tone as if it could see the image to convert these pairs to instruction-following format.

Running-time. We train LoGra-Med using 4 A100-GPUs per with 80GB for both stages and complete the training process for stage 1 in 7 hours and for stage 2 in 7.5 hours. With original LLaVA-Med (version 1.5) (Li et al., 2024), the training process for stage 1 finishes in 6.5 hours, and for stage 2 finishes in 7 hours. In total, we need extra one more hour to complete the whole pre-training process compared to the LLaVa-Med.

4.2 Data-hungry Problem of Autoregressive

We begin by presenting the data-intensive nature of autoregressive training. To assess this, we used LLaVA-Med, a state-of-the-art multimodal large language model in the biomedical domain. LLaVA-Med follows a two-stage in their training process: Stage 1 aligns image-text tokens with biomedical concepts, and Stage 2 fine-tunes the model for instruction-based tasks. We pre-trained LLaVA-Med on varying data amounts (10%, 40%, 70%) and used the checkpoints to fine-tune the visual question answering downstream task. As shown in Figure 1, performance on the VQA-RAD dataset sharply declines when using 10% pre-training data compared with fully-training on 100%. To the best of our knowledge, this is the first time that the auto-regressive mechanism’s data dependency issue in medical-MLLM has been revealed. This evidence indicates that without using enough instruction-tuning data, the auto-regressive creates a weak connection between visual features and text embeddings in LLM.

To tackle this issue, our LoGra-Med enhances image-text alignment through the semantic constraints between input images, their instruction response, and the long-context extended contexts. We experiment with LoGra-Med under the same settings as LLaVA-Med, pretraining on varying data sizes and utilizing the checkpoints for instruction fine-tuning. Figure 1 demonstrates that LoGra-Med effectively learns the biomedical concept alignment task, producing strong pretraining checkpoints for downstream fine-tuning while using only 10% of the data. For instance, while LLaVa-Med is at 52.3952.3952.3952.39 with 10%percent1010\%10 %, LoGra-Med reaches 72.52%percent72.5272.52\%72.52 %, which is approximately LLaVa-Med pre-trained on 100%percent100100\%100 %. Across 10%, 40%, and 70% data usage, LoGra-Med shows stable performance with consistent improvement, positioning it as a competitive model compared to other medical-MLLM like BioMed-GPT (Zhang et al., 2023a) or Med-MoE (Jiang et al., 2024). This suggests that the multi-graph alignment strategy effectively mitigates the data-intensive demands of autoregressive mechanisms in mLLMs.

4.3 Multi-modal Pre-training Comparison

To validate the benefit of our multi-graph alignment, we compare it with other vision-language pre-trained methods designed to train for the CLiP model or in visual instruction tuning to bridge frozen vision-language models.

Method VQA-RAD SLAKE PathVQA Overall Open Closed Avg. Open Closed Avg. Open Closed Avg. LLaVA-Med (100%) 63.65 81.62 72.64 83.44 83.41 83.43 36.78 91.33 64.06 73.37 LLaVA-Med (10%) 43.38\downarrow20.27 61.4\downarrow20.22 52.39\downarrow20.25 80.94\downarrow2.5 80.29\downarrow3.12 80.62\downarrow2.81 24.26\downarrow13.69 88.03\downarrow3.18 56.15\downarrow7.91 63.05\downarrow10.32 InfoNCE 59.39 77.57 68.48 82.4 83.17 82.78 34.59 91.45 63.02 71.43 PLOT 16.86 26.47 21.67 37.81 56.25 47.03 11.79 81.36 46.58 38.42 SigLIP 56.99 77.94 67.47 80.86 80.53 80.69 18.08 50.85 34.465 60.88 VLAP 57.49 76.47 66.98 80.05 82.21 81.13 32.21 91.16 61.685 69.93 LoGra-Med 66.02 79.04 72.52 84.92 85.1 85.01 37.25 91.45 64.34 73.96

Table 1: Fine-tuning performance on MedVQA downstream datasets (pre-trained 10%). Bold indicates the best values among pre-training algorithms, excluding LLaVA-Med (pre-trained 100%)

.

Datasets. We evaluate pre-trained models on three prominent biomedical VQA datasets: VQA-RAD (Lau et al., 2018), SLAKE (Liu et al., 2021), and PathVQA (He et al., 2020). VQA-RAD includes 3,515 questions across 315 radiology images, while SLAKE contains 642 radiology images from various body parts and over 7k QA pairs. PathVQA, focused on pathology, features 5k images and 32.8k questions. All datasets include open-ended (e.g., what, why, where) and closed-ended (yes/no or two-option) question types. We provide more details in the Appendix.

Baselines. We compare four approaches, including InfoNCE-based methods (Khan & Fu, 2023; Liu et al., 2023), SigLIP (Zhai et al., 2023), PLOT (Chen et al., 2022), and VLAP (Park et al., 2024). Among this, SigLIP adapts the Sigmoid loss on image-text pairs to break the global view of the pairwise similarities for normalization, resulting in scaling in large batch size. PLOT defines optimal transport as a distance between visual image patches and text embedding. In contrast, VLAP uses assignment prediction to bridge the modality gap between the visual and LLM embeddings. We train the baselines under the same settings as LoGra-Med with varying pre-training data rates and compare their performance on downstream tasks.

Results. In Tables 1, 2 and 12, We compare the performance of LoGra-Med and the baselines when pre-trained with 10%percent1010\%10 %, 40%percent4040\%40 %, and 100%percent100100\%100 % of instruction-tuning data. While most contrastive baselines improve LLaVA-Med at 10%, LoGra-Med consistently outperforms LLaVA-Med and other methods overall. In open-ended questions, which require external knowledge, LoGra-Med delivers the best results. Notably, it maintains stable improvements across all three VQA datasets, unlike other methods that peak at 40%percent4040\%40 % pre-training and decline afterward. For example, SigLIP scores 72.14%percent72.1472.14\%72.14 % (Average) and 72.12%percent72.1272.12\%72.12 % (Overall) on VQA-RAD at 40%percent4040\%40 % but drops over 6%percent66\%6 % and 1%percent11\%1 % at 100%percent100100\%100 %. In contrast, LoGra-Med continues to improve, reaching 74.91%percent74.9174.91\%74.91 % (Average) and 74.75%percent74.7574.75\%74.75 % (Overall).

Method VQA-RAD SLAKE PathVQA Overall Open Closed Avg. Open Closed Avg. Open Closed Avg. LLaVA-Med (100%) 63.65 81.62 72.64 83.44 83.41 83.43 36.78 91.33 64.06 73.37 LLaVA-Med (40%) 62.23\downarrow1.42 79.41\downarrow2.21 70.82\downarrow1.82 84.42\uparrow0.98 83.65\downarrow0.24 84.04\uparrow0.61 31.86\downarrow4.92 84.99\downarrow6.34 58.43\downarrow5.63 71.09\downarrow2.28 InfoNCE 63.11 77.57 70.34 82.68 83.89 83.29 33.58 89.62 61.6 71.74 PLOT 64.36 79.41 71.89 83.38 82.93 83.16 35.11 89.59 62.35 72.46 SigLIP 63.02 81.25 72.14 81.26 80.29 80.77 36.01 90.86 63.435 72.12 VLAP 63.17 79.04 71.11 83.38 83.89 83.64 35.62 90.83 63.225 72.66 LoGra-Med 66.01 82.72 74.37 84.17 85.82 84.99 37.41 91.27 64.34 74.57

Table 2: Performance fine-tuning on MedVQA downstream datasets (pre-trained 40%). Bold indicate for best values among pre-training algorithms excluding LLaVA-Med (pre-trained 100%).

4.4 Med-VQA Comparison with Medical MLLMs

We now compare LoGra-Med pre-trained with 100%percent100100\%100 % data against other medical foundation models, each trained on varying datasets and employing different architectures or model sizes.

Baselines. We compare with eight competitors, both generic or medical foundation models, including LLaVA (Liu et al., 2024b), LLaVA-Med (Li et al., 2024), Med-Flamingo (Moor et al., 2023), Med-Dr (He et al., 2024), Biomed-GPT (Zhang et al., 2023a), M2I2 (Li et al., 2023b), GPT-4o (Achiam et al., 2023) and Med-MoE (Jiang et al., 2024). Whilst LLaVA and GPT-4o have no medical background, the others are pre-trained on a variety of biomedical knowledge. With the exception of LLaVa, which we reproduced, the results for the other baselines are taken from the literature. Moreover, we also present an enhanced version, LoGra-Med + DCI, which integrates multi-scale visual features from vision encoders (Yao et al., 2024), potentially benefiting medical image analysis by considering both local (detailed) and global (contextual) features.

Method #Params VQA-RAD SLAKE PathVQA Overall Open Closed Avg. Open Closed Avg. Open Closed Avg. LLaVA-Med (Li et al., 2024) 7B 63.65 81.62 72.64 83.44 83.41 83.43 36.78 91.33 64.06 73.37 BiomedGPT-B (Zhang et al., 2023a) 182M 60.9 81.3 71.1 84.3 89.9 87.1 28 88 58 72.07 M2I2 (Li et al., 2023b) - 61.8 81.6 71.7 74.7 91.1 82.9 36.3 88 62.15 72.25 BioMed-CLIP (Zhang et al., 2023d) 422M 67.6 79.8 73.7 82.5 89.7 86.1 Med-Dr (He et al., 2024) 40B 37.5 78.9 58.2 74.2 83.4 78.8 33.5 90.2 61.85 66.28 LLaVA (general) (Liu et al., 2024b) 7B 50 65.1 57.55 78.2 63.2 70.7 7.7 63.2 35.45 54.57 GPT-4 (Achiam et al., 2023) 200B 39.5 78.9 59.2 33.6 43.6 38.6 Med-MoE (Phi2) (Jiang et al., 2024) 3.6B 58.55 82.72 70.64 85.06 85.58 85.32 34.74 91.98 63.36 73.11 Med-MoE (Stable LM) (Jiang et al., 2024) 2B 50.08 80.07 65.08 83.16 83.41 83.29 33.79 91.30 62.55 70.3 LoGra-Med 7B 66.35 83.46 74.91 85.34 85.58 85.46 36.82 90.92 63.87 74.75 LoGra-Med + DCI 7B 67.31 81.99 74.65 84.88 85.82 85.23 37.77 91.86 64.82 74.9

Table 3: Performance comparing on MedVQA vs. other Med-MLLM.

Results. Overall, two LoGra-Med versions perform better than the baseline models (Table 3), especially the version combined with DCI gains the best performance for PathVQA on Average score (64.82%) and for Overall result (74.9%). Compared to LLaVA-Med, LoGra-Med exhibits significant improvements on every task result, such as 2.01% on VQA-RAD, 2.03% on SLAKE and 0.76% on PathVQA. Furthermore, it is important to note that both the plain and DCI versions of LoGra-Med demonstrate competitive performance across all VQA datasets, despite having significantly fewer parameters. For instance, the two 7B-parameter LoGra-Med versions outperform the 40B-parameter Med-Dr across all three datasets.

4.5 Medical Visual Chatbot Evaluation

Datasets. Following the experimental settings in LLaVA-Med, we evaluate LoGra-Med on a biomedical multimodal conversational dataset containing a total of 193 questions including conversation (143 samples), description (50 samples). Additionally, the experimental setup categorizes the test samples into five medical domains, including Chest X-ray (CXR), Magnetic Resonance Imaging (MRI), Histology, Gross, and Computed Tomography (CT).

Baselines. We evaluate with several SOTA multimodal large language models, including general models like LLaVA and GPT-4o, as well as medical-focused models such as LLaVA-Med and its variants, Med-Flamingo, Med-Dr, and Biomed-GPT. We use the officially provided weights for all comparison baselines without additional reproduction steps. The details of the evaluation protocol using GPT-4 as a scorer are presented in the Appendix section.

Results. Table 4 summarizes our finding results, which shows that in most of the settings, two of our LoGra-Med deliver the top records. We provide additional analytics and visualizations of typical model outputs in the Appendix. Method #Params Question Type Domain Overall Conver. Descr. CXR MRI Histology Gross CT LLAVA 7B 39.40 26.20 41.60 33.40 38.40 32.91 33.40 36.1 LLAVA-Med 1.0 7B 47.4 33.99 51.31 36.32 45.61 41.09 44.87 43.93 LLAVA-Med 1.5 7B 46.78 34.58 54.58 36.5 41.85 40.3 45.02 43.62 MedFlamingo 8.3B 28.58 13.89 26.93 21.34 22.09 32.71 22.25 24.77 Med-Dr 40B 35.61 19.28 38.98 26.28 29.10 35.40 28.30 31.38 Biomed-GPT 182M 20.71 17.99 27.53 18.50 17.18 14.72 22.08 20.01 GPT-4o 200B 42.04 25.47 42.77 39.74 38.68 31.40 35.59 37.75 LoGra-Med 7B 48.49 34.32 58.37 36.82 46.05 45.19 38.24 44.82 LoGra-Med + DCI 7B 48.99 34.01 59.9 32.34 51.88 42.53 38.28 45.11 Table 4: Medical visual chatbot evaluation. Results are reported using GPT-4 as the scorer. Method VQA-RAD SLAKE LoGra-Med (Full) 74.37 84.99 LVM-Med with synonyms 72.39 82.93 LoGra-Med in two stages 72.81 84.14 LVM-Med w/o long-context 72.12 81.95 LVM-Med w/o message passing 73.90 84.29 Table 5: LoGra-Med ablation study. Results are presented as average scores on VQA-RAD and SLAKE, using pre-trained weights on 40%. The two key factors are emphasized.

4.6 Zero-shot Image Classification as MedVQA

Datasets. We assess the generalization of LoGra-Med on zero-shot image classification by adapting public datasets from (Hu et al., 2024). We convert image classification tasks into visual question answering, where the model selects the correct answer from input options based on the given image (See Figure 6, Appendix). We focus on three key data modalities prevalent in our pre-training: Microscopy, Computed Tomography (CT), and Chest X-Ray (CXR). This evaluation spans several downstream tasks, including 8 datasets for Microscopy, 4 for CT, and 11 for CXR, totaling 23 datasets.

Baselines. We use checkpoints from LLaVa-Med, Med-Flamingo, and RadFM (Wu et al., 2023a) for zero-shot inference on the collected datasets. Notably, RadFM is pre-trained on 16M16𝑀16M16 italic_M 2D and 3D medical scans, while LoGra-Med is trained on just 600K600𝐾600K600 italic_K instruction-following data. For baseline models, we follow the prompts proposed by (Hu et al., 2024), with detailed evaluations using third-party software to align model outputs with ground-truth answers, as outlined in the Appendix.

Refer to caption
Figure 4: LoGra-Med performance on 23232323 zero-shot image classification tasks within three data modalities.

Results. Figure 4 illustrates the average performance of LoGra-Med across Microscopy, CT, and Chest X-Ray modalities, with the total number of images and question-answer items listed below. Detailed results for each dataset within these modality groups are provided in Tables 8, 10, and 9 in the Appendix. Overall, LoGra-Med still outperforms other models across all datasets, especially excelling in the microscopy modality, where it exceeds the runner-up, RadFM, by 8.2%percent8.28.2\%8.2 %. We attribute these benefits to the strong alignment between visual features and language embeddings achieved through triplet constraints, which compel the model to capture deeper semantic relationships.

4.7 Ablation Studies

We validate LoGra-Med under the following settings: (i) applying multi-graph alignment in both stages instead of only in stage 2 (default); (ii) performing simple paraphrasing during instruction tuning by replacing keywords with their synonyms; (iii) using only two graphs in the alignment step (excluding long-context graphs); and (iv) applying message passing to enhance node features. Table 5 shows that each proposed factor boosts the final performance wherein the multi-graph alignment and using ChatGPT as an extender are the most crucial components.

5 Conclusion

We have shown that enforcing triplet correlations among image modalities, their instruction data, and the extended contextual captions can improve vision-language alignment, which is often lacking in models trained by auto-regressive, especially given less pre-training data size as demonstrated for the LLaVa-Med model. We also present LoGra-Med, a new multi-graph alignment algorithm to handle such requirements, which is efficient in training and matches LLAVA-Med’s performance on just 10% of the training data while outperforming other state-of-the-art methods on various tasks. These findings underscore that selecting the appropriate learning algorithm for training MLLMs is as crucial as scaling model size or data volume.

Limitations and Future Work

While we have conducted several experiments using the LLaVa model, it is crucial to validate LoGra-Med with other architectures, such as the Flamingo model (Alayrac et al., 2022). Additionally, integrating a vision encoder or LLM specifically trained on medical sources (Chen et al., 2023b; MH Nguyen et al., 2024) holds promise for enhancing performance and improving robustness—an approach we have not yet evaluated. Lastly, we propose to investigate adaptor-based methods (Hu et al., 2022; Zhang et al., 2023b) for fine-tuning downstream tasks, as frozen backbone models can significantly reduce memory usage and maximize efficiency.

References

  • (1) Chest ct-scan images dataset. https://tianchi.aliyun.com/dataset/93929. Accessed: 2024-09-30.
  • (2) Covid ct dataset. https://tianchi.aliyun.com/dataset/106604. Accessed: 2024-09-30.
  • blo (2023) Blood cell images. https://www.kaggle.com/datasets/paultimothymooney/blood-cells, 2023. Accessed: 2024-09-30.
  • cov (2023) Covid-19 image dataset: 3 way classification - covid-19, viral pneumonia, normal. https://tianchi.aliyun.com/dataset/93853, 2023.
  • nlm (2023) Nlm - malaria data. https://lhncbc.nlm.nih.gov/LHC-research/LHC-projects/image-processing/malaria-datasheet.html, 2023. Accessed: 2024-09-30.
  • xra (2023) X-ray hand small joint classification dataset (based on bone age scoring method rus-chn). https://aistudio.baidu.com/datasetdetail/69582/0, 2023. Accessed: 2024-09-30.
  • Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  • Agueh & Carlier (2011) Martial Agueh and Guillaume Carlier. Barycenters in the wasserstein space. SIAM Journal on Mathematical Analysis, 43(2):904–924, 2011.
  • Alayrac et al. (2022) Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736, 2022.
  • An et al. (2024) Chenxin An, Fei Huang, Jun Zhang, Shansan Gong, Xipeng Qiu, Chang Zhou, and Lingpeng Kong. Training-free long-context scaling of large language models. arXiv preprint arXiv:2402.17463, 2024.
  • Asraf & Islam (2021) A Asraf and Z Islam. Covid19 pneumonia and normal chest x-ray pa dataset. mendeley data v1 (2021), 2021.
  • Bernard et al. (2019) Florian Bernard, Johan Thunberg, Paul Swoboda, and Christian Theobalt. Hippi: Higher-order projected power iterations for scalable multi-matching. In Proceedings of the ieee/cvf international conference on computer vision, pp.  10284–10293, 2019.
  • Chen et al. (2022) Guangyi Chen, Weiran Yao, Xiangchen Song, Xinyue Li, Yongming Rao, and Kun Zhang. Plot: Prompt learning with optimal transport for vision-language models. International Conference on Learning Representations, 2022.
  • Chen et al. (2023a) Guo Chen, Yin-Dong Zheng, Jiahao Wang, Jilan Xu, Yifei Huang, Junting Pan, Yi Wang, Yali Wang, Yu Qiao, Tong Lu, et al. Videollm: Modeling video sequence with large language models. arXiv preprint arXiv:2305.13292, 2023a.
  • Chen (2018) Pingjun Chen. Knee osteoarthritis severity grading dataset, 2018.
  • Chen et al. (2023b) Zeming Chen, Alejandro Hernández Cano, Angelika Romanou, Antoine Bonnet, Kyle Matoba, Francesco Salvi, Matteo Pagliardini, Simin Fan, Andreas Köpf, Amirkeivan Mohtashami, et al. Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079, 2023b.
  • Chizat & Bach (2018) Lénaïc Chizat and Francis Bach. On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
  • Chowdhury et al. (2020) Muhammad EH Chowdhury, Tawsifur Rahman, Amith Khandakar, Rashid Mazhar, Muhammad Abdul Kadir, Zaid Bin Mahbub, Khandakar Reajul Islam, Muhammad Salman Khan, Atif Iqbal, Nasser Al Emadi, et al. Can ai help in screening viral and covid-19 pneumonia? Ieee Access, 8:132665–132676, 2020.
  • Cohen et al. (2020) Joseph Paul Cohen, Paul Morrison, Lan Dao, Karsten Roth, Tim Q Duong, and Marzyeh Ghassemi. Covid-19 image data collection: Prospective predictions are the future. arXiv preprint arXiv:2006.11988, 2020.
  • Devlin (2018) Jacob Devlin. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • Ehm et al. (2024) Viktoria Ehm, Maolin Gao, Paul Roetzer, Marvin Eisenberger, Daniel Cremers, and Florian Bernard. Partial-to-partial shape matching with geometric consistency. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  27488–27497, 2024.
  • Gupta & Gupta (2019) Anubha Gupta and Ritu Gupta. Isbi 2019 c-nmc challenge: Classification in cancer cell imaging. Select Proceedings, 2, 2019.
  • Haller et al. (2022) Stefan Haller, Lorenz Feineis, Lisa Hutschenreiter, Florian Bernard, Carsten Rother, Dagmar Kainmüller, Paul Swoboda, and Bogdan Savchynskyy. A comparative study of graph matching algorithms in computer vision. In European Conference on Computer Vision, pp.  636–653. Springer, 2022.
  • He et al. (2024) Sunan He, Yuxiang Nie, Zhixuan Chen, Zhiyuan Cai, Hongmei Wang, Shu Yang, and Hao Chen. Meddr: Diagnosis-guided bootstrapping for large-scale medical vision-language learning. arXiv preprint arXiv:2404.15127, 2024.
  • He et al. (2020) Xuehai He, Yichen Zhang, Luntian Mou, Eric Xing, and Pengtao Xie. Pathvqa: 30000+ questions for medical visual question answering. arXiv preprint arXiv:2003.10286, 2020.
  • Hu et al. (2022) Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. International Conference on Learning Representations, 2022.
  • Hu et al. (2024) Yutao Hu, Tianbin Li, Quanfeng Lu, Wenqi Shao, Junjun He, Yu Qiao, and Ping Luo. Omnimedvqa: A new large-scale comprehensive evaluation benchmark for medical lvlm. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  22170–22183, 2024.
  • Huang et al. (2023) Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, and Li Fei-Fei. Voxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973, 2023.
  • Jaeger et al. (2014) Stefan Jaeger, Sema Candemir, Sameer Antani, Yì-Xiáng J Wáng, Pu-Xuan Lu, and George Thoma. Two public chest x-ray datasets for computer-aided screening of pulmonary diseases. Quantitative imaging in medicine and surgery, 4(6):475, 2014.
  • Javadi & Mirroshandel (2019) Soroush Javadi and Seyed Abolghasem Mirroshandel. A novel deep learning method for automatic assessment of human sperm images. Computers in biology and medicine, 109:182–194, 2019.
  • Jiang et al. (2024) Songtao Jiang, Tuo Zheng, Yan Zhang, Yeying Jin, and Zuozhu Liu. Moe-tinymed: Mixture of experts for tiny medical large vision-language models. arXiv preprint arXiv:2404.10237, 2024.
  • Ju et al. (2024) Wei Ju, Zheng Fang, Yiyang Gu, Zequn Liu, Qingqing Long, Ziyue Qiao, Yifang Qin, Jianhao Shen, Fang Sun, Zhiping Xiao, et al. A comprehensive survey on deep graph representation learning. Neural Networks, pp.  106207, 2024.
  • Kather et al. (2018) J. N. Kather, N. Halama, and A. Marx. 100,000 histological images of human colorectal cancer and healthy tissue, April 2018. Accessed: 2024-09-30.
  • Kermany et al. (2018) Daniel S Kermany, Michael Goldbaum, Wenjia Cai, Carolina CS Valentim, Huiying Liang, Sally L Baxter, Alex McKeown, Ge Yang, Xiaokang Wu, Fangbing Yan, et al. Identifying medical diagnoses and treatable diseases by image-based deep learning. cell, 172(5):1122–1131, 2018.
  • Khan & Fu (2023) Zaid Khan and Yun Fu. Contrastive alignment of vision to language through parameter-efficient transfer learning. International Conference on Learning Representations, 2023.
  • Kingma (2014) Diederik P Kingma. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Lau et al. (2018) Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images. Scientific data, 5(1):1–10, 2018.
  • Li et al. (2024) Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems, 36, 2024.
  • Li et al. (2022) Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning, pp. 12888–12900. PMLR, 2022.
  • Li et al. (2023a) Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pp. 19730–19742. PMLR, 2023a.
  • Li et al. (2023b) Pengfei Li, Gang Liu, Lin Tan, Jinying Liao, and Shenjun Zhong. Self-supervised vision-language pretraining for medial visual question answering. In 2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI), pp.  1–5, 2023b. doi: 10.1109/ISBI53787.2023.10230743.
  • Lin et al. (2022) Tianyi Lin, Nhat Ho, Marco Cuturi, and Michael I. Jordan. On the complexity of approximating multi-marginal optimal transport. Journal of Machine Learning Research, 23:1–43, 2022.
  • Liu et al. (2021) Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. In 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), pp.  1650–1654, 2021. doi: 10.1109/ISBI48211.2021.9434010.
  • Liu et al. (2024a) Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  26296–26306, 2024a.
  • Liu et al. (2024b) Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024b.
  • Liu et al. (2023) Lizhao Liu, Xinyu Sun, Tianhang Xiang, Zhuangwei Zhuang, Liuren Yin, and Mingkui Tan. Contrastive vision-language alignment makes efficient instruction learner. arXiv preprint arXiv:2311.17945, 2023.
  • Liu et al. (2024c) Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157–173, 2024c.
  • Lou et al. (2023) Renze Lou, Kai Zhang, and Wenpeng Yin. A comprehensive survey on instruction following. arXiv preprint arXiv:2303.10475, 2023.
  • MH Nguyen et al. (2024) Duy MH Nguyen, Hoang Nguyen, Nghiem Diep, Tan Ngoc Pham, Tri Cao, Binh Nguyen, Paul Swoboda, Nhat Ho, Shadi Albarqouni, Pengtao Xie, et al. Lvm-med: Learning large-scale self-supervised vision models for medical imaging via second-order graph matching. Advances in Neural Information Processing Systems, 36, 2024.
  • Minervini et al. (2023) Pasquale Minervini, Luca Franceschi, and Mathias Niepert. Adaptive perturbation-based gradient estimation for discrete latent variable models. In Proceedings of the AAAI Conference on Artificial Intelligence, pp.  9200–9208, 2023.
  • Moor et al. (2023) Michael Moor, Qian Huang, Shirley Wu, Michihiro Yasunaga, Yash Dalmia, Jure Leskovec, Cyril Zakka, Eduardo Pontes Reis, and Pranav Rajpurkar. Med-flamingo: a multimodal medical few-shot learner. In Machine Learning for Health (ML4H), pp.  353–367. PMLR, 2023.
  • Mémoli (2011) Facundo Mémoli. Gromov–Wasserstein Distances and the Metric Approach to Object Matching. Foundations of Computational Mathematics, 11(4):417–487, August 2011. ISSN 1615-3375, 1615-3383.
  • Mémoli & Sapiro (2005) Facundo Mémoli and Guillermo Sapiro. A Theoretical and Computational Framework for Isometry Invariant Recognition of Point Cloud Data. Foundations of Computational Mathematics, 5(3):313–347, July 2005. ISSN 1615-3383.
  • Nanni et al. (2016) Loris Nanni, Michelangelo Paci, Florentino Luciano Caetano dos Santos, Heli Skottman, Kati Juuti-Uusitalo, and Jari Hyttinen. Texture descriptors ensembles enable image-based classification of maturation of human stem cell-derived retinal pigmented epithelium. PLoS One, 11(2):e0149399, 2016.
  • Nguyen et al. (2024a) Duy MH Nguyen, An T Le, Trung Q Nguyen, Nghiem T Diep, Tai Nguyen, Duy Duong-Tran, Jan Peters, Li Shen, Mathias Niepert, and Daniel Sonntag. Dude: Dual distribution-aware context prompt learning for large vision-language model. Asian Conference on Machine Learning, 2024a.
  • Nguyen et al. (2024b) Duy MH Nguyen, Nina Lukashina, Tai Nguyen, An T Le, TrungTin Nguyen, Nhat Ho, Jan Peters, Daniel Sonntag, Viktor Zaverkin, and Mathias Niepert. Structure-aware e (3)-invariant molecular conformer aggregation networks. International Conference on Machine Learning, 2024b.
  • Niepert et al. (2021) Mathias Niepert, Pasquale Minervini, and Luca Franceschi. Implicit mle: backpropagating through discrete exponential family distributions. Advances in Neural Information Processing Systems, 34:14567–14579, 2021.
  • Park et al. (2024) Jungin Park, Jiyoung Lee, and Kwanghoon Sohn. Bridging vision and language spaces with assignment prediction. International Conference on Learning Representations, 2024.
  • Pawar et al. (2024) Saurav Pawar, SM Tonmoy, SM Zaman, Vinija Jain, Aman Chadha, and Amitava Das. The what, why, and how of context length extension techniques in large language models–a detailed survey. arXiv preprint arXiv:2401.07872, 2024.
  • Piran et al. (2024) Zoe Piran, Michal Klein, James Thornton, and Marco Cuturi. Contrasting multiple representations with the multi-marginal matching gap. International Conference on Machine Learning, 2024.
  • Pogančić et al. (2020) Marin Vlastelica Pogančić, Anselm Paulus, Vit Musil, Georg Martius, and Michal Rolinek. Differentiation of blackbox combinatorial solvers. In International Conference on Learning Representations, 2020.
  • Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021.
  • Rajpurkar et al. (2017) Pranav Rajpurkar, Jeremy Irvin, Aarti Bagul, Daisy Ding, Tony Duan, Hershel Mehta, Brandon Yang, Kaylie Zhu, Dillon Laird, Robyn L Ball, et al. Mura: Large dataset for abnormality detection in musculoskeletal radiographs. arXiv preprint arXiv:1712.06957, 2017.
  • Rolínek et al. (2020) Michal Rolínek, Vít Musil, Anselm Paulus, Marin Vlastelica, Claudio Michaelis, and Georg Martius. Optimizing rank-based metrics with blackbox differentiation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  7620–7630, 2020.
  • Rolínek et al. (2020) Michal Rolínek, Paul Swoboda, Dominik Zietlow, Anselm Paulus, Vít Musil, and Georg Martius. Deep graph matching via blackbox differentiation of combinatorial solvers. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVIII, pp.  407–424. Springer, 2020.
  • Shaker et al. (2017) Fariba Shaker, S Amirhassan Monadjemi, Javad Alirezaie, and Ahmad Reza Naghsh-Nilchi. A dictionary learning approach for human sperm heads classification. Computers in biology and medicine, 91:181–190, 2017.
  • Singhal et al. (2023) Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Le Hou, Kevin Clark, Stephen Pfohl, Heather Cole-Lewis, Darlene Neal, et al. Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617, 2023.
  • Soares & Angelov (2020) Eduardo Soares and Plamen Angelov. A large dataset of real patients ct scans for covid-19 identification. Harv. Dataverse, 1:1–8, 2020.
  • Song et al. (2020) Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. Mpnet: Masked and permuted pre-training for language understanding. Advances in neural information processing systems, 33:16857–16867, 2020.
  • Spanhol et al. (2015) Fabio A Spanhol, Luiz S Oliveira, Caroline Petitjean, and Laurent Heutte. A dataset for breast cancer histopathological image classification. Ieee transactions on biomedical engineering, 63(7):1455–1462, 2015.
  • Suckling (1994) John Suckling. The mammographic images analysis society digital mammogram database. In Exerpta Medica. International Congress Series, 1994, volume 1069, pp.  375–378, 1994.
  • Sun et al. (2024) Yuchong Sun, Che Liu, Kun Zhou, Jinwen Huang, Ruihua Song, Wayne Xin Zhao, Fuzheng Zhang, Di Zhang, and Kun Gai. Parrot: Enhancing multi-turn instruction following for large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  9729–9750, 2024.
  • Swoboda et al. (2017) Paul Swoboda, Carsten Rother, Hassan Abu Alhaija, Dagmar Kainmuller, and Bogdan Savchynskyy. A study of lagrangean decompositions and dual ascent solvers for graph matching. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  1607–1616, 2017.
  • Swoboda et al. (2019) Paul Swoboda, Ashkan Mokarian, Christian Theobalt, Florian Bernard, et al. A convex relaxation for multi-graph matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  11156–11165, 2019.
  • Tang et al. (2022) Shixiang Tang, Feng Zhu, Lei Bai, Rui Zhao, Chenyu Wang, and Wanli Ouyang. Unifying visual contrastive learning for object recognition from a graph perspective. In European Conference on Computer Vision, pp.  649–667. Springer, 2022.
  • Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  • Tu et al. (2024) Tao Tu, Shekoofeh Azizi, Danny Driess, Mike Schaekermann, Mohamed Amin, Pi-Chuan Chang, Andrew Carroll, Charles Lau, Ryutaro Tanno, Ira Ktena, et al. Towards generalist biomedical ai. NEJM AI, 1(3):AIoa2300138, 2024.
  • Wang et al. (2020) Linda Wang, Zhong Qiu Lin, and Alexander Wong. Covid-net: A tailored deep convolutional neural network design for detection of covid-19 cases from chest x-ray images. Scientific reports, 10(1):19549, 2020.
  • Wu et al. (2023a) Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Towards generalist foundation model for radiology. arXiv preprint arXiv:2308.02463, 2023a.
  • Wu et al. (2023b) Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519, 2023b.
  • Xie et al. (2024) Yunfei Xie, Ce Zhou, Lang Gao, Juncheng Wu, Xianhang Li, Hong-Yu Zhou, Sheng Liu, Lei Xing, James Zou, Cihang Xie, et al. Medtrinity-25m: A large-scale multimodal dataset with multigranular annotations for medicine. arXiv preprint arXiv:2408.02900, 2024.
  • Yao et al. (2024) Huanjin Yao, Wenhao Wu, Taojiannan Yang, YuXin Song, Mengxi Zhang, Haocheng Feng, Yifan Sun, Zhiheng Li, Wanli Ouyang, and Jingdong Wang. Dense connector for mllms. arXiv preprint arXiv:2405.13800, 2024.
  • Zanfir & Sminchisescu (2018) Andrei Zanfir and Cristian Sminchisescu. Deep learning of graph matching. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  2684–2693, 2018.
  • Zhai et al. (2023) Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  11975–11986, 2023.
  • Zhang et al. (2023a) Kai Zhang, Jun Yu, Zhiling Yan, Yixin Liu, Eashan Adhikarla, Sunyang Fu, Xun Chen, Chen Chen, Yuyin Zhou, Xiang Li, et al. Biomedgpt: a unified and generalist biomedical generative pre-trained transformer for vision, language, and multimodal tasks. arXiv preprint arXiv:2305.17100, 2023a.
  • Zhang et al. (2023b) Renrui Zhang, Jiaming Han, Chris Liu, Peng Gao, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023b.
  • Zhang et al. (2024) Renrui Zhang, Jiaming Han, Chris Liu, Aojun Zhou, Pan Lu, Yu Qiao, Hongsheng Li, and Peng Gao. Llama-adapter: Efficient fine-tuning of large language models with zero-initialized attention. In The Twelfth International Conference on Learning Representations, 2024.
  • Zhang et al. (2023c) Sheng Zhang, Yanbo Xu, Naoto Usuyama, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, Cliff Wong, et al. Large-scale domain-specific pretraining for biomedical vision-language processing. arXiv preprint arXiv:2303.00915, 2(3):6, 2023c.
  • Zhang et al. (2023d) Sheng Zhang, Yanbo Xu, Naoto Usuyama, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, Cliff Wong, et al. Large-scale domain-specific pretraining for biomedical vision-language processing. URL https://arxiv. org/abs/2303.00915, 2023d.

Supplementary Material for
“LoGra-Med: Long-Context Multi-Graph Learning for Medical Vision-Language Models”

Appendix A Proofs of the Main Theoretical Results

In this appendix, we provide detailed technical proofs of our main theoretical results.

A.1 Proof of Theorem 1

For the sake of simplicity, we denote the labeled graphs 𝒢𝒢{\mathcal{G}}caligraphic_G and structured graphs discussed above only by μ𝜇\muitalic_μ the whole structured graph.

To prove Theorem 1, for any two graphs 𝒢1subscript𝒢1{\mathcal{G}}_{1}caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝒢2subscript𝒢2{\mathcal{G}}_{2}caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in the structured graph space 𝕊()𝕊{\mathbb{S}}({\mathcal{F}})blackboard_S ( caligraphic_F ), described respectively by their mixing measure μ1=i=1Mw1iδ(f1i,s1i)subscript𝜇1superscriptsubscript𝑖1𝑀subscript𝑤1𝑖subscript𝛿subscript𝑓1𝑖subscript𝑠1𝑖\mu_{1}=\sum_{i=1}^{M}w_{1i}\delta_{(f_{1i},s_{1i})}italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT and μ2=j=1Nw2jδ(f2j,s2j)subscript𝜇2superscriptsubscript𝑗1𝑁subscript𝑤2𝑗subscript𝛿subscript𝑓2𝑗subscript𝑠2𝑗\mu_{2}=\sum_{j=1}^{N}w_{2j}\delta_{(f_{2j},s_{2j})}italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT, respectively, we wish to prove the following properties:

  1. 1.

    Positivity: dSoGA(𝒢1,𝒢2)>0subscript𝑑SoGAsubscript𝒢1subscript𝒢20d_{\text{SoGA}}({\mathcal{G}}_{1},{\mathcal{G}}_{2})>0italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) > 0 for any 𝒢1𝒢2subscript𝒢1subscript𝒢2{\mathcal{G}}_{1}\neq{\mathcal{G}}_{2}caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≠ caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

  2. 2.

    Equality relation: dSoGA(𝒢1,𝒢2)=0subscript𝑑SoGAsubscript𝒢1subscript𝒢20d_{\text{SoGA}}({\mathcal{G}}_{1},{\mathcal{G}}_{2})=0italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = 0 if and only if 𝒢1=𝒢2subscript𝒢1subscript𝒢2{\mathcal{G}}_{1}={\mathcal{G}}_{2}caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

  3. 3.

    Symmetry: dSoGA(𝒢1,𝒢2)=dSoGA(𝒢2,𝒢1)subscript𝑑SoGAsubscript𝒢1subscript𝒢2subscript𝑑SoGAsubscript𝒢2subscript𝒢1d_{\text{SoGA}}({\mathcal{G}}_{1},{\mathcal{G}}_{2})=d_{\text{SoGA}}({\mathcal% {G}}_{2},{\mathcal{G}}_{1})italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ).

  4. 4.

    Triangle inequality: dSoGA(𝒢1,𝒢3)dSoGA(𝒢1,𝒢2)+dSoGA(𝒢2,𝒢3)subscript𝑑SoGAsubscript𝒢1subscript𝒢3subscript𝑑SoGAsubscript𝒢1subscript𝒢2subscript𝑑SoGAsubscript𝒢2subscript𝒢3d_{\text{SoGA}}({\mathcal{G}}_{1},{\mathcal{G}}_{3})\leq d_{\text{SoGA}}({% \mathcal{G}}_{1},{\mathcal{G}}_{2})+d_{\text{SoGA}}({\mathcal{G}}_{2},{% \mathcal{G}}_{3})italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) ≤ italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) for any graph 𝒢3subscript𝒢3{\mathcal{G}}_{3}caligraphic_G start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT.

Note first that 1. Positivity and 3. Symmetry hold trivially.

Proof of 2. Equality relation. The equality relation immediately follows the following Proposition 1, which is proved in Appendix B.1.

Proposition 1 (Equality relation).

For any two graphs 𝒢1subscript𝒢1{\mathcal{G}}_{1}caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝒢2subscript𝒢2{\mathcal{G}}_{2}caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in the structured graph space 𝕊()𝕊{\mathbb{S}}({\mathcal{F}})blackboard_S ( caligraphic_F ), described respectively by their mixing measure μ1=i=1Mw1iδ(f1i,s1i)subscript𝜇1superscriptsubscript𝑖1𝑀subscript𝑤1𝑖subscript𝛿subscript𝑓1𝑖subscript𝑠1𝑖\mu_{1}=\sum_{i=1}^{M}w_{1i}\delta_{(f_{1i},s_{1i})}italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT and μ2=j=1Nw2jδ(f2j,s2j)subscript𝜇2superscriptsubscript𝑗1𝑁subscript𝑤2𝑗subscript𝛿subscript𝑓2𝑗subscript𝑠2𝑗\mu_{2}=\sum_{j=1}^{N}w_{2j}\delta_{(f_{2j},s_{2j})}italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT, it holds dSoGA(𝒢1,𝒢2)=0subscript𝑑SoGAsubscript𝒢1subscript𝒢20d_{\text{SoGA}}({\mathcal{G}}_{1},{\mathcal{G}}_{2})=0italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = 0 if and only if M=N𝑀𝑁M=Nitalic_M = italic_N and there exists a bijection σ:[M][M]:𝜎maps-todelimited-[]𝑀delimited-[]𝑀\sigma:[M]\mapsto[M]italic_σ : [ italic_M ] ↦ [ italic_M ] such that:

  1. E1.

    i[M]:w1i=w2σ(i):for-all𝑖delimited-[]𝑀subscript𝑤1𝑖subscript𝑤2𝜎𝑖\forall i\in[M]:w_{1i}=w_{2\sigma(i)}∀ italic_i ∈ [ italic_M ] : italic_w start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT = italic_w start_POSTSUBSCRIPT 2 italic_σ ( italic_i ) end_POSTSUBSCRIPT.

  2. E2.

    i[M]:f1i=f2σ(i):for-all𝑖delimited-[]𝑀subscript𝑓1𝑖subscript𝑓2𝜎𝑖\forall i\in[M]:f_{1i}=f_{2\sigma(i)}∀ italic_i ∈ [ italic_M ] : italic_f start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT 2 italic_σ ( italic_i ) end_POSTSUBSCRIPT.

  3. E3.

    i,k[M]2:ds(s1i,s1k)=ds(s2σ(i),s2σ(k)):for-all𝑖𝑘superscriptdelimited-[]𝑀2subscript𝑑𝑠subscript𝑠1𝑖subscript𝑠1𝑘subscript𝑑𝑠subscript𝑠2𝜎𝑖subscript𝑠2𝜎𝑘\forall i,k\in[M]^{2}:d_{s}(s_{1i},s_{1k})=d_{s}(s_{2\sigma(i)},s_{2\sigma(k)})∀ italic_i , italic_k ∈ [ italic_M ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT : italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT ) = italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 2 italic_σ ( italic_i ) end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 italic_σ ( italic_k ) end_POSTSUBSCRIPT ).

Proof of 4. Triangle inequality. Let us consider two arbitrary graphs 𝒢1subscript𝒢1{\mathcal{G}}_{1}caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝒢2subscript𝒢2{\mathcal{G}}_{2}caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, described respectively by their probability measure μ1=i=1Mw1iδ(f1i,s1i)subscript𝜇1superscriptsubscript𝑖1𝑀subscript𝑤1𝑖subscript𝛿subscript𝑓1𝑖subscript𝑠1𝑖\mu_{1}=\sum_{i=1}^{M}w_{1i}\delta_{(f_{1i},s_{1i})}italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT and μ2=j=1Nw2jδ(f2j,s2j)subscript𝜇2superscriptsubscript𝑗1𝑁subscript𝑤2𝑗subscript𝛿subscript𝑓2𝑗subscript𝑠2𝑗\mu_{2}=\sum_{j=1}^{N}w_{2j}\delta_{(f_{2j},s_{2j})}italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT. For any graph 𝒢3subscript𝒢3{\mathcal{G}}_{3}caligraphic_G start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT described by its probability measure μ3=i=1Kw3kδ(f3k,s3k)subscript𝜇3superscriptsubscript𝑖1𝐾subscript𝑤3𝑘subscript𝛿subscript𝑓3𝑘subscript𝑠3𝑘\mu_{3}=\sum_{i=1}^{K}w_{3k}\delta_{(f_{3k},s_{3k})}italic_μ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT 3 italic_k end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 3 italic_k end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 3 italic_k end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT, we define 𝑷𝒜(𝒢1,𝒢2)𝑷𝒜subscript𝒢1subscript𝒢2{\bm{P}}\in{\mathcal{A}}({\mathcal{G}}_{1},{\mathcal{G}}_{2})bold_italic_P ∈ caligraphic_A ( caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) and 𝑸𝒜(𝒢2,𝒢3)𝑸𝒜subscript𝒢2subscript𝒢3{\bm{Q}}\in{\mathcal{A}}({\mathcal{G}}_{2},{\mathcal{G}}_{3})bold_italic_Q ∈ caligraphic_A ( caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) as two optimal couplings of the SoGA distance between μ1subscript𝜇1\mu_{1}italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and μ2subscript𝜇2\mu_{2}italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and μ2subscript𝜇2\mu_{2}italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and μ3subscript𝜇3\mu_{3}italic_μ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, respectively, i.e., 

𝑷𝒜(𝒢1,𝒢2)={𝑷{0,1}M×N:i=1MPi,j=w1j=1,j=1NPi,j=w2i=1},𝑷𝒜subscript𝒢1subscript𝒢2conditional-set𝑷superscript01𝑀𝑁formulae-sequencesuperscriptsubscript𝑖1𝑀subscript𝑃𝑖𝑗subscript𝑤1𝑗1superscriptsubscript𝑗1𝑁subscript𝑃𝑖𝑗subscript𝑤2𝑖1\displaystyle{\bm{P}}\in{\mathcal{A}}({\mathcal{G}}_{1},{\mathcal{G}}_{2})=% \left\{{\bm{P}}\in\{0,1\}^{M\times N}:\sum_{i=1}^{M}{P}_{i,j}=w_{1j}=1,\sum_{j% =1}^{N}{P}_{i,j}=w_{2i}=1\right\},bold_italic_P ∈ caligraphic_A ( caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = { bold_italic_P ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_M × italic_N end_POSTSUPERSCRIPT : ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = italic_w start_POSTSUBSCRIPT 1 italic_j end_POSTSUBSCRIPT = 1 , ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = italic_w start_POSTSUBSCRIPT 2 italic_i end_POSTSUBSCRIPT = 1 } ,
𝑸𝒜(𝒢2,𝒢3)={𝑸{0,1}N×K:j=1NQj,k=w2k=1,k=1KQj,k=w3j=1}.𝑸𝒜subscript𝒢2subscript𝒢3conditional-set𝑸superscript01𝑁𝐾formulae-sequencesuperscriptsubscript𝑗1𝑁subscript𝑄𝑗𝑘subscript𝑤2𝑘1superscriptsubscript𝑘1𝐾subscript𝑄𝑗𝑘subscript𝑤3𝑗1\displaystyle{\bm{Q}}\in{\mathcal{A}}({\mathcal{G}}_{2},{\mathcal{G}}_{3})=% \left\{{\bm{Q}}\in\{0,1\}^{N\times K}:\sum_{j=1}^{N}{Q}_{j,k}=w_{2k}=1,\sum_{k% =1}^{K}{Q}_{j,k}=w_{3j}=1\right\}.bold_italic_Q ∈ caligraphic_A ( caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) = { bold_italic_Q ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_N × italic_K end_POSTSUPERSCRIPT : ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT = italic_w start_POSTSUBSCRIPT 2 italic_k end_POSTSUBSCRIPT = 1 , ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT = italic_w start_POSTSUBSCRIPT 3 italic_j end_POSTSUBSCRIPT = 1 } .

We then construct 𝑹=(jPi,jQj,kw2j)i,k𝑹subscriptsubscript𝑗subscript𝑃𝑖𝑗subscript𝑄𝑗𝑘subscript𝑤2𝑗𝑖𝑘{\bm{R}}=\left(\sum_{j}\frac{P_{i,j}Q_{j,k}}{w_{2j}}\right)_{i,k}bold_italic_R = ( ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT divide start_ARG italic_P start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_w start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT end_ARG ) start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT. Then it holds that 𝑹𝒜(𝒢1,𝒢3)𝑹𝒜subscript𝒢1subscript𝒢3{\bm{R}}\in{\mathcal{A}}({\mathcal{G}}_{1},{\mathcal{G}}_{3})bold_italic_R ∈ caligraphic_A ( caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ). Indeed, we have

iRi,k=ijPi,jQj,kw2j=jiPi,jQj,kw2j=jw1jQj,kw2j=jQj,k=1.subscript𝑖subscript𝑅𝑖𝑘subscript𝑖subscript𝑗subscript𝑃𝑖𝑗subscript𝑄𝑗𝑘subscript𝑤2𝑗subscript𝑗subscript𝑖subscript𝑃𝑖𝑗subscript𝑄𝑗𝑘subscript𝑤2𝑗subscript𝑗subscript𝑤1𝑗subscript𝑄𝑗𝑘subscript𝑤2𝑗subscript𝑗subscript𝑄𝑗𝑘1\displaystyle\sum_{i}{R}_{i,k}=\sum_{i}\sum_{j}\frac{P_{i,j}Q_{j,k}}{w_{2j}}=% \sum_{j}\sum_{i}P_{i,j}\frac{Q_{j,k}}{w_{2j}}=\sum_{j}w_{1j}\frac{Q_{j,k}}{w_{% 2j}}=\sum_{j}Q_{j,k}=1.∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT divide start_ARG italic_P start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_w start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT end_ARG = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT divide start_ARG italic_Q start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_w start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT end_ARG = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT 1 italic_j end_POSTSUBSCRIPT divide start_ARG italic_Q start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_w start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT end_ARG = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT = 1 .

By the suboptimality of 𝑹𝑹{\bm{R}}bold_italic_R, the triangle inequalities of dfsubscript𝑑𝑓d_{f}italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT and |||\cdot|| ⋅ |, we have

dSoGA(𝒢1,𝒢3)subscript𝑑SoGAsubscript𝒢1subscript𝒢3\displaystyle d_{\text{SoGA}}({\mathcal{G}}_{1},{\mathcal{G}}_{3})italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) i,j,k,l[df(f1i,f3j)+|ds(s1i,s1k)ds(s3j,s3l)|]Ri,jRk,labsentsubscript𝑖𝑗𝑘𝑙delimited-[]subscript𝑑𝑓subscript𝑓1𝑖subscript𝑓3𝑗subscript𝑑𝑠subscript𝑠1𝑖subscript𝑠1𝑘subscript𝑑𝑠subscript𝑠3𝑗subscript𝑠3𝑙subscript𝑅𝑖𝑗subscript𝑅𝑘𝑙\displaystyle\leq\sum_{i,j,k,l}\left[d_{f}(f_{1i},f_{3j})+|d_{s}(s_{1i},s_{1k}% )-d_{s}(s_{3j},s_{3l})|\right]{R}_{i,j}{R}_{k,l}≤ ∑ start_POSTSUBSCRIPT italic_i , italic_j , italic_k , italic_l end_POSTSUBSCRIPT [ italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 3 italic_j end_POSTSUBSCRIPT ) + | italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT ) - italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 3 italic_j end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 3 italic_l end_POSTSUBSCRIPT ) | ] italic_R start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT
=i,j,k,l[df(f1i,f3j)+|ds(s1i,s1k)ds(s3j,s3l)|]tPi,tQt,jw2tdPk,dQd,lw2dabsentsubscript𝑖𝑗𝑘𝑙delimited-[]subscript𝑑𝑓subscript𝑓1𝑖subscript𝑓3𝑗subscript𝑑𝑠subscript𝑠1𝑖subscript𝑠1𝑘subscript𝑑𝑠subscript𝑠3𝑗subscript𝑠3𝑙subscript𝑡subscript𝑃𝑖𝑡subscript𝑄𝑡𝑗subscript𝑤2𝑡subscript𝑑subscript𝑃𝑘𝑑subscript𝑄𝑑𝑙subscript𝑤2𝑑\displaystyle=\sum_{i,j,k,l}\left[d_{f}(f_{1i},f_{3j})+|d_{s}(s_{1i},s_{1k})-d% _{s}(s_{3j},s_{3l})|\right]\sum_{t}\frac{P_{i,t}Q_{t,j}}{w_{2t}}\sum_{d}\frac{% P_{k,d}Q_{d,l}}{w_{2d}}= ∑ start_POSTSUBSCRIPT italic_i , italic_j , italic_k , italic_l end_POSTSUBSCRIPT [ italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 3 italic_j end_POSTSUBSCRIPT ) + | italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT ) - italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 3 italic_j end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 3 italic_l end_POSTSUBSCRIPT ) | ] ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT divide start_ARG italic_P start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_w start_POSTSUBSCRIPT 2 italic_t end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT divide start_ARG italic_P start_POSTSUBSCRIPT italic_k , italic_d end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_d , italic_l end_POSTSUBSCRIPT end_ARG start_ARG italic_w start_POSTSUBSCRIPT 2 italic_d end_POSTSUBSCRIPT end_ARG
=i,j,k,l,t,d[df(f1i,f3j)+|ds(s1i,s1k)ds(s3j,s3l)|]Pi,tQt,jw2tPk,dQd,lw2dabsentsubscript𝑖𝑗𝑘𝑙𝑡𝑑delimited-[]subscript𝑑𝑓subscript𝑓1𝑖subscript𝑓3𝑗subscript𝑑𝑠subscript𝑠1𝑖subscript𝑠1𝑘subscript𝑑𝑠subscript𝑠3𝑗subscript𝑠3𝑙subscript𝑃𝑖𝑡subscript𝑄𝑡𝑗subscript𝑤2𝑡subscript𝑃𝑘𝑑subscript𝑄𝑑𝑙subscript𝑤2𝑑\displaystyle=\sum_{i,j,k,l,t,d}\left[d_{f}(f_{1i},f_{3j})+|d_{s}(s_{1i},s_{1k% })-d_{s}(s_{3j},s_{3l})|\right]\frac{P_{i,t}Q_{t,j}}{w_{2t}}\frac{P_{k,d}Q_{d,% l}}{w_{2d}}= ∑ start_POSTSUBSCRIPT italic_i , italic_j , italic_k , italic_l , italic_t , italic_d end_POSTSUBSCRIPT [ italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 3 italic_j end_POSTSUBSCRIPT ) + | italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT ) - italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 3 italic_j end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 3 italic_l end_POSTSUBSCRIPT ) | ] divide start_ARG italic_P start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_w start_POSTSUBSCRIPT 2 italic_t end_POSTSUBSCRIPT end_ARG divide start_ARG italic_P start_POSTSUBSCRIPT italic_k , italic_d end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_d , italic_l end_POSTSUBSCRIPT end_ARG start_ARG italic_w start_POSTSUBSCRIPT 2 italic_d end_POSTSUBSCRIPT end_ARG
i,j,k,l,t,d[df(f1i,f2t)+df(f2t,f3j)]Pi,tQt,jw2tPk,dQd,lw2dabsentsubscript𝑖𝑗𝑘𝑙𝑡𝑑delimited-[]subscript𝑑𝑓subscript𝑓1𝑖subscript𝑓2𝑡subscript𝑑𝑓subscript𝑓2𝑡subscript𝑓3𝑗subscript𝑃𝑖𝑡subscript𝑄𝑡𝑗subscript𝑤2𝑡subscript𝑃𝑘𝑑subscript𝑄𝑑𝑙subscript𝑤2𝑑\displaystyle\leq\sum_{i,j,k,l,t,d}\left[d_{f}(f_{1i},f_{2t})+d_{f}(f_{2t},f_{% 3j})\right]\frac{P_{i,t}Q_{t,j}}{w_{2t}}\frac{P_{k,d}Q_{d,l}}{w_{2d}}≤ ∑ start_POSTSUBSCRIPT italic_i , italic_j , italic_k , italic_l , italic_t , italic_d end_POSTSUBSCRIPT [ italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 italic_t end_POSTSUBSCRIPT ) + italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 2 italic_t end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 3 italic_j end_POSTSUBSCRIPT ) ] divide start_ARG italic_P start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_w start_POSTSUBSCRIPT 2 italic_t end_POSTSUBSCRIPT end_ARG divide start_ARG italic_P start_POSTSUBSCRIPT italic_k , italic_d end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_d , italic_l end_POSTSUBSCRIPT end_ARG start_ARG italic_w start_POSTSUBSCRIPT 2 italic_d end_POSTSUBSCRIPT end_ARG
+i,j,k,l,t,d[|ds(s1i,s1k)ds(s2t,s2d)|+|ds(s2t,s2d)ds(s3j,s3l)|]Pi,tQt,jw2tPk,dQd,lw2dsubscript𝑖𝑗𝑘𝑙𝑡𝑑delimited-[]subscript𝑑𝑠subscript𝑠1𝑖subscript𝑠1𝑘subscript𝑑𝑠subscript𝑠2𝑡subscript𝑠2𝑑subscript𝑑𝑠subscript𝑠2𝑡subscript𝑠2𝑑subscript𝑑𝑠subscript𝑠3𝑗subscript𝑠3𝑙subscript𝑃𝑖𝑡subscript𝑄𝑡𝑗subscript𝑤2𝑡subscript𝑃𝑘𝑑subscript𝑄𝑑𝑙subscript𝑤2𝑑\displaystyle\quad+\sum_{i,j,k,l,t,d}\left[|d_{s}(s_{1i},s_{1k})-d_{s}(s_{2t},% s_{2d})|+|d_{s}(s_{2t},s_{2d})-d_{s}(s_{3j},s_{3l})|\right]\frac{P_{i,t}Q_{t,j% }}{w_{2t}}\frac{P_{k,d}Q_{d,l}}{w_{2d}}+ ∑ start_POSTSUBSCRIPT italic_i , italic_j , italic_k , italic_l , italic_t , italic_d end_POSTSUBSCRIPT [ | italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT ) - italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 2 italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 italic_d end_POSTSUBSCRIPT ) | + | italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 2 italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 italic_d end_POSTSUBSCRIPT ) - italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 3 italic_j end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 3 italic_l end_POSTSUBSCRIPT ) | ] divide start_ARG italic_P start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_w start_POSTSUBSCRIPT 2 italic_t end_POSTSUBSCRIPT end_ARG divide start_ARG italic_P start_POSTSUBSCRIPT italic_k , italic_d end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_d , italic_l end_POSTSUBSCRIPT end_ARG start_ARG italic_w start_POSTSUBSCRIPT 2 italic_d end_POSTSUBSCRIPT end_ARG
=i,j,k,l,t,d[df(f1i,f2t)+|ds(s1i,s1k)ds(s2t,s2d)|]Pi,tPk,dw2tQt,jQd,lw2dabsentsubscript𝑖𝑗𝑘𝑙𝑡𝑑delimited-[]subscript𝑑𝑓subscript𝑓1𝑖subscript𝑓2𝑡subscript𝑑𝑠subscript𝑠1𝑖subscript𝑠1𝑘subscript𝑑𝑠subscript𝑠2𝑡subscript𝑠2𝑑subscript𝑃𝑖𝑡subscript𝑃𝑘𝑑subscript𝑤2𝑡subscript𝑄𝑡𝑗subscript𝑄𝑑𝑙subscript𝑤2𝑑\displaystyle=\sum_{i,j,k,l,t,d}\left[d_{f}(f_{1i},f_{2t})+|d_{s}(s_{1i},s_{1k% })-d_{s}(s_{2t},s_{2d})|\right]\frac{P_{i,t}P_{k,d}}{w_{2t}}\frac{Q_{t,j}Q_{d,% l}}{w_{2d}}= ∑ start_POSTSUBSCRIPT italic_i , italic_j , italic_k , italic_l , italic_t , italic_d end_POSTSUBSCRIPT [ italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 italic_t end_POSTSUBSCRIPT ) + | italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT ) - italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 2 italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 italic_d end_POSTSUBSCRIPT ) | ] divide start_ARG italic_P start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_k , italic_d end_POSTSUBSCRIPT end_ARG start_ARG italic_w start_POSTSUBSCRIPT 2 italic_t end_POSTSUBSCRIPT end_ARG divide start_ARG italic_Q start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_d , italic_l end_POSTSUBSCRIPT end_ARG start_ARG italic_w start_POSTSUBSCRIPT 2 italic_d end_POSTSUBSCRIPT end_ARG
+i,j,k,l,t,d[df(f2t,f3j)+|ds(s2t,s2d)ds(s3j,s3l)|]Pi,tQt,jw2tPk,dQd,lw2dsubscript𝑖𝑗𝑘𝑙𝑡𝑑delimited-[]subscript𝑑𝑓subscript𝑓2𝑡subscript𝑓3𝑗subscript𝑑𝑠subscript𝑠2𝑡subscript𝑠2𝑑subscript𝑑𝑠subscript𝑠3𝑗subscript𝑠3𝑙subscript𝑃𝑖𝑡subscript𝑄𝑡𝑗subscript𝑤2𝑡subscript𝑃𝑘𝑑subscript𝑄𝑑𝑙subscript𝑤2𝑑\displaystyle\quad+\sum_{i,j,k,l,t,d}\left[d_{f}(f_{2t},f_{3j})+|d_{s}(s_{2t},% s_{2d})-d_{s}(s_{3j},s_{3l})|\right]\frac{P_{i,t}Q_{t,j}}{w_{2t}}\frac{P_{k,d}% Q_{d,l}}{w_{2d}}+ ∑ start_POSTSUBSCRIPT italic_i , italic_j , italic_k , italic_l , italic_t , italic_d end_POSTSUBSCRIPT [ italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 2 italic_t end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 3 italic_j end_POSTSUBSCRIPT ) + | italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 2 italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 italic_d end_POSTSUBSCRIPT ) - italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 3 italic_j end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 3 italic_l end_POSTSUBSCRIPT ) | ] divide start_ARG italic_P start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_w start_POSTSUBSCRIPT 2 italic_t end_POSTSUBSCRIPT end_ARG divide start_ARG italic_P start_POSTSUBSCRIPT italic_k , italic_d end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_d , italic_l end_POSTSUBSCRIPT end_ARG start_ARG italic_w start_POSTSUBSCRIPT 2 italic_d end_POSTSUBSCRIPT end_ARG
=i,k,t,d[df(f1i,f2t)+|ds(s1i,s1k)ds(s2t,s2d)|]Pi,tPk,djQt,jw2tlQd,lw2dabsentsubscript𝑖𝑘𝑡𝑑delimited-[]subscript𝑑𝑓subscript𝑓1𝑖subscript𝑓2𝑡subscript𝑑𝑠subscript𝑠1𝑖subscript𝑠1𝑘subscript𝑑𝑠subscript𝑠2𝑡subscript𝑠2𝑑subscript𝑃𝑖𝑡subscript𝑃𝑘𝑑subscript𝑗subscript𝑄𝑡𝑗subscript𝑤2𝑡subscript𝑙subscript𝑄𝑑𝑙subscript𝑤2𝑑\displaystyle=\sum_{i,k,t,d}\left[d_{f}(f_{1i},f_{2t})+|d_{s}(s_{1i},s_{1k})-d% _{s}(s_{2t},s_{2d})|\right]P_{i,t}P_{k,d}\sum_{j}\frac{Q_{t,j}}{w_{2t}}\sum_{l% }\frac{Q_{d,l}}{w_{2d}}= ∑ start_POSTSUBSCRIPT italic_i , italic_k , italic_t , italic_d end_POSTSUBSCRIPT [ italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 italic_t end_POSTSUBSCRIPT ) + | italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT ) - italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 2 italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 italic_d end_POSTSUBSCRIPT ) | ] italic_P start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_k , italic_d end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT divide start_ARG italic_Q start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_w start_POSTSUBSCRIPT 2 italic_t end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT divide start_ARG italic_Q start_POSTSUBSCRIPT italic_d , italic_l end_POSTSUBSCRIPT end_ARG start_ARG italic_w start_POSTSUBSCRIPT 2 italic_d end_POSTSUBSCRIPT end_ARG
+j,l,t,d[df(f2t,f3j)+|ds(s2t,s2d)ds(s3j,s3l)|]Qt,jQd,liPi,tw2tkPk,dw2d.subscript𝑗𝑙𝑡𝑑delimited-[]subscript𝑑𝑓subscript𝑓2𝑡subscript𝑓3𝑗subscript𝑑𝑠subscript𝑠2𝑡subscript𝑠2𝑑subscript𝑑𝑠subscript𝑠3𝑗subscript𝑠3𝑙subscript𝑄𝑡𝑗subscript𝑄𝑑𝑙subscript𝑖subscript𝑃𝑖𝑡subscript𝑤2𝑡subscript𝑘subscript𝑃𝑘𝑑subscript𝑤2𝑑\displaystyle\quad+\sum_{j,l,t,d}\left[d_{f}(f_{2t},f_{3j})+|d_{s}(s_{2t},s_{2% d})-d_{s}(s_{3j},s_{3l})|\right]Q_{t,j}Q_{d,l}\sum_{i}\frac{P_{i,t}}{w_{2t}}% \sum_{k}\frac{P_{k,d}}{w_{2d}}.+ ∑ start_POSTSUBSCRIPT italic_j , italic_l , italic_t , italic_d end_POSTSUBSCRIPT [ italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 2 italic_t end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 3 italic_j end_POSTSUBSCRIPT ) + | italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 2 italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 italic_d end_POSTSUBSCRIPT ) - italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 3 italic_j end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 3 italic_l end_POSTSUBSCRIPT ) | ] italic_Q start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_d , italic_l end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG italic_P start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_w start_POSTSUBSCRIPT 2 italic_t end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT divide start_ARG italic_P start_POSTSUBSCRIPT italic_k , italic_d end_POSTSUBSCRIPT end_ARG start_ARG italic_w start_POSTSUBSCRIPT 2 italic_d end_POSTSUBSCRIPT end_ARG .

Note that we have

jQt,jw2t=lQd,lw2d=iPi,tw2t=kPk,dw2d=1.subscript𝑗subscript𝑄𝑡𝑗subscript𝑤2𝑡subscript𝑙subscript𝑄𝑑𝑙subscript𝑤2𝑑subscript𝑖subscript𝑃𝑖𝑡subscript𝑤2𝑡subscript𝑘subscript𝑃𝑘𝑑subscript𝑤2𝑑1\displaystyle\sum_{j}\frac{Q_{t,j}}{w_{2t}}=\sum_{l}\frac{Q_{d,l}}{w_{2d}}=% \sum_{i}\frac{P_{i,t}}{w_{2t}}=\sum_{k}\frac{P_{k,d}}{w_{2d}}=1.∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT divide start_ARG italic_Q start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_w start_POSTSUBSCRIPT 2 italic_t end_POSTSUBSCRIPT end_ARG = ∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT divide start_ARG italic_Q start_POSTSUBSCRIPT italic_d , italic_l end_POSTSUBSCRIPT end_ARG start_ARG italic_w start_POSTSUBSCRIPT 2 italic_d end_POSTSUBSCRIPT end_ARG = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG italic_P start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_w start_POSTSUBSCRIPT 2 italic_t end_POSTSUBSCRIPT end_ARG = ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT divide start_ARG italic_P start_POSTSUBSCRIPT italic_k , italic_d end_POSTSUBSCRIPT end_ARG start_ARG italic_w start_POSTSUBSCRIPT 2 italic_d end_POSTSUBSCRIPT end_ARG = 1 .

This is how we achieve the desired result, because

dSoGA(𝒢1,𝒢3)subscript𝑑SoGAsubscript𝒢1subscript𝒢3\displaystyle d_{\text{SoGA}}({\mathcal{G}}_{1},{\mathcal{G}}_{3})italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) i,k,t,d[df(f1i,f2t)+|ds(s1i,s1k)ds(s2t,s2d)|]Pi,tPk,dabsentsubscript𝑖𝑘𝑡𝑑delimited-[]subscript𝑑𝑓subscript𝑓1𝑖subscript𝑓2𝑡subscript𝑑𝑠subscript𝑠1𝑖subscript𝑠1𝑘subscript𝑑𝑠subscript𝑠2𝑡subscript𝑠2𝑑subscript𝑃𝑖𝑡subscript𝑃𝑘𝑑\displaystyle\leq\sum_{i,k,t,d}\left[d_{f}(f_{1i},f_{2t})+|d_{s}(s_{1i},s_{1k}% )-d_{s}(s_{2t},s_{2d})|\right]P_{i,t}P_{k,d}≤ ∑ start_POSTSUBSCRIPT italic_i , italic_k , italic_t , italic_d end_POSTSUBSCRIPT [ italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 italic_t end_POSTSUBSCRIPT ) + | italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT ) - italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 2 italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 italic_d end_POSTSUBSCRIPT ) | ] italic_P start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_k , italic_d end_POSTSUBSCRIPT
+j,l,t,d[df(f2t,f3j)+|ds(s2t,s2d)ds(s3j,s3l)|]Qt,jQd,lsubscript𝑗𝑙𝑡𝑑delimited-[]subscript𝑑𝑓subscript𝑓2𝑡subscript𝑓3𝑗subscript𝑑𝑠subscript𝑠2𝑡subscript𝑠2𝑑subscript𝑑𝑠subscript𝑠3𝑗subscript𝑠3𝑙subscript𝑄𝑡𝑗subscript𝑄𝑑𝑙\displaystyle\quad+\sum_{j,l,t,d}\left[d_{f}(f_{2t},f_{3j})+|d_{s}(s_{2t},s_{2% d})-d_{s}(s_{3j},s_{3l})|\right]Q_{t,j}Q_{d,l}+ ∑ start_POSTSUBSCRIPT italic_j , italic_l , italic_t , italic_d end_POSTSUBSCRIPT [ italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 2 italic_t end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 3 italic_j end_POSTSUBSCRIPT ) + | italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 2 italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 italic_d end_POSTSUBSCRIPT ) - italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 3 italic_j end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 3 italic_l end_POSTSUBSCRIPT ) | ] italic_Q start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_d , italic_l end_POSTSUBSCRIPT
=dSoGA(𝒢1,𝒢2)+dSoGA(𝒢2,𝒢3) (since 𝑷 and 𝑸 are the optimal plans).absentsubscript𝑑SoGAsubscript𝒢1subscript𝒢2subscript𝑑SoGAsubscript𝒢2subscript𝒢3 (since 𝑷 and 𝑸 are the optimal plans)\displaystyle=d_{\text{SoGA}}({\mathcal{G}}_{1},{\mathcal{G}}_{2})+d_{\text{% SoGA}}({\mathcal{G}}_{2},{\mathcal{G}}_{3})\text{ (since ${\bm{P}}$ and ${\bm{% Q}}$ are the optimal plans)}.= italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) (since bold_italic_P and bold_italic_Q are the optimal plans) .

A.2 Proof of Theorem 2

Theorem 2 enables us to characterise the optimal transport problem between two measures as a curve in the space of measures, with the objective of minimising its total length. Furthermore, this formulation is beneficial for deriving global minima results for non-convex particles in gradient descent in an optimisation context, which is a valuable application of gradient flows (Chizat & Bach, 2018). By definition, a geodesic between 𝒢1subscript𝒢1{\mathcal{G}}_{1}caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝒢2subscript𝒢2{\mathcal{G}}_{2}caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is a shortest path between these two graphs. In particular, the computation of distances along constant speed geodesic paths is a relatively straightforward process, as these paths are directly embedded into the real line \mathbb{R}blackboard_R as follows: dSoGA(𝒢1,𝒢2)=|tu|1dSoGA(p(u),p(t))subscript𝑑SoGAsubscript𝒢1subscript𝒢2superscript𝑡𝑢1subscript𝑑SoGA𝑝𝑢𝑝𝑡d_{\text{SoGA}}({\mathcal{G}}_{1},{\mathcal{G}}_{2})=|t-u|^{-1}d_{\text{SoGA}}% (p(u),p(t))italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = | italic_t - italic_u | start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( italic_p ( italic_u ) , italic_p ( italic_t ) ), for all 0ut10𝑢𝑡10\leq u\neq t\leq 10 ≤ italic_u ≠ italic_t ≤ 1 and for any path (continuous map) p𝑝pitalic_p connect 𝒢1subscript𝒢1{\mathcal{G}}_{1}caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to 𝒢2subscript𝒢2{\mathcal{G}}_{2}caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT such that p(u)=𝒢1𝑝𝑢subscript𝒢1p(u)={\mathcal{G}}_{1}italic_p ( italic_u ) = caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and p(t)=𝒢2𝑝𝑡subscript𝒢2p(t)={\mathcal{G}}_{2}italic_p ( italic_t ) = caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. To prove Theorem A.2, it is necessary to collect fundamental material using Definition 1 from metric geometry for a general metric space (𝕄,d)𝕄𝑑({\mathbb{M}},d)( blackboard_M , italic_d ).

Definition 1 (Length and geodesic spaces).

Let (𝕄,d)𝕄𝑑({\mathbb{M}},d)( blackboard_M , italic_d ) be a metric space and two points x,y𝕄𝑥𝑦𝕄x,y\in{\mathbb{M}}italic_x , italic_y ∈ blackboard_M. We say that a path (curve) p:[0,1]𝕄:𝑝maps-to01𝕄p:[0,1]\mapsto{\mathbb{M}}italic_p : [ 0 , 1 ] ↦ blackboard_M connect or join x𝑥xitalic_x to y𝑦yitalic_y if p(0)=x𝑝0𝑥p(0)=xitalic_p ( 0 ) = italic_x and p(1)=y𝑝1𝑦p(1)=yitalic_p ( 1 ) = italic_y and p𝑝pitalic_p is a continuous map.

We also define the length L(p)𝐿𝑝L(p)\in\mathbb{R}italic_L ( italic_p ) ∈ blackboard_R of a path p:[0,1]𝕄:𝑝maps-to01𝕄p:[0,1]\mapsto{\mathbb{M}}italic_p : [ 0 , 1 ] ↦ blackboard_M as

L(p):=supi=1nd(p(ti),p(ti+1))assign𝐿𝑝supremumsuperscriptsubscript𝑖1𝑛𝑑𝑝subscript𝑡𝑖𝑝subscript𝑡𝑖1L(p):=\sup\sum_{i=1}^{n}d(p(t_{i}),p(t_{i+1}))italic_L ( italic_p ) := roman_sup ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_d ( italic_p ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_p ( italic_t start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ) )

where we take the supremum over all n1𝑛1n\geq 1italic_n ≥ 1 and all n𝑛nitalic_n-tuples t1<<tnsubscript𝑡1subscript𝑡𝑛t_{1}<\ldots<t_{n}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < … < italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT in [0,1]01[0,1][ 0 , 1 ].

We denote a metric space 𝕄𝕄{\mathbb{M}}blackboard_M as a length space if for all x,y𝕄𝑥𝑦𝕄x,y\in{\mathbb{M}}italic_x , italic_y ∈ blackboard_M, d(x,y)=infpL(p)𝑑𝑥𝑦subscriptinfimum𝑝𝐿𝑝d(x,y)=\inf_{p}L(p)italic_d ( italic_x , italic_y ) = roman_inf start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_L ( italic_p ) where the infimum is taken over all paths p𝑝pitalic_p connecting x𝑥xitalic_x to y𝑦yitalic_y.

We call a length space as a geodesic space if for all x,y𝕄𝑥𝑦𝕄x,y\in{\mathbb{M}}italic_x , italic_y ∈ blackboard_M, there exists a path p(x,y):[0,1]𝕄:𝑝𝑥𝑦maps-to01𝕄p(x,y):[0,1]\mapsto{\mathbb{M}}italic_p ( italic_x , italic_y ) : [ 0 , 1 ] ↦ blackboard_M such that

d(x,y)=minp(x,y)L(p(x,y)).𝑑𝑥𝑦subscript𝑝𝑥𝑦𝐿𝑝𝑥𝑦d(x,y)=\min_{p(x,y)}L(p(x,y)).italic_d ( italic_x , italic_y ) = roman_min start_POSTSUBSCRIPT italic_p ( italic_x , italic_y ) end_POSTSUBSCRIPT italic_L ( italic_p ( italic_x , italic_y ) ) .

We also denote the path p(x,y)𝑝𝑥𝑦p(x,y)italic_p ( italic_x , italic_y ) as a geodesic between x𝑥xitalic_x and y𝑦yitalic_y.

Finally, we define a path p:[0,1]𝕄:𝑝maps-to01𝕄p:[0,1]\mapsto{\mathbb{M}}italic_p : [ 0 , 1 ] ↦ blackboard_M as a constant speed geodesic if and only if

d(p(u),p(t))=|tu|d(p(0),p(1)),u,t[0,1].formulae-sequence𝑑𝑝𝑢𝑝𝑡𝑡𝑢𝑑𝑝0𝑝1for-all𝑢𝑡01d(p(u),p(t))=|t-u|d(p(0),p(1)),\forall u,t\in[0,1].italic_d ( italic_p ( italic_u ) , italic_p ( italic_t ) ) = | italic_t - italic_u | italic_d ( italic_p ( 0 ) , italic_p ( 1 ) ) , ∀ italic_u , italic_t ∈ [ 0 , 1 ] .

For the proof of Theorem 2, we first consider an optimal coupling 𝑽superscript𝑽{\bm{V}}^{*}bold_italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT for SoGA distance between two graphs 𝒢1subscript𝒢1{\mathcal{G}}_{1}caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝒢2subscript𝒢2{\mathcal{G}}_{2}caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, i.e., 

dSoGA(𝒢1,𝒢2)=min𝑽𝒜(𝒢1,𝒢2)O(𝑨v,𝑨e,𝑽)=O(𝑨v,𝑨e,𝑽),subscript𝑑SoGAsubscript𝒢1subscript𝒢2subscript𝑽𝒜subscript𝒢1subscript𝒢2𝑂superscript𝑨𝑣superscript𝑨𝑒𝑽𝑂superscript𝑨𝑣superscript𝑨𝑒superscript𝑽d_{\text{SoGA}}({\mathcal{G}}_{1},{\mathcal{G}}_{2})=\min_{{\bm{V}}\in{% \mathcal{A}}({\mathcal{G}}_{1},{\mathcal{G}}_{2})}O({\bm{A}}^{v},{\bm{% \mathsfit{A}}}^{e},{\bm{V}})=O({\bm{A}}^{v},{\bm{\mathsfit{A}}}^{e},{\bm{V}}^{% *}),italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = roman_min start_POSTSUBSCRIPT bold_italic_V ∈ caligraphic_A ( caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_O ( bold_italic_A start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT , bold_slanted_A start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , bold_italic_V ) = italic_O ( bold_italic_A start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT , bold_slanted_A start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , bold_italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ,

described respectively by their mixing measure μ0=i=1Mw0iδ(f0i,s0i)subscript𝜇0superscriptsubscript𝑖1𝑀subscript𝑤0𝑖subscript𝛿subscript𝑓0𝑖subscript𝑠0𝑖\mu_{0}=\sum_{i=1}^{M}w_{0i}\delta_{(f_{0i},s_{0i})}italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT 0 italic_i end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 0 italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 0 italic_i end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT and μ1=j=1Nw1jδ(f1j,s1j)subscript𝜇1superscriptsubscript𝑗1𝑁subscript𝑤1𝑗subscript𝛿subscript𝑓1𝑗subscript𝑠1𝑗\mu_{1}=\sum_{j=1}^{N}w_{1j}\delta_{(f_{1j},s_{1j})}italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT 1 italic_j end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 italic_j end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 italic_j end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT. Moreover, for any t[0,1]𝑡01t\in[0,1]italic_t ∈ [ 0 , 1 ] we define νt:×𝒮0××𝒮1×𝒮0×𝒮1:subscript𝜈𝑡maps-tosubscript𝒮0subscript𝒮1subscript𝒮0subscript𝒮1\nu_{t}:{\mathcal{F}}\times{\mathcal{S}}_{0}\times{\mathcal{F}}\times{\mathcal% {S}}_{1}\mapsto{\mathcal{F}}\times{\mathcal{S}}_{0}\times{\mathcal{S}}_{1}italic_ν start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : caligraphic_F × caligraphic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT × caligraphic_F × caligraphic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ↦ caligraphic_F × caligraphic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT × caligraphic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT such that

νt(f0,s0,f1,s1)=((1t)f0+tf1,s0,s1), and μt:=νt#𝑽=i=1Mj=1NVi,jδ((1t)f0+tf1,s0i,s1j),formulae-sequencesubscript𝜈𝑡subscript𝑓0subscript𝑠0subscript𝑓1subscript𝑠11𝑡subscript𝑓0𝑡subscript𝑓1subscript𝑠0subscript𝑠1assign and subscript𝜇𝑡subscript𝜈𝑡#superscript𝑽superscriptsubscript𝑖1𝑀superscriptsubscript𝑗1𝑁subscriptsuperscript𝑉𝑖𝑗subscript𝛿1𝑡subscript𝑓0𝑡subscript𝑓1subscript𝑠0𝑖subscript𝑠1𝑗\nu_{t}(f_{0},s_{0},f_{1},s_{1})=((1-t)f_{0}+tf_{1},s_{0},s_{1}),\text{ and }% \mu_{t}:=\nu_{t}\#{\bm{V}}^{*}=\sum_{i=1}^{M}\sum_{j=1}^{N}{V}^{*}_{i,j}\delta% _{((1-t)f_{0}+tf_{1},s_{0i},s_{1j})},italic_ν start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = ( ( 1 - italic_t ) italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_t italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , and italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := italic_ν start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT # bold_italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT ( ( 1 - italic_t ) italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_t italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 0 italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 italic_j end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ,

and on the metric space 𝒮0×𝒮1subscript𝒮0subscript𝒮1{\mathcal{S}}_{0}\times{\mathcal{S}}_{1}caligraphic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT × caligraphic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, we define the distance

dt:=(1t)ds0tds1:(1t)ds0tds1((s0i,s0j),(s1k,s1l))=(1t)ds(s0i,s1k)+tds(s0j,s1l):assignsubscript𝑑𝑡direct-sum1𝑡subscript𝑑subscript𝑠0𝑡subscript𝑑subscript𝑠1direct-sum1𝑡subscript𝑑subscript𝑠0𝑡subscript𝑑subscript𝑠1subscript𝑠0𝑖subscript𝑠0𝑗subscript𝑠1𝑘subscript𝑠1𝑙1𝑡subscript𝑑𝑠subscript𝑠0𝑖subscript𝑠1𝑘𝑡subscript𝑑𝑠subscript𝑠0𝑗subscript𝑠1𝑙d_{t}:=(1-t)d_{s_{0}}\oplus td_{s_{1}}:(1-t)d_{s_{0}}\oplus td_{s_{1}}((s_{0i}% ,s_{0j}),(s_{1k},s_{1l}))=(1-t)d_{s}(s_{0i},s_{1k})+td_{s}(s_{0j},s_{1l})italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := ( 1 - italic_t ) italic_d start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⊕ italic_t italic_d start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT : ( 1 - italic_t ) italic_d start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⊕ italic_t italic_d start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ( italic_s start_POSTSUBSCRIPT 0 italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 0 italic_j end_POSTSUBSCRIPT ) , ( italic_s start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 italic_l end_POSTSUBSCRIPT ) ) = ( 1 - italic_t ) italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT ) + italic_t italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 italic_j end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 italic_l end_POSTSUBSCRIPT )

for any ((s0i,s0j),(s1k,s1l))𝒮0×𝒮1subscript𝑠0𝑖subscript𝑠0𝑗subscript𝑠1𝑘subscript𝑠1𝑙subscript𝒮0subscript𝒮1((s_{0i},s_{0j}),(s_{1k},s_{1l}))\in{\mathcal{S}}_{0}\times{\mathcal{S}}_{1}( ( italic_s start_POSTSUBSCRIPT 0 italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 0 italic_j end_POSTSUBSCRIPT ) , ( italic_s start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 italic_l end_POSTSUBSCRIPT ) ) ∈ caligraphic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT × caligraphic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Here, we denote ##\## the push-forward operator such that νt#𝑽(𝔸)=𝑽(νt1(𝔸))subscript𝜈𝑡#superscript𝑽𝔸superscript𝑽superscriptsubscript𝜈𝑡1𝔸\nu_{t}\#{\bm{V}}^{*}({\mathbb{A}})={\bm{V}}^{*}(\nu_{t}^{-1}({\mathbb{A}}))italic_ν start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT # bold_italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( blackboard_A ) = bold_italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_ν start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( blackboard_A ) ) for any Borel sets of a σ𝜎\sigmaitalic_σ-albegra. For simplicity, we only consider (,df)=(d,)({\mathcal{F}},d_{f})=(\mathbb{R}^{d},\|\cdot\|)( caligraphic_F , italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ) = ( blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , ∥ ⋅ ∥ ) where \|\cdot\|∥ ⋅ ∥ is the Euclidean norm.

Then we aim to prove that (×𝒮0×𝒮1,(1t)ds0tds1,μt)t[0,1]subscriptsubscript𝒮0subscript𝒮1direct-sum1𝑡subscript𝑑subscript𝑠0𝑡subscript𝑑subscript𝑠1subscript𝜇𝑡𝑡01({\mathcal{F}}\times{\mathcal{S}}_{0}\times{\mathcal{S}}_{1},(1-t)d_{s_{0}}% \oplus td_{s_{1}},\mu_{t})_{t\in[0,1]}( caligraphic_F × caligraphic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT × caligraphic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ( 1 - italic_t ) italic_d start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⊕ italic_t italic_d start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t ∈ [ 0 , 1 ] end_POSTSUBSCRIPT is a constant speed geodesic joining (×𝒮0,ds0,μ0))({\mathcal{F}}\times{\mathcal{S}}_{0},d_{s0},\mu_{0}))( caligraphic_F × caligraphic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_s 0 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) and (×𝒮1,ds1,μ1))({\mathcal{F}}\times{\mathcal{S}}_{1},d_{s1},\mu_{1}))( caligraphic_F × caligraphic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_s 1 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ), for arbitrary elements (×𝒮0,ds0,μ0))({\mathcal{F}}\times{\mathcal{S}}_{0},d_{s0},\mu_{0}))( caligraphic_F × caligraphic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_s 0 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) and (×𝒮1,ds1,μ1))({\mathcal{F}}\times{\mathcal{S}}_{1},d_{s1},\mu_{1}))( caligraphic_F × caligraphic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_s 1 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) in the metric space (𝕊(),dSoGA)𝕊subscript𝑑SoGA({\mathbb{S}}({\mathcal{F}}),d_{\text{SoGA}})( blackboard_S ( caligraphic_F ) , italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ).

To do so, we consider any u,t[0,1]𝑢𝑡01u,t\in[0,1]italic_u , italic_t ∈ [ 0 , 1 ] such that ut𝑢𝑡u\neq titalic_u ≠ italic_t. By definition, we have to prove that

dSoGA(μu,μt)=|tu|dSoGA(μ0,μ1).subscript𝑑SoGAsubscript𝜇𝑢subscript𝜇𝑡𝑡𝑢subscript𝑑SoGAsubscript𝜇0subscript𝜇1\displaystyle d_{\text{SoGA}}(\mu_{u},\mu_{t})=|t-u|d_{\text{SoGA}}(\mu_{0},% \mu_{1}).italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = | italic_t - italic_u | italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) . (9)

Indeed, to prove equation (9), we first recall that

μusubscript𝜇𝑢\displaystyle\mu_{u}italic_μ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT :=νu#𝑽=i=1Mj=1NVi,jδ((1u)f0+uf1,s0i,s1j),assignabsentsubscript𝜈𝑢#superscript𝑽superscriptsubscript𝑖1𝑀superscriptsubscript𝑗1𝑁subscriptsuperscript𝑉𝑖𝑗subscript𝛿1𝑢subscript𝑓0𝑢subscript𝑓1subscript𝑠0𝑖subscript𝑠1𝑗\displaystyle:=\nu_{u}\#{\bm{V}}^{*}=\sum_{i=1}^{M}\sum_{j=1}^{N}{V}^{*}_{i,j}% \delta_{((1-u)f_{0}+uf_{1},s_{0i},s_{1j})},:= italic_ν start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT # bold_italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT ( ( 1 - italic_u ) italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_u italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 0 italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 italic_j end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ,
μtsubscript𝜇𝑡\displaystyle\mu_{t}italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT :=νt#𝑽=i=1Mj=1NVi,jδ((1t)f0+tf1,s0i,s1j),assignabsentsubscript𝜈𝑡#superscript𝑽superscriptsubscript𝑖1𝑀superscriptsubscript𝑗1𝑁subscriptsuperscript𝑉𝑖𝑗subscript𝛿1𝑡subscript𝑓0𝑡subscript𝑓1subscript𝑠0𝑖subscript𝑠1𝑗\displaystyle:=\nu_{t}\#{\bm{V}}^{*}=\sum_{i=1}^{M}\sum_{j=1}^{N}{V}^{*}_{i,j}% \delta_{((1-t)f_{0}+tf_{1},s_{0i},s_{1j})},:= italic_ν start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT # bold_italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT ( ( 1 - italic_t ) italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_t italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 0 italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 italic_j end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ,
dSoGA(μ0,μ1)subscript𝑑SoGAsubscript𝜇0subscript𝜇1\displaystyle d_{\text{SoGA}}(\mu_{0},\mu_{1})italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) =i,j,k,l[df(f0i,f1j)+|ds(s0i,s1k)ds(s0j,s1l)|]Vi,jVk,l.absentsubscript𝑖𝑗𝑘𝑙delimited-[]subscript𝑑𝑓subscript𝑓0𝑖subscript𝑓1𝑗subscript𝑑𝑠subscript𝑠0𝑖subscript𝑠1𝑘subscript𝑑𝑠subscript𝑠0𝑗subscript𝑠1𝑙subscriptsuperscript𝑉𝑖𝑗subscriptsuperscript𝑉𝑘𝑙\displaystyle=\sum_{i,j,k,l}\left[d_{f}(f_{0i},f_{1j})+|d_{s}(s_{0i},s_{1k})-d% _{s}(s_{0j},s_{1l})|\right]{V}^{*}_{i,j}{V}^{*}_{k,l}.= ∑ start_POSTSUBSCRIPT italic_i , italic_j , italic_k , italic_l end_POSTSUBSCRIPT [ italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 0 italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 1 italic_j end_POSTSUBSCRIPT ) + | italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT ) - italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 italic_j end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 italic_l end_POSTSUBSCRIPT ) | ] italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT .

We then define the coupling γu,t=(μu×μt)#𝑽𝒜(μu,μt)superscript𝛾𝑢𝑡subscript𝜇𝑢subscript𝜇𝑡#superscript𝑽𝒜subscript𝜇𝑢subscript𝜇𝑡\gamma^{u,t}=(\mu_{u}\times\mu_{t})\#{\bm{V}}^{*}\in{\mathcal{A}}(\mu_{u},\mu_% {t})italic_γ start_POSTSUPERSCRIPT italic_u , italic_t end_POSTSUPERSCRIPT = ( italic_μ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT × italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) # bold_italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_A ( italic_μ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). By the suboptimality of γu,tsuperscript𝛾𝑢𝑡\gamma^{u,t}italic_γ start_POSTSUPERSCRIPT italic_u , italic_t end_POSTSUPERSCRIPT, it holds that:

dSoGA(μu,μt)subscript𝑑SoGAsubscript𝜇𝑢subscript𝜇𝑡\displaystyle d_{\text{SoGA}}(\mu_{u},\mu_{t})italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) i,j,k,l[df(f0i,f1j)+|dt((s0i,s0j),(s1k,s1l))du((s0i,s0j),(s1k,s1l))|]γi,ju,tγk,lu,tabsentsubscript𝑖𝑗𝑘𝑙delimited-[]subscript𝑑𝑓subscript𝑓0𝑖subscript𝑓1𝑗subscript𝑑𝑡subscript𝑠0𝑖subscript𝑠0𝑗subscript𝑠1𝑘subscript𝑠1𝑙subscript𝑑𝑢subscript𝑠0𝑖subscript𝑠0𝑗subscript𝑠1𝑘subscript𝑠1𝑙subscriptsuperscript𝛾𝑢𝑡𝑖𝑗subscriptsuperscript𝛾𝑢𝑡𝑘𝑙\displaystyle\leq\sum_{i,j,k,l}\left[d_{f}(f_{0i},f_{1j})+|d_{t}((s_{0i},s_{0j% }),(s_{1k},s_{1l}))-d_{u}((s_{0i},s_{0j}),(s_{1k},s_{1l}))|\right]\gamma^{u,t}% _{i,j}\gamma^{u,t}_{k,l}≤ ∑ start_POSTSUBSCRIPT italic_i , italic_j , italic_k , italic_l end_POSTSUBSCRIPT [ italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 0 italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 1 italic_j end_POSTSUBSCRIPT ) + | italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ( italic_s start_POSTSUBSCRIPT 0 italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 0 italic_j end_POSTSUBSCRIPT ) , ( italic_s start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 italic_l end_POSTSUBSCRIPT ) ) - italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( ( italic_s start_POSTSUBSCRIPT 0 italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 0 italic_j end_POSTSUBSCRIPT ) , ( italic_s start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 italic_l end_POSTSUBSCRIPT ) ) | ] italic_γ start_POSTSUPERSCRIPT italic_u , italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_γ start_POSTSUPERSCRIPT italic_u , italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT
=i,j,k,l[df((1t)f0i+tf1j,(1u)f0i+uf1j)\displaystyle=\sum_{i,j,k,l}\Big{[}d_{f}((1-t)f_{0i}+tf_{1j},(1-u)f_{0i}+uf_{1% j})= ∑ start_POSTSUBSCRIPT italic_i , italic_j , italic_k , italic_l end_POSTSUBSCRIPT [ italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( ( 1 - italic_t ) italic_f start_POSTSUBSCRIPT 0 italic_i end_POSTSUBSCRIPT + italic_t italic_f start_POSTSUBSCRIPT 1 italic_j end_POSTSUBSCRIPT , ( 1 - italic_u ) italic_f start_POSTSUBSCRIPT 0 italic_i end_POSTSUBSCRIPT + italic_u italic_f start_POSTSUBSCRIPT 1 italic_j end_POSTSUBSCRIPT )
+|(1t)ds(s0i,s1k)+tds(s0j,s1l)(1u)ds(s0i,s1k)uds(s0j,s1l)|]Vi,jVk,l\displaystyle\quad+|(1-t)d_{s}(s_{0i},s_{1k})+td_{s}(s_{0j},s_{1l})-(1-u)d_{s}% (s_{0i},s_{1k})-ud_{s}(s_{0j},s_{1l})|\Big{]}{V}^{*}_{i,j}{V}^{*}_{k,l}+ | ( 1 - italic_t ) italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT ) + italic_t italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 italic_j end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 italic_l end_POSTSUBSCRIPT ) - ( 1 - italic_u ) italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT ) - italic_u italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 italic_j end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 italic_l end_POSTSUBSCRIPT ) | ] italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT
=i,j,k,l[(tu)df(f0i,f1j)+|(tu)ds(s0i,s1k)(tu)ds(s0j,s1l)|]Vi,jVk,labsentsubscript𝑖𝑗𝑘𝑙delimited-[]𝑡𝑢subscript𝑑𝑓subscript𝑓0𝑖subscript𝑓1𝑗𝑡𝑢subscript𝑑𝑠subscript𝑠0𝑖subscript𝑠1𝑘𝑡𝑢subscript𝑑𝑠subscript𝑠0𝑗subscript𝑠1𝑙subscriptsuperscript𝑉𝑖𝑗subscriptsuperscript𝑉𝑘𝑙\displaystyle=\sum_{i,j,k,l}\left[(t-u)d_{f}(f_{0i},f_{1j})+|(t-u)d_{s}(s_{0i}% ,s_{1k})-(t-u)d_{s}(s_{0j},s_{1l})|\right]{V}^{*}_{i,j}{V}^{*}_{k,l}= ∑ start_POSTSUBSCRIPT italic_i , italic_j , italic_k , italic_l end_POSTSUBSCRIPT [ ( italic_t - italic_u ) italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 0 italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 1 italic_j end_POSTSUBSCRIPT ) + | ( italic_t - italic_u ) italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT ) - ( italic_t - italic_u ) italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 italic_j end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 italic_l end_POSTSUBSCRIPT ) | ] italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT
=|tu|i,j,k,l[df(f0i,f1j)+|ds(s0i,s1k)ds(s0j,s1l)|]Vi,jVk,labsent𝑡𝑢subscript𝑖𝑗𝑘𝑙delimited-[]subscript𝑑𝑓subscript𝑓0𝑖subscript𝑓1𝑗subscript𝑑𝑠subscript𝑠0𝑖subscript𝑠1𝑘subscript𝑑𝑠subscript𝑠0𝑗subscript𝑠1𝑙subscriptsuperscript𝑉𝑖𝑗subscriptsuperscript𝑉𝑘𝑙\displaystyle=|t-u|\sum_{i,j,k,l}\left[d_{f}(f_{0i},f_{1j})+|d_{s}(s_{0i},s_{1% k})-d_{s}(s_{0j},s_{1l})|\right]{V}^{*}_{i,j}{V}^{*}_{k,l}= | italic_t - italic_u | ∑ start_POSTSUBSCRIPT italic_i , italic_j , italic_k , italic_l end_POSTSUBSCRIPT [ italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 0 italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 1 italic_j end_POSTSUBSCRIPT ) + | italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT ) - italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 italic_j end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 italic_l end_POSTSUBSCRIPT ) | ] italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT
=|tu|dSoGA(μ0,μ1).absent𝑡𝑢subscript𝑑SoGAsubscript𝜇0subscript𝜇1\displaystyle=|t-u|d_{\text{SoGA}}(\mu_{0},\mu_{1}).= | italic_t - italic_u | italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) .

Here, we used the fact that dfsubscript𝑑𝑓d_{f}italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT is the Euclidean norm, hence

df((1t)f0i+tf1j,(1u)f0i+uf1j)=(1t)f0i+tf1j(1u)f0iuf1j=|tu|df(f0i,f1j).subscript𝑑𝑓1𝑡subscript𝑓0𝑖𝑡subscript𝑓1𝑗1𝑢subscript𝑓0𝑖𝑢subscript𝑓1𝑗norm1𝑡subscript𝑓0𝑖𝑡subscript𝑓1𝑗1𝑢subscript𝑓0𝑖𝑢subscript𝑓1𝑗𝑡𝑢subscript𝑑𝑓subscript𝑓0𝑖subscript𝑓1𝑗d_{f}((1-t)f_{0i}+tf_{1j},(1-u)f_{0i}+uf_{1j})=\|(1-t)f_{0i}+tf_{1j}-(1-u)f_{0% i}-uf_{1j}\|=|t-u|d_{f}(f_{0i},f_{1j}).italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( ( 1 - italic_t ) italic_f start_POSTSUBSCRIPT 0 italic_i end_POSTSUBSCRIPT + italic_t italic_f start_POSTSUBSCRIPT 1 italic_j end_POSTSUBSCRIPT , ( 1 - italic_u ) italic_f start_POSTSUBSCRIPT 0 italic_i end_POSTSUBSCRIPT + italic_u italic_f start_POSTSUBSCRIPT 1 italic_j end_POSTSUBSCRIPT ) = ∥ ( 1 - italic_t ) italic_f start_POSTSUBSCRIPT 0 italic_i end_POSTSUBSCRIPT + italic_t italic_f start_POSTSUBSCRIPT 1 italic_j end_POSTSUBSCRIPT - ( 1 - italic_u ) italic_f start_POSTSUBSCRIPT 0 italic_i end_POSTSUBSCRIPT - italic_u italic_f start_POSTSUBSCRIPT 1 italic_j end_POSTSUBSCRIPT ∥ = | italic_t - italic_u | italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 0 italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 1 italic_j end_POSTSUBSCRIPT ) .

Therefore, we have

dSoGA(μu,μt)subscript𝑑SoGAsubscript𝜇𝑢subscript𝜇𝑡\displaystyle d_{\text{SoGA}}(\mu_{u},\mu_{t})italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) |tu|dSoGA(μ0,μ1).absent𝑡𝑢subscript𝑑SoGAsubscript𝜇0subscript𝜇1\displaystyle\leq|t-u|d_{\text{SoGA}}(\mu_{0},\mu_{1}).≤ | italic_t - italic_u | italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) . (10)

The remaining task is to prove that

dSoGA(μu,μt)subscript𝑑SoGAsubscript𝜇𝑢subscript𝜇𝑡\displaystyle d_{\text{SoGA}}(\mu_{u},\mu_{t})italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) |tu|dSoGA(μ0,μ1).absent𝑡𝑢subscript𝑑SoGAsubscript𝜇0subscript𝜇1\displaystyle\geq|t-u|d_{\text{SoGA}}(\mu_{0},\mu_{1}).≥ | italic_t - italic_u | italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) . (11)

To show that this inequality, we note that via the triangle inequality of dSoGAsubscript𝑑SoGAd_{\text{SoGA}}italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT and for any 0ut10𝑢𝑡10\leq u\leq t\leq 10 ≤ italic_u ≤ italic_t ≤ 1, it holds that

dSoGA(μ0,μ1)subscript𝑑SoGAsubscript𝜇0subscript𝜇1\displaystyle d_{\text{SoGA}}(\mu_{0},\mu_{1})italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) dSoGA(μ0,μu)+dSoGA(μu,μt)+dSoGA(μt,μ1)absentsubscript𝑑SoGAsubscript𝜇0subscript𝜇𝑢subscript𝑑SoGAsubscript𝜇𝑢subscript𝜇𝑡subscript𝑑SoGAsubscript𝜇𝑡subscript𝜇1\displaystyle\leq d_{\text{SoGA}}(\mu_{0},\mu_{u})+d_{\text{SoGA}}(\mu_{u},\mu% _{t})+d_{\text{SoGA}}(\mu_{t},\mu_{1})≤ italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) + italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )
udSoGA(μ0,μ1)+(tu)dSoGA(μ0,μ1)+(1t)dSoGA(μ0,μ1)absent𝑢subscript𝑑SoGAsubscript𝜇0subscript𝜇1𝑡𝑢subscript𝑑SoGAsubscript𝜇0subscript𝜇11𝑡subscript𝑑SoGAsubscript𝜇0subscript𝜇1\displaystyle\leq ud_{\text{SoGA}}(\mu_{0},\mu_{1})+(t-u)d_{\text{SoGA}}(\mu_{% 0},\mu_{1})+(1-t)d_{\text{SoGA}}(\mu_{0},\mu_{1})≤ italic_u italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + ( italic_t - italic_u ) italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + ( 1 - italic_t ) italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )
=dSoGA(μ0,μ1).absentsubscript𝑑SoGAsubscript𝜇0subscript𝜇1\displaystyle=d_{\text{SoGA}}(\mu_{0},\mu_{1}).= italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) .

Hence, for any 0ut10𝑢𝑡10\leq u\leq t\leq 10 ≤ italic_u ≤ italic_t ≤ 1, we obtain

dSoGA(μ0,μu)+dSoGA(μu,μt)+dSoGA(μt,μ1)subscript𝑑SoGAsubscript𝜇0subscript𝜇𝑢subscript𝑑SoGAsubscript𝜇𝑢subscript𝜇𝑡subscript𝑑SoGAsubscript𝜇𝑡subscript𝜇1\displaystyle d_{\text{SoGA}}(\mu_{0},\mu_{u})+d_{\text{SoGA}}(\mu_{u},\mu_{t}% )+d_{\text{SoGA}}(\mu_{t},\mu_{1})italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) + italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )
=udSoGA(μ0,μ1)+(tu)dSoGA(μ0,μ1)+(1t)dSoGA(μ0,μ1).absent𝑢subscript𝑑SoGAsubscript𝜇0subscript𝜇1𝑡𝑢subscript𝑑SoGAsubscript𝜇0subscript𝜇11𝑡subscript𝑑SoGAsubscript𝜇0subscript𝜇1\displaystyle=ud_{\text{SoGA}}(\mu_{0},\mu_{1})+(t-u)d_{\text{SoGA}}(\mu_{0},% \mu_{1})+(1-t)d_{\text{SoGA}}(\mu_{0},\mu_{1}).= italic_u italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + ( italic_t - italic_u ) italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + ( 1 - italic_t ) italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) . (12)

Suppose that

dSoGA(μu,μt)<(tu)dSoGA(μ0,μ1).subscript𝑑SoGAsubscript𝜇𝑢subscript𝜇𝑡𝑡𝑢subscript𝑑SoGAsubscript𝜇0subscript𝜇1d_{\text{SoGA}}(\mu_{u},\mu_{t})<(t-u)d_{\text{SoGA}}(\mu_{0},\mu_{1}).italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) < ( italic_t - italic_u ) italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) .

Then combining with the fact that

dSoGA(μ0,μu)udSoGA(μ0,μ1), and dSoGA(μt,μ1)(1t)dSoGA(μ0,μ1),formulae-sequencesubscript𝑑SoGAsubscript𝜇0subscript𝜇𝑢𝑢subscript𝑑SoGAsubscript𝜇0subscript𝜇1 and subscript𝑑SoGAsubscript𝜇𝑡subscript𝜇11𝑡subscript𝑑SoGAsubscript𝜇0subscript𝜇1d_{\text{SoGA}}(\mu_{0},\mu_{u})\leq ud_{\text{SoGA}}(\mu_{0},\mu_{1}),\text{ % and }d_{\text{SoGA}}(\mu_{t},\mu_{1})\leq(1-t)d_{\text{SoGA}}(\mu_{0},\mu_{1}),italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) ≤ italic_u italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , and italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ≤ ( 1 - italic_t ) italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ,

we have

dSoGA(μ0,μu)+dSoGA(μu,μt)+dSoGA(μt,μ1)subscript𝑑SoGAsubscript𝜇0subscript𝜇𝑢subscript𝑑SoGAsubscript𝜇𝑢subscript𝜇𝑡subscript𝑑SoGAsubscript𝜇𝑡subscript𝜇1\displaystyle d_{\text{SoGA}}(\mu_{0},\mu_{u})+d_{\text{SoGA}}(\mu_{u},\mu_{t}% )+d_{\text{SoGA}}(\mu_{t},\mu_{1})italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) + italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )
<udSoGA(μ0,μ1)+(tu)dSoGA(μ0,μ1)+(1t)dSoGA(μ0,μ1).absent𝑢subscript𝑑SoGAsubscript𝜇0subscript𝜇1𝑡𝑢subscript𝑑SoGAsubscript𝜇0subscript𝜇11𝑡subscript𝑑SoGAsubscript𝜇0subscript𝜇1\displaystyle<ud_{\text{SoGA}}(\mu_{0},\mu_{1})+(t-u)d_{\text{SoGA}}(\mu_{0},% \mu_{1})+(1-t)d_{\text{SoGA}}(\mu_{0},\mu_{1}).< italic_u italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + ( italic_t - italic_u ) italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + ( 1 - italic_t ) italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) .

This leads to the contradiction with the equation (12.) Hence the desired inequality in (11) holds. Finally, we obtain

dSoGA(μu,μt)subscript𝑑SoGAsubscript𝜇𝑢subscript𝜇𝑡\displaystyle d_{\text{SoGA}}(\mu_{u},\mu_{t})italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) =|tu|dSoGA(μ0,μ1).absent𝑡𝑢subscript𝑑SoGAsubscript𝜇0subscript𝜇1\displaystyle=|t-u|d_{\text{SoGA}}(\mu_{0},\mu_{1}).= | italic_t - italic_u | italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) . (13)

Appendix B Proofs of Technical Results

B.1 Proof of Proposition 1

First, let us suppose that dSoGA(𝒢1,𝒢2)=0subscript𝑑SoGAsubscript𝒢1subscript𝒢20d_{\text{SoGA}}({\mathcal{G}}_{1},{\mathcal{G}}_{2})=0italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = 0. We wish to prove the existence of a bijection σ𝜎\sigmaitalic_σ satisfying E1, E2, and E3. Indeed, let 𝑽𝒜(𝒢1,𝒢2)𝑽𝒜subscript𝒢1subscript𝒢2{\bm{V}}\in{\mathcal{A}}({\mathcal{G}}_{1},{\mathcal{G}}_{2})bold_italic_V ∈ caligraphic_A ( caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) be any admissible transportation plan that encode a valid matching between 𝒢1subscript𝒢1{\mathcal{G}}_{1}caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝒢2subscript𝒢2{\mathcal{G}}_{2}caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Then we define:

d(s1i,s1k)=12[df(f1i,f1k)+ds(s1i,s1k)],i,k[M]2,formulae-sequence𝑑subscript𝑠1𝑖subscript𝑠1𝑘12delimited-[]subscript𝑑𝑓subscript𝑓1𝑖subscript𝑓1𝑘subscript𝑑𝑠subscript𝑠1𝑖subscript𝑠1𝑘for-all𝑖𝑘superscriptdelimited-[]𝑀2\displaystyle d(s_{1i},s_{1k})=\frac{1}{2}\left[d_{f}(f_{1i},f_{1k})+d_{s}(s_{% 1i},s_{1k})\right],\quad\forall i,k\in[M]^{2},italic_d ( italic_s start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG [ italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT ) + italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT ) ] , ∀ italic_i , italic_k ∈ [ italic_M ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (14)
d(s2j,s2l)=12[df(f2j,f2l)+ds(s2j,s2l)],j,l[M]2.formulae-sequence𝑑subscript𝑠2𝑗subscript𝑠2𝑙12delimited-[]subscript𝑑𝑓subscript𝑓2𝑗subscript𝑓2𝑙subscript𝑑𝑠subscript𝑠2𝑗subscript𝑠2𝑙for-all𝑗𝑙superscriptdelimited-[]𝑀2\displaystyle d(s_{2j},s_{2l})=\frac{1}{2}\left[d_{f}(f_{2j},f_{2l})+d_{s}(s_{% 2j},s_{2l})\right],\quad\forall j,l\in[M]^{2}.italic_d ( italic_s start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 italic_l end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG [ italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 italic_l end_POSTSUBSCRIPT ) + italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 italic_l end_POSTSUBSCRIPT ) ] , ∀ italic_j , italic_l ∈ [ italic_M ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (15)

Recall that we then define SGM discrepancy as:

dSoGA(𝒢1,𝒢2)subscript𝑑SoGAsubscript𝒢1subscript𝒢2\displaystyle d_{\text{SoGA}}({\mathcal{G}}_{1},{\mathcal{G}}_{2})italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) =min𝑽𝒜(𝒢1,𝒢2)(i,jAi,jvVi,j+i,j,k,lAi,j,k,leVi,jVk,l)=min𝑽𝒜(𝒢1,𝒢2)O(𝑨v,𝑨e,𝑽)absentsubscript𝑽𝒜subscript𝒢1subscript𝒢2subscript𝑖𝑗subscriptsuperscript𝐴𝑣𝑖𝑗subscript𝑉𝑖𝑗subscript𝑖𝑗𝑘𝑙subscriptsuperscript𝐴𝑒𝑖𝑗𝑘𝑙subscript𝑉𝑖𝑗subscript𝑉𝑘𝑙subscript𝑽𝒜subscript𝒢1subscript𝒢2𝑂superscript𝑨𝑣superscript𝑨𝑒𝑽\displaystyle=\min_{{\bm{V}}\in{\mathcal{A}}({\mathcal{G}}_{1},{\mathcal{G}}_{% 2})}\left(\sum_{i,j}{A}^{v}_{i,j}{V}_{i,j}+\sum_{i,j,k,l}{\mathsfit{A}}^{e}_{i% ,j,k,l}{V}_{i,j}{V}_{k,l}\right)=\min_{{\bm{V}}\in{\mathcal{A}}({\mathcal{G}}_% {1},{\mathcal{G}}_{2})}O({\bm{A}}^{v},{\bm{\mathsfit{A}}}^{e},{\bm{V}})= roman_min start_POSTSUBSCRIPT bold_italic_V ∈ caligraphic_A ( caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_A start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i , italic_j , italic_k , italic_l end_POSTSUBSCRIPT slanted_A start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j , italic_k , italic_l end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT ) = roman_min start_POSTSUBSCRIPT bold_italic_V ∈ caligraphic_A ( caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_O ( bold_italic_A start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT , bold_slanted_A start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , bold_italic_V )
=min𝑽𝒜(𝒢1,𝒢2)[Of(𝑨v,𝑽)+Os(𝑨e,𝑽)].absentsubscript𝑽𝒜subscript𝒢1subscript𝒢2subscript𝑂𝑓superscript𝑨𝑣𝑽subscript𝑂𝑠superscript𝑨𝑒𝑽\displaystyle=\min_{{\bm{V}}\in{\mathcal{A}}({\mathcal{G}}_{1},{\mathcal{G}}_{% 2})}\left[O_{f}({\bm{A}}^{v},{\bm{V}})+O_{s}({\bm{\mathsfit{A}}}^{e},{\bm{V}})% \right].= roman_min start_POSTSUBSCRIPT bold_italic_V ∈ caligraphic_A ( caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_O start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( bold_italic_A start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT , bold_italic_V ) + italic_O start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_slanted_A start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , bold_italic_V ) ] . (16)

It should be recalled that the vertex affinity matrix 𝑨vM×Msuperscript𝑨𝑣superscript𝑀𝑀{\bm{A}}^{v}\in\mathbb{R}^{M\times M}bold_italic_A start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_M end_POSTSUPERSCRIPT, defined as Ai,jv=(df(f1i,f2j))i,jsubscriptsuperscript𝐴𝑣𝑖𝑗subscriptsubscript𝑑𝑓subscript𝑓1𝑖subscript𝑓2𝑗𝑖𝑗A^{v}_{i,j}=(d_{f}(f_{1i},f_{2j}))_{i,j}italic_A start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = ( italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT ) ) start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT, was introduced in the previous section. The edge affinity tensor, denoted by 𝑨esuperscript𝑨𝑒{\bm{\mathsfit{A}}}^{e}bold_slanted_A start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT, is defined as follows: Ai,j,k,le=|ds(s1i,s1k)ds(s2j,s2l)|subscriptsuperscript𝐴𝑒𝑖𝑗𝑘𝑙subscript𝑑𝑠subscript𝑠1𝑖subscript𝑠1𝑘subscript𝑑𝑠subscript𝑠2𝑗subscript𝑠2𝑙{\mathsfit{A}}^{e}_{i,j,k,l}=|d_{s}(s_{1i},s_{1k})-d_{s}(s_{2j},s_{2l})|slanted_A start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j , italic_k , italic_l end_POSTSUBSCRIPT = | italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT ) - italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 italic_l end_POSTSUBSCRIPT ) |.

Let 𝑽superscript𝑽{\bm{V}}^{*}bold_italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT be the optimal coupling for dSoGA(𝒢1,𝒢2)subscript𝑑SoGAsubscript𝒢1subscript𝒢2d_{\text{SoGA}}({\mathcal{G}}_{1},{\mathcal{G}}_{2})italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). Then we have

Of(𝑨v,𝑽)+Os(𝑨e,𝑽)=min𝑽𝒜(𝒢1,𝒢2)[Of(𝑨v,𝑽)+Os(𝑨e,𝑽)]=dSoGA(𝒢1,𝒢2)=0.subscript𝑂𝑓superscript𝑨𝑣superscript𝑽subscript𝑂𝑠superscript𝑨𝑒superscript𝑽subscript𝑽𝒜subscript𝒢1subscript𝒢2subscript𝑂𝑓superscript𝑨𝑣𝑽subscript𝑂𝑠superscript𝑨𝑒𝑽subscript𝑑SoGAsubscript𝒢1subscript𝒢20\displaystyle O_{f}({\bm{A}}^{v},{\bm{V}}^{*})+O_{s}({\bm{\mathsfit{A}}}^{e},{% \bm{V}}^{*})=\min_{{\bm{V}}\in{\mathcal{A}}({\mathcal{G}}_{1},{\mathcal{G}}_{2% })}\left[O_{f}({\bm{A}}^{v},{\bm{V}})+O_{s}({\bm{\mathsfit{A}}}^{e},{\bm{V}})% \right]=d_{\text{SoGA}}({\mathcal{G}}_{1},{\mathcal{G}}_{2})=0.italic_O start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( bold_italic_A start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT , bold_italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + italic_O start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_slanted_A start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , bold_italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = roman_min start_POSTSUBSCRIPT bold_italic_V ∈ caligraphic_A ( caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_O start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( bold_italic_A start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT , bold_italic_V ) + italic_O start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_slanted_A start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , bold_italic_V ) ] = italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = 0 . (17)

Since both Of(𝑨v,𝑽)subscript𝑂𝑓superscript𝑨𝑣superscript𝑽O_{f}({\bm{A}}^{v},{\bm{V}}^{*})italic_O start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( bold_italic_A start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT , bold_italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) and Os(𝑨e,𝑽)subscript𝑂𝑠superscript𝑨𝑒superscript𝑽O_{s}({\bm{\mathsfit{A}}}^{e},{\bm{V}}^{*})italic_O start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_slanted_A start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , bold_italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) are non-negative, we conclude that Of(𝑨v,𝑽)=Os(𝑨e,𝑽)=0subscript𝑂𝑓superscript𝑨𝑣superscript𝑽subscript𝑂𝑠superscript𝑨𝑒superscript𝑽0O_{f}({\bm{A}}^{v},{\bm{V}}^{*})=O_{s}({\bm{\mathsfit{A}}}^{e},{\bm{V}}^{*})=0italic_O start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( bold_italic_A start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT , bold_italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = italic_O start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_slanted_A start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , bold_italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = 0. Now we wish to use the following Lemma B.2, which is proved in Appendix B.2.

Lemma 1.

Given the definition of A¯i,j,k,le=|d(s1i,s1k)d(s2j,s2l)|subscriptsuperscript¯𝐴𝑒𝑖𝑗𝑘𝑙𝑑subscript𝑠1𝑖subscript𝑠1𝑘𝑑subscript𝑠2𝑗subscript𝑠2𝑙\overline{{\mathsfit{A}}}^{e}_{i,j,k,l}=|d(s_{1i},s_{1k})-d(s_{2j},s_{2l})|over¯ start_ARG slanted_A end_ARG start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j , italic_k , italic_l end_POSTSUBSCRIPT = | italic_d ( italic_s start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT ) - italic_d ( italic_s start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 italic_l end_POSTSUBSCRIPT ) | where d(s1i,s1k)𝑑subscript𝑠1𝑖subscript𝑠1𝑘d(s_{1i},s_{1k})italic_d ( italic_s start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT ) and d(s2j,s2l)𝑑subscript𝑠2𝑗subscript𝑠2𝑙d(s_{2j},s_{2l})italic_d ( italic_s start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 italic_l end_POSTSUBSCRIPT ) are provided in equations (14) and (15), respectively, it holds that

Os(𝑨¯e,𝑽)=i,j,k,lA¯i,j,k,leVi,jVk,l=i,j,k,l|d(s1i,s1k)d(s2j,s2l)|Vi,jVk,l=0.subscript𝑂𝑠superscript¯𝑨𝑒superscript𝑽subscript𝑖𝑗𝑘𝑙subscriptsuperscript¯𝐴𝑒𝑖𝑗𝑘𝑙subscriptsuperscript𝑉𝑖𝑗subscriptsuperscript𝑉𝑘𝑙subscript𝑖𝑗𝑘𝑙𝑑subscript𝑠1𝑖subscript𝑠1𝑘𝑑subscript𝑠2𝑗subscript𝑠2𝑙subscriptsuperscript𝑉𝑖𝑗subscriptsuperscript𝑉𝑘𝑙0\displaystyle O_{s}(\overline{{\bm{\mathsfit{A}}}}^{e},{\bm{V}}^{*})=\sum_{i,j% ,k,l}\overline{{\mathsfit{A}}}^{e}_{i,j,k,l}{V}^{*}_{i,j}{V}^{*}_{k,l}=\sum_{i% ,j,k,l}|d(s_{1i},s_{1k})-d(s_{2j},s_{2l})|{V}^{*}_{i,j}{V}^{*}_{k,l}=0.italic_O start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( over¯ start_ARG bold_slanted_A end_ARG start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , bold_italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i , italic_j , italic_k , italic_l end_POSTSUBSCRIPT over¯ start_ARG slanted_A end_ARG start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j , italic_k , italic_l end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i , italic_j , italic_k , italic_l end_POSTSUBSCRIPT | italic_d ( italic_s start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT ) - italic_d ( italic_s start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 italic_l end_POSTSUBSCRIPT ) | italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT = 0 . (18)

Moreover, there exists a bijective σ:[M][N]:𝜎maps-todelimited-[]𝑀delimited-[]𝑁\sigma:[M]\mapsto[N]italic_σ : [ italic_M ] ↦ [ italic_N ] with M=N𝑀𝑁M=Nitalic_M = italic_N satisfies the weight and distance d𝑑ditalic_d preserving isometry as follows:

  1. E1.

    i[M]:w1i=w2σ(i):for-all𝑖delimited-[]𝑀subscript𝑤1𝑖subscript𝑤2𝜎𝑖\forall i\in[M]:w_{1i}=w_{2\sigma(i)}∀ italic_i ∈ [ italic_M ] : italic_w start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT = italic_w start_POSTSUBSCRIPT 2 italic_σ ( italic_i ) end_POSTSUBSCRIPT.

  2. E3*.

    i,k[M]2:d(s1i,s1k)=d(s2σ(i),s2σ(k)):for-all𝑖𝑘superscriptdelimited-[]𝑀2𝑑subscript𝑠1𝑖subscript𝑠1𝑘𝑑subscript𝑠2𝜎𝑖subscript𝑠2𝜎𝑘\forall i,k\in[M]^{2}:d(s_{1i},s_{1k})=d(s_{2\sigma(i)},s_{2\sigma(k)})∀ italic_i , italic_k ∈ [ italic_M ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT : italic_d ( italic_s start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT ) = italic_d ( italic_s start_POSTSUBSCRIPT 2 italic_σ ( italic_i ) end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 italic_σ ( italic_k ) end_POSTSUBSCRIPT ).

Because we have 𝑽superscript𝑽{\bm{V}}^{*}bold_italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the optimal coupling w.r.t. the distance d𝑑ditalic_d such that

Os(𝑨¯e,𝑽)=min𝑽𝒜(𝒢1,𝒢2)Os(𝑨¯e,𝑽)=0,subscript𝑂𝑠superscript¯𝑨𝑒superscript𝑽subscript𝑽𝒜subscript𝒢1subscript𝒢2subscript𝑂𝑠superscript¯𝑨𝑒𝑽0\displaystyle O_{s}(\overline{{\bm{\mathsfit{A}}}}^{e},{\bm{V}}^{*})=\min_{{% \bm{V}}\in{\mathcal{A}}({\mathcal{G}}_{1},{\mathcal{G}}_{2})}O_{s}(\overline{{% \bm{\mathsfit{A}}}}^{e},{\bm{V}})=0,italic_O start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( over¯ start_ARG bold_slanted_A end_ARG start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , bold_italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = roman_min start_POSTSUBSCRIPT bold_italic_V ∈ caligraphic_A ( caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_O start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( over¯ start_ARG bold_slanted_A end_ARG start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , bold_italic_V ) = 0 , (19)

𝑽superscript𝑽{\bm{V}}^{*}bold_italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is supported by σ𝜎\sigmaitalic_σ and satisfies 𝑽=𝑰M×N×σsuperscript𝑽subscript𝑰𝑀𝑁𝜎{\bm{V}}^{*}={\bm{I}}_{M\times N}\times\sigmabold_italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = bold_italic_I start_POSTSUBSCRIPT italic_M × italic_N end_POSTSUBSCRIPT × italic_σ. Therefore, Of(𝑨v,𝑽)=i,jdf(f1i,f2σ(i))Vi,j=idf(f1i,f2σ(i))jVi,j=idf(f1i,f2σ(i))=0subscript𝑂𝑓superscript𝑨𝑣superscript𝑽subscript𝑖𝑗subscript𝑑𝑓subscript𝑓1𝑖subscript𝑓2𝜎𝑖subscriptsuperscript𝑉𝑖𝑗subscript𝑖subscript𝑑𝑓subscript𝑓1𝑖subscript𝑓2𝜎𝑖subscript𝑗subscriptsuperscript𝑉𝑖𝑗subscript𝑖subscript𝑑𝑓subscript𝑓1𝑖subscript𝑓2𝜎𝑖0O_{f}({\bm{A}}^{v},{\bm{V}}^{*})=\sum_{i,j}d_{f}(f_{1i},f_{2\sigma(i)}){V}^{*}% _{i,j}=\sum_{i}d_{f}(f_{1i},f_{2\sigma(i)})\sum_{j}{V}^{*}_{i,j}=\sum_{i}d_{f}% (f_{1i},f_{2\sigma(i)})=0italic_O start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( bold_italic_A start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT , bold_italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 italic_σ ( italic_i ) end_POSTSUBSCRIPT ) italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 italic_σ ( italic_i ) end_POSTSUBSCRIPT ) ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 italic_σ ( italic_i ) end_POSTSUBSCRIPT ) = 0. Here, we used the fact that

𝑽𝒜(𝒢1,𝒢2)={𝑽{0,1}M×N:i=1MVi,j=w1j=1,j=1NVi,j=w2i=1}.superscript𝑽𝒜subscript𝒢1subscript𝒢2conditional-set𝑽superscript01𝑀𝑁formulae-sequencesuperscriptsubscript𝑖1𝑀subscript𝑉𝑖𝑗subscript𝑤1𝑗1superscriptsubscript𝑗1𝑁subscript𝑉𝑖𝑗subscript𝑤2𝑖1\displaystyle{\bm{V}}^{*}\in{\mathcal{A}}({\mathcal{G}}_{1},{\mathcal{G}}_{2})% =\left\{{\bm{V}}\in\{0,1\}^{M\times N}:\sum_{i=1}^{M}{V}_{i,j}=w_{1j}=1,\sum_{% j=1}^{N}{V}_{i,j}=w_{2i}=1\right\}.bold_italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_A ( caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = { bold_italic_V ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_M × italic_N end_POSTSUPERSCRIPT : ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = italic_w start_POSTSUBSCRIPT 1 italic_j end_POSTSUBSCRIPT = 1 , ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = italic_w start_POSTSUBSCRIPT 2 italic_i end_POSTSUBSCRIPT = 1 } .

Note that df(f1i,f2σ(i)),i[M]subscript𝑑𝑓subscript𝑓1𝑖subscript𝑓2𝜎𝑖𝑖delimited-[]𝑀d_{f}(f_{1i},f_{2\sigma(i)}),i\in[M]italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 italic_σ ( italic_i ) end_POSTSUBSCRIPT ) , italic_i ∈ [ italic_M ] are all non-negative. This leads to df(f1i,f2σ(i))=0,i[M]formulae-sequencesubscript𝑑𝑓subscript𝑓1𝑖subscript𝑓2𝜎𝑖0for-all𝑖delimited-[]𝑀d_{f}(f_{1i},f_{2\sigma(i)})=0,\forall i\in[M]italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 italic_σ ( italic_i ) end_POSTSUBSCRIPT ) = 0 , ∀ italic_i ∈ [ italic_M ]. This is equivalent to f1i=f2σ(i),i[M]formulae-sequencesubscript𝑓1𝑖subscript𝑓2𝜎𝑖for-all𝑖delimited-[]𝑀f_{1i}=f_{2\sigma(i)},\forall i\in[M]italic_f start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT 2 italic_σ ( italic_i ) end_POSTSUBSCRIPT , ∀ italic_i ∈ [ italic_M ] since dfsubscript𝑑𝑓d_{f}italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT is a metric, which is the desired E2. Therefore, we also have df(f1i,f1k)=df(f2σ(i),f2σ(k)),i,k[M]formulae-sequencesubscript𝑑𝑓subscript𝑓1𝑖subscript𝑓1𝑘subscript𝑑𝑓subscript𝑓2𝜎𝑖subscript𝑓2𝜎𝑘for-all𝑖𝑘delimited-[]𝑀d_{f}(f_{1i},f_{1k})=d_{f}(f_{2\sigma(i)},f_{2\sigma(k)}),\forall i,k\in[M]italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT ) = italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 2 italic_σ ( italic_i ) end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 italic_σ ( italic_k ) end_POSTSUBSCRIPT ) , ∀ italic_i , italic_k ∈ [ italic_M ]. Combining equations (14), (15), and E3*, we have

d(s1i,s1k)𝑑subscript𝑠1𝑖subscript𝑠1𝑘\displaystyle d(s_{1i},s_{1k})italic_d ( italic_s start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT ) =12[df(f1i,f1k)+ds(s1i,s1k)],absent12delimited-[]subscript𝑑𝑓subscript𝑓1𝑖subscript𝑓1𝑘subscript𝑑𝑠subscript𝑠1𝑖subscript𝑠1𝑘\displaystyle=\frac{1}{2}\left[d_{f}(f_{1i},f_{1k})+d_{s}(s_{1i},s_{1k})\right],= divide start_ARG 1 end_ARG start_ARG 2 end_ARG [ italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT ) + italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT ) ] , (20)
d(s2σ(i),sσ(k))𝑑subscript𝑠2𝜎𝑖subscript𝑠𝜎𝑘\displaystyle d(s_{2\sigma(i)},s_{\sigma(k)})italic_d ( italic_s start_POSTSUBSCRIPT 2 italic_σ ( italic_i ) end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_σ ( italic_k ) end_POSTSUBSCRIPT ) =12[df(f2σ(i),f2σ(k))+ds(s2σ(i),s2σ(k))],i,k[M]2.formulae-sequenceabsent12delimited-[]subscript𝑑𝑓subscript𝑓2𝜎𝑖subscript𝑓2𝜎𝑘subscript𝑑𝑠subscript𝑠2𝜎𝑖subscript𝑠2𝜎𝑘for-all𝑖𝑘superscriptdelimited-[]𝑀2\displaystyle=\frac{1}{2}\left[d_{f}(f_{2\sigma(i)},f_{2\sigma(k)})+d_{s}(s_{2% \sigma(i)},s_{2\sigma(k)})\right],\quad\forall i,k\in[M]^{2}.= divide start_ARG 1 end_ARG start_ARG 2 end_ARG [ italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 2 italic_σ ( italic_i ) end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 italic_σ ( italic_k ) end_POSTSUBSCRIPT ) + italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 2 italic_σ ( italic_i ) end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 italic_σ ( italic_k ) end_POSTSUBSCRIPT ) ] , ∀ italic_i , italic_k ∈ [ italic_M ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (21)

This leads to the desired result, i.e.,  E3. ds(s1i,s1k)=ds(s2σ(i),s2σ(k)),i,k[M]2formulae-sequencesubscript𝑑𝑠subscript𝑠1𝑖subscript𝑠1𝑘subscript𝑑𝑠subscript𝑠2𝜎𝑖subscript𝑠2𝜎𝑘for-all𝑖𝑘superscriptdelimited-[]𝑀2d_{s}(s_{1i},s_{1k})=d_{s}(s_{2\sigma(i)},s_{2\sigma(k)}),\forall i,k\in[M]^{2}italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT ) = italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 2 italic_σ ( italic_i ) end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 italic_σ ( italic_k ) end_POSTSUBSCRIPT ) , ∀ italic_i , italic_k ∈ [ italic_M ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

Now, let us suppose that M=N𝑀𝑁M=Nitalic_M = italic_N there exists a bijection σ:[M][N]:𝜎maps-todelimited-[]𝑀delimited-[]𝑁\sigma:[M]\mapsto[N]italic_σ : [ italic_M ] ↦ [ italic_N ] satisfying E1, E2, and E3. We wish to prove that dSoGA(𝒢1,𝒢2)=0subscript𝑑SoGAsubscript𝒢1subscript𝒢20d_{\text{SoGA}}({\mathcal{G}}_{1},{\mathcal{G}}_{2})=0italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = 0. Then we can consider the transportation plan 𝑽=𝑰M×N×σsuperscript𝑽subscript𝑰𝑀𝑁𝜎{\bm{V}}^{*}={\bm{I}}_{M\times N}\times\sigmabold_italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = bold_italic_I start_POSTSUBSCRIPT italic_M × italic_N end_POSTSUBSCRIPT × italic_σ, i.e.,  𝑽superscript𝑽{\bm{V}}^{*}bold_italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is associated with iimaps-to𝑖𝑖i\mapsto iitalic_i ↦ italic_i and jσ(i)maps-to𝑗𝜎𝑖j\mapsto\sigma(i)italic_j ↦ italic_σ ( italic_i ). Using E1, it holds that 𝑽𝒜(𝒢1,𝒢2)superscript𝑽𝒜subscript𝒢1subscript𝒢2{\bm{V}}^{*}\in{\mathcal{A}}({\mathcal{G}}_{1},{\mathcal{G}}_{2})bold_italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_A ( caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). Moreover, via E2 and E3, we also have

dSoGA(𝒢1,𝒢2)subscript𝑑SoGAsubscript𝒢1subscript𝒢2\displaystyle d_{\text{SoGA}}({\mathcal{G}}_{1},{\mathcal{G}}_{2})italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) =min𝑽𝒜(𝒢1,𝒢2)O(𝑨v,𝑨e,𝑽)i,jAi,jvVi,j+i,j,k,lAi,j,k,leVi,jVk,labsentsubscript𝑽𝒜subscript𝒢1subscript𝒢2𝑂superscript𝑨𝑣superscript𝑨𝑒𝑽subscript𝑖𝑗subscriptsuperscript𝐴𝑣𝑖𝑗subscriptsuperscript𝑉𝑖𝑗subscript𝑖𝑗𝑘𝑙subscriptsuperscript𝐴𝑒𝑖𝑗𝑘𝑙subscriptsuperscript𝑉𝑖𝑗subscriptsuperscript𝑉𝑘𝑙\displaystyle=\min_{{\bm{V}}\in{\mathcal{A}}({\mathcal{G}}_{1},{\mathcal{G}}_{% 2})}O({\bm{A}}^{v},{\bm{\mathsfit{A}}}^{e},{\bm{V}})\leq\sum_{i,j}{A}^{v}_{i,j% }{V}^{*}_{i,j}+\sum_{i,j,k,l}{\mathsfit{A}}^{e}_{i,j,k,l}{V}^{*}_{i,j}{V}^{*}_% {k,l}= roman_min start_POSTSUBSCRIPT bold_italic_V ∈ caligraphic_A ( caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_O ( bold_italic_A start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT , bold_slanted_A start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , bold_italic_V ) ≤ ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_A start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i , italic_j , italic_k , italic_l end_POSTSUBSCRIPT slanted_A start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j , italic_k , italic_l end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT
=i,jdf(f1i,f2j)Vi,j+i,j,k,l|ds(s1i,s1k)ds(s2j,s2l)|Vi,jVk,labsentsubscript𝑖𝑗subscript𝑑𝑓subscript𝑓1𝑖subscript𝑓2𝑗subscriptsuperscript𝑉𝑖𝑗subscript𝑖𝑗𝑘𝑙subscript𝑑𝑠subscript𝑠1𝑖subscript𝑠1𝑘subscript𝑑𝑠subscript𝑠2𝑗subscript𝑠2𝑙subscriptsuperscript𝑉𝑖𝑗subscriptsuperscript𝑉𝑘𝑙\displaystyle=\sum_{i,j}d_{f}(f_{1i},f_{2j}){V}^{*}_{i,j}+\sum_{i,j,k,l}|d_{s}% (s_{1i},s_{1k})-d_{s}(s_{2j},s_{2l})|{V}^{*}_{i,j}{V}^{*}_{k,l}= ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT ) italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i , italic_j , italic_k , italic_l end_POSTSUBSCRIPT | italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT ) - italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 italic_l end_POSTSUBSCRIPT ) | italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT
=i,jdf(f1i,f2σ(i))Vi,j+i,j,k,l|ds(s1i,s1k)ds(s2σ(i),s2σ(k))|Vi,jVk,l=0.absentsubscript𝑖𝑗subscript𝑑𝑓subscript𝑓1𝑖subscript𝑓2𝜎𝑖subscriptsuperscript𝑉𝑖𝑗subscript𝑖𝑗𝑘𝑙subscript𝑑𝑠subscript𝑠1𝑖subscript𝑠1𝑘subscript𝑑𝑠subscript𝑠2𝜎𝑖subscript𝑠2𝜎𝑘subscriptsuperscript𝑉𝑖𝑗subscriptsuperscript𝑉𝑘𝑙0\displaystyle=\sum_{i,j}d_{f}(f_{1i},f_{2\sigma(i)}){V}^{*}_{i,j}+\sum_{i,j,k,% l}|d_{s}(s_{1i},s_{1k})-d_{s}(s_{2\sigma(i)},s_{2\sigma(k)})|{V}^{*}_{i,j}{V}^% {*}_{k,l}=0.= ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 italic_σ ( italic_i ) end_POSTSUBSCRIPT ) italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i , italic_j , italic_k , italic_l end_POSTSUBSCRIPT | italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT ) - italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 2 italic_σ ( italic_i ) end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 italic_σ ( italic_k ) end_POSTSUBSCRIPT ) | italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT = 0 .

This leads to the desired result that dSoGA(𝒢1,𝒢2)=0subscript𝑑SoGAsubscript𝒢1subscript𝒢20d_{\text{SoGA}}({\mathcal{G}}_{1},{\mathcal{G}}_{2})=0italic_d start_POSTSUBSCRIPT SoGA end_POSTSUBSCRIPT ( caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = 0.

B.2 Proof of Lemma 1

By definitions and the triangle inequalities of the metric dfsubscript𝑑𝑓d_{f}italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT and dssubscript𝑑𝑠d_{s}italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, we have

Os(𝑨¯e,𝑽)subscript𝑂𝑠superscript¯𝑨𝑒superscript𝑽\displaystyle O_{s}(\overline{{\bm{\mathsfit{A}}}}^{e},{\bm{V}}^{*})italic_O start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( over¯ start_ARG bold_slanted_A end_ARG start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , bold_italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) =i,j,k,l|d(s1i,s1k)d(s2j,s2l)|Vi,jVk,labsentsubscript𝑖𝑗𝑘𝑙𝑑subscript𝑠1𝑖subscript𝑠1𝑘𝑑subscript𝑠2𝑗subscript𝑠2𝑙subscriptsuperscript𝑉𝑖𝑗subscriptsuperscript𝑉𝑘𝑙\displaystyle=\sum_{i,j,k,l}|d(s_{1i},s_{1k})-d(s_{2j},s_{2l})|{V}^{*}_{i,j}{V% }^{*}_{k,l}= ∑ start_POSTSUBSCRIPT italic_i , italic_j , italic_k , italic_l end_POSTSUBSCRIPT | italic_d ( italic_s start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT ) - italic_d ( italic_s start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 italic_l end_POSTSUBSCRIPT ) | italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT
=i,j,k,l|12[df(f1i,f1k)+ds(s1i,s1k)]12[df(f2j,f2l)+ds(s2j,s2l)]|Vi,jVk,labsentsubscript𝑖𝑗𝑘𝑙12delimited-[]subscript𝑑𝑓subscript𝑓1𝑖subscript𝑓1𝑘subscript𝑑𝑠subscript𝑠1𝑖subscript𝑠1𝑘12delimited-[]subscript𝑑𝑓subscript𝑓2𝑗subscript𝑓2𝑙subscript𝑑𝑠subscript𝑠2𝑗subscript𝑠2𝑙subscriptsuperscript𝑉𝑖𝑗subscriptsuperscript𝑉𝑘𝑙\displaystyle=\sum_{i,j,k,l}|\frac{1}{2}\left[d_{f}(f_{1i},f_{1k})+d_{s}(s_{1i% },s_{1k})\right]-\frac{1}{2}\left[d_{f}(f_{2j},f_{2l})+d_{s}(s_{2j},s_{2l})% \right]|{V}^{*}_{i,j}{V}^{*}_{k,l}= ∑ start_POSTSUBSCRIPT italic_i , italic_j , italic_k , italic_l end_POSTSUBSCRIPT | divide start_ARG 1 end_ARG start_ARG 2 end_ARG [ italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT ) + italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT ) ] - divide start_ARG 1 end_ARG start_ARG 2 end_ARG [ italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 italic_l end_POSTSUBSCRIPT ) + italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 italic_l end_POSTSUBSCRIPT ) ] | italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT
=i,j,k,l|12[df(f1i,f1k)df(f2j,f2l)]+12[ds(s1i,s1k)ds(s2j,s2l)]|Vi,jVk,labsentsubscript𝑖𝑗𝑘𝑙12delimited-[]subscript𝑑𝑓subscript𝑓1𝑖subscript𝑓1𝑘subscript𝑑𝑓subscript𝑓2𝑗subscript𝑓2𝑙12delimited-[]subscript𝑑𝑠subscript𝑠1𝑖subscript𝑠1𝑘subscript𝑑𝑠subscript𝑠2𝑗subscript𝑠2𝑙subscriptsuperscript𝑉𝑖𝑗subscriptsuperscript𝑉𝑘𝑙\displaystyle=\sum_{i,j,k,l}\left|\frac{1}{2}\left[d_{f}(f_{1i},f_{1k})-d_{f}(% f_{2j},f_{2l})\right]+\frac{1}{2}\left[d_{s}(s_{1i},s_{1k})-d_{s}(s_{2j},s_{2l% })\right]\right|{V}^{*}_{i,j}{V}^{*}_{k,l}= ∑ start_POSTSUBSCRIPT italic_i , italic_j , italic_k , italic_l end_POSTSUBSCRIPT | divide start_ARG 1 end_ARG start_ARG 2 end_ARG [ italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT ) - italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 italic_l end_POSTSUBSCRIPT ) ] + divide start_ARG 1 end_ARG start_ARG 2 end_ARG [ italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT ) - italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 italic_l end_POSTSUBSCRIPT ) ] | italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT
12i,j,k,l|df(f1i,f1k)df(f2j,f2l)|Vi,jVk,l+12i,j,k,l|ds(s1i,s1k)ds(s2j,s2l)|Vi,jVk,labsent12subscript𝑖𝑗𝑘𝑙subscript𝑑𝑓subscript𝑓1𝑖subscript𝑓1𝑘subscript𝑑𝑓subscript𝑓2𝑗subscript𝑓2𝑙subscriptsuperscript𝑉𝑖𝑗subscriptsuperscript𝑉𝑘𝑙12subscript𝑖𝑗𝑘𝑙subscript𝑑𝑠subscript𝑠1𝑖subscript𝑠1𝑘subscript𝑑𝑠subscript𝑠2𝑗subscript𝑠2𝑙subscriptsuperscript𝑉𝑖𝑗subscriptsuperscript𝑉𝑘𝑙\displaystyle\leq\frac{1}{2}\sum_{i,j,k,l}\left|d_{f}(f_{1i},f_{1k})-d_{f}(f_{% 2j},f_{2l})\right|{V}^{*}_{i,j}{V}^{*}_{k,l}+\frac{1}{2}\sum_{i,j,k,l}\left|d_% {s}(s_{1i},s_{1k})-d_{s}(s_{2j},s_{2l})\right|{V}^{*}_{i,j}{V}^{*}_{k,l}≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j , italic_k , italic_l end_POSTSUBSCRIPT | italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT ) - italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 italic_l end_POSTSUBSCRIPT ) | italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j , italic_k , italic_l end_POSTSUBSCRIPT | italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT ) - italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 italic_l end_POSTSUBSCRIPT ) | italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT
=12i,j,k,l|df(f1i,f1k)df(f2j,f2l)|Vi,jVk,l+12Os(𝑨e,𝑽)absent12subscript𝑖𝑗𝑘𝑙subscript𝑑𝑓subscript𝑓1𝑖subscript𝑓1𝑘subscript𝑑𝑓subscript𝑓2𝑗subscript𝑓2𝑙subscriptsuperscript𝑉𝑖𝑗subscriptsuperscript𝑉𝑘𝑙12subscript𝑂𝑠superscript𝑨𝑒superscript𝑽\displaystyle=\frac{1}{2}\sum_{i,j,k,l}\left|d_{f}(f_{1i},f_{1k})-d_{f}(f_{2j}% ,f_{2l})\right|{V}^{*}_{i,j}{V}^{*}_{k,l}+\frac{1}{2}O_{s}({\bm{\mathsfit{A}}}% ^{e},{\bm{V}}^{*})= divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j , italic_k , italic_l end_POSTSUBSCRIPT | italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT ) - italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 italic_l end_POSTSUBSCRIPT ) | italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_O start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_slanted_A start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , bold_italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )
=12i,j,k,l|df(f1i,f1k)df(f2j,f2l)|Vi,jVk,l (since Os(𝑨e,𝑽)=0).absent12subscript𝑖𝑗𝑘𝑙subscript𝑑𝑓subscript𝑓1𝑖subscript𝑓1𝑘subscript𝑑𝑓subscript𝑓2𝑗subscript𝑓2𝑙subscriptsuperscript𝑉𝑖𝑗subscriptsuperscript𝑉𝑘𝑙 (since Os(𝑨e,𝑽)=0)\displaystyle=\frac{1}{2}\sum_{i,j,k,l}\left|d_{f}(f_{1i},f_{1k})-d_{f}(f_{2j}% ,f_{2l})\right|{V}^{*}_{i,j}{V}^{*}_{k,l}\text{ (since $O_{s}({\bm{\mathsfit{A% }}}^{e},{\bm{V}}^{*})=0$)}.= divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j , italic_k , italic_l end_POSTSUBSCRIPT | italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT ) - italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 italic_l end_POSTSUBSCRIPT ) | italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT (since italic_O start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_slanted_A start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , bold_italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = 0 ) . (22)

Using the triangle inequality of the metric dfsubscript𝑑𝑓d_{f}italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT again, we have

df(f1i,f1k)df(f1i,f2j)+df(f2j,f2l)+df(f2l,f1k),subscript𝑑𝑓subscript𝑓1𝑖subscript𝑓1𝑘subscript𝑑𝑓subscript𝑓1𝑖subscript𝑓2𝑗subscript𝑑𝑓subscript𝑓2𝑗subscript𝑓2𝑙subscript𝑑𝑓subscript𝑓2𝑙subscript𝑓1𝑘\displaystyle d_{f}(f_{1i},f_{1k})\leq d_{f}(f_{1i},f_{2j})+d_{f}(f_{2j},f_{2l% })+d_{f}(f_{2l},f_{1k}),italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT ) ≤ italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT ) + italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 italic_l end_POSTSUBSCRIPT ) + italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 2 italic_l end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT ) ,
df(f2j,f2l)df(f2j,f1i)+df(f1i,f1k)+df(f1k,f2l).subscript𝑑𝑓subscript𝑓2𝑗subscript𝑓2𝑙subscript𝑑𝑓subscript𝑓2𝑗subscript𝑓1𝑖subscript𝑑𝑓subscript𝑓1𝑖subscript𝑓1𝑘subscript𝑑𝑓subscript𝑓1𝑘subscript𝑓2𝑙\displaystyle d_{f}(f_{2j},f_{2l})\leq d_{f}(f_{2j},f_{1i})+d_{f}(f_{1i},f_{1k% })+d_{f}(f_{1k},f_{2l}).italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 italic_l end_POSTSUBSCRIPT ) ≤ italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT ) + italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT ) + italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 italic_l end_POSTSUBSCRIPT ) .

This is equivalent to

df(f1i,f1k)df(f2j,f2l)df(f1i,f2j)+df(f1k,f2l),subscript𝑑𝑓subscript𝑓1𝑖subscript𝑓1𝑘subscript𝑑𝑓subscript𝑓2𝑗subscript𝑓2𝑙subscript𝑑𝑓subscript𝑓1𝑖subscript𝑓2𝑗subscript𝑑𝑓subscript𝑓1𝑘subscript𝑓2𝑙\displaystyle d_{f}(f_{1i},f_{1k})-d_{f}(f_{2j},f_{2l})\leq d_{f}(f_{1i},f_{2j% })+d_{f}(f_{1k},f_{2l}),italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT ) - italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 italic_l end_POSTSUBSCRIPT ) ≤ italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT ) + italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 italic_l end_POSTSUBSCRIPT ) ,
df(f2j,f2l)df(f1i,f1k)df(f1i,f2j)+df(f1k,f2l).subscript𝑑𝑓subscript𝑓2𝑗subscript𝑓2𝑙subscript𝑑𝑓subscript𝑓1𝑖subscript𝑓1𝑘subscript𝑑𝑓subscript𝑓1𝑖subscript𝑓2𝑗subscript𝑑𝑓subscript𝑓1𝑘subscript𝑓2𝑙\displaystyle d_{f}(f_{2j},f_{2l})-d_{f}(f_{1i},f_{1k})\leq d_{f}(f_{1i},f_{2j% })+d_{f}(f_{1k},f_{2l}).italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 italic_l end_POSTSUBSCRIPT ) - italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT ) ≤ italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT ) + italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 italic_l end_POSTSUBSCRIPT ) . (23)

We consider two sets I1={i,j,k,l:df(f1i,f1k)df(f2j,f2l)0}subscript𝐼1conditional-set𝑖𝑗𝑘𝑙subscript𝑑𝑓subscript𝑓1𝑖subscript𝑓1𝑘subscript𝑑𝑓subscript𝑓2𝑗subscript𝑓2𝑙0I_{1}=\{i,j,k,l:d_{f}(f_{1i},f_{1k})-d_{f}(f_{2j},f_{2l})\leq 0\}italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = { italic_i , italic_j , italic_k , italic_l : italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT ) - italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 italic_l end_POSTSUBSCRIPT ) ≤ 0 } and I2={i,j,k,l:df(f2j,f2l)df(f1i,f1k)0}subscript𝐼2conditional-set𝑖𝑗𝑘𝑙subscript𝑑𝑓subscript𝑓2𝑗subscript𝑓2𝑙subscript𝑑𝑓subscript𝑓1𝑖subscript𝑓1𝑘0I_{2}=\{i,j,k,l:d_{f}(f_{2j},f_{2l})-d_{f}(f_{1i},f_{1k})\leq 0\}italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = { italic_i , italic_j , italic_k , italic_l : italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 italic_l end_POSTSUBSCRIPT ) - italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT ) ≤ 0 }. Combining equations (B.2) and (23), it holds that

Os(𝑨¯e,𝑽)subscript𝑂𝑠superscript¯𝑨𝑒superscript𝑽\displaystyle O_{s}(\overline{{\bm{\mathsfit{A}}}}^{e},{\bm{V}}^{*})italic_O start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( over¯ start_ARG bold_slanted_A end_ARG start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , bold_italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) 12i,j,k,l|df(f1i,f1k)df(f2j,f2l)|Vi,jVk,labsent12subscript𝑖𝑗𝑘𝑙subscript𝑑𝑓subscript𝑓1𝑖subscript𝑓1𝑘subscript𝑑𝑓subscript𝑓2𝑗subscript𝑓2𝑙subscriptsuperscript𝑉𝑖𝑗subscriptsuperscript𝑉𝑘𝑙\displaystyle\leq\frac{1}{2}\sum_{i,j,k,l}\left|d_{f}(f_{1i},f_{1k})-d_{f}(f_{% 2j},f_{2l})\right|{V}^{*}_{i,j}{V}^{*}_{k,l}≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j , italic_k , italic_l end_POSTSUBSCRIPT | italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT ) - italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 italic_l end_POSTSUBSCRIPT ) | italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT
=12i,j,k,lI1[df(f2j,f2l)df(f1i,f1k)]Vi,jVk,labsent12subscript𝑖𝑗𝑘𝑙subscript𝐼1delimited-[]subscript𝑑𝑓subscript𝑓2𝑗subscript𝑓2𝑙subscript𝑑𝑓subscript𝑓1𝑖subscript𝑓1𝑘subscriptsuperscript𝑉𝑖𝑗subscriptsuperscript𝑉𝑘𝑙\displaystyle=\frac{1}{2}\sum_{i,j,k,l\in I_{1}}\left[d_{f}(f_{2j},f_{2l})-d_{% f}(f_{1i},f_{1k})\right]{V}^{*}_{i,j}{V}^{*}_{k,l}= divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j , italic_k , italic_l ∈ italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 italic_l end_POSTSUBSCRIPT ) - italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT ) ] italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT
+12i,j,k,lI2[df(f1i,f1k)df(f2j,f2l)]Vi,jVk,l12subscript𝑖𝑗𝑘𝑙subscript𝐼2delimited-[]subscript𝑑𝑓subscript𝑓1𝑖subscript𝑓1𝑘subscript𝑑𝑓subscript𝑓2𝑗subscript𝑓2𝑙subscriptsuperscript𝑉𝑖𝑗subscriptsuperscript𝑉𝑘𝑙\displaystyle\quad+\frac{1}{2}\sum_{i,j,k,l\in I_{2}}\left[d_{f}(f_{1i},f_{1k}% )-d_{f}(f_{2j},f_{2l})\right]{V}^{*}_{i,j}{V}^{*}_{k,l}+ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j , italic_k , italic_l ∈ italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT ) - italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 italic_l end_POSTSUBSCRIPT ) ] italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT
12i,j,k,lI1[df(f1i,f2j)+df(f1k,f2l)]Vi,jVk,labsent12subscript𝑖𝑗𝑘𝑙subscript𝐼1delimited-[]subscript𝑑𝑓subscript𝑓1𝑖subscript𝑓2𝑗subscript𝑑𝑓subscript𝑓1𝑘subscript𝑓2𝑙subscriptsuperscript𝑉𝑖𝑗subscriptsuperscript𝑉𝑘𝑙\displaystyle\leq\frac{1}{2}\sum_{i,j,k,l\in I_{1}}\left[d_{f}(f_{1i},f_{2j})+% d_{f}(f_{1k},f_{2l})\right]{V}^{*}_{i,j}{V}^{*}_{k,l}≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j , italic_k , italic_l ∈ italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT ) + italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 italic_l end_POSTSUBSCRIPT ) ] italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT
+12i,j,k,lI2[df(f1i,f2j)+df(f1k,f2l)]Vi,jVk,l12subscript𝑖𝑗𝑘𝑙subscript𝐼2delimited-[]subscript𝑑𝑓subscript𝑓1𝑖subscript𝑓2𝑗subscript𝑑𝑓subscript𝑓1𝑘subscript𝑓2𝑙subscriptsuperscript𝑉𝑖𝑗subscriptsuperscript𝑉𝑘𝑙\displaystyle\quad+\frac{1}{2}\sum_{i,j,k,l\in I_{2}}\left[d_{f}(f_{1i},f_{2j}% )+d_{f}(f_{1k},f_{2l})\right]{V}^{*}_{i,j}{V}^{*}_{k,l}+ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j , italic_k , italic_l ∈ italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT ) + italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 italic_l end_POSTSUBSCRIPT ) ] italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT
=12i,j,k,l[df(f1i,f2j)+df(f1k,f2l)]Vi,jVk,labsent12subscript𝑖𝑗𝑘𝑙delimited-[]subscript𝑑𝑓subscript𝑓1𝑖subscript𝑓2𝑗subscript𝑑𝑓subscript𝑓1𝑘subscript𝑓2𝑙subscriptsuperscript𝑉𝑖𝑗subscriptsuperscript𝑉𝑘𝑙\displaystyle=\frac{1}{2}\sum_{i,j,k,l}\left[d_{f}(f_{1i},f_{2j})+d_{f}(f_{1k}% ,f_{2l})\right]{V}^{*}_{i,j}{V}^{*}_{k,l}= divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j , italic_k , italic_l end_POSTSUBSCRIPT [ italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT ) + italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 italic_l end_POSTSUBSCRIPT ) ] italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT
=M2i,jdf(f1i,f2j)Vi,j+M2k,ldf(f1k,f2l)Vk,l=MOf(𝑨v,𝑽)=0.absent𝑀2subscript𝑖𝑗subscript𝑑𝑓subscript𝑓1𝑖subscript𝑓2𝑗subscriptsuperscript𝑉𝑖𝑗𝑀2subscript𝑘𝑙subscript𝑑𝑓subscript𝑓1𝑘subscript𝑓2𝑙subscriptsuperscript𝑉𝑘𝑙𝑀subscript𝑂𝑓superscript𝑨𝑣superscript𝑽0\displaystyle=\frac{M}{2}\sum_{i,j}d_{f}(f_{1i},f_{2j}){V}^{*}_{i,j}+\frac{M}{% 2}\sum_{k,l}d_{f}(f_{1k},f_{2l}){V}^{*}_{k,l}=MO_{f}({\bm{A}}^{v},{\bm{V}}^{*}% )=0.= divide start_ARG italic_M end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT ) italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT + divide start_ARG italic_M end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 italic_l end_POSTSUBSCRIPT ) italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT = italic_M italic_O start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( bold_italic_A start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT , bold_italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = 0 . (24)

Hence, Os(𝑨¯e,𝑽)=0subscript𝑂𝑠superscript¯𝑨𝑒superscript𝑽0O_{s}(\overline{{\bm{\mathsfit{A}}}}^{e},{\bm{V}}^{*})=0italic_O start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( over¯ start_ARG bold_slanted_A end_ARG start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , bold_italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = 0 since Os(𝑨¯e,𝑽)0subscript𝑂𝑠superscript¯𝑨𝑒superscript𝑽0O_{s}(\overline{{\bm{\mathsfit{A}}}}^{e},{\bm{V}}^{*})\geq 0italic_O start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( over¯ start_ARG bold_slanted_A end_ARG start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , bold_italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≥ 0. Here, we have 𝑽superscript𝑽{\bm{V}}^{*}bold_italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the optimal coupling such that

Os(𝑨¯e,𝑽)=min𝑽𝒜(𝒢1,𝒢2)Os(𝑨¯e,𝑽).subscript𝑂𝑠superscript¯𝑨𝑒superscript𝑽subscript𝑽𝒜subscript𝒢1subscript𝒢2subscript𝑂𝑠superscript¯𝑨𝑒𝑽\displaystyle O_{s}(\overline{{\bm{\mathsfit{A}}}}^{e},{\bm{V}}^{*})=\min_{{% \bm{V}}\in{\mathcal{A}}({\mathcal{G}}_{1},{\mathcal{G}}_{2})}O_{s}(\overline{{% \bm{\mathsfit{A}}}}^{e},{\bm{V}}).italic_O start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( over¯ start_ARG bold_slanted_A end_ARG start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , bold_italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = roman_min start_POSTSUBSCRIPT bold_italic_V ∈ caligraphic_A ( caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_O start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( over¯ start_ARG bold_slanted_A end_ARG start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , bold_italic_V ) . (25)

Hence, in accordance with Theorem 5.1 from Mémoli (2011); Mémoli & Sapiro (2005), there exists an isomorphisms between the metric spaces associated with 𝒢1subscript𝒢1{\mathcal{G}}_{1}caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝒢2subscript𝒢2{\mathcal{G}}_{2}caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, described respectively by their mixing measure μ1=i=1Mw1iδ(f1i,s1i)subscript𝜇1superscriptsubscript𝑖1𝑀subscript𝑤1𝑖subscript𝛿subscript𝑓1𝑖subscript𝑠1𝑖\mu_{1}=\sum_{i=1}^{M}w_{1i}\delta_{(f_{1i},s_{1i})}italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT and μ2=j=1Nw2jδ(f2j,s2j)subscript𝜇2superscriptsubscript𝑗1𝑁subscript𝑤2𝑗subscript𝛿subscript𝑓2𝑗subscript𝑠2𝑗\mu_{2}=\sum_{j=1}^{N}w_{2j}\delta_{(f_{2j},s_{2j})}italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT. This means that there exists a bijective with weight preserving isometry σ:[M][N]:𝜎maps-todelimited-[]𝑀delimited-[]𝑁\sigma:[M]\mapsto[N]italic_σ : [ italic_M ] ↦ [ italic_N ]. This implies that M=N𝑀𝑁M=Nitalic_M = italic_N and there exists a bijective σ:[M][N]:𝜎maps-todelimited-[]𝑀delimited-[]𝑁\sigma:[M]\mapsto[N]italic_σ : [ italic_M ] ↦ [ italic_N ] satisfies the weight and distance d𝑑ditalic_d preserving isometry as follows:

  1. E1.

    i[M]:w1i=w2σ(i):for-all𝑖delimited-[]𝑀subscript𝑤1𝑖subscript𝑤2𝜎𝑖\forall i\in[M]:w_{1i}=w_{2\sigma(i)}∀ italic_i ∈ [ italic_M ] : italic_w start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT = italic_w start_POSTSUBSCRIPT 2 italic_σ ( italic_i ) end_POSTSUBSCRIPT.

  2. E3*.

    i,k[M]2:d(s1i,s1k)=d(s2σ(i),s2σ(k)):for-all𝑖𝑘superscriptdelimited-[]𝑀2𝑑subscript𝑠1𝑖subscript𝑠1𝑘𝑑subscript𝑠2𝜎𝑖subscript𝑠2𝜎𝑘\forall i,k\in[M]^{2}:d(s_{1i},s_{1k})=d(s_{2\sigma(i)},s_{2\sigma(k)})∀ italic_i , italic_k ∈ [ italic_M ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT : italic_d ( italic_s start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT ) = italic_d ( italic_s start_POSTSUBSCRIPT 2 italic_σ ( italic_i ) end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 italic_σ ( italic_k ) end_POSTSUBSCRIPT ).

Original Conversation Long Context
[Uncaptioned image] User What is the organ shown in the image?
instruction The image is a CT scan of the abdomen and pelvis, focusing on the adrenal glands. The organ depicted in the image is the adrenal glands, which are located on top of each kidney. In this CT scan of the abdomen and pelvis, the adrenal glands appear as small, triangular-shaped structures that play a crucial role in hormone production, including cortisol and adrenaline, which are vital for stress response and metabolism.
User What do the black arrows indicate?
instruction The black arrows in the image are pointing to the bilateral adrenal enlargement, which means that both adrenal glands are larger than normal. The black arrows in the image indicate bilateral adrenal enlargement, which signifies that both adrenal glands are enlarged beyond their normal size. This condition can be associated with various clinical implications, such as hormonal imbalances or underlying diseases, and may require further evaluation to determine the cause.
User Are both adrenal glands enlarged?
instruction Yes, both adrenal glands appear to be enlarged in the CT scan. This bilateral adrenal enlargement may be due to various causes, and further evaluation by a healthcare professional is necessary to determine the underlying reason for this finding. Yes, both adrenal glands are enlarged in the CT scan. This bilateral adrenal enlargement can result from various conditions, including: - Adrenal adenomas - Cushing’s syndrome - Hyperplasia - Metastatic disease Further evaluation by a healthcare professional is essential to identify the specific cause of this finding.
[Uncaptioned image] User What is the age and gender of the patient?
instruction The patient is a 23-year-old female. The patient is a 23-year-old female, indicating her age and gender clearly. This information is essential for understanding her demographic profile in a clinical context.
User What is the main finding in the chest x-ray?
instruction The main finding in the chest x-ray is lymphadenopathy, which refers to the enlargement of lymph nodes. The main finding in the chest x-ray is lymphadenopathy, which indicates the enlargement of lymph nodes. This condition may suggest underlying issues such as infection, malignancy, or autoimmune disorders, warranting further investigation to determine the cause and appropriate management.
User What is the patient’s medical condition?
instruction The patient has respiratory symptoms and a confirmed H1N1 infection. Additionally, the patient has a history of crack addiction. The patient is diagnosed with respiratory symptoms and has a confirmed H1N1 infection. Additionally, the patient has a significant history of crack addiction, which may impact their overall health and treatment options.
Table 6: Example of a conversation extended with long-context explanations.

Appendix C Medical Visual Chatbot

Evaluation Protocol

We evaluate the ability of models to serve as a medical visual chatbot as follows: each of the 193 novel questions in the dataset has a corresponding ground-truth answer. We generate response to each question from the LMM. Next, we ask GPT-4 to score the helpfulness, relevance, accuracy and level of details of the response from the LMM relative to the ground-truth answer. The score is therefore in the scale of 0 to 100, where higher score indicates better overall performance. During our project, we are unable to access the GPT-4 version used by LLaVA-Med due to deprecation. Therefore, we opt for GPT-4o version as a judge. We also use this GPT version to reproduce the results reported in LLaVA-Med paper and observe decrease in performance. This may be due to the fact that GPT-4o serve as a better judge than previous version and thus judge the model’s response harder.

Results

Table  4 shows the experimental results of LoGra-Med alongside competitive methods, with the highest scores in bold. Our two method variants—based on LLaVA 1.5 with and without the DCI technique—outperform others on conversation samples and achieve comparable results to LLaVA-Med 1.5 on description samples. In evaluations across five medical domains, our methods surpass the baselines in three (CXR, Histology, and Gross), positioning LoGra-Med as the state-of-the-art overall. These findings highlight how the multi-graph alignment strategy and extended answer contexts enhance VQA chatbot performance in the biomedical domain.

Qualitative results are shown in Table 7, where LoGra-Med’s generated responses are compared against a series of example questions and image contexts. The top and middle parts of the figure illustrate a detailed description of a chest X-ray image, while the bottom part showcases a conversational exchange regarding the origin and extent of a mass found in a medical image. As demonstrated, LoGra-Med effectively provides clinically relevant answers with high-level details, enhancing the interpretability and usability of its outputs. In cases where the given image and/or question is ambiguous, as shown in the middle dialogue, LoGra-Med should produce a clear explanation without adding any additional information or showing any signs of hallucinations.

Detailed Description
[Uncaptioned image] Caption Chest X-ray: Normal with no signs of metastatic lesions
User Walk through the important details of the image
LoGra-Med The image is a chest X-ray (CXR) taken on the 1st postoperative day. It provides a view of the patient’s chest, including the lungs, heart, and surrounding structures. Since this is a postoperative image, it may be used to assess the patient’s condition after a surgical procedure, monitor for any complications, or evaluate the effectiveness of the treatment.
[Uncaptioned image] Caption Viral Sialadenitis. Axial computed tomography in a 14-year-old female with bilateral parotid pain and swelling demonstrates symmetric enlargement of the parotid glands with subtle stranding in the adjacent fat (white arrows). Blood work was positive for parainfluenza virus.
User Offer a thorough analysis of the image
LoGra-Med As an AI, I cannot view the actual image, but I can help you interpret the image based on the visual aspects you describe. If you have any questions about the content or context of the image, feel free to ask, and I’ll do my best to provide assistance.
Answer (Short) Chest X-ray showing dilated cardiomyopathy
Conversation
[Uncaptioned image] Caption A computed tomography scan shows that the mass arises from the posterior costal arc of the eighth rib on the patients left side and does not involve the medulla.
User What is the origin of the mass?
LoGra-Med The mass appears to be originating from the right lung.
User Does the mass involve the medulla?
LoGra-Med According to the image, the mass does not appear to involve the medulla. The medulla is a part of the brainstem, and its involvement would indicate that the mass has extended to this area. In this case, the mass seems to be affecting other parts of the brainstem and cerebellum.
Answer (Short) -
Table 7: Examples of the medical visual chatbot. Top: a detailed description for a Chest X-Ray (CXR) image; Middle: a detailed description for a CT scan image; Bottom: a conversation exchange regarding a medical condition.

Appendix D Zero-shot Image Classification as MedVQA

Figure 6 provides several examples of microscopy and CT images. The top section displays three microscopy images along with their respective question-option pairs, while the bottom section presents three CT image samples with their question-option pairs. The ground truth correct options are highlighted in blue. In total, the number of images and question-answering items across the three groups of various medical image modalities are shown in Figure 4.

Evaluation method. Following Hu et al. (2024), we use Question-answering Score as metric to report performance of the models. Specifically, we combine the question expression and all candidate options to construct the prompt. Our prompt template therefore is as follows:

“This is a medical question with several Options, and there is only one correct answer among these options. Please select the correct answer for the question. Remember, you can only select one option. The Question is:\langleQuestion\rangle. ### The candidate Options are:\langleOptions\rangle.

The MLLM receives this prompt and corresponding image and is asked to generate response. We then utilize difflib, a standard Python package to compare two strings, to calculate the similarity of the response with each of the candidate options and pick the option with the largest similarity as final prediction. The accuracy is computed by comparing the prediction with the ground-truth answer.

Microscopy Image Dataset Method LLaVA-Med Med-Flamingo RadFM Ours CRC100k (Kather et al., 2018) 24.74 17.18 27.48 28.06 ALL Challenge (Gupta & Gupta, 2019) 29.24 13.20 39.88 27.49 BioMediTech (Nanni et al., 2016) 39.14 16.08 47.84 46.97 Blood Cell blo (2023) 21.11 15.25 16.95 29.87 BreakHis (Spanhol et al., 2015) 23.27 13.62 18.26 33.74 NLM-Malaria (nlm, 2023) 30.67 6.76 32.43 66.67 HuSHeM (Shaker et al., 2017) 16.85 18.18 11.36 25.84 MHSMA (Javadi & Mirroshandel, 2019) 29.64 39.66 38.41 39.70 Avg. 26.83 17.49 29.08 37.29

Table 8: Performance comparison on various microscopy image datasets.

CXR Dataset Method LLaVA-Med Med-Flamingo RadFM Ours RUS CHN (xra, 2023) 28.05 20.19 29.88 41.88 Mura (Rajpurkar et al., 2017) 20.70 25.91 43.47 30.19 Pulmonary Chest MC (Jaeger et al., 2014) 21.05 27.03 10.81 47.37 MIAS (Suckling, 1994) 25.35 38.30 28.37 42.96 Pulmonary Chest Shenzhen (Jaeger et al., 2014) 26.35 32.54 36.95 19.93 COVIDx CXR-4 (Wang et al., 2020) 28.25 25.83 48.14 22.68 Knee Osteoarthritis (Chen, 2018) 11.20 22.24 6.19 8.69 Chest X-Ray PA (Asraf & Islam, 2021) 29.06 38.04 38.28 49.41 CoronaHack (Cohen et al., 2020) 19.74 33.67 22.99 47.81 Covid-19 tianchi (cov, 2023) 16.67 45.26 33.68 30.21 Covid19 heywhale (Chowdhury et al., 2020) 22.03 56.31 23.37 29.28 Avg. 23.18 32.01 28.84 33.67

Table 9: Performance comparison across CXR datasets.

CT Dataset Method LLaVA-Med Med-Flamingo RadFM Ours Chest CT Scan (che, ) 25.72 20.00 25.06 20.09 SARS-CoV-2 CT (Soares & Angelov, 2020) 28.79 40.92 44.55 34.95 Covid CT (cov, ) 22.61 21.72 28.79 37.19 OCT & X-Ray 2017 (Kermany et al., 2018) 27.21 28.08 29.46 47.89 Avg. 26.08 27.68 31.97 35.03

Table 10: Performance comparison on various CT (Computed Tomography) datasets.

Results We provide detailed results for datasets on each data modality in Tables 8, 10, and 9.

Appendix E LLM Prompting

We illustrate in Figure 5 how to leverage the GPT-4 API to analyze and extend the original answers. For detailed responses in specific cases, refer to Table 6.

System Prompt You possess in-depth biomedical knowledge in checking the quality of the answer to a given instruction. From the given input, which is a pair of instruction and answer, your task involves the following steps: 1. Explain why the given answer is not good for its instruction. Please analyze based on the Helpfulness, Relevance, Accuracy, Level of Detail, and Structure fields. 2. Generate a better answer based on the reasons pointed out above, while preserving the same content. To achieve that, you may want to adjust the level of details, add bullet points, or use comprehensive words, etc. Because these answers are about biomedical knowledge, you must keep all the medical terminology and important words in the new better answer. The new better answer should be in a tone that you are also seeing the image and answering the question. 3. Output a JSON object containing the following keys (note that double quotes should not be used): { "explanation": { "helpfulness":<comment on helpfulness, max 20 tokens>, "relevance":<comment on relevance, max 20 tokens>, "accuracy":<comment on accuracy, max 20 tokens>, "detail":<comment on detail, max 20 tokens>, "structure":<comment on structure, max 20 tokens> },
"revision": <improved version of the answer, max 2x tokens of input if > 2 tokens, otherwise max 20 tokens> }
Figure 5: Instructions provided to the system for analyzing the quality of answers based on different criteria and generating a revised response in JSON format.

Appendix F Additional Results for Multi-modal Pre-training Comparison

F.1 MedVQA datasets

We train and evaluate LoGra-Med on three biomedical VQA datasets, including VQA-RAD, SLAKE, and PathVQA. The dataset statistics are summarized in detail in Table 13.

  • VQA-RAD dataset is a collection of 2248 QA pairs and 515 radiology images which are evenly distributed over the chest, head, and abdomen. Over half of the answers are closed-ended (i.e., yes/no type), while the rest are open-ended with short phrase answers.

  • SLAKE dataset contains 642 radiology images and over 7000 diverse QA pairs. It includes rich modalities and human body parts such as the brain, neck, chest, abdomen, and pelvic cavity. This dataset is bilingual in English and Chinese, and in our experiments, we only considered the English subset.

  • PathVQA dataset contain pathology images. It has a total of 32795 QA pairs and 4315 pathology images. The questions in this dataset have two types: open-ended questions such as why, where, how, what, etc. and closed-ended questions.

F.2 Results

Tables 11 and 12 present the results using 70% and 100% of the data. Overall, LoGra-Med demonstrates a steady improvement and consistently outperforms other pre-training methods across nearly all settings.

Method VQA-RAD SLAKE PathVQA Overall Open Closed Avg. Open Closed Avg. Open Closed Avg. LLaVA-Med (100%) 63.65 81.62 72.64 83.44 83.41 83.43 36.78 91.33 64.06 73.37 LLaVA-Med (70%) 65.96\uparrow2.31 81.62\downarrow0 73.79\uparrow1.13 84.16\uparrow0.72 83.17\downarrow0.24 83.67\uparrow0.24 37.39\uparrow0.61 92.27\uparrow0.94 64.83\uparrow0.77 74.1\uparrow0.64 InfoNCE 64.18 77.94 71.06 70.9 82.69 76.80 33.58 88.5 61.04 69.63 PLOT 60.13 78.31 69.22 82.48 83.89 83.185 29.23 85.7 57.478 69.96 SigLIP 61.68 78.68 70.18 82.04 83.17 82.61 34.43 90.3 62.37 71.72 VLAP 64.08 79.41 71.75 84.94 85.1 85.02 36.44 91.51 63.98 73.58 LoGra-Med 67.12 81.99 74.56 84.81 84.86 84.84 37.26 91.77 64.52 74.64

Table 11: Performance fine-tuning on MedVQA downstream datasets (pre-training 70%). Bold indicate for best values among pre-training algorithms except for LLaVA-Med (pre-trained on 100%).

Method VQA-RAD SLAKE PathVQA Overall Open Closed Avg. Open Closed Avg. Open Closed Avg. LLaVA-Med (100%) 63.65 81.62 72.64 83.44 83.41 83.43 36.78 91.33 64.06 73.37 InfoNCE 66.01 79.41 72.71 83.23 83.41 83.32 35.01 89.53 62.27 72.77 PLOT 63.58 77.21 70.4 82.44 84.86 83.65 34.45 89.97 62.21 72.09 SigLIP 57.11 74.26 65.69 85.07 83.41 84.24 36.47 89.38 62.925 70.95 VLAP 60.93 79.78 70.36 84.74 83.17 83.955 35.86 89.65 62.755 72.36 LoGra-Med 66.35 83.46 74.91 85.34 85.58 85.46 36.82 90.92 63.87 74.75

Table 12: Performance fine-tuning on MedVQA downstream datasets (pre-training 100%).
Dataset VQA-RAD SLAKE PathVQA
Train Test Train Val Test Train Val Test
# Images 313 203 450 96 96 2599 858 858
# QA Pairs 1797 451 4919 1053 1061 19755 6279 6761
# Open 770 179 2976 631 645 9949 3144 3370
# Closed 1027 272 1943 422 416 9806 3135 3391
Table 13: Dataset statistics for 3 medical VQA datasets: VQA-RAD, SLAKE, and PathVQA.
Refer to caption
Q: What are the types of cells depicted in this image?

A: Neutrophils

B: Melanocytes

C: Lymphocytes

D: Hepatocytes

Refer to caption
Q: What is the diagnosis of the histopathology in this image?

A: Breast hyperplasia without atypia histopathology

B: Normal breast histopathology

C: Benign breast histopathology

D: Fibrocystic breast histopathology

Refer to caption
Q: What is the probable diagnosis depicted in this image?

A: Chronic myeloid leukemia

B: Multiple myeloma

C: Hodgkin’s lymphoma

D: Acute lymphoblastic leukemia

 

Refer to caption
Q: What is the diagnosis of the cancer seen in this image?

A: Adenocarcinoma of the right hilum, T3 N1 M0, Stage IIb

B: Mesothelioma of the right hilum, T2 N1 M0, Stage IIb

C: Large cell carcinoma of the left

hilum, T2 N2 M0, Stage IIIa

D: Non-small cell carcinoma of the left hilum, T2 N0 M0, Stage I

Refer to caption
Q: Is COVID-19 apparent in this CT scan image?

A: No

B: Yes

Refer to caption
Q: Which imaging technique was utilized to obtain this image?

A: Ultrasound

B: Optical Coherence Tomography

C: Magnetic Resonance Imaging (MRI)

D: Thermography

Figure 6: Examples from the OmniMedVQA dataset: microscopy (top) and CT images (bottom) with corresponding questions and options, with the correct answers highlighted in blue.